Next in our Tech Blog series, we speak with Samir Arfoui, Lead Data Engineer, about the importance of using both Data Scientists and Data Engineers within Infectious Media, to deliver the best results internally and externally to our clients.
IM: What is the purpose of having both Data Engineers and Data Scientists at Infectious Media?
SA: At Infectious Media, data scientists and engineers complement each other in their duties. Our data engineers are responsible for our data pipeline, managing our ETL (Extract, Transform, Load) framework and data warehouses, ensuring the company’s daily data needs are satisfied. The role of our data scientists is to try to answer specific questions by building models around the data prepared by our data engineers.
The data engineer’s toolbox consists mainly of strong programming skills and good knowledge of database technologies. At Infectious Media, we use an in-house ETL framework developed in Python, which uses Celery to schedule and execute tasks. As for storage, we use MySQL for transactional data, as well as Google’s BigQuery for larger datasets.
IM: How does Infectious Media use Data Engineers?
SA: The Data Team’s responsibilities include the management of more than 700 ETL processes, as well as the integration of new data pipelines with clients or third parties, and generally providing internal data support to the company.
Our global programmatic activity generates more than 500 million unique events daily — impressions, views, clicks, conversions, and pixel events, which have been proven to be extremely useful in a wide range of ad hoc analyses that provide us with invaluable insights for our clients.
These events are streamed from Kafka into a Cassandra database cluster, which we’ve picked for its ability to store very large amounts of data and index it using a primary key, thus increasing query efficiency. Once our event-level data is available in Cassandra, it’s time for some Data Science.
IM: How does Infectious Media use Data Scientists?
SA: The goal of some of our advertising strategies is to drive users to specific actions by showing them specific adverts. Such action — or conversion, could be landing on our client’s website or purchasing one of our client’s products. By recording every event leading to a conversion, we’re able to construct the user’s journey from the very first time we’ve shown them an ad.
This leads to following question: can we quantify how similar a user who we’ve shown adverts to but has not yet converted to users who have? This is what our data scientists are trying to answer using the vast amounts of data our engineers have streamed into Cassandra. This lookalike project is currently the main focus of our Data Science team.
IM: What is the added value of this lookalike modeling?
SA: Simply put, potentially enormous. In one bucket we have all the users whom we have successfully driven to convert, and in the other all the users who have seen our adverts to but have not purchased our client’s products, which we consider a failure from a performance perspective. By computing the behavioural similarity between these two pools of users, our data scientists are able to extract a conversion probability for each non-converter. Following this step, we are then able to actively target the most likely converters or refrain from showing adverts to users with a low likelihood of converting.
IM: What makes IM’s Data team stand out from the rest?
SA: This collaboration between Infectious Media’s data engineers and scientists is a great example of how we can use data to extend the reach of our programmatic strategies and increase performance, benefiting ourselves and, of course, our clients. The extensive knowledge, wealth of data and partnership between both teams makes our offering unique within the programmatic industry.
Samir Arfoui, Lead Data Engineer, Infectious Media.