Data Management

Lookalike Modeling Your Ad Ops Team Can Build With a DMP

Digital Publishers and Advertisers that have access to a Data Management Platform (DMP) can bootstrap their own data modeling, or lookalike model capabilities with some simple index-based approaches.  That is to say, if you can understand both the total population of users for every segment and for any specific segment, how many users of every other segment overlap in that target segment, you can build a fast and easily understood audience model with a little legwork. It’s not the rocket science approach of a regression model or black box algorithm, but it works, and it’s pretty easy for people without a degree in data science to execute once you figure out how to get the right data out of your system.

How to Do Lookalike Modeling Yourself

The first step to building a lookalike segment is to first define what you are trying to model, that is, what audience you want want more of.  This will be your ‘target’ – for our example here, let’s consider the following audiences:

Segment
Qualified Users
% of Total
Women 20,000 20%
Pet Owners 5,000 5%
Coffee Drinkers 8,000 8%
Outdoor Enthusiasts 9,000 9%
Total Users 100,000 100%

Let’s say we’re trying to reach females.  Unfortunately, we only have 20,000 we can identify, out of a total population of 100,000.  Now let’s assume that our content isn’t skewed to one gender or another, and therefore there’s clearly some users in the 80,000 other users that we can expect would be female.  But we need to find a signal within that group that directs us to which other audiences are likely to be female.

What we need to do then is compare every other audience to our female audience, and figure out how many users of each of our other segments overlap with our female segment.  To do that, we need to pull another table of data – let’s add a few more audiences while we’re at it.

Test Segment
Total Users in Test Segment
Overlap (Number of Females in Test Segment)
Pet Owners 5,000 1,500
Coffee Drinkers 8,000 500
Outdoor Enthusiasts 9,000 1,200
Business Travelers14,000 3,000
Sports Fans 2,800 1,000
Avid Readers 7,000 900

Now, since every audience has a different total population, and every overlap of one audience to another is also different, we need a way to compare one overlap to another.  For example, just because there are a greater number of men over 6 feet tall in China than in Norway doesn’t mean Chinese men are more likely to be over 6 feet tall that Norwegians – to know for sure, you need to know the total population of each country and figure out if men are more likely to be over 6 feet tall in China or Norway relative to their population.  And that’s exactly what we need to do when building our lookalike segment, we need to determine if one audience is more or less likely to be female relative to its population.

To do that, we need to divide the overlap of each test segment audience (pet owners, coffee drinkers, etc.) to target segment audience by the population of the target segment audience (females), so that we can compare that to the target segment audience overlap in the overall population.  So, with some simple division, we divide the overlap figures from the table above into the total population of females, and get the following:

Segment
Total Users in Test Segment
Overlap
Total Females
Concentration of Test Segment in Female Segment
Pet Owners 5,000 1,500 20,000 7.5%
Coffee Drinkers 8,000 500 20,000 2.5%
Outdoor Enthusiasts 9,000 1,200 20,000 6%
Business Travelers14,000 3,000 20,000 15%
Sports Fans 2,800 1,000 20,000 5%
Avid Readers 7,000 900 20,000 4.5%

Finally, if we divide the relative concentration of females in each test segment to the concentration of each test segment in the total population, we can create an index, or a comparison of one relative figure to another.  All we need to do this is multiply each comparison by 100, which is our benchmark.  Any audience with an index greater than 100 tells us the test segment is more likely to contain female users that the general population, and any audience with an index less than 100 tells us the test segment is less likely to contain female users than the general population.

Test Segment
Total Users in Test Segment
Overlap
Concentration of Test Segment in Total Population
Concentration of Test Segment in Female Segment
Relative Concentration of Test Segment in Female Segment (Index)
Pet Owners 5,000 1,500 5%7.5%150
Coffee Drinkers 8,000 5008%2.5%31
Outdoor Enthusiasts 9,000 1,200 9%6%67
Business Travelers14,000 3,000 14%15%107
Sports Fans 2,800 1,000 2.8%5%179
Avid Readers 7,000 9007%4.5%64

So now with the data above, if you wanted to model an audience to find those who are likely to be women, but not necessarily known to be women, you could build a segment of pet owners or sports fans, neither of which is a coffee drinker, and know they were more likely than not to be women using the data below.  In boolean logic is would be (pet owners OR sports fans) NOT coffee drinkers.  After you create the new compound audience, you can see how it ends up indexing to your total once the overlapping users are de-duplicated into a single segment, and then refine as necessary.

You Can Model Clickers and Converters, Too

The technique above is especially useful for finding ways to optimize campaigns that are focused on a click or online conversion metric – you simply track the campaign clickers or converters with a new audience in your DMP, and then index all audiences in your platform against their overlap in the clicking or converting audience.  You could, for example, start running every performance based campaign in ROS to expose every audience to the campaign, and then after a short period of time figure out which audiences are responding more favorably and reliably to the campaign goal.

In an ideal world you have lots of audiences you can overlap against a target; hundreds or even thousands.  You could then index all of them against your target, sort them by the index, and then optimize your campaign targeting into the top choices.  Which segments you pick, the highest indexing or the largest scale (there will rarely ever be an option that is both large and high quality), depending on your goals for the campaign, budget, etc. You can also exclude the lowest indexing audiences as a technique, and reduce your distribution against lower performing audienciences.

The risk to this technique is that the amount of overlapping users is so small that you lack enough of a sample to reach a statistically significant index.  In other words you don’t have enough data to trust the lookalike.  To precisely calculate this, you’d need to employ a statistician, however my rule of thumb has been to rely on standard sample size tables that clearly define how many users you need to sample from a given population for the result to meet a particular confidence level.  You can easily build this check into Excel to compare your overlapping users in the test segment (pet owners in our case) to the target segment (women).

As you can see though, in a population of almost any size, a mere 400 users is all you need for a representative sample to meet a 95% confidence level with a ± 5% margin of error.  You can use this same check on creating general lookalike audiences, but it tends to be more relevant when working with very small target segments, like users who had to take a particular action.  Of course, this isn’t the most sophisticated audience modeling method out there, far from it; but for Ad Ops teams who need to play fast and loose with campaign optimization, it’s a place to start, and a great way to get more out of your investment in a DMP.

Data Management Part IV: Syncing Offline Data To Your DMP

Before the internet and digital advertising, direct mail solicitation was perhaps the most technologically advanced form of marketing out there.  Even today, as much as interactive marketers like to poke fun at traditional media people, the direct mail industry is far more sophisticated at accurate audience segmentation and message delivery than most of the digital realm.  Since everything in the snail mail world works off your actual name and address, it is far easier to connect data points in your life – the car you drive, your credit score, your age, gender, and plenty else from public records.  Start adding information about your purchase habits from catalogs, your credit cards, and all the hotel and airline loyalty cards stuck in your wallet and the direct marketers can profile you three ways to Sunday. The truth is that it’s far easier to move data offline by matching on a name and address than to move it online with nothing but a cookie.  That said, data companies and marketers alike have a huge incentive to try, because offline data is generally much more reliable and therefore valuable than its online competitors. (more…)

Data Management Part III: Syncing Online Data to a Data Management Platform

To get the full value out of a relationship with a data management platform, you want to provide the platform with as much data as possible.  That said, the low hanging fruit in any organization will be to integrate 1st party data for which you already have a cookie to the DMP.  The mechanism to accomplish this is your standard cookie sync,which passes a user ID from one system to another via a query string appended to a pixel call, and ideally, a server-to-server integration after that.

Practically speaking this means that when a user hits your site and calls your site analytics tag, either independently or through a container tag, that site analytics tag redirects the user to the DMP, and simultaneously passes the site analytics user ID to the DMP.  When the DMP receives that call, it cookies the same user and also records what the site analytics user ID is.  Now the DMP knows how to associate data from the site analytics tool to its own cookie ID.  The beauty of this system is only the user IDs need to be synced at this time, and the actual data that the site analytics tool records can be passed to the DMP later, without slowing down the user experience on site.  Now imagine replicating this process with all 3rd party tools, and syncing all systems into the DMP. (more…)

Data Management Part II: Centralize and Synchronize Your User Data

A critical component of any DMP is the ability to centralize your audience data from multiple systems into a single interface.  They do this through a NoSQL database management system that imports your data from multiple systems using a match key between each system that they form via, what else, a cookie sync.  It sounds complicated but it isn’t.  Let’s take an example from the marketer side to explain the concept.

Say you run a large eCommerce store and want to create audience-based marketing campaigns around different customer groups.  You send a weekly newsletter with a few hundred thousand users signed up, you have a site analytics tool, you have an order management database, or other CRM system, and you buy media through a DSP. Each system fulfills a specific business need, but generally speaking operate in parallel and do not talk to each other. So there’s no way for you to specifically target users on your DSP that are also signed up for your newsletter, or who are signed up for your newsletter and have also visited three or more pages in the mystery novels section of your site in the past 30 days.  You have a site analytics cookie on the user’s machine, but no newsletter cookie, and even if you did, how do you know how to identify the same user in both systems?  In order to get your newsletter system to talk to your site analytics system and push that information to your DSP for future media campaigns you need to find a way to identify the same user between systems.  This is where the DMP comes in. (more…)

Data Management Part I: What Are Data Management Platforms?

If you’re working in digital advertising today and not losing sleep over your data strategy (or lack thereof), climb out from under your rock and join the rest of us trying to figure out how to leverage the mountain of consumer intent and behavior collecting on the  doorstep each day. From both the marketer and publisher perspective, data isn’t the problem, access is the problem.  Each party has access to vast amounts of data, either directly or through 3rd party channels, but centralizing, organizing, analyzing, and segmenting are very difficult for all but the largest companies.  Unless you have a pedigreed team that speaks SAS and Oracle, understands how to use an IBM supercomputer, or has a team of PhDs on the payroll, building your own solution to this problem just isn’t realistic.  It just doesn’t exist in the DNA of most advertising companies today, at least not yet. (more…)