Digital Publishers and Advertisers that have access to a Data Management Platform (DMP) can bootstrap their own data modeling, or lookalike model capabilities with some simple index-based approaches. That is to say, if you can understand both the total population of users for every segment and for any specific segment, how many users of every other segment overlap in that target segment, you can build a fast and easily understood audience model with a little legwork. It’s not the rocket science approach of a regression model or black box algorithm, but it works, and it’s pretty easy for people without a degree in data science to execute once you figure out how to get the right data out of your system.
How to Do Lookalike Modeling Yourself
The first step to building a lookalike segment is to first define what you are trying to model, that is, what audience you want want more of. This will be your ‘target’ – for our example here, let’s consider the following audiences:
|Segment||Qualified Users||% of Total|
Let’s say we’re trying to reach females. Unfortunately, we only have 20,000 we can identify, out of a total population of 100,000. Now let’s assume that our content isn’t skewed to one gender or another, and therefore there’s clearly some users in the 80,000 other users that we can expect would be female. But we need to find a signal within that group that directs us to which other audiences are likely to be female.
What we need to do then is compare every other audience to our female audience, and figure out how many users of each of our other segments overlap with our female segment. To do that, we need to pull another table of data – let’s add a few more audiences while we’re at it.
|Test Segment||Total Users in Test Segment||Overlap (Number of Females in Test Segment)|
Now, since every audience has a different total population, and every overlap of one audience to another is also different, we need a way to compare one overlap to another. For example, just because there are a greater number of men over 6 feet tall in China than in Norway doesn’t mean Chinese men are more likely to be over 6 feet tall that Norwegians – to know for sure, you need to know the total population of each country and figure out if men are more likely to be over 6 feet tall in China or Norway relative to their population. And that’s exactly what we need to do when building our lookalike segment, we need to determine if one audience is more or less likely to be female relative to its population.
To do that, we need to divide the overlap of each test segment audience (pet owners, coffee drinkers, etc.) to target segment audience by the population of the target segment audience (females), so that we can compare that to the target segment audience overlap in the overall population. So, with some simple division, we divide the overlap figures from the table above into the total population of females, and get the following:
|Segment||Total Users in Test Segment||Overlap||Total Females||Concentration of Test Segment in Female Segment|
Finally, if we divide the relative concentration of females in each test segment to the concentration of each test segment in the total population, we can create an index, or a comparison of one relative figure to another. All we need to do this is multiply each comparison by 100, which is our benchmark. Any audience with an index greater than 100 tells us the test segment is more likely to contain female users that the general population, and any audience with an index less than 100 tells us the test segment is less likely to contain female users than the general population.
|Test Segment||Total Users in Test Segment||Overlap||Concentration of Test Segment in Total Population||Concentration of Test Segment in Female Segment||Relative Concentration of Test Segment in Female Segment (Index)|
So now with the data above, if you wanted to model an audience to find those who are likely to be women, but not necessarily known to be women, you could build a segment of pet owners or sports fans, neither of which is a coffee drinker, and know they were more likely than not to be women using the data below. In boolean logic is would be (pet owners OR sports fans) NOT coffee drinkers. After you create the new compound audience, you can see how it ends up indexing to your total once the overlapping users are de-duplicated into a single segment, and then refine as necessary.