Digital Publishers and Advertisers that have access to a Data Management Platform (DMP) can bootstrap their own data modeling, or lookalike model capabilities with some simple index-based approaches. That is to say, if you can understand both the total population of users for every segment and for any specific segment, how many users of every other segment overlap in that target segment, you can build a fast and easily understood audience model with a little legwork. It’s not the rocket science approach of a regression model or black box algorithm, but it works, and it’s pretty easy for people without a degree in data science to execute once you figure out how to get the right data out of your system.

## How to Do Lookalike Modeling Yourself

The first step to building a lookalike segment is to first define what you are trying to model, that is, what audience you want want more of. This will be your ‘target’ – for our example here, let’s consider the following audiences:

Segment | Qualified Users | % of Total |
---|---|---|

Women | 20,000 | 20% |

Pet Owners | 5,000 | 5% |

Coffee Drinkers | 8,000 | 8% |

Outdoor Enthusiasts | 9,000 | 9% |

Total Users | 100,000 | 100% |

Let’s say we’re trying to reach females. Unfortunately, we only have 20,000 we can identify, out of a total population of 100,000. Now let’s assume that our content isn’t skewed to one gender or another, and therefore there’s clearly some users in the 80,000 other users that we can expect would be female. But we need to find a signal within that group that directs us to which other audiences are likely to be female.

What we need to do then is compare every other audience to our female audience, and figure out how many users of each of our other segments overlap with our female segment. To do that, we need to pull another table of data – let’s add a few more audiences while we’re at it.

Test Segment | Total Users in Test Segment | Overlap (Number of Females in Test Segment) |
---|---|---|

Pet Owners | 5,000 | 1,500 |

Coffee Drinkers | 8,000 | 500 |

Outdoor Enthusiasts | 9,000 | 1,200 |

Business Travelers | 14,000 | 3,000 |

Sports Fans | 2,800 | 1,000 |

Avid Readers | 7,000 | 900 |

Now, since every audience has a different total population, and every overlap of one audience to another is also different, we need a way to compare one overlap to another. For example, just because there are a greater number of men over 6 feet tall in China than in Norway doesn’t mean Chinese men are more likely to be over 6 feet tall that Norwegians – to know for sure, you need to know the total population of each country and figure out if men are more likely to be over 6 feet tall in China or Norway *relative to their population*. And that’s exactly what we need to do when building our lookalike segment, we need to determine if one audience is more or less likely to be female relative to its population.

To do that, we need to divide the overlap of each test segment audience (pet owners, coffee drinkers, etc.) to target segment audience by the population of the target segment audience (females), so that we can compare that to the target segment audience overlap in the overall population. So, with some simple division, we divide the overlap figures from the table above into the total population of females, and get the following:

Segment | Total Users in Test Segment | Overlap | Total Females | Concentration of Test Segment in Female Segment |
---|---|---|---|---|

Pet Owners | 5,000 | 1,500 | 20,000 | 7.5% |

Coffee Drinkers | 8,000 | 500 | 20,000 | 2.5% |

Outdoor Enthusiasts | 9,000 | 1,200 | 20,000 | 6% |

Business Travelers | 14,000 | 3,000 | 20,000 | 15% |

Sports Fans | 2,800 | 1,000 | 20,000 | 5% |

Avid Readers | 7,000 | 900 | 20,000 | 4.5% |

Finally, if we divide the relative concentration of females in each test segment to the concentration of each test segment in the total population, we can create an index, or a comparison of one relative figure to another. All we need to do this is multiply each comparison by 100, which is our benchmark. Any audience with an index greater than 100 tells us the test segment is more likely to contain female users that the general population, and any audience with an index less than 100 tells us the test segment is less likely to contain female users than the general population.

Test Segment | Total Users in Test Segment | Overlap | Concentration of Test Segment in Total Population | Concentration of Test Segment in Female Segment | Relative Concentration of Test Segment in Female Segment (Index) |
---|---|---|---|---|---|

Pet Owners | 5,000 | 1,500 | 5% | 7.5% | 150 |

Coffee Drinkers | 8,000 | 500 | 8% | 2.5% | 31 |

Outdoor Enthusiasts | 9,000 | 1,200 | 9% | 6% | 67 |

Business Travelers | 14,000 | 3,000 | 14% | 15% | 107 |

Sports Fans | 2,800 | 1,000 | 2.8% | 5% | 179 |

Avid Readers | 7,000 | 900 | 7% | 4.5% | 64 |

So now with the data above, if you wanted to model an audience to find those who are likely to be women, but not necessarily known to be women, you could build a segment of pet owners or sports fans, neither of which is a coffee drinker, and know they were more likely than not to be women using the data below. In boolean logic is would be (pet owners OR sports fans) NOT coffee drinkers. After you create the new compound audience, you can see how it ends up indexing to your total once the overlapping users are de-duplicated into a single segment, and then refine as necessary.

Hey Ben,

Could you please explain how indexing works? I don’t really understand the multiplication with 100 method.

Thanks,

Chico

Hi Chico,

Indexing is just a way to create a relative metric so that you can compare two things that are different sizes from an absolute point of view. For example, in the article I mention how you might determine which country in the world has the most tall people; to do that you wouldn’t just want to count the sheer number of tall people in each country, because every country has a different population. Of course China will have more tall people than Norway, because it’s total population is 250 times as large. Rather, you’d want to know how many tall people per some standard unit, like 1000. If you could know how many tall people per thousand people each country has, you could then say which country truly has more tall people on a per capita basis. And that’s what indexing is, it’s converting any absolute figure into a relative, or per capita figure that you can use to make accurate comparisons.

Specifically to your question, the only reason I end up multiplying by 100 is to make the number easy to read. .03 vs. .11 is the same as 3 vs. 11, but I find 3 vs. 11 to be easier figures to work with, so I just multiply all figures by 100 to change a percentage into a whole number.

Hope that makes sense –

Ben

I just happened upon this article and I have the same question regarding your index #. Specifically, how are you arriving at the 150 index for Pet Owners? No matter how I try to slice the data I can’t get the math to work. Can you explain the equation?

Hi LAQ,

The calculation divides the concentration of Pet Owners (the test segment) in Female segment (7.5%) into the concentration of Pet Owners in the Total Population (5%). So, 7.5 / 5 = 1.5, and then 1.5 *100 = 150. We multiply by 100 just to make the number easier to read, and because we do it to every test segment’s result, we’re not changing the relationship between the figures, we’re just transforming them up by a factor of 100.

What we’re basically saying is “how concentrated are Pet Owners among women vs Pet Owners on the site in general?” If the index is higher than average (the site in general), then we know, generally speaking, Pet Owners have a higher propensity to be women than not, and we can quantify that propensity with our index.

Hope that helps!

Ben

Hey Ben,

In test segments like Pet owners etc., how did you arrive at overlap no. (i.e. overlap no. of female in the segment)?

Thanks,

MN

Hi MN,

Your DMP should be able to provide this to you through a custom report; essentially you need a matrix style report.

Something like this:

For your data, you’d want all your segment IDs as column headers as well as row headers, with the overlapping members between any two segments as the cell value where they intersect. Another way to do it would be to create a segment that combines Pet Owners AND Women in the definition, but you’ll have to create a huge number of segments to get at every possibility. The whole point of this analysis is that you want to find the strong correlations without any prejudice and simply let the data tell you what matters.

Hope that helps!

Ben