Measurement & Tracking

A Primer on Data Leakage for Digital Publishers

In this new four-part series on data leakage, I’ll explore how data leakage snuck up on the digital publishing industry as a critical business risk, how data leakage happens, what the costs are, and how publishers can create a policy around their data to manage the risk and capitalize on the opportunity.

What is Data Leakage?

In the digital advertising world, data leakage means the unwanted or unknowing transfer of audience data from one party to another, typically from a publisher to an advertiser, although in some cases, from an advertiser to an intermediary, such as a data exchange or ad network.

That’s my attempt at a Webster’s definition, but plainly speaking, when people talk about data leakage as it relates to interactive advertising, in almost all cases they’re talking about advertisers, ad networks, and data companies dropping cookies on users through ad redirects running on a publisher without that publisher knowing it or wanting it.  The thing is, advertisers have been doing that for years for benign purposes–like tracking ROI, for example, to see how many users from a content buy made it to their website, or conversion page.  Advertisers would drop a cookie on a user through their ad tag, and if the same cookie was recognized on a landing page at some point in the future, they could value to their ad buy, what the ad world calls ‘attribution’.  Measuring ROI was great, but that’s about all you could do with that cookie pool.  As an advertiser, even if you knew all the people in your cookie pool were sourced while reading up on leasing a new Rolls Royce, thus including them in an extremely high-value and rare audience segment, what could you really do with that pile of cookies?

Nothing, that’s what.  So publishers didn’t pay much attention to the practice.  For an advertiser though, it’s pretty easy to drop a cookie with a callback in your redirect, so dropping third party cookies out of ad buys was fairly common in a short while. After all, this is the internet – if you can measure something, why not measure it?

Gradually though, through the increased innovation in the industry and regular practice of cookie or pixel-dropping, publishers have been caught with their pants down.  Today as an advertiser you can absolutely take action against any data you can collect or cookie pool you can build, and often those actions are in direct competition with a publisher’s sales force.  The potential impact to revenue is huge, especially as programmatic buying through ad exchanges continues to build steam.

So what happened?  How did the cookie go from a background distraction to a covert business liability? In the next post, I’ll review a brief history of data collection online and explain how data leakage made it the mainstream.

Read Next – Audience Analytics Lights the Data Leakage Fuse

Tracking Billable Impressions and 3rd Party Discrepancies with Ad-Juster

The Problem with 3rd Party Discrepancies

It’s a sad fact that after more than a decade of innovation and growth in the digital display business, virtually nothing has been done to address the cost of 3rd party discrepancies on the industry.  As I detailed in my post on how 3rd party ad serving works, because Publisher ad servers and Marketer ad servers count an impression at a different point in the technical process, there is always a variance in the numbers, and reconciling those figures to cut an invoice is a manual, time-consuming process, and a huge administrative cost on the industry.  Discrepancies are typically around 10%, but can often exceed this, especially if there is a technical problem with the ad.  In virtually all cases however, publishers simply have to accept losses due to discrepancies as the cost of doing business.

Third party ad servers have never made it easy to address this issue.  Their publisher reporting tools are woefully inadequate and in some cases comically inefficient.  For example, the leading ad server, DoubleClick’s DART product, does not provide site level reports for publishers that allows them to see everything running on their site from that ad server, but only allows publishers to get reports advertiser-by-advertiser.  That means billing departments on every major online publisher spend days pulling hundreds of reports every month out of Dart alone. That means for most operations folks, a centralized reporting database that maps 3rd party delivery to local ad server delivery at the creative or flight level and updates automatically is practically a holy grail.

The Industry’s Response: An Impression Exchange

The IAB has recommended their own solution to address the matter via the Impression Exchange project, but I find the project fundamentally flawed.  For one, the technical process the IAB uses to centralize impression reporting between systems adds another call in the ad serving process and so creates a discrepancy on the discrepancy it reports.  Additionally, it has been very slow to win adoption by the ad servers – a year and a half in and DART is the only ad server currently on board.

Ad-Juster, The Superior Solution

A far better solution is for publishers to look at a company called Ad-Juster, which has created a way to centralize third party reports and map third party delivery against their internal flights down to the creative level.  Ad-Juster has essentially mapped the schema for every ad server reporting system, figured out how to pull large data dumps from every major third party ad server on a regular basis and map it with a unique identifier back to the third party tag running on a publisher’s local ad server.  In other words, it allows them to create a unified database across lots of systems. While the system is just a read-only version of the reporting you can get yourself, the speed and automation it brings to the table is very compelling for any large publisher.

Ad-Juster offers some canned reports that actually calculate the discrepancy between systems as well as some helpful filters that automatically email you a discrepancy report on flights that launched in the last three or five days for example, which allows operations staff to quickly catch implementation or technical issues.  Since you can monitor the entire network on a regular basis, it is easy to adjust the padding that most publishers add to client goals to makeup for an expected discrepancy.  You may very well find that some third parties track closer than others, so you can reduce the padding for those campaigns.  The reports are a boon to operations folks, but also extremely useful for billing departments.  Now the billing staff doesn’t have to spend all their time waiting in an ancient publisher-facing ad server UIs, they can push bills to clients faster.

The system isn’t perfect, especially if you run the same third party tags in multiple flights (the system can have a hard time attributing the right amount of third party impressions at the flight level in that case), but in most cases it offers tremendous benefits.  Recently Ad-Juster has partnered with Solbright and Fattail to push their data into those workflow systems, but they also offer an API for clients that want to push the data to proprietary systems.

Highly recommended for large publishers seeking reporting relief.

Get Pixel Tracking Transparency with Ghostery

Thanks to a series of articles in the WSJ, publishers around the country are taking a hard look at their privacy practices and trying to get a handle on who collects data on their site.  You would think this would be a simple task, after all, the publisher owns the site and controls everything on it, right?

Well, not exactly.  In fact, thanks to the off-site redirects inherent to 3rd party adserving, publishers often have no idea when an advertiser or marketer attempts to redirect the user within a 3rd party ad tag.  Due to the number of players involved, it’s actually quite difficult to assess which tags are attempting to cookie the user for audience aggregation.  If publishers can’t audit their site, how can they enforce their privacy policy and contractual agreements with marketers?

Thankfully, the people at Better Advertising have developed a rather brilliant browser extension called Ghostery to make pixel tracking more transparent.  Ghostery runs on your browser and sifts through all the code and ad calls to quickly identify which 3rd parties are tracking data on your site. This particular example is from Dictionary.com – as you can see, the tool quickly pulls up a list of the various companies with pixels running on the site or somehow spawning to the browser.


From there, you can take a deeper dive on any particular tracker you want, view a brief summary of what the company does, how to access its privacy policy, and even other sites where that company was seen.  I have to say, Ghostery is a quantum leap ahead of other tools for identifying which ads are spawning pixels or running piggyback cookie requests.

Ghostery was actually developed more for Consumers to give them a way to see who is tracking their behavior online and actually block it, but I see huge potential for industry folks as well to audit their site.  Do you know what is running on your site?

P.S. – the Ghostery Blog isn’t half bad, either…

Online Ad Verification & What It Means For Online Publishers

Ad Verification is a relatively new area of digital marketing technology that has the potential to change how marketers buy media and how impressions are valued in the marketplace.

Most Popular Ad Verification Services

Some of the best-known companies that offer ad verification technology are DoubleVerify, AdSafe Media, and AdXPose, though there are others out there.  Wrapping the marketer’s ad with a snippet of javascript, these companies are able to read, or “scrape” the contents of the page calling the adserver and determine all kinds of information about the context and quality of the ad placement, as well as some information about the user viewing the page.  Advertisers use this information to validate the publisher has implemented their ads correctly, and help measure user engagement.

The Ad Verification Process

This image from AdXPose shows basically how the process works:How Online Ad Verification Works

For example, if a Marketer wanted their ads to only run to the New York City area, ad verification technology could give them a report on how many users saw the ad that were not in this geographic target.  Ad verification can also show on average, how many ad placements were in-view for the user, that is, on the visible area of the browser window opposed to further down on the page, but not actually on screen.  These two functions alone are often enough to convince marketers, as publishers are often reluctant or unable to provide this information, or in some cases, even distort the truth to close a sale.

But ad verification can provide additional value when used in conjunction with an ad network buy or ad exchange buy when marketers have their ads spread across hundreds or even thousands of sites.  Trying to maintain any quality standards has been difficult to impossible in these situations because Networks and Exchanges usually don’t disclose all the publishers with which they work, selling their inventory in a so-called “non-transparent” manner.  Because verification tags scrape the page content as the ad is loading however, they know exactly what domain they’re being served to and can often categorize that site as appropriate for the marketer or not, depending on how the marketer defines it.  A beer company might not mind appearing on a site dedicated to girls in bikinis, but that probably isn’t where a toy company wants to show their ads.  Verification technology can often tell marketers if this is happening to their brand, and if so, can actively prevent or block the ad from showing in that environment.

Who Is Using Ad Verification Today?

Market leaders are already flocking to the service – in early March of 2010, GroupM, the parent to agency powerhouse, WPP started using DoubleVerify tags to enforce ad specs with publishers and major networks and exchanges are also getting on board.  In late 2009, Traffic Marketplace, Collective Media, Undertone, and Tremor Networks all added verification analytics to their reporting to ease marketer concerns. Data providers are also seeing value here – AudienceScience, Rubicon Project, x+1, and other data vendors want access to verification companies site ratings.

For more about ad verification technology, see these links below:

Ad Exchanger: AdSafe Media On Transparency Into Display Ad Inventory and Frame Challenge

MediaPost: GroupM Will Be Watching Where Its Clients’ Ads Run, And Where They Should Not

IAB: IAB Hosts Interactive Advertising’s First-Ever Ad Verification Summit

MediaPost: Mpire Blocks Ads From Appearing On Sites

How to Read Doubleclick Ad Tags and Ad Tag Variables

The term ad tag is thrown around quite a bit, and can usually refer to any link involved in the ad serving process, on the publisher, or marketer side. Strictly speaking, Ad Tags are the HTML code a browser uses to fetch an advertisement from an Ad Server – it is a redirect to content rather than content itself.  There are also click tags, action tags, view tags, and other more specific variants to the general ad tag category.  For this particular example, we’ll look at publisher side tag, because our purpose is to show how ad tags help publishers organize their content into targetable products.

Ad Tag Components

So, without further ado, feast your eyes on this example a Doubleclick ad tag:

http://ad.doubleclick.net/ADJ/publisher/zone;topic=abc;sbtpc=def;cat=ghi;kw=xyz;tile=1;slot=728x90.1;sz=728x90;ord=7268140825331981?

An ad tag can tell you quite a bit about how which ad ends up on a page – if you want, navigate to any major publisher and look at the source code; you can probably find a real-life example of a working ad tag. So how can you tell what the ad tag says about the publisher hierarchy and ad targeting? Let’s break it down piece by piece:

http://ad.doubleclick.net/ – this is the host address for the Ad Server – you can see that it is not a publisher’s website, but an independent technology company that has nothing to do with publishing content.  In this example, we’re talking about Doubleclick, the Ad Serving powerhouse that was acquired by Google for $3.1 billion dollars in 2007.

/ADJ – this code defines a specific type of ad call, and what the response can be, i.e., images vs. XML vs. scripts.  For this example, the code ‘ADJ’ is the most common, and only returns images, which will serve via JavaScript.  Other responses can include ADF (only image creatives in a frame), ADX (only image creatives served through streaming technologies), as well as others.  (Thanks to Jared & Paul for correcting!)

/publisher – this is the site code that Doubleclick uses to distinguish one publisher property from another.  For example, the New York Times owns NYTimes.com, About.com, and Boston.com among other properties.  If they are a client of Doubleclick, the corporation likely pays the bill, but each site would have its own site code so ads could be targeted to a specific paper and not the entire network.

/zone – the zone is akin to a channel level, so the Homepage vs. the Arts page, vs. the Sports page.  These content verticals are likely to attract different advertisers, so it’s important for publishers to be able to target to this kind of granularity.

Zone-Based Hierarchy vs. Topic Based Hierarchy

Here is where tagging logic starts to diverge in Doubleclick.  Some publishers prefer to deeply categorize at the zone level, while others keep moving down the hierarchy to the topic level.  The benefit of using zones over the topic, subtopic, category, or keyword levels that we’ll talk about in just a minute is that the zone is the last level in which you can pull historical reporting.  So you might have sports/baseball or even sports/baseball/nymets so you can pull traffic statistics going back months or years.

The downside with this method is that zones are vertical structures, so if you had multiple verticals on your site that all had a games section, you would have to select each games zone every time you wanted to target all games when trafficking the ads, rather than just targeting a single “games” key value.  This sounds easy on paper, but adds up to lots of extra time for your trafficking staff if you have lots of subcategories in each zone.  It wouldn’t be difficult to imagine needing 50 zones or more per content vertical to tag to the lowest level of granularity.

Which is why most Publishers tag at a higher level, and leave the granularity to the topic variable and below.  A great benefit of granular topic tagging opposed to granular zone tagging aside from being able to use the same topic tag across multiple zones is the ability for topic tags to handle wild cards when trafficking.  This means if you had topic=newyorkmarathon and topic=bostonmarathon, you could simply target topic=*marathon* and ads would automatically fall into both areas.  This makes trafficking much easier, but has the downside of no historical reporting, which can be a challenge for your Yield or Inventory teams.

topic=abc – next in the hierarchy is the topic level. As mentioned above you can use the topic level to tag similar content across zones.  For example, games in multiple content verticals or within them.

sbtpc=def – next in the hierarchy is the subtopic level.  You might use this to target sportsgames vs. adventuregames for example.  Again, you can use this to target across content verticals or within them.

kw=xyz – the keyword segment isn’t really another level in the hierarchy but a way to describe the page for contextual targeting.  The benefit here is multiple keywords are allowed.  These are typically used in guides and directories like a recipe, where you would want to be able to target chicken recipes vs. vegetarian recipes vs. winter recipes, and etc, allowing some overlapping targeting.

tile=1 – the tile variable sets a unique value for each ad call on a specific page.  If there were two or more of the same size ads on a page, separate tile values would prevent the browser from trying to serve the same ad to multiple ad slots at the same time.

slot=728x90.1 – typically defines the location of the ad tag, but is really just another type of key-value.  While this may seem duplicative with the tile value, it isn’t.  For example, tile values are often set dynamically, in the order they appear on the page.  So the first call is tile=1, the second is tile=2, and so on.  But websites use different templates all the time so the homepage may not have as many ad calls as a category page which may have a different number of calls than an article page, so the tile value isn’t designed to be a consistent variable for use in targeting.  The slot however, is.  For example, if a publisher had two of the same ad units on a given page, say a 728×90 unit at the top of the page and a 728×90 at the bottom of the page, the slot value allows them to target specifically to one or the other. That said, the publisher could just as easily set the value of this to anything they want, and it’s common to see sites re-purpose this keyvalue for another purpose, use a text value such as “leaderboard”, or not use it at all.  See Jared’s post in the comment thread below for more detail.

sz=728x90 – defines the ad size of the unit for the ad server logic.  To be clear, this doesn’t restrict the size of the ad in the unit, it just provides a targeting attribute for the ad server.  If a trafficker were to mistakenly target a 300×250 ad to a market segment with a sz=728×90 attribute however, the 300×250 creative would still serve to the 728×90 call, it would just be cut off.  It isn’t uncommon to catch one of these mistakes from time to time as you surf around the web. Additionally, you can actually include multiple values into this attribute, separated by commas.  (Thanks to Jared for correcting!)

ord=7268140825331981 – this number is a random value better known as a cache-buster.  As users move back and forth between pages of content, they often return to pages they’ve seen before, especially navigational pages like the homepage.  Browsers today try to save as much content as possible to speed up load times.  To prevent browsers from reloading the same ad multiple times (so publishers can maximize revenue and advertisers can get accurate reporting), a random number is tacked on to the end of each ad call so it looks unique to a browser and forces a new series of calls through the ad server.  Click here to read more about what a cache-buster is and how it works.