Customer Data Feed

04 Mar 2018 » AAM

A while ago I wrote about the the Adobe Analytics data feed. Adobe Audience Manager has the same concept: you can get a file with the raw data. This feed is call Customer Data Feed or CDF.

Why would you want the Customer Data Feed

Before going to the details, you need to question yourself why you want it. It does not make sense to just get it because you want to hoard as much data as you can. With GDPR in the horizon, you should only have the minimal amount of data needed to run your business.

Besides, the CDF generates huge amounts of data. I was talking a few days ago with one of my colleagues and I asked him which customers he knew were using CDF. He said that many customers request it, but when we explain them the amount of data it will generate, most of them realise it is not worth it.

As a particular case, if all you need is the population of trait or segment qualifications, there are other options. You can request a daily export of UUIDs and the traits and segments, for which they qualify.

On the other hand, there are a few reasons why you would want this data:

Get trait qualifications. As you know, you do not get any standard destination for traits, only for segments, and you do not know when a trait qualifies. With the Customer Data Feed, you can analyse what the visitors where doing to qualify for traits.
Analyse the data. More and more companies have data scientists who want to analyse all the data they can get their hands on. The moment they hear about CDF, they say “I want it”.

So, if you really think that you need the CDF, read on.

How to request it

In order to get your Customer Data Feed, you need to request it to Adobe: your AAM consultant, your Customer Success Specialist (formerly known as account managers) or client care. Currently, there is no UI setting you can use to request it. I do not know the exact details of what happens, but I believe our TechOps team needs to do some additional configuration in the backend. The files will be delivered to an Amazon S3 bucket.

Yes, it is that simple. You may be asked about a few details of the feed, though. The only thing worth mentioning of this process is whether to request media pixels’ data or not. By default, you will not get them in the CDF, so remember to include them in your request, if this is what you want.

File format

This is probably the trickiest part of the CDF. The files are flat files and, for each call to AAM, you get a line. Only real-time calls will be output to the Customer Data Feed. In other words, only calls to http(s)://[clientID].demdex.net are included in the outpu. This includes Analytics calls forwarded to AAM via server-side forwarding.

Each line contains the following fields, in this exact order:

Timestamp. It uses the yyyy-mm-dd hh:mm:ss format, in the UTC time zone. A couple of things you should be aware:
- This is the timestamp of when the AAM processes this hit, not when AAM received it. There will probably be a small difference between the two timestamps.
- It may not be in the hour range of the file (see below). Under heavy load, AAM servers may take longer to generate the CDF, especially since the data collection is distributed. You should rely on this timestamp for event ordering and not the file hour.
UUID. This is the Unique User ID of the device, which you find in the demdex cookie under the demdex.net domain.
Container ID. You can safely ignore this field. I have never needed it and I think it is not used.
Realised traits. Array of trait IDs that the visitor qualified in this hit. In other words, traits that the visitor has just qualified thanks to the data in the hit. The IDs are the same as those you find in the UI.
Realised segments. Array of segment IDs that the visitor has qualified in this hit.
Request parameters. Array with all query string parameters in the call to demdex.net. So, if the call was something like http://abcd.demdex.net/event?c_param1=a&c_param2=b&d_param3=c, you will get in this field c_param1=a, c_param2=b, d_param3=c. Just note that the format is not exactly this, but using the separators describe below.
Referrer URL. The “Referer:” HTTP parameter in the call to demdex.net. I want also to clarify that the HTTP protocol specifies that this is the URL from where the call was made, not the previous URL in the visitor’s journey. In other words, in this field you get the page the visitor was viewing, the URL in the browser’s address bar.
IP address. In principle, you will get here the IP address of the visitor. However, GDPR (again) considers this value as personal data and you should have it obfuscated. Therefore, I do not expect this parameter to be useful any more.
MCID. If you have the MCID on the page, you will get it in this field. Since the rebranding from Adobe Marketing Cloud to Adobe Experience Cloud, you may also see this parameter named as ECID (Experience Cloud ID). Finally, worth noting that UUID and MCID are mathematically bound through a reversible operation.
All segments. Array of all segment IDs the visitor qualifies for. This will include the list in “Realised segments” and other segments for which the visitor has already qualified.
All traits. Array of trait IDs. The same explanation as the previous field applies here, but with traits.

The file uses different separators, which are non-printable ASCII characters. Therefore, you need a special text editor to correctly visualise the file (I recommend Notepad++).

ASCII 0x01 (SOH) separates the fields above.
ASCII 0x02 (STX) separates the elements within an array (realised traits and segments, request parameters and all traits and segments)
ASCII 0x03 (ETX) separates key/value pairs in the request parameters field

You will also notice that not all fields are present. Sometimes you will just get two SOH characters together, sometimes the field equals “\N”. In both cases, it just means that the field is empty.

Finally, be aware that Adobe may add more fields at the end. Your code should allow for more fields without crashing.

Official documentation: https://marketing.adobe.com/resources/help/en_US/aam/cdf-contents-defined.html.

Directory structure

AAM puts the Customer Data Feed in the Amazon S3 bucket following a structure.

s3://aam-cdf/[bucket name]/day=[yyyy-mm-dd]/hour=[hh]/AAM_CDF_[partner ID]_[AAM process ID]_0.gz

Where each of the fields mean:

bucket name is the name of the bucket where the data is stored. Each of our AAM customers will get a bucket, with read-only access, protected by a set of credentials.
yyyy-mm-dd is the day when the file was generated.
hh is the hour when the file was generated
partner ID is the unique ID assigned to you as a customer in AAM
AAM process ID is an internal ID
Files are gzip’ed

A few final notes about the files. As you can see, you get a file every hour, but you need to take into account the following:

It is not available exactly at the hour o’clock; it may be available at any time after that hour
It will not necessarily contain the data from the full hour; in fact, it is very common to see data from various hours, which took longer to process
The contents are not sorted; you will need to sort them and, as I have just said, remember that consecutive hits may be split in different hourly files.
It is not the previous hour what you get, but the hour a few hours ago

Notification of availability

Instead of polling your S3 bucket regularly for the files and checking whether there are more in the sequence, the best way to know that the data is available is by checking the Info file. This text file is in JSON format and will contain all files generated in the hour and the totals. Check the details here: https://marketing.adobe.com/resources/help/en_US/aam/cdf-notifications.html.

RTCDP vs AAM (Categories: Platform, AAM)
The demdex cookie (Categories: AAM)
Look-alike modelling (Categories: AAM)
URL-based destinations (Categories: AAM)
Cookie-based destinations (Categories: AAM)
DMPs: data in, data out (Categories: AAM)