Forget privacy: you’re terrible at targeting anyway
I don’t mind letting your programs see my private data as long as I get
something useful in exchange. But that’s not what happens.
A former co-worker told me once: “Everyone loves collecting data,
but nobody loves analyzing it later.” This claim is almost shocking, but
people who have been involved in data collection and analysis have all seen
it. It starts with a brilliant idea: we’ll collect information about
every click someone makes on every page in our app! And we’ll track how
long they hesitate over a particular choice! And how often they use the
back button! How many seconds they watch our intro video before they abort!
How many times they reshare our social media post!
And then they do track all that. Tracking it all is easy. Add some log
events, dump them into a database, off we go.
But then what? Well, after that, we have to analyze it. And as someone who
has analyzed a lot of data
about various things, let me tell you: being a data analyst is difficult
and mostly unrewarding (except financially).
See, the problem is there’s almost no way to know if you’re right. (It’s
also not clear what the definition of “right” is, which I’ll get to in a bit.)
There are almost never any easy conclusions, just hard ones, and the hard
ones are error prone. What analysts don’t talk about is how many incorrect
charts (and therefore conclusions) get made on the way to making correct
ones. Or ones we think are correct. A good chart is so incredibly
persuasive that it almost doesn’t even matter if it’s right, as long as what
you want is to persuade someone… which is probably why newpapers,
magazines, and lobbyists publish so many misleading charts.
But let’s leave errors aside for the moment. Let’s assume, very
unrealistically, that we as a profession are good at analyzing things. What
Well, then, let’s get rich on targeted ads and personalized recommendation
algorithms. It’s what everyone else does!
Or do they?
The state of personalized recommendations is surprisingly terrible. At this
point, the top recommendation is always a clickbait rage-creating
article about movie stars or whatever Trump did or didn’t do in the last 6
hours. Or if not an article, then a video or documentary. That’s not what I
want to read or to watch, but I sometimes get sucked in anyway, and then
it’s recommendation apocalypse time, because the algorithm now thinks I
like reading about Trump, and now everything is Trump. Never give
positive feedback to an AI.
This is, by the way, the dirty secret of the machine learning movement:
almost everything produced by ML could have been produced, more cheaply,
using a very dumb heuristic you coded up by hand, because mostly the ML is
trained by feeding it examples of what humans did while following a very
dumb heuristic. There’s no magic here. If you use ML to teach a computer
how to sort through resumes, it will recommend you interview people with
male, white-sounding names, because it turns out that’s what your HR
If you ask it what video a person like you wants to see next, it will
recommend some political propaganda crap, because 50% of the time 90% of the
people do watch that next, because they can’t help themselves, and that’s
a pretty good success rate.
(Side note: there really are some excellent uses of ML out there, for things
traditional algorithms are bad at, like image processing or winning at
strategy games. That’s wonderful, but chances are good that your pet ML
application is an expensive replacement for a dumb heuristic.)
Someone who works on web search once told me that they already have an
algorithm that guarantees the maximum click-through rate for any web search:
just return a page full of porn links. (Someone else said you can reverse
this to make a porn detector: any link which has a high click-through
rate, regardless of which query it’s answering, is probably porn.)
Now, the thing is, legitimate-seeming businesses can’t just give you porn
links all the time, because that’s Not Safe For Work, so the job of most
modern recommendation algorithms is to return the closest thing to porn that
is still Safe For Work. In other words, celebrities (ideally attractive
ones, or at least controversial ones), or politics, or both. They walk that
line as closely as they can, because that’s the local maximum for their
profitability. Sometimes they accidentally cross that line, and then have
to apologize or pay a token fine, and then go back to what they were doing.
This makes me sad, but okay, it’s just math. And maybe human nature. And
maybe capitalism. Whatever. I might not like it, but I understand it.
My complaint is that none of the above had anything to do with hoarding
my personal information.
The hottest recommendations have nothing to do with me
Let’s be clear: the best targeted ads I will ever see are the ones I get from
a search engine when it serves an ad for exactly the thing I was searching
for. Everybody wins: I find what I wanted, the vendor helps me buy their
thing, and the search engine gets paid for connecting us. I don’t know
anybody who complains about this sort of ad. It’s a good ad.
And it, too, had nothing to do with my personal information!
Google was serving targeted search ads decades ago, before it ever occurred
to them to ask me to log in. Even today you can still use every search
engine web site without logging in. They all still serve ads targeted to
your search keyword. It’s an excellent business.
There’s another kind of ad that works well on me. I play video games
sometimes, and I use Steam, and sometimes I browse through games on Steam
and star the ones I’m considering buying. Later, when those games go on
sale, Steam emails me to tell me they are on sale, and sometimes then I buy
them. Again, everybody wins: I got a game I wanted (at a discount!), the
game maker gets paid, and Steam gets paid for connecting us. And I can
disable the emails if I want, but I don’t want, because they are good ads.
But nobody had to profile me to make that happen! Steam has my account, and
I told it what games I wanted and then it sold me those games. That’s
not profiling, that’s just a remembering a list that I explicitly
handed to you.
Amazon shows a box that suggests I might want to re-buy certain kinds of
consumable products that I’ve bought in the past. This is useful too, and
requires no profiling other than remembering the transactions we’ve had with
each other in the past, which they kinda have to do anyway. And again,
Now, Amazon also recommends products like the ones I’ve bought before, or
looked at before. That’s, say, 20% useful. If I just bought a computer
monitor, and you know I did because I bought it from you, then you might as
well stop selling them to me. But for a few days after I buy any
electronics they also keep offering to sell me USB cables, and
they’re probably right. So okay, 20% useful targeting is better than 0%
useful. I give Amazon some credit for building a useful profile of me,
although it’s specifically a profile of stuff I did on their site and
which they keep to themselves. That doesn’t seem too invasive. Nobody is
surprised that Amazon remembers what I bought or browsed on their
Worse is when (non-Amazon) vendors get the idea that I might want something.
(They get this idea because I visited their web site and looked at it.)
So their advertising partner chases me around the web trying to sell me the
same thing. They do that, even if I already bought it. Ironically, this
is because of a half-hearted attempt to protect my privacy. The vendor
doesn’t give information about me or my transactions to their advertising
partner (because there’s an excellent chance it would land them in legal
trouble eventually), so the advertising partner doesn’t know that I bought
it. All they know (because of the advertising partner’s tracker gadget on
the vendor’s web site) is that I looked at it, so they keep advertising it
to me just in case.
But okay, now we’re starting to get somewhere interesting. The advertiser
has a tracker that it places on multiple sites and tracks me around. So it
doesn’t know what I bought, but it does know what I looked at, probably over
a long period of time, across many sites.
Using this information, its painstakingly trained AI makes conclusions about
which other things I might want to look at, based on…
…well, based on what? People similar to me? Things my Facebook friends
like to look at? Some complicated matrix-driven formula humans can’t
possibly comprehend, but which is 10% better?
Probably not. Probably what it does is infer my gender, age, income level,
and marital status. After that, it sells me cars and gadgets if I’m a guy,
and fashion if I’m a woman. Not because all guys like cars and gadgets, but
because some very uncreative human got into the loop and said “please sell
my car mostly to men” and “please sell my fashion items mostly to women.”
Maybe the AI infers the wrong demographic information (I know Google has
mine wrong) but it doesn’t really matter, because it’s usually mostly right,
which is better than 0% right, and advertisers get some mostly
demographically targeted ads, which is better than 0% targeted ads.
You know this is how it works, right? It has to be. You can infer it
from how bad the ads are. Anyone can, in a few seconds, think of some stuff
they really want to buy which The Algorithm has failed to offer them, all
while Outbrain makes zillions of dollars sending links about car insurance
to non-car-owning Manhattanites. It might as well be a 1990s late-night TV
infomercial, where all they knew for sure about my demographic profile is
that I was still awake.
You tracked me everywhere I go, logging it forever, begging for someone to
steal your database, desperately fearing that some new EU privacy regulation
might destroy your business… for this?
Of course, it’s not really as simple as that. There is not just one
advertising company tracking me across every web site I visit. There are…
many advertising companies tracking me across every web site I visit. Some
of them don’t even do advertising, they just do tracking, and they sell that
tracking data to advertisers who supposedly use it to do better targeting.
This whole ecosystem is amazing. Let’s look at online news web sites. Why
do they load so slowly nowadays? Trackers. No, not ads – trackers. They
only have a few ads, which mostly don’t take that long to load. But they
have a lot of trackers, because each tracker will pay them a tiny bit of
money to be allowed to track each page view. If you’re a giant publisher
teetering on the edge of bankruptcy and you have 25 trackers on your web site
already, but tracker company #26 calls you and says they’ll pay you $50k a
year if you add their tracker too, are you going to say no? Your page runs
like sludge already, so making it 1/25th more sludgy won’t change anything,
but that $50k might.
(“Ad blockers” remove annoying ads, but they also speed up the web, mostly
because they remove trackers. Embarrassingly, the trackers themselves don’t
even need to cause a slowdown, but they always do, because their developers
to do what could be done in two. But that’s another story.)
Then the ad sellers, and ad networks, buy the tracking data from all the
trackers. The more tracking data they have, the better they can target ads,
right? I guess.
The brilliant bit here is that each of the trackers has a bit of data about
you, but not all of it, because not every tracker is on every web site. But
on the other hand, cross-referencing individuals between trackers is kinda
hard, because none of them wants to give away their secret sauce. So each
ad seller tries their best to cross-reference the data from all the tracker
data they buy, but it mostly doesn’t work. Let’s say there are 25 trackers
each tracking a million users, probably with a ton of overlap. In a sane
world we’d guess that there are, at most, a few million distinct users. But
in an insane world where you can’t prove if there’s an overlap, it could be
as many as 25 million distinct users! The more tracker data your ad network
buys, the more information you have! Probably! And that means better
targeting! Maybe! And so you should buy ads from our network instead of
the other network with less data! I guess!
None of this works. They are still trying to sell me car insurance for my
It’s not just ads
That’s a lot about profiling for ad targeting, which obviously doesn’t work,
if anyone would just stop and look at it. But there are way too many people
incentivized to believe otherwise. Meanwhile, if you care about your
privacy, all that matters is they’re still collecting your personal
information whether it works or not.
What about content recommendation algorithms though? Do those work?
Obviously not. I mean, have you tried them. Seriously.
That’s not quite fair. There are a few things that work. Pandora’s
are surprisingly good, but they are doing it in a very non-obvious way. The
obvious way is to take the playlist of all the songs your users listen to,
blast it all into an ML training dataset, and then use that to produce a new
playlist for new users based on… uh… their… profile? Well, they
don’t have a profile yet because they just joined. Perhaps based on the
first few songs they select manually? Maybe, but they probably started
with either a really popular song, which tells you nothing, or a really
obscure song to test the thoroughness of your library, which tells you less
(I’m pretty sure this is how Mixcloud works. After each mix, it tries to
find the “most similar” mix to continue with. Usually this is someone
else’s upload of the exact same mix. Then the “most similar” mix to that
one is the first one, so it does that. Great job, machine learning, keep it
That leads us to the “random song followed by thumbs up/down” system that
everyone uses. But everyone sucks, except Pandora. Why? Apparently
because Pandora spent a lot of time hand-coding a bunch of music
characteristics and writing a “real algorithm” (as opposed to ML) that tries
to generate playlists based on the right combinations of those
In that sense, Pandora isn’t pure ML. It often converges on a playlist
you’ll like within one or two thumbs up/down operations, because you’re
navigating through a multidimensional interconnected network of songs that
people encoded the hard way, not a massive matrix of mediocre playlists
scraped from average people who put no effort into generating those
playlists in the first place. Pandora is bad at a lot of things (especially
“availability in Canada”) but their music recommendations are top notch.
Just one catch. If Pandora can figure out a good playlist based
on a starter song and one or two thumbs up/down clicks, then… I guess it’s
not profiling you. They didn’t need your personal information either.
While we’re here, I just want to rant about Netflix, which is an odd case
of starting off with a really good recommendation algorithm
and then making it worse on purpose.
Once upon a time, there was the Netflix
which granted $1 million to the best team that could predict people’s movie
ratings, based on their past ratings, with better accuracy than Netflix
could themselves. (This not-so-shockingly resulted in a privacy
fiasco when it turned
out you could de-anonymize the data set that they publicly released, oops.
Well, that’s what you get when you long-term store people’s personal
information in a database.)
Netflix believed their business depended on a good
recommendation algorithm. It was already pretty good: I remember using
Netflix around 10 years ago and getting several recommendations for things I
would never have discovered, but which I turned out to like.
That hasn’t happened to me on Netflix in a long, long time.
As the story goes, once upon a time Netflix was a DVD-by-mail service.
DVD-by-mail is really slow, so it was absolutely essential that at least one
of this week’s DVDs was good enough to entertain you for
your Friday night movie. Too many Fridays with only bad movies, and
you’d surely unsubscribe. A good recommendation system was key. (I guess
there was also some interesting math around trying to make sure to rent out
as much of the inventory as possible each week, since having a zillion
copies of the most recent blockbuster, which would be popular this month and
then die out next month, was not really viable.)
Eventually though, Netflix moved online, and the cost of a bad
recommendation was much less: just stop watching and switch to a new movie.
Moreover, it was perfectly fine if everyone watched the same blockbuster.
In fact, it was better, because they could cache it at your ISP and caches
work better if people are boring and average.
Worse, as the story goes, Netflix noticed a pattern: the more hours people
watch, the less likely they are to cancel. (This makes sense: the more
hours you spend on Netflix, the more you feel like you “need” it.) And with
new people trying the service at a mostly fixed rate, higher retention
translates directly to faster growth.
When I heard this was also when I learned the word
essentially means searching through sludge not for the best option, but for
a good enough option. Nowadays Netflix isn’t about finding the best movie,
it’s about satisficing. If it has the choice between an award-winning movie
that you 80% might like or 20% might hate, and a mainstream movie that’s 0%
special but you 99% won’t hate, it will recommend the second one every time.
Outliers are bad for business.
The thing is, you don’t need a risky, privacy-invading profile to recommend
a mainstream movie. Mainstream movies are specially designed to be
inoffensive to just about everyone. My Netflix
recommendations screen is no longer “Recommended for you,” it’s “New
Releases,” and then “Trending Now,” and “Watch it again.”
As promised, Netflix paid out their $1 million prize to buy the winning
recommendation algorithm, which was even better than their old one. But
they didn’t use it, they threw it away.
Some very expensive A/B testers determined that this is what makes me watch
the most hours of mindless TV. Their revenues keep going up. And they
don’t even need to invade my privacy to do it.
Who am I to say they’re wrong?