Editing to update: Google has actually taken the proposals I made seriously (though it’s not me personally, but critics like me in general). It’s not publically available yet, but they are trying to make a real tool.
And, oh my God, they even admit the problems I talk about in this article:
We do not recommend using the API as a tool for automated moderation: the models make too many errors.
Open sourcing the data sets and algorithms is an important first step in making a real anti-harassment tools.
Some of this problem gets addressed in this video, where Google engineers talked about designing massive neural networks and why it can’t figure out context.
Tl;dr: The first step in statistical analysis is to formulate the question that you are attempting to solve by the analysis. The next steps are to perform a series of analyses based on that question: Descriptive, exploratory, inferential, and finally predictive. None of the software companies have even formed the question for what they are attempting to do, much less performed any of the other relevant portions of statistical analysis.
None of this really ties into even deeper problems that plague statistical science and modeling.
Because of this, they are often vague, cryptic, or just plain don’t respond when asked questions about why their algorithms went into a hysterical fit. Given the damage that these failings cause, I think the companies should be legally accountable for their failings.
James Mickens gave a satirical talk about why software security always fails.
Not Even Close: The State of Computer Security (with slides) – James Mickens from NDC Conferences on Vimeo.
You could equally make one for machine learning, given how often Amazon, Facebook, Twitter, and YouTube fail spectacularly. I reposted an article about Amazon’s Hall of Spinning Knives, but I think it’s important to go into a little more detail about it.
If you want a high level overview for how this happens in a microstudy, you can read this link.
Tl;dr: A machine learning algorithm is just a statistical sorting algorithm that looks at how commonly different things are associated and then determines if a new input matches in the expected output, i.e. for this message, we will either have an output of “ban” or “allow”.
Most of these algorithms have been around a while, (id3, c4.5, CHAID, CR&T, QUEST, etc.) but never had any relevance to computer science because the data sets programmers deal with have never been large enough to require this sort of analysis.
The problem with these algorithms is that they break down the more you move away from a single variable. Nate Silver’s wrote an entire book on it, The Signal and the Noise.
The more variables that come into play, the more noise gets introduced into the system that looks relevant, but ends up being worthless. It requires a lot of knowledge and research about what is and is not relevant to make a good statistician, which is why most people are horrible at it.
It’s also why there’s a process called debiasing that gets introduced into data science. This requires a good knowledge of human psychology and fallibility, as human folk psychology (what psychologists call people’s everyday assumptions about how their brain works) is terrible when it comes to letting biases overrule its observations.
It’s why anecdotal evidence is not enough to form a scientific opinion, because anecdotal evidence is usually spurious correlations taken as causal fact.
Another way to solve the problem is instead of sifting through signal vs. noise, make a set of assumptions about what matters and analyze those. This is what economists do, they rather famously reduce people to simple automatons. That’s because making predictive guesses when using real-World people is an exercise in madness.
However, making those sorts of reductionist assumptions brings about its own set of problems. For example, a classic tenet of economics is that humans are all rational actors. According to classic economics, no one has ever boughten anything that they later regretted. For mathematical purposes, this works out quite nicely. For analyzing actual human behavior, it’s patently ridiculous.
Anyway, the type of analysis that these big companies are attempting to do, predictive, is one of the hardest types you can perform. It is at the top rung of difficulty for an analysis that can be done on humans. The other types of more rigorous analysis are causal and mechanistic, but those don’t apply to most domains involving human behavior.
The easiest form of statistics is descriptive, which merely shows two things are observed together. This is what most of statistics is.
However, a common problem is that knowing two things are observed together doesn’t mean that the two things cause each other. This is why you often hear the phrase “correlation does not mean causation”. There’s an entire website called spurious correlations that deals with just this.
As an example, ice cream sales coincide with drowning deaths. This is descriptively true, but predictively inaccurate. Knowing that someone bought ice cream will tell you nothing about whether or not someone is about to drown. The actual causative factor is that people buy ice cream when it’s hot and people go swimming when it’s hot. More swimming leads to more drownings.
It’s also common for this sort of problem to creep in when you analyze data across multiple matrixes. Analyzing any complex pattern of behavior requires analyzing multiple different data points. Each bit of new information may look statistically relevant, but without using techniques like Bonferroni Correction, the “probability of identifying at least one significant result due to chance increases as more hypotheses are tested”.
For most of what companies like Facebook, Google, et al; they are attempting to do predictive analysis across multiple matrixes. That is, they are attempting the hardest form of calculation possible.
Worse than that, the only way to make a statistically relevant data claim is to be able to formulate the question that the algorithm is attempting to solve. Part of the reason Test Driven Development and Behavior Driven Development are important for software is that they force clarity in the software. They describe the exact problems, behavior and operations that the software will perform.
None of these companies appear to have any sort of defined problem that they are attempting to solve.
The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data…
Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.
— John Tukey (The guy who invented most of the hard stuff the rest of us use.)
For example, let’s look at Amazon’s system for abuse. People would put books on Kindle Unlimited that were thousands of pages of nonsense, then farm out the books to click-farms. That would drive up the price of the book and the payout on the KU program, which pays per page read.
In an attempt to reign this in, Amazon implemented a machine learning algorithm, (or a bot), that looked for suspicious activity, and ended up flagging several people who were not part of the problem.
What is the question that Amazon should have articulated in that algorithm? If it was “Look for any outside generated traffic” followed up by the output of “then flag it as abusive”, you can immediately see why it’s a huge problem. The formulated hypothesis is that any outside generated traffic can only be an attempt at abusing the system.
We can see the same thing at work in YouTube’s algorithms:
So, what’s the question that YouTube’s algorithm is attempting to fix here? If it is “Find any video which depicts weapons” with the expected output of “then remove/demonetize” it, you will see the problem.
Without explicitly formulating the hypothesis, there is no measurement for the effectiveness of the system.
So, at the very basic level, the step 1 of any algorithm, we literally have no idea what the question the algorithms are attempting to fix. Vague notions don’t count. Software requires perfection, bugs are caused by imperfection. But without knowing what the software is even supposed to do, there is no way to infer if it’s working or how to adjust it. Hence the philosophy of “Behavior Driven Design”. Any further steps from there are already, by necessity, going to be wrong. But the wrongness will keep going.
The next problem is that before you can attempt to make a predictive analysis, you have to perform several other analyses first.
- A descriptive analysis is done to see if there’s any correlation between two pieces of data.
- Then an exploratory analysis is done to see if the correlation in the descriptive analysis holds true.
- Then an inferential analysis is done to see if the trend holds true beyond the data sets being studied.
- Finally, only after all of that is done, a predictive analysis is performed.
Next question: What work has Google, Amazon, etc. done in these regards? Where are the published papers detailing their methodology and the findings thereof?
The reason all of that rigorous work is done first is because one of the most common flaws in data analysis is “overfitting”. Data that is merely descriptive becomes conflated with data that is predictive.
This is the reason why so much content is being demonetized or being flagged. They have jumped straight into machine learning algorithms as the cure when they haven’t done all of the other work that makes statistical inferences possible. It’s equivalent to taking medicines at random and hoping they will cure the symptoms.
This is also why all of these companies have been completely opaque about why something is flagged or what they consider offensive. Several people have put forward quasi-conspiracy theories about it, but I think Hanlon’s razor applies here. Don’t attribute to malice what can be explained by ignorance. I think these companies genuinely have no idea what they’re doing.
So, how do they fix this? It’s actually dead simple. It’s how scientists deal with this problem. They open source their algorithms and peer-review it. The reason why data scientists like programming languages like R and Jupyter Notebook is they allow other scientists to see the exact steps necessary to reproduce the results.
If those algorithms are wrong, data scientists and programmers can spot it. It’s what Google already does with Polymer, Angular, Chromium, and a dozen of its core web services. The same thing applies to Amazon, Facebook, Twitter, etc.
Interestingly, all of these companies open source almost all of their software, but guard their machine learning algorithms. The same arguments for open source software beating closed source software applies equally, perhaps even moreso, to machine learning algorithms.
There’s another benefit too. Suppose one of these companies invents an anti-harassment algorithm that actually works. Since it will be open-source, other companies can build off of it and make the web better, not worse. One of the axioms of programming is DRTW. Don’t Reinvent The Wheel. (Unless you want to learn about wheels.)
Given the intrusiveness of machine learning algorithms in determining everything from credit ratings to entire livelihoods, it is very likely that there will be a law passed that says all machine learning algorithms and data sets must be publicly available for scrutiny, given how many of them are plain horrible or discriminatory in their assumptions when examined.
If these algorithms are wrong, they do irreparable damage, and the only solace is that there can’t be any malice involved (a common tenet of most law is proving malicious intent), because they are machine-based. But I think this is false. They violate the central tenant of the programmer’s oath: “I will not produce harmful code”.
As James Mickens noted in his talk, the law is often out of date with real-World practices. But as Robert Martin points out at the bottom, the regulators will eventually come for us lowly software developers. And when it does, we need to have a feasible defense for what we do. “It’s not my fault, it’s the machines” isn’t going to fly as an excuse, nor should it when humans are the ones programming the machines.
So, do we have a counter-example that follows my prescribed plan? One that uses open source technology, relies upon openly available data sets, publishes its results, and is wildly successful at its stated mission? Yes.
All of the things I’ve discussed have been proposed by Elon Musk for all AIs. His goal is different from mine admittedly, as he’s worried about catastrophic rogue AIs destroying the World on the basis that it’s easy for flaws to appear in AI bots. I’m concerned at a lower level, that AIs without these safeguards fail to do whatever it is they are supposed to.
Interestingly, the open source “openAI” program successfully beat the best players at DOTA 2. This is an impressive feat. So part 1, the programming is open source.
Step 2, they publish research papers on it.
Step 3, they use openly available data:
Result? Winning. We see the triumph of the scientific method vs. randomly throwing at the wall and seeing what sticks.
Maybe this appeal to the better nature of tech companies won’t work. Perhaps some good-old fashioned legal appeals will.
If Google had posted its algorithms and data online, then it could be immediately seen if there was any malicious intent or probable cause for being sued. Alas, it does not. But the PragerU case is going to be one amongst many, and like Uncle Bob foretold, if we don’t regulate ourselves, then the regulators will come for our industry.