[Insight] Will big data one day replace opinion polls?
Big data may be able to help the Los Angeles Police Department reduce crime. It is said to be able to make cities more intelligent, and some even see it as a credible alternative to restrictive policy measures in the fight against climate change (1). Why has big data, in its seemingly all powerful appearance, not yet “uberized” the polling industry? Leonardo Noleto, Data Scientist, Guillaume Pataut, mathematician, and Guilhem Fouetillou, cofounder of the startup Linkfluence – which helps leverage social media conversations to create value for businesses- and associated professor at Science Po Paris, give their opinions on the question.
Is Big Data capable of making polls, as we know them today (i.e: an array of mathematical methods to know the opinion of a group by generalizing from a subset) a thing of the past? “This is a complex question,” warns Leonardo, suggesting that the answer can be found somewhere between Angers, the quiet capital of Anjou (region of France) and the United States.
Survey and big data: complimentary methods?
If the residents of Angers would not have validated such concepts as the 15cl min-sized can of Coca-Cola, the Kinder Pingui or the introduction of Philadelphia cheese, these products would probably not have ever been commercialized in France. With a population of 400,000 inhabitants, the Maine-et-Loire prefecture is known by brands from around the globe as being very representative of the archetype or average behavior of France and therefore, consumer expectations (2). The phenomenon, which has lasted for more than 20 years, is cleverly exploited by two companies: MarketingScan (of the GfK-Médiamétrie group) and Scannel (Kantar Worldpanel). The startup CityPanel also chose to set up itself in this city in 2013, with the ambition of expanding to digital services (mobile applications, web sites, connected devices…) the tests that are submitted to the prophetic panel of the citizens of Angers.
On the other side of the Atlantic lives Nate Silver, a statistician specializing in the calculation of sports statistics. Forecasting the results of games and the possible career paths of players in Major League Baseball was his livelihood in the early 2000’s. But it was his political analysis of the 2008 United States presidential election which made him famous. Published on FiveThirtyEight.com (a blog affiliated at that time with the New York Times), his forecasts were shocking because of their accuracy. Nate Silver predicted the winner in 49 of 50 states and predicted Barack Obama to be the victor several months in advance. His secret? The use of big data… to weigh the projections of traditional polling organizations (3).
“As suggested by Nate Silver, surveys and big data can be complimentary,” Leonardo states. The case of the city of Angers, which has become the Pythia of the food industry, shows that statistical methods, refined through decades of practice, are still relevant. “Based on the budget available, size of the population studied, average response rate observed, and the margin of error accepted by the client, mathematical models permit to accurately determine the size and composition of the survey sample to be probed to obtain representative results,” Guillaume explains. “And don’t forget that surveys are just one method of observing a society, among many others: sociology, ethnography ... or national statistics, based on the comprehensive census of the population. This is an area in which France was very ahead of the curve, with the creation of INSEE in 1946(4)”, recalls Guihem.
Big Data is interesting for picking up on weak signals
The interest in big data resides in its capacity to find emerging tendencies, produce unprecedented hypotheses and to answer the questions which have never been asked. “More data was collected in the year 2011 than all data combined from the invention of writing to up to then,” Guilhem states, citing the United Nations’ Global Pulse project. Most of that data was largely acquired in recent years. First we must understand what big data covers. For Linkfluence, it’s the possibility of capturing and analyzing what users voluntary express, comment on or “like” on the Internet (declarative data), but also what they do (observations of usage). Big data is a new lens through which to observer society, having the advantage of not presupposing anything.” This is contrary to surveys as their methodology can introduce bias. Pierre Bourdieu raised in his expose “Public Opinion Doesn’t Exist” (1972), “The simple act of asking the same question to everyone implies a hypothesis that there is a consensus regarding the issues. That’s to say that there is an agreement on the questions worth asking.” (5) In what measure does the survey contribute to shaping the opinion of those intending to take the survey?
“Big data allows data collection without a visible process and without any influence from the observer,” says Guilhem. “Pollsters are leading famous experiments showing how the gender, age, or beauty of an interviewer may affect the accuracy of responses, particularly when men are faced with a female interviewer.” Leonardo adds, “Polls work in part by ‘provoked’ data.” Big data represents a paradigm shift, permitting the exploration of data without a set goal, and the traces left behind by unaware Internet users, - this raises obvious ethical questions about internet user consent, data ownership, cross referencing of data, and reselling of data (6). “At Linkfluence, which offers monitoring tools and social web analysis for brands, it is customary to say that we only listen to those that want to be heard. That is to say those who express themselves on public web spaces. Interaction on social networks, the content that you “Like” on Facebook or the Twitter accounts that you follow provide vital information about your interests, click is the message. The epoch where 1% of Internet users produce 99% of content is over. Big data allows us to move from socio-professional categorization–which assume that socio-professional categories have homogeneous behavior – to a classification that is closer to reality. You can understand why taking into account the study of social media has become relevant but you have to consider the fact that users can be performers. Remember the adage, “On the Internet, nobody knows that you are a dog. The posture which can be adopted on the web alters the level of trust placed in the data collected on certain subjects, for example as part of a study about an employer brand. But the absence of measure allows us to gain spontaneity.”
The pitfall of absurd correlations
For those who understand its limitations, big data is a revolutionary tool. There is not a magic process that will allow for the removal of human intervention nor any bias introduced by the unwitting analyst. Leonardo amusingly says, “The myth of big data is that once data is collected, it will speak by on its own! Obviously this is wrong. Today it is easy – and less and less expensive – to use algorithms to crunch the data until correlations are revealed. The trap is that correlation doesn’t necessarily mean that there is a link between cause and effect.” Something that illustrates reductio ad absurdum is the facetious response received by SNCF (French National Railways) from datascience.net. (7) At the end of 2014, the public enterprise offered to develop a model to estimate the number of travelers in a train station on any given day of the week by using the railway’s publicly available data (opendata). “Someone made the correlation between the number of travelers present in the train stations with the number of hairs salons operating within the stations. Though this is a valid mathematical correlation, is not very pertinent to estimating the number of people frequenting a train station. These “absurd” correlations may result by chance, but most often the explanation is hidden variables. It is also very likely that the more a station is frequented, the more shops it will have inside. Will a day come when machines possess the required intelligence to estimate the relevance of a correlation? More work needs to be done on the subject, but the ‘Master Algorithm’ still appears to be the Holy Grail.
Statistical methods: useful for finding relevant indicators, outdated or necessary to verify the hypotheses produced by Big Data?
Big data has not yet uberized polling industry. Rather we find traditional players cooperating with startups which have taken off in big data. (8) “On my part, I believe that statistical protocols will remain,” Guillaume confides (his job at OVH is to imagine algorithms which will make it possible to examine data). “Big data presents us with large volumes and varieties of information which greatly surpasses the analytical capability of humans. Before beginning to examine data with the help of algorithms, it is often necessary – to use mathematical terms – to reduce the scope of work. Statistical methods are a great help for identifying relevant indicators, which are like buoys floating in a sea of data. This allows to simplify the problem. For example, when neuroscientists, attempt to identify the role of each part of the brain, the amount of data made available by cerebral imaging is astounding. The medical team must move forward using methods of regression and proximal algorithms for eliminating unnecessary data, or weight its influence in establishing a correlation. The mathematical principals of basic algorithms of big data, such as principal component analysis (PCA), or those of machine learning, are relatively trivial. When we speak of big data, we imagine complicated equations, huge machines that eat data and spit it back out in the form of dashboards. There is a much more less spectacular aspect to our job, the hours spent understanding the data, structuring it, and the sorting out of interesting data from that which is not…” Leonardo is more skeptical on the future of statistical methods, “Raw data does not always lend itself to statistical methods. Data sets can be heterogeneous: texts, images, videos… Moreover, the goal of data science is to examine data (extract patterns). Statistics offer a limited catalog of models. One can find an adequate model to make sense of the data, but that is not always the case.” (9) Guilhem sometimes sees in the statistics, a way to verify the assumptions produced by big data, “One of our customers experienced a social media crisis. Analysis of online conversations showed that bad buzz was widespread across the web. To learn what percentage of the population had caught wind of the online crisis, the advertiser conducted a traditional poll. For the record, the poll revealed that 25% of the population was aware of the issue. Not very insignificant!”
Analysis, a necessary safeguard
Leonardo and Guillaume agree on their responsibility and warn, “As anyone can twist statistics to tell just about any kind of story, the same tales can be told using a data set from big data. In a world where numbers rule, we have to stay suspicious. Stat-activisme, from the name of the collective work published in 2014, urges that we should resist appropriating data” (10). Human intervention, with its subjectivity, certainly, but also its critical reflection, is not only essential, but also salutary. As explained by the Russian researcher Evgeny Morozpv and author of the book, To Save Everything, Click Here: The Folly of Technological Solutionism (11), “Consider the current popularity of big data, it has the capacity to provide powerful insights based only on correlations”. According to a recent book by Viktor Mayer-Schonberger and Kenneth Cukier, titled, Big Data: A Revolution That Will Transform How We Live, Work, and Think[/], “Once we have fully embraced big data, society will pay a causality in exchange for simple correlations, in the form of not knowing why, but only what.” A real problem, if we imagine for example, big data being applied to public politics, with actions being based on a series of correlations, without correcting injustices or discrimination which can be the sign (12).
The particular case of political polls
Political polls make up a case unto their own. Capturing public opinion and political awareness on the eve of an election seems to have become complicated. This has been pointed out by the political science professor, Alain Garrigou in the publication “Monde Diplomatique” (13), the last error in opinion polls occurred during the Greek referendum, “with polled referendum voters being given the choice of just yes or no, the assumption made is that this is the easiest type of election to predict”. This suggests, in light of Nate Silver’s experiments, we can believe that big data and listening to the social web can contribute in reducing the risk of error induced by corrections resulting by the “wet finger in the wind” used by the pollsters for weighing what they believe to be over-reporting or under-reporting. Guilhem, a researcher at l’Université de technologie de Compiègne, was interested in the role that the Internet played in the public’s “no” vote to the referendum of the European constitution in 2005. Sending robots to browse the web to study the content of sites and the links between them, he had observed a quantitative imbalance between the sites in favor, “yes” votes, and those opposed, “no” votes. The sites in opposition outnumbered those in favor by two to one with the “no” community being larger and more active. What are his thoughts today of the web as a polling ground through big data methods?
“The first pitfall of the Internet is that it defies all the principles of representativeness.” Some age groups are not connected at all while others are hyper-connected. Not all population categories are represented. “However, there doesn’t exist a way to exploit the data provided by the web to sort the results representing an entire population. This is not a problem for the brands that we work for, because either we look for weak signals warning of an upcoming trend, or we study the behavior of their online communities and for this we have consistent and comprehensive ground.” For political polls, this is an issue. Moreover in 2005, users had not yet considered the web a medium as traditional media. “We were right in the myth of the original tribe, observance by the ethnologist does not change behavior. Today, the political militants know that they are observed on the web and they understand the power of the Internet.” Campaigns also take place online, with the production of content, commentary, and statuses on social networks which aim to influence opinion. Result: “Some of the noise that could be captured by big data could be somewhat fictitious. In the case of political campaigns, a quantitative analysis of the Web reveals less public opinion than work by activists and militants, who are the loudest online.” One of the solutions is to not listen to everything, “Studying the weight given to subjects by media offers a good way to get an idea of opinion about the subjects that can tip a campaign. For this type of study, we must reintroduce meritocratic logic. In the attention economy, which is the approach of the web, all content does not have the same value. A robot that tweets political messages should not be considered to have the same influence as an online article written by a major media outlet having an audience of several thousands of readers.” If the web is not possible ground for political polls, it is certainly fertile ground for political marketing, an activity which Linkfluence lends itself to from time to time. Studying the web and establishing predictive correlations based on the history of past electoral results are already methods used during campaigns to realize "electoral micro-targeting” (14).
Large consumers of polls, the French are there for not far from seeing the name of their next President revealed several months ahead of the election thanks to information provided by big data. Even when it will be possible, there is doubt on the ability of the prediction itself to move voters to the polling stations… to contradict the predicted outcome. As testimony, take for example the participation in the second round of regional elections in December 2015. Stéphane Rozès, political adviser cited by the Monde.fr in an article about Nate Silver (15), explains that one should not be surprised that the American statistician’s book has not been translated into French. He says, “The idea that a statistician can announce the results ahead of an election is baroque and detrimental to the imagination of French politics.”
The challenges posed by big data
Outside of very specific case of political polls, big data is a very effective method for observing individuals, especially for understanding the customers which are dormant within them. Take the case of Aldebaran Robotics, a company that uses Linkfluence’s Radarly software suite to capture conversations around its brand. The company uses the software to track the subject of the presence of its humanoid robots on the TV show, Salut Les Terriens, broadcast on Canal+. With the help of the Linkfluence tool, Aldebaran was able to move away from a traditional descending communication style by developing an approach centered on the interests of each web user. Even though it has opened up infinite possibilities, big data has also created new challenges. The technical challenges were linked to data storage and the necessary processing power to analyze the data – the job of OVH. Intellectual challenges were presented with the birth of the data science discipline. Then there were ethical challenges concerning the necessary awareness among users about the traces they leave behind. “The social web, that Linkfluence analyzes, represents only a part of the data on Web,” reports Guilhem. Tomorrow, with connected devices and massive use of social networks, the amount of traces that we leave on the Internet will continue to explode and a more significant part of our lives will be documented. Today, we give away a lot of data by agreeing to a site’s terms and conditions allowing access to a lot of services. It is logical that companies, that pay fortunes for hosting, seek to make a profit. (16).” Are we all become digital workers in the same sense that the sociologist Antonio Casilli describes in the blog post entitled Digital Labor (17)? Should users be rewarded when using their data to create value? What do the inhabitants of Angers think? Admit that the simple act of asking the question can create an opinion that did not exist until then...
Please note that this article was originally published in French. Any publication in any other language is a translation of the French text. The following lists the original citations taken from the original French publications.
(1) La police de Los Angeles utilise depuis 2011 le logiciel Predpol (www.predpol.com) et a déclaré avoir fait baisser grâce à lui de 33 % les agressions et de 21 % les crimes violents entre novembre 2011 et mai 2012. Source : Le logiciel qui prédit les délits, M le magazine du Monde, par Louise Couvelaire le 04.01.2013
Lire également cet article qui nuance l’enthousiasme qui accompagne le déploiement de Predpol dans le monde : Predpol : la prédiction des banalités, internetactu.net, par Hubert Guillaud le 23.06.2015
À propos de l'utilisation du Big Data pour lutter contre le dérèglement climatique : Big data et algorithmes, l'enjeu caché du COP21, journaldunet.com, par Charles Abner Dadi le 26.08.2015
(2) Si un produit marche à Angers, il marchera partout !, capital.fr, le 12.05.2011 (mis à jour le 23.01.14)
(3) Nate Silver et les limites du Big Data, Le Monde.fr, par Ludovic Vinogradoff le 15.07.2013 et Et Nate Silver, saint patron des "nerds", créa le data, M le magazine du Monde, par Louise Couvelaire le 24.05.2013
(4) Histoire de la statistique française, Wikipédia.org
(5) L'opinion publique n'existe pas, Pierre Bourdieu. Exposé fait à Noroit (Arras) en janvier 1972 et paru dans Les temps modernes, 318, janvier 1973, pp. 1292-1309. Repris in Questions de sociologie, Paris, Les Éditions de Minuit, 1984, pp. 222-235.
(6) Au sujet des questions éthiques engendrées par le Big Data, lire cet article de ParisTech Review à propos de l’éthical data mining : Big Data et données personnelles : vers une gouvernance éthique des algorithmes, par Jérôme Béranger le 22.12.2014
(7) Prédiction de la fréquentation des gares SNCF en Île-de-France, un challenge lancé sur la plateforme datascience.net
(8) Instituts d'études et sondage, l'effet big data, lenouveleconomiste.fr, par Anne-Laurence Gollion le 17.09.2014
(9) Why Do We Need Data Science when We’ve Had Statistics for Centuries?, annenberglab.com, par Irving Wladawsky-Berger le 30.04.2014 et An executive’s guide to machine learning, mckinsey.com, par Dorian Pyle and Cristina San Jose, juin 2015
(10) Statactivisme, comment lutter avec les nombres, data.blog.lemonde.fr, par Alexandre Léchenet le 25.06.2014
(11) Pour tout résoudre, cliquez ici, d’Evgeny Morozov, aux éditions Fyp
(12) La technologie est-elle toujours la solution ? (2/2) : le risque du solutionnisme, internetactu.net, par Hubert Guillaud le 28.03.2013
(13) L’erreur record des sondages sur le référendum grec, blog.mondediplo.net, par Alain Garrigou le 13.07.2015
(14) La victoire d'Obama : cas d'étude concret d'utilisation des Big Data, journaldunet.com, par Henri Ruet le 23.03.2013
(15) Et Nate Silver, saint patron des "nerds", créa le data, M le magazine du Monde, par Louise Couvelaire le 24.05.2013
(16) Facebook : de la nécessité de protéger ses données "relationnelles", Le Monde.fr, par Guilhem Fouetillou le 22.04.2010
(17) Digital Labor : comment répondre à l’exploitation croissante du moindre de nos comportements ?, internetactu.blog.lemonde.fr, par Hubert Guillaud le 20.12.2014