The BBC news today is full of articles about the extend and background to extensive spying the government does on private citizens personal communications such as Facebook, e-mails, phone calls, and so on. As a big data professional as well as a citizen of a democratic country there are a couple of angles worth taking. For one, what does commercially available software look like that gets sold to governments to achieve this sort of thing, and what can it do and how does a big data architecture like that do it. Secondly why is all this going on? Why does a government feel compelled to spy on its own citizens and non citizens.
Lets start with the why. Please bare with me. In my humble opinion we’ll have to go into quite a bit of background, but It should be fairly obvious that terrorism is only part of the picture.
Lets start with the governmental expressed desire to stop terrorism. In the olden days terrorism was either a crime like murder, or it was government induced criminal activity and given the label of spying. Examining the label of terrorism we find that it is applied easily and without a specific definition. The Oxford English Dictionary defines “the unofficial or unauthorised use of violence and intimidation in the pursuit of political aims” as terrorism (not sure what we call the official use of violence. War?). In any case the clear definition seems to exclude religiously motivated murderers.
I am very much an empirical person, so I am attaching here a terrorism study (and please remember that the definition of terrorism is loose here, it includes for example state sponsored terrorism, which seems to be a contradiction in terms as it is official terrorism, the study also includes Syria which could maybe be better described as a civil war). The study was carried out by the National Consortium for the Study of Terrorism and Responses to Terrorism (A Department of Homeland Security Science and Technology Center of Excellence Based at the University of Maryland) published in May 2013 by the U.S. department of state (credit to the fact that they have to publish this stuff). Please make up your own mind, but to me the numbers don’t add up to much. Here some surprising extracts:
In 2012 – the main object of the research – 6771 terrorism attacks were recorded worldwide with 11098 people killed. According to the research 65% of all attack can be attributed to three countries (Pakistan, Afghanistan and Iraq). 80% of attacks are perpetrated in only 10 countries. What I am trying to highlight is that before and after the publicly announced war on terror by the then U.S. administration in 2001 nothing significant has happened in terms of statistical significants. The global victims of terrorism being outnumbered by the victims of crime in Washington DC (35,358 in 2011 in DC vs about 34,000 through terrorism). Death tolls from terrorism are admittedly a lot higher, but we probably should rank globally where deathrates due to terrorism are insignificant. According to the world health organisation terrorism would rank at spot 77 if it was a disease (just below leprosy). According to the Economist in an article entitled “Is America spending too much on homeland security?” Published April 30th 2011, 22:21 by N.B. from WASHINGTON, “America’s spending on counterterrorism outpaced all anti-crime spending by some $15 billion”. I was unable to find how much money the USA spends on fighting leprosy worldwide, but Novartis managed to collect over $200 million since the year 2000 cumulatively.
In short the numbers of dead and effected certainly don’t reflect the effort targeted, given that there seem to be no obvious links between security spend and effect. Maybe it is prevention of terror that we have to thank for the low numbers? We would all live in chaos if we hadn’t fraught terrorism? Maths unfortunately is quite week in that ” what might have been” area. In any case compulsory fitting of cameras to cars in Russia prevented more crime than committed cumulatively by terrorism in the last 3 years put together. So maybe there are better areas to spend this kind of budget on globally when it comes to protecting lives?
Not so. The wider picture of resource access by population shows clearly that the overall trade balance is very much positively effected after resource wars. Terrorism is just a byproduct. With the population growing out of all proportion, and 10% of the worlds population being significantly better off than the remaining 90% there is a need to defend the wealth distribution. We would complain bitterly if we would have 6-7 billion unhappy people suddenly working together to prevent us from resource access in their country. Cesar’s mantra of ” divide and conquer” still holds.
So why spy on everyone? Well, if I am correct in the above reasoning and information is a resource than surely global mining of this resource is critical to protect the status quo. Insight into commercial, political and private behaviours gives leverage and insight, both are valuable when trying to influence perceptions.
So what the heck does this software stuff look like. The truth is I have no idea. I would break official secret acts if I did know and publish this information, but lets look at software capability and hardware constraints to estimate what is possible with comercially available tools, and what isn’t.
At the heart of the envisaged architecture lies in essence a big data construct. Big data is an overused term, so lets focus on the four use cases.
A) find significant events through search / automatic filters
B) create insight into and monitor a cumulative data group (big picture analysis)
C) manage and influence the perceived reality in one area
D) predict the future development for a subject area.
Lets start with A) the simplest example would be Google. They let you search through vast amounts of collected and indexed information and the user decides what is significant to them. Google helps by providing prioritisation based on your past behaviour and the behaviour of others who used similar terminology.
Searches are probably the least effective way to interact with big data from a military and governmental perspective. Especially when it comes to crime prevention. However it becomes more effective when I am actively investigating a crime ad I have some details to work with. The key to search are existing parameters an investigator can link. Using simplified data models/classification such as POLE (People, Objects, Location, Events) allows tools like Palantir to be very fast in building up network diagrams. For example a set of phone numbers, car number plates, credit/debit cards and similar info allows investigators rapid investigation output, assuming you have the source data, and he/she has the tools to use the information electronically.
3 years ago I was at a Business Information conference where a large national telco company presented. They demonstrated location based marketing analysis (what consumer goes where and how this information could be used, if only the consumer would agree). When asked by an audience member how many police and security information requests they got per year, they stated that at any point in time they were passing on life data streams from about 10,000 mobile devices, and that this number was growing rapidly. The requests were in fact so frequent that they had build a secure gateway to make this interaction automatic, rather than manual.
Assuming there are at least 6 similar sized organisations, with an equal amount of surveillance in each, we are looking at 0.1% of the UK population under phone surveillance in 2010. The power of scale here is predicated on legal access to data, and not the sophistication of the software itself. If access was restricted (for example) legally to only active investigations, then people networks could not be build so easily.
Maths will have to continue to help us out. Phones register on the network. At any time 1-2 times as many phone as people in a country have a phone connected to a network. most interesting data is logged directly with the network. Where and when you logged in, who you called, texted and so on, all of that is basic data and it is stored in telco companies big data stores (either super scalable dbs such as Sybase IQ (world record holder in data take-on of 34TB per hour) or utilising a distributed storage capability like Hadoop file systems. the data is small and extreamly insightful. Node network diagrams, location based marketing, influencer networks emerge quickly and a useful marketing tools. However data packages are a bit more complex to handle. iphones for example love sending texts via the data network to save people money (texts can be more expensive than data, but that picture is changing too). Looking into data packages is complicated as they often are application specific and compressed, encrypted and incremental (ie what I am sending could be move 12 in a chess match that started 3 days ago.” a4″ by itself doesn’t mean a lot. What we should remember here is that semantics and context are important to get anything out of information. It still takes the fraud office 1-12 years to bring a prosecution against a major corporation simply because the data and documentary evidence chain is so large (just think of the total number of emails in a company)
The next bit of maths in this scenario is more significant. Assuming that a monitored phone communicated with 20 other individuals, then indirect access to people widens to 200,000 individuals for Vodafone alone. Assuming some overlap (let’s guess 1 in 4) the UK defence groups would have access to some personal phone communications for 900,000 people equaling about 1.5% of the population. We should however not fall into the trap and think 1.5% of the UK population are therefore under surveillance. Only exchanges between a watched and an unmatched are effected.
The news that monitoring goes a lot further when it comes to emails and phone data is interesting, as we are moving into scenario B – the “big picture” analysis.
B) getting the big picture. If a scientist wants to understand a statistical population from any angle he or she stands and falls by the collected data. The search for true insight is determined by three core elements: data, data and more data. Lets try an example and see if we get anywhere.
Do beans grow better to classical music? To find that out I could grow a bean while playing classical music in the background and see how that is going. But statistically speaking I am really not proving anything. First off I need to isolate the elements I wish to measure. Is it for a specific bean type? Is music volume a factor? In short how precise can I be with my conditional attributes. The next element is standard deviation / genetics. If I grow 20 beans for 25 days under equal conditions their hight will still vary. So what is the standard divination so I can exclude natural variance from my result by setting the target outcome above the deviation. The next aspects are spurious influences, effects I have difficulties in identifying (maybe a bean was dropped, or some asset rain fell on it?). These lead to outliers. Having done all of the pre-work I can start forming a hypothesis with a relevant factor and design an experiment to prove or disprove it. Once set-up I can run the experiment under laboratory conditions and find out if in fact my hypothesis is correct. The final tricky step is to understand if the results are applicable and/or relevant. If I am a farmer of beans, and beans grow faster to classical music, does that mean I should play my beans classical music? The answer is “not necessarily” as the experiment focused on speed of growth, not harvest yiel, nor does it factor in costs. Providing the music might be more costly than any financial benefits from increased sales revenue.
What the above bean example shows is that data is not just important from an experimental point of view, it is critical when it comes to setting up the parameters. What factors effect bean growth, is bean growth related to bean harvest? In short many factors effect outcomes and test design.
So how does the bean example effect privacy? For one thing we are a lot more complicated than beans. We learn. We adapt. For example ever since my friends and I heard that America monitored voice conversations for key words some 8years ago, we started our discussions with the expansion of hierarchies in electronic Bill of Materials (Bill of Materials =BoM, and expand the hierarchy in IT is referred to as exploding.). We were young and stupid, and nothing ever came of it, except when some colleagues of ours were extensively searched because they genuinely discussed the subject during pre-flight check-in.
Finding significants and insight for a population in regards to behaviour is mathematically very complicated, ever changing and never truly conclusive. So most of the time in this field is spend finding dynamics. Terrorists used to get trained in Pakistan and Afghanistan. That is a statistical big picture insight that affects behaviours. Try flying into New York from Pakistan during early 1990s and compare this to flying in from Canada. What we learned is that terrorists adapt quickly. Some of the 2001 terrorists came via Germany where they lived and trained.
Big picture analysis is in essence trying to find patterns and outliers that help direct more targeted activities. The issue in large scale data acquisition is that you find additional insights that might have nothing to do with your originally stated objectives (for example “how are people’s share buying habits effected when the government announces the possibility of a terror threat” could be a possible insight?). It should also be noted that the certainty in these studies is boarder line useless as the complexity of interactions with other attributes is almost endless. So significants in findings is limited.
What is significantly different is the need for sampling. If I am testing the effect of music on bean growth I will run my experiment against a relevant sample of beans. Establishing the size of the sample depends largely on the attributes. Big data big picture analysis does not require sampling, it can run against the entirety of the available data (for example every Facebook account and every number plate on cars). Statistically this is significantly more accurate than sampling. Instead of being limited by your imagination in regards to which attributes are significantly effecting bean growth perfect data allows you to find significant attributes by analysis alone.
Powerful stuff. However as individuals we do not provide full attribute documentation. In fact we ourselves don’t know how we are likely to behave. This makes it difficult to provide perfect data. However some data sources reveal more than just basic data about us. Word choices and sentance constructs in text and on social media sites like twitter and Facebook reveal a lot about us. Our bank and credit cards reveal even more. Phone records such as activity and movement logs are in fact gold mines. Today we can buy sentiment analysis tools for little money and most marketing organisations use them to understand customer and prospect trending (for example is a customer likely to switch service providers in the next 30 days? Or which websites are most often used by my target groups).
What is surprising to me is that this is still far more data than the Stasi ever collected in east Germany.
In architecture terms this is getting interesting from an IT perspective. we should lead in the design with some questions and suggested answers:
a) do we need to aquire the data we wish to search? No. We can simply index the records at an abstract level and software solutions are very good at this. Just look at scientific search systems in law. However it becomes clear that these work a lot better the more material are in a single record. Text Analysis and sentiment analytics are now available in the cloud (think Netbase). So a cold stage like Hadoop, and an indexing engine with sentiment analysis is important
b) Filter capability. Based on high volume analysis as well as trends spotted in the field it mis advisable to have a real-time ilter that lets only data through that has passed a set of basic rules. Usualy these technologies are termed Event Stream Processorsd (ESPs) or Complex Event Processors (CSPs) and are a very mature technology, usually attached to a data stream, rather than a data source they apply real-time filters to information and act in line with fixed rules. These filters ensure nothing noisy gets through. Some of these technologies now have dynamic rule capability and learn from patterns. But in general the filter is calibrated to let doubtful material through.
c) Processing and Analysis. Here automation is still key for the primary analysis. in order to crunch through masses of data and find patterns three things have to be present in the processing engine. 1) a real-time in memory cache calculationstagingand Events). The POLE data model makes it extreamly simple to store any Objuect type and their relationships without running out of headroom. The processing engine is responsible of matching new info with existing info and see if the additional information is significant. We mash all this information up in warm storage and create a very large repository that can be queried relatively quickly.
d) An advanced in Memory Analytics engine (preferably working in cache and directly in core processors). This in memory technology (SAP HANA is what springs to mind here) si capable of crunching through 100s of Terabytes in seconds, making statistical calculations in near realtime on the complete data set. If you want to find anything remotely pattern related out about your data, this is what works.
e) Enrichment engine. Oftne found in ELT capabilities information is augmented with additional information such as Geo-Tagging, name de-duplication, spelling, short-hadn decryption, … All of these capabilities ensure the above engines work with the best data available. Language capabilities are extreamly important. Augmentation engines such as voice to text, handwriting algorythms, video and picture recognition and others.
Which brings us to Part 2 of this series. In it I hope to discuss the next two use cases and take a look at the architectures required (although most parts are in the above architecture already). What I really want to discuss is where the limits of this model lie, and what the shortcomings might be.
Thanks for reading this far. Thoughts? Questions? Just mail me