The Search for "God Particle" with Big Data

Dec 27, 2018 • 6 min read

The Higgs field is responsible for giving mass to fundamental particles like electrons and quarks, which are the building blocks of matter that cannot be further broken down. To isolate the Higgs boson particle associated with this field, scientists use the Large Hadron Collider (LHC). This machine creates around 600 million collisions every second between 3 quadrillion protons. During this process, more than 50 trillion bytes of data are generated and analyzed to confirm the existence of the Higgs boson particle.

With such massive and fast-growing data, Big data made it possible to analyse the plethora of particles. This data is analysed by algorithms that are programmed to detect the energy signatures left behind by the appearance and disappearance of the elusive particles CERN are looking for. The algorithms compare the results with theoretical data on how the particle is believed to act.

On 4 July 2012, scientists at CERN announced that they had found a new particle consistent with the Higgs boson, a ground-breaking discovery. The following year, the 2013 Nobel Prize in physics was awarded jointly to François Englert and Peter Higgs “for the theoretical discovery of a mechanism that contributes to our understanding of the origin of mass of subatomic particles, and which was confirmed through the discovery of the predicted fundamental particle, by the ATLAS and CMS experiments at CERN’s Large Hadron Collider”.

The Higgs hunt is a vivid illustration of a much broader shift. The same pressures that define it (an overwhelming volume of data, arriving at enormous velocity, that must be filtered down to the rare signal that actually matters) now show up far from any particle accelerator, in ordinary businesses. The rest of this post steps back from the collider to look at that wider world of data.

Data

Before computers became ubiquitous, data used to be recorded on paper: sample surveys and questionnaires, population censuses, student grades. These are usually time-consuming and not effective. Such data collections are mostly structured and follow a certain format for statistical analysis. The advent of computers helped ease the laborious task of data collection and analysis, especially with spreadsheet application software like Microsoft Excel, Lotus 1-2-3, etc., that allows for organisation, analysis and storage of data in tabular form.

The launch of World Wide Web (WWW) in 1989, by Sir Tim Berners-Lee, made it extremely easy and seamless to collect, store and analyse data on the computer. As the Web made data easily accessible, more people started contributing to the existing data. This resulted in a very large volume of data with different formats: structured, unstructured, or semi-structured kinds of data.

We can classify structured data as the kind of data that are filled into a web form fields, as this data can be easily stored in a database, spreadsheet or any other table-style system with rows and columns: each row being the collection record, and each column a defined field (e.g. name, address, date of birth) of what is being collected in the intersecting row.

Unstructured data, on the other hand, can’t be easily categorised, e.g. tweets, photos, videos, clickstreams and other data from unstructured sources. Although taking a closer look into some of the unstructured data, we can re-classify some of them as semi-structured data: Email is a good example of semi-structured data, as it contains structured metadata, like the email heading, as well as the unstructured text and maybe photos or other attached documents. Metadata, data that describes and gives information about other data, can add some kind of structure to unstructured data: Like adding word tags to photos for ease of identification, or hashtags on social networking platforms for topic classifications. Unstructured data can’t be stored in a table-style (relational databases), or on spreadsheet, making it hard to extract useful information from them even as there’re lots of these data around us; for this, special tools have been built to help sort them out.

Big Data Basics

The term “big data” refers not only to large data sets, but also to the techniques, and tools used to analyze it. Data can be collected through any data-generating process such as social media, public utility infrastructure, search engines, etc.. And as we’ve discussed earlier, Big data may be either structured, semi-structured, or unstructured.

Typically big data is analyzed and collected at specific intervals, but real-time big data analytics collect and analyze data constantly. The purpose of this continuous processing loop is to offer instant insights to users.

Today, with the average large business storing more than 200 terabytes, companies have more than enough data to tell them who is buying their products, as well as how, when and where the buying takes place. Customers expect companies to know how they feel about their products, and how they can be served better. Customers drop cues all over the Web, and companies can get hold of this data and make sense of it. So the data is there, they’re not just in the rows, columns, reports and purchase histories we’re used to. They’re mostly the unstructured types of data, and today, there are technology and tools to make sense of this data (e.g. IBM Smarter Analytics, etc.).

When we have enormous amount of information that can be made to interoperate with itself, we can come up with answers that will solve societal problems.

Big Data Scandals

There have been issues of data-breaches with Big data. The biggest scandal yet, the Cambridge Analytica, a London-based political consulting firm that collects consumer data for use in election campaigns around the world.

The data of up to 87 million users, mostly in the U.S., was obtained by an analytics firm that, among its other work, helped elect President Donald Trump. In response to that revelation, lawmakers and regulators in the U.S. and U.K. increased their scrutiny of the social media giant, and at least some Facebook users cancelled their accounts. The uproar has only added to the pressure on Facebook and Chief Executive Mark Zuckerberg over how the company was used during the 2016 presidential campaign to spread Russian propaganda and phony headlines (Read more).

data physics