Big Data A to Z

The big data ecosystem benefits from a rich and diverse array of products and projects, veritable pieces in a puzzle that IT professionals largely assemble themselves. But this big data abundance can be overwhelming at times, so we put together this guide to help you understand some of the most common terms you’ll come across in your big data journey.

In the big data jungle, it’s worth starting at the top, which brings us to…

A is for Apache, algorithms, and analytics.

Nearly all of the most popular big data projects are hosted by the Apache Software Foundation (ASF), making it the go-to place for open source software. Machine learning algorithms lay at the heart of the big data movement, which is largely about advanced analytics.

B is for big, behavioral, and Bayes.

Love it or hate it, the phrase “big data” has stuck, even if it’s not the best one to describe the challenges and opportunities (“operationalizing data science” might be another one, but say that 10 times fast). Understanding user behavior is one of the keys to advanced analytics, while Bayes’ theorem is a standard statistical method data scientists use to understand the probability of an event.

C is for Cassandra, cloud, cleansing.

Cassandra is a super popular NoSQL database from the ASF that scales horizontally like no other. The cloud figures prominently in big data workloads going forward, while data cleansing is the least loved part of a data scientist’s workload.

D is for data scientist, deep learning, distributed.

Sometimes called unicorns due to their rarity, data scientists who boast talents in statistics, computer science, and business theory are the superstars of the big data movement. The emergence of huge training data sets and cheap server clusters has given rise to a new form of artificial intelligence called deep learning. And nothing gets done in the big data world if the workloads aren’t distributed horizontally in a parallel fashion (with all due apologies to IBM’s vertical Power).

E is for Exhaust, ETL, ensembles.

When companies discovered they could get value from the so-called data “exhaust” that they used to throw away after processing transactions, the big data movement was born. Early data analytic practitioners would use extract, transform, and load (ETL) software product to load data warehouses. In machine learning, an ensemble is the execution of collection of many algorithms.

F is for fast, fault-tolerant, forests.

Some data professionals say the growing need to process data quickly in real-time, or “fast data,” is the real problem that we face. As software projects mature, users expect enterprise-like features, such as fault-tolerance. Random forests is a form of ensemble learning method used for classification and regression.

G is for governance, geospatial, graph.

Data governance currently is one of the biggest obstacles for analytic projects that involve unstructured and semi-structured data, while geospatial data presents both a challenge and an opportunity. Graph database are emerging as a suitable place to run certain analytic workloads, such as fraud detection and recommendation systems.

H is for Hadoop, HPC, hype.

Hadoop is one of the key technologies enabling the collection and processing of big data. Before Hadoop democratized parallel computing, it was primarily the domain of supercomputer specialists working in high performance computing (HPC). It’s tough to read anything in big data without ingesting a little hype.

I is for ingest, Impala, IoT.

Ingesting large amounts of data can be tough, but is critical if you’re to process it. Impala is one of the leading ASF projects that seek to re-create relational-style data warehousing on Hadoop using SQL. The Internet of Things (IoT) is expected to generate huge amounts of machine data in the coming years.

J is for JSON, JBOD, Java.

Many of today’s NoSQL databases store data in the JSON (JavaScript Object Notation) format that’s become popular with Web developers. Just a bunch of disks (JBOD) provides the Hadoop Distributed File System (HDFS) with a simple and straightforward way to store large amounts of data. Hadoop was written in Java, which continues to be one of the most popular languages for developing big data apps.

K is for Kafka, key-value stores, K-Means.

The open source Apache Kafka project has emerged to become the defacto standard data message broker used in the big data space. Key-value stores such as Memcache-D and Redis are the most basic databases, and are used when processing speed is valued above everything else. K-Means is a popular clustering algorithm used in the machine learning world.

L is for Lake, log files, Lambda.

A big data lake is where you store all your unstructured or semi-structured data exhaust that you want to analyze later. Log files generated by servers, security gear, and networking equipment are among the most popular data types analyzed in big data. The Lambda Architecture was created to unify batch and real-time processing paradigms.

M is for machine learning, metadata, mining.

Data scientists commonly use machine learning models to automatically extract insight from large volumes of data. Metadata is data that describes the content of a given data file, and is very important for data governance initiatives. Data mining is not a new term, but still sufficiently describes what most modern data scientists do today.

N is for NoSQL, NLP, nodes.

NoSQL databases such as Cassandra and MongoDB have emerged to power distributed processing on large clusters of commodity servers. The analysis of text and words is the domain of natural language processing (NLP), which benefits from the scale of today’s big data platforms and deep learning methodologies. No discussion of distributed computing is complete without the notion of nodes, or a single server in a larger cluster.

O is for outliers, operationalization, OLAP.

How you deal with the outliers in your data distribution is critical to the success of your big data project. In data science, much of the focus lies in how best to operationalize smaller test projects. Online analytical processing (OLAP)–an old term that was basically synonymous with multi-dimensional databases in the data warehousing days–is seeing a resurgence atop modern distributed platforms like Hadoop.

P is for pipelines, petabytes, predictive analytics.

The notion of data flows and independent “pipelines” of data is becoming increasingly popular in the stream processing and real-time analytics. A petabyte of data, or 1,000,000,000,000,000 bytes, is enough to store three copies of the DNA of every United States citizen. Predictive analytics largely refers to the use of machine learning to predict an outcome from a given piece of data.

Q is for quants, queries, quadrants (as in Magic).

Wall Street was ahead of the curve in hiring quantitative analysts, or “quants,” to make money in securities, but now other industries are catching up. Queries remain central to the big data scheme of things, particularly with regard to next-gen SQL architectures. For software vendors, getting your product rated in a Gartner Magic Quadrant is something to be proud of (or paid for, depending on your disposition towards industry analysts).

R is for R, real-time, recommendations.

The open source language R remains a popular way to build models and conduct statistical work with large data sets. Real-time processing, as opposed to batch processing, is the goal of many big data architects these days. Recommendation systems are a popular big data application in the retail space.

S is for Spark, search, SQL.

Apache Spark has emerged as a powerful data processing engine inside Hadoop, but its use cases are not restricted to being a replacement for MapReduce. Hadoop originated in Doug Cutting’s brain as a better search engine, and search engines remain key components of big data stacks. Structured Query Language (SQL) may be 40 years old, but it remains critical to analytics.

T is for time-series data, topological data analysis, terabytes.

Among the different data types popular in analytics, time-series data remains one of the toughest to ingest and store in a performant way. Topological data analysis is an emerging machine learning ensemble technique that seeks to find structure in data. Some of today’s largest analytic clusters can store up 50 of terabytes of data in RAM.

U is for unstructured, unsupervised.

Much of the interesting data exhaust that companies are analyzing today is unstructured, such as pictures, videos, and sound files. Unsupervised learning refers to a class of machine learning algorithms that don’t require specially groomed data sets to extract insights.

V is for 3 Vs, voice data, visualization.

If you’ve followed big data for any lengthy of time, you’ve undoubtedly run across the “three Vs,” or volume, velocity, variety. Voice data is another data type that’s showing some promise. Without a good visualization, it can be tough to identify or share the content or meaning behind a big data analysis.

W is for wearables, weather data.

Google started it the wearables fad with its geeky glasses, but it’s likely not done yet. Lots of companies are trying to figure out how to incorporate weather data and weather predictions into their analysis.

X is for XML, exabytes.

The rise of extensible markup language (XML) is one of the factors that kicked off the big data trend on the Web. Beyond petabytes are exabytes.

Y is for Yottabytes, YarcData.

A yottabyte (1 with 24 zeros) was named after the pouplar Star Wars character Yoda. YarcData is a graph database appliance developed by a subsidiary of supercomputer maker Cray.

Z is for Zookeeper, Zeppelin, Zettabytes,

Who keeps all the big data pieces from getting too unruly? Zookeeper, of course. While old German zeppelins crashed and burned, Apache Zeppelin is gaining traction as a Python-focused data science workbook. Between Exabytes and Yottabytes are Zettabytes.

Thanks To Alex Woodie