How big MNCs like Google, Facebook, Instagram etc stores, manages and manipulate Thousands of Terabytes of data with High Speed and High Efficiency?

Tanumoy Deb
6 min readSep 17, 2020

Here is the answer of this interesting technical querry. Here the topic is concerned with the problem to store huge amout of data. In the technical world the name of this problem is Big Data. Let me explain the Big Data problem in detail :-

Note: To know more about each of the technologies in details that are mentioned in this article, click on each of the names that are underlined.

Big data burst upon the scene in the first decade of the 21st century, and the first organizations to embrace it were online and startup firms. Arguably, firms like Google, eBay, LinkedIn, and Facebook were built around big data from the beginning. They didn’t have to reconcile or integrate big data with more traditional sources of data and the analytics performed upon them, because they didn’t have those traditional forms.

The problem of Big Data can be solved by a concept called Distributed Storage System. Now let’s understand the concept of Distributed Storage System with an example. Suppose you have to store 300GB of data maintaining the high speed. Here the challange is you don’t have such a big storage but your friend have three storage devices of 100Gb each in three separate systems and each one has limited Read/Write speed of about 1GB/min. Now, if somehow you manage to manufacture a storage device of 500GB of same velocity i.e read/write speed, then first of all it will be too much costly and to store 300GB data it will take up to 300 minutes. Now if you connect to the three systems remotely through networking, you don’t need to buy the storage, and the main advantage is you will get much higher speed because the whole 300GB of data will be stripped into 3 parts of 100GB each and parallely it will be transferred to the three systems. This will take only 100 minutes to transfer 300GB. Also it will take the same time to retrieve the whole data from the devices. This will save the cost, volume(storage) and velocity. Your system is connected to other three systems through networking with the help of HDFS protocol or program, this type of connection is called master-slave model or topology. The use of this particular topology is to provide storage. There is one master node and three slave nodes working as a team and the whole team of four nodes is called a cluster. To implement the concept of Distributed Storage System we need a product called Hadoop. It is mainly used to create the cluster.

Master-Slave Topology

Many companies continue to rely on incumbent data warehouses for standard BI and analytics reporting, including regional sales reports, customer dashboards, or credit risk history. In this new environment, the data warehouse can continue with its standard workload, using data from legacy operational systems and storing historical data to provision traditional business intelligence and analytics results. Persistence is done using MySQL, Memcached, Hadoop’s HBase. Memcached is used as a cache for MySQL as well as a general purpose cache. Offline processing is done using Hadoop and Hive. Data such as logging, clicks and feeds transit using Scribe and are aggregating and stored in HDFS using Scribe-HDFS, thus allowing extended analysis using MapReduce. BigPipe is a custom technology to accelerate page rendering using a pipelining logic. Varnish Cache is used for HTTP proxying. It is used for its high performance and speed. The storage of the billions of photos posted by the users is handled by Haystack, an ad-hoc storage solution developed by Facebook which brings low level optimizations and append-only writes. Facebook Messages is using its own architecture which is notably based on infrastructure sharding and dynamic cluster management. Business logic and persistence is encapsulated in so-called ‘Cell’. Each Cell handles a part of users; new Cells can be added as popularity grows. Persistence is achieved using HBase. Facebook Messages’ search engine is built with an inverted index stored in HBase. MySql at Facebook is programmed in such a way that it pretty much drives itself and there is hardly any maintenance. But because data is huge they periodically move their data to Hadoop and use Hive to process that data. Also not to forget that Facebook is a huge contributor to Apache Hive.They use Hbase which is a big data database working on Hadoop for the facebook messaging service. They have modified Hbase to read data efficiently even after a region server goes down to avoid latency. They call this version HydraBase.

Big data sites like FaceBook, Google, Twitter handle enormous amount of data by using a combination of massively paralleled systems (multiple computers working together as a single system).

Here are some of the products or technologies used by the large MNCs to maintain their storage, speed and performance :-

> Apache Hadoop : The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

> Voldemort : Voldemort is a distributed key-value storage system.

> Apache Cassandra : The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance.

> Memcached : Free & open source, high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.

> Dremel : Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data.

> Teradata : Teradata Vantage™ is a powerful, modern analytics cloud platform that unifies everything — data lakes, data warehouses, analytics, and new data sources and types. Leading the way with hybrid multi-cloud environments and priced for flexibility, Vantage delivers unlimited intelligence to build the future of your business.

> Oracle Exadata : Oracle Exadata is the best place to run your Oracle Database. It helps lower your costs by up to 40% and expedite digital transformations with on-premises, Cloud at Customer, and Oracle Cloud deployments that offer high identicality and run up to 3X faster than any other solution.

> BigQuery : Serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility.

> The Google File System : It is a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.

> Mesa : Mesa is a highly scalable analytic data warehousing system that stores critical measurement data related to Google’s Internet advertising business.

--

--