Where does Data live (is stored) in 2021

Subhashini Sharma Tripathi
6 min readJul 13, 2021

Businesses have website, apps and sensors which create BigData and storing this ever-growing data is a critical decision. The IT infrastructure needs to be able to keep pace with this change without any dip in digital performance.

The upgrade of storge facilities is a costly affair and technology is becoming a bottomless pit. There is also no option of remaining with obsolete technology like legacy disk based storage systems since these systems do not support the latest frontend technology systems . decisions regarding storage have to consider not only how and where the data is stored but also the speed at which it can be accessed and used in the future.

For example, accessing data in 5–10 minutes is just too slow for a data-driven business that is responding in real time to global business opportunities on a 24×7 basis. Amazon, Flipkart, Netflix, Practo and all online, digital business needs to have near real time response for customers to have a good customer experience.

Types of Data Storage options : There are 2 primary types of databases that we will explore , Relational databases and Non-relational databases.

Relational databases: relational databases work by linking data across different tables using what are called keys. Primary key is a field which is a unique identifier assigned to a particular row within a table. This is unique identifier is used to locate data for the same customer or product or transaction in different tables. Sometimes you can use mobile numbers and other government issued identification numbers as primary keys for your customer data set. When this unique primary key is added to a record in another table, it is called a “foreign key” in the associated table. the relationship formed between the records across many tables because of the primary and foreign key is the reason that the rdbms structure allows for query running with data spread across multiple tables.

Residential integrity is possible in an rdbms because of the use of primary keys land foreign keys. You can be very sure of the data that is retrieved through SQL queries running on an rdbms making this system robust .

However in a non-relational database, no SQL database also the data is stored but there are no tables, rows, primary keys, foreign keys. The non-relational database use a storage model optimised for certain requirements of different types of big data

Some of the more popular NoSQL databases are MongoDB, Apache Cassandra, Redis, Couchbase and Apache HBase.

This brings us to what is SQL? SQL stands for structured query language. It is used to communicate with databases, especially relational database management systems full start SQL statements are used to perform tasks on the database like writing data into the database or retrieving data from the database.

Some common relational database management systems that use SQL are Oracle, Sybase, Microsoft SQL Server, Access, Ingres, etc.

Source: https://searchdatamanagement.techtarget.com/definition/relational-database

The knowledge of SQL is required by a data scientist to handle structured data. Structured data is stored in the relational databases and therefore to query these databases sequel is necessary. .So, to carry out data analytics with the data that is stored in relational databases like Oracle, Microsoft SQL, MySQL, we need SQL.

As a matter of fact, Big Data Platforms like Hadoop provides an extension for querying SQL commands for manipulating data through HiveQL.

No SQL stands for not only SQL. Here the data is not split into multiple tables and relationships are not created. There is no need to create and run dds joins to get the data together as is required in an rdbms. This makes it easy to work with no SQL in a distributed environment.

NoSQL rose into importance as data outgrew RDBMS structures . The integrity of an RDBMS is dependent on the singularity of the data within them. Companies ran into chronic problems once their database requirements outgrew the capacity of single servers; it was impossible to coherently store relational data on large clusters because there was no efficient way to keep all the indexes synchronized.

The advantage of NoSQL is that it offers an architectural approach with fewer constraints. In general, this makes it easier to break apart NoSQL data stores, but more difficult to query them for complex results.

The advantage of no sequel is that there are very few constraints and it becomes easy to break up the data into smaller parts. However it is true that it is more difficult to query on the data for complex outcomes .

Source: https://cdn.educba.com/academy/wp-content/uploads/2019/05/what-is-Nosql-database1.png

Some points that make noSQL challenging include that

  1. without any enforcement of atomicity, a NoSQL database neither has any way to ensure data is not duplicated nor that it is collected in the first place.
  2. It’s also difficult to ensure that all relevant records are updated or deleted when modifications are made.
  3. NoSQL databases usually strive for something called “eventual consistency” in their data stores, which isn’t good news if you want accurate results from your query immediately.

Let us now spend some time on further understanding the Theory of no SQL. The acronym for this is CAP Theorem.

It is very important to understand the limitations of NoSQL database. NoSQL can not provide consistency and high availability together. This was first expressed by Eric Brewer in CAP Theorem.

CAP theorem or Eric Brewers theorem states that we can only achieve at most two out of three guarantees for a database: Consistency, Availability and Partition Tolerance.

  • Consistency means that all nodes in the network see the same data at the same time.
  • Availability is a guarantee that every request receives a response about whether it was successful or failed. However it does not guarantee that a read request returns the most recent write. The more number of users a system can cater to better is the availability.
  • Partition Tolerance is a guarantee that the system continues to operate despite arbitrary message loss or failure of part of the system. In other words, even if there is a network outage in the data centre and some of the computers are unreachable, still the system continues to perform.

Out of these three guarantees, no system can provide more than 2 guarantees. Since in the case of a distributed systems, the partitioning of the network is must, the tradeoff is always between consistency and availability.

Source: https://www.researchgate.net/figure/Visualization-of-CAP-theorem_fig2_282679529

As depicted in the Venn diagram, RDBMS can provide only consistency but not partition tolerance. While HBASE and Redis can provide Consistency and Partition tolerance. And MongoDB, CouchDB, Cassandra and Dynamo guarantee only availability but no consistency. Such databases generally settle down for eventual consistency meaning that after a while the system is going to be ok.

To conclude, both SQL and NoSQL are of crucial importance in data science ecosystem. The fuel of data science is data so everything starts with proper, well maintained and easily accessible data. Both SQL and NoSQL are critical players for these processes.

MongoDB is a NoSQL database. NoSQL databases became popular for a few reasons:

1.) Developers are creating applications with masses of new, and rapidly changing data types. Traditional SQL databases do not allow you to modify previously created documents with ease. This is because relational databases were not designed to cope with modern development techniques.

2.) The waterfall development cycle is not widely used anymore and has been replaced with Agile development methodology. The waterfall development cycle separates development into different phases. This has been almost completely replaced by agile methods that have weekly sprint plans where developers ship out code every week.

MongoDB is popular among new developers due to its flexibility and ease of usage. Even though it’s easy to use it still provides all the capabilities needed to meet the complex requirements of modern applications.

The Data Science and Analytics community needs to remain abreast of these updates and additions into the types of storage of the data for easy retrieval and real-time business outcomes.

Do follow us on LinkedIn : https://www.linkedin.com/company/pexitics.com

--

--

Subhashini Sharma Tripathi

My passion is Intelligence Amplification - using tech and data to make great decisions. Currently, am bullish on Generative AI for Banking solutions.