Polyglot Persistence

  • The concept of using multiple data storage technologies to store and manage data is called Polyglot persistence.
  • Modern applications have led to the development of different types of databases like SQL, NoSQL and Search. SQL databases have always been the go-to databases but over the years, NoSQL and Search databases have gained a lot of attention in the industry.
  • Cazton experts have more than a decade of experience in creating highly scalable and efficient systems that use multiple database technologies that are complementary to each other. We have helped many Fortune 500 clients, mid-size companies and startups create a fool-proof polyglot persistence strategy.
  • Our team consists of Microsoft Regional Director, Microsoft Most Valuable Professional, Azure and Cosmos DB insiders.

Database technologies have undergone several generations of evolution, right from flat-file systems to relational databases to schemaless databases. Some people might say that traditional relational databases are a thing of the past, but that is not true for all the scenarios. Changing requirements and evolution of the internet has meant that new types of databases have emerged, but most have specific use cases, which makes it difficult to decide on which database should be used when. At the same time, different types of data models have emerged throughout the history of databases but only Relational and NoSQL models have prevailed. While relational data model has been around since the 1960s, it has evolved a lot since then and has stood the test of time. NoSQL data model is relatively new and has proved to be a strong contender for data storage and processing needs of modern applications.

Now let's quickly walk through the different types of databases. Cazton experts have more than a decade of experience in creating highly scalable systems with high performance that are cost-effective. These systems are created using multiple different database technologies that are complementary to each other. This is also called Polyglot Persistence. If you are new to Polyglot Persistence, we highly recommend reading Scalability Patterns. It's a weekend read and is written in such a way that anyone from executives, technical architects/developers to non-technical people can understand it.

Types of Data

  • Structured Data: Data is stored in structured databases and it can be easily analysed and used in many ways depending on the needs of the application. This type of data can be stored in tables with columns where the data type for each column is well-defined. Usually, relational databases are more suited for storing this type of data.
  • Semi-Structured Data: This type of data is partially structured and is typically stored in formats like JSON, XML or CSV and can be easily transferred to structured data stores. This type of data doesn't conform to concrete single schema, which allows developers to develop applications faster. This type of data is more difficult to analyze when compared with structured data. Most NoSQL databases allow users to store this type of data.
  • Unstructured Data: This type of data is most-prevalent in organizations and is mostly stored in textual and multimedia formats. Some of the most common examples are - images, Word documents, presentations. This type of data doesn't conform to any schema and the data can be of any type or in any format. Large organizations usually have a lot of unstructured data and a lot of value in that data. This is the most underutilized type of data in most of the organizations. This is where search databases and machine learning techniques like natural language processing are used.

Types of Databases

Modern applications have led to the development of different types of databases including SQL, NoSQL and Search. While SQL and NoSQL databases are usually generic databases and used to store data, search databases have emerged for a specific use case -- enabling developers to easily integrate search functionalities in their applications. In this article, we are going to cover the different types of databases and try to explain in which use-case the database is most suitable for.

SQL Databases (RDBMS)

SQL or relational database emerged in the 1970s with growing needs of organizations to store and retrieve data more effectively. These databases were aimed at reducing the manual work of storing and organizing the data. A SQL database is a collection of tables, each with its own schema representing the attributes and datatypes for each column. Most SQL databases allow users to query and manipulate data using a set of commands called Data Manipulation Language (DML) and define and manipulate the data schema using a set of commands called Data Definition Language (DDL). DML and DDL are a part of Structured Query Language (SQL).

Modern relational databases provide much more than just storing and retrieving data. With leading databases providing analysis, security, partitioning features by default. Apart from the simplicity and ease of access, most relational databases provide other features such as data integrity and normalization.

Now let's take a look at the most famous SQL Databases in the industry:

  • PostgreSQL: PostgreSQL is a free and open-source object-relational database management system. It offers many features such as - complex queries, foreign keys, triggers, updatable views, transactional integrity and multiversion concurrency control. It can be extended by users by adding new data types, functions, operators, aggregate functions, index methods and procedural languages. PostgreSQL has always been known for its conformance to the SQL standard and with recent versions of the database it has only gotten better. Features such as internationalization, robust authentication and easy administration make PostgreSQL a strong contender for enterprise storage needs. PostgreSQL can be installed on-prem as well as on-cloud on major cloud providers. Learn more.
  • Microsoft SQL Server: Microsoft SQL Server is a relational database management system developed by Microsoft and its latest version of SQL Server is the Microsoft SQL Server 2019 that provides features such as the ability of being used as a data warehouse. The demand for Big Data is increasing over the years, and this has led to integration of Big Data tools such as Spark and HDFS. It provides flexibility to the users to choose their own language and platform. SQL Server provides great emphasis on security and performance. It can also be used as a business intelligence platform to perform analysis and extract key insights from the data. SQL server can be deployed on Kubernetes allowing easy deployments and scalability. It can be used to store data upto a petabyte scale. SQL Server can be installed on-prem and many major cloud platforms provide a cloud version of this database, where Microsoft's Azure SQL has been the industry favorite. Learn more.
  • Oracle: Oracle Database (commonly referred to as Oracle RDBMS or simply as Oracle) is a multi-model database management system produced and marketed by Oracle Corporation. The latest version of Oracle database is available on-prem, on-cloud, or in a hybrid-cloud environment. With features such as automatic indexing, performing analysis on data, a lot of authentication options, Oracle is one of the leaders in enterprise databases. Oracle is currently working on making the database REST aware - the ability to retrieve data by making HTTP calls.
  • MySQL: MySQL is an open source RDBMS based on Structured Query Language (SQL). It is written in C and C++ and is based on Client/Server architecture. Since it is open source, it is free to download. MySQL supports databases having as much as 50 million rows. It allows extension via stored procedures and provides users with the flexibility to choose storage engines from various available options. It also provides inbuilt replication support, full-text indexing and searching.

NoSQL Databases

NoSQL databases, since their emergence, have been a strong contender for data storage and processing requirements of modern applications. NoSQL databases are inherently more scalable than SQL databases as they allow you to scale vertically and horizontally. Whereas most SQL databases are vertically scalable by design. These databases can be deployed as distributed systems, which makes them more reliable and faster while querying for data. There are different types of NoSQL databases available including Key-Value Store, Document Databases, Column-Family Databases, Graph Databases and Full-Text Search Engine Databases. Each type has its advantages and disadvantages and its usage depends on the business requirements. Let's explore this more below.

  • Document Databases: These are schemaless databases that store data in the form of JSON documents. They store data as key-value pairs where document name is the key and the content of the document is the value. Now, the good thing about these stores is that they allow you to form the content of the document in any way you want, thus enabling you to store semi-structured data. Let's quickly take a look at some frequently used document databases.
    • MongoDB: MongoDB is one of the most popular document stores. It provides features like support for ad hoc queries, indexing via primary and secondary indices, aggregation, sharding in-built replication providing high availability and more. MongoDB is available for installation on-prem for free and as a fully managed cloud service called MongoDB Atlas. Learn more.
    • Cosmos DB: Cosmos DB is a globally distributed, horizontally scalable, multi-model database service with a focus on providing strong consistency. Cosmos DB provides many features including bounded staleness, strong consistency, reduced latency, multi-api support, automatic failover and more. Cosmos DB also provides ability to store data as key-value pairs, graphs or column families. Learn more here. And if you are ready to get your hands dirty with CosmosDB, watch this video that focuses on performing CRUD operations using both the Node.js and .NET SDKs.
    • Couchbase: Couchbase, previously known as Membase is an open-source, distributed, multi-model document-oriented database. This database has evolved over the years and many consider it as a competitor to MongoDB. This database offers a lot of features including memory first architecture where it ensures all operations are performed in-memory and it offers memory optimized data management options. It also offers live cluster reconfiguration, elastic scalability due to its Multidimensional Scaling architecture, data replication options like Intra-Cluster replication, Cross Datacenter replication, Index replication and much more. Other features like Big Data and SQL Integration, Unified Programming Model, Full Stack Security and Container & Cloud Deployment ready.
  • Key-Value Pairs: These are really simple databases. They store data in the form of simple key-value pairs. They allow you to retrieve data for a certain key. These are generally used in IoT applications as the data to be stored is generally sensor data which is not complex and speed and reliability of storage is highly important. Following are some systems based on key-value pairs.
    • AWS DynamoDB: Amazon DynamoDB is a key-value and document database that delivers single-digit millisecond performance at any scale. It's a fully managed, multiregion, multimaster, durable database with built-in security, backup and restore, and in-memory caching for internet-scale applications. DynamoDB can handle more than 10 trillion requests per day and can support peaks of more than 20 million requests per second.
    • Redis: Redis is an in-memory data structure database. It implements a distributed, in-memory key-value database with optional durability. Redis supports different kinds of data structures including strings, lists, maps, sets, streams. Redis provides many other features like transactions, pub/sub, Lua scripting and more. It is written in ANSI C and can be deployed in Linux and OS X environments. Learn more.
  • Search Databases: These databases are developed with in-built search capabilities and in no way should be used as the primary source for persisting data. These databases offer great benefits like high performance, fault tolerance, high scalability and more, however these databases are more focussed in making your data easily searchable.
    • Apache Solr: Apache Solr is primarily a powerful search engine technology that is enterprise-ready, fast and highly scalable. It is built on top of Lucene and over the years Solr has become a very mature technology that offers features like full-text search, faceted search, hit highlighting, dynamic clustering, near real-time indexing and much more. Solr at its heart is a search engine that allows developers to develop sophisticated, high-performance search applications with advanced features. Learn more.
    • Elasticsearch: Elasticsearch is an open-source, powerful, scalable, and near real-time search technology. It provides a full-text search, analytics engine and machine learning capabilities. Because of its distributed nature, elasticsearch offers high scalability and fault tolerance. Elasticsearch is being used by major internet giants for their search needs. The parent company, Elastic, has developed a whole stack around Elasticsearch called ELK which stands for Elasticsearch, Logstash, Kibana but there are other new products like Beats and XPack that has joined the league. Learn more.
  • Columnar Databases: These databases allow storing data as column families in which each column is stored as key-value pairs. The rows in these databases can be thought of as multi-dimensional key-value pairs. These databases allow managing petabytes of data and generally provide SQL or SQL like query language which can be used to interact with the data.
    • Apache Cassandra: Apache Cassandra is a distributed, scalable, fault-tolerant NoSQL Database that allows us to tune consistency as per the application's requirements. Cassandra supports MapReduce which is widely used in Big Data analytics. It allows for replication across multiple data centers, which means even if one of the data centers fails, your data would still be available. It provides CQL (Cassandra Query Language) to query/manipulate the data or schema of the data stored in Cassandra. Cassandra nodes talk amongst each other by using a protocol called Gossip protocol. Learn more.
    • Google Cloud Bigtable: Google Cloud Bigtable is a fast, easily-scalable and reliable NoSQL database. As per Google, it can store data up to petabytes of scale and offers sub-10ms latency. Cloud Bigtable is ideal for modern workloads where the data is huge and low-latency is important like IoT, Financial markets etc. Bigtable is designed with machine learning and big data in mind and provides easy integration with open source big data tools like hadoop. However Bigtable is more suitable for developing applications as compared to analyzing data.
    • Google Cloud BigQuery: Google BigQuery is a large scale data warehouse that is designed for quickly extracting information from stored or streaming data. BigQuery provides great emphasis security, speed, scalability, consistency and availability. It provides features such as BigQuery ML - for building and operationalizing ML models, BigQuery BI Engine - for fast interaction with complex datasets using already available BI tools (Data Studio, Looker, Google Sheets etc.), BigQuery GIS - for geospatial analysis. Bigquery is more suitable for analyzing stored or streaming data as compared to developing applications.
  • Graph Databases: These databases represent data as a graph which is a collection of related nodes. Each node represents an entity and is related to other objects. They find significant usage in applications which have complex related data. These relations can be represented as graphs and analyzing them between data is important.
    • Neo4j: Neo4j is an open-source, NoSQL, native graph database and it is the most widely used graph database in the industry. It allows for a flexible data model, real-time data analysis, high availability, scalability and security. It also supports indexing by using Apache Lucene. It provides a REST API to integrate with other applications. Neo4j serves use cases like real time recommendations, graph search, network and IT operations and more.

Databases for Big Data

The term big data is mainly used to describe enormous datasets usually at terabytes or petabytes scale. Compared with traditional datasets, big data typically includes masses of unstructured data that need more real-time analysis. Large organizations have a huge amount of such data getting collected in different stores. As more applications are developed, this data is only going to grow. As our data and data analysis needs have evolved, the data storage and data analysis techniques have evolved as well.

  • Hadoop & HDFS: Apache Hadoop is a highly scalable open-source framework written in Java, which allows processing and storage of terabytes or even petabytes of structured and unstructured complex data (Big Data) across clusters of computers. Its unique storage mechanism over distributed file system (HDFS) maps data wherever it is located on a cluster. The speciality of Hadoop is that it can scale from one server to hundreds of servers and can still perform well. Did you know Hadoop is used as an umbrella term that refers to a lot of different technologies? Did you know Hadoop's architecture is divided into 4 different modules? Do you wish to learn why Hadoop is better than any SQL or NoSQL databases for Big Data problems? Learn more.

In the real world, technology offers sophistication and the world of databases is no different. As our needs for managing data efficiently increases, newer features and newer database technologies will keep evolving. We hope you liked reading this article and feel free to reach out to us in case you have any queries or issues managing your data.

At Cazton, our team of expert Developers, Consultants, Architects, Data Analysts, Data Scientists, DBAs who are awarded Microsoft Most Valuable Professionals and Google Developer Experts understand the changing requirements and demands of the industry. Depending on the type, structure, data model and intended use-case of the data, different types of systems will suit different needs. Our experts are well versed with all the database technologies be it on-prem or cloud and we have the expertise to help you setup your database infrastructure, tune and improve database performance, and help you make the right decision to achieve your business goals. Learn more about our Database Development services.

Need help with Big Data? Don't worry, we have you covered. Our team of Big Data Specialists, Big Data Architects, ML Experts, Data Scientists, Data Analysts & Researchers, Predictive Modelers, Statisticians, Business Intelligence & Security Experts have years of experience working on many different kinds of Big Data requirements. Learn more about our Big Data & Artificial Intelligence services.