Row Vs Column based Data Store

Arun Rajeevan
5 min readJan 20, 2024

A very simple example: Below is our normal way of storing a Row based Database. Our brain things in a sequential way were the rows whose Primary keys or shard keys are nearby, will be stored together in a Partition sequentially. So if I query on that particular Primary key or shard key, I will get my data faster.

| ID | Name  | Age |
|----|-------|-----|
| 1 | John | 25 |
| 2 | Jane | 30 |
| 3 | Bob | 25 |
| 4 | Alice | 22 |

In a Columnar Data Store, the storage might be organized like this:

| ID | 1 | 2 | 3 | 4 |
|----|---|---|---|---|
| Name | John | Jane | Bob | Alice |
| Age | 25 | 30 | 25 | 22 |

In this example, the “Age” column values are stored together, allowing for more efficient retrieval when performing queries that involve the “Age” column. This organization is especially beneficial when queries involve aggregations, filtering, or analyzing specific columns.

Columnar storage provides advantages for scenarios where analytical queries often involve working with a subset of columns rather than the entire row. It allows the database engine to read and process only the necessary data, improving query performance for certain types of workloads.

Important Note:
Columnar databases are optimized for analytical queries, reporting, and data warehousing.
You should not use it for transactional or OLTP (Online Transaction Processing) workloads. While they excel in analytical scenarios, they may not be the best choice for heavy data manipulation and frequent updates.

More details on Columnar Data Store:

  1. Storage Format:
    Data is physically stored in columns rather than rows.
    Each column is stored as a separate file or data structure on disk.
    This allows for better compression and encoding schemes tailored to the data characteristics of each column.
  2. Query Processing:
    When a query is issued, the database engine can take advantage of the columnar layout to optimize query execution.
    Since a query often involves only a subset of columns, the database engine can read and process only the relevant columns, improving I/O efficiency. This is particularly beneficial for analytical queries that involve aggregations, filtering, and other operations on specific columns.
  3. Compression and Encoding:
    Columnar storage allows for more effective compression techniques. Since column values tend to have similar data types and characteristics, compression algorithms can be optimized for each column. Additionally, encoding schemes such as dictionary encoding or run-length encoding can be applied to further reduce storage requirements.
  4. Parallel Processing:
    Many columnar databases leverage parallel processing to improve query performance.
    The database engine can process multiple columns or chunks of data in parallel across multiple CPU cores or nodes in a distributed environment.
  5. Projection Pushdown:
    In some cases, columnar databases can perform “projection pushdown,” where only the necessary columns are read from storage during query execution.
    This minimizes the amount of data that needs to be read from disk, contributing to faster query performance.
  6. Aggregation and Vectorization:
    Columnar databases are often designed to efficiently handle aggregations, as they can operate on contiguous blocks of data within a column.
    Vectorized processing, where operations are applied to entire vectors or batches of data at once, is commonly used to enhance performance.

In summary, column stores are designed to optimize query performance for analytical workloads by organizing and processing data in a column-oriented fashion. This storage format and processing approach are well-suited for scenarios where queries involve aggregations and analyses on a subset of columns.

How a query is processed in Column based Database?

Example:

| uid |   Name   | Age | Country | Salary |
|-----|----------|-----|---------|--------|
| 101 | Alice | 25 | USA | 60000 |
| 102 | Bob | 30 | Canada | 75000 |
| 103 | Charlie | 28 | UK | 80000 |
| 104 | David | 35 | Germany | 90000 |
| 105 | Eve | 22 | USA | 55000 |

In a columnar database, the data is physically stored in separate columns. Now, let’s consider a query that retrieves the names and ages of users from the USA:

SELECT Name, Age FROM users WHERE Country = 'USA';
  1. Projection:
    The database engine reads only the “Name” and “Age” columns, skipping unnecessary columns like “Salary.”
  2. Index Usage (if available):
    If there is an index on the “Country” column, the database can quickly locate rows where the country is ‘USA.’
  3. Predicate Pushdown:
    The database engine pushes down the filter condition to the storage layer, filtering out rows where the country is not ‘USA’ before reading data.
  4. Vectorized Processing:
    The database performs vectorized processing on the “Name” and “Age” columns, operating on entire columns at once.
  5. Assembling Results:
    The results are assembled by combining the selected values from the “Name” and “Age” columns based on the filtered rows. The reconstructed result set is then returned to the user.

This process allows the database to efficiently retrieve and process only the necessary data, leveraging the advantages of columnar storage for improved performance. While the logical relationships between “Name” and “Age” are maintained, the physical storage structure is optimized for query execution

More details on Row Store:

  1. Transaction Support:
    Row-based databases are designed to handle transactions efficiently. They provide strong support for ACID properties (Atomicity, Consistency, Isolation, Durability), making them suitable for applications with frequent updates, inserts, and deletes.
  2. Ease of Data Manipulation:
    Operations that involve manipulating entire rows, such as updating individual fields or inserting small amounts of data, are typically more straightforward in row stores.
    The row-based storage format is intuitive and aligns well with the way transactional applications often interact with data.
  3. Simplicity in Data Model:
    The row-oriented storage model is simpler to understand and work with, especially for developers accustomed to traditional relational databases. Data is organized in a way that mirrors the logical structure of a table.
  4. Optimal for OLTP Workloads:
    Row stores are optimized for OLTP workloads, where the emphasis is on processing individual transactions quickly and reliably.
    OLTP workloads often involve many small, frequent transactions, and row stores are designed to handle such scenarios efficiently.
  5. Indexes for Fast Retrieval:
    Row stores often use indexes to facilitate fast retrieval of specific rows based on key values.
    Indexes help optimize query performance, making row-based databases suitable for applications that require quick lookups and point queries.
  6. Consistency in Query Performance:
    For workloads that involve a mix of transactional and analytical queries, row stores can provide consistent performance across a variety of queries.
    While they may not be as specialized for analytics as columnar databases, row stores can handle a broader range of query types without significant performance trade-offs.
  7. Better for Small to Medium-Sized Datasets:
    For datasets that are not excessively large and where the primary workload involves frequent updates and retrievals of individual records, row stores can be a suitable and efficient choice.

--

--