Skip to main content

Parquet

Apache Parquet is a columnar storage file format optimised for analytical workloads. It's widely used in data engineering and data science for its excellent compression and query performance with large datasets.

Key Features

  • Columnar storage: Read only needed columns
  • Compression: Highly efficient (often 10x smaller than CSV)
  • Schema: Self-describing with embedded metadata
  • Predicate pushdown: Skip irrelevant data during reads
  • Nested data: Support for complex structures

Why Columnar Matters

When querying SELECT AVG(price) FROM sales:

Row format (CSV):  Read entire file
[date, product, price, quantity, customer, ...]

Column format: Read only price column
[price1, price2, price3, ...]

Result: 10-100x faster for analytical queries.

Common Use Cases

  • Data lakes (S3 + Athena)
  • Data warehouses (Snowflake, BigQuery, Redshift)
  • Machine learning pipelines
  • ETL pipelines
  • Log analytics
  • Time-series data

Reading/Writing Parquet

Python (pandas/pyarrow)

import pandas as pd

# Write
df.to_parquet('data.parquet', compression='snappy')

# Read
df = pd.read_parquet('data.parquet')

# Read specific columns
df = pd.read_parquet('data.parquet', columns=['name', 'price'])

JavaScript (parquet-wasm)

import { readParquet } from 'parquet-wasm';

const data = await readParquet(buffer);

Parquet vs Other Formats

FormatTypeCompressionQuery SpeedUse Case
ParquetColumnarExcellentFast (analytics)Data lakes
ORCColumnarExcellentFast (Hive)Hadoop
CSVRowNoneSlowData exchange
JSONRowNoneSlowAPIs
AvroRowGoodFast (streaming)Kafka

What We Like

  • Query performance: Orders of magnitude faster for analytics
  • Compression: Dramatically smaller files
  • Ecosystem: Supported everywhere (Spark, pandas, Athena, etc.)
  • Schema evolution: Add columns without rewriting
  • Partitioning: Natural fit for date-partitioned data

What We Don't Like

  • Not human-readable: Binary format requires tools
  • Write overhead: Slower to write than CSV
  • Small files: Overhead not worth it for tiny datasets
  • Row-level operations: Poor for single-row lookups

Best Practices

  1. Use for analytics: Not for transactional workloads
  2. Partition wisely: By date, region, or common filters
  3. Right-size files: 128 MB - 1 GB per file is ideal
  4. Use Snappy compression: Good balance of speed and ratio
  5. Include schema: Always embed metadata

AWS Integration

S3 (Parquet files)

Glue Data Catalog (schema)

Athena (SQL queries)

This combination provides a serverless, pay-per-query analytics platform.