One Billion Row Challenge
Hey there! Today, let’s talk about a code optimization challenge.
The One Billion Row Challenge (1brc.dev) is a programming challenge designed to test how fast we can process a file containing 1 billion lines of data. Think of it as a fun way to sharpen your skills in performance optimization.
The input is a UTF-8 file where each line consists of a station name and a temperature. Stations can appear multiple times in the file, like this:
Tokyo;12.8
Marseille;28.1
Philadelphia;10.1
Tokyo;14.8
...
Philadelphia;-4.1
The goal is to write an application that:
Reads the file.
Calculate the min, mean, and max temperatures per station.
Write the results on stdout (sorted by station name) in a specific format.
The challenge was initially posted on GitHub under gunnarmorling/1brc and was specific to Java. Due to its growing popularity, it has been extended to several other languages:
C/C++
C#
Go
JavaScript
PHP
Python
Rust
Zig
If you have already tackled this challenge, please share your solutions in the comments. If not, this is a great opportunity to delve into code optimization. While the problem statement may sound straightforward, we can learn a lot about different optimization techniques we may not yet be familiar with, such as concurrency, branch prediction1, memory mapping2, SIMD3, or any other strategies.
Tomorrow, we will discuss a distributed systems coding challenge.
A way to help the CPU when it attempts to guess the outcome of a conditional operation in order to minimize the delays caused by code branching.
A technique to map files or devices into a process’s virtual space without reading everything into memory,
A technique where a single instruction operates simultaneously on multiple data.