parquet¶
For Parquet support, you must also include the header <rfl/parquet.hpp>
and link to the Apache Arrow and Apache Parquet libraries.
Furthermore, when compiling reflect-cpp, you need to pass -DREFLECTCPP_PARQUET=ON
to cmake.
Parquet is a columnar storage format optimized for analytical workloads. Unlike most other formats supported by reflect-cpp, Parquet is designed for tabular data and has specific limitations regarding nested structures.
Reading and writing¶
Suppose you have a struct like this:
struct Person {
rfl::Rename<"firstName", std::string> first_name;
rfl::Rename<"lastName", std::string> last_name;
rfl::Timestamp<"%Y-%m-%d"> birthday;
unsigned int age;
rfl::Email email;
};
Important: Parquet is a tabular format that requires collections of records. You cannot serialize individual structs - you must use containers like std::vector<Person>
, std::deque<Person>
, etc.
A collection of Person
structs can be serialized to a bytes vector like this:
const auto people = std::vector<Person>{
Person{.first_name = "Bart", .birthday = "1987-04-19", .age = 10, .email = "bart@simpson.com"},
Person{.first_name = "Lisa", .birthday = "1987-04-19", .age = 8, .email = "lisa@simpson.com"}
};
const std::vector<char> bytes = rfl::parquet::write(people);
You can parse bytes like this:
const rfl::Result<std::vector<Person>> result = rfl::parquet::read<std::vector<Person>>(bytes);
Settings and compression¶
Parquet supports various compression algorithms and chunk sizes. You can configure these using the Settings
struct:
const auto settings = rfl::parquet::Settings{}
.with_compression(rfl::parquet::Compression::GZIP)
.with_chunksize(1000);
const std::vector<char> bytes = rfl::parquet::write(people, settings);
Available compression options include:
UNCOMPRESSED
- No compression, fastest read/write but largest file sizeSNAPPY
(default) - Fast compression/decompression, good balance of speed and sizeGZIP
- Good compression ratio, slower than Snappy but better compressionBROTLI
- Good compression for text data, optimized for web contentZSTD
- Excellent compression ratio, modern algorithm with good speedLZ4
- Very fast compression/decompression, lower compression ratioLZ4_FRAME
- LZ4 with frame format, better compatibilityLZO
- Fast compression, older algorithmBZ2
- High compression ratio, slower compression/decompressionLZ4_HADOOP
- LZ4 optimized for Hadoop ecosystem
// Examples of different compression settings
const auto snappy_settings = rfl::parquet::Settings{}
.with_compression(rfl::parquet::Compression::SNAPPY);
const auto gzip_settings = rfl::parquet::Settings{}
.with_compression(rfl::parquet::Compression::GZIP);
const auto zstd_settings = rfl::parquet::Settings{}
.with_compression(rfl::parquet::Compression::ZSTD);
const auto uncompressed_settings = rfl::parquet::Settings{}
.with_compression(rfl::parquet::Compression::UNCOMPRESSED);
Loading and saving¶
You can also load and save to disk using a very similar syntax:
const rfl::Result<std::vector<Person>> result = rfl::parquet::load<std::vector<Person>>("/path/to/file.parquet");
const auto people = std::vector<Person>{...};
rfl::parquet::save("/path/to/file.parquet", people);
With custom settings:
const auto settings = rfl::parquet::Settings{}
.with_compression(rfl::parquet::Compression::GZIP);
rfl::parquet::save("/path/to/file.parquet", people, settings);
Reading from and writing into streams¶
You can also read from and write into any std::istream
and std::ostream
respectively.
const rfl::Result<std::vector<Person>> result = rfl::parquet::read<std::vector<Person>>(my_istream);
const auto people = std::vector<Person>{...};
rfl::parquet::write(people, my_ostream);
With custom settings:
const auto settings = rfl::parquet::Settings{}
.with_compression(rfl::parquet::Compression::GZIP);
rfl::parquet::write(people, my_ostream, settings);
Field name transformations¶
Like other formats, Parquet supports field name transformations. You can use processors like SnakeCaseToCamelCase
:
const auto people = std::vector<Person>{...};
const auto result = rfl::parquet::read<std::vector<Person>, rfl::SnakeCaseToCamelCase>(bytes);
This will automatically convert field names from snake_case to camelCase during serialization and deserialization.
Supported processors¶
The following processors are NOT supported and will cause compilation errors:
rfl::AddTagsToVariants
- Cannot be used for tabular datarfl::NoOptionals
- Cannot be used for tabular datarfl::DefaultIfMissing
- Cannot be used for tabular datarfl::NoExtraFields
- Cannot be used for tabular datarfl::NoFieldNames
- Cannot be used for tabular data
// ✅ This works
const auto result = rfl::parquet::read<std::vector<Person>, rfl::SnakeCaseToCamelCase>(bytes);
// ❌ This will cause compilation errors
const auto result = rfl::parquet::read<std::vector<Person>, rfl::AddTagsToVariants>(bytes);
const auto result = rfl::parquet::read<std::vector<Person>, rfl::NoOptionals>(bytes);
const auto result = rfl::parquet::read<std::vector<Person>, rfl::DefaultIfMissing>(bytes);
Enums and validation¶
Parquet supports enums and validated types. Enums are stored as strings:
enum class FirstName { Bart, Lisa, Maggie, Homer };
struct Person {
rfl::Rename<"firstName", FirstName> first_name;
rfl::Rename<"lastName", std::string> last_name;
rfl::Timestamp<"%Y-%m-%d"> birthday;
rfl::Validator<unsigned int, rfl::Minimum<0>, rfl::Maximum<130>> age;
rfl::Email email;
};
No variant types¶
Parquet does not support variant types like std::variant
, rfl::Variant
, or rfl::TaggedUnion
. These types cannot be serialized to Parquet format.
// ❌ This will NOT work
struct Person {
std::string first_name;
std::variant<std::string, int> status; // Variant - not supported
rfl::Variant<std::string, int> type; // rfl::Variant - not supported
rfl::TaggedUnion<"type", std::string, int> category; // TaggedUnion - not supported
};
Limitations of tabular formats¶
Parquet, like other tabular formats, has specific limitations that differ from hierarchical formats like JSON or XML:
No nested objects¶
Unlike JSON or XML, Parquet cannot directly represent nested objects within a single row. Each field must be a primitive type, enum, or a simple container of primitives.
// This works fine
struct Person {
std::string first_name;
std::string last_name;
unsigned int age;
};
// This would NOT work as expected - nested objects are not automatically flattened
struct Address {
std::string street;
std::string city;
};
struct Person {
std::string first_name;
std::string last_name;
Address address; // ❌ This will cause compilation errors
};
Using rfl::Flatten for nested objects¶
If you need to include nested objects, you can use rfl::Flatten
to explicitly flatten them:
struct Address {
std::string street;
std::string city;
};
struct Person {
std::string first_name;
std::string last_name;
rfl::Flatten<Address> address; // ✅ This will flatten the Address fields
};
// The resulting Parquet file will have columns: first_name, last_name, street, city
Collections requirement¶
You must serialize collections, not individual objects:
std::vector<Person> people = {...}; // ✅ Correct
Person person = {...}; // ❌ Wrong - must be in a container
No arrays (except bytestrings)¶
Parquet does not support arrays of any type except for binary data (bytestrings). This includes arrays of primitive types, strings, and objects.
// ❌ This will NOT work
struct Person {
std::string first_name;
std::vector<std::string> hobbies; // Array of strings - not supported
std::vector<int> scores; // Array of integers - not supported
std::vector<Address> addresses; // Array of objects - not supported
};
// ✅ This works
struct Person {
std::string first_name;
std::string last_name;
std::vector<char> binary_data; // Binary data - supported as bytestring
};
Use cases¶
Parquet is ideal for: - Data warehousing and analytics - Large datasets with repeated values - Integration with big data tools (Spark, Hadoop, etc.) - Simple, flat data structures with consistent types
Parquet is less suitable for: - Complex nested data structures - Data with arrays or variant types - Frequent schema changes - Row-oriented access patterns - Small datasets where the overhead isn't justified - Data with complex object hierarchies