-
Notifications
You must be signed in to change notification settings - Fork 435
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: Parquet 3 metadata with decoupled column metadata #242
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -81,7 +81,13 @@ more pages. | |
- Encoding/Compression - Page | ||
|
||
## File format | ||
This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format. | ||
|
||
This file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read | ||
together to understand the format. | ||
|
||
## Overall file structure | ||
|
||
A Parquet file is a collection of binary blocks representing data and metadata. | ||
|
||
4-byte magic number "PAR1" | ||
<Column 1 Chunk 1 + Column Metadata> | ||
|
@@ -107,12 +113,97 @@ start locations. More details on what is contained in the metadata can be found | |
in the Thrift definition. | ||
|
||
Metadata is written after the data to allow for single pass writing. | ||
This is especially useful when writing to backends such as S3. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is a requirement for HDFS too, S3 is simply a beneficiary of a design requirement from the outset of an append-only filesystem being the target from day 1 |
||
|
||
Readers are expected to first read the file metadata to find all the column | ||
chunks they are interested in. The columns chunks should then be read sequentially. | ||
|
||
![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif) | ||
|
||
### Parquet 3 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. People, who are not involved in the discussion of Parquet 3, might wonder why suddenly Parquet '3'? |
||
|
||
Parquet 3 files have the following overall structure: | ||
|
||
``` | ||
4-byte magic number "PAR1" | ||
4-byte magic number "PAR3" | ||
|
||
<Column 1 Chunk 1 + Column Metadata> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've literally copied this from the original snippet above. This proposal doesn't change anything in the layout here, so it would be confusing if I expressed it differently, IMHO. |
||
<Column 2 Chunk 1 + Column Metadata> | ||
... | ||
<Column N Chunk 1 + Column Metadata> | ||
<Column 1 Chunk 2 + Column Metadata> | ||
<Column 2 Chunk 2 + Column Metadata> | ||
... | ||
<Column N Chunk 2 + Column Metadata> | ||
... | ||
<Column 1 Chunk M + Column Metadata> | ||
<Column 2 Chunk M + Column Metadata> | ||
... | ||
<Column N Chunk M + Column Metadata> | ||
|
||
<File-level Column 1 Metadata v3> | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we'll have a file-level column metadata, we can optimize the storage of key_metadata of encrypted columns - by keeping the column key_metadata here, instead of the ColumnChunk structures. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Assuming that each column uses a single key_metadata for all chunks, this sounds like a good idea. |
||
... | ||
<File-level Column N Metadata v3> | ||
pitrou marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
File Metadata v3 | ||
4-byte length in bytes of File Metadata v3 (little endian) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i think we might want a slightly extended. In particular, I think it is possible we want a digest (e.g. sha256) of the v3 header. This can serve two purposes:
Second I think we might want to record an offset to the "full footer" containing additional index information, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. lastly, we might want to consider if compression is should be a setting, if we move enough stuff out of thrift this probably isn't a concern anymore. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess we can add a CRC32 here if other people want it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @emkornfield IIUC the digest is not to protect for corruption but to make sure we do not mistakenly read a V3 footer in a file without one if we happen to see "PAR3" bytes before the V1 footer, correct? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it can serve both purposes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I posit the latter is more important. Reading a valid encoded Whereever we store a new metadata footer, its content hash needs to be stored somewhere for verification. A crc32 of the footer should be good enough. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
This is exactly what I'm suggesting, I think it solves both use-cases (sha1 is probably overkill). Today there is no digest on the footer as far as I'm know. |
||
4-byte magic number "PAR3" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. one more thought is potentially having a 16-bit or 32-bit flag map, initially set to zero if we do want to allow for future iterations on metadata intepretation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Data point: we added this to the Arrow IPC format and nobody to my knowledge is making use of it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems to be superior to ParquetVersion of parquet-cpp in terms of representing a collection of enabled features. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But what would be a concrete use case for those feature flags? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Test if the reader is compatible and time to upgrade? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Readers can already scan the metadata and see if they recognize all encodings, etc. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is a lot of current discussion on evolution, some of these might come fruitition some of them won't. As a concrete example if we have a bitmap, we not longer need different magic footer/header values for compression. Other potential items that we might not want to close off in :
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, we can perhaps add 4 reserved bytes just before the magic number. But I would recommend against giving them any specific meaning until necessary. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, for the concrete use case I meant to say "encryption" which I think we could use immediately? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You're right that encryption would be an immediate use case (as you did in #250, btw). |
||
|
||
File Metadata | ||
4-byte length in bytes of File Metadata (little endian) | ||
4-byte magic number "PAR1" | ||
``` | ||
|
||
Unlike the legacy File Metadata, the File Metadata v3 is designed to be light-weight | ||
to decode, regardless of the number of columns in the file. Individual column | ||
metadata can be opportunistically decoded depending on actual needs. | ||
|
||
This file structure is backwards-compatible. Parquet 1 readers will read and | ||
decode the legacy File Metadata in the file footer, while Parquet 3 readers | ||
will notice the "PAR3" magic number just before the File Metadata and will | ||
instead read and decode the File Metadata v3. | ||
Comment on lines
+162
to
+165
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Efficient I/O can be challenging with this proposal. A reader needs to read the last 8 bytes of the file, then read 8 bytes before the legacy footer, figure out a v3 footer exists and then read that. It would be better if the v3 metadata are at the end of the file, right before the 4-byte len + PAR1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you suggest a layout that preserves compatibility with PAR1 readers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are a few options:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The goal here is twofold:
This is why this proposal creates a separate array of structure types: so that PAR3 readers don't have to eagerly decode those pesky columns, while letting PAR1 readers correctly access column information. I don't think any of your two proposals is able of achieving those two goals simultaneously, are they? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think for 2, thrift could avoid parsing it assuming that we still follow the pattern of nested footer. e.g. So the encoding/serialization would be manual but on decoding old readers should automatically drop the unknown field (it is possible some thrift implementations retain unknown fields, I know proto does) (i.e. the field ID 10003 should never actually be modeled in the schema). "note 0x0000" is the stop field for structs if I am reading the thrift spec correctly So the trade-offs of doing this approach are:
Effectively for doing the operation currently as proposed in V3 the trade-offs are reverse. I There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With fetch latencies of that order, does the decode latency of the existing thrift payload even matter? I would be interested to see any empirical data that would suggest the current metadata structures are an appreciable bottleneck for use-cases involving a catalog. I created a mailing list thread related to this here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the concrete numbers. This gives us more precise insight for us to work on (though not all Parquet files are read from S3; local reads and writes are a very common use case as well). However, I'm a bit surprised by your numbers because typical guidance for S3 involves larger reads (i.e. multi-megabyte). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On S3 reads: Guidance from S3 involves larger reads because of cost: reads from S3 typically cost an API call and bandwidth/transfer volume is free (same zone). 8MB reads will cost half the price of 4MB reads. Each connection can go up to ~100MB/sec so just transfering 4MB of data is at least 25ms. If one is reading large files, doing 100s of 8MB reads in parallel will saturate any network card - which is typically what engines do. I posit that the vast majority of parquet encoded data is in some cloud somewhere (like >90% of all data). Hence working well with object stores (high latency, immutable files) is a requirement for any change. This is also lesson 4 from An Empirical Evaluation of Columnar Storage Formats.
Yes it does. With the numbers above in mind:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I posit that many PiB of parquet data lives in HDFS which also has read latencies, but not as bad as S3. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm late to this thread and haven't fully read it. I just want to give my +1 that the current proposal would require too many I/O requests and thus make this basically a deal breaker for high latency storage like S3. We would not use this due to this. Any change that increases the number of data-dependent requests necessary to decode it basically a deal breaker for us and I guess is so for a lot of other data lake companies. |
||
|
||
### Parquet 3 without legacy metadata | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wgtmac @gszadovszky Let me know if it is a good idea to even mention this :-) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my 2 cents I think at some point we want to avoid the duplication. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. There will be a time when people will agree to remove the old metadata. I added the "PAR1" magic number specifically for this :) |
||
|
||
It is possible, though not recommended for the time being, to produce Parquet 3 | ||
files without the legacy metadata blocks: | ||
``` | ||
4-byte magic number "PAR3" | ||
|
||
<Column 1 Chunk 1 + Column Metadata> | ||
<Column 2 Chunk 1 + Column Metadata> | ||
... | ||
<Column N Chunk 1 + Column Metadata> | ||
<Column 1 Chunk 2 + Column Metadata> | ||
<Column 2 Chunk 2 + Column Metadata> | ||
... | ||
<Column N Chunk 2 + Column Metadata> | ||
... | ||
<Column 1 Chunk M + Column Metadata> | ||
<Column 2 Chunk M + Column Metadata> | ||
... | ||
<Column N Chunk M + Column Metadata> | ||
|
||
<File-level Column 1 Metadata v3> | ||
... | ||
<File-level Column N Metadata v3> | ||
|
||
File Metadata v3 | ||
4-byte length in bytes of File Metadata v3 (little endian) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. offset + length |
||
4-byte magic number "PAR3" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do encrypted footers work in this case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hopefully, the same way they work for PAR1, but I agree this will need to be spelled out more explicitly :-) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yep, might need some analysis. If we continue using "PARE", can the readers know if the file is v1 or v3? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, you're right. I thought the "PARE" footer was in addition to the "PAR1" footer, but apparently it replaces it. Darn. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would using PAR3 + bit in a bitmap to represent encryption be a reasonable approach here to reduce the proliferation of magic footer values? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, as long as we can read the bitmap before decrypting the footer |
||
``` | ||
|
||
These files are not backwards compatible with Parquet 1 readers. | ||
Parquet 3 readers should be prepared to read such files, so that the ecosystem | ||
can gradually switch to this new format. | ||
|
||
## Encryption | ||
|
||
Encryption with footer encryption enabled changes the above file structure slightly. | ||
In particular, the "PAR1" magic number is replaced with "PARE". | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TBD (#242 (comment)) |
||
See the [encryption specification](Encryption.md) for details. | ||
|
||
## Metadata | ||
There are three types of metadata: file metadata, column (chunk) metadata and page | ||
header metadata. All thrift structures are serialized using the TCompactProtocol. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The biggest culprit in parsing metadata is struct Statistics {
/**
* DEPRECATED
*/
1: optional binary max;
2: optional binary min;
/** count of null value in the column */
3: optional i64 null_count;
/** count of distinct values occurring */
4: optional i64 distinct_count;
/**
* Only one pair of min/max will be populated. For fixed sized types, one of the minN/maxN variants will be used. Otherwise min_value/max_value is used.
*/
5: optional binary max_value;
6: optional binary min_value;
7: optional byte max1;
8: optional byte min1;
9: optional i32 max4;
10: optional i32 min4;
11: optional i64 max8;
12: optional i64 min8;
} There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since that structure is deprecated, do we really want to "improve" it, or should we focus our efforts on non-deprecated structures such as More generally, this should probably be discussed in a separate ML thread, for better separation of concerns. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wasn't aware the whole structure was deprecated. I thought only There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree this is a separate topic. I still want to say that Apache ORC has defined separate statistics for different types: https://github.com/apache/orc-format/blob/6e30e63f5071b616ec5cedaac7b2e0a98ae377c5/src/main/proto/orc/proto/orc_proto.proto#L87-L99. But the dirty thing is that we might still need some sort of encoding for data types which do not have direct representation from protobuf or thrift, e.g. decimal type. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having an encoding shouldn't be a problem as long as the encoding is fast enough to decode (e.g. PLAIN), should it? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agree separate topic. I think you still need variable length types that rely on byte_array There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree with you. Defining specific statistics for different data types adds complexity as well. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would a union work here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ahah, sorry. I think you're right actually. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sigh, a unrelated issue is that currently |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -467,6 +467,35 @@ struct SchemaElement { | |
10: optional LogicalType logicalType | ||
} | ||
|
||
struct SchemaElementV3 { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Antoine, on the question of keeping implementation simpler, would it pay to not revise this and just reuse the existing one? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 these changes are not necessary, we should mark the relevant bits in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can, of course, but deprecated fields will probably never be removed from the format (except perhaps in 20 years). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are deprecated fields a problem? Assuming a writer writes them, the writer wastes cpu and bytes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mandating that readers remove fields from the official Parquet Thrift file sounds like a complicated and error-prone arrangement. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no mandate. The official parquet thrift will comment them out.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem is that, for any given implementation, we'll have either 1+3 (compiled with old version: backwards compatibility but no perf improvement when reading), or 2+4 (compiled with new version: better performance on read, but backwards compatibility is lost). This doesn't satisfy the requirements we're trying to satisfy here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh I see. The problem is old readers and new writers. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm OK with no cleanup of deprecated fields. Marking them semantically as logically required for from a wire compatibility perspective required->optional should be forward/backward compatible. |
||
/** Data type for this field. */ | ||
1: optional Type type; | ||
|
||
/** If type is FIXED_LEN_BYTE_ARRAY, this is the byte length of the values. | ||
* | ||
* CHANGED from v1: this must be omitted for other column types. | ||
*/ | ||
2: optional i32 type_length; | ||
|
||
/** repetition of the field. */ | ||
3: optional FieldRepetitionType repetition_type; | ||
|
||
/** Name of the field in the schema */ | ||
4: required string name; | ||
|
||
/** Nested fields. */ | ||
5: optional i32 num_children; | ||
|
||
/** CHANGED from v1: from i32 to i64 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. have you encountered a use-case for this? 2 Billion field ID seems quite high? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same question! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same. I don't think this change is needed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok. It seemed a bit arbitrary to limit the width of this, but I can undo the change. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is backwards compatible and in the long run, it probably isn't a big deal either way. I as just curious if there as a reason? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No concrete reason actually :-) |
||
*/ | ||
6: optional i64 field_id; | ||
|
||
/** The logical type of this SchemaElement */ | ||
7: optional LogicalType logicalType | ||
|
||
/** REMOVED from v1: converted_type, scale, precision (obsolete) */ | ||
} | ||
|
||
/** | ||
* Encodings supported by Parquet. Not all encodings are valid for all types. These | ||
* enums are also used to specify the encoding of definition and repetition levels. | ||
|
@@ -835,6 +864,65 @@ struct ColumnMetaData { | |
16: optional SizeStatistics size_statistics; | ||
} | ||
|
||
struct ColumnChunkMetaDataV3 { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Similarly to SchemaElement, we can reuse the existing struct and deprecate the useless fields. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also don't we need a index into the list of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unless I'm misunderstanding something, there should be one There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having an index referencing a
(1) is important when schemata are very wide but data is sparse. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, ignoring the fact that Parquet is currently not a sparse format, your proposal implies that readers have to do a O(n) search to find a given column? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Why would they need an O(n) search? The index indexes an There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, let's step back because I think I am not following you. Here is the file organization as proposed here:
I'm not sure what you're proposing exactly when you mean "can skip encoding columns that do not have values"? This is currently not possible using Parquet, as at least the corresponding definition levels would be expected (even if they are all 0). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
IIUC, Finding a column via schema elements today is also O(N) assuming no nesting. I think the difference is today the first thing implementations do create an efficient dictionary structure to amortize lookup of further columns. I think if we want fast lookups without building any additional dictionaries in memory we should be considering a new stored index structure (or reconsider how we organize schema elements instead of a straight BFS). |
||
/** REMOVED from v1: type (redundant with SchemaElementV3) */ | ||
/** REMOVED from v1: encodings (unnecessary and non-trivial to get right) */ | ||
/** REMOVED from v1: path_in_schema (unnecessary and wasteful) */ | ||
/** REMOVED from v1: index_page_offset (unused in practice?) */ | ||
|
||
/** Compression codec **/ | ||
1: required CompressionCodec codec | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This could go in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, right now it's possible to make it vary. I don't know if any writers make use of this possibility, but I'm not sure there's any harm in keeping it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Enforcing same codec to all row groups will prohibit fast merging row groups of different parquet files without rewriting chunk data. So I vote for keeping it as is. |
||
|
||
/** Number of values in this column chunk **/ | ||
2: required i64 num_values | ||
|
||
/** total byte size of all uncompressed pages in this column chunk (including the headers) **/ | ||
3: required i64 total_uncompressed_size | ||
|
||
/** total byte size of all compressed, and potentially encrypted, pages | ||
* in this column chunk (including the headers) **/ | ||
4: required i64 total_compressed_size | ||
|
||
/** Optional key/value metadata for this column chunk. | ||
** CHANGED from v1: only use this for chunk-specific metadata, otherwise | ||
** use `FileColumnMetadataV3.key_value_metadata`. | ||
**/ | ||
5: optional list<KeyValue> key_value_metadata | ||
|
||
/** Byte offset from beginning of file to first data page **/ | ||
6: required i64 data_page_offset | ||
|
||
/** Byte offset from the beginning of file to first (only) dictionary page **/ | ||
7: optional i64 dictionary_page_offset | ||
|
||
/** optional statistics for this column chunk */ | ||
8: optional Statistics statistics; | ||
|
||
/** Set of all encodings used for pages in this column chunk. | ||
* This information can be used to determine if all data pages are | ||
* dictionary encoded for example **/ | ||
9: optional list<PageEncodingStats> encoding_stats; | ||
|
||
/** Byte offset from beginning of file to Bloom filter data. **/ | ||
10: optional i64 bloom_filter_offset; | ||
|
||
/** Size of Bloom filter data including the serialized header, in bytes. | ||
* Added in 2.10 so readers may not read this field from old files and | ||
* it can be obtained after the BloomFilterHeader has been deserialized. | ||
* Writers should write this field so readers can read the bloom filter | ||
* in a single I/O. | ||
*/ | ||
11: optional i32 bloom_filter_length; | ||
|
||
/** | ||
* Optional statistics to help estimate total memory when converted to in-memory | ||
* representations. The histograms contained in these statistics can | ||
* also be useful in some cases for more fine-grained nullability/list length | ||
* filter pushdown. | ||
*/ | ||
12: optional SizeStatistics size_statistics; | ||
} | ||
|
||
struct EncryptionWithFooterKey { | ||
} | ||
|
||
|
@@ -885,6 +973,44 @@ struct ColumnChunk { | |
9: optional binary encrypted_column_metadata | ||
} | ||
|
||
struct ColumnChunkV3 { | ||
/** File where column data is stored. **/ | ||
1: optional string file_path | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we want to keep this concept of having the metadata in a separate file? I did not see it working anywhere. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we do support it in PyArrow (see tests here and here), and I think Dask makes use of it. @jorisvandenbossche @mrocklin Am I right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should be the summary metadata file, IIUC. cc @rdblue for more context. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (thanks for pinging us) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PyArrow provides write_metadata and parquet_dataset for such usecases. Original PR goes more in depth. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
not really
the reason for (2) is that the hadoop cloud-native committer design kept clear of making any changes to the superclass of with a move to table based formats rather than directory trees, that whole commit process becomes much easier as well as supporting atomic job commits on a table (including deletes!). And as you note, these formats can include schema info too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For context, AFAIK the Another Parquet dev mailing list thread with some discussion about this: https://lists.apache.org/thread/142yj57c68s2ob5wkrs80xsjoksm7rb7 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @adamreeve I see so the parquet file is one with all the metadata and all the data is in files pointed to by this singleton. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Sorry I couldn't really follow this argument, that sounds like a Hadoop specific problem. To me a cloud object store means something like S3, and for our use case we're mostly concerned with reducing the number of objects that need to be read to satisfy read queries that filter data and don't need to read all files in a dataset, as we have many concurrent jobs running and adding load on storage.
Much of the discussion there seems to be related to issues users can run into if doing things like overwriting Parquet files or having heterogeneous schemas, which this feature was not designed for. But it sounds like others have also found this feature useful. I think this quote from Patrick Woody matches our experience: "outstandingly useful when you have well laid out data with a sort-order"
Yes, exactly. The |
||
|
||
/** Byte offset in file_path to the ColumnChunkMetaDataV3, optionally encrypted | ||
** CHANGED from v1: renamed to metadata_file_offset | ||
**/ | ||
2: required i64 metadata_file_offset | ||
|
||
/** NEW from v1: Byte length in file_path of ColumnChunkMetaDataV3, optionally encrypted | ||
**/ | ||
3: required i32 metadata_file_length | ||
Comment on lines
+985
to
+987
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you :) The naming is a little awkward...I read these as "metadata-file offset"/"metadata-file length". Perhaps instead just "metadata_offset" and "metadata_length"? Aside: just curious, but in the current format how do you make use of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW, do we still want to put a copy |
||
|
||
/** REMOVED from v1: meta_data, encrypted_column_metadata. | ||
** Use encoded_metadata instead. | ||
**/ | ||
|
||
/** NEW from v1: Column metadata for this chunk, duplicated here from file_path. | ||
** This is a Thrift-encoded ColumnChunkMetaDataV3, optionally encrypted. | ||
**/ | ||
4: optional binary encoded_metadata | ||
pitrou marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
/** File offset of ColumnChunk's OffsetIndex **/ | ||
5: optional i64 offset_index_offset | ||
|
||
/** Size of ColumnChunk's OffsetIndex, in bytes **/ | ||
6: optional i32 offset_index_length | ||
|
||
/** File offset of ColumnChunk's ColumnIndex **/ | ||
7: optional i64 column_index_offset | ||
|
||
/** Size of ColumnChunk's ColumnIndex, in bytes **/ | ||
8: optional i32 column_index_length | ||
|
||
/** Crypto metadata of encrypted columns **/ | ||
9: optional ColumnCryptoMetaData crypto_metadata | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not needed if we keep this in FileColumnMetadataV3 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed especially if it can't be different for individual chunks |
||
} | ||
|
||
struct RowGroup { | ||
/** Metadata for each column chunk in this row group. | ||
* This list must have the same order as the SchemaElement list in FileMetaData. | ||
|
@@ -914,6 +1040,32 @@ struct RowGroup { | |
7: optional i16 ordinal | ||
} | ||
|
||
struct RowGroupV3 { | ||
/** REMOVED from v1: columns. | ||
* Instead, decode each FileColumnMetadataV3 individually as needed. | ||
*/ | ||
|
||
/** Total byte size of all the uncompressed column data in this row group **/ | ||
1: required i64 total_byte_size | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure if this field is particularly useful, as it doesn't account for encodings. We might want a different field name with a clearer meaning. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This field was introduced in 2013, I'm not sure anyone still has a memory of the reasons. The author is named Nong Li, without a corresponding GitHub account. @nongli Is it you? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems to be used by parquet-mr's There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess it might be used for a proxy as to actual data size, but it isn't a good one maybe we can keep it and always add another better field. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm ok with removing this, FTR, I would just like to make sure this isn't actually useful for some unexpected use case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe it was added before that. this commit is just referencing when we created the separate parquet-format repo. It might well be part of the original metadata. The |
||
|
||
/** Number of rows in this row group **/ | ||
2: required i64 num_rows | ||
|
||
/** If set, specifies a sort ordering of the rows in this row group. */ | ||
3: optional list<SortingColumn> sorting_columns | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if this should be per file. I guess the downside is expense concatenating files with different sort orders. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have no idea. Not knowing the precise use cases for this, I would err on the side of safety and not change this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was going to make the same comment. I would imagine the point of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is kind of the same rationale as the codecs (easily being able to concatenate row groups with different sort orders) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, but it can be worked around by simply dropping sort order metadata if row groups have inconsistent values. This is different to codec. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is fair, but why lose the information per row-group it seems some engines could make sure of this, although it is probably superceded by page indexes? |
||
|
||
/** REMOVED from v1: file_offset. | ||
* Use the OffsetIndex for each column instead. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe this is useful when we want to estimate the whole read range but do not want to read offset index. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So my question would be: how do you estimate the read range if you only have the file offset, but not the on-disk length? Using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. total_compressed_size should give the range I think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC, the current spec does not prohibit placing column chunks in random order. So we must use |
||
*/ | ||
|
||
/** Total byte size of all compressed (and potentially encrypted) column data | ||
* in this row group **/ | ||
4: optional i64 total_compressed_size | ||
|
||
/** Row group ordinal in the file **/ | ||
5: optional i16 ordinal | ||
} | ||
|
||
/** Empty struct to signal the order defined by the physical or logical type */ | ||
struct TypeDefinedOrder {} | ||
|
||
|
@@ -1165,6 +1317,62 @@ struct FileMetaData { | |
9: optional binary footer_signing_key_metadata | ||
} | ||
|
||
/** Metadata for a column in this file. */ | ||
struct FileColumnMetadataV3 { | ||
/** All column chunks in this file (one per row group) **/ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add in this column in the comment? also, rename the field to /** Metadata for a column in this file. */
struct FileColumnMetadataV3 {
/** All column chunks in this column (in this file - one per row group) **/
1: required list<ColumnChunkV3> chunks |
||
1: required list<ColumnChunkV3> columns | ||
shangxinli marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
/** Sort order used for the Statistics min_value and max_value fields | ||
**/ | ||
2: optional ColumnOrder column_order; | ||
|
||
/** NEW from v1: Optional key/value metadata for this column at the file level | ||
**/ | ||
3: optional list<KeyValue> key_value_metadata | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we might want to model this as a two data pages (key and value) or follow the suggestion above for simply using a page to individually store the encodings. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is file-level metadata. Do we have reasons to believe there will be enough key-values to warrant the complexity of using dedicated data pages for this? The more we deviate from Parquet 1 file organization, the more work it will create for implementors and the more potential for incompatibilites and bugs. We should perhaps ask on the ML for opinions... Edit: started a discussion on https://lists.apache.org/thread/9z4o0zbz34lt5jtllzfrr4gmxczddqyb There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. it is mostly due to the fact that I think we want to follow a policy of "lazy" decoding as much as possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not convinced that "lazy as much as possible" is that desirable. "Lazy where necessary" sounds more reasonable to me. Hence my question about typical metadata sizes. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Putting all user-defined metadata in a list is subject to limitations from thrift. That's why we have to care about its size. Switching it to separate "page" or something may unblock other potentials like putting extra user-defined secondary index. For now, we can only choose a "black hole" from somewhere in the file and put its offset/length pair into the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Ok, but is it a practical concern? In Parquet C++ we have: constexpr int32_t kDefaultThriftStringSizeLimit = 100 * 1000 * 1000;
// Structs in the thrift definition are relatively large (at least 300 bytes).
// This limits total memory to the same order of magnitude as
// kDefaultStringSizeLimit.
constexpr int32_t kDefaultThriftContainerSizeLimit = 1000 * 1000;
Well, you could also have a special-named column with 1 defined BYTE_ARRAY value for the piece of metadata you care about (or you could also model it more finely using Parquet types). |
||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can add the column encryption key_metadata here, /** Crypto metadata of encrypted column **/
4: optional ColumnCryptoMetaData crypto_metadata |
||
|
||
struct FileMetaDataV3 { | ||
/** Version of this file **/ | ||
1: required i32 version | ||
|
||
/** Parquet schema for this file **/ | ||
2: required list<SchemaElementV3> schema; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Lists cannot really be decoded in a random access manner. I suggest in V3 we should consider moving any list elements to a Page that has type byte_array where each element is a serialized struct (thrift or something else if we choose to move away from it). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For heavily nested nested lists we might want to separate type specific fields. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it matter here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, not here. I think I need to do a more formal auditing to see what makes sense. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. skip lists do improve list random IO perf, or some variant where as the element list is built up somehow an offset to a later list element is added. But I don't know of any easy way to do that here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My concrete suggestion is to introduce a page encoding that supports random access: #250 which IIUC is similar to the approach described here for Columns but allows for the solution to be more easily generalized. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you build random access at the encoding level, then how about the compression step? |
||
|
||
/** Number of rows in this file **/ | ||
3: required i64 num_rows | ||
|
||
/** Row groups in this file **/ | ||
4: required list<RowGroupV3> row_groups | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As per discussion above, we would really like to move away from using a list for the row_groups so that individual row_groups can be read in a random access way. That is, we don't have to read data about all the row_groups if we just want a single row group from the parquet file. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps that's true but is this something an engine would do other than when dealing with a sampling read or limit query? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I think yes? An engine might dispatch individual row-groups to different workers if there is external metadata to support knowing the number of row-group in a file before hand. Again given reduced size of row-groups this might be too much of a micro-optimization. |
||
|
||
/** Optional key/value metadata for this file. **/ | ||
5: optional list<KeyValue> key_value_metadata | ||
|
||
/** String for application that wrote this file. **/ | ||
6: optional string created_by | ||
|
||
/** NEW from v1: byte offset of FileColumnMetadataV3, for each column **/ | ||
7: required list<i64> file_column_metadata_offset; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. is this supposed to be at the beginning of the column chunk? Or is it part of the footer? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think having the column information in the footer is probably still useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By "in the footer", you mean after all the data?
Though, of course, this is purely advisory and writers are free to place them elsewhere. |
||
/** NEW from v1: byte length of FileColumnMetadataV3, for each column **/ | ||
8: required list<i32> file_column_metadata_length; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it too late to add something like below for all offset/length pair?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we're looking to speed up thrift deserialization, I'd bet two int lists are going to be faster to parse than a list of structs. If the metadata objects are contiguous, maybe instead extend the offsets list by one and use deltas to calculate the lengths. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was going with the same intuition as @etseidl 's, though I don't have any precise insight into typical Thrift deserializer performance characteristics. Mandating contiguous column metadata objects and using N+1 offsets is an intriguing idea. It could perhaps allow preloading many column metadata at once more easily. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 to N+1 offsets. They are going to parse a lot faster (>2x) than structs. |
||
|
||
/** REMOVED from v1: column_orders. | ||
** Use `FileColumnMetadataV3.column_order` instead. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Updated, I misread where this was being moved to, this seems fine. |
||
**/ | ||
|
||
/** | ||
* Encryption algorithm. This field is set only in encrypted files | ||
* with plaintext footer. Files with encrypted footer store algorithm id | ||
* in FileCryptoMetaData structure. | ||
*/ | ||
9: optional EncryptionAlgorithm encryption_algorithm | ||
|
||
/** | ||
* Retrieval metadata of key used for signing the footer. | ||
* Used only in encrypted files with plaintext footer. | ||
*/ | ||
10: optional binary footer_signing_key_metadata | ||
} | ||
|
||
/** Crypto metadata for files with encrypted footer **/ | ||
struct FileCryptoMetaData { | ||
/** | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"MUST", RFC-2119 style, maybe