-
Notifications
You must be signed in to change notification settings - Fork 448
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Base64 encode compaction input file list #5229
Conversation
The input file list is serialized as json in the compaction metadata object. When the compaction metadata object was being serialized into the metadata is too was being serialized as json. This had the effect of escaping all of the double quotes in the input file json object because it was json in a String inside of another json object. Closes apache#5060
@@ -96,7 +101,7 @@ public FateId getFateId() { | |||
// This class is used to serialize and deserialize this class using GSon. Any changes to this | |||
// class must consider persisted data. | |||
private static class GSonData { | |||
List<String> inputs; | |||
String inputs; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wonder if we could make this inputs field a JsonElement like this SO post describes to avoid base64 and avoid escaping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could be an alternative, but instead of the json object example from the post (which looks like a map), we would need to use a json array of strings.
I think this is the way. In 4be28a1 I created an object to store the path, start, and end rows. The input files are now an array of these objects and the double-quotes in the output are not escaped. However, they are still escaped for the intermediate output file, so we may want to change that as well. |
InputFile i = new InputFile(); | ||
i.path = stf.getPath().toString(); | ||
Range r = stf.getRange(); | ||
i.startRow = r.getStartKey() == null ? "null" : r.getStartKey().toString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May not want to pass data through toString as that could corrupt binary data. It would be nice if we could use the internal class and machinery in StoredTabletFile here, but as @cshannon mentioned that would break encapsulation. Encapsulation is nice and I would not want to see StoredTabletFile internals in my ide when doing completion, we could break encapsulation in a limited way possibly by doing the following.
- Move StoredTabletFile and CompactionMetadata into the same package. They are almost in the same package, one is in
o.a.a.c.metadata
and the other is ino.a.a.c.metadata.schema
. - Change StoredTabletFile to make TabletFileCqMetadataGson package private and add some package private static methods to go to from TabletFileCqMetadataGson<->StoredTabletFile
- In CompactionMetadata.GSonData use TabletFileCqMetadataGson
This way most of the Accumulo code can not see StoredTabletFile.TabletFileCqMetadataGson and we still get a nice human readable encoding maybe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path
and range
are defined in AbstractTabletFile
. Can we just move the serialization and deserialization methods, and TabletFileCqMetadataGson
there and make them public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path and range are defined in AbstractTabletFile. Can we just move the serialization and deserialization methods, and TabletFileCqMetadataGson there and make them public?
IMO it would be nice to avoid making them public. I was thinking some package private methods like the follow could be added to StoredTabletFile in addition to making TabletFileCqMetadataGson package private. Could refactor the code in StoredTabletFile to accomodate and or use these methods.
static StoredTabletFile deserialize(TabletFileCqMetadataGson serialized) {
}
static TabletFileCqMetadataGson serialize(StoredTabletFile storedTabletFile) {
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When i first looked into this issue and I mentioned that we would break encapsulation I should have elaborated that I was referring to TabletFileCqMetadataGson being private and that the StoredTabletFile kept all of that internal so I didn't want to expose it.
I agree with @keith-turner that if we at least make the code around the serialization package protected and not public that would be a good compromise I think. The TabletFileCqMetadataGson would certainly be the thing to re-use here because it already handles everything correctly with encoding/decoding the binary ranges so no need to have to do it again.
One other nice benefit of making TabletFileCqMetadataGson and the serialization code for it package protected instead of private is that it would be easier to write some unit tests if we wanted.
Closing this, will open a different PR with the suggested solution. |
The input file list is serialized as json in the compaction metadata object. When the compaction metadata object was being serialized into the metadata is too was being serialized as json. This had the effect of escaping all of the double quotes in the input file json object because it was json in a String inside of another json object.
Closes #5060