-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unification of XML to dict list tree translation #113
Comments
For whose who are not familiar with "what is so hard with XML translation to python structures" this is a primer. <NodeX>
<subNode1></subNode1>
<subNode2></subNode2>
</NodeX> These child nodes can have the same names or differing names. If it is different names then such child translates to python dict, else if it is same names, it translates to list. Above xml could be translated to: {"NodeX":
{"subNode1": None,
"subNode2": None}
} in case children would have same name {"NodeX":
{"subNode": [None, None]}
} Looks easy! No? <NodeX>
<subNode>Alpha</subNode>
<subNode>Beta</subNode>
</NodeX> which would translate to: {"NodeX":
{"subNode": ["Alpha", "Beta"]}
} But now pay attention - it is absolutely valid to add content alongside the children like this: <NodeX>This is
<subNode>Alpha</subNode> and this is
<subNode>Beta</subNode> subnodes
</NodeX> text The question is where (and how) we put "This is and this is subnodes" - it should be one level below "NodeX" key in the dict. How we should name the key? <NodeX>This is
<Content>Alpha</Content> and this is
<Content>Beta</Content> subnodes
</NodeX> To think that we could use some sensible node name and it would not collide with some of tag name in some of OEM XML formats would be extremely naive. There is some kind like Murphy's law as if there is even such very unlikely possibility, for some bizarre reasons it will be implemented by some of OEM. {"NodeX": {"content": ["Alpha", "Beta"], "#text": "This is ant this is subnodes"}} But that is not the only complication, because we looked only into 2 from 3 ways of storing data in the XML. |
Then there is 3rd way: attributes. I know some proprietary scientific formats which stuffs hierarchical data into flat structure (the XML has root node, and only single level of long list of children) and hierarchy is described only with attributes (It is behind human comprehension to try to resolve hierarchy in head without software in such cases). that would look something like this: <root>
<Instance class="Detector" parent="root" \>
<Instance class="Window" parent="Detector" thickness="10" Type="Thin" \>
<Instance class="IonGun" parent="root"\>
....
</root> ...while it looks messy and hardly comprehensible for human, such flat structure indeed is easy mappable to python dictionary. Basically it would look like this: {"root":
{ "Instance": [
{"class": "Detector", "parent"="root"},
{"class": "Window", "parent"="Detector", "thickness"=10, "Type"="Thin"},
{"class": "IonGun", "parent"="root"},
]
}
} Making it hierarchical from that point would need some effort. <dataNode>
<size units="bytes">2048</size>
<createdWith version="1.2.4">SoftwareX</createdWith>
<comment>From garden</comment>
<randomMPStuff title="bunch of taxes" class="taxes">
<title proposed="committee on the new taxation">Thingy</title>
<title proposed="an old man">An old people standing in the water</title>
<title proposed="some street advisor">Holiday snaps</title>
</randomMPStuff>
</dataNode> first of all under "dataNode" we see first problem: Then, if we look to The most simple way to work around this collision problem is to prepend the attribute with some char which is not valid in the XML tag, but valid in python dictionary. I.e. "@". Considering those, the prepending string to attribute name should be customisable by developer. Default is |
@sem-geologist, #111 fixes this issue? At least, for now? |
@ericpre I want to use this issue as a tracker for XML handling unification. #111 is initial part of what this issue states. I think it will be proper to close this as most of file readers will adapted this (I plan to do one by one review and adaptation) and dev documentation updated how to use that (there is dev docs? no?) for new formats with XML. |
I had updated some post above. This is continuation of discussion and demonstration of the current state (which was merged with #111). The below XML is also added in #111 as a test file under <?xml version="1.0" encoding="UTF-8"?>
<TestXML>
<Header>
<ShortDescription>Test XML</ShortDescription>
<HTMLDescription><![CDATA[This utter <i>nonsense</i> <b>XML</b> was created to check far-fetched ideas about sub-worst case of OEM-generated XML scenarios.]]></HTMLDescription>
</Header>
<Main>
<ClassInstance>
<Detector>
<ClassInstance>
<Angle>15.345</Angle>
<Type>SDD</Type>
<Model>BreakFast™</Model>
<PulseProcessor>FPGAv11</PulseProcessor>
<BufferSize units="bytes">2048</BufferSize>
</ClassInstance>
</Detector>
<Instrument ClassInstance="Analytical">
<Type class="chasis">Toaster</Type>
<SerialNumber>1234-5</SerialNumber>
<Dim axes="width, depth, height">33.3,27.4,25.2</Dim>
<IsCoated/>
<IsToasted>not today</IsToasted>
<IsToasting>affirmative</IsToasting>
</Instrument>
<Sample name="breakfast test" number="23">With one of these components
<Components>
<ComponentChildren>
<Instance name="Eggs" calories="345.2" breaking-speed="5.2"></Instance>
<Instance name="Bacon" calories="5000" breaking-speed="11"></Instance>
<Instance name="Spam" calories="0.1" breaking-speed="24.6"></Instance>
</ComponentChildren>
</Components>
<Project>BreakFast</Project> SDD risks to be Toasted.
</Sample>
</ClassInstance>
</Main>
</TestXML> The XML is intentionally over-verbose with many name collisions. Xml to dictionary translator is also using literal_eval form Could it be just a function? Actually it could, and probably that would be more simplier for simple cases. <Main>
<DetectorHeader\>
<EPMAHeader\>
<SEMHeader\>
<Data\>
<RedundantColors\>
<UnicornsOnTheScreen\>
<LittleMermaid\>
<SelectionOfGardenBakedClayDwarfStatues\>
</Main> It contains 4 first nodes with useful scientific (meta-)data, and 4 last nodes with kind of visualization instructions useful explicitly for OEM software, but absolute useless for anything outside that. |
using new functionality in action:import xml.etree.ElementTree as ET
from rsciio.utils.tools import XmlToDict
x2d_translator = XmlToDict() So making the etree object from ToastedBreakFastSDD.xml and converting it to python structure would look like: toasted_break_fast_sdd_et = ET.fromstring(there_not_shown_loaded_as_python_bytes_or_str_xml)
py_toasted = x2d_translator.dictionarize(toasted_break_fast_sdd_et) py_toasted will contain such structure: {
'TestXML': {
'Header': {
'ShortDescription': 'Test XML',
'HTMLDescription': 'This utter <i>nonsense</i> <b>XML</b> was created to check far-fetched ideas about sub-worst case of OEM-generated XML scenarios.'},
'Main': {
'ClassInstance': {
'Detector': {
'ClassInstance': {
'Angle': 15.345,
'Type': 'SDD',
'Model': 'BreakFast™',
'PulseProcessor': 'FPGAv11',
'BufferSize': {'units': 'bytes', '#value': 2048}
}
},
'Instrument': {
'Type': {'class': 'chasis', '#value': 'Toaster'},
'SerialNumber': '1234-5',
'Dim': {'axes': 'width, depth, height', '#value': (33.3, 27.4, 25.2)},
'IsCoated': None,
'IsToasted': 'not today',
'IsToasting': 'affirmative',
'@ClassInstance': 'Analytical'
},
'Sample': {
'Components': {
'ComponentChildren': {
'Instance': [
{'name': 'Eggs', 'calories': 345.2, 'breaking-speed': 5.2},
{'name': 'Bacon', 'calories': 5000, 'breaking-speed': 11},
{'name': 'Spam', 'calories': 0.1, 'breaking-speed': 24.6}
]
}
},
'Project': 'BreakFast',
'@name': 'breakfast test',
'@number': 23,
'#value': 'With one of these components'
}
}
}
}
} It is verbose! accessing name of first component of breakfast needs this monstrosity: py_toasted['TestXML']['Main']['ClassInstance']['Sample']['Components']['ComponentChildren']['Instance'][0]['name'] Using Box would make hardly any difference (maybe more pleasant for Java devs... and panoramic(ultra-wide)-screen-friendly): boxy_obj = Box(py_toasted)
boxy_obj.TestXML.Main.ClassInstance.Sample.Components.ComponentChildren.Instance[0].name Now it is time to remind again: many XML structures used by OEM are framework designed - not Human designed. If we want this data to be usable in human readable form (box or DataTreeViewer or other helper...) it would be good to shave off some artificially made hierarchical scruff. It is indeed easy to do that while initializing better_x2d = XmlToDict(
dub_text_str="#val",`
interchild_text_parsing='cat',
tags_to_flatten=[
"ClassInstance",
"ComponentChildren",
"Instance"
]
)
py_better_toasted = better_x2d.dictionarize(toasted_break_fast_sdd_et) that will have much flatter structure without redundant programming-framework-injected stuff: {
'TestXML': {
'Header': {
'ShortDescription': 'Test XML',
'HTMLDescription': 'This utter <i>nonsense</i> <b>XML</b> was created to check far-fetched ideas about sub-worst case of OEM-generated XML scenarios.'
},
'Main': {
'Detector': {
'Angle': 15.345,
'Type': 'SDD',
'Model': 'BreakFast™',
'PulseProcessor': 'FPGAv11',
'BufferSize': {'units': 'bytes', '#val': 2048}
},
'Instrument': {
'Type': {'class': 'chasis', '#val': 'Toaster'},
'SerialNumber': '1234-5',
'Dim': {'axes': 'width, depth, height', '#val': (33.3, 27.4, 25.2)},
'IsCoated': None,
'IsToasted': 'not today',
'IsToasting': 'affirmative',
'@ClassInstance': 'Analytical'
},
'Sample': {
'Components': {
'name': ['Eggs', 'Bacon', 'Spam'],
'calories': [345.2, 5000, 0.1],
'breaking-speed': [5.2, 11, 24.6]
},
'Project': 'BreakFast',
'@name': 'breakfast test',
'@number': 23,
'#interchild_text': 'With one of these componentsSDD risks to be Toasted.'
}
}
}
} I think everyone will agree that this is better than previous. py_better_toasted['TestXML']['Main']['Sample']['Components']['name'][0] or in boxified version: py_boxy_better.TestXML.Main.Sample.Components.name[0] Hope this starts to look reasonable. |
@sem-geologist this is very cool! Admittedly, I attempted to do something similar with #11 but I see now that I was rather naive in my approach. I'll try to do some testing with those XML files later this week but they should be simple enough that this should definitely cover them. |
I did implement something similar for the The
Should I keep this in the filereader or add this to @sem-geologist What is your opinion on this? |
@pietsjoh , to begin with, let me ask Why Do You want only attributes (maybe problem lies somewhere else)? If it is due to "@" being added to the name - it can be set with empty string "" during initiation of translator class (if there is warranty of no name clash with children tag names). It is indeed quite simple to add the way to ignore any children (altogether). It is probably wise as You suggested to add that kind of functionality as keyword to BTW how many children such node contains. I am asking, as I currently am preparing some extension to xml2dict to be able to ignore tags by name. I have similar (not same) issue. In Bruker formats sometimes metadata is intermingled with irelevant nodes, relevant metadata, and data at same level like this: <ClassInstance type="Parent">
<UsefulMetadata1>SDDtype3</UsefulMetadata1>
<UsefulMetadata2>
<ClassInstance type="HardwareHeader">
<ShappingTime>0.2</ShappingTime>
<PulserFreq>50000</PulserFreq>
<Channels>4096</Channels>
<Size>2</Size>
</ClassInstance>
</UsefulMetadata2>
<NotRelevantData>
<Count>254</Count>
<C1>0,0,0<C1>
<C2>1,1,1<C2>
<!--up to 254 shades of gray-->
</NotRelevantData>
<UsefulMetadata3>4</UsefulMetadata3>
<UsefulMetadata4>587</UsefulMetadata4>
<UsefulMetadata5>3E-2</UsefulMetadata5>
<InterestingMetadata>hh</InterestingMetadata>
<Data>1,2,3,4,5,6,7,8,9,4,2,21,5</Data>
</ClassInstance> In above pseudo example, hitherto, my bruker code reads and converts such Useful Metadata 1,3,4,5 one-by-one, where dictionarization is used on UsefulMetadata2. The point is that there are few nodes which I want not to dictionarize for different reasons. I.e. |
@sem-geologist , essentially my situation looks somewhat like this: <Document Version="2" Label="Intensity" DataLabel="Counts" InfoSerialized="<?xml version="1.0" ...">
<NotRelevantMetadata Count="0" />
<Data>
<Frame>1;2;3;4;5</Frame>
</Data>
</Document> Lots of useful metadata is saved in the attributes of the node
Yes , thinking about it again this won't be needed as a generalization.
That is exactly what I did in my original version. And I think that is probably the easiest solution for my case. Another option would be to split up the dictionarize() method: class XmlDict:
def read_attributes(self, et_node):
d_node = {et_node.tag: {} if et_node.attrib else None}
d_node[et_node.tag].update(
(self.dub_attr_pre_str + key if children else key, self.eval(val))
for key, val in et_node.attrib.items()
)
return d_node
def dictionarize(self, et_node):
d_node = {et_node.tag: {} if et_node.attrib else None}
children = list(et_node)
if children:
...
if et_node.attrib:
d_node[et_node.tag].update(self.read_attributes(et_node))
if et_node.text:
... That would work in my case. However, in general it would be probably more useful to just ignore the children and make a dictionary out of |
I have spotted that there are quite many duplication efforts in parsing and translating hierarchical metadata in xml into pythonic dict and list structures. I will keep this updated. So I think XML translator used in bruker api is most built-up and I expand that to take into account most bizzare XML cases (and XML can be really unreadable but valid mess).
Progress:
bruker._api
toutils.tools.py
and expand to work on more cases. (done move the xml_to_dict and msfiletime to datetime from bruker parser to utils.tools #111 )The text was updated successfully, but these errors were encountered: