parser #28

byteface · 2021-08-21T20:18:34Z

I think I'm going to need a few types of parser. a normal one, one that uses python built in one, peg ones, ones that can do xml/svg/html/... as well as my evolving one. consider importing something light if it could easily output domonic style pyml. but this could do with conversations with others that are good at that kind of thing for suggestions to improve. etc.

byteface · 2021-09-03T09:58:06Z

hmmm, been modding expatbuilder and seems to have worked. a decent parseString could be coming, quite soon. can you feel the excitement?.

byteface · 2021-09-06T11:28:21Z

some links...
https://www.tutorialspoint.com/python3/python_xml_processing.htm
https://www.computerhope.com/unix/pylibml.htm

byteface · 2021-09-08T16:38:40Z

this looks exciting...
https://github.com/byteface/html5-parser/blob/master/src/html5_parser/dom.py

given what i just did with expat. may be able to mod that to generate domonic from huge sites?

byteface · 2021-09-09T10:05:42Z

I managed to mod the file. easier that I thought...

byteface/html5-parser@fa83bf1

so that appears to work. even with lots of websites. It seems to build trees with domonic.

import requests
from html5_parser import parse

sites = []  # add webpages here
for SITE in sites:
    try:
        r = requests.get("https://"+SITE)
        some_html = r.content.decode("utf-8")
        root = html5_parser.parse(some_html, treebuilder='dom')#, return_root=False)
        print(root)
        # print(type(root))  # a domonic Document
        # print([str(el) for el in root.getElementsByTagName("a")])
        # print(page)
    except Exception as e:    
        print('Failed to dl page', e)

byteface · 2021-09-09T10:23:02Z

So the options are to patch that file after each install. or

pip install git+https://path to my patched version

i need to figure out that path and test. again. But very promising. It's so fast.

byteface · 2021-09-09T10:23:23Z

https://html5-parser.readthedocs.io/en/latest/

ipfans · 2021-10-08T03:49:01Z

It is a cool toolkit, but is there a way to quick transcript html page to python code?

byteface · 2021-10-08T07:21:45Z

Hi @ipfans , thanks for feedback.

There is Not yet a perfect way as I originally only set out to generate html. But it IS on the roadmap.

Some more complete parsers for html/python will hopefully be ready by v1. Which I'd love to get done within 12 months.

We can already get about 75% or more of the way. (but is dangerous and uses eval)

see codemirror.py in this folder...
i.e
https://github.com/byteface/domonic/tree/master/examples/parsing

or via the command line util...
python3 -m domonic -d http://eventual.technology

Also all tags recently had a __pyml__() secret function added but it may not recurse and is not fully tested. so not documented.

so if you do:

    mydom.__pyml__()

it might work. If you have an existing dom. A precursory option was added to the renderer.

render(root, 'test.pyml', 'pyml')

However for this to work we need a dom already parsed.

As people know who use minidom (some may be coming here) . It can only parse very very strict XML not html. So it seems to work sometimes but very easily doesn't. Hence domonic parsers failing as it leverages the same. Usually failing due to content not node structure. Often the default parsers work fine for html strings without content for example.

I then tried to get around this with a simple parser myself. But found I wanted to keep expanding on it and that is at the heart of domonic. an unfinished regex, in-place html to python converter.

However it still has errors and the main issue is python wants keyword args last. Therefor you have to not only parse but swap around the nodes to put 'content' before _classes for example. (the only real crux of learning domonic)

Anyway during investigation I found several ways to parse. python has a builtin html parser too. But you have to use it like a lexer and I've not gotten round to it yet. There's also PEG parsers and some offshelf ones. I found also a html5 c++ one referenced above. So my long term goal would be to have a default good one out of the box, with options of picking some others.

for now. if you are brave domonic __init__ class has a host of methods that are trying to work towards this aim.
After the inital regex parse which does syntax only. It then then passes through a series of self iterating failures to try and fix syntax issues and swap the parameters to the order python expects them. This currently uses eval to check the line is valid. So therefore is dangerous. Hence not documented.

By using these tools you can get 75% of the way there for some huge files and manual modify and edit them to work. By rendering them then fixing the syntax issues pointed out when trying to compile. (there's a guide on the readme for common errors that can help speed this up).

My biggest success was using the hacked html5 c++ parser as mentioned above and then calling pyml() on the dom it produces. However there's still issues compared to my existing parser (which isn't too bad in some cases).

i.e the c++ one does not yet convert data-attributes to the keyword argument syntax format.

it doesn't do this...
i.e. **{'_data-tag':'somevalue'}

automatically for you.

So I hadn't released any further documentation until I come back to investigate parsing. Or get help.

Anyway I hope these tips assist you while I'm still figuring it all out and maybe you might like the codemirror.py example.

once done you may also enjoy this plugin. that will format it for you.

useful plugin for formatting flat .pyml in vscode

https://marketplace.visualstudio.com/items?itemName=mgesbert.indent-nested-dictionary

Also as a final note. If you don't want it ALL in domonic if templating parts is laborious, you can mixin your own fstrings. See DocumentFragment example here...

https://github.com/byteface/htmxtest/blob/master/app.py

byteface · 2021-10-08T11:33:28Z

to explain maybe a little deeper. and future progress. As parser stuff is undocumented.

domonic orignally had a simple regex parser, for tags only no content.

which grew. domonic currently uses that... (which you then need to eval if you want to auto fix it up)
domonic.parse

but it can also use a copy of builtin in minidom parseString. This autofails with single char replacement so could take infinity to gen a working doc if the XML is not perfect. : / . I achieved that by hacking the builtin expatparser to use domonic rather than minidom. However that needs replacing by a html5 parser.

so the c++ one i knocked up to prove the concept and check compatibility but is not ideal as not pure python and needs extra steps to setup on windows. so will be a later 'option'.

i need to write a pure python one using the builtin if possible.

There's a new window class that will eventually let you do

window.location = x

which I on my own fork swapped out the parseString method for to get working the c++ one. So if you need a quick fix you can do somethign like that. To help with this I've been moving some of the parse methods discovered to a new utility parse package. So if you want to play you can try to hook the data-attribute fixer to the hacked c++ parser and bingo.

However the full solution I'm probably at least several months away from as I need to start a whole new one or find a compatible lib that can build with my dom as an option rather than hacking it like i did with expat. Before I can get back to my regex curiosity.

Also for compatibility 'html' needs not BE the document. So a slight re-architecure on the dom is needed without breaking current useage. Which I'm also in the process of considering which should help with other dom builders. To understand what im talking about diff the native expat parser vs mine 'borrowed' one. and you will see.

ipfans · 2021-10-08T14:49:43Z

Thanks for your replies, and I made a just works version of transcript :) But it is a good news for official support.

byteface · 2021-10-27T21:11:45Z

html5lib now has an integration point.

An example exists in the /examples/parsers/html5libtest...

and notes on the release. https://github.com/byteface/domonic/releases/tag/0.6.5

byteface · 2022-01-24T23:57:01Z

I've included html5lib. and and integration point for the c++ one.

import html5_parser
from domonic.ext.html5_parser_ import parse
root = parse(some_html_string, treebuilder='domonic')

though that one is still experimental and to test.

byteface added the help wanted Extra attention is needed label Aug 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parser #28

parser #28

byteface commented Aug 21, 2021

byteface commented Sep 3, 2021

byteface commented Sep 6, 2021

byteface commented Sep 8, 2021 •

edited

Loading

byteface commented Sep 9, 2021 •

edited

Loading

byteface commented Sep 9, 2021

byteface commented Sep 9, 2021

ipfans commented Oct 8, 2021

byteface commented Oct 8, 2021 •

edited

Loading

byteface commented Oct 8, 2021 •

edited

Loading

ipfans commented Oct 8, 2021

byteface commented Oct 27, 2021 •

edited

Loading

byteface commented Jan 24, 2022

parser #28

parser #28

Comments

byteface commented Aug 21, 2021

byteface commented Sep 3, 2021

byteface commented Sep 6, 2021

byteface commented Sep 8, 2021 • edited Loading

byteface commented Sep 9, 2021 • edited Loading

byteface commented Sep 9, 2021

byteface commented Sep 9, 2021

ipfans commented Oct 8, 2021

byteface commented Oct 8, 2021 • edited Loading

useful plugin for formatting flat .pyml in vscode

https://marketplace.visualstudio.com/items?itemName=mgesbert.indent-nested-dictionary

byteface commented Oct 8, 2021 • edited Loading

ipfans commented Oct 8, 2021

byteface commented Oct 27, 2021 • edited Loading

byteface commented Jan 24, 2022

byteface commented Sep 8, 2021 •

edited

Loading

byteface commented Sep 9, 2021 •

edited

Loading

byteface commented Oct 8, 2021 •

edited

Loading

byteface commented Oct 8, 2021 •

edited

Loading

byteface commented Oct 27, 2021 •

edited

Loading