Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser #28

Open
byteface opened this issue Aug 21, 2021 · 12 comments
Open

parser #28

byteface opened this issue Aug 21, 2021 · 12 comments
Labels
help wanted Extra attention is needed

Comments

@byteface
Copy link
Owner

I think I'm going to need a few types of parser. a normal one, one that uses python built in one, peg ones, ones that can do xml/svg/html/... as well as my evolving one. consider importing something light if it could easily output domonic style pyml. but this could do with conversations with others that are good at that kind of thing for suggestions to improve. etc.

@byteface byteface added the help wanted Extra attention is needed label Aug 21, 2021
@byteface
Copy link
Owner Author

byteface commented Sep 3, 2021

hmmm, been modding expatbuilder and seems to have worked. a decent parseString could be coming, quite soon. can you feel the excitement?.

@byteface
Copy link
Owner Author

byteface commented Sep 6, 2021

@byteface
Copy link
Owner Author

byteface commented Sep 8, 2021

this looks exciting...
https://github.com/byteface/html5-parser/blob/master/src/html5_parser/dom.py

given what i just did with expat. may be able to mod that to generate domonic from huge sites?

@byteface
Copy link
Owner Author

byteface commented Sep 9, 2021

I managed to mod the file. easier that I thought...

byteface/html5-parser@fa83bf1

so that appears to work. even with lots of websites. It seems to build trees with domonic.

import requests
from html5_parser import parse

sites = []  # add webpages here
for SITE in sites:
    try:
        r = requests.get("https://"+SITE)
        some_html = r.content.decode("utf-8")
        root = html5_parser.parse(some_html, treebuilder='dom')#, return_root=False)
        print(root)
        # print(type(root))  # a domonic Document
        # print([str(el) for el in root.getElementsByTagName("a")])
        # print(page)
    except Exception as e:    
        print('Failed to dl page', e)

@byteface
Copy link
Owner Author

byteface commented Sep 9, 2021

So the options are to patch that file after each install. or

pip install git+https://path to my patched version

i need to figure out that path and test. again. But very promising. It's so fast.

@byteface
Copy link
Owner Author

byteface commented Sep 9, 2021

@ipfans
Copy link

ipfans commented Oct 8, 2021

It is a cool toolkit, but is there a way to quick transcript html page to python code?

@byteface
Copy link
Owner Author

byteface commented Oct 8, 2021

Hi @ipfans , thanks for feedback.

There is Not yet a perfect way as I originally only set out to generate html. But it IS on the roadmap.

Some more complete parsers for html/python will hopefully be ready by v1. Which I'd love to get done within 12 months.

We can already get about 75% or more of the way. (but is dangerous and uses eval)

see codemirror.py in this folder...
i.e
https://github.com/byteface/domonic/tree/master/examples/parsing

or via the command line util...
python3 -m domonic -d http://eventual.technology

Also all tags recently had a __pyml__() secret function added but it may not recurse and is not fully tested. so not documented.

so if you do:

    mydom.__pyml__() 

it might work. If you have an existing dom. A precursory option was added to the renderer.

render(root, 'test.pyml', 'pyml')

However for this to work we need a dom already parsed.

As people know who use minidom (some may be coming here) . It can only parse very very strict XML not html. So it seems to work sometimes but very easily doesn't. Hence domonic parsers failing as it leverages the same. Usually failing due to content not node structure. Often the default parsers work fine for html strings without content for example.

I then tried to get around this with a simple parser myself. But found I wanted to keep expanding on it and that is at the heart of domonic. an unfinished regex, in-place html to python converter.

However it still has errors and the main issue is python wants keyword args last. Therefor you have to not only parse but swap around the nodes to put 'content' before _classes for example. (the only real crux of learning domonic)

Anyway during investigation I found several ways to parse. python has a builtin html parser too. But you have to use it like a lexer and I've not gotten round to it yet. There's also PEG parsers and some offshelf ones. I found also a html5 c++ one referenced above. So my long term goal would be to have a default good one out of the box, with options of picking some others.

for now. if you are brave domonic __init__ class has a host of methods that are trying to work towards this aim.
After the inital regex parse which does syntax only. It then then passes through a series of self iterating failures to try and fix syntax issues and swap the parameters to the order python expects them. This currently uses eval to check the line is valid. So therefore is dangerous. Hence not documented.

By using these tools you can get 75% of the way there for some huge files and manual modify and edit them to work. By rendering them then fixing the syntax issues pointed out when trying to compile. (there's a guide on the readme for common errors that can help speed this up).

My biggest success was using the hacked html5 c++ parser as mentioned above and then calling pyml() on the dom it produces. However there's still issues compared to my existing parser (which isn't too bad in some cases).

i.e the c++ one does not yet convert data-attributes to the keyword argument syntax format.

it doesn't do this...
i.e. **{'_data-tag':'somevalue'}

automatically for you.

So I hadn't released any further documentation until I come back to investigate parsing. Or get help.

Anyway I hope these tips assist you while I'm still figuring it all out and maybe you might like the codemirror.py example.

once done you may also enjoy this plugin. that will format it for you.

useful plugin for formatting flat .pyml in vscode

https://marketplace.visualstudio.com/items?itemName=mgesbert.indent-nested-dictionary

Also as a final note. If you don't want it ALL in domonic if templating parts is laborious, you can mixin your own fstrings. See DocumentFragment example here...

https://github.com/byteface/htmxtest/blob/master/app.py

@byteface
Copy link
Owner Author

byteface commented Oct 8, 2021

to explain maybe a little deeper. and future progress. As parser stuff is undocumented.

domonic orignally had a simple regex parser, for tags only no content.

which grew. domonic currently uses that... (which you then need to eval if you want to auto fix it up)
domonic.parse

but it can also use a copy of builtin in minidom parseString. This autofails with single char replacement so could take infinity to gen a working doc if the XML is not perfect. : / . I achieved that by hacking the builtin expatparser to use domonic rather than minidom. However that needs replacing by a html5 parser.

so the c++ one i knocked up to prove the concept and check compatibility but is not ideal as not pure python and needs extra steps to setup on windows. so will be a later 'option'.

i need to write a pure python one using the builtin if possible.

There's a new window class that will eventually let you do

window.location = x

which I on my own fork swapped out the parseString method for to get working the c++ one. So if you need a quick fix you can do somethign like that. To help with this I've been moving some of the parse methods discovered to a new utility parse package. So if you want to play you can try to hook the data-attribute fixer to the hacked c++ parser and bingo.

However the full solution I'm probably at least several months away from as I need to start a whole new one or find a compatible lib that can build with my dom as an option rather than hacking it like i did with expat. Before I can get back to my regex curiosity.

Also for compatibility 'html' needs not BE the document. So a slight re-architecure on the dom is needed without breaking current useage. Which I'm also in the process of considering which should help with other dom builders. To understand what im talking about diff the native expat parser vs mine 'borrowed' one. and you will see.

@ipfans
Copy link

ipfans commented Oct 8, 2021

Thanks for your replies, and I made a just works version of transcript :) But it is a good news for official support.

@byteface
Copy link
Owner Author

byteface commented Oct 27, 2021

html5lib now has an integration point.

An example exists in the /examples/parsers/html5libtest...

and notes on the release. https://github.com/byteface/domonic/releases/tag/0.6.5

@byteface
Copy link
Owner Author

I've included html5lib. and and integration point for the c++ one.

import html5_parser
from domonic.ext.html5_parser_ import parse
root = parse(some_html_string, treebuilder='domonic')

though that one is still experimental and to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants