How to use this model to extract html structure from image?

#7
by Alexziyu - opened

How to use this unfine-tuned pre-trained model to perform its pre-trained task ---- generating html representations of screenshots? Thanks

Google org

Maybe @Molbap can help out :)

Hey @Alexziyu , I'm not 100% certain it can actually be done, because what pix2struct has been trained on is a simplified html code of webpages.
As a matter of fact this issue was brought up in the official repo a few times:

https://github.com/google-research/pix2struct/issues/1
https://github.com/google-research/pix2struct/issues/38

However for instance if you feed the beginning of such a simplified structure as a prompt, you start getting html elements of the image. I used a screenshot of the Google landing page, use << as a prompt, I'm starting to get some html-like outputs. This is after initializing processor and model from pix2struct-base.

img = Image.open("google_homepage.png").convert('RGB')
inputs = processor(text="<<", images=img, return_tensors='pt')                                                                                                                                                                                                              
processor.decode(model.generate(**inputs, max_new_tokens=100)[0])                                                                                                                                                                                                               
>> '<pad> <<</s>> <img_src=logo_google img_alt=google>></s>'

The reason for that is that the simplified representation of html uses >, < and so on for divisions and structures. I would suggest experimenting around this, but if there is a specific pretraining prompt, I don't know it. I'm also interested to know more!

so I need "<<" as a prompt? May I ask how do you know that? I will try it, thank you very much!!

I do not think it is enough, but it's something that will mimic the simplified html from the pretraining objective. But I think it is missing a special pretraining token and perhaps a prompt structure, but some html elements can be recovered.

These models are not supposed to need any prompt in terms of decoder input ids, prompts should be rendered onto the image.

I wonder if something else is wrong, because yeah, this appears unable to really generate much of relevance to the input image...

Further, trying to feed images into the 'unfine-tuned' models in transformers if I get anything other than a few tokens it'll be gibberish (wrt to the image) like

'<pad></s> for business and personal reasons.> <The only way to get the full picture is to use the html element of the image.> <The first step is to get the html element of the image.> <The second step is to get the html element of the image.> <The third step is to get the html element of the image.> <The third step is to get the html element of the image.> <The second step is to get the'

or

<pad></s> for commercial purposes.> <International shipments are expected to be made in the next few weeks.> <Please be aware that the shipping costs are not included in the price.> <If you are not sure what you are getting, please contact me.> <I am not a professional engineer, but I do have a good understanding of the technical aspects of the industry.> <I am a professional engineer, but I am not a professional engineer.> <I am a professional engineer, but']

Wheras, using the demo app in the pix2struct original (flax) impl, I can get something if the text is reasonably sized. This output is for same web page on Amazon.ca as the second gibbersh above...

<img_alt=Frog and Toad Storybook Favorites: Includes 4 Stories Plus Stickers! Hardcover – Sticker Book, Feb. 19 2019 img_src=9781442400000_cover_middle_low>

Image is a window snapshot of upper left corner, zoomed in (in a narrow window) of this https://www.amazon.ca/Frog-Toad-Storybook-Favorites-Stickers/dp/0062883127

And one last comment, while the info re the exact rules and html subset used for the pretrain are sadly not specified, I feel you can piece together some of it by paying close attention to the tokenizer, look for the html tags! And also, this paper reference in pix2struct in relation to Simplified HTML might have some clues https://openreview.net/forum?id=P-pPW1nxf1r

Thanks @rwightman @Molbap I was actually trying to replicate the pre-training results, but it looks like the model didn't work very well. It was really weird and I wondered if I should try to contact the author of the paper, after all there were so many people with the same question

Sign up or log in to comment