• @SkybreakerEngineer@lemmy.world
    link
    fedilink
    English
    2273 months ago

    You can’t parse [X]HTML with LLM. Because HTML can’t be parsed by LLM. LLM is not a tool that can be used to correctly parse HTML. As I have answered in HTML-and-regex questions here so many times before, the use of LLM will not allow you to consume HTML. LLM are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by LLM. LLM queries are not equipped to break down HTML into its meaningful parts. so many times but it is not getting to me. Even enhanced irregular LLM as used by Perl are not up to the task of parsing HTML. You will never make me crack. HTML is a language of sufficient complexity that it cannot be parsed by LLM. Even Jon Skeet cannot parse HTML using LLM. Every time you attempt to parse HTML with LLM, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with LLM summons tainted souls into the realm of the living. HTML and LLM go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of LLM and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with LLM you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-LLM will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror. LLM-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the transgression of a chi͡ld ensures LLM will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using LLM to parse HTML has doomed humanity to an eternity of dread torture and security holes using LLM as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of LLM parsers for HTML will ins​tantly transport a programmer’s consciousness into a world of ceaseless screaming, he comes, the pestilent slithy LLM-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he com̡e̶s, ̕h̵i​s un̨ho͞ly radiańcé destro҉ying all enli̍̈́̂̈́ghtenment, HTML tags lea͠ki̧n͘g fr̶ǫm ̡yo​͟ur eye͢s̸ ̛l̕ik͏e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩́t̲͎̩̱͔́̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOŚ͖̩͇̗̪̏̈́T ALL I​S LOST the pon̷y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE ᵒh god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ</center>

  • chud mcfalse
    link
    fedilink
    78
    edit-2
    3 months ago

    send me your data and i will parse it for you

    it may take me a week to get back to you

  • @jonne@infosec.pub
    link
    fedilink
    703 months ago

    So I guess it’s just only us Millennials that know how to convert a PDF properly, and we’re just sandwiched in between boomers and gen Z finding the most ridiculous ways to try to accomplish that task.

    • @dan@upvote.au
      link
      fedilink
      393 months ago

      Somehow, millennials ended up being the only generation that at least kind of knows how to use computers.

      • @Saleh@feddit.org
        link
        fedilink
        153 months ago

        Naah. There is plenty of Gen X, Y, Z who know and plenty of Millenials who dont.

        Its just if you wanted to “do stuff with computers” you had to develop some understanding back then.

        Today you can “do stuff” like gaming much easier out of the box. So not everyone who “does stuff” knows his way around.

        In the office most colleagues of all generations just know how to do their specific things, mostly in MS Office products.

        • @Tuxman@sh.itjust.works
          link
          fedilink
          8
          edit-2
          3 months ago

          Of god…. The number of colleagues that their WHOLE job depends on MS Word and they never heard of “Insert page break”………

          Then they complain when inserting an image breaks their whole document…….

          • LiveLM
            link
            fedilink
            English
            2
            edit-2
            3 months ago

            To be fair, I’ve had Word absolutely freak out with images even in the simplest documents so I don’t blame your colleagues, even without the page breaks

        • @SpacetimeMachine@lemmy.world
          link
          fedilink
          63 months ago

          The difference being that a lot of millennials know how to figure out how to do stuff on computers by doing basic research. I’ve found a lot of my Gen-z friends to be more helpless in that regard.

      • Sabata
        link
        fedilink
        143 months ago

        I will use my powers for evil, just to be smug.

      • Flying Squid
        link
        fedilink
        53 months ago

        I was probably using computers when you were still in your mom’s ovaries. My first one was an Apple ][+ in 1982.

      • lime!
        link
        fedilink
        English
        14
        edit-2
        3 months ago

        $ pandoc doc.pdf -o doc.txt

        Edit: welp, pandoc can’t do that. pdftotext it is.

        • @mexicancartel@lemmy.dbzer0.com
          link
          fedilink
          English
          2
          edit-2
          3 months ago
          magick file.jpg file.html
          

          Imagemagick be converting anything into anything (Actually in this case, it make an html file and a png file which is referenced in html file and html page displays it)

          • lime!
            link
            fedilink
            English
            23 months ago

            not really a good way to get the text out of a pdf though. then again, turns out neither is pandoc.

          • lime!
            link
            fedilink
            English
            23 months ago

            damn it, you’re right. should probably have checked that…

    • Litanys
      link
      fedilink
      English
      73 months ago

      This is why millenials, despite all else wrong with us, are the best generation. I asked a kid the other day if they knew what a directory was… Crickets.

      • What’s fucking wrong * is * that we’re sandwiched between Boomers/GenX and Zoomers. Too young to be able to shove the Cold War brain rotters out of power, too old to convince the Zoomer incels to talk to a girl instead of listening to Andrew Tate.

  • @wise_pancake@lemmy.ca
    link
    fedilink
    353 months ago

    Yes, there are LLMs for that, you literally just have to Google “llm parse PDF”.

    You could also use tesseract or any number of other solutions which probably work as well…

    But an inexperienced kid is gonna act like an inexperienced kid

  • @Gsus4@mander.xyz
    link
    fedilink
    19
    edit-2
    3 months ago

    Same for music like suno. I don’t need to remix and hallucitate new fusion music, I just need a really good way to effectively search/discover all music that already exists in one place.

  • JackbyDev
    link
    fedilink
    English
    183 months ago

    I’m still mad there’s no straightforward way to convert a PDF into semantic HTML. There’s plenty of tools to convert it into HTML that looks the same with pages and such, but I just want the content.

    • AnimalsDream
      link
      fedilink
      English
      63 months ago

      Would it work to convert it to a simpler intermediate format like rtf or txt, and then convert into html? Why html anyway, Isn’t epub more appropriate?

      • JackbyDev
        link
        fedilink
        English
        53 months ago

        I just hate two column paginated lay outs. Give me pageless single column text.

        • AnimalsDream
          link
          fedilink
          English
          33 months ago

          Yeah I get that. I’ve just gotten used to leaving pdfs the way they are, and choosing to read them on more appropriate devices like laptops or tablets.

  • @GroundedGator@lemmy.world
    link
    fedilink
    163 months ago

    We absolutely should have more specialized LLMs. That being said we have dozens of tools that convert documents and data. Also any engineer worth a nickel should be able to whip something up in an hour or so for most cases.

    • @locuester@lemmy.zip
      link
      fedilink
      English
      83 months ago

      This is a primary use for me. A couple times per day.

      That’s part of what makes LLMs so popular with software engineers, they solve lots of trivial daily computer tasks.

    • @jdeath@lemm.ee
      link
      fedilink
      13 months ago

      oh nice! i use Prettier for that and it has worked fine for a decade or so, but it is really lacking any AI so i have been having to search for alternatives.

      maybe PrettierAI, it can use LLMs to format all your code!

      • @JaddedFauceet@lemmy.world
        link
        fedilink
        1
        edit-2
        3 months ago

        Prettier doesn’t make my markdown table prettier tho. This is what i did

        prompt

        Given the following markdown table

        | input | output l
        | -- | -- |
        | 2.6 | 3 |
        | 2.5 | 2 |
        | 2.4 | 2 |
        | 1.6 | 2 |
        | 1.5 | 2 |
        | 1.4 | 1 |
        

        Align the vertical bar. Align number to the left

        output

        Here’s the table with the vertical bars aligned and numbers left-aligned:

        | input | output l |
        |-------|----------|
        | 2.6   | 3        |
        | 2.5   | 2        |
        | 2.4   | 2        |
        | 1.6   | 2        |
        | 1.5   | 2        |
        | 1.4   | 1        |
        

        Each column has been padded so that the vertical bars line up consistently, and the numbers are aligned to the left as requested.

        • @jdeath@lemm.ee
          link
          fedilink
          13 months ago

          yeah, you formatted a markdown table. now you’re just repeating yourself. Prettier handles that just fine. I have the Prettier VSCode extension and set it as default formatter on save and this gets done automatically for me.

          before:

          after:

          Didn’t require a small lake of water and a gigawatt of electricity to compute… it even works offline!

  • @sabreW4K3@lazysoci.al
    link
    fedilink
    5
    edit-2
    3 months ago

    I mean he’s not wrong.

    Edit: it seems the joke that LLMs just take other people’s data and regurgitates it in another format went over everyone’s head 🥺

    • lemmyng
      link
      fedilink
      English
      833 months ago

      Using LLM for format conversion is like taking a picture of an electronic document, taking the card out of the camera and plugging it into a computer, printing the screenshots, taking those prints to a scanner with OCR, turning the result into an audio recording, and then dictating it too an army of 3 million monkeys with typewriters.

      • @gsfraley@lemmy.world
        link
        fedilink
        223 months ago

        Haha considering just how much irrelevant third-party training data you’d be looping into a format conversion, this metaphor really is spot-on.

      • qaz
        link
        fedilink
        English
        83 months ago

        Sounds very appropriate for a government operation

      • htrayl
        link
        fedilink
        63 months ago

        Im not so sure. I think this is more of a question about taking arbitrary, undefined, or highly variable unstructured data and transforming it into a close approximation for structured data.

        Yes, the pipeline will include additional steps beyond “LLM do the thing”, but there are plenty of tools that seek to do this with LLM assistance.

      • @MajorHavoc@programming.dev
        link
        fedilink
        33 months ago

        So…my process (which you just accurately described) could be replaced by an LLM, after all? Hooray! Monkey feed isn’t too expensive, but a million mouths is still a million mouths.