Parsing HTML with taggy-lens and lens
Extract the text contents from a div with a particular id
Section titled “Extract the text contents from a div with a particular id”Taggy-lens allows us to use lenses to parse and inspect HTML documents.
#!/usr/bin/env stack-- stack --resolver lts-7.0 --install-ghc runghc --package text --package lens --package taggy-lens
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Text.Lazy as TLimport qualified Data.Text.IO as Timport Text.Taggy.Lensimport Control.Lens
someHtml :: TL.TextsomeHtml = "\ \<!doctype html><html><body>\ \<div>first div</div>\ \<div id=\"thediv\">second div</div>\ \<div id=\"not-thediv\">third div</div>"
main :: IO ()main = do T.putStrLn (someHtml ^. html . allAttributed (ix "id" . only "thediv") . contents)Filtering elements from the tree
Section titled “Filtering elements from the tree”Find div with id="article" and strip out all the inner script tags.
#!/usr/bin/env stack-- stack --resolver lts-7.1 --install-ghc runghc --package text --package lens --package taggy-lens --package string-class --package classy-prelude{-# LANGUAGE NoImplicitPrelude #-}{-# LANGUAGE OverloadedStrings #-}
import ClassyPreludeimport Control.Lens hiding (children, element)import Data.String.Class (toText, fromText, toString)import Data.Text (Text)import Text.Taggy.Lensimport qualified Text.Taggy.Lens as Taggyimport qualified Text.Taggy.Renderer as Renderer
somehtmlSmall :: TextsomehtmlSmall = "<!doctype html><html><body>\ \<div id=\"article\"><div>first</div><div>second</div><script>this should be removed</script><div>third</div></div>\ \</body></html>"
renderWithoutScriptTag :: TextrenderWithoutScriptTag = let mArticle :: Maybe Taggy.Element mArticle = (fromText somehtmlSmall) ^? html . allAttributed (ix "id" . only "article") mArticleFiltered = fmap (transform (children %~ filter (\n -> n ^? element . name /= Just "script"))) mArticle in maybe "" (toText . Renderer.render) mArticleFiltered
main :: IO ()main = print renderWithoutScriptTag-- outputs:-- "<div id=\"article\"><div>first</div><div>second</div><div>third</div></div>"Contribution based upon @duplode’s SO answer