Download pdf file from wikipedia - http
Wikipedia provides a link (left side on Print/export) on every article to download the article as pdf. I wrote a small Haskell script which first gets the Wikipedia link and output the rendering link. When I am giving the rendering url as input, I am getting empty tags but the same url in browser provides download link.
Could someone please tell me how to solve this problem? Formated code on ideone.
import Network.HTTP
import Text.HTML.TagSoup
import Data.Maybe
parseHelp :: Tag String -> Maybe String
parseHelp ( TagOpen _ y ) = if any ( \( a , b ) -> b == "Download a PDF version of this wiki page" ) y
then Just $ "http://en.wikipedia.org" ++ snd ( y !! 0 )
else Nothing
parse :: [ Tag String ] -> Maybe String
parse [] = Nothing
parse ( x : xs )
| isTagOpen x = case parseHelp x of
Just s -> Just s
Nothing -> parse xs
| otherwise = parse xs
main = do
x <- getLine
tags_1 <- fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest x ) --open url
let lst = head . sections ( ~== "<div class=portal id=p-coll-print_export>" ) $ tags_1
url = fromJust . parse $ lst --rendering url
putStrLn url
tags_2 <- fmap parseTags $ getResponseBody =<< simpleHTTP ( getRequest url )
print tags_2
If you try requesting the URL through some external tool like wget, you will see that Wikipedia does not serve up the result page directly. It actually returns a 302 Moved Temporarily redirect.
When entering this URL in a browser, it will be fine, as the browser will follow the redirect automatically. simpleHTTP, however, will not. simpleHTTP is, as the name suggests, rather simple. It does not handle things like cookies, SSL or redirects.
You'll want to use the Network.Browser module instead. It offers much more control over how the requests are done. In particular, the setAllowRedirects function will make it automatically follow redirects.
Here's a quick and dirty function for downloading an URL into a String with support for redirects:
import Network.Browser
grabUrl :: String -> IO String
grabUrl url = fmap (rspBody . snd) . browse $ do
-- Disable logging output
setErrHandler $ const (return ())
setOutHandler $ const (return ())
setAllowRedirects True
request $ getRequest url
Related
How to make a request to an IPv6 address using the http-client package in haskell?
I've been trying to make a request to an IPv6 address using the parseRequest function from Network.HTTP.Client (https://hackage.haskell.org/package/http-client-0.7.10/docs/Network-HTTP-Client.html) package as follows: request <- parseRequest "http://[2001:0db8:85a3:0000:0000:8a2e:0370:7334]" Instead of parsing it as an address/addrInfo, it is parsed as a hostname and throws the error: does not exist (Name or service not known). As a next step, I tried pointing a domain to the same IPv6 address and then using the domain name in parseRequest, then it successfully resolves that into the IPv6 address and makes the request. Is there some other way I can directly use the IPv6 address to make the request using the http-client package? PS: I also tried without square brackets around the IP address, in this case the error is Invalid URL: request <- parseRequest "http://2001:0db8:85a3:0000:0000:8a2e:0370:7334" More context: For an IPv4 address, the getAddrInfo function generates the address as: AddrInfo {addrFlags = [AI_NUMERICHOST], addrFamily = AF_INET, addrSocketType = Stream, addrProtocol = 6, addrAddress = 139.59.90.1:80, addrCanonName = Nothing} whereas for IPv6 address(inside the square brackets format): AddrInfo {addrFlags = [AI_ADDRCONFIG], addrFamily = AF_UNSPEC, addrSocketType = Stream, addrProtocol = 6, addrAddress = 0.0.0.0:0, addrCanonName = Nothing} and the error prints as: (ConnectionFailure Network.Socket.getAddrInfo (called with preferred socket type/protocol: AddrInfo {addrFlags = [AI_ADDRCONFIG], addrFamily = AF_UNSPEC, addrSocketType = Stream, addrProtocol = 6, addrAddress = 0.0.0.0:0, addrCanonName = Nothing}, host name: Just "[2001:0db8:85a3:0000:0000:8a2e:0370:7334]", service name: Just "80"): does not exist (Name or service not known))
When a literal IPv6 address is used in a URL, it should be surrounded by square brackets (as per RFC 2732) so the colons in the literal address aren't misinterpreted as some kind of port designation. When a literal IPv6 address is resolved using the C library function getaddrinfo (or the equivalent Haskell function getAddrInfo), these functions are not required to handle these extra square brackets, and at least on Linux they don't. Therefore, it's the responsibility of the HTTP client library to remove the square brackets from the hostname extracted from the URL before resolving the literal IPv6 address using getaddrinfo, and the http-client package doesn't do this, at least as of version 0.7.10. So, this is a bug, and I can see you've appropriately filed a bug report. Unfortunately, I don't see an easy way to work around the issue. You can manipulate the Request after parsing to remove the square brackets from the host field, like so: {-# LANGUAGE OverloadedStrings #-} import Data.ByteString (ByteString) import qualified Data.ByteString as BS import Network.HTTP.Client import Network.HTTP.Types.Status (statusCode) main :: IO () main = do manager <- newManager defaultManagerSettings request <- parseRequest "http://[::1]" let request' = request { host = removeBrackets (host request) } response <- httpLbs request' manager print response removeBrackets :: ByteString -> ByteString removeBrackets bs = case BS.stripPrefix "[" bs >>= BS.stripSuffix "]" of Just bs' -> bs' Nothing -> bs The problem with this is that it also removes the square brackets from the value in the Host header, so the HTTP request will contain the header: Host: ::1 instead of the correct Host: [::1] which may or may not cause problems, depending on the web server at the other end. You could try using a patched http-client package. The following patch against version 0.7.10 seems to work, but I didn't test it very extensively: diff --git a/Network/HTTP/Client/Connection.hs b/Network/HTTP/Client/Connection.hs index 0e329cd..719822e 100644 --- a/Network/HTTP/Client/Connection.hs +++ b/Network/HTTP/Client/Connection.hs ## -15,6 +15,7 ## module Network.HTTP.Client.Connection import Data.ByteString (ByteString, empty) import Data.IORef +import Data.List (stripPrefix, isSuffixOf) import Control.Monad import Network.HTTP.Client.Types import Network.Socket (Socket, HostAddress) ## -158,8 +159,12 ## withSocket :: (Socket -> IO ()) withSocket tweakSocket hostAddress' host' port' f = do let hints = NS.defaultHints { NS.addrSocketType = NS.Stream } addrs <- case hostAddress' of - Nothing -> - NS.getAddrInfo (Just hints) (Just host') (Just $ show port') + Nothing -> do + let port'' = Just $ show port' + case ip6Literal host' of + Just lit -> NS.getAddrInfo (Just hints { NS.addrFlags = [NS.AI_NUMERICHOST] }) + (Just lit) port'' + Nothing -> NS.getAddrInfo (Just hints) (Just host') port'' Just ha -> return [NS.AddrInfo ## -173,6 +178,11 ## withSocket tweakSocket hostAddress' host' port' f = do E.bracketOnError (firstSuccessful addrs $ openSocket tweakSocket) NS.close f + where + ip6Literal h = case stripPrefix "[" h of + Just rest | "]" `isSuffixOf` rest -> Just (init rest) + _ -> Nothing + openSocket tweakSocket addr = E.bracketOnError (NS.socket (NS.addrFamily addr) (NS.addrSocketType addr)
Encoding problem with GET requests in Haskell
I'm trying to get some Json data from a Jira server using Haskell. I'm counting this as "me having problems with Haskell" rather than encodings or Jira because my problem is when doing this in Haskell. The problem occurs when the URL (or query) has plus signs. After building my request for theproject+order+by+created, Haskell prints it as: Request { host = "myjiraserver.com" port = 443 secure = True requestHeaders = [("Content-Type","application/json"),("Authorization","<REDACTED>")] path = "/jira/rest/api/2/search" queryString = "?jql=project%3Dtheproject%2Border%2Bby%2Bcreated" method = "GET" proxy = Nothing rawBody = False redirectCount = 10 responseTimeout = ResponseTimeoutDefault requestVersion = HTTP/1.1 } But the request fails with this response: - 'Error in the JQL Query: The character ''+'' is a reserved JQL character. You must enclose it in a string or use the escape ''\u002b'' instead. (line 1, character 21)' So it seems like Jira didn't like Haskell's %2B. Do you have any suggestions on what I can do to fix this, or any resources that might be helpful? The same request sans the +order+by+created part is successful. The code (patched together from these examples): {-# LANGUAGE OverloadedStrings #-} import Data.Aeson import qualified Data.ByteString.Char8 as S8 import qualified Data.Yaml as Yaml import Network.HTTP.Simple import System.Environment (getArgs) -- auth' is echo -e "username:passwd" | base64 foo urlBase proj' auth' = do let proj = S8.pack (proj' ++ "+order+by+created") auth = S8.pack auth' request'' <- parseRequest urlBase let request' = setRequestMethod "GET" $ setRequestPath "/jira/rest/api/2/search" $ setRequestHeader "Content-Type" ["application/json"] $ request'' request = setRequestQueryString [("jql", Just (S8.append "project=" proj))] $ setRequestHeader "Authorization" [S8.append "Basic " auth] $ request' return request main :: IO () main = do args <- getArgs case args of (urlBase:proj:auth:_) -> do request <- foo urlBase proj auth putStrLn $ show request response <- httpJSON request S8.putStrLn $ Yaml.encode (getResponseBody response :: Value) -- apparently this is required putStrLn "" _ -> putStrLn "usage..." (If you know a simpler way to do the above then I'd take such suggestions as well, I'm just trying to do something analogous to this Python: import requests import sys if len(sys.argv) >= 4: urlBase = sys.argv[1] proj = sys.argv[2] auth = sys.argv[3] urlBase += "/jira/rest/api/2/search?jql=project=" proj += "+order+by+created" h = {} h["content-type"] = "application/json" h["authorization"] = "Basic " + auth r = requests.get(urlBase + proj, headers=h) print(r.json()) )
project+order+by+created is the URL-encoded string for the actual request project order by created (with spaces instead of +). The function setRequestQueryString expects a raw request (with spaces, not URL-encoded), and URL-encodes it. The Python script you give for comparison essentially does the URL-encoding by hand. So the fix is to put the raw request in proj: foo urlBase proj' auth' = do let proj = S8.pack (proj' ++ " order by created") -- spaces instead of + ...
Ejabberd: error in simple module to handle offline messages
I have an Ejabberd 17.01 installation where I need to push a notification in case a recipient is offline. This seems the be a common task and solutions using a customized Ejabberd module can be found everywhere. However, I just don't get it running. First, here's me script: -module(mod_offline_push). -behaviour(gen_mod). -export([start/2, stop/1]). -export([push_message/3]). -include("ejabberd.hrl"). -include("logger.hrl"). -include("jlib.hrl"). start(Host, _Opts) -> ?INFO_MSG("mod_offline_push loading", []), ejabberd_hooks:add(offline_message_hook, Host, ?MODULE, push_message, 10), ok. stop(Host) -> ?INFO_MSG("mod_offline_push stopping", []), ejabberd_hooks:add(offline_message_hook, Host, ?MODULE, push_message, 10), ok. push_message(From, To, Packet) -> ?INFO_MSG("mod_offline_push -> push_message", [To]), Type = fxml:get_tag_attr_s(<<"type">>, Packet), % Supposedly since 16.04 %Type = xml:get_tag_attr_s(<<"type">>, Packet), % Supposedly since 13.XX %Type = xml:get_tag_attr_s("type", Packet), %Type = xml:get_tag_attr_s(list_to_binary("type"), Packet), ?INFO_MSG("mod_offline_push -> push_message", []), ok. The problem is the line Type = ... line in method push_message; without that line the last info message is logged (so the hook definitely works). When browsing online, I can find all kinds of function calls to extract elements from Packet. As far as I understand it changed over time with new releases. But it's not good, all variants lead in some kind of error. The current way returns: 2017-01-25 20:38:08.701 [error] <0.21678.0>#ejabberd_hooks:run1:332 {function_clause,[{fxml,get_tag_attr_s,[<<"type">>,{message,<<>>,normal,<<>>,{jid,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>},{jid,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>},[],[{text,<<>>,<<"sfsdfsdf">>}],undefined,[],#{}}],[{file,"src/fxml.erl"},{line,169}]},{mod_offline_push,push_message,3,[{file,"mod_offline_push.erl"},{line,33}]},{ejabberd_hooks,safe_apply,3,[{file,"src/ejabberd_hooks.erl"},{line,382}]},{ejabberd_hooks,run1,3,[{file,"src/ejabberd_hooks.erl"},{line,329}]},{ejabberd_sm,route,3,[{file,"src/ejabberd_sm.erl"},{line,126}]},{ejabberd_local,route,3,[{file,"src/ejabberd_local.erl"},{line,110}]},{ejabberd_router,route,3,[{file,"src/ejabberd_router.erl"},{line,87}]},{ejabberd_c2s,check_privacy_route,5,[{file,"src/ejabberd_c2s.erl"},{line,1886}]}]} running hook: {offline_message_hook,[{jid,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>},{jid,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>},{message,<<>>,normal,<<>>,{jid,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>},{jid,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>},[],[{text,<<>>,<<"sfsdfsdf">>}],undefined,[],#{}}]} I'm new Ejabberd and Erlang, so I cannot really interpret the error, but the Line 33 as mentioned in {mod_offline_push,push_message,3,[{file,"mod_offline_push.erl"}, {line,33}]} is definitely the line calling get_tag_attr_s. UPDATE 2017/01/27: Since this cost me a lot of headache -- and I'm still not perfectly happy -- I post here my current working module in the hopes it might help others. My setup is Ejabberd 17.01 running on Ubuntu 16.04. Most stuff I tried and failed with seem to for older versions of Ejabberd: -module(mod_fcm_fork). -behaviour(gen_mod). %% public methods for this module -export([start/2, stop/1]). -export([push_notification/3]). %% included for writing to ejabberd log file -include("ejabberd.hrl"). -include("logger.hrl"). -include("xmpp_codec.hrl"). %% Copied this record definition from jlib.hrl %% Including "xmpp_codec.hrl" and "jlib.hrl" resulted in errors ("XYZ already defined") -record(jid, {user = <<"">> :: binary(), server = <<"">> :: binary(), resource = <<"">> :: binary(), luser = <<"">> :: binary(), lserver = <<"">> :: binary(), lresource = <<"">> :: binary()}). start(Host, _Opts) -> ?INFO_MSG("mod_fcm_fork loading", []), % Providing the most basic API to the clients and servers that are part of the Inets application inets:start(), % Add hook to handle message to user who are offline ejabberd_hooks:add(offline_message_hook, Host, ?MODULE, push_notification, 10), ok. stop(Host) -> ?INFO_MSG("mod_fcm_fork stopping", []), ejabberd_hooks:add(offline_message_hook, Host, ?MODULE, push_notification, 10), ok. push_notification(From, To, Packet) -> % Generate JID of sender and receiver FromJid = lists:concat([binary_to_list(From#jid.user), "#", binary_to_list(From#jid.server), "/", binary_to_list(From#jid.resource)]), ToJid = lists:concat([binary_to_list(To#jid.user), "#", binary_to_list(To#jid.server), "/", binary_to_list(To#jid.resource)]), % Get message body MessageBody = Packet#message.body, % Check of MessageBody is not empty case MessageBody/=[] of true -> % Get first element (no idea when this list can have more elements) [First | _ ] = MessageBody, % Get message data and convert to string MessageBodyText = binary_to_list(First#text.data), send_post_request(FromJid, ToJid, MessageBodyText); false -> ?INFO_MSG("mod_fcm_fork -> push_notification: MessageBody is empty",[]) end, ok. send_post_request(FromJid, ToJid, MessageBodyText) -> %?INFO_MSG("mod_fcm_fork -> send_post_request -> MessageBodyText = ~p", [Demo]), Method = post, PostURL = gen_mod:get_module_opt(global, ?MODULE, post_url,fun(X) -> X end, all), % Add data as query string. Not nice, query body would be preferable % Problem: message body itself can be in a JSON string, and I couldn't figure out the correct encoding. URL = lists:concat([binary_to_list(PostURL), "?", "fromjid=", FromJid,"&tojid=", ToJid,"&body=", edoc_lib:escape_uri(MessageBodyText)]), Header = [], ContentType = "application/json", Body = [], ?INFO_MSG("mod_fcm_fork -> send_post_request -> URL = ~p", [URL]), % ADD SSL CONFIG BELOW! %HTTPOptions = [{ssl,[{versions, ['tlsv1.2']}]}], HTTPOptions = [], Options = [], httpc:request(Method, {URL, Header, ContentType, Body}, HTTPOptions, Options), ok.
Actually it fails with second arg Packet you pass to fxml:get_tag_attr_s in push_message function {message,<<>>,normal,<<>>, {jid,<<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>, <<"homer">>,<<"xxx.xxx.xxx.xxx">>,<<"conference">>}, {jid,<<"carl">>,<<"xxx.xxx.xxx.xxx">>,<<>>,<<"carl">>, <<"xxx.xxx.xxx.xxx">>,<<>>}, [], [{text,<<>>,<<"sfsdfsdf">>}], undefined,[],#{}} because it is not xmlel Looks like it is record "message" defined in tools/xmpp_codec.hrl with <<>> id and type 'normal' xmpp_codec.hrl -record(message, {id :: binary(), type = normal :: 'chat' | 'error' | 'groupchat' | 'headline' | 'normal', lang :: binary(), from :: any(), to :: any(), subject = [] :: [#text{}], body = [] :: [#text{}], thread :: binary(), error :: #error{}, sub_els = [] :: [any()]}). Include this file and use just Type = Packet#message.type or, if you expect binary value Type = erlang:atom_to_binary(Packet#message.type, utf8)
The newest way to do that seems to be with xmpp:get_type/1: Type = xmpp:get_type(Packet), It returns an atom, in this case normal.
ErrorClosed exception from Network.HTTP.simpleHTTP -- trying to upload images via XML-RPC with haxr
I'm trying to use haxr 3000.8.5 to upload images to a WordPress blog using the metaWeblog API---specifically, the newMediaObject method. I've gotten it to work for small images, having successfully uploaded 20x20 icons in both PNG and JPG formats. However, when I try medium-sized images (say, 300x300) I get an ErrorClosed exception, presumably from the HTTP package (I did a bit of source diving, and found that haxr ultimately calls Network.HTTP.simpleHTTP). Can anyone shed light on the reasons why a call to simpleHTTP might fail with ErrorClosed? Suggestions of things to try and potential workarounds are also welcome. Here are links to full tcpdump output from a successful upload and from an unsuccessful upload. The (sanitized) code is also shown below, in case it's of any use. import Network.XmlRpc.Client (remote) import Network.XmlRpc.Internals (Value(..), toValue) import Data.Char (toLower) import System.FilePath (takeFileName, takeExtension) import qualified Data.ByteString.Char8 as B import Data.Functor ((<$>)) uploadMediaObject :: FilePath -> IO Value uploadMediaObject file = do media <- mkMediaObject file remote "http://someblog.wordpress.com/xmlrpc.php" "metaWeblog.newMediaObject" "default" "username" "password" media -- Create the required struct representing the image. mkMediaObject :: FilePath -> IO Value mkMediaObject filePath = do bits <- B.unpack <$> B.readFile filePath return $ ValueStruct [ ("name", toValue fileName) , ("type", toValue fileType) , ("bits", ValueBase64 bits) ] where fileName = takeFileName filePath fileType = case (map toLower . drop 1 . takeExtension) fileName of "png" -> "image/png" "jpg" -> "image/jpeg" "jpeg" -> "image/jpeg" "gif" -> "image/gif" main = do v <- uploadMediaObject "images/puppy.png" print v
21:59:56.813021 IP 192.168.1.148.39571 > ..http: Flags [.] 22:00:01.922598 IP ..http > 192.168.1.148.39571: Flags [F.] The connection is closed by the server after a 3-4 sec timeout as the client didn't send any data, to prevent slowloris and similar ddos attacks. (F is the FIN flag, to close one direction of the bidirectional connection). The server does not wait for the client to close the connection (wait for eof/0 == recv(fd)) but uses the close() syscall; the kernel on the server will respond with [R]eset packets if it receives further data, as you can see at the end of your dump. I guess the client first opens the http connection and then prepares the data which takes too long.
Haskell Network.Browser HTTPS Connection
Is there a way to make https calls with the Network.Browser package. I'm not seeing it in the documentation on Hackage. If there isn't a way to do it with browse is there another way to fetch https pages? My current test code is import Network.HTTP import Network.URI (parseURI) import Network.HTTP.Proxy import Data.Maybe (fromJust) import Control.Applicative ((<$>)) import Network.Browser retrieveUrl :: String -> IO String retrieveUrl url = do rsp <- browse $ request (Request (fromJust uri) POST [] "Body") return $ snd (rspBody <$> rsp) where uri = parseURI url I've been running nc -l -p 8000 and watching the output. I see that it doesn't encrypt it when I do retrieveUrl https://localhost:8000 Also when I try a real https site I get: Network.Browser.request: Error raised ErrorClosed *** Exception: user error (Network.Browser.request: Error raised ErrorClosed) Edit: Network.Curl solution (For doing a SOAP call) import Network.Curl (curlGetString) import Network.Curl.Opts soapHeader s = CurlHttpHeaders ["Content-Type: text/xml", "SOAPAction: " ++ s] proxy = CurlProxy "proxy.foo.org" envelope = "myRequestEnvelope.xml" headers = readFile envelope >>= (\x -> return [ soapHeader "myAction" , proxy , CurlPost True , CurlPostFields [x]]) main = headers >>= curlGetString "https://service.endpoint"
An alternative and perhaps more "haskelly" solution as Travis Brown put it with http-conduit: To just fetch https pages: import Network.HTTP.Conduit import qualified Data.ByteString.Lazy as L main = simpleHttp "https://www.noisebridge.net/wiki/Noisebridge" >>= L.putStr The below shows how to pass urlencode parameters. {-# LANGUAGE OverloadedStrings #-} import Network.HTTP.Conduit import qualified Data.ByteString.Lazy as L main = do initReq <- parseUrl "https://www.googleapis.com/urlshortener/v1/url" let req' = initReq { secure = True } -- Turn on https let req = (flip urlEncodedBody) req' $ [ ("longUrl", "http://www.google.com/") -- , ] response <- withManager $ httpLbs req L.putStr $ responseBody response You can also set the method, content-type, and request body manually. The api is the same as in http-enumerator a good example is: https://stackoverflow.com/a/5614946
I've wondered about this in the past and have always ended up just using the libcurl bindings. It would be nice to have a more Haskelly solution, but Network.Curl is very convenient.
If all you want to do is fetch a page, Network.HTTP.Wget is the most simple way. Exhibit a: import Network.HTTP.Wget main = putStrLn =<< wget "https://www.google.com" [] []