I am learning Go and came across this problem.
I am just downloading web page content using HTTP client:
package main
import (
"fmt"
"io/ioutil"
"log"
"net/http"
)
func main() {
client := &http.Client{}
req, err := http.NewRequest("GET", "https://mail.ru/", nil)
req.Close = true
response, err := client.Do(req)
if err != nil {
log.Fatal(err)
}
defer response.Body.Close()
content, err := ioutil.ReadAll(response.Body)
if err != nil {
fmt.Println(err)
}
fmt.Println(string(content)[:100])
}
I get an unexpected EOF error when reading response body. At the same time content variable has full page content.
This error appear only when I downloading https://mail.ru/ content. With other URLs everything works fine - without any errors.
I used curl for downloading this page content - everything works as expected.
I am confused a bit - what's happening here?
Go v1.2, tried on Ubuntu and MacOS X
It looks like the that server (Apache 1.3, wow!) is serving up a truncated gzip response. If you explicitly request the identity encoding (preventing the Go transport from adding gzip itself), you won't get the ErrUnexpectedEOF:
req.Header.Add("Accept-Encoding", "identity")
Related
I was trying to build some sort of website status checker. I figure out that the golang HTTP get request is not resolved and hung forever for some URL like https://www.hetzner.com. But the same URL works if we do curl.
Golang
Here there is no error thrown. It just hangs on http.Get
func main() {
resp, err := http.Get("https://www.hetzner.com")
if err != nil {
fmt.Println("Error while retrieving site", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Println("Eroor while reading response body", err)
}
fmt.Println("RESPONSE", string(body))}
CURL
I get the response while running following command.
curl https://www.hetzner.com
What may be the reason? And how do I resolve this issue from golang HTTP?
Your specific case can be fixed by specifying HTTP User-Agent Header:
import (
"fmt"
"io"
"net/http"
)
func main() {
client := &http.Client{}
req, err := http.NewRequest("GET", "https://www.hetzner.com", nil)
if err != nil {
fmt.Println("Error while retrieving site", err)
}
req.Header.Set("User-Agent", "Golang_Spider_Bot/3.0")
resp, err := client.Do(req)
if err != nil {
fmt.Println("Error while retrieving site", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Println("Eroor while reading response body", err)
}
fmt.Println("RESPONSE", string(body))
}
Note: many other hosts will reject requests from your server because of some security rules on their side. Some ideas:
Empty or bot-like User-Agent HTTP header
Location of your IP address. For example, online shops in the USA don't need to handle requests from Russia.
Autonomous System or CIDR of your provider. Some ASNs are completely blackholed because of the enormous malicious activities from their residents.
Note 2: Many modern websites have DDoS protection or CDN systems in front of them. If Cloudflare protects your target website, your HTTP request will be blocked despite the status code 200. To handle this, you need to build something able to render JavaScript-based websites and add some scripts to resolve a captcha.
Also, if you check a considerable amount of websites in a short time, you will be blocked by your DNS servers as they have some inbuild rate limits. In this case, you may want to take a look at massdns or similar solutions.
My goal is to scrape a website that requires me to log in first using HTTP requests in Golang. I actually succeeded by finding out I can send a post request to the website writing form-data into the body of the request. When I test this through an API development software I use called Postman, the response is instantaneous with no delays. However, when performing the request with an HTTP client in Go, there is a consistent 60 second delay every single time. I end up getting a logged in page, but for my program I need the response to be nearly instantaneous.
As you can see in my code, I've tried adding a bunch of headers to the request like "Connection", "Content-Type", "User-Agent" since I thought maaaaaybe the website can tell I'm requesting from a program and is forcing me to wait 60 seconds for a response. Adding these headers to make my request more legitimate(?) doesn't work at all.
Is the delay coming from Go's HTTP client being slow or is there something wrong with how I'm forming my HTTP POST request? Also, was I on to something with my headers and HTTP client is rewriting them when they send out?
Here's my simple program...
package main
import (
"bytes"
"fmt"
"mime/multipart"
"net/http"
"net/http/cookiejar"
"os"
)
func main() {
url := "https://easypronunciation.com/en/log-in"
method := "POST"
payload := &bytes.Buffer{}
writer := multipart.NewWriter(payload)
_ = writer.WriteField("email", "foo#bar.com")
_ = writer.WriteField("password", "*********")
_ = writer.WriteField("persistent_login", "on")
_ = writer.WriteField("submit", "")
err := writer.Close()
if err != nil {
fmt.Println(err)
}
cookieJar, _ := cookiejar.New(nil)
client := &http.Client{
Jar: cookieJar,
}
req, err := http.NewRequest(method, url, payload)
if err != nil {
fmt.Println(err)
}
req.Header.Set("Content-Type", writer.FormDataContentType())
req.Header.Set("Connection", "Keep-Alive")
req.Header.Set("Accept-Language", "en-US")
req.Header.Set("User-Agent", "Mozilla/5.0")
res, err := client.Do(req)
if err != nil {
fmt.Println(err)
}
defer res.Body.Close()
f, err := os.Create("response.html")
defer f.Close()
res.Write(f)
}
I doubt, this is the go client library too. I would suggest printing out the latencies for different components and see if/where the 60 second delay is. I would also replace and try different URLs instead
I'm trying to download a file from the web. It should be a simple processes. One that I've alredy done before. But, this particular link (a 135 kB zip file) gives me an error message: Get "http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip": stopped after 10 redirect. If I copy the link into the browser the file is downloaded without any issues, but when using the code below, the error pops up.
package main
import (
"io"
"net/http"
"os"
)
func main() {
link := "http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip"
resp, err := http.Get(link)
if err != nil {
panic(err)
}
defer resp.Body.Close()
// Create the file
out, err := os.Create("ms.zip")
if err != nil {
panic(err)
}
defer out.Close()
// Write the body to file
_, err = io.Copy(out, resp.Body)
if err != nil {
panic(err)
}
}
Any ideas on why does this happens and how to get around it?
Thanks for the attention.
After investigating this url I see that it sets cookie
Set-Cookie: security=true; path=/
You can set cookie manually, or implement CookieJar
c := http.Client{}
req, err := http.NewRequest("GET", link, nil)
if err != nil {
panic(err)
}
req.AddCookie(&http.Cookie{Name: "security", Value: "true", Path: "/"})
resp, err := c.Do(req)
if err != nil {
panic(err)
}
Your code is totally fine, but you'll often find this issue is more related to the source you're trying to download a file from, itself, rather than Go.
You would have had the same issue with other tools/languages, because the host you are trying to reach, keeps redirecting you because of an invalid 'User-Agent' header property. This is often the case when you want to allow your files to be downloadable only from 'browsers', rather than crawls, automated scripts etc.
With Go, you can add the header property with req.Header.Set("User-Agent", "<some-user-agent-value>"), before sending the request. You'd create an instance of request set the header, and execute it with a http.Client{} and client.Do(req).
Eg:
link := "http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip"
req, err := http.NewRequest("GET", link, nil)
if err != nil {
panic(err)
}
req.Header.Set("User-Agent", "Mozilla/4.0") // Doesn't even have to be a full
// proper user agent string
client := &http.Client{}
resp, err := client.Do(req)
You can read more in the Go's http pkg docs, it states that:
"For control over HTTP client headers, redirect policy, and other
settings, create a Client..."
Here's also the http.reqeust and http.client docs.
More about this ingeneral you can find in e.g. Mozilla's HTTP docs, as well as many other great docs and resources out there.
Btw. the zip archive you're trying to download seems like invalid. :-)
So, I'm using the net/http package. I'm GETting a URL that I know for certain is redirecting. It may even redirect a couple of times before landing on the final URL. Redirection is handled automatically behind the scenes.
Is there an easy way to figure out what the final URL was without a hackish workaround that involves setting the CheckRedirect field on a http.Client object?
I guess I should mention that I think I came up with a workaround, but it's kind of hackish, as it involves using a global variable and setting the CheckRedirect field on a custom http.Client.
There's got to be a cleaner way to do it. I'm hoping for something like this:
package main
import (
"fmt"
"log"
"net/http"
)
func main() {
// Try to GET some URL that redirects. Could be 5 or 6 unseen redirections here.
resp, err := http.Get("http://some-server.com/a/url/that/redirects.html")
if err != nil {
log.Fatalf("http.Get => %v", err.Error())
}
// Find out what URL we ended up at
finalURL := magicFunctionThatTellsMeTheFinalURL(resp)
fmt.Printf("The URL you ended up at is: %v", finalURL)
}
package main
import (
"fmt"
"log"
"net/http"
)
func main() {
resp, err := http.Get("http://stackoverflow.com/q/16784419/727643")
if err != nil {
log.Fatalf("http.Get => %v", err.Error())
}
// Your magic function. The Request in the Response is the last URL the
// client tried to access.
finalURL := resp.Request.URL.String()
fmt.Printf("The URL you ended up at is: %v\n", finalURL)
}
Output:
The URL you ended up at is: http://stackoverflow.com/questions/16784419/in-golang-how-to-determine-the-final-url-after-a-series-of-redirects
I would add a note that http.Head method should be enough to retrieve the final URL. Theoretically it should be faster comparing to http.Get as a server is expected to send back just a header:
resp, err := http.Head("http://stackoverflow.com/q/16784419/727643")
...
finalURL := resp.Request.URL.String()
...
I'm playing with Go (first time ever) and I want to build a tool to retrieve images from Internet and cut them (even resize) but I'm stuck on the first step.
package main
import (
"fmt"
"http"
)
var client = http.Client{}
func cutterHandler(res http.ResponseWriter, req *http.Request) {
reqImg, err := client.Get("http://www.google.com/intl/en_com/images/srpr/logo3w.png")
if err != nil {
fmt.Fprintf(res, "Error %d", err)
return
}
buffer := make([]byte, reqImg.ContentLength)
reqImg.Body.Read(buffer)
res.Header().Set("Content-Length", fmt.Sprint(reqImg.ContentLength)) /* value: 7007 */
res.Header().Set("Content-Type", reqImg.Header.Get("Content-Type")) /* value: image/png */
res.Write(buffer)
}
func main() {
http.HandleFunc("/cut", cutterHandler)
http.ListenAndServe(":8080", nil) /* TODO Configurable */
}
I'm able to request an image (let's use Google logo) and to get its kind and size.
Indeed, I'm just re-writing the image (look at this as a toy "proxy"), setting Content-Length and Content-Type and writing the byte slice back but I get it wrong somewhere. See how it looks the final image rendered on Chromium 12.0.742.112 (90304):
Also I checked the downloaded file and it is a 7007 bytes PNG image. It should be working properly if we look at the request:
GET /cut HTTP/1.1
User-Agent: curl/7.22.0 (i486-pc-linux-gnu) libcurl/7.22.0 OpenSSL/1.0.0e zlib/1.2.3.4 libidn/1.23 libssh2/1.2.8 librtmp/2.3
Host: 127.0.0.1:8080
Accept: /
HTTP/1.1 200 OK
Content-Length: 7007
Content-Type: image/png
Date: Tue, 27 Dec 2011 19:51:53 GMT
[PNG data]
What do you think I'm doing wrong here?
Disclaimer: I'm scratching my own itch, so probably I'm using the wrong tool :) Anyway, I can implement it on Ruby but before I would like to give Go a try.
Update: still scratching itches but... I think this is going to be a good side-of-side project so I'm opening it https://github.com/imdario/go-lazor If it is not useful, at least somebody can find usefulness with the references used to develop it. They were for me.
I think you went too fast to the serve things part.
Focus on the first step, downloading the image.
Here you have a little program that downloads that image to memory.
It works on my 2011-12-22 weekly version, for r60.3 you just need to gofix the imports.
package main
import (
"log"
"io/ioutil"
"net/http"
)
const url = "http://www.google.com/intl/en_com/images/srpr/logo3w.png"
func main() {
// Just a simple GET request to the image URL
// We get back a *Response, and an error
res, err := http.Get(url)
if err != nil {
log.Fatalf("http.Get -> %v", err)
}
// We read all the bytes of the image
// Types: data []byte
data, err = ioutil.ReadAll(res.Body)
if err != nil {
log.Fatalf("ioutil.ReadAll -> %v", err)
}
// You have to manually close the body, check docs
// This is required if you want to use things like
// Keep-Alive and other HTTP sorcery.
res.Body.Close()
// You can now save it to disk or whatever...
ioutil.WriteFile("google_logo.png", data, 0666)
log.Println("I saved your image buddy!")
}
Voilá!
This will get the image to memory inside data.
Once you have that, you can decode it, crop it and serve back to the browser.
Hope this helps.
I tried your code and noticed that the image you were serving was the right size, but the contents of the file past a certain point were all 0x00.
Review the io.Reader documentation. The important thing to remember is that Read reads up to the number of bytes you request. It can read fewer with no error returned. (You should be checking the error too, but that's not an issue here.)
If you want to make sure your buffer is completely full, use io.ReadFull. In this case it's simpler to just copy the entire contents of the Reader with io.Copy.
It's also important to remember to close HTTP request bodies.
I would rewrite the code this way:
package main
import (
"fmt"
"http"
"io"
)
var client = http.Client{}
func cutterHandler(res http.ResponseWriter, req *http.Request) {
reqImg, err := client.Get("http://www.google.com/intl/en_com/images/srpr/logo3w.png")
if err != nil {
fmt.Fprintf(res, "Error %d", err)
return
}
res.Header().Set("Content-Length", fmt.Sprint(reqImg.ContentLength))
res.Header().Set("Content-Type", reqImg.Header.Get("Content-Type"))
if _, err = io.Copy(res, reqImg.Body); err != nil {
// handle error
}
reqImg.Body.Close()
}
func main() {
http.HandleFunc("/cut", cutterHandler)
http.ListenAndServe(":8080", nil) /* TODO Configurable */
}