I'm trying to download a file from the web. It should be a simple processes. One that I've alredy done before. But, this particular link (a 135 kB zip file) gives me an error message: Get "http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip": stopped after 10 redirect. If I copy the link into the browser the file is downloaded without any issues, but when using the code below, the error pops up.
package main
import (
"io"
"net/http"
"os"
)
func main() {
link := "http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip"
resp, err := http.Get(link)
if err != nil {
panic(err)
}
defer resp.Body.Close()
// Create the file
out, err := os.Create("ms.zip")
if err != nil {
panic(err)
}
defer out.Close()
// Write the body to file
_, err = io.Copy(out, resp.Body)
if err != nil {
panic(err)
}
}
Any ideas on why does this happens and how to get around it?
Thanks for the attention.
After investigating this url I see that it sets cookie
Set-Cookie: security=true; path=/
You can set cookie manually, or implement CookieJar
c := http.Client{}
req, err := http.NewRequest("GET", link, nil)
if err != nil {
panic(err)
}
req.AddCookie(&http.Cookie{Name: "security", Value: "true", Path: "/"})
resp, err := c.Do(req)
if err != nil {
panic(err)
}
Your code is totally fine, but you'll often find this issue is more related to the source you're trying to download a file from, itself, rather than Go.
You would have had the same issue with other tools/languages, because the host you are trying to reach, keeps redirecting you because of an invalid 'User-Agent' header property. This is often the case when you want to allow your files to be downloadable only from 'browsers', rather than crawls, automated scripts etc.
With Go, you can add the header property with req.Header.Set("User-Agent", "<some-user-agent-value>"), before sending the request. You'd create an instance of request set the header, and execute it with a http.Client{} and client.Do(req).
Eg:
link := "http://www1.caixa.gov.br/loterias/_arquivos/loterias/D_megase.zip"
req, err := http.NewRequest("GET", link, nil)
if err != nil {
panic(err)
}
req.Header.Set("User-Agent", "Mozilla/4.0") // Doesn't even have to be a full
// proper user agent string
client := &http.Client{}
resp, err := client.Do(req)
You can read more in the Go's http pkg docs, it states that:
"For control over HTTP client headers, redirect policy, and other
settings, create a Client..."
Here's also the http.reqeust and http.client docs.
More about this ingeneral you can find in e.g. Mozilla's HTTP docs, as well as many other great docs and resources out there.
Btw. the zip archive you're trying to download seems like invalid. :-)
Related
I was trying to build some sort of website status checker. I figure out that the golang HTTP get request is not resolved and hung forever for some URL like https://www.hetzner.com. But the same URL works if we do curl.
Golang
Here there is no error thrown. It just hangs on http.Get
func main() {
resp, err := http.Get("https://www.hetzner.com")
if err != nil {
fmt.Println("Error while retrieving site", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Println("Eroor while reading response body", err)
}
fmt.Println("RESPONSE", string(body))}
CURL
I get the response while running following command.
curl https://www.hetzner.com
What may be the reason? And how do I resolve this issue from golang HTTP?
Your specific case can be fixed by specifying HTTP User-Agent Header:
import (
"fmt"
"io"
"net/http"
)
func main() {
client := &http.Client{}
req, err := http.NewRequest("GET", "https://www.hetzner.com", nil)
if err != nil {
fmt.Println("Error while retrieving site", err)
}
req.Header.Set("User-Agent", "Golang_Spider_Bot/3.0")
resp, err := client.Do(req)
if err != nil {
fmt.Println("Error while retrieving site", err)
}
defer resp.Body.Close()
body, err := io.ReadAll(resp.Body)
if err != nil {
fmt.Println("Eroor while reading response body", err)
}
fmt.Println("RESPONSE", string(body))
}
Note: many other hosts will reject requests from your server because of some security rules on their side. Some ideas:
Empty or bot-like User-Agent HTTP header
Location of your IP address. For example, online shops in the USA don't need to handle requests from Russia.
Autonomous System or CIDR of your provider. Some ASNs are completely blackholed because of the enormous malicious activities from their residents.
Note 2: Many modern websites have DDoS protection or CDN systems in front of them. If Cloudflare protects your target website, your HTTP request will be blocked despite the status code 200. To handle this, you need to build something able to render JavaScript-based websites and add some scripts to resolve a captcha.
Also, if you check a considerable amount of websites in a short time, you will be blocked by your DNS servers as they have some inbuild rate limits. In this case, you may want to take a look at massdns or similar solutions.
I'm writing a (hopefully zero-dependency) speed test in Go leveraging Netflix's fast.com servers.
The code is pulling down several pieces of 25MB content, reading the response into a buffer and counting the bytes read along the way.
The code (speed test) works as expected on my development computer, but when I run it on a much tinier linux machine, the speed test caps out at measuring ~75Mbps (despite being hardwired into a network reliably providing 400+Mbps).
I believe the issue must be that because the machine is small, it's relatively slow at either reading the response or writing into the buffer.
I did a Go trace of the program on the 2 machines, and sure enough the Heap on the small linux machine continually gets full before GC clears it out; rinse and repeat.
The question is: what can I do about this to make my speed test accurate? More specifically, since I don't actually need the response data (because this is just a speed test), is there a way I can download and count the bytes from the HTTP response without actually bothering to write them anywhere, thus potentially saving time?
The relevant code is below. (Note: the reason I'm using http.NewRequest is because in some cases I build on URL params).
client := &http.Client{}
req, err := http.NewRequest("GET", url, nil)
resp, err := client.Do(req)
defer resp.Body.Close()
buffer := make([]byte, 128 * 1024)
for {
b, err := resp.Body.Read(buffer)
if err == io.EOF {
break
}
func() {
mu.Lock()
defer mu.Unlock()
*bytesRead += b
if *done {
return
}
}()
}
Edit: I should also add that the linux device has been tested and validated via other speed tests that it can achieve greater than 75Mbps.
You can use the io.Discard Writer in order to speedup the code, an example:
var bytesRead int64
client := &http.Client{}
req, err := http.NewRequest("GET", url, nil)
if err != nil {
panic(err)
}
resp, err := client.Do(req)
if err != nil {
panic(err)
}
defer resp.Body.Close()
nBytes, err := io.Copy(io.Discard, resp.Body)
if err != nil {
panic(err)
}
bytesRead += nBytes
By this way you don't need to iterate among a bytes buffer.
My goal is to scrape a website that requires me to log in first using HTTP requests in Golang. I actually succeeded by finding out I can send a post request to the website writing form-data into the body of the request. When I test this through an API development software I use called Postman, the response is instantaneous with no delays. However, when performing the request with an HTTP client in Go, there is a consistent 60 second delay every single time. I end up getting a logged in page, but for my program I need the response to be nearly instantaneous.
As you can see in my code, I've tried adding a bunch of headers to the request like "Connection", "Content-Type", "User-Agent" since I thought maaaaaybe the website can tell I'm requesting from a program and is forcing me to wait 60 seconds for a response. Adding these headers to make my request more legitimate(?) doesn't work at all.
Is the delay coming from Go's HTTP client being slow or is there something wrong with how I'm forming my HTTP POST request? Also, was I on to something with my headers and HTTP client is rewriting them when they send out?
Here's my simple program...
package main
import (
"bytes"
"fmt"
"mime/multipart"
"net/http"
"net/http/cookiejar"
"os"
)
func main() {
url := "https://easypronunciation.com/en/log-in"
method := "POST"
payload := &bytes.Buffer{}
writer := multipart.NewWriter(payload)
_ = writer.WriteField("email", "foo#bar.com")
_ = writer.WriteField("password", "*********")
_ = writer.WriteField("persistent_login", "on")
_ = writer.WriteField("submit", "")
err := writer.Close()
if err != nil {
fmt.Println(err)
}
cookieJar, _ := cookiejar.New(nil)
client := &http.Client{
Jar: cookieJar,
}
req, err := http.NewRequest(method, url, payload)
if err != nil {
fmt.Println(err)
}
req.Header.Set("Content-Type", writer.FormDataContentType())
req.Header.Set("Connection", "Keep-Alive")
req.Header.Set("Accept-Language", "en-US")
req.Header.Set("User-Agent", "Mozilla/5.0")
res, err := client.Do(req)
if err != nil {
fmt.Println(err)
}
defer res.Body.Close()
f, err := os.Create("response.html")
defer f.Close()
res.Write(f)
}
I doubt, this is the go client library too. I would suggest printing out the latencies for different components and see if/where the 60 second delay is. I would also replace and try different URLs instead
I am learning Go and came across this problem.
I am just downloading web page content using HTTP client:
package main
import (
"fmt"
"io/ioutil"
"log"
"net/http"
)
func main() {
client := &http.Client{}
req, err := http.NewRequest("GET", "https://mail.ru/", nil)
req.Close = true
response, err := client.Do(req)
if err != nil {
log.Fatal(err)
}
defer response.Body.Close()
content, err := ioutil.ReadAll(response.Body)
if err != nil {
fmt.Println(err)
}
fmt.Println(string(content)[:100])
}
I get an unexpected EOF error when reading response body. At the same time content variable has full page content.
This error appear only when I downloading https://mail.ru/ content. With other URLs everything works fine - without any errors.
I used curl for downloading this page content - everything works as expected.
I am confused a bit - what's happening here?
Go v1.2, tried on Ubuntu and MacOS X
It looks like the that server (Apache 1.3, wow!) is serving up a truncated gzip response. If you explicitly request the identity encoding (preventing the Go transport from adding gzip itself), you won't get the ErrUnexpectedEOF:
req.Header.Add("Accept-Encoding", "identity")
So, I'm using the net/http package. I'm GETting a URL that I know for certain is redirecting. It may even redirect a couple of times before landing on the final URL. Redirection is handled automatically behind the scenes.
Is there an easy way to figure out what the final URL was without a hackish workaround that involves setting the CheckRedirect field on a http.Client object?
I guess I should mention that I think I came up with a workaround, but it's kind of hackish, as it involves using a global variable and setting the CheckRedirect field on a custom http.Client.
There's got to be a cleaner way to do it. I'm hoping for something like this:
package main
import (
"fmt"
"log"
"net/http"
)
func main() {
// Try to GET some URL that redirects. Could be 5 or 6 unseen redirections here.
resp, err := http.Get("http://some-server.com/a/url/that/redirects.html")
if err != nil {
log.Fatalf("http.Get => %v", err.Error())
}
// Find out what URL we ended up at
finalURL := magicFunctionThatTellsMeTheFinalURL(resp)
fmt.Printf("The URL you ended up at is: %v", finalURL)
}
package main
import (
"fmt"
"log"
"net/http"
)
func main() {
resp, err := http.Get("http://stackoverflow.com/q/16784419/727643")
if err != nil {
log.Fatalf("http.Get => %v", err.Error())
}
// Your magic function. The Request in the Response is the last URL the
// client tried to access.
finalURL := resp.Request.URL.String()
fmt.Printf("The URL you ended up at is: %v\n", finalURL)
}
Output:
The URL you ended up at is: http://stackoverflow.com/questions/16784419/in-golang-how-to-determine-the-final-url-after-a-series-of-redirects
I would add a note that http.Head method should be enough to retrieve the final URL. Theoretically it should be faster comparing to http.Get as a server is expected to send back just a header:
resp, err := http.Head("http://stackoverflow.com/q/16784419/727643")
...
finalURL := resp.Request.URL.String()
...