I'm building an application that will be downloading roughly 5000 CSV files concurrently using go routines and plain ol http get requests. Downloading the files in parallel.
I'm currently running into open file limits imposed by OS X.
The CSV files are served over http. Are there any other network protocols that I can use to batch each request into one? I don't have access to the server, so I can't zip them. I'd also prefer not to change the ulimit because once in production, I probably won't have access to that configuration.
You probably want to limit active concurrent requests to a more sensible number than 5000. Possibly spin up 10/20 workers and send individual files to them over a channel.
The http client should reuse connections for requests, assuming you always read the entire request body, and close it.
Something like this:
func main() {
http.DefaultTransport.(*http.Transport).MaxIdleConnsPerHost = 100
for i := 0; i < 10; i++ {
wg.Add(1)
go worker()
}
var csvs = []string{"http://example.com/a.csv", "http://example.com/b.csv"}
for _, u := range csvs {
ch <- u
}
close(ch)
wg.Wait()
}
var ch = make(chan string)
var wg sync.WaitGroup
func worker() {
defer wg.Done()
for u := range ch {
get(u)
}
}
func get(u string) {
resp, err := http.Get(u)
//check err here
// make sure we always read rest of body, and close
defer resp.Body.Close()
defer io.Copy(ioutil.Discard, resp.Body)
//read and decode / handle it. Make sure to read all of body.
}
Related
I have an API that receives a CSV file to process. I'd like to be able to send back an 202 Accepted (or any status really) while processing the file in the background. I have a handler that checks the request, writes the success header, and then continues processing via a producer/consumer pattern. The problem is that, due to the WaitGroup.Wait() calls, the accepted header isn't sending back. The errors on the handler validation are sending back correctly but that's because of the return statements.
Is it possible to send that 202 Accepted back with the wait groups as I'm hoping (and if so, what am I missing)?
func SomeHandler(w http.ResponseWriter, req *http.Request) {
endAccepted := time.Now()
err := verifyRequest(req)
if err != nil {
w.WriteHeader(http.StatusBadRequest)
data := JSONErrors{Errors: []string{err.Error()}}
json.NewEncoder(w).Encode(data)
return
}
// ...FILE RETRIEVAL CLIPPED (not relevant)...
// e.g. csvFile, openErr := os.Open(tmpFile.Name())
//////////////////////////////////////////////////////
// TODO this isn't sending due to the WaitGroup.Wait()s below
w.WriteHeader(http.StatusAccepted)
//////////////////////////////////////////////////////
// START PRODUCER/CONSUMER
jobs := make(chan *Job, 100) // buffered channel
results := make(chan *Job, 100) // buffered channel
// start consumers
for i := 0; i < 5; i++ { // 5 consumers
wg.Add(1)
go consume(i, jobs, results)
}
// start producing
go produce(jobs, csvFile)
// start processing
wg2.Add(1)
go process(results)
wg.Wait() // wait for all workers to finish processing jobs
close(results)
wg2.Wait() // wait for process to finish
log.Println("===> Done Processing.")
}
You're doing all the processing in the background, but you're still waiting for it to finish. The solution would be to just not wait. The best solution would move all of the handling elsewhere to a function you can just call with go to run it in the background, but the simplest solution leaving it inline would just be
w.WriteHeader(http.StatusAccepted)
go func() {
// START PRODUCER/CONSUMER
jobs := make(chan *Job, 100) // buffered channel
results := make(chan *Job, 100) // buffered channel
// start consumers
for i := 0; i < 5; i++ { // 5 consumers
wg.Add(1)
go consume(i, jobs, results)
}
// start producing
go produce(jobs, csvFile)
// start processing
wg2.Add(1)
go process(results)
wg.Wait() // wait for all workers to finish processing jobs
close(results)
wg2.Wait() // wait for process to finish
log.Println("===> Done Processing.")
}()
Note that you elided the CSV file handling, so you'll need to ensure that it's safe to use this way (i.e. that you haven't defered closing or deleting the file, which would cause that to occur as soon as the handler returns).
I am trying to implement http server that:
Calculate farther redirect using some logic
Redirect user
Log user data
The goal is to achieve maximum throughput (at least 15k rps). In order to do this, I want to save log asynchronously. I'm using kafka as logging system and separate logging block of code into separate goroutine. Overall example of current implementation:
package main
import (
"github.com/confluentinc/confluent-kafka-go/kafka"
"net/http"
"time"
"encoding/json"
)
type log struct {
RuntimeParam string `json:"runtime_param"`
AsyncParam string `json:"async_param"`
RemoteAddress string `json:"remote_address"`
}
var (
producer, _ = kafka.NewProducer(&kafka.ConfigMap{
"bootstrap.servers": "localhost:9092,localhost:9093",
"queue.buffering.max.ms": 1 * 1000,
"go.delivery.reports": false,
"client.id": 1,
})
topicName = "log"
)
func main() {
siteMux := http.NewServeMux()
siteMux.HandleFunc("/", httpHandler)
srv := &http.Server{
Addr: ":8080",
Handler: siteMux,
ReadTimeout: 2 * time.Second,
WriteTimeout: 5 * time.Second,
IdleTimeout: 10 * time.Second,
}
if err := srv.ListenAndServe(); err != nil {
panic(err)
}
}
func httpHandler(w http.ResponseWriter, r *http.Request) {
handlerLog := new(log)
handlerLog.RuntimeParam = "runtimeDataString"
http.Redirect(w, r, "http://google.com", 301)
go func(goroutineLog *log, request *http.Request) {
goroutineLog.AsyncParam = "asyncDataString"
goroutineLog.RemoteAddress = r.RemoteAddr
jsonLog, err := json.Marshal(goroutineLog)
if err == nil {
producer.ProduceChannel() <- &kafka.Message{
TopicPartition: kafka.TopicPartition{Topic: &topicName, Partition: kafka.PartitionAny},
Value: jsonLog,
}
}
}(handlerLog, r)
}
The questions are:
Is it correct/efficient to use separate goroutine to implement async logging or should I use a different approach? (workers and channels for example)
Maybe there is a way to further improve performance of server, that I'm missing?
Yes, this is correct and efficient use of a goroutine (as Flimzy pointed in the comments). I couldn't agree more, this is a good approach.
The problem is that the handler may finish executing before the goroutine started processing everything and the request (which is a pointer) may be gone or you may have some races down the middleware stack. I read your comments, that it isn't your case, but in general, you shouldn't pass a request to a goroutine. As I can see from your code, you're really using only RemoteAddr from the request and why not to redirect straight away and put logging in the defer statement? So, I'd rewrite your handler a bit:
func httpHandler(w http.ResponseWriter, r *http.Request) {
http.Redirect(w, r, "http://google.com", 301)
defer func(runtimeDataString, RemoteAddr string) {
handlerLog := new(log)
handlerLog.RuntimeParam = runtimeDataString
handlerLog.AsyncParam = "asyncDataString"
handlerLog.RemoteAddress = RemoteAddr
jsonLog, err := json.Marshal(handlerLog)
if err == nil {
producer.ProduceChannel() <- &kafka.Message{
TopicPartition: kafka.TopicPartition{Topic: &topicName, Partition: kafka.PartitionAny},
Value: jsonLog,
}
}
}("runtimeDataString", r.RemoteAddr)
}
The goroutines unlikely improve performance of your server as you just send the response earlier and those kafka connections could pile up in the background and slow down the whole server. If you find this as the bottleneck, you may consider saving logs locally and sending them to kafka in another process (or pool of workers) outside of your server. This may spread the workload over time (like sending fewer logs when you have more requests and vice versa).
I'm trying to process a file which contains 200 URLs and use each URL to make an HTTP request. I need to process 10 URLs concurrently maximum each time (code should block until 10 URLs finish processing). Tried to solve it in go but I keep getting the whole file processed with 200 concurrent connection created.
for scanner.Scan() { // loop through each url in the file
// send each url to golang HTTPrequest
go HTTPrequest(scanner.Text(), channel, &wg)
}
fmt.Println(<-channel)
wg.Wait()
What should i do?
A pool of 10 go routines reading from a channel should fulfill your requirements.
work := make(chan string)
// get original 200 urls
var urlsToProcess []string = seedUrls()
// startup pool of 10 go routines and read urls from work channel
for i := 0; i<=10; i++ {
go func(w chan string) {
url := <-w
}(work)
}
// write urls to the work channel, blocking until a worker goroutine
// is able to start work
for _, url := range urlsToProcess {
work <- url
}
Cleanup and request results are left as an exercise for you. Go channels is will block until one of the worker routines is able to read.
code like this
longTimeAct := func(index int, w chan struct{}, wg *sync.WaitGroup) {
defer wg.Done()
time.Sleep(1 * time.Second)
println(index)
<-w
}
wg := new(sync.WaitGroup)
ws := make(chan struct{}, 10)
for i := 0; i < 100; i++ {
ws <- struct{}{}
wg.Add(1)
go longTimeAct(i, ws, wg)
}
wg.Wait()
I'm attempting to write a small webapp in Go where the user uploads a gzipped file in a multipart form. The app unzips and parses the file and writes some output to the response. However, I keep running into an error where the input stream looks corrupted when I begin writing to the response. Not writing to the response fixes the problem, as does reading from a non-gzipped input stream. Here's an example http handler:
func(w http.ResponseWriter, req *http.Request) {
//Get an input stream from the multipart reader
//and read it using a scanner
multiReader, _ := req.MultipartReader()
part, _ := multiReader.NextPart()
gzipReader, _ := gzip.NewReader(part)
scanner := bufio.NewScanner(gzipReader)
//Strings read from the input stream go to this channel
inputChan := make(chan string, 1000)
//Signal completion on this channel
donechan := make(chan bool, 1)
//This goroutine just reads text from the input scanner
//and sends it into the channel
go func() {
for scanner.Scan() {
inputChan <- scanner.Text()
}
close(inputChan)
}()
//Read lines from input channel. They all either start with #
//or have ten tab-separated columns
go func() {
for line := range inputChan {
toks := strings.Split(line, "\t")
if len(toks) != 10 && line[0] != '#' {
panic("Dang.")
}
}
donechan <- true
}()
//periodically write some random text to the response
go func() {
for {
time.Sleep(10*time.Millisecond)
w.Write([]byte("write\n some \n output\n"))
}
}()
//wait until we're done to return
<-donechan
}
Weirdly, this code panics every time because it always encounters a line with fewer than 10 tokens, although at different spots every time. Commenting out the line that writes to the response fixes the issue, as does reading from a non-gzipped input stream. Am I missing something obvious? Why would writing to the response break if reading from a gzip file, but not a plain text formatted file? Why would it break at all?
The HTTP protocol is not full-duplex: it is request-response based. You should only send output once you're done with reading the input.
In your code you use a for with range on a channel. This will try to read the channel until it is closed, but you never close the inputChan.
If you never close inputChan, the following line is never reached:
donechan <- true
And therefore receiving from donechan blocks:
<-donechan
You have to close the inputChan when EOF is reached:
go func() {
for scanner.Scan() {
inputChan <- scanner.Text()
}
close(inputChan) // THIS IS NEEDED
}()
I wrote a simple UDP server in go.
When I do go run udp.go it prints all packages I send to it. But when running go run udp.go > out it stops passing stdout to the out file when the client stops.
The client is simple program that sends 10k requests. So in the file I have around 50% of sent packages. When I run the client again, the out file grows again until the client script finishes.
Server code:
package main
import (
"net"
"fmt"
)
func main() {
addr, _ := net.ResolveUDPAddr("udp", ":2000")
sock, _ := net.ListenUDP("udp", addr)
i := 0
for {
i++
buf := make([]byte, 1024)
rlen, _, err := sock.ReadFromUDP(buf)
if err != nil {
fmt.Println(err)
}
fmt.Println(string(buf[0:rlen]))
fmt.Println(i)
//go handlePacket(buf, rlen)
}
}
And here is the client code:
package main
import (
"net"
"fmt"
)
func main() {
num := 0
for i := 0; i < 100; i++ {
for j := 0; j < 100; j++ {
num++
con, _ := net.Dial("udp", "127.0.0.1:2000")
fmt.Println(num)
buf := []byte("bla bla bla I am the packet")
_, err := con.Write(buf)
if err != nil {
fmt.Println(err)
}
}
}
}
As you suspected, it seems like UDP packet loss due to the nature of UDP. Because UDP is connectionless, the client doesn't care if the server is available or ready to receive data. So if the server is busy processing, it won't be available to handle the next incoming datagram. You can check with netstat -u (which should include UDP packet loss info). I ran into the same thing, in which the server (receive side) could not keep up with the packets sent.
You can try two things (the second worked for me with your example):
Call SetReadBuffer. Ensure the receive socket has enough buffering to handle everything you throw at it.
sock, _ := net.ListenUDP("udp", addr)
sock.SetReadBuffer(1048576)
Do all packet processing in a go routine. Try to increase the datagrams per second by ensuring the server isn't busy doing other work when you want it to be available to receive. i.e. Move the processing work to a go routine, so you don't hold up ReadFromUDP().
//Reintroduce your go handlePacket(buf, rlen) with a count param
func handlePacket(buf []byte, rlen int, count int)
fmt.Println(string(buf[0:rlen]))
fmt.Println(count)
}
...
go handlePacket(buf, rlen, i)
One final option:
Lastly, and probably not what you want, you put a sleep in your client which would slow down the rate and would also remove the problem. e.g.
buf := []byte("bla bla bla I am the packet")
time.Sleep(100 * time.Millisecond)
_, err := con.Write(buf)
Try syncing stdout after the write statements.
os.Stdout.Sync()