Happy New Year!
I have just started to learn Julia and my first mini challenge I have set myself is to scrape data from a large list of URLs.
I have ca 50k URLs (which I successfully parsed from a JSON with Julia using Regex) in a CSV file. I want to scrape each one and return a matched string ("/page/12345/view" - where 12345 is any integer).
I managed to do so using HTTP and Queryverse (although had started with CSV and CSVFiles but looking at packages for learning purposes) but the script seems to stop after just under 2k. I can't see an error such as a timeout.
May I ask if anyone can advise what I'm doing wrong or how I can approach it differently? Explanations/links to learning resources would also be great!
using HTTP, Queryverse
URLs = load("urls.csv") |> DataFrame
patternid = r"\/page\/[0-9]+\/view"
touch("ids.txt")
f = open("ids.txt", "a")
for row in eachrow(URLs)
urlResponse = HTTP.get(row[:url])
if Int(urlResponse.status) == 404
continue
end
urlHTML = String(urlResponse.body)
urlIDmatch = match(patternid, urlHTML)
write(f, urlIDmatch.match, "\n")
end
close(f)
There can be always a server that detects your scraper and intentionally takes a very long time to respond.
Basically, since scraping is an IO intensive operations you should do it using a big number of asynchronous tasks. Moreover this should be combined with the readtimeout parameter of the get function. Hence your code will look more or less like this:
asyncmap(1:nrow(URLs);ntasks=50) do n
row = URLs[n, :]
urlResponse = HTTP.get(row[:url], readtimeout=10)
# the rest of your code comes here
end
Even one some servers are delaying transmission, always many connections will be working.
Basically I am working on a python project where I download and index files from the sec edgar database. The problem however, is that when using the requests module, it take a very long time to save the text in a variable (between ~130 and 170 seconds for one file).
The file roughly has around 16 million characters, and I wanted to see if there was any way to easily lower the time it takes to retrieve the text. -- Example:
import requests
url ="https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.text)
Thanks!
What I found is in the code for r.text, specifically when no encoding was given ( r.encoding == 'None' ). The time spend detecting the encoding was 20 seconds, I was able to skip it by defining the encoding.
...
r.encoding = 'utf-8'
...
Additional details
In my case, my request was not returning an encoding type. The response was 256k in size, the r.apparent_encoding was taking 20 seconds.
Looking into the text property function. It tests to see if there is an encoding. If there is None, it will call the apperent_encoding function which will scan the text to autodetect the encoding scheme.
On a long string this will take time. By defining the encoding of the response ( as described above), you will skip the detection.
Validate that this is your issue
in your above example :
from datetime import datetime
import requests
url = "https://www.sec.gov/Archives/edgar/data/0001652044/000165204417000008/goog10-kq42016.htm"
r = requests.get(url, stream=True)
print(r.encoding)
print(datetime.now())
enc = r.apparent_encoding
print(enc)
print(datetime.now())
print(r.text)
print(datetime.now())
r.encoding = enc
print(r.text)
print(datetime.now())
of course the output may get lost in the printing, so I recommend you run the above in an interactive shell, it may become more aparent where you are losing the time even without printing datetime.now()
From #martijn-pieters
Decoding and printing 15MB of data to your console is often slower than loading data from a network connection. Don't print all that data. Just write it straight to a file.
I am using ATmega128 and I need two serial ports for communication. I have been using printf from "stdio.h" header file to send data through USART 0. I also need to send data through USART 1 to lcd and I am curious about using formatted input function. I have been thinking that connecting same printf function to USART 1 and USART 0 makes the compiler confused so I haven't tried it.
Can anyone suggest how to make another printf say "Lprintf" to send data through USART 1 ??
What you want to do here is to use fprintf(). See the documentation on avr-libc for the function. Essentially, you want to have a fputc() function for UART1 and one for UART0. Then, based on that, you can create two FILE buffers. Once you do so, you are free to use fprintf() on each. Optionally, you can point stdout to one of these buffers, as to be able to use printf().
FILE uart1_out = FDEV_SETUP_STREAM(uart1_putc, 0, _FDEV_SETUP_WRITE);
FILE uart0_out = FDEV_SETUP_STREAM(uart0_putc, 0, _FDEV_SETUP_WRITE);
fprintf(&uart1_out, "printing to UART1");
fprintf(&uart0_out, "printing %d to UART0", 0);
stdout = &uart1_out;
stderr = &uart0_out;
printf("This string will be printed thru UART1");
fprintf(stderr, "This string will be printed thru UART0");
You just need to provide the implementation for int uart1_putc(int, FILE*) and int uart0_putc(int, FILE*) to manipulate data as you wish.
Hope this helps.
Cheers.
Depending on how you've linked it, there are two alternatives that are possibly simpler:
Use sprintf() to write your formatted text to a string, and then use your own putchar() or putstring() to send it to the desired USART.
If you're using the FILE struct to link your USARTs to the stdio functions (likely), you can use fprintf() to direct the results to a particular stream.
I am making a C application in Unix that uses raw tty input.
I am calling write() to characters on the display, but I want to manipulate the cursor:
ssize_t
write(int d, const void *buf, size_t nbytes);
I've noticed that if buf has the value 8 (I mean char tmp = 8, then passing &tmp), it will move the cursor/pointer backward on the screen.
I was wondering where I could find all the codes, for example, I wish to move the cursor forward but I cannot seem to find it via Google.
Is there a page that lists all the code for the write() function please?
Thank you very much,
Jary
8 is just the ascii code for backspace. You can type man ascii and look at all the values (the man page on my Ubuntu box has friendlier names for the values). If you want to do more complicated things you may want to look at a library like ncurses.
You have just discovered that character code 8 is backspace (control-H).
You would probably be best off using the curses library to manage the screen. However, you can find out what control sequences curses knows about by using infocmp to decompile the terminfo entry for your terminal. The format isn't particularly easy to understand, but it is relatively comprehensive. The alternative is to find a manual for the terminal, which tends to be rather hard.
For instance, I'm using a color Xterm window; infocmp says:
# Reconstructed via infocmp from file: /usr/share/terminfo/78/xterm-color
xterm-color|nxterm|generic color xterm,
am, km, mir, msgr, xenl,
colors#8, cols#80, it#8, lines#24, ncv#, pairs#64,
acsc=``aaffggiijjkkllmmnnooppqqrrssttuuvvwwxxyyzz{{||}}~~,
bel=^G, bold=\E[1m, clear=\E[H\E[2J, cr=^M,
csr=\E[%i%p1%d;%p2%dr, cub=\E[%p1%dD, cub1=^H,
cud=\E[%p1%dB, cud1=^J, cuf=\E[%p1%dC, cuf1=\E[C,
cup=\E[%i%p1%d;%p2%dH, cuu=\E[%p1%dA, cuu1=\E[A,
dch=\E[%p1%dP, dch1=\E[P, dl=\E[%p1%dM, dl1=\E[M, ed=\E[J,
el=\E[K, enacs=\E)0, home=\E[H, ht=^I, hts=\EH, il=\E[%p1%dL,
il1=\E[L, ind=^J,
is2=\E[m\E[?7h\E[4l\E>\E7\E[r\E[?1;3;4;6l\E8, kbs=^H,
kcub1=\EOD, kcud1=\EOB, kcuf1=\EOC, kcuu1=\EOA,
kdch1=\E[3~, kf1=\E[11~, kf10=\E[21~, kf11=\E[23~,
kf12=\E[24~, kf13=\E[25~, kf14=\E[26~, kf15=\E[28~,
kf16=\E[29~, kf17=\E[31~, kf18=\E[32~, kf19=\E[33~,
kf2=\E[12~, kf20=\E[34~, kf3=\E[13~, kf4=\E[14~,
kf5=\E[15~, kf6=\E[17~, kf7=\E[18~, kf8=\E[19~, kf9=\E[20~,
kfnd=\E[1~, kich1=\E[2~, kmous=\E[M, knp=\E[6~, kpp=\E[5~,
kslt=\E[4~, meml=\El, memu=\Em, op=\E[m, rc=\E8, rev=\E[7m,
ri=\EM, rmacs=^O, rmcup=\E[2J\E[?47l\E8, rmir=\E[4l,
rmkx=\E[?1l\E>, rmso=\E[m, rmul=\E[m,
rs2=\E[m\E[?7h\E[4l\E>\E7\E[r\E[?1;3;4;6l\E8, sc=\E7,
setab=\E[4%p1%dm, setaf=\E[3%p1%dm, sgr0=\E[m, smacs=^N,
smcup=\E7\E[?47h, smir=\E[4h, smkx=\E[?1h\E=, smso=\E[7m,
smul=\E[4m, tbc=\E[3g, u6=\E[%i%d;%dR, u7=\E[6n,
u8=\E[?1;2c, u9=\E[c,
That contains information about box drawing characters, code sequences generated by function keys, various cursor movement sequences, and so on.
You can find out more about X/Open Curses (v4.2) in HTML. However, that is officially obsolete, superseded by X/Open Curses v7, which you can download for free in PDF.
If you're using write just so you have low-level cursor control, I think you are using the wrong tool for the job. There are command codes for many types of terminal. VT100 codes, for example, are sequences of the form "\x1b[...", but rather than sending raw codes, you'd be much better off using a library like ncurses.
I'm trying to write a lua script that reads input from other processes and analyzes it. For this purpose I'm using io.popen and it works as expected in Windows, but on Unix(Solaris) reading from io.popen blocks, so the script just waits there until something comes along instead of returning immediately...
As far as I know I can't change the functionality of io.popen from within the script, and if at all possible I would rather not have to change the C code, because then the script will then need to be bound with the patched binary.
Does that leave me with any command-line solutions?
Ok got no answers so far, but for posterity if someone needs a similar solution I did the following more or less
function my_popen(name,cmd)
local process = {}
process.__proc = assert(io.popen(cmd..">"..name..".tmp", 'r'))
process.__file = assert(io.open(name..".tmp", 'r'))
process.lines = function(self)
return self.__file:lines()
end
process.close = function(self)
self.__proc:close()
self.__file:close()
end
return process
end
proc = my_popen("somename","some command")
while true
--do stuf
for line in proc:lines() do
print(line)
end
--do stuf
end
Your problems seems to be related to buffering. For some reason the pipe is waiting for some data to be read before it allows the opened program to write more to it, and it seems to be less than a line. What you can do is use io.popen(cmd):read"*a" to read everything. This should avoid the buffering problem. Then you can split the returned string in lines with for line in string.gmatch("[^\n]+") do someting_with(line) end.
Your solution consist in dumping the output of the process to a file, and reading that file. You can replace your use or io.popen with io.execute, and discard the return value (just check it's 0).