downloading data and saving data to a folder in batches - r
I have 200,000 links that I am trying to download, I have tried downloading it all in one go but I ran into memory issues.
I am trying to create a function which will download 1000 links at a time and save them in a folder.
Packages:
library(dplyr)
library(purrr)
library(edgarWebR)
A small sample of the data is as follows:
Data 1:
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm"
)
I then apply the following function to download these 10 links
parsed_files <- map(urls_to_parse, possibly(parse_filing, otherwise = NA))
Which stores it as a nice list, I can then apply names(parsed_files) <- urls_to_parse to name the lists as the links from where they were downloading them from. I can also use output <- plyr::ldply(parsed_files, data.frame) to store everything in a nice data frame.
Using the below data, how could I create batches to download the data in say batches of 10?
What I have currently:
start = 1
end = 100
output <- NULL
output_fin <- NULL
for(i in start:end){
output[[i]] <- map(urls_to_parse[[i]], possibly(parse_filing, otherwise = NA))
names(output) <- urls_to_parse[start:end]
save(output_fin, file = paste0("C:/Users/Downloads/data/",i, "output.RData"))
}
I am sure there is a better way using a function, since this code breaks for some of the results.
More data: - 100 links
urls_to_parse <- c("https://www.sec.gov/Archives/edgar/data/1750/000104746918004978/a2236183z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746917004528/a2232622z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746916014299/a2228768z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746915006136/a2225345z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746914006243/a2220733z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746913007797/a2216052z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746912007300/a2210166z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746911006302/a2204709z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746910006500/a2199382z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746909006783/a2193700z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746908008126/a2186742z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465907055173/a07-18543_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465906047248/a06-15961_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000110465905033688/a05-12324_110k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746904023905/a2140220z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000104746903028005/a2116671z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/1750/000091205702033450/a2087919z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095012310108231/c61492e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095015208010514/n48172e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707018659/c22309e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013707000193/c11187e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000095013406000594/c01109e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677405000032/d16006.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000120677404000013/d13773.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000104746903001075/a2097401z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/61478/000091205702001614/a2067550z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752308008030/a5800571.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752307009801/a5515869.htm",
"https://www.sec.gov/Archives/edgar/data/319126/000115752306009238/a5227919.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046908000102/alpharmainc_10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046907000017/alo10k2006.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046906000027/alo10k2005.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046905000021/alo10k2004final.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046904000058/alo10k2003master.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046903000001/alo10k.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046902000004/alo10k2001.htm",
"https://www.sec.gov/Archives/edgar/data/730469/000073046901500003/alo.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620118000009/a10k123117.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312517051216/d286458d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312516474605/d78287d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000119312515061145/d829913d10k.htm",
"https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620113000023/amr-10kx20121231.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000119312512063516/d259681d10k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095012311014726/d78201e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620110000006/ar123109.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000620109000009/ar120810k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000000451508000014/ar022010k.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013407003888/d43815e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013406003715/d33303e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013405003726/d22731e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000095013404002668/d12953e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/6201/000104746903013301/a2108197z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095013407003823/h42902e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012906002343/h31028e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/65695/000095012905002955/h22337e10vk.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459018005085/cece-10k_20171231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459017004264/cece-10k_20161231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000156459016015157/cece-10k_20151231.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312515095828/d864880d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312514098407/d661608d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312513109153/d444138d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312512119293/d293768d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312511067373/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312510069639/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312509055504/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312508058939/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312507071909/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312506068031/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312505077739/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/3197/000119312504052176/d10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465910047121/a10-16705_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000114420409046933/v159572_10k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000110465906060737/a06-19311_110k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746905022854/a2162888z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746904028585/a2143353z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/2601/000104746903031974/a2119476z10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000143774918010388/avx20180331_10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916317000028/avx-20170331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916316000079/avx-20160331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916315000024/avx-20150331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916314000035/avx-20140331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916313000022/avx-20130331x10k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916312000024/avxform10kfy12.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916311000013/avxform10kfy11.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916310000020/avxform10kfy10.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916309000117/form10kfy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000192/form10qq1fy09.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916308000101/form10kfy08.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916307000122/form10kfy07.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916306000102/avxfy06form10-k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916305000094/fy0510k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916304000091/fy0410k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916303000020/fy0310k.htm",
"https://www.sec.gov/Archives/edgar/data/859163/000085916302000007/r10k-0302.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462218000018/pnw2017123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462217000010/pnw2016123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462216000087/pnw2015123110-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000076462215000013/pnw12311410-k.htm",
"https://www.sec.gov/Archives/edgar/data/7286/000110465914012068/a13-25897_110k.htm"
)
Looping over to do batch job as you showed is a bad idea. If you have a 1000s of files to be downloaded, how do you recover from errors?
The performance is not solely depend on your computer's configuration, but the network performance is crucial.
Here are couple of suggestions.
Option 1
partition all URLs in to batches to be able to download them parallelly. The number of files to be downloaded could be equal to number of cores in your computer. Look at this question; reading multiple files quickly in R
store these batches in a queue objects - For ex: using a package like https://cran.r-project.org/web/packages/dequer/dequer.pdf
pop the queue and use the batch of URLs in your parallel file download function.
Use a retryable file download function like in -- HTTP error 400 in R, error handling, How to retry instead of forcing to stop?
Once the queue is completed, move to the next partition.
wrap the whole operation in a retryable loop. For example; How to retry a statement on error?
Why do I use a queue? Because you could retry on error easily.
A pseudo code
file_url_partitions <- partion_as_batches(all_urls, batch_size)
attempts = 3
while( file_url_partitions is not empty && attempt <= 3 ) {
batch = file_url_partitions.pop()
tryCatch({
download_parallel(batch)
}, some_exception = function(se) {
file_url_partitions.push(batch)
attemp = attempt+1
})
}
Note: I don't have access to R studio/environment now hence no way to try.
Option 2
Download files separately using a download manager/similar and use downloaded files.
Some useful resources:
https://www.r-bloggers.com/r-with-parallel-computing-from-user-perspectives/
http://adv-r.had.co.nz/beyond-exception-handling.html
Related
How do I create a loop function to apply acoustic indices from "soundecology" to specific sections of .wav files using R
I have a large quantity of .wav files that I need to analyze using the acoustic indices from the "soundecology" package in R. However, the recordings do not have uniform start times and I need to analyze specific periods of time within the files. I want to create a function and loop for automating the process. I have created a spread sheet for each folder of recordings (each folder is a different location) that lays out the recording and the times within each recording that I need to analyze. Basically, a row contains: the sound file name, the time when the sample should start (eg. 09:00:00, the number of seconds from the start of the file that that time occurs, and the munber of seconds from the start time of the file that the end of the sample should occur. That data looks like this: Spread sheet of data I am using the package "tuneR" and "warbleR" to select the specific portion of a sound file that I want to analyze. Here is the the code and the output that I would like to loop across all the sound files: wavrow1 <-read_wave(mvb$sound.files[1], from = mvb$start[1], to = mvb$end[1]) wavrow1.aci <- acoustic_complexity(wavrow1, j=10) which yeilds max_freq not set, using value of: 22050 min_freq not set, using value of: 0 This is a mono file. Calculating index. Please wait... Acoustic Complexity Index (total): 934.568 However, when I put this into a function in order to then put it into a loop I get a different output. acianalyzeFUN <- function(mvb, i){ r <- read_wave(mvb$sound.files[i], mvb$start[i], mvb$end[i]) soundfile.aci <- acoustic_complexity(r, j=10) } row1.test <- acianalyzeFUN(mvb, 1) This gives the output: max_freq not set, using value of: 22050 min_freq not set, using value of: 0 This is a mono file. Calculating index. Please wait... Acoustic Complexity Index (total): 19183.03 Acoustic Complexity Index (by minute): 931.98 Which is different. So I need to fix this function and put it into a loop so that I can apply it across all the files and save the results into a data frame or ultimately another spread sheet. I was thinking a loop like the following might work but I am also getting errors with it: output <- vector("logical", length(97)) for (i in seq_along(mvb$sound.files)) { output[[i]] <- acianalyzeFUN(mvb, i) } Which returns this error: max_freq not set, using value of: 22050 min_freq not set, using value of: 0 This is a mono file. Calculating index. Please wait... Acoustic Complexity Index (total): 19183.03 Acoustic Complexity Index (by minute): 931.98 Error in output[[i]] <- acianalyzeFUN(mvb, i) : more elements supplied than there are to replace Thanks for any help and advice on this. Please let me know if there are any other pieces of information that would be helpful.
the read_wave function takes following arguments : read_wave(X, index, from = X$start[index], to = X$end[index], channel = NULL, header = FALSE, path = NULL) In the manual test, you specify from = mvb$start[1], to = mvb$end[1] In the function you created, you dont specify the arguments : r <- read_wave(mvb$sound.files[i], mvb$start[i], mvb$end[i]) so that mvb$start[i] gets affected to index and mvb$end[i] to from. You should write: acianalyzeFUN <- function(mvb, i){ r <- read_wave(mvb$sound.files[i], from = mvb$start[i], to = mvb$end[i]) soundfile.aci <- acoustic_complexity(r, j=10) } This should explain the difference you observe. Regarding the error, you create a vector of logical to collect the result, but acianalyzeFUN returns nothing : it just sets two variables r and soundfileaci without returning anything.
Loop to wait for result or timeout in r
I've written a very quick blast script in r to enable interfacing with the NCBI blast API. Sometimes however, the result url takes a while to load and my script throws an error until the url is ready. Is there an elegant way (i.e. a tryCatch option) to handle the error until the result is returned or timeout after a specified time? library(rvest) ## Definitive set of blast API instructions can be found here: https://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/new/BLAST_URLAPI.html ## Generate query URL query_url <- function(QUERY, PROGRAM = "blastp", DATABASE = "nr", ...) { put_url_stem <- 'https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Put' arguments = list(...) paste0( put_url_stem, "&QUERY=", QUERY, "&PROGRAM=", PROGRAM, "&DATABASE=", DATABASE, arguments ) } blast_url <- query_url(QUERY = "NP_001117.2") ## test query blast_session <- html_session(blast_url) ## create session blast_form <- html_form(blast_session)[[1]] ## pull form from session RID <- blast_form$fields$RID$value ## extract RID identifier get_url <- function(RID, ...) { get_url_stem <- "https://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Get" arguments = list(...) paste0(get_url_stem, "&RID=", RID, "&FORMAT_TYPE=XML", arguments) } hits_xml <- read_xml(get_url(RID)) ## this is the sticky part Sometimes it takes several minutes for the get_url to go live so what I would like is to do is to keep trying let's say every 20-30 seconds until it either produces the url or times out after a pre-specified time.
I think you may find this answer about the use of tryCatch useful Regarding the 'keep trying until timeout' part. I imagine you can work on top of this other answer about a tryCatch loop on error Hope it helps.
Manual API rate limiting
I am trying to write a manual rate-limiting function for the rgithub package. So far this is what I have: library(rgithub) pull <- function(i){ commits <- get.pull.request.commits(owner = owner, repo = repo, id = i, ctx = get.github.context(), per_page=100) links <- digest_header_links(commits) number_of_pages <- links[2,]$page if (number_of_pages != 0) try_default(for (n in 1:number_of_pages){ if (as.integer(commits$headers$`x-ratelimit-remaining`) < 5) Sys.sleep(as.integer(commits$headers$`x-ratelimit-reset`)-as.POSIXct(Sys.time()) %>% as.integer()) else get.pull.request.commits(owner = owner, repo = repo, id = i, ctx = get.github.context(), per_page=100, page = n) }, default = NULL) else return(commits) } list <- c(500, 501, 502) pull_lists <- lapply(list, pull) The intention i that if the x-ratelimit-remaining variable goes below a certain threshold the script should wait until the time specified in x-ratelimit-reset has passed, and then continue the script. However, I'm not sure if this is the actual behavior of the if else set up that I have here. The function runs fine, but I have some doubts about whether it actually does the rate limiting or whether it somehow skips that steps. Hence I ask: a) how can I find out if it actually does rate-limiting, and b) if not, how can I rewrite it so that it actually does rate limiting? Would a while condition/loop perhaps be better?
You can test if it does the rate limiting changing 5 to a large enough number and adding a display of the timing of Sys.sleep using: print(system.time(Sys.sleep(...))) That said, the function seems ok to me, unfortunately I cannot test it easily as rgithub is not available for my version of R (3.1.3).
Not a canonical answer, but some working example. You should add some logging in your script, even kind of write.csv(append=TRUE). I've implemented automatic antiddos process which prevent your ip to be banned by the exchange market. You can find it jangorecki/Rbitcoin/R/utils.R. Rbitcoin.last_api_call is env object stored in package namespace, kind of session package cache. This can help you with setting it in your package. You should also consider a optional parallel supported version. Linking to database with concurrency read. My function can be easy modified to queue call and recheck timing every X seconds. Edit I forget to add that mentioned function support multiple source systems. That allows for example to extend your rgithub for bitbucket, etc. and still effectively manage API rate limiting.
Speed up API calls in R
I am querying Freebase to get the genre information for some 10000 movies. After reading How to optimise scraping with getURL() in R, I tried to execute the requests in parallel. However, I failed - see below. Besides parallelization, I also read that httr might be a better alternative to RCurl. My questions are: Is it possible to speed up the API calls by using a) a parallel version of the loop below (using a WINDOWS machine)? b) alternatives to getURL such as GET in the httr-package? library(RCurl) library(jsonlite) library(foreach) library(doSNOW) df <- data.frame(film=c("Terminator", "Die Hard", "Philadelphia", "A Perfect World", "The Parade", "ParaNorman", "Passengers", "Pink Cadillac", "Pleasantville", "Police Academy", "The Polar Express", "Platoon"), genre=NA) f_query_freebase <- function(film.title){ request <- paste0("https://www.googleapis.com/freebase/v1/search?", "filter=", paste0("(all alias{full}:", "\"", film.title, "\"", " type:\"/film/film\")"), "&indent=TRUE", "&limit=1", "&output=(/film/film/genre)") temp <- getURL(URLencode(request), ssl.verifypeer = FALSE) data <- fromJSON(temp, simplifyVector=FALSE) genre <- paste(sapply(data$result[[1]]$output$`/film/film/genre`[[1]], function(x){as.character(x$name)}), collapse=" | ") return(genre) } # Non-parallel version # ---------------------------------- for (i in df$film){ df$genre[which(df$film==i)] <- f_query_freebase(i) } # Parallel version - Does not work # ---------------------------------- # Set up parallel computing cl<-makeCluster(2) registerDoSNOW(cl) foreach(i=df$film) %dopar% { df$genre[which(df$film==i)] <- f_query_freebase(i) } stopCluster(cl) # --> I get the following error: "Error in { : task 1 failed", further saying that it cannot find the function "getURL".
This doesn't achieve parallel requests within a single R session, however, it's something I've used to achieve >1 simultaneous requests (e.g. in parallel) across multiple R sessions, so it may be useful. At a high level You'll want to break the process into a few parts: Get a list of the URLs/API calls you need to make and store as a csv/text file Use the code below as a template for starting multiple R processes and dividing the work among them Note: this happened to run on windows, so I used powershell. On mac this could be written in bash. Powershell/bash script Use a single powershell script to start off multiple instances R processes (here we divide the work between 3 processes): e.g. save a plain text file with .ps1 file extension, you can double click on it to run it, or schedule it with task scheduler/cron: start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 1; TIMEOUT 20000 } start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 2; TIMEOUT 20000 } start powershell { cd C:\Users\Administrator\Desktop; Rscript extract.R 3; TIMEOUT 20000 } What's it doing? It will: Go the the Desktop, start a script it finds called extract.R, and provide an argument to the R script (1, 2, and 3). The R processes Each R process can look like this # Get command line argument arguments <- commandArgs(trailingOnly = TRUE) process_number <- as.numeric(arguments[1]) api_calls <- read.csv("api_calls.csv") # work out which API calls each R script should make (e.g. indicies <- seq(process_number, nrow(api_calls), 3) api_calls_for_this_process_only <- api_calls[indicies, ] # this subsets for 1/3 of the API calls # (the other two processes will take care of the remaining calls) # Now, make API calls as usual using rvest/jsonlite or whatever you use for that
Logfile analysis in R?
I know there are other tools around like awstats or splunk, but I wonder whether there is some serious (web)server logfile analysis going on in R. I might not be the first thought to do it in R, but still R has nice visualization capabilities and also nice spatial packages. Do you know of any? Or is there a R package / code that handles the most common log file formats that one could build on? Or is it simply a very bad idea?
In connection with a project to build an analytics toolbox for our Network Ops guys, i built one of these about two months ago. My employer has no problem if i open source it, so if anyone is interested i can put it up on my github repo. I assume it's most useful to this group if i build an R Package. I won't be able to do that straight away though because i need to research the docs on package building with non-R code (it might be as simple as tossing the python bytecode files in /exec along with a suitable python runtime, but i have no idea). I was actually suprised that i needed to undertake a project of this sort. There are at least several excellent open source and free log file parsers/viewers (including the excellent Webalyzer and AWStats) but neither parse server error logs (parsing server access logs is the primary use case for both). If you are not familiar with error logs or with the difference between them and access logs, in sum, Apache servers (likewsie, nginx and IIS) record two distinct logs and store them to disk by default next to each other in the same directory. On Mac OS X, that directory in /var, just below root: $> pwd /var/log/apache2 $> ls access_log error_log For network diagnostics, error logs are often far more useful than the access logs. They also happen to be significantly more difficult to process because of the unstructured nature of the data in many of the fields and more significantly, because the data file you are left with after parsing is an irregular time series--you might have multiple entries keyed to a single timestamp, then the next entry is three seconds later, and so forth. i wanted an app that i could toss in raw error logs (of any size, but usually several hundred MB at a time) have something useful come out the other end--which in this case, had to be some pre-packaged analytics and also a data cube available inside R for command-line analytics. Given this, i coded the raw-log parser in python, while the processor (e.g., gridding the parser output to create a regular time series) and all analytics and data visualization, i coded in R. I have been building analytics tools for a long time, but only in the past four years have i been using R. So my first impression--immediately upon parsing a raw log file and loading the data frame in R is what a pleasure R is to work with and how it is so well suited for tasks of this sort. A few welcome suprises: Serialization. To persist working data in R is a single command (save). I knew this, but i didn't know how efficient is this binary format. Thee actual data: for every 50 MB of raw logfiles parsed, the .RData representation was about 500 KB--100 : 1 compression. (Note: i pushed this down further to about 300 : 1 by using the data.table library and manually setting compression level argument to the save function); IO. My Data Warehouse relies heavily on a lightweight datastructure server that resides entirely in RAM and writes to disk asynchronously, called redis. The proect itself is only about two years old, yet there's already a redis client for R in CRAN (by B.W. Lewis, version 1.6.1 as of this post); Primary Data Analysis. The purpose of this Project was to build a Library for our Network Ops guys to use. My goal was a "one command = one data view" type interface. So for instance, i used the excellent googleVis Package to create a professional-looking scrollable/paginated HTML tables with sortable columns, in which i loaded a data frame of aggregated data (>5,000 lines). Just those few interactive elments--e.g., sorting a column--delivered useful descriptive analytics. Another example, i wrote a lot of thin wrappers over some basic data juggling and table-like functions; each of these functions i would for instance, bind to a clickable button on a tabbed web page. Again, this was a pleasure to do in R, in part becasue quite often the function required no wrapper, the single command with the arguments supplied was enough to generate a useful view of the data. A couple of examples of the last bullet: # what are the most common issues that cause an error to be logged? err_order = function(df){ t0 = xtabs(~Issue_Descr, df) m = cbind( names(t0), t0) rownames(m) = NULL colnames(m) = c("Cause", "Count") x = m[,2] x = as.numeric(x) ndx = order(x, decreasing=T) m = m[ndx,] m1 = data.frame(Cause=m[,1], Count=as.numeric(m[,2]), CountAsProp=100*as.numeric(m[,2])/dim(df)[1]) subset(m1, CountAsProp >= 1.) } # calling this function, passing in a data frame, returns something like: Cause Count CountAsProp 1 'connect to unix://var/ failed' 200 40.0 2 'object buffered to temp file' 185 37.0 3 'connection refused' 94 18.8 The Primary Data Cube Displayed for Interactive Analysis Using googleVis: A contingency table (from an xtab function call) displayed using googleVis)
It is in fact an excellent idea. R also has very good date/time capabilities, can do cluster analysis or use any variety of machine learning alogorithms, has three different regexp engines to parse etc pp. And it may not be a novel idea. A few years ago I was in brief email contact with someone using R for proactive (rather than reactive) logfile analysis: Read the logs, (in their case) build time-series models, predict hot spots. That is so obviously a good idea. It was one of the Department of Energy labs but I no longer have a URL. Even outside of temporal patterns there is a lot one could do here.
I have used R to load and parse IIS Log files with some success here is my code. Load IIS Log files require(data.table) setwd("Log File Directory") # get a list of all the log files log_files <- Sys.glob("*.log") # This line # 1) reads each log file # 2) concatenates them IIS <- do.call( "rbind", lapply( log_files, read.csv, sep = " ", header = FALSE, comment.char = "#", na.strings = "-" ) ) # Add field names - Copy the "Fields" line from one of the log files :header line colnames(IIS) <- c("date", "time", "s_ip", "cs_method", "cs_uri_stem", "cs_uri_query", "s_port", "cs_username", "c_ip", "cs_User_Agent", "sc_status", "sc_substatus", "sc_win32_status", "sc_bytes", "cs_bytes", "time-taken") #Change it to a data.table IIS <- data.table( IIS ) #Query at will IIS[, .N, by = list(sc_status,cs_username, cs_uri_stem,sc_win32_status) ]
I did a logfile-analysis recently using R. It was no real komplex thing, mostly descriptive tables. R's build-in functions were sufficient for this job. The problem was the data storage as my logfiles were about 10 GB. Revolutions R does offer new methods to handle such big data, but I at last decided to use a MySQL-database as a backend (which in fact reduced the size to 2 GB though normalization). That could also solve your problem in reading logfiles in R.
#!python import argparse import csv import cStringIO as StringIO class OurDialect: escapechar = ',' delimiter = ' ' quoting = csv.QUOTE_NONE parser = argparse.ArgumentParser() parser.add_argument('-f', '--source', type=str, dest='line', default=[['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"'''], ['''54.67.81.141 - - [01/Apr/2015:13:39:22 +0000] "GET / HTTP/1.1" 502 173 "-" "curl/7.41.0" "-"''']]) arguments = parser.parse_args() try: with open(arguments.line, 'wb') as fin: line = fin.readlines() except: pass finally: line = arguments.line header = ['IP', 'Ident', 'User', 'Timestamp', 'Offset', 'HTTP Verb', 'HTTP Endpoint', 'HTTP Version', 'HTTP Return code', 'Size in bytes', 'User-Agent'] lines = [[l[:-1].replace('[', '"').replace(']', '"').replace('"', '') for l in l1] for l1 in line] out = StringIO.StringIO() writer = csv.writer(out) writer.writerow(header) writer = csv.writer(out,dialect=OurDialect) writer.writerows([[l1 for l1 in l] for l in lines]) print(out.getvalue()) Demo output: IP,Ident,User,Timestamp,Offset,HTTP Verb,HTTP Endpoint,HTTP Version,HTTP Return code,Size in bytes,User-Agent 54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, - 54.67.81.141, -, -, 01/Apr/2015:13:39:22, +0000, GET, /, HTTP/1.1, 502, 173, -, curl/7.41.0, - This format can easily be read into R using read.csv. And, it doesn't require any 3rd party libraries.