I use R to parse XML data from a website. I have list of 20,000 rows with URLs from which I need to extract data. I have a code which gets the job done using a for loop, but it's very slow (takes approx. 12 hours). I thought of using parallel processing (I have access to several CPUs) to speed it up, but I cannot make it work properly. Would it be more efficient using a data table instead of a data frame? Is there any way to speed the process up? Thanks!
for (i in 1:nrow(list)) {
t <- xmlToDataFrame(xmlParse(read_xml(list$path[i]))) #Read the data into a file
t$ID <- list$ID[i]
emptyDF <- bind_rows(all, t) #Bind all into one file
if (i / 10 == floor(i / 10)) {
print(i)
} #print every 10th value to monitor progress of the loop
}
This script should point you in the correct direction:
t<-list()
for (i in 1:nrow(list)) {
tempdf <- xmlToDataFrame(xmlParse(list$path[i])) #Read the data into a file
tempdf$ID <- list$ID[i]
t[[i]]<-tempdf
if (i %% 10 == 0) {
print(i)
} #print every 10th value to monitor progress of the loop
}
answer <- bind_rows(t) #Bind all into one file
Instead of a for loop, an lapply would also work here. Without any sample data, this is untested.
Related
Okay, so I needed to split a larger file into a bunch of CSV's to run through a non-R program. I used this loop to do it:
for(k in 1:14){
inner_edge = 2000000L*(k-1) + 1
outter_edge = 2000000L*(k)
part <- slice(nc_tu_CLEAN, inner_edge:outter_edge)
out_name = paste0("geo/geo_CLEAN",(k),".csv")
write_csv(part,out_name)
Sys.time()
}
which worked great. Except I'm having a problem in this other program, and need to read a bunch of these back in to trouble shoot. I tried to write this loop for it, and get the following error:
for(k in 1:6){
csv_name <- paste0("geo_CLEAN",(k),".csv")
geo_CLEAN_(k) <- fread(file= csv_name)
}
|--------------------------------------------------|
|==================================================|
Error in geo_CLEAN_(k) <- fread(file = csv_name) :
could not find function "geo_CLEAN_<-"
I know I could do this line by line, but I'd like to have that be a loop if possible. What I want is for geo_CLEAN_1 to relate to fread geoCLEAN1.csv; geo_CLEAN_2 to relate to fread geoCLEAN2.csv, etc.
We need assign if we are interested in creating objects
for(k in 1:6){
csv_name <- paste0("geo_CLEAN",(k),".csv")
assign(sub("\\.csv", "", csv_name), fread(file= csv_name))
}
I am currently working on an imputation project where I need to evaluate my methods of imputation. I have my incomplete dataframe with NAs from which I calculate the missing rate for every column/variable. My second data frame contains the complete cases which I extracted from the first data frame. I now want to simulate the missingness structure of the real data in the frame containing the complete cases. the data frame with the generated NAs get stored in the object "result" as you can see in the code. If I now want to replicate this code and thus generate 100 different data frames like "result", how do I replicate and save them separately?
I'm a beginner and would be really thankful for your answers!
I tried to put my loop which generates the NAs in another loop which contains the replicate() command and counts from 1:100 and saves these 100 replicated data frames but it didn't work at all.
result = data.frame(res0=rep(NA, dim(comp_cas)[1]))
for (i in 1:length(Z32_miss_item$miss_per_item)) {
dat = comp_cas[,i]
missRate = Z32_miss_item$miss_per_item[i]
cat (i, " ", paste0(dat, collapse=",") ," ", missRate, "!\n")
df <- data.frame("res"= GenMiss(x=dat, missrate = missRate), stringsAsFactors = FALSE)
colnames(df) = gsub("res", paste0("Var", i), colnames(df))
result = cbind(result, df)
}
result = result[,-1]
I expect that every data frame of the 100 runs get saved in a separate .rda file in my project folder.
also, is imputation and the evaluation of fitness of the latter beginner stuff in r or at what level of proficiency am I if you take a look at the code that I posted?
It is difficult to guess what exactly you are doing without some dummy data. But it is fine to have loops within loops and to save data.frames. Firstly, I would avoid the replicate function here as it has a strange syntax and just stick with plain loops. Secondly, you must make sure that the loops have different indexes (i.e. for(i ... should be surrounded by, say, for(j ... since functions can loop outside their scope in R. Finally, use saveRDS rather than save, as you can then have each object (data.frame) saved in separate .rds files. The save function is designed for saving your whole workspace so that you can pick up where you left off.
fun <- function(i){
df <- data.frame(x=rnorm(5))
names(df) <- paste0("x",i)
df
}
for(j in 1:100){
res <- data.frame(id=1:5)
for(i in 1:10){
res <- cbind(res, fun(i))
}
saveRDS(res, sprintf("replication_%s.rds",j))
}
I know there are a lot of posts on how to save data out of loops to data frames, but i've been having some trouble making it work for me. Currently i am only able to get my data using print, but would like for it to instead be put into a data frame. I can't predict how many lines of data or responses per line (although I just need a single true/false) it will give.
Suggestions on how to get the P loop to output data to a dataframe?
max <- max(x$a)
for (n in 1:max) {
print(n)
#right now i'm just printing the iteration and data to console
result <- x[x$a==n,"b"]
test <- unique(as.numeric(unlist(result)))
#Below is the loop i'd like to save the data from
for (P in test)
print({
ar <- x[x$b==P & x$a!=n,"a"]
ar1 <- sapply(unique(as.numeric(unlist(ar))),
function(f)
x[x$a==f & x$b!=P,"b"])
af <- sapply(ar1, function(f) any(match(f,result)))
})
}
Thanks!
Initiate an empty data frame:
results <- data.frame(it=numeric(), P=numeric(), value=logical())
And then instead of printing, just add this inside your loop:
results[nrow(results)+1,] <- list( [your 3 values separated by ","] )
I am using the parallel library in R to process a large data set on which I am applying complex operations.
For the sake of providing a reproducible code, you can find below a simpler example:
#data generation
dir <- "C:/Users/things_to_process/"
setwd(dir)
for(i in 1:800)
{
my.matrix <- matrix(runif(100),ncol=10,nrow=10)
saveRDS(my.matrix,file=paste0(dir,"/matrix",i))
}
#worker function
worker.function <- function(files)
{
files.length <- length(files)
partial.results <- vector('list',files.length)
for(i in 1:files.length)
{
matrix <- readRDS(files[i])
partial.results[[i]] <- sum(diag(matrix))
}
Reduce('+',partial.results)
}
#master part
cl <- makeCluster(detectCores(), type = "PSOCK")
file_list <- list.files(path=dir,recursive=FALSE,full.names=TRUE)
part <- clusterSplit(cl,seq_along(file_list))
files.partitioned <- lapply(part,function(p) file_list[p])
results <- clusterApply(cl,files.partitioned,worker.function)
result <- Reduce('+',results)
Essentially, I am wondering if trying to read files in parallel would be done in an interleaved fashion instead. And if, as a result, this bottleneck would cut down on the expected performance of running tasks in parallel?
Would it be better if I first read all matrices at once in a list then sent chunks of this list to each core for it to be processed? what if these matrices were much larger, would I be able to load all of them in a list at once ?
Instead of saving each matrix in a separate RDS file, have you tried saving a list of N matrices in each file, where N is the number that is going to be processed by a single worker?
Then the worker.function looks like:
worker.function <- function(file) {
matrix_list <- readRDS(file)
partial_results <- lapply(matrix_list, function(mat) sum(diag(mat)))
Reduce('+',partial.results)
}
You should save some time on I/O and maybe even on computation by replacing a for with a lapply.
I'm using R, but this question isn't specific to it. Suppose you've written some loop which downloads a url at each iteration. You want to to save this data, so you could do so each iteration or hold the information and save it every nth iteration. Are there any general rules of thumb for doing this? How slow is it to open and close a file for writing all the time? What I have in mind is
for (i in 1:1000) {
data <- url("http://...i")
write.table(data, file="file")
}
versus something like this
data <- list()
length(data) <- 20
j <- 1
for (i in 1:1000) {
data[j] <-url("http://...i")
j <- j+1
if (j = 20) {j <-1}
if (i %% 20 == 0) {
write.table(data, file="file")
}
}
If all your downloaded data are of the same form, you might want to append them to a unique file in which case you can do that at each iteration. Here is a short example:
sites<-c("714A","715A","716A")
for(i in 1:length(sites)){
data<-read.table(file=paste("http://www.ngdc.noaa.gov/mgg/geology/odp/data/115/",sites[i],"paleomag.txt",sep="/"),sep="\t",header=TRUE)
#In this example i downloaded paleomagnetic data from deep sea drilling sites.
ifelse(i==1,h<-TRUE,h<-FALSE) #Here the idea is that we want to output the column names only the first time.
write.table(data,file="paleomag_leg115.txt",sep="\t",append=!h,col.names=h,row.names=FALSE)
}