I'm having memory and otimization problems when loping over 200,000 documents of JSTOR's data for research. The documents are in xml format. More information can be found here: https://www.jstor.org/dfr/.
In the first step of the code I transform a xml file into a tidy dataframe in the following manner:
Transform <- function (x)
{
a <- xmlParse (x)
aTop <- xmlRoot (a)
Journal <- xmlValue(aTop[["front"]][["journal-meta"]][["journal-title group"]][["journal-title"]])
Publisher <- xmlValue (aTop[["front"]][["journal-meta"]][["publisher"]][["publisher-name"]])
Title <- xmlValue (aTop[["front"]][["article-meta"]][["title-group"]][["article-title"]])
Year <- as.integer(xmlValue(aTop[["front"]][["article-meta"]][["pub-date"]][["year"]]))
Abstract <- xmlValue(aTop[["front"]][["article-meta"]][["abstract"]])
Language <- xmlValue(aTop[["front"]][["article-meta"]][["custom-meta-group"]][["custom-meta"]][["meta-value"]])
df <- data.frame (Journal, Publisher, Title, Year, Abstract, Language, stringsAsFactors = FALSE)
df
}
In the sequence, I use this first function to transform a series of xml files into a single dataframe:
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml")
i = 2
df2 <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
while (i<=length(files))
{
df <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
df2[i,] <- df
i <- i + 1
}
data.frame(df2)
}
When I have more than 100000 files it takes several hours to run. In case with 200000 it eventually breaks or gets to slow over time. Even in small sets, it can be noticed that it runs slower over time. Is there something I'm doing worong? Could I do something to otimize the code? I've already tried rbind and bind-rows instead of allocating the values directly using df2[i,] <- df.
Avoid the growth of objects in a loop with your assignment df2[i,] <- df (which by the way only works if df is one-row) and avoid the bookkeeping required of while with iterator, i.
Instead, consider building a list of data frames with lapply that you can then rbind together in one call outside of loop.
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml", full.names = TRUE)
df_list <- lapply(files, Transform)
final_df <- do.call(rbind, unname(df_list))
# ALTERNATIVES FOR POSSIBLE PERFORMANCE:
# final_df <- data.table::rbindlist(df_list)
# final_df <- dplyr::bind_rows(df_list)
# final_df <- plyr::rbind.fill(df_list)
}
Related
I have a time series data. I stored the data in txt files under daily subfolders in Monthly folders.
setwd(".../2018/Jan")
parent.folder <-".../2018/Jan"
sub.folders <- list.dirs(parent.folder, recursive=TRUE)[-1] #To read the sub-folders under parent folder
r.scripts <- file.path(sub.folders)
A_2018 <- list()
for (j in seq_along(r.scripts)) {
A_2018[[j]] <- dir(r.scripts[j],"\\.txt$")}
Of these .txt files, I removed some of the files which I don't want to use for the further analysis, using the following code.
trim_to_two <- function(x) {
runs = rle(gsub("^L1_\\d{4}_\\d{4}_","",x))
return(cumsum(runs$lengths)[which(runs$lengths > 2)] * -1)
}
A_2018_new <- list()
for (j in seq_along(A_2018)) {
A_2018_new[[j]] <- A_2018[[j]][trim_to_two(A_2018[[j]])]
}
Then, I want to make a rowbind by for loop for the whole .txt files. Before that, I would like to remove some lines in each txt file, and add one new column with file name. The following is my code.
for (i in 1:length(A_2018_new)) {
for (j in 1:length(A_2018_new[[i]])){
filename <- paste(str_sub(A_2018_new[[i]][j], 1, 14))
assign(filename, read_tsv(complete_file_name, skip = 14, col_names = FALSE),
)
Y <- r.scripts %>% str_sub(46, 49)
MD <- r.scripts %>% str_sub(58, 61)
HM <- filename %>% str_sub(9, 12)
Turn <- filename %>% str_sub(14, 14)
time_minute <- paste(Y, MD, HM, sep="-")
Map(cbind, filename, SampleID = names(filename))
}
}
But I didn't get my desired output. I tried to code using other examples. Could anyone help to explain what my code is missing.
Your code seems overly complex for what it is doing. Your problem is however not 100% clear (e.g. what is the pattern in your file names that determine what to import and what not?). Here are some pointers that would greatly simplify the code, and likely avoid the issue you are having.
Use lapply() or map() from the purrr package to iterate instead of a for loop. The benefit is that it places the different data frames in a list and you don't need to assign multiple data frames into their own objects in the environment. Since you tagged the tidyverse, we'll use the purrr functions.
library(tidyverse)
You could for instance retrieve the txt file paths, using something like
txt_files <- list.files(path = 'data/folder/', pattern = "txt$", full.names = TRUE) # Need to remove those files you don't with whatever logic applies
and then use map() with read_tsv() from readr like so:
mydata <- map(txt_files, read_tsv)
Then for your manipulation, you can again use lapply() or map() to apply that manipulation to each data frame. The easiest way is to create a custom function, and then apply it to each data frame:
my_func <- function(df, filename) {
df |>
filter(...) |> # Whatever logic applies here
mutate(filename = filename)
}
and then use map2() to apply this function, iterating through the data and filenames, and then list_rbind() to bind the data frames across the rows.
mydata_output <- map2(mydata, txt_files, my_func) |>
list_rbind()
I have a loop to read in a series of .csv files
for (i in 1:3)
{
nam <- paste0("A_tree", i)
assign(nam, read.csv(sprintf("/Users/sethparker/Documents/%d_tree_from_data.txt", i), header = FALSE))
}
This works fine and generates a series of files comparable to this example data
A_tree1 <- data.frame(cbind(c(1:5),c(1:5),c(1:5)))
A_tree2 <- data.frame(cbind(c(2:6),c(2:6),c(2:6)))
A_tree3 <- data.frame(cbind(c(3:10),c(3:10),c(3:10)))
What I want to do is add column names, and populate 2 new columns with data (month and model run). My current successful approach is to do this individually, like this:
colnames(A_tree1) <- c("GPP","NPP","LA")
A_tree1$month <- seq.int(nrow(A_tree1))
A_tree1$run <- c("1")
colnames(A_tree2) <- c("GPP","NPP","LA")
A_tree2$month <- seq.int(nrow(A_tree2))
A_tree2$run <- c("2")
colnames(A_tree3) <- c("GPP","NPP","LA")
A_tree3$month <- seq.int(nrow(A_tree3))
A_tree3$run <- c("3")
This is extremely inefficient for the number of _tree objects I have. Attempts to modify the loop with paste0() or sprintf() to incorporate these desired manipulations have resulted in Error: target of assignment expands to non-language object. I think I understand why this error is appearing based on reading other posts (Error in <my code> : target of assignment expands to non-language object). Is it possible to do what I want within my for loop? If not, how could I automate this better?
You can use lapply:
n <- index #(include here the total index)
l <- lapply(1:n, function(i) {
# this is the same of sprintf, but i prefer paste0
# importing data on each index i
r <- read.csv(
paste0("/Users/sethparker/Documents/", i, "_tree_from_data.txt"),
header = FALSE
)
# creating add columns
r$month <- seq.int(nrow(r))
r$run <- i
return(r)
})
# lapply will return a list for you, if you desire to append tables
# include a %>% operator and a bind_rows() call (dplyr package)
l %>%
bind_rows() # like this
I am trying to combine some excel spreadsheets. There are 50 documents. I am looking to get sheets 2:5, except some only have sheets 2:3, 2:4, etc - this is why I include the try function. I need ranges F6:AZ2183 and I am transposing the data.
The issue I am running into is that only the last file is saving into the data frame df.
I attached the code below. If you have any ideas, I would much appreciate it!
Also, I'm a longtime lurker first time poster, so if my etiquette is poor, I apologize.
df <- data.frame()
for (i in 1:50){
for (j in 2:5) {
try({
df.temp <- t(read_excel((paste0('FqReport',i,'.xlsx')), sheet = j, range ='F6:AZ2183'))
df.temp <- df.temp[rowSums(is.na(df.temp)) != ncol(df.temp), ]
df <- rbind(df, df.temp)
rm(df.temp)
gc()
}, silent = TRUE)
}
}
You can read the sheets available in each excel file which will avoid the use of try. Also growing dataframe in loop is quite inefficient. Try this lapply approach.
library(readxl)
filename <- paste0('FqReport',1:50,'.xlsx')
df <- do.call(rbind, lapply(filename, function(x) {
sheet_name <- excel_sheets(x)[-1]
do.call(rbind, lapply(sheet_name, function(y) {
df.temp <- t(read_excel(x, y, range ='F6:AZ2183'))
df.temp[rowSums(is.na(df.temp)) != ncol(df.temp), ]
}))
}))
I know how to delete rows in in a sequence for a SINGLE list:
data <- data.table('A' = c(1,2,3,4), 'B' = c(900,6,'NA',2))
row.remove <- data[!(data$A = seq(from=1,to=4,by=2) )]
However, I would like to know how to do so with MULTIPLE lists.
Code I've tried:
file.number <- c(1:5)
data <- setNames(lapply(paste(file.number,".csv"), read.csv, paste(file.number)) # this line imports the lists from csv files - works
data.2 <- lapply(data, data.table) # seems to work
row.remove <- lapply(data.2, function(x) x[!(data.2$A = seq(from=1,to=4,by=2)) # no error message, but deletes all the rows
I feel like I'm missing something obvious, any help will be greatly appreciated.
Solution:
for (i in 1:5){
file.number = i
data <- setNames(lapply(paste(file.number,".csv"), read.csv, paste(file.number))
data <- as.data.table(data)
row.remove <- data[!(data$A = seq(from=1,to=4,by=2) )]
}
Instead of analyzing the list simultaneously, this will analyze the lists one by one. It's not a full solution, but more of a work around.
I am new to R program and currently working on a set of financial data. Now I got around 10 csv files under my working directory and I want to analyze one of them and apply the same command to the rest of csv files.
Here are all the names of these files: ("US%10y.csv", "UK%10y.csv", "GER%10y.csv","JAP%10y.csv", "CHI%10y.csv", "SWI%10y.csv","SOA%10y.csv", "BRA%10y.csv", "CAN%10y.csv", "AUS%10y.csv")
For example, because the Date column in CSV files are Factor so I need to change them to Date format:
CAN <- read.csv("CAN%10y.csv", header = T, sep = ",")
CAN$Date <- as.character(CAN$Date)
CAN$Date <- as.Date(CAN$Date, format ="%m/%d/%y")
CAN_merge <- merge(all.dates.frame, CAN, all = T)
CAN_merge$Bid.Yield.To.Maturity <- NULL
all.dates.frame is a data frame of 731 consecutive days. I want to merge them so that each file will have the same number of rows which later enables me to combine 10 files together to get a 731 X 11 master data frame.
Surely I can copy and paste this code and change the file name, but is there any simple approach to use apply or for loop to do that ???
Thank you very much for your help.
This should do the trick. Leave a comment if a certain part doesn't work. Wrote this blind without testing.
Get a list of files in your current directory ending in name .csv
L = list.files(".", ".csv")
Loop through each of the name and reads in each file, perform the actions you want to perform, return the data.frame DF_Merge and store them in a list.
O = lapply(L, function(x) {
DF <- read.csv(x, header = T, sep = ",")
DF$Date <- as.character(CAN$Date)
DF$Date <- as.Date(CAN$Date, format ="%m/%d/%y")
DF_Merge <- merge(all.dates.frame, CAN, all = T)
DF_Merge$Bid.Yield.To.Maturity <- NULL
return(DF_Merge)})
Bind all the DF_Merge data.frames into one big data.frame
do.call(rbind, O)
I'm guessing you need some kind of indicator, so this may be useful. Create a indicator column based on the first 3 characters of your file name rep(substring(L, 1, 3), each = 731)
A dplyr solution (though untested since no reproducible example given):
library(dplyr)
file_list <- c("US%10y.csv", "UK%10y.csv", "GER%10y.csv","JAP%10y.csv", "CHI%10y.csv", "SWI%10y.csv","SOA%10y.csv", "BRA%10y.csv", "CAN%10y.csv", "AUS%10y.csv")
can_l <- lapply(
file_list
, read.csv
)
can_l <- lapply(
can_l
, function(df) {
df %>% mutate(Date = as.Date(as.character(Date), format ="%m/%d/%y"))
}
)
# Rows do need to match when column-binding
can_merge <- left_join(
all.dates.frame
, bind_cols(can_l)
)
can_merge <- can_merge %>%
select(-Bid.Yield.To.Maturity)
One possible solution would be to read all the files into R in the form of a list, and then use lapply to to apply a function to all data files. For example:
# Create vector of file names in working direcotry
files <- list.files()
files <- files[grep("csv", files)]
#create empty list
lst <- vector("list", length(files))
#Read files in to list
for(i in 1:length(files)) {
lst[[i]] <- read.csv(files[i])
}
#Apply a function to the list
l <- lapply(lst, function(x) {
x$Date <- as.Date(as.character(x$Date), format = "%m/%d/%y")
return(x)
})
Hope it's helpful.