I am trying to combine some excel spreadsheets. There are 50 documents. I am looking to get sheets 2:5, except some only have sheets 2:3, 2:4, etc - this is why I include the try function. I need ranges F6:AZ2183 and I am transposing the data.
The issue I am running into is that only the last file is saving into the data frame df.
I attached the code below. If you have any ideas, I would much appreciate it!
Also, I'm a longtime lurker first time poster, so if my etiquette is poor, I apologize.
df <- data.frame()
for (i in 1:50){
for (j in 2:5) {
try({
df.temp <- t(read_excel((paste0('FqReport',i,'.xlsx')), sheet = j, range ='F6:AZ2183'))
df.temp <- df.temp[rowSums(is.na(df.temp)) != ncol(df.temp), ]
df <- rbind(df, df.temp)
rm(df.temp)
gc()
}, silent = TRUE)
}
}
You can read the sheets available in each excel file which will avoid the use of try. Also growing dataframe in loop is quite inefficient. Try this lapply approach.
library(readxl)
filename <- paste0('FqReport',1:50,'.xlsx')
df <- do.call(rbind, lapply(filename, function(x) {
sheet_name <- excel_sheets(x)[-1]
do.call(rbind, lapply(sheet_name, function(y) {
df.temp <- t(read_excel(x, y, range ='F6:AZ2183'))
df.temp[rowSums(is.na(df.temp)) != ncol(df.temp), ]
}))
}))
Related
I am trying to create an efficient code that opens data files containing a list, extracts one element within the list, stores it in a data frame and then deletes this object before opening the next one.
My idea is doing this using loops. Unfortunately, I am quite new in learning how to do this using loops, and don't know how write the code.
I have managed to open the data-sets using the following code:
for(i in 1995:2015){
objects = paste("C:/Users/...",i,"agg.rda", sep=" ")
load(objects)
}
The problem is that each data-set is extremely large and R cannot open all of them at once. Therefore, I am now trying to extract an element within each list called: tab_<<i value >>_agg[["A"]] (for example tab_1995_agg[["A"]]), then delete the object and iterate over each i (which are different years).
I have tried using the following code but it does not work
for(i in unique(1995:2015)){
objects = paste("C:/Users/...",i,"agg.rda", sep=" ")
load(objects)
tmp = cat("tab",i,"_agg[[\"A\"]]" , sep = "")
y <- rbind(y, tmp)
rm(list=objects)
}
I apologize for any silly mistake (or question) and greatly appreciate any help.
Here’s a possible solution using a function to rename the object you’re loading in. I got loadRData from here. The loadRData function makes this a bit more approachable because you can load in the object with a different name.
Create some data for a reproducible example.
tab2000_agg <-
list(
A = 1:5,
b = 6:10
)
tab2001_agg <-
list(
A = 1:5,
d = 6:10
)
save(tab2000_agg, file = "2000_agg.rda")
save(tab2001_agg, file = "2001_agg.rda")
rm(tab2000_agg, tab2001_agg)
Using your loop idea.
loadRData <- function(fileName){
load(fileName)
get(ls()[ls() != "fileName"])
}
y <- list()
for(i in 2000:2001){
objects <- paste("", i, "_agg.rda", sep="")
data_list <- loadRData(objects)
tmp <- data_list[["A"]]
y[[i]] <- tmp
rm(data_list)
}
y <- do.call(rbind, y)
You could also turn it into a function rather than use a loop.
getElement <- function(year){
objects <- paste0("", year, "_agg.rda")
data_list <- loadRData(objects)
tmp <- data_list[["A"]]
return(tmp)
}
y <- lapply(2000:2001, getElement)
y <- do.call(rbind, y)
Created on 2022-01-14 by the reprex package (v2.0.1)
I'm having memory and otimization problems when loping over 200,000 documents of JSTOR's data for research. The documents are in xml format. More information can be found here: https://www.jstor.org/dfr/.
In the first step of the code I transform a xml file into a tidy dataframe in the following manner:
Transform <- function (x)
{
a <- xmlParse (x)
aTop <- xmlRoot (a)
Journal <- xmlValue(aTop[["front"]][["journal-meta"]][["journal-title group"]][["journal-title"]])
Publisher <- xmlValue (aTop[["front"]][["journal-meta"]][["publisher"]][["publisher-name"]])
Title <- xmlValue (aTop[["front"]][["article-meta"]][["title-group"]][["article-title"]])
Year <- as.integer(xmlValue(aTop[["front"]][["article-meta"]][["pub-date"]][["year"]]))
Abstract <- xmlValue(aTop[["front"]][["article-meta"]][["abstract"]])
Language <- xmlValue(aTop[["front"]][["article-meta"]][["custom-meta-group"]][["custom-meta"]][["meta-value"]])
df <- data.frame (Journal, Publisher, Title, Year, Abstract, Language, stringsAsFactors = FALSE)
df
}
In the sequence, I use this first function to transform a series of xml files into a single dataframe:
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml")
i = 2
df2 <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
while (i<=length(files))
{
df <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
df2[i,] <- df
i <- i + 1
}
data.frame(df2)
}
When I have more than 100000 files it takes several hours to run. In case with 200000 it eventually breaks or gets to slow over time. Even in small sets, it can be noticed that it runs slower over time. Is there something I'm doing worong? Could I do something to otimize the code? I've already tried rbind and bind-rows instead of allocating the values directly using df2[i,] <- df.
Avoid the growth of objects in a loop with your assignment df2[i,] <- df (which by the way only works if df is one-row) and avoid the bookkeeping required of while with iterator, i.
Instead, consider building a list of data frames with lapply that you can then rbind together in one call outside of loop.
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml", full.names = TRUE)
df_list <- lapply(files, Transform)
final_df <- do.call(rbind, unname(df_list))
# ALTERNATIVES FOR POSSIBLE PERFORMANCE:
# final_df <- data.table::rbindlist(df_list)
# final_df <- dplyr::bind_rows(df_list)
# final_df <- plyr::rbind.fill(df_list)
}
I have been stacking this work for quite long time, tried different approaches but couldn't succeed.
what I want is to apply following 4 functions to 30 different data (data1,2,3,...data30) within for loop or whatsoever in R. These datasets have same (10) column numbers and different rows.
This is the code I wrote for first data (data1). It works well.
for(i in 1:nrow(data1)){
data1$simp <-diversity(data1$sp, "simpson")
data1$shan <-diversity(data1$sp, "shannon")
data1$E <- E(data1$sp)
data1$D <- D(data1$sp)
}
I want to apply this code for other 29 data in order not to repeat the process 29 times.
Following code what I am trying to do now. But still not right.
data.list <- list(data1, data2,data3,data4,data5)
for(i in data.list){
data2 <- NULL
i$simp <-diversity(i$sp, "simpson")
i$shan <-diversity(i$sp, "shannon")
i$E <- E(i$sp)
i$D <- D(i$sp)
data2 <- rbind(data2, i)
print(data2)
}
So I wanna ask how I can tell R to apply functions to other 29 data?
Thanks in advance!
You can do this with Map.
fun <- function(DF){
for(i in 1:nrow(DF)){
DF$simp <-diversity(DF$sp, "simpson")
DF$shan <-diversity(DF$sp, "shannon")
DF$E <- E(DF$sp)
DF$D <- D(DF$sp)
}
DF
}
result.list <- Map(fun, data.list)
Or, if you don't want to have a function fun in the .GlobalEnv, with lapply.
result.list <- lapply(data.list, function(DF){
for(i in 1:nrow(DF)){
DF$simp <-diversity(DF$sp, "simpson")
DF$shan <-diversity(DF$sp, "shannon")
DF$E <- E(DF$sp)
DF$D <- D(DF$sp)
}
DF
})
If I understand the question, it you're ultimately asking about your 'data2' variable and how to merge these all together? I think the issue you're having is that you're setting data2 <- NULL with each loop iteration. The proposed solution below moves this definition outside the loop and the call to rbind() should now append all your data frames together to return the consolidated dataset.
data.list <- list(data1, data2,data3,data4,data5) #all 29 can go here
data2 <- NULL
for(i in data.list){
i$simp <-diversity(i$sp, "simpson")
i$shan <-diversity(i$sp, "shannon")
i$E <- E(i$sp)
i$D <- D(i$sp)
data2 <- rbind(data2, i)
}
print(data2)
I am assuming that your data1, ..., dataN are files stored in a directory and you're reading them one at a time. Also they have the same header.
What you can do is to import them one at a time and then perform the operations you want, as you mentioned:
files <- list.files(directoryPath) #maybe you can grep() some specific files
for (f in files){
data <- read.table(f) #choose header, sep and so on...
for(i in 1:nrow(data)){
data$simp <-diversity(data$sp, "simpson")
data$shan <-diversity(data$sp, "shannon")
data$E <- E(data$sp)
data$D <- D(data$sp)
}
}
be careful that you must be in the working directory or you must add a path to the filename while reading the tables (i.e. paste(path, f, sep=""))
There are plenty of options, here's one using only base functions:
data.list <- list(data1, data2, data3, data4, data5)
changed_data <- lapply(data.list, function(my_data) {
my_data$simp <-diversity(my_data$sp, "simpson")
my_data$shan <-diversity(my_data$sp, "shannon")
my_data$E <- E(my_data$sp)
my_data$D <- D(my_data$sp)
my_data})
I am new to R program and currently working on a set of financial data. Now I got around 10 csv files under my working directory and I want to analyze one of them and apply the same command to the rest of csv files.
Here are all the names of these files: ("US%10y.csv", "UK%10y.csv", "GER%10y.csv","JAP%10y.csv", "CHI%10y.csv", "SWI%10y.csv","SOA%10y.csv", "BRA%10y.csv", "CAN%10y.csv", "AUS%10y.csv")
For example, because the Date column in CSV files are Factor so I need to change them to Date format:
CAN <- read.csv("CAN%10y.csv", header = T, sep = ",")
CAN$Date <- as.character(CAN$Date)
CAN$Date <- as.Date(CAN$Date, format ="%m/%d/%y")
CAN_merge <- merge(all.dates.frame, CAN, all = T)
CAN_merge$Bid.Yield.To.Maturity <- NULL
all.dates.frame is a data frame of 731 consecutive days. I want to merge them so that each file will have the same number of rows which later enables me to combine 10 files together to get a 731 X 11 master data frame.
Surely I can copy and paste this code and change the file name, but is there any simple approach to use apply or for loop to do that ???
Thank you very much for your help.
This should do the trick. Leave a comment if a certain part doesn't work. Wrote this blind without testing.
Get a list of files in your current directory ending in name .csv
L = list.files(".", ".csv")
Loop through each of the name and reads in each file, perform the actions you want to perform, return the data.frame DF_Merge and store them in a list.
O = lapply(L, function(x) {
DF <- read.csv(x, header = T, sep = ",")
DF$Date <- as.character(CAN$Date)
DF$Date <- as.Date(CAN$Date, format ="%m/%d/%y")
DF_Merge <- merge(all.dates.frame, CAN, all = T)
DF_Merge$Bid.Yield.To.Maturity <- NULL
return(DF_Merge)})
Bind all the DF_Merge data.frames into one big data.frame
do.call(rbind, O)
I'm guessing you need some kind of indicator, so this may be useful. Create a indicator column based on the first 3 characters of your file name rep(substring(L, 1, 3), each = 731)
A dplyr solution (though untested since no reproducible example given):
library(dplyr)
file_list <- c("US%10y.csv", "UK%10y.csv", "GER%10y.csv","JAP%10y.csv", "CHI%10y.csv", "SWI%10y.csv","SOA%10y.csv", "BRA%10y.csv", "CAN%10y.csv", "AUS%10y.csv")
can_l <- lapply(
file_list
, read.csv
)
can_l <- lapply(
can_l
, function(df) {
df %>% mutate(Date = as.Date(as.character(Date), format ="%m/%d/%y"))
}
)
# Rows do need to match when column-binding
can_merge <- left_join(
all.dates.frame
, bind_cols(can_l)
)
can_merge <- can_merge %>%
select(-Bid.Yield.To.Maturity)
One possible solution would be to read all the files into R in the form of a list, and then use lapply to to apply a function to all data files. For example:
# Create vector of file names in working direcotry
files <- list.files()
files <- files[grep("csv", files)]
#create empty list
lst <- vector("list", length(files))
#Read files in to list
for(i in 1:length(files)) {
lst[[i]] <- read.csv(files[i])
}
#Apply a function to the list
l <- lapply(lst, function(x) {
x$Date <- as.Date(as.character(x$Date), format = "%m/%d/%y")
return(x)
})
Hope it's helpful.
I have some bloated code in R which I'm trying to streamline. I'm trying to read spreadsheets into a dataframe and then transpose each one.
I have a list as follows
var <- c("amp_genes.annotated.BLCA.txt","amp_genes.annotated.BRCA.txt")
for (i in var) {
var[i] <- readWorksheet(wk, sheet="var[i]", header=T)
var[i] <- as.data.frame(var[i])
var[i] <- t(var1[i][3:ncol(var1[i]),])
}
The sheet = line has to have double quotes around the string variable.
This just tells me I have an unexpected }
Maybe try this; not sure it will work as I don't have your spreadsheets, but give it a try and let me know... And maybe if it doesn't work right out, it can hopefully unblock you wherever you're stuck.
library(XLConnect)
wk <- loadWorkbook("workbookname.xls")
sheetnames <- getSheets(object = wk)
content.tr <- list()
# To access sheets by their names
for (sheetname in sheetnames) {
content <- readWorksheet(wk, sheet=sheetname, header=T)
content.tr[[sheetname]] <- t(content[3:ncol(content),])
}
# To access sheets by their position
for (pos in c(1,2) {
content <- readWorksheet(wk, sheet=i, header=T)
content.tr[[sheetname[i]]] <- t(content[3:ncol(content),])
}
To access the dataframes:
names(content.tr)
spreadsheet1 <- content.tr[[1]]
spreadsheet2 <- content.tr[[2]]