How can I tell R to apply functions to multiple data? - r

I have been stacking this work for quite long time, tried different approaches but couldn't succeed.
what I want is to apply following 4 functions to 30 different data (data1,2,3,...data30) within for loop or whatsoever in R. These datasets have same (10) column numbers and different rows.
This is the code I wrote for first data (data1). It works well.
for(i in 1:nrow(data1)){
data1$simp <-diversity(data1$sp, "simpson")
data1$shan <-diversity(data1$sp, "shannon")
data1$E <- E(data1$sp)
data1$D <- D(data1$sp)
}
I want to apply this code for other 29 data in order not to repeat the process 29 times.
Following code what I am trying to do now. But still not right.
data.list <- list(data1, data2,data3,data4,data5)
for(i in data.list){
data2 <- NULL
i$simp <-diversity(i$sp, "simpson")
i$shan <-diversity(i$sp, "shannon")
i$E <- E(i$sp)
i$D <- D(i$sp)
data2 <- rbind(data2, i)
print(data2)
}
So I wanna ask how I can tell R to apply functions to other 29 data?
Thanks in advance!

You can do this with Map.
fun <- function(DF){
for(i in 1:nrow(DF)){
DF$simp <-diversity(DF$sp, "simpson")
DF$shan <-diversity(DF$sp, "shannon")
DF$E <- E(DF$sp)
DF$D <- D(DF$sp)
}
DF
}
result.list <- Map(fun, data.list)
Or, if you don't want to have a function fun in the .GlobalEnv, with lapply.
result.list <- lapply(data.list, function(DF){
for(i in 1:nrow(DF)){
DF$simp <-diversity(DF$sp, "simpson")
DF$shan <-diversity(DF$sp, "shannon")
DF$E <- E(DF$sp)
DF$D <- D(DF$sp)
}
DF
})

If I understand the question, it you're ultimately asking about your 'data2' variable and how to merge these all together? I think the issue you're having is that you're setting data2 <- NULL with each loop iteration. The proposed solution below moves this definition outside the loop and the call to rbind() should now append all your data frames together to return the consolidated dataset.
data.list <- list(data1, data2,data3,data4,data5) #all 29 can go here
data2 <- NULL
for(i in data.list){
i$simp <-diversity(i$sp, "simpson")
i$shan <-diversity(i$sp, "shannon")
i$E <- E(i$sp)
i$D <- D(i$sp)
data2 <- rbind(data2, i)
}
print(data2)

I am assuming that your data1, ..., dataN are files stored in a directory and you're reading them one at a time. Also they have the same header.
What you can do is to import them one at a time and then perform the operations you want, as you mentioned:
files <- list.files(directoryPath) #maybe you can grep() some specific files
for (f in files){
data <- read.table(f) #choose header, sep and so on...
for(i in 1:nrow(data)){
data$simp <-diversity(data$sp, "simpson")
data$shan <-diversity(data$sp, "shannon")
data$E <- E(data$sp)
data$D <- D(data$sp)
}
}
be careful that you must be in the working directory or you must add a path to the filename while reading the tables (i.e. paste(path, f, sep=""))

There are plenty of options, here's one using only base functions:
data.list <- list(data1, data2, data3, data4, data5)
changed_data <- lapply(data.list, function(my_data) {
my_data$simp <-diversity(my_data$sp, "simpson")
my_data$shan <-diversity(my_data$sp, "shannon")
my_data$E <- E(my_data$sp)
my_data$D <- D(my_data$sp)
my_data})

Related

Adding new columns and column names in a loop in R

I have a loop to read in a series of .csv files
for (i in 1:3)
{
nam <- paste0("A_tree", i)
assign(nam, read.csv(sprintf("/Users/sethparker/Documents/%d_tree_from_data.txt", i), header = FALSE))
}
This works fine and generates a series of files comparable to this example data
A_tree1 <- data.frame(cbind(c(1:5),c(1:5),c(1:5)))
A_tree2 <- data.frame(cbind(c(2:6),c(2:6),c(2:6)))
A_tree3 <- data.frame(cbind(c(3:10),c(3:10),c(3:10)))
What I want to do is add column names, and populate 2 new columns with data (month and model run). My current successful approach is to do this individually, like this:
colnames(A_tree1) <- c("GPP","NPP","LA")
A_tree1$month <- seq.int(nrow(A_tree1))
A_tree1$run <- c("1")
colnames(A_tree2) <- c("GPP","NPP","LA")
A_tree2$month <- seq.int(nrow(A_tree2))
A_tree2$run <- c("2")
colnames(A_tree3) <- c("GPP","NPP","LA")
A_tree3$month <- seq.int(nrow(A_tree3))
A_tree3$run <- c("3")
This is extremely inefficient for the number of _tree objects I have. Attempts to modify the loop with paste0() or sprintf() to incorporate these desired manipulations have resulted in Error: target of assignment expands to non-language object. I think I understand why this error is appearing based on reading other posts (Error in <my code> : target of assignment expands to non-language object). Is it possible to do what I want within my for loop? If not, how could I automate this better?
You can use lapply:
n <- index #(include here the total index)
l <- lapply(1:n, function(i) {
# this is the same of sprintf, but i prefer paste0
# importing data on each index i
r <- read.csv(
paste0("/Users/sethparker/Documents/", i, "_tree_from_data.txt"),
header = FALSE
)
# creating add columns
r$month <- seq.int(nrow(r))
r$run <- i
return(r)
})
# lapply will return a list for you, if you desire to append tables
# include a %>% operator and a bind_rows() call (dplyr package)
l %>%
bind_rows() # like this

Issues with for loops in R

I am trying to combine some excel spreadsheets. There are 50 documents. I am looking to get sheets 2:5, except some only have sheets 2:3, 2:4, etc - this is why I include the try function. I need ranges F6:AZ2183 and I am transposing the data.
The issue I am running into is that only the last file is saving into the data frame df.
I attached the code below. If you have any ideas, I would much appreciate it!
Also, I'm a longtime lurker first time poster, so if my etiquette is poor, I apologize.
df <- data.frame()
for (i in 1:50){
for (j in 2:5) {
try({
df.temp <- t(read_excel((paste0('FqReport',i,'.xlsx')), sheet = j, range ='F6:AZ2183'))
df.temp <- df.temp[rowSums(is.na(df.temp)) != ncol(df.temp), ]
df <- rbind(df, df.temp)
rm(df.temp)
gc()
}, silent = TRUE)
}
}
You can read the sheets available in each excel file which will avoid the use of try. Also growing dataframe in loop is quite inefficient. Try this lapply approach.
library(readxl)
filename <- paste0('FqReport',1:50,'.xlsx')
df <- do.call(rbind, lapply(filename, function(x) {
sheet_name <- excel_sheets(x)[-1]
do.call(rbind, lapply(sheet_name, function(y) {
df.temp <- t(read_excel(x, y, range ='F6:AZ2183'))
df.temp[rowSums(is.na(df.temp)) != ncol(df.temp), ]
}))
}))

Memory and otimization problems in JSTOR's XML big loop

I'm having memory and otimization problems when loping over 200,000 documents of JSTOR's data for research. The documents are in xml format. More information can be found here: https://www.jstor.org/dfr/.
In the first step of the code I transform a xml file into a tidy dataframe in the following manner:
Transform <- function (x)
{
a <- xmlParse (x)
aTop <- xmlRoot (a)
Journal <- xmlValue(aTop[["front"]][["journal-meta"]][["journal-title group"]][["journal-title"]])
Publisher <- xmlValue (aTop[["front"]][["journal-meta"]][["publisher"]][["publisher-name"]])
Title <- xmlValue (aTop[["front"]][["article-meta"]][["title-group"]][["article-title"]])
Year <- as.integer(xmlValue(aTop[["front"]][["article-meta"]][["pub-date"]][["year"]]))
Abstract <- xmlValue(aTop[["front"]][["article-meta"]][["abstract"]])
Language <- xmlValue(aTop[["front"]][["article-meta"]][["custom-meta-group"]][["custom-meta"]][["meta-value"]])
df <- data.frame (Journal, Publisher, Title, Year, Abstract, Language, stringsAsFactors = FALSE)
df
}
In the sequence, I use this first function to transform a series of xml files into a single dataframe:
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml")
i = 2
df2 <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
while (i<=length(files))
{
df <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
df2[i,] <- df
i <- i + 1
}
data.frame(df2)
}
When I have more than 100000 files it takes several hours to run. In case with 200000 it eventually breaks or gets to slow over time. Even in small sets, it can be noticed that it runs slower over time. Is there something I'm doing worong? Could I do something to otimize the code? I've already tried rbind and bind-rows instead of allocating the values directly using df2[i,] <- df.
Avoid the growth of objects in a loop with your assignment df2[i,] <- df (which by the way only works if df is one-row) and avoid the bookkeeping required of while with iterator, i.
Instead, consider building a list of data frames with lapply that you can then rbind together in one call outside of loop.
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml", full.names = TRUE)
df_list <- lapply(files, Transform)
final_df <- do.call(rbind, unname(df_list))
# ALTERNATIVES FOR POSSIBLE PERFORMANCE:
# final_df <- data.table::rbindlist(df_list)
# final_df <- dplyr::bind_rows(df_list)
# final_df <- plyr::rbind.fill(df_list)
}

R: Creating new variables in a for loop and assign values

I would like to create a new variable, assign a list of values, and write into a hierarchical data frame. I tried the below but it is not writing in.
for(i in 1:sample){
for(j in 1:10){
x[,j]<-0
name <- paste("hierdata[[i]]$Test", j, sep = "_")
assign(name, rowSums(alpha+beta+x)))
}
}
Appreciate some help.
To address #Moody_Mudskipper 's comment:
The best way to declare hierarchical data is to use indexed lists:
hierdata <- lapply(1:sample,
function(iterator)
{
temp_list <- lapply(1:10,
function(j)
{
x[,j]<-0
value <- rowSums(alpha+beta+x)
return(value)
})
names(temp_list) <- lapply(1:10,function(j){paste0("temp_",j)})
return(temp_list)
})
not really a "one liner", but it contains all good "stuff". Lapply which returns list by default so it just nests list in each list.
Hope you enjoy some "good practice". :)
I just wanted to address your question precisely in my first try.
Try using
for(i in 1:sample){
hierdata[[i]] <- list()
for(j in 1:10){
code_j_init <- paste0("hierdata[[",i,"]]$Test_",j,"<- list()")
eval(parse(text = code_j_init))
hierdata[[i]][[j]] <- list(1,2,3)
}
}
or
for(i in 1:sample){
hierdata[[i]] <- list()
names <- c()
for(j in 1:10){
hierdata[[i]][[j]] <- list(1,2,3)
names <- c(names,paste0("Test_",j))
}
names(hierdata[[i]]) <- names
}

R Function with for Loops creates several dataframes, how to have each one have a different name

I have a for loop that loops through a list of urls,
url_list <- c('http://www.irs.gov/pub/irs-soi/04in21id.xls',
'http://www.irs.gov/pub/irs-soi/05in21id.xls',
'http://www.irs.gov/pub/irs-soi/06in21id.xls',
'http://www.irs.gov/pub/irs-soi/07in21id.xls',
'http://www.irs.gov/pub/irs-soi/08in21id.xls',
'http://www.irs.gov/pub/irs-soi/09in21id.xls',
'http://www.irs.gov/pub/irs-soi/10in21id.xls',
'http://www.irs.gov/pub/irs-soi/11in21id.xls',
'http://www.irs.gov/pub/irs-soi/12in21id.xls',
'http://www.irs.gov/pub/irs-soi/13in21id.xls',
'http://www.irs.gov/pub/irs-soi/14in21id.xls',
'http://www.irs.gov/pub/irs-soi/15in21id.xls')
dowloads an excel file from each one assigns it to a dataframe and performs a set of data cleaning operations on it.
library(gdata)
for (url in url_list){
test <- read.xls(url)
cols <- c(1,4:5,97:98)
test <- test[-(1:8),cols]
test <- test[1:22,]
test <- test[-4,]
test$Income <-test$Table.2.1...Returns.with.Itemized.Deductions..Sources.of.Income..Adjustments..Itemized.Deductions.by.Type..Exemptions..and.Tax..Items..by.Size.of.Adjusted.Gross.Income..Tax.Year.2015..Filing.Year.2016.
test$Total_returns <- test$X.2
test$return_dollars <- test$X.3
test$charitable_deductions <- test$X.95
test$charitable_deduction_dollars <- test$X.96
test[1:5] <- NULL
}
My problem is that the loop simply writes over the same dataframe for each iteration through the loop. How can I have it assign each iteration through the loop to a data frame with a different name?
Use assign. This question is a duplicate of this post: Change variable name in for loop using R
For your particular case, you can do something like the following:
for (i in 1:length(url_list)){
url = url_list[i]
test <- read.xls(url)
cols <- c(1,4:5,97:98)
test <- test[-(1:8),cols]
test <- test[1:22,]
test <- test[-4,]
test$Income <-test$Table.2.1...Returns.with.Itemized.Deductions..Sources.of.Income..Adjustments..Itemized.Deductions.by.Type..Exemptions..and.Tax..Items..by.Size.of.Adjusted.Gross.Income..Tax.Year.2015..Filing.Year.2016.
test$Total_returns <- test$X.2
test$return_dollars <- test$X.3
test$charitable_deductions <- test$X.95
test$charitable_deduction_dollars <- test$X.96
test[1:5] <- NULL
assign(paste("test", i, sep=""), test)
}
You could write to a list:
result_list <- list()
for (i_url in 1:length(url_list)){
url <- url_list[i_url]
...
result_list[[i_url]] <- test
}
You can also name the list
names(result_list) <- c("df1","df2","df3",...)
Here's another approach with lapply instead of for loops which will write all resulting data.frames as separate list items which can then be re-named (if needed).
url_list <- c('http://www.irs.gov/pub/irs-soi/04in21id.xls',
...
'http://www.irs.gov/pub/irs-soi/15in21id.xls')
readURLFunc <- function(z){
test <- readxl::read_xls(z)
...
test[1:5] <- NULL
return(test)}
data_list <- lapply(url_list, readURLFunc)

Resources