I have a loop to read in a series of .csv files
for (i in 1:3)
{
nam <- paste0("A_tree", i)
assign(nam, read.csv(sprintf("/Users/sethparker/Documents/%d_tree_from_data.txt", i), header = FALSE))
}
This works fine and generates a series of files comparable to this example data
A_tree1 <- data.frame(cbind(c(1:5),c(1:5),c(1:5)))
A_tree2 <- data.frame(cbind(c(2:6),c(2:6),c(2:6)))
A_tree3 <- data.frame(cbind(c(3:10),c(3:10),c(3:10)))
What I want to do is add column names, and populate 2 new columns with data (month and model run). My current successful approach is to do this individually, like this:
colnames(A_tree1) <- c("GPP","NPP","LA")
A_tree1$month <- seq.int(nrow(A_tree1))
A_tree1$run <- c("1")
colnames(A_tree2) <- c("GPP","NPP","LA")
A_tree2$month <- seq.int(nrow(A_tree2))
A_tree2$run <- c("2")
colnames(A_tree3) <- c("GPP","NPP","LA")
A_tree3$month <- seq.int(nrow(A_tree3))
A_tree3$run <- c("3")
This is extremely inefficient for the number of _tree objects I have. Attempts to modify the loop with paste0() or sprintf() to incorporate these desired manipulations have resulted in Error: target of assignment expands to non-language object. I think I understand why this error is appearing based on reading other posts (Error in <my code> : target of assignment expands to non-language object). Is it possible to do what I want within my for loop? If not, how could I automate this better?
You can use lapply:
n <- index #(include here the total index)
l <- lapply(1:n, function(i) {
# this is the same of sprintf, but i prefer paste0
# importing data on each index i
r <- read.csv(
paste0("/Users/sethparker/Documents/", i, "_tree_from_data.txt"),
header = FALSE
)
# creating add columns
r$month <- seq.int(nrow(r))
r$run <- i
return(r)
})
# lapply will return a list for you, if you desire to append tables
# include a %>% operator and a bind_rows() call (dplyr package)
l %>%
bind_rows() # like this
Related
I’m trying to extract the column names and associated datatypes from a number of dbfs and put the results into a table to cross reference which column names and datatypes appear in which dbfs. I’ve written an initial R script for this as a trial using just two dbfs and it works fine. The problem is when I try to adapt the script to iterate over many dbfs I can’t quite get the result right.
The script for two dbfs is:
library(foreign)
dbf05 <- read.dbf("path/data05.dbf")
dbf06 <- read.dbf("path/data06.dbf")
dbf05ColnamesDT <- lapply(dbf05,class)
dbf06ColnamesDT <- lapply(dbf06,class)
ColnamesDTList <- list(dbf05ColnamesDT, dbf06ColnamesDT)
maxLength <- max(lengths(ColnamesDTList)) #Get the max length of the lists in ColnamesDTList
#Create a df from the nested list, with equal length columns
ColnamesDTDf <- as.data.frame(do.call(rbind, lapply(ColnamesDTList, `length<-`, maxLength)))
#Rename rows
years <- 2005:2006
new.names <-NULL
for(i in 1:2){
new.names[i]<-paste("dbf", years[i], sep="")
}
row.names(ColnamesDTDf)<-new.names
This produces a table like this, which is exactly what I want:
cname1 cname2 cname3 cname4 cname5 cname6
dbf2005 factor factor numeric numeric factor factor
dbf2006 numeric factor numeric factor numeric NULL
The script I have for the iteration is:
library(foreign)
files <- list.files("path/", full.names = TRUE, pattern = "*.dbf$") #List files
BBcolnamesDTList <- list()
for (i in 1:14){
dbfs <- read.dbf(files[i])
ColnamesDT <- lapply(dbfs,class)
ColnamesDTList[[i]] <- list(ColnamesDT)
}
maxLength <- max(lengths(ColnamesDTList))
ColnamesDTDf <- as.data.frame(do.call(rbind, lapply(ColnamesDTList, `length<-`, maxLength)))
years <- 2005:2018
new.names <-NULL
for(i in 1:14){
new.names[i]<-paste("dbf", years[i], sep="")
}
row.names(ColnamesDTDf)<-new.names
This produces a table with only one column containing the list of column names:
V1
dbf2005 list(cname1 = “factor”, cname2 = “factor”, cname3 = “numeric”, …
dbf2006 list(cname1 = “numeric”, cname3 = “factor”, cname4 = “numeric”,…
In addition, ‘’’maxLength’’’ returns 34 in the first script but only 1 in the iterative script, which tells me that the list structure I have in the for loop is incorrect but I’m not sure how to implement it correctly.
I'm trying to load multiple excel files all into one data frame. Some of the files don't have any data in the sheet that I'm looking for, so I'm looking to write code that collects the files that do have data, but also tells me which files weren't included because they didn't have any data. The code I have written does tell me which ones don't have data if I simply 'print(i)' inside the no part of my ifelse statement. However, as soon as I try to do anything else instead of printing, it seems to just ignore me! It's infuriating. How can I collect the names of the files that haven't contributed towards the total data frame?
this works fine:
library(readxl)
files <- list.files(path="./sfiles", pattern = "*.xls", full.names = T)
alldiasendcgmlist <- lapply(files,function(i){
ifelse(nrow(i)==NULL, NULL, i$name<-i)
x= read_excel(i,sheet=2,skip=4)
ifelse(nrow(x)>1, x$ID <- i, print(i))
x
})
but as soon as I want to collect these printings inside a vector, the vector just continues to remain empty:
library(readxl)
files <- list.files(path="./sfiles", pattern = "*.xls", full.names = T)
vectornodata <- character(0)
alldiasendcgmlist <- lapply(files,function(i){
ifelse(nrow(i)==NULL, NULL, i$name<-i)
x= read_excel(i,sheet=2,skip=4)
ifelse(nrow(x)>1, x$ID <- i, nodata <- append(vectornodata, i))
x
})
Help!
Consider building your list of data frames without any logical conditions. Then afterwards run Filter to separate empty and non-empty elements. Negate (another high-order function) is used to return the opposite (i.e., NULL elements without length).
# BUILD NAMED LIST OF EMPTY AND NON-EMPTY DFs
alldiasendcgmlist <- lapply(files, function(i) read_excel(i,sheet=2,skip=4))
alldiasendcgmlist <- setNames(alldiasendcgmlistl), gsub(".xls", "", basename(files)))
# EXTRACT ACTUAL DFs FROM FULL LIST
full_xlfiles <- Filter(length, alldiasendcgmlist)
# EXTRACT NULL ELEMENTS FROM FULL LIST
empty_xlfiles <- Filter(function(df) Negate(length)(df), alldiasendcgmlist)
empty_xlfiles <- names(empty_xlfiles)
I will use my own example to show what one can do. Simply make your function return a list and then use one of its elements to filter our empty data frames:
input <- c(1,2,3,4,5,6)
func <- function(x){
list("square"=x^2, "even"= ifelse(x %% 2 == 0, TRUE, FALSE))
}
res <- lapply(1:6, func) # returns a list of lists
even_numbers <- input[sapply(res, function(x) x[[2]])] # use 2nd element to filter
In your case, you can use a boolean vector to identify files that are empty.
I'm having memory and otimization problems when loping over 200,000 documents of JSTOR's data for research. The documents are in xml format. More information can be found here: https://www.jstor.org/dfr/.
In the first step of the code I transform a xml file into a tidy dataframe in the following manner:
Transform <- function (x)
{
a <- xmlParse (x)
aTop <- xmlRoot (a)
Journal <- xmlValue(aTop[["front"]][["journal-meta"]][["journal-title group"]][["journal-title"]])
Publisher <- xmlValue (aTop[["front"]][["journal-meta"]][["publisher"]][["publisher-name"]])
Title <- xmlValue (aTop[["front"]][["article-meta"]][["title-group"]][["article-title"]])
Year <- as.integer(xmlValue(aTop[["front"]][["article-meta"]][["pub-date"]][["year"]]))
Abstract <- xmlValue(aTop[["front"]][["article-meta"]][["abstract"]])
Language <- xmlValue(aTop[["front"]][["article-meta"]][["custom-meta-group"]][["custom-meta"]][["meta-value"]])
df <- data.frame (Journal, Publisher, Title, Year, Abstract, Language, stringsAsFactors = FALSE)
df
}
In the sequence, I use this first function to transform a series of xml files into a single dataframe:
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml")
i = 2
df2 <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
while (i<=length(files))
{
df <- Transform (paste(pathFiles, files[i], sep="/", collapse=""))
df2[i,] <- df
i <- i + 1
}
data.frame(df2)
}
When I have more than 100000 files it takes several hours to run. In case with 200000 it eventually breaks or gets to slow over time. Even in small sets, it can be noticed that it runs slower over time. Is there something I'm doing worong? Could I do something to otimize the code? I've already tried rbind and bind-rows instead of allocating the values directly using df2[i,] <- df.
Avoid the growth of objects in a loop with your assignment df2[i,] <- df (which by the way only works if df is one-row) and avoid the bookkeeping required of while with iterator, i.
Instead, consider building a list of data frames with lapply that you can then rbind together in one call outside of loop.
TransformFiles <- function (pathFiles)
{
files <- list.files(pathFiles, "*.xml", full.names = TRUE)
df_list <- lapply(files, Transform)
final_df <- do.call(rbind, unname(df_list))
# ALTERNATIVES FOR POSSIBLE PERFORMANCE:
# final_df <- data.table::rbindlist(df_list)
# final_df <- dplyr::bind_rows(df_list)
# final_df <- plyr::rbind.fill(df_list)
}
I have a for loop that loops through a list of urls,
url_list <- c('http://www.irs.gov/pub/irs-soi/04in21id.xls',
'http://www.irs.gov/pub/irs-soi/05in21id.xls',
'http://www.irs.gov/pub/irs-soi/06in21id.xls',
'http://www.irs.gov/pub/irs-soi/07in21id.xls',
'http://www.irs.gov/pub/irs-soi/08in21id.xls',
'http://www.irs.gov/pub/irs-soi/09in21id.xls',
'http://www.irs.gov/pub/irs-soi/10in21id.xls',
'http://www.irs.gov/pub/irs-soi/11in21id.xls',
'http://www.irs.gov/pub/irs-soi/12in21id.xls',
'http://www.irs.gov/pub/irs-soi/13in21id.xls',
'http://www.irs.gov/pub/irs-soi/14in21id.xls',
'http://www.irs.gov/pub/irs-soi/15in21id.xls')
dowloads an excel file from each one assigns it to a dataframe and performs a set of data cleaning operations on it.
library(gdata)
for (url in url_list){
test <- read.xls(url)
cols <- c(1,4:5,97:98)
test <- test[-(1:8),cols]
test <- test[1:22,]
test <- test[-4,]
test$Income <-test$Table.2.1...Returns.with.Itemized.Deductions..Sources.of.Income..Adjustments..Itemized.Deductions.by.Type..Exemptions..and.Tax..Items..by.Size.of.Adjusted.Gross.Income..Tax.Year.2015..Filing.Year.2016.
test$Total_returns <- test$X.2
test$return_dollars <- test$X.3
test$charitable_deductions <- test$X.95
test$charitable_deduction_dollars <- test$X.96
test[1:5] <- NULL
}
My problem is that the loop simply writes over the same dataframe for each iteration through the loop. How can I have it assign each iteration through the loop to a data frame with a different name?
Use assign. This question is a duplicate of this post: Change variable name in for loop using R
For your particular case, you can do something like the following:
for (i in 1:length(url_list)){
url = url_list[i]
test <- read.xls(url)
cols <- c(1,4:5,97:98)
test <- test[-(1:8),cols]
test <- test[1:22,]
test <- test[-4,]
test$Income <-test$Table.2.1...Returns.with.Itemized.Deductions..Sources.of.Income..Adjustments..Itemized.Deductions.by.Type..Exemptions..and.Tax..Items..by.Size.of.Adjusted.Gross.Income..Tax.Year.2015..Filing.Year.2016.
test$Total_returns <- test$X.2
test$return_dollars <- test$X.3
test$charitable_deductions <- test$X.95
test$charitable_deduction_dollars <- test$X.96
test[1:5] <- NULL
assign(paste("test", i, sep=""), test)
}
You could write to a list:
result_list <- list()
for (i_url in 1:length(url_list)){
url <- url_list[i_url]
...
result_list[[i_url]] <- test
}
You can also name the list
names(result_list) <- c("df1","df2","df3",...)
Here's another approach with lapply instead of for loops which will write all resulting data.frames as separate list items which can then be re-named (if needed).
url_list <- c('http://www.irs.gov/pub/irs-soi/04in21id.xls',
...
'http://www.irs.gov/pub/irs-soi/15in21id.xls')
readURLFunc <- function(z){
test <- readxl::read_xls(z)
...
test[1:5] <- NULL
return(test)}
data_list <- lapply(url_list, readURLFunc)
I want to read a bunch of data sets (e. g. *.dta) with specific prefix and increasing number pattern into the global environment, and combine them in a list. (In this special case they're all of same dimension.)
Traditionally I code:
library(foreign) # for reading *.dta files
df_1 <- read.dta("df_1.dta")
df_2 <- read.dta("df_2.dta")
...
df_n <- read.dta("df_n.dta") # note: consider 'n' being an arbitrary defined integer
df_lst <- mget(ls(pattern = "df[0-9]")) # combine dfs into list
Now I want to accomplish this in one brief step.
I attempted this loop which won't work - most likely due to defining a variable within quotation marks:
# initialize list
df_lst <- list()
# read and combine dfs into list
i <- 0
while(i < n) {
i = i + 1
df_[i] = read.dta("df_[i].dta")
c(df_lst, df[i])
}
Moreover I'd rather prefer a function than a loop.
How can I reach my goal?
Try using rio:
rio::import_list(dir(pattern = "df[0-9]"))
This will return a list of the data frames.
(Generally speaking, there's no need to import data files into the global environment before putting them into a list.)
Full disclosure: I am the maintainer of rio.
for the loop, use paste to recreate the name:
# initialize list
df_lst <- list()
# read and combine dfs into list
i <- 0
while(i < n) {
i = i + 1
df_[i] = read.dta(paste("df_[",i,"].dta",sep=''))
c(df_lst, df[i])
}
and define 'n' (I assume you did it, but does not appear defined in the text)
cheers
Fer
Using assign() and do.call("list",...), you can do this with a function:
# list of filenames matching pattern
fnames <- list.files(pattern = "df_[0-9].dta")
# function to read, assign to global env, and return data
dtafx <- function(i){
df <- foreign::read.dta(fnames[i])
assign(gsub(".dta", "", fnames[i]), df, envir = .GlobalEnv)
return(df)
}
# apply function to filenames, combining dfs into list
df_lst <- do.call("list", sapply(seq_along(fnames), dtafx, simplify = F))