I am trying to read over 200 CSV files, each with multiple rows and columns of numbers. It makes most sense to reach each one as a separate data frame.
Ideally, I'd like to give meaningful names. So the data frame of store 1, room 1 would be named store.1.room.1, and store.1.room.2. This would go all the way up to store.100.room.1, store.100.room.2 etc.
I can read each file into a specified data frame. For example:
store.1.room.1 <- read.csv(filepath,...)
But how do I create a dynamically created data frame name using a For loop?
For example:
for (i in 1:100){
for (j in 1:2){
store.i.room.j <- read.csv(filepath...)
}
}
Alternatively, is there another approach that I should consider instead of having each csv file as a separate data frame?
Thanks
You can create your dataframes using read.csv as you have above, but store them into a list. Then give names to each item (i.e. dataframe) in the list:
# initialize an empty list
my_list <- list()
for (i in 1:100) {
for (j in 1:2) {
df <- read.csv(filename...)
df_name <- paste("store", i, "room", j, sep="")
my_list[[df_name]] <- df
}
}
# now you can access any data frame you wish by using my_list$store.i.room.j
I'm not sure whether I am answering your question, but you would never want to store those CSV files into separate data frames. What I would do in your case is this:
set <- data.frame()
for (i in 1:100){
##calculate filename here
current.csv <- read.csv(filename)
current.csv <- cbind(current.csv, index = i)
set <- rbind(set, current.csv)
An additional column is being used to identify which csv files the measurements are from.
EDIT:
This is useful to apply tapply in certain vectors of your data.frame. Also, in case you'd like to keep the measurements of only one csv (let's say the one indexed by 5), you can enter
single.data.frame <- set[set$index == 5, ]
Related
I tried using mapply to create a list of data frames of elements from sheets in an Excel file.
To be precise, every column of the data table I want to create as one element of the list is a column of a separate sheet in the Excel file. There are 7 files; they have differing numbers of sheets though each sheet has the same dimensions. Each element of my final list, which I call RAINS, should refer to one of 7 files.
### Excel files
weather_files <- list()
weather_files <- list.files(pattern = "[M-m][0-9]{4}\\.xlsx")
year = c(1:7)
dateseq = c(5:26)
rainsheet <- list()
RAIN <- list()
RAINS <- list()
## List of vectors of sheet numbers for each file
for (i in seq_along(years)) {
y[[i]] <- c(5:length(excel_sheets(weather_files[i])))
}
### Function 'raindate' which calls read.xlsx
raindate <- function(j,i) {
rainsheet <- read.xlsx(weather_files[i],sheet=j,startRow=2,colNames=TRUE,rowNames=FALSE,detectDates=FALSE,rows=c(4:108),cols=c(2),check.names=FALSE)
}
### Create data frame using cbind
for (i in seq_along(year)) {
RAIN <- read.xlsx(weather_files[1],sheet=5,startRow=2,colNames=TRUE,rowNames=FALSE,detectDates=FALSE,rows=c(4:108),cols=c(1),check.names=FALSE)
RAINS[[i]] <- cbind(RAIN,mapply(raindate,y[[i]],year))
}
The problem I have is that mapply, increments on pairs of elements of the vectors 'y' and 'year'. This gives me data frames where each successive column increments the excel file and the sheet, completely mixing up the data. What I need is incrementing over all values of y within one year, then incrememting y.
Is there a method in R to replace mapply in the above code to achieve this?
This question already has answers here:
Reading multiple files into multiple data frames
(2 answers)
Closed 6 years ago.
I am new to R and stackoverflow so this will probably have a very simple solution.
I have a set of data from 20 different subject. In the future I will have to perform a lot of different actions on this data and will have to repeat this action for all individual sets. Analyzing them separately and recombining them.
My question is how can I automate this process:
P4 <- read.delim("P4Rtest.txt")
P7 <- read.delim("P7Rtest.txt")
P13 <- read.delim("P13Rtest.txt")
etc etc etc.
I have tried looping with a for loop but see to get stuck with creating a new data.frame with a unique name every time.
Thank you for your help
The R way to do this would be to keep all the data sets together in a named list. For that you can use the following, where n is the number of files.
nm <- paste0("P", 1:n) ## create the names P1, P2, ..., Pn
dfList <- setNames(lapply(paste0(nm, "Rtest.txt"), read.delim), nm)
Now dfList will contain all the data sets. You can access them individually with dfList$P1 for P1, dfList$P2 for P2, and so on.
There are a bunch of different ways of doing stuff like this. You could combine all the data into one data frame using rbind. The first answer here has a good way of doing that: Replace rbind in for-loop with lapply? (2nd circle of hell)
If you combine everything into one data frame, you'll need to add a column that identifies the participant. So instead of
P4 <- read.delim("P4Rtest.txt")
...
You would have something like
my.list <- vector("list", number.of.subjects)
for(participant.number in 1:number.of.subjects){
# load individual participant data
participant.filename = paste("P", participant, "Rtest.txt", sep="")
participant.df <- read.delim(participant.filename)
# add a column:
participant.df$participant.number = participant.number
my.list[[i]] <- participant.df
}
solution <- rbind(solution, do.call(rbind, my.list))
If you want to keep them separate data frames for some reason, you can keep them in a list (leave off the last rbind line) and use lapply(my.list, function(participant.df) { stuff you want to do }) whenever you want to do stuff to the data frames.
You can use assign. Assuming all your files have a similar format as you have shown, this will work for you:
# Define how many files there are (with the numbers).
numFiles <- 10
# Run through that sequence.
for (i in 1:numFiles) {
fileName <- paste0("P", i, "Rtest.txt") # Creating the name to pull from.
file <- read.delim(fileName) # Reading in the file.
dName <- paste0("P", i) # Creating the name to assign the file to in R.
assign(dName, file) # Creating the file in R.
}
There are other methods that are faster and more compact, but I find this to be more readable, especially for someone who is new to R.
Additionally, if your numbers aren't a complete sequence like I've used here, you can just define a vector of what numbers are used like:
numFiles <- c(1, 4, 10, 25)
I'm taking an introductory R-programming course on Cousera. The first assignment has us evaluating a list of hundreds of csv files in a specified directory ("./specdata/). Each csv file, in turn, contains hundreds of records of sample pollutant data in the atmosphere - a date, a sulfite sample, a nitrate sample, and an ID of that identifies the sampling location.
The assignment asks us to create a function that takes the pollutant an id or range of ids for sampling location and returns a sample mean, given the supplied arguments.
My code (below) uses a for loop to use the id argument to only read the files of interest (seems more efficient than reading in all 322 files before doing any processing). That works great.
Within the loop, I assign the contents of the csv file to a variable. I then make that variable a data frame and use rbind to append to it the file read in during each loop. I use na.omit to remove the missing files from the variable. Then I use rbind to append the result of each iteration of the loop to variable. When I print the data frame variable within the loop, I can see the entire full list, subgrouped by id. But when I print the variable outside the loop, I only see the last element in the id vector.
I would like to create a consolidated list of all records matching the id argument within the loop, then pass the consolidate list outside the loop for further processing. I can't get this to work. My code is shown below.
Is this the wrong approach? Seems like it could work. Any help would be most appreciated. I searched StackOverflow and couldn't find anything that quite addresses what I'm trying to do.
pmean <- function(directory = "./specdata/", pollutant, id = 1:322) {
x <- list.files(path=directory, pattern="*.csv")
x <- paste(directory, x, sep="")
id1 <- id[1]
id2 <- id[length(id)]
for (i in id1:id2) {
df <- read.csv(x[i], header = TRUE)
df <- data.frame(df)
df <- na.omit(df)
df <- rbind(df)
print(df)
}
# would like a consolidated list of records here to to do more stuff, e.g. filter on pollutant and calcuate mean
}
You can just define the data frame outside the for loop and append to it. Also you can skip some steps in between... There are more ways to improve here... :-)
pmean <- function(directory = "./specdata/", pollutant, id = 1:322) {
x <- list.files(path=directory, pattern="*.csv")
x <- paste(directory, x, sep="")
df_final <- data.frame()
for (i in id) {
df <- read.csv(x[i], header = TRUE)
df <- data.frame(df)
df <- na.omit(df)
df_final <- rbind(df_final, df)
print(df)
}
# would like a consolidated list of records here to to do more stuff, e.g. filter on pollutant and calcuate mean
return(df_final)
}
by only calling df <- rbind(df) you are effectively overwriting df everytime. You can fix this by doing something like this:
df = data.frame() # empty data frame
for(i in 1:10) { # for all you csv files
x <- mean(rnorm(10)) # some new information
df <- rbind(df, x) # bind old dataframe and new value
}
By the way, if you know how big df will be beforehand then this is not the proper way to do it.
I need to download 300+ .csv files available online and combine them into a dataframe in R. They all have the same column names but vary in length (number of rows).
l<-c(1441,1447,1577)
s1<-"https://coraltraits.org/species/"
s2<-".csv"
for (i in l){
n<-paste(s1,i,s2, sep="") #creates download url for i
x <- read.csv( curl(n) ) #reads download url for i
#need to sucessively combine each of the 3 dataframes into one
}
Like #RohitDas said, continuously appending a data frame is very inefficient and will be slow. Just download each of the csv files as an entry in a list, and then bind all the rows after collecting all the data in the list.
l <- c(1441,1447,1577)
s1 <- "https://coraltraits.org/species/"
s2 <- ".csv"
# Initialize a list
x <- list()
# Loop through l and download the table as an element in the list
for(i in l) {
n <- paste(s1, i, s2, sep = "") # Creates download url for i
# Download the table as the i'th entry in the list, x
x[[i]] <- read.csv( curl(n) ) # reads download url for i
}
# Combine the list of data frames into one data frame
x <- do.call("rbind", x)
Just a warning: all the data frames in x must have the same columns to do this. If one of the entries in x has a different number of columns, or differently named columns, the rbind will fail.
More efficient row binding functions (with some extras, such as column filling) exist in several different packages. Take a look at some of these solutions for binding rows:
plyr::rbind.fill()
dplyr::bind_rows()
data.table::rbindlist()
If they have the same columns then its just a matter of appending the rows. A simple (but not memory efficient) approach is using rbind in a loop
l<-c(1441,1447,1577)
s1<-"https://coraltraits.org/species/"
s2<-".csv"
data <- NULL
for (i in l){
n<-paste(s1,i,s2, sep="") #creates download url for i
x <- read.csv( curl(n) ) #reads download url for i
#need to sucessively combine each of the 3 dataframes into one
data <- rbind(data,x)
}
A more efficient way would be to build a list and then combine them into a single data frame at the end, but I will leave that as an exercise for you.
I am trying to populate a data frame from within a for loop in R. The names of the columns are generated dynamically within the loop and the value of some of the loop variables is used as the values while populating the data frame. For instance the name of the current column could be some variable name as a string in the loop, and the column can take the value of the current iterator as its value in the data frame.
I tried to create an empty data frame outside the loop, like this
d = data.frame()
But I cant really do anything with it, the moment I try to populate it, I run into an error
d[1] = c(1,2)
Error in `[<-.data.frame`(`*tmp*`, 1, value = c(1, 2)) :
replacement has 2 rows, data has 0
What may be a good way to achieve what I am looking to do. Please let me know if I wasnt clear.
It is often preferable to avoid loops and use vectorized functions. If that is not possible there are two approaches:
Preallocate your data.frame. This is not recommended because indexing is slow for data.frames.
Use another data structure in the loop and transform into a data.frame afterwards. A list is very useful here.
Example to illustrate the general approach:
mylist <- list() #create an empty list
for (i in 1:5) {
vec <- numeric(5) #preallocate a numeric vector
for (j in 1:5) { #fill the vector
vec[j] <- i^j
}
mylist[[i]] <- vec #put all vectors in the list
}
df <- do.call("rbind",mylist) #combine all vectors into a matrix
In this example it is not necessary to use a list, you could preallocate a matrix. However, if you do not know how many iterations your loop will need, you should use a list.
Finally here is a vectorized alternative to the example loop:
outer(1:5,1:5,function(i,j) i^j)
As you see it's simpler and also more efficient.
You could do it like this:
iterations = 10
variables = 2
output <- matrix(ncol=variables, nrow=iterations)
for(i in 1:iterations){
output[i,] <- runif(2)
}
output
and then turn it into a data.frame
output <- data.frame(output)
class(output)
what this does:
create a matrix with rows and columns according to the expected growth
insert 2 random numbers into the matrix
convert this into a dataframe after the loop has finished.
this works too.
df = NULL
for (k in 1:10)
{
x = 1
y = 2
z = 3
df = rbind(df, data.frame(x,y,z))
}
output will look like this
df #enter
x y z #col names
1 2 3
Thanks Notable1, works for me with the tidytextr
Create a dataframe with the name of files in one column and content in other.
diretorio <- "D:/base"
arquivos <- list.files(diretorio, pattern = "*.PDF")
quantidade <- length(arquivos)
#
df = NULL
for (k in 1:quantidade) {
nome = arquivos[k]
print(nome)
Sys.sleep(1)
dados = read_pdf(arquivos[k],ocr = T)
print(dados)
Sys.sleep(1)
df = rbind(df, data.frame(nome,dados))
Sys.sleep(1)
}
Encoding(df$text) <- "UTF-8"
I had a case in where I was needing to use a data frame within a for loop function. In this case, it was the "efficient", however, keep in mind that the database was small and the iterations in the loop were very simple. But maybe the code could be useful for some one with similar conditions.
The for loop purpose was to use the raster extract function along five locations (i.e. 5 Tokio, New York, Sau Paulo, Seul & Mexico city) and each location had their respective raster grids. I had a spatial point database with more than 1000 observations allocated within the 5 different locations and I was needing to extract information from 10 different raster grids (two grids per location). Also, for the subsequent analysis, I was not only needing the raster values but also the unique ID for each observations.
After preparing the spatial data, which included the following tasks:
Import points shapefile with the readOGR function (rgdap package)
Import raster files with the raster function (raster package)
Stack grids from the same location into one file, with the function stack (raster package)
Here the for loop code with the use of a data frame:
1. Add stacked rasters per location into a list
raslist <- list(LOC1,LOC2,LOC3,LOC4,LOC5)
2. Create an empty dataframe, this will be the output file
TB <- data.frame(VAR1=double(),VAR2=double(),ID=character())
3. Set up for loop function
L1 <- seq(1,5,1) # the location ID is a numeric variable with values from 1 to 5
for (i in 1:length(L1)) {
dat=subset(points,LOCATION==i) # select corresponding points for location [i]
t=data.frame(extract(raslist[[i]],dat),dat$ID) # run extract function with points & raster stack for location [i]
names(t)=c("VAR1","VAR2","ID")
TB=rbind(TB,t)
}
was looking for the same and the following may be useful as well.
a <- vector("list", 1)
for(i in 1:3){a[[i]] <- data.frame(x= rnorm(2), y= runif(2))}
a
rbind(a[[1]], a[[2]], a[[3]])