I need to download 300+ .csv files available online and combine them into a dataframe in R. They all have the same column names but vary in length (number of rows).
l<-c(1441,1447,1577)
s1<-"https://coraltraits.org/species/"
s2<-".csv"
for (i in l){
n<-paste(s1,i,s2, sep="") #creates download url for i
x <- read.csv( curl(n) ) #reads download url for i
#need to sucessively combine each of the 3 dataframes into one
}
Like #RohitDas said, continuously appending a data frame is very inefficient and will be slow. Just download each of the csv files as an entry in a list, and then bind all the rows after collecting all the data in the list.
l <- c(1441,1447,1577)
s1 <- "https://coraltraits.org/species/"
s2 <- ".csv"
# Initialize a list
x <- list()
# Loop through l and download the table as an element in the list
for(i in l) {
n <- paste(s1, i, s2, sep = "") # Creates download url for i
# Download the table as the i'th entry in the list, x
x[[i]] <- read.csv( curl(n) ) # reads download url for i
}
# Combine the list of data frames into one data frame
x <- do.call("rbind", x)
Just a warning: all the data frames in x must have the same columns to do this. If one of the entries in x has a different number of columns, or differently named columns, the rbind will fail.
More efficient row binding functions (with some extras, such as column filling) exist in several different packages. Take a look at some of these solutions for binding rows:
plyr::rbind.fill()
dplyr::bind_rows()
data.table::rbindlist()
If they have the same columns then its just a matter of appending the rows. A simple (but not memory efficient) approach is using rbind in a loop
l<-c(1441,1447,1577)
s1<-"https://coraltraits.org/species/"
s2<-".csv"
data <- NULL
for (i in l){
n<-paste(s1,i,s2, sep="") #creates download url for i
x <- read.csv( curl(n) ) #reads download url for i
#need to sucessively combine each of the 3 dataframes into one
data <- rbind(data,x)
}
A more efficient way would be to build a list and then combine them into a single data frame at the end, but I will leave that as an exercise for you.
Related
I have a list of data frames "181", and i want to extract the 2nd column and save it in a csv file and label it, the labels for those 181 dfs are 0,1,2,3,4,5,6.
The problem is i have different length for each df, and i don't know if that's applicable in R!
This is an inefficient but easily coded solution (and efficiency doesn't matter when all you need to do is output a short CSV file). It writes each data frame one line at a time, assuming the data frames are represented by a list l.df.
#
# Prepare for output and input.
#
fn <- "temp.csv"
if(is.null(names(l.df))) names(l.df) <- 1:length(l.df)
#
# Loop over the data frames.
#
append <- FALSE
for (s in names(l.df)) {
#
# Create a one-row data frame for this column.
#
X <- data.frame(ID=s, as.list(l.df[[s]][[2]]))
#
# Append it to the output.
#
write.table(X, file=fn, sep=",", row.names=FALSE, col.names=FALSE, append=append)
append <- TRUE
}
For example, we may prepare a set of data frames with random entries:
set.seed(17)
l.df <- lapply(1+rpois(181, 5), function(n) data.frame(X=1:n, Y=round(rnorm(n),2)))
The output file looks like this:
"1",0.37,1.61,0.02,0.51
"2",1.07,0.13,-0.55,0.34,2.24,0.41,0.26,0.13,-0.48,0.07,0.54
... (177 lines omitted)
"180",0.58,-1.5,1.85,-1.02
"181",-0.59,0.12,-0.38,-0.35,1.22,-0.63,0.81
There are many ways of solving your issues, I'll just propose the simplest one with base R, looping š (otherwise work with tidyverse).
The Issue of differing df lengths (in terms of rows) can be solved by adding NAs at the end.
I assume this is your setup:
# Your list of data frames
yourlistofdataframes <- list()
for (i in 1:182) { # in R list indices run from 1 to 181 (in Python from 0 onwards)
nrowofdf <- sample(1:100,1) # random number of rows between 1 and 100
yourlistofdataframes[[i]] <- data.frame(cbind(rep(paste0("df",i,"|column1"),nrowofdf),
rep(paste0("df",i,"|column2"),nrowofdf),
rep(paste0("df",i,"|column3"),nrowofdf)))
}
names(yourlistofdataframes) <- 0:181 # labeling the data frames
Then this is your solution:
newlist <- list()
for (i in 1:length(yourlistofdataframes)){
newlist[[i]] <- unlist(yourlistofdataframes[[i]][2])
}
names(newlist) <- 0:181 # give them the names you wanted
newlist <- lapply(newlist, `length<-`, max(lengths(newlist))) # add NA's to make them equal length
# bind back to data.frame & save as csv
newdf <- data.frame(newlist) # if you want to have the data in 181 columns in your final df
newdft <- t(newdf) # if you want to have the data in 181 rows in your final df
write.csv(newdf, "mycsv.csv")
Feedback on your question:
Also, if you want to ask for coding advice, post some representation of your data, so that people don't have to assume how your data looks like / build their own.
I've got multiple data.frames (26) in a list. The dfs have the same structure, but I would like working/exporting only two different columns. I can export all the dfs to individual dfs
for(i in filelist){
list2env(setNames(filelist, paste0("names(filelist[[i]])",
seq_along(filelist))), envir = parent.frame())}
I can delete a column from all the dfs
for(i in seq_along(filelist)){filelist[[i]]$V5 = NULL}
but I cannot export the other columns individually. From a single data.frame it simply works:
token_out_mk_totatyafiak_02.txt = out_mk_totatyafiak_02.txt["V2"]
type_out_mk_totatyafiak_02.txt = out_mk_totatyafiak_02.txt["V1"]
When I tried these
for(i in seq_along(filelist)){n[[i]] <- filelist[[i]]$V2}
for(i in seq_along(filelist)){
sapply(filelist, function(x) n <- filelist[[i]]$V2)
}
the most I achieved, that I could read in all the 26 dfs the second column of the last df.
The V2 looks like:
V2
1 az
2 a
3 f
ekete
4 folt
(and so on, these are hungarian short stories... )
Depending on your desired results, you have several options.
If you want a new list, with your data frames containing only one specific column.
new_filelist <-
lapply(filelist, function(df){
df["V2"]
})
If you want to export to a file one specific column for all data frames, in separate files (in this case, .txt files).
This requires your data frames in your list to be named. In case they are not, you can replace names(filelist) for 1:length(filelist).
lapply(names(filelist), function(df){
df_filename <- paste0(df, ".txt")
write.table(filelist[[df]]["V2"], df_filename)
})
If you wan to assign to new objects in your enviroment one specific column for all your data frames.
Again, this requires your data frames to be named.
lapply(names(filelist), function(df){
assign(df, filelist[[df]]["V2"], envir = .GlobalEnv)
})
The below is driving me a little crazy and Iām sure theres an easy solution.
I currently use R to perform some calculations from a bunch of excel files, where the files are monthly observations of financial data. The files all have the exact same column headers. Each file gets imported, gets some calcs done on it and the output is saved to a list. The next file is imported and the process is repeated. I use the following code for this:
filelist <- list.files(pattern = "\\.xls")
universe_list <- list()
count <- 1
for (file in filelist) {
df <- read.xlsx(file, 1, startRow=2, header=TRUE)
*perform calcs*
universe_list[[count]] <- df
count <- count + 1
}
I now have a problem where some of the new operations I want to perform would involve data from two or more excel files. So for example, I would need to import the Jan-16 and the Jan-15 excel files, perform whatever needs to be done, and then move on to the next set of files (Feb-16 and Feb-15). The files will always be of fixed length apart (like one year etc)
I cant seem to figure out the code on how to do thisā¦ from a process perspective, Im thinking 1) need to design a loop to import both sets of files at the same time, 2) create two dataframes from the imported data, 3) rename the columns of one of the dataframes (so the columns can be distinguished), 4) merge both dataframes together, and 4) perform the calcs. I cant work out the code for steps 1-4 for this!
Many thanks for helping out
Consider mapply() to handle both data frame pairs together. Your current loop is actually reminiscient of other languages running for loop operations. However, R has many vectorized approaches to iterate over lists. Below assumes both 15 and 16 year list of files are same length with corresponding months in both and year abbrev comes right before file extension (i.e, -15.xls, -16.xls):
files15list <- list.files(path, pattern = "[15]\\.xls")
files16list <- list.files(path, pattern = "[16]\\.xls")
dfprocess <- function(x, y){
df1 <- read.xlsx(x, 1, startRow=2, header=TRUE)
names(df1) <- paste0(names(df1), "1") # SUFFIX COLS WITH 1
df2 <- read.xlsx(y, 1, startRow=2, header=TRUE)
names(df2) <- paste0(names(df2), "2") # SUFFIX COLS WITH 2
df <- cbind(df1, df2) # CBIND DFs
# ... perform calcs ...
return(df)
}
wide_list <- mapply(dfprocess, files15list, files16list)
long_list <- lapply(1:ncol(wide_list),
function(i) wide_list[,i]) # ALTERNATE OUTPUT
First sort your filelist such that the two files on which you want to do your calculations are consecutive to each other. After that try this:
count <- 1
for (count in seq(1, (len(filelist)),2) {
df <- read.xlsx(filelist[count], 1, startRow=2, header=TRUE)
df1 <- read.xlsx(filelist[count+1], 1, startRow=2, header=TRUE)
*change column names and apply merge or append depending on requirement
*perform calcs*
*save*
}
I am trying to read over 200 CSV files, each with multiple rows and columns of numbers. It makes most sense to reach each one as a separate data frame.
Ideally, I'd like to give meaningful names. So the data frame of store 1, room 1 would be named store.1.room.1, and store.1.room.2. This would go all the way up to store.100.room.1, store.100.room.2 etc.
I can read each file into a specified data frame. For example:
store.1.room.1 <- read.csv(filepath,...)
But how do I create a dynamically created data frame name using a For loop?
For example:
for (i in 1:100){
for (j in 1:2){
store.i.room.j <- read.csv(filepath...)
}
}
Alternatively, is there another approach that I should consider instead of having each csv file as a separate data frame?
Thanks
You can create your dataframes using read.csv as you have above, but store them into a list. Then give names to each item (i.e. dataframe) in the list:
# initialize an empty list
my_list <- list()
for (i in 1:100) {
for (j in 1:2) {
df <- read.csv(filename...)
df_name <- paste("store", i, "room", j, sep="")
my_list[[df_name]] <- df
}
}
# now you can access any data frame you wish by using my_list$store.i.room.j
I'm not sure whether I am answering your question, but you would never want to store those CSV files into separate data frames. What I would do in your case is this:
set <- data.frame()
for (i in 1:100){
##calculate filename here
current.csv <- read.csv(filename)
current.csv <- cbind(current.csv, index = i)
set <- rbind(set, current.csv)
An additional column is being used to identify which csv files the measurements are from.
EDIT:
This is useful to apply tapply in certain vectors of your data.frame. Also, in case you'd like to keep the measurements of only one csv (let's say the one indexed by 5), you can enter
single.data.frame <- set[set$index == 5, ]
Using a particular function, I wish to merge pairs of data frames, for multiple pairings in an R directory. I am trying to write a āfor loopā that will do this job for me, and while related questions such as Merge several data.frames into one data.frame with a loop are helpful, I am struggling to adapt example loops for this particular use.
My data frames end with either ā_df1.csvā or ā_df2.csvā. Each pair, that I wish to merge into an output data frame, has an identical number at the being of the file name (i.e. 543_df1.csv and 543_df2.csv).
I have created a character string for each of the two types of file in my directory using the list.files command as below:
df1files <- list.files(path="~/Desktop/combined filesā pattern="*_df1.csv", full.names=T, recursive=FALSE)
df2files <- list.files(path="="~/Desktop/combined files ", pattern="*_df2.csv", full.names=T, recursive=FALSE)
The function and commands that I want to apply in order to merge each pair of data frames are as follows:
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
merged <- cbind(df2, df1[rows,])
I am now trying to incorporate these commands into a for loop starting with something along the following lines, to prevent me from having to manually merge the pairs:
for(i in 1:length(df2files)){ ā¦ā¦
I am not yet a strong R programmer, and have hit a wall, so any help would be greatly appreciated.
My intuition (which I haven't had a chance to check) is that you should be able to do something like the following:
# read in the data as two lists of dataframes:
dfs1 <- lapply(df1files, read.csv)
dfs2 <- lapply(df2files, read.csv)
# define your merge commands as a function
merge2 <- function(df1, df2){
findRow <- function(dt, df) { min(which(df$datetime > dt )) }
rows <- sapply(df2$datetime, findRow, df=df1)
merged <- cbind(df2, df1[rows,])
}
# apply that merge command to the list of lists
mergeddfs <- mapply(merge2, dfs1, dfs2, SIMPLIFY=FALSE)
# write results to files
outfilenames <- gsub("df1","merged",df1files)
mapply(function(x,y) write.csv(x,y), mergeddfs, outfilenames)