I have a list of data frames "181", and i want to extract the 2nd column and save it in a csv file and label it, the labels for those 181 dfs are 0,1,2,3,4,5,6.
The problem is i have different length for each df, and i don't know if that's applicable in R!
This is an inefficient but easily coded solution (and efficiency doesn't matter when all you need to do is output a short CSV file). It writes each data frame one line at a time, assuming the data frames are represented by a list l.df.
#
# Prepare for output and input.
#
fn <- "temp.csv"
if(is.null(names(l.df))) names(l.df) <- 1:length(l.df)
#
# Loop over the data frames.
#
append <- FALSE
for (s in names(l.df)) {
#
# Create a one-row data frame for this column.
#
X <- data.frame(ID=s, as.list(l.df[[s]][[2]]))
#
# Append it to the output.
#
write.table(X, file=fn, sep=",", row.names=FALSE, col.names=FALSE, append=append)
append <- TRUE
}
For example, we may prepare a set of data frames with random entries:
set.seed(17)
l.df <- lapply(1+rpois(181, 5), function(n) data.frame(X=1:n, Y=round(rnorm(n),2)))
The output file looks like this:
"1",0.37,1.61,0.02,0.51
"2",1.07,0.13,-0.55,0.34,2.24,0.41,0.26,0.13,-0.48,0.07,0.54
... (177 lines omitted)
"180",0.58,-1.5,1.85,-1.02
"181",-0.59,0.12,-0.38,-0.35,1.22,-0.63,0.81
There are many ways of solving your issues, I'll just propose the simplest one with base R, looping 🙌 (otherwise work with tidyverse).
The Issue of differing df lengths (in terms of rows) can be solved by adding NAs at the end.
I assume this is your setup:
# Your list of data frames
yourlistofdataframes <- list()
for (i in 1:182) { # in R list indices run from 1 to 181 (in Python from 0 onwards)
nrowofdf <- sample(1:100,1) # random number of rows between 1 and 100
yourlistofdataframes[[i]] <- data.frame(cbind(rep(paste0("df",i,"|column1"),nrowofdf),
rep(paste0("df",i,"|column2"),nrowofdf),
rep(paste0("df",i,"|column3"),nrowofdf)))
}
names(yourlistofdataframes) <- 0:181 # labeling the data frames
Then this is your solution:
newlist <- list()
for (i in 1:length(yourlistofdataframes)){
newlist[[i]] <- unlist(yourlistofdataframes[[i]][2])
}
names(newlist) <- 0:181 # give them the names you wanted
newlist <- lapply(newlist, `length<-`, max(lengths(newlist))) # add NA's to make them equal length
# bind back to data.frame & save as csv
newdf <- data.frame(newlist) # if you want to have the data in 181 columns in your final df
newdft <- t(newdf) # if you want to have the data in 181 rows in your final df
write.csv(newdf, "mycsv.csv")
Feedback on your question:
Also, if you want to ask for coding advice, post some representation of your data, so that people don't have to assume how your data looks like / build their own.
Related
I would like to delete only some data frames from the environment. Since I have more than 400 data frames there, I would like to select based on number of columns.
I tried the code below that does not work, however.
# Select data tables
dfs <- ls()
dfs <- dfs[grepl(".csv", dfs)]
myfilenames <- list.files(inputdir, pattern = ".csv$")
storage <- data.frame(matrix(nrow=length(myfilenames), ncol=1))
# Filter for 25 columns
for (i in seq_along(dfs)){
dat <- get(dfs[i])
if(ncol(dat)!=25){
storage[i,1] <- dfs[i]
}
}
# Delete
todelete <- storage[,1]
rm(todelete)
Expected result: only data frames with 25 columns remain in Global environment
Actual result: none is removed
Thank you very much for your help!
The below is driving me a little crazy and I’m sure theres an easy solution.
I currently use R to perform some calculations from a bunch of excel files, where the files are monthly observations of financial data. The files all have the exact same column headers. Each file gets imported, gets some calcs done on it and the output is saved to a list. The next file is imported and the process is repeated. I use the following code for this:
filelist <- list.files(pattern = "\\.xls")
universe_list <- list()
count <- 1
for (file in filelist) {
df <- read.xlsx(file, 1, startRow=2, header=TRUE)
*perform calcs*
universe_list[[count]] <- df
count <- count + 1
}
I now have a problem where some of the new operations I want to perform would involve data from two or more excel files. So for example, I would need to import the Jan-16 and the Jan-15 excel files, perform whatever needs to be done, and then move on to the next set of files (Feb-16 and Feb-15). The files will always be of fixed length apart (like one year etc)
I cant seem to figure out the code on how to do this… from a process perspective, Im thinking 1) need to design a loop to import both sets of files at the same time, 2) create two dataframes from the imported data, 3) rename the columns of one of the dataframes (so the columns can be distinguished), 4) merge both dataframes together, and 4) perform the calcs. I cant work out the code for steps 1-4 for this!
Many thanks for helping out
Consider mapply() to handle both data frame pairs together. Your current loop is actually reminiscient of other languages running for loop operations. However, R has many vectorized approaches to iterate over lists. Below assumes both 15 and 16 year list of files are same length with corresponding months in both and year abbrev comes right before file extension (i.e, -15.xls, -16.xls):
files15list <- list.files(path, pattern = "[15]\\.xls")
files16list <- list.files(path, pattern = "[16]\\.xls")
dfprocess <- function(x, y){
df1 <- read.xlsx(x, 1, startRow=2, header=TRUE)
names(df1) <- paste0(names(df1), "1") # SUFFIX COLS WITH 1
df2 <- read.xlsx(y, 1, startRow=2, header=TRUE)
names(df2) <- paste0(names(df2), "2") # SUFFIX COLS WITH 2
df <- cbind(df1, df2) # CBIND DFs
# ... perform calcs ...
return(df)
}
wide_list <- mapply(dfprocess, files15list, files16list)
long_list <- lapply(1:ncol(wide_list),
function(i) wide_list[,i]) # ALTERNATE OUTPUT
First sort your filelist such that the two files on which you want to do your calculations are consecutive to each other. After that try this:
count <- 1
for (count in seq(1, (len(filelist)),2) {
df <- read.xlsx(filelist[count], 1, startRow=2, header=TRUE)
df1 <- read.xlsx(filelist[count+1], 1, startRow=2, header=TRUE)
*change column names and apply merge or append depending on requirement
*perform calcs*
*save*
}
I need to download 300+ .csv files available online and combine them into a dataframe in R. They all have the same column names but vary in length (number of rows).
l<-c(1441,1447,1577)
s1<-"https://coraltraits.org/species/"
s2<-".csv"
for (i in l){
n<-paste(s1,i,s2, sep="") #creates download url for i
x <- read.csv( curl(n) ) #reads download url for i
#need to sucessively combine each of the 3 dataframes into one
}
Like #RohitDas said, continuously appending a data frame is very inefficient and will be slow. Just download each of the csv files as an entry in a list, and then bind all the rows after collecting all the data in the list.
l <- c(1441,1447,1577)
s1 <- "https://coraltraits.org/species/"
s2 <- ".csv"
# Initialize a list
x <- list()
# Loop through l and download the table as an element in the list
for(i in l) {
n <- paste(s1, i, s2, sep = "") # Creates download url for i
# Download the table as the i'th entry in the list, x
x[[i]] <- read.csv( curl(n) ) # reads download url for i
}
# Combine the list of data frames into one data frame
x <- do.call("rbind", x)
Just a warning: all the data frames in x must have the same columns to do this. If one of the entries in x has a different number of columns, or differently named columns, the rbind will fail.
More efficient row binding functions (with some extras, such as column filling) exist in several different packages. Take a look at some of these solutions for binding rows:
plyr::rbind.fill()
dplyr::bind_rows()
data.table::rbindlist()
If they have the same columns then its just a matter of appending the rows. A simple (but not memory efficient) approach is using rbind in a loop
l<-c(1441,1447,1577)
s1<-"https://coraltraits.org/species/"
s2<-".csv"
data <- NULL
for (i in l){
n<-paste(s1,i,s2, sep="") #creates download url for i
x <- read.csv( curl(n) ) #reads download url for i
#need to sucessively combine each of the 3 dataframes into one
data <- rbind(data,x)
}
A more efficient way would be to build a list and then combine them into a single data frame at the end, but I will leave that as an exercise for you.
I would like to find the frequency of each row of a data frame in other multiple data frames. In other words, i have to find out how many times each two values in a row can be seen in other files
I have 103 files. If read them as a data frame, they will be like:
V1 V2
1 xbc
1 xbd
1 xbf
2 xbr
2 xbt
3 xbu
3 xbi
3 xbo
(V2 is not numeric). I have to find out how many times each row in a file can be seen in other 102 files!
I try a nested for loop. but it is damn slow! because each file has at least 4500 rows!
for(j in 1:nrow(df1)){
df1<- df1[j,] #select just one row of the data frame each time to find its frequency!
setwd("my path")
file<-list.files(pattern="^inp")
for (i in file) {
df<-read.table(i)
mylist<-list()
tmp <- merge(df1,df, c("V1", "V3"))
mylist[[i]] <- tmp #put all vectors in the list
}
df <- do.call("rbind",mylist) #combine all vectors into a matrix
k<-toString(j) #to avoid connection error in write.table
setwd("my path")
write.table(x, file=k, append=FALSE, sep="\t", row.names=FALSE, col.names=FALSE)
}
I would appreciate it if anyone can suggest me any other faster ways?
Here's a little example. dat2 is just dat1 with three values changed to show that we are finding them to be different from those in dat1.
> dat1 <- read.table(h=T, text = "V1 V2
1 xbc
1 xbd
1 xbf
2 xbr
2 xbt
3 xbu
3 xbi
3 xbo")
> dat2 <- dat1
> dat2[c(2, 4, 5),1] <- 0
You could iterate through the second data frame, using %in% with all to determine if the entire row is match. This returns a logical vector, which we sum to get the number of rows in dat2 that are also in dat1.
> sap <- sapply(seq(nrow(dat1)), function(i){
all(dat2[i, ] %in% dat1[i, ])
}))
> sum(sap)
# [1] 5
As noted in the comments, we can rbind all the data together and then just use duplicated to get the result.
> d <- do.call(rbind, list(dat1, dat2))
> sum(duplicated(d))
[1] 5
It could be done in two steps.
Make statistic about each file. Statistic about frequency each line in 102 files. Result is smaller then original file and have information what you need.
Second step is similar your code, but you can save statistic to memory and you can break the searching in case of first occurence.
It should be 2times faster (without statistic in memory). But it depend on data.
Also you can sort the statistic and use benefits of quick sort.
I am trying to populate a data frame from within a for loop in R. The names of the columns are generated dynamically within the loop and the value of some of the loop variables is used as the values while populating the data frame. For instance the name of the current column could be some variable name as a string in the loop, and the column can take the value of the current iterator as its value in the data frame.
I tried to create an empty data frame outside the loop, like this
d = data.frame()
But I cant really do anything with it, the moment I try to populate it, I run into an error
d[1] = c(1,2)
Error in `[<-.data.frame`(`*tmp*`, 1, value = c(1, 2)) :
replacement has 2 rows, data has 0
What may be a good way to achieve what I am looking to do. Please let me know if I wasnt clear.
It is often preferable to avoid loops and use vectorized functions. If that is not possible there are two approaches:
Preallocate your data.frame. This is not recommended because indexing is slow for data.frames.
Use another data structure in the loop and transform into a data.frame afterwards. A list is very useful here.
Example to illustrate the general approach:
mylist <- list() #create an empty list
for (i in 1:5) {
vec <- numeric(5) #preallocate a numeric vector
for (j in 1:5) { #fill the vector
vec[j] <- i^j
}
mylist[[i]] <- vec #put all vectors in the list
}
df <- do.call("rbind",mylist) #combine all vectors into a matrix
In this example it is not necessary to use a list, you could preallocate a matrix. However, if you do not know how many iterations your loop will need, you should use a list.
Finally here is a vectorized alternative to the example loop:
outer(1:5,1:5,function(i,j) i^j)
As you see it's simpler and also more efficient.
You could do it like this:
iterations = 10
variables = 2
output <- matrix(ncol=variables, nrow=iterations)
for(i in 1:iterations){
output[i,] <- runif(2)
}
output
and then turn it into a data.frame
output <- data.frame(output)
class(output)
what this does:
create a matrix with rows and columns according to the expected growth
insert 2 random numbers into the matrix
convert this into a dataframe after the loop has finished.
this works too.
df = NULL
for (k in 1:10)
{
x = 1
y = 2
z = 3
df = rbind(df, data.frame(x,y,z))
}
output will look like this
df #enter
x y z #col names
1 2 3
Thanks Notable1, works for me with the tidytextr
Create a dataframe with the name of files in one column and content in other.
diretorio <- "D:/base"
arquivos <- list.files(diretorio, pattern = "*.PDF")
quantidade <- length(arquivos)
#
df = NULL
for (k in 1:quantidade) {
nome = arquivos[k]
print(nome)
Sys.sleep(1)
dados = read_pdf(arquivos[k],ocr = T)
print(dados)
Sys.sleep(1)
df = rbind(df, data.frame(nome,dados))
Sys.sleep(1)
}
Encoding(df$text) <- "UTF-8"
I had a case in where I was needing to use a data frame within a for loop function. In this case, it was the "efficient", however, keep in mind that the database was small and the iterations in the loop were very simple. But maybe the code could be useful for some one with similar conditions.
The for loop purpose was to use the raster extract function along five locations (i.e. 5 Tokio, New York, Sau Paulo, Seul & Mexico city) and each location had their respective raster grids. I had a spatial point database with more than 1000 observations allocated within the 5 different locations and I was needing to extract information from 10 different raster grids (two grids per location). Also, for the subsequent analysis, I was not only needing the raster values but also the unique ID for each observations.
After preparing the spatial data, which included the following tasks:
Import points shapefile with the readOGR function (rgdap package)
Import raster files with the raster function (raster package)
Stack grids from the same location into one file, with the function stack (raster package)
Here the for loop code with the use of a data frame:
1. Add stacked rasters per location into a list
raslist <- list(LOC1,LOC2,LOC3,LOC4,LOC5)
2. Create an empty dataframe, this will be the output file
TB <- data.frame(VAR1=double(),VAR2=double(),ID=character())
3. Set up for loop function
L1 <- seq(1,5,1) # the location ID is a numeric variable with values from 1 to 5
for (i in 1:length(L1)) {
dat=subset(points,LOCATION==i) # select corresponding points for location [i]
t=data.frame(extract(raslist[[i]],dat),dat$ID) # run extract function with points & raster stack for location [i]
names(t)=c("VAR1","VAR2","ID")
TB=rbind(TB,t)
}
was looking for the same and the following may be useful as well.
a <- vector("list", 1)
for(i in 1:3){a[[i]] <- data.frame(x= rnorm(2), y= runif(2))}
a
rbind(a[[1]], a[[2]], a[[3]])