I am trying to populate a data frame from within a for loop in R. The names of the columns are generated dynamically within the loop and the value of some of the loop variables is used as the values while populating the data frame. For instance the name of the current column could be some variable name as a string in the loop, and the column can take the value of the current iterator as its value in the data frame.
I tried to create an empty data frame outside the loop, like this
d = data.frame()
But I cant really do anything with it, the moment I try to populate it, I run into an error
d[1] = c(1,2)
Error in `[<-.data.frame`(`*tmp*`, 1, value = c(1, 2)) :
replacement has 2 rows, data has 0
What may be a good way to achieve what I am looking to do. Please let me know if I wasnt clear.
It is often preferable to avoid loops and use vectorized functions. If that is not possible there are two approaches:
Preallocate your data.frame. This is not recommended because indexing is slow for data.frames.
Use another data structure in the loop and transform into a data.frame afterwards. A list is very useful here.
Example to illustrate the general approach:
mylist <- list() #create an empty list
for (i in 1:5) {
vec <- numeric(5) #preallocate a numeric vector
for (j in 1:5) { #fill the vector
vec[j] <- i^j
}
mylist[[i]] <- vec #put all vectors in the list
}
df <- do.call("rbind",mylist) #combine all vectors into a matrix
In this example it is not necessary to use a list, you could preallocate a matrix. However, if you do not know how many iterations your loop will need, you should use a list.
Finally here is a vectorized alternative to the example loop:
outer(1:5,1:5,function(i,j) i^j)
As you see it's simpler and also more efficient.
You could do it like this:
iterations = 10
variables = 2
output <- matrix(ncol=variables, nrow=iterations)
for(i in 1:iterations){
output[i,] <- runif(2)
}
output
and then turn it into a data.frame
output <- data.frame(output)
class(output)
what this does:
create a matrix with rows and columns according to the expected growth
insert 2 random numbers into the matrix
convert this into a dataframe after the loop has finished.
this works too.
df = NULL
for (k in 1:10)
{
x = 1
y = 2
z = 3
df = rbind(df, data.frame(x,y,z))
}
output will look like this
df #enter
x y z #col names
1 2 3
Thanks Notable1, works for me with the tidytextr
Create a dataframe with the name of files in one column and content in other.
diretorio <- "D:/base"
arquivos <- list.files(diretorio, pattern = "*.PDF")
quantidade <- length(arquivos)
#
df = NULL
for (k in 1:quantidade) {
nome = arquivos[k]
print(nome)
Sys.sleep(1)
dados = read_pdf(arquivos[k],ocr = T)
print(dados)
Sys.sleep(1)
df = rbind(df, data.frame(nome,dados))
Sys.sleep(1)
}
Encoding(df$text) <- "UTF-8"
I had a case in where I was needing to use a data frame within a for loop function. In this case, it was the "efficient", however, keep in mind that the database was small and the iterations in the loop were very simple. But maybe the code could be useful for some one with similar conditions.
The for loop purpose was to use the raster extract function along five locations (i.e. 5 Tokio, New York, Sau Paulo, Seul & Mexico city) and each location had their respective raster grids. I had a spatial point database with more than 1000 observations allocated within the 5 different locations and I was needing to extract information from 10 different raster grids (two grids per location). Also, for the subsequent analysis, I was not only needing the raster values but also the unique ID for each observations.
After preparing the spatial data, which included the following tasks:
Import points shapefile with the readOGR function (rgdap package)
Import raster files with the raster function (raster package)
Stack grids from the same location into one file, with the function stack (raster package)
Here the for loop code with the use of a data frame:
1. Add stacked rasters per location into a list
raslist <- list(LOC1,LOC2,LOC3,LOC4,LOC5)
2. Create an empty dataframe, this will be the output file
TB <- data.frame(VAR1=double(),VAR2=double(),ID=character())
3. Set up for loop function
L1 <- seq(1,5,1) # the location ID is a numeric variable with values from 1 to 5
for (i in 1:length(L1)) {
dat=subset(points,LOCATION==i) # select corresponding points for location [i]
t=data.frame(extract(raslist[[i]],dat),dat$ID) # run extract function with points & raster stack for location [i]
names(t)=c("VAR1","VAR2","ID")
TB=rbind(TB,t)
}
was looking for the same and the following may be useful as well.
a <- vector("list", 1)
for(i in 1:3){a[[i]] <- data.frame(x= rnorm(2), y= runif(2))}
a
rbind(a[[1]], a[[2]], a[[3]])
Related
I have a list of 33 dataframes (each dataframe has a different number of rows). I am trying to write a nested for loop that will go through each dataframe in the list, and then go through each row within that dataframe and apply a function, before coming out again and moving onto the next dataframe in the list. However, Im not sure how to index a specific row within a dataframe within a list. If anyone knows how to do this or a more efficient way of doing this it would be much appreciated.
Thanks.
for (i in 1:length(data.list)) {
#Creating a matrix of all possible combinations of pairs in order to do pairwise comparisons on all of the sites
pairs = t(combn(nrow(data.list[[i]]), m = 2))
#Some more data wrangling
pairs <- as.data.frame(pairs)
colnames(pairs) <- c("PaperOneRowNumber", "PaperTwoRowNumber")
pairs$LRR <- 0
pairs$LRR_var <- 0
for (j in 1:nrow(pairs)) {
#print(i)
#Assigning Paper IDs to variables
a <- pairs[j,1]
b <- pairs[j,2]
#print(a)
#print(b)
paperone <- data.list[[i[a,]]]
papertwo <- data.list[[i[b,]]]
#print(paperone)
#print(papertwo)
#Inputting variables into calc.effect function and saving the output
effect.size <- calc.effect(paperone, papertwo)
#print(effect.size)
pairs$LRR[j] <- effect.size$LRR
pairs$LRR_var[j] <- effect.size$LRR_var
}
}
I am trying to subset this data frame by pre determined row numbers.
# Make dummy data frame
df <- data.frame(data=1:200)
train.length <- 1:2
# Set pre determined row numbers for subsetting
train.length.1 = 1:50
test.length.1 = 50:100
train.length.2 = 50:100
test.length.2 = 100:150
train.list <- list()
test.list <- list()
# Loop for subsetting by row, using row numbers in variables above
for (i in 1:length(train.length)) {
# subset by row number, each row number in variables train.length.1,2etc..
train.list[[i]] <- df[train.length.[i],] # need to place the variable train.length.n here...
test.list[[i]] <- df[test.length.[i],] # place test.length.n variable here..
# save outcome to lists
}
My question is, if I have my row numbers stored in a variable, how I do place each [ith] one inside the subsetting code?
I have tried:
df[train.length.[i],]
also
df[paste0"train.length.",[i],]
however that pastes as a character and it doesnt read my train.length.n variable... as below
> train.list[[i]] <- df[c(paste0("train.length.",train.length[i])),]
> train.list
[[1]]
data data1
NA NA NA
If i have the variable in there by itself, it works as intended. Just need it to work in a for loop
Desired output - print those below
train.set.output.1 <- df[train.length.1,]
test.set.output.1 <- df[test.length.1,]
train.set.output.2 <- df[train.length.2,]
test.set.output.2 <- df[test.length.2,]
I can do this manually, but its cumersome for lots of train / test sets... hence for loop
Consider staggered seq() and pass the number sequences in lapply to slice by rows. Also, for equal-length dataframes, you likely intended starts at 1, 51, 101, ...
train_num_set <- seq(1, 200, by=50)
train.list <- lapply(train_num_set, function(i) df[c(i:(i+49)),])
test_num_set <- seq(51, 200, by=50)
test.list <- lapply(test_num_set, function(i) df[c(i:(i+49)),])
Create a function that splits your data frame into different chunks:
split_frame_by_chunks <- function(data_frame, chunk_size) {
n <- nrow(data_frame)
r <- rep(1:ceiling(n/chunk_size),each=chunk_size)[1:n]
sub_frames <- split(data_frame,r)
return(sub_frames)
}
Call your function using your data frame and chunk size. In your case, you are splitting your data frame into chunks of 50:
chunked_frames <- split_frame_by_chunks(data_frame, 50)
Decide number of train/test splits to create in the loop
num_splits <- 2
Create the appropriate train and test sets inside your loop. In this case, I am creating the 2 you showed in your question. (i.e. the first loop creates a train and test set with rows 1-50 and 50-100 respectively):
for(i in 1:num_splits) {
this_train <- chunked_frames[i]
this_test <- chunked_frames[i+1]
}
Just do whatever you need to the dynamically created train and test frames inside your loop.
My data frame contains 22 columns: "DATE", "INDEX" and S1, S2, S3 ... S20. There are over 4322 rows. I want to calculate log returns and store the results in a data frame. That should give me 4321 rows.
I run this code, but I am sure there is a much more elegant way to do the calculation in a short way.
# count the sum of rows in order to make the following formula work appropriately - (n-1)
n <- nrow(df)
# calculating the log returns (natural logarithm), of INDEX and S1-20
LogRet_INDEX <- log(df$INDEX[2:n])-log(df$INDEX[1:(n-1)])
LogRet_S1 <- log(df$S1[2:n])-log(df$S1[1:(n-1)])
LogRet_S2 <- log(df$S2[2:n])-log(df$S2[1:(n-1)])
LogRet_S3 <- log(df$S3[2:n])-log(df$S3[1:(n-1)])
LogRet_S4 <- log(df$S4[2:n])-log(df$S4[1:(n-1)])
LogRet_S5 <- log(df$S5[2:n])-log(df$S5[1:(n-1)])
LogRet_S6 <- log(df$S6[2:n])-log(df$S6[1:(n-1)])
LogRet_S7 <- log(df$S7[2:n])-log(df$S7[1:(n-1)])
LogRet_S8 <- log(df$S8[2:n])-log(df$S7[1:(n-1)])
LogRet_S9 <- log(df$S9[2:n])-log(df$S8[1:(n-1)])
LogRet_S10 <- log(df$S10[2:n])-log(df$S10[1:(n-1)])
LogRet_S11 <- log(df$S11[2:n])-log(df$S11[1:(n-1)])
LogRet_S12 <- log(df$S12[2:n])-log(df$S12[1:(n-1)])
LogRet_S13 <- log(df$S13[2:n])-log(df$S13[1:(n-1)])
LogRet_S14 <- log(df$S14[2:n])-log(df$S14[1:(n-1)])
LogRet_S15 <- log(df$S15[2:n])-log(df$S15[1:(n-1)])
LogRet_S16 <- log(df$S16[2:n])-log(df$S16[1:(n-1)])
LogRet_S17 <- log(df$S17[2:n])-log(df$S17[1:(n-1)])
LogRet_S18 <- log(df$S18[2:n])-log(df$S18[1:(n-1)])
LogRet_S19 <- log(df$S19[2:n])-log(df$S19[1:(n-1)])
LogRet_S20 <- log(df$S20[2:n])-log(df$S20[1:(n-1)])
# adding the results from the previous calculation (log returns) to a data frame
LogRet_df <- data.frame(LogRet_INDEX, LogRet_S1, LogRet_S2, LogRet_S3, LogRet_S4, LogRet_S5, LogRet_S6, LogRet_S7, LogRet_S8, LogRet_S9, LogRet_S10, LogRet_S11, LogRet_S12, LogRet_S13, LogRet_S14, LogRet_S15, LogRet_S16, LogRet_S17, LogRet_S18, LogRet_S19, LogRet_S20)
Is there a possibility to make this code shorter? Maybe some kind of loop or using a for argument? Since I am quite new to R, I try to improve my knowledge.
Any kind of help is highly appreciated!
You can use sapply to apply a function to each column of the data.frame.
What the code below does, is 1) take columns 2 to 22 from the data frame called df. 2) for each of this columns, calculate logarithm of the respective column and then calculate the difference between two neighboring rows. 3) when done, convert it to data.frame called df2
df2 <- as.data.frame(sapply(df[2:22], function(x) diff(log(x))))
I want to redefine propperly elements of a multidimensional matrix using assign on R.
I tried this
lat = 3
lon = 3
laidx = 1:3
loidx = 1:3
OtherDF is a 3x3 multidimensional matrix, and each element is a large data frame
for (i in 1:12){
assign(paste("STAT",i,sep=""),array(list(NA), dim=c(length(lat),length(lon))))
for (lo in loidx){
for (la in laidx){
assign(paste("STAT",i,"[",la,",",lo,"]",sep=""), as.data.frame(do.call(rbind,otherDF[la,lo])))
# otherDF[la,lo] are data frames
}
}
}
First I created 12 empty matrix STATS1,STATS2 ,...,STATS12 (I need 12, one for each month)
Then I tried to fill them with elements of an other dataframe, but instead of filling it create a lot of new variables like this `STAT10[[1,1]]``
Some help please
Since you don't provide your data, I made up some:
lat = 3
lon = 3
otherDF <- data.frame(A=1:3, B=4:6, C=7:9)
loidx <- 1:3
laidx <- 1:3
I avoid the nested for loops and the second assign statement with expand.grid and sapply(iter(idx,by="row", function(x) otherDF[x$Var1,x$Var2]).
install.packages("iterators")
for (i in 1:12){
library(iterators)
idx <- expand.grid(loidx,laidx) # expands all combinations of elements in loidx and laidx
assign(paste0("STAT",i), matrix(sapply(iter(idx, by="row"), function(x) otherDF[x$Var1, x$Var2]), ncol=3))
}
I made some guesses on what you wanted based on your code, so edit your original post if you wanted something different.
i'm a bit new to R and this site has been an amazing help to me in answering a lot of questions. However, I’ve come across a recent problem and have exhausted all options to find a solution on my own and am in need of some help.
I am trying to write a code where I create multiple data frames (or matrices) INSIDE the loop and loop it 5000 times. On each loop I would like the variable to change so I can retrieve the data for each loop at a later point.
Also, I would like to be able to repeat this method for other data frames and in creating these new data frames, it draws upon other data frames based on the iteration it is on.
I have tried to find a solution to this and it seems that it could be either the for loop or apply function, but I am not sure as to how I could execute it. As an example of what I would like to see:
for (i in 1:10) {
df.a[i] <- data.frame (…information...)
df.b[i] <- data.frame (...information...)
df.c[i] <- data.frame (new.col.A=df.a[i]$column1, new.col.B=df.b[i]$column2)
}
Then, after having run the loop, if I were to write df.c3 I would find the data frame created in the loop on the third iteration which has data from iteration 3 in df.a and df.b.
The ‘closest’ I have come to getting what I thought I needed was by doing this:
df.a = seq (1, 10, by=1)
df.b = seq (1,10, by=1)
df.c = seq (1,10, by=1)
for (i in 1:10) {
df.a[[i]] <- data.frame (...information)
...
}
But this typically results in an error of: "number of items to replace is not a multiple of replacement length".
So i'm not sure what else i could do and really hope someone is able to help out.
Create objects df.x as empty lists:
df.a <- list()
df.b <- list()
df.c <- list()
Then access (and write to) individual dataframes using double square backets:
for (i in 1:10) {
df.a[[i]] <- data.frame(...)
df.b[[i]] <- data.frame(...)
df.c[[i]] <- data.frame(new.col.A=df.a[[i]]$column1, new.col.B=df.b[[i]]$column2)
}