Generate a sequence of Data frame from function

Generate a sequence of Data frame from function - r

I searched but I couldn't find a similar question, so Apologies in advance if this is a duplicate question. I am trying to Generate a data frame from within a for loop in R.
what I want to do:
Define each columns of each data frame by a function,
Generate n data frames (length of my sequence of data frame) using loop,
As example I will use n=100 :
n<-100
k<-8
d1 <- data.frame()
for(i in 1:(k)) {d1 <- rbind(d1,c(a="i+1",b="i-1",c="i/1"))}
d2 <- data.frame()
for(i in 1:(k+2)) {d2 <- rbind(d2,c(a="i+2",b="i-2",c="i/2"))}
...
d100 <- data.frame()
for(i in 1:(k+100)) {d100 <- rbind(d100,c(i+100, i-100, i/100))}
It is clear that It'll be difficult to construct one by one each data.frame. I tried this:
d<-list()
for(j in 1:100) {
d[j] <- data.frame()
for(i in 1:(k+j)) {d[j] <- rbind(d[j] ,c(i+j, i-j, i/j))}
But I cant really do anything with it, I run into an error :
Error in d[j] <- data.frame() : replacement has length zero
In addition: Warning messages:
1: In d[j] <- rbind(d[j], c(i + j, i - j, i/j)) :
number of items to replace is not a multiple of replacement length
And a few more remarks about your example:
the number of rows in each data frame are not the same : d1 has 8 rows, d2 has 10 rows, and d100 has 8+100 rows,
the algorithm should give us : D=(d1,d2, ... ,d100).
It would be great to get an answer using the same approach (rbind) and a more base like approach. Both will aid in my understanding. Of course, please point out where I'm going wrong if it's obvious.

Here's how to create an empty data.frame (and it's not what you are trying):
Create an empty data.frame
And you should not be creating 100 separate dataframes but rather a list of dataframes. I would not do it with rbind, since that would be very slow. Instead I would create them with a function that returns a dataframe of the required structure:
make_df <- function(n,var) {data.frame( a=(1:n)+var,b=(1:n)-var,c=(1:n)/var) }
mylist <- setNames(
lapply(1:100, function(n) make_df(n,n)) , # the dataframes
paste0("d_", 1:100)) # the names for access
head(mylist,3)
#---------------
$d_1
a b c
1 2 0 1
$d_2
a b c
1 3 -1 0.5
2 4 0 1.0
$d_3
a b c
1 4 -2 0.3333333
2 5 -1 0.6666667
3 6 0 1.0000000
Then if you want the "d_40" dataframe it's just:
mylist[[ "d_40" ]]
Or
mylist$d_40
If you want to perform the same operation or get a result from all of them at nce; just use lapply:
lapply(mylist, nrow) # will be a list
Or:
sapply(mylist, nrow) #will be a vector because each value is the same length.

Related

Appending to an R List one by one

Let's say I have data like:
> data[295:300,]
Date sulfate nitrate ID
295 2003-10-22 NA NA 1
296 2003-10-23 NA NA 1
297 2003-10-24 3.47 0.363 1
298 2003-10-25 NA NA 1
299 2003-10-26 NA NA 1
300 2003-10-27 NA NA 1
Now I would like to add all the nitrate values into a new list/vector. I'm using the following code:
i <- 1
my_list <- c()
for(val in data)
{
my_list[i] <- val
i <- i + 1
}
But this is what happens:
Warning message:
In x[i] <- val :
number of items to replace is not a multiple of replacement length
> i
[1] 2
> x
[1] NA
Where am I going wrong? The data is part of a Coursera R Programming coursework. I can assure you that this is not an assignment/quiz. I have been trying to understand what is the best way append elements into a list with a loop? I have not proceeded to the lapply or sapply part of the coursework, so thinking about workarounds.
Thanks in advance.
If it's a duplicate question, please direct me to it.

As we mention in the comments, you are not looping over the rows of your data frame, but the columns (also sometimes variables). Hence, loop over data$nitrate.
i <- 1
my_list <- c()
for(val in data$nitrate)
{
my_list[i] <- val
i <- i + 1
}
Now, instead of looping over your values, a better way is to use that you want the new vector and the old data to have the same index, so loop over the index i. How do you tell R how many indexes there are? Here you have several choices again: 1:nrow(data), 1:length(data$nitrate) and several other ways. Below I have given you a few examples of how to extract from the data frame.
my_vector <- c()
for(i in 1:nrow(data)){
my_vector[i] <- data$nitrate[i] ## Version 1 of extracting from data.frame
my_vector[i] <- data[i,"nitrate"] ## Version 2: [row,column name]
my_vector[i] <- data[i,3] ## Version 3: [row,column number]
}
My suggestion: Rather than calling the collection a list, call it a vector, since that is what it is. Vectors and lists behave a little differently in R.
Of course, in reality you don't want to get the data out one by one. A much more efficient way of getting your data out is
my_vector2 <- data$nitrate

R: Repeat script n times, changing variables in each iteration

I have a script that I want to repeat n times, where some variables are changed by 1 each iteration. I'm creating a data frame consisting of the standard deviation of the difference of various vectors. My script currently looks like this:
standard.deviation <- data.frame
c(
sd(diff(t1[,1])),
sd(diff(t1[,2])),
sd(diff(t1[,3])),
sd(diff(t1[,4])),
sd(diff(t1[,5]))
),
c(
sd(diff(t2[,1])),
sd(diff(t2[,2])),
sd(diff(t2[,3])),
sd(diff(t2[,4])),
sd(diff(t2[,5]))
),
c(
sd(diff(t3[,1])),
sd(diff(t3[,2])),
sd(diff(t3[,3])),
sd(diff(t3[,4])),
sd(diff(t3[,5]))
),
)
I want to write the script creating the vector only once, and repeat it n times (n=3 in this example) so that I end up with n vectors. In each iteration, I want to add 1 to a variable (in this case: 1 -> 2 -> 3, so the number next to 't'). t1, t2 and t3 are all separate data frames, and I can't figure out how to loop a script with changing data frame names.
1) How to make this happen?
2) I would also like to divide each sd value in a row by the row number. How would I do this?
3) I will be using 140 data frames in total. Is there a way to call all of these with a simple function, rather than making a list and adding each of the 140 data frames individually?

Use functions to get a more readable code:
set.seed(123) # so you'll get the same number as this example
t1 <- t2 <- t3 <- data.frame(replicate(5,runif(10)))
# make a function for your sd of diff
sd.cols <- function(data) {
# loop over the df columns
sapply(data,function(x) sd(diff(x)))
}
# make a list of your data frames
dflist <- list(sdt1=t1,sdt2=t2,sdt3=t3)
# Loop overthe list
result <- data.frame(lapply(dflist,sd.cols))
Which gives:
> result
sdt1 sdt2 sdt3
1 0.4887692 0.4887692 0.4887692
2 0.5140287 0.5140287 0.5140287
3 0.2137486 0.2137486 0.2137486
4 0.3856857 0.3856857 0.3856857
5 0.2548264 0.2548264 0.2548264

Assuming that you always want to use columns 1 to 5...
# some data
t3 <- t2 <- t1 <- as.data.frame(matrix(rnorm(100),10,10))
# script itself
lis=list(t1,t2,t3)
sapply(lis,function(x) sapply(x[,1:5],function(y) sd(diff(y))))
# [,1] [,2] [,3]
# V1 1.733599 1.733599 1.733599
# V2 1.577737 1.577737 1.577737
# V3 1.574130 1.574130 1.574130
# V4 1.158639 1.158639 1.158639
# V5 0.999489 0.999489 0.999489
The output is a matrix, so as.data.frame should fix that.
For completeness: As #Tensibai mentions, you can just use list(mget(ls(pattern="^t[0-9]+$"))), assuming that all your variables are t followed by a number.
Edit: Thanks to #Tensibai for pointing out a missing step and improving the code, and the mget step.

You can itterate through a list of the ts...
ans <- data.frame()
dats <- c(t, t1 , t2)
for (k in dats){
temp <- c()
for (k2 in c(1,2,3,4,5)){
temp <- c(temp , sd(k[,k2]))
}
ans <- rbind(ans,temp)
}
rownames(ans) <- c("t1","t2","t3")
colnames(ans) <- c(1,2,3,4,5)
attr(results,"title") <- "standard deviation"

R: Using a vector to feed dataframe names for sapply

I'm quite new to R, and I trying to use it to organize and extract info from some tables into different, but similar tables, and instead of repeating the commands but changing the names of the table:
#DvE, DvS, and EvS are dataframes
Sum.DvE <- data.frame(DvE$genes, DvE$FDR, DvE$logFC)
names(Sum.DvE) <- c("gene","FDR","log2FC")
Sum.DvS <- data.frame(DvS$genes, DvS$FDR, DvS$logFC)
names(Sum.DvS) <- c("gene","FDR","log2FC")
Sum.EvS <- data.frame(EvS$genes, EvS$FDR, EvS$logFC)
names(Sum.EvS) <- c("gene","FDR","log2FC")
I thought it would be easier to create a vector of the table names, and feed it into a for loop:
Sum.Comp <- c("DvE","DvS","EvS")
for(i in 1:3){
Sum.Comp[i] <- data.frame(i$genes, i$FDR, i$logFC)
names(Sum.Comp[i]) <- c("gene","FDR","log2FC")
}
But I get
>Error in i$genes : $ operator is invalid for atomic vectors
which I kind of expected because I was just trying it out, but can someone tell me if what I want to do can be done some other way, or if you have some suggestions for me, that would be much appreciated!
Clarification: Basically I'm trying to ask if there's a way to feed a dataframe name into a for loop through a vector, because I think I get the error because R doesn't realize "i" in the for loop stands for a dataframe name. This is a more simplified example:
DF1 <- data.frame(A=1:5, B=1:5, C=1:5, D=1:5)
DF2 <- data.frame(A=10:15, B=10:15, C=10:15, D=10:15)
DF3 <- data.frame(A=20:25, B=20:25, D=20:25, D=20:25)
DFs <- ("DF1", "DF2", "DF3")
for (i in 1:3){
New.i <- dataframe(i$A, i$D)
}
And I'd like it to make 3 new dataframes called "New.DF1", "New.DF2", "New.DF3" with example outputs like:
New.DF1
A D
1 1
2 2
3 3
4 4
5 5
New.DF2
A D
10 10
11 11
12 12
13 13
14 14
15 15
Thank you!

Not entirely sure I understand your problem, but the code below may do what you're asking. I've created simple values for the input data frames for testing.
DvE <- data.frame(genes=1:2, FDR=2:3, logFC=3:4)
DvS <- data.frame(genes=4, FDR=5, logFC=6)
EvS <- data.frame(genes=7, FDR=8, logFC=9)
df_names <- c("DvE","DvS", "EvS")
sum_df <- function(x) data.frame(gene=x$genes, FDR=x$FDR, log2FC=x$logFC)
for(df in df_names) {
assign(paste("Sum.",df,sep=""), do.call("sum_df", list(as.name(df)) ) )
}

Instead of operating on the names of variables, it would be easier to store the data frames you want to process in a list and then process them with lapply:
to.process <- list(DvE, DvS, EvS)
processed <- lapply(to.process, function(x) {
data.frame(gene=x$genes, FDR=x$FDR, log2FC=x$logFC)
})
Now you can access the new data frames with processed[[1]], processed[[2]], and processed[[3]].

I have a numeric list where I'd like to add 0 or NA to extend the length of the list

I have 5 lists that need to be the same length as the lists will be combined into a dataframe. One of them may not be the same length as the other 4 so what I currently have is an if statement that checks the length against the length of one of the other lists and then...
1) I create a temporary list using rep( NA, length ) where length is the extra elements I need to add to extend the list
2) I use the concat function c() to combine the list that needs extending with the list with the NAs.
x <- as.numeric( list )
if( length( list ) < length( main ))
{
temp <- rep( NA, length( main ) - length( list ))
list <- c( list, temp )
}
List 1 - NA NA
List 2 - 32 53 45
Merged List - 32 53 45 NA NA
The problem with this is that I then get a ton of NAs introduced by coercion after the dataframe is created.
Is there a better way of handling this? I assume it has to do with the fact that the main list is numeric. I tried doing the same with 0 instead of NA but that failed for some reason. What I use to extend the length does not matter. I just need it to not be a number other than 0.

I will assume that you start with several lists like that:
n=as.list(1:2)
a=as.list(letters[1:3])
A=as.list(LETTERS[1:4])
First, I'd suggest to combine them into a list of lists:
z <- list(n,a,A)
so you can find the length of the longest sub-lists:
max.length <- max(sapply(z,length))
and use length<- to fill the missing elements of the shorter sub-lists with NULL values:
# z2 <- lapply(z,function(k) {length(k) <- max.length; return(k)}) # Original version
# z2 <- lapply(z, "length<-", max.length) # More elegant way
z2 <- lapply(lapply(z, unlist), "length<-", max.length) # Even better because it makes sure that the resulting data frame will consists of atomic vectors
The resulting list can be easily transformed into data.frame:
df <- as.data.frame(do.call(rbind,z2))

Another option using stringi would be ("z" from #Marat Talipov's post). If you want to get the result as showed in "df",
library(stringi)
as.data.frame(stri_list2matrix(lapply(z, as.character), byrow=TRUE))
# V1 V2 V3 V4
#1 1 2 <NA> <NA>
#2 a b c <NA>
#3 A B C D
NOTE: Now, the columns are all "factors" or "characters" (if we specify stringsAsFactors=FALSE). As #Richard Scriven mentioned in the comments, this would make more sense to have the "rows" as "columns". The above method is good when you have all 'numeric' or 'character' lists.

Update Variable within Loop in R

How should I modify my code to update variables within a loop?
Specifically, I want to do something like the following:
myMatrix1 <- read.table(someFile)
myMatrix2 <- read.table(someFile2)
for (i in nrow(myMatrix2))
{
myMatrix3 <- myMatrix1[which(doSomeTest),]
myMatrix4 <- rep(myMatrix2$header1,nrow(myMatrix1))
myMatrix5 <- rep(myMatrix2$header2, nrow(myMatrix1))
myMatrix6 <- cbind(myMatrix3, myMatrix4, myMatrix5)
# *see question
}
How can I get myMatrix6 to be updated instead of reassigned the product of cbind(myMatrix3, myMatrix4, myMatrix5)? In other words, if the first iteration (i = 1) gave a myMatrix6 of:
> 1 1 1 1
> 2 2 2 2
and the second iteration (i = 2) gave myMatrix 6 of:
> 3 3 3 3
> 4 4 4 4
how do I get a dataframe(?) of:
> 1 1 1 1
> 2 2 2 2
> 3 3 3 3
> 4 4 4 4
UPDATE:
I have - thanks to DWin and Timo's suggestions - got the following. However, the following code has taken me about 2 hours to run on my datasets. Are there any ways to make it run any faster??? (without using a more powerful computer I may add)
# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]
for (i in 1:nrow(myMatrix2))
{
# create matrix that has the value of myMatrix1$begin being
# situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin < myMatrix2$finish[i])),]
myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))
if (is.na(myMatrix2$boreWidth[i])) {
myMatrix5 <- rep(NA, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] == 0) {
myMatrix5 <- rep(TRUE, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] > 0) {
myMatrix5 <- rep(FALSE, nrow(myMatrix3))
}
myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}

You instead initialize myMatrix6 to an empty data.frame and rbind the results (which may be inefficient). If efficiency is a concern then you pre-allocate to the size you want and fill in rows in the data.frame with indexing.
# Method # 1 code
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
for (i in nrow(myMatrix2)) {
myMatrix3 <- myMatrix1[which(doSomeTest),]
myMatrix4 <- rep(myMatrix2$header1,nrow(myMatrix1))
myMatrix5 <- rep(myMatrix2$header2, nrow(myMatrix1))
myMatrix6 <- rbind( myMatrix6, cbind(myMatrix3, myMatrix4, myMatrix5) )
}

In your code, you are not dealing with matrices (in the sense of R), but data frames, as read.table returns a data frame.
In either way, you can append one matrix/data frame to another (assuming column names match) with rbind command
For example, if
> a = data.frame(x=c(1,2,3),y=c(4,5,6),z=c(7,8,9))
> b = data.frame(x=c(4,5),y=c(5,6),z=c(6,7))
then
> rbind(a,b)
x y z
1 1 4 7
2 2 5 8
3 3 6 9
4 4 5 6
5 5 6 7
There are other gotchas in the code you provide. For example
for (i in length(someVector)))
should be
for (i in 1:length(someVector)))
R has many functions for iterating over data.frames, vectors etc and can do all kinds of data transformations. Most of the time one does not need to write a for loop.
If you would provide more details about what you are trying to do, maybe we can find a simpler solution.
EDIT:
It seems from your post update that you are trying to do some sort of conversion between 'wide' and 'long' format and filter out some lines that fail a test. Correct me, if I am wrong.
Anyway, if that is the case, you should check out reshape command. Also, there is a reshape package containing extremely useful commands melt and cast, which can do that kind of transformations quite efficiently. Also, there is merge command for doing certain "join" operations for data frames. I'm quite sure your problem could be solved by using a combination of above commands, but it depends on exact details.
For filtering rows/columns with some criteria, check out subset command.