Update Variable within Loop in R - r

How should I modify my code to update variables within a loop?
Specifically, I want to do something like the following:
myMatrix1 <- read.table(someFile)
myMatrix2 <- read.table(someFile2)
for (i in nrow(myMatrix2))
{
myMatrix3 <- myMatrix1[which(doSomeTest),]
myMatrix4 <- rep(myMatrix2$header1,nrow(myMatrix1))
myMatrix5 <- rep(myMatrix2$header2, nrow(myMatrix1))
myMatrix6 <- cbind(myMatrix3, myMatrix4, myMatrix5)
# *see question
}
How can I get myMatrix6 to be updated instead of reassigned the product of cbind(myMatrix3, myMatrix4, myMatrix5)? In other words, if the first iteration (i = 1) gave a myMatrix6 of:
> 1 1 1 1
> 2 2 2 2
and the second iteration (i = 2) gave myMatrix 6 of:
> 3 3 3 3
> 4 4 4 4
how do I get a dataframe(?) of:
> 1 1 1 1
> 2 2 2 2
> 3 3 3 3
> 4 4 4 4
UPDATE:
I have - thanks to DWin and Timo's suggestions - got the following. However, the following code has taken me about 2 hours to run on my datasets. Are there any ways to make it run any faster??? (without using a more powerful computer I may add)
# create empty matrix for sedimentation
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
# create empty matrix for bore
myMatrix7 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix7) <- letters[1:4]
for (i in 1:nrow(myMatrix2))
{
# create matrix that has the value of myMatrix1$begin being
# situated between the values of myMatrix2begin[i] and myMatrix2finish[i]
myMatrix3 <- myMatrix1[which((myMatrix1$begin > myMatrix2$begin[i]) & (myMatrix1$begin < myMatrix2$finish[i])),]
myMatrix4 <- rep(myMatrix2$sedimentation, nrow(myMatrix3))
if (is.na(myMatrix2$boreWidth[i])) {
myMatrix5 <- rep(NA, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] == 0) {
myMatrix5 <- rep(TRUE, nrow(myMatrix3))
}
else if (myMatrix2$boreWidth[i] > 0) {
myMatrix5 <- rep(FALSE, nrow(myMatrix3))
}
myMatrix6 <- rbind(myMatrix6, cbind(myMatrix3, myMatrix4))
myMatrix7 <- rbind(myMatrix7, cbind(myMatrix3, myMatrix5))
}

You instead initialize myMatrix6 to an empty data.frame and rbind the results (which may be inefficient). If efficiency is a concern then you pre-allocate to the size you want and fill in rows in the data.frame with indexing.
# Method # 1 code
myMatrix6 <- data.frame(NA,NA,NA,NA)[0,]
names(myMatrix6) <- letters[1:4]
for (i in nrow(myMatrix2)) {
myMatrix3 <- myMatrix1[which(doSomeTest),]
myMatrix4 <- rep(myMatrix2$header1,nrow(myMatrix1))
myMatrix5 <- rep(myMatrix2$header2, nrow(myMatrix1))
myMatrix6 <- rbind( myMatrix6, cbind(myMatrix3, myMatrix4, myMatrix5) )
}

In your code, you are not dealing with matrices (in the sense of R), but data frames, as read.table returns a data frame.
In either way, you can append one matrix/data frame to another (assuming column names match) with rbind command
For example, if
> a = data.frame(x=c(1,2,3),y=c(4,5,6),z=c(7,8,9))
> b = data.frame(x=c(4,5),y=c(5,6),z=c(6,7))
then
> rbind(a,b)
x y z
1 1 4 7
2 2 5 8
3 3 6 9
4 4 5 6
5 5 6 7
There are other gotchas in the code you provide. For example
for (i in length(someVector)))
should be
for (i in 1:length(someVector)))
R has many functions for iterating over data.frames, vectors etc and can do all kinds of data transformations. Most of the time one does not need to write a for loop.
If you would provide more details about what you are trying to do, maybe we can find a simpler solution.
EDIT:
It seems from your post update that you are trying to do some sort of conversion between 'wide' and 'long' format and filter out some lines that fail a test. Correct me, if I am wrong.
Anyway, if that is the case, you should check out reshape command. Also, there is a reshape package containing extremely useful commands melt and cast, which can do that kind of transformations quite efficiently. Also, there is merge command for doing certain "join" operations for data frames. I'm quite sure your problem could be solved by using a combination of above commands, but it depends on exact details.
For filtering rows/columns with some criteria, check out subset command.

Related

How to run Chisq test for multiple rows FASTER in R?

I have managed to do chisq-test using loop in R but it is very slow for a large data and I wonder if you could help me out doing it faster with something like dplyr? I've tried with dplyr but I ended up getting an error all the time which I am not sure about the reason.
Here is a short example of my data:
df
1 2 3 4 5
row_1 2260.810 2136.360 3213.750 3574.750 2383.520
row_2 328.050 496.608 184.862 383.408 151.450
row_3 974.544 812.508 1422.010 1307.510 1442.970
row_4 2526.900 826.197 1486.000 2846.630 1486.000
row_5 2300.130 2499.390 1698.760 1690.640 2338.640
row_6 280.980 752.516 277.292 146.398 317.990
row_7 874.159 794.792 1033.330 2383.420 748.868
row_8 437.560 379.278 263.665 674.671 557.739
row_9 1357.350 1641.520 1397.130 1443.840 1092.010
row_10 1749.280 1752.250 3377.870 1534.470 2026.970
cs
1 1 1 2 1 2 2 1 2 3
What I want to do is to run chisq-test between each row of the df and cs. Then giving me the statistics and p.values as well as row names.
here is my code for the loop:
value = matrix(nrow=ncol(df),ncol=3)
for (i in 1:ncol(df)) {
tst <- chisq.test(df[i,], cs)
value[i,1] <- tst$p.value
value[i,2] <- tst$statistic
value[i,3] <- rownames(df)[i]}
Thanks for your help.
I guess you do want to do this column by column. Knowing the structure of Biobase::exprs(PANCAN_w)) would have helped greatly. Even better would have been to use an example from the Biobase package instead of a dataset that cannot be found.
This is an implementation of the code I might have used. Note: you do NOT want to use a matrix to store results if you are expecting a mixture of numeric and character values. You would be coercing all the numerics to character:
value = data.frame(p_val =NA, stat =NA, exprs = rownames(df) )
for (i in 1:col(df)) {
# tbl <- table((df[i,]), cs) ### No use seen for this
# I changed the indexing in the next line to compare columsn to the standard `cs`.
tst <- chisq.test(df[ ,i], cs) #chisq.test not vectorized, need some sort of loop
value[i, 1:2] <- tst[ c('p.value', 'statistic')] # one assignment per row
}
Obviously, you would need to change every instance of df (not a great name since there is also a df function) to Biobase::exprs(PANCAN_w)

R: Using a vector to feed dataframe names for sapply

I'm quite new to R, and I trying to use it to organize and extract info from some tables into different, but similar tables, and instead of repeating the commands but changing the names of the table:
#DvE, DvS, and EvS are dataframes
Sum.DvE <- data.frame(DvE$genes, DvE$FDR, DvE$logFC)
names(Sum.DvE) <- c("gene","FDR","log2FC")
Sum.DvS <- data.frame(DvS$genes, DvS$FDR, DvS$logFC)
names(Sum.DvS) <- c("gene","FDR","log2FC")
Sum.EvS <- data.frame(EvS$genes, EvS$FDR, EvS$logFC)
names(Sum.EvS) <- c("gene","FDR","log2FC")
I thought it would be easier to create a vector of the table names, and feed it into a for loop:
Sum.Comp <- c("DvE","DvS","EvS")
for(i in 1:3){
Sum.Comp[i] <- data.frame(i$genes, i$FDR, i$logFC)
names(Sum.Comp[i]) <- c("gene","FDR","log2FC")
}
But I get
>Error in i$genes : $ operator is invalid for atomic vectors
which I kind of expected because I was just trying it out, but can someone tell me if what I want to do can be done some other way, or if you have some suggestions for me, that would be much appreciated!
Clarification: Basically I'm trying to ask if there's a way to feed a dataframe name into a for loop through a vector, because I think I get the error because R doesn't realize "i" in the for loop stands for a dataframe name. This is a more simplified example:
DF1 <- data.frame(A=1:5, B=1:5, C=1:5, D=1:5)
DF2 <- data.frame(A=10:15, B=10:15, C=10:15, D=10:15)
DF3 <- data.frame(A=20:25, B=20:25, D=20:25, D=20:25)
DFs <- ("DF1", "DF2", "DF3")
for (i in 1:3){
New.i <- dataframe(i$A, i$D)
}
And I'd like it to make 3 new dataframes called "New.DF1", "New.DF2", "New.DF3" with example outputs like:
New.DF1
A D
1 1
2 2
3 3
4 4
5 5
New.DF2
A D
10 10
11 11
12 12
13 13
14 14
15 15
Thank you!
Not entirely sure I understand your problem, but the code below may do what you're asking. I've created simple values for the input data frames for testing.
DvE <- data.frame(genes=1:2, FDR=2:3, logFC=3:4)
DvS <- data.frame(genes=4, FDR=5, logFC=6)
EvS <- data.frame(genes=7, FDR=8, logFC=9)
df_names <- c("DvE","DvS", "EvS")
sum_df <- function(x) data.frame(gene=x$genes, FDR=x$FDR, log2FC=x$logFC)
for(df in df_names) {
assign(paste("Sum.",df,sep=""), do.call("sum_df", list(as.name(df)) ) )
}
Instead of operating on the names of variables, it would be easier to store the data frames you want to process in a list and then process them with lapply:
to.process <- list(DvE, DvS, EvS)
processed <- lapply(to.process, function(x) {
data.frame(gene=x$genes, FDR=x$FDR, log2FC=x$logFC)
})
Now you can access the new data frames with processed[[1]], processed[[2]], and processed[[3]].

Generate a sequence of Data frame from function

I searched but I couldn't find a similar question, so Apologies in advance if this is a duplicate question. I am trying to Generate a data frame from within a for loop in R.
what I want to do:
Define each columns of each data frame by a function,
Generate n data frames (length of my sequence of data frame) using loop,
As example I will use n=100 :
n<-100
k<-8
d1 <- data.frame()
for(i in 1:(k)) {d1 <- rbind(d1,c(a="i+1",b="i-1",c="i/1"))}
d2 <- data.frame()
for(i in 1:(k+2)) {d2 <- rbind(d2,c(a="i+2",b="i-2",c="i/2"))}
...
d100 <- data.frame()
for(i in 1:(k+100)) {d100 <- rbind(d100,c(i+100, i-100, i/100))}
It is clear that It'll be difficult to construct one by one each data.frame. I tried this:
d<-list()
for(j in 1:100) {
d[j] <- data.frame()
for(i in 1:(k+j)) {d[j] <- rbind(d[j] ,c(i+j, i-j, i/j))}
But I cant really do anything with it, I run into an error :
Error in d[j] <- data.frame() : replacement has length zero
In addition: Warning messages:
1: In d[j] <- rbind(d[j], c(i + j, i - j, i/j)) :
number of items to replace is not a multiple of replacement length
And a few more remarks about your example:
the number of rows in each data frame are not the same : d1 has 8 rows, d2 has 10 rows, and d100 has 8+100 rows,
the algorithm should give us : D=(d1,d2, ... ,d100).
It would be great to get an answer using the same approach (rbind) and a more base like approach. Both will aid in my understanding. Of course, please point out where I'm going wrong if it's obvious.
Here's how to create an empty data.frame (and it's not what you are trying):
Create an empty data.frame
And you should not be creating 100 separate dataframes but rather a list of dataframes. I would not do it with rbind, since that would be very slow. Instead I would create them with a function that returns a dataframe of the required structure:
make_df <- function(n,var) {data.frame( a=(1:n)+var,b=(1:n)-var,c=(1:n)/var) }
mylist <- setNames(
lapply(1:100, function(n) make_df(n,n)) , # the dataframes
paste0("d_", 1:100)) # the names for access
head(mylist,3)
#---------------
$d_1
a b c
1 2 0 1
$d_2
a b c
1 3 -1 0.5
2 4 0 1.0
$d_3
a b c
1 4 -2 0.3333333
2 5 -1 0.6666667
3 6 0 1.0000000
Then if you want the "d_40" dataframe it's just:
mylist[[ "d_40" ]]
Or
mylist$d_40
If you want to perform the same operation or get a result from all of them at nce; just use lapply:
lapply(mylist, nrow) # will be a list
Or:
sapply(mylist, nrow) #will be a vector because each value is the same length.

Don't understand how apply gets its parameters in r

I am struggling to make my apply() work: I have two dataframes:
from <- c(1,2,3)
to <- c(2,3,4)
df1 <- data.frame(from, to)
long <-c(9,9.2,9.4,9.6)
lat <- c(45,45.2,45.4,45.6)
id <- c(1,2,3,4)
df2 <- data.frame(long, lat, id)
Now I want something like this:
myFunction <- function(arg){
>>> How do I access arg$from and arg$to? <<<<
}
apply(df1,1,myFunction)
In myFunction I need to make some calculations and return a value for each from-to pair. I don't understand how to access parts of the arg, since arg[0] gives me numeric(0) and arg$from just crashes.
The problem is that apply(...) requires a matrix or array as the first argument. If you pass a dataframe, it will coerce that to a matrix. Matrices are 1 indexed, so the upper left element is [1,1], not [0,0]. Also, matrix columns cannot be referenced using the $ notation.
So,
f <- function(x) {
from <- x[1]
to <- x[2]
# do stuff with from and to...
}
apply(df,1,f)
would work.
One other thing to watch out for is that if your dataframe has (other) columns that have character strings, the conversion will make everything character (including the numbers!). This is because, by definition, all elements of a matrix must have the same data type. Your example does not have that problem, though.
Try mapply(). It's a multivariate version of sapply(). For example:
> myFunction <- function(arg1, arg2){
+ return(sum(arg1, arg2))
+ }
>
> mapply(myFunction, df1$from, df1$to)
[1] 3 5 7
You can also use it to make a new variable in your data frame.
> df1$newvar <- mapply(myFunction, df1$from, df1$to)
> df1
from to newvar
1 1 2 3
2 2 3 5
3 3 4 7

Retain Vector Names as Dataframe Column Names

In my code, I am filling the columns of a dataframe with vectors, as so:
df1[columnNum] <- barWidth
This works fine, except for one thing: I want the name of the vector variable (barWidth above) to be retained as the column header, one column at a time. Furthermore, I do not wish to use cbind. This slows the execution of my code down considerably. Consequently, I am using a pre-allocated dataframe.
Can this be done in the vector-to-column assignment? If not, then how do I change it after the fact? I can't find the right syntax to do this with colNames().
TIA
It's being done by the [<-.data.frame function. It could conceivably be replaced by one that looked at the name of the argument but it's such a fundamental function I would be hesitant. Furthermore there appears to be an aversion to that practice signaled by this code at the top of the function definition:
> `[<-.data.frame`
function (x, i, j, value)
{
if (!all(names(sys.call()) %in% c("", "value")))
warning("named arguments are discouraged")
nA <- nargs()
if (nA == 4L) {
<snipped rest of rather long definition>
I don't know why that is there, but it is. Maybe you should either be thinking about using names<- after the column assignment, or using this method:
> dfrm["barWidth"] <- barWidth
> dfrm
a V2 barWidth
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
This can be generalized to a list of new columns:
dfrm <- data.frame(a=letters[1:4])
barWidth <- 1:4
newcols <- list(barWidth=barWidth, bw2 =barWidth)
dfrm[names(newcol)] <- newcol
dfrm
#
a barWidth bw2
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
If you have the list of names of vectors you want to apply you could do:
namevec <- c(...,"barWidth"...,)
columnNums <- c(...,10,...)
df1[columnNums[i]] <- get(namevec[i])
names(df1)[columnNums[i]] <- namevec[i]
or even
columnNums <- c(barWidth=4,...)
for (i in seq_along(columnNums)) {
df1[columnNums[i]] <- get(names(columnNums)[i])
}
names(df1)[columnNums] <- names(columnNums)
but the deeper question would be where this set of vectors is coming from in the first place: could you have them in a list all along?
I'd simply use cbind():
df1 <- cbind( df1, barWidth )
which retains the name. It will, however, end up as the last column in df1

Resources