Loop several data frames using R - r

I want to filter a group of data frames, first by a p.value and then divide it in two by the value of the t.
This is how I do it for just one data frame.
p.value.cut <- which(top_Na1$P.Value < 0.05)
top_Na1 <- top_Na1[p.value.cut,]
up <- which(top_Na1$t > 0)
down <- which(top_Na1$t < 0)
up.p.value <- top_Na1[up,]
down.p.value <- top_Na1[down,]
Whenever I try to replicate this using a loop for or apply, sapply, lapply I end applying changes to all columns or not being able to work with a proper column (it looks like the for loop doesn't uses the whole data frame but it goes column by column) or I just lose the row names of the data frame (like with some apply).
This is how one of the data frames looks like (all are the same).
logFC t P.Value adj.P.Val B
YMR290C -0.1952028 -4.593506 0.003484478 0.03596870 -1.602151
YBR090C -0.3406244 -4.373073 0.004429437 0.03930581 -1.857238
YPL037C -0.8737048 -4.088782 0.006100105 0.04526780 -2.197584
YGL035C -0.3058778 -3.839335 0.008159371 0.05142077 -2.506721

Thanks. It's solved. The problem was on the way I was creating the list.
This way worked.
l <- list(top_Na1, top_Na2)
function_filtering <- function(x){
p.value.cut <- which(x$P.Value < 0.05)
x <- x[p.value.cut,]
up <- which(x$t > 0)
down <- which(x$t < 0)
up.p.value <- x[up,]
down.p.value <- x[down,]
}
lapply(l, function_filtering)

Assume that "/tmp/dat.txt" contains your sample data.frame. The code below should do what you want.
d <- read.table("/tmp/dat.txt")
d[1,"P.Value"] <- 0.05 # to make the example more general
d[2,"t"] <- -d[2,"t"] # to make the example more general
input <- list(d,d) # generating a list of data frames for testing purposes
cutIt <- function(x) {
y <- x[x$P.Value < 0.05,]
list(up=y[y$t > 0,],down=y[y$t <= 0,])
}
lapply(input,cutIt)

Related

How do I save a single column of data produced from a while loop in R to a dataframe?

I have written the following very simple while loop in R.
i=1
while (i <= 5) {
print(10*i)
i = i+1
}
I would like to save the results to a dataframe that will be a single column of data. How can this be done?
You may try(if you want while)
df1 <- c()
i=1
while (i <= 5) {
print(10*i)
df1 <- c(df1, 10*i)
i = i+1
}
as.data.frame(df1)
df1
1 10
2 20
3 30
4 40
5 50
Or
df1 <- data.frame()
i=1
while (i <= 5) {
df1[i,1] <- 10 * i
i = i+1
}
df1
If you already have a data frame (let's call it dat), you can create a new, empty column in the data frame, and then assign each value to that column by its row number:
# Make a data frame with column `x`
n <- 5
dat <- data.frame(x = 1:n)
# Fill the column `y` with the "missing value" `NA`
dat$y <- NA
# Run your loop, assigning values back to `y`
i <- 1
while (i <= 5) {
result <- 10*i
print(result)
dat$y[i] <- result
i <- i+1
}
Of course, in R we rarely need to write loops like his. Normally, we use vectorized operations to carry out tasks like this faster and more succinctly:
n <- 5
dat <- data.frame(x = 1:n)
# Same result as your loop
dat$y <- 10 * (1:n)
Also note that, if you really did need a loop instead of a vectorized operation, that particular while loop could also be expressed as a for loop.
I recommend consulting an introductory book or other guide to data manipulation in R. Data frames are very powerful and their use is a necessary and essential part of programming in R.

How to call a different value for each element in a list in R

I have a list with 29 data frames.
I am trying to do a simple transformation with ifelse(), that looks something like this:
with(df, ifelse(col1 > x, col1 <- col1-y, col1<-col1+y))
The one thing I can't seem to get is how to change that x and y value so that a different value is used for each data frame in the list.
Here's a quick reproducible example of what I've got so far .. but I want to call different values for x and y from a data frame (e.g. info)
df.1 <- data.frame("df"=rep(c(1), times=4),"length"=c(10:7))
df.2 <- data.frame("df"=rep(c(2),times=4),"length"=c(8:11))
df.3 <- data.frame("df"=rep(c(3),times=4),"length"=c(9:12))
list <- list(df.1,df.2,df.3)
info <- data.frame(x=rep(c(8.5,9.5,10.5)), y=rep(c(1,1.5,2)))
# using static number for x & y but wanting these to be grabbed from the above df and change
# for each list
x <- 8
y <- 1
lapply(list, function(df) {
df <- with(df, ifelse(length > x,
length <- length-y,
length <- length+y)) })
Any and all help/insight is appreciated!
Edited to add clarification:
I would like the rows to match up with lists.
E.g. Row 1 in Info (x=8.5, y=1) is used in the function and applied just to the first data frame in the list (df.1).
When you need to pass more than one value to lapply, you must use mapply instead.
mapply(
function(df, x, y) {
#print("df")
#print(df)
#print("x")
#print(x)
#print("y")
#print(y)
with(df, ifelse(length > x, length <- length - x, length <- length + y))
},
list,
info$x,
info$y
)
I've left some debugging in the code which can enabled in case you want to see how it works.

How to substitute negative values with a calculated value in an entire dataframe

I've got a huge dataframe with many negative values in different columns that should be equal to their original value*0.5.
I've tried to apply many R functions but it seems I can't find a single function to work for the entire dataframe.
I would like something like the following (not working) piece of code:
mydf[] <- replace(mydf[], mydf[] < 0, mydf[]*0.5)
You can simply do,
mydf[mydf<0] <- mydf[mydf<0] * 0.5
If you have values that are non-numeric, then you may want to apply this to only the numeric ones,
ind <- sapply(mydf, is.numeric)
mydf1 <- mydf[ind]
mydf1[mydf1<0] <- mydf1[mydf1<0] * 0.5
mydf[ind] <- mydf1
You could try using lapply() on the entire data frame, making the replacements on each column in succession.
df <- lapply(df, function(x) {
x <- ifelse(x < 0, x*0.5, x)
})
The lapply(), or list apply, function is intended to be used on lists, but data frames are a special type of list so this works here.
Demo
In the replace the values argument should be of the same length as the number of TRUE values in the list ('index' vector)
replace(mydf, mydf <0, mydf[mydf <0]*0.5)
Or another option is set from data.table, which would be very efficient
library(data.table)
for(j in seq_along(mydf)){
i1 <- mydf[[j]] < 0
set(mydf, i = which(i1), j= j, value = mydf[[j]][i1]*0.5)
}
data
set.seed(24)
mydf <- as.data.frame(matrix(rnorm(25), 5, 5))

cor() function in R with a subset

I have a table in R with three columns. I want to get the correlation of the first two columns with a subset of the third column following a specific set of conditions (values are all numeric, I want them to be > a certain number). The cor() function doesn't seem to have an argument to define such a subset.
I know that I could use the summary(lm()) function and square-root the r^2, but the issue is that I'm doing this inside a for loop and am just appending the correlation to a separate list that I have. I can't really append part of the summary of the regression easily to a list.
Here is what I am trying to do:
for (i in x) {list[i] = cor(data$column_a, data$column_b, subset = data$column_c > i)}
Obviously, though, I can't do that because the cor() function doesn't work with subsets.
(Note: x = seq(1,100) and list = NULL)
You can do this without a loop using lapply. Here's some code that will output a data frame with the month-range in one column and the correlation in another column. The do.call(rbind... business is just to take the list output from lapply and turn it into a data frame.
corrs = do.call(rbind, lapply(min(airquality$Month):max(airquality$Month),
function(x) {
data.frame(month_range=paste0(x," - ", max(airquality$Month)),
correlation = cor(airquality$Temp[airquality$Month >= x & airquality$Temp < 80],
airquality$Wind[airquality$Month >= x & airquality$Temp < 80]))
}))
corrs
month_range correlation
1 5 - 9 -0.3519351
2 6 - 9 -0.2778532
3 7 - 9 -0.3291274
4 8 - 9 -0.3395647
5 9 - 9 -0.3823090
You can subset the data first, and then find the correlation.
a <- subset(airquality, Temp < 80 & Month > 7)
cor(a$Temp, a$Wind)
Edit: I don't really know what your list variable is, but here is an example of dynamically changing the subset based on i (see how the month requirement changes with each iteration)
list <- seq(1, 5)
for (i in 1:5){
a <- subset(airquality, Temp < 80 & Month > i)
list[i] <- cor(a$Temp, a$Wind)
}
Based on the pseudo-code you provided alone, here's something that should work:
for (i in x) {
df <- subset(data, column_c > i)
list[i] = cor(df$column_a, df$column_b)
}
However, I don't know why you would want your index in list[i] to be the same value that you use to subset column_c. That could be another source of problems.

Splitting a data set using two parameters and saving the sub-data sets in a list

I am trying to split my data set using two parameters, the fraction of missing values and "maf", and store the sub-data sets in a list. Here is what I have done (it's not working). Any help will be appreciated,
Thanks.
library(BLR)
library(missForest)
data(wheat)
X2<- prodNA(X, 0.4) ### creating missing values
dim(X2)
fd<-t(X2)
MAF<-function(geno){ ## markers are in the rows
geno[(geno!=0) & (geno!=1) & (geno!=-1)] <- NA
geno <- as.matrix(geno)
## calc_Freq for alleles
n0 <- apply(geno==0,1,sum,na.rm=T)
n1 <- apply(geno==1,1,sum,na.rm=T)
n2 <- apply(geno==-1,1,sum,na.rm=T)
n <- n0 + n1 + n2
## calculate allele frequencies
p <- ((2*n0)+n1)/(2*n)
q <- 1 - p
maf <- pmin(p, q)
maf}
frac.missing <- apply(fd,1,function(z){length(which(is.na(z)))/length(z)})
maf<-MAF(fd)
lst<-matrix()
for (i in seq(0.2,0.7,by =0.2)){
for (j in seq(0,0.2,by =0.005)){
lst=fd[(maf>j)|(frac.missing < i),]
}}
It sounds like you want the results that the split function provides.
If you have a vector, "frac.missing" and "maf" is defined on the basis of values in "fd" (and has the same length as the number of rows in fd"), then this would provide the split you are looking for:
spl.fd <- split(fd, list(maf, frac.missing) )
If you want to "group" the fd values basesd on of maf(fd) and frac.missing within the bands specified by your for-loop, then the same split-construct may do what your current code is failing to accomplish:
lst <- split( fd, list(cut(maf(fd), breaks = seq(0,0.2,by =0.005) ,
include.lowest=TRUE),
cut(frac.missing, breaks = seq(0.2,0.7,by =0.2),
right=TRUE,include.lowest=TRUE)
)
)
The right argument accomodates the desire to have the splits based on a "<" operator whereas the default operation of cut presumes a ">" comparison against the 'breaks'. The other function that provides similar facility is by.
the below codes give me exactly what i need:
Y<-t(GBS.binary)
nn<-colnames(Y)
fd<-Y
maf<-as.matrix(MAF(Y))
dff<-cbind(frac.missing,maf,Y)
colnames(dff)<-c("fm","maf",nn)
dff<-as.data.frame(dff)
for (i in seq(0.1,0.6,by=0.1)) {
for (j in seq(0,0.2,by=0.005)){
assign(paste("fm_",i,"maf_",j,sep=""),
(subset(dff, maf>j & fm <i))[,-c(1,2)])
} }

Resources