Remove NA value within a list of dataframes - r

I'm sure there is a very easy answer to this but I can't find one. In a separate post, How do I remove empty data frames from a list? I have looked at removing an empty data frame from a list of data frames.
But how can you do this when one of the items in the list isn't classified as a data frame and is just a NA value? Modifying the parameters of the question above slightly, you have:
M1 <- data.frame(matrix(1:4, nrow = 2, ncol = 2))
M2 <- NA
M3 <- data.frame(matrix(9:12, nrow = 2, ncol = 2))
mlist <- list(M1, M2, M3)
I would like to remove M2 in this instance, but I have several examples of these empty data frames so I would like a function that removes them all simultaenously.
I have tried a couple of solutions to the question above which do not work:
mlist[sapply(mlist, function(x) dim(x)[1]) > 0]##Error message -
##Error: (list) object cannot be coerced to type 'double'
Filter(function(x) dim(x), mlist) ###Incorrect outputs
Thank you in advance for any help!

One option is to use Filter to check wheter the list elements are data.frames
Filter(is.data.frame, mlist)
#[[1]]
# X1 X2
#1 1 3
#2 2 4
#[[2]]
# X1 X2
#1 9 11
#2 10 12

Here's a slightly different way to get your result
library(tidyverse)
M1 <- data.frame(matrix(1:4, nrow = 2, ncol = 2))
M2 <- NA
M3 <- data.frame(matrix(9:12, nrow = 2, ncol = 2))
M4 <- NA
mlist <- list(M1, M2, M3,M4)
indexes <- tibble()
for (i in 1:length(mlist)) {
if (is.na(mlist[[i]]) == TRUE) {
new_index <- tibble(index = i)
indexes <- bind_rows(new_index,indexes)
}
}
indexnums <- indexes %>% pull(index)
mlist <- mlist[-indexnums]
With this, you check if each list element is NA or not, then add the index number to a table if it is, then you pull those index numbers out and subset the list. If you have a lot of these in your data set this should remove them all.

Hope to help you.
# Method 1
mlist[!is.na(mlist)]
# Method 2
replace(mlist, is.na(mlist), NULL)

Related

Dataframe output from a for-loop

I am trying to populate the output of a for loop into a data frame. The loop is repeating across the columns of a dataset called "data". The output is to be put into a new dataset called "data2". I specified an empty data frame with 4 columns (i.e. ncol=4). However, the output generates only the first two columns. I also get a warning message: "In matrix(value, n, p) : data length [2403] is not a sub-multiple or multiple of the number of columns [2]"
Why does the dataframe called "data2" have 2 columns, when I have specified 4 columns? This is my code:
a <- 0
b <- 0
GM <- 0
GSD <- 0
data2 <- data.frame(ncol=4, nrow=33)
for (i in 1:ncol(data))
{
if (i==34) {break}
a[i] <- colnames(data[i])
b <- data$cycle
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2
If you look at the ?data.frame() help page, you'll see that it does not take arguments nrow and ncol--those are arguments for the matrix() function.
This is how you initialize data2, and you can see it starts with 2 columns, one column is named ncol, the second column is named nrow.
data2 <- data.frame(ncol=4, nrow=33)
data2
# ncol nrow
# 1 4 33
Instead you could try data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33)), though if you share a small sample of data and your expected result there may be more efficient ways than explicit loops to get this job done.
Generally, if you do loop, you want to do as much outside of the loop as possible. This is just guesswork without having sample data, these changes seem like a start at improving your code.
a <- colnames(data)
b <- data$cycle ## this never changes, no need to redefine every iteration
GM <- numeric(ncol(data)) ## better to initialize vectors to the correct length
GSD <- numeric(ncol(data))
data2 <- as.data.frame(matrix(NA, ncol = 4, nrow = 33))
for (i in 1:ncol(data))
{
if (i==34) {break}
GM[i] <- geoMean(data[,i], na.rm=TRUE)
GSD[i] <- geoSD(data[,i], na.rm=TRUE)
## it's weird to assign a row of data.frame at once...
## maybe you should keep it as a matrix?
data2[i,] <- c(a[i], b, GM[i], GSD[i])
}
data2

Create subset matrix according to criteria/ Extract key rows according to criteria

I want to subset the rows of my original matrix into two separate matrices.
I setup the problem as follows:
set.seed(2)
Mat1 <- data.frame(matrix(nrow = 4, ncol =10, data = rnorm(40,0,1)))
keep.rows = matrix(nrow =2, ncol =4)
keep.rows[,1] = c(1,2)
keep.rows[,2] = c(2,3)
keep.rows[,3] = c(2,3)
keep.rows[,4] = c(1,2)
Mat1
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 0.9959846 -2.2079198 -0.3869496 -1.183606 1.959357077 1.0744594 -0.8621983 -0.4213736 0.4718595 1.2309537
2 -1.6957649 1.8221225 0.3866950 -1.358457 0.007645872 0.2605978 2.0480403 -0.3508344 1.3589398 1.1471368
3 -0.5333721 -0.6533934 1.6003909 -1.512671 -0.842615198 -0.3142720 0.9399201 -1.0273806 0.5641686 0.1065980
4 -1.3722695 -0.2846812 1.6811550 -1.253105 -0.601160105 -0.7496301 2.0086871 -0.2505191 0.4559801 -0.7833167
Mat 1 is my original matrix. Now from the Keep rows matrix, I want to create two output matrices. The first output matrix (Output1) should store all the row numbers specified in keep.row. The second output(Output2) matrix should store all remaining rows. In my actual application my matrices are very large and so cannot be sorted manually as i do here.
I need:
1) I need a function that does this simply over large matrices.
2) Ideally one where i can change the number of entries to "keep" each time. So in this case I store 3 entries. However, imagine if my keep.rows matrix was 2x2. In this case, I might want to store five entries each time.
Results should be of the form:
Output1 <- data.frame(matrix(nrow = 2, ncol =10))
Output1[1:2,1:3] <- Mat1[c(1,2), 1:3]
Output1[1:2,4:6] <- Mat1[c(2,3), 4:6]
Output1[1:2,7:9] <- Mat1[c(2,3), 7:9]
Output1[1:2,10] <- Mat1[c(1,2), 10]
Output2 <- data.frame(matrix(nrow = 2, ncol =10))
Output2[1:2,1:3] <- Mat1[c(3,4), 1:3]
Output2[1:2,4:6] <- Mat1[c(1,4), 4:6]
Output2[1:2,7:9] <- Mat1[c(1,4), 7:9]
Output2[1:2,10] <- Mat1[c(3,4), 10]
IMPORTANT: In the answer i need output 2 to be specified in a way that keeps all remaining rows. In my application my keep.row matrix is the same size. But Mat1 contains 1000 rows +
You can use sapply which iterates over the columns of Mat1 with seq_along(Mat1) and subset Mat1 using keep.rows. With cbind you get a matrix-like data.frame from the returned list of sapply. To get the remaining data you simply place a - before keep.rows.
Output1 <- do.call(cbind, sapply(seq_along(Mat1), function(i) Mat1[keep.rows[,(i+2) %/% 3], i, drop = FALSE], simplify = FALSE))
Output2 <- do.call(cbind, sapply(seq_along(Mat1), function(i) Mat1[-keep.rows[,(i+2) %/% 3], i, drop = FALSE], simplify = FALSE))

select a specific columns in R nested list

suppose i have a list of data frames, just like this:
M1 <- data.frame(matrix(1:4, nrow = 2, ncol = 2))
M2 <- data.frame(matrix(1:9, nrow = 3, ncol = 3))
M3 <- data.frame(matrix(1:4, nrow = 2, ncol = 2))
mlist <- list(M1, M2, M3)
and now i want to select X1 columns from all of dataframes, I tried :
M.X1 <- mlist$X1
but failed with NULL:
> mlist$X1
NULL
I don't want to use for to extract each data frames' X1, is there some better way to do this ? And what if extract columns X3 ? (which means some columns may not exists in other row)
Normally you can use lapply as below:
lapply(mlist, function(x) x$X2)
The 2nd parameter you define a function right inside to pass to each member of mlist.

R: How to write a for loop that reads every two lines in a matrix?

I want to calculate correlation statistics using cor.test(). I have a data matrix where the two pairs to be tested are on consecutive lines (I have more than thousand pairs so I need to correct for that also later). I was thinking that I could loop through every two and two lines in the matrix and perform the test (i.e. first test correlation between row1 and row2, then row3 and row4, row5 and row6 etc.), but I don't know how to make this kind of loop.
This is how I do the test on a single pair:
d = read.table(file="cor-test-sample-data.txt", header=T, sep="\t", row.names = 1)
d = as.matrix(d)
cor.test(d[1,], d[2,], method = "spearman")
You could try
res <- lapply(split(seq_len(nrow(mat1)),(seq_len(nrow(mat1))-1)%/%2 +1),
function(i){m1 <- mat1[i,]
if(NROW(m1)==2){
cor.test(m1[1,], m1[2,], method="spearman")
}
else NA
})
To get the p-values
resP <- sapply(res, function(x) x$p.value)
indx <- t(`dim<-`(seq_len(nrow(mat1)), c(2, nrow(mat1)/2)))
names(resP) <- paste(indx[,1], indx[,2], sep="_")
resP
# 1_2 3_4 5_6 7_8 9_10 11_12 13_14
#0.89726818 0.45191660 0.14106085 0.82532260 0.54262680 0.25384239 0.89726815
# 15_16 17_18 19_20 21_22 23_24 25_26 27_28
#0.02270217 0.16840791 0.45563229 0.28533447 0.53088721 0.23453161 0.79235990
# 29_30 31_32
#0.01345768 0.01611903
Or using mapply (assuming that the rows are even)
ind <- seq(1, nrow(mat1), by=2) #similar to the one used by #CathG in for loop
mapply(function(i,j) cor.test(mat1[i,], mat1[j,],
method='spearman')$p.value , ind, ind+1)
data
set.seed(25)
mat1 <- matrix(sample(0:100, 20*32, replace=TRUE), ncol=20)
Try
d = matrix(rep(1:9, 3), ncol=3, byrow = T)
sapply(2*(1:(nrow(d)/2)), function(pair) unname(cor.test(d[pair-1,], d[pair,], method="spearman")$estimate))
pvalues<-c()
for (i in seq(1,nrow(d),by=2)) {
pvalues<-c(pvalues,cor.test(d[i,],d[i+1,],method="spearman")$p.value)
}
names(pvalues)<-paste(row.names(d)[seq(1,nrow(d),by=2)],row.names(d)[seq(2,nrow(d),by=2)],sep="_")

Apply function on list of multiple lists of two dataframes by rows

Pardon me if this questions has been answered before but I searched and couldn't find one.
I have a list containing multiple lists containing two dataframes. I want to apply t.test between first row of dataframe 1 and first row of dataframe 2 and so on.
I tried this:
list1 <- list(set1 = data.frame(rnorm(100), rexp(100)), set2 = data.frame(rnorm(100, mean = 5, sd = 3), rexp(100, rate = 4)))
list2 <- list(set1 = data.frame(rnorm(100), rexp(100)), set2 = data.frame(rnorm(100, mean = 6, sd = 4), rexp(100, rate = 2)))
mylist <- list(list1, list2)
ttest<-function(list){
df1 <- list$set1
df2 <- list$set2
testresults<-rep(NA,nrow(df1))
for (j in seq(nrow(df1))){
testresults[j] <- t.test(df1[j,], df2[j,])$p.value
}
return(as.matrix(testresults))}
lapply(mylist,ttest)
This works fine but takes a lot of time because of this for loop and since the actual data is much larger. I want to replace the for loop with an apply function(if possible). Please suggest.
You basically want to use lapply with a function taking more than one arguments, which is Map. So you can replace ttest in your code with
ttest2 <- function(list) {
df1 <- list$set1
df2 <- list$set2
l1 <- unlist(apply(df1, 1, list), recursive = FALSE)
l2 <- unlist(apply(df2, 1, list), recursive = FALSE)
testresults <- unlist(Map(function(x,y) t.test(x,y)$p.value, x=l1, y=l2))
return(as.matrix(testresults))
}
This seems to be faster. I extended your data frames to have 10000 rows (it runs quite fast with 100 and can't see the difference much) and got
system.time(lapply(mylist,ttest))
# user system elapsed
# 12.736 0.000 12.760
system.time(lapply(mylist,ttest2))
# user system elapsed
# 3.825 0.000 3.833
Try:
res1 <- sapply(mylist, function(x) {
x1 <- do.call(`cbind`,x)
apply(x1, 1, function(y) t.test(y[1:2], y[3:4])$p.value)
})
Using your function
res2 <- sapply(mylist, ttest)
identical(res1, res2)
#[1] TRUE

Resources