Error using colSds; error while loop across lists - r

I have the following list:
d1<-data.frame(y1=c(34,56,89,45),y2=c(42,54,68,25),y3=c(253,547,586,258),y4=c(233,537,554,258))
d2<-data.frame(y1=c(37,26,14,67),y2=c(65,54,43,23),y3=c(243,577,516,125),y4=c(267,527,567,368))
d3<-data.frame(y1=c(35,24,14,58),y2=c(65,51,43,21),y3=c(267,527,567,368),y4=c(243,577,516,125))
d4<-data.frame(y1=c(34,23,13,36),y2=c(65,55,44,24),y3=c(233,537,554,258),y4=c(253,547,586,258))
lst <- list(d1,d2,d3,d4)
My intention is to obtain different data frames with the means and sd of certain columns for each of the elements of the list. The first problem came when trying to use colSds to obtain the sd.
W.mean<-list()
W.sd<-list()
for (i in ids){
W.mean<-lapply(lst, function(i) colMeans(i[,c(1,2,4)],na.rm=TRUE))
W.sd<-lapply(lst, function(i) colSds(i[,c(1,2,4)],na.rm = TRUE))
}
As soon as I run this script I obtain the folowing error:
Error in colVars(x, rows = rows, cols = cols, ...) :
Argument 'x' must be a matrix or a vector.
The mean function still working so I have a new list with all the means (W.mean)
Now I want to create separete data.frame with just the means (would also include de sd, but I need to make it work)
for (i in c("d1","d2","d3","d4")){
df<-get(i)
df<-data.frame(t(W.mean[[i]]))
assign(paste0(i,"mean"), df)
}
However I get a new error: Error in t.default(W.mean[[i]]) : argument is not a matrix
can someone help me to fix the errors? TAHNKS

The reason is because colSds work on matrix and not on data.frame. According to the Description from ?colSds
Description - Standard deviation estimates for each row (column) in a
matrix.
lapply(lst, function(i) colSds(i[, c(1,2, 4)], na.rm = TRUE))
Error in colVars(x, rows = rows, cols = cols, ...) : Argument 'x'
must be a matrix or a vector.
Therefore, convert the 'data.frame' to 'matrix' and it should work fine
lapply(lst, function(i) colSds(as.matrix(i[, c(1,2, 4)]), na.rm = TRUE))
#[[1]]
#[1] 23.76272 18.24600 173.64427
#[[2]]
#[1] 22.70095 17.91415 139.72682
#[[3]]
#[1] 18.89224 18.40290 216.20110
#[[4]]
#[1] 10.66146 17.56891 180.27202
Also, the for loop in
for(i in ids) {
}
seems to be unnecessary if the intention is just to loop over the list of data.frame and get the colMeans and colSds. Also, we can do this in a single lapply call instead of multiple lapply
res1 <- lapply(lst, function(i) t(cbind(Mean = colMeans(i[, c(1,2, 4)],
na.rm = TRUE), Sds = colSds(as.matrix(i[, c(1,2, 4)]), na.rm = TRUE))))
and it can be converted to a single dataset by rbinding the contents
do.call(rbind, res1)

Related

T-test between a group of columns and a sorted list

This is my situation: 11 variables columns vs 1 column.
I want to test the single column[which it's sorted into 0 and 1s to create two groups] for each of the 11 columns. Single column and 11 columns have different lenghts and are not homoschedastic.
I tried with:
> TTest1 <- t.test(A$Column1, Bsorted0, var.equal = FALSE)
> View(TTest1)
which gave me a p-value of 7.681668e-05
but how can I loop this in order to have A$column1,2,3,4,5...each one tested with Bsorted0? Is it the right approach?
Thanks
Loop through the list with lapply calling the anonymous function \(x) (R4.1.0) or function(x) (previous versions of R).
TTest1_list <- lapply(A, \(x) {
tryCatch(t.test(x, Bsorted0, var.equal = FALSE),
error = \(e) e)
})
names(TTest1_list) <- names(A)
ok <- !sapply(TTest1_list, inherits, "error")
Err_list <- TTest1_list[!ok]
TTest1_list <- TTest1_list[ok]
Then, to extract the statistic or p-values, use sapply on the tests' list.
stat <- sapply(TTest1_list, `[[`, 'statistic')
pval <- sapply(TTest1_list, `[[`, 'p.value')
And to see the errors, apply conditionMessage to the errors list.
lapply(Err_list, conditionMessage)

Intersection of pairs of sets (any possible combination)

I have more than theree sets, but here I wrote the following example.
S1<-c("Frizzy","Jack","Amy")
S2<-c("Alice","Samy","Anna","Jack")
S3<-c("Frizzy","Anna","Fred","Jack")
I would like to obtain the following result
length(intersect(S1,S2))+length(intersect(S1,S3))+length(intersect(S2,S3))
without write manually all the possible combinations.
We can use combn to get the pairwise intersect between the elements, get the lengths of the list elements and find the sum
sum(lengths(combn(list(S1, S2, S3), 2,
FUN = function(x) Reduce(intersect, x), simplify = FALSE)))
#[1] 5
If there are many objects of the same pattern 'S' followed by some digits, use mget to get those all into a list instead of writing them manually
lst1 <- mget(ls(pattern = '^S\\d+$'))
sum(lengths(combn(lst1, 2,
FUN = function(x) Reduce(intersect, x), simplify = FALSE)))
#[1] 5

Is there a way to sum together lists of data frames within a larger list?

I have a large list (z) containing 3 lists of 10 data frames. I would like to collapse this object into a list of 3 data frames where each data frame is the sum of the 10 prior data frames (think matrix addition). Here is what I am working with, keep in mind that these are fake numbers, as the real data are read in from hundreds of *.csv files
x = rep(1,100)
x = matrix(x,10,10)
x = as.data.frame(x)
y = list(x,x,x,x,x,x,x,x,x,x)
z = list(y,y,y)
The desired end product would look like this:
x1 = rep(10,100)
x1 = matrix(x,10,10)
y1 = list(x1,x1,x1)
I keep trying stuff along the lines of:
z1 = c()
for (i in 1:3){
for (j in 1:10){
z1[[i]] = sum(z[[i]][[j]])
}
}
However, this does not yield the desired output. I have also messed around with some of the the apply functions, but to no avail
Thanks in advance for your help!
We can use Reduce to sum the corresponding i, j elements in the list and collapse it to a single dataset
lapply(z, function(x) Reduce(`+`, x))
If we want to remove the last column which is not numeric
lapply(z, function(x) Reduce(`+`, lapply(x, function(y) y[-ncol(y)])))
Or it can be looped over the sequence of list
lapply(seq_along(z), function(i) Reduce(`+`, lapply(seq_along(z[[i]]),
function(j) z[[i]][[j]][-ncol(z[[i]][[j]])])))
If we want to use sum, the data.frames inside the list can be converted to an array, loop over the array with apply, specify the MARGIN and do the sum. In this option, there is also possiblity to take care of NA elements with na.rm = TRUE in sum
lapply(z, function(x) apply(array(unlist(x), c(10, 10, 10)),
1:2, sum, na.rm = TRUE))
Or make it more efficient by looping only on one dimension and use colSums
lapply(z, function(x) apply(array(unlist(x), c(10, 10, 10)), 1, colSums, na.rm = TRUE))
Or using a for loop
z1 <- replicate(length(z), matrix(0, 10, 10), simplify = FALSE)
for(i in seq_along(z)) for(j in seq_along(z[[1]])) z1[[i]] <- z1[[i]] + z[[i]][[j]]

Applying a function across 3 columns in a data frame only returns one column

I am trying to normalize all columns within a data frame using the function written below. When I try to apply it to all columns using the for loop below, the output returns only one column when I would expect three. The output is normalized correctly suggesting the function works and the for loop is the issue.
seq_along(df) returns the same output
### example df
df <- cbind(as.data.frame(c(2:12), c(3:13), c(4:14)))
### normalization function
rescale <- function(x) {
(x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}
### for loop that returns one column although properly normalized
for (i in 1:ncol(df)){
i <- df[[i]]
output <- as.data.frame(rescale(i))
}
The syntax of as.data.frame is as.data.frame(x, row.names = NULL, optional = FALSE, ...). That means that in the call as.data.frame(c(2:12), c(3:13), c(4:14)) c(3:13) is being assigned to the argument row.names and c(4:14) to the ellipse. This means that your data.frame only has one column: 2:12.
The correct version should be: df <- as.data.frame(cbind(c(2:12), c(3:13), c(4:14))). Of course you should use the function scale instead of writing it yourself

Factors and Dummy Variables in R

I am new to data analytic and learning R. I have few very basic questions which I am not very clear about. I hope to find some help here. Please bear with me..still learning -
I wrote a small function to perform basic exploratory analysis on a data set with 9 variables out of which 8 are of Int/Numeric type and 1 is Factor. The function is like this :
out <- function(x)
{
c <- class(x)
na.len <- length(which(is.na(x)))
m <- mean(x, na.rm = TRUE)
s <- sd(x, na.rm = TRUE)
uc <- m+3*s
lc <- m-3*s
return(c(classofvar = c, noofNA = na.len, mean=m, stdev=s, UpperCap = uc, LowerCap = lc))
}
And I apply it to the data set using :
stats <- apply(train, 2, FUN = out)
But the output file has all the class of variables as Character and all the Means as NA. After some head hurting, I figured that the problem is due to the Factor variable. I converted it to Numeric using this :
train$MonthlyIncome=as.numeric(as.character(train$MonthlyIncome))
It worked fine. But I am confused that if without looking at the dataset I use the above function - it wont work. How can I handle this situation.
When should I consider creating dummy variables?
Thank you in advance, and I hope the questions are not too silly!
Note that c() results in a vector and all element within the vector must be of the same class. If the elements have different classes, then c() uses the least complex class which is able to hold all information. E.g. numeric and integer will result in numeric. character and integer will result in character.
Use a list or a data.frame if you need different classes.
out <- function(x)
{
c <- class(x)
na.len <- length(which(is.na(x)))
m <- mean(x, na.rm = TRUE)
s <- sd(x, na.rm = TRUE)
uc <- m+3*s
lc <- m-3*s
return(data.frame(classofvar = c, noofNA = na.len, mean=m, stdev=s, UpperCap = uc, LowerCap = lc))
}
sum(is.na(x)) is faster than length(which(is.na(x)))
Use lapply to run the function on each variable. Use do.call to append the resulting dataframes.
stats <- do.call(
rbind,
lapply(train, out)
)

Resources