R - sapply Over Columns, then lappy Over Elements - r

Likely because I've spent an hour on this, I'm curious if it is possible - I am trying to transform each element of each column in a dataframe, where the transformation applied to each element depends upon the mean and standard deviation of the column that the element is in. I wanted to use nested lapply or sapply to do this, but ran into some unforeseen issues. My current "solution" (although it does not work as expected) is:
scale_variables <- function(dframe, columns) {
means <- colMeans(dframe[sapply(dframe, is.numeric)])
sds <- colSds(as.matrix(dframe[sapply(dframe, is.numeric)]))
new_dframe <- lapply(seq_along(means), FUN = function(m) {
sapply(dframe[ , columns], FUN = function(x) {
sapply(x, FUN = helper_func, means[[m]], sds[m])
})
})
return(new_dframe)
}
So, I calculate the column means and SDs beforehand; then, I seq_along the index of each mean in means, then each of the columns with the first sapply, and then each element in the second sapply. I get the mean and SD of this particular column using index m, then pass the current element, mean, and SD to the helper function to work on.
Running this on the numeric variables in the iris dataset yields this monstrosity:
'data.frame': 150 obs. of 16 variables:
$ Sepal.Length : num -0.898 -1.139 -1.381 -1.501 -1.018 ...
$ Sepal.Width : num -2.83 -3.43 -3.19 -3.31 -2.71 ...
$ Petal.Length : num -5.37 -5.37 -5.49 -5.25 -5.37 ...
$ Petal.Width : num -6.82 -6.82 -6.82 -6.82 -6.82 ...
$ Sepal.Length.1: num 4.69 4.23 3.77 3.54 4.46 ...
$ Sepal.Width.1 : num 1.0156 -0.1315 0.3273 0.0979 1.245 ...
$ Petal.Length.1: num -3.8 -3.8 -4.03 -3.57 -3.8 ...
$ Petal.Width.1 : num -6.56 -6.56 -6.56 -6.56 -6.56 ...
$ Sepal.Length.2: num 0.76 0.647 0.534 0.477 0.704 ...
$ Sepal.Width.2 : num -0.1462 -0.4294 -0.3161 -0.3727 -0.0895 ...
$ Petal.Length.2: num -1.34 -1.34 -1.39 -1.28 -1.34 ...
$ Petal.Width.2 : num -2.02 -2.02 -2.02 -2.02 -2.02 ...
$ Sepal.Length.3: num 5.12 4.86 4.59 4.46 4.99 ...
$ Sepal.Width.3 : num 3.02 2.36 2.62 2.49 3.15 ...
$ Petal.Length.3: num 0.263 0.263 0.132 0.394 0.263 ...
$ Petal.Width.3 : num -1.31 -1.31 -1.31 -1.31 -1.31 ...
I assume I am applying each mean in means to each column of the dataframe in turn, when I only want to use it for elements in the column it refers to, so I'm not sure that nesting apply functions in this way will do what I need - but can it be done like this?

I'm not sure what your helper_func, is, but I've made a toy example below
helper_func <- function(x,m,sd) (x-m)/sd
You can then adjust your scale_variables() function like this:
scale_variables <- function(dframe, columns) {
means <- apply(dframe[columns],2,mean, na.rm=T)
sds <- apply(dframe[columns],2,sd)
sapply(columns, \(col) helper_func(dframe[[col]], m=means[col], sd=sds[col]))
}
And call it like this:
scale_variables(iris,names(iris)[sapply(iris, is.numeric)])
Output: (first 6 of 150 rows)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 -0.89767388 1.01560199 -1.33575163 -1.3110521482
2 -1.13920048 -0.13153881 -1.33575163 -1.3110521482
3 -1.38072709 0.32731751 -1.39239929 -1.3110521482
4 -1.50149039 0.09788935 -1.27910398 -1.3110521482
5 -1.01843718 1.24503015 -1.33575163 -1.3110521482
6 -0.53538397 1.93331463 -1.16580868 -1.0486667950

Related

Error when running boxcox on response variable

I'm using the following code to try to transform my response variable for regression. Seems to need a log transformation.
bc = boxCox(auto.tf.lm)
lambda.mpg = bc$x[which.max(bc$y)]
auto.tf.bc <- with(auto_mpg, data.frame(log(mpg), as.character(cylinders), displacement**.2, log(as.numeric(horsepower)), log(weight), log(acceleration), model_year))
auto.tf.bc.lm <- lm(log(mpg) ~ ., data = auto.tf.bc)
view(auto.tf.bc)
I am receiving this error though.
Error in Math.data.frame(mpg) :
non-numeric variable(s) in data frame: manufacturer, model, trans, drv, fl, class
Not sure how to resolve this. The data is in a data frame, not csv.
Here's the output from str(auto.tf.bc). Sorry for such bad question formatting.
'data.frame': 392 obs. of 7 variables:
$ log.mpg. : num 2.89 2.71 2.89 2.77 2.83 ...
$ as.character.cylinders.: chr "8" "8" "8" "8" ...
$ displacement.0.2 : num 3.14 3.23 3.17 3.14 3.13 ...
$ log.horsepower. : num 4.87 5.11 5.01 5.01 4.94 ...
$ log.weight. : num 8.16 8.21 8.14 8.14 8.15 ...
$ log.acceleration. : num 2.48 2.44 2.4 2.48 2.35 ...
$ model_year : num 70 70 70 70 70 70 70 70 70 70 ...
removing the cylinders doesn't change anything.

Accessing dataframes after splitting a dataframe

I'm splitting a dataframe in multiple dataframes using the command
data <- apply(data, 2, function(x) data.frame(sort(x, decreasing=F)))
I don't know how to access them, I know I can access them using df$1 but I have to do that for every dataframe,
df1<- head(data$`1`,k)
df2<- head(data$`2`,k)
can I get these dataframes in one go (like storing them in some form) however the indexes of these multiple dataframes shouldn't change.
str(data) gives
List of 2
$ 7:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.265 0.332 0.458 0.51 0.52 ...
$ 8:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.173 0.224 0.412 0.424 0.5 ...
str(data[1:2])
List of 2
$ 7:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.265 0.332 0.458 0.51 0.52 ...
$ 8:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.173 0.224 0.412 0.424 0.5 ...
Thanks to #r2evans I got it done, here is his code from the comments
Yes. Two short demos: lapply(data, head, n=2), or more generically
sapply(data, function(df) mean(df$x)). – r2evans
and after that fetching the indexes
df<-lapply(df, rownames)

Quickly sum a big list of lists?

I have a 10000 lists (results of a simulation), each containing 22500 lists (each list is a pixel in an image) which contains a vector of length 55.
# Simple Example
m <- replicate(2, list(runif(55)))
m2 <- replicate(3, list(m))
str(m2,list.len = 3)
List of 3
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
$ :List of 4
..$ : num [1:55] 0.107 0.715 0.826 0.582 0.604 ...
..$ : num [1:55] 0.949 0.389 0.645 0.331 0.698 ...
..$ : num [1:55] 0.138 0.207 0.32 0.442 0.721 ...
.. [list output truncated]
# my function
m3 <- lapply(seq_along(m2[[1]]), FUN = function(j) Reduce('+', lapply(seq_along(m2), FUN = function(i) m2[[i]][[j]])))
#by hand
identical(m2[[1]][[1]] + m2[[2]][[1]] + m2[[3]][[1]], m3[[1]] )
I wrote a nested lapply with Reduce to sum the lists. On a small example, as in above, it's fast but on my real data, it's really slow.
#slow code
m <- replicate(22500, list(runif(55)))
m2 <- replicate(10000, list(m))
str(m2,list.len = 3)
m3 <- lapply(seq_along(m2[[1]]), FUN = function(j) Reduce('+', lapply(seq_along(m2), FUN = function(i) m2[[i]][[j]])))
How can I speed this up, or should I change data structures?
Thanks.
This gives some improvement (>2x):
split(Reduce(`+`, lapply(m2, unlist)), rep(seq_along(m2[[1]]), lengths(m2[[1]])))
Since your data is essentially rectangular, had you stored it in this shape:
library(data.table)
d = rbindlist(lapply(m2, function(x) transpose(as.data.table(x))), id = T
)[, id.in := 1:.N, by = .id]
# .id V1 V2 V55 id.in
#1: 1 0.4605065 0.09744975 ... 0.8620728 1
#2: 1 0.6666742 0.10435471 ... 0.3991940 2
#3: 2 0.4605065 0.09744975 ... 0.8620728 1
#4: 2 0.6666742 0.10435471 ... 0.3991940 2
#5: 3 0.4605065 0.09744975 ... 0.8620728 1
#6: 3 0.6666742 0.10435471 ... 0.3991940 2
You could do the aggregation even faster by doing:
d[, lapply(.SD, sum), by = id.in]
But if the list is your starting point, the conversion would take up the majority of the time.

Dealing with Zero Values in Principal Component Analysis

I've really been struggling to get my PCA working and I think it is because there are zero values in my data set. But I don't know how to resolve the issue.
The first problem is, the zero values are not missing values (they are areas with no employment in a certain sector), so I should probably keep them in there. I feel uncomfortable that they might be excluded because they are zero.
Secondly, even when I try remove all missing data I still get the same error message.
Starting with the following code, I get the following error message:
urban.pca.cov <- princomp(urban.cov, cor-T)
Error in cov.wt(z) : 'x' must contain finite values only
Also, I can do this:
urban.cut<- na.omit(urban.cut)
> sum(is.na(urban.cut))
[1] 0
And then run it again and get the same issue.
urban.pca.cov <- princomp(urban.cov, cor-T)
Error in cov.wt(z) : 'x' must contain finite values only
Is this a missing data issue? I've log transformed all of my variables according to this PCA tutorial. Here is the structure of my data.
> str(urban.cut)
'data.frame': 5490 obs. of 13 variables:
$ median.lt : num 2.45 2.57 2.53 2.6 2.31 ...
$ p.nga.lt : num 0.547 4.587 4.529 4.605 4.564 ...
$ p.mbps2.lt : num 1.66 4.17 4 3.9 4.2 ...
$ density.lt : num 3.24 3.44 3.85 3.21 4.28 ...
$ p_m_s.lt : num 4.54 4.61 4.56 4.61 4.61 ...
$ p_m_l.lt : num 1.87 -Inf 1.44 -Inf -Inf ...
$ p.tert.lt : num 4.59 4.61 4.55 4.61 4.61 ...
$ p.kibs.lt : num 4.25 3.05 3.12 3 3.03 ...
$ p.edu.lt : num 4.14 2.6 2.9 2.67 2.57 ...
$ p.non.white.lt : num 3.06 3.56 3.82 2.94 3.52 ...
$ p.claim.lt : num 0.459 1.287 1.146 1.415 1.237 ...
$ d.connections.lt: num 2.5614 0.6553 5.2573 0.9562 -0.0252 ...
$ SAM.KM.lt2 : num 1.449 1.081 1.071 1.246 0.594 ...
Thank you in advance for your help.
Sounds to me like R wants finite values. -inf is not finite. it is minus infinity. Perhaps you should be doing log(data + 1) if you really need to log transform your data, and not log a 0

Non-numeric argument to mathematical function

I want to get pvalues from a data set. I have not had any problems to use pnorm, but I have now.
data(iris)
iris[,-5]<- scale(as.matrix(iris[,-5]))
# K-Means Cluster Analysis
fit <- kmeans(iris[,-5], 5) # 5 cluster solution
# get cluster means
aggregate(iris[,-5],by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(iris, fit$cluster)
pval<- pnorm(iris[,-5])
After this, I get message of "Error in pnorm(q, mean, sd, lower.tail, log.p) :
Non-numeric argument to mathematical function".
What is the problem? I do not understand why this is happening.
Please let me know.
You are trying to pass a dataframe to a function that is requesting a numeric vector:
> is.numeric(iris[,-5])
[1] FALSE
> str(iris[,-5])
'data.frame': 150 obs. of 4 variables:
$ Sepal.Length: num -0.898 -1.139 -1.381 -1.501 -1.018 ...
$ Sepal.Width : num 1.0156 -0.1315 0.3273 0.0979 1.245 ...
$ Petal.Length: num -1.34 -1.34 -1.39 -1.28 -1.34 ...
$ Petal.Width : num -1.31 -1.31 -1.31 -1.31 -1.31 ...
Try passing just a single column, like:
pnorm(iris[,1])

Resources