I want to get pvalues from a data set. I have not had any problems to use pnorm, but I have now.
data(iris)
iris[,-5]<- scale(as.matrix(iris[,-5]))
# K-Means Cluster Analysis
fit <- kmeans(iris[,-5], 5) # 5 cluster solution
# get cluster means
aggregate(iris[,-5],by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata <- data.frame(iris, fit$cluster)
pval<- pnorm(iris[,-5])
After this, I get message of "Error in pnorm(q, mean, sd, lower.tail, log.p) :
Non-numeric argument to mathematical function".
What is the problem? I do not understand why this is happening.
Please let me know.
You are trying to pass a dataframe to a function that is requesting a numeric vector:
> is.numeric(iris[,-5])
[1] FALSE
> str(iris[,-5])
'data.frame': 150 obs. of 4 variables:
$ Sepal.Length: num -0.898 -1.139 -1.381 -1.501 -1.018 ...
$ Sepal.Width : num 1.0156 -0.1315 0.3273 0.0979 1.245 ...
$ Petal.Length: num -1.34 -1.34 -1.39 -1.28 -1.34 ...
$ Petal.Width : num -1.31 -1.31 -1.31 -1.31 -1.31 ...
Try passing just a single column, like:
pnorm(iris[,1])
Related
Likely because I've spent an hour on this, I'm curious if it is possible - I am trying to transform each element of each column in a dataframe, where the transformation applied to each element depends upon the mean and standard deviation of the column that the element is in. I wanted to use nested lapply or sapply to do this, but ran into some unforeseen issues. My current "solution" (although it does not work as expected) is:
scale_variables <- function(dframe, columns) {
means <- colMeans(dframe[sapply(dframe, is.numeric)])
sds <- colSds(as.matrix(dframe[sapply(dframe, is.numeric)]))
new_dframe <- lapply(seq_along(means), FUN = function(m) {
sapply(dframe[ , columns], FUN = function(x) {
sapply(x, FUN = helper_func, means[[m]], sds[m])
})
})
return(new_dframe)
}
So, I calculate the column means and SDs beforehand; then, I seq_along the index of each mean in means, then each of the columns with the first sapply, and then each element in the second sapply. I get the mean and SD of this particular column using index m, then pass the current element, mean, and SD to the helper function to work on.
Running this on the numeric variables in the iris dataset yields this monstrosity:
'data.frame': 150 obs. of 16 variables:
$ Sepal.Length : num -0.898 -1.139 -1.381 -1.501 -1.018 ...
$ Sepal.Width : num -2.83 -3.43 -3.19 -3.31 -2.71 ...
$ Petal.Length : num -5.37 -5.37 -5.49 -5.25 -5.37 ...
$ Petal.Width : num -6.82 -6.82 -6.82 -6.82 -6.82 ...
$ Sepal.Length.1: num 4.69 4.23 3.77 3.54 4.46 ...
$ Sepal.Width.1 : num 1.0156 -0.1315 0.3273 0.0979 1.245 ...
$ Petal.Length.1: num -3.8 -3.8 -4.03 -3.57 -3.8 ...
$ Petal.Width.1 : num -6.56 -6.56 -6.56 -6.56 -6.56 ...
$ Sepal.Length.2: num 0.76 0.647 0.534 0.477 0.704 ...
$ Sepal.Width.2 : num -0.1462 -0.4294 -0.3161 -0.3727 -0.0895 ...
$ Petal.Length.2: num -1.34 -1.34 -1.39 -1.28 -1.34 ...
$ Petal.Width.2 : num -2.02 -2.02 -2.02 -2.02 -2.02 ...
$ Sepal.Length.3: num 5.12 4.86 4.59 4.46 4.99 ...
$ Sepal.Width.3 : num 3.02 2.36 2.62 2.49 3.15 ...
$ Petal.Length.3: num 0.263 0.263 0.132 0.394 0.263 ...
$ Petal.Width.3 : num -1.31 -1.31 -1.31 -1.31 -1.31 ...
I assume I am applying each mean in means to each column of the dataframe in turn, when I only want to use it for elements in the column it refers to, so I'm not sure that nesting apply functions in this way will do what I need - but can it be done like this?
I'm not sure what your helper_func, is, but I've made a toy example below
helper_func <- function(x,m,sd) (x-m)/sd
You can then adjust your scale_variables() function like this:
scale_variables <- function(dframe, columns) {
means <- apply(dframe[columns],2,mean, na.rm=T)
sds <- apply(dframe[columns],2,sd)
sapply(columns, \(col) helper_func(dframe[[col]], m=means[col], sd=sds[col]))
}
And call it like this:
scale_variables(iris,names(iris)[sapply(iris, is.numeric)])
Output: (first 6 of 150 rows)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 -0.89767388 1.01560199 -1.33575163 -1.3110521482
2 -1.13920048 -0.13153881 -1.33575163 -1.3110521482
3 -1.38072709 0.32731751 -1.39239929 -1.3110521482
4 -1.50149039 0.09788935 -1.27910398 -1.3110521482
5 -1.01843718 1.24503015 -1.33575163 -1.3110521482
6 -0.53538397 1.93331463 -1.16580868 -1.0486667950
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
I have a dataset in which we measured well-being with 18 items and political orientation (let's just assume for the moment that political orientation is measured with one item).
A person’s well-being score can be computed by taking the average of all 18 items, but also of taking the average of each possible combination of items (e.g., all combination of one item, two items etc), resulting in sum(choose(18, 0:18)) = 262,144 possible combinations.
I am interested in how the correlation coefficient between well-being and political orientation changes depending on how well-being is computed. That is, I am interested in getting all 18 (choose(18,1) = 18) correlation coefficients if well-being is assessed with each of the 18 items and then correlated with political orientation, all 153 correlation coefficients if well-being is computed with all possible combinations of 2-items and then correlated with political orientation etc. So in the end I'd be looking for 262,144 correlation coefficients.
The dataset looks something like this (just with >10,000 participants), whereas v19 is political orientation, v1 to v18 the well-being items.
df <- as.data.frame(matrix(rnorm(190), ncol = 19))
In essence, I am asking on how to compute the average of all combinations of 2 items, 3, … , 17 well-being items. I came across the expand() function of tidyr, but this seems to be doing something else.
Here are some steps to (1) calculate the average across the combinations of 18 factors; and then (2) correlate each of those combined-averages with the 19th column (political orientation).
set.seed(42)
df <- as.data.frame(matrix(rnorm(190), ncol = 19))
df[,1:3]
# V1 V2 V3
# 1 1.37096 1.3049 -0.3066
# 2 -0.56470 2.2866 -1.7813
# 3 0.36313 -1.3889 -0.1719
# 4 0.63286 -0.2788 1.2147
# 5 0.40427 -0.1333 1.8952
# 6 -0.10612 0.6360 -0.4305
# 7 1.51152 -0.2843 -0.2573
# 8 -0.09466 -2.6565 -1.7632
# 9 2.01842 -2.4405 0.4601
# 10 -0.06271 1.3201 -0.6400
rowMeans(df[,c(1,2)])
# [1] 1.3379 0.8610 -0.5129 0.1770 0.1355 0.2649 0.6136 -1.3756 -0.2110 0.6287
rowMeans(df[,c(1,3)])
# [1] 0.53216 -1.17300 0.09561 0.92377 1.14973 -0.26830 0.62713 -0.92891 1.23926 -0.35135
rowMeans(df[,c(2,3)])
# [1] 0.4991 0.2527 -0.7804 0.4679 0.8809 0.1027 -0.2708 -2.2098 -0.9902 0.3401
I show the row-means for three combinations because I want to verify where in the next step those values are found.
means <- lapply(1:3, function(N) {
do.call(cbind,
lapply(asplit(combn(18, N), 2),
function(ind) rowMeans(df[, ind, drop = FALSE])))
})
str(means)
# List of 3
# $ : num [1:10, 1:18] 1.371 -0.565 0.363 0.633 0.404 ...
# $ : num [1:10, 1:153] 1.338 0.861 -0.513 0.177 0.135 ...
# $ : num [1:10, 1:816] 0.7897 -0.0198 -0.3992 0.5229 0.722 ...
That last step produces a means object that contains the "1" (singular columns), "2" (pairwise row-averages), and "3"-deep combination-averages. Note that choose(18,2) is 153 (number of columns in means[[2]]) and choose(18,3) is 816 (means[[3]]). Each column represents the average of the respective columns combined.
I included 1 here (choose(18,1)) simply to keep all data in the same structure, since we do want to test correlation of the single-columns; other methods could be done to achieve this, I leaned towards consistency and simplicity.
To verify we have what we think, I'll pull out three columns from means[[2]] which correspond to the three rowMeans calculations I showed above based on direct access to df (inspection will reveal they are a match):
means[[2]][,c(1,2,18)]
# [,1] [,2] [,3]
# [1,] 1.3379 0.53216 0.4991
# [2,] 0.8610 -1.17300 0.2527
# [3,] -0.5129 0.09561 -0.7804
# [4,] 0.1770 0.92377 0.4679
# [5,] 0.1355 1.14973 0.8809
# [6,] 0.2649 -0.26830 0.1027
# [7,] 0.6136 0.62713 -0.2708
# [8,] -1.3756 -0.92891 -2.2098
# [9,] -0.2110 1.23926 -0.9902
# [10,] 0.6287 -0.35135 0.3401
This means that the columns are ordered as 1,2, 1,3, 1,4, ..., 1,18, then 2,3 (column 18), 2,4, etc, through 17,18 (column 153).
From here, correlating each of those columns with V19 is not difficult:
cors <- lapply(means, function(mn) apply(mn, 2, cor, df$V19))
str(cors)
# List of 3
# $ : num [1:18] 0.2819 -0.3977 0.0426 0.2501 -0.063 ...
# $ : num [1:153] -0.27 0.168 0.472 0.192 0.6 ...
# $ : num [1:816] -0.1831 -0.063 -0.3355 0.0358 -0.3829 ...
cor(df$V1, df$V19)
# [1] 0.2819
cor(rowMeans(df[,c(1,2)]), df$V19)
# [1] -0.2702
cor(rowMeans(df[,c(1,3)]), df$V19)
# [1] 0.1677
cor(rowMeans(df[,c(1,2,3)]), df$V19)
# [1] -0.1831
cor(rowMeans(df[,c(1,2,4)]), df$V19)
# [1] -0.06303
Because of the way that was done, it should be straight-forward to change the N of 3 to whatever you may need ... realizing that choose(18,9) is 48620, generating those combination-averages is not instantaneous but still quite manageable:
system.time({
means18 <- lapply(1:18, function(N) {
do.call(cbind,
lapply(asplit(combn(18, N), 2),
function(ind) rowMeans(df[, ind, drop = FALSE])))
})
})
# user system elapsed
# 41.65 0.58 50.35
str(means18)
# List of 18
# $ : num [1:10, 1:18] 1.371 -0.565 0.363 0.633 0.404 ...
# $ : num [1:10, 1:153] 1.338 0.861 -0.513 0.177 0.135 ...
# $ : num [1:10, 1:816] 0.7897 -0.0198 -0.3992 0.5229 0.722 ...
# $ : num [1:10, 1:3060] 0.7062 0.1614 -0.0406 0.24 0.6678 ...
# $ : num [1:10, 1:8568] 0.6061 0.0569 0.1191 0.0466 0.2606 ...
# $ : num [1:10, 1:18564] 0.5588 -0.0832 0.3619 0.146 0.2321 ...
# $ : num [1:10, 1:31824] 0.4265 -0.0449 0.3933 0.3251 0.095 ...
# $ : num [1:10, 1:43758] 0.2428 -0.0505 0.4221 0.1653 0.0153 ...
# $ : num [1:10, 1:48620] 0.3839 -0.0163 0.385 0.1335 -0.1191 ...
# $ : num [1:10, 1:43758] 0.4847 -0.0623 0.4115 0.2592 -0.2183 ...
# $ : num [1:10, 1:31824] 0.5498 0.0384 0.2829 0.4037 -0.259 ...
# $ : num [1:10, 1:18564] 0.5019 0.0442 0.2189 0.3281 -0.3759 ...
# $ : num [1:10, 1:8568] 0.3484 -0.0723 0.2117 0.2262 -0.3471 ...
# $ : num [1:10, 1:3060] 0.364 -0.102 0.197 0.29 -0.219 ...
# $ : num [1:10, 1:816] 0.334 -0.155 0.154 0.269 -0.232 ...
# $ : num [1:10, 1:153] 0.311 -0.242 0.217 0.235 -0.247 ...
# $ : num [1:10, 1:18] 0.282 -0.291 0.214 0.2 -0.198 ...
# $ : num [1:10, 1] 0.254 -0.228 0.105 0.283 -0.139 ...
and the rest of the process can be completed in a similar manner.
I would like to use rollapply or rollapplyr to apply the modwt function to my time series data.
I'm familiar with how rollapply/r works but I need some help setting up the output so that I can correctly store my results when using rollapply.
The modwt function in the waveslim package takes a time series and decomposes it into J levels, for my particular problem J = 4 which means I will have 4 sets of coefficients from my single time series stored in a list of 5. Of this list I am only concerned with d1,d2,d3 & d4.
The output of the modwt function looks as follows
> str(ar1.modwt)
List of 5
$ d1: num [1:200] -0.223 -0.12 0.438 -0.275 0.21 ...
$ d2: num [1:200] 0.1848 -0.4699 -1.183 -0.9698 -0.0937 ...
$ d3: num [1:200] 0.5912 0.6997 0.5416 0.0742 -0.4989 ...
$ d4: num [1:200] 1.78 1.86 1.85 1.78 1.65 ...
$ s4: num [1:200] 4.64 4.42 4.19 3.94 3.71 ...
- attr(*, "class")= chr "modwt"
- attr(*, "wavelet")= chr "la8"
- attr(*, "boundary")= chr "periodic"
In the example above I have applied the modwt function to the full length time series of length 200 but I wish to apply it to a small rolling window of 30 using rollapply.
I have already tried the following but the output is a large matrix and I cannot easily identify which values belong to d1,d2,d3 or d4
roller <- rollapplyr(ar1, 30,FUN=modwt,wf="la8",n.levels=4,boundary="periodic")
The output of this is a large matrix with the following structure:
> str(roller)
List of 855
$ : num [1:30] 0.117 -0.138 0.199 -1.267 1.872 ...
$ : num [1:30] -0.171 0.453 -0.504 -0.189 0.849 ...
$ : num [1:30] 0.438 -0.3868 0.1618 -0.0973 -0.0247 ...
$ : num [1:30] -0.418 0.407 0.639 -2.013 1.349 ...
...lots of rows omitted...
$ : num [1:30] 0.307 -0.658 -0.105 1.128 -0.978 ...
[list output truncated]
- attr(*, "dim")= int [1:2] 171 5
- attr(*, "dimnames")=List of 2
..$ : NULL
..$ : chr [1:5] "d1" "d2" "d3" "d4" ...
How can I set up a variable such that it will store the (200-30)+1 lists with lists within this for each of the scales d1,d2,d3 and d4?
For a reproducible example please use the following:
library(waveslim)
data(ar1)
ar1.modwt <- modwt(ar1, "la8", 4)
Define modwt2 which invokes modwt, takes the first 4 components and strings them out into a numeric vector. Then use rollapplyr with that giving rollr where each row of rollr is the result of one call to modwt2. Finally, reshape each row of rollr into a separate matrix and create a list, L, of those matrices:
modwt2 <- function(...) unlist(head(modwt(...), 4))
rollr <- rollapplyr(ar1, 30, FUN = modwt2, wf = "la8", n.levels = 4, boundary = "periodic")
L <- lapply(1:nrow(rollr), function(i) matrix(rollr[i,], , 4))
If a 30 x 4 x 171 array is desired then the following will simplify it into a 3d array:
simplify2array(L)
or as a list of lists:
lapply(L, function(x) as.list(as.data.frame(x)))
2) This is an alternate solution that just uses lapply directly and returns a list each of whose components is the list consisting of d1, d2, d3 and d4.
lapply(1:(200-30+1), function(i, ...) head(modwt(ar1[seq(i, length = 30)], ...), 4),
wf = "la8", n.levels = 4, boundary = "periodic")
Updates: Code improvements, expand (1) and add (2).
I'm trying to fit PLSR model, but I'm doing something wrong. Below, you can see how I created data frame and its structure.
reflektance <- read_excel("data/reflektance.xlsx", na = "NA")
reflektance <- dput(reflektance)
pH <- read_excel("data/rijen2016.xls", na = "NA")
pH <- na.omit(pH)
pH <- dput(pH)
reflektance<-aggregate(reflektance[, 2:753], list(reflektance$Vzorek), mean)
colnames(reflektance)[colnames(reflektance)=='Group.1']<-'Vzorek'
datapH <- merge(pH, reflektance, by="Vzorek")
datasetpH <- data.frame(pH=datapH[,2], ref=I(as.matrix(datapH[, 3:754], 22, 752)))
Problem is with using "plsr", because result is this error:
ph1<-plsr(pH ~ ref, ncomp = 5, data=datasetpH)
Error in pls::mvr(ref ~ pH, ncomp = 5, data = datasetpH, method = "kernelpls") :
Invalid number of components, ncomp
dput(reflectance):
https://jpst.it/RyyS
Here you can see structure of table datapH:
'data.frame': 22 obs. of 754 variables:
$ Vzorek: chr "5 - P01" "5 - P02" "5 - P03" "5 - R1 - A1" ...
$ pH/H2O: num 6.96 6.62 7.02 5.62 5.97 6.12 5.64 5.81 5.61 5.47 ...
$ 325 : num 0.017 0.0266 0.0191 0.0241 0.016 ...
$ 326 : num 0.021 0.0263 0.0154 0.0264 0.0179 ...
$ 327 : num 0.0223 0.0238 0.0147 0.028 0.0198 ...
...
And here structure of table datasetpH:
'data.frame': 22 obs. of 2 variables:
$ pH : num 6.96 6.62 7.02 5.62 5.97 6.12 5.64 5.81 5.61 5.47 ...
$ ref: AsIs [1:22, 1:752] 0.016983.... 0.026556.... 0.019059.... 0.024097.... 0.016000.... ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr "325" "326" "327" "328" ...
Do you have any advice and solution? Thank you
The problem seems to come from one of your columns containing only NA's.
The last line of the output of names(df)gives:
[745] "1068" "1069" "1070" "1071" "1072" "1073" "1074" "1075" NA
Using your data + some randomly generated values for pH (which isn't in the reflektance dataframe, named df here):
test=data.frame(pH=rnorm(23,5,2), ref=I(as.matrix(df[, 2:752], 22, 751)))
pls::plsr(pH ~ ref, data=test)
Error in matrix(0, ncol = ncomp, nrow = npred) :
invalid 'ncol' value (< 0)
Note that the indexing is a bit different from yours. I didn't have the second column in df (the one that contains pH in yours).
If I remove the last column which contains NA's :
test=data.frame(pH=rnorm(23,5,2), ref=I(as.matrix(df[, 2:752], 22, 751)))
pls::plsr(pH ~ ref, data=test)
Partial least squares regression , fitted with the kernel algorithm.
Call:
plsr(formula = pH ~ ref, data = test)
Let me know if that fixes it.
I'm splitting a dataframe in multiple dataframes using the command
data <- apply(data, 2, function(x) data.frame(sort(x, decreasing=F)))
I don't know how to access them, I know I can access them using df$1 but I have to do that for every dataframe,
df1<- head(data$`1`,k)
df2<- head(data$`2`,k)
can I get these dataframes in one go (like storing them in some form) however the indexes of these multiple dataframes shouldn't change.
str(data) gives
List of 2
$ 7:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.265 0.332 0.458 0.51 0.52 ...
$ 8:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.173 0.224 0.412 0.424 0.5 ...
str(data[1:2])
List of 2
$ 7:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.265 0.332 0.458 0.51 0.52 ...
$ 8:'data.frame': 7 obs. of 1 variable:
..$ sort.x..decreasing...F.: num [1:7] 0.173 0.224 0.412 0.424 0.5 ...
Thanks to #r2evans I got it done, here is his code from the comments
Yes. Two short demos: lapply(data, head, n=2), or more generically
sapply(data, function(df) mean(df$x)). – r2evans
and after that fetching the indexes
df<-lapply(df, rownames)