svd imputation R - r

I'm trying to use the SVD imputation from the bcv package but all the imputed values are the same (by column).
This is the dataset with missing data
http://pastebin.com/YS9qaUPs
#load data
dataMiss = read.csv('dataMiss.csv')
#impute data
SVDimputation = round(impute.svd(dataMiss)$x, 2)
#find index of missing values
bool = apply(X = dataMiss, 2, is.na)
#put in a new data frame only the imputed value
SVDImpNA = mapply(function(x,y) x[y], as.data.frame(SVDimputation), as.data.frame(bool))
View(SVDImpNA)
head(SVDImpNA)
V1 V2 V3
[1,] -0.01 0.01 0.01
[2,] -0.01 0.01 0.01
[3,] -0.01 0.01 0.01
[4,] -0.01 0.01 0.01
[5,] -0.01 0.01 0.01
[6,] -0.01 0.01 0.01
Where am I wrong?

The impute.svd algorithm works as follows:
Replace all missing values with the corresponding column means.
Compute a rank-k approximation to the imputed matrix.
Replace the values in the imputed positions with the corresponding values from the rank-k approximation computed in Step 2.
Repeat Steps 2 and 3 until convergence.
In your example code, you are setting k=min(n,p) (the default). Then, in Step 2, the rank-k approximation is exactly equal to imputed matrix. The algorithm converges after 0 iterations. That is, the algorithm sets all imputed entries to be the column means (or something extremely close to this if there is numerical error).
If you want to do something other than impute the missing values with the column means, you need to use a smaller value for k. The following code demonstrates this with your sample data:
> library("bcv")
> dataMiss = read.csv('dataMiss.csv')
k=3
> SVDimputation = impute.svd(dataMiss, k = 3, maxiter=10000)$x
> table(round(SVDimputation[is.na(dataMiss)], 2))
-0.01 0.01
531 1062
k=2
> SVDimputation = impute.svd(dataMiss, k = 2, maxiter=10000)$x
> table(round(SVDimputation[is.na(dataMiss)], 2))
-11.31 -6.94 -2.59 -2.52 -2.19 -2.02 -1.67 -1.63
25 23 61 2 54 23 5 44
-1.61 -1.2 -0.83 -0.8 -0.78 -0.43 -0.31 -0.15
14 10 13 19 39 1 14 19
-0.14 -0.02 0 0.01 0.02 0.03 0.06 0.17
83 96 94 77 30 96 82 28
0.46 0.53 0.55 0.56 0.83 0.91 1.26 1.53
1 209 83 23 28 111 16 8
1.77 5.63 9.99 14.34
112 12 33 5
Note that for your data, the default maximum number of iterations (100) was too low (I got a warning message). To fix this, I set maxiter=10000.

The problem that you describe likely occurs because impute.svd initially sets all of the NA values to be equal to the column means, and then doesn't change these values upon convergence.
It depends on the reason that you are using SVD imputation in the first place, but in case you are flexible, a good solution to this problem might be to switch the rank of the SVD call, by setting k to, e.g., 1. Currently, k is set automatically to min(n, p), where n = nrow, and p = ncol, which for your data means k = 3. For example, if you set it to 1 (as it is set in the example in the impute.svd function documentation), then this problem does not occur:
library(bcv)
dataMiss = read.csv("dataMiss.csv")
SVDimputation = round(impute.svd(dataMiss, k = 1)$x, 2)
head(SVDimputation)
[,1] [,2] [,3]
[1,] 0.96 -0.23 0.52
[2,] 0.02 -0.23 -1.92
[3,] -1.87 -0.23 0.52
[4,] -0.92 -0.23 0.52
[5,] 0.49 -0.46 0.52
[6,] -1.87 -0.23 0.52

Related

Is there an R function to order a matrix using two or more columns?

I simulated a data of 3 columns
aa <- rep(seq(0,1,0.05), seq(21,1,-1))
bb <- NA
for(i in length(seq(0,1,0.05)):1){
bb <- c(bb,rep(seq(0,1,0.05),len = i))
}
bb <- bb[-1]
cc <- 1-(aa+bb)
Dominance <- cbind(aa,bb,cc)
Then, in my problem, a row containing (0,0,1) is equal to a row containing (1,0,0) and (0,1,0).
So I use this code below to organize my data
for(i in 1:dim(Dominance)[1]){
Dominance[i,] <- Dominance[i, order(Dominance[i,], decreasing = FALSE)]
}
The problem is that when I try to order using this code bellow, they order the first column well, but not the second column.
Dominance[order(Dominance[,1],Dominance[,2],Dominance[,3]),]
I got this as a result
[1,] 0.00 0.00 1.00
[2,] 0.00 0.00 1.00
[3,] 0.00 0.00 1.00
[4,] 0.00 0.05 0.95
...
[59,] 0.00 0.50 0.50
[60,] 0.00 0.50 0.50
[61,] 0.05 0.35 0.60
[62,] 0.05 0.35 0.60
[63,] 0.05 0.05 0.90
[64,] 0.05 0.05 0.90
The problem starts on line 61, once I have in the first column 0.05 and in the second column 0.35, but in the line 63 I have the same value in the first column (0.05) but in the second one I have a small value than 0.35.
Any ideas?
I have tried to use two other functions but they got the same results.
With the tidyverse approach, this is as simple as:
library(tidyverse)
data %>%
as_tibble() %>%
arrange(aa,bb) %>%
as.matrix()
Hope it helps!

R Function to get Confidence Interval of Difference Between Means

I am trying find a function that allows me two easily get the confidence interval of difference between two means.
I am pretty sure t.test has this functionality, but I haven't been able to make it work. Below is a screenshot of what I have tried so far:
Image
This is the dataset I am using
Indoor Outdoor
1 0.07 0.29
2 0.08 0.68
3 0.09 0.47
4 0.12 0.54
5 0.12 0.97
6 0.12 0.35
7 0.13 0.49
8 0.14 0.84
9 0.15 0.86
10 0.15 0.28
11 0.17 0.32
12 0.17 0.32
13 0.18 1.55
14 0.18 0.66
15 0.18 0.29
16 0.18 0.21
17 0.19 1.02
18 0.20 1.59
19 0.22 0.90
20 0.22 0.52
21 0.23 0.12
22 0.23 0.54
23 0.25 0.88
24 0.26 0.49
25 0.28 1.24
26 0.28 0.48
27 0.29 0.27
28 0.34 0.37
29 0.39 1.26
30 0.40 0.70
31 0.45 0.76
32 0.54 0.99
33 0.62 0.36
and I have been trying to use t.test function that has been installed from
install.packages("ggpubr")
I am pretty new to R, so sorry if there is a simple answer to this question. I have searched around quite a bit and haven't been able to find anything that I am looking for.
Note: The output I am looking for is Between -1.224 and 0.376
Edit:
The CI of difference between means I am looking for is if a random 34th datapoint was added to the chart by picking a random value in the Indoor column and a random value in the Outdoor column and duplicating it. Running the t.test will output the correct CI for the difference of means for the given sample size of 33.
How can I go about doing this pretending the sample size is 34?
there's probably something more convenient in the standard library, but it's pretty easy to calculate. given your df variable, we can just do:
# calculate mean of difference
d_mu <- mean(df$Indoor) - mean(df$Outdoor)
# calculate SD of difference
d_sd <- sqrt(var(df$Indoor) + var(df$Outdoor))
# calculate 95% CI of this
d_mu + d_sd * qt(c(0.025, 0.975), nrow(df)*2)
giving me: -1.2246 0.3767
mostly for #AkselA: I often find it helpful to check my work by sampling simpler distributions, in this case I'd do something like:
a <- mean(df$Indoor) + sd(df$Indoor) * rt(1000000, nrow(df)-1)
b <- mean(df$Outdoor) + sd(df$Outdoor) * rt(1000000, nrow(df)-1)
quantile(a - b, c(0.025, 0.975))
which gives me answers much closer to the CI I gave in the comment
Even though I always find the approach of manually calculating the results, as shown by #Sam Mason, the most insightful, there are some who want a shortcut. And sometimes, it's also ok to be lazy :)
So among the different ways to calculate CIs, this is imho the most comfortable:
DescTools::MeanDiffCI(Indoor, Outdoor)
Here's a reprex:
IV <- diamonds$price
DV <- rnorm(length(IV), mean = mean(IV), sd = sd(IV))
DescTools::MeanDiffCI(IV, DV)
gives
meandiff lwr.ci upr.ci
-18.94825 -66.51845 28.62195
This is calculated with 999 bootstrapped samples by default. If you want 1000 or more, you can just add that in the argument R:
DescTools::MeanDiffCI(IV, DV, R = 1000)

How to find and replace min value in dataframe with text in r

i have dataframe with 20 columns and I like to identify the minimum value in each of the column and replace them with text such as "min". Appreciate any help
sample data :
a b c
-0.05 0.31 0.62
0.78 0.25 -0.01
0.68 0.33 -0.04
-0.01 0.30 0.56
0.55 0.28 -0.03
Desired output
a b c
min 0.31 0.62
0.78 min -0.01
0.68 0.33 min
-0.01 0.30 0.56
0.55 0.28 -0.03
You can apply a function to each column that replaces the minimum value with a string. This returns a matrix which could be converted into a data frame if desired. As IceCreamToucan pointed out, all rows will be of type character since each variable must have the same type:
apply(df, 2, function(x) {
x[x == min(x)] <- 'min'
return(x)
})
a b c
[1,] "min" "0.31" "0.62"
[2,] "0.78" "min" "-0.01"
[3,] "0.68" "0.33" "min"
[4,] "-0.01" "0.3" "0.56"
[5,] "0.55" "0.28" "-0.03"
You can use the method below, but know that this converts all your columns to character, since vectors must have elements which all have the same type.
library(dplyr)
df %>%
mutate_all(~ replace(.x, which.min(.x), 'min'))
# a b c
# 1 min 0.31 0.62
# 2 0.78 min -0.01
# 3 0.68 0.33 min
# 4 -0.01 0.3 0.56
# 5 0.55 0.28 -0.03
apply(df, MARGIN=2, FUN=(function(x){x[which.min(x)] <- 'min'; return(x)})

Inter-scale correlation matrix wide format to long format (in R)

In one of my datafiles, correlation matrices are stored in long format, where the first three columns represent the variables and the last three columns represent the inter-scale correlations. Frustratingly, I have to admit, the rows may represent different sub-dimensions of particular construct (e.g., the 0s in column 1).
The datafile (which is used for a meta-analysis) was constructed by a PhD-student who "decomposed" all relevant correlation matrices by hand (so the wide- format matrix was not generated by another piece of code).
Example
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 0 4 1 -0.32 0.25 -0.08
[2,] 0 4 2 -0.32 0.15 -0.04
[3,] 0 4 3 -0.32 -0.04 0.27
[4,] 0 4 4 -0.32 -0.14 -0.16
[5,] 0 4 1 -0.01 0.33 -0.08
[6,] 0 4 2 -0.01 0.36 -0.04
[7,] 0 4 3 -0.01 0.04 0.27
[8,] 0 4 4 -0.01 -0.03 -0.16
My question is, how to restore the interscale-correlation matrix. Such that,
c1_0a c1_0b c2_4 c3_1 c3_2 c3_3 c3_4
c1_0a
c1_0b
c2_4 -0.32 -0.01
c3_1 0.25 0.33 -0.08
c3_2 0.15 0.36 -0.04
c3_3 -0.04 0.04 0.27
c3_4 -0.14 -0.03 -0.16
I suppose this can be done with the reshape2 package, but I am unsure how to proceed. Your help is greatly appreciated!
What I have tried so far (rather clumpsy):
I identified the unique combinations of column 1,2, and 4 which I transposed to correlation matrix #1;
Similarily, I identified the unique combinations of column 1,3, and 5 which I transposed to correlation matrix #2;
Similarily, I identified the unique combinations of column 2,3, and 6 which I transposed to correlation matrix #3;
Next, I binded the three matrices which gives the incorrect solution.
The problem here is that matrix #1 has different dimensions [as there are two different correlations reported for the relationship between the first construct (0) and the second construct (4)] than matrix #3 [as there are eight different correlations reported for the second construct (4) and the third construct (1-4)].
I tried both the meltand reshape2packages to overcome these problems (and to come up with a more elegant solution), but I did not find any clues about how to set up functions in these packages to reconstruct the correlation matrix.

function of corr.test() and cor() to get pearson correlation and P value in R

I just tried to get a matrix to describe the P value and a matrix of correlation.
For example, I used the code below to create data
library(psych)
xx <- matrix(rnorm(16), 4, 4)
xx
[,1] [,2] [,3] [,4]
[1,] 1.2349830 -0.23417979 -1.0380279 0.2119736
[2,] 0.9540995 0.05405983 0.4438048 1.8375497
[3,] 0.1583041 -1.29936451 -0.6030342 -0.4052208
[4,] 0.4524374 1.03351913 1.3253830 -0.4829464
when I tried to run
corr.test(xx)
But I got the error as below:
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘NA-NA’
But I thought I never set the row or column name before, and the row or column name could be NULL;I checked the row and column name with the code below:
> any(duplicated(colnames(xx)))
[1] FALSE
> any(duplicated(rownames(xx)))
[1] FALSE
However, I used the other correlation function cor() to get the matrix of correlation of xx:
co<-cor(xx)
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.24090246 -0.28707770 0.58664566
[2,] 0.2409025 1.00000000 0.79523833 0.06658293
[3,] -0.2870777 0.79523833 1.00000000 0.04730974
[4,] 0.5866457 0.06658293 0.04730974 1.00000000
Now, it works fine, and I thought maybe I can use the corr.p() to get the p value quickly. But actually, the I got the same error again!!
corr.p(co,16)
Error in `row.names<-.data.frame`(`*tmp*`, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique value when setting 'row.names': ‘NA-NA’
I am confused that I never set the rownames and colnames and have checked the duplicate 'row.names' as well, but why I still got the error, did I missed any important point?
library(psych)
set.seed(123)
xx <- matrix(rnorm(16), 4, 4)
xx <- as.data.frame(xx)
out <- corr.test(xx)
print(out, short = FALSE)
# Call:corr.test(x = xx)
# Correlation matrix
# V1 V2 V3 V4
# V1 1.00 -0.02 0.96 -0.49
# V2 -0.02 1.00 -0.27 -0.75
# V3 0.96 -0.27 1.00 -0.22
# V4 -0.49 -0.75 -0.22 1.00
# Sample Size
# [1] 4
# Probability values (Entries above the diagonal are adjusted for multiple tests.)
# V1 V2 V3 V4
# V1 0.00 1.00 0.25 1
# V2 0.98 0.00 1.00 1
# V3 0.04 0.73 0.00 1
# V4 0.51 0.25 0.78 0
# To see confidence intervals of the correlations, print with the short=FALSE option
# Confidence intervals based upon normal theory. To get bootstrapped values, try cor.ci
# lower r upper p
# V1-V2 -0.96 -0.02 0.96 0.98
# V1-V3 -0.04 0.96 1.00 0.04
# V1-V4 -0.99 -0.49 0.89 0.51
# V2-V3 -0.98 -0.27 0.93 0.73
# V2-V4 -0.99 -0.75 0.75 0.25
# V3-V4 -0.97 -0.22 0.94 0.78

Resources