PCA x must be numeric in R - r

I have a dataset like this called df
head(df[, 1:3])
ratio
P
T
H
S
p1
p2
PM10
CO2
B
G
Month
Year
0.5
89
-7
98
133
0
40
50
30
3
20
1
2019
0.5
55
4
43
43
30
30
40
32
1
15
1
2019
0.85
75
4
63
43
30
30
42
32
1
18
1
2019
I would like to do a principal component analysis to reduced number of variables for regression analysis. I gave that code
library(factoextra)
df.pca <- prcomp(df, scale = TRUE)
But I got this error message and for that reason I was not able to continue
Error in colMeans(x, na.rm = TRUE) : ​​'x' must be numeric
What I am doing wrong?

prcomp() will assume that every column in the object you are passing to it should be used in the analysis. You'll need to drop any non-numeric columns, as well as any numeric columns that should not be used in the PCA.
library(factoextra)
# Example data
df <- data.frame(
x = letters,
y1 = rbinom(26,1,0.5),
y2 = rnorm(26),
y3 = 1:26,
id = 1:26
)
# Reproduce your error
prcomp(df)
#> Error in colMeans(x, na.rm = TRUE): 'x' must be numeric
# Remove all non-numeric columns
df_nums <- df[sapply(df, is.numeric)]
# Conduct PCA - works but ID column is in there!
prcomp(df_nums, scale = TRUE)
#> Standard deviations (1, .., p=4):
#> [1] 1.445005e+00 1.039765e+00 9.115092e-01 1.333315e-16
#>
#> Rotation (n x k) = (4 x 4):
#> PC1 PC2 PC3 PC4
#> y1 0.27215111 -0.5512026 -0.7887391 0.000000e+00
#> y2 0.07384194 -0.8052981 0.5882536 4.715914e-16
#> y3 -0.67841033 -0.1543868 -0.1261909 -7.071068e-01
#> id -0.67841033 -0.1543868 -0.1261909 7.071068e-01
# Remove ID
df_nums$id <- NULL
# Conduct PCA without ID - success!
prcomp(df_nums, scale = TRUE)
#> Standard deviations (1, .., p=3):
#> [1] 1.1253120 0.9854030 0.8733006
#>
#> Rotation (n x k) = (3 x 3):
#> PC1 PC2 PC3
#> y1 -0.6856024 0.05340108 -0.7260149
#> y2 -0.4219813 -0.84181344 0.3365738
#> y3 0.5931957 -0.53712052 -0.5996836

Related

t-test through all combinations of all factors all levels

I have a dataframe with the following structure:
> str(data_l)
'data.frame': 800 obs. of 5 variables:
$ Participant: int 1 2 3 4 5 6 7 8 9 10 ...
$ Temperature: Factor w/ 4 levels "35","37","39",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Region : Factor w/ 5 levels "Eyes","Front",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Time : Factor w/ 5 levels "0","15","30",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Rating : num 5 5 5 4 5 5 5 5 5 5 ...
I want to run one-sample t-test for each combination of all factors all levels, for a total of 4*5*5 = 100 t-tests, with Rating as dependent variables, or y.
I am stuck at looping through the combinations, and performing t-test at each combo.
I tried splitting the dataframe by the factors, then lapply t.test() through the list, but to no avail.
Does anyone have a better approach? Cheers!
Edit
My ultimate intention is to calculate confidence interval for arrays in all factors all levels. For instance, I was able to do this:
subset1 <- data_l$Rating[data_l$Temperature == 35 & data_l$Region == "Front" & data_l$Time == 0]
Then,
t.test(subset1)$conf.int
But the problem is I will have to do this 100 times.
Edit 2
I am recreating the dataframe.
Temperature <- rep(seq(35, 41, 2), 10)
Region <- rep(c("Front", "Back", "Eyes", "Left", "Right"), 8)
Time <- rep(seq(0, 60, 15), 8)
Rating <- sample(1:5, 40, replace = TRUE)
data_l <- data.frame(Region = factor(Region), Temperature = factor(Temperature), Time = factor(Time), Rating = as.numeric(Rating))
Two things.
Can this be done? Certainly. Should it? Many of your combinations may have insufficient data to find a reasonable confidence interval. While your data sample is certainly reduced and simplified, I don't have assurances that there will be sufficient fillingness of your factor combinations.
table(sapply(split(data_l$Rating, data_l[,c("Temperature","Region","Time")]), length))
# 0 2
# 80 20
(There are 80 "empty" combinations of your factor levels.)
Let's try this:
outs <- aggregate(data_l$Rating, data_l[,c("Temperature","Region","Time")],
function(x) if (length(unique(x)) > 1) t.test(x)$conf.int else c(NA, NA))
nrow(outs)
# [1] 20
head(outs)
# Temperature Region Time x.1 x.2
# 1 35 Front 0 NA NA
# 2 37 Front 0 -9.706205 15.706205
# 3 39 Front 0 -2.853102 9.853102
# 4 41 Front 0 -15.559307 22.559307
# 5 35 Back 15 -15.559307 22.559307
# 6 37 Back 15 -4.853102 7.853102
Realize that this is not five columns; the fourth is really a matrix embedded in a frame column:
head(outs$x)
# [,1] [,2]
# [1,] NA NA
# [2,] -9.706205 15.706205
# [3,] -2.853102 9.853102
# [4,] -15.559307 22.559307
# [5,] -15.559307 22.559307
# [6,] -4.853102 7.853102
It's easy enough to extract:
outs$conf1 <- outs$x[,1]
outs$conf2 <- outs$x[,2]
outs$x <- NULL
head(outs)
# Temperature Region Time conf1 conf2
# 1 35 Front 0 NA NA
# 2 37 Front 0 -9.706205 15.706205
# 3 39 Front 0 -2.853102 9.853102
# 4 41 Front 0 -15.559307 22.559307
# 5 35 Back 15 -15.559307 22.559307
# 6 37 Back 15 -4.853102 7.853102
(If you're wondering why I have a conditional on length(unique(x)) > 1, then see what happens without it:
aggregate(data_l$Rating, data_l[,c("Temperature","Region","Time")],
function(x) t.test(x)$conf.int)
# Error in t.test.default(x) : data are essentially constant
This is because there are combinations with empty data. You'll likely see something similar with not-empty but still invariant data.)
I am stuck at looping through the combinations, and performing t-test
at each combo.
I'm not sure if this is what you wanted.
N <- 800
df <- data.frame(Participant=1:N,
Temperature=gl(4,200),
Region=sample(1:5, 800, TRUE),
Time=sample(1:5, 800, TRUE),
Rating=sample(1:5, 800, TRUE))
head(df)
t_test <- function(data, y, x){
x <- eval(substitute(x), data)
y <- eval(substitute(y), data)
comb <- combn(levels(x), m=2) # this gives all pair-wise combinations
n <- dim(comb)[2]
t <- vector(n, mode="list")
for(i in 1:n){
xlevs <- comb[,i]
DATA <- subset(data, subset=x %in% xlevs)
x2 <- factor(x, levels=xlevs)
tt <- t.test(y~x2, data=DATA)
t[[i]] <- tt
names(t)[i] <- toString(xlevs)
}
t
}
T.test <- t_test(df, Rating, Temperature)
T.test[1]
$`1, 2`
Welch Two Sample t-test
data: y by x2
t = -1.0271, df = 396.87, p-value = 0.305
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.4079762 0.1279762
sample estimates:
mean in group 1 mean in group 2
2.85 2.99

Bayes Factor for linear mixed models in R

I am trying to compute the Bayes Factor (BF) for one of the fixed effect with the BayesFactor package in R.
The data has the following structure:
rating is the dependent variable
cond is the independent variable with 3 levels ("A", "B", "C")
C1 is a contrast code derived from cond that opposes "A" (coded -0.50) to "B" and "C" (both coded -0.25)
C2 is a contrast code derived from cond that opposes "B" (coded -0.50) to "C" (coded +0.5; and "A" is coded 0)
judge and face are random factors such that face is crossed with judge but nested within cond (and thus also nested within C1 and C2)
DT <- fread("http://matschmitz.github.io/dataLMM.csv")
DT[, judge := factor(judge)]
DT[, face := factor(face)]
# > DT
# judge face cond C1 C2 rating
# 1: 66 13 A -0.50 0.0 1
# 2: 20 13 A -0.50 0.0 4
# 3: 22 13 A -0.50 0.0 7
# 4: 69 13 A -0.50 0.0 1
# 5: 7 13 A -0.50 0.0 3
# ---
# 4616: 45 62 C 0.25 0.5 2
# 4617: 30 62 C 0.25 0.5 6
# 4618: 18 62 C 0.25 0.5 4
# 4619: 40 62 C 0.25 0.5 3
# 4620: 65 62 C 0.25 0.5 1
Ideally I would like to test the "full" model as in:
library(lmerTest)
lmer(rating ~ C1 + C2 + (1 + C1 + C2|judge) + (1|face), data = DT)
and compute the BF for C1.
I managed to compute the BF for C1 but with random intercepts only:
library(BayesFactor)
BF1 <- lmBF(rating ~ C1 + C2 + judge + face, whichRandom = c("judge", "face"), data = DT)
BF0 <- lmBF(rating ~ C2 + judge + face, whichRandom = c("judge", "face"), data = DT)
BF10 <- BF1 / BF0
# > BF10
# Bayes factor analysis
# --------------
# [1] C1 + C2 + judge + face : 0.4319222 ±15.49%
#
# Against denominator:
# rating ~ C2 + judge + face
# ---
# Bayes factor type: BFlinearModel, JZS
I tried without success this solution to include the random slopes:
BF1 <- lmBF(rating ~ C1 + C2 + judge + face + C1:judge + C2:judge,
whichRandom = c("judge", "face", "C1:judge", "C2:judge"), data = DT)
# Some NAs were removed from sampling results: 10000 in total.
I would also need to include (if possible) the correlation between the random intercepts and slopes for judge.
Please feel free to use any other package (e.g., rstan, bridgesampling) in your answer.
Some additional questions:
Do I need to perform any transformation on the BF10, or can I interpret it as it?
What are the default priors?
The covariate has to be a "factor".
In your case, not just the "judge", "face", "C1" and "C2" need to be a factor as well.
DT$C1 = factor(DT$C1)

polynomial fitting for unseen data

orignally i have the data in the form
m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
using the following code i convert it to
input<-file('stdin', 'r')
mn <- read.table(input, nrows = 1, as.is = TRUE)
DF <- read.table(input, skip = 0)
m <- mn[[1]]
n <- mn[[2]]
x1<- DF[[1]]
y1<-DF[[2]]
x2<-DF[[3]]
y2<-DF[[4]]
fit1<-lm(x1 ~ poly(y1, 3, raw=TRUE))
fit2<-lm(x2 ~ poly(y2, 3, raw=TRUE))
`
m = the current datas length
n = number of points in the future to be predicted
x1= 1 5 9 13
x2= 2 6 10 14
i would like to predict all the values of x1 y1 x2 y2 for n values after the given values.
i tried to fit with lm but i am not sure how to proceed with all the values of data points to be predicted in the future missing and just getting the coefficients in terms of the other would not be sufficient as all of them need to be predicted

polynomial curve fitting for future frames

orignally i have the data in the form
m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
using the following code i convert it to
input<-file('stdin', 'r')
mn <- read.table(input, nrows = 1, as.is = TRUE)
DF <- read.table(input, skip = 0)
m <- mn[[1]]
n <- mn[[2]]
x1<- DF[[1]]
y1<-DF[[2]]
x2<-DF[[3]]
y2<-DF[[4]]
fit1<-lm(x1 ~ poly(y1, 3, raw=TRUE))
fit2<-lm(x2 ~ poly(y2, 3, raw=TRUE))
`
m = the current datas length
n = number of points in the future to be predicted
x1= 1 5 9 13
x2= 2 6 10 14
i would like to predict all the values of x1 y1 x2 y2 for n values after the given values.
i tried to fit with lm but i am not sure how to proceed with all the values of data points to be predicted in the future missing and just getting the coefficients in terms of the other would not be sufficient as all of them need to be predicted
In order to get that to run without error one needs to use skip =1 on the second read.table:
mn <- read.table(text="m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16", nrows = 1, as.is = TRUE)
DF <- read.table(text="m n
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16", skip = 1)
m <- mn[[1]]
n <- mn[[2]]
x1<- DF[[1]]
y1<-DF[[2]]
x2<-DF[[3]]
y2<-DF[[4]]
fit1<-lm(x1 ~ poly(y1, 3, raw=TRUE))
fit2<-lm(x2 ~ poly(y2, 3, raw=TRUE))
So those input data are exactly colinear and you would NOT expect there to be any useful information in either the quadratic or cubic terms. That is in fact recognized by the lm machinery:
> fit1
Call:
lm(formula = x1 ~ poly(y1, 3, raw = TRUE))
Coefficients:
(Intercept) poly(y1, 3, raw = TRUE)1 poly(y1, 3, raw = TRUE)2
-1 1 0
poly(y1, 3, raw = TRUE)3
0
Generally one should be using the data argument
> fit3<-lm(x1 ~ poly(y1, 3, raw=TRUE), DF)
>
> fit4<-lm(x2 ~ poly(y2, 3, raw=TRUE), DF)
But in this case it doesn't seem to matter:
> predict(fit1, newdata = list(y1=20:23))
1 2 3 4
19 20 21 22
> predict(fit3, newdata = list(y1=20:23))
1 2 3 4
19 20 21 22
> predict(fit2, newdata = list(y1=25:28))
1 2 3 4
3 7 11 15
The way to get predictions is to supply a newdata argument that can be coerced into a dataframe. Using a list value that has items of the same length (in this case a single argument) will succeed.

Calculating distance between 2 elements of a data frame

I have a data frame that looks like this:
library(dplyr)
size_df <- tibble(size_chr = c("XS", "S", "M", "L", "XL", "1XL", "2XL", "3XL", "4XL", "5XL", "6XL"),
size_min = c(0,36,39,42,45,48,52,56,60,64,66),
size_max = c(36,39,42,45,48,52,56,60,64,66,70))
For any given number less than 70, I want to find the two sizes that it lies between, and the distance between them both (normalised to between 0 and 1)
For example:
input <- 37.2
# S 0.6
# M 0.4
input <- 48
# XL 1
input <- 68
# 5XL 0.5
# 6XL 0.5
This is the perfect case for findInterval(). We'll create a vector of the breaks between categories and use those to calculate scaling factors.
size_breaks <- c(size_df[["size_min"]], max(size_df[["size_max"]]))
size_breaks
# [1] 0 36 39 42 45 48 52 56 60 64 66 70
size_spans <- diff(size_breaks)
size_scales <- 1 / size_spans
size_scales
# [1] 0.02777778 0.33333333 0.33333333 0.33333333 0.33333333 0.25000000 0.25000000
# [8] 0.25000000 0.25000000 0.50000000 0.25000000
findInterval() will give us the index of the lower bound. The upper bound is just that index + 1.
neighbor_distances <- function(x) {
lower <- findInterval(x, size_breaks)
neighbors <- c(lower, lower + 1)
distances <- abs(x - size_breaks[neighbors]) * size_scales[lower]
tibble(
size_chr = size_df[["size_chr"]][neighbors],
distance = distances
)
}
It works well for your first example.
neighbor_distances(37.2)
# # A tibble: 2 x 2
# size_chr distance
# <chr> <dbl>
# 1 S 0.4
# 2 M 0.600
The second example gives two rows instead of just one, but that can be handled with extra logic in the function. I left that logic out to keep things simple.
neighbor_distances(48)
# # A tibble: 2 x 2
# size_chr distance
# <chr> <dbl>
# 1 1XL 0
# 2 2XL 1
It gives a different answer for your third example, but I don't know why you expect a number to be compared to a size category smaller than the lower bound.
neighbor_distances(68)
# # A tibble: 2 x 2
# size_chr distance
# <chr> <dbl>
# 1 6XL 0.5
# 2 NA 0.5
INDS = c(max(1, tail(which(size_df$size_min < input), 1)),
min(NROW(size_df), 1 + head(which(size_df$size_max > input), 1)))
size_df$size_chr[INDS]
#[1] "S" "M"
DIST = c(abs(size_df$size_min[INDS[1]] - input),
abs(size_df$size_max[INDS[2]] - input))
DIST/sum(DIST)
#[1] 0.2 0.8

Resources