Related
I am quite new to R and I am trying to run a PCA for an incomplete data set with the code:
res.comp <- imputePCA(questionaire_results_PCA, ncp = nb$ncp)
but R tells me:
Error: Must use a vector in [, not an object of class matrix.
Run rlang::last_error() to see where the error occurred.
So I run:
rlang::last_error()
R says:
1. missMDA::imputePCA(questionaire_results_PCA, ncp = nb$ncp)
4. tibble:::`[.tbl_df`(X, !is.na(X))
5. tibble:::check_names_df(i, x)
Run `rlang::last_trace()` to see the full context
So I run:
rlang::last_trace()
And R Says:
Must use a vector in `[`, not an object of class matrix.
Backtrace:
█
1. └─missMDA::imputePCA(questionaire_results_PCA, ncp = nb$ncp)
2. ├─base::mean((res.impute$fittedX[!is.na(X)] - X[!is.na(X)])^2)
3. ├─X[!is.na(X)]
4. └─tibble:::`[.tbl_df`(X, !is.na(X))
5. └─tibble:::check_names_df(i, x)
Does anyone know what this means and how I could get it to work?
I have run:
dput(head(questionaire_results_PCA))
and I got:
structure(list(Active = c(6, 6, 5, 7, 5, 6), `Aggressive to people` = c(NA,
4, NA, 2, NA, 1), Anxious = c(NA, 4, NA, 3, NA, 2), Calm = c(NA,
5, NA, 5, NA, 6), Cooperative = c(7, 6, 7, 6, 6, 6), Curious = c(7,
2, 7, 7, 7, 6), Depressed = c(1, 3, 1, 1, 1, 1), Eccentric = c(1,
3, 1, 4, 1, 4), Excitable = c(5, 2, 5, 5, 4, 4), `Fearful of people` = c(1,
2, 1, 2, 1, 1), `friendly of people` = c(5, 6, 7, 7, 7, 7), Insecure = c(2,
5, 2, 3, 2, 2), Playful = c(4, 6, 2, 5, 6, 6), `Self assured` = c(7,
6, 7, 5, 6, 6), Smart = c(6, 2, 7, 5, 7, 3), Solitary = c(4,
4, 3, 4, 3, 2), Tense = c(1, 2, 1, 3, 1, 2), Timid = c(2, 2,
2, 2, 2, 2), Trusting = c(6, 6, 6, 6, 6, 6), Vigilant = c(7,
6, 5, 3, 5, 3), Vocal = c(2, 7, 1, 6, 1, 7)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
I then ran the code:
dput(nb$ncp)
and got:
3L
Here's the answer in case anyone comes across the same issue. Using the data provided by OP:
class(questionaire_results_PCA)
[1] "tbl_df" "tbl" "data.frame"
The input of imputePCA requires a data.frame, but it does not work with a tribble. So we need to convert it back to a matrix or data.frame:
library(missMDA)
res.comp <- imputePCA(data.frame(questionaire_results_PCA), ncp = 2)
Error in eigen(crossprod(t(X), t(X)), symmetric = TRUE) :
infinite or missing values in 'x'
I get this error because it's a subset of the data and some of the columns have no deviation, we work around this first.
sel = which(apply(questionaire_results_PCA,2,sd)!=0)
# returns you a data.frame
res1 <- imputePCA(as.data.frame(questionaire_results_PCA[,sel]), ncp = 2)
# returns you a matrix
res2 <- imputePCA(as.matrix(questionaire_results_PCA[,sel]), ncp = 2)
Though this problem has been 'solved' many times, it turns out there's always another problem.
Without the print function it runs with no errors, but with it I get the following:
Error in .subset2(x, i) : recursive indexing failed at level 2
Which I'm taking to mean it doesn't like graphs being created in two layers of iteration? Changing the method to 'qplot(whatever:whatever)' has the exact same problem.
It's designed to print a graph for every pairing of the variables I'm looking at. There's too many for them to fit in a singular picture, such as for the pairs function, and I need to be able to see the actual variable names in the axes.
load("Transport_Survey.RData")
variables <- select(Transport, "InfOfReceievingWeather", "InfOfReceievingTraffic", "InfOfSeeingTraffic", "InfWeather.Ice", "InfWeather.Rain", "InfWeather.Wind", "InfWeather.Storm", "InfWeather.Snow", "InfWeather.Cold", "InfWeather.Warm", "InfWeather.DarkMorn", "InfWeather.DarkEve", "HomeParking", "WorkParking", "Disability", "Age", "CommuteFlexibility", "Gender", "PassionReduceCongest")
varnames <- list("InfOfReceivingWeather", "InfOfReceivingTraffic", "InfOfSeeingTraffic", "InfWeather.Ice", "InfWeather.Rain", "InfWeather.Wind", "InfWeather.Storm", "InfWeather.Snow", "InfWeather.Cold", "InfWeather.Warm", "InfWeather.DarkMorn", "InfWeather.DarkEve", "HomeParking", "WorkParking", "Disability", "Age", "CommuteFlexibility", "Gender", "PassionReduceCongest")
counterx = 1
countery = 1
for (a in variables) {
for (b in variables) {
print(ggplot(variables, mapping=aes(x=variables[[a]], y=variables[[b]],
xlab=varnames[counterx], ylab=varnames[countery]))+
geom_point())
countery = countery+1
counterx = counterx+1
}
}
#variables2 <- select(Transport, one_of(InfOfReceivingWeather, InfOfReceivingTraffic, InfOfSeeingTraffic, InfWeather.Ice, InfWeather.Rain, InfWeather.Wind, InfWeather.Storm, InfWeather.Snow, InfWeather.Cold, InfWeather.Warm, InfWeather.DarkMorn, InfWeather.DarkEve, HomeParking, WorkParking, Disability, Age, CommuteFlexibility, Gender, PassionReduceCongest))
Here is a mini-data frame for reference, sampled from the columns I'm using:
structure(list(InfOfReceievingWeather = c(1, 1, 1, 1, 4), InfOfReceievingTraffic = c(1,
1, 1, 1, 4), InfOfSeeingTraffic = c(1, 1, 1, 1, 4), InfWeather.Ice = c(3,
1, 3, 5, 5), InfWeather.Rain = c(1, 1, 2, 2, 4), InfWeather.Wind = c(1,
1, 2, 2, 4), InfWeather.Storm = c(1, 1, 1, 2, 5), InfWeather.Snow = c(1,
1, 2, 5, 5), InfWeather.Cold = c(1, 1, 1, 2, 5), InfWeather.Warm = c(1,
1, 1, 1, 3), InfWeather.DarkMorn = c(1, 1, 1, 1, 1), InfWeather.DarkEve = c(1,
1, 1, 1, 1), HomeParking = c(1, 1, 3, 1, 1), WorkParking = c(1,
4, 4, 5, 4), Disability = c(1, 1, 1, 1, 1), Age = c(19, 45, 35,
40, 58), CommuteFlexibility = c(2, 1, 5, 1, 2), Gender = c(2,
2, 2, 2, 1), PassionReduceCongest = c(0, 0, 2, 0, 2)), row.names = c(NA,
-5L), class = c("tbl_df", "tbl", "data.frame"))
You get an error in the assignment of your a and b. Basically, when defining a and b in variables, they become the vector of values contained in columns of variables. Thus, in your aes mapping, when you are calling variables[[a]], basically, you are writing (for the first iteration of a in variables):
variables[[c(1, 1, 1, 1, 4)]] instead of variables[["InfOfReceievingWeather"]]. So, it can't work.
To get over this issue, you have to either choose between:
for (a in variables) {
for (b in variables) {
print(ggplot(variables, mapping=aes(x=a, y=b)) ...
or
for (a in 1:ncol(variables)) {
for (b in 1:ncol(variables)) {
print(ggplot(variables, mapping=aes(x=variables[[a]], y=variables[[b]])) ...
Despite the first one seems to be simpler, I will rather prefere the second option because it will allow you to recycle a and b as column indicator to extract colnames of variables for xlab and ylab.
At the end, writing something like this should work:
for (a in 1:ncol(variables)) {
for (b in 1:ncol(variables)) {
print(ggplot(variables, mapping=aes(x=variables[[a]], y=variables[[b]])) +
xlab(colnames(variables)[a])+
ylab(colnames(variables)[b])+
geom_point())
}
}
Does it answer your question ?
I have a function which her input should run each time over one cell from one column over each cell in another column.
I can do it with a loop, however, I'm looking to vectorize the process or make it faster. As for now, it would take me days to finish the process.
Ideally, it would be using tidyverse but any help would be appreciated.
My loop looks like that:
results <- data.frame(
pathSubject1 = as.character(),
pathSubject2 = as.character())
i <- 1 #Counter first loop
j <- 1 #Counter second loop
#Loop over subject 1
for (i in 1:dim(df)[1]) {#Start of first loop
#Loop over subject 2
for (j in 1:dim(df)[1]) {#Start of second loop
#calc my function for the subjects
tempPercentSync <- myFunc(df$subject1[i], df$subject2[j])
results <- rbind(
results,
data.frame(
pathSubject1 = df$value[i],
pathSubject2 = df$value[j],
syncData = nest(tempPercentSync)))
} #End second loop
} #End first loop
My example function:
myFunc <- function(x, y) {
temp <- dplyr::inner_join(
as.data.frame(x),
as.data.frame(y),
by = "Time")
out <- as.data.frame(summary(temp))
}
Example of my dataset using dput:
structure(list(value = c("data/ExportECG/101_1_1_0/F010.feather",
"data/ExportECG/101_1_1_0/F020.feather"), ID = c(101, 101), run = c(1,
1), timeComing = c(1, 1), part = c(0, 0), paradigm = c("F010",
"F020"), group = c(1, 1), subject1 = list(structure(list(Time = c(0,
0.5, 1, 1.5, 2, 2.5), subject1 = c(9.73940345482368, 9.08451907157601,
8.42963468832833, 7.77475030508065, 7.11986592183298, 7.24395122629289
)), .Names = c("Time", "subject1"), row.names = c(NA, 6L), class = "data.frame"),
structure(list(Time = c(0, 0.5, 1, 1.5, 2, 2.5), subject1 = c(58.3471156751544,
75.9103303197856, 83.014068283342, 89.7923167579699, 88.6748903116088,
84.7651306939912)), .Names = c("Time", "subject1"), row.names = c(NA,
6L), class = "data.frame")), subject2 = list(structure(list(
Time = c(0, 0.5, 1, 1.5, 2, 2.5), subject2 = c(77.7776200371528,
77.4139420609906, 74.9760822165258, 75.3915183650012, 77.5672070195079,
80.7418145918357)), .Names = c("Time", "subject2"), row.names = c(NA,
6L), class = "data.frame"), structure(list(Time = c(0, 0.5, 1,
1.5, 2, 2.5), subject2 = c(101.133666720578, 105.010792226714,
107.01541987713, 104.471173834529, 97.5910271952943, 92.9840354003295
)), .Names = c("Time", "subject2"), row.names = c(NA, 6L), class = "data.frame"))), .Names = c("value",
"ID", "run", "timeComing", "part", "paradigm", "group", "subject1",
"subject2"), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-2L))
Output should loook like:
pathSubject1
1 data/ExportECG/101_1_1_0/F010.feather
2 data/ExportECG/101_1_1_0/F010.feather
3 data/ExportECG/101_1_1_0/F020.feather
4 data/ExportECG/101_1_1_0/F020.feather
pathSubject2
1 data/ExportECG/101_1_1_0/F010.feather
2 data/ExportECG/101_1_1_0/F020.feather
3 data/ExportECG/101_1_1_0/F010.feather
4 data/ExportECG/101_1_1_0/F020.feather
data
1 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 20, 5, 17, 14, 8, 11, 21, 6, 19, 16, 10, 13, 22, 7, 18, 15, 9, 12
2 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 21, 6, 17, 14, 8, 12, 22, 7, 19, 16, 10, 13, 20, 5, 18, 15, 9, 11
3 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 20, 5, 17, 14, 8, 11, 21, 7, 19, 16, 10, 13, 22, 6, 18, 15, 9, 12
4 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 21, 6, 17, 14, 8, 12, 22, 7, 19, 16, 10, 13, 20, 5, 18, 15, 9, 11
Thank you!
I think you're looking for lapply (or a related function).
What's probably taking the most time is the rbind, because at each step in your loops the entire object results gets slightly larger, which means it gets fully copied. With lapply, all results are first calculated, and only then you combine them with dplyr::rbind_list dplyr::bind_rows
What you get is this:
results <- dplyr::bind_rows(lapply(1:dim(df)[1], function(i) {
dplyr::bind_rows(lapply(1:dim(df)[1], function(j) {
data.frame(pathSubject1 = df$value[i],
pathSubject2 = df$value[j],
syncData = tidyr::nest(myFunc(df$subject1[[i]], df$subject2[[j]])))
}))
}))
Does that solve your problem?
EDIT: speeding things up
I've edited to use bind_rows instead of rbind_list, it's supposed to be faster. Furthermore, if you use [[i]] instead of [i] in the call to myFunc, you can drop the as.data.frame(x) there (and some for j/y).
Finally, you could optimize myFunc a bit by not assigning any intermediate results:
myFunc <- function(x, y) {
as.data.frame(summary(dplyr::inner_join(x, y, by = "Time")))
}
But my gut feeling says these will be small differences. To gain more speedup we need to reduce the actual computations, and then it matters what your actual data is, and what you need for your results.
Some observations, based on your example:
Do we need seperate data.frames? We compare all values in df$subject1 with those in df$subject2. In the example, first making one large data.frame for subject1, and then another for subject2, if needed with an extra label would speed up the join.
Why a join? Right now the summary of the join gives only information that we could have gotten without a join as well.
We join on Time, but in the example the timestamps for subject1 and 2 are identical. A check whether they are the same, followed by simply copying would be faster
As end-result we have a data.frame, with one column containing data.frames containing the summary of the join. Is that the way you need it? I think your code could be a lot faster if you only calculate the values you actually need. And I haven't worked a lot with data.frames containing data.frames, but it could well be that bind_rows doesn't handle it efficiently. Maybe a simple list (as column of your data.frame) would work better, as there's less overhead.
Finally, it could be that you're unable to reveal more about your real data, or it's too complicated.
In that case I think you could look aorund for various profiling-tools, functions that can help show you where most time is being spend. Personally, I like the profvis-tool
Put print(profvis::profvis({ mycode }, interval=seconds)) around a block of code, and after it finishes execution you see which lines took the most time, and which functions are called under the hood.
In the example-code, almost all time is spent in the row-binding and making data.frames. But in real data, I expect other functions may be time-consuming.
I am interested in plotting fits with confidence intervals after using two-way clustering package (multiwayvcov).
Here is my reproducible data.
rm(list=ls(all=TRUE))
library(lmtest)
library(multiwayvcov)
dv<-c(1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0)
int1<-c(0.0123, 0.3428, 0.2091, 0.8325, 0.7113, 0.7401, 0.6009, 0.5062, 0.4841, 0.8912, 0.3850, 0.2463, 0.0625, 0.5374, 0.1984)
int2<-c(0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0)
cont<-c(3, 1, 2, 4, 6, 7, 1, 4, 3, 2, 4, 3, 6, 1, 3)
cluster1<-c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3)
cluster2<-c(1, 2, 3, 1, 2, 3, 1, 2, 1, 2, 1, 2, 3, 1, 2)
mydata<-as.data.frame(cbind(dv, int1, int2, cont, cluster1, cluster2))
This is my non-clustered model:
result_lm <- lm(dv~int1+int2+cont,data=mydata)
To get clustered results using "cluster1" and "cluster2", I use functions in the package of "lmtest" and "multiwayvcov" as follows.
cluster_vcov<-cluster.vcov(result_lm, ~cluster1+cluster2)
result_2c<-coeftest(result_lm, cluster_vcov)
Here, "cluster_vcov" is just a variance-covariance matrix and "result_2c" is just an atomic vector. Thus, I am not able to use "predict" function to plot fits on a new dataset ("datagrid") such as
grid <- seq(0,1,.2)
datagrid <- data.frame(int1=rep(grid,2),
int2=c(rep(0,length(grid)),
rep(1,length(grid))))
datagrid$cont<-mean(mydata$cont, na.rm=T)
Before moving to what I have done, here is something similar what I would like to have eventually.
fits <- predict(result_lm,newdata=datagrid,interval="confidence")
plotdata <- data.frame(fits,datagrid)
plotdata$int2 <- plotdata$int2==1
ggplot(plotdata,aes(x=int1,y=fit,ymin=lwr,ymax=upr,color=int2)) + geom_line(aes(linetype = int2)) + geom_ribbon(alpha=.2) + theme(legend.position="none") + scale_color_manual(values=c("red", "darkgreen")) + scale_linetype_manual(values=c("dashed", "solid"))
The result is
To address the problem that "result_2c" does not give a dataframe that can be directly used with "predict", I decided to construct a data by myself as follows.
d_twc_result<-data.frame(matrix(0, nrow =4, ncol = 4) )
colnames(d_twc_result) <- c("Estimate","Std. Error","t value", "Pr(>|t|)")
rownames(d_twc_result) <-c("(Intercept)", "int1","int2", "cont")
for (j in 1:4){
for (i in 1:4){
d_twc_result[i, j]<-result_2c[i,j]
}
}
Then, using "d_twc_result$Estimate", I generate a vector that corresponds to "fits" that one could get after running "predict".
fits<-c(1:12)
for (i in 1:12){
fits[i]<-d_twc_result$Estimate[1]+
d_twc_result$Estimate[2]*datagrid$int1[i]+
d_twc_result$Estimate[3]*datagrid$int2[i]+
d_twc_result$Estimate[4]*datagrid$cont[i]
}
Yet, I was still not able to construct vectors for "lwr" and "upr", which requires 'residuals' or 'standard error'. What I was actually stuck is that it seems impossible to get 'residuals' or 'standard error' because there is no observation on 'dv' in the dataset "datagrid".
Nevertheless, "predict" works with the dataset "datagrid", so I guess that I am poorly understanding how "predict" works or the concept of fit.
It will be highly appreciated if you could help me to get "lwr" and "upr" (if my understanding of the concept of fit is incorrect). Thank for any comment in advance.
I am a beginner in R, and have a question about making boxplots of columns in R. I just made a dataframe:
SUS <- data.frame(RD = c(4, 3, 4, 1, 2, 2, 4, 2, 4, 1), TK = c(4, 2, 4, 2, 2, 2, 4, 4, 3, 1),
WK = c(3, 2, 4, 1, 3, 3, 4, 2, 4, 2), NW = c(2, 2, 4, 2, NA, NA, 5, 1, 4, 2),
BW = c(3, 2, 4, 1, 4, 1, 4, 1, 5, 1), EK = c(2, 4, 3, 1, 2, 4, 2, 2, 4, 2),
AN = c(3, 2, 4, 2, 3, 3, 3, 2, 4, 2))
rownames(SUS) <- c('Pleasant to use', 'Unnecessary complex', 'Easy to use',
'Need help of a technical person', 'Different functions well integrated','Various function incohorent', 'Imagine that it is easy to learn',
'Difficult to use', 'Confident during use', 'Long duration untill I could work with it')
I tried a number of times, but I did not succeed in making boxplots for all rows. Someone who can help me out here?
You can do it as well using tidyverse
library(tidyverse)
SUS %>%
#create new column and save the row.names in it
mutate(variable = row.names(.)) %>%
#convert your data from wide to long
tidyr::gather("var", "value", 1:7) %>%
#plot it using ggplot2
ggplot(., aes(x = variable, y = value)) +
geom_boxplot()+
theme(axis.text.x = element_text(angle=35,hjust=1))
As #blondeclover says in the comment, boxplot() should work fine for doing a boxplot of each column.
If what you want is a boxplot for each row, then actually your current rows need to be your columns. If you need to do this, you can transpose the data frame before plotting:
SUS.new <- as.data.frame(t(SUS))
boxplot(SUS.new)