I'm trying to run a principal component analysis (PCA) indicating the quantitative data and the qualitative data, but I get this error when performing:
library(FactoMineR)
pca(data, quanti.sup = 4:12, quali.sup = 1:3, scale.unit = FALSE, ncp=2)
Error in eigen(t(X)%*%X, symmetric = TRUE): = 0x0 matrix
My data is a 2980 x 12 data frame with names, so it's really weird.
Any advice would be very much appreciated.
The problem you encountered is because you have specified all of your variables as supplementary variables when you call PCA().
To illustrate with an example we can use the built in dataset USJudgeRatings.
head(USJudgeRatings)
CONT INTG DMNR DILG CFMG DECI PREP FAMI ORAL WRIT PHYS RTEN
AARONSON,L.H. 5.7 7.9 7.7 7.3 7.1 7.4 7.1 7.1 7.1 7.0 8.3 7.8
ALEXANDER,J.M. 6.8 8.9 8.8 8.5 7.8 8.1 8.0 8.0 7.8 7.9 8.5 8.7
ARMENTANO,A.J. 7.2 8.1 7.8 7.8 7.5 7.6 7.5 7.5 7.3 7.4 7.9 7.8
BERDON,R.I. 6.8 8.8 8.5 8.8 8.3 8.5 8.7 8.7 8.4 8.5 8.8 8.7
BRACKEN,J.J. 7.3 6.4 4.3 6.5 6.0 6.2 5.7 5.7 5.1 5.3 5.5 4.8
BURNS,E.B. 6.2 8.8 8.7 8.5 7.9 8.0 8.1 8.0 8.0 8.0 8.6 8.6
In this data there are 43 judges who were ranked on 11 qualities by lawyers (columns 2:12). Column 1 is the number of contacts the lawyers had with the judge.
The PCA won't work if you specify that all variables are supplementary.
library(FactoMineR)
result <- PCA(USJudgeRatings, ncp = 3, quanti.sup = 1:12)
# Error in eigen(t(X) %*% X, symmetric = TRUE) : 0 x 0 matrix
We have to give the PCA some variables to work with. Instead, we let our 11 variables go into the PCA and specify only the number of contacts the lawyers had with the judges as a quantitative supplementary variable:
result <- PCA(USJudgeRatings, ncp = 3, quanti.sup = 1)
This runs and you can then view the results with summary.PCA(result).
Related
I am trying to show the top 100 sales on a scatterplot by year. I used the below code to take top 100 games according to sales and then set it as a data frame.
top100 <- head(sort(games$NA_Sales,decreasing=TRUE), n = 100)
as.data.frame(top100)
I then tried to plot this with the below code:
ggplot(top100)+
aes(x=Year, y = Global_Sales) +
geom_point()
I bet the below error when using the subset top100
Error: data must be a data frame, or other object coercible by fortify(), not a numeric vector
if i use the actual games dataseti get the plot attached.
Any ideas?
As pointed out in comments by #CMichael, you have several issues in your code.
In absence of reproducible example, I used iris dataset to explain you what is wrong with your code.
top100 <- head(sort(games$NA_Sales,decreasing=TRUE), n = 100)
By doing that you are only extracting a single column.
The same command with the iris dataset:
> head(sort(iris$Sepal.Length, decreasing = TRUE), n = 20)
[1] 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 7.2 7.1 7.0 6.9 6.9 6.9 6.9 6.8 6.8 6.8
So, first, you do not have anymore two dimensions to be plot in your ggplot2. Second, even colnames are not kept during the extraction, so you can't after ask for ggplot2 to plot Year and Global_Sales.
So, to solve your issue, you can do (here the example with the iris dataset):
top100 = as.data.frame(head(iris[order(iris$Sepal.Length, decreasing = TRUE), 1:2], n = 100))
And you get a data.frame of of this type:
> str(top100)
'data.frame': 100 obs. of 2 variables:
$ Sepal.Length: num 7.9 7.7 7.7 7.7 7.7 7.6 7.4 7.3 7.2 7.2 ...
$ Sepal.Width : num 3.8 3.8 2.6 2.8 3 3 2.8 2.9 3.6 3.2 ...
> head(top100)
Sepal.Length Sepal.Width
132 7.9 3.8
118 7.7 3.8
119 7.7 2.6
123 7.7 2.8
136 7.7 3.0
106 7.6 3.0
And then if you are plotting:
library(ggplot2)
ggplot(top100, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
Warning Based on what you provided in your example, I will suggest you to do:
top100 <- as.data.frame(head(games[order(games$NA_Sales,decreasing=TRUE),c("Year","Global_Sales")], 100))
However, if this is not satisfying to you, you should consider to provide a reproducible example of your dataset How to make a great R reproducible example
How can I get the variable name of a function to act dynamically within a string?
Below is an extract of what I am trying to achieve: a function that produces a list depending on the varName. But I cannot get the varName to act dynamically within the string sqldf(...). I assume this problem is not specific to the package sqldf.
createExcelSheetData<-function(varName){
sqldf("
SELECT Name
FROM dataTable
WHERE Choice=varName
")
}
table1<-createExcelSheetData(1)
table2<-createExcelSheetData(2)
table3<-createExcelSheetData(3)
What the above gives me is the choice fixed with the text varName.
UPDATE: To have the variable within the text, not just at the end.
createExcelSheetData<-function(varName){
sqldf("
SELECT Name
FROM dataTable
WHERE Choice=varName
ORDER BY Name
")
}
table1<-createExcelSheetData(1)
table2<-createExcelSheetData(2)
table3<-createExcelSheetData(3)
fn$ is discussed in Example 6 on the sqldf home page. Here is a self contained minimial reproducible example using the iris data frame that comes with R: (In the future please ensure all code is minimal and reproducible and in particular includes all inputs).
library(sqldf)
# retrieve records for specified Species and Petal.Length above minPetalLength
f <- function(Species, minPetalLength) {
fn$sqldf("SELECT *
FROM iris
WHERE Species = '$Species' and [Petal.Length] > $minPetalLength")
}
f("virginica", 6)
giving:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 7.6 3.0 6.6 2.1 virginica
2 7.3 2.9 6.3 1.8 virginica
3 7.2 3.6 6.1 2.5 virginica
4 7.7 3.8 6.7 2.2 virginica
5 7.7 2.6 6.9 2.3 virginica
6 7.7 2.8 6.7 2.0 virginica
7 7.4 2.8 6.1 1.9 virginica
8 7.9 3.8 6.4 2.0 virginica
9 7.7 3.0 6.1 2.3 virginica
I have a 207x7 xts object (called temp). I have a 207x3 matrix (called ac.topn), each row of which contains the columns I'd like from the corresponding row in the xts object.
For example, given the following top two rows of temp and ac.topn,
temp
v1 v2 v3 v4 v5 v6 v7
1997-09-30 14.5 8.7 -5.8 2.6 4.7 1.9 17.2
1997-10-31 6.0 -2.0 -25.7 2.9 4.9 9.6 8.4
head(ac.topn)
Rank1 Rank2 Rank3
1997-09-30 7 4 2
1997-10-31 6 5 7
I would like to get the result:
1997-09-30 17.2 2.6 8.7 (elements 7, 4, and 2 from the first row of temp)
1997-10-31 9.6 4.9 8.4 (elements 6, 5, 7 from the second row of temp)
My first attempt was temp[,ac.topn]. I've browsed for help, but am struggling to word my request effectively.
Thank you.
Well, this works, but I've got to think there's a better way...
result <- do.call(rbind,lapply(index(temp),function(i)temp[i,ac.topn[i]]))
colnames(result) <- colnames(as.topn)
result
# Rank1 Rank2 Rank3
# 1997-09-30 17.2 2.6 8.7
# 1997-10-31 9.6 4.9 8.4
You may subset a matrix version of the xts object, using indexing via a numeric matrix:
m <- as.matrix(temp)
cols <- as.vector(ac.topn)
rows <- rep(1:nrow(ac.topn), ncol(ac.topn))
vals <- m[cbind(rows, cols)]
xts(x = matrix(vals, nrow = nrow(temp)), order.by = index(temp))
# [,1] [,2] [,3]
# 1997-09-30 17.2 2.6 8.7
# 1997-10-31 9.6 4.9 8.4
However, I say the same as #jlhoward: I've got to think there's a better way...
I'm working on two datasets, derrived fromm cats, an in-build R dataset.
> cats
Sex Bwt Hwt
1 F 2.0 7.0
2 F 2.0 7.4
3 F 2.0 9.5
4 F 2.1 7.2
5 F 2.1 7.3
6 F 2.1 7.6
7 F 2.1 8.1
8 F 2.1 8.2
9 F 2.1 8.3
10 F 2.1 8.5
11 F 2.1 8.7
12 F 2.1 9.8
...
137 M 3.6 13.3
138 M 3.6 14.8
139 M 3.6 15.0
140 M 3.7 11.0
141 M 3.8 14.8
142 M 3.8 16.8
143 M 3.9 14.4
144 M 3.9 20.5
I want to find the 99% Confidence Interval on the difference of means values between the Bwt of Male and Female specimens (Sex == M and Sex == F respectively)
I know that t.test does this, among other things, but if I break up cats to two datasets that contain the Bwt of Males and Females, t.test() complains that the two datasets are not of the same length, which is true. There's only 47 Females in cats, and 87 Males.
Is it doable some other way or am I misinterpreting data by breaking them up?
EDIT:
I have a function suggested to me by an Answerer on another Question that gets the CI of means on a dataset, may come in handy:
ci_func <- function(data, ALPHA){
c(
mean(data) - qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data)),
mean(data) + qnorm(1-ALPHA/2) * sd(data)/sqrt(length(data))
)
}
You should apply the t.test with the formula interface:
t.test(Bwt ~ Sex, data=cats, conf.level=.99)
Alternatively to t.test, if you really only interested in the difference of means, you can use:
DescTools::MeanDiffCI(cats$Bwt, cats$Sex)
which gives something like
meandiff lwr.ci upr.ci
-23.71474 -71.30611 23.87662
This is calculated with 999 bootstrapped samples by default. If you want more, you can specify this in the R parameter:
DescTools::MeanDiffCI(cats$Bwt, cats$Sex, R = 1000)
I asked a question like this before but I decided to simplify my data format because I'm very new at R and didnt understand what was going on....here's the link for the question How to handle more than multiple sets of data in R programming?
But I edited what my data should look like and decided to leave it like this..in this format...
X1.0 X X2.0 X.1
0.9 0.9 0.2 1.2
1.3 1.4 0.8 1.4
As you can see I have four columns of data, The real data I'm dealing with is up to 2000 data points.....Columns "X1.0" and "X2.0" refer "Time"...so what I want is the average of "X" and "X.1" every 100 seconds based on my 2 columns of time which are "X1.0" and "X2.0"...I can do it using this command
cuts <- cut(data$X1.0, breaks=seq(0, max(data$X1.0)+400, 400))
by(data$X, cuts, mean)
But this will only give me the average from one set of data....which is "X1.0" and "X".....How will I do it so that I could get averages from more than one data set....I also want to stop having this kind of output
cuts: (0,400]
[1] 0.7
------------------------------------------------------------
cuts: (400,800]
[1] 0.805
Note that the output was done every 400 s....I really want a list of those cuts which are the averages at different intervals...please help......I just used data=read.delim("clipboard") to get my data into the program
It is a little bit confusing what output do you want to get.
First I change colnames but this is optional
colnames(dat) <- c('t1','v1','t2','v2')
Then I will use ave which is like by but with better output. I am using a trick of a matrix to index column:
matrix(1:ncol(dat),ncol=2) ## column1 is col1 adn col2...
[,1] [,2]
[1,] 1 3
[2,] 2 4
Then I am using this matrix with apply. Here the entire solution:
cbind(dat,
apply(matrix(1:ncol(dat),ncol=2),2,
function(x,by=10){ ## by 10 seconds! you can replace this
## with 100 or 400 in you real data
t.col <- dat[,x][,1] ## txxx
v.col <- dat[,x][,2] ## vxxx
ave(v.col,cut(t.col,
breaks=seq(0, max(t.col),by)),
FUN=mean)})
)
EDIT correct the cut and simplify the code
cbind(dat,
apply(matrix(1:ncol(dat),ncol=2),2,
function(x,by=10)ave(dat[,x][,1], dat[,x][,1] %/% by)))
X1.0 X X2.0 X.1 1 2
1 0.9 0.9 0.2 1.2 3.3000 3.991667
2 1.3 1.4 0.8 1.4 3.3000 3.991667
3 2.0 1.7 1.6 1.1 3.3000 3.991667
4 2.6 1.9 2.2 1.6 3.3000 3.991667
5 9.7 1.0 2.8 1.3 3.3000 3.991667
6 10.7 0.8 3.5 1.1 12.8375 3.991667
7 11.6 1.5 4.1 1.8 12.8375 3.991667
8 12.1 1.4 4.7 1.2 12.8375 3.991667
9 12.6 1.8 5.4 1.2 12.8375 3.991667
10 13.2 2.1 6.3 1.3 12.8375 3.991667
11 13.7 1.6 6.9 1.1 12.8375 3.991667
12 14.2 2.2 9.4 1.3 12.8375 3.991667
13 14.6 1.8 10.0 1.5 12.8375 10.000000