Subscript out of bound error in R - r

While Using factanal function from stats package for performing factor analysis.
I tried following thing.
library(mirt)
library(ltm)
library(psych)
library(stats)
data(SAT12)
data=SAT12
cor_mat=polychoric(data, ML=TRUE, global=F)
fit <- factanal(factors=2, n.obs=nrow(data), covmat=cor_mat$rho)
Divide_item_Factor_Loading(fit)
when I am trying to run Divide_item_Factor_Loading(fit) an error called
Error in a[[i]][[2]] : subscript out of bounds
pops up.
my complete code of Divide_item_Factor_Loading:
Divide_item_Factor_Loading=function(fit)
{
a=list()
items=NULL
for(i in 1:nrow(fit$loadings)) ######corresponding to rows of loading matrix
{
k=which(fit$loadings[i,]==max(abs(fit$loadings[i,])))
a[[i]]=c(i,as.numeric(k))
}
fact_item_mat=matrix(, nrow=nrow(fit$loadings), ncol=ncol(fit$loadings))
for(j in 1:(ncol(fit$loadings)))
{
for(i in 1:(nrow(fit$loadings)))
{
if(a[[i]][[2]]==j) {fact_item_mat[i,j]=a[[i]][[1]]}
}
}
nam=names(fit$loadings[,1])
factor=list()
for(i in 1:ncol(fit$loadings))
{
factor[[i]]=sort(fact_item_mat[,i], decreasing = FALSE, na.last = NA)
fac=factor[[i]]
fac=nam[fac]
factor[[i]]=fac
}
names(factor)=paste("factor", 1:ncol(fit$loadings), sep="")
return(factor)
}
What steps should I take now to avoid this error?

To change the way the loadings are printed, use the cutoff argument to print.loadings.
Try something like this:
print(fit$loadings, cutoff=0)
The actual matrix contains all the values.
print(loadings(fit), cutoff=0)
Loadings:
Factor1 Factor2
Item 1 0.014 0.418
Item 2 0.130 0.350
Item 3 0.036 0.553
Item 4 0.166 0.294
Item 5 0.990 0.125
Factor1 Factor2
SS loadings 1.025 0.705
Proportion Var 0.205 0.141
Cumulative Var 0.205 0.346
Now extract the maximum loading on each factor, using apply():
apply(loadings(fit), 2, max)
Factor1 Factor2
0.9895743 0.5531770

Running your code and debugging your function (using debug function) I can see why you're having a "subscript out of bound" error:
the 15th element (among other) of your variable a is of length 1 so R is not happy when you're trying to reach a[[15]][2]...
the reason why this element is only of length one instead of 2 is because the maximum absolute value of factor is reached for a negative value and you're asking which value (not absolute) is equal to this maximal absolute value, so the answer is none...
Hence you need to change the line
which(fit$loadings[i,]==max(abs(fit$loadings[i,]))) to which(abs(fit$loadings[i,])==max(abs(fit$loadings[i,])))
and you'll get:
Divide_item_Factor_Loading(fit)
#$factor1
#[1] "Item.1" "Item.4" "Item.6" "Item.7" "Item.8" "Item.9" "Item.10" "Item.11" "Item.13" "Item.14" "Item.15"
#[12] "Item.17" "Item.19" "Item.20" "Item.24" "Item.26" "Item.27" "Item.28" "Item.29"
#$factor2
#[1] "Item.2" "Item.3" "Item.5" "Item.12" "Item.16" "Item.18" "Item.21" "Item.22" "Item.23" "Item.25" "Item.30"
#[12] "Item.31" "Item.32"
Even if the debugged function will now work, I think you should change it because this is more complicated than it should be.
My proposition for an alternative function:
Divide_item_Factor_Loading_v2<-function(fit){
a<-apply(fit$loadings,1,function(facs) which(abs(facs)==max(abs(facs))))
return(list(factor1=names(a)[a==1],factor2=names(a)[a==2]))
}
This gives for your fit object the exact same result as your (debugged) function:
Divide_item_Factor_Loading_v2(fit)
#$factor1
#[1] "Item.1" "Item.4" "Item.6" "Item.7" "Item.8" "Item.9" "Item.10" "Item.11" "Item.13" "Item.14" "Item.15"
#[12] "Item.17" "Item.19" "Item.20" "Item.24" "Item.26" "Item.27" "Item.28" "Item.29"
#$factor2
#[1] "Item.2" "Item.3" "Item.5" "Item.12" "Item.16" "Item.18" "Item.21" "Item.22" "Item.23" "Item.25" "Item.30"
#[12] "Item.31" "Item.32"

Check ?loadings, what you'll find out that there is a cutoff parameter that defines a value that "loadings smaller than this (in absolute value) are suppressed".

Related

Finding out if column has only identical values in R for if...else

I'm new to R and have the following problem. I have a big df that looks something like this...
PPC=precipitation
ID
Area
ppc_1
ppc_2
ppc_3
....
other_values
1
PB
0.989
0.787
0.676
..
2
DR
0.978
0.989
.....
..
3
...
...
...
...
I need to test for normal distribution for every PPC-station per area with the Shapiro-Wilk-Test. To get the necessary data from the weather stations in one column I used the following "for loop".
My problem is that some weather stations seem to have identical values which causes an error for the Shapiro-Wilk-Test. I tried to circumvent it by using an "If...else- statement", but the "if" clause does not work.
#create empty vector to put results in
pvalue_ppc <- as.character()
for (i in 1:length(df$ID)) {
####filter for ID and PPC
temp_df<-df %>%select(starts_with("ppc"
), ID)%>% filter(ID==i )
#melt weather stations over ID
temp_df_ID_ppc <- melt(temp_df,
id = "ID",
value.name = "all_ppc")
#i want to test if all values in df$all_ppc are the same = TRUE
if(isTRUE(all.equal(temp_df_ID_ppc$all_ppc, countEQ = TRUE))){
#and if all values are equal, put an "NA" into the vector
output <- NA
pvalue_ppc[i]<- rbind(output)
}
else {
output<- shapiro.test(ttemp_df_ID_ppc$all_ppc)
#bind results to vector
pvalue_ppc[i]<- rbind(output$p.value)
}
}
#bind vector to orig. df
df<- cbind(df, "Pvalue_PPC"=pvalue_ppc)
The error I get is
$ operator is invalid for atomic vectors
or it returns TRUE back, even if not all values are equal.
I also tried using "identical(), but it only compared vector to vector

What is the difference between fitdistr and fitdist for t-distribution?

More specifically, I have a set of data x e.g
x <- scan(textConnection(
"10.210126 10.277015 10.402625 10.208137 9.831884 9.672501
9.815476 9.980124 9.710509 9.148997 9.406072
9.991224 10.324005 10.402747 10.449439 10.051304 9.886748 9.771041
9.761175 10.049102 9.457981 9.114380 9.461333 9.804220 10.395986
10.192419 10.202962 9.984330 9.765604 9.473166 9.966462 10.120895
9.631744"))
If I use
fitdist(x,"t")
I get the following results:
m s df
10.63855874 0.50766169 37.08954639
( 0.02217171) ( 0.01736012) (23.12225558)
Compared to
fitdist(x,"t",start=list(length(x)-1,mean(x)),lower=c(0))
(starting parameters were found in an answer on this site) which results in:
estimate Std. Error
1 103129.28018 NA
2 10.63869 NA
Why is there a difference? Am I perhaps using the wrong starting points for fitdist()?

Error when using cv.tree

Hi I tried using the function cv.tree from the package tree. I have a binary categorical response (called Label) and 30 predictors. I fit a tree object using all predictors.
I got the following error message that I don't understand:
Error in as.data.frame.default(data, optional = TRUE) :
cannot coerce class ""function"" to a data.frame
The data is the file 'training' taken from this site.
This is what I did:
x <- read.csv("training.csv")
attach(x)
library(tree)
Tree <- tree(Label~., x, subset=sample(1:nrow(x), nrow(x)/2))
CV <- cv.tree(Tree,FUN=prune.misclass)
The error occurs once cv.tree calls model.frame. The 'call' element of the tree object must contain a reference to a data frame whose name is also not the name of a loaded function.
Thus, not only will subsetting in the call to tree generate the error when cv.tree later uses the 'call' element of the tree object, using a dataframe with a name like "df" would give an error as well because model.frame will take this to be name of an existing function (i.e. the 'density of F distribution' from the stats package).
I think the problem is in the dependent variable list. The following works, but I think you need to read the problem description more carefully. First, setup the formula without weight.
x <- read.csv("training.csv")
vars<-setdiff(names(x),c("EventId","Label","Weight"))
fmla <- paste("Label", "~", vars[1], "+",
paste(vars[-c(1)], collapse=" + "))
Here's what you've been running
Tree <- tree(fmla, x, subset=sample(1:nrow(x), nrow(x)/2))
plot(Tree)
$size
[1] 6 5 4 3 1
$dev
[1] 25859 25859 27510 30075 42725
$k
[1] -Inf 0.0 1929.0 2791.0 6188.5
$method
[1] "misclass"
attr(,"class")
[1] "prune" "tree.sequence"
You may want to consider package rpart also
urows = sample(1:nrow(x), nrow(x)/2)
x_sub <- x[urows,]
Tree <- tree(fmla, x_sub)
plot(Tree)
CV <- cv.tree(Tree,FUN=prune.misclass)
CV
library(rpart)
tr <- rpart(fmla, data=x_sub, method="class")
printcp(tr)
Classification tree:
rpart(formula = fmla, data = x_sub, method = "class")
Variables actually used in tree construction:
[1] DER_mass_MMC DER_mass_transverse_met_lep
[3] DER_mass_vis
Root node error: 42616/125000 = 0.34093
n= 125000
CP nsplit rel error xerror xstd
1 0.153733 0 1.00000 1.00000 0.0039326
2 0.059274 2 0.69253 0.69479 0.0035273
3 0.020016 3 0.63326 0.63582 0.0034184
4 0.010000 5 0.59323 0.59651 0.0033393
If you include weight, then that is the only split.
vars<-setdiff(names(x),c("EventId","Label"))

R object not found, although returned using print

I don't understand the R output. It seems that my clearly defined object outcome is not found, although it is successfully used in sub-functions and printed. How is that possible?
My R code:
f.hazardratio <- function(input)
{
outcome <- c("A","B","C","D","E","F")
category <- c(rep("surv",2),rep("term",2),rep("lobw",2))
for(i in 1:length(outcome))
{
if(nrow(subset(input,input[,paste("out",category[i],sep=".")]==outcome[i]))>0)
{
lex <- f.lexis(data=input,
out=category[i],
out.case=outcome[i])
print(str(lex))
print(outcome[i])
print(head(subset(lex, lex.Xst=="A")))
print(head(subset(lex, lex.Xst==outcome[i])))
# nrow(subset(lex, lex.Xst==outcome[i])) is the value I am actually interest in and causes the same error message as print(), which I only added for identifying the problem
# code continues, but not shown ...
}
}
}
And the output:
Classes ‘Lexis’ and 'data.frame': 107455 obs. of 6 variables:
$ pre.time : num
$ lex.dur : num
$ lex.Xst : Factor w/ 3 levels
$ lex.Cst : Factor w/ 3 levels
[1] "A"
pre.time lex.dur lex.Xst lex.Cst
930 145 36 A vv
2255 273 14 A vv
4842 115 99 A vv
5127 260 30 A vv
5217 71 108 A v
5422 152 2 A vv
Error in eval(expr, envir, enclos) (from #32) : object 'outcome' not found
I have already tried to alter the type of variables from factor to character or vice versa and tried to define an intermediate, temporary variable tmp <- outcome[i]. Unfortunately, nothing has worked so far.
Replacing subset() using square brackets as suggested by Spacedman solved the problem.
Welcome to functional programming. If you want to specify a particular value that will be returned, then wrap it in the return(.) function. Otherwise the value returned is simple the results of the last evaluation. All of the variables created within the function will be inaccessible from outside and later garbage-collected. Calling print may or may not give you exactly the object. Some authors pass the object-to-be-printed on to summary.class (where class is an attribute of the object) first and do not return an exact copy of the argument.

Sapply is turning a vector into a vector of vectors.. I think

Honestly I am unsure if the title accurately describes what's happening, but here it goes.
Suppose I start with the following object "Tempcheck"
> str(Tempcheck)
'data.frame': 1872 obs. of 3 variables:
$ Time : POSIXlt, format: "2013-07-10 14:26:40" "2013-07-10 14:26:43" "2013-07-10 14:26:50" "2013-07-10 14:26:53" ...
$ rawTemp : int 107461 108551 109940 110258 110740 110890 111096 111164 111238 111296 ...
$ rawConductivity: int 969903 1287631 1298627 1292063 1303909 1297249 1305610 1297557 1305070 1298703 ...
I then call a function and use sapply to normalize some data.
TCalibration<- function(x){ #this function normalizes data based on the calculated y intercept and slope
dc <- (x*((Tempcor[[2]])))+((Tempcor[[1]])) # y = 1/m*x + -1/b
dc <- dc[[1]]
}
##calibrates rawTemp into real temp
Tempcheck$Temp <- sapply(Tempcheck[[2]],TCalibration)
Tempcor is a previous object that stores coefficients from a linear model. If this is relevant I can post it later.
> str(Tempcheck)
'data.frame': 1872 obs. of 4 variables:
$ Time : POSIXlt, format: "2013-07-10 14:26:40" ...
$ rawTemp : int 107461 108551 109940 110258 110740 110890 111096 111164 111238 111296 ...
$ rawConductivity: int 969903 1287631 1298627 1292063 1303909 1297249 1305610 1297557 1305070 1298703 ...
$ Temp : num 23.6 23.9 24.3 24.4 24.5 ...
This is all fine and dandy! UNTIL ....
I call another function
ConductivityCorrection <- function(x){
t <- 1+.02*(Tempcheck$Temp-25)
EC25 <- (x/t)
}
Then use sapply again to Tempcheck
Tempcheck$rawCEC <-sapply(Tempcheck[[3]] ,ConductivityCorrection)
I was expecting to get the same thing that I got with the previous line of code, but something strange happened.
str(Tempcheck$rawCEC)
num [1:1872, 1:1872] 998390 991974 983917 982090 979335 ...
The length of this vector is 1872^2 which I thought was odd. My suspision is that it comes from the line
t <- 1+.02*(Tempcheck$Temp-25)
I know i could do this a different way, but I'm trying to force my self to use the apply family and learn it better. Anyway any help would be appreciated. Thank you!
I am aware that this piece of code solves my problem.
Tempcheck$alphaT <- 1+.02*(Tempcheck$Temp-25)
Tempcheck$rawCEC <- Tempcheck[[3]]/Tempcheck$alphaT
I was looking for a way to turn this into a function and apply to each element in the column of Tempcheck[[3]]
The issue is that Tempcheck$Temp in your ConductivityCorrection function is a vector so t is a vector and thus x/t also returns a vector. Instead you can use mapply or sapply(seq_along(Tempcheck[[3]]), ...) and index both accordingly.
ConductivityCorrection <- function(x){
t <- 1+.02*(Tempcheck$Temp[x]-25)
EC25 <- (Tempcheck$rawConductivity[x]/t)
}
sapply(seq_along(Tempcheck$Temp, ConductivityCorrection)
Generally, if you're applying a function to every row in a data.frame, you can vectorize your solution and skip apply functions altogether:
Temcheck$Temp <- Tempcheck$rawTemp * Tempcor[[2]] + Tempcor[[1]]
Tempcheck$rawCEC <- Tempcheck$rawConductivity / (1 + 0.02 * (Tempcheck$Temp - 25))
However, for simpler functions like these, I really like the data.table syntax:
DT <- data.table(Tempcheck)
DT[, rawCEC := rawConductivity / (1 + 0.02*Temp - 25)]`)

Resources