Please have a look at the factor scores returned from robCompositions package in this example:
data(expenditures)
x <- expenditures
res.rob <- pfa(x, factors=1, score="regression")
according to pfa help, since the covariance is not specified,
the covariance is estimated from isometric log-ratio transformed data internally, but the data used for factor analysis are back-transformed to the "clr" space.
So the clr transformed data obtain as follows:
# ilr transformation
ilr <- function(x){
x.ilr=matrix(NA,nrow=nrow(x),ncol=ncol(x)-1)
for (i in 1:ncol(x.ilr)){
x.ilr[,i]=sqrt((i)/(i+1))*
log(((apply(as.matrix(x[,1:i]), 1, prod))^(1/i))/(x[,i+1]))
}
return(x.ilr)
}
#construct orthonormal basis:
#(matrix with ncol(x) rows and ncol(x)-1 columns)
V=matrix(0,nrow=ncol(x),ncol=ncol(x)-1)
for (i in 1:ncol(V)){
V[1:i,i] <- 1/i
V[i+1,i] <- (-1)
V[,i] <- V[,i]*sqrt(i/(i+1))
}
z=ilr(x) #ilr transformed data
y=z%*%t(V) #clr transformed data
now the factor scores using regression method might be calculated as follows:
loa<-c(0.970,0.830,0.986,0.876,0.977) #res.rob object
facscores<- y%*%loa
head(facscores)
-0.009485110
0.009680645
0.008426665
-0.015401000
-0.003610644
-0.004584145
but calling res.rob$scores returns us
head(res.rob$scores)
Factor1
-755.2681
705.5309
4196.5652
-778.6955
-628.2141
-663.4534
So please check am I wrong or there is probably a bug in the pfa command?
Yours,
Hamid
I am conducting a study that analyzes speakers' production and measures their average F2 values. What I need is an R function that allows me to find a relationship for these F2 values with 3 other variables, and if there is, which one is the most significant. These variables have been coded as 1, 2, or 3 for things like "yes" "no" answers or whether responses are positive, neutral or negative (1, 2, 3 respectively).
Is there a particular technique or R function/test that we can use to approach this problem? I've considered using ANOVA or a T-Test but am unsure if this will give me what I need.
A quick solution might look like this. Here, the cor function is used. Read its help page (?cor) to understand what is calculated. By default, the Pearson correlation coefficient is used. The function below return the variable with the highest Pearson correlation with respect to the reference variable.
set.seed(111)
x <- rnorm(100)
y <- rnorm(100)
z <- rnorm(100)
ref <- 0.5*x + 0.5*rnorm(100)
find_max_corr <- function(vars, ref){
val <- sapply(vars, cor, y = ref)
val[which.max(val)]
}
find_max_corr(list('x' = x, 'y' = y, 'z' = z), ref)
I would like to automatically produce linear regressions for a data frame for each category separately.
My data frame includes one column with time categories, one column (slope$Abs) as the dependent variable, several columns, which should be used as the independent variable.
head(slope)
timepoint Abs In1 In2 In3 Out1 Out2 Out3 ...
1: t0 275.0 2.169214 2.169214 2.169214 2.069684 2.069684 2.069684
2: t0 275.5 2.163937 2.163937 2.163937 2.063853 2.063853 2.063853
3: t0 276.0 2.153298 2.158632 2.153298 2.052088 2.052088 2.057988
4: ...
All in all for each timepoint I have 40 variables, and I want to end up with a linear regression for each combination. Such as In1~Abs[t0], In1~Abs[t1] and so on for each column.
Of course I can do this manually, but I guess there must be a more elegant way to do the work.
I did my research and found out that dlply() might be the function I'm looking for. However, my attempt results in an error.
So I somehow tried to combine the answers from previous questions I have found:
On individual variables per column and on subsets per category
I came up with a function like this:
lm.fun <- function(x) {summary(lm(x ~ slope$Abs, data=slope))}
lm.list <- dlply(.data=slope, .variables=slope$timepoint, .fun=lm.fun )
But I get the following error:
Error in eval.quoted(.variables, data) :
envir must be either NULL, a list, or an environment.
Hope someone can help me out.
Thanks a lot in advance!
The dplyr package in R does not do well in accepting formulas in the form of y~x into its functions based on my research. So the other alternative is to calculate it someone manually. Now let me first inform you that slope = cor(x,y)*sd(y)/sd(x) (reference found here: http://faculty.cas.usf.edu/mbrannick/regression/regbas.html) and that the intercept = mean(y) - slope*mean(x). Simple linear regression requires that we use the centroid as our point of reference when finding our intercept because it is an unbiased estimator. Using a single point will only get you the intercept of that individual point and not the overall intercept.
Now for this explanation, I will be using the mtcars data set. I only wanted a subset of the data so I am using variables c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec') to basically mimic your dataset. In my example, my grouping variable is 'cyl', which is the equivalent of your 'timepoint' variable. The variable 'mpg' is the y-variable in this case, which is equivalent to 'Abs' in your data.
Based on my explanation of slope and intercept above, it is clear that we need three tables/datasets: a correlation dataset for your y with respect to your x for each group, a standard deviation table for each variable and group, and a table of means for each group and each variable.
To get the correlation dataset, we want to group by 'cyl' and calculate the correlation coefficients for , you should use:
df <- mtcars[c('mpg', 'cyl', 'disp', 'hp', 'drat', 'wt', 'qsec')]
corrs <- data.frame(k1 %>% group_by(cyl) %>% do(head(data.frame(cor(.[,c(1,3:7)])), n = 1)))
Because of the way my dataset is structured, the second variable (df[ ,2]) is 'cyl'. For you, you should use
do(head(data.frame(cor(.[,c(2:40)])), n = 1)))
since your first column is the grouping variable and it is not numeric. Essentially, you want to go across all numeric variables. Not using head will produce a correlation matrix, but since you are interested in finding the slope independent of each other x-variable, you only need the row that has the correlation coefficient of your y-variable equal to 1 (r_yy = 1).
To get standard deviation and means for each group, each variable, use
sds <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(sd)))
means <- data.frame(k1 %>% group_by(cyl) %>% summarise_each(funs(mean)))
Your group names will be the first column, so make sure to rename your rows for each dataset corrs, sds, and means and delete column 1.
rownames(corrs) <- rownames(means) <- rownames(sds) <- corrs[ ,1]
corrs <- corrs[ ,-1]; sds <- sds[ ,-1]; means <- means[ ,-1]
Now we need to calculate the sd(y)/sd(x). The best way I have done this, and seen it done is using an apply affiliated function.
sdst <- data.frame(t(apply(sds, 1, function(X) X[1]/X)))
I use X[1] because the first variable in sds is my y-variable. The first variable after you have deleted timepoint is Abs which is your y-variable. So use that.
Now the rest is pretty straight forward. Since everything is saved as a data frame, to find slope, all it you need to do is
slopes <- sdst*corrs
inter <- slopes*means
intercept <- data.frame(t(apply(inter, 1, function(x) x[1]-x)))
Again here, since our y-variable is in the first column, we use x[1]. To check if all is well, your slopes for your y-variable should be 1 and the intercept should be 0.
I have solved the issue with a simpler approach, so I wanted to update the answer.
To make life easier I converted the data frame structure so that all columns are converted into rows with the melt() function of the reshape package.
melt(slope, id = c("Abs", "timepoint"), variable_name = "Sites")
The output's column name is by default "value".
Then create one column that adds both predictors with paste().
slope$FullTreat <- paste(slope$Sites,slope$timepoint, sep="_")
Run a function through the dataset to create separate models for each treatment combination.
models <- dlply(slope, ~ FullTreat, function(df) {
lm(value ~ Abs, data = df)
})
To extract the coefficents simply run
coefs <- ldply(models, coef)
Then split the FullTreat column into separate columns again with colsplit() also from reshape. Plus, add the Intercept and slope to the new data frame:
coefs <- cbind(colsplit(coefs$FullTreat, split="_",
c("Sites","Timepoint")), coefs[,2:3])
I haven't worked on a function that plots all the regressions from the models, but I guess this is feasible with the ldply() function.
I am experiencing difficulties estimating a BMA-model via glib(), due to multicollinearity issues, even though I have clearly specified which columns to use. Please find the details below.
The data I'll be using for the estimation via Bayesian Model Averaging:
Cij <- c(357848,766940,610542,482940,527326,574398,146342,139950,227229,67948,
352118,884021,933894,1183289,445745,320996,527804,266172,425046,
290507,1001799,926219,1016654,750816,146923,495992,280405,
310608,1108250,776189,1562400,272482,352053,206286,
443160,693190,991983,769488,504851,470639,
396132,937085,847498,805037,705960,
440832,847631,1131398,1063269,
359480,1061648,1443370,
376686,986608,
344014)
n <- length(Cij);
TT <- trunc(sqrt(2*n))
i <- rep(1:TT,TT:1); #row numbers: year of origin
j <- sequence(TT:1) #col numbers: year of development
k <- i+j-1 #diagonal numbers: year of payment
#Since k=i+j-1, we have to leave out another dummy in order to avoid multicollinearity
k <- ifelse(k == 2, 1, k)
I want to evaluate the effect of i and j both via levels and factors, but of course not in the same model. Since I can decide to include i and j as factors, levels, or not include them at all and for k either to include as level, or exclude, there are a total of 18 (3x3x2) models. This brings us to the following data frame:
X <- data.frame(Cij,i.factor=as.factor(i),j.factor=as.factor(j),k,i,j)
X <- model.matrix(Cij ~ -1 + i.factor + j.factor + k + i + j,X)
X <- as.data.frame(X[,-1])
Next, via the following declaration I specify which variables to consider in each of the 18 models. According to me, no linear dependence exists in these specifications.
model.set <- rbind(
c(rep(0,9),rep(0,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,0),
c(rep(0,9),rep(0,9),0,0,1),
c(rep(0,9),rep(0,9),1,0,0),
c(rep(1,9),rep(0,9),0,0,0),
c(rep(0,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,1),
c(rep(0,9),rep(0,9),1,1,0),
c(rep(0,9),rep(1,9),0,1,0),
c(rep(0,9),rep(0,9),1,0,1),
c(rep(1,9),rep(0,9),0,0,1),
c(rep(1,9),rep(0,9),1,0,0),
c(rep(0,9),rep(1,9),1,0,0),
c(rep(1,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),1,1,1),
c(rep(0,9),rep(1,9),1,1,0),
c(rep(1,9),rep(0,9),1,0,1),
c(rep(1,9),rep(1,9),1,0,0))
Then I call the glib() function, telling it to select the specified columns from X according to model.set.
library(BMA)
model.glib <- glib(X,Cij,error="poisson", link="log",models=model.set)
which results in the error
Error in glim(x, y, n, error = error, link = link, scale = scale) : X matrix is not full rank
The function first checks whether the matrix is f.c.r, before it evaluates which columns to select from X via model.set. How do I circumvent this, or is there any other way to include all 18 models in the glib() function?
Thank you in advance.
I'm looking to perform classification on data with mostly categorical features. For that purpose, Euclidean distance (or any other numerical assuming distance) doesn't fit.
I'm looking for a kNN implementation for [R] where it is possible to select different distance methods, like Hamming distance.
Is there a way to use common kNN implementations like the one in {class} with different distance metric functions?
I'm using R 2.15
As long as you can calculate a distance/dissimilarity matrix (in whatever way you like) you can easily perform kNN classification without the need of any special package.
# Generate dummy data
y <- rep(1:2, each=50) # True class memberships
x <- y %*% t(rep(1, 20)) + rnorm(100*20) < 1.5 # Dataset with 20 variables
design.set <- sample(length(y), 50)
test.set <- setdiff(1:100, design.set)
# Calculate distance and nearest neighbors
library(e1071)
d <- hamming.distance(x)
NN <- apply(d[test.set, design.set], 1, order)
# Predict class membership of the test set
k <- 5
pred <- apply(NN[, 1:k, drop=FALSE], 1, function(nn){
tab <- table(y[design.set][nn])
as.integer(names(tab)[which.max(tab)]) # This is a pretty dirty line
}
# Inspect the results
table(pred, y[test.set])
If anybody knows a better way of finding the most common value in a vector than the dirty line above, I'd be happy to know.
The drop=FALSE argument is needed to preserve the subset of NN as matrix in the case k=1. If not it will be converted to a vector and apply will throw an error.