R - Extract coefficients from a factor of lm object using conditions - r

I have fitted a lm with the following code:
Eq1_females = <- lm(earnings ~ event_time + factor(age) + factor(year) - 1, data=females)
Now, I would like to calculate a predicted value based on the factor coefficients, but this predicted value depends on certain conditions in the data. I therefore create a list of the coefficients and I now want to extract the factor coefficients if age = k and year = y, but it keeps returning 0 or NA. However, if I input a number (e.g. 34) instead of k, it does give the right value. I tried two different codes:
estimates <- coef(Eq1_females)
k = females$age[1]
Eq1_females$coefficients["factor(age)k"]
and
estimates <- coef(Eq1_females)
k = females$age[1]
beta_age = estimates[grep("^factor\\(age\\)k", names(estimates))]
(note that in the end, I would like to loop over different rows of females$age)
What does work, is calculating
beta_age = estimates[grep("^factor\\(age\\)34", names(estimates))]
Could anyone tell me if there is a way of also getting the code to work with k in the beta_age formula?
Thanks a lot in advance!

Answer
Paste the right number to the regex pattern using paste0:
beta = estimates[grep(paste0("^factor\\(Petal.Width\\)", k), names(estimates))]
This returns:
factor(Petal.Width)0.2
3.764947
Rationale
In "^factor\\(age\\)k", it will treat k as the literal k. However, you are referring to variable k. By using paste(..., sep = "") or paste0(...) you can simply paste k to the base pattern.

Related

Mclust() - NAs in model selection

I recently tried to perform a GMM in R on a multivariate matrix (400 obs of 196 var), which elements belong to known categories. The Mclust() function (from package mclust) gave very poor results (around 30% of individuals were well classified, whereas with k-means the result reaches more than 90%).
Here is my code :
library(mclust)
X <- read.csv("X.csv", sep = ",", h = T)
y <- read.csv("y.csv", sep = ",")
gmm <- Mclust(X, G = 5) #I want 5 clusters
cl_gmm <- gmm$classification
cl_gmm_lab <- cl_gmm
for (k in 1:nclusters){
ii = which(cl_gmm == k) # individuals of group k
counts=table(y[ii]) # number of occurences for each label
imax = which.max(counts) # Majority label
maj_lab = attributes(counts)$dimnames[[1]][imax]
print(paste("Group ",k,", majority label = ",maj_lab))
cl_gmm_lab[ii] = maj_lab
}
conf_mat_gmm <- table(y,cl_gmm_lab) # CONFUSION MATRIX
The problem seems to come from the fact that every other model than "EII" (spherical, equal volume) is "NA" when looking at gmm$BIC.
Until now I did not find any solution to this problem...are you familiar with this issue?
Here is the link for the data: https://drive.google.com/file/d/1j6lpqwQhUyv2qTpm7KbiMRO-0lXC3aKt/view?usp=sharing
Here is the link for the labels: https://docs.google.com/spreadsheets/d/1AVGgjS6h7v6diLFx4CxzxsvsiEm3EHG7/edit?usp=sharing&ouid=103045667565084056710&rtpof=true&sd=true
I finally found the answer. GMMs simply cannot apply every model when two much explenatory variables are involved. The right thing to do is first reduce dimensions and select an optimal number of dimensions that make it possible to properly apply GMMs while preserving as much informations as possible about the data.

What does lag(log(emp), 1:2) mean when using pgmm function?

I tried an example regarding pgmm function in plm package. The codes are as follows:
library(plm)
data("EmplUK", package = "plm")
## Arellano and Bond (1991), table 4 col. b
z1 <- pgmm(log(emp) ~ lag(log(emp), 1:2) + lag(log(wage), 0:1)
+ log(capital) + lag(log(output), 0:1) | lag(log(emp), 2:99),
data = EmplUK, effect = "twoways", model = "twosteps")
summary(z1, robust = FALSE)
I am not sure the meaning of lag(log(emp), 1:2) and also lag(log(emp), 2:99). Does lag(log(emp), 1:2) mean that from one unit to two unit lag value of log(emp) and lag(log(emp), 2:99) from two units to 99 units' lag value of log(emp)?
And also sometimes I got an error when running the regression in summary part but sometimes there was no such error (the codes are the same):
Error in !class_ind : invalid argument type
Can anyone help me with these problems?That's the error here
log, a base R function, gives you the (natural) logarithm, in this case of variable emp.
lag of package plm can be given a second argument, called k, like in your example. By looking at ?plm::lag.plm it becomes clear: k is
an integer, the number of lags for the lag and lead methods (can also
be negative). For the lag method, a positive (negative) k gives lagged
(leading) values. For the lead method, a positive (negative) k gives
leading (lagged) values, thus, lag(x, k = -1) yields the same as
lead(x, k = 1). If k is an integer with length > 1 (k = c(k1, k2,
...)), a matrix with multiple lagged pseries is returned
Thus, instead of typing lag twice to have the first and second lag:
(lag(<your_variable>, 1)
lag(<your_variable>, 2)
one can simply type
lag(<your_variable>, k = 1:2), or without the named argument
lag(<your_variable>, 1:2).
Setting k to 2:99 gives you the 2nd to 99th lags.
The number refers to the number of time periods the lagging is applied to, not to the number of individuals (units) as the lagging is applied to all individuals.
You may want to run the example in ?plm::lag.plm to aid understanding of that function.

R function to find which of 3 variables correlates most with another value?

I am conducting a study that analyzes speakers' production and measures their average F2 values. What I need is an R function that allows me to find a relationship for these F2 values with 3 other variables, and if there is, which one is the most significant. These variables have been coded as 1, 2, or 3 for things like "yes" "no" answers or whether responses are positive, neutral or negative (1, 2, 3 respectively).
Is there a particular technique or R function/test that we can use to approach this problem? I've considered using ANOVA or a T-Test but am unsure if this will give me what I need.
A quick solution might look like this. Here, the cor function is used. Read its help page (?cor) to understand what is calculated. By default, the Pearson correlation coefficient is used. The function below return the variable with the highest Pearson correlation with respect to the reference variable.
set.seed(111)
x <- rnorm(100)
y <- rnorm(100)
z <- rnorm(100)
ref <- 0.5*x + 0.5*rnorm(100)
find_max_corr <- function(vars, ref){
val <- sapply(vars, cor, y = ref)
val[which.max(val)]
}
find_max_corr(list('x' = x, 'y' = y, 'z' = z), ref)

R: multicollinearity issues using glib(), Bayesian Model Averaging (BMA-package)

I am experiencing difficulties estimating a BMA-model via glib(), due to multicollinearity issues, even though I have clearly specified which columns to use. Please find the details below.
The data I'll be using for the estimation via Bayesian Model Averaging:
Cij <- c(357848,766940,610542,482940,527326,574398,146342,139950,227229,67948,
352118,884021,933894,1183289,445745,320996,527804,266172,425046,
290507,1001799,926219,1016654,750816,146923,495992,280405,
310608,1108250,776189,1562400,272482,352053,206286,
443160,693190,991983,769488,504851,470639,
396132,937085,847498,805037,705960,
440832,847631,1131398,1063269,
359480,1061648,1443370,
376686,986608,
344014)
n <- length(Cij);
TT <- trunc(sqrt(2*n))
i <- rep(1:TT,TT:1); #row numbers: year of origin
j <- sequence(TT:1) #col numbers: year of development
k <- i+j-1 #diagonal numbers: year of payment
#Since k=i+j-1, we have to leave out another dummy in order to avoid multicollinearity
k <- ifelse(k == 2, 1, k)
I want to evaluate the effect of i and j both via levels and factors, but of course not in the same model. Since I can decide to include i and j as factors, levels, or not include them at all and for k either to include as level, or exclude, there are a total of 18 (3x3x2) models. This brings us to the following data frame:
X <- data.frame(Cij,i.factor=as.factor(i),j.factor=as.factor(j),k,i,j)
X <- model.matrix(Cij ~ -1 + i.factor + j.factor + k + i + j,X)
X <- as.data.frame(X[,-1])
Next, via the following declaration I specify which variables to consider in each of the 18 models. According to me, no linear dependence exists in these specifications.
model.set <- rbind(
c(rep(0,9),rep(0,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,0),
c(rep(0,9),rep(0,9),0,0,1),
c(rep(0,9),rep(0,9),1,0,0),
c(rep(1,9),rep(0,9),0,0,0),
c(rep(0,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),0,1,1),
c(rep(0,9),rep(0,9),1,1,0),
c(rep(0,9),rep(1,9),0,1,0),
c(rep(0,9),rep(0,9),1,0,1),
c(rep(1,9),rep(0,9),0,0,1),
c(rep(1,9),rep(0,9),1,0,0),
c(rep(0,9),rep(1,9),1,0,0),
c(rep(1,9),rep(1,9),0,0,0),
c(rep(0,9),rep(0,9),1,1,1),
c(rep(0,9),rep(1,9),1,1,0),
c(rep(1,9),rep(0,9),1,0,1),
c(rep(1,9),rep(1,9),1,0,0))
Then I call the glib() function, telling it to select the specified columns from X according to model.set.
library(BMA)
model.glib <- glib(X,Cij,error="poisson", link="log",models=model.set)
which results in the error
Error in glim(x, y, n, error = error, link = link, scale = scale) : X matrix is not full rank
The function first checks whether the matrix is f.c.r, before it evaluates which columns to select from X via model.set. How do I circumvent this, or is there any other way to include all 18 models in the glib() function?
Thank you in advance.

Maximum first derivative in for values in a data frame R

Good day, I am looking for some help in processing my dataset. I have 14000 rows and 500 columns and I am trying to get the maximum value of the first derivative for individual rows in different column groups. I have my data saved as a data frame with the first column being the name of a variable. My data looks like this:
Species Spec400 Spec405 Spec410 Spec415
1 AfricanOilPalm_1_Lf_1 0.2400900 0.2318345 0.2329633 0.2432734
2 AfricanOilPalm_1_Lf_10 0.1783162 0.1808581 0.1844433 0.1960315
3 AfricanOilPalm_1_Lf_11 0.1699646 0.1722618 0.1615062 0.1766804
4 AfricanOilPalm_1_Lf_12 0.1685733 0.1743336 0.1669799 0.1818896
5 AfricanOilPalm_1_Lf_13 0.1747400 0.1772355 0.1735916 0.1800227
For each of the variables in the species column, I want to get the maximum derivative from Spec495 to Spec500 for example. This is what I did before I ran into errors.
x<-c(495,500,505,510,515,520,525,530,535,540,545,550)##get x values of reflectance(Spec495 to Spec500)
y.data.f<-hsp[,21:32]##get row values for the required columns
y<-as.numeric(y.data.f[1,])##convert to a vector, for just the first row of data
library(pspline) ##Using a spline so a derivative maybe calculated from a list of numeric values
I really wanted to avoid using a loop because of the time it takes, but this is the only way I know of thus far
for(j in 1:14900)
+ { y<-as.numeric(y.data.f[j,]) + a1d<-max(predict(sm.spline(x, y), x, 1))
+ write.table(a1d, file = "a1-d-appended.csv", sep = ",",
+ col.names = FALSE, append=TRUE) + }
This loop runs up until the 7861th value then get this error:
Error in smooth.Pspline(x = ux, y = tmp[, 1], w = tmp[, 2], method = method, :
NA/NaN/Inf in foreign function call (arg 6)
I am sure there must be a way to avoid using a loop, maybe using the plyr package, but I can't figure out how to do so, nor which package would be best to get the value for maximum derivative.
Can anyone offer some insight or suggestions? Thanks in advance
First differences are the numerical analog of first derivatives when the x-dimension is evenly spaced. So something along the lines of:
which.max( diff ( predict(sm.spline(x, y))$ysmth) ) )
... will return the location of the maximum (positive) slope of the smoothed spline. If you wanted the maximal slope allowing it to be either negative or postive you would use abs() around the predict()$ysmth. If you are having difficulties with non-finite values then using an index of is.finite will clear both Inf and NaN difficulties:
predy <- predict(sm.spline(x, y))$ysmth
predx <- predict(sm.spline(x, y))$x
is.na( predy ) <- !is.finite(pred)
plot(predx, predy, # NA values will not blow up R plotting function,
# ... just create discontinuities.
main ="First Derivative")

Resources