GAM distributed lag model with factor smooth interaction (by variable) - r

I'm trying to compare the climate response in the last 60 years of two subgroups of a plant (factor variable subgroups with 2 levels). The response of the two subgroups which both grew on the same plots is measured in deviation from the long-term growth (plant_growth). As climate data mean temperature (tmean) and mean precipitation (prec) are available.
I formulated a distributed lag model using mgcv's gam() to test the hypothesis, that the climate response differs between the plant subgroups:
climate_model <- gam(plant_growth ~ te(tmean, lag, by = subgroups) +
te(prec, lag, , by = subgroups) +
te(tmean, prec, lag, , by = subgroups) ,
data = plant_data)
plant_data is a list that contains tmean, prec and lag as separate numeric matrices, subgroups as factor variable which distinguishes between subgroup A and B, a character variable giving the ID of the plant, and the numeric measured plant_growth as vector.
The problem is, however, that factor by variables cannot be used with the matrix arguments from plant_data. The error message looks as follows:
Error in smoothCon(split$smooth.spec[[i]], data, knots, absorb.cons, scale.penalty = scale.penalty, :
factor `by' variables can not be used with matrix arguments.
I'm wondering if there is a way to include the factor variable subgroups into the distributed lag model so that a comparison between the two levels of the factor is possible.
I've already tried running two separate lag models for the two levels of subgroups. This works fine. However, I cannot really compare the predictions of the two models because the fit and the parameters of the smooths are different. Moreover, in this way the the climate response of the two subgroups is treated as if it was completely independent. This is however not the case.
I was reproduce my problem with growth data from the Treeclim package:
library("treeclim") #Data library
data("muc_spruce") #Plant growth
data("muc_clim") #Climate data
#Format climate to wide
clim <- pivot_wider(muc_clim, names_from = month, values_from = c(temp,prec))
#Format the growth data and add three new groth time series
growth <- muc_spruce %>%
select(-samp.depth) %>%
mutate(year = as.numeric(row.names(muc_spruce))) %>%
mutate(ID = 1) %>%
rename("plant_growth" = "mucstd")
additional_growth <- data.frame()
for (i in c(1:3)){
A <- growth %>%
mutate(plant_growth = plant_growth + runif(nrow(muc_spruce), min = 0, max = 0.5)) %>%
mutate(ID = ID + i)
additional_growth <- rbind(additional_growth, A)
}
growth <- rbind(growth, additional_growth)
#Bring growth and climate data together
plant_data <- na.omit(left_join(growth, clim))
rm(A, growth, clim, muc_clim, muc_spruce, additional_growth, i) #clean
#Add the subgroups label
plant_data$subgroups <- as.factor(c(rep("A", nrow(plant_data)/2), rep("B", nrow(plant_data)/2)))
#Format for gam input
plant_data <- list(lag = matrix(1:12,nrow(plant_data),12,byrow=TRUE),
year = plant_data$year,
ID = plant_data$ID,
plant_growth = plant_data$plant_growth,
subgroups = as.factor(plant_data$subgroups),
tmean = data.matrix(plant_data[,c(4:15)]),
prec = data.matrix(plant_data[,c(16:27)]))

From ?mgcv::linear.functional.terms:
The mechanism is usable with random effect smooths which take factor arguments, by using a trick to create a 2D array of factors. Simply create a factor vector containing the columns of the factor matrix stacked end to end (column major order). Then reset the dimensions of this vector to create the appropriate 2D array: the first dimension should be the number of response data and the second the number of columns of the required factor matrix. You can not use matrix or data.matrix to set up the required matrix of factor levels. See example below:
## set up a `factor matrix'...
fac <- factor(sample(letters,n*2,replace=TRUE))
dim(fac) <- c(n,2)
You cannot create a factor matrix tough, but can create a factor and modify the dims afterwars.

Related

Weighted dataset after IPTW using weightit?

I'm trying to get a weighted dataset after IPTW using weightit. Unfortunately, I'm not even sure where to start. Any help would be appreciated.
library(WeightIt)
library(cobalt)
library(survey)
W.out <- weightit(treat ~ age + educ + race + married + nodegree + re74 + re75,
data = lalonde, estimand = "ATT", method = "ps")
bal.tab(W.out)
# pre-weighting dataset
lalonde
# post-weighting dataset??
The weightit() function produces balance weights. In your case, setting method = "ps" will produce propensity scores that are transformed into weights. More details of how it produces those weights can be found with ?method_ps. You can extract the weights from your output and store them as a column in a data.frame via: data.frame(w = W.out[["weights"]]). The output is a vector of weights with a length equal to the number of non-NA rows in your data (lalonde).
What you actually mean by "weighted dataset" is ambiguous for two reasons. First, any analyses that use those weights will typically not actually produce a new data.set...rather it will weight the contribution of the row to the likelihood. This is substantively different from simply analyzing a dataset that has had each row's values multiplied by its weight and will produce different results for many models. Second, you are asking how to get a weighted dataset that has character vectors in columns. For example, lalonde$race is a character vector. Multiplying 5*"black" doesn't make much sense.
If you are indeed intent on multiplying every value in every row of your data by the row's respective weight, you will need to convert your race variable to numeric indicators, remove it from your data, then you can apply sweep():
library(dplyr)
df <- lalonde %>%
black = if_else(race == "black", 1, 0),
hispan = if_else(race == "hispan",1,0),
white = if_else(race == "white",1,0)) %>%
select(-race)
sweep(df, MARGIN = 2, W.out[["weights"]], `*`)

vegan::betadisper() extract distance and error associated with centroid

I am trying to construct a meta regression to look at distance between centroids across multiple independent monitoring datasets. To build that model, for each dataset I need to extract the distance to each centroid (each dataset has the same two grouping variables -- before, after), the number of points that went into calculating the centroid (n), and the standard deviation associated with each distance to centroid (sd). I'm using vegan::betadisper() to calculate the distance to each centroid, but I am not sure whether it is possible to extract a single unit of standard deviation associated with the centroid?
I've modified the dune dataset below as sample code. The 'Use' grouping variable has two levels: before, after.
rm(list=ls())
library (vegan)
library(dplyr)
# Species and environmental data
dune2.spe <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.spe.txt', row.names = 1)
dune2.env <- read.delim ('https://raw.githubusercontent.com/zdealveindy/anadat-r/master/data/dune2.env.txt', row.names = 1)
data (dune) # matrix with species data (20 samples in rows and 30 species in columns)
data (dune.env)# matix of environmental variables (20 samples in rows and 5 environmental variables in columns)
#select two grouping levels for 'use'
dune_data <- cbind(dune2.spe,dune2.env)%>%
filter(Use=='Pasture'|Use=='Hayfield')
dune_data$Use <- recode_factor(dune_data$Use, 'Pasture'='Before')
dune_data$Use <- recode_factor(dune_data$Use, 'Hayfield'='After')
dune_sp <- dune_data%>%
dplyr::select(1:28)
dune_en <- dune_data%>%
dplyr::select(29:33)
#transform relative species counts
dune_rel <- decostand(dune_sp, method = "hellinger")
dune_distmat <- vegdist(dune_rel, method = "bray", na.rm=T)
(dune_disper <- betadisper(dune_distmat, type="centroid", group=dune_en$Use))
plot(dune_disper, label=FALSE)
I am trying to arrive at the following output:
Group
before_distance
n_before
sd_before
after_distance
n_after
sd_after
Dune
0.4009
5
?
0.4314
7
?

Fama Macbeth Regression in R pmg

In the past few days I have been trying to find how to do Fama Macbeth regressions in R. It is advised to use the plm package with pmg, however every attempt I do returns me that I have an insufficient number of time periods.
My Dataset consists of 2828419 observations with 13 columns of variables of which I am looking to do multiple cross-sectional regressions.
My firms are specified by seriesis, I have got a variable date and want to do the following Fama Macbeth regressions:
totret ~ size
totret ~ momentum
totret ~ reversal
totret ~ volatility
totret ~ value size
totret ~ value + size + momentum
totret ~ value + size + momentum + reversal + volatility
I have been using this command:
fpmg <- pmg(totret ~ momentum, Data, index = c("date", "seriesid")
Which returns: Error in pmg(totret ~ mom, Dataset, index = c("seriesid", "datem")) : Insufficient number of time periods
I tried it with my dataset being a datatable, dataframe and pdataframe. Switching the index does not work as well.
My data contains NAs as well.
Who can fix this, or find a different way for me to do Fama Macbeth?
This is almost certainly due to having NAs in the variables in your formula. The error message is not very helpful - it is probably not a case of "too few time periods to estimate" and very likely a case of "there are firm/unit IDs that are not represented across all time periods" due to missing data being dropped.
You have two options - impute the missing data or drop observations with missing data (the latter being a quick test that the model works without missing points before deciding what you want to do that is valid for estimtation).
If the missingness in your data is truly random, you might be okay just dropping observations with missingness. Otherwise you should probably impute. A common strategy here is to impute multiple times - at least 5 - and then estimate for each of those 5 resulting data sets and average the effect together. Amelia or mice are very strong imputation packages. I like Amelia because with one call you can impute n times for that many resulting data sets and it's easy to pass in a set of variables to not impute (e.g., id variable or time period) with the idvars parameter.
EDIT: I dug into the source code to see where the error was triggered and here is what the issue is - again likely caused by missing data, but it does interact with your degrees of freedom:
...
# part of the code where error is triggered below, here is context:
# X = matrix of the RHS of your model including intercept, so X[,1] is all 1s
# k = number of coefficients used determined by length(coef(plm.model))
# ind = vector of ID values
# so t here is the minimum value from a count of occurrences for each unique ID
t <- min(tapply(X[,1], ind, length))
# then if the minimum number of times a single ID appears across time is
# less than the number of coefficients + 1, you do not have enough time
# points (for that ID/those IDs) to estimate.
if (t < (k + 1))
stop("Insufficient number of time periods")
That is what is triggering your error. So imputation is definitely a solution, but there might be a single offender in your data and importantly, once this condition is satisfied your model will run just fine with missing data.
Lately, I fixed the Fama Macbeth regression in R.
From a Data Table with all of the characteristics within the rows, the following works and gives the opportunity to equally weight or apply weights to the regression (remove the ",weights = marketcap" for equally weighted). totret is a total return variable, logmarket is the logarithm of market capitalization.
logmarket<- df %>%
group_by(date) %>%
summarise(constant = summary(lm(totret~logmarket, weights = marketcap))$coefficient[1], rsquared = summary(lm(totret~logmarket*, weights = marketcap*))$r.squared, beta= summary(lm(totret~logmarket, weights = marketcap))$coefficient[2])
You obtain a DataFrame with monthly alphas (constant), betas (beta), the R squared (rsquared).
To retrieve coefficients with t-statistics in a dataframe:
Summarystatistics <- as.data.frame(matrix(data=NA, nrow=6, ncol=1)
names(Summarystatistics) <- "logmarket"
row.names(Summarystatistics) <- c("constant","t-stat", "beta", "tstat", "R^2", "observations")
Summarystatistics[1,1] <- mean(logmarket$constant)
Summarystatistics[2,1] <- coeftest(lm(logmarket$constant~1))[1,3]
Summarystatistics[3,1] <- mean(logmarket$beta)
Summarystatistics[4,1] <- coeftest(lm(logmarket$beta~1))[1,3]
Summarystatistics[5,1] <- mean(logmarket$rsquared)
Summarystatistics[6,1] <- nrow(subset(df, !is.na(logmarket)))
There are some entries of "seriesid" with only one entry. Therefore the pmg gives the error. If you do something like this (with variable names you use), it will stop the error:
try2 <- try2 %>%
group_by(cusip) %>%
mutate(flag = (if (length(cusip)==1) {1} else {0})) %>%
ungroup() %>%
filter(flag == 0)

Looking for analysis that clusters like SIMPROF, but allows for many observations per category

I need to run a clustering or similarity analysis on some biological data and I am looking for an output like the one SIMPROF gives. Aka a dendrogram or hierarchical cluster.
However, I have 3200 observations/rows per group. SIMPROF, see example here,
library(clustsig)
usarrests<-USArrests[,c(1,2,4)]
rownames(usarrests)<-state.abb
# Run simprof on the data
res <- simprof(data= usarrests,
method.distance="braycurtis")
# Graph the result
pl.color <- simprof.plot(res)
seems to expect only one observation per group (US state in this example).
Now, again, my biological data (140k rows total) has about 3200 obs per group.
I am trying to cluster the groups together that have a similar representation in the variables provided.
As if in the example above, AK would be represented by more than one observation.
What's my best bet for a function/package/analysis?
Cheers,
Mo
Example from a paper:
The solution became obvious upon further reflection.
Instead of using all observations (200k) in the long format, I made longitude and depth of sampling into one variable, used like sampling units along a transect. Thus, ending up with 3800 columns of longitude - depth combinations, and 61 rows for the taxa, with the value variable being the abundance of the taxa (If you want to cluster sampling units then you have to transpose the df). This is then feasible for hclust or SIMPROF since now the quadratic complexity only applies to 61 rows (as opposed to ~200k as I tried at the beginning).
Cheers
Here is some code:
library(reshape2)
library(dplyr)
d4<-d4 %>% na.omit() %>% arrange(desc(LONGITUDE_DEC))
# make 1 variable of longitude and depth that can be used for all taxa measured, like
#community ecology sampling units
d4$sampling_units<-paste(d4$LONGITUDE_DEC,d4$BIN_MIDDEPTH_M)
d5<-d4 %>% select(PREDICTED_GROUP,CONCENTRATION_IND_M3,sampling_units)
d5<-d5%>%na.omit()
# dcast data frame so that you get the taxa as rows, sampling units as columns w
# concentration/abundance as values.
d6<-dcast(d5,PREDICTED_GROUP ~ sampling_units, value.var = "CONCENTRATION_IND_M3")
d7<-d6 %>% na.omit()
d7$PREDICTED_GROUP<-as.factor(d7$PREDICTED_GROUP)
# give the rownames the taxa names
rownames(d7)<-paste(d7$PREDICTED_GROUP)
#delete that variable that is no longer needed
d7$PREDICTED_GROUP<-NULL
library(vegan)
# calculate the dissimilarity matrix with vegdist so you can use the sorenson/bray
#method
distBray <- vegdist(d7, method = "bray")
# calculate the clusters with ward.D2
clust1 <- hclust(distBray, method = "ward.D2")
clust1
#plot the cluster dendrogram with dendextend
library(dendextend)
library(ggdendro)
library(ggplot2)
dend <- clust1 %>% as.dendrogram %>%
set("branches_k_color", k = 5) %>% set("branches_lwd", 0.5) %>% set("clear_leaves") %>% set("labels_colors", k = 5) %>% set("leaves_cex", 0.5) %>%
set("labels_cex", 0.5)
ggd1 <- as.ggdend(dend)
ggplot(ggd1, horiz = TRUE)

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

Resources