Tapply only producing missing values - r

I'm trying to generate estimates of the percent of Catholics within a given municipality in a country and I'm using multilevel regression and post-stratification of survey data.
The approach fits a multilevel logit and generates predicted probabilities of the dependent variable. It then weights the probabilities using poststratification of the sample to census data.
I can generate the initial estimates (which are essentially just the predicted probability of being Catholic for a given individual in the survey data.) However, when I try to take the average with the last line of code below it only returns NA's for each of the municipalities. The initial cell predictions have some missing values but nowhere near a majority.
I don't understand why I can't generate municipal weighted averages as I've followed the procedure using different data. Any help would be greatly appreciated.
rm(list=ls(all=TRUE))
library("arm")
library("foreign")
#read in megapoll and attach
ES.data <- read.dta("ES4.dta", convert.underscore = TRUE)
#read in municipal-level dataset
munilevel <- read.dta("election.dta",convert.underscore = TRUE)
munilevel <- munilevel[order(munilevel$municode),]
#read in Census data
Census <- read.dta("poststratification4.dta",convert.underscore = TRUE)
Census <- Census[order(Census$municode),]
Census$municode <- match(Census$municode, munilevel$municode)
#Create index variables
#At level of megapoll
ES.data$ur.female <- (ES.data$female *2) + ES.data$ur
ES.data$age.edr <- 6 * (ES.data$age -1) + ES.data$edr
#At census level (same coding as above for all variables)
Census$cur.cfemale <- (Census$cfemale *2) + Census$cur
Census$cage.cedr <- 6 * (Census$cage -1) + Census$cedr
##Municipal level variables
Census$c.arena<- munilevel$c.arena[Census$municode]
Census$c.fmln <- munilevel$c.fmln[Census$municode]
#run individual-level opinion model
individual.model1 <- glmer(formula = catholic ~ (1|ur.female) + (1|age)
+ (1|edr) + (1|age.edr) + (1|municode) + p.arena +p.fmln
,data=ES.data, family=binomial(link="logit"))
display(individual.model1)
#examine random effects and standard errors for urban-female
ranef(individual.model1)$ur.female
se.ranef(individual.model1)$ur.female
#create vector of state ranefs and then fill in missing ones
muni.ranefs <- array(NA,c(66,1))
dimnames(muni.ranefs) <- list(c(munilevel$municode),"effect")
for(i in munilevel$municode){
muni.ranefs[i,1] <- ranef(individual.model1)$municode[i,1]
}
muni.ranefs[,1][is.na(muni.ranefs[,1])] <- 0 #set states with missing REs (b/c not in data) to zero
#create a prediction for each cell in Census data
cellpred1 <- invlogit(fixef(individual.model1)["(Intercept)"]
+ranef(individual.model1)$ur.female[Census$cur.cfemale,1]
+ranef(individual.model1)$age[Census$cage,1]
+ranef(individual.model1)$edr[Census$cedr,1]
+ranef(individual.model1)$age.edr[Census$cage.cedr,1]
+muni.ranefs[Census$municode,1]
+(fixef(individual.model1)["p.fmln"] *Census$c.fmln) # municipal level
+(fixef(individual.model1)["p.arena"] *Census$c.arena)) # municipal level
#weights the prediction by the freq of cell
cellpredweighted1 <- cellpred1 * Census$cpercent.muni
#calculates the percent within each municipality (weighted average of responses)
munipred <- 100* as.vector(tapply(cellpredweighted1, Census$municode, sum))
munipred

The extensive amount of code is totally redundant without the data! I suppose you have NAs in the object cellpredweighted1 and by default sum() propagates NAs to the answer because if one or more elements of a vector is NA then by definition the summation of those elements is also NA.
If the above is the case here, then simply adding na.rm = TRUE to the tapply() call should solve the problem.
tapply(cellpredweighted1, Census$municode, sum, na.rm = TRUE)
You should be asking yourself why there are NAs at this stage and if these result from errors earlier on the process.

Related

MASS package LDA - error in assessing accuracy of the prediction when using a loop: Error in table(): all arguments must have the same length

I'm trying to use a linear discriminant analysis (LDA) with the MASS package in R to determine if species within communities are more discriminable than random sets of species (with multiple traits predicting species identity).
When I run the LDA on the full dataset (all communities together), no problem. My problem comes when I try to use loops to run the LDA on each community separately.
I have 19 communities. I make a list of 19 called new2; each item in the list represents a single community with all of it's associated species and their scores for various traits. I use a loop to run the LDA on each item this list new2[[i]] and put it in the object c.dfas[[i]]. No problem here.
I then try to run a loop to assess the prediction accuracy (% correct for each category of Species) and put the summaries into coms.sum[[i]] list I should get a list of 19, with the prediction success for each community. Instead, I get this error:
Error in table(new2[[i]]$Species, c.dfas[[i]]$class) :
all arguments must have the same length
When I run length(new2[[i]]) I get 26,
and when I run length(c.dfas[[i]]) I get 5.
I understand that these are different lengths but I am unsure of how to solve it/re-write the loop to make this work. I have successfully run this exact (or I guess almost exact) code before with a different dataset no problem, so I know it is possible to get the prediction success from these lists in some way.
When I run length(new2[i]) I get 19,
and when I run length(c.dfas[i]) I get 19.
How do I get to the species stored in new2 and the class stored in c.dfas to be the same length?
data1 <-read.csv("Apr4_DFA.csv")
#load packages
library(MASS)
library(tidyverse)
library(caret)
library(lattice)
library(ggplot2)
theme_set(theme_classic())
#######################_Full_Model_############################
# Linear Discriminant Analysis with Jacknifed Prediction
full <- lda(Species ~ Pixelwise.Shannon.Entropy + Average.Local.Entropies + sclFWMC +
Local.Patch.Covariance + logAGM + Pixelwise.Differential.Entropy +
Patchwise.Differential.Entropy + Global.Patch.Covariance + logPMD +
logWDC + Colour.Diversity + pattern + logH1, data = data1, na.action="na.omit", CV=TRUE)
full # show results
# Assess the accuracy of the prediction
# percent correct for each category of Species
ct_full <- table(data1$Species, full$class)
diag(prop.table(ct_full, 1))
# total percent correct
sum(diag(prop.table(ct_full)))
#full model 91%
###################_Run DFA on Each Community_######################
#see levels in factor community
levels(data1$Community)
#see number of levels in factor community
length(levels(data1$Community))
#take the levels of community and put it in coms
coms <- levels(data1[,"Community"])
coms
#sample 1 community of coms without replacement
sample(coms,1, replace=F)
#make empty list of 19 put it in new2
new2 <- list(19)
#take each randomly drawn community and put them in new2
for(i in 1:19){
temp <- sample(coms,1, replace=F)
new2[[i]] <- data1 %>% filter(Community %in% temp)
new2[[i]] <- droplevels(new2[[i]])
rm(temp)
}
new2
#create empty list of 19 called c.dfas
c.dfas <- list(19)
#run dfa on each community subset and put it in c.dfas
for(i in 1:19){
c.dfas[[i]] <- lda(Species ~ Pixelwise.Shannon.Entropy + Average.Local.Entropies + sclFWMC +
Local.Patch.Covariance + logAGM + Pixelwise.Differential.Entropy +
Patchwise.Differential.Entropy + Global.Patch.Covariance + logPMD +
logWDC + Colour.Diversity + pattern + logH1, data = data1, na.action="na.omit", CV=TRUE)
}
c.dfas
unclass(c.dfas)
#make empty list of 19 called coms.sum
coms.sum <- list(19)
#make another empty list of 19 called ct.c.dfas
ct.c.dfas <-list(19)
################ this is where I get the error ###################
#loop to assess the accuracy of the prediction % correct for each category of Species
#put the summaries of total % correct in the coms.sum list
for(i in 1:19){
ct.c.dfas[[i]] <- table(new2[[i]]$Species, c.dfas[[i]]$class)
diag(prop.table(ct.c.dfas[[i]], 1))
coms.sum[[i]] <- sum(diag(prop.table(ct.c.dfas[[i]])))
}
coms.sum
length(new2[[i]]) #26
length(c.dfas[[i]]) #5 ```

Fama Macbeth Regression in R pmg

In the past few days I have been trying to find how to do Fama Macbeth regressions in R. It is advised to use the plm package with pmg, however every attempt I do returns me that I have an insufficient number of time periods.
My Dataset consists of 2828419 observations with 13 columns of variables of which I am looking to do multiple cross-sectional regressions.
My firms are specified by seriesis, I have got a variable date and want to do the following Fama Macbeth regressions:
totret ~ size
totret ~ momentum
totret ~ reversal
totret ~ volatility
totret ~ value size
totret ~ value + size + momentum
totret ~ value + size + momentum + reversal + volatility
I have been using this command:
fpmg <- pmg(totret ~ momentum, Data, index = c("date", "seriesid")
Which returns: Error in pmg(totret ~ mom, Dataset, index = c("seriesid", "datem")) : Insufficient number of time periods
I tried it with my dataset being a datatable, dataframe and pdataframe. Switching the index does not work as well.
My data contains NAs as well.
Who can fix this, or find a different way for me to do Fama Macbeth?
This is almost certainly due to having NAs in the variables in your formula. The error message is not very helpful - it is probably not a case of "too few time periods to estimate" and very likely a case of "there are firm/unit IDs that are not represented across all time periods" due to missing data being dropped.
You have two options - impute the missing data or drop observations with missing data (the latter being a quick test that the model works without missing points before deciding what you want to do that is valid for estimtation).
If the missingness in your data is truly random, you might be okay just dropping observations with missingness. Otherwise you should probably impute. A common strategy here is to impute multiple times - at least 5 - and then estimate for each of those 5 resulting data sets and average the effect together. Amelia or mice are very strong imputation packages. I like Amelia because with one call you can impute n times for that many resulting data sets and it's easy to pass in a set of variables to not impute (e.g., id variable or time period) with the idvars parameter.
EDIT: I dug into the source code to see where the error was triggered and here is what the issue is - again likely caused by missing data, but it does interact with your degrees of freedom:
...
# part of the code where error is triggered below, here is context:
# X = matrix of the RHS of your model including intercept, so X[,1] is all 1s
# k = number of coefficients used determined by length(coef(plm.model))
# ind = vector of ID values
# so t here is the minimum value from a count of occurrences for each unique ID
t <- min(tapply(X[,1], ind, length))
# then if the minimum number of times a single ID appears across time is
# less than the number of coefficients + 1, you do not have enough time
# points (for that ID/those IDs) to estimate.
if (t < (k + 1))
stop("Insufficient number of time periods")
That is what is triggering your error. So imputation is definitely a solution, but there might be a single offender in your data and importantly, once this condition is satisfied your model will run just fine with missing data.
Lately, I fixed the Fama Macbeth regression in R.
From a Data Table with all of the characteristics within the rows, the following works and gives the opportunity to equally weight or apply weights to the regression (remove the ",weights = marketcap" for equally weighted). totret is a total return variable, logmarket is the logarithm of market capitalization.
logmarket<- df %>%
group_by(date) %>%
summarise(constant = summary(lm(totret~logmarket, weights = marketcap))$coefficient[1], rsquared = summary(lm(totret~logmarket*, weights = marketcap*))$r.squared, beta= summary(lm(totret~logmarket, weights = marketcap))$coefficient[2])
You obtain a DataFrame with monthly alphas (constant), betas (beta), the R squared (rsquared).
To retrieve coefficients with t-statistics in a dataframe:
Summarystatistics <- as.data.frame(matrix(data=NA, nrow=6, ncol=1)
names(Summarystatistics) <- "logmarket"
row.names(Summarystatistics) <- c("constant","t-stat", "beta", "tstat", "R^2", "observations")
Summarystatistics[1,1] <- mean(logmarket$constant)
Summarystatistics[2,1] <- coeftest(lm(logmarket$constant~1))[1,3]
Summarystatistics[3,1] <- mean(logmarket$beta)
Summarystatistics[4,1] <- coeftest(lm(logmarket$beta~1))[1,3]
Summarystatistics[5,1] <- mean(logmarket$rsquared)
Summarystatistics[6,1] <- nrow(subset(df, !is.na(logmarket)))
There are some entries of "seriesid" with only one entry. Therefore the pmg gives the error. If you do something like this (with variable names you use), it will stop the error:
try2 <- try2 %>%
group_by(cusip) %>%
mutate(flag = (if (length(cusip)==1) {1} else {0})) %>%
ungroup() %>%
filter(flag == 0)

Need help applying regression model to dataset in R (sports data)

Update: Solved!
I'm currently trying to create a regression model for football that predicts a team's total points based on their pass yards and rush yards. I was able to get all the way to figuring out the regression equation but from here I do not know how to "plug in" the formula.
The data table is essentially all 32 NFL teams listed in rows and their offensive stats listed in columns
Code:
# 1. Import
Offense <- read.csv(file.choose(), header=TRUE)
#2 View
show (Offense)
#3 Attach so headers can be referenced
attach (Offense)
#4 Create Regression Model
mod1 <-lm(Total.Points ~ Pass.Yds + Rush.Yds)
summary(mod1)
#Formula obtained from summary: -255.60178 + .10565(Pass) + .12154(Rush)
#Plug in the Regression Equation
predict(mod1)
Output: https://imgur.com/a/AbTNF
I see that at the end it applied the regression equation to all 32 rows, but how do I
get it to display in a ranked list
get it to display, say, the team name as well as the projected score (so I don't have to wonder what team "1" or "2" refer to
Since I have the equation, could I also just write a loop function that ran the equation for every row of data I have and print the results?
I'm a beginner so much appreciated!
Update: Came up with this
####Part 2. Interpretation
#1. Examining quality of model
summary(mod1)
cor(Pass.Yds, Rush.Yds)
#2. Formula obtained from summary: -255.60178 + .10565(Pass) + .12154(Rush)
#3. Predicted Points (Descending Order)
proj <- sort(predict(mod1), decreasing = TRUE)
proj
#4. Corresponding Name (Descending)
name <- Team[order(predict(mod1), decreasing = TRUE)]
name
#Data Frame
Projections <- data.frame(name, proj)
Projections
While bbrot provided a much simpler version
Assuming that Teams is the vector of team names, something like cbind(Teams[order(predict(mod1), decreasing = TRUE)], sort(predict(mod1), decreasing = TRUE)) should do...
Edit: Your Teams vector seems to be a factor. In this case, the following commands are going to work:
# returns a character matrix
cbind(as.character(Teams)[order(predict(mod1), decreasing = TRUE)],
sort(predict(mod1), decreasing = TRUE))
# returns a data frame
data.frame(Teams = Teams[order(predict(mod1), decreasing = TRUE)],
Points = sort(predict(mod1), decreasing = TRUE))

Effects from multinomial logistic model in mlogit

I received some good help getting my data formatted properly produce a multinomial logistic model with mlogit here (Formatting data for mlogit)
However, I'm trying now to analyze the effects of covariates in my model. I find the help file in mlogit.effects() to be not very informative. One of the problems is that the model appears to produce a lot of rows of NAs (see below, index(mod1) ).
Can anyone clarify why my data is producing those NAs?
Can anyone help me get mlogit.effects to work with the data below?
I would consider shifting the analysis to multinom(). However, I can't figure out how to format the data to fit the formula for use multinom(). My data is a series of rankings of seven different items (Accessible, Information, Trade offs, Debate, Social and Responsive) Would I just model whatever they picked as their first rank and ignore what they chose in other ranks? I can get that information.
Reproducible code is below:
#Loadpackages
library(RCurl)
library(mlogit)
library(tidyr)
library(dplyr)
#URL where data is stored
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
#Get data
dat <- read.csv(dat.url)
#Complete cases only as it seems mlogit cannot handle missing values or tied data which in this case you might get because of median imputation
dat <- dat[complete.cases(dat),]
#Change the choice index variable (X) to have no interruptions, as a result of removing some incomplete cases
dat$X <- seq(1,nrow(dat),1)
#Tidy data to get it into long format
dat.out <- dat %>%
gather(Open, Rank, -c(1,9:12)) %>%
arrange(X, Open, Rank)
#Create mlogit object
mlogit.out <- mlogit.data(dat.out, shape='long',alt.var='Open',choice='Rank', ranked=TRUE,chid.var='X')
#Fit Model
mod1 <- mlogit(Rank~1|gender+age+economic+Job,data=mlogit.out)
Here is my attempt to set up a data frame similar to the one portrayed in the help file. It doesnt work. I confess although I know the apply family pretty well, tapply is murky to me.
with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt, mean)))
Compare from the help:
data("Fishing", package = "mlogit")
Fish <- mlogit.data(Fishing, varying = c(2:9), shape = "wide", choice = "mode")
m <- mlogit(mode ~ price | income | catch, data = Fish)
# compute a data.frame containing the mean value of the covariates in
# the sample data in the help file for effects
z <- with(Fish, data.frame(price = tapply(price, index(m)$alt, mean),
catch = tapply(catch, index(m)$alt, mean),
income = mean(income)))
# compute the marginal effects (the second one is an elasticity
effects(m, covariate = "income", data = z)
I'll try Option 3 and switch to multinom(). This code will model the log-odds of ranking an item as 1st, compared to a reference item (e.g., "Debate" in the code below). With K = 7 items, if we call the reference item ItemK, then we're modeling
log[ Pr(Itemk is 1st) / Pr(ItemK is 1st) ] = αk + xTβk
for k = 1,...,K-1, where Itemk is one of the other (i.e. non-reference) items. The choice of reference level will affect the coefficients and their interpretation, but it will not affect the predicted probabilities. (Same story for reference levels for the categorical predictor variables.)
I'll also mention that I'm handling missing data a bit differently here than in your original code. Since my model only needs to know which item gets ranked 1st, I only need to throw out records where that info is missing. (E.g., in the original dataset record #43 has "Information" ranked 1st, so we can use this record even though 3 other items are NA.)
# Get data
dat.url <- 'https://raw.githubusercontent.com/sjkiss/Survey/master/mlogit.out.csv'
dat <- read.csv(dat.url)
# dataframe showing which item is ranked #1
ranks <- (dat[,2:8] == 1)
# for each combination of predictor variable values, count
# how many times each item was ranked #1
dat2 <- aggregate(ranks, by=dat[,9:12], sum, na.rm=TRUE)
# remove cases that didn't rank anything as #1 (due to NAs in original data)
dat3 <- dat2[rowSums(dat2[,5:11])>0,]
# (optional) set the reference levels for the categorical predictors
dat3$gender <- relevel(dat3$gender, ref="Female")
dat3$Job <- relevel(dat3$Job, ref="Government backbencher")
# response matrix in format needed for multinom()
response <- as.matrix(dat3[,5:11])
# (optional) set the reference level for the response by changing
# the column order
ref <- "Debate"
ref.index <- match(ref, colnames(response))
response <- response[,c(ref.index,(1:ncol(response))[-ref.index])]
# fit model (note that age & economic are continuous, while gender &
# Job are categorical)
library(nnet)
fit1 <- multinom(response ~ economic + gender + age + Job, data=dat3)
# print some results
summary(fit1)
coef(fit1)
cbind(dat3[,1:4], round(fitted(fit1),3)) # predicted probabilities
I didn't do any diagnostics, so I make no claim that the model used here provides a good fit.
You are working with Ranked Data, not just Multinomial Choice Data. The structure for the Ranked data in mlogit is that first set of records for a person are all options, then the second is all options except the one ranked first, and so on. But the index assumes equal number of options each time. So a bunch of NAs. We just need to get rid of them.
> with(mlogit.out, data.frame(economic=tapply(economic, index(mod1)$alt[complete.cases(index(mod1)$alt)], mean)))
economic
Accessible 5.13
Debate 4.97
Information 5.08
Officials 4.92
Responsive 5.09
Social 4.91
Trade.Offs 4.91

How to find an optimal adstock decay factor for an independent variable in panel analysis in R?

I'm working with a panel dataset (24 months of data for 210 DMAs). I'm trying to optimize the adstock decay factor for an independent variable by minimizing the standard error of a fixed effects model.
In this particular case, I want to get a decay factor that minimizes the SE of the adstock-transformed variable "SEM_Br_act_norm" in the model "Mkt_TRx_norm = b0 + b1*Mkt_TRx_norm_prev + b2*SEM+Br_act_norm_adstock".
So far, I've loaded the dataset in panel formal using plm and created a function to generate the adstock values. The function also runs a fixed effects model on the adstock values and returns the SE. I then use optimize() to find the best decay value within the bounds (0,1). While my code is returning an optimal value, I am worried something is wrong because it returns the same optimum (close to 1) on all other variables.
I've attached a sample of my data, as well as key parts of my code. I'd greatly appreciate if someone could take a look and see what is wrong.
Sample Data
# Set panel data structure
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
alldata$var <- alldata$SEM_Br_act_norm +0
# Create 1 month time lag for TRx
alldata <- ddply(
alldata, .(DMA), transform,
# This assumes that the data is sorted
Mkt_TRx_norm_prev = c(NA,Mkt_TRx_norm[-length(Mkt_TRx_norm)])
)
# Create adstock function and obtain SE of regression
adstockreg <-function(decay, period, data_vector, pool_vector=0){
data_vector <-alldata$var
pool_vector <- alldata$DMA
data2<-data_vector
l<-length(data_vector)
#if no pool apply zero to vector
if(length(pool_vector)==1)pool_vector<-rep(0,l)
#outer loop: extract data to decay from observation i
for( i in 1:l){
x<-data_vector[i]
#inner loop: apply decay onto following observations after i
for(j in 1:min(period,l)){
#constrain decay to same pool (if data is pooled)
if( pool_vector[i]==pool_vector[min(i+j,l)]){data2[(i+j)]<- data2[(i+j)]+(x*(decay)^j)}
}
}
#reduce length of edited data to equal length of initial data
data2<-data2[1:l]
#regression - excludes NA values
alldata <- plm.data (alldata, index = c("DMA", "Month_Num"))
var_fe <- plm(alldata$Mkt_TRx_norm ~ alldata$Mkt_TRx_norm_prev + data2, data = alldata , model = "within", na.action = na.exclude)
se <- summary(var_fe)$coefficients["data2","Std. Error"]
return(se)
}
# Optimize decay for adstock variable
result <- optimize(adstockreg, interval=c(0,1), period = 6)
print(result)

Resources