R logit model variable choice - r

I have a dataset where each household has an observation for each of five power generation methods (so each household appears 5 times). There is a dummy variable marking which method they use, so a sample household might look like --
hh_id choice choice_dummy
1 Grid 0
1 Diesel 0
1 Ownsolar 1
1 Solargrid 0
1 None 0
I have some other variables (price, avail, load, peakhours) that I need to run in a logit model to see what is influencing the decision to pick a particular choice. I know to use glm() for this, but I'm unsure of what to put in for the dependent variable. "Choice" doesn't actually capture the decision that was made, because "choice_dummy" actually marks the decision, but "choice_dummy" by itself doesn't give any information.
I can't merely filter for choice_dummy being 1 because I will lose the values for all the other variables by doing that. Does anyone know how I would go about running a logit model that relates the probability of each household choosing an energy source to the variables "price," "avail," "load," and "peakhours", ideally with code?

Replying to OP's comment 5/22/20:
hh_id<-c("1","1","1","1","1")
choice <- c("Grid","Diesel","Ownsolar","Solargrid","None")
choice_dummy <- c("0","0","1","0","0")
df <- data.frame(hh_id,choice,choice_dummy)
library(reshape2)
df2 <- dcast(df, hh_id ~ choice)
df2$power_choice <- ifelse(df2$Grid==1,"Grid",
ifelse(df2$Diesel==1,"Diesel",
ifelse(df2$Ownsolar==1,"Ownsolar",
ifelse(df2$Solargrid==1,"Solargrid",
ifelse(df2$None==1,"None",NA)))))

Related

is there an R function for merging duplicates to the same row?

I am conducting research on SARS-CoV-2 test on healthcare workers. Some workers were tested multiple times (they are identified by employee number). Therefore I would like to have a new column were the second/third test-value (=numeric) and date of test is listed for the same healthcare worker. However, I am completely oblivious as to how to approach this. I'd guess you could group by duplicate for the employee number and use some sort of mutate() function?
All tips are appreciated!
Maybe you could utilize the dcast function from library(data.table):
Lets asume you have the following data table:
cov_test <- data.table(worker_id =c(1,1,2),test_count=c(1,2,1),test_result=c("negative", "positive","negative"))
worker_id
test_count
test_result
1
test 1
negative
1
test 2
positive
2
test 1
negative
Using the following code you get following table:
dcast(data = cov_test, ...~test_count, value.var="test_result")
worker_id
test1
test2
1
negative
positive
2
negative
NA
The question is whether you have a column that describes the current test number for a person. If not, you would have to extract this information from the date column.

Propensity Score Matching with panel data

I am trying to use MatchIt to perform Propensity Score Matching (PSM) for my panel data. The data is panel data that contains multi-year observations from the same group of companies.
The data is basically describing a list of bond data and the financial data of their issuers, also the bond terms such as issued date, coupon rate, maturity, and bond type of bonds issued by them. For instance:
Firmnames
Year
ROA
Bond_type
AAPL US Equity
2015
0.3
0
AAPL US Equity
2015
0.3
1
AAPL US Equity
2016
0.3
0
AAPL US Equity
2017
0.3
0
C US Equity
2015
0.3
0
C US Equity
2016
0.3
0
C US Equity
2017
0.3
0
......
I've already known how to match the observations by the criteria I want and I use exact = Year to make sure I match observations from the same year. The problem now I am facing is that the observations from the same companies will be matched together, this is not what I want. The code I used:
matchit(Bond_type ~ Year + Amount_Issued + Cpn + Total_Assets_bf + AssetsEquityRatio_bf + Asset_Turnover_bf, data = rdata, method = "nearest", distance = "glm", exact = "Year")
However, as you can see, in the second raw of my sample, there might be two observations in one year from the same companies due to the nature of my study (the company can issue bonds more than one time a year). The only difference between them is the Bond_type. Therefore, the MathcIt function will, of course, treat them as the best control and treatment group and match these two observations together since they have the same ROA and other matching factors in that year.
I have two ways to solve this in my opinion:
Remove the observations from the same year and company, however, removing the observations might lead to bias results and ruined the study.
Preventing MatchIt function match the observations from the same company (or with the same Frimnames)
The second approach will be better since it will not lead to bias, however, I don't know if I can do this in MatchIt function. Hope someone can give me some advice on this or maybe there's any better solution to this problem, please be so kind to share with me, thanks in advance!
Note: If there's any further information or requirement I should provide, please just inform me. This is my first time raising the question here!
This is not possible with MatchIt at the moment (though it's an interesting idea and not hard to implement, so I may add it as a feature).
In the optmatch package, which perfroms optimal pair and full matching, there is a constraint that can be added called "anti-exact matching", which sounds exactly like what you want. Units with the same value of the anti-exact matching variable will not be matched with each other. This can be implemented using optmatch::antiExactMatch().
In the Matching package, which performs nearest neighbor and genetic matching, the restrict argument can be supplied to the matching function to restrict certain matches. You could manually create the restriction matrix by restricting all pairs of observations in the same company and then supply the matrix to Match().

Discriminant analysis and column name in the code

I have been writing a code to ease performing a discriminant analysis using the lda function. But actually I have a step which I cannot solve. And it is when I have to introduce the name of the categorical column in the code. Imagine we have the next table (called smoke), in which the column Factor represents the groups (in our cases, smoker and nsmok).
smoke
Factor Lung Heart Blood
1 smoker 7 22 15
2 smoker 8 21 12
3 nsmok 22 9 5
This is the code I have been preparing. Please, look at the XXXX's in the code (it appears twice). I want them to write automatically the name of the categorical column, instead of writing directly it twice.
lda=lda(XXXX~.,data=Smoke)
plot(lda)
lda
lda$counts
lda$svd
lda.p=predict(lda)
Tabla=table(Smoke$XXXX,lda.p$class)
Tabla
diag(prop.table(Tabla, 1))
sum(diag(prop.table(Tabla)))
I thought that writing...
colnames(Table)[1]
... would solve it. But actually there still exist some errors when running the code.
Otherwise, I though that introducing directly the name in this way:
Column_Factor-> Factor
and writing Column_Factor in the two places in the code would solve it. But it isn't.
Any ideas?
You could do something like this:
library(MASS)
#gets the column name of the factor, maybe check if there is only one factor column first
Column_Factor <- names(Smoke)[sapply(Smoke, class)=="factor"]
#creates the formula by pasting the name and the RHS
lda <- lda(as.formula(paste(Column_Factor,"~.",sep="")),data=Smoke)
plot(lda)
lda
lda$counts
lda$svd
lda.p=predict(lda)
#selects the column using the variable
Tabla=table(Smoke[,Column_Factor],lda.p$class)
Tabla
diag(prop.table(Tabla, 1))
sum(diag(prop.table(Tabla)))

Reliability tests for classic content analysis (multiple categorial codes per item)

In classic content analysis (or qualitative content analysis), as typically done with Atlas.TI or Nvivo type tools (sometimes called QACDAS tools), you typically face the situation of having multiple raters rate many objects with many codes, so there are multiple codes that each rater might apply to each object. I think this is what the excellent John Ubersax page on agreement statistics calls "Two Raters, Polytomous Ratings".
For example you might have two raters read articles and code them with some group of topic codes from a coding scheme (e.g., diy, shelving, circular saw), and you are asking how well the coders agree on applying the codes.
What I'd like is to use the irr package functions, agree and kappa2, in these situations. Yet their documentation didn't help me figure out how to proceed, since they expect input in the form of "n*m matrix or dataframe, n subjects m raters." which implies that there is a single rating per rater, per object.
Given two raters using (up to) three codes to code two articles input data that looks like this (two diy articles, the second with some topic tags):
article,rater,code
article1,rater1,diy
article1,rater2,diy
article2,rater1,diy
article2,rater2,diy
article2,rater1,circular-saw
article2,rater1,shelving
article2,rater2,shelving
I'd like to get:
Overall percentage agreement.
Percentage agreement for each code.
Contingency table for each code.
Ideally, I'd also like to get Positive agreement (how often do the raters agree that a code should be present?) and Negative Agreement (how often do the raters agree that a code should not be present). See discussion of these at http://www.john-uebersax.com/stat/raw.htm#binspe
I'm pretty sure that this involves breaking the input data.frame up and processing it code by code, using something like dplyr, but I wondered if others have tackled this problem.
(The kappa functions take the same input, so let's just keep this simple by using the agree function from the irr package, plus the positive and negative agreement only really make sense with percentage agreement).
Looking at the meta.stackexchange threads on answering one's own question, it seems that is an acceptable thing to do. Makes sense, good place to store stuff for others to find :)
I solved most of this with the following code:
library(plyr); library(dplyr); library(reshape2); library(irr)
# The irr package expects input in the form of n x m (objects in rows, raters in columns)
# for multiple coders per coded items that is really confusing. Here we have 10 articles (to be coded) and
# many codes. So each rater rates each combinations of articles and codes as present (or not).
# Basically you send only the ratings columns to agree and kappa2. You can send them all at
# once for overall agreement, or send only those for each code for code-by-code agreement.
# letter,code,rater
# letter1,code1,rater1
# letter1,code2,rater1
# letter2,code3,rater2
coding <- read.csv("CombinedCoding.csv")
# Now want:
# letter, code, rater1, rater2
# where 0 = no (this code wasn't used), 1 = yes (this code was used)
# dcast can do this, collapsing across a group. In this case we're not really
# grouping, so if the code was not present length gives a 0, if it was length
# gives a 1.
# This excludes all the times where we agreed that both codes weren't present.
ccoding <- dcast(coding, letter + code ~ coder, length)
# create data.frame from combination of letters and codes
# this handles the negative agreement parts.
codelist <- unique(coding$code)
letterlist <- unique(coding$letter)
coding_with_negatives <- merge(codelist, letterlist) # Gets cartesion product of these.
names(coding_with_negatives) <- c("code", "letter") # align the names
# merge this with the coding, produces NA for rows that don't exist in ccoding
coding_with_negatives <- merge(coding_with_negatives,ccoding,by=c("letter","code"), all.x=T)
# replace NAs with zeros.
coding_with_negatives[is.na(coding_with_negatives)] <- 0
# Now want agreement per code.
# need a function that returns a df
# this function gets given the split data frame (ie this happens once per code)
getagree <- function(df) {
# for positive agreement remove the cases where we both coded it negative
positive_df <- filter(df, (rater1 == 1 & rater2 == 1) | (rater1 == 0 & rater2 == 1) | (rater1 == 1 & rater2 == 0))
# for negative agreement remove the cases where we both coded it positive
negative_df <- filter(df, (rater1 == 0 & rater2 == 0) | (rater1 == 0 & rater2 == 1) | (rater1 == 1 & rater2 == 0))
data.frame( positive_agree = round(agree(positive_df[,3:4])$value,2) # Run agree on the raters columns, get the $value, and round it.
, negative_agree = round(agree(negative_df[,3:4])$value,2)
, agree = round(agree(df[,3:4])$value,2)
, used_in_articles = nrow(positive_df) # gives some idea of the prevalance.
)
}
# split the df up by code, run getagree on the sections
# recombine into a data frame.
results <- ddply(coding_with_negatives, .(code), getagree)
The confusion matrices can be gotten with:
print(table(coding_with_negatives[,3],coding_with_negatives[,4],dnn=c("rater1","rater2")))
I haven't done it but I think I could do that per code inside the function using print to push them into a text file.

Using Lm in dlply while sorting by variable

I have the following data called: dataframe
planid (each plan indicated with a number from 1 till 126)
US_FRAC (a value between 0 and 1 for each fund in each year) and
market.premium (a value indicating the market premium for every fund in every year)
For every planid I want to do a regression where I regress US_FRAC against market.premiumas I have 10 years of data for every planid.
I used the following code:
mods=dlply(dataframe,.('planid'),lm,formula=ADJ_US_FRAC ~ market.premium)
I need both the t-statistic and the coefficient for every planid in a table, but I could only find the code for the coefficient. I did something wrong as I only get an output with 1 value for an intercept and nothing else.
removing the quotes in on planid and ADJ_ before usfrac worked for this sample data
dataframe <- data.frame(planid=round(runif(1000)*126), US_FRAC=runif(1000), market.premium=rnorm(1000))
dlply(dataframe,.(planid),lm,formula=US_FRAC ~ market.premium)
this summary() performs the coef t-tests. You can create dataframes with the fits with something like:
C <- ddply(dataframe,.(planid),function(x) {summary(lm(formula=US_FRAC ~ market.premium,data=x))$coefficients['(Intercept)', ]})
Beta <- ddply(dataframe,.(planid),function(x) {summary(lm(formula=US_FRAC ~ market.premium,data=x))$coefficients['market.premium', ]})
kind greetings

Resources