R Conditional Regression with Multiple Conditions - r

I am trying to run a regression in R based on two conditions. My data has binary variables for both year and another classification. I can get the regression to run properly while only using 1 condition:
# now time for the millions of OLS
# format: OLSABCD where ABCD are binary for the values of MSA/UA and years
# A = 1 if MSA, 0 if UA
# B = 1 if 2010
# C = 1 if 2000
# D = 1 if 1990
OLS1000<-summary(lm(lnrank ~ lnpop, data = subset(df, msa==1)))
OLS1000
However I cannot figure out how to get both the MSA/UA classification to work with the year variables as well. I have tried:
OLS1100<-summary(lm(lnrank ~ lnpop, data = subset(df, msa==1, df$2010==1)))
OLS1100
But it returns the error:
Error: unexpected numeric constant in "OLS1100<-summary(lm(lnrank ~ lnpop,
data = subset(df, msa==1, df$2010"
How can I get the program to run utilizing both conditions?
Thank you again!

The problem is:
df$2010
If your data really has a column named 2010, then you need backticks around it:
df$`2010`
And in your subset, don't specify df twice:
subset(df, msa == 1, `2010` == 1)
In general it's better if column names don't start with digits. It's also best not to name data frames df, since that's a function name.

#neilfws pointed out the "numeric as column names issue", but there is actually another issue in your code.
The third argument of subset() is actually reserved for the select =, which lets you choose which columns to include (or exclude). So the correct syntax should be:
subset(df, msa == 1 & `2010` == 1)
instead of
subset(df, msa == 1, `2010` == 1)
This second code would not give you an error, but it also would not give you the right condition.

Related

Multivariate Cointegration Test?

I am currently performing an analysis on FDI and income inequality in a panel of 10 countries over 30 years. In this setting, I want to test for panel cointegration, unit roots etc. My data is currently in the format:
ID year var1 var2
1 1 3 4
1 2 4 NA
1 3 1 6
2 1 4 2
2 2 1 3
2 3 2 2`
and so on. I have used
data.frame(split(df$var1, df$ID))
to create a dataframe with the IDs as columns to use the purtest function from the plm package (note that my panel data is unbalanced). However, I have a lot of variables and this whole procedure seems excessively tedious.
Is there a way to perform a cointegration test on multiple variables at once? There appears to be an index specification in purtest but I don't fully understand how to use it. What format would my data have to be in and how would I get there? Would it be an issue that there is quite a few NAs in some variables?
df <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2),
year = c(1,2,3,4,5,6,7,1,2,3,4,5,6,7),
var1 = c(3,5,6,2,NA,6,4, 2,3,4,5,8,NA,7),
var2 = c(5,6,3,8,NA,NA,2, NA,6,7,4,5,9,2))
The index argument to plm's purtest works like in the other functions of the package: it is used to specify the individual and the time dimension of your data. However, there is no need to use it directly in the purtest (or other functions) if data is converted to a pdata.frame first.
There is no need to split the data by individual on your end. purtest supports various ways/formats to input data. A user-driven split is just one option. If you already have a data frame in long-format, I suggest to convert it to pdata.frame and input a pseries to purtest (thus, no need to split data yourself). Something along these lines:
library(plm)
df <- data.frame(id = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2),
year = c(1,2,3,4,5,6,7,1,2,3,4,5,6,7),
var1 = c(3,5,6,2,NA,6,4,2,3,4,5,8,NA,7),
var2 = c(5,6,3,8,NA,NA,2,NA,6,7,4,5,9,2))
pdf <- pdata.frame(df)
purtest(pdf$var1, test = "hadri", exo = "intercept")
To test multiple variables at once, you can put the purtest function into a loop, e.g., like this:
myvars <- as.list(pdf[ , -c(1:2)], keep.attributes = TRUE)
lapply(myvars, function(x) purtest(x, test = "ips", exo = "intercept", pmax = 1L))

R code to detect a change in a variable over time for multiple patients

I have a data set with multiple rows per patient, where each row represents a 1-week period of time over the course of 4 months. There is a variable grade that can take on values of 1,2,or 3, and I want to detect when a single patient's grade INCREASES (1 to 2, 1 to 3, or 2 to 3) at any point (the result would be a yes/no variable). I could write a function to do it but I'm betting there is some clever functional programming I could do to make use of existing R functions. Here is a sample data set below. Thank you!
df=data.frame(patient=c(1,1,1,2,2,3,3,3,3),period=c(1,2,3,1,3,1,3,4,5),grade=c(1,1,1,2,3,1,1,2,3))
what I would want is a resulting data frame of:
data.frame(patient=c(1,2,3),grade.increase=c(0,1,1))
library(dplyr)
df %>%
arrange(patient, period) %>%
mutate(grade.increase = case_when(grade > lag(grade) ~ TRUE,TRUE ~ FALSE)) %>%
group_by(patient) %>%
summarise(grade.increase = max(grade.increase))
Combining lag which checks the previous value with case_when allows us to identify each grade.increase.
Summarising the maximum of grade.increase for each patient gets the desired results as boolean calculations treat FALSE as 0 and TRUE as 1.
If you feel like doing this in base R, here's a solution that uses the split-apply-combine approach.
You use split to make a list with a separate data frame for each patient;
you use lapply to iterate a summarization function over each list element, where the summarization function uses diff to look at changes in grade and if and any to summarize; and then
you wrap the whole thing in do.call(rbind, ...) to collapse the resulting list into a data frame.
Here's what that looks like:
do.call(rbind, lapply(split(df, df[,"patient"]), function(i) {
data.frame(patient = i[,"patient"][1],
grade.increase = if (any(diff(i[,"grade"]) > 0)) 1 else 0 )
}))
Result:
patient grade.increase
1 1 0
2 2 1
3 3 1

R function that creates indicator variable values unique between several columns

I'm using the Drug Abuse Warning Network data to analyze common drug combinations in ER visits. Each additional drug is coded by a number in variables DRUGID_1....16. So Pt1 might have DRUGID_1 = 44 (cocaine) and DRUGID_3 = 20 (heroin), while Pt2 might have DRUGID_1=20 (heroin), DRUGID_3=44 (cocaine).
I want my function to loop through DRUGID_1...16 and for each of the 2 million patients create a new binary variable column for each unique drug mention, and set the value to 1 for that pt. So a value of 1 for binary variable Heroin indicates that somewhere in the pts DRUGID_1....16 heroin is mentioned.
respDRUGID <- character(0)
DRUGID.df <- data.frame(allDAWN$DRUGID_1, allDAWN$DRUGID_2, allDAWN$DRUGID_3)
Count <- 0
DrugPicker <- function(DRUGID.df){
for(i in seq_along(DRUGID.df$allDAWN.DRUGID_1)){
if (!'NA' %in% DRUGID.df[,allDAWN.DRUGID_1]){
if (!is.element(DRUGID.df$allDAWN.DRUGID_1,respDRUGID)){
Count <- Count + 1
respDRUGID[Count] <- as.character(DRUGID.df$allDAWN.DRUGID_1[Count])
assign(paste('r', as.character(respDRUGID[Count,]), sep='.'), 1)}
else {
assign(paste("r", as.character(respDRUGID[Count,]), sep='.'), 1)}
}
}
}
DrugPicker(DRUGID.df)
Here I have tried to first make a list to contain each new DRUGIDx value (respDRUGID) as well as a counter (Count) for the total number unique DRUGID values and a new dataframe (DRUGID.df) with just the relevant columns.
The function is supposed to move down the observations and if not NA, then if DRUGID_1 is not in list respDRUGID then create a new column variable 'r.DRUGID' and set value to 1. Also increase the unique count by 1. Otherwise the value of DRUGID_1 is already in list respDRUGID then set r.DRUGID = 1
I think I've seen suggestions for get() and apply() functions, but I'm not following how to use them. The resulting dataframe has to be in the same obs x variable format so merging will align with the survey design person weight variable.
Taking a guess at your data and required result format. Using package tidyverse
drug_df <- read.csv(text='
patient,DRUGID_1,DRUGID_2,DRUGID_3
A,1,2,3
B,2,,
C,2,1,
D,3,1,2
')
library(tidyverse)
gather(drug_df, value = "DRUGID", ... = -patient, na.rm = TRUE) %>%
arrange(patient, DRUGID) %>%
group_by(patient) %>%
summarize(DRUGIDs = paste(DRUGID, collapse=","))
# patient DRUGIDs
# <fctr> <chr>
# 1 A 1,2,3
# 2 B 2
# 3 C 1,2
# 4 D 1,2,3
I found another post that does exactly what I want using stringr, destring, sapply and grepl. This works well after combining each variable into a string.
Creating dummy variables in R based on multiple chr values within each cell
Many thanks to epi99 whose post helped think about the problem in another way.

targeting a single value of a dichotomous variable in R

I am trying to use the command assocstats() in order to receive Cramer's V for 2 Variables. This is not a a problem as long as I target the entirety of both variables:
assocstats(table(democrat, sex))
Problems arise when I try to target only 1 specific value of the dichotomous variable sex, which consists of 1 and 2.
I thought that dplyr might be of help with the filter command, but
assocstats(table(democrat, filter(sex==1))
does not yield any results.
Does anybody know how I can target only 1 value of the variable sex in this case?
Many thanks
Suppose if I am using the Arthritis data from library(vcd), we need to filter the rows that matches the 'Male' (or 1 in your dataset), select the columns of interest ('Treatment', and 'Sex'), get the frequency with table and use assocstats.
library(vcd)
assocstats(table(Arthritis[Arthritis$Sex=='Male', c('Treatment', 'Sex')]))
Assuming that the OP have two vectors i.e. 'democrat' and 'sex'
i1 <- sex ==1
assocstats(table(democrat[i1], sex[i1]))

R - boxplotting specific variables against one another

I am using the dative data frame within R, and I am trying to plot only the LengthOfRecipient == 'nonpronomial' against Modality. I gathered all the LengthOfRecipient == 'nonpronomial':
library('languageR')
lor.np = dative[dative$PronomOfRec == 'nonpronominal',]$LengthOfRecipient
I have tried nesting this subset function, and applied vectors, but I cannot figure out a way to then access the Modality column for only the items in lor.np and store it in mod.np, so that I can plot and analyze the data with:
boxplot(lor.np, mod.np)
I'm very new to R and the syntax is extremely confusing. Any help would be very appreciated. Thanks in advance!
It might be easier to select all the columns you want at once and then use the formula feature in boxplot rather than using vectors:
library('languageR')
lor.np <- dative[dative$PronomOfRec == 'nonpronominal',
c('LengthOfRecipient','Modality')]
head(lor.np)
# LengthOfRecipient Modality
# 2 2 written
# 3 1 written
# 5 2 written
# 6 2 written
# 7 2 written
# 11 2 written
## but you don't even need to select the columns:
lor.np <- dative[dative$PronomOfRec == 'nonpronominal', ]
boxplot(LengthOfRecipient ~ Modality, lor.np)
After looking at the data, you don't need droplevels, but here is an example when it may be useful:
dat1 <- dative[dative$Modality == 'written', ]
boxplot(LengthOfRecipient ~ Modality, dat1)

Resources