Using Lm in dlply while sorting by variable - r

I have the following data called: dataframe
planid (each plan indicated with a number from 1 till 126)
US_FRAC (a value between 0 and 1 for each fund in each year) and
market.premium (a value indicating the market premium for every fund in every year)
For every planid I want to do a regression where I regress US_FRAC against market.premiumas I have 10 years of data for every planid.
I used the following code:
mods=dlply(dataframe,.('planid'),lm,formula=ADJ_US_FRAC ~ market.premium)
I need both the t-statistic and the coefficient for every planid in a table, but I could only find the code for the coefficient. I did something wrong as I only get an output with 1 value for an intercept and nothing else.

removing the quotes in on planid and ADJ_ before usfrac worked for this sample data
dataframe <- data.frame(planid=round(runif(1000)*126), US_FRAC=runif(1000), market.premium=rnorm(1000))
dlply(dataframe,.(planid),lm,formula=US_FRAC ~ market.premium)
this summary() performs the coef t-tests. You can create dataframes with the fits with something like:
C <- ddply(dataframe,.(planid),function(x) {summary(lm(formula=US_FRAC ~ market.premium,data=x))$coefficients['(Intercept)', ]})
Beta <- ddply(dataframe,.(planid),function(x) {summary(lm(formula=US_FRAC ~ market.premium,data=x))$coefficients['market.premium', ]})
kind greetings

Related

Why are aggregate data frame and aggregate formula do not return the same results?

I just started learning R last month and I am learning the aggregate functions.
To start off, I have a data called property and I am trying to get the mean price per city.
I first used the formula method of aggregate:
mean_price_per_city_1 <- aggregate(PRICE ~ PROPERTYCITY,
property_data, mean)
The results are as follow (just the head):
PROPERTYCITY
PRICE
1.00
ALLISON PARK
193814.08
AMBRIDGE
62328.92
ASPINWALL
226505.50
BADEN
400657.52
BAIRDFORD
59337.37
Then I decided to try the data frame method:
mean_price_per_city_2 <- aggregate(list(property_data$PRICE),
by = list(property_data$PROPERTYCITY),
FUN = mean)
The results are as follow (just the head):
Group.1
c.12000L.. 1783L..4643L..
1.00
ALLISON PARK
NA
AMBRIDGE
62328.92
ASPINWALL
226505.50
BADEN
400657.52
BAIRDFORD
59337.37
I thought that the two methods will return the same results. However I noticed that when I used the data frame method, there are NAs in the second column.
I tried checking if there are NAs in the PRICE column, but there is none. So I am lost why the two methods don't return the same values.
You have two issues. First aggregate(list(property_data$PRICE), by = list(property_data$PROPERTYCITY), FUN = mean) should just have property_data$PRICE without the list. Only the by= argument must be a list. That is why your column name is so strange. Second, as documented in the manual page (?aggregate), the formula method has a default value of na.action=na.omit, but the method for class data.frame does not. Since you have at least one missing value in the ALLISON PARK group, the formula command deleted that value, but the second command did not so the result for ALLISON PARK is NA.

How alter R codes in efficient way

I have a sample created as follows:
survival1a= data.frame(matrix(vector(), 50, 2,dimnames=list(c(), c("Id", "district"))),stringsAsFactors=F)
survival1a$Id <- 1:nrow(survival1a)
survival1a$district<- sample(1:4, size=50, replace=TRUE)
this sample has 50 individuals from 4 different districts.
I have probabilities (a matrix) that shows the likelihood of migration from one district to another(Migdata) as follows:
district***** prob1****** prob2******** prob3******* prob4**
0.83790 0.08674 0.05524 0.02014
0.02184 0.88260 0.03368 0.06191
0.01093 0.03565 0.91000 0.04344
0.03338 0.06933 0.03644 0.86090
I merge these probabilities with my data with this code:
survival1a<-merge( Migdata,survival1a, by.x=c("district"), by.y=c("district"))
I would like to know by the end of the year each person resides in which districts based on probabilities of migration that I have(Migdata).
I have already written a code that perfectly works but with big data it is so time-consuming since it is based on a Loop:
for (k in 1:nrow(survival1a)){
survival1a$migration[k]<-sample(1:4, size=1,replace = TRUE,prob=survival1a[k,2:5])}
Now, I want to write the code in a way that it would not be based on a loop and shows every person district by the end of the year.

R logit model variable choice

I have a dataset where each household has an observation for each of five power generation methods (so each household appears 5 times). There is a dummy variable marking which method they use, so a sample household might look like --
hh_id choice choice_dummy
1 Grid 0
1 Diesel 0
1 Ownsolar 1
1 Solargrid 0
1 None 0
I have some other variables (price, avail, load, peakhours) that I need to run in a logit model to see what is influencing the decision to pick a particular choice. I know to use glm() for this, but I'm unsure of what to put in for the dependent variable. "Choice" doesn't actually capture the decision that was made, because "choice_dummy" actually marks the decision, but "choice_dummy" by itself doesn't give any information.
I can't merely filter for choice_dummy being 1 because I will lose the values for all the other variables by doing that. Does anyone know how I would go about running a logit model that relates the probability of each household choosing an energy source to the variables "price," "avail," "load," and "peakhours", ideally with code?
Replying to OP's comment 5/22/20:
hh_id<-c("1","1","1","1","1")
choice <- c("Grid","Diesel","Ownsolar","Solargrid","None")
choice_dummy <- c("0","0","1","0","0")
df <- data.frame(hh_id,choice,choice_dummy)
library(reshape2)
df2 <- dcast(df, hh_id ~ choice)
df2$power_choice <- ifelse(df2$Grid==1,"Grid",
ifelse(df2$Diesel==1,"Diesel",
ifelse(df2$Ownsolar==1,"Ownsolar",
ifelse(df2$Solargrid==1,"Solargrid",
ifelse(df2$None==1,"None",NA)))))

Juxtaposing Replicate Data

I have provided a sample dataset that I have arranged in column format (called "full.table").
These data were extracted from a 96-well PCR plate, & while collecting my data, I always ran a duplicate experiment, meaning each variable (aka test) has 1 replicate. I would like to take all replicates and juxtapose them (have them be side by side), which would allow me to easily visualize replicates next to each other, and finally calculate an average value for the variable "Cq" between the two.
The complications stems from having done multiple tests over several days (complication one), and NOT having my samples always run in the same fashion on the PCR plate (complication two). Typically, as you see on my data set below, Well A1 has a duplicate in Well B1, however this is not always the case. Occasionally, Well A7 matches Well A8 (and NOT B7).
Replicates were always run on the same day, so an important variable here is “date” which I added via R before uploading to Stack Exchange. I am confused on how to re-arrange the data to get my desired result (not even sure where to start)
I have provided an example of what I would like in the end, called “sample.finished.table”
Logically, having 768 observations in this example, this should divide it in two, resulting in 384 total lines of data (385 with header)
I appreciate any feedback. Thank you
full.table<- read.table("https://pastebin.com/raw/kTQhuttv", header=T, sep="")
sample.finished.table <- read.table("https://pastebin.com/raw/Phg7C9xD", header=T, sep="")
You can use dplyr here to group by sample and extract the requested values:
library(dplyr)
full.table %>% group_by(sample,date) %>% summarise(
Well1 = first(Well), Cq1 = first(Cq),
Well2 = last(Well), sample1 = last(sample), Cq2 = last(Cq), Cq_mean = mean(Cq[Cq > 0]))

Logical in my R function always returns "TRUE"

I'm trying to write an R function that calculates whether a data subject is eligible for subsidies based on their income (X_INCOMG), the size of their household (household calculated from CHILDREN and NUMADULT), and the federal poverty limit for their household size (fpl_matrix). I use a number of if statements to evaluate whether the record is eligible, but for some reason my code is labeling everyone as eligible, even though I know that's not true. Could someone else take a look at my code?
Note that the coding for the variable X_INCOMG denotes income categories (less than $15000, 25-35000, etc).
#Create a sample data set
sampdf=data.frame(NUMADULT=sample(3,1000,replace=T),CHILDREN=sample(0:5,1000,replace=T),X_INCOMG=sample(5,1000,replace=T))
#Introducing some "impurities" into the data so its more realistic
sampdf[sample(1000,3),'CHILDREN']=13
sampdf[sample(1000,3),'CHILDREN']=NA
sampdf[sample(1000,3),'X_INCOMG']=9
#this is just a matrix of the federal poverty limit, which is based on household size
fpl_2004=matrix(c(
1,9310,
2,12490,
3,15670,
4,18850,
5,22030,
6,25210,
7,28390,
8,31570,
9,34750,
10,37930,
11,41110),byrow=T,ncol=2)
##################here is the function I'm trying to create
fpl250=function(data,fpl_matrix,add_limit){ #add_limit is the money you add on for every extra person beyond a household size of 11
data[which(is.na(data$CHILDREN)),'CHILDREN']=99 #This code wasn't liking NAs so I'm coding NA as 99
data$household=data$CHILDREN+data$NUMADULT #calculate household size
for(i in seq(nrow(data))){
if(data$household[i]<=11){data$bcccp_cutoff[i]=2.5*fpl_matrix[data$household[i],2]} #this calculates what the subsidy cutoff should be, which is 250% of the FPL
else{data$bcccp_cutoff[i]=2.5*((data$household[i]-11)*add_limit+fpl_matrix[11,2])}}
data$incom_elig='yes' #setting the default value as 'yes', then changing each record to 'no' if the income is definitely more than the eligibility cutoff
for(i in seq(nrow(data))){
if(data$X_INCOMG[i]=='1' | data$X_INCOMG[i]=='9'){data$incom_elig='yes'} #This is the lowest income category and almost all of these people will qualify
if(data$X_INCOMG[i]=='2' & data$bcccp_cutoff[i]<15000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='3' & data$bcccp_cutoff[i]<25000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='4' & data$bcccp_cutoff[i]<35000){data$incom_elig[i]='no'}
if(data$X_INCOMG[i]=='5' & data$bcccp_cutoff[i]<50000){data$incom_elig[i]='no'}
if(data$household[i]>90){data$incom_elig[i]='no'}
}
return(data)
}
dd=fpl250(sampl,fpl_2004,3180)
with(dd,table(incom_elig)) #it's coding all except one as eligible
I know this is a lot of code to digest, but I appreciate whatever help you have to offer!
I find it easier to get the logic working well outside of a function first, then wrap it in a function once it is all working well. My code below does this.
I think one issue was you had the literal comparisons to X_INCOMG as strings (data$X_INCOMG[i]=='1'). That field is a numeric in your sample code, so remove the quotes. Try using a coded factor for X_INCOMG as well. This will make your code easier to manage later.
There is no need to loop over each row in the data frame.
#put the poverty level data in a data frame for merging
fpl_2004.df<- as.data.frame(fpl_2004)
names(fpl_2004.df)<-c("household","pov.limit")
#Include cutoffs
fpl_2004.df$cutoff = 2.5 * fpl_2004.df$pov.limit
add_limit=3181
#compute household size (if NA's this will skip them)
sampdf$household = numeric(nrow(sampdf))
cc<-which(complete.cases(sampdf))
sampdf$household[cc] = sampdf$NUMADULT[cc] + sampdf$CHILDREN[cc]
#get max household and fill fpl_2004 frame
max.hh<-max(sampdf$household,na.rm=TRUE)
#get the 11 person poverty limit
fpl11=subset(fpl_2004.df,household==11)$pov.limit
#rows to fill out the data frame
append<-data.frame(household=12:max.hh,pov.limit=numeric(max.hh-12+1),
cutoff=2.5 *(((12:max.hh)-11)*add_limit+fpl11))
fpl_2004.df<- rbind(fpl_2004.df,append)
#merge the two data frames
sampdf<- merge(sampdf,fpl_2004.df, by="household",all.x=TRUE)
#Add a logical variable to hold the eligibility
sampdf$elig <- logical(nrow(sampdf))
#compute eligibility
sampdf[!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 1,"elig"] = TRUE
sampdf[!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 9,"elig"] = TRUE
#for clarity define variable of what to subset
lvl2 <-!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 2
lvl2 <- lvl2 & !is.na(sampdf$cutoff) & sampdf$cutoff>=15000
#set the eligibility (note the initial value was false thus cutoff logic reversed)
sampdf[lvl2,"elig"] = TRUE
#continue computing these
lvl3 <-!is.na(sampdf$X_INCOMG) & sampdf$X_INCOMG == 3
lvl3 <- lvl3 & !is.na(sampdf$cutoff) & sampdf$cutoff>=25000
sampdf[lvl3,"elig"] = TRUE
Alternately you could load in a small data frame with the cutoff comparison values (15000; 25000; 35000 etc) and the X_INCOMG. Then merge by X_INCOMG, as I did with the household size, and set all the values in one go like this the below. You may need to use complete.cases again.
sampdf$elig = sampdf$cutoff >= sampdf$comparison.value
You will then have elig == FALSE for any incomplete cases, which will need further investigation.

Resources