Automate basic calculations with residuals in R - r

I have some basic calculations I want to apply on residuals of a plm model but I am stuck on how to automate the steps for a lot of data.
Let's assume the input is a data.frame (df) with the following data:
Id Year Population Y X1 X2 X3
country A 2009 977612 212451.009 19482.7995 0.346657979 0.001023221
country A 2010 985332 221431.632 18989.3 0.345142551 0.001015205
country A 2011 998211 219939.296 18277.79286 0.344020453 0.001002106
country A 2012 1010001 218487.503 17916.2765 0.342434314 0.000990409
country B 2009 150291 177665.268 18444.04522 0.330864789 0.001940218
country B 2010 150841 183819.407 18042 0.327563461 0.001933143
country B 2011 152210 183761.566 17817.3515 0.32539255 0.001915756
country B 2012 153105 182825.112 17626.62261 0.321315437 0.001904557
country c 2009 83129 132328.034 17113.64268 0.359525557 0.005862866
country c 2010 83752 137413.878 16872.5 0.357854141 0.005819254
country c 2011 84493 136002.537 16576.17856 0.356479235 0.005768219
country c 2012 84958 133064.911 16443.3057 0.355246122 0.005736648
A model was applied and the residuals are stored:
fixed <- plm(Y ~ Y1 + X2 + X3,
data=df, drop.unused.levels = TRUE, index=c("Id", "Year"), model="within")
residuals <- resid(fixed)
In my next step, I want to calculate "weighted averages" of my residuals with:
with nit standing for the population in country i at time t and nt being the total population at t.
My approach so far is:
First I compute the total population nt for every year:
year_range <- seq(from=2009,to=2012,by=1)
tot_pop = NULL
for (n in year_range)
{
tot_pop[n] = with(df, sum(Population[Year == n]))
}
Before taking the sum of the "weighted" residuals, my next step would be to automate the calculation of my "new" residuals:
res1 <- df$Population[1]/tot_pop[2009] * residuals[1]
res2 <- df$Population[2]/tot_pop[2010] * residuals[2]
res3 <- df$Population[3]/tot_pop[2011] * residuals[3]
...
res12 <- df$Population[12]/tot_pop[2011] * residuals[12]
Edit: Applying the solution of JTT to my problem, the last step would then be:
year_range1 <- rep(year_range, 3)
df_res <- data.frame(year = year_range1, res=as.vector(res))
aggr_res <- aggregate(df_res$res, list(df_res$year), sum)
colnames(aggr_res) <- c("Year", "Aggregated residual")
Is that correct?
I have tried the lapply function and a double "for-loop" without success. I don't know how to do this. Your help would be appreciated. If my question is unclear, please comment and I will try to improve it.

First, instead of a for-loop, you might want to calculate the total population using the aggregate funtion, e.g.:
a<-aggregate(df$Population, list(df$Year), sum)
Notice the column names of a (Group.1 and x).
Then you could match the results in a to the data in df using the match()-function. It gives the matching row numbers, which can be used to subset data from df to the division before multiplying with the residuals. For example:
res<-df$Population/a$x[match(df$Year, a$Group.1)]*residuals
Now you should have a vector of "new" residuals in object res.

Related

A way to get Column Names as Row Names?

My goal is to plot a map with each point representing the year of the highest measured value. So for that I need the year as one value and the Station Name as Row Name.
I get to the point where I get the year of the maximum value for each Station but don´t know how to get the station name as Row Name.
My example is the following:
set.seed(123)
df1<-data.frame(replicate(6,sample(0:200,2500,rep=TRUE)))
date_df1<-seq(as.Date("1995-01-01"), by = "day", length.out = 2500)
test_sto<-cbind(date_df1, df1)
test_sto$date_df1<-as.Date(test_sto$date_df1)
test_sto<-test_sto%>% dplyr::mutate( year = lubridate::year(date_df1),
month = lubridate::month(date_df1),
day = lubridate::day(date_df1))
This is my Dataframe, i then applied the following steps:
To get all values above the treshold for each year and station:
test_year<-aggregate.data.frame(x=test_sto[2:7] > 120, by = list(test_sto$year), FUN = sum, na.rm=TRUE )
This works as it should, the nex is the following
m <- ncol(test_year)
Value <- rep(NA,m)
for (j in 2:m) {
idx<- which.max(test_year[,j])
Value[j] <- test_year[,1][idx]
}
test_test<-Value[2:m]
At the end of this, I get the following table:
x
1
1996
2
1996
3
1998
4
1996
5
1999
6
1999
But instead of the 1,2,3,4,5..I need there my Column Names (X1,X2,X3 etc.):
x
X1
1996
X2
1996
X3
1998
X4
1996
X5
1999
X6
1999
but this is the point where i´m struggeling.
I tried it with the following step:
test_year$max<-apply(test_year[2:7], 1, FUN = max)
apply(test_year[2:7], 2, FUN = max)
test_year2<-subset(test_year, ncol(2:7) == max(ncol(2:7)))
But i´m just getting an error message saying:
in max(ncol(2:7)):
non not-missing Argument for max; give -Inf back<
Maybe someone knows a work around! Thanks in advance!
The 'test_test' is just a vector. Its magnitude characterized by length and is a one 1 dimensional object which doesn't have row.names attribute. But, we can have names attribute
names(test_test) <- colnames(test_year)[-1]

How can I use only one year from my dummy?

I have some data with some individuals where I know their age (18-98) and which country they are from (1 = Germany , 2 = France, etc. ).
I have an other variable for which i want to see the effect eg. of 18 year old people from Germany.
With dummy(data$age, sep='_') and dummy(data$sg2, sep='_')[sg2 = country] I was able to create dummy variables for these two variables.
But while regressing, the output shows the effect now of every age and every country separate.
How can I combine 18 year olds from Germany that I see their effect on the other variable?
#Dummy age
x1 <- dummy(dat$age, sep='_')
#Dummy country
x2 <- dummy(dat$sg2, sep = '_')
fm <- lm(myvariable~x1+x2, data=dat)
summary(fm)
Estimate Std. Error t value Pr(>|t|)
x1age_18 1.547691 0.567995 2.725 0.006437 **
x1age_19 1.632648 0.567939 2.875 0.004047 **
x2sg2_1 0.083239 0.030118 2.764 0.005717 **
x2sg2_2 0.056555 0.030655 1.845 0.065063 .
This is what I get, bit how can i get x1age_18 & x2sg2_1 in one?

R: How to output subset calculations (n, %) using ddply

I hope you can help me with this problem: For my work I have to use R to analyze survey data. The data has a number of columns by which I have/want to group the data and then do some calculations, e.g. How many men or women do work at a certain department? And then calculate the number and percentage for each group. --> at department A work 42 people, whereof 30 women and 12 men, at department B work 70 people, whereof 26 women and 44 men.
I currently use the following code to output the data (using ddply):
percentage_median_per_group_multiple_columns <- function(data, column_name, column_name2){
library(plyr)
descriptive <- ddply( data, column_name,
function(x){
percentage_median_per_group(x, column_name)
percentage_median_per_group(x, column_name2)
}
)
print(data.frame(descriptive))
}
## give number, percentage and median per group_value in column
percentage_median_per_group <- function(data, column_name3){
library(plyr)
descriptive <- ddply( data, column_name3,
function(x){
c(
N <- nrow(x[column_name3]), #number
pct <- (N/nrow(data))*100 #percentage
#TODO: median
)
}
)
return(descriptive)
}
#calculate
percentage_median_per_group_multiple_columns(users_surveys_full_responses, "department", "gender")
Now the data outputs like this:
Department Sex N % per sex
A f 30 71,4
m 12 28,6
B f 26 37,1
m 44 62,9
But, I want the output to look like this, so calculations take place and are printed in each substep:
Department N % per department Sex N % per sex
A 42 37,5 f 30 71,4
m 12 28,6
B 70 62,5 f 26 37,1
m 44 62,9
Does anyone have a suggestion of how I can do that, if possible even build it dynamic so I can potentially group it by the variables in multiple columns (e.g. department + sex + type of software + ...), but I would be happy if I can have it already like in the example =)
thanks!
EDIT
You can use this to generate example data:
n=100
sample_data = data.frame(department=sample(1:20,n,replace=TRUE), gender=sample(1:2,n,replace=TRUE))
percentage_median_per_group_multiple_columns(sample_data, "department", "gender")
V1 in the output stands for N (number) and V2 for %

R-sq values, linear regression of several trends within one dataset

I am running into a sticky spot trying to solve for variance accounted for by trend several times within a single data set.....
My data is structured like this
x <- read.table(text = "
STA YEAR VALUE
a 1968 457
a 1970 565
a 1972 489
a 1974 500
a 1976 700
a 1978 650
a 1980 659
b 1968 457
b 1970 565
b 1972 350
b 1974 544
b 1976 678
b 1978 650
b 1980 690
c 1968 457
c 1970 565
c 1972 500
c 1974 600
c 1976 678
c 1978 670
c 1980 750 " , header = T)
and I am trying to return something like this
STA R-sq
a n1
b n2
c n3
where n# is the corresponding r-squared value of the locations data in the original set....
I have tried
fit <- lm(VALUE ~ YEAR + STA, data = x)
to give the model of yearly trend of VALUE for each individual station over the years data is available for VALUE, within the master data set....
Any help would be greatly appreciated.... I am really stumped on this one and I know it is just a familiarity with R problem.
To get r-squared for VALUE ~ YEAR for each group of STA, you can take this previous answer, modify it slightly and plug-in your values:
# assuming x is your data frame (make sure you don't have Hmisc loaded, it will interfere)
models_x <- dlply(x, "STA", function(df)
summary(lm(VALUE ~ YEAR, data = df)))
# extract the r.squared values
rsqds <- ldply(1:length(models_x), function(x) models_x[[x]]$r.squared)
# give names to rows and col
rownames(rsqds) <- unique(x$STA)
colnames(rsqds) <- "rsq"
# have a look
rsqds
rsq
a 0.6286064
b 0.5450413
c 0.8806604
EDIT: following mnel's suggestion here are more efficient ways to get the r-squared values into a nice table (no need to add row and col names):
# starting with models_x from above
rsqds <- data.frame(rsq =sapply(models_x, '[[', 'r.squared'))
# starting with just the original data in x, this is great:
rsqds <- ddply(x, "STA", summarize, rsq = summary(lm(VALUE ~ YEAR))$r.squared)
STA rsq
1 a 0.6286064
2 b 0.5450413
3 c 0.8806604
#first load the data.table package
library(data.table)
#transform your dataframe to a datatable (I'm using your example)
x<- as.data.table(x)
#calculate all the metrics needed (r^2, F-distribution and so on)
x[,list(r2=summary(lm(VALUE~YEAR))$r.squared ,
f=summary(lm(VALUE~YEAR))$fstatistic[1] ),by=STA]
STA r2 f
1: a 0.6286064 8.462807
2: b 0.5450413 5.990009
3: c 0.8806604 36.897258
there's only one r-squared value, not three.. please edit your question
# store the output
y <- summary( lm( VALUE ~ YEAR + STA , data = x ) )
# access the attributes of `y`
attributes( y )
y$r.squared
y$adj.r.squared
y$coefficients
y$coefficients[,1]
# or are you looking to run three separate
# lm() functions on 'a' 'b' and 'c' ..where this would be the first?
y <- summary( lm( VALUE ~ YEAR , data = x[ x$STA %in% 'a' , ] ) )
# access the attributes of `y`
attributes( y )
y$r.squared
y$adj.r.squared
y$coefficients
y$coefficients[,1]

Aggregate or summarize to get ratios

The following is a toy problem that demonstrates my question.
I have a data frame that contains a bunch of employees; for each employee, it has a name, salary, gender and state.
aggregate(salary ~ state) # Returns the average salary per state
aggregate(salary ~ state + gender, data, FUN = mean) # Avg salary per state/gender
What I actually need is a summary of the fraction of the total salary earned by women in each state.
aggregate(salary ~ state + gender, data, FUN = sum)
returns the total salary earned by women (and men) in each state ,but what I really need is salary_w / salary_total on a per-state level. I can write a for-loop, etc -- but I am wondering if there is some way to use aggregate to do that.
Another option would be using plyr. ddply() expects a data.frame as an input and will return a data.frame as an output. The second argument is how you want to split the data frame. The third argument is what we want to apply to the chunks, here we are using summarise to create a new data.frame from the existing data.frame.
library(plyr)
#Using the sample data from kohske's answer above
> ddply(d, .(state), summarise, ratio = sum(salary[gender == "Woman"]) / sum(salary))
state ratio
1 1 0.5789860
2 2 0.4530224
probably reshape or reshape2 would help your work.
Here is a sample script:
library(reshape2) # from CRAN
# sample data
d <- data.frame(expand.grid(state=gl(2,2),gender=gl(2,1, labels=c("Men","Wemon"))),
salaly=runif(8))
d2 <- dcast(d, state~gender, sum)
d2$frac <- d2$Wemon/(d2$Men+d2$Wemon)
The ave function is good for problems like this.
Data$ratio <- ave(Data$salary, Data$state, Data$gender, FUN=sum) /
ave(Data$salary, Data$state, FUN=sum)
Another solution is to use xtabs and prop.table:
prop.table(xtabs(salary ~ state + gender,data),margin=1)
It's generally not advisable to name your datasets, "data", so I will change the problem slightly to name the dataset "dat1".
with( subset(dat1, gender="Female"), aggregate(salary, state, sum )/
# should return a vector
with( data=dat1, aggregate(salary, state, sum )
# using R's element-wise division
I think you are also using attach and there are good reasons to reconsider that decision, despite what you might read in Crawley.
Since you want the results on a per state basis the tapply might be what you want.
To illustrate let's generate some arbitrary data to play with:
set.seed(349) # For replication
n <- 20000 # Sample size
gender <- sample(c('M', 'W'), size = n, replace = TRUE) # Random selection of gender
state <- c('AL','AK','AZ','AR','CA','CO','CT','DE','DC','FL','GA','HI',
'ID','IL','IN','IA','KS','KY','LA','ME','MD','MA','MI','MN',
'MS','MO','MT','NE','NV','NH','NJ','NM','NY','NC','ND','OH',
'OK','OR','PA','RI','SC','SD','TN','TX','UT','VT','VA','WA',
'WV','WI','WY') # All US states
state <- sample(state, size = n, replace = TRUE) # Random selection of the states
state_index <- tapply(state, state) # Just for the data generatino part ...
gender_index <- tapply(gender, gender)
# Generate salaries
salary <- runif(length(unique(state)))[state_index] # Make states different
salary <- salary + c(.02, -.02)[gender_index] # Make gender different
salary <- salary + log(50) + rnorm(n) # Add mean and error term
salary <- exp(salary) # The variable of interest
What you asked for, the sum of salaries for the women per state and the sum of total salaries per state:
salary_w <- tapply(salary[gender == 'W'], state[gender == 'W'], sum)
salary_total <- tapply(salary, state, sum)
Or if it is in a data-frame:
salary_w <- with(myData, tapply(salary[gender == 'W'], state[gender == 'W'], sum))
salary_total <- with(myData, tapply(salary, state, sum))
Then the answer is:
> salary_w / salary_total
AK AL AR AZ CA CO CT DC
0.4667424 0.4877013 0.4554831 0.4959573 0.5382478 0.5544388 0.5398104 0.4750799
DE FL GA HI IA ID IL IN
0.4684846 0.5365707 0.5457726 0.4788805 0.5409347 0.4596598 0.4765021 0.4873932
KS KY LA MA MD ME MI MN
0.5228247 0.4955802 0.5604342 0.5249406 0.4890297 0.4939574 0.4882687 0.5611435
MO MS MT NC ND NE NH NJ
0.5090843 0.5342312 0.5492702 0.4928284 0.5180169 0.5696885 0.4519603 0.4673822
NM NV NY OH OK OR PA RI
0.4391634 0.4380065 0.5366625 0.5362918 0.5613301 0.4583937 0.5022793 0.4523672
SC SD TN TX UT VA VT WA
0.4862358 0.4895377 0.5048047 0.4443220 0.4881062 0.4880047 0.5338397 0.5136393
WI WV WY
0.4787588 0.5495602 0.5029816

Resources