I am trying to group my data by Year and CountyID then use splinefun (cubic spline interpolation) on the subset data. I am open to ideas, however the splinefun is a must and cannot be changed.
Here is the code I am trying to use:
age <- seq(from = 0, by = 5, length.out = 18)
TOT_POP <- df %.%
group_by(unique(df$Year), unique(df$CountyID) %.%
splinefun(age, c(0, cumsum(df$TOT_POP)), method = "hyman")
Here is a sample of my data Year = 2010 : 2013, Agegrp = 1 : 17 and CountyIDs are equal to all counties in the US.
CountyID Year Agegrp TOT_POP
1001 2010 1 3586
1001 2010 2 3952
1001 2010 3 4282
1001 2010 4 4136
1001 2010 5 3154
What I am doing is taking the Agegrp 1 : 17 and splitting the grouping into individual years 0 - 84. Right now each group is a representation of 5 years. The splinefun allows me to do this while providing a level of mathematical rigour to the process i.e., splinefun allows me provide a population total per each year of age, in each individual county in the US.
Lastly, the splinefun code by itself does work but within the group_by function it does not, it produces:
Error: wrong result size(4), expected 68 or 1.
The splinefun code the way I am using it works like this
TOT_POP <- splinefun(age, c(0, cumsum(df$TOT_POP)),
method = "hyman")
TOT_POP = pmax(0, diff(TOT_POP(c(0:85))))
Which was tested on one CountyID during one Year. I need to iterate this process over "x" amount of years and roughly 3200 counties.
# Reproducible data set
set.seed(22)
df = data.frame( CountyID = rep(1001:1005,each = 100),
Year = rep(2001:2010, each = 10),
Agegrp = sample(1:17, 500, replace=TRUE),
TOT_POP = rnorm(500, 10000, 2000))
# Convert Agegrp to age
df$Agegrp = df$Agegrp*5
colnames(df)[3] = "age"
# Make a spline function for every CountyID-Year combination
split.dfs = split(df, interaction(df$CountyID, df$Year))
spline.funs = lapply(split.dfs, function(x) splinefun(x[,"age"], x[,"TOT_POP"]))
# Use the spline functions to interpolate populations for all years between 0 and 85
new.split.dfs = list()
for( i in 1:length(split.dfs)) {
new.split.dfs[[i]] = data.frame( CountyID=split.dfs[[i]]$CountyID[1],
Year=split.dfs[[i]]$Year[1],
age=0:85,
TOT_POP=spline.funs[[i]](0:85))
}
# Does this do what you want? If so, then it will be
# easier for others to work from here
# > head(new.split.dfs[[1]])
# CountyID Year age TOT_POP
# 1 1001 2001 0 909033.4
# 2 1001 2001 1 833999.8
# 3 1001 2001 2 763181.8
# 4 1001 2001 3 696460.2
# 5 1001 2001 4 633716.0
# 6 1001 2001 5 574829.9
# > tail(new.split.dfs[[2]])
# CountyID Year age TOT_POP
# 81 1002 2001 80 10201.693
# 82 1002 2001 81 9529.030
# 83 1002 2001 82 8768.306
# 84 1002 2001 83 7916.070
# 85 1002 2001 84 6968.874
# 86 1002 2001 85 5923.268
First, I believe I was using the wrong wording in what I was trying to achieve, my apologies; group_by actually wasn't going to solve the issue. However, I was able to solve the problem using two functions and ddply. Here is the code that solved the issue:
interpolate <- function(x, ageVector){
result <- splinefun(ageVector,
c(0, cumsum(x)), method = "hyman")
diff(result(c(0:85)))
}
mainFunc <- function(df){
age <- seq(from = 0, by = 5, length.out = 18)
colNames <- setdiff(colnames(df)
c("Year","CountyID","AgeGrp"))
colWiseSpline <- colwise(interpolate, .cols = true,
age)(df[ , colNames])
cbind(data.frame(
Year = df$Year[1],
County = df$CountyID[1],
Agegrp = 0:84
),
colWiseSpline
)
}
CompleteMainRaw <- ddply(.data = df,
.variables = .(CountyID, Year),
.fun = mainFunc)
The code now takes each county by year and runs the splinefun on that subset of population data. At the same time it creates a data.frame with the results i.e., splits the data from 17 age groups to 85 age groups while factoring it our appropriately; which is what splinefun does.
Thanks!
Related
I have a data table with several columns.
Lets say
Location which may include Los Angles, etc.
age_Group, lets say (young, child, teenager), etc.
year = (2000, 2001, ..., 2015)
month = c(jan, ..., dec)
I would like to group_by them and see how many people has spent money
in some intervals, lets say I have intervals of interval_1 = (1, 100), (100, 1000), ..., interval_20=(1000, infinity)
How shall I proceed? What should I do after the following?
data %>% group_by(location, age_Group, year, month)
sample:
location age_gp year month spending
LA child 2000 1 102
LA teen 2000 1 15
LA teen 2000 10 9
NY old 2000 11 1000
NY old 2010 2 1000000
NY teen 2020 3 10
desired output
LA, child, 2000, jan interval_1
LA, child, 2000, feb interval_20
...
NY OLD 2015 Dec interval_1
the last column has to be determined by adding the spending of all people belonging to the same city, age_croup, year, month.
You can first create a new column (spending_cat) using, for example, the cut function. After you can add the new variable as a grouping variable and then you just need to count:
df <- data.frame(group = sample(letters[1:4], size = 1000, replace = T),
spending = rnorm(1000))
df %>%
mutate(spending_cat = cut(spending, breaks = c(-5:5))) %>%
group_by(group, spending_cat) %>%
summarise(n_people = n())
# A tibble: 26 x 3
# Groups: group [?]
group spending_cat n_people
<fct> <fct> <int>
1 a (-3,-2] 6
2 a (-2,-1] 36
3 a (-1,0] 83
4 a (0,1] 78
5 a (1,2] 23
6 a (2,3] 10
7 b (-4,-3] 1
8 b (-3,-2] 4
9 b (-2,-1] 40
10 b (-1,0] 78
# … with 16 more rows
I found that dplyr is speedy and simple for aggregate and summarise data. But I can't find out how to solve the following problem with dplyr.
Given these data frames:
df_2017 <- data.frame(
expand.grid(1:195,1:65,1:39),
value = sample(1:1000000,(195*65*39)),
period = rep("2017",(195*65*39)),
stringsAsFactors = F
)
df_2017 <- df_2017[sample(1:(195*65*39),450000),]
names(df_2017) <- c("company", "product", "acc_concept", "value", "period")
df_2017$company <- as.character(df_2017$company)
df_2017$product <- as.character(df_2017$product)
df_2017$acc_concept <- as.character(df_2017$acc_concept)
df_2017$value <- as.numeric(df_2017$value)
ratio_df <- data.frame(concept=c("numerator","numerator","numerator","denom", "denom", "denom","name"),
ratio1=c("1","","","4","","","Sales over Assets"),
ratio2=c("1","","","5","6","","Sales over Expenses A + B"), stringsAsFactors = F)
where the columns in df_2017 are:
company = This is a categorical variable with companies from 1 to 195
product = This is a categorical, with home apliance products from 1 to 65. For example, 1 could be equal to irons, 2 to television, etc
acc_concept = This is a categorical variable with accounting concepts from 1 to 39. For example, 1 would be equal to "Sales", 2 to "Total Expenses", 3 to Returns", 4 to "Assets, etc
value = This is a numeric variable, with USD from 1 to 100.000.000
period = Categorical variable. Always 2017
As the expand.grid implies, the combinations of company - product - acc_concept are never duplicated, but, It could happen that certains subjects have not every company - product - acc_concept combinations. That's why the code line "df_2017 <- df_2017[sample(1:195*65*39),450000),]", and that's why the output could turn out into NA (see below).
And where the columns in ratio_df are:
Concept = which acc_concept corresponds to numerator, which one to
denominator, and which is name of the ratio
ratio1 = acc_concept and name for ratio1
ratio2 = acc_concept and name for ratio2
I want to calculate 2 ratios (ratio_df) between acc_concept, for each product within each company.
For example:
I take the first ratio "acc_concepts" and "name" from ratio_df:
num_acc_concept <- ratio_df[ratio_df$concept == "numerator", 2]
denom_acc_concept <- ratio_df[ratio_df$concept == "denom", 2]
ratio_name <- ratio_df[ratio_df$concept == "name", 2]
Then I calculate the ratio for one product of one company, just to show you want i want to do:
ratio1_value <- sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% num_acc_concept, 4]) / sum(df_2017[df_2017$company == 1 & df_2017$product == 1 & df_2017$acc_concept %in% denom_acc_concept, 4])
Output:
output <- data.frame(Company="1", Product="1", desc_ratio=ratio_name, ratio_value = ratio1_value, stringsAsFactors = F)
As i said before i want to do this for each product within each company
The output data.frame could be something like this (ratios aren't the true ones because i haven't done the calculations yet):
company product desc_ratio ratio_value
1 1 Sales over Assets 0.9303675
1 2 Sales over Assets 1.30
1 3 Sales over Assets Nan
1 4 Sales over Assets Inf
1 5 Sales over Assets 2.32
1 6 Sales over Assets NA
.
.
.
1 1 Sales over Expenses A + B 3.25
.
.
.
2 1 Sales over Assets 0.256
and so on...
NaN when ratio is 0 / 0
Inf when ratio is number / 0
NA when there is no data for certain company and product.
I hope i have made myself clear this time :)
Is there any way to solve this row problem with dplyr? Should I cast the df_2017 for mutating? In this case, which is the best way for casting?
Any help would be welcome!
This is one way of doing it. At the end I timed the code on all of your records.
First create a function to create all the ratios. Do note, this function is only useful inside the dplyr code.
ratio <- function(data){
result <- data.frame(desc_ratio = rep(NA, ncol(ratio_df) -1), ratio_value = rep(NA, ncol(ratio_df) -1))
for(i in 2:ncol(ratio_df)){
num <- ratio_df[ratio_df$concept == "numerator", i]
denom <- ratio_df[ratio_df$concept == "denom", i]
result$desc_ratio[i-1] <- ratio_df[ratio_df$concept == "name", i]
result$ratio_value[i-1] <- sum(ifelse(data$acc_concept %in% num, data$value, 0)) / sum(ifelse(data$acc_concept %in% denom, data$value, 0))
}
return(result)
}
Using dplyr, tidyr and purrr to put everything together. First group by the data, nest the data needed for the function, run the function with a mutate on the nested data. Drop the not needed nested data and unnest to get your wanted output. I leave the sorting up to you.
library(dplyr)
library(purrr)
library(tidyr)
output <- df_2017 %>%
group_by(company, product, period) %>%
nest() %>%
mutate(ratios = map(data, ratio)) %>%
select(-data) %>%
unnest
output
# A tibble: 25,350 x 5
company product period desc_ratio ratio_value
<chr> <chr> <chr> <chr> <dbl>
1 103 2 2017 Sales over Assets 0.733
2 103 2 2017 Sales over Expenses A + B 0.219
3 26 26 2017 Sales over Assets 0.954
4 26 26 2017 Sales over Expenses A + B 1.01
5 85 59 2017 Sales over Assets 4.14
6 85 59 2017 Sales over Expenses A + B 1.83
7 186 38 2017 Sales over Assets 7.85
8 186 38 2017 Sales over Expenses A + B 0.722
9 51 25 2017 Sales over Assets 2.34
10 51 25 2017 Sales over Expenses A + B 0.627
# ... with 25,340 more rows
Time it took to run this code on my machine measured with system.time:
user system elapsed
6.75 0.00 6.81
year <- c(2000:2014)
group <- c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"C","C","C","C","C","C","C","C","C","C","C","C","C","C","C")
value <- sample(1:5, 45, replace=TRUE)
df <- data.frame(year,group,value)
df$value[df$value==1] <- NA
year group value
1 2000 A NA
2 2001 A 2
3 2002 A 2
...
11 2010 A 2
12 2011 A 3
13 2012 A 5
14 2013 A NA
15 2014 A 3
16 2000 B 2
17 2001 B 3
...
26 2010 B NA
27 2011 B 5
28 2012 B 4
29 2013 B 3
30 2014 B 5
31 2000 C 5
32 2001 C 4
33 2002 C 3
34 2003 C 4
...
44 2013 C 5
45 2014 C 3
Above is the sample dataframe for my question.
Each group (A,B or C), has value from 2000 to 2014, but in some years, the value might be missing for some of the groups.
The graph I would like to plot is as below:
x-axis is year
y-axis is group (i.e. A, B & C should be showed on y-lab)
the bar or line represent the value availability of each group
If the value is NA, then the bar would not show at that time point.
ggplot2 is preferred if possible.
Can anyone help?
Thank you.
I think my description is confusing. I am expecting a graph like below, BUT the x-axis would be year. And the bar or line represents the availability of the value for a given group across the year.
In the sample dataframe of group A, we have
2012 A 5
2013 A NA
2014 A 3
Then there should be nothing at the point of group A in 2013, and then a dot would be presented at the point of group A in 2014.
You can use the geom_errorbar, with no range (geom_errorbarh for horizontal). Then just subset for complete.cases (or !is.na(df$value))
library(ggplot2)
set.seed(10)
year <- c(2000:2014)
group <- c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B","B","B","B",
"C","C","C","C","C","C","C","C","C","C","C","C","C","C","C")
value <- sample(1:5, 45, replace=TRUE)
df <- data.frame(year,group,value)
df$value[df$value==1] <- NA
no_na_df <- df[complete.cases(df), ]
ggplot(no_na_df, aes(x=year, y = group)) +
geom_errorbarh(aes(xmax = year, xmin = year), size = 2)
Edit:
To get a countious bar, you can use this slightly unappealing method. It is nesessary to make a numeric representation of the group data, to give the bars a width. Thereafter, we can make the scale represent the variables as discrete again.
df$group_n <- as.numeric(df$group)
no_na_df <- df[complete.cases(df), ]
ggplot(no_na_df, aes(xmin=year-0.5, xmax=year+0.5, y = group_n)) +
geom_rect(aes(ymin = group_n-0.1, ymax = group_n+0.1)) +
scale_y_discrete(limits = levels(df$group))
I have searched the forum, but found nothing that could answer or provide hint on how to do what I wish to on the forum.
I have yearly measurement of exposure data from which I wish to calculate individual level annual average based on entry of each individual into the study. For each row the one year exposure assignment should include data from the preceding 12 months starting from the last month before joining the study.
As an example the first person in the sample data joined the study on Feb 7, 2002. His exposure will include a contribution of January 2002 (annual average is 18) and February to December 2001 (annual average is 19). The time weighted average for this person would be (1/12*18) + (11/12*19). The two year average exposure for the same person would extend back from January 2002 to February 2000.
Similarly, for last person who joined the study in December 2004 will include contribution on 11 months in 2004 and one month in 2003 and his annual average exposure will be (11/12*5 ) derived form 2004 and (1/12*6) which comes from the annual average of 2003.
How can I calculate the 1, 2 and 5 year average exposure going back from the date of entry into study? How can I use lags in the manner taht I hve described?
Sample data is accessed from this link
https://drive.google.com/file/d/0B_4NdfcEvU7La1ZCd2EtbEdaeGs/view?usp=sharing
This is not an elegant answer. But, I would like to leave what I tried. I first arranged the data frame. I wanted to identify which year will be the key year for each subject. So, I created id. variable comes from the column names (e.g., pol_2000) in your original data set. entryYear comes from entry in your data. entryMonth comes from entry as well. check was created in order to identify which year is the base year for each participant. In my next step, I extracted six rows for each participant using getMyRows in the SOfun package. In the next step, I used lapply and did math as you described in your question. For the calculation for two/five year average, I divided the total values by year (2 or 5). I was not sure how the final output would look like. So I decided to use the base year for each subject and added three columns to it.
library(stringi)
library(SOfun)
devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
### Big thanks to BondedDust for this function
### http://stackoverflow.com/questions/6987478/convert-a-month-abbreviation-to-a-numeric-month-in-r
mo2Num <- function(x) match(tolower(x), tolower(month.abb))
### Arrange the data frame.
ana <- foo %>%
mutate(id = 1:n()) %>%
melt(id.vars = c("id","entry")) %>%
arrange(id) %>%
mutate(variable = as.numeric(gsub("^.*_", "", variable)),
entryYear = as.numeric(stri_extract_last(entry, regex = "\\d+")),
entryMonth = mo2Num(substr(entry, 3,5)) - 1,
check = ifelse(variable == entryYear, "Y", "N"))
### Find a base year for each subject and get some parts of data for each participant.
indx <- which(ana$check == "Y")
bob <- getMyRows(ana, pattern = indx, -5:0)
### Get one-year average
cathy <- lapply(bob, function(x){
x$one <- ((x[6,6] / 12) * x[6,4]) + (((12-x[5,6])/12) * x[5,4])
x
})
one <- unnest(lapply(cathy, `[`, i = 6, j = 8))
### Get two-year average
cathy <- lapply(bob, function(x){
x$two <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + (((12-x[4,6])/12) * x[4,4])) / 2
x
})
two <- unnest(lapply(cathy, `[`, i = 6, j =8))
### Get five-year average
cathy <- lapply(bob, function(x){
x$five <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + x[4,4] + x[3,4] + x[2,4] + (((12-x[2,6])/12) * x[1,4])) / 5
x
})
five <- unnest(lapply(cathy, `[`, i =6 , j =8))
### Combine the results with the key observations
final <- cbind(ana[which(ana$check == "Y"),], one, two, five)
colnames(final) <- c(names(ana), "one", "two", "five")
# id entry variable value entryYear entryMonth check one two five
#6 1 07feb2002 2002 18 2002 1 Y 18.916667 18.500000 18.766667
#14 2 06jun2002 2002 16 2002 5 Y 16.583333 16.791667 17.150000
#23 3 16apr2003 2003 14 2003 3 Y 15.500000 15.750000 16.050000
#31 4 26may2003 2003 16 2003 4 Y 16.666667 17.166667 17.400000
#39 5 11jun2003 2003 13 2003 5 Y 13.583333 14.083333 14.233333
#48 6 20feb2004 2004 3 2004 1 Y 3.000000 3.458333 3.783333
#56 7 25jul2004 2004 2 2004 6 Y 2.000000 2.250000 2.700000
#64 8 19aug2004 2004 4 2004 7 Y 4.000000 4.208333 4.683333
#72 9 19dec2004 2004 5 2004 11 Y 5.083333 5.458333 4.800000
I have a data frame (~5000 rows, 6 columns) that contains some duplicate values for an id variable. I have another continuous variable x, whose values I would like to sum for each duplicate id. The observations are time dependent, there are year and month variables, and I'd like to keep the chronologically first observation of each duplicate id and add the subsequent dupes to this first observation.
I've included dummy data that resembles what I have: dat1. I've also included a data set that shows the structure of my desired outcome: outcome.
I've tried two strategies, neither of which quite give me what I want (see below). The first strategy gives me the correct values for x, but I loose my year and month columns - I need to retain these for all the first duplicate id values. The second strategy doesn't sum the values of x correctly.
Any suggestions for how to get my desired outcome would be much appreciated.
# dummy data set
set.seed(179)
dat1 <- data.frame(id = c(1234, 1321, 4321, 7423, 4321, 8503, 2961, 1234, 8564, 1234),
year = rep(c("2006", "2007"), each = 5),
month = rep(c("December", "January"), each = 5),
x = round(rnorm(10, 10, 3), 2))
# desired outcome
outcome <- data.frame(id = c(1234, 1321, 4321, 7423, 8503, 2961, 8564),
year = c(rep("2006", 4), rep("2007", 3)),
month = c(rep("December", 4), rep("January", 3)),
x = c(36.42, 11.55, 17.31, 5.97, 12.48, 10.22, 11.41))
# strategy 1:
library(plyr)
dat2 <- ddply(dat1, .(id), summarise, x = sum(x))
# strategy 2:
# partition into two data frames - one with unique cases, one with dupes
dat1_unique <- dat1[!duplicated(dat1$id), ]
dat1_dupes <- dat1[duplicated(dat1$id), ]
# merge these data frames while summing the x variable for duplicated ids
# with plyr
dat3 <- ddply(merge(dat1_unique, dat1_dupes, all.x = TRUE),
.(id), summarise, x = sum(x))
# in base R
dat4 <- aggregate(x ~ id, data = merge(dat1_unique, dat1_dupes,
all.x = TRUE), FUN = sum)
I got different sums, but it were b/c I forgot the seed:
> dat1$x <- ave(dat1$x, dat1$id, FUN=sum)
> dat1[!duplicated(dat1$id), ]
id year month x
1 1234 2006 December 25.18
2 1321 2006 December 15.06
3 4321 2006 December 15.50
4 7423 2006 December 7.16
6 8503 2007 January 13.23
7 2961 2007 January 7.38
9 8564 2007 January 7.21
(To be safer It would be better to work on a copy. And you might need to add an ordering step.)
You could do this with data.table (quicker, more memory efficiently than plyr)
With a bit of self-joining fun using mult ='first'. Keying by id year and month will sort by id, year then month.
library(data.table)
DT <- data.table(dat1, key = c('id','year','month'))
# setnames is required as there are two x columns that get renamed x, x.1
DT1 <- setnames(DT[DT[,list(x=sum(x)),by=id],mult='first'][,x:=NULL],'x.1','x')
Or a simpler approach :
DT = as.data.table(dat1)
DT[,x:=sum(x),by=id][!duplicated(id)]
id year month x
1: 1234 2006 December 36.42
2: 1321 2006 December 11.55
3: 4321 2006 December 17.31
4: 7423 2006 December 5.97
5: 8503 2007 January 12.48
6: 2961 2007 January 10.22
7: 8564 2007 January 11.41