calculate proportion in grouped data frame - r

I have the following data frame :
year owngun N
1 2000 Yes 603
2 2000 No 1231
3 2000 Refused 23
4 2012 Yes 440
5 2012 No 841
6 2012 Refused 24
How can I make a column to have the proportions for each year and level of owngun?

Assuming your N's are already your aggregated counts, you could get proportions using data.table:
library(data.table)
setDT(df)[,prop:=N/sum(N),by=year]
df
year owngun N prop
1: 2000 Yes 603 0.32471729
2: 2000 No 1231 0.66289715
3: 2000 Refused 23 0.01238557
4: 2012 Yes 440 0.33716475
5: 2012 No 841 0.64444444
6: 2012 Refused 24 0.01839080
Same approach using plyr:
library(plyr)
df2<-ddply(df,.(year),transform,prop=N/sum(N))

We can use ave from base R
df1$prop <- with(df1, N/ave(N, year, FUN = sum))
df1$prop
#[1] 0.32471729 0.66289715 0.01238557 0.33716475 0.64444444 0.01839080
Or another option with tapply
with(df1, prop.table(tapply(N, list(year, owngun), FUN = sum), 1))

>df
year owngun N
1 2000 Yes 603
2 2000 No 1231
3 2000 Refused 23
4 2012 Yes 440
5 2012 No 841
6 2012 Refused 24
>library(dplyr)
> df %>% group_by(year) %>% mutate(Proportion=N/sum(N))
year owngun N Proportion
(int) (fctr) (int) (dbl)
1 2000 Yes 603 0.32471729
2 2000 No 1231 0.66289715
3 2000 Refused 23 0.01238557
4 2012 Yes 440 0.33716475
5 2012 No 841 0.64444444
6 2012 Refused 24 0.01839080

prop.table and xtabs could be a handy tool:
library(magrittr)
xtabs(N ~., df) %>% prop.table(1) %>% round(2)
# owngun
#year No Refused Yes
# 2000 0.66 0.01 0.32
# 2012 0.64 0.02 0.34

Is that what you want?
ndf<-reshape2::dcast(dfr[,-1], owngun ~ year)
ndf$p2000=ndf$`2000`/rowSums(ndf[,-1])
ndf$p2012=ndf$`2012`/rowSums(ndf[,-1])
ndf[c(3,1,2),]
Proportion of year by level of owngun
owngun 2000 2012 p2000 p2012
3 Yes 603 440 0.5781400 0.4216263
1 No 1231 841 0.5941120 0.4057717
2 Refused 23 24 0.4893617 0.5053763
Proportion of owngun by year
ndf<-reshape2::dcast(dfr[,-1], year ~ owngun)
cbind(year=ndf$year,(100*ndf[,-1]/apply(ndf[,-1], 1, sum))[,c(3,1,2)])
year Yes No Refused
1 2000 32.47173 66.28971 1.238557
2 2012 33.71648 64.44444 1.839080

Related

Add multiple columns lagged by one year

I need to add a 1-year-lagged version of multiple columns from my dataframe. Here's my data:
data<-data.frame(Year=c("2011","2011","2011","2012","2012","2012","2013","2013","2013"),
Country=c("America","China","India","America","China","India","America","China","India"),
Value1=c(234,443,754,334,117,112,987,903,476),
Value2=c(2,4,5,6,7,8,1,2,2))
And I want to add two columns that contain Value1 and Value2 at t-1, so that my dataframe looks like this:
How can I do this? Would this be the correct way to lag my variables by year?
Thanks in advance!
Using data.table:
library(data.table)
setDT(data)
cols <- grep("^Value", colnames(data), value = TRUE)
data[, paste0(cols, "_lag") := lapply(.SD, shift), .SDcols = cols, by = Country]
# Year Country Value1 Value2 Value1_lag Value2_lag
# 1: 2011 America 234 2 NA NA
# 2: 2011 China 443 4 NA NA
# 3: 2011 India 754 5 NA NA
# 4: 2012 America 334 6 234 2
# 5: 2012 China 117 7 443 4
# 6: 2012 India 112 8 754 5
# 7: 2013 America 987 1 334 6
# 8: 2013 China 903 2 117 7
# 9: 2013 India 476 2 112 8
In dplyr, use lag by group:
library(dplyr) #1.1.0
data %>%
mutate(across(contains("Value"), lag, .names = "{col}_lagged"), .by = Country)
Year Country Value1 Value2 Value1_lagged Value2_lagged
1 2011 America 234 2 NA NA
2 2011 China 443 4 NA NA
3 2011 India 754 5 NA NA
4 2012 America 334 6 234 2
5 2012 China 117 7 443 4
6 2012 India 112 8 754 5
7 2013 America 987 1 334 6
8 2013 China 903 2 117 7
9 2013 India 476 2 112 8
Below 1.1.0:
data %>%
group_by(Country) %>%
mutate(across(c(GDP, Population), lag, .names = "{col}_lagged")) %>%
ungroup()
Another way using dplyr to ge tthe job done.
library(dplyr)
data_lagged <- data %>%
group_by(Country) %>%
mutate(Value1_Lagged = lag(Value1),
Value2_Lagged = lag(Value2),
Year = as.integer(as.character(Year)) + 1)
data_final <- cbind(data, data_lagged[, c("Value1_Lagged", "Value2_Lagged")])
data_final
Output:
Year Country Value1 Value2 Value1_Lagged Value2_Lagged
1 2011 America 234 2 NA NA
2 2011 China 443 4 NA NA
3 2011 India 754 5 NA NA
4 2012 America 334 6 234 2
5 2012 China 117 7 443 4
6 2012 India 112 8 754 5
7 2013 America 987 1 334 6
8 2013 China 903 2 117 7
9 2013 India 476 2 112 8

replacing NA with next available number within a group

I have a relatively large dataset and I want to replace NA value for the price in a specific year and for a specific ID number with an available value in next year within a group for the same ID number. Here is a reproducible example:
ID <- c(1,2,3,2,2,3,1,4,5,5,1,2,2)
year <- c(2000,2001,2002,2002,2003,2007,2001,2000,2005,2006,2002,2004,2005)
value <- c(1000,20000,30000,NA,40000,NA,6000,4000,NA,20000,7000,50000,60000)
data <- data.frame(ID, year, value)
ID year value
1 1 2000 1000
2 2 2001 20000
3 3 2002 30000
4 2 2002 NA
5 2 2003 40000
6 3 2007 NA
7 1 2001 6000
8 4 2000 4000
9 5 2005 NA
10 5 2006 20000
11 1 2002 7000
12 2 2004 50000
13 2 2005 60000
So, for example for ID=2 we have following value and years:
ID year value
2 2001 20000
2 2002 NA
2 2003 40000
2 2004 50000
2 2005 60000
So in the above case, NA should be replaced with 40000 (Values in next year). And the same story for other IDs.
the final result should be in this form:
ID year value
1 2000 1000
1 2001 6000
1 2002 7000
2 2001 20000
2 2002 40000
2 2003 40000
2 2004 50000
2 2005 60000
3 2007 NA
4 2000 4000
5 2005 20000
5 2006 20000
Please note that for ID=3 since there is no next year available, we want to leave it as is. That's why it's in the form of NA
I appreciate if you can suggest a solution
Thanks
dplyr solution
library(tidyverse)
data2 <- data %>%
dplyr::group_by(ID) %>%
dplyr::arrange(year) %>%
dplyr::mutate(replaced_value = ifelse(is.na(value), lead(value), value))
print(data2)
# A tibble: 13 x 4
# Groups: ID [5]
ID year value replaced_value
<dbl> <dbl> <dbl> <dbl>
1 1 2000 1000 1000
2 4 2000 4000 4000
3 2 2001 20000 20000
4 1 2001 6000 6000
5 3 2002 30000 30000
6 2 2002 NA 40000
7 1 2002 7000 7000
8 2 2003 40000 40000
9 2 2004 50000 50000
10 5 2005 NA 20000
11 2 2005 60000 60000
12 5 2006 20000 20000
13 3 2007 NA NA
Try this tidyverse approach using a flag to check sequential years and fill() to complete data:
library(tidyverse)
#Data
ID <- c(1,2,3,2,2,3,1,4,5,5,1,2,2)
year <- c(2000,2001,2002,2002,2003,2007,2001,2000,2005,2006,2002,2004,2005)
value <- c(1000,20000,30000,NA,40000,NA,6000,4000,NA,20000,7000,50000,60000)
data <- data.frame(ID, year, value)
#Code
data2 <- data %>% arrange(ID,year) %>%
group_by(ID) %>%
mutate(Flag=c(1,diff(year))) %>%
fill(value,.direction = 'downup') %>%
mutate(value=ifelse(Flag!=1,NA,value)) %>% select(-Flag)
Output:
# A tibble: 13 x 3
# Groups: ID [5]
ID year value
<dbl> <dbl> <dbl>
1 1 2000 1000
2 1 2001 6000
3 1 2002 7000
4 2 2001 20000
5 2 2002 20000
6 2 2003 40000
7 2 2004 50000
8 2 2005 60000
9 3 2002 30000
10 3 2007 NA
11 4 2000 4000
12 5 2005 20000
13 5 2006 20000
You could do:
library(dplyr)
data %>%
group_by(ID) %>%
mutate(value = coalesce(value, as.integer(sapply(pmin(year + 1, max(year)), function(x) value[year == x])))) %>%
arrange(ID, year)
Output:
# A tibble: 13 x 3
# Groups: ID [5]
ID year value
<dbl> <dbl> <dbl>
1 1 2000 1000
2 1 2001 6000
3 1 2002 7000
4 2 2001 20000
5 2 2002 40000
6 2 2003 40000
7 2 2004 50000
8 2 2005 60000
9 3 2002 30000
10 3 2007 NA
11 4 2000 4000
12 5 2005 20000
13 5 2006 20000
Now in case you want to replace NA with any value that follows immediately - i.e. even if the year is not necessarily consecutive - you could do:
library(tidyverse)
data %>%
arrange(ID, year) %>%
group_by(ID, idx = cumsum(is.na(value))) %>%
fill(value, .direction = 'up') %>%
ungroup %>%
select(-idx)
This is much more straightforward (and likely much faster) in data.table:
library(data.table)
setDT(data)[order(ID, year), ][
, value := nafill(value, type = 'nocb'), by = .(ID, cumsum(is.na(value)))]

Summarizing a dataframe by date and group

I am trying to summarize a data set by a few different factors. Below is an example of my data:
household<-c("household1","household1","household1","household2","household2","household2","household3","household3","household3")
date<-c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value<-c(1:9)
type<-c("income","water","energy","income","water","energy","income","water","energy")
df<-data.frame(household,date,value,type)
household date value type
1 household1 1999-05-10 100 income
2 household1 1999-05-25 200 water
3 household1 1999-10-12 300 energy
4 household2 1999-02-02 400 income
5 household2 1999-08-20 500 water
6 household2 1999-02-19 600 energy
7 household3 1999-07-01 700 income
8 household3 1999-10-13 800 water
9 household3 1999-01-01 900 energy
I want to summarize the data by month. Ideally the resulting data set would have 12 rows per household (one for each month) and a column for each category of expenditure (water, energy, income) that is a sum of that month's total.
I tried starting by adding a column with a short date, and then I was going to filter for each type and create a separate data frame for the summed data per transaction type. I was then going to merge those data frames together to have the summarized df. I attempted to summarize it using ddply, but it aggregated too much, and I can't keep the household level info.
ddply(df,.(shortdate),summarize,mean_value=mean(value))
shortdate mean_value
1 14/07 15.88235
2 14/09 5.00000
3 14/10 5.00000
4 14/11 21.81818
5 14/12 20.00000
6 15/01 10.00000
7 15/02 12.50000
8 15/04 5.00000
Any help would be much appreciated!
It sounds like what you are looking for is a pivot table. I like to use reshape::cast for these types of tables. If there is more than one value returned for a given expenditure type for a given household/year/month combination, this will sum those values. If there is only one value, it returns the value. The "sum" argument is not required but only placed there to handle exceptions. I think if your data is clean you shouldn't need this argument.
hh <- c("hh1", "hh1", "hh1", "hh2", "hh2", "hh2", "hh3", "hh3", "hh3")
date <- c(sample(seq(as.Date('1999/01/01'), as.Date('2000/01/01'), by="day"), 9))
value <- c(1:9)
type <- c("income", "water", "energy", "income", "water", "energy", "income", "water", "energy")
df <- data.frame(hh, date, value, type)
# Load lubridate library, add date and year
library(lubridate)
df$month <- month(df$date)
df$year <- year(df$date)
# Load reshape library, run cast from reshape, creates pivot table
library(reshape)
dfNew <- cast(df, hh+year+month~type, value = "value", sum)
> dfNew
hh year month energy income water
1 hh1 1999 4 3 0 0
2 hh1 1999 10 0 1 0
3 hh1 1999 11 0 0 2
4 hh2 1999 2 0 4 0
5 hh2 1999 3 6 0 0
6 hh2 1999 6 0 0 5
7 hh3 1999 1 9 0 0
8 hh3 1999 4 0 7 0
9 hh3 1999 8 0 0 8
Try this:
df$ym<-zoo::as.yearmon(as.Date(df$date), "%y/%m")
library(dplyr)
df %>% group_by(ym,type) %>%
summarise(mean_value=mean(value))
Source: local data frame [9 x 3]
Groups: ym [?]
ym type mean_value
<S3: yearmon> <fctr> <dbl>
1 jan 1999 income 1
2 jun 1999 energy 3
3 jul 1999 energy 6
4 jul 1999 water 2
5 ago 1999 income 4
6 set 1999 energy 9
7 set 1999 income 7
8 nov 1999 water 5
9 dez 1999 water 8
Edit: the wide format:
reshape2::dcast(dfr, ym ~ type)
ym energy income water
1 jan 1999 NA 1 NA
2 jun 1999 3 NA NA
3 jul 1999 6 NA 2
4 ago 1999 NA 4 NA
5 set 1999 9 7 NA
6 nov 1999 NA NA 5
7 dez 1999 NA NA 8
If I understood your requirement correctly (from the description in the question), this is what you are looking for:
library(dplyr)
library(tidyr)
df %>% mutate(date = lubridate::month(date)) %>%
complete(household, date = 1:12) %>%
spread(type, value) %>% group_by(household, date) %>%
mutate(Total = sum(energy, income, water, na.rm = T)) %>%
select(household, Month = date, energy:water, Total)
#Source: local data frame [36 x 6]
#Groups: household, Month [36]
#
# household Month energy income water Total
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 household1 1 NA NA NA 0
#2 household1 2 NA NA NA 0
#3 household1 3 NA NA 200 200
#4 household1 4 NA NA NA 0
#5 household1 5 NA NA NA 0
#6 household1 6 NA NA NA 0
#7 household1 7 NA NA NA 0
#8 household1 8 NA NA NA 0
#9 household1 9 300 NA NA 300
#10 household1 10 NA NA NA 0
# ... with 26 more rows
Note: I used the same df you provided in the question. The only change I made was the value column. Instead of 1:9, I used seq(100, 900, 100)
If I got it wrong, please let me know and I will delete my answer. I will add an explanation of what's going on if this is correct.

Aggregate by specific year in R

Apologies if this question has already been dealt with already on SO, but I cannot seem to find a quick solution as of yet.
I am trying to aggregate a dataset by a specific year. My data frame consists of hourly climate data over a period of 10 years.
head(df)
# day month year hour rain temp pressure wind
#1 1 1 2005 0 0 7.6 1016 15
#2 1 1 2005 1 0 8.0 1015 14
#3 1 1 2005 2 0 7.7 1014 15
#4 1 1 2005 3 0 7.8 1013 17
#5 1 1 2005 4 0 7.3 1012 17
#6 1 1 2005 5 0 7.6 1010 17
To calculate daily means from the above dataset, I use this aggregate function
g <- aggregate(cbind(temp,pressure,wind) ~ day + month + year, d, mean)
options(digits=2)
head(g)
# day month year temp pressure wind
#1 1 1 2005 6.6 1005 25
#2 2 1 2005 6.5 1018 25
#3 3 1 2005 9.7 1019 22
#4 4 1 2005 7.5 1010 25
#5 5 1 2005 7.3 1008 25
#6 6 1 2005 9.6 1009 26
Unfortunately, I get a huge dataset spanning the whole 10 years (2005 to 2014). I am wondering if anybody would be able to help me tweak the above aggregate code so as I would be able to summaries daily means over a specific year as opposed to all of them in one swipe?
You can use the subset argument in aggregate
aggregate(cbind(temp,pressure,wind) ~ day + month + year, df,
subset=year %in% 2005:2014, mean)
Dplyr also does it nicely.
library(dplyr)
df %>%
filter(year==2005) %>%
group_by(day, month, year) %>%
summarise_each(funs(mean), temp, pressure, wind)

Merge 2 data frame based on 2 columns with different column names

I have 2 very large data sets that looks like below:
merge_data <- data.frame(ID = c(1,2,3,4,5,6,7,8,9,10),
position=c("yes","no","yes","no","yes",
"no","yes","no","yes","yes"),
school = c("a","b","a","a","c","b","c","d","d","e"),
year1 = c(2000,2000,2000,2001,2001,2000,
2003,2005,2008,2009),
year2=year1-1)
merge_data
ID position school year1 year2
1 1 support a 2000 1999
2 2 oppose b 2000 1999
3 3 support a 2000 1999
4 4 oppose a 2001 2000
5 5 support c 2001 2000
6 6 oppose b 2000 1999
7 7 support c 2003 2002
8 8 oppose d 2005 2004
9 9 support d 2008 2007
10 10 support e 2009 2008
merge_data_2 <- data.frame(year=c(1999,1999,2000,2000,2000,2001,2003
,2012,2009,2009,2008,2002,2009,2005,
2001,2000,2002,2000,2008,2005),
amount=c(100,200,300,400,500,600,700,800,900,
1000,1100,1200,1300,1400,1500,1600,
1700,1800,1900,2000),
ID=c(1,1,2,2,2,3,3,3,5,6,8,9,10,13,15,17,19,20,21,7))
merge_data_2
year amount ID
1 1999 100 1
2 1999 200 1
3 2000 300 2
4 2000 400 2
5 2000 500 2
6 2001 600 3
7 2003 700 3
8 2012 800 3
9 2009 900 5
10 2009 1000 6
11 2008 1100 8
12 2002 1200 9
13 2009 1300 10
14 2005 1400 13
15 2001 1500 15
16 2000 1600 17
17 2002 1700 19
18 2000 1800 20
19 2008 1900 21
20 2005 2000 7
And what I want is:
ID position school year1 year2 amount
1 yes a 2000 1999 300
2 no b 2000 1999 1200
10 yes e 2009 2008 1300
for ID=1 in the merge_data_2, we have amount =300, since there are 2 cases where ID=1,and their year1 or year1 is equal to the year of ID=1 in merge_data
So basically what I want is to perform a merge based on the ID and year.
2 conditions:
ID from merge_data matches the ID from merge_data_2
one of the year1 and year2 from merge_data also matches the year from merge_data_2.
then make the merge based on the sum of the amount for each IDs.
and I think the code will be something looks like:
merge_data_final <- merge(merge_data, merge_data_2,
merge_data$ID == merge_data_2$ID && (merge_data$year1 ||
merge_data$year2 == merge_data_2$year))
Then somehow to aggregate the amount by ID.
Obviously I know the code is wrong, and I have been thinking about plyr or reshape library, but was having difficulties of getting my hands on them.
Any helps would be great! thanks guys!
As noted above, I think you have some discrepancies between your example input and output data. Here's the basic approach - you were on the right track with reshape2. You can simply melt() your data into long format so you are joining on a single column instead of the either/or bit you had going on before.
library(reshape2)
#melt into long format
merge_data_m <- melt(merge_data, measure.vars = c("year1", "year2"))
#merge together, specifying the joining columns
merge(merge_data_m, merge_data_2, by.x = c("ID", "value"), by.y = c("ID", "year"))
#-----
ID value position school variable amount
1 1 1999 yes a year2 100
2 1 1999 yes a year2 200
3 2 2000 no b year1 500
4 2 2000 no b year1 300
5 2 2000 no b year1 400

Resources