Sum duplicates then remove all but first occurrence - r

I have a data frame (~5000 rows, 6 columns) that contains some duplicate values for an id variable. I have another continuous variable x, whose values I would like to sum for each duplicate id. The observations are time dependent, there are year and month variables, and I'd like to keep the chronologically first observation of each duplicate id and add the subsequent dupes to this first observation.
I've included dummy data that resembles what I have: dat1. I've also included a data set that shows the structure of my desired outcome: outcome.
I've tried two strategies, neither of which quite give me what I want (see below). The first strategy gives me the correct values for x, but I loose my year and month columns - I need to retain these for all the first duplicate id values. The second strategy doesn't sum the values of x correctly.
Any suggestions for how to get my desired outcome would be much appreciated.
# dummy data set
set.seed(179)
dat1 <- data.frame(id = c(1234, 1321, 4321, 7423, 4321, 8503, 2961, 1234, 8564, 1234),
year = rep(c("2006", "2007"), each = 5),
month = rep(c("December", "January"), each = 5),
x = round(rnorm(10, 10, 3), 2))
# desired outcome
outcome <- data.frame(id = c(1234, 1321, 4321, 7423, 8503, 2961, 8564),
year = c(rep("2006", 4), rep("2007", 3)),
month = c(rep("December", 4), rep("January", 3)),
x = c(36.42, 11.55, 17.31, 5.97, 12.48, 10.22, 11.41))
# strategy 1:
library(plyr)
dat2 <- ddply(dat1, .(id), summarise, x = sum(x))
# strategy 2:
# partition into two data frames - one with unique cases, one with dupes
dat1_unique <- dat1[!duplicated(dat1$id), ]
dat1_dupes <- dat1[duplicated(dat1$id), ]
# merge these data frames while summing the x variable for duplicated ids
# with plyr
dat3 <- ddply(merge(dat1_unique, dat1_dupes, all.x = TRUE),
.(id), summarise, x = sum(x))
# in base R
dat4 <- aggregate(x ~ id, data = merge(dat1_unique, dat1_dupes,
all.x = TRUE), FUN = sum)

I got different sums, but it were b/c I forgot the seed:
> dat1$x <- ave(dat1$x, dat1$id, FUN=sum)
> dat1[!duplicated(dat1$id), ]
id year month x
1 1234 2006 December 25.18
2 1321 2006 December 15.06
3 4321 2006 December 15.50
4 7423 2006 December 7.16
6 8503 2007 January 13.23
7 2961 2007 January 7.38
9 8564 2007 January 7.21
(To be safer It would be better to work on a copy. And you might need to add an ordering step.)

You could do this with data.table (quicker, more memory efficiently than plyr)
With a bit of self-joining fun using mult ='first'. Keying by id year and month will sort by id, year then month.
library(data.table)
DT <- data.table(dat1, key = c('id','year','month'))
# setnames is required as there are two x columns that get renamed x, x.1
DT1 <- setnames(DT[DT[,list(x=sum(x)),by=id],mult='first'][,x:=NULL],'x.1','x')
Or a simpler approach :
DT = as.data.table(dat1)
DT[,x:=sum(x),by=id][!duplicated(id)]
id year month x
1: 1234 2006 December 36.42
2: 1321 2006 December 11.55
3: 4321 2006 December 17.31
4: 7423 2006 December 5.97
5: 8503 2007 January 12.48
6: 2961 2007 January 10.22
7: 8564 2007 January 11.41

Related

How to organise a date in 3 different columns in r?

I have a lot of climatic data organise by dates like this.
df = data.frame(date = c("2011-03-24", "2011-02-03", "2011-01-02"), Precipitation = c(20, 22, 23))
And I want to organise it like this one
df = data.frame(year = c("2011", "2011","2011"), month = c("03","02","01"), day = c("24", "03", "02"), pp = c(20, 22, 23))
I have a lot of information and I can not do it manually.
Can anybody help me? Thanks a lot.
Using strsplit you can do like this way:
Logic: strsplit will split the date with dashes to create list of 3 elements each having 3 parts of year, month and day. We bind these elements using rbind but to do it iteratively. We use do.call, So do.call will row bind these list elements into 3 rows. Since the outcome is a matrix, we convert it into a dataframe and then using setNames we give new names to the columns. The last cbind will bind these 3x3 dataframe with original precipitation.
cbind(setNames(data.frame(do.call('rbind', strsplit(df$date, '-'))), c('Year', 'month', 'day')), 'Precipitation' = df$Precipitation)
Output:
Year month day Precipitation
1 2011 03 24 20
2 2011 02 03 22
3 2011 01 02 23
This returns integer values for year, month, and day. If you really need them as characters padded with 0 you can use formatC(x, width = 2, flag = "0") on the result.
library(clock)
library(dplyr)
df <- data.frame(
date = c("2011-03-24", "2011-02-03", "2011-01-02"),
pp = c(20, 22, 23)
)
df %>%
mutate(
date = as.Date(date),
year = get_year(date),
month = get_month(date),
day = get_day(date)
)
#> date pp year month day
#> 1 2011-03-24 20 2011 3 24
#> 2 2011-02-03 22 2011 2 3
#> 3 2011-01-02 23 2011 1 2

Replace missing values with mean for subsets of dataframe

I have a data frame titled final_project_data with the following structure. It includes 17 columns with data that corresponds to the county/ State and years. For example, Baldwin county in Alabama in 2006 had a population of 69162, an unemployment rate of 4.2% etc.
ID County State Population Year Ump.Rate Fertility
<dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1003 Baldwin County Alabama 69162 2006 4.2 88
1015 Calhoun County Alabama 112903 2006 2.4 na
1043 Baldwin County Alabama na 2007 1.9 71
1049 Calhoun County Alabama 68014 2007 na 90
1050 CountyY Alaska 2757 2006 3.9 na
1070 CountyZ Alaska 11000 2006 7.8 95
1081 CountyY Alaska na 2007 6.5 70
1082 CountyZ Alaska 67514 2007 4.5 60
There are a number of columns with missing values in them, which I am trying to replace with the mean for the given State and Year. I am running into issues trying to loop over each column with missing values and then each subset of years and rows to fill in the missing values with the mean. The code I have thus far is below:
#get list of unique states
states <- unique(final_project_data$State)
#get list of columns with na in them - we will use this to impute missing
values
list_na <- colnames(final_project_data)[ apply(final_project_data, 2, anyNA) ]
list_na
#create a place to hold the missing values
average_missing <- c()
#Loop through each state to impute the missing values with the mean
for(i in 1:length(states)){
average_missing <- apply(final_project_data[which(final_project_data$State == states[i]),colnames(final_project_data) %in% list_na], 2, mean, na.rm = TRUE)
}
average_missing
However, when I run the above bit of code, I only get one set of values for each of the columns with missing values, not for a different value for every state. I am also not sure how to extend this to include years. Any help or advice would be appreciated!
In a for loop:
dt <- data.frame(
ID = c(1003, 1015, 1043, 1049, 1050, 1070, 1081, 1082, NA, NA),
State = c(rep("Alabama", 4), rep("Alaska", 4), "Alabama", "Alaska"),
Population = c(sample(10000:100000, 8, replace = T), NA, NA),
Year = c(2006, 2006, 2007, 2007, 2006, 2006, 2007, 2007, 2007, 2006),
Unemployment = c(sample(1:5, 8, replace = T), NA, NA)
)
# index through each row in data frame
for (i in 1:nrow(dt)){
# if Population variable is NA
if(is.na(dt$Population[i]) == T){
# calculate mean from all Population variables with the same State and Year as index
dt$Population[i] <- mean(dt$Population[which(dt$State == dt$State[i] & dt$Year == dt$Year[i])], na.rm = T)
}
# repeat for Unemployment variable
if(is.na(dt$Unemployment[i]) == T){
dt$Unemployment[i] <- mean(dt$Unemployment[which(dt$State == dt$State[i] & dt$Year == dt$Year[i])], na.rm = T)
}
}
Here's a dplyr version without a loop. Just add all the columns you want to transform insided vars():
your_data %>%
group_by(State, Year) %>%
mutate_at(vars(Population, Ump.Rate, Fertility),
~ ifelse(is.na(.), mean(., na.rm = TRUE), .))

Calculate difference between values using different column and with gaps using R

Can anyone help me figure out how to calculate the difference in values based on my monthly data? For example I would like to calculate the difference in groundwater values between Jan-Jul, Feb-Aug, Mar-Sept etc, for each well by year. Note in some years there will be some months missing. Any tidyverse solutions would be appreciated.
Well year month value
<dbl> <dbl> <fct> <dbl>
1 222 1995 February 8.53
2 222 1995 March 8.69
3 222 1995 April 8.92
4 222 1995 May 9.59
5 222 1995 June 9.59
6 222 1995 July 9.70
7 222 1995 August 9.66
8 222 1995 September 9.46
9 222 1995 October 9.49
10 222 1995 November 9.31
# ... with 18,400 more rows
df1 <- subset(df, month %in% c("February", "August"))
test <- df1 %>%
dcast(site + year + Well ~ month, value.var = "value") %>%
mutate(Diff = February - August)
Thanks,
Simon
So I attempted to manufacture a data set and use dplyr to create a solution. It is best practice to include a method of generating a sample data set, so please do so in future questions.
# load required library
library(dplyr)
# generate data set of all site, well, and month combinations
## define valid values
sites = letters[1:3]
wells = 1:5
months = month.name
## perform a series of merges
full_sites_wells_months_set <-
merge(sites, wells) %>%
dplyr::rename(sites = x, wells = y) %>% # this line and the prior could be replaced on your system with initial_tibble %>% dplyr::select(sites, wells) %>% unique()
merge(months) %>%
dplyr::rename(months = y) %>%
dplyr::arrange(sites, wells)
# create sample initial_tibble
## define fraction of records to simulate missing months
data_availability <- 0.8
initial_tibble <-
full_sites_wells_months_set %>%
dplyr::sample_frac(data_availability) %>%
dplyr::mutate(values = runif(nrow(full_sites_wells_months_set)*data_availability)) # generate random groundwater values
# generate final result by joining full expected set of sites, wells, and months to actual data, then group by sites and wells and perform lag subtraction
final_tibble <-
full_sites_wells_months_set %>%
dplyr::left_join(initial_tibble) %>%
dplyr::group_by(sites, wells) %>%
dplyr::mutate(trailing_difference_6_months = values - dplyr::lag(values, 6L))

How to calculate time-weighted average and create lags

I have searched the forum, but found nothing that could answer or provide hint on how to do what I wish to on the forum.
I have yearly measurement of exposure data from which I wish to calculate individual level annual average based on entry of each individual into the study. For each row the one year exposure assignment should include data from the preceding 12 months starting from the last month before joining the study.
As an example the first person in the sample data joined the study on Feb 7, 2002. His exposure will include a contribution of January 2002 (annual average is 18) and February to December 2001 (annual average is 19). The time weighted average for this person would be (1/12*18) + (11/12*19). The two year average exposure for the same person would extend back from January 2002 to February 2000.
Similarly, for last person who joined the study in December 2004 will include contribution on 11 months in 2004 and one month in 2003 and his annual average exposure will be (11/12*5 ) derived form 2004 and (1/12*6) which comes from the annual average of 2003.
How can I calculate the 1, 2 and 5 year average exposure going back from the date of entry into study? How can I use lags in the manner taht I hve described?
Sample data is accessed from this link
https://drive.google.com/file/d/0B_4NdfcEvU7La1ZCd2EtbEdaeGs/view?usp=sharing
This is not an elegant answer. But, I would like to leave what I tried. I first arranged the data frame. I wanted to identify which year will be the key year for each subject. So, I created id. variable comes from the column names (e.g., pol_2000) in your original data set. entryYear comes from entry in your data. entryMonth comes from entry as well. check was created in order to identify which year is the base year for each participant. In my next step, I extracted six rows for each participant using getMyRows in the SOfun package. In the next step, I used lapply and did math as you described in your question. For the calculation for two/five year average, I divided the total values by year (2 or 5). I was not sure how the final output would look like. So I decided to use the base year for each subject and added three columns to it.
library(stringi)
library(SOfun)
devtools::install_github("hadley/tidyr")
library(tidyr)
library(dplyr)
### Big thanks to BondedDust for this function
### http://stackoverflow.com/questions/6987478/convert-a-month-abbreviation-to-a-numeric-month-in-r
mo2Num <- function(x) match(tolower(x), tolower(month.abb))
### Arrange the data frame.
ana <- foo %>%
mutate(id = 1:n()) %>%
melt(id.vars = c("id","entry")) %>%
arrange(id) %>%
mutate(variable = as.numeric(gsub("^.*_", "", variable)),
entryYear = as.numeric(stri_extract_last(entry, regex = "\\d+")),
entryMonth = mo2Num(substr(entry, 3,5)) - 1,
check = ifelse(variable == entryYear, "Y", "N"))
### Find a base year for each subject and get some parts of data for each participant.
indx <- which(ana$check == "Y")
bob <- getMyRows(ana, pattern = indx, -5:0)
### Get one-year average
cathy <- lapply(bob, function(x){
x$one <- ((x[6,6] / 12) * x[6,4]) + (((12-x[5,6])/12) * x[5,4])
x
})
one <- unnest(lapply(cathy, `[`, i = 6, j = 8))
### Get two-year average
cathy <- lapply(bob, function(x){
x$two <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + (((12-x[4,6])/12) * x[4,4])) / 2
x
})
two <- unnest(lapply(cathy, `[`, i = 6, j =8))
### Get five-year average
cathy <- lapply(bob, function(x){
x$five <- (((x[6,6] / 12) * x[6,4]) + x[5,4] + x[4,4] + x[3,4] + x[2,4] + (((12-x[2,6])/12) * x[1,4])) / 5
x
})
five <- unnest(lapply(cathy, `[`, i =6 , j =8))
### Combine the results with the key observations
final <- cbind(ana[which(ana$check == "Y"),], one, two, five)
colnames(final) <- c(names(ana), "one", "two", "five")
# id entry variable value entryYear entryMonth check one two five
#6 1 07feb2002 2002 18 2002 1 Y 18.916667 18.500000 18.766667
#14 2 06jun2002 2002 16 2002 5 Y 16.583333 16.791667 17.150000
#23 3 16apr2003 2003 14 2003 3 Y 15.500000 15.750000 16.050000
#31 4 26may2003 2003 16 2003 4 Y 16.666667 17.166667 17.400000
#39 5 11jun2003 2003 13 2003 5 Y 13.583333 14.083333 14.233333
#48 6 20feb2004 2004 3 2004 1 Y 3.000000 3.458333 3.783333
#56 7 25jul2004 2004 2 2004 6 Y 2.000000 2.250000 2.700000
#64 8 19aug2004 2004 4 2004 7 Y 4.000000 4.208333 4.683333
#72 9 19dec2004 2004 5 2004 11 Y 5.083333 5.458333 4.800000

R - group_by utilizing splinefun

I am trying to group my data by Year and CountyID then use splinefun (cubic spline interpolation) on the subset data. I am open to ideas, however the splinefun is a must and cannot be changed.
Here is the code I am trying to use:
age <- seq(from = 0, by = 5, length.out = 18)
TOT_POP <- df %.%
group_by(unique(df$Year), unique(df$CountyID) %.%
splinefun(age, c(0, cumsum(df$TOT_POP)), method = "hyman")
Here is a sample of my data Year = 2010 : 2013, Agegrp = 1 : 17 and CountyIDs are equal to all counties in the US.
CountyID Year Agegrp TOT_POP
1001 2010 1 3586
1001 2010 2 3952
1001 2010 3 4282
1001 2010 4 4136
1001 2010 5 3154
What I am doing is taking the Agegrp 1 : 17 and splitting the grouping into individual years 0 - 84. Right now each group is a representation of 5 years. The splinefun allows me to do this while providing a level of mathematical rigour to the process i.e., splinefun allows me provide a population total per each year of age, in each individual county in the US.
Lastly, the splinefun code by itself does work but within the group_by function it does not, it produces:
Error: wrong result size(4), expected 68 or 1.
The splinefun code the way I am using it works like this
TOT_POP <- splinefun(age, c(0, cumsum(df$TOT_POP)),
method = "hyman")
TOT_POP = pmax(0, diff(TOT_POP(c(0:85))))
Which was tested on one CountyID during one Year. I need to iterate this process over "x" amount of years and roughly 3200 counties.
# Reproducible data set
set.seed(22)
df = data.frame( CountyID = rep(1001:1005,each = 100),
Year = rep(2001:2010, each = 10),
Agegrp = sample(1:17, 500, replace=TRUE),
TOT_POP = rnorm(500, 10000, 2000))
# Convert Agegrp to age
df$Agegrp = df$Agegrp*5
colnames(df)[3] = "age"
# Make a spline function for every CountyID-Year combination
split.dfs = split(df, interaction(df$CountyID, df$Year))
spline.funs = lapply(split.dfs, function(x) splinefun(x[,"age"], x[,"TOT_POP"]))
# Use the spline functions to interpolate populations for all years between 0 and 85
new.split.dfs = list()
for( i in 1:length(split.dfs)) {
new.split.dfs[[i]] = data.frame( CountyID=split.dfs[[i]]$CountyID[1],
Year=split.dfs[[i]]$Year[1],
age=0:85,
TOT_POP=spline.funs[[i]](0:85))
}
# Does this do what you want? If so, then it will be
# easier for others to work from here
# > head(new.split.dfs[[1]])
# CountyID Year age TOT_POP
# 1 1001 2001 0 909033.4
# 2 1001 2001 1 833999.8
# 3 1001 2001 2 763181.8
# 4 1001 2001 3 696460.2
# 5 1001 2001 4 633716.0
# 6 1001 2001 5 574829.9
# > tail(new.split.dfs[[2]])
# CountyID Year age TOT_POP
# 81 1002 2001 80 10201.693
# 82 1002 2001 81 9529.030
# 83 1002 2001 82 8768.306
# 84 1002 2001 83 7916.070
# 85 1002 2001 84 6968.874
# 86 1002 2001 85 5923.268
First, I believe I was using the wrong wording in what I was trying to achieve, my apologies; group_by actually wasn't going to solve the issue. However, I was able to solve the problem using two functions and ddply. Here is the code that solved the issue:
interpolate <- function(x, ageVector){
result <- splinefun(ageVector,
c(0, cumsum(x)), method = "hyman")
diff(result(c(0:85)))
}
mainFunc <- function(df){
age <- seq(from = 0, by = 5, length.out = 18)
colNames <- setdiff(colnames(df)
c("Year","CountyID","AgeGrp"))
colWiseSpline <- colwise(interpolate, .cols = true,
age)(df[ , colNames])
cbind(data.frame(
Year = df$Year[1],
County = df$CountyID[1],
Agegrp = 0:84
),
colWiseSpline
)
}
CompleteMainRaw <- ddply(.data = df,
.variables = .(CountyID, Year),
.fun = mainFunc)
The code now takes each county by year and runs the splinefun on that subset of population data. At the same time it creates a data.frame with the results i.e., splits the data from 17 age groups to 85 age groups while factoring it our appropriately; which is what splinefun does.
Thanks!

Resources