How to add means to an existing column in R - r

I am manipulating a dataset but I can't make things right.
Here's an example for this, where df is the name of data frame.
year ID value
2013 1 10
2013 2 20
2013 3 10
2014 1 20
2014 2 20
2014 3 30
2015 1 20
2015 2 10
2015 3 30
So I tried to make another data frame df1 <- aggregate(value ~ year, df, mean, rm.na=T)
And made this data frame df1:
year ID value
2013 avg 13.3
2014 avg 23.3
2015 avg 20
But I want to add each mean by year into each row of df.
The expected form is:
year ID value
2013 1 10
2013 2 20
2013 3 10
2013 avg 13.3
2014 1 20
2014 2 20
2014 3 30
2014 avg 23.3
2015 1 20
2015 2 10
2015 3 30
2015 avg 20

Here is an option with data.table where we convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'year', get the 'mean of 'value' and 'ID' as 'avg', then use rbindlist to rbind both the datasets and order by 'year'
library(data.table)
rbindlist(list(setDT(df), df[, .(ID = 'avg', value = mean(value)), year]))[order(year)]
# year ID value
# 1: 2013 1 10.00000
# 2: 2013 2 20.00000
# 3: 2013 3 10.00000
# 4: 2013 avg 13.33333
# 5: 2014 1 20.00000
# 6: 2014 2 20.00000
# 7: 2014 3 30.00000
# 8: 2014 avg 23.33333
# 9: 2015 1 20.00000
#10: 2015 2 10.00000
#11: 2015 3 30.00000
#12: 2015 avg 20.00000
Or using the OP's method, rbind both the datasets and then order
df2 <- rbind(df, transform(df1, ID = 'avg'))
df2 <- df2[order(df2$year),]

Related

Averaging a monthly time series with incomplete observations

I have the following dataset:
id observation_date Observation_value
1 2015-02-23 5
1 2015-02-24 6
1 2015-03-01 24
1 2015-07-16 2
1 2015-09-28 9
1 2015-12-05 12
I would like to create monthly averages of observation_value. In those cases that there are no values for a certain month, I would like to fill in the data with the average between the months where I have data.
Using the data in the Note at the end -- we have added a second id -- convert to zoo using column 1 to split by and column 2 as the index with yearmon class. Also in the same statement aggregate using mean over year/month giving the zoo object z. Then convert to ts which will fill in the missing months with NA and then convert back to zoo and use na.approx to fill in the NAs (or use na.spline or na.locf depending on what you want). fortify.zoo(zz) and fortify.zoo(zz, melt = TRUE) can be used to convert zoo objects to data frames.
library(zoo)
z <- read.zoo(dat, FUN = as.yearmon, index = 2, split = 1, aggregate = mean)
zz <- na.approx(as.zoo(as.ts(z)))
giving
> zz
1 2
Feb 2015 5.5 5.5
Mar 2015 24.0 24.0
Apr 2015 18.5 18.5
May 2015 13.0 13.0
Jun 2015 7.5 7.5
Jul 2015 2.0 2.0
Aug 2015 5.5 5.5
Sep 2015 9.0 9.0
Oct 2015 10.0 10.0
Nov 2015 11.0 11.0
Dec 2015 12.0 12.0
Note
Lines <- "id observation_date Observation_value
1 2015-02-23 5
1 2015-02-24 6
1 2015-03-01 24
1 2015-07-16 2
1 2015-09-28 9
1 2015-12-05 12
2 2015-02-23 5
2 2015-02-24 6
2 2015-03-01 24
2 2015-07-16 2
2 2015-09-28 9
2 2015-12-05 12"
dat <- read.table(text = Lines, header = TRUE)

How to lump sum the number of days of a data of several year?

I have data similar to this. I would like to lump sum the day (I'm not sure the word "lump sum" is correct or not) and create a new column "date" so that new column lump sum the number of 3 years data in ascending order.
year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24
I did this code but result was wrong and it's too long also. It doesn't count the February correctly since February has only 28 days. are there any shorter ways?
cday <- function(data,syear=2011,smonth=1,sday=1){
year <- data[1]
month <- data[2]
day <- data[3]
cmonth <- c(0,31,28,31,30,31,30,31,31,30,31,30,31)
date <- (year-syear)*365+sum(cmonth[1:month])+day
for(yr in c(syear:year)){
if(yr==year){
if(yr%%4==0&&month>2){date<-date+1}
}else{
if(yr%%4==0){date<-date+1}
}
}
return(date)
}
op10$day.no <- apply(op10[,c("year","month","day")],1,cday)
I expect the result like this:
year month day date
2011 1 5 5
2011 1 14 14
2011 1 21 21
2011 1 24 24
2011 2 3 31
2011 2 4 32
2011 2 6 34
2011 2 14 42
2011 2 17 45
2011 2 24 52
Thank you for helping!!
Use Date classes. Dates and times are complicated, look for tools to do this for you rather than writing your own. Pick whichever of these you want:
df$date = with(df, as.Date(paste(year, month, day, sep = "-")))
df$julian_day = as.integer(format(df$date, "%j"))
df$days_since_2010 = as.integer(df$date - as.Date("2010-12-31"))
df
# year month day date julian_day days_since_2010
# 1 2011 1 5 2011-01-05 5 5
# 2 2011 2 14 2011-02-14 45 45
# 3 2011 8 21 2011-08-21 233 233
# 4 2012 2 24 2012-02-24 55 420
# 5 2012 3 3 2012-03-03 63 428
# 6 2012 4 4 2012-04-04 95 460
# 7 2012 5 6 2012-05-06 127 492
# 8 2013 2 14 2013-02-14 45 776
# 9 2013 5 17 2013-05-17 137 868
# 10 2013 6 24 2013-06-24 175 906
# using this data
df = read.table(text = "year month day
2011 1 5
2011 2 14
2011 8 21
2012 2 24
2012 3 3
2012 4 4
2012 5 6
2013 2 14
2013 5 17
2013 6 24", header = TRUE)
This is all using base R. If you handle dates and times frequently, you may also want to look a the lubridate package.

How to remove subjects with missing yearly observations in R?

num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432
I have the data which is represented by various subjects in 5 years. I need to remove all the subjects, which are missing any of years from 2011 to 2015. How can I accomplish it, so in given data only subject A is left?
Using data.table:
A data.table solution might look something like this:
library(data.table)
dt <- as.data.table(df)
dt[, keep := identical(unique(year), 2011:2015), by = Name ][keep == T, ][,keep := NULL]
# num Name year age X
#1: 1 A 2011 68 116292
#2: 1 A 2012 69 46132
#3: 1 A 2013 70 7042
#4: 1 A 2014 71 -100425
#5: 1 A 2015 72 6493
This is more strict in that it requires that the unique years be exactly equal to 2011:2015. If there is a 2016, for example that person would be excluded.
A less restrictive solution would be to check that 2011:2015 is in your unique years. This should work:
dt[, keep := all(2011:2015 %in% unique(year)), by = Name ][keep == T, ][,keep := NULL]
Thus, if for example, A had a 2016 year and a 2010 year it would still keep all of A. But if anyone is missing a year in 2011:2015 this would exclude them.
Using base R & aggregate:
Same option, but using aggregate from base R:
agg <- aggregate(df$year, by = list(df$Name), FUN = function(x) all(2011:2015 %in% unique(x)))
df[df$Name %in% agg[agg$x == T, 1] ,]
Here is a slightly more straightforward tidyverse solution.
First, expand the dataframe to include all combinations of Name + year:
df %>% complete(Name, year)
# A tibble: 20 x 5
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
6 B 2011 2 20 -8484
7 B 2012 NA NA NA
8 B 2013 NA NA NA
9 B 2014 NA NA NA
10 B 2015 NA NA NA
...
Then extend the pipe to group by "Name", and filter to keep only those with 0 NA values:
df %>% complete(Name, year) %>%
group_by(Name) %>%
filter(sum(is.na(age)) == 0)
# A tibble: 5 x 5
# Groups: Name [1]
Name year num age X
<fctr> <int> <int> <int> <int>
1 A 2011 1 68 116292
2 A 2012 1 69 46132
3 A 2013 1 70 7042
4 A 2014 1 71 -100425
5 A 2015 1 72 6493
Just check which names have the right number of entries.
## Reproduce your data
df = read.table(text=" num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432",
header=TRUE)
Tab = table(df$Name)
Keepers = names(Tab)[which(Tab == 5)]
df[df$Name %in% Keepers,]
num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
Here is a somewhat different approach using tidyverse packages:
library(tidyverse)
df <- read.table(text = " num Name year age X
1 1 A 2011 68 116292
2 1 A 2012 69 46132
3 1 A 2013 70 7042
4 1 A 2014 71 -100425
5 1 A 2015 72 6493
6 2 B 2011 20 -8484
7 3 C 2015 23 -120836
8 4 D 2011 3 -26523
9 4 D 2012 4 9923
10 4 D 2013 5 82432")
df2 <- spread(data = df, key = Name, value = year)
x <- colSums(df2[, 4:7], na.rm = TRUE) > 10000
df3 <- select(df2, num, age, X, c(4:7)[x])
df4 <- na.omit(df3)
All steps can of course be constructed as one single pipe with the %>% operator.

Correct previous year by id within R

I have data something like this:
df <- data.frame(Id=c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,9,9,9,9),Date=c("2013-04","2013-12","2013-01","2013-12","2013-11",
"2013-12","2012-04","2013-12","2012-08","2014-12","2013-08","2014-12","2013-08","2014-12","2011-01","2013-11","2013-12","2014-01","2014-04"))
To get the correct format:
df$Date <- paste0(df$Date,"-01")
I would need to obtain only years, so that each id contains 2 dates following on each other.
I if do on the existing data something like this:
require(lubridate)
df$Date <- year(as.Date(df$Date)-days(1))
I get sometimes same date for given id.
The desired output for the column Date is this:
2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011 2013 2014
Please note that the last date for given id is always correct, so just the preceding year have to be corrected based on the last date. The date have to be in format that can be converted to years only as shown.
EDIT Here is the case:
Id Date
1 2013-11-01
1 2013-12-01
1 2014-01-01
1 2014-04-01
Now I'm getting this: 2012,2013,2013,2013
I would need: 2012,2013,2013,2014
This is how I would solve this using data.table package (though it looks over complicated to me)
library(data.table)
setDT(df)[, year := year(Date)][,
year := if(.N == 2) (year[2] - 1):year[2] else year,
Id][]
# Id Date year indx
# 1: 1 2013-04-01 2012 2
# 2: 1 2013-12-01 2013 2
# 3: 2 2013-01-01 2012 2
# 4: 2 2013-12-01 2013 2
# 5: 3 2013-11-01 2012 2
# 6: 3 2013-12-01 2013 2
# 7: 4 2012-04-01 2012 2
# 8: 4 2013-12-01 2013 2
# 9: 5 2012-08-01 2013 2
# 10: 5 2014-12-01 2014 2
# 11: 6 2013-08-01 2013 2
# 12: 6 2014-12-01 2014 2
# 13: 7 2013-08-01 2013 2
# 14: 7 2014-12-01 2014 2
# 15: 8 2011-01-01 2011 1
Or all in one step (thanks to #Arun for providing this):
setDT(df)[, year := {tmp = year(Date);
if (.N == 2L) (tmp[2]-1L):tmp[2] else tmp},
Id]
Edit:
Per OPs new data, we can modify the code by adding additional index
setDT(df)[, indx := if(.N > 2) rep(seq_len(.N/2), each = 2) + 1L else .N, Id]
df[, year := {tmp = year(Date); if (.N > 1L) (tmp[2] - 1L):tmp[2] else tmp},
list(Id, indx)][]
# Id Date indx year
# 1: 1 2013-04-01 2 2012
# 2: 1 2013-12-01 2 2013
# 3: 2 2013-01-01 2 2012
# 4: 2 2013-12-01 2 2013
# 5: 3 2013-11-01 2 2012
# 6: 3 2013-12-01 2 2013
# 7: 4 2012-04-01 2 2012
# 8: 4 2013-12-01 2 2013
# 9: 5 2012-08-01 2 2013
# 10: 5 2014-12-01 2 2014
# 11: 6 2013-08-01 2 2013
# 12: 6 2014-12-01 2 2014
# 13: 7 2013-08-01 2 2013
# 14: 7 2014-12-01 2 2014
# 15: 8 2011-01-01 1 2011
# 16: 9 2013-11-01 2 2012
# 17: 9 2013-12-01 2 2013
# 18: 9 2014-01-01 3 2013
# 19: 9 2014-04-01 3 2014
Or another possible solution provided by #akrun
setDT(df)[, `:=`(year = year(Date), indx = .N, indx2 = as.numeric(gl(.N,2, .N))), Id]
df[indx > 1, year:=(year[2]-1):year[2], list(Id, indx2)][]
Using dplyr using similar approach as #David Arenburg's
library(dplyr)
df %>%
group_by(Id) %>%
mutate(year=as.numeric(sub('-.*', '', Date)),
year=replace(year, n()>1, c(year[2]-1, year[2])))
# Id Date year
#1 1 2013-04 2012
#2 1 2013-12 2013
#3 2 2013-01 2012
#4 2 2013-12 2013
#5 3 2013-11 2012
#6 3 2013-12 2013
#7 4 2012-04 2012
#8 4 2013-12 2013
#9 5 2012-08 2013
#10 5 2014-12 2014
#11 6 2013-08 2013
#12 6 2014-12 2014
#13 7 2013-08 2013
#14 7 2014-12 2014
#15 8 2011-01 2011
Or using base R
with(df, ave(as.numeric(sub('-.*', '', Date)), Id,
FUN=function(x) if(length(x)>1)(x[2]-1):x[2] else x))
#[1] 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011
Update
You can try
df$indx <- with(df, ave(Id, Id, FUN=function(x) (seq_along(x)-1)%/%2+1))
with(df, ave(as.numeric(sub('-.*', '', Date)), Id, indx,
FUN=function(x) if(length(x)>1)(x[2]-1):x[2] else x))
#[1] 2012 2013 2012 2013 2012 2013 2012 2013 2013 2014 2013 2014 2013 2014 2011
#[16] 2012 2013 2013 2014
Or
df %>%
group_by(Id) %>%
mutate(year=as.numeric(sub('-.*', '', Date))) %>%
group_by(indx=cumsum(rep(c(TRUE,FALSE), length.out=n())), add=TRUE) %>%
mutate(year=replace(year, n()>1, c(year[2]-1, year[2])))
Here's a dplyr solution. You can remove the intermediate fields last_year and year2, but I left them here for clarity:
library(stringr)
library(dplyr)
df %>%
group_by(Id) %>%
mutate(
last_year = last(as.integer(str_sub(Date, 1, 4))),
year2 = row_number() - n(),
year = last_year + year2
)

Taking Average and Median by Month and then Ordering by Date and Factor in R

Lets suppose I have the following data:
set.seed(123)
Dates <- c("2013-10-07","2013-10-14","2013-11-21","2013-11-28" , "2013-12-04" , "2013-12-11","2013-01-18","2013-01-18")
Dates.New <- c(Dates,Dates)
Values <- sample(seq(1:10),16,replace = TRUE)
Factor <- c(rep("Group 1",8),rep("Group 2",8))
df <- data.frame(Dates.New,Values,Factor)
df[sample(1:nrow(df)),]
This returns
Dates.New Values Factor
4 2013-11-28 9 Group 1
1 2013-10-07 3 Group 1
5 2013-12-04 10 Group 1
13 2013-12-04 7 Group 2
11 2013-11-21 10 Group 2
8 2013-01-18 9 Group 1
7 2013-01-18 6 Group 1
9 2013-10-07 6 Group 2
6 2013-12-11 1 Group 1
14 2013-12-11 6 Group 2
16 2013-01-18 9 Group 2
3 2013-11-21 5 Group 1
2 2013-10-14 8 Group 1
15 2013-01-18 2 Group 2
12 2013-11-28 5 Group 2
10 2013-10-14 5 Group 2
What I am trying to do here is find the monthly average and median for both of my factors then order each group by month in a new data frame. So the new data frame would have a median and average for months 10,11,12,1 for Group 1 bundled together and the next 4 rows would have the median and average for months 10,11,12,1 for Group 2bundled together as well. I am open to packages. Thanks!
Here is a data.table solution. The question seems to be looking for both mean and median. See if this suits your need.
library(zoo); library(data.table)
setDT(df)[, list(Mean = mean(Values),
Median = median(Values)),
by = list(Factor, as.yearmon(Dates.New))][order(Factor, as.yearmon)]
# Factor as.yearmon Mean Median
# 1: Group 1 Jan 2013 7.5 7.5
# 2: Group 1 Oct 2013 5.5 5.5
# 3: Group 1 Nov 2013 7.0 7.0
# 4: Group 1 Dec 2013 5.5 5.5
# 5: Group 2 Jan 2013 5.5 5.5
# 6: Group 2 Oct 2013 5.5 5.5
# 7: Group 2 Nov 2013 7.5 7.5
# 8: Group 2 Dec 2013 6.5 6.5
Like this?
df$Dates.New <- as.Date(df$Dates.New)
library(zoo) # for as.yearmon(...)
result <- aggregate(Values~as.yearmon(Dates.New)+Factor,df,mean)
names(result)[1] <- "Year.Mon"
result
# Year.Mon Factor Values
# 1 Jan 2013 Group 1 7.5
# 2 Oct 2013 Group 1 5.5
# 3 Nov 2013 Group 1 7.0
# 4 Dec 2013 Group 1 5.5
# 5 Jan 2013 Group 2 5.5
# 6 Oct 2013 Group 2 5.5
# 7 Nov 2013 Group 2 7.5
# 8 Dec 2013 Group 2 6.5

Resources