Calculate average of values in R and add result as new rows instead of as a new column - r

I have a dataframe like the following one:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
and I want to calculate the average value by day for the three year period (2014, 2015, 2016). The following code works for this purpose:
data %>%
group_by(day) %>%
mutate(MEAN = mean(value))
and produces this output:
day year value MEAN
1 2014 5 7
1 2015 16 7
1 2016 0 7
2 2014 3 3
2 2015 1 3
2 2016 4 3
but I want to add the average values as new rows in the same dataframe as follows:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
1 avg 7 <--
2 avg 3 <--
Any suggestions about how can I possibly do this? Thanks!

We can use summarise (instead of mutate - which adds a new column in the original dataset) to calculate the mean and then with bind_rows can bind with original data. The tidyverse functions are very particular about type, so make sure the class are the same before we do the binding
library(dplyr)
data %>%
group_by(day) %>%
summarise(year = 'avg', value = mean(value)) %>%
bind_rows(data %>%
mutate(year = as.character(year)), .)
# day year value
#1 1 2014 5.00
#2 1 2015 16.00
#3 1 2016 0.00
#4 2 2014 3.00
#5 2 2015 1.00
#6 2 2016 4.00
#7 1 avg 7.00
#8 2 avg 2.67
Another option is to split by the 'day' and then with add_row (from tibble) create a new row on each of the list elements
library(tibble)
library(purrr)
data %>%
mutate(year = as.character(year)) %>%
group_split(day) %>%
map_dfr(~ .x %>% add_row(day = first(.$day),
year = 'avg', value = mean(.$value)))

Here is a base R option using aggregate
rbind(df,cbind(aggregate(value~day,df,mean),year = "avg")[c(1,3,2)])
or a variation (by #thelatemail from comments)
rbind(df, aggregate(df["value"], cbind(df["day"], year="avg"), FUN=mean))
which gives
day year value
1 1 2014 5.000000
2 1 2015 16.000000
3 1 2016 0.000000
4 2 2014 3.000000
5 2 2015 1.000000
6 2 2016 4.000000
7 1 avg 7.000000
8 2 avg 2.666667

Related

r conditional subtract number

I am trying to do the following logic to create 'subtract' column.
I have years from 1986-2014 and around 100 firms.
year firm count sum_of_year subtract
1986 A 1 2 2
1986 B 1 2 4
1987 A 2 4 5
1987 C 1 4 2
1987 D 1 4 5
1988 C 3 5
1988 E 2 5
That is, if a firm i at t appears in t+1, then subtract its count at t+1 from the sum_of_year at t+1,
if a firm i does not appear in t+1, then just put sum_of_year at t+1 as shown in the sample.
I am having difficulties in creating this conditional code.
How can I do this in a generalized version?
Thank you for your help.
One way using dplyr with the help of tidyr::complete. We complete the missing combinations of rows for year and firm and fill count with 0. For each year, we subtract the count by sum of count for that entire year and finally for each firm, we take the value from the next year using lead.
library(dplyr)
df %>%
tidyr::complete(year, firm, fill = list(count = 0)) %>%
group_by(year) %>%
mutate(n = sum(count) - count) %>%
group_by(firm) %>%
mutate(subtract = lead(n)) %>%
filter(count != 0) %>%
select(-n)
# year firm count sum_of_year subtract
# <int> <fct> <dbl> <int> <dbl>
#1 1986 A 1 2 2
#2 1986 B 1 2 4
#3 1987 A 2 4 5
#4 1987 C 1 4 2
#5 1987 D 1 4 5
#6 1988 C 3 5 NA
#7 1988 E 2 5 NA

Average percentage change over different years in R

I have a data frame from which I created a reproducible example:
country <- c('A','A','A','B','B','C','C','C','C')
year <- c(2010,2011,2015,2008,2009,2008,2009,2011,2015)
score <- c(1,2,2,1,4,1,1,3,2)
country year score
1 A 2010 1
2 A 2011 2
3 A 2015 2
4 B 2008 1
5 B 2009 4
6 C 2008 1
7 C 2009 1
8 C 2011 3
9 C 2015 2
And I am trying to calculate the average percentage increase (or decrease) in the score for each country by calculating [(final score - initial score) รท (initial score)] for each year and averaging it over the number of years.
country year score change
1 A 2010 1 NA
2 A 2011 2 1
3 A 2015 2 0
4 B 2008 1 NA
5 B 2009 4 3
6 C 2008 1 NA
7 C 2009 1 0
8 C 2011 3 2
9 C 2015 2 -0.33
The final result I am hoping to obtain:
country avg_change
1 A 0.5
2 B 3
3 C 0.55
As you can see, the trick is that countries have spans over different years, sometimes with a missing year in between. I tried different ways to do it manually but I do struggle. If someone could hint me a solution would be great. Many thanks.
With dplyr, we can group_by country and get mean of difference between scores.
library(dplyr)
df %>%
group_by(country) %>%
summarise(avg_change = mean(c(NA, diff(score)), na.rm = TRUE))
# country avg_change
# <fct> <dbl>
#1 A 0.500
#2 B 3.00
#3 C 0.333
Using base R aggregate with same logic
aggregate(score~country, df, function(x) mean(c(NA, diff(x)), na.rm = TRUE))
We can use data.table to group by 'country' and take the mean of the difference between the 'score' and the lag of 'score'
library(data.table)
setDT(df1)[, .(avg_change = mean(score -lag(score), na.rm = TRUE)), .(country)]
# country avg_change
#1: A 0.5000000
#2: B 3.0000000
#3: C 0.3333333

how do I identify rows where an element appears for the first time?

I have the following data frame of student records. what I want is to identify students who joined a certain program in 2014 for the first time when they were in 9th grade.
names.first<-c('a','a','b','b','c','d')
names.last<-c('c','c','z','z','f','h')
year<-c(2014,2013,2014,2015,2015,2014)
grade<-c(9,8,9,10,10,10)
df<-data.frame(names.first,names.last,year,grade)
df
To do this, I have used the following statement to say that I want students where the program year==2014 and their grade ==9.
df$first.cohort<-ifelse(df$year==2014 & df$grade==9,1,0)
df
names.first names.last year grade first.cohort
1 a c 2014 9 1
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
However, as you can notice this would include students who didn't enter the program in year 2014 such as student awho started in 2013. How do I create a ifelse statement where I only capture students who are in 9th grade and started the program in 2014 for the first time so that the df looks like
names.first names.last year grade first.cohort
1 a c 2014 9 0
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
We can use first after arrangeing by 'name' and 'year' to create the logical expression
library(dplyr)
df %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 4
# Groups: names [4]
# names year grade first.cohort
# <fct> <dbl> <dbl> <int>
#1 a 2013 8 0
#2 a 2014 9 0
#3 b 2014 9 1
#4 b 2015 10 0
#5 c 2015 10 0
#6 d 2014 10 0
For keeping the same order as in the input dataset, we can create a sequence column first and then do the arrange on the column after the mutate
df %>%
mutate(rn = row_number()) %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014)) %>%
ungroup %>%
arrange(rn) %>%
select(-rn)
Or using the same logic with data.table that have the additional advantage of keeping the same order as in the input dataset
library(data.table)
setDT(df)[order(names, year), first.cohort := as.integer(grade == 9 &
first(year) == 2014), names]
Update
With the new example in the OP's post, we do the grouping by both the 'names' column
df %>%
arrange(names.first, names.last, year) %>%
group_by(names.first, names.last) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 5
# Groups: names.first, names.last [4]
# names.first names.last year grade first.cohort
# <fct> <fct> <dbl> <dbl> <int>
#1 a c 2013 8 0
#2 a c 2014 9 0
#3 b z 2014 9 1
#4 b z 2015 10 0
#5 c f 2015 10 0
#6 d h 2014 10 0
Using dplyr
library(dplyr)
df%>%group_by(names)%>%dplyr::mutate(Fc=as.numeric((year==2014&grade==9)&(min(year)==2014)))
# A tibble: 6 x 4
# Groups: names [4]
names year grade Fc
<fctr> <dbl> <dbl> <dbl>
1 a 2014 9 0
2 a 2013 8 0
3 b 2014 9 1
4 b 2015 10 0
5 c 2015 10 0
6 d 2014 10 0

Combine data in many row into a columnn

I have a data like this:
year Male
1 2011 8
2 2011 1
3 2011 4
4 2012 3
5 2012 12
6 2012 9
7 2013 4
8 2013 3
9 2013 3
and I need to group the data for the year 2011 in one column, 2012 in the next column and so on.
2011 2012 2013
1 8 3 4
2 1 12 3
3 4 9 3
How do I achieve this?
One option is unstack if the number of rows per 'year' is the same
unstack(df1, Male ~ year)
One option is to use functions from dplyr and tidyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(year) %>%
mutate(ID = 1:n()) %>%
spread(year, Male) %>%
select(-ID)
1
If every year has the same number of data, you could split the data and cbind it using base R
do.call(cbind, split(df$Male, df$year))
# 2011 2012 2013
#[1,] 8 3 4
#[2,] 1 12 3
#[3,] 4 9 3
2
If every year does not have the same number of data, you could use rbind.fill of plyr
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(plyr)
setNames(object = data.frame(t(rbind.fill.matrix(lapply(split(df$Male, df$year), t)))),
nm = unique(df$year))
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA
3
Yet another way is to use dcast to convert data from long to wide format
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(reshape2)
dcast(df, ave(df$Male, df$year, FUN = seq_along) ~ year, value.var = "Male")[,-1]
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA

data standardization for all group data.frame in R

I have a dataset as below
Date <- rep(c("Jan", "Feb"), 3)[1:5]
Group <- c(rep(letters[1:2],each=2),"c")
value <- sample(1:10,5)
data <- data.frame(Date, Group, value)
> data
Date Group value
1 Jan a 2
2 Feb a 7
3 Jan b 3
4 Feb b 9
5 Jan c 1
As you can observed, for group c it do not have data on Date=Feb.
How can i make a dataset such that
> DATA
Date Group value
1 Jan a 2
2 Feb a 7
3 Jan b 3
4 Feb b 9
5 Jan c 1
6 Feb c 0
I have added last row such that value for group c in feb is 0.
Thanks
With base R you can use xtabs wrapped in as.data.frame:
as.data.frame(xtabs(formula = value ~ Date + Group, data = data))
# Date Group Freq
#1 Feb a 8
#2 Jan a 6
#3 Feb b 4
#4 Jan b 1
#5 Feb c 0
#6 Jan c 10
Using merge:
#get all combinations of 2 columns
all.comb <- expand.grid(unique(data$Date),unique(data$Group))
colnames(all.comb) <- c("Date","Group")
#merge with all.x=TRUE to keep nonmatched rows
res <- merge(all.comb,data,all.x=TRUE)
#convert NA to 0
res$value[is.na(res$value)] <- 0
#result
res
# Date Group value
# 1 Feb a 3
# 2 Feb b 4
# 3 Feb c 0
# 4 Jan a 5
# 5 Jan b 7
# 6 Jan c 10
Using reshape2
library(reshape2)
melt(dcast(data, Date~Group, value.var="value",fill=0), id.var="Date") #values differ as there was no set.seed()
# Date variable value
#1 Feb a 1
#2 Jan a 10
#3 Feb b 7
#4 Jan b 4
#5 Feb c 0
#6 Jan c 5
Or using dplyr
library(dplyr)
library(tidyr)
data%>%
spread(Group, value, fill=0) %>%
gather(Group, value, a:c)
# Date Group value
#1 Feb a 1
#2 Jan a 10
#3 Feb b 7
#4 Jan b 4
#5 Feb c 0
#6 Jan c 5

Resources