Subtract rows varying one column but keeping others fixed - r

I have an experiment where I need to subtract values of two different treatments from the Control (baseline), but these subtractions must correspond to other columns, named block and year sampled.
Dummy data frame:
df <- data.frame("Treatment" = c("Control","Treat1", "Treat2"),
"Block" = rep(1:3, each=3), "Year" = rep(2011:2013, each=3),
"Value" = c(6,12,4,3,9,5,6,3,1));df
Treatment Block Year Value
1 Control 1 2011 6
2 Treat1 1 2011 12
3 Treat2 1 2011 4
4 Control 2 2012 3
5 Treat1 2 2012 9
6 Treat2 2 2012 5
7 Control 3 2013 6
8 Treat1 3 2013 3
9 Treat2 3 2013 1
Desired output:
Treatment Block Year Value
1 Control-Treat1 1 2011 -6
2 Control-Treat2 1 2011 2
3 Control-Treat1 2 2012 -6
4 Control-Treat2 2 2012 -2
5 Control-Treat1 3 2013 3
6 Control-Treat2 3 2013 5
Any suggestion, preferably using dplyr?
I have found similar questions but none addressing this specific issue.

We can use dplyr, group_by Block and subtract Value where Treatment == "Control" from each Value and remove the "Control" rows.
library(dplyr)
df %>%
group_by(Block) %>%
mutate(Value = Value[which.max(Treatment == "Control")] - Value) %>%
filter(Treatment != "Control")
# Treatment Block Year Value
# <fct> <int> <int> <dbl>
#1 Treat1 1 2011 -6
#2 Treat2 1 2011 2
#3 Treat1 2 2012 -6
#4 Treat2 2 2012 -2
#5 Treat1 3 2013 3
#6 Treat2 3 2013 5
Not sure, if the values in Treatment column in expected output (Control-Treat1, Control-Treat2) are shown only for demonstration purpose of the calculation or OP really wants that as output. In case if that is needed as output we can use
df %>%
group_by(Block) %>%
mutate(Value = Value[which.max(Treatment == "Control")] - Value,
Treatment = paste0("Control-", Treatment)) %>%
filter(Treatment != "Control-Control")
# Treatment Block Year Value
# <chr> <int> <int> <dbl>
#1 Control-Treat1 1 2011 -6
#2 Control-Treat2 1 2011 2
#3 Control-Treat1 2 2012 -6
#4 Control-Treat2 2 2012 -2
#5 Control-Treat1 3 2013 3
#6 Control-Treat2 3 2013 5

A somehow different tidyverse possibility could be:
df %>%
spread(Treatment, Value) %>%
gather(var, val, -c(Block, Year, Control)) %>%
mutate(Value = Control - val,
Treatment = paste("Control", var, sep = " - ")) %>%
select(Treatment, Block, Year, Value) %>%
arrange(Block)
Treatment Block Year Value
1 Control - Treat1 1 2011 -6
2 Control - Treat2 1 2011 2
3 Control - Treat1 2 2012 -6
4 Control - Treat2 2 2012 -2
5 Control - Treat1 3 2013 3
6 Control - Treat2 3 2013 5

This can be done with an SQL self join like this:
library(sqldf)
sqldf("select a.Treatment || '-' || b.Treatment as Treatment,
a.Block,
a.Year,
a.Value - b.Value as Value
from df a
join df b on a.block = b.block and
a.Treatment = 'Control' and
b.Treatment != 'Control'")
giving:
Treatment Block Year Value
1 Control-Treat1 1 2011 -6
2 Control-Treat2 1 2011 2
3 Control-Treat1 2 2012 -6
4 Control-Treat2 2 2012 -2
5 Control-Treat1 3 2013 3
6 Control-Treat2 3 2013 5

Another dplyr-tidyr approach: You can remove unwanted columns with select:
library(tidyr)
library(dplyr)
dummy_df %>%
spread(Treatment,Value) %>%
gather(key,value,Treat1:Treat2) %>%
group_by(Block,Year,key) %>%
mutate(Val=Control-value)
# A tibble: 6 x 6
# Groups: Block, Year, key [6]
Block Year Control key value Val
<int> <int> <dbl> <chr> <dbl> <dbl>
1 1 2011 6 Treat1 12 -6
2 2 2012 3 Treat1 9 -6
3 3 2013 6 Treat1 3 3
4 1 2011 6 Treat2 4 2
5 2 2012 3 Treat2 5 -2
6 3 2013 6 Treat2 1 5
Just the exact output:
dummy_df %>%
spread(Treatment,Value) %>%
gather(key,value,Treat1:Treat2) %>%
mutate(Treatment=paste0("Control-",key)) %>%
group_by(Block,Year,Treatment) %>%
mutate(Val=Control-value) %>%
select(Treatment,everything(),-value,-key)%>%
arrange(Year)
Result:
# A tibble: 6 x 5
# Groups: Block, Year, Treatment [6]
Treatment Block Year Control Val
<chr> <int> <int> <dbl> <dbl>
1 Control-Treat1 1 2011 6 -6
2 Control-Treat2 1 2011 6 2
3 Control-Treat1 2 2012 3 -6
4 Control-Treat2 2 2012 3 -2
5 Control-Treat1 3 2013 6 3
6 Control-Treat2 3 2013 6 5

Another tidyverse solution. We can use filter to separate "Control" and "Treatment" to different data frames, use left_join to combine them by Block and Year, and then process the data frame.
library(tidyverse)
df2 <- df %>%
filter(!Treatment %in% "Control") %>%
left_join(df %>% filter(Treatment %in% "Control"),
.,
by = c("Block", "Year")) %>%
mutate(Value = Value.x - Value.y) %>%
unite(Treatment, Treatment.x, Treatment.y, sep = "-") %>%
select(names(df))
# Treatment Block Year Value
# 1 Control-Treat1 1 2011 -6
# 2 Control-Treat2 1 2011 2
# 3 Control-Treat1 2 2012 -6
# 4 Control-Treat2 2 2012 -2
# 5 Control-Treat1 3 2013 3
# 6 Control-Treat2 3 2013 5

Related

Calculate average of values in R and add result as new rows instead of as a new column

I have a dataframe like the following one:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
and I want to calculate the average value by day for the three year period (2014, 2015, 2016). The following code works for this purpose:
data %>%
group_by(day) %>%
mutate(MEAN = mean(value))
and produces this output:
day year value MEAN
1 2014 5 7
1 2015 16 7
1 2016 0 7
2 2014 3 3
2 2015 1 3
2 2016 4 3
but I want to add the average values as new rows in the same dataframe as follows:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
1 avg 7 <--
2 avg 3 <--
Any suggestions about how can I possibly do this? Thanks!
We can use summarise (instead of mutate - which adds a new column in the original dataset) to calculate the mean and then with bind_rows can bind with original data. The tidyverse functions are very particular about type, so make sure the class are the same before we do the binding
library(dplyr)
data %>%
group_by(day) %>%
summarise(year = 'avg', value = mean(value)) %>%
bind_rows(data %>%
mutate(year = as.character(year)), .)
# day year value
#1 1 2014 5.00
#2 1 2015 16.00
#3 1 2016 0.00
#4 2 2014 3.00
#5 2 2015 1.00
#6 2 2016 4.00
#7 1 avg 7.00
#8 2 avg 2.67
Another option is to split by the 'day' and then with add_row (from tibble) create a new row on each of the list elements
library(tibble)
library(purrr)
data %>%
mutate(year = as.character(year)) %>%
group_split(day) %>%
map_dfr(~ .x %>% add_row(day = first(.$day),
year = 'avg', value = mean(.$value)))
Here is a base R option using aggregate
rbind(df,cbind(aggregate(value~day,df,mean),year = "avg")[c(1,3,2)])
or a variation (by #thelatemail from comments)
rbind(df, aggregate(df["value"], cbind(df["day"], year="avg"), FUN=mean))
which gives
day year value
1 1 2014 5.000000
2 1 2015 16.000000
3 1 2016 0.000000
4 2 2014 3.000000
5 2 2015 1.000000
6 2 2016 4.000000
7 1 avg 7.000000
8 2 avg 2.666667

Find the difference within a group based on a condition

I have a data frame with entries such as;
V1 zipcode year V2
1 11747 2012 5
2 11747 2012 10
3 11747 2012 20
4 11747 2012 15
1 11000 2012 20
2 11000 2012 15
3 11000 2012 20
I want to find the difference in V2 for the maximum value of V1 for each zipcode year combination. In this case 15-5 and 20-20.
I would like the result to look like:
V1 zipcode year V2 V3
1 11747 2012 5 10
2 11747 2012 10 10
3 11747 2012 20 10
4 11747 2012 15 10
1 11000 2012 20 0
2 11000 2012 15 0
3 11000 2012 20 0
So for I have attempting to use group by and mutate with the following conditions:
df %>% group_by(year, zipcode) %>% mutate(difV2 = df[df$V1== max(df$V1),4] - df[df$V1== min(df$V1),4])
Any tips would be greatly appreciated!
After the group_by, we need to only use the column names instead of df$V1. When we do df$V1, it is extracting the whole column and not the values that are specific to that group.
library(dplyr)
df %>%
group_by(year, zipcode) %>%
mutate(V3 = V2[V1== max(V1)] - V2[V1== min(V1)])
# A tibble: 7 x 5
# Groups: year, zipcode [2]
# V1 zipcode year V2 V3
# <int> <int> <int> <int> <int>
#1 1 11747 2012 5 10
#2 2 11747 2012 10 10
#3 3 11747 2012 20 10
#4 4 11747 2012 15 10
#5 1 11000 2012 20 0
#6 2 11000 2012 15 0
#7 3 11000 2012 20 0

how do I identify rows where an element appears for the first time?

I have the following data frame of student records. what I want is to identify students who joined a certain program in 2014 for the first time when they were in 9th grade.
names.first<-c('a','a','b','b','c','d')
names.last<-c('c','c','z','z','f','h')
year<-c(2014,2013,2014,2015,2015,2014)
grade<-c(9,8,9,10,10,10)
df<-data.frame(names.first,names.last,year,grade)
df
To do this, I have used the following statement to say that I want students where the program year==2014 and their grade ==9.
df$first.cohort<-ifelse(df$year==2014 & df$grade==9,1,0)
df
names.first names.last year grade first.cohort
1 a c 2014 9 1
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
However, as you can notice this would include students who didn't enter the program in year 2014 such as student awho started in 2013. How do I create a ifelse statement where I only capture students who are in 9th grade and started the program in 2014 for the first time so that the df looks like
names.first names.last year grade first.cohort
1 a c 2014 9 0
2 a c 2013 8 0
3 b z 2014 9 1
4 b z 2015 10 0
5 c f 2015 10 0
6 d h 2014 10 0
We can use first after arrangeing by 'name' and 'year' to create the logical expression
library(dplyr)
df %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 4
# Groups: names [4]
# names year grade first.cohort
# <fct> <dbl> <dbl> <int>
#1 a 2013 8 0
#2 a 2014 9 0
#3 b 2014 9 1
#4 b 2015 10 0
#5 c 2015 10 0
#6 d 2014 10 0
For keeping the same order as in the input dataset, we can create a sequence column first and then do the arrange on the column after the mutate
df %>%
mutate(rn = row_number()) %>%
arrange(names, year) %>%
group_by(names) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014)) %>%
ungroup %>%
arrange(rn) %>%
select(-rn)
Or using the same logic with data.table that have the additional advantage of keeping the same order as in the input dataset
library(data.table)
setDT(df)[order(names, year), first.cohort := as.integer(grade == 9 &
first(year) == 2014), names]
Update
With the new example in the OP's post, we do the grouping by both the 'names' column
df %>%
arrange(names.first, names.last, year) %>%
group_by(names.first, names.last) %>%
mutate(first.cohort = as.integer(grade == 9 & first(year) == 2014))
# A tibble: 6 x 5
# Groups: names.first, names.last [4]
# names.first names.last year grade first.cohort
# <fct> <fct> <dbl> <dbl> <int>
#1 a c 2013 8 0
#2 a c 2014 9 0
#3 b z 2014 9 1
#4 b z 2015 10 0
#5 c f 2015 10 0
#6 d h 2014 10 0
Using dplyr
library(dplyr)
df%>%group_by(names)%>%dplyr::mutate(Fc=as.numeric((year==2014&grade==9)&(min(year)==2014)))
# A tibble: 6 x 4
# Groups: names [4]
names year grade Fc
<fctr> <dbl> <dbl> <dbl>
1 a 2014 9 0
2 a 2013 8 0
3 b 2014 9 1
4 b 2015 10 0
5 c 2015 10 0
6 d 2014 10 0

Search in row for a certain value and report the date

Ciao,
Here is my replicating example.
a=c(1,2,3,4,5,6,7,8)
b=c(1,1,0,0,0,"NA",0,"NA")
c=c(11,7,9,9,5,"NA",7,"NA")
d=c(2012,2011,2012,2014,2014,"NA",2011,"NA")
e=c(1,0,1,0,0,1,"NA","NA")
f=c(10,4,11,10,10,6,"NA","NA")
g=c(2014,2012,2010,2012,2013,2011,"NA","NA")
h=c(1,0,1,0,1,0,1,"NA")
i=c(2,12,12,6,8,11,3,"NA")
j=c(2011,2012,2011,2012,2012,2014,2012,"NA")
k=c(1,1,1,0,1,1,1,"NA")
l=c(11/1/2012,"7/1/2012","11/1/2010",0 ,"8/1/2012","6/1/2012","3/1/2012","NA")
mydata = data.frame(a,b,c,d,e,f,g,h,i,j,k,l)
names(mydata) = c("id","test1","month1","year1","test2","month2","year2","test3","month3","year3","anytest","date")
I am aiming to search through each row and find the first test column that is equal to 1. The new column I am aiming to create is "anytest." This column is 1 if test1 or test2 or test3 equals to 1. If none of them do then it equals to 0. This ignores NA values..if test1 and test2 are NA but test3 equals to 0 then anytest equals to 0. Now I have made progress I think using this code:
anytestTRY = if(rowSums(mydata[,c(test1,test2,test3)] == 1, na.rm=TRUE) > 0],1,0)
But now I am at a crossroads because I am aiming to search through each row to find the first column of test1 test2 or test3 that equals to 1 and then report the month and year for that test. So if test1 equals to 0 and test2 equals to NA and test3 equals to 1 I want the column which I created called date to have the month3 and year3 in analyzable time format. Thanks a million.
a=c(1,2,3,4,5,6,7,8)
b=c(1,1,0,0,0,"NA",0,"NA")
c=c(11,7,9,9,5,"NA",7,"NA")
d=c(2012,2011,2012,2014,2014,"NA",2011,"NA")
e=c(1,0,1,0,0,1,"NA","NA")
f=c(10,4,11,10,10,6,"NA","NA")
g=c(2014,2012,2010,2012,2013,2011,"NA","NA")
h=c(1,0,1,0,1,0,1,"NA")
i=c(2,12,12,6,8,11,3,"NA")
j=c(2011,2012,2011,2012,2012,2014,2012,"NA")
mydata = data.frame(a,b,c,d,e,f,g,h,i,j)
names(mydata) = c("id","test1","month1","year1","test2","month2","year2","test3","month3","year3")
library(tidyverse)
library(lubridate)
mydata %>%
mutate_all(~as.numeric(as.character(.))) %>% # update columns to numeric
group_by(id) %>% # for each id
nest() %>% # nest data
mutate(date = map(data, ~case_when(.$test1==1 ~ ymd(paste0(.$year1,"-",.$month1,"-",1)), # get date based on first test that is 1
.$test2==1 ~ ymd(paste0(.$year2,"-",.$month2,"-",1)),
.$test3==1 ~ ymd(paste0(.$year3,"-",.$month3,"-",1)))),
anytest = map(data, ~as.numeric(case_when(sum(c(.$test1, .$test2, .$test3)==1) > 0 ~ "1", # create anytest column
sum(is.na(c(.$test1, .$test2, .$test3))) == 3 ~ "NA",
TRUE ~ "0")))) %>%
unnest() # unnestdata
which returns:
# # A tibble: 8 x 12
# id date anytest test1 month1 year1 test2 month2 year2 test3 month3 year3
# <dbl> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2012-11-01 1 1 11 2012 1 10 2014 1 2 2011
# 2 2 2011-07-01 1 1 7 2011 0 4 2012 0 12 2012
# 3 3 2010-11-01 1 0 9 2012 1 11 2010 1 12 2011
# 4 4 NA 0 0 9 2014 0 10 2012 0 6 2012
# 5 5 2012-08-01 1 0 5 2014 0 10 2013 1 8 2012
# 6 6 2011-06-01 0 NA NA NA 1 6 2011 0 11 2014
# 7 7 2012-03-01 0 0 7 2011 NA NA NA 1 3 2012
# 8 8 NA NA NA NA NA NA NA NA NA NA NA

Combine data in many row into a columnn

I have a data like this:
year Male
1 2011 8
2 2011 1
3 2011 4
4 2012 3
5 2012 12
6 2012 9
7 2013 4
8 2013 3
9 2013 3
and I need to group the data for the year 2011 in one column, 2012 in the next column and so on.
2011 2012 2013
1 8 3 4
2 1 12 3
3 4 9 3
How do I achieve this?
One option is unstack if the number of rows per 'year' is the same
unstack(df1, Male ~ year)
One option is to use functions from dplyr and tidyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(year) %>%
mutate(ID = 1:n()) %>%
spread(year, Male) %>%
select(-ID)
1
If every year has the same number of data, you could split the data and cbind it using base R
do.call(cbind, split(df$Male, df$year))
# 2011 2012 2013
#[1,] 8 3 4
#[2,] 1 12 3
#[3,] 4 9 3
2
If every year does not have the same number of data, you could use rbind.fill of plyr
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(plyr)
setNames(object = data.frame(t(rbind.fill.matrix(lapply(split(df$Male, df$year), t)))),
nm = unique(df$year))
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA
3
Yet another way is to use dcast to convert data from long to wide format
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(reshape2)
dcast(df, ave(df$Male, df$year, FUN = seq_along) ~ year, value.var = "Male")[,-1]
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA

Resources