Find first event and report year - r

Ciao,
I did post a similar inquiry but what I needed changed so deeply sorry as I work for a school district and they need different information!
Here is my replicating example.
a=c(1,2,3,4,5,6)
b=c(1,0,NA,NA,0,NA)
c=c(2010,2010,2010,2010,2010,2010)
d=c(1,1,0,1,0,NA)
e=c(2012,2012,2012,2012,2012,2012)
f=c(1,0,0,0,0,NA)
g=c(2014,2014,2014,2014,2014,2014)
h=c(1,1,0,1,0,NA)
i=c(2010,2012,2014,2012,2014,2014)
mydata = data.frame(a,b,c,d,e,f,g,h,i)
names(mydata) = c("id","test1","year1","test2","year2","test3","year3","anytest","year")
The nuts and bolts is to find the first '1' in test1 and test2 and test3 and then add to column value in year1 or year2 or year3 based on where the first '1' is found. I am aiming to search through each row and find the first test column that is equal to 1. The new column I am aiming to create is "anytest." This column is 1 if test1 or test2 or test3 equals to 1. If none of them do then it equals to 0. This ignores NA values..if test1 and test2 are NA but test3 equals to 0 then anytest equals to 0. Now I have made progress I think using this code:
anytestTRY = if(rowSums(mydata[,c(test1,test2,test3)] == 1, na.rm=TRUE) > 0],1,0)
But now I am at a crossroads because I am aiming to search through each row to find the first column of test1 test2 or test3 that equals to 1 and then report the year for that test. So if test1 equals to 0 and test2 equals to NA and test3 equals to 1 I want the column which I created called year3 to be in date. Then last of all if test 1 and test2 and test3 all equals to 0 or NA or some combination of the sort then date should be last year which here is 2014.

We can use rowSums from base R to create the anytest column
i1 <- grep('test', names(mydata))
NA^(rowSums(is.na(mydata[i1])) == 3) * (rowSums(mydata[i1] == 1, na.rm = TRUE) !=0)
#[1] 1 1 0 1 0 NA
If we also need a column of 'date', use max.col to get the column index of the max value of 'test' per row, extract the 'year' based on cbinding the row index with column index
i2 <- grep('year', names(mydata))
m1 <- replace(mydata[i1], is.na(mydata[i1]), 0)
i3 <- !rowSums(m1 == 1)
date <- rep(NA, nrow(mydata))
date[!i3] <- mydata[i2][!i3,][cbind(seq_len(sum(!i3)), max.col(m1[!i3,], 'first'))]
date[i3] <- do.call(pmax, mydata[i2][i3,])
date
#[1] 2010 2012 2014 2012 2014 2014

a=c(1,2,3,4,5,6)
b=c(1,0,NA,NA,0,NA)
c=c(2010,2010,2010,2010,2010,2010)
d=c(1,1,0,1,0,NA)
e=c(2012,2012,2012,2012,2012,2012)
f=c(1,0,0,0,0,NA)
g=c(2014,2014,2014,2014,2014,2014)
h=c(1,1,0,1,0,NA)
i=c(2010,2012,2014,2012,2014,2014)
mydata = data.frame(a,b,c,d,e,f,g)
names(mydata) = c("id","test1","year1","test2","year2","test3","year3")
library(tidyverse)
library(lubridate)
mydata %>%
mutate_all(~as.numeric(as.character(.))) %>% # update columns to numeric
group_by(id) %>% # for each id
nest() %>% # nest data
mutate(date = map(data, ~case_when(.$test1==1 ~ .$year1, # get year based on first test that is 1
.$test2==1 ~ .$year2,
.$test3==1 ~ .$year3,
TRUE ~ max(c(mydata$year1, mydata$year2, mydata$year3)))), # if no test is 1 get the maximum year in the original dataset
anytest = map(data, ~as.numeric(case_when(sum(c(.$test1, .$test2, .$test3)==1, na.rm = T) > 0 ~ "1", # create anytest column
sum(is.na(c(.$test1, .$test2, .$test3))) == 3 ~ "NA",
TRUE ~ "0")))) %>%
unnest()
which returns:
# # A tibble: 6 x 9
# id date anytest test1 year1 test2 year2 test3 year3
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 2010 1 1 2010 1 2012 1 2014
# 2 2 2012 1 0 2010 1 2012 0 2014
# 3 3 2014 0 NA 2010 0 2012 0 2014
# 4 4 2012 1 NA 2010 1 2012 0 2014
# 5 5 2014 0 0 2010 0 2012 0 2014
# 6 6 2014 NA NA 2010 NA 2012 NA 2014

Related

Sum values from rows with conditions in R

I need to create a new column with the sum values in several other columns, but with conditions.
My data is
ID <- c(A,B,C,D,E,F)
Q1 <- c(0,1,7,9,na,3)
Q2 <- c(0,3,2,2,na,3)
Q3 <- c(0,0,7,9,na,3)
dta <- as.data.frame (ID,Q1,Q2,Q3)
I need to sum values from the columns only if the values are < 4. If there is any value in any column that is > 4, the result should be dismissed. And I need to preserve the rows with only "na".
The result should look like
Result
0
4
na
na
na
9
I have tried :
library(dplyr)
dta %>% filter(Q1 < 4) %>% mutate(Result = rowSums(.[2:4]))
but then, all the rows with values > 4 disappear, and I was only able filter one row at a time. I have also tried:
dta$Result <- ifelse(c("Q1", "Q2", "Q3") < 4, rowSums(.[2:4]), NA)
but then all my results are "na"
ID <- c("A","B","C","D","E","F")
Q1 <- c(0,1,7,9,NA,3)
Q2 <- c(0,3,2,2,NA,3)
Q3 <- c(0,0,7,9,NA,3)
dta <- data.frame(ID,Q1,Q2,Q3)
You have to switch the sum and ifelse statement.
dta %>%
rowwise() %>%
mutate(result = sum(ifelse(c(Q1, Q2, Q3)<4, c(Q1, Q2, Q3), NA)))
You can use the following solution:
library(dplyr)
dta %>%
rowwise() %>%
mutate(Result = ifelse(any(c_across(Q1:Q3) > 4), NA, Reduce(`+`, c_across(Q1:Q3))))
# A tibble: 6 x 5
# Rowwise:
ID Q1 Q2 Q3 Result
<chr> <dbl> <dbl> <dbl> <dbl>
1 A 0 0 0 0
2 B 1 3 0 4
3 C 7 2 7 NA
4 D 9 2 9 NA
5 E NA NA NA NA
6 F 3 3 3 9

Removing rows with dates conditional to specific IDs

Basically, I have a data frame that contains IDs, Dates, VolumeX, and VolumeY.
I want to split the VolumeX data frame into before and after the max date of VolumeY specific to an ID.
Ex.
df looks like (with many different IDs) :
ID Date VolX VolY
1 2018 - 02- 01 5 -
1 2018 - 03- 01 6 -
1 2018 - 08- 01 3 -
1 2018 - 10- 01 1 -
1 2017 - 02- 01 - 1
1 2014 - 10- 01 - 0
1 2014 - 11- 01 - 5
1 2018 - 02- 01 - 0
So for the max date of VolY for every ID, I'd like to split the data frame into two: before and after that date for each ID soas to sum VolX before and after VolY max date.
Seems like this needs to be some kind of nested for loop. I am able to extract max dates and total volume... just having a hard time selecting out ID-specific
Is this what you're after?
library(dplyr)
df %>%
replace(., . == "-", NA) %>%
mutate(Date = as.Date(gsub("\\s", "", Date))) %>%
mutate_at(vars(VolX, VolY), as.numeric) %>%
group_by(ID, Before_After = cumsum(c(0, lag(+(Date == max(Date)))[-1]))) %>%
mutate(
sum_Volx = sum(VolX[Date != max(Date)], na.rm = T),
sum_VolY = sum(VolY[Date != max(Date)], na.rm = T)
) %>% ungroup() %>% select(-Before_After)
Output:
# A tibble: 8 x 6
ID Date VolX VolY sum_Volx sum_VolY
<int> <date> <dbl> <dbl> <dbl> <dbl>
1 1 2018-02-01 5 NA 14 0
2 1 2018-03-01 6 NA 14 0
3 1 2018-08-01 3 NA 14 0
4 1 2018-10-01 1 NA 14 0
5 1 2017-02-01 NA 1 0 6
6 1 2014-10-01 NA 0 0 6
7 1 2014-11-01 NA 5 0 6
8 1 2018-02-01 NA 0 0 6
You could also make separate columns for before/after, like this:
df %>%
replace(., . == "-", NA) %>%
mutate_at(vars(VolX, VolY), as.numeric) %>%
group_by(ID) %>%
mutate(
Date = as.Date(gsub("\\s", "", Date)),
Before_After = cumsum(c(0, lag(+(Date == max(Date)))[-1])),
sum_Volx_Before = sum(VolX[Date != max(Date) & Before_After == 0], na.rm = T),
sum_VolY_Before = sum(VolY[Date != max(Date) & Before_After == 0], na.rm = T),
sum_Volx_After = sum(VolX[Date != max(Date) & Before_After == 1], na.rm = T),
sum_VolY_After = sum(VolY[Date != max(Date) & Before_After == 1], na.rm = T)
) %>% ungroup() %>% select(-Before_After)
Output:
# A tibble: 8 x 8
ID Date VolX VolY sum_Volx_Before sum_VolY_Before sum_Volx_After sum_VolY_After
<int> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2018-02-01 5 NA 14 0 0 6
2 1 2018-03-01 6 NA 14 0 0 6
3 1 2018-08-01 3 NA 14 0 0 6
4 1 2018-10-01 1 NA 14 0 0 6
5 1 2017-02-01 NA 1 14 0 0 6
6 1 2014-10-01 NA 0 14 0 0 6
7 1 2014-11-01 NA 5 14 0 0 6
8 1 2018-02-01 NA 0 14 0 0 6
On the other hand, you could just create 2 separate new data frames in your environment, named Before and After, that literally exclude the maximum date and summarise the information, like below:
df_list <- df %>%
replace(., . == "-", NA) %>%
mutate_at(vars(VolX, VolY), as.numeric) %>%
group_by(ID) %>%
mutate(
Date = as.Date(gsub("\\s", "", Date)),
Before_After = cumsum(c(0, lag(+(Date == max(Date)))[-1]))
) %>%
filter(!Date == max(Date)) %>%
group_by(ID, Before_After) %>%
summarise(
sum_VolX = sum(VolX, na.rm = T),
sum_VolY = sum(VolY, na.rm = T)
) %>%
split(., .$Before_After)
names(df_list) <- c("Before", "After")
list2env(df_list, envir = .GlobalEnv)
Let's go through one-by-one:
first we replace the - signs by NA (not strictly needed, just to avoid errors later on);
afterwards we transform VolX and VolY into numeric;
then we group by ID so that everything is applied to each group separately;
afterwards we transform the Date into a proper Date format;
then it is the crucial part: we calculate the flag Before_After column where first we flag with 1 if in the previous row the maximum date was observed; afterwards we calculate a cumulative sum of such column, so that everything before this event is 0 and everything after 1;
then we filter out the maximum Date;
we group again by ID and Before_After indicator;
we shrink the data frame with summarise so that it only contains the sum of the respective columns;
we turn the data frame into 2 different ones by splitting on Before_After column;
as the obtained result is a list of 2 data frames, we need to get them into global environment, so first we assign the names to each one and then we turn them into 'proper' data frames.
Output:
Before
# A tibble: 1 x 4
# Groups: ID [1]
ID Before_After sum_VolX sum_VolY
<int> <dbl> <dbl> <dbl>
1 1 0 14 0
After
# A tibble: 1 x 4
# Groups: ID [1]
ID Before_After sum_VolX sum_VolY
<int> <dbl> <dbl> <dbl>
1 1 1 0 6
Note that 0 corresponds to Before and 1 to After.

Dplyr solution using slice and group

Ciao, Here is my replicating example.
a=c(1,2,3,4,5,6)
a1=c(15,17,17,16,14,15)
a2=c(0,0,1,1,1,0)
b=c(1,0,NA,NA,0,NA)
c=c(2010,2010,2010,2010,2010,2010)
d=c(1,1,0,1,0,NA)
e=c(2012,2012,2012,2012,2012,2012)
f=c(1,0,0,0,0,NA)
g=c(2014,2014,2014,2014,2014,2014)
h=c(1,1,0,1,0,NA)
i=c(2010,2012,2014,2012,2014,2014)
mydata = data.frame(a,a1,a2,b,c,d,e,f,g,h,i)
names(mydata) = c("id","age","gender","drop1","year1","drop2","year2","drop3","year3","drop4","year4")
mydata2 <- reshape(mydata, direction = "long", varying = list(c("year1","year2","year3","year4"), c("drop1","drop2","drop3","drop4")),v.names = c("year", "drop"), idvar = "X", timevar = "Year", times = c(1:4))
x1 = mydata2 %>%
group_by(id) %>%
slice(which(drop==1)[1])
x2 = mydata2 %>%
group_by(id) %>%
slice(which(drop==0)[1])
I have data "mydata2" which is tall such that every ID has many rows.
I want to make new data set "x" such that every ID has one row that is based on if they drop or not.
The first of drop1 drop2 drop3 drop4 that equals to 1, I want to take the year of that and put that in a variable dropYEAR. If none of drop1 drop2 drop3 drop4 equals to 1 I want to put the last data point in year1 year2 year3 year4 in the variable dropYEAR.
Ultimately every ID should have 1 row and I want to create 2 new columns: didDROP equals to 1 if the ID ever dropped or 0 if the ID did not ever drop. dropYEAR equals to the year of drop if didDROP equals to 1 or equals to the last reported year1 year2 year3 year4 if the ID did not ever drop. I try to do this in dplyr but this gives part of what I want only because it gets rid of ID values that equals to 0.
This is desired output, thank you to #Wimpel
First mydata2 %>% arrange(id) to understand the dataset, then using dplyr first and lastwe can pull the first year where drop==1 and the last year in case of drop never get 1 where drop is not null. Usingcase_when to check didDROP as it has a nice magic in dealing with NAs.
library(dplyr)
mydata2 %>% group_by(id) %>%
mutate(dropY=first(year[!is.na(drop) & drop==1]),
dropYEAR=if_else(is.na(dropY), last(year[!is.na(drop)]),dropY)) %>%
slice(1)
#Update
mydata2 %>% group_by(id) %>%
mutate(dropY=first(year[!is.na(drop) & drop==1]),
dropYEAR=if_else(is.na(dropY), last(year),dropY),
didDROP=case_when(any(drop==1) ~ 1, #Return 1 if there is any drop=1 o.w it will return 0
TRUE ~ 0)) %>%
select(-dropY) %>% slice(1)
# A tibble: 6 x 9
# Groups: id [6]
id age gender Year year drop X dropYEAR didDROP
<dbl> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 15 0 1 2010 1 1 2010 1
2 2 17 0 1 2010 0 2 2012 1
3 3 17 1 1 2010 NA 3 2014 0
4 4 16 1 1 2010 NA 4 2012 1
5 5 14 1 1 2010 0 5 2014 0
6 6 15 0 1 2010 NA 6 2014 0
I hope this what you're looking for.
You can sort by id, drop and year, conditionally on dropping or not:
library(dplyr)
mydata2 %>%
mutate(drop=ifelse(is.na(drop),0,drop)) %>%
arrange(id,-drop,year*(2*drop-1)) %>%
group_by(id) %>%
slice(1) %>%
select(id,age,gender,didDROP=drop,dropYEAR=year)
# A tibble: 6 x 5
# Groups: id [6]
id age gender didDROP dropYEAR
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 15 0 1 2010
2 2 17 0 1 2012
3 3 17 1 0 2014
4 4 16 1 1 2012
5 5 14 1 0 2014
6 6 15 0 0 2014

R - countifs like excel

I have a CSV like this, saved as an object in R named df1.
X Y Z Year
0 2 4 2014
3 1 3 2014
5 4 0 2014
0 3 0 2014
2 1 0 2015
I want to:
Count each column where there are no "0" for year 2014. For example, for column X, the count = 2 (not 3 because I want 2014 data only). For column Y the count is 4. For column Z the count is 1.
Sum all the counts for each column
This is what I tried:
count_total <- sum(df1$x != 0 &
df1$y != 0 &
df1&z != 0 &
df1$Year == 2014)
count_total
I want the output to be simply be 1 (i.e. the 2nd row in df has no 0's)
However, this does not align with my countifs on excel. In excel, it's like this:
=COUNTIFS('df1'!$A$2:$A$859,"<>0",'df1'!$B$2:$B$859,"<>0",
'df1'!$C$2:$C$859,"<>0",'df1'!$D$2:$D$859,2014)
Wondering if I mistyped something on R? I'm a dyplr user but can't find anything particularly useful on google.
Thank you very much!
One way is using rowSums on subset of data
sum(rowSums(subset(df1, Year == 2014) == 0) == 0)
#[1] 1
You can do this with aggregate then colSums to get the totals by column.
agg <- aggregate(. ~ Year, df1, function(x) sum(x != 0))
agg
# Year X Y Z
#1 2014 2 4 2
#2 2015 1 1 0
colSums(agg[-1])
#X Y Z
#3 5 2
Data.
df1 <- read.table(text = "
X Y Z Year
0 2 4 2014
3 1 3 2014
5 4 0 2014
0 3 0 2014
2 1 0 2015
",header = TRUE)
dplyrapproach:
library(dplyr)
df1 %>%
group_by(Year) %>%
summarise_at(vars(X:Z), function (x) sum(x != 0))
Output:
# A tibble: 2 x 4
# Year X Y Z
# <int> <int> <int> <int>
# 1 2014 2 4 2
# 2 2015 1 1 0
Alternative using summaryBy.
library(doBy)
summaryBy(list(c('X','Y','Z'), c('Year')), df1, FUN= function(x) sum(x!=0), keep.names=T)
Year X Y Z
1 2014 2 4 2
2 2015 1 1 0
When needed use colSums as explained before.

Speeding up rowwise comparison

I have a data.frame like the following:
id year x y v1
1 2006 12 1 0.8510703
1 2007 12 1 0.5954527
1 2008 12 2 -1.9312854
1 2009 12 1 0.1558393
1 2010 8 1 0.9051487
2 2001 12 2 -0.5480566
2 2002 12 2 -0.7607420
2 2003 3 2 -0.8094283
2 2004 3 2 -0.1732794
I would like to sum up (grouped by id) v1 of consecutive years (so 2010 and 2009, 2009 and 2008 and so on) only if x and y match. Expected output:
id year res
1 2010 NA
1 2009 NA
1 2008 NA
1 2007 1.4465230
2 2004 -0.9827077
2 2003 NA
2 2002 -1.3087987
The oldest year per id is removed, as there is no preceding year.
I have a slow lapply solution in place but would like to speed things up, as my data is rather large.
Data:
set.seed(1)
dat <- data.frame(id = c(rep(1,5),rep(2,4)),year = c(2006:2010,2001:2004),
x = c(12,12,12,12,8,12,12,3,3), y = c(1,1,2,1,1,2,2,2,2),
v1 = rnorm(9))
Current Solution:
require(dplyr)
myfun <- function(dat) { do.call(rbind,lapply(rev(unique(dat$year)[-1]),
function(z) inner_join(dat[dat$year==z,2:5],
dat[dat$year==z-1,2:5],
by=c("x","y")) %>%
summarise(year = z, res = ifelse(nrow(.) < 1,NA,sum(v1.x,v1.y)))))
}
dat %>% group_by(id) %>% do(myfun(.))
Here is a data.table solution, I think.
datNew <- setDT(dat)[, .(year=year, res=(v1+shift(v1)) * NA^(x != shift(x) | y != shift(y))),
by=id][-1, .SD, by=id][]
id year res
1: 1 2007 -0.4428105
2: 1 2008 NA
3: 1 2009 NA
4: 1 2010 NA
5: 2 2001 NA
6: 2 2002 -0.3330393
7: 2 2003 NA
8: 2 2004 1.3141061
Here, the j statement contains a list with two elements, the year and a function. This function sums values with the lagged value, using shift, but is multiplied by NA or 1 depending on whether the x and y match with their lagged values. This calculation is performed by id. The output is fed to a second chain, which drops the first observation of each id which is all NA.
You can adjust the order efficiently using setorder if desired.
setorder(datNew, id, -year)
datNew
id year res
1: 1 2010 NA
2: 1 2009 NA
3: 1 2008 NA
4: 1 2007 -0.4428105
5: 2 2004 1.3141061
6: 2 2003 NA
7: 2 2002 -0.3330393
8: 2 2001 NA
Assuming there are sorted years as in the example:
dat %>%
group_by(id) %>%
mutate(res = v1 + lag(v1), #simple lag for difference
res = ifelse(x == lag(x) & y == lag(y), v1, NA)) %>% #NA if x and y don't match
slice(-1) #drop the first year
You can use %>% select(id, year, res), and %>% arrange(id, desc(year)) at the end if you want.

Resources