r: select row with highest year followed by highest month - r

For my data frame I want to filter the row with the highest year followed by the highest month.
Sample data frame:
df <- data.frame(ID = c(1:5),
year = c(2018,2018,2018,2018,2019),
month = c(9,10,11,12,11))
I tried the following but this does not return any record (which makes sense, because the max year and max month are in different rows). Does anybody have the answer? Obviously my desired output would be row 5.
df %>% filter(year == max(year) & month == max(month))

We can do:
df %>%
filter(year==max(year) & lag(month) == max(month))
ID year month
1 5 2019 11

This will give back the highest year containing the highest month value within that year.
df <- df %>%
arrange(desc(year), desc(month))
ID year month
1 5 2019 11
2 4 2018 12
3 3 2018 11
4 2 2018 10
5 1 2018 9
df[1,] # first row
ID year month
1 5 2019 11

Related

R script for extracting last row by ID where the next row value for that ID is not consecutive

I'm preparing data for a cox regression model and I have a dataset that shows all of the years that participants were registered as living in the province. There is a variable that identifies how many days they were registered as living in the province for each year. I want their start year to be their first year that they were fully registered (>=365 days) as living in the province. I also want the last year that they were fully registered as living in the province. However, there are some participants that left the province, then returned later for at least one full-time year. For this analysis, I want to consider participants follow-up to end when they leave the first time as we can't track their health outcomes that may have occurred while outside the province.
Imagine I have already sorted the dataset by ID, then year. I then removed any observations where there were less than 365 days registered.
Here is a test dataset:
df <- data.frame(
ID = c(1,1,1,1,1,2,2,2,2,2,2,3,3,3,3),
values = c(1996,1998,1999,2000,2001,2001,2002,2003,2004,2007,2008,2004,2005,2006,2007)
)
df_inc <- df %>%
group_by(ID) %>%
filter(row_number(values)==1)
This works as intended, returning the first fully registered year per participant
df_lastoverall <- df %>%
group_by(ID) %>%
filter(row_number(values)==n())
This works, but returns the last fully registered year, regardless of whether their years were all consecutive, or they left the province then returned to have at least one full year. This gives a last year of 2001 for ID1, 2008 for ID2, and 2007 for ID3.
Here's where I'm at and can use some help... I'm looking for some way to identify the last full year after a consecutive run from their start year (just incase there are people that left and returned more than once). This should return a last year of 1996 for ID1, 2004 for ID2, and 2007 for ID3.
Something like this, perhaps?
df_last <- df %>%
group_by(ID) %>%
filter(row_number(values)[cumsum(c(1, diff(values)!=1))])
# OR
df_last <- df %>%
group_by(ID) %>%
filter(row_number(values)==max(values[cumsum(c(1, diff(values)!=1))]))
You can leverage data.table::rleid() as follows:
group_by(df,ID) %>%
filter(data.table::rleid(c(1,diff(values)))==1)
Output:
ID values
<dbl> <dbl>
1 1 1996
2 2 2001
3 2 2002
4 2 2003
5 2 2004
6 3 2004
7 3 2005
8 3 2006
9 3 2007
If you wanted only the last year of each group, you can add a second filter at the end:
group_by(df,ID) %>%
filter(data.table::rleid(c(1,diff(values)))==1) %>%
filter(row_number()==n())
Output:
ID values
<dbl> <dbl>
1 1 1996
2 2 2004
3 3 2007
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
df_first <- df %>%
group_by(ID) %>%
filter(cumsum(c(1,diff(values)) - 1) == 0) %>%
slice_min(values) %>%
ungroup()
df_last <- df %>%
group_by(ID) %>%
filter(cumsum(c(1,diff(values)) - 1) == 0) %>%
slice_max(values) %>%
ungroup()
This returns
#> df_first
# A tibble: 3 × 2
ID values
<dbl> <dbl>
1 1 1996
2 2 2001
3 3 2004
and
#> df_last
# A tibble: 3 × 2
ID values
<dbl> <dbl>
1 1 1996
2 2 2004
3 3 2007

I want to select the # users_read in the last row for each book_id and calculate mean

I have a dataset table that's set like this
Year | Book_id | Event_num | Users_read
2018. 1. 1 14
2018. 1. 2 13
2018. 2. 3 15
2018. 1. 4 13
2018. 2. 5 12
I want to select the last row for each book_id in the year 2018 to find the mean in R. So for this sample data above I would be selecting 13 users_read for book_id = 1 and 12 for book_id = 2. Then mean = 25/2
Here is a way of how you can do that using dplyr package.
# Loading Required libraries
library(dplyr)
# Creating sample data
example <- read.table(text ='
Year|Book_id|Event_num|Users_read
2018.|1.|1|14
2018.|1.|2|13
2018.|2.|3|15
2018.|1.|4|13
2018.|2.|5|12',header = TRUE, stringsAsFactors = FALSE, sep = "|")
example %>%
# Filtering to get data for 2018 only
filter(Year == "2018") %>%
# Grouping by Book_id
group_by(Book_id) %>%
# Selecting last occurance for each book_id
slice(n()) %>%
ungroup() %>%
# Calculating average
summarize(avg_users_read = mean(Users_read))
For the sake of completeness, here is a solution using the data.table package:
library(data.table)
setDT(df1)[Year == 2018, last(Users_read), by = Book_id][, mean(V1)]
[1] 12.5
Data
df1 <- data.table::fread("
Year Book_id Event_num Users_read
2018. 1. 1 14
2018. 1. 2 13
2018. 2. 3 15
2018. 1. 4 13
2018. 2. 5 12")
Just out of curiosity, I tried to find a sql solution as well:
sqldf::sqldf(
"select avg(Users_read)
from df1 A
inner join (select Year, Book_id, max(Event_num) Last_event from df1
where Year = 2018
group by Year, Book_id) B
where A.Year = B.Year and A.Book_id = B.Book_id and A.Event_num = B.Last_event"
)
avg(Users_read)
1 12.5

How can I keep only those rows that together contain the longest consecutive run of a variable increasing by one, using dplyr in R?

I have a tibble where each row contains a subject identifier and a year. My goal is to isolate, for each subject, only those rows which together constitute the longest sequence of rows in which the variable year increases by 1 from one row to the next.
I've tried quite a few things with a grouped filter, such as building helper variables that code whether year on one row is one more or less than year on the previous row, and using the rle() function. But so far nothing has worked exactly as it should.
Here is a toy example of my data. Note that the number of rows varies between subjects and that there are typically (some) gaps between years. Also note that the data have been arranged so that the year value always increases from one row to the next within each subject.
# A tibble: 8 x 2
subject year
<dbl> <dbl>
1 1 2012
2 1 2013
3 1 2015
4 1 2016
5 1 2017
6 1 2019
7 2 2011
8 2 2013
The toy example tibble can be recreated by running this code:
dat = structure(list(subject = c(1, 1, 1, 1, 1, 1, 2, 2), year = c(2012,
2013, 2015, 2016, 2017, 2019, 2011, 2013)), row.names = c(NA,
-8L), class = c("tbl_df", "tbl", "data.frame"))
To clarify, for this tibble the desired output is:
# A tibble: 3 x 2
subject year
<dbl> <dbl>
1 1 2015
2 1 2016
3 1 2017
(Note that subject 2 is dropped because she has no sequence of years increasing by one.)
There must be an elegant way to do this using dplyr!
This doesn't take into account ties, but ...
dat %>%
group_by(subject) %>%
mutate( r = cumsum(c(TRUE, diff(year) != 1)) ) %>%
group_by(subject, r) %>%
mutate( rcount = n() ) %>%
group_by(subject) %>%
filter(rcount > 1, rcount == max(rcount)) %>%
select(-r, -rcount) %>%
ungroup()
# # A tibble: 3 x 2
# subject year
# <dbl> <dbl>
# 1 1 2015
# 2 1 2016
# 3 1 2017

Create cohort dropout rate table from raw data

I need help creating a cohort dropout table from raw data.
I have a dataset that looks like this:
DT<-data.table(
id =c (1,2,3,4,5,6,7,8,9,10,
11,12,13,14,15,16,17,18,19,20,
21,22,23,24,25,26,27,28,29,30,31,32,33,34,35),
year =c (2014,2014,2014,2014,2014,2014,2014,2014,2014,2014,
2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,
2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016),
cohort =c(1,1,1,1,1,1,1,1,1,1,
2,2,2,1,1,2,1,2,1,2,
1,1,3,3,3,2,2,2,2,3,3,3,3,3,3))
I want to calculate the dropout rate by cohort, and get a table like this:
cohortdt<-data.table(
cohort =c(1,2,3),
drop_rateY1 =c(.60,0.0,0.0),
droprate_Y2 =c (.50,.33,0.0))
For cohort 1, the dropout rate at the end of Y1 is 60%. (i.e. 60 percent of students who were originally enrolled dropped out at the end of year 1. The value in Y2 means that 50% of those who remained at the end of year 1, dropped out at the end of year 2.
How can create a table like this from the raw data?
Here is one solution:
library(tidyverse)
DT %>%
group_by(year) %>%
count(cohort) %>%
ungroup() %>%
spread(year, n) %>%
mutate(year_1_drop_rate = 1 - (`2015` / `2014`),
year_2_drop_rate = 1 - (`2016` / `2015`)) %>%
replace_na(list(year_1_drop_rate = 0.0,
year_2_drop_rate = 0.0)) %>%
select(cohort, year_1_drop_rate, year_2_drop_rate)
Which returns:
# A tibble: 3 x 3
cohort year_1_drop_rate year_2_drop_rate
<dbl> <dbl> <dbl>
1 1 0.6 0.5000000
2 2 0.0 0.3333333
3 3 0.0 0.0000000
group by year
count cohort by year groups
ungroup
spread year in columns 2014, 2015, and 2016
mutate twice to get dropout rates for year 1 and year 2
replace_na to 0
select cohort, year_1_drop_rate, and year_2_drop_rate
This solution takes a tidy dataset and makes it untidy by spreading the year variable (i.e. 2014, 2015, and 2016 are separate columns).
I have a simple data.table solution :
x <- DT[,.N, by = .(cohort,year)]
count the number of student each year on each cohort and create the new data.table x
x[,drop := (1-N/(c(NA,N[-.N])))*100,by = cohort]
here I make the ratio between the number of student and the number of student the year after (c(NA,N[-.N]) is the shifted vector of N), that gives you the percentage of lost student each year
x[,.SD,by = cohort]
cohort year N drop
1: 1 2014 10 NA
2: 1 2015 4 60.00000
3: 1 2016 2 50.00000
4: 2 2015 6 NA
5: 2 2016 4 33.33333
6: 3 2016 9 NA
Hope it helps

Display in data.frame a conditional row count by group

I am struggling with creating a new variable in my data.frame. I apology for the question title that might not be very clear. I have a database that looks like this:
obs year type
1 2015 A
2 2015 A
3 2015 B
4 2014 A
5 2014 B
I want to add to the current data.frame a column (freq2015) that gives the number of rows by type for 2015 and report the result disregarding the considered year so long as the type is the same. Here is the output I am looking for:
obs year type freq2015
1 2015 A 2 (there are 2 obs. of type A in 2015)
2 2015 A 2 (there are 2 obs. of type A in 2015)
3 2015 B 1 (there is 1 obs. of type B in 2015)
4 2014 A 2 (there are 2 obs. of type A in 2015)
5 2014 B 1 (there are 1 obs. of type B in 2015)
I know how to add to my data.frame the number of rows by type by year using dplyr:
data <- data %>%
group_by(year, type) %>%
mutate(freq = n())
But then, for year=="2014" the added column will display the count of 2014 rows by race instead of that of 2015.
I know how to isolate into a new data.frame the number of rows by race for 2015:
data2015 <- dat[dat$year==2015,] %>%
group_by(type) %>%
mutate(freq2015 = n())
But I don't know how to add a column (with the count of rows by race for 2015) for the entire data.frame conditional on the type being the same (as shown in the example). I am looking for a solution that would prevent me from explicitly using the "type" variable modalities. That is, I don't want to use a code telling R: do this if type==A, do that otherwise. The reason for this restriction is that I have far too many types.
Any ideas? Thank you in advance.
If you group_by using only type, you can sum the rows when year == 2015.
data %>%
group_by(type) %>%
mutate(freq2015 = sum(year == 2015))
Source: local data frame [5 x 4]
Groups: type [2]
obs year type freq2015
<int> <int> <fctr> <int>
1 1 2015 A 2
2 2 2015 A 2
3 3 2015 B 1
4 4 2014 A 2
5 5 2014 B 1
Using the data table we could do:
setDT(df)
setkey(df,type)
df[ df[ year==2015, .(freq2015=.N), by = type]]
Result:
obs year type freq2015
1: 1 2015 A 2
2: 2 2015 A 2
3: 4 2014 A 2
4: 3 2015 B 1
5: 5 2014 B 1
You could use a left_join(), as follows:
temp <- data %>%
filter(year==2015) %>%
group_by(type) %>%
summarize(freq = n())# %>%
data <- data %>% left_join(temp, "type")
We can do this with base R using ave (without any external packages) and it is reasonably fast as well.
df1$freq2015 <- with(df1, ave(year == 2015, type, FUN = sum))
df1$freq2015
#[1] 2 2 1 2 1

Resources