Merging rows in R while excluding certain data - r

Let's say I have a data frame with many subjects and many test variables:
Name Date1 Date2 `Test1` `Test2` `Test3`
<dbl> <dttm> <dttm> <chr> <chr> <chr>
1 Steve 2012-02-27 2011-11-18 <NA> <NA> 3
2 Steve 2012-02-27 2012-01-22 4 <NA> <NA>
3 Steve 2012-02-27 2014-08-09 <NA> 8 <NA>
4 Mike 2012-02-09 2007-03-29 1 2 3
5 Mike 2012-02-09 2009-07-13 <NA> 5 6
6 Mike 2012-02-09 2014-03-11 <NA> <NA> 9
7 John 2012-03-20 2013-10-22 1 2 <NA>
8 John 2012-03-20 2014-03-17 4 5 <NA>
9 John 2012-03-20 2015-06-01 <NA> 8 9
I would like to know (most likely with dplyr) how to exclude data of rows that have a Date2 that is past Date1. Then to combine the remaining data into one row by (arranged by Name) while excluding the earlier data that have more recent results. Then write a new data frame that excludes the Date2 column, all while still including the "NA"s in the data.
Also, if none of the Date2 column are before the Date1 column, I would like to keep the Name but include a row of "NA"s (as in the case of "John").
So the results should look like this:
Name Date1 `Test1` `Test2` `Test3`
<dbl> <dttm> <chr> <chr> <chr>
1 Steve 2012-02-27 4 <NA> 3
2 Mike 2012-02-09 1 5 6
3 John 2012-03-20 <NA> <NA> <NA>
Any help on this would be greatly appreciated, thank you.

This will do it with dplyr...
library(dplyr)
df2 <- df %>% filter(as.Date(Date2) <= as.Date(Date1)) %>% #remove date2 past date1
arrange(as.Date(Date2)) %>% #make sure ordered by date2
group_by(Name, Date1) %>% #group by name and date1
summarise_all(function(x) last(x[!is.na(x)])) %>% #summarise remaining (i.e. the test-columns) by the last non-NA value
right_join(df %>% distinct(Name, Date1)) %>% #join names and date1 from original df (to restore NA rows such as John)
select(-Date2) #remove Date2
df2
Name Date1 Test1 Test2 Test3
1 Steve 2012-02-27 4 <NA> 3
2 Mike 2012-02-09 1 5 6
3 John 2012-03-20 <NA> <NA> <NA>

Related

Pivot_wider: Combine Duplicate Observations AND Create New Variable Columns for Those Values [duplicate]

This question already has answers here:
Transpose / reshape dataframe without "timevar" from long to wide format
(9 answers)
Closed 9 months ago.
I'm new to R and have scoured the site to find a solution - I've found lots of similar, but slightly different questions. I'm stumped.
I have a dataset in this structure:
SURVEY_ID CHILD_NAME CHILD_AGE
Survey1 Billy 4
Survey2 Claude 12
Survey2 Maude 6
Survey2 Constance 3
Survey3 George 22
Survey4 Marjoram 14
Survey4 LeBron 37
I'm trying to pivot the data wider so that there's a) only one unique SURVEY_ID per row, and, critically, b) a new column for second, third, etc. children for surveys with more than one child.
So the result would look like:
SURVEY_ID CHILD_NAME1 CHILD_NAME2 CHILD_NAME3 CHILD_AGE1 CHILD_AGE2 CHILD_AGE3
Survey1 Billy 4
Survey2 Claude Maude Constance 12 6 3
Survey3 George 22
Survey4 Marjoram Lebron 14 37
The actual data has thousands of surveys and the number of "child names" and "child ages" could be as high as 10. It's the issue of creating the new columns not from existing value names and only where there are multiple children that has me perplexed.
Using base R:
reshape(transform(df, time = ave(SURVEY_ID, SURVEY_ID, FUN=seq)),
v.names = c('CHILD_NAME', 'CHILD_AGE'),
direction = 'wide', idvar = 'SURVEY_ID', sep = '_')
SURVEY_ID CHILD_NAME_1 CHILD_AGE_1 CHILD_NAME_2 CHILD_AGE_2 CHILD_NAME_3 CHILD_AGE_3
1 Survey1 Billy 4 <NA> NA <NA> NA
2 Survey2 Claude 12 Maude 6 Constance 3
5 Survey3 George 22 <NA> NA <NA> NA
6 Survey4 Marjoram 14 LeBron 37 <NA> NA
using tidyverse:
library(tidyverse)
df %>%
group_by(SURVEY_ID) %>%
mutate(name = row_number()) %>%
pivot_wider(SURVEY_ID, values_from = c(CHILD_NAME, CHILD_AGE))
# A tibble: 4 x 7
# Groups: SURVEY_ID [4]
SURVEY_ID CHILD_NAME_1 CHILD_NAME_2 CHILD_NAME_3 CHILD_AGE_1 CHILD_AGE_2 CHILD_AGE_3
<chr> <chr> <chr> <chr> <int> <int> <int>
1 Survey1 Billy NA NA 4 NA NA
2 Survey2 Claude Maude Constance 12 6 3
3 Survey3 George NA NA 22 NA NA
4 Survey4 Marjoram LeBron NA 14 37 NA
using data.table
library(data.table)
dcast(setDT(df), SURVEY_ID~rowid(SURVEY_ID), value.var = c('CHILD_AGE', 'CHILD_NAME'))
SURVEY_ID CHILD_AGE_1 CHILD_AGE_2 CHILD_AGE_3 CHILD_NAME_1 CHILD_NAME_2 CHILD_NAME_3
1: Survey1 4 NA NA Billy <NA> <NA>
2: Survey2 12 6 3 Claude Maude Constance
3: Survey3 22 NA NA George <NA> <NA>
4: Survey4 14 37 NA Marjoram LeBron <NA>

Selecting distinct entries based on specific variables in R

I want to select distinct entries for my dataset based on two specific variables. I may, in fact, like to create a subset and do analysis using each subset.
The data set looks like this
id <- c(3,3,6,6,4,4,3,3)
date <- c("2017-1-1", "2017-3-3", "2017-4-3", "2017-4-7", "2017-10-1", "2017-11-1", "2018-3-1", "2018-4-3")
date_cat <- c(1,1,1,1,2,2,3,3)
measurement <- c(10, 13, 14,13, 12, 11, 14, 17)
myData <- data.frame(id, date, date_cat, measurement)
myData
myData$date1 <- as.Date(myData$date)
myData
id date date_cat measurement date1
1 3 2017-1-1 1 10 2017-01-01
2 3 2017-3-3 1 13 2017-03-03
3 6 2017-4-3 1 14 2017-04-03
4 6 2017-4-7 1 13 2017-04-07
5 4 2017-10-1 2 12 2017-10-01
6 4 2017-11-1 2 11 2017-11-01
7 3 2018-3-1 3 14 2018-03-01
8 3 2018-4-3 3 17 2018-04-03
#select the last date for the ID in each date category.
Here date_cat is the date category and date1 is date formatted as date. How can I get the last date for each ID in each date_category?
I want my data to show up as
id date date_cat measurement date1
1 3 2017-3-3 1 13 2017-03-03
2 6 2017-4-7 1 13 2017-04-07
3 4 2017-11-1 2 11 2017-11-01
4 3 2018-4-3 3 17 2018-04-03
Thanks!
I am not sure if you want something like below
subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
which gives
> subset(myData,ave(date1,id,date_cat,FUN = function(x) tail(sort(x),1))==date1)
id date date_cat measurement date1
2 3 2017-3-3 1 13 2017-03-03
4 6 2017-4-7 1 13 2017-04-07
6 4 2017-11-1 2 11 2017-11-01
8 3 2018-4-3 3 17 2018-04-03
Using data.table:
library(data.table)
myData_DT <- as.data.table(myData)
myData_DT[, .SD[.N] , by = .(date_cat, id)]
We could create a group with rleid on the 'id' column, slice the last row, remove the temporary grouping column
library(dplyr)
library(data.table)
myData %>%
group_by(grp = rleid(id)) %>%
slice(n()) %>%
ungroup %>%
select(-grp)
# A tibble: 4 x 5
# id date date_cat measurement date1
# <dbl> <chr> <dbl> <dbl> <date>
#1 3 2017-3-3 1 13 2017-03-03
#2 6 2017-4-7 1 13 2017-04-07
#3 4 2017-11-1 2 11 2017-11-01
#4 3 2018-4-3 3 17 2018-04-03
Or this can be done on the fly without creating a temporary column
myData %>%
filter(!duplicated(rleid(id), fromLast = TRUE))
Or using base R with subset and rle
subset(myData, !duplicated(with(rle(id),
rep(seq_along(values), lengths)), fromLast = TRUE))
# id date date_cat measurement date1
#2 3 2017-3-3 1 13 2017-03-03
#4 6 2017-4-7 1 13 2017-04-07
#6 4 2017-11-1 2 11 2017-11-01
#8 3 2018-4-3 3 17 2018-04-03
Using dplyr:
myData %>%
group_by(id,date_cat) %>%
top_n(1,date)

Group records with time interval overlap

I have a data frame (with N=16) contains ID (character), w_from (date), and w_to (date). Each record represent a task.
Here’s the data in R.
ID <- c(1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2)
w_from <- c("2010-01-01","2010-01-05","2010-01-29","2010-01-29",
"2010-03-01","2010-03-15","2010-07-15","2010-09-10",
"2010-11-01","2010-11-30","2010-12-15","2010-12-31",
"2011-02-01","2012-04-01","2011-07-01","2011-07-01")
w_to <- c("2010-01-31","2010-01-15", "2010-02-13","2010-02-28",
"2010-03-16","2010-03-16","2010-08-14","2010-10-10",
"2010-12-01","2010-12-30","2010-12-20","2011-02-19",
"2011-03-23","2012-06-30","2011-07-31","2011-07-06")
df <- data.frame(ID, w_from, w_to)
df$w_from <- as.Date(df$w_from)
df$w_to <- as.Date(df$w_to)
I need to generate a group number by ID for the records that their time intervals overlap. As an example, and in general terms, if record#1 overlaps with record#2, and record#2 overlaps with record#3, then record#1, record#2, and record#3 overlap.
Also, if record#1 overlaps with record#2 and record#3, but record#2 doesn't overlap with record#3, then record#1, record#2, record#3 are all overlap.
In the example above and for ID=1, the first four records overlap.
Here is the final output:
Also, if this can be done using dplyr, that would be great!
Try this:
library(dplyr)
df %>%
group_by(ID) %>%
arrange(w_from) %>%
mutate(group = 1+cumsum(
cummax(lag(as.numeric(w_to), default = first(as.numeric(w_to)))) < as.numeric(w_from)))
# A tibble: 16 x 4
# Groups: ID [2]
ID w_from w_to group
<dbl> <date> <date> <dbl>
1 1 2010-01-01 2010-01-31 1
2 1 2010-01-05 2010-01-15 1
3 1 2010-01-29 2010-02-13 1
4 1 2010-01-29 2010-02-28 1
5 1 2010-03-01 2010-03-16 2
6 1 2010-03-15 2010-03-16 2
7 1 2010-07-15 2010-08-14 3
8 1 2010-09-10 2010-10-10 4
9 1 2010-11-01 2010-12-01 5
10 1 2010-11-30 2010-12-30 5
11 1 2010-12-15 2010-12-20 5
12 1 2010-12-31 2011-02-19 6
13 1 2011-02-01 2011-03-23 6
14 2 2011-07-01 2011-07-31 1
15 2 2011-07-01 2011-07-06 1
16 2 2012-04-01 2012-06-30 2

How to create a column based on two conditions from other data frame?

I'm trying to create a column that identifies if the row meets two conditions. For example, I have a table similar to this:
> dat <- data.frame(Date = c(rep(c("2019-01-01", "2019-02-01","2019-03-01", "2019-04-01"), 4)),
+ Rep = c(rep("Mike", 4), rep("Tasha", 4), rep("Dane", 4), rep("Trish", 4)),
+ Manager = c(rep("Amber", 2), rep("Michelle", 2), rep("Debbie", 4), rep("Brian", 4), rep("Tim", 3), "Trevor"),
+ Sales = floor(runif(16, min = 0, max = 10)))
> dat
Date Rep Manager Sales
1 2019-01-01 Mike Amber 6
2 2019-02-01 Mike Amber 3
3 2019-03-01 Mike Michelle 9
4 2019-04-01 Mike Michelle 2
5 2019-01-01 Tasha Debbie 9
6 2019-02-01 Tasha Debbie 6
7 2019-03-01 Tasha Debbie 0
8 2019-04-01 Tasha Debbie 4
9 2019-01-01 Dane Brian 3
10 2019-02-01 Dane Brian 6
11 2019-03-01 Dane Brian 6
12 2019-04-01 Dane Brian 1
13 2019-01-01 Trish Tim 6
14 2019-02-01 Trish Tim 7
15 2019-03-01 Trish Tim 6
16 2019-04-01 Trish Trevor 1
Out of the Reps that have switched manager, I would like to identify weather this manager is the first or the second manager with respect to the date. The ideal output would look something like:
Date Rep Manager Sales New_Column
1 2019-01-01 Mike Amber 6 1
2 2019-02-01 Mike Amber 3 1
3 2019-03-01 Mike Michelle 9 2
4 2019-04-01 Mike Michelle 2 2
5 2019-01-01 Trish Tim 6 1
6 2019-02-01 Trish Tim 7 1
7 2019-03-01 Trish Tim 6 1
8 2019-04-01 Trish Trevor 1 2
I have tried a few things but they're not quite working out. I have created two separate data frames where one consists of the first instance of that Rep and associated manager (df1) and the other one consists of the last instance of that rep and associated manager (df2). The code that I have tried that has gotten the closest is:
dat$New_Column <- ifelse(dat$Rep %in% df1$Rep & dat$Manager %in% df1$Manager, 1,
ifelse(dat$Rep %in% df2$Rep & dat$Manager %in% df2$Manager, 2, NA))
However this reads as two separate conditions, rather than having a condition of a condition (i.e. If Mike exists in the first instance and Amber exists in the first instance assign 1 rather than If Mike exists with the manager Amber in the first instance assign 1). Any help would be really appreciated. Thank you!
An option is to first grouped by 'Rep' filter the rows where the number of unique 'Manager' is 2, and then add a column by matching the 'Manager' with the unique elements of 'Manager' to get the indices
library(dplyr)
dat %>%
group_by(Rep) %>%
filter(n_distinct(Manager) == 2) %>%
mutate(New_Column = match(Manager, unique(Manager)))
# A tibble: 8 x 5
# Groups: Rep [2]
# Date Rep Manager Sales New_Column
# <chr> <chr> <chr> <int> <int>
#1 2019-01-01 Mike Amber 6 1
#2 2019-02-01 Mike Amber 3 1
#3 2019-03-01 Mike Michelle 9 2
#4 2019-04-01 Mike Michelle 2 2
#5 2019-01-01 Trish Tim 6 1
#6 2019-02-01 Trish Tim 7 1
#7 2019-03-01 Trish Tim 6 1
#8 2019-04-01 Trish Trevor 1 2

Using the r dplyr library to generate aggregate numbers in a new column

I am trying to use dplyr to generate a new column in a data frame, based on the aggregation of values in existing columns. Given my dataframe:
group1 <- c("2019","2019","2019","2018","2018","2017","2017","2017")
group2 <- c("2019-01-01", "2019-01-01","2019-01-01","2018-05-01","2018-06-01","2017-01-01","2017-01-01","2017-02-01")
group3 <- c("A","A","B","A","A","C","C","B")
df <- data.frame("Year" = group1,"Date" = group2,"Sample" = group3)
Gives:
Year Date Sample
1 2019 2019-01-01 A
2 2019 2019-01-01 A
3 2019 2019-01-01 B
4 2018 2018-05-01 A
5 2018 2018-06-01 A
6 2017 2017-01-01 C
7 2017 2017-01-01 C
8 2017 2017-02-01 B
So I'd like to generate new column "Count", that for each row gives the total number of unique dates per sample. So for the above data, I would expect the results to be:
Year Date Sample Count
1 2019 2019-01-01 A 1
2 2019 2019-01-01 A 1
3 2019 2019-02-01 B 1
4 2018 2018-05-01 A 2
5 2018 2018-06-01 C 2
6 2017 2017-01-01 C 1
7 2017 2017-01-01 C 1
8 2017 2017-02-01 B 1
I've tried using the following code in r:
df %>%
group_by(Year) %>%
group_by(Sample) %>%
group_by(Date) %>%
mutate(Count = n_distinct(Date))
But I'm not getting the correct answer!
You could try:
library(dplyr)
df %>%
group_by(Year, Sample) %>%
mutate(Count = n_distinct(Date))
If you want to pass several variables to group_by, you need to put them together - what you were doing is cancelling out the previous groupings by each new statement.
Moreover, if you'd like to count unique dates, you shouldn't group by them.
The above code would give:
# A tibble: 8 x 4
# Groups: Year, Sample [6]
Year Date Sample Count
<fct> <fct> <fct> <int>
1 2019 2019-01-01 A 1
2 2019 2019-01-01 A 1
3 2019 2019-01-01 B 1
4 2018 2018-05-01 A 2
5 2018 2018-06-01 A 2
6 2018 2017-01-01 C 1
7 2017 2017-01-01 C 1
8 2017 2017-02-01 B 1
Note that there is a mismatch between your generated data frame and the one you show us. The data frame generated by your code is:
Year Date Sample
1 2019 2019-01-01 A
2 2019 2019-01-01 A
3 2019 2019-01-01 B
4 2018 2018-05-01 A
5 2018 2018-06-01 A
6 2018 2017-01-01 C
7 2017 2017-01-01 C
8 2017 2017-02-01 B
Where indeed the only Sample with 2 distinct Dates in a given Year is A (in 2018).

Resources