Dealing with ties using rank (R) - r

I'm trying to create dummy variable for whether a child is first born, and one for if the child is second born. My data looks something like this
ID MID CMOB CYRB
1 1 1 1991
2 1 7 1989
3 2 1 1985
4 2 11 1985
5 2 9 1994
6 3 4 1992
7 4 2 1992
8 4 10 1983
With ID = child ID, MID = mother ID, CMOB = month of birth and CYRB = year of birth.
For the first born dummy I tried using this:
Identifiers_age <- Identifiers_age %>% group_by(MPUBID)
%>% mutate(first = as.numeric(rank(CYRB) == 1))
But there doesn't seem to be a way of breaking ties by the rank of another columnn (clearly in this case the desired column being CMOB), whenever I try using the "ties.method" argument it tell me the input must be a character vector.
Am I missing something here?

order might be more convenient to use here, from ?order:
order returns a permutation which rearranges its first argument into
ascending or descending order, breaking ties by further arguments.
Identifiers_age <- Identifiers_age %>% group_by(MID) %>%
mutate(first = as.numeric(order(CYRB, CMOB) == 1))
Identifiers_age
#Source: local data frame [8 x 5]
#Groups: MID [4]
# ID MID CMOB CYRB first
# <int> <int> <int> <int> <dbl>
#1 1 1 1 1991 0
#2 2 1 7 1989 1
#3 3 2 1 1985 1
#4 4 2 11 1985 0
#5 5 2 9 1994 0
#6 6 3 4 1992 1
#7 7 4 2 1992 0
#8 8 4 10 1983 1

If we still want to use rank, we can convert the 'CYRB', 'CMOB' in to 'Date', apply rank on it and the get the binary output based on the logical vector
Identifiers_age %>%
group_by(MID) %>%
mutate(first = as.integer(rank(as.Date(paste(CYRB, CMOB, 1,
sep="-"), "%Y-%m-%d"))==1))
# ID MID CMOB CYRB first
# <int> <int> <int> <int> <int>
#1 1 1 1 1991 0
#2 2 1 7 1989 1
#3 3 2 1 1985 1
#4 4 2 11 1985 0
#5 5 2 9 1994 0
#6 6 3 4 1992 1
#7 7 4 2 1992 0
#8 8 4 10 1983 1
Or we can use arithmetic to do this with rank
Identifiers_age %>%
group_by(MID) %>%
mutate(first = as.integer(rank(CYRB + CMOB/12)==1))
# ID MID CMOB CYRB first
# <int> <int> <int> <int> <int>
#1 1 1 1 1991 0
#2 2 1 7 1989 1
#3 3 2 1 1985 1
#4 4 2 11 1985 0
#5 5 2 9 1994 0
#6 6 3 4 1992 1
#7 7 4 2 1992 0
#8 8 4 10 1983 1

Related

How to keep only first value from distinct values in one column based on repeated values in other column in R? [duplicate]

The code below should group the data by year and then create two new columns with the first and last value of each year.
library(dplyr)
set.seed(123)
d <- data.frame(
group = rep(1:3, each = 3),
year = rep(seq(2000,2002,1),3),
value = sample(1:9, r = T))
d %>%
group_by(group) %>%
mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
However, it does not work as it should. The expected result would be
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5
Yet, I get this (it takes the first and the last value over the entire data frame, not just the groups):
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 5
2 1 2001 8 3 5
3 1 2002 4 3 5
4 2 2000 8 3 5
5 2 2001 9 3 5
6 2 2002 1 3 5
7 3 2000 5 3 5
8 3 2001 9 3 5
9 3 2002 5 3 5
dplyr::mutate() did the trick
d %>%
group_by(group) %>%
dplyr::mutate(
first = dplyr::first(value),
last = dplyr::last(value)
)
You can also try by using summarise function within dpylr to get the first and last values of unique groups
d %>%
group_by(group) %>%
summarise(first_value = first(na.omit(values)),
last_value = last(na.omit(values))) %>%
left_join(d, ., by = 'group')
If you are from the future and dplyr has stopped supporting the first and last functions or want a future-proof solution, you can just index the columns like you would a list:
> d %>%
group_by(group) %>%
mutate(
first = value[[1]],
last = value[[length(value)]]
)
# A tibble: 9 × 5
# Groups: group [3]
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 8 8 1
5 2 2001 9 8 1
6 2 2002 1 8 1
7 3 2000 5 5 5
8 3 2001 9 5 5
9 3 2002 5 5 5

Recode dates to study day within subject

I have data in which subjects completed multiple ratings per day over 6-7 days. The number of ratings per day varies. The data set includes subject ID, date, and the ratings. I would like to create a new variable that recodes the dates for each subject into "study day" --- so 1 for first day of ratings, 2 for second day of ratings, etc.
For example, I would like to take this:
id Date Rating
1 10/20/2018 2
1 10/20/2018 3
1 10/20/2018 5
1 10/21/2018 1
1 10/21/2018 7
1 10/21/2018 9
1 10/22/2018 4
1 10/22/2018 5
1 10/22/2018 9
2 11/15/2018 1
2 11/15/2018 3
2 11/15/2018 4
2 11/16/2018 3
2 11/16/2018 1
2 11/17/2018 0
2 11/17/2018 2
2 11/17/2018 9
And end up with this:
id Day Date Rating
1 1 10/20/2018 2
1 1 10/20/2018 3
1 1 10/20/2018 5
1 2 10/21/2018 1
1 2 10/21/2018 7
1 2 10/21/2018 9
1 3 10/22/2018 4
1 3 10/22/2018 5
1 3 10/22/2018 9
2 1 11/15/2018 1
2 1 11/15/2018 3
2 1 11/15/2018 4
2 2 11/16/2018 3
2 2 11/16/2018 1
2 3 11/17/2018 0
2 3 11/17/2018 2
2 3 11/17/2018 9
I was going to look into setting up some kind of loop, but I thought it would be worth asking if there is a more efficient way to pull this off. Are there any functions that would allow me to automate this sort of thing? Thanks very much for any suggestions.
Since you want to reset the count after every id , makes this question a bit different.
Using only base R, we can split the Date based on id and then create a count of each distinct group.
df$Day <- unlist(sapply(split(df$Date, df$id), function(x) match(x,unique(x))))
df
# id Date Rating Day
#1 1 10/20/2018 2 1
#2 1 10/20/2018 3 1
#3 1 10/20/2018 5 1
#4 1 10/21/2018 1 2
#5 1 10/21/2018 7 2
#6 1 10/21/2018 9 2
#7 1 10/22/2018 4 3
#8 1 10/22/2018 5 3
#9 1 10/22/2018 9 3
#10 2 11/15/2018 1 1
#11 2 11/15/2018 3 1
#12 2 11/15/2018 4 1
#13 2 11/16/2018 3 2
#14 2 11/16/2018 1 2
#15 2 11/17/2018 0 3
#16 2 11/17/2018 2 3
#17 2 11/17/2018 9 3
I don't know how I missed this but thanks to #thelatemail who reminded that this is basically the same as
library(dplyr)
df %>%
group_by(id) %>%
mutate(Day = match(Date, unique(Date)))
AND
df$Day <- as.numeric(with(df, ave(Date, id, FUN = function(x) match(x, unique(x)))))
If you want a slightly hacky dplyr version....you can use the date column and convert it to a numeric date then manipulate that number to give the desired result
library(tidyverse)
library(lubridate)
df <- data_frame(id=c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2),
Date= c('10/20/2018', '10/20/2018', '10/20/2018', '10/21/2018', '10/21/2018', '10/21/2018',
'10/22/2018', '10/22/2018', '10/22/2018','11/15/2018', '11/15/2018', '11/15/2018',
'11/16/2018', '11/16/2018', '11/17/2018', '11/17/2018', '11/17/2018'),
Rating=c(2,3,5,1,7,9,4,5,9,1,3,4,3,1,0,2,9))
df %>%
group_by(id) %>%
mutate(
Date = mdy(Date),
Day = as.numeric(Date),
Day = Day-min(Day)+1)
# A tibble: 17 x 4
# Groups: id [2]
id Date Rating Day
<dbl> <date> <dbl> <dbl>
1 1 2018-10-20 2 1
2 1 2018-10-20 3 1
3 1 2018-10-20 5 1
4 1 2018-10-21 1 2
5 1 2018-10-21 7 2
6 1 2018-10-21 9 2
7 1 2018-10-22 4 3
8 1 2018-10-22 5 3
9 1 2018-10-22 9 3
10 2 2018-11-15 1 1
11 2 2018-11-15 3 1
12 2 2018-11-15 4 1
13 2 2018-11-16 3 2
14 2 2018-11-16 1 2
15 2 2018-11-17 0 3
16 2 2018-11-17 2 3
17 2 2018-11-17 9 3

multiple step gathering of columns in R

I have data.frame like this:
df<-data.frame(Time=c(1:100),Rome_population=c(1:100),Rome_gdp=c(1:100),Rome_LifeLenght=c(1:100),London_population=c(1:100),London_gdp=c(1:100),London_LifeLenght=c(1:100),Berlin_population=c(1:100),Berlin_gdp=c(1:100),Berlin_LifeLenght=c(1:100))
And I would like to have a data.frame like this:
df<-data.frame(Time,City,population,gdp,LifeLenght)
How can I make it? Possibly with tidyr?
Thanks!
Try:
df %>%
gather(key, value, Rome_population:Berlin_LifeLenght) %>%
separate(key, into = c("city", "stat"), sep = "_") %>%
spread(stat, value)
Output:
# A tibble: 300 x 5
Time city gdp LifeLenght population
<int> <chr> <int> <int> <int>
1 1 Berlin 1 1 1
2 1 London 1 1 1
3 1 Rome 1 1 1
4 2 Berlin 2 2 2
5 2 London 2 2 2
6 2 Rome 2 2 2
7 3 Berlin 3 3 3
8 3 London 3 3 3
9 3 Rome 3 3 3
10 4 Berlin 4 4 4
# ... with 290 more rows

group_by n unique sequential values of a variable

It's easy to group_by unique values of a variable:
library(tidyverse)
library(gapminder)
gapminder %>%
group_by(year)
If we wanted to make a group ID just to show us what the groups would be:
gapminder %>%
select(year) %>%
distinct %>%
mutate(group = group_indices(., year))
A tibble: 12 x 2
year group
<int> <int>
1 1952 1
2 1957 2
3 1962 3
4 1967 4
5 1972 5
6 1977 6
7 1982 7
8 1987 8
9 1992 9
10 1997 10
11 2002 11
12 2007 12
But what if I want to group by pairs ("group2"), triplets ("group3"), etc. of sequential years? How could I produce the following tibble using dplyr/tidyverse?
A tibble: 12 x 2
year group group2 group3 group5
<int> <int> <int> <int> <int>
1 1952 1 1 1 1
2 1957 2 1 1 1
3 1962 3 2 1 1
4 1967 4 2 2 1
5 1972 5 3 2 1
6 1977 6 3 2 2
7 1982 7 4 3 2
8 1987 8 4 3 2
9 1992 9 5 3 2
10 1997 10 5 4 2
11 2002 11 6 4 3
12 2007 12 6 4 3
With ceiling() you can create groups very easily.
gapminder %>%
select(year) %>%
distinct() %>%
mutate(group1 = group_indices(., year)) %>%
mutate(group2=ceiling(group1 / 2)) %>%
mutate(group3=ceiling(group1 / 3)) %>%
mutate(group4=ceiling(group1 / 4)) %>%
mutate(group5=ceiling(group1 / 5))
# A tibble: 12 x 6
year group1 group2 group3 group4 group5
<int> <int> <dbl> <dbl> <dbl> <dbl>
1 1952 1 1 1 1 1
2 1957 2 1 1 1 1
3 1962 3 2 1 1 1
4 1967 4 2 2 1 1
5 1972 5 3 2 2 1
6 1977 6 3 2 2 2
7 1982 7 4 3 2 2
8 1987 8 4 3 2 2
9 1992 9 5 3 3 2
10 1997 10 5 4 3 2
11 2002 11 6 4 3 3
12 2007 12 6 4 3 3
Here's an alternative solution, where you can specify the number of groups you want in the beginning and the process creates the corresponding groups:
library(tidyverse)
library(gapminder)
# input number of groups
nn = 5
gapminder %>%
select(year) %>%
distinct() %>%
mutate(X = seq_along(year),
d = map(X, ~data.frame(t(ceiling(.x/2:nn))))) %>%
unnest() %>%
setNames(c("year", paste0("group",1:nn)))
# # A tibble: 12 x 6
# year group1 group2 group3 group4 group5
# <int> <int> <dbl> <dbl> <dbl> <dbl>
# 1 1952 1 1 1 1 1
# 2 1957 2 1 1 1 1
# 3 1962 3 2 1 1 1
# 4 1967 4 2 2 1 1
# 5 1972 5 3 2 2 1
# 6 1977 6 3 2 2 2
# 7 1982 7 4 3 2 2
# 8 1987 8 4 3 2 2
# 9 1992 9 5 3 3 2
#10 1997 10 5 4 3 2
#11 2002 11 6 4 3 3
#12 2007 12 6 4 3 3
Here's a function that does the job
group_by_n = function(x, n) {
ux <- match(x, sort(unique(x)))
ceiling(ux / n)
}
It does not require that x be ordered, or that values be evenly spaced or even numeric values. Use as, e.g.,
mutate(gapminder, group3 = group_by_n(year, 3))

Create a conditional timeline based on events in R

I have data where the 'Law' variable indicates changes in legislation, in different places ('Place'):
Person Place Year Law
1 A 1990 0
2 A 1991 1
3 A 1992 1
4 B 1990 0
5 B 1991 0
6 B 1992 1
7 B 1993 1
8 B 1993 1
9 B 1993 1
10 B 1992 1
Basically the law was implemented in place A in 1991 and remained in force for all subsequent time periods. It was implemented in place B in 1992 and remained in force, & so on.
I would like to create a new variable that takes on a value of 0 for the year the law was implemented, 1 for 1 year after, 2 for 2 years after, -1 for the year before, -2 for 2 years before, and so on.
I need the final dataframe to look like:
Person Place Year Law timeline
1 A 1990 0 -1
2 A 1991 1 0
3 A 1992 1 1
4 B 1990 0 -2
5 B 1991 0 -1
6 B 1992 1 0
7 B 1993 1 1
8 B 1993 1 2
9 B 1993 1 2
10 B 1992 1 1
I have tried:
library(dplyr)
df %>%
group_by(Place) %>%
arrange(Year) %>%
mutate(timeline = rank(Law))
but it's not working like I need. What am I doing wrong? Can I do this in dplyr or do I need to create a complex for loop?
You can subtract the row_numer by the index where the Law is implemented:
df %>%
arrange(Year) %>%
group_by(Place) %>%
mutate(timeline = row_number() - which(diff(Law) == 1) - 1) %>%
arrange(Place)
# A tibble: 7 x 5
# Groups: Place [2]
# Person Place Year Law timeline
# <int> <fct> <int> <int> <dbl>
#1 1 A 1990 0 -1.
#2 2 A 1991 1 0.
#3 3 A 1992 1 1.
#4 4 B 1990 0 -2.
#5 5 B 1991 0 -1.
#6 6 B 1992 1 0.
#7 7 B 1993 1 1.
using data.table
library(data.table)
setDT(dat)[,timeline:=sequence(.N)-which.min(!Law),by=Place]
dat
Person Place Year Law timeline
1: 1 A 1990 0 -1
2: 2 A 1991 1 0
3: 3 A 1992 1 1
4: 4 B 1990 0 -2
5: 5 B 1991 0 -1
6: 6 B 1992 1 0
7: 7 B 1993 1 1
Using base r:
transform(dat,timeline=ave(Law,Place,FUN=function(x)1:length(x)-which.min(!x)))
Person Place Year Law timeline
1 1 A 1990 0 -1
2 2 A 1991 1 0
3 3 A 1992 1 1
4 4 B 1990 0 -2
5 5 B 1991 0 -1
6 6 B 1992 1 0
7 7 B 1993 1 1

Resources