Calculate number of days passed since First event R - r

I would like to calculate the number of days which have passed since the first event. There are different groups, so each group's starting date for an event is different and I want to calculate each groups number of days passed since their own first event.
names = c('Ben',"Ben","Ben","Ben","Ben","Ben" ,'Dan',"Dan","Dan","Dan", 'Peter',"Peter","Peter","Peter","Peter","Peter","Peter",'Betty',"Betty","Betty",'Betty', "Betty")
dates = c('2000-02-01','2000-02-02',"2000-02-03","2000-02-04",'2000-02-05','2000-02-05', '2000-01-11','2000-01-12',"2000-01-13",'2000-01-14',
'2000-09-10','2000-09-11',"2000-09-12",'2000-09-13','2000-09-14','2000-09-15','2000-09-16','2000-11-13','2000-11-14', "2000-11-15",'2000-11-16','2000-11-17')
events = c(0,0,1,4,5,11,0,0,2,6,0,0,1,2,3,4,5,0,0,1,2,3)
newd = data.frame(names,dates,events)
newd
so the data frame looks like this:
> newd
names dates events
1 Ben 2000-02-01 0
2 Ben 2000-02-02 0
3 Ben 2000-02-03 1
4 Ben 2000-02-04 4
5 Ben 2000-02-05 5
6 Ben 2000-02-05 11
7 Dan 2000-01-11 0
8 Dan 2000-01-12 0
9 Dan 2000-01-13 2
10 Dan 2000-01-14 6
11 Peter 2000-09-10 0
12 Peter 2000-09-11 0
13 Peter 2000-09-12 1
14 Peter 2000-09-13 2
15 Peter 2000-09-14 3
16 Peter 2000-09-15 4
17 Peter 2000-09-16 5
18 Betty 2000-11-13 0
19 Betty 2000-11-14 0
20 Betty 2000-11-15 1
21 Betty 2000-11-16 2
22 Betty 2000-11-17 3
This is just an example I am using, the 'events' are not in a specific order and are totally random, there are also many other dates with the event of 0. So I would like to only start counting days where: event > 0.
So if there's a 0 at 'event' than there should also be a 0 days counted.

Convert the dates to actual date and you can then subtract minimum dates for each names.
newd$dates <- as.Date(newd$dates)
library(dplyr)
newd %>% group_by(names) %>% mutate(events = as.integer(dates - min(dates)))
# names dates events
# <chr> <date> <int>
# 1 Ben 2000-02-02 0
# 2 Ben 2000-02-03 1
# 3 Ben 2000-02-04 2
# 4 Ben 2000-02-05 3
# 5 Ben 2000-02-05 3
# 6 Dan 2000-01-12 0
# 7 Dan 2000-01-13 1
# 8 Dan 2000-01-14 2
# 9 Peter 2000-09-11 0
#10 Peter 2000-09-12 1
#11 Peter 2000-09-13 2
#12 Peter 2000-09-14 3
#13 Peter 2000-09-15 4
#14 Peter 2000-09-16 5
#15 Betty 2000-11-14 0
#16 Betty 2000-11-15 1
#17 Betty 2000-11-16 2
#18 Betty 2000-11-17 3
In base R :
newd$events <- with(newd, dates - ave(dates, names, FUN = min))
and data.table :
library(data.table)
setDT(newd)[, events := dates - min(dates), names]

Related

Group two dfs based on dates that closely match

These are subsets of two dataframes.
df1:
plot
mean_first_flower_date
gdd
1
2019-07-15
60
1
2019-07-21
50
1
2019-07-23
78
2
2019-05-13
100
2
2019-05-22
173
2
2019-05-25
245
(cont.)
df2:
plot
date
flowers
1
2019-07-12
2
1
2019-07-13
9
1
2019-07-14
3
1
2019-07-15
3
2
2019-05-12
10
2
2019-05-13
10
2
2019-05-14
14
2
2019-05-15
17
(cont.)
df2 has some matching dates with df1 but sometimes the dates are off for one or a couple days (highlighted in bold).
I would like to group both dfs based on both 'date' and 'plot', keeping df2, without losing 'gdd' data from df1.
This will happen if, for example, I inner_join both dfs because the dates will not match.
So if a date in df1 is one to three days earlier or later than what it's possible to match in df2, it's fine because the dates are relatively close. This is tricky because I want this data replacement only if there is not data available in df1 for that data range.
My goal is to have something like this:
plot
date
flowers
gdd
1
2019-07-12
2
60
1
2019-07-13
9
60
1
2019-07-14
3
60
1
2019-07-15
3
60
2
2019-05-12
10
100
2
2019-05-13
10
100
2
2019-05-14
14
100
2
2019-05-15
17
100
Is it possible to do?
I greatly appreciate any help!
Thanks!
I think a 'rolling join' from the data.table package can handle this:
library(data.table)
setDT(df1)
setDT(df2)
df1[, mean_first_flower_date := as.Date(mean_first_flower_date)]
df2[, date := as.Date(date)]
df1[df2, on=c("plot","mean_first_flower_date==date"), roll=3, rollends=TRUE]
# plot mean_first_flower_date gdd flowers
#1: 1 2019-07-12 60 2
#2: 1 2019-07-13 60 9
#3: 1 2019-07-14 60 3
#4: 1 2019-07-15 60 3
#5: 2 2019-05-12 100 10
#6: 2 2019-05-13 100 10
#7: 2 2019-05-14 100 14
#8: 2 2019-05-15 100 17
Using this data:
df1 <- read.table(text="plot mean_first_flower_date gdd
1 2019-07-15 60
1 2019-07-21 50
1 2019-07-23 78
2 2019-05-13 100
2 2019-05-22 173
2 2019-05-25 245", header=TRUE)
df2 <- read.table(text="plot date flowers
1 2019-07-12 2
1 2019-07-13 9
1 2019-07-14 3
1 2019-07-15 3
2 2019-05-12 10
2 2019-05-13 10
2 2019-05-14 14
2 2019-05-15 17", header=TRUE)
Try fill from dplyr. use this syntax
df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
fill(gdd, .direction = "up")
plot date flowers gdd
1 1 2019-07-12 2 60
2 1 2019-07-13 9 60
3 1 2019-07-14 3 60
4 1 2019-07-15 3 60
5 2 2019-05-12 10 100
6 2 2019-05-13 10 100
7 2 2019-05-14 14 NA
8 2 2019-05-15 17 NA
As you can notice there are two NAs in the last two rows which shouldn't be there if you'll join your actual df2 where these rows will be filled by 173 as there will be a match for 2019-05-22. Still if you want to fill the last NA rows, if any, you can use fill again with .direction = "down"
df2 %>% left_join(df1, by = c("plot" = "plot", "date" = "mean_first_flower_date")) %>%
fill(gdd, .direction = "up") %>% fill(gdd, .direction = "down")
plot date flowers gdd
1 1 2019-07-12 2 60
2 1 2019-07-13 9 60
3 1 2019-07-14 3 60
4 1 2019-07-15 3 60
5 2 2019-05-12 10 100
6 2 2019-05-13 10 100
7 2 2019-05-14 14 100
8 2 2019-05-15 17 100

How can I create a Variable for experience in R? [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I have a dataset that has observations for different case files. And I would like to create a variable that indicates the number of cases that have been dealt with of that kind before a specific case is looked into.
Here is a test code and dataset to specify what I am asking.
df <- data.frame( ID= c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16),
name = c("Jon", "Jon", "Maria","Jon", "Jon", "Maria","Jon", "Jon", "Maria","Prince", "Jon", "Maria","Prince", "Jon", "Maria","Prince"),
date = c("2007-01-22", "2007-02-13", "2007-05-22", "2007-02-25", "2007-04-22", "2007-03-13", "2007-03-22", "2007-07-13", "2007-08-22",
"2007-05-10", "2007-04-18", "2007-07-09","2007-06-10", "2008-02-13","2007-09-22", "2007-05-15"))
I would like to group the observations into categories and for each observation check the date and give a count of the number of observations in that category before the stated observation.
df$date <- as.Date(df$date, '%Y-%m-%d')
df$exp = NA
for(i in 1:nrow(df)){
temp = df %>% filter(!is.na(date))
temp = temp %>% filter(name == name[i])
df$exp[i]= nrow( filter(temp,date[i]>date))
}
I tried run the code above but doesn't give the results I am looking for. It gives me the following results
ID name date exp
1 1 Jon 2007-01-22 0
2 2 Jon 2007-02-13 1
3 4 Jon 2007-02-25 5
4 7 Jon 2007-03-22 4
5 11 Jon 2007-04-18 0
6 5 Jon 2007-04-22 3
7 8 Jon 2007-07-13 7
8 14 Jon 2008-02-13 0
9 6 Maria 2007-03-13 0
10 3 Maria 2007-05-22 3
11 12 Maria 2007-07-09 0
12 9 Maria 2007-08-22 0
13 15 Maria 2007-09-22 0
14 10 Prince 2007-05-10 0
15 16 Prince 2007-05-15 0
16 13 Prince 2007-06-10 0
instead of
ID name date exp
1 1 Jon 2007-01-22 0
2 2 Jon 2007-02-13 1
3 4 Jon 2007-02-25 2
4 7 Jon 2007-03-22 3
5 11 Jon 2007-04-18 4
6 5 Jon 2007-04-22 5
7 8 Jon 2007-07-13 6
8 14 Jon 2008-02-13 7
9 6 Maria 2007-03-13 0
10 3 Maria 2007-05-22 1
11 12 Maria 2007-07-09 2
12 9 Maria 2007-08-22 3
13 15 Maria 2007-09-22 4
14 10 Prince 2007-05-10 0
15 16 Prince 2007-05-15 1
16 13 Prince 2007-06-10 2
How can I efficiently get this done?
You can sort by name and date, make groups by name and use the row_number to get the result
library(tidyverse)
df %>%
arrange(name, as.Date(date)) %>%
group_by(name) %>%
mutate(n = row_number() - 1)
# A tibble: 16 x 4
# Groups: name [3]
ID name date n
<dbl> <chr> <chr> <dbl>
1 1 Jon 2007-01-22 0
2 2 Jon 2007-02-13 1
3 4 Jon 2007-02-25 2
4 7 Jon 2007-03-22 3
5 11 Jon 2007-04-18 4
6 5 Jon 2007-04-22 5
7 8 Jon 2007-07-13 6
8 14 Jon 2008-02-13 7
9 6 Maria 2007-03-13 0
10 3 Maria 2007-05-22 1
11 12 Maria 2007-07-09 2
12 9 Maria 2007-08-22 3
13 15 Maria 2007-09-22 4
14 10 Prince 2007-05-10 0
15 16 Prince 2007-05-15 1
16 13 Prince 2007-06-10 2

How to populate values of one row conditional of another row in R?

I inherited a data set coded in an unusual way. I would like to learn a less verbose way of reshaping it. The data frame looks like this:
# Input.
participant = c(rep("John",6), rep("Mary",6))
day = c(rep(1,3), rep(2,3), rep(1,3), rep(2,3))
likes = c("apples", "apples", "18", "apples", "apples", "7", "bananas", "bananas", "24", "bananas", "bananas", "3")
question = rep(c(1,1,0),4)
number = c(rep(18,3), rep(7,3), rep(24,3), rep(3,3))
df = data.frame(participant, day, question, likes)
participant day question likes
1 John 1 1 apples
2 John 1 1 apples
3 John 1 0 18
4 John 2 1 apples
5 John 2 1 apples
6 John 2 0 7
7 Mary 1 1 bananas
8 Mary 1 1 bananas
9 Mary 1 0 24
10 Mary 2 1 bananas
11 Mary 2 1 bananas
12 Mary 2 0 3
As you can see, the column likes is heterogeneous. When question equals 0, likes conveys a number chosen by the participants, not their preferred fruit. So I would like to re-code it in a new column as follows:
participant day question likes number
1 John 1 1 apples 18
2 John 1 1 apples 18
3 John 1 0 18 18
4 John 2 1 apples 7
5 John 2 1 apples 7
6 John 2 0 7 7
7 Mary 1 1 bananas 24
8 Mary 1 1 bananas 24
9 Mary 1 0 24 24
10 Mary 2 1 bananas 3
11 Mary 2 1 bananas 3
12 Mary 2 0 3 3
My current solution with base R involves subsetting the initial data frame, creating a lookup table, changing the column names and then merging the lookup table with the original data frame. But this involves several steps and I worry that there should be a simpler solution. I think that tidyr might be the answer, but I don't know how to use it to spread values in one column (likes) conditional other columns (day and question).
Do you have any suggestions? Thanks a lot!
Using the data set above, you can try the following. You group your data by participant and day and look for a row with question == 0 for each group.
library(dplyr)
group_by(df, participant, day) %>%
mutate(age = as.numeric(as.character(likes[which(question == 0)])))
Or as alistaire suggested, you can use grep() too.
group_by(df, participant, day) %>%
mutate(age = as.numeric(grep('\\d+', likes, value = TRUE)))
# participant day question likes age
# (fctr) (dbl) (dbl) (fctr) (dbl)
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
If you want to use data.table, you can do:
library(data.table)
setDT(df)[, age := as.numeric(as.character(likes[which(question == 0)])),
by = list(participant, day)]
NOTE
The present data set is a new one. Jota's answer works for the deleted data set.
Addressing the new example data:
# create a key column, overwrite it later
df$number <- paste0(df$participant, df$day) # use as a key
# create lookup table
lookup <- df[!is.na(as.numeric(as.character(df$likes))), c("number", "likes")]
# use lookup to overwrite df$number with the appropriate number
df$number <- lookup$likes[match(df$number, lookup$number)]
# participant day question likes number
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
The warning about NAs be introduced by coercion is expected due to converting characters to numeric (as.numeric(as.character(df$likes))),.
If you're data are ordered like in the example, you can use na.locf from the zoo package:
library(zoo)
df$age <- na.locf(as.numeric(as.character(df$likes)), fromLast = TRUE)

R - Add row index to a data frame but handle ties with minimum rank

I successfully used the answer in this SO thread
r-how-to-add-row-index-to-a-data-frame-based-on-combination-of-factors but I need to handle situation where two (or more) rows can be tied.
df <- data.frame(
season = c(2014,2014,2014,2014,2014,2014, 2014, 2014),
week = c(1,1,1,1,2,2,2,2),
player.name = c("Matt Ryan","Peyton Manning","Cam Newton","Matthew Stafford","Carson Palmer","Andrew Luck", "Aaron Rodgers", "Chad Henne"),
fant.pts.passing = c(28,19,29,28,18,22,29,22)
)
df <- df[order(-df$season, df$week, -df$fant.pts.passing),]
df$Index <- ave( 1:nrow(df), df$season, df$week, FUN=function(x) 1:length(x) )
df
In this example, for week 1, Matt Ryan and Matthew Stafford would both be 2, and then Peyton Manning would be 4.
You would want to use the rank function with ties.method="min" within your ave call:
df$Index <- ave(-df$fant.pts.passing, df$season, df$week,
FUN=function(x) rank(x, ties.method="min"))
df
# season week player.name fant.pts.passing Index
# 3 2014 1 Cam Newton 29 1
# 1 2014 1 Matt Ryan 28 2
# 4 2014 1 Matthew Stafford 28 2
# 2 2014 1 Peyton Manning 19 4
# 7 2014 2 Aaron Rodgers 29 1
# 6 2014 2 Andrew Luck 22 2
# 8 2014 2 Chad Henne 22 2
# 5 2014 2 Carson Palmer 18 4
Assuming you want ranks by season and week, this can be easily accomplished with dplyr's min_rank:
library(dplyr)
df %>% group_by(season, week) %>%
mutate(indx = min_rank(desc(fant.pts.passing)))
# season week player.name fant.pts.passing Index indx
# 1 2014 1 Cam Newton 29 1 1
# 2 2014 1 Matt Ryan 28 2 2
# 3 2014 1 Matthew Stafford 28 3 2
# 4 2014 1 Peyton Manning 19 4 4
# 5 2014 2 Aaron Rodgers 29 1 1
# 6 2014 2 Andrew Luck 22 2 2
# 7 2014 2 Chad Henne 22 3 2
# 8 2014 2 Carson Palmer 18 4 4
You could use the faster frank from data.table and assign (:=) the column by reference
library(data.table)#v1.9.5+
setDT(df)[, indx := frank(-fant.pts.passing, ties.method='min'), .(season, week)]
# season week player.name fant.pts.passing indx
#1: 2014 1 Cam Newton 29 1
#2: 2014 1 Matt Ryan 28 2
#3: 2014 1 Matthew Stafford 28 2
#4: 2014 1 Peyton Manning 19 4
#5: 2014 2 Aaron Rodgers 29 1
#6: 2014 2 Andrew Luck 22 2
#7: 2014 2 Chad Henne 22 2
#8: 2014 2 Carson Palmer 18 4

remove individuals based on their range of values

I have a df with two variables, one with IDs and one with a variable called numbers. I would like to excude individuals who do not start their sequence of numbers with the number 1.
I have managed to do this by creating a binary indicator and excluding if the person has this indicator. However, there must be a simpler more elegant way to do this?
Example data and the code I've used to achieve desired result are below.
Thank you.
sample df:
zz<-" names numbers
1 john 1
2 john 2
3 john 3
4 john 4
5 john 5
6 john 6
7 john 7
8 john 8
9 mary 4
10 mary 5
11 mary 6
12 mary 7
13 mary 8
14 mary 9
15 mary 10
16 mary 11
17 mary 12
18 pat 1
19 pat 2
20 pat 3
21 pat 4
22 pat 5
23 pat 6
24 pat 7
25 pat 8
26 pat 9
27 pat 10
28 sue 2
29 sue 3
30 sue 4
31 sue 5
32 sue 6
33 sue 7
34 sue 8
35 sue 9
36 tom 5
37 tom 6
38 tom 7
39 tom 8
40 tom 9
41 tom 10
42 tom 11
"
Data <- read.table(text=zz, header = TRUE)
Step 1 - add binary indicator
df$all<-ifelse(df$numbers==1, 1,0)
df$allperson<-ave(df$all, df$names, FUN=cumsum)
Step two - get rid of people who do not have 1 as their start number
df[!df$allperson==0,]
If you want elegance, I must recommend the package dplyr:
library(dplyr)
Data %>%
group_by(names) %>%
filter(min(numbers) != 1)
It means just what it appears to mean: filter only records where a group (defined by names) has a minimum numbers value inequal to 1.
names numbers
1 mary 4
2 mary 5
3 mary 6
4 mary 7
5 mary 8
6 mary 9
7 mary 10
8 mary 11
9 mary 12
10 sue 2
11 sue 3
You may also try:
zz1 <- zz[with(zz, names %in% unique(names)[!!table(zz)[,1]]),]
head(zz1,4)
# names numbers
#1 john 1
#2 john 2
#3 john 3
#4 john 4

Resources