Subtracting rows based on conditions in other columns - r

I am working with a dataframe which is similar to this:
df1 <- data.frame(p1 = c("John", "John", "John", "John", "John", "John", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim","Jim" ),
elapsed_time = c(0, 4, 6, 9, 12, 14, 17, 22, 27, 35, 42, 47, 51, 57),
event_type = c("start of period", "play", "play", "play", "play", "play", "play", "play", "play", "timeout", "play", "play", "play", "play"))
and looks like this:
p1 elapsed_time event_type
1 John 0 start of period
2 John 4 play
3 John 6 play
4 John 9 play
5 John 12 play
6 John 14 play
7 Jim 17 play
8 Jim 22 play
9 Jim 27 play
10 Jim 35 timeout
11 Jim 42 play
12 Jim 47 play
13 Jim 51 play
14 Jim 57 play
What I'd like to do is add a 4th column that calculates elapsed time since 1 of 3 things happened: 1) event_type == "start of period" 2) eventtype == "timeout" 3) p1 was changed (like in row 7 from John to Jim). Any of these three things should reset the 4th column to zero.
My desired output is
p1 elapsed_time event_type elapsed_time_since_last_break
1 John 0 start of period 0
2 John 4 play 4
3 John 6 play 6
4 John 9 play 9
5 John 12 play 12
6 John 14 play 14
7 Jim 17 play 0
8 Jim 22 play 5
9 Jim 27 play 10
10 Jim 35 timeout 0
11 Jim 42 play 7
12 Jim 47 play 12
13 Jim 51 play 16
14 Jim 57 play 22
I'm somewhat new to r and haven't had much success. I'm sure there's probably a simple solution I'm overlooking.

df1 %>%
group_by(p1, elps = cumsum(event_type != 'play'))%>%
mutate(elps = elapsed_time - elapsed_time[1])
# A tibble: 14 × 4
# Groups: p1, elps [13]
p1 elapsed_time event_type elps
<chr> <dbl> <chr> <dbl>
1 John 0 start of period 0
2 John 4 play 4
3 John 6 play 6
4 John 9 play 9
5 John 12 play 12
6 John 14 play 14
7 Jim 17 play 0
8 Jim 22 play 5
9 Jim 27 play 10
10 Jim 35 timeout 0
11 Jim 42 play 7
12 Jim 47 play 12
13 Jim 51 play 16
14 Jim 57 play 22

A data.table option
setDT(df1)[
,
grp := (rowid(p1) == 1 | (event_type != "play"))
][
,
elps := elapsed_time - elapsed_time[grp][cumsum(grp)]
][
,
grp := NULL
]
gives
> df1
p1 elapsed_time event_type elps
1: John 0 start of period 0
2: John 4 play 4
3: John 6 play 6
4: John 9 play 9
5: John 12 play 12
6: John 14 play 14
7: Jim 17 play 0
8: Jim 22 play 5
9: Jim 27 play 10
10: Jim 35 timeout 0
11: Jim 42 play 7
12: Jim 47 play 12
13: Jim 51 play 16
14: Jim 57 play 22

df1 %>%
group_by(p1) %>%
mutate(result=elapsed_time-ifelse(elapsed_time==elapsed_time[1] | event_type!='play', elapsed_time, 0)[1])

Related

How can I create a Variable for experience in R? [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I have a dataset that has observations for different case files. And I would like to create a variable that indicates the number of cases that have been dealt with of that kind before a specific case is looked into.
Here is a test code and dataset to specify what I am asking.
df <- data.frame( ID= c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16),
name = c("Jon", "Jon", "Maria","Jon", "Jon", "Maria","Jon", "Jon", "Maria","Prince", "Jon", "Maria","Prince", "Jon", "Maria","Prince"),
date = c("2007-01-22", "2007-02-13", "2007-05-22", "2007-02-25", "2007-04-22", "2007-03-13", "2007-03-22", "2007-07-13", "2007-08-22",
"2007-05-10", "2007-04-18", "2007-07-09","2007-06-10", "2008-02-13","2007-09-22", "2007-05-15"))
I would like to group the observations into categories and for each observation check the date and give a count of the number of observations in that category before the stated observation.
df$date <- as.Date(df$date, '%Y-%m-%d')
df$exp = NA
for(i in 1:nrow(df)){
temp = df %>% filter(!is.na(date))
temp = temp %>% filter(name == name[i])
df$exp[i]= nrow( filter(temp,date[i]>date))
}
I tried run the code above but doesn't give the results I am looking for. It gives me the following results
ID name date exp
1 1 Jon 2007-01-22 0
2 2 Jon 2007-02-13 1
3 4 Jon 2007-02-25 5
4 7 Jon 2007-03-22 4
5 11 Jon 2007-04-18 0
6 5 Jon 2007-04-22 3
7 8 Jon 2007-07-13 7
8 14 Jon 2008-02-13 0
9 6 Maria 2007-03-13 0
10 3 Maria 2007-05-22 3
11 12 Maria 2007-07-09 0
12 9 Maria 2007-08-22 0
13 15 Maria 2007-09-22 0
14 10 Prince 2007-05-10 0
15 16 Prince 2007-05-15 0
16 13 Prince 2007-06-10 0
instead of
ID name date exp
1 1 Jon 2007-01-22 0
2 2 Jon 2007-02-13 1
3 4 Jon 2007-02-25 2
4 7 Jon 2007-03-22 3
5 11 Jon 2007-04-18 4
6 5 Jon 2007-04-22 5
7 8 Jon 2007-07-13 6
8 14 Jon 2008-02-13 7
9 6 Maria 2007-03-13 0
10 3 Maria 2007-05-22 1
11 12 Maria 2007-07-09 2
12 9 Maria 2007-08-22 3
13 15 Maria 2007-09-22 4
14 10 Prince 2007-05-10 0
15 16 Prince 2007-05-15 1
16 13 Prince 2007-06-10 2
How can I efficiently get this done?
You can sort by name and date, make groups by name and use the row_number to get the result
library(tidyverse)
df %>%
arrange(name, as.Date(date)) %>%
group_by(name) %>%
mutate(n = row_number() - 1)
# A tibble: 16 x 4
# Groups: name [3]
ID name date n
<dbl> <chr> <chr> <dbl>
1 1 Jon 2007-01-22 0
2 2 Jon 2007-02-13 1
3 4 Jon 2007-02-25 2
4 7 Jon 2007-03-22 3
5 11 Jon 2007-04-18 4
6 5 Jon 2007-04-22 5
7 8 Jon 2007-07-13 6
8 14 Jon 2008-02-13 7
9 6 Maria 2007-03-13 0
10 3 Maria 2007-05-22 1
11 12 Maria 2007-07-09 2
12 9 Maria 2007-08-22 3
13 15 Maria 2007-09-22 4
14 10 Prince 2007-05-10 0
15 16 Prince 2007-05-15 1
16 13 Prince 2007-06-10 2

Replicating table in R with change in one column

I have this table in R :
Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1
I want to replicate this same table 4 times, all values should be the same. Except the Month column, which needs to be incremented by 1 every time. And the final table should look like this:
Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1
John 8 2017 8 16
Carol 90 2017 8 30
Bug 9 2017 8 1
John 8 2017 9 16
Carol 90 2017 9 30
Bug 9 2017 9 1
John 8 2017 10 16
Carol 90 2017 10 30
Bug 9 2017 10 1
John 8 2017 11 16
Carol 90 2017 11 30
Bug 9 2017 11 1
Please point how to do this efficiently in R. Many thanks!
If this is your dataframe:
df = read.table(text = "Name ID Year Month Date
John 8 2017 7 16
Carol 90 2017 7 30
Bug 9 2017 7 1", header = TRUE)
Then this is your dataframe repeating:
df2 = df[rep(rownames(df), 4),]
And this is it again, but with the months incremented:
df2$Month = df2$Month + rep(0:3, 3)
In the more general case:
m = 4 # <-- number of rows desired
df2 = df[rep(rownames(df), m), ]
df2$Month = df2$Month + rep(0:m, nrow(df))

How to populate values of one row conditional of another row in R?

I inherited a data set coded in an unusual way. I would like to learn a less verbose way of reshaping it. The data frame looks like this:
# Input.
participant = c(rep("John",6), rep("Mary",6))
day = c(rep(1,3), rep(2,3), rep(1,3), rep(2,3))
likes = c("apples", "apples", "18", "apples", "apples", "7", "bananas", "bananas", "24", "bananas", "bananas", "3")
question = rep(c(1,1,0),4)
number = c(rep(18,3), rep(7,3), rep(24,3), rep(3,3))
df = data.frame(participant, day, question, likes)
participant day question likes
1 John 1 1 apples
2 John 1 1 apples
3 John 1 0 18
4 John 2 1 apples
5 John 2 1 apples
6 John 2 0 7
7 Mary 1 1 bananas
8 Mary 1 1 bananas
9 Mary 1 0 24
10 Mary 2 1 bananas
11 Mary 2 1 bananas
12 Mary 2 0 3
As you can see, the column likes is heterogeneous. When question equals 0, likes conveys a number chosen by the participants, not their preferred fruit. So I would like to re-code it in a new column as follows:
participant day question likes number
1 John 1 1 apples 18
2 John 1 1 apples 18
3 John 1 0 18 18
4 John 2 1 apples 7
5 John 2 1 apples 7
6 John 2 0 7 7
7 Mary 1 1 bananas 24
8 Mary 1 1 bananas 24
9 Mary 1 0 24 24
10 Mary 2 1 bananas 3
11 Mary 2 1 bananas 3
12 Mary 2 0 3 3
My current solution with base R involves subsetting the initial data frame, creating a lookup table, changing the column names and then merging the lookup table with the original data frame. But this involves several steps and I worry that there should be a simpler solution. I think that tidyr might be the answer, but I don't know how to use it to spread values in one column (likes) conditional other columns (day and question).
Do you have any suggestions? Thanks a lot!
Using the data set above, you can try the following. You group your data by participant and day and look for a row with question == 0 for each group.
library(dplyr)
group_by(df, participant, day) %>%
mutate(age = as.numeric(as.character(likes[which(question == 0)])))
Or as alistaire suggested, you can use grep() too.
group_by(df, participant, day) %>%
mutate(age = as.numeric(grep('\\d+', likes, value = TRUE)))
# participant day question likes age
# (fctr) (dbl) (dbl) (fctr) (dbl)
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
If you want to use data.table, you can do:
library(data.table)
setDT(df)[, age := as.numeric(as.character(likes[which(question == 0)])),
by = list(participant, day)]
NOTE
The present data set is a new one. Jota's answer works for the deleted data set.
Addressing the new example data:
# create a key column, overwrite it later
df$number <- paste0(df$participant, df$day) # use as a key
# create lookup table
lookup <- df[!is.na(as.numeric(as.character(df$likes))), c("number", "likes")]
# use lookup to overwrite df$number with the appropriate number
df$number <- lookup$likes[match(df$number, lookup$number)]
# participant day question likes number
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
The warning about NAs be introduced by coercion is expected due to converting characters to numeric (as.numeric(as.character(df$likes))),.
If you're data are ordered like in the example, you can use na.locf from the zoo package:
library(zoo)
df$age <- na.locf(as.numeric(as.character(df$likes)), fromLast = TRUE)

R - Add row index to a data frame but handle ties with minimum rank

I successfully used the answer in this SO thread
r-how-to-add-row-index-to-a-data-frame-based-on-combination-of-factors but I need to handle situation where two (or more) rows can be tied.
df <- data.frame(
season = c(2014,2014,2014,2014,2014,2014, 2014, 2014),
week = c(1,1,1,1,2,2,2,2),
player.name = c("Matt Ryan","Peyton Manning","Cam Newton","Matthew Stafford","Carson Palmer","Andrew Luck", "Aaron Rodgers", "Chad Henne"),
fant.pts.passing = c(28,19,29,28,18,22,29,22)
)
df <- df[order(-df$season, df$week, -df$fant.pts.passing),]
df$Index <- ave( 1:nrow(df), df$season, df$week, FUN=function(x) 1:length(x) )
df
In this example, for week 1, Matt Ryan and Matthew Stafford would both be 2, and then Peyton Manning would be 4.
You would want to use the rank function with ties.method="min" within your ave call:
df$Index <- ave(-df$fant.pts.passing, df$season, df$week,
FUN=function(x) rank(x, ties.method="min"))
df
# season week player.name fant.pts.passing Index
# 3 2014 1 Cam Newton 29 1
# 1 2014 1 Matt Ryan 28 2
# 4 2014 1 Matthew Stafford 28 2
# 2 2014 1 Peyton Manning 19 4
# 7 2014 2 Aaron Rodgers 29 1
# 6 2014 2 Andrew Luck 22 2
# 8 2014 2 Chad Henne 22 2
# 5 2014 2 Carson Palmer 18 4
Assuming you want ranks by season and week, this can be easily accomplished with dplyr's min_rank:
library(dplyr)
df %>% group_by(season, week) %>%
mutate(indx = min_rank(desc(fant.pts.passing)))
# season week player.name fant.pts.passing Index indx
# 1 2014 1 Cam Newton 29 1 1
# 2 2014 1 Matt Ryan 28 2 2
# 3 2014 1 Matthew Stafford 28 3 2
# 4 2014 1 Peyton Manning 19 4 4
# 5 2014 2 Aaron Rodgers 29 1 1
# 6 2014 2 Andrew Luck 22 2 2
# 7 2014 2 Chad Henne 22 3 2
# 8 2014 2 Carson Palmer 18 4 4
You could use the faster frank from data.table and assign (:=) the column by reference
library(data.table)#v1.9.5+
setDT(df)[, indx := frank(-fant.pts.passing, ties.method='min'), .(season, week)]
# season week player.name fant.pts.passing indx
#1: 2014 1 Cam Newton 29 1
#2: 2014 1 Matt Ryan 28 2
#3: 2014 1 Matthew Stafford 28 2
#4: 2014 1 Peyton Manning 19 4
#5: 2014 2 Aaron Rodgers 29 1
#6: 2014 2 Andrew Luck 22 2
#7: 2014 2 Chad Henne 22 2
#8: 2014 2 Carson Palmer 18 4

remove individuals based on their range of values

I have a df with two variables, one with IDs and one with a variable called numbers. I would like to excude individuals who do not start their sequence of numbers with the number 1.
I have managed to do this by creating a binary indicator and excluding if the person has this indicator. However, there must be a simpler more elegant way to do this?
Example data and the code I've used to achieve desired result are below.
Thank you.
sample df:
zz<-" names numbers
1 john 1
2 john 2
3 john 3
4 john 4
5 john 5
6 john 6
7 john 7
8 john 8
9 mary 4
10 mary 5
11 mary 6
12 mary 7
13 mary 8
14 mary 9
15 mary 10
16 mary 11
17 mary 12
18 pat 1
19 pat 2
20 pat 3
21 pat 4
22 pat 5
23 pat 6
24 pat 7
25 pat 8
26 pat 9
27 pat 10
28 sue 2
29 sue 3
30 sue 4
31 sue 5
32 sue 6
33 sue 7
34 sue 8
35 sue 9
36 tom 5
37 tom 6
38 tom 7
39 tom 8
40 tom 9
41 tom 10
42 tom 11
"
Data <- read.table(text=zz, header = TRUE)
Step 1 - add binary indicator
df$all<-ifelse(df$numbers==1, 1,0)
df$allperson<-ave(df$all, df$names, FUN=cumsum)
Step two - get rid of people who do not have 1 as their start number
df[!df$allperson==0,]
If you want elegance, I must recommend the package dplyr:
library(dplyr)
Data %>%
group_by(names) %>%
filter(min(numbers) != 1)
It means just what it appears to mean: filter only records where a group (defined by names) has a minimum numbers value inequal to 1.
names numbers
1 mary 4
2 mary 5
3 mary 6
4 mary 7
5 mary 8
6 mary 9
7 mary 10
8 mary 11
9 mary 12
10 sue 2
11 sue 3
You may also try:
zz1 <- zz[with(zz, names %in% unique(names)[!!table(zz)[,1]]),]
head(zz1,4)
# names numbers
#1 john 1
#2 john 2
#3 john 3
#4 john 4

Resources