I have a df with two variables, one with IDs and one with a variable called numbers. I would like to excude individuals who do not start their sequence of numbers with the number 1.
I have managed to do this by creating a binary indicator and excluding if the person has this indicator. However, there must be a simpler more elegant way to do this?
Example data and the code I've used to achieve desired result are below.
Thank you.
sample df:
zz<-" names numbers
1 john 1
2 john 2
3 john 3
4 john 4
5 john 5
6 john 6
7 john 7
8 john 8
9 mary 4
10 mary 5
11 mary 6
12 mary 7
13 mary 8
14 mary 9
15 mary 10
16 mary 11
17 mary 12
18 pat 1
19 pat 2
20 pat 3
21 pat 4
22 pat 5
23 pat 6
24 pat 7
25 pat 8
26 pat 9
27 pat 10
28 sue 2
29 sue 3
30 sue 4
31 sue 5
32 sue 6
33 sue 7
34 sue 8
35 sue 9
36 tom 5
37 tom 6
38 tom 7
39 tom 8
40 tom 9
41 tom 10
42 tom 11
"
Data <- read.table(text=zz, header = TRUE)
Step 1 - add binary indicator
df$all<-ifelse(df$numbers==1, 1,0)
df$allperson<-ave(df$all, df$names, FUN=cumsum)
Step two - get rid of people who do not have 1 as their start number
df[!df$allperson==0,]
If you want elegance, I must recommend the package dplyr:
library(dplyr)
Data %>%
group_by(names) %>%
filter(min(numbers) != 1)
It means just what it appears to mean: filter only records where a group (defined by names) has a minimum numbers value inequal to 1.
names numbers
1 mary 4
2 mary 5
3 mary 6
4 mary 7
5 mary 8
6 mary 9
7 mary 10
8 mary 11
9 mary 12
10 sue 2
11 sue 3
You may also try:
zz1 <- zz[with(zz, names %in% unique(names)[!!table(zz)[,1]]),]
head(zz1,4)
# names numbers
#1 john 1
#2 john 2
#3 john 3
#4 john 4
Related
I am working with a dataframe which is similar to this:
df1 <- data.frame(p1 = c("John", "John", "John", "John", "John", "John", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim", "Jim","Jim" ),
elapsed_time = c(0, 4, 6, 9, 12, 14, 17, 22, 27, 35, 42, 47, 51, 57),
event_type = c("start of period", "play", "play", "play", "play", "play", "play", "play", "play", "timeout", "play", "play", "play", "play"))
and looks like this:
p1 elapsed_time event_type
1 John 0 start of period
2 John 4 play
3 John 6 play
4 John 9 play
5 John 12 play
6 John 14 play
7 Jim 17 play
8 Jim 22 play
9 Jim 27 play
10 Jim 35 timeout
11 Jim 42 play
12 Jim 47 play
13 Jim 51 play
14 Jim 57 play
What I'd like to do is add a 4th column that calculates elapsed time since 1 of 3 things happened: 1) event_type == "start of period" 2) eventtype == "timeout" 3) p1 was changed (like in row 7 from John to Jim). Any of these three things should reset the 4th column to zero.
My desired output is
p1 elapsed_time event_type elapsed_time_since_last_break
1 John 0 start of period 0
2 John 4 play 4
3 John 6 play 6
4 John 9 play 9
5 John 12 play 12
6 John 14 play 14
7 Jim 17 play 0
8 Jim 22 play 5
9 Jim 27 play 10
10 Jim 35 timeout 0
11 Jim 42 play 7
12 Jim 47 play 12
13 Jim 51 play 16
14 Jim 57 play 22
I'm somewhat new to r and haven't had much success. I'm sure there's probably a simple solution I'm overlooking.
df1 %>%
group_by(p1, elps = cumsum(event_type != 'play'))%>%
mutate(elps = elapsed_time - elapsed_time[1])
# A tibble: 14 × 4
# Groups: p1, elps [13]
p1 elapsed_time event_type elps
<chr> <dbl> <chr> <dbl>
1 John 0 start of period 0
2 John 4 play 4
3 John 6 play 6
4 John 9 play 9
5 John 12 play 12
6 John 14 play 14
7 Jim 17 play 0
8 Jim 22 play 5
9 Jim 27 play 10
10 Jim 35 timeout 0
11 Jim 42 play 7
12 Jim 47 play 12
13 Jim 51 play 16
14 Jim 57 play 22
A data.table option
setDT(df1)[
,
grp := (rowid(p1) == 1 | (event_type != "play"))
][
,
elps := elapsed_time - elapsed_time[grp][cumsum(grp)]
][
,
grp := NULL
]
gives
> df1
p1 elapsed_time event_type elps
1: John 0 start of period 0
2: John 4 play 4
3: John 6 play 6
4: John 9 play 9
5: John 12 play 12
6: John 14 play 14
7: Jim 17 play 0
8: Jim 22 play 5
9: Jim 27 play 10
10: Jim 35 timeout 0
11: Jim 42 play 7
12: Jim 47 play 12
13: Jim 51 play 16
14: Jim 57 play 22
df1 %>%
group_by(p1) %>%
mutate(result=elapsed_time-ifelse(elapsed_time==elapsed_time[1] | event_type!='play', elapsed_time, 0)[1])
I would like to calculate the number of days which have passed since the first event. There are different groups, so each group's starting date for an event is different and I want to calculate each groups number of days passed since their own first event.
names = c('Ben',"Ben","Ben","Ben","Ben","Ben" ,'Dan',"Dan","Dan","Dan", 'Peter',"Peter","Peter","Peter","Peter","Peter","Peter",'Betty',"Betty","Betty",'Betty', "Betty")
dates = c('2000-02-01','2000-02-02',"2000-02-03","2000-02-04",'2000-02-05','2000-02-05', '2000-01-11','2000-01-12',"2000-01-13",'2000-01-14',
'2000-09-10','2000-09-11',"2000-09-12",'2000-09-13','2000-09-14','2000-09-15','2000-09-16','2000-11-13','2000-11-14', "2000-11-15",'2000-11-16','2000-11-17')
events = c(0,0,1,4,5,11,0,0,2,6,0,0,1,2,3,4,5,0,0,1,2,3)
newd = data.frame(names,dates,events)
newd
so the data frame looks like this:
> newd
names dates events
1 Ben 2000-02-01 0
2 Ben 2000-02-02 0
3 Ben 2000-02-03 1
4 Ben 2000-02-04 4
5 Ben 2000-02-05 5
6 Ben 2000-02-05 11
7 Dan 2000-01-11 0
8 Dan 2000-01-12 0
9 Dan 2000-01-13 2
10 Dan 2000-01-14 6
11 Peter 2000-09-10 0
12 Peter 2000-09-11 0
13 Peter 2000-09-12 1
14 Peter 2000-09-13 2
15 Peter 2000-09-14 3
16 Peter 2000-09-15 4
17 Peter 2000-09-16 5
18 Betty 2000-11-13 0
19 Betty 2000-11-14 0
20 Betty 2000-11-15 1
21 Betty 2000-11-16 2
22 Betty 2000-11-17 3
This is just an example I am using, the 'events' are not in a specific order and are totally random, there are also many other dates with the event of 0. So I would like to only start counting days where: event > 0.
So if there's a 0 at 'event' than there should also be a 0 days counted.
Convert the dates to actual date and you can then subtract minimum dates for each names.
newd$dates <- as.Date(newd$dates)
library(dplyr)
newd %>% group_by(names) %>% mutate(events = as.integer(dates - min(dates)))
# names dates events
# <chr> <date> <int>
# 1 Ben 2000-02-02 0
# 2 Ben 2000-02-03 1
# 3 Ben 2000-02-04 2
# 4 Ben 2000-02-05 3
# 5 Ben 2000-02-05 3
# 6 Dan 2000-01-12 0
# 7 Dan 2000-01-13 1
# 8 Dan 2000-01-14 2
# 9 Peter 2000-09-11 0
#10 Peter 2000-09-12 1
#11 Peter 2000-09-13 2
#12 Peter 2000-09-14 3
#13 Peter 2000-09-15 4
#14 Peter 2000-09-16 5
#15 Betty 2000-11-14 0
#16 Betty 2000-11-15 1
#17 Betty 2000-11-16 2
#18 Betty 2000-11-17 3
In base R :
newd$events <- with(newd, dates - ave(dates, names, FUN = min))
and data.table :
library(data.table)
setDT(newd)[, events := dates - min(dates), names]
This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 2 years ago.
I have a dataset that has observations for different case files. And I would like to create a variable that indicates the number of cases that have been dealt with of that kind before a specific case is looked into.
Here is a test code and dataset to specify what I am asking.
df <- data.frame( ID= c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16),
name = c("Jon", "Jon", "Maria","Jon", "Jon", "Maria","Jon", "Jon", "Maria","Prince", "Jon", "Maria","Prince", "Jon", "Maria","Prince"),
date = c("2007-01-22", "2007-02-13", "2007-05-22", "2007-02-25", "2007-04-22", "2007-03-13", "2007-03-22", "2007-07-13", "2007-08-22",
"2007-05-10", "2007-04-18", "2007-07-09","2007-06-10", "2008-02-13","2007-09-22", "2007-05-15"))
I would like to group the observations into categories and for each observation check the date and give a count of the number of observations in that category before the stated observation.
df$date <- as.Date(df$date, '%Y-%m-%d')
df$exp = NA
for(i in 1:nrow(df)){
temp = df %>% filter(!is.na(date))
temp = temp %>% filter(name == name[i])
df$exp[i]= nrow( filter(temp,date[i]>date))
}
I tried run the code above but doesn't give the results I am looking for. It gives me the following results
ID name date exp
1 1 Jon 2007-01-22 0
2 2 Jon 2007-02-13 1
3 4 Jon 2007-02-25 5
4 7 Jon 2007-03-22 4
5 11 Jon 2007-04-18 0
6 5 Jon 2007-04-22 3
7 8 Jon 2007-07-13 7
8 14 Jon 2008-02-13 0
9 6 Maria 2007-03-13 0
10 3 Maria 2007-05-22 3
11 12 Maria 2007-07-09 0
12 9 Maria 2007-08-22 0
13 15 Maria 2007-09-22 0
14 10 Prince 2007-05-10 0
15 16 Prince 2007-05-15 0
16 13 Prince 2007-06-10 0
instead of
ID name date exp
1 1 Jon 2007-01-22 0
2 2 Jon 2007-02-13 1
3 4 Jon 2007-02-25 2
4 7 Jon 2007-03-22 3
5 11 Jon 2007-04-18 4
6 5 Jon 2007-04-22 5
7 8 Jon 2007-07-13 6
8 14 Jon 2008-02-13 7
9 6 Maria 2007-03-13 0
10 3 Maria 2007-05-22 1
11 12 Maria 2007-07-09 2
12 9 Maria 2007-08-22 3
13 15 Maria 2007-09-22 4
14 10 Prince 2007-05-10 0
15 16 Prince 2007-05-15 1
16 13 Prince 2007-06-10 2
How can I efficiently get this done?
You can sort by name and date, make groups by name and use the row_number to get the result
library(tidyverse)
df %>%
arrange(name, as.Date(date)) %>%
group_by(name) %>%
mutate(n = row_number() - 1)
# A tibble: 16 x 4
# Groups: name [3]
ID name date n
<dbl> <chr> <chr> <dbl>
1 1 Jon 2007-01-22 0
2 2 Jon 2007-02-13 1
3 4 Jon 2007-02-25 2
4 7 Jon 2007-03-22 3
5 11 Jon 2007-04-18 4
6 5 Jon 2007-04-22 5
7 8 Jon 2007-07-13 6
8 14 Jon 2008-02-13 7
9 6 Maria 2007-03-13 0
10 3 Maria 2007-05-22 1
11 12 Maria 2007-07-09 2
12 9 Maria 2007-08-22 3
13 15 Maria 2007-09-22 4
14 10 Prince 2007-05-10 0
15 16 Prince 2007-05-15 1
16 13 Prince 2007-06-10 2
I inherited a data set coded in an unusual way. I would like to learn a less verbose way of reshaping it. The data frame looks like this:
# Input.
participant = c(rep("John",6), rep("Mary",6))
day = c(rep(1,3), rep(2,3), rep(1,3), rep(2,3))
likes = c("apples", "apples", "18", "apples", "apples", "7", "bananas", "bananas", "24", "bananas", "bananas", "3")
question = rep(c(1,1,0),4)
number = c(rep(18,3), rep(7,3), rep(24,3), rep(3,3))
df = data.frame(participant, day, question, likes)
participant day question likes
1 John 1 1 apples
2 John 1 1 apples
3 John 1 0 18
4 John 2 1 apples
5 John 2 1 apples
6 John 2 0 7
7 Mary 1 1 bananas
8 Mary 1 1 bananas
9 Mary 1 0 24
10 Mary 2 1 bananas
11 Mary 2 1 bananas
12 Mary 2 0 3
As you can see, the column likes is heterogeneous. When question equals 0, likes conveys a number chosen by the participants, not their preferred fruit. So I would like to re-code it in a new column as follows:
participant day question likes number
1 John 1 1 apples 18
2 John 1 1 apples 18
3 John 1 0 18 18
4 John 2 1 apples 7
5 John 2 1 apples 7
6 John 2 0 7 7
7 Mary 1 1 bananas 24
8 Mary 1 1 bananas 24
9 Mary 1 0 24 24
10 Mary 2 1 bananas 3
11 Mary 2 1 bananas 3
12 Mary 2 0 3 3
My current solution with base R involves subsetting the initial data frame, creating a lookup table, changing the column names and then merging the lookup table with the original data frame. But this involves several steps and I worry that there should be a simpler solution. I think that tidyr might be the answer, but I don't know how to use it to spread values in one column (likes) conditional other columns (day and question).
Do you have any suggestions? Thanks a lot!
Using the data set above, you can try the following. You group your data by participant and day and look for a row with question == 0 for each group.
library(dplyr)
group_by(df, participant, day) %>%
mutate(age = as.numeric(as.character(likes[which(question == 0)])))
Or as alistaire suggested, you can use grep() too.
group_by(df, participant, day) %>%
mutate(age = as.numeric(grep('\\d+', likes, value = TRUE)))
# participant day question likes age
# (fctr) (dbl) (dbl) (fctr) (dbl)
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
If you want to use data.table, you can do:
library(data.table)
setDT(df)[, age := as.numeric(as.character(likes[which(question == 0)])),
by = list(participant, day)]
NOTE
The present data set is a new one. Jota's answer works for the deleted data set.
Addressing the new example data:
# create a key column, overwrite it later
df$number <- paste0(df$participant, df$day) # use as a key
# create lookup table
lookup <- df[!is.na(as.numeric(as.character(df$likes))), c("number", "likes")]
# use lookup to overwrite df$number with the appropriate number
df$number <- lookup$likes[match(df$number, lookup$number)]
# participant day question likes number
#1 John 1 1 apples 18
#2 John 1 1 apples 18
#3 John 1 0 18 18
#4 John 2 1 apples 7
#5 John 2 1 apples 7
#6 John 2 0 7 7
#7 Mary 1 1 bananas 24
#8 Mary 1 1 bananas 24
#9 Mary 1 0 24 24
#10 Mary 2 1 bananas 3
#11 Mary 2 1 bananas 3
#12 Mary 2 0 3 3
The warning about NAs be introduced by coercion is expected due to converting characters to numeric (as.numeric(as.character(df$likes))),.
If you're data are ordered like in the example, you can use na.locf from the zoo package:
library(zoo)
df$age <- na.locf(as.numeric(as.character(df$likes)), fromLast = TRUE)
Say that I have two dataframes. I have one that lists the names of soccer players, teams that they have played for, and the number of goals that they have scored on each team. Then I also have a dataframe that contains the soccer players ages and their names. How do I add an "names_age" column to the goal dataframe that is the age column for the players in the first column "names", not for "teammates_names"? How do I add an additional column that is the teammates' ages column? In short, I'd like two age columns: one for the first set of players and one for the second set.
> AGE_DF
names age
1 Sam 20
2 Jon 21
3 Adam 22
4 Jason 23
5 Jones 24
6 Jermaine 25
> GOALS_DF
names goals team teammates_names teammates_goals teammates_team
1 Sam 1 USA Jason 1 HOLLAND
2 Sam 2 ENGLAND Jason 2 PORTUGAL
3 Sam 3 BRAZIL Jason 3 GHANA
4 Sam 4 GERMANY Jason 4 COLOMBIA
5 Sam 5 ARGENTINA Jason 5 CANADA
6 Jon 1 USA Jones 1 HOLLAND
7 Jon 2 ENGLAND Jones 2 PORTUGAL
8 Jon 3 BRAZIL Jones 3 GHANA
9 Jon 4 GERMANY Jones 4 COLOMBIA
10 Jon 5 ARGENTINA Jones 5 CANADA
11 Adam 1 USA Jermaine 1 HOLLAND
12 Adam 1 ENGLAND Jermaine 1 PORTUGAL
13 Adam 4 BRAZIL Jermaine 4 GHANA
14 Adam 3 GERMANY Jermaine 3 COLOMBIA
15 Adam 2 ARGENTINA Jermaine 2 CANADA
What I have tried: I've successfully got this to work using a for loop. The actual data that I am working with have thousands of rows, and this takes a long time. I would like a vectorized approach but I'm having trouble coming up with a way to do that.
Try merge or match.
Here's merge (which is likely to screw up your row ordering and can sometimes be slow):
merge(AGE_DF, GOALS_DF, all = TRUE)
Here's match, which makes use of basic indexing and subsetting. Assign the result to a new column, of course.
AGE_DF$age[match(GOALS_DF$names, AGE_DF$names)]
Here's another option to consider: Convert your dataset into a long format first, and then do the merge. Here, I've done it with melt and "data.table":
library(reshape2)
library(data.table)
setkey(melt(as.data.table(GOALS_DF, keep.rownames = TRUE),
measure.vars = c("names", "teammates_names"),
value.name = "names"), names)[as.data.table(AGE_DF)]
# rn goals team teammates_goals teammates_team variable names age
# 1: 1 1 USA 1 HOLLAND names Sam 20
# 2: 2 2 ENGLAND 2 PORTUGAL names Sam 20
# 3: 3 3 BRAZIL 3 GHANA names Sam 20
# 4: 4 4 GERMANY 4 COLOMBIA names Sam 20
# 5: 5 5 ARGENTINA 5 CANADA names Sam 20
# 6: 6 1 USA 1 HOLLAND names Jon 21
## <<SNIP>>
# 28: 13 4 BRAZIL 4 GHANA teammates_names Jermaine 25
# 29: 14 3 GERMANY 3 COLOMBIA teammates_names Jermaine 25
# 30: 15 2 ARGENTINA 2 CANADA teammates_names Jermaine 25
# rn goals team teammates_goals teammates_team variable names age
I've added the rownames so you can you can use dcast to get back to the wide format and retain the row ordering if it's important.