I am trying to group a series of observations by two columns, and then create a third column with an id number. I've tried group_indices, but that gives each combination of observations a unique number. I want the number to revert to 1 for the first observation of each group.
In my data there are a series of Sites with a number of rows showing the calendar Day when an observation was collected. I want to calculate the chronological day within a Site.
library(dplyr)
# Make some data
df <- data.frame(Site = rep(c("A", "B", "C"), each = 70),
Day = as.integer(rep(c(21,22,23,24,25,26,27,1,2,3,4,5,6,7,
24,25,26,27,28,29,30), each = 10)))
# Create Day Number column (this doesn't actually work, but is the sort
# of thing I'm looking for...)
df <- df %>% group_by(Site, Day) %>%
mutate(Day.Number = group_indices(Day))
# Desired output
Site Day Day.Number
1 A 21 1
2 A 21 1
3 A 21 1
...
11 A 22 2
12 A 22 2
13 A 22 2
14 A 22 2
15 A 22 2
...
141 C 24 1
142 C 24 1
143 C 24 1
144 C 24 1
...
151 C 25 2
152 C 25 2
153 C 25 2
154 C 25 2
155 C 25 2
...
This is just a toy dataset to demonstrate the problem. Although most sites will have ten observations of seven days it is not always a given, so I can't just use a sequence of rep() etc.
There is a bit of a discussion about this on github here and here but it doesn't seem to have been resolved. Any suggestions for workarounds are much appreciated.
Here's one way to do it:
df <- df %>%
left_join(unique(df) %>% group_by(Site) %>% mutate(Day.Number=1:n()))
head(df)
# Site Day Day.Number
# 1 A 21 1
# 2 A 21 1
# 3 A 21 1
# 4 A 21 1
# 5 A 21 1
# 6 A 21 1
Related
Say I have a large data frame for e.g.
Block Subject Stim
10 100 1
11 100 2
12 100 1
...
24. 101 1
25 101 3
26 101 1
27 101 3
28 101 3
29 101 3
For every subject, every stim appears in either 2 or 4 blocks.
My objective is to mutate a column that determines for each row whether this is the first half of the blocks that this stim appears for this Subject (so if a stim appears in 2 blocks then I'm checking if its first block, and if in 4 I'm checking if this is within the first two blocks).
So my desired output would be
Block Subject Stim BlockType
10 100 1 1
11 100 2 1
12 100 1 2 # already learned in block 10
...
24. 101 1 1 # block 10/12 dont count since its a different subject
25 101 3 1
26 101 1 2 # already learned in block 24
27 101 3 1 # in four total blocks so still in bottom half
28 101 3 2
29 101 3 2
I know i can do this with a forloop for each row but thats incredibly slow. Any suggestions for how i can speed it up would be greatly appreciated!
EDIT thanks to suggestion I try
df %>%
group_by(Subject,Stim) %>%
mutate(BlockType = rep(c(1,2), each = n()/2, length.out = n())) %>%
ungroup() -> result
Most of the outputs are correct but then sometimes I get a strange mistake
Block Subject Stim BlockType
13 1 2 1
13 1 3 1
13 1 2 2
Where that third row should be Blocktype 1. Any idea what could be tripping it up?
You can try with rep :
library(dplyr)
df %>%
group_by(Subject, Stim) %>%
mutate(BlockType = rep(1:2, each = n()/2, length.out = n())) %>%
ungroup() -> result
result
How can I get the index of the sample whose previous samples were consecutive and were greater than a fixed threshold in groups?
In the below example, I need to find the time when I have consecutively 3 samples whose speed is greater than 35 speed >= 35 group-wise
speed_threshold = 35
Group Time Speed
1 5 25
1 10 23
1 15 21
1 20 40 # Speed > 35
1 25 42 # Speed > 35
1 30 52 # Speed > 35
1 35 48 # <--- Return time = 35 as answer for Group 1 !
1 40 45
2 5 22
2 10 36 # Speed > 35
2 15 38 # Speed > 35
2 20 46 # Speed > 35
2 25 53 # <--- Return time = 25 as answer for Group 2 !
3 5 45
3 10 58 # <--- Return time = NA as answer for group 3 !
If it's above the threshold and it's the third such value in a row, capture the index in ends. Select the first index in ends and add one to get the index of the return time. (There may be more than 1 such group of 3 and therefore more than one element of ends. In this case, the first end needs to be used.)
Note: In your example, the speed at return time is always above the threshold. This code does not check that as a condition at all, but simply gives the first time after three rows with speeds above threshold (regardless of whether the speed at that time is still above the threshold).
library(data.table)
setDT(df)
speed_thresh <- 35
df[, {above <- Speed > speed_thresh
ends <- which(above & rowid(rleid(above)) == 3)
.(Return_Time = Time[ends[1] + 1])}
, Group]
# Group Return_Time
# 1: 1 35
# 2: 2 25
# 3: 3 NA
Data used:
df <- fread('
Group Time Speed
1 5 25
1 10 23
1 15 21
1 20 40
1 25 42
1 30 52
1 35 48
1 40 45
2 5 22
2 10 36
2 15 38
2 20 46
2 25 53
3 5 45
3 10 58
')
One option is to use rleid to create a grouping variable based on the logic in 'Speed' and filter the rows where the number of rows (n()) is equal to 3 and all 'Speed' is greater than 35
library(dplyr)
library(data.table)
df1 %>%
group_by(Group, grp = rleid(Speed > speed_threshold)) %>%
filter(n() >= 3, all(Speed > speed_threshold)) %>%
slice(1:3)
1) Using DF defined reproducibly in the Note at the end, define a function ok which takes a vector of logicals indicating whether speed is greater than 35 and returns a logical vector of the same length which is TRUE for the first speed that comes after 3 consecutive TRUEs. Apply that to each group using ave and subset DF down those rows which are TRUE giving s.
If just returning the groups which satisfy the condition is sufficient then we are done; otherwise, define Groups as a one column data frame with one row per Group and merge that with s so that we get an NA for those groups not satisfying the condition.
library(zoo)
ok <- function(x) cumsum(rollapplyr(x, list(-(1:3)), all, fill = FALSE)) == 1
s <- subset(DF, ave(Speed > 35, Group, FUN = ok))
Groups <- data.frame(Group = unique(DF$Group))
merge(Groups, s, all.x = TRUE)[1:2]
## Group Time
## 1 1 35
## 2 2 25
## 3 3 NA
2) A second approach is to split DF by group and then perform the calculation over each component of the split.
library(zoo)
calc <- function(x) {
r <- rollapplyr(x$Speed > 35, list(-(1:3)), all, fill = FALSE)
c(which(cumsum(r) == 1), NA)[1]
}
sapply(split(DF, DF$Group), calc)
## 1 2 3
## 35 25 NA
Note
Lines <- "Group Time Speed
1 5 25
1 10 23
1 15 21
1 20 40 # Speed > 35
1 25 42 # Speed > 35
1 30 52 # Speed > 35
1 35 48 # <--- Return time = 35 as answer for Group 1 !
1 40 45
2 5 22
2 10 36 # Speed > 35
2 15 38 # Speed > 35
2 20 46 # Speed > 35
2 25 53 # <--- Return time = 25 as answer for Group 2 !
3 5 45
3 10 58 # <--- Return time = NA as answer for group 3 !"
DF <- read.table(text = Lines, header = TRUE)
I'm working on an unbalanced panel dataset. Data came from a game and for every user (user_id) in the record I have data for every level (level) of the game. As recording data started some time after introduction of the game, for some users I don't have data regarding the first levels, that's why I want to throw them out in a first step.
I've tried the complete.cases-function, but it only excludes the rows with the missing values (NAs), but not data for the whole user with missing values in level 1.
panel <- panel[complete.cases(panel), ]
That's why I need a code that excludes every user who has no record in level 1 (which in my dataset means he has an "NA" at one of the dependent variables, i.e. number of activities).
Update #1:
Data looks like this (thanks to thc):
> game_data <- data.frame(player = c(1,1,1,2,2,2,3,3,3), level = c(1,2,3,1,2,3,1,2,3), score=c(0,150,170,80,100,110,75,100,0))
> game_data
player level score
1 1 1 0
2 1 2 150
3 1 3 170
4 2 1 80
5 2 2 100
6 2 3 110
7 3 1 75
8 3 2 100
9 3 3 0
I now want to exclude data from player 1, because he has a score of 0 in level 1.
Here is one approach
Example data:
game_data <- data.frame(player = c(1,1,2,2,2,3,3,3), level = c(2,3,1,2,3,1,2,3), score=sample(100, 8))
> game_data
player level score
1 1 2 19
2 1 3 13
3 2 1 65
4 2 2 32
5 2 3 22
6 3 1 98
7 3 2 58
8 3 3 84
library(dplyr)
game_data %>% group_by(player) %>% filter(any(level == 1)) %>% as.data.frame
player level score
1 2 1 65
2 2 2 32
3 2 3 22
4 3 1 98
5 3 2 58
6 3 3 84
I think I now find a solution with your help:
game_data %>% group_by(player) %>% filter(any(level == 1 & score > 0)) %>% as.data.frame
This seems to work and I just needed a little adjustment from your code thc, thank you very much for your help!
I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11
I'm having a hard time to describe this so it's best explained with an example (as can probably be seen from the poor question title).
Using dplyr I have the result of a group_by and summarize I have a data frame that I want to do some further manipulation on by factor.
As an example, here's a data frame that looks like the result of my dplyr operations:
> df <- data.frame(run=as.factor(c(rep(1,3), rep(2,3))),
group=as.factor(rep(c("a","b","c"),2)),
sum=c(1,8,34,2,7,33))
> df
run group sum
1 1 a 1
2 1 b 8
3 1 c 34
4 2 a 2
5 2 b 7
6 2 c 33
I want to divide sum by a value that depends on run. For example, if I have:
> total <- data.frame(run=as.factor(c(1,2)),
total=c(45,47))
> total
run total
1 1 45
2 2 47
Then my final data frame will look like this:
> df
run group sum percent
1 1 a 1 1/45
2 1 b 8 8/45
3 1 c 34 34/45
4 2 a 2 2/47
5 2 b 7 7/47
6 2 c 33 33/47
Where I manually inserted the fraction in the percent column by hand to show the operation I want to do.
I know there is probably some dplyr way to do this with mutate but I can't seem to figure it out right now. How would this be accomplished?
(In base R)
You can use total as a look-up table where you get a total for each run of df :
total[df$run,'total']
[1] 45 45 45 47 47 47
And you simply use it to divide the sum and assign the result to a new column:
df$percent <- df$sum / total[df$run,'total']
run group sum percent
1 1 a 1 0.02222222
2 1 b 8 0.17777778
3 1 c 34 0.75555556
4 2 a 2 0.04255319
5 2 b 7 0.14893617
6 2 c 33 0.70212766
If your "run" values are 1,2...n then this will work
divisor <- c(45,47) # c(45,47,...up to n divisors)
df$percent <- df$sum/divisor[df$run]
first you want to merge in the total values into your df:
df2 <- merge(df, total, by = "run")
then you can call mutate:
df2 %<>% mutate(percent = sum / total)
Convert to data.table in-place, then merge and add new column, again in-place:
library(data.table)
setDT(df)[total, on = 'run', percent := sum/total]
df
# run group sum percent
#1: 1 a 1 0.02222222
#2: 1 b 8 0.17777778
#3: 1 c 34 0.75555556
#4: 2 a 2 0.04255319
#5: 2 b 7 0.14893617
#6: 2 c 33 0.70212766