Nested Subseting - r

I have the following data frame
Library(dplyr)
ID <- c(1,1,1,2,2,2,2,3,3)
Tag <- c(1,2,6,1,3,4,6,4,3)
Value <- c(5,9,3,3,5,6,4,8,9)
DF <- data.frame(ID,Tag,Value)
ID Tag Value
1 1 1 5
2 1 2 9
3 1 6 3
4 2 1 3
5 2 3 5
6 2 4 6
7 2 6 4
8 3 4 8
9 3 3 9
I would like to perform the following 1) group by rows ID 2) assign the Value corresponding to a specific Tag a new column. In the following example, I am assigning the Value of Tag 6 to a new column by ID
ID Tag Value New_Value
1 1 1 5 3
2 1 2 9 3
3 1 6 3 3
4 2 1 3 4
5 2 3 5 4
6 2 4 6 4
7 2 6 4 4
8 3 4 8 NA
9 3 3 9 NA
To the best of my knowledge, I need to subset the data in each group to get the Value for Tag 6. Here is my code and the error msg
DF %>% group_by(ID) %>% mutate(New_Value = select(filter(.,Tag==6),Value))
Adding missing grouping variables: `ID`
Error: Column `New_Value` is of unsupported class data.frame
Another possible solution is to create a new dataframe with IDs and Values for Tag 6 and join it with DF. However, I believe there is a better generic solution by only using dplyr.
I would appreciate it if you can help me understand how to perform a nested subset in this situation
Thank you

On the assumption that Tag is unique within groups, you could do:
library(dplyr)
DF %>%
group_by(ID) %>%
mutate(New_Value = ifelse(any(Tag == 6), Value[Tag == 6], NA))
# A tibble: 9 x 4
# Groups: ID [3]
ID Tag Value New_Value
<dbl> <dbl> <dbl> <dbl>
1 1 1 5 3
2 1 2 9 3
3 1 6 3 3
4 2 1 3 4
5 2 3 5 4
6 2 4 6 4
7 2 6 4 4
8 3 4 8 NA
9 3 3 9 NA

Related

Splitting a grouping column using another column and map it using an increasing sequence

I have data that is uniquely identified with an id, and grouped by a sub_id and group_id. Each instance with a unique object_id generates an id. Similar instances concerning different objects are grouped by a sub_id. Multiple interactions for one instance with one or more objects are grouped by the group_id column. Essentially, if someone interacts twice for two objects, there will be 4 unique ids, two unique sub_ids and a single group_id. The only way to distinguish the difference in the duplicate interactions is through the object_id column:
library(tidyverse)
tibble(id = c(1:11),
sub_id = c(1,2,1,2,3,4,3,4,5,6,6),
group_id = c(rep(1,4),rep(3,4),5,6,6),
object_id = c(1,1,2,2,2,2,3,3,2,2,3))
# A tibble: 11 × 4
id sub_id group_id object_id
<int> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 1 1
3 3 1 1 2
4 4 2 1 2
5 5 3 3 2
6 6 4 3 2
7 7 3 3 3
8 8 4 3 3
9 9 5 5 2
10 10 6 6 2
11 11 6 6 3
I would like some sort of sub_group_id that will split the group_id interactions in sequences using the object_id column. The expected output would look like this:
# A tibble: 11 × 5
id sub_id group_id subgroup_id object_id
<int> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 2 2 1 1 1
3 3 1 1 2 2
4 4 2 1 2 2
5 5 3 3 3 2
6 6 4 3 3 2
7 7 3 3 4 3
8 8 4 3 4 3
9 9 5 5 5 2
10 10 6 6 6 2
11 11 6 6 7 3
That column essentially needs to increase infinitly to distinguish between different subgroups. It ressembles the object_id column, but the same object can be present in different sub_ids, group_ids, and subgroup_ids, whereas each subgroup_id only appears in one group_id. Any ideas on how to achieve that output?
group_by group_id and object_id, and then use cur_group_id to get the sequence of id for each subgroup:
dat %>%
group_by(group_id, object_id) %>%
mutate(subgroup_id = cur_group_id())
id sub_id group_id object_id subgroup_id
<int> <dbl> <dbl> <dbl> <int>
1 1 1 1 1 1
2 2 2 1 1 1
3 3 1 1 2 2
4 4 2 1 2 2
5 5 3 3 2 3
6 6 4 3 2 3
7 7 3 3 3 4
8 8 4 3 3 4
9 9 5 5 2 5
10 10 6 6 2 6
11 11 6 6 3 7
We could use
library(dplyr)
library(data.table)
df1 %>%
mutate(subgroup_id = rleid(object_id, group_id))

Filter question i R, remove duplicate values and only keep min value

I am trying to use a database in R and need to add some filters.
On selected routes you have to change bus to get to your final destination. I have filtered on these routes, but I need to remove duplicate values and keep the one with minimum values. So I can see how many departures for the selected destination.
Current filter code:
filterroutes <- c("5", "10")
busroutes <- database %>% filter(Route %in% filterroutes)
Table after filter on routes 5 and 10
Route Time NDepartures
5 2 1
5 3 1
5 3 1
5 4 1
5 5 1
10 1 1
10 3 3
10 4 2
10 6 1
10 7 2
I want to keep routes with unique time stamp and if duplicate keep the one with minimum NDepartures.
Should Return
Route Time NDepartures
5 2 1
5 3 1
5 3 1
5 4 1
5 5 1
10 1 1
10 6 1
10 7 2
Someone told me I could use a NDepartures == min(NDepartures) but I could not get this to work.
library(dplyr)
busroutes %>%
group_by(Time) %>%
#n() equals # of obs in each group "Time"
filter(n()==1 | (n()>1 & NDepartures==min(NDepartures))) %>%
ungroup()
# A tibble: 8 x 3
# Groups: Time [7]
Route Time NDepartures
<int> <int> <int>
1 5 2 1
2 5 3 1
3 5 3 1
4 5 4 1
5 5 5 1
6 10 1 1
7 10 6 1
8 10 7 2

R: Assign incremental ids based on the groups [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 3 years ago.
I have the following sample data frame:
> test = data.frame(UserId = sample(1:5, 10, replace = T)) %>% arrange(UserId)
> test
UserId
1 1
2 1
3 1
4 1
5 1
6 3
7 4
8 4
9 4
10 5
I now want another column called loginCount for that user, which is something like assigning incremental ids within each group, something like below. Using the mutate like below creates id within each group, but how do I get the incremental ids within each group independent of each other ?
> test %>% mutate(loginCount = group_indices_(test, .dots = "UserId"))
UserId loginCount
1 1 1
2 1 1
3 1 1
4 1 1
5 1 1
6 3 2
7 4 3
8 4 3
9 4 3
10 5 4
I want something like shown below:
UserId loginCount
1 1
1 2
1 3
1 4
1 5
3 1
4 1
4 2
4 3
5 1
You could group and use row_number:
test %>%
arrange(UserId) %>%
group_by(UserId) %>%
mutate(loginCount = row_number()) %>%
ungroup()
# A tibble: 10 x 2
# Groups: UserId [4]
UserId loginCount
<int> <int>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 3 1
7 4 1
8 4 2
9 4 3
10 5 1
One solution using base R tapply()
test$loginCount <- unlist(tapply(rep(1, nrow(test)), test$UserId, cumsum))
> test
UserId loginCount
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 3 1
7 4 1
8 4 2
9 4 3
10 5 1

dplyr solution to split dataset, but keep IDs in same splits

I'm looking for a dplyr or tidyr solution to split a dataset into n chunks. However, I do not want to have any single ID go into multiple chunks. That is, each ID should appear in only one chunk.
For example, imagine "test" below is an ID variable, and the dataset has many other columns.
test<-data.frame(id= c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
val = 1:16)
out <- test %>% select(id) %>% ntile(n = 3)
out
[1] 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
The ID=4 would end up in chunks 1 and 2. I am wondering how to code this so that all ID=4 end up in the same chunk (doesn't matter which one). I looked at the split function but could not find a way to do this.
The desired output would be something like
test[which(out==1),]
returning
id val
1 1 1
2 2 2
3 3 3
4 4 4
5 4 5
6 4 6
7 4 7
8 4 8
Then if I wanted to look at the second chunk, I would call something like test[which(out==2),], and so on up to out==n. I only want to deal with one chunk at a time. I don't need to create all n chunks simultaneously.
You need to create a data frame, then use group_by and mutate to add columns:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
mutate(new_column = ntile(id,3))
out
# A tibble: 16 x 3
id value new_column
<dbl> <int> <int>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 4 5 1
6 4 6 1
7 4 7 2
8 4 8 2
9 6 9 2
10 7 10 2
11 8 11 2
12 9 12 3
13 9 13 3
14 9 14 3
15 9 15 3
16 10 16 3
Or given Frank's comment you could run the ntile function on distinct/unique values of the id - then join the original table back on id:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
distinct(id) %>%
mutate(new_column = ntile(id,3)) %>%
right_join(test, by = "id")
out
# A tibble: 16 x 3
id new_column value
<dbl> <int> <int>
1 1 1 1
2 2 1 2
3 3 1 3
4 4 2 4
5 4 2 5
6 4 2 6
7 4 2 7
8 4 2 8
9 6 2 9
10 7 2 10
11 8 3 11
12 9 3 12
13 9 3 13
14 9 3 14
15 9 3 15
16 10 3 16

Assign value to group based on condition in column

I have a data frame that looks like the following:
> df = data.frame(group = c(1,1,1,2,2,2,3,3,3),
date = c(1,2,3,4,5,6,7,8,9),
value = c(3,4,3,4,5,6,6,4,9))
> df
group date value
1 1 1 3
2 1 2 4
3 1 3 3
4 2 4 4
5 2 5 5
6 2 6 6
7 3 7 6
8 3 8 4
9 3 9 9
I want to create a new column that contains the date value per group that is associated with the value "4" from the value column.
The following data frame shows what I hope to accomplish.
group date value newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
As we can see, group 1 has the newValue "2" because that is the date associated with the value "4". Similarly, group two has newValue 4 and group three has newValue 8.
I assume there is an easy way to do this using ave() or a range of dplyr/data.table functions, but I have been unsuccessful with my many attempts.
Here's a quick data.table one
library(data.table)
setDT(df)[, newValue := date[value == 4L], by = group]
df
# group date value newValue
# 1: 1 1 3 2
# 2: 1 2 4 2
# 3: 1 3 3 2
# 4: 2 4 4 4
# 5: 2 5 5 4
# 6: 2 6 6 4
# 7: 3 7 6 8
# 8: 3 8 4 8
# 9: 3 9 9 8
Here's a similar dplyr version
library(dplyr)
df %>%
group_by(group) %>%
mutate(newValue = date[value == 4L])
Or a possible base R solution using merge after filtering the data (will need some renaming afterwards)
merge(df, df[df$value == 4, c("group", "date")], by = "group")
Here is a base R option
df$newValue = rep(df$date[which(df$value == 4)], table(df$group))
Another alternative using lapply
do.call(rbind, lapply(split(df, df$group),
function(x){x$newValue = rep(x$date[which(x$value == 4)],
each = length(x$group)); x}))
# group date value newValue
#1.1 1 1 3 2
#1.2 1 2 4 2
#1.3 1 3 3 2
#2.4 2 4 4 4
#2.5 2 5 5 4
#2.6 2 6 6 4
#3.7 3 7 6 8
#3.8 3 8 4 8
#3.9 3 9 9 8
One more base R path:
df$newValue <- ave(`names<-`(df$value==4,df$date), df$group, FUN=function(x) as.numeric(names(x)[x]))
df
group date value newValue
1 1 1 3 2
2 1 2 4 2
3 1 3 3 2
4 2 4 4 4
5 2 5 5 4
6 2 6 6 4
7 3 7 6 8
8 3 8 4 8
9 3 9 9 8
10 3 11 7 8
I used a test on variable length groups. I assigned the date column as the names for the logical index of value equal to 4. Then identify the value by group.
Data
df = data.frame(group = c(1,1,1,2,2,2,3,3,3,3),
date = c(1,2,3,4,5,6,7,8,9,11),
value = c(3,4,3,4,5,6,6,4,9,7))

Resources