Get columns in frame based on values in second frame

Get columns in frame based on values in second frame - r

I have 2 dataframes. One has a ID column with alot of arranged IDs.
The other one has just specific rows of the first column. Those are my markers.
I need to get the sum of the of the values in a specific column based on the id values of the second column.
The first column may be
id goals cards group
1 2 2 1
2 3 2 1
3 4 2 1
4 5 1 1
5 1 2 1
1 2 2 2
2 3 2 2
3 4 2 2
4 5 1 3
5 1 2 3
the second one:
id goals cards group
2 3 2 1
5 1 2 1
2 3 2 2
3 4 2 2
5 1 2 3
what i need to get:
id goals cards group points
1 2 2 1 2-(2+2)
2 3 2 1 0 cause in second list
3 4 2 1 4-(2+1+2)
4 5 1 1 5-(1+2)
5 1 2 1 0 cause in second list
1 2 2 2 2-(2+2)
2 3 2 2 0
3 4 2 2 0
4 5 1 3 5-(1+2)
5 1 2 3 0
Something like: ??
df1<- df1%>%
rowwise() %>%
mutate(points=
goals
-(sum( df1$cards[df1$id <= df2$id & df1$id>df1$id])))

df1 = read.table(text = "
id goals cards
1 2 2
2 3 2
3 4 2
4 5 1
5 1 2
", header=T)
df2 = read.table(text = "
id goals cards
2 3 2
5 1 2
", header=T)
library(dplyr)
# function that gets an id and returns the sum of cards based on df2
GetSumOfCards = function(x) {
ids = min(df2$id[df2$id >= x]) # for a given id of df1 find the minimum id in df2 that is bigger than this id
ifelse(x %in% df2$id, # if the given id exists in df2
0, # sum of cards is zero
sum(df1$cards[df1$id >= x & df1$id <= ids])) # otherwise get sum of cards in df1 from this id until the id obtained before
}
# update function to be vectorised
GetSumOfCards = Vectorize(GetSumOfCards)
df1 %>%
mutate(sum_cards = GetSumOfCards(id), # get sum of cards for each id using the function
points = goals - sum_cards) # get the points
# id goals cards sum_cards points
# 1 1 2 2 4 -2
# 2 2 3 2 0 3
# 3 3 4 2 5 -1
# 4 4 5 1 3 2
# 5 5 1 2 0 1
Based on your updated question, applying a similar function to every row makes the process very slow. So, this solution groups data in a way that you can just count the cards on chunks of data/rows:
df1 = read.table(text = "
id goals cards group
1 2 2 1
2 3 2 1
3 4 2 1
4 5 1 1
5 1 2 1
1 2 2 2
2 3 2 2
3 4 2 2
4 5 1 3
5 1 2 3
", header=T)
df2 = read.table(text = "
id goals cards group
2 3 2 1
5 1 2 1
2 3 2 2
3 4 2 2
5 1 2 3
", header=T)
library(dplyr)
df1 %>%
arrange(group, desc(id)) %>% # order by group and id descending (this will help with counting the cards)
left_join(df2 %>% # join specific columns of df2 and add a flag to know that this row exists in df2
select(id, group) %>%
mutate(flag = 1), by=c("id","group")) %>%
mutate(flag = ifelse(is.na(flag), 0, flag), # replace NA with 0
flag2 = cumsum(flag)) %>% # this flag will create the groups we need to count cards
group_by(group, flag2) %>% # for each new group (we need both as the card counting will change when we have a row from df2, or if group changes)
mutate(sum_cards = ifelse(flag == 1, 0, cumsum(cards))) %>% # get cummulative sum of cards unless the flag = 1, where we need 0 cards
ungroup() %>% # forget the grouping
arrange(group, id) %>% # back to original order
mutate(points = goals - sum_cards) %>% # calculate points
select(-flag, -flag2) # remove flags
# # A tibble: 10 x 6
# id goals cards group sum_cards points
# <int> <int> <int> <int> <dbl> <dbl>
# 1 1 2 2 1 4 -2
# 2 2 3 2 1 0 3
# 3 3 4 2 1 5 -1
# 4 4 5 1 1 3 2
# 5 5 1 2 1 0 1
# 6 1 2 2 2 4 -2
# 7 2 3 2 2 0 3
# 8 3 4 2 2 0 4
# 9 4 5 1 3 3 2
# 10 5 1 2 3 0 1

Related

Code values in new column based on whether values in another column are unique

Given the following data I would like to create a new column new_sequence based on the condition:
If only one id is present the new value should be 0. If several id's are present, the new value should numbered according to the values present in sequence.
dat <- tibble(id = c(1,2,3,3,3,4,4),
sequence = c(1,1,1,2,3,1,2))
# A tibble: 7 x 2
id sequence
<dbl> <dbl>
1 1 1
2 2 1
3 3 1
4 3 2
5 3 3
6 4 1
7 4 2
So, for the example data I am looking to produce the following output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2
I have tried with the code below, that does not work since all unique values are coded as 0
dat %>% mutate(new_sequence = ifelse(!duplicated(id), 0, sequence))

Use dplyr::add_count() rather than !duplicated():
library(dplyr)
dat %>%
add_count(id) %>%
mutate(new_sequence = ifelse(n == 1, 0, sequence)) %>%
select(!n)
Output:
# A tibble: 7 x 3
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2

You can also try the following. After grouping by id check if the number of rows in the group n() is 1 or not. Use separate if and else instead of ifelse since the lengths are different within each group.
dat %>%
group_by(id) %>%
mutate(new_sequence = if(n() == 1) 0 else sequence)
Output
id sequence new_sequence
<dbl> <dbl> <dbl>
1 1 1 0
2 2 1 0
3 3 1 1
4 3 2 2
5 3 3 3
6 4 1 1
7 4 2 2

Remove groups with only one individual in R without using dplyr package [duplicate]

This question already has answers here:
Select groups with more than one distinct value
(3 answers)
Closed 1 year ago.
Consider the following dataset. The data is grouped with either one or two people per group. However, an individual may have several entries.
group<-c(1,1,1,1,2,2,3,3,3,3,4,4)
individualID<-c(1,1,2,2,3,3,5,5,6,6,7,7)
X<-rbinom(12,1,0.5)
df1<-data.frame(group,individualID,X)
> df1
group individualID X
1 1 1 0
2 1 1 1
3 1 2 1
4 1 2 1
5 2 3 1
6 2 3 1
7 3 5 1
8 3 5 1
9 3 6 1
10 3 6 1
11 4 7 0
12 4 7 1
From the above Group 1 and group 3 have 2 individuals whereas group 2 and group 4 have 1 individual each.
> aggregate(data = df1, individualID ~ group, function(x) length(unique(x)))
group individualID
1 1 2
2 2 1
3 3 2
4 4 1
How can I subset the data without use of dplyr package to have only groups that have more than 1 individual. i.e. omit groups with 1 individual.
I should end up with only group 1 and group 3.

Or another option is with tidyverse - after grouping by 'group', filter the rows where the number of distinct (n_distinct) elements in 'individualID' is greater than 1
library(dplyr)
df1 %>%
group_by(group) %>%
filter(n_distinct(individualID) > 1) %>%
ungroup
# A tibble: 8 × 3
group individualID X
<dbl> <dbl> <int>
1 1 1 0
2 1 1 0
3 1 2 1
4 1 2 1
5 3 5 0
6 3 5 0
7 3 6 1
8 3 6 0
Or with subset and ave from base R
subset(df1, ave(individualID, group, FUN = function(x) length(unique(x))) > 1)
group individualID X
1 1 1 0
2 1 1 0
3 1 2 1
4 1 2 1
7 3 5 0
8 3 5 0
9 3 6 1
10 3 6 0

There are more concise ways for sure, but here is the general idea.
# use your code to get the counts by group
df1_counts <- aggregate(data = df1, individualID ~ group, function(x) length(unique(x)))
# create a vector of groups where the count is > 1
keep_groups <- df1_counts$group[df1_counts$individualID > 1]
# filter the rows to only groups you want to keep
df1[df1$group %in% keep_groups,]
# group individualID X
# 1 1 1 0
# 2 1 1 0
# 3 1 2 1
# 4 1 2 0
# 7 3 5 1
# 8 3 5 1
# 9 3 6 0
# 10 3 6 1

Remove groups with only one individual in R [duplicate]

This question already has an answer here:
Select groups with more than one distinct value per group [duplicate]
(1 answer)
Closed 1 year ago.
Consider the following dataset. The data is grouped with either one or two people per group. However, an individual may have several entries.
df1<-data.frame(group,individualID,X)
> df1
group individualID X
1 1 1 0
2 1 1 1
3 1 2 1
4 1 2 1
5 2 3 1
6 2 3 1
7 3 5 1
8 3 5 1
9 3 6 1
10 3 6 1
11 4 7 0
12 4 7 1
From the above Group 1 and group 3 have 2 individuals whereas group 2 and group 4 have 1 individual each.
> aggregate(data = df1, individualID ~ group, function(x) length(unique(x)))
group individualID
1 1 2
2 2 1
3 3 2
4 4 1
How can I subset the data to have only groups that have more than 1 individual. i.e. omit groups with 1 individual.
I should end up with only group 1 and group 3.

You could make a lookup table to identify the groups that have more than one unique individualID (similar to what you did with aggregate), then filter df1 based on that:
library(dplyr)
lookup <- df1 %>%
group_by(group) %>%
summarise(count = n_distinct(individualID)) %>%
filter(count > 1)
df1 %>% filter(group %in% unique(lookup$group))
group individualID X
1 1 1 0
2 1 1 1
3 1 2 1
4 1 2 1
5 3 5 1
6 3 5 1
7 3 6 1
8 3 6 1
Or, as #MrGumble suggests above, you could also merge df1 after creating lookup:
merge(df1, lookup)
group individualID X count
1 1 1 0 2
2 1 1 1 2
3 1 2 1 2
4 1 2 1 2
5 3 6 1 2
6 3 6 1 2
7 3 5 1 2
8 3 5 1 2

R Tidyverse - Randomize by ID

I have a df like this one:
id <- c(1,1,2,2,3,3,4,4,5,5)
v1 <- c(3,1,2,3,4,5,6,1,5,4)
pos <- c(1,2,1,2,1,2,1,2,1,2)
df <- data.frame(id,v1,pos)
How can I "randomize" the values of v1 WHILE keeping the inherent order from the "Id" var and also the values of "pos" such as I get df with randomized values like this:
id v1 pos
1 1 1
1 3 2
2 2 1
2 3 2
3 5 1
3 4 2
4 6 1
4 1 2
5 5 1
5 4 2
Above and example of resulting df with id and pos staying as originally created and v1 randomized.
Thx!

Is sample what you're looking for?
df %>%
group_by(id) %>%
mutate(v1 = sample(v1, size = length(v1)))
# A tibble: 10 x 3
# Groups: id [5]
id v1 pos
<dbl> <dbl> <dbl>
1 1 3 1
2 1 1 2
3 2 3 1
4 2 2 2
5 3 4 1
6 3 5 2
7 4 1 1
8 4 6 2
9 5 5 1
10 5 4 2

Subseting data frame based on multiple criteria for deletion of rows

Consider the following data frame consisting of column names "id" and "x", where each id is repeated four times. Data is as follows:
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
The question is about how to subset the data frame by the following criteria:
(1) keep all entries of each id, if its corresponding values in column x does not contain 3 or it has 3 as the last number.
(2) for a given id with multiple 3s in column x, keep all the numbers up to the first 3 and delete the remaining 3s. The expected output would look like:
id x
1 1 2
2 1 2
3 1 1
4 1 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 2
10 3 3
11 4 2
12 4 2
13 4 3
I am familiar with the use of the 'filter' function in dplyr package to subset data, but this particular situation confuses me because of the complexity of the above criteria. Any help on this would be greatly appraciated.

Here's one solution that uses / creates some new columns to help you filter on:
library(dplyr)
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
df %>%
group_by(id) %>% # for each id
mutate(num_threes = sum(x == 3), # count number of 3s
flag = ifelse(unique(num_threes) > 0, # if there is a 3
min(row_number()[x == 3]), # keep the row of the first 3
0)) %>% # otherwise put a 0
filter(num_threes == 0 | row_number() <= flag) %>% # keep ids with no 3s or up to first 3
ungroup() %>%
select(-num_threes, -flag) # remove helpful columns
# # A tibble: 13 x 2
# id x
# <dbl> <dbl>
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 1
# 5 2 2
# 6 2 3
# 7 3 1
# 8 3 2
# 9 3 2
# 10 3 3
# 11 4 2
# 12 4 2
# 13 4 3

this works for me:
data
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
commands
library(dplyr)
df <- mutate(df, before = lag(x))
df$condition1 <- 1
df$condition1[df$x == 3 & df$before == 3] <- 0
final_df <- df[df$condition1 == 1, 1:2]
result
x id
1 2
1 2
1 1
1 1
2 2
2 3
3 1
3 2
3 2
3 3
4 2
4 2
4 3`

One idea is to pick out the rows with x==3 and use unique() over them. Then append the unique rows with just single 3 to the rest part of the data frame, and finally order the rows.
Here is a solution with base R for the idea above:
res <- (r <- with(df,rbind(df[x!=3,],unique(df[x==3,]))))[order(as.numeric(rownames(r))),]
rownames(res) <- seq(nrow(res))
which give
> res
id x
1 1 2
2 1 2
3 1 1
4 1 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 2
10 3 3
11 4 2
12 4 2
13 4 3
DATA
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))