Extract rows with levels of a dataframe - r

I've got a dataframe such as this:
df = data.frame(col1=c(1,1,1,2,2,2,3,3,3),
col2=as.factor(c('a','b','b','a','a','a','b','a','b')))
Then I extract all the categories (levels) related to each column:
levels_df = expand.grid(unique(df$col1), unique(df$col2))
colnames(levels_df)=c('col1','col2')
My objective now is to perform for the rows belonging to each pair of levels a function. How can I do that?
sapply(levels, FUN, dataset=df)
Any other strategy to perform the same task is accepted. The function operation could be whatever you like, for example a counting function (how many rows belong to each pair of levels), in which case the output would have this aspect:
In conclusion I want to susbset rows from a dataframe using each pair of levels, so I can manipulate those rows to perform a function ( such as nrows() )

You can skip the levels part, and just use dplyr to group by col1 and col2, then count the rows. Finally, we use complete to add in any combinations that don't appear in our dataset:
library(tidyverse)
df %>%
group_by(col1, col2) %>% # group df by col1 and col2
summarise(n = n()) %>% # make a new column, n, which is the count
complete(col1, col2, fill=list(n=0)) # Fill in missing pairs with 0
The output matches what you expected:
# A tibble: 6 x 3
# Groups: col1 [3]
col1 col2 n
<dbl> <fct> <dbl>
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 0
5 3 a 1
6 3 b 2

I‘m not sure if this specific count example will help you, but here‘s what you could do in the tidyverse:
library(tidyverse)
df %>%
group_by(col1, col2) %>%
count() %>%
ungroup() %>%
complete(col1, col2, fill = list(n = 0))
which gives:
# A tibble: 6 x 3
col1 col2 n
<dbl> <fct> <dbl>
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 0
5 3 a 1
6 3 b 2

Related

New variable conditional on whether a df1 column value equals any value included in a specific df2 column

I am trying to make a new variable using mutate() . In df1, I have ranges of values in col1, col2, col3, and col4. I would like to create a new binary variable in df1 that is "1" IF any of the col1-4 values are found in a specific df2 column (let's say col10).
Thanks!
This is what I have tried so far, but I don't think it is returning a value of "1" for all matching value, only some of them.
df1 %>%
mutate(newvar = case_when(
col1 == df2$col10 | col2 == df2$col10 | col3 == df2$col10 | col4 == df2$col10 ~ 1
))
We could use if_any here. If the number of rows are the same, use == for elementwise comparison instead of %in%
library(dplyr)
df1 %>%
mutate(newvar = +(if_any(col1:col4, ~.x %in% df2$col10)))
First, let's make some dummy data. df1 has 4 columns and df2 has one column named col10. In the dummy data, rows 1,2,3 and 5 have matches in df2$col10.
library(dplyr)
df1 <- data.frame(col1 = 1:5, col2=3:7, col3=5:9, col4=10:14)
df2 <- data.frame(col10 = c(1,2,3,14))
We can use rowwise() to do computations within each row and then c_across() to identify that variables of interest. The code identifies whether any of the values in the four columns are in df2$col10 and returns a logical value. The as.numeric() turns that logical value into 0 (FALSE) and 1 (TRUE).
df1 %>%
rowwise() %>%
mutate(newvar = as.numeric(any(c_across(col1:col4) %in% df2$col10)))
#> # A tibble: 5 × 5
#> # Rowwise:
#> col1 col2 col3 col4 newvar
#> <int> <int> <int> <int> <dbl>
#> 1 1 3 5 10 1
#> 2 2 4 6 11 1
#> 3 3 5 7 12 1
#> 4 4 6 8 13 0
#> 5 5 7 9 14 1
Created on 2023-02-09 by the reprex package (v2.0.1)

Select specific/all columns in rowwise

I have the following table:
col1
col2
col3
col4
1
2
1
4
5
6
6
3
My goal is to find the max value per each row, and then find how many times it was repeated in the same row.
The resulting table should look like this:
col1
col2
col3
col4
max_val
repetition
1
2
1
4
4
1
5
6
6
3
6
2
Now to achieve this, I am doing the following for Max:
df%>% rowwise%>%
mutate(max=max(col1:col4))
However, I am struggling to find the repetition. My idea is to use this pseudo code in mutate:
sum( "select current row entirely or only for some columns"==max). But I don't know how to select entire row or only some columns of it and use its content to do the check, i.e.: is it equal to the max. How can we do this in dplyr?
A dplyr approach:
library(dplyr)
df %>%
rowwise() %>%
mutate(max_val = max(across(everything())),
repetition = sum(across(col1:col4) == max_val))
# A tibble: 2 × 6
# Rowwise:
col1 col2 col3 col4 max_val repetition
<int> <int> <int> <int> <int> <int>
1 1 2 1 4 4 1
2 5 6 6 3 6 2
An R base approach:
df$max_val <- apply(df,1,max)
df$repetition <- rowSums(df[, 1:4] == df[, 5])
For other (non-tidyverse) readers, a base R approach could be:
df$max_val <- apply(df, 1, max)
df$repetition <- apply(df, 1, function(x) sum(x[1:4] == x[5]))
Output:
# col1 col2 col3 col4 max_val repetition
# 1 1 2 1 4 4 1
# 2 5 6 6 3 6 2
Although dplyr has added many tools for working across rows of data, it remains, in my mind at least, much easier to adhere to tidy principles and always convert the data to "long" format for these kinds of operations.
Thus, here is a tidy approach:
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
group_by(row) %>%
mutate(max_val = max(value), repetitions = sum(value == max(value))) %>%
pivot_wider(id_cols = c(row, max_val, repetitions)) %>%
select(col1:col4, max_val, repetitions)
The last select() is just to get the columns in the order you want.

Deleting rows that are duplicated in one column based on value in another column

A similar question was asked here. However, I did not manage to adopt that solution to my particular problem, hence the separate question.
An example dataset:
id group
1 1 5
2 1 998
3 2 2
4 2 3
5 3 998
I would like to delete all rows that are duplicated in id and where group has value 998.
In this example, only row 2 should be deleted.
I tried something along those lines:
df1 <- df %>%
subset((unique(by = "id") | group != 998))
but got
Error in is.factor(x) : Argument "x" is missing, with no default
Thank you in advance
Here is an idea
library(dplyr)
df %>%
group_by(id) %>%
filter(!any(n() > 1 & group == 998))
# A tibble: 3 x 2
# Groups: id [2]
id group
<int> <int>
1 2 2
2 2 3
3 3 998
In case you want to remove only the 998 entry from the group then,
df %>%
group_by(id) %>%
filter(!(n() > 1 & group == 998))
One way could be:
library(dplyr)
df1 <- df %>%
filter(duplicated(id) & group=="998")
anti_join(df, df1)
Joining, by = c("id", "group")
id group
1 1 5
3 2 2
4 2 3
5 3 998

How to find duplicates based on values in 2 columns but also the groupings by another column in R?

I have a dataset with 3 columns: ID, value a, and value b. I want to group the dataset based on the values in the ID column and then identify duplicates that have identical data in the value a and b columns between the different groupings.
I know that I can use the dplyr package and data %>% group_by (ID) to group my dataset based on the ID column. I also know that I can use data[duplicated(data[,2:3]),] to return all rows with duplicate data in rows 2 (value a) and 3 (value b).
However, I would like a function that can only finds duplicates between different ID groups instead of just duplicates within the whole dataset. I've tried combining group_by and duplicated, but it doesn't return the correct results. Which function would do this?
It was a little unclear if you wanted to return:
only the distinct rows
single examples of duplicated rows
all duplicated rows
So here are some options:
library(dplyr)
library(readr)
"ID,a,b
1, 1, 1
1, 1, 1
1, 1, 2
2, 1, 1
2, 1, 2" %>%
read_csv() -> exp_dat
# return only distinct rows
exp_dat %>%
distinct(ID, a, b)
# # A tibble: 4 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 2
# 3 2 1 1
# 4 2 1 2
# return single examples of duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 1 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# return all duplicated rows
exp_dat %>%
group_by(ID, a, b) %>%
add_count() %>%
filter(n > 1) %>%
ungroup() %>%
select(-n)
# # A tibble: 2 x 3
# ID a b
# <dbl> <dbl> <dbl>
# 1 1 1 1
# 2 1 1 1

Joining data in R by first row, then second and so on

I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05

Resources