Separate data frame depending on one column duplicates - r

I have a large data frame with a lot of rows and columns. In one column there are characters, some of them occur only once, other multiple times. I would now like to separate the whole data frame, so that I end up with two data frames, one with all the rows that have characters that repeat themselves in this one column and another one with all the rows with the charcaters that occur only once. Like for example:
One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)
> df
One Two Three
1 1 4 a
2 2 5 b
3 3 3 c
4 4 6 d
5 5 2 d
6 6 7 e
7 7 1 f
8 8 8 e
9 9 1 g
10 10 9 c
I wish to have two data frames like
> dfSingle
One Two Three
1 1 4 a
2 2 5 b
7 7 1 f
9 9 1 g
> dfMultiple
One Two Three
3 3 3 c
4 4 6 d
5 5 2 d
6 6 7 e
8 8 8 e
10 10 9 c
I tried with the duplicated() function
dfSingle = subset(df, !duplicated(df$Three))
dfMultiple = subset(df, duplicated(df$Three))
but it does not work as the first of the "c", "d" and "e" go to the "dfSingle".
I also tried to do a for-loop
MulipleValues = unique(df$Three[c(which(duplicated(df$Three)))])
dfSingle = data.frame()
x = 1
dfMultiple = data.frame()
y = 1
for (i in 1:length(df$One)) {
if(df$Three[i] %in% MulipleValues){
dfMultiple[x,] = df[i,]
x = x+1
} else {
dfSingle[y,] = df[i,]
y = y+1
}
}
It seems to do the right thing as the data frames have now the right amont of rows but they somehow have 0 columns.
> dfSingle
data frame with 0 columns and 4 rows
> dfMultiple
data frame with 0 columns and 6 rows
What am I doing wrong? Or is there another way to do this?
Thanks for your help!

In base R, we can use split with duplicated which will return you list of two dataframes.
df1 <- split(df, duplicated(df$Three) | duplicated(df$Three, fromLast = TRUE))
df1
#$`FALSE`
# One Two Three
#1 1 4 a
#2 2 5 b
#7 7 1 f
#9 9 1 g
#$`TRUE`
# One Two Three
#3 3 3 c
#4 4 6 d
#5 5 2 d
#6 6 7 e
#8 8 8 e
#10 10 9 c
where df1[[1]] can be considered as dfSingle and df1[[2]] as dfMultiple.

Here is a dplyr one for fun,
library(dplyr)
df %>%
group_by(Three) %>%
mutate(new = n() > 1) %>%
split(.$new)
which gives,
$`FALSE`
# A tibble: 4 x 4
# Groups: Three [4]
One Two Three new
<dbl> <dbl> <fct> <lgl>
1 1 4 a FALSE
2 2 5 b FALSE
3 7 1 f FALSE
4 9 1 g FALSE
$`TRUE`
# A tibble: 6 x 4
# Groups: Three [3]
One Two Three new
<dbl> <dbl> <fct> <lgl>
1 3 3 c TRUE
2 4 6 d TRUE
3 5 2 d TRUE
4 6 7 e TRUE
5 8 8 e TRUE
6 10 9 c TRUE

A way with dplyr:
library(dplyr)
df %>%
group_split(Duplicated = (add_count(., Three) %>% pull(n)) > 1)
Output:
[[1]]
# A tibble: 4 x 4
One Two Three Duplicated
<dbl> <dbl> <fct> <lgl>
1 1 4 a FALSE
2 2 5 b FALSE
3 7 1 f FALSE
4 9 1 g FALSE
[[2]]
# A tibble: 6 x 4
One Two Three Duplicated
<dbl> <dbl> <fct> <lgl>
1 3 3 c TRUE
2 4 6 d TRUE
3 5 2 d TRUE
4 6 7 e TRUE
5 8 8 e TRUE
6 10 9 c TRUE

You can do it using base R
One = c(1,2,3,4,5,6,7,8,9,10)
Two = c(4,5,3,6,2,7,1,8,1,9)
Three = c("a", "b", "c", "d","d","e","f","e","g","c")
df <- data.frame(One, Two, Three)
str(df)
df$Three <- as.character(df$Three)
df$count <- as.numeric(ave(df$Three,df$Three,FUN = length))
dfSingle = subset(df,df$count == 1)
dfMultiple = subset(df,df$count > 1)

Related

Concatenate values in two data frames in R

Given two dataframes with the same column names:
a <- data.frame(x=1:4,y=5:8)
b <- data.frame(x=LETTERS[1:4],y=LETTERS[5:8])
>a
x y
1 5
2 6
3 7
4 8
>b
x y
A E
B F
C G
D H
How can each column with the same name be concatentated?
Desired output:
cat_x cat_y
1 A 5 E
2 B 6 F
3 C 7 G
4 D 8 H
Tried so far, merging columns one at a time:
a$cat_x <- paste(a$x,b$x)
a$cat_y <- paste(a$y,b$y)
This approach works, but the real data has 40 columns (and will include multiple more dataframes). Looking for a more efficient method for larger dataframes.
We may use Map to do this on a loop
data.frame(Map(paste, setNames(a, paste0("cat_", names(a))), b,
MoreArgs = list(sep = "_")))
-output
cat_x cat_y
1 1_A 5_E
2 2_B 6_F
3 3_C 7_G
4 4_D 8_H
Used sep above in case we want to add a delimiter. Or else by default it will be space
data.frame(Map(paste, setNames(a, paste0("cat_", names(a))), b ))
cat_x cat_y
1 1 A 5 E
2 2 B 6 F
3 3 C 7 G
4 4 D 8 H
Another possible solution, using purrr::map2_dfc:
library(tidyverse)
map2_dfc(a,b, ~ str_c(.x, .y, sep = " ")) %>%
rename_with(~ str_c("cat", .x, sep = "_"))
#> # A tibble: 4 × 2
#> cat_x cat_y
#> <chr> <chr>
#> 1 1 A 5 E
#> 2 2 B 6 F
#> 3 3 C 7 G
#> 4 4 D 8 H

How to remove all rows from dataframe if count of simillar `person_id` values are not `== 2`

I need remove all rows from dataframe if count of simillar person_id values are not == 2. For example:
a1 <- data.frame(person_id = 1:5, b=letters[1:5])
a2 <- data.frame(person_id = 2:6, b=letters[6:10])
data = rbind(a1, a2)
person_id b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 2 f
7 3 g
8 4 h
9 5 i
10 6 j
Row 1 and 10 must be removed, because person_id==1 and person_id==6 have only 1 record. For example person_id==2 have 2 rows.
How can I get new dataset with only rows where count of rows with person_id values are == 2 (and in future 3 or 4)?
Base R solution:
subset(
data,
ave(person_id, person_id, FUN = length) == 2
)
To remove the rows where count of person_id isn't equal to 2:
library(dplyr)
data %>%
group_by(person_id) %>%
filter(n() == 2)
person_id b
<int> <chr>
1 2 b
2 3 c
3 4 d
4 5 e
5 2 f
6 3 g
7 4 h
8 5 i

Using dplyr and mutate to counting the number of columns that meet a criterion

Minimal example: A small dataframe with 4 columns and a variable that holds the name of a new column I want to create. The new column is TRUE if responses to more than a certain number of questions exceed a threshold, and is FALSE otherwise
df1 <- data.frame(ID = LETTERS[1:5],
Q1 = sample(0:10, 5, replace=T),
Q2 = sample(0:10, 5, replace=T)
Q3 = sample(0:10, 5, replace=T)
Q4 = sample(0:10, 5, replace=T)
)
This gives me my dataframe with responses to the various questions:
> df1
ID Q1 Q2 Q3 Q4
1 A 2 4 5 0
2 B 9 6 6 3
3 C 5 5 3 2
4 D 0 5 3 10
5 E 7 5 6 7
I also define the following constants:
QUESTIONS <- c("Q1”, “Q2”, “Q3”, “Q4")
MY_NEW_COL <- "New_Col"
THESHOLD1 <- 5
THESHOLD2 <- 2
I want to add a new column named New_Col that is TRUE if more than THRESHOLD2 columns have a value in excess of THRESHOLD1. I can get this to work in a clumsy, but obvious way:
df1 %>%
mutate(!!MY_NEW_COL := ( (Q1 > THREHOLD1) + (Q2> THREHOLD1) +
(Q3 > THREHOLD1) + (Q4> THREHOLD1) ) > THRESHOLD2)
This gives the right answer:
ID Q1 Q2 Q3 Q4 New_Col
1 A 2 4 5 0 FALSE
2 B 9 6 6 3 TRUE
3 C 5 5 3 2 FALSE
4 D 0 5 3 10 FALSE
5 E 7 5 6 7 TRUE
But I would like to systematize this up as there are 17 questions in all. My code, which I show below, gives the wrong answer
df1 %>%
mutate(!!MY_NEW_COL := sum(all_of(QUESTIONS) > THRESHOLD1)) > THRESHOLD2)
ID Q1 Q2 Q3 Q4 New_Col
1 A 2 4 5 0 TRUE
2 B 9 6 6 3 TRUE
3 C 5 5 3 2 TRUE
4 D 0 5 3 10 TRUE
5 E 7 5 6 7 TRUE
What am I doing wring, and how can I fix this?
Many thanks in advance
Thomas Philips
As you didnt provide a seed, it is not possible to reproduce your results exactly. The solution to you problem is using across() and rowSums(), such that,
df1 %>%
mutate(!!MY_NEW_COL := rowSums(across(QUESTIONS) > THESHOLD1) > THESHOLD2)
It gives the output,
ID Q1 Q2 Q3 Q4 New_Col
1 A 7 9 1 1 FALSE
2 B 3 9 9 7 TRUE
3 C 4 0 6 6 FALSE
4 D 5 1 6 10 FALSE
5 E 6 5 5 1 FALSE
We can also do
library(dplyr)
library(purrr)
library(magrittr)
df1 %>%
mutate(!! MY_NEW_COL := map(select(cur_data(), starts_with("Q")),
~ .x > THESHOLD1) %>%
reduce(`+`) %>%
is_greater_than(THESHOLD2) )
I don't know if the following output is what you have in mind, but I first checked whether all Qs are greater than threshold 1 and if so whether the sum of which are greater than threshold2:
library(dplyr)
f1 <- function(x, threshold1 = 2, threshold2 = 5) {
df1 <- df1 %>%
group_by(ID) %>%
mutate(threshold_1 = if_all(, ~ .x > 2, TRUE),
sum_Qs = sum(Q1:Q4),
threshold_2 = if_else(sum_Qs > threshold2 & threshold_1 == TRUE,
TRUE, FALSE))
df1
}
f1(df1, 2, 5)
# A tibble: 5 x 8
# Groups: ID [5]
ID Q1 Q2 Q3 Q4 threshold_1 sum_Qs threshold_2
<chr> <int> <int> <int> <int> <lgl> <int> <lgl>
1 A 8 0 1 10 FALSE 27 FALSE
2 B 2 3 2 8 FALSE 35 FALSE
3 C 1 8 4 3 FALSE 6 FALSE
4 D 9 3 3 9 TRUE 9 TRUE
5 E 1 3 0 1 FALSE 1 FALSE

Checking the presence of values in multiple datasets

I have a number of tables and all the "a" columns of the tables must have identical values for the analysis I am conducting. The actual tables are very big so I will use simplified (mock) data frames.
Let's say I have the following data:
A <- data.frame(a = c(3,4,5,6,7,8), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
B <- data.frame(a = c(2,3,4,5,6,7), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
C <- data.frame(a = c(1,2,3,4,5,6), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
D <- data.frame(a = c(4,5,6,7,8,9), b = c(4,5,6,7,8,9), c = c(5,6,7,8,9,10))
Now, each data frame has unidentical values in column "a"s. My goal is to delete the entire rows that contain different values as compared to all the other tables.
In order to have identical values in column "a" for all tables A, B and C, I could use the following operations:
A <- A[A$a %in% B$a,]
B <- B[B$a %in% A$a,]
C <- C[C$a %in% B$a,]
B <- B[B$a %in% C$a,]
A <- A[A$a %in% C$a,]
This is already getting very tedious as you can see. What if I throw the table D or other data frames in this mix. It's becoming almost impossible to proceed, as each table contain at least one unique value.
One dplyr option could be:
bind_rows(list(A, B, C, D), .id = "ID") %>%
mutate(n_datasets = max(ID)) %>%
group_by(a) %>%
filter(n_distinct(ID) == n_datasets)
ID a b c n_datasets
<chr> <dbl> <dbl> <dbl> <chr>
1 1 4 5 6 4
2 1 5 6 7 4
3 1 6 7 8 4
4 2 4 6 7 4
5 2 5 7 8 4
6 2 6 8 9 4
7 3 4 7 8 4
8 3 5 8 9 4
9 3 6 9 10 4
10 4 4 4 5 4
11 4 5 5 6 4
12 4 6 6 7 4

bind_rows to each group of tibble

Consider the following two tibbles:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value = 1:6)
So a and b have the same columns and b has an additional column called id.
I want to do the following: group b by id and then add tibble a on top of each group.
So the output should look like this:
# A tibble: 10 x 3
id time value
<chr> <int> <int>
1 a -1 100
2 a 0 200
3 a 1 1
4 a 2 2
5 a 3 3
6 b -1 100
7 b 0 200
8 b 1 4
9 b 2 5
10 b 3 6
Of course there are multiple workarounds to achieve this (like loops for example). But in my case I have a large number of IDs and a very large number of columns.
I would be thankful if anyone could point me towards the direction of a solution within the tidyverse.
Thank you
We can expand the data frame a with id from b and then bind_rows them together.
library(tidyverse)
a2 <- expand(a, id = b$id, nesting(time, value))
b2 <- bind_rows(a2, b) %>% arrange(id, time)
b2
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6
split from base R will divide a data frame into a list of subsets based on an index.
b %>%
split(b[["id"]]) %>%
lapply(bind_rows, a) %>%
lapply(select, -"id") %>%
bind_rows(.id = "id")
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a 1 1
# 2 a 2 2
# 3 a 3 3
# 4 a -1 100
# 5 a 0 200
# 6 b 1 4
# 7 b 2 5
# 8 b 3 6
# 9 b -1 100
# 10 b 0 200
An idea (via base R) is to split your data frame and create a new one with id + the other data frame and rbind, i.e.
df = do.call(rbind, lapply(split(b, b$id), function(i)rbind(data.frame(id = i$id[1], a), i)))
which gives
id time value
a.1 a -1 100
a.2 a 0 200
a.3 a 1 1
a.4 a 2 2
a.5 a 3 3
b.1 b -1 100
b.2 b 0 200
b.3 b 1 4
b.4 b 2 5
b.5 b 3 6
NOTE: You can remove the rownames by simply calling rownames(df) <- NULL
We can nest and add the relevant rows to each nested item :
library(tidyverse)
b %>%
nest(-id) %>%
mutate(data= map(data,~bind_rows(a,.x))) %>%
unnest
# # A tibble: 10 x 3
# id time value
# <chr> <dbl> <dbl>
# 1 a -1 100
# 2 a 0 200
# 3 a 1 1
# 4 a 2 2
# 5 a 3 3
# 6 b -1 100
# 7 b 0 200
# 8 b 1 4
# 9 b 2 5
# 10 b 3 6
Maybe not the most efficient way, but easy to follow:
library(tidyverse)
a <- tibble(time = c(-1, 0), value = c(100, 200))
b <- tibble(id = rep(letters[1:2], each = 3), time = rep(1:3, 2), value =
1:6)
a.a <- a %>% add_column(id = rep("a",length(a)))
a.b <- a %>% add_column(id = rep("b",length(a)))
joint <- bind_rows(b,a.a,a.b)
(joint <- arrange(joint,id))

Resources