Show rows that appear only once in R dataframe - r

I know that we can use unique() to effectively show a dataframe without duplicate values, but is there an elegant way to show only those rows that appear once in a dataframe?
E.g.,
a = c(10,20,10,10)
b = c(10,30,10,20)
ab = data.frame(a,b)
should return the second and final row only, and not the first and third (since this row exists more than once).
Thanks

We can use duplicated
subset(ab, !(duplicated(ab)|duplicated(ab, fromLast = TRUE)))
-output
a b
2 20 30
4 10 20

dplyr option:
library(dplyr)
ab %>%
group_by(across(everything())) %>%
filter(n() == 1)
Output:
# A tibble: 2 × 2
# Groups: a, b [2]
a b
<dbl> <dbl>
1 20 30
2 10 20

Related

count number of combinations by group

I am struggling to count the number of unique combinations in my data. I would like to first group them by the id and then count, how many times combination of each values occurs. here, it does not matter if the elements are combined in 'd-f or f-d, they still belongs in teh same category, as they have same element:
combinations:
n
c-f: 2 # aslo f-c
c-d-f: 1 # also cfd or fdc
d-f: 2 # also f-d or d-f. The dash is only for isualization purposes
Dummy example:
# my data
dd <- data.frame(id = c(1,1,2,2,2,3,3,4, 4, 5,5),
cat = c('c','f','c','d','f','c','f', 'd', 'f', 'f', 'd'))
> dd
id cat
1 1 c
2 1 f
3 2 c
4 2 d
5 2 f
6 3 c
7 3 f
8 4 d
9 4 f
10 5 f
11 5 d
Using paste is a great solution provided by #benson23, but it considers as unique category f-d and d-f. I wish, however, that the order will not matter. Thank you!
Create a "combination" column in summarise, we can count this column afterwards.
An easy way to count the category is to order them at the beginning, then in this case they will all be in the same order.
library(dplyr)
dd %>%
group_by(id) %>%
arrange(id, cat) %>%
summarize(combination = paste0(cat, collapse = "-"), .groups = "drop") %>%
count(combination)
# A tibble: 3 x 2
combination n
<chr> <int>
1 c-d-f 1
2 c-f 2
3 d-f 2

Joining data in R by first row, then second and so on

I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05

dplyr distinct over two columns

I have a table where the first two rows are sample identifiers and the third a measure of distance eg:
df<-data.table(H1=c(1,2,3,4,5),H2=c(7,3,2,8,9), D=c(100,4,55,66,35))
I want to find only the unique pairs across both columns, ie 1-7,2-3,4-8,5-9. Removing the duplicate 2-3 and 3-2 pairings which appears in different columns but keeping the third row (which being a distance is identical for 2-3 and 3-2).
# example data
df<-data.frame(H1=c(1,2,3,4,5),
H2=c(7,3,2,8,9),
D=c(100,4,55,66,35), stringsAsFactors = F)
library(dplyr)
df %>%
rowwise() %>% # for each row
mutate(HH = paste0(sort(c(H1,H2)), collapse = ",")) %>% # create a new variable that orders and combines H1 and H2
group_by(HH) %>% # group by that variable
filter(D == max(D)) %>% # keep the row where D is the maximum (assumed logic*)
ungroup() %>% # forget the grouping
select(-HH) # remove unnecessary variable
# # A tibble: 4 x 3
# H1 H2 D
# <dbl> <dbl> <dbl>
# 1 1 7 100
# 2 3 2 55
# 3 4 8 66
# 4 5 9 35
*Note: No idea what your logic is to keep 1 row from the duplicates. I had to use something as an example and here I'm keeping the row with the highest D value. This logic can change if needed.

Make multiple random number of copies of rows in a dataframe

I have a dataframe in r with 100 rows of unique first and last name and address. I also have columns for weather 1 and weather 2. I want to make a random number of copies between 50 and 100 for each row. How would I do that?
df$fname df$lname df$street df$town df%state df$weather1 df$weather2
Using iris and baseR:
#example data
iris2 <- iris[1:100, ]
#replicate rows at random
iris2[rep(1:100, times = sample(50:100, 100, replace = TRUE)), ]
Each row of iris2 will be replicated between 50-100 times at random
This is probably not the easiest way to do this, but...
What I've done here is for each for of the data set select just that row and make 1-3 (sub 50-100) copies of that row, and finally stack all the results together.
library(dplyr)
library(purrr)
df <- tibble(foo = 1:3, bar = letters[1:3])
map_dfr(seq_len(nrow(df)), ~{
df %>%
slice(.x) %>%
sample_n(size = sample(1:3, 1), replace = TRUE)
})
#> # A tibble: 7 x 2
#> foo bar
#> <int> <chr>
#> 1 1 a
#> 2 1 a
#> 3 1 a
#> 4 2 b
#> 5 2 b
#> 6 3 c
#> 7 3 c

group by in R dplyr for more than one variable on unique value of other variable

I have a dataset with three columns as below:
data <- data.frame(
grpA = c(1,1,1,1,1,2,2,2),
idB = c(1,1,2,2,3,4,5,6),
valueC = c(10,10,20,20,10,30,40,50),
otherD = c(1,2,3,4,5,6,7,8)
)
valueC is unique to each unique value of idB.
I want to use dplyr pipe (as the rest of my code is in dplyr) and use group_by on grpA to get a new column with sum of valueC values for each group.
The answer should be like:
newCol <- c(40,40,40,40,40,120,120,120)
but with data %>% group_by(grpA) %>%
mutate(newCol=sum(valueC), I get newCol <- c(70,70,70,70,70,120,120,120)
How do I include unique value of idB? Is there anything else I can use instead of group_by in dplyr %>% pipe.
I cant use summarise as I need to keep values in otherD intact for later use.
Other option I have is to create newCol separately through sql and then merge with left join. But I am looking for a better solution inline.
If it has been answered before, please refer me to the link as I could not find any relevant answer to this issue.
We need unique with match
data %>%
group_by(grpA) %>%
mutate(ind = sum(valueC[match(unique(idB), idB)]))
# A tibble: 8 x 5
# Groups: grpA [2]
# grpA idB valueC otherD ind
# <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 1 10 1 40
#2 1 1 10 2 40
#3 1 2 20 3 40
#4 1 2 20 4 40
#5 1 3 10 5 40
#6 2 4 30 6 120
#7 2 5 40 7 120
#8 2 6 50 8 120
Or another option is to get the distinct rows by 'grpA', 'idB', grouped by 'grpA', get the sum of 'valueC' and left_join with the original data
data %>%
distinct(grpA, idB, .keep_all = TRUE) %>%
group_by(grpA) %>%
summarise(newCol = sum(valueC)) %>%
left_join(data, ., by = 'grpA')

Resources