Collapsing together duplicate rows with synonyms - r

If my data looks something like this:
species1 species2 info1 info2
Loro Parrot 3 NA
NA Parrot NA 7
Osprey NA NA 89
Sparrow Finch NA 19
Sparrow NA 27 NA
Mallard Duck 69 16
Mallard NA NA NA
NA Swift 25 NA
And i want to merge it together like this:
species1 species2 info1 info2
Loro Parrot 3 7
Osprey NA NA 89
Sparrow Finch 27 19
Mallard Duck 69 16
NA Swift 25 NA
How could i do it, tanking into account that i need to keep the NA records?
Thank you very much! :)

We could use similar approach from the previous post, but in a different way i.e. first create a named vector from the 'species' columns'. Use that to replace the values in first 'species1' column, coalese with the second one to do a grouping and then do the summarise
library(dplyr)
library(tibble)
nm1 <- df1 %>%
select(species1, species2) %>%
na.omit %>%
deframe
df1 %>%
group_by(species = coalesce(nm1[species1], species2)) %>%
summarise(across(everything(), ~ .[complete.cases(.)][1])) %>%
select(-species)
# A tibble: 5 x 4
species1 species2 info1 info2
<chr> <chr> <int> <int>
1 Mallard Duck 69 16
2 Sparrow Finch 27 19
3 Loro Parrot 3 7
4 <NA> Swift 25 NA
5 Osprey <NA> NA 89
data
df1 <- structure(list(species1 = c("Loro", NA, "Osprey", "Sparrow",
"Sparrow", "Mallard", "Mallard", NA), species2 = c("Parrot",
"Parrot", NA, "Finch", NA, "Duck", NA, "Swift"), info1 = c(3L,
NA, NA, NA, 27L, 69L, NA, 25L), info2 = c(NA, 7L, 89L, 19L, NA,
16L, NA, NA)), class = "data.frame", row.names = c(NA, -8L))

You may group by one column and fill the NA values in other one to get the pairs, after which take sum of the values grouping by both the species column.
library(dplyr)
library(tidyr)
df %>%
group_by(species2) %>%
fill(species1) %>%
group_by(species1) %>%
fill(species2) %>%
group_by(species2, .add = TRUE) %>%
summarise(across(.fns = sum, na.rm = TRUE)) %>%
ungroup
# species1 species2 info1 info2
# <chr> <chr> <int> <int>
#1 Loro Parrot 3 7
#2 Mallard Duck 69 16
#3 Osprey NA 0 89
#4 Sparrow Finch 27 19
#5 NA Swift 25 0

Related

Grouping into desired number of groups

I have a data frame like this:
ID is the primary key and Apples is the number of apples that person has.
ID
Apples
E1
10
E2
5
E3
NA
E4
5
E5
8
E6
12
E7
NA
E8
4
E9
NA
E10
8
I want to group NA and non-NA values into only 2 separate groups and get the count of each. I tried the normal group_by(), but it does not give me desired output.
Fruits %>% group_by(Apples) %>% summarize(n())
Apples n()
<dbl> <int>
4 1
5 2
8 2
10 1
12 1
NA 3
My desired output:
Apples n()
<dbl> <int>
non-NA 7
NA 3
We can create a group for NA and non-NA using group_by, and we can also make it a factor so that we can change the labels in the same step. Then, get the number of observations for each group.
library(dplyr)
df %>%
group_by(grp = factor(is.na(Apples), labels=c("non-NA", "NA"))) %>%
summarise(`n()`= n())
# grp `n()`
# <fct> <int>
#1 non-NA 7
#2 NA 3
Or in base R, we could use colSums:
data.frame(Apples = c("non-NA", "NA"), n = c(colSums(!is.na(df))[2], colSums(is.na(df))[2]), row.names = NULL)
Data
df <- structure(list(ID = c("E1", "E2", "E3", "E4", "E5", "E6", "E7",
"E8", "E9", "E10"), Apples = c(10L, 5L, NA, 5L, 8L, 12L, NA,
4L, NA, 8L)), class = "data.frame", row.names = c(NA, -10L))
In base R, this can be done with table on a logical vector
table(!is.na(df1$Apples))

Collapse duplicated rows simultaneously on several columns

R beginner here. I currently have a database with over 33 variables for over 10000 species of animals and have just got myself into a bit of a trouble.
Summing up, my data looks something like this:
species1 species2 info1 info2
Parrot Parrot 3 NA
NA Parrot NA 7
Osprey NA NA 89
Sparrow Sparrow NA 19
Sparrow NA 27 NA
NA Duck 69 16
What i'm trying to do here is to collapse or merge together rows that have duplicates on species columns, while keeping those rows that have NA. Something like this:
species1 species2 info1 info2
Parrot Parrot 3 7
Osprey NA NA 89
Sparrow Sparrow 27 19
NA Duck 69 16
I have tried with group_by, but apart from the fact that it only groups by one variable, im not sure it is the correct way. I have 5 other species rows that may also have duplicates on them, should i just use this for each one?
data %>%
group_by(species1) %>%
summarise_each(funs(max))
Sorry if this is too obvious, im just a novice!
Tank you!! :)
We could coalesce the 'species1', 'species2' columns to a single column, do a grouping on the coalesced column before doing the summarise and remove it with select
library(dplyr)
data %>%
group_by(species = coalesce(species1, species2)) %>%
summarise(across(everything(), ~ .[complete.cases(.)][1]))%>%
select(-species)
-output
# A tibble: 4 x 4
species1 species2 info1 info2
<chr> <chr> <int> <int>
1 <NA> Duck 69 16
2 Osprey <NA> NA 89
3 Parrot Parrot 3 7
4 Sparrow Sparrow 27 19
data
data <- structure(list(species1 = c("Parrot", NA, "Osprey", "Sparrow",
"Sparrow", NA), species2 = c("Parrot", "Parrot", NA, "Sparrow",
NA, "Duck"), info1 = c(3L, NA, NA, NA, 27L, 69L), info2 = c(NA,
7L, 89L, 19L, NA, 16L)), class = "data.frame", row.names = c(NA,
-6L))

How to replace NA's in rows with a fixed value by group?

I'm trying to find a way to replace NA values in my code with a fixed value.
Example
Animals Value
Dog 13
Dog 20
Dog 27
Dog 35
Dog NA
Dog NA
Dog NA
Cat 17
Cat 24
Cat 31
Cat NA
Cat NA
Mouse 100
Mouse 107
Mouse NA
Mouse NA
Mouse NA
Mouse NA
What I would like to do is replace NA values with multiples of 7 so it would look like:
Animals Value
Dog 13
Dog 20
Dog 27
Dog 34
Dog 41
Dog 48
Dog 55
Cat 17
Cat 24
Cat 31
Cat 38
Cat 45
Mouse 100
Mouse 107
Mouse 114
Mouse 121
Mouse 128
Mouse 135
I tried looking into the "fill" and "complete" functions, but from what I gathered, it usually fills NA's with a previous row value or fixed value. Any help would be appreciated!
We can use seq after grouping by 'Animals'
library(dplyr)
df1 %>%
group_by(Animals) %>%
mutate(Value = seq(first(Value), length.out = n(), by = 7))
# A tibble: 18 x 2
# Groups: Animals [3]
# Animals Value
# <chr> <dbl>
# 1 Dog 13
# 2 Dog 20
# 3 Dog 27
# 4 Dog 34
# 5 Dog 41
# 6 Dog 48
# 7 Dog 55
# 8 Cat 17
# 9 Cat 24
#10 Cat 31
#11 Cat 38
#12 Cat 45
#13 Mouse 100
#14 Mouse 107
#15 Mouse 114
#16 Mouse 121
#17 Mouse 128
#18 Mouse 135
Or another option is coalesce
df1 %>%
group_by(Animals) %>%
mutate(Value = coalesce(Value, seq(first(Value), length.out = n(), by = 7L)))
data
df1 <- structure(list(Animals = c("Dog", "Dog", "Dog", "Dog", "Dog",
"Dog", "Dog", "Cat", "Cat", "Cat", "Cat", "Cat", "Mouse", "Mouse",
"Mouse", "Mouse", "Mouse", "Mouse"), Value = c(13L, 20L, 27L,
35L, NA, NA, NA, 17L, 24L, 31L, NA, NA, 100L, 107L, NA, NA, NA,
NA)), class = "data.frame", row.names = c(NA, -18L))

How can I remove NAs when both columns are missing only?

I have a df in R as follows:
ID Age Score1 Score2
2 22 12 NA
3 19 11 22
4 20 NA NA
1 21 NA 20
Now I want to only remove the rows where both Score 1 and Score 2 is missing (i.e. 3rd row)
You can filter it like this:
df <- read.table(head=T, text="ID Age Score1 Score2
2 22 12 NA
3 19 11 22
4 20 NA NA
1 21 NA 20")
df[!(is.na(df$Score1) & is.na(df$Score2)), ]
# ID Age Score1 Score2
# 1 2 22 12 NA
# 2 3 19 11 22
# 4 1 21 NA 20
I.e. take rows where there's not (!) Score1 missing and (&) Score2 missing.
Here are two version with dplyr which can be extended to many columns with prefix "Score".
Using filter_at
library(dplyr)
df %>% filter_at(vars(starts_with("Score")), any_vars(!is.na(.)))
# ID Age Score1 Score2
#1 2 22 12 NA
#2 3 19 11 22
#3 1 21 NA 20
and filter_if
df %>% filter_if(startsWith(names(.),"Score"), any_vars(!is.na(.)))
A base R version with apply
df[apply(!is.na(df[startsWith(names(df),"Score")]), 1, any), ]
One option is rowSums
df1[ rowSums(is.na(df1[grep("Score", names(df1))])) < 2,]
Or another option with base R
df1[!Reduce(`&`, lapply(df1[grep("Score", names(df1))], is.na)),]
data
df1 <- structure(list(ID = c(2L, 3L, 4L, 1L), Age = c(22L, 19L, 20L,
21L), Score1 = c(12L, 11L, NA, NA), Score2 = c(NA, 22L, NA, 20L
)), class = "data.frame", row.names = c(NA, -4L))

How to join data frames in R without duplicating original data values

I have 2 data frames (DF1 & DF2) and 1 would like to join them together by a unique value called "acc_num". In DF2, payment was made twice by acc_num A and thrice by B. Data frames are as follows.
DF1:
acc_num total_use sales
A 433 145
A NA 2
A NA 18
B 149 32
DF2:
acc payment
A 150
A 98
B 44
B 15
B 10
My desired output is:
acc_num total_use sales payment
A 433 145 150
A NA 2 98
A NA 18 NA
B 149 32 44
B NA NA 15
B NA NA 10
I've tried full_join and merge but the output was not as desired. I couldn't work this out as I'm still a beginner in R, and haven't found the solution to this.
Example of the code I used was
test_full_join <- DF1 %>% full_join(DF2, by = c("acc_num" = "acc"))
The displayed output was:
acc_num total_use sales payment
A 433 145 150
A 433 145 98
A NA 2 150
A NA 2 98
A NA 18 150
A NA 18 98
B 149 32 44
B 149 32 15
B 149 32 10
This is contrary to my desired output as at the end,
my concern is to get the total sum of total_use, sales and payment.
This output will definitely give me wrong interpretation
for data visualization later on.
We may need to do a join by row_number() based on 'acc_num'
library(dplyr)
df1 %>%
group_by(acc_num) %>%
mutate(grpind = row_number()) %>%
full_join(df2 %>%
group_by(acc_num = acc) %>%
mutate(grpind = row_number())) %>%
select(acc_num, total_use, sales, payment)
# A tibble: 6 x 4
# Groups: acc_num [2]
# acc_num total_use sales payment
# <chr> <int> <int> <int>
#1 A 433 145 150
#2 A NA 2 98
#3 A NA 18 NA
#4 B 149 32 44
#5 B NA NA 15
#6 B NA NA 10
data
df1 <- structure(list(acc_num = c("A", "A", "A", "B"), total_use = c(433L,
NA, NA, 149L), sales = c(145L, 2L, 18L, 32L)), class = "data.frame",
row.names = c(NA,
-4L))
df2 <- structure(list(acc = c("A", "A", "B", "B", "B"), payment = c(150L,
98L, 44L, 15L, 10L)), class = "data.frame", row.names = c(NA,
-5L))

Resources