R beginner here. I currently have a database with over 33 variables for over 10000 species of animals and have just got myself into a bit of a trouble.
Summing up, my data looks something like this:
species1 species2 info1 info2
Parrot Parrot 3 NA
NA Parrot NA 7
Osprey NA NA 89
Sparrow Sparrow NA 19
Sparrow NA 27 NA
NA Duck 69 16
What i'm trying to do here is to collapse or merge together rows that have duplicates on species columns, while keeping those rows that have NA. Something like this:
species1 species2 info1 info2
Parrot Parrot 3 7
Osprey NA NA 89
Sparrow Sparrow 27 19
NA Duck 69 16
I have tried with group_by, but apart from the fact that it only groups by one variable, im not sure it is the correct way. I have 5 other species rows that may also have duplicates on them, should i just use this for each one?
data %>%
group_by(species1) %>%
summarise_each(funs(max))
Sorry if this is too obvious, im just a novice!
Tank you!! :)
We could coalesce the 'species1', 'species2' columns to a single column, do a grouping on the coalesced column before doing the summarise and remove it with select
library(dplyr)
data %>%
group_by(species = coalesce(species1, species2)) %>%
summarise(across(everything(), ~ .[complete.cases(.)][1]))%>%
select(-species)
-output
# A tibble: 4 x 4
species1 species2 info1 info2
<chr> <chr> <int> <int>
1 <NA> Duck 69 16
2 Osprey <NA> NA 89
3 Parrot Parrot 3 7
4 Sparrow Sparrow 27 19
data
data <- structure(list(species1 = c("Parrot", NA, "Osprey", "Sparrow",
"Sparrow", NA), species2 = c("Parrot", "Parrot", NA, "Sparrow",
NA, "Duck"), info1 = c(3L, NA, NA, NA, 27L, 69L), info2 = c(NA,
7L, 89L, 19L, NA, 16L)), class = "data.frame", row.names = c(NA,
-6L))
Related
I'm having issues figuring out how to merge non-unique columns that look like this:
2_2
2_3
2_4
2_2
3_2
1
2
3
NA
NA
2
3
-1
NA
NA
NA
NA
NA
3
-2
NA
NA
NA
-2
4
To make them look like this:
2_2
2_3
2_4
3_2
1
2
3
NA
2
3
-1
NA
3
NA
NA
-2
-2
NA
NA
4
Essentially reshaping any non-unique columns. I have a large data set to work with so this is becoming an issue!
Note that data.frame doesn't allow for duplicate column names. Even if we create those, it may get modified when we apply functions as make.unique is automatically applied. Assuming we created the data.frame with duplicate names, an option is to use split.default to split the data into list of subset of data, then loop over the list with map and use coalesce
library(dplyr)
library(purrr)
map_dfc(split.default(df1, names(df1)),~ invoke(coalesce, .x))
-output
# A tibble: 4 × 4
`2_2` `2_3` `2_4` `3_2`
<int> <int> <int> <int>
1 1 2 3 NA
2 2 3 -1 NA
3 3 NA NA -2
4 -2 NA NA 4
data
df1 <- structure(list(`2_2` = c(1L, 2L, NA, NA), `2_3` = c(2L, 3L, NA,
NA), `2_4` = c(3L, -1L, NA, NA), `2_2` = c(NA, NA, 3L, -2L),
`3_2` = c(NA, NA, -2L, 4L)), class = "data.frame", row.names = c(NA,
-4L))
Also using coalesce:
You use non-syntactic names. R is strict in using names see here https://adv-r.hadley.nz/names-values.html and also notice the explanation by #akrun:
library(dplyr)
df %>%
mutate(X2_2 = coalesce(X2_2, X2_2.1), .keep="unused")
X2_2 X2_3 X2_4 X3_2
1 1 2 3 NA
2 2 3 -1 NA
3 3 NA NA -2
4 -2 NA NA 4
I have a data frame like this:
ID is the primary key and Apples is the number of apples that person has.
ID
Apples
E1
10
E2
5
E3
NA
E4
5
E5
8
E6
12
E7
NA
E8
4
E9
NA
E10
8
I want to group NA and non-NA values into only 2 separate groups and get the count of each. I tried the normal group_by(), but it does not give me desired output.
Fruits %>% group_by(Apples) %>% summarize(n())
Apples n()
<dbl> <int>
4 1
5 2
8 2
10 1
12 1
NA 3
My desired output:
Apples n()
<dbl> <int>
non-NA 7
NA 3
We can create a group for NA and non-NA using group_by, and we can also make it a factor so that we can change the labels in the same step. Then, get the number of observations for each group.
library(dplyr)
df %>%
group_by(grp = factor(is.na(Apples), labels=c("non-NA", "NA"))) %>%
summarise(`n()`= n())
# grp `n()`
# <fct> <int>
#1 non-NA 7
#2 NA 3
Or in base R, we could use colSums:
data.frame(Apples = c("non-NA", "NA"), n = c(colSums(!is.na(df))[2], colSums(is.na(df))[2]), row.names = NULL)
Data
df <- structure(list(ID = c("E1", "E2", "E3", "E4", "E5", "E6", "E7",
"E8", "E9", "E10"), Apples = c(10L, 5L, NA, 5L, 8L, 12L, NA,
4L, NA, 8L)), class = "data.frame", row.names = c(NA, -10L))
In base R, this can be done with table on a logical vector
table(!is.na(df1$Apples))
If my data looks something like this:
species1 species2 info1 info2
Loro Parrot 3 NA
NA Parrot NA 7
Osprey NA NA 89
Sparrow Finch NA 19
Sparrow NA 27 NA
Mallard Duck 69 16
Mallard NA NA NA
NA Swift 25 NA
And i want to merge it together like this:
species1 species2 info1 info2
Loro Parrot 3 7
Osprey NA NA 89
Sparrow Finch 27 19
Mallard Duck 69 16
NA Swift 25 NA
How could i do it, tanking into account that i need to keep the NA records?
Thank you very much! :)
We could use similar approach from the previous post, but in a different way i.e. first create a named vector from the 'species' columns'. Use that to replace the values in first 'species1' column, coalese with the second one to do a grouping and then do the summarise
library(dplyr)
library(tibble)
nm1 <- df1 %>%
select(species1, species2) %>%
na.omit %>%
deframe
df1 %>%
group_by(species = coalesce(nm1[species1], species2)) %>%
summarise(across(everything(), ~ .[complete.cases(.)][1])) %>%
select(-species)
# A tibble: 5 x 4
species1 species2 info1 info2
<chr> <chr> <int> <int>
1 Mallard Duck 69 16
2 Sparrow Finch 27 19
3 Loro Parrot 3 7
4 <NA> Swift 25 NA
5 Osprey <NA> NA 89
data
df1 <- structure(list(species1 = c("Loro", NA, "Osprey", "Sparrow",
"Sparrow", "Mallard", "Mallard", NA), species2 = c("Parrot",
"Parrot", NA, "Finch", NA, "Duck", NA, "Swift"), info1 = c(3L,
NA, NA, NA, 27L, 69L, NA, 25L), info2 = c(NA, 7L, 89L, 19L, NA,
16L, NA, NA)), class = "data.frame", row.names = c(NA, -8L))
You may group by one column and fill the NA values in other one to get the pairs, after which take sum of the values grouping by both the species column.
library(dplyr)
library(tidyr)
df %>%
group_by(species2) %>%
fill(species1) %>%
group_by(species1) %>%
fill(species2) %>%
group_by(species2, .add = TRUE) %>%
summarise(across(.fns = sum, na.rm = TRUE)) %>%
ungroup
# species1 species2 info1 info2
# <chr> <chr> <int> <int>
#1 Loro Parrot 3 7
#2 Mallard Duck 69 16
#3 Osprey NA 0 89
#4 Sparrow Finch 27 19
#5 NA Swift 25 0
I'm trying to find a way to replace NA values in my code with a fixed value.
Example
Animals Value
Dog 13
Dog 20
Dog 27
Dog 35
Dog NA
Dog NA
Dog NA
Cat 17
Cat 24
Cat 31
Cat NA
Cat NA
Mouse 100
Mouse 107
Mouse NA
Mouse NA
Mouse NA
Mouse NA
What I would like to do is replace NA values with multiples of 7 so it would look like:
Animals Value
Dog 13
Dog 20
Dog 27
Dog 34
Dog 41
Dog 48
Dog 55
Cat 17
Cat 24
Cat 31
Cat 38
Cat 45
Mouse 100
Mouse 107
Mouse 114
Mouse 121
Mouse 128
Mouse 135
I tried looking into the "fill" and "complete" functions, but from what I gathered, it usually fills NA's with a previous row value or fixed value. Any help would be appreciated!
We can use seq after grouping by 'Animals'
library(dplyr)
df1 %>%
group_by(Animals) %>%
mutate(Value = seq(first(Value), length.out = n(), by = 7))
# A tibble: 18 x 2
# Groups: Animals [3]
# Animals Value
# <chr> <dbl>
# 1 Dog 13
# 2 Dog 20
# 3 Dog 27
# 4 Dog 34
# 5 Dog 41
# 6 Dog 48
# 7 Dog 55
# 8 Cat 17
# 9 Cat 24
#10 Cat 31
#11 Cat 38
#12 Cat 45
#13 Mouse 100
#14 Mouse 107
#15 Mouse 114
#16 Mouse 121
#17 Mouse 128
#18 Mouse 135
Or another option is coalesce
df1 %>%
group_by(Animals) %>%
mutate(Value = coalesce(Value, seq(first(Value), length.out = n(), by = 7L)))
data
df1 <- structure(list(Animals = c("Dog", "Dog", "Dog", "Dog", "Dog",
"Dog", "Dog", "Cat", "Cat", "Cat", "Cat", "Cat", "Mouse", "Mouse",
"Mouse", "Mouse", "Mouse", "Mouse"), Value = c(13L, 20L, 27L,
35L, NA, NA, NA, 17L, 24L, 31L, NA, NA, 100L, 107L, NA, NA, NA,
NA)), class = "data.frame", row.names = c(NA, -18L))
I have a df in R as follows:
ID Age Score1 Score2
2 22 12 NA
3 19 11 22
4 20 NA NA
1 21 NA 20
Now I want to only remove the rows where both Score 1 and Score 2 is missing (i.e. 3rd row)
You can filter it like this:
df <- read.table(head=T, text="ID Age Score1 Score2
2 22 12 NA
3 19 11 22
4 20 NA NA
1 21 NA 20")
df[!(is.na(df$Score1) & is.na(df$Score2)), ]
# ID Age Score1 Score2
# 1 2 22 12 NA
# 2 3 19 11 22
# 4 1 21 NA 20
I.e. take rows where there's not (!) Score1 missing and (&) Score2 missing.
Here are two version with dplyr which can be extended to many columns with prefix "Score".
Using filter_at
library(dplyr)
df %>% filter_at(vars(starts_with("Score")), any_vars(!is.na(.)))
# ID Age Score1 Score2
#1 2 22 12 NA
#2 3 19 11 22
#3 1 21 NA 20
and filter_if
df %>% filter_if(startsWith(names(.),"Score"), any_vars(!is.na(.)))
A base R version with apply
df[apply(!is.na(df[startsWith(names(df),"Score")]), 1, any), ]
One option is rowSums
df1[ rowSums(is.na(df1[grep("Score", names(df1))])) < 2,]
Or another option with base R
df1[!Reduce(`&`, lapply(df1[grep("Score", names(df1))], is.na)),]
data
df1 <- structure(list(ID = c(2L, 3L, 4L, 1L), Age = c(22L, 19L, 20L,
21L), Score1 = c(12L, 11L, NA, NA), Score2 = c(NA, 22L, NA, 20L
)), class = "data.frame", row.names = c(NA, -4L))