Keeping all NAs in dplyr distinct function - r

I have a data.frame (the eBird basic dataset) where many observers may upload a record from a same sighting to a database, in this case, the event is given a "group identifier"; when not from a group session, a NA will appear in the database; so I'm trying to filter out all those duplicates from group events and keep all NAs, I'm trying to do this without splitting the dataframe in two:
library(dplyr)
set.seed(1)
df <- tibble(
x = sample(c(1:6, NA), 30, replace = T),
y = sample(c(letters[1:4]), 30, replace = T)
)
df %>% count(x,y)
gives:
> df %>% count(x,y)
# A tibble: 20 x 3
x y n
<int> <chr> <int>
1 1 a 1
2 1 b 2
3 2 a 1
4 2 b 1
5 2 c 1
6 2 d 3
7 3 a 1
8 3 b 1
9 3 c 4
10 4 d 1
11 5 a 1
12 5 b 2
13 5 c 1
14 5 d 1
15 6 a 1
16 6 c 2
17 NA a 1
18 NA b 2
19 NA c 2
20 NA d 1
I want no NA at x to be grouped together, as here happened with "NA b" and "NA c" combinations; distinct function has no information on not taking NAs into the computation; is splitting the dataframe the only solution?

With distinct an option is to create a new column based on the NA elements in 'x'
library(dplyr)
df %>%
mutate(x1 = row_number() * is.na(x)) %>%
distinct %>%
select(-x1)
Or we can use duplicated with an OR (|) condition to return all NA elements in 'x' with filter
df %>%
filter(is.na(x)|!duplicated(cur_data()))
# A tibble: 20 x 2
# x y
# <int> <chr>
# 1 1 b
# 2 4 b
# 3 NA a
# 4 1 d
# 5 2 c
# 6 5 a
# 7 NA d
# 8 3 c
# 9 6 b
#10 2 b
#11 3 b
#12 1 c
#13 5 d
#14 2 d
#15 6 d
#16 2 a
#17 NA c
#18 NA a
#19 1 a
#20 5 b

Related

How to create column, from the cumulative column in r?

df <- data.frame(dat=c("11-03","12-03","13-03"),
c=c(0,15,20,4,19,21,2,10,14), d=rep(c("A","B","C"),each=3))
suppose c has the cumulative values. I want to create a column daily that will look like
dat c d daily
1 11-03 0 A 0
2 12-03 15 A 15
3 13-03 20 A 5
4 11-03 4 B 4
5 12-03 19 B 15
6 13-03 21 B 2
7 11-03 2 C 2
8 12-03 10 C 8
9 13-03 14 C 4
for each value of d and dat (date wise) a daily change in value is generated from the column c has that cumulative value.
We can get the diff of 'c' after grouping by 'd'
library(dplyr)
df %>%
group_by(d) %>%
mutate(daily = c(first(c), diff(c)))
# A tibble: 9 x 4
# Groups: d [3]
# dat c d daily
# <fct> <dbl> <fct> <dbl>
#1 11-03 0 A 0
#2 12-03 15 A 15
#3 13-03 20 A 5
#4 11-03 4 B 4
#5 12-03 19 B 15
#6 13-03 21 B 2
#7 11-03 2 C 2
#8 12-03 10 C 8
#9 13-03 14 C 4
Or do the difference between the 'c' and the lag of 'c'
df %>%
group_by(d) %>%
mutate(daily = c - lag(c))
Data.table solution:
df <- as.data.table(df)
df[, daily:= c - shift(c, fill = 0),by=d]
Shift is datatable's lag operator, so basically we subtract from C its previous value within each group.
fill = 0 replaces NAs with zeros, because within each group, there is no previous value (shift(c)) for the first element.

Expand dataframe by ID to generate a special column

I have the following dataframe
df<-data.frame("ID"=c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B"),
'A_Frequency'=c(1,2,3,4,5,1,2,3,4,5),
'B_Frequency'=c(1,2,NA,4,6,1,2,5,6,7))
The dataframe appears as follows
ID A_Frequency B_Frequency
1 A 1 1
2 A 2 2
3 A 3 NA
4 A 4 4
5 A 5 6
6 B 1 1
7 B 2 2
8 B 3 5
9 B 4 6
10 B 5 7
I Wish to create a new dataframe df2 from df that looks as follows
ID CFreq
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 B 1
8 B 2
9 B 3
10 B 4
11 B 5
12 B 6
13 B 7
The new dataframe has a column CFreq that takes unique values from A_Frequency, B_Frequency and groups them by ID. Then it ignores the NA values and generates the CFreq column
I have tried dplyr but am unable to get the required response
df2<-df%>%group_by(ID)%>%select(ID, A_Frequency,B_Frequency)%>%
mutate(Cfreq=unique(A_Frequency, B_Frequency))
This yields the following which is quite different
ID A_Frequency B_Frequency Cfreq
<fct> <dbl> <dbl> <dbl>
1 A 1 1 1
2 A 2 2 2
3 A 3 NA 3
4 A 4 4 4
5 A 5 6 5
6 B 1 1 1
7 B 2 2 2
8 B 3 5 3
9 B 4 6 4
10 B 5 7 5
Request someone to help me here
gather function from tidyr package will be helpful here:
library(tidyverse)
df %>%
gather(x, CFreq, -ID) %>%
select(-x) %>%
na.omit() %>%
unique() %>%
arrange(ID, CFreq)
A different tidyverse possibility could be:
df %>%
nest(A_Frequency, B_Frequency, .key = C_Frequency) %>%
mutate(C_Frequency = map(C_Frequency, function(x) unique(x[!is.na(x)]))) %>%
unnest()
ID C_Frequency
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
9 A 6
10 B 1
11 B 2
12 B 3
13 B 4
14 B 5
18 B 6
19 B 7
Base R approach would be to split the dataframe based on ID and for every list we count the number of unique enteries and create a sequence based on that.
do.call(rbind, lapply(split(df, df$ID), function(x) data.frame(ID = x$ID[1] ,
CFreq = seq_len(length(unique(na.omit(unlist(x[-1]))))))))
# ID CFreq
#A.1 A 1
#A.2 A 2
#A.3 A 3
#A.4 A 4
#A.5 A 5
#A.6 A 6
#B.1 B 1
#B.2 B 2
#B.3 B 3
#B.4 B 4
#B.5 B 5
#B.6 B 6
#B.7 B 7
This will also work when A_Frequency B_Frequency has characters in them or some other random numbers instead of sequential numbers.
In tidyverse we can do
library(tidyverse)
df %>%
group_split(ID) %>%
map_dfr(~ data.frame(ID = .$ID[1],
CFreq= seq_len(length(unique(na.omit(flatten_chr(.[-1])))))))
A data.table option
library(data.table)
cols <- c('A_Frequency', 'B_Frequency')
out <- setDT(df)[, .(CFreq = sort(unique(unlist(.SD)))),
.SDcols = cols,
by = ID]
out
# ID CFreq
# 1: A 1
# 2: A 2
# 3: A 3
# 4: A 4
# 5: A 5
# 6: A 6
# 7: B 1
# 8: B 2
# 9: B 3
#10: B 4
#11: B 5
#12: B 6
#13: B 7

Generating multiple column to sort the data out in R

I have a database including names, codes and rooms as follows:
Name1 Code1 R1
A A 12 1
A B 13 2
A C 15 5
A B 8 4
A C 13 2
A D 17 1
A B 16 7
I want to generate columns for the repeated names like this:
Name1 Code1 R1 Name2 Code2 R2 Name3 Cod3 R3
A A 12 1
A B 13 2
A C 15 5
A B 8 4 A B 8 4
A C 13 2 A C 13 2
A D 17 1
A B 16 7 A B 16 7
I have googled to find a solution, but I could not find or may be I have missed something. Would it be possible for you to help me. Some names (Name1) has been repeated 5 times and i did not add it.So I I have Name2 Code2 R2; Name3, Code3, R3...
Sample data:
df <- read.table(stringsAsFactors = F, header = T, text = "
Name1a Name1b Code1 R1
1 A A 12 1
2 A B 13 2
3 A C 15 5
4 A B 8 4
5 A C 13 2
6 A D 17 1
7 A B 16 7") %>%
tidyr::unite(Name1, Name1a, Name1b)
Edit: Orig answer was in packed format, but OP would like the first set of columns repeated for all lines, and 2nd and third appearances showing up in the row they originally appeared in.
Here's an approach using dplyr and tidyr.
# Keep track of original rows, label repeats, and make it long format
df_order <- df %>%
mutate(orig_row = row_number()) %>%
group_by(Name1) %>% mutate(repeat_no = row_number()) %>% ungroup() %>%
gather(col_type, value, Code1:R1)
# Make one copy of all the rows to keep in first column
df_ones <- df_order %>%
mutate(repeat_no = 1) %>%
unite(col_rpt, repeat_no, col_type)
# Get the repeated rows to add on
df_repeats <- df_order %>%
filter(repeat_no > 1) %>%
unite(col_rpt, repeat_no, col_type)
# Combine the two and spread out
output <- df_ones %>%
bind_rows(df_repeats) %>%
spread(col_rpt, value) %>%
arrange(orig_row) %>%
select(-orig_row)
Output:
> output
# A tibble: 7 x 7
Name1 `1_Code1` `1_R1` `2_Code1` `2_R1` `3_Code1` `3_R1`
<chr> <int> <int> <int> <int> <int> <int>
1 A_A 12 1 NA NA NA NA
2 A_B 13 2 NA NA NA NA
3 A_C 15 5 NA NA NA NA
4 A_B 8 4 8 4 NA NA
5 A_C 13 2 13 2 NA NA
6 A_D 17 1 NA NA NA NA
7 A_B 16 7 NA NA 16 7

Dynamic select expression in function [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 4 years ago.
I am trying to write a function that will convert this data frame
library(dplyr)
library(rlang)
library(purrr)
df <- data.frame(obj=c(1,1,2,2,3,3,3,4,4,4),
S1=rep(c("a","b"),length.out=10),PR1=rep(c(3,7),length.out=10),
S2=rep(c("c","d"),length.out=10),PR2=rep(c(7,3),length.out=10))
obj S1 PR1 S2 PR2
1 1 a 3 c 7
2 1 b 7 d 3
3 2 a 3 c 7
4 2 b 7 d 3
5 3 a 3 c 7
6 3 b 7 d 3
7 3 a 3 c 7
8 4 b 7 d 3
9 4 a 3 c 7
10 4 b 7 d 3
In to this data frame
df %>% {bind_rows(select(., obj, S = S1, PR = PR1),
select(., obj, S = S2, PR = PR2))}
obj S PR
1 1 a 3
2 1 b 7
3 2 a 3
4 2 b 7
5 3 a 3
6 3 b 7
7 3 a 3
8 4 b 7
9 4 a 3
10 4 b 7
11 1 c 7
12 1 d 3
13 2 c 7
14 2 d 3
15 3 c 7
16 3 d 3
17 3 c 7
18 4 d 3
19 4 c 7
20 4 d 3
But I would like the function to be able to work with any number of columns. So it would also work if I had S1, S2, S3, S4 or if there was an additional category ie DS1, DS2. Ideally the function would take as arguments the patterns that determine which columns are stacked on top of each other, the number of sets of each column, the names of the output columns and the names of any variables that should also be kept.
This is my attempt at this function:
stack_col <- function(df, patterns, nums, cnames, keep){
keep <- enquo(keep)
build_exp <- function(x){
paste0("!!sym(cnames[[", x, "]]) := paste0(patterns[[", x, "]],num)") %>%
parse_expr()
}
exps <- map(1:length(patterns), ~expr(!!build_exp(.)))
sel_fun <- function(num){
df %>% select(!!keep,
!!!exps)
}
map(nums, sel_fun) %>% bind_rows()
}
I can get the sel_fun part to work for a fixed number of patterns like this
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
keep <- quo(obj)
sel_fun <- function(num){
df %>% select(!!keep,
!!sym(cnames[[1]]) := paste0(patterns[[1]], num),
!!sym(cnames[[2]]) := paste0(patterns[[2]], num))
}
sel_fun(1)
But the dynamic version that I have tried does not work and gives this error:
Error: `:=` can only be used within a quasiquoted argument
Here is a function to get the expected output. Loop through the 'patterns' and the corresponding new column names ('cnames') using map2, gather into 'long' format, rename the 'val' column to the 'cnames' passed into the function, bind the columns (bind_cols) and select the columns of interest
stack_col <- function(dat, pat, cname, keep) {
purrr::map2(pat, cname, ~
dat %>%
dplyr::select(keep, matches(.x)) %>%
tidyr::gather(key, val, matches(.x)) %>%
dplyr::select(-key) %>%
dplyr::rename(!! .y := val)) %>%
dplyr::bind_cols(.) %>%
dplyr::select(keep, cname)
}
stack_col(df, patterns, cnames, 1)
# obj Species PR
#1 1 a 3
#2 1 b 7
#3 2 a 3
#4 2 b 7
#5 3 a 3
#6 3 b 7
#7 3 a 3
#8 4 b 7
#9 4 a 3
#10 4 b 7
#11 1 c 7
#12 1 d 3
#13 2 c 7
#14 2 d 3
#15 3 c 7
#16 3 d 3
#17 3 c 7
#18 4 d 3
#19 4 c 7
#20 4 d 3
Also, multiple patterns reshaping can be done with data.table::melt
library(data.table)
melt(setDT(df), measure = patterns("^S\\d+", "^PR\\d+"),
value.name = c("Species", "PR"))[, variable := NULL][]
This solves your problem, although it does not fix your function:
The idea is to use gather and spread on the columns which starts with the specific pattern. Therefore I create a regex which matches the column names and then first gather all of them, extract the group and the rename the groups with the cnames. Finally spread takes separates the new columns.
library(dplyr)
library(purrr)
library(tidyr)
library(stringr)
patterns <- c("S", "PR")
cnames <- c("Species", "PR")
names(cnames) <- patterns
complete_pattern <- str_c("^", str_c(patterns, collapse = "|^"))
df %>%
mutate(rownumber = 1:n()) %>%
gather(new_variable, value, matches(complete_pattern)) %>%
mutate(group = str_extract(new_variable, complete_pattern),
group = str_replace_all(group, cnames),
group_number = str_extract(new_variable, "\\d+")) %>%
select(-new_variable) %>%
spread(group, value)
# obj rownumber group_number PR Species
# 1 1 1 1 3 a
# 2 1 1 2 7 c
# 3 1 2 1 7 b
# 4 1 2 2 3 d
# 5 2 3 1 3 a
# 6 2 3 2 7 c
# 7 2 4 1 7 b
# 8 2 4 2 3 d
# 9 3 5 1 3 a
# 10 3 5 2 7 c
# 11 3 6 1 7 b
# 12 3 6 2 3 d
# 13 3 7 1 3 a
# 14 3 7 2 7 c
# 15 4 8 1 7 b
# 16 4 8 2 3 d
# 17 4 9 1 3 a
# 18 4 9 2 7 c
# 19 4 10 1 7 b
# 20 4 10 2 3 d

Remove semi duplicate rows in R

I have the following data.frame.
a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)
a b c d
1 A NA 1 1
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
6 B NA 1 6
7 C 2 2 7
8 C NA 2 8
9 D NA 2 9
I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.
One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements
df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
# a b c d
#2 A 1 1 2
#3 A 2 2 3
#5 B 1 1 5
#4 B 4 4 4
#7 C 2 2 7
#9 D NA 2 9
setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8
First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :
x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]
Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.
rbind(na.omit(x),x1)
Answer:
a b c d
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
7 C 2 2 7
9 D NA 2 9
You can use dplyr to do this.
df %>% distinct(a, c, .keep_all = TRUE)
Output
a b c d
1 A NA 1 1
2 A 2 2 3
3 B 4 4 4
4 B 1 1 5
5 C 2 2 7
6 D NA 2 9
There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

Resources