Finding duplicate observations of selected variables in a tibble

Finding duplicate observations of selected variables in a tibble - r

I have a rather large tibble (called df.tbl with ~ 26k rows and 22 columns) and I want to find the "twins" of each object, i.e. each row that has the same values in column 2:7 (date:Pos).
If I use:
inner_join(df.tbl, ~ df.tbl[i,], by = c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos"))
with i being the row I want to check for "twins", everything is working as expected, spitting out a 2 x 22 tibble, and I can expand this using:
x <- NULL
for (i in 1:nrow(df.tbl)) {
x[[i]] <- as_vector(inner_join(df.tbl[,],
df.tbl[i,],
by = c("date",
"forge",
"serNum",
"PinMain",
"PinMainNumber",
"Pos")) %>%
select(rowNum.x)
}
to create a list containing the row numbers for each twin for each object (row).
I cannot, however I try, use map to produce a similar result:
twins <- map(df.tbl, ~ inner_join(df.tbl,
.,
by = c("date",
"forge",
"serNum",
"PinMain",
"PinMainNumber",
"Pos")) %>%
select(rowNum.x) )
All I get is the following error:
Error in UseMethod("tbl_vars") : no applicable method for 'tbl_vars' applied to an object of class "c('double', 'numeric')"
How would I go about to convert the for loop into an equivalent using map?
My original data look like this:
>head(df.tbl, 3)
# A tibble: 3 x 22
rowNum date forge serNum PinMain PinMainNumber Pos FrontBack flow Sharped SV OP max min mean
<dbl> <date> <chr> <fct> <fct> <fct> <fct> <fct> <chr> <fct> <fct> <chr> <dbl> <dbl> <dbl>
1 1 2017-10-18 NA 179 Pin 1 W F NA 3 36237 235 77.7 55.3 64.7
2 2 2017-10-18 NA 179 Pin 2 W F NA 3 36237 235 77.5 52.1 67.4
3 3 2017-10-18 NA 179 Pin 3 W F NA 3 36237 235 79.5 58.6 69.0
# ... with 7 more variables: median <dbl>, sd <dbl>, Round2 <dbl>, Round4 <dbl>, OrigData <list>, dataSize <int>,
# fileName <chr>
and I would like a list with a length the same as nrow(df.tbl) looking like this:
> twins
[[1]]
[1] 1 7
[[2]]
[1] 2 8
[[3]]
[1] 3 9
Almost all objects have one twin / duplicate (as above) but a few have two or even three duplicates (as defined above, i.e. column 2:7 are the same)

A bit late to the party, but you can do it much more neatly with nest().
tbl.df1 <- tbl.df %>% group_by(date, forge, serNum, PinMain, PinMainNumber, Pos) %>% nest(rowNum)
The twins will be in the list of tibbles created by nest.
tbl.df1$data
# [[1]]
# A tibble: 2 x 1
# rowNum
# <dbl>
# 1 1
# 2 7
#[[2]]
# A tibble: 2 x 1
# rowNum
# <dbl>
# 1 2
# 2 8
# etc

do you really need to solve it with map?
I would solve through combining duplicated and semi_join from the package dplyr like this
defining_columns <- c("date", "forge", "serNum", "PinMain", "PinMainNumber", "Pos")
dplyr::semi_join(
df.tbl,
df.tbl[duplicated(df.tbl[defining_columns]),],
by = defining_columns
) %>%
group_by_at(defining_columns) %>%
arrange(.by_group = TRUE) %>%
summarise(twins = paste0(rowNum,collapse = ",")) %>%
pull(twins) %>%
strsplit(",")
the duplicated gives us which rows are duplicated and the semi_join only keeps rows in x that are present in y
Hope this helps!!

Related

Reordering columns using common names - dplyr

My data comes from a database which, depending on when I run my SQL query could contain different values for POS from one week to the other.
Not knowing which values will be in a variable makes it very hard to automate the creation of a report.
My data looks as follows:
sample <- data.frame(DRUG = c("A","A","B"),POS = c("Hospital","Physician","Home"),GROSS_COST = c(50,100,60), NET_COST = c(45,80,40))
I need to pivot this data frame wider so that there's a column for each pos by cost (gross & net).
This can be easily achieve using pivot_wider:
x <- sample %>% pivot_wider(names_from = POS, values_from = c(GROSS_COST,NET_COST))
Objective
I would like to be able to keep the columns for each POS together i.e. the GROSS_COST_Hospital and NET_COST_Hospital would be side by side, similar for all other POS columns.
Is there an elegant way to group columns using string matching?

Unfortunately, I don't think there is a direct solution to this (yet!). See https://github.com/tidyverse/tidyr/issues/839 .
For now you can get the data in long format so you can control their ordering the way you want.
library(tidyr)
sample %>%
pivot_longer(cols = c(GROSS_COST, NET_COST)) %>%
pivot_wider(names_from = c(name, POS), values_from = value)
# DRUG GROSS_COST_Hosp… NET_COST_Hospit… GROSS_COST_Phys… NET_COST_Physic…
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 A 50 45 100 80
#2 B NA NA NA NA
# … with 2 more variables: GROSS_COST_Home <dbl>, NET_COST_Home <dbl>

We can do an ordering on the select step
library(dplyr)
library(tidyr)
library(stringr)
sample %>%
pivot_wider(names_from = POS, values_from = c(GROSS_COST,NET_COST)) %>%
select(DRUG, names(.)[-1][order(str_extract(names(.)[-1], '[^_]+$'))])
# A tibble: 2 x 7
# DRUG GROSS_COST_Home NET_COST_Home GROSS_COST_Hospital NET_COST_Hospital GROSS_COST_Physician NET_COST_Physician
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A NA NA 50 45 100 80
#2 B 60 40 NA NA NA NA

A data.table option using dcast + melt
> dcast(melt(setDT(sample), id.vars = c("DRUG", "POS")), DRUG ~ variable + POS)
DRUG GROSS_COST_Home GROSS_COST_Hospital GROSS_COST_Physician NET_COST_Home
1: A NA 50 100 NA
2: B 60 NA NA 40
NET_COST_Hospital NET_COST_Physician
1: 45 80
2: NA NA

With the advent of tidyr 1.2.0, the issue is finally resolved, you may do this directly using names_vary argument
library(tidyr)
sample <- data.frame(DRUG = c("A","A","B"),POS = c("Hospital","Physician","Home"),GROSS_COST = c(50,100,60), NET_COST = c(45,80,40))
sample %>%
pivot_wider(names_from = POS, values_from = c(GROSS_COST,NET_COST), names_vary = 'slowest')
#> # A tibble: 2 x 7
#> DRUG GROSS_COST_Hospital NET_COST_Hospital GROSS_COST_Physi~ NET_COST_Physic~
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A 50 45 100 80
#> 2 B NA NA NA NA
#> # ... with 2 more variables: GROSS_COST_Home <dbl>, NET_COST_Home <dbl>
Created on 2022-02-18 by the reprex package (v2.0.1)

Filtering specific column in list of dataframes with map by index in R

I have a list of dataframes (test) with one column starting with the word "total_" (numeric), one with names, and one holding years.
I want to find the cases (names) where the total is missing in one specific year but is there in the next.
I experimented around with str_detec(names) and contains(), tries using the column index to adress the column with the "totals", tried to filter it by year and is.na() for the missings, but I cannot figure it out.
lapply(test, function(x) filter(x[[x]][,1], is.na(.) & year == 1820))
map(test, ~filter(.x, sum(is.na(.x)), year == 1820))
map(test, ~filter_at(.x, sum(is.na(starts_with("total")))))
I just cannot figure it out how to filter by index with multiple conditions and then use map or lapply.
Uhhh very bad example: (I know this is just three times the same dataframe, but it should do the trick).
year <- c(1820, 1821, 1822)
names <- c("A", "B", "C")
df <- data.frame(year, names)
df <- expand(df, year, names)
df$total_1 <- c(NA, 1,2, 1,2,3, NA, 2,3)
l <- list(df, df, df)
And here is what I want
[[1]]
# A tibble: 9 x 3
year names total_1
<dbl> <chr> <dbl>
1 1820 A NA
7 1822 A NA
[[2]]
# A tibble: 9 x 3
year names total_1
<dbl> <chr> <dbl>
1 1820 A NA
7 1822 A NA
[[3]]
# A tibble: 9 x 3
year names total_1
<dbl> <chr> <dbl>
1 1820 A NA
7 1822 A NA

So you cannot use the column name total_1 directly since it can be anything which has the word 'total' in it?
Maybe something like this help?
library(dplyr)
library(purrr)
map(l, ~.x %>% filter(if_any(contains('total'), ~is.na(.) & lead(!is.na(.)))))
#[[1]]
# A tibble: 2 x 3
# year names total_1
# <dbl> <chr> <dbl>
#1 1820 A NA
#2 1822 A NA
#[[2]]
# A tibble: 2 x 3
# year names total_1
# <dbl> <chr> <dbl>
#1 1820 A NA
#2 1822 A NA
#[[3]]
# A tibble: 2 x 3
# year names total_1
# <dbl> <chr> <dbl>
#1 1820 A NA
#2 1822 A NA

dplyr `across()` function and data frame length while grouping

packageVersion("dplyr")
#[1] ‘0.8.99.9002’
Please note that this question uses dplyr's new across() function. To install the latest dev version of dplyr issue the remotes::install_github("tidyverse/dplyr") command. To restore to the released version of dplyr issue the install.packages("dplyr") command. If you are reading this some point in the future and are already on dplyr 1.X+ you won't need to worry about this note.
library(tidyverse)
df <- tibble(Date = c(rep(as.Date("2020-01-01"), 3),
rep(as.Date("2020-02-01"), 2)),
Type = c("A", "A", "B", "C", "C"),
col1 = 1:5,
col2 = c(0, 8, 0, 3, 0),
col3 = c(25:29),
colX = rep(99, 5))
#> # A tibble: 5 x 6
#> Date Type col1 col2 col3 colX
#> <date> <chr> <int> <dbl> <int> <dbl>
#> 1 2020-01-01 A 1 0 25 99
#> 2 2020-01-01 A 2 8 26 99
#> 3 2020-01-01 B 3 0 27 99
#> 4 2020-02-01 C 4 3 28 99
#> 5 2020-02-01 C 5 0 29 99
I'd like to sum columns 1 through X above row-wise, grouped by "Date" and "Type". I will always start at the third column (ie col1), but will never know the numerical value of X in colX. That's OK because I can use the length of the data frame to determine how far I need to go 'out' to capture all columns until the end of the data frame. Here's my approach:
df %>%
group_by(Date, Type) %>%
summarize(across(3:length(.)), sum())
#> Error: Problem with `summarise()` input `..1`.
#> x Can't subset columns that don't exist.
#> x Locations 5 and 6 don't exist.
#> i There are only 4 columns.
#> i Input `..1` is `across(3:length(.))`.
#> i The error occured in group 1: Date = 2020-01-01, Type = "A".
#> Run `rlang::last_error()` to see where the error occurred.
But it seems my usage of the base R length(.) function is improper. Am I using dplyr's new across() function in the right manner? How can I get the length of the data frame in the portion of the pipe where I need it? I'll never know how many columns there are to the end, nor are the actual names nearly as clean as my example data frame.

packageVersion("dplyr")
#[1] ‘0.8.99.9002’
First, you just have a little problem with your syntax, the select statement and the function both go inside the across call.
df %>% summarize(across(3:length(.),sum))
## A tibble: 1 x 4
# col1 col2 col3 colX
# <int> <dbl> <int> <dbl>
#1 15 11 135 495
The following code does not work because you cannot select columns that are currently being group_by-ed on.
df %>%
group_by(Date, Type) %>%
summarize(across(3:length(.), sum))
#Error: Problem with `summarise()` input `..1`.
#x Can't subset columns that don't exist.
#x Locations 5 and 6 don't exist.
#ℹ There are only 4 columns.
This is obvious when you try the following:
df %>%
group_by(Date, Type) %>%
summarize(across(everything(), sum))
## A tibble: 3 x 6
## Groups: Date [2]
# Date Type col1 col2 col3 colX
# <date> <chr> <int> <dbl> <int> <dbl>
#1 2020-01-01 A 3 8 51 198
#2 2020-01-01 B 3 0 27 99
#3 2020-02-01 C 9 3 57 198
Other options include the starts_with tidy-select verb.
df %>%
group_by(Date, Type) %>%
summarize(across(starts_with("col"), sum))
## A tibble: 3 x 6
## Groups: Date [2]
# Date Type col1 col2 col3 colX
# <date> <chr> <int> <dbl> <int> <dbl>
#1 2020-01-01 A 3 8 51 198
#2 2020-01-01 B 3 0 27 99
#3 2020-02-01 C 9 3 57 198
The row-wise and column-wise vignettes are pretty good. The row-wise one actually discusses how group_by columns are subset.

Cannot use multi word variables in dplyr or am I missing something?

Why doesn't dplyr like this format of 'beta linalool' in my function as compared to beta.linalool?
It took me a few hours of troubleshooting to figure out what the problem was. Is there any way to use data where variables are labeled as more than one word or should I just move everything to the beta.linalool type format?
Everything I have learned has been from Programming with dplyr.
library(ggplot2)
library(readxl)
library(dplyr)
library(magrittr)
Data3<- read_excel("Desktop/Data3.xlsx")
Data3 %>% filter(Variety=="CS 420A"&`Red Blotch`=="-")%>% group_by(`Time Point`)%>%
summarise(m=mean(`beta linalool`),SD=sd(`beta linalool`))
# A tibble: 4 x 3
`Time Point` m SD
<chr> <dbl> <dbl>
1 End 0.00300 0.000117
2 Mid 0.00385 0.000353
3 Must 0.000254 0.00000633
4 Start 0.000785 0.000283
Now when I work it into a function:
cwine<-function(df,v,rb,c){
c<-enquo(c)
df %>% filter(Variety==v&`Red Blotch`==rb)%>%
group_by(`Time Point`) %>%
summarise_(m=mean(!!c),SD=sd(!!c)) %>%
}
cwine(Data3,"CS 420A","-",'beta linalool')
# A tibble: 4 x 3
`Time Point` m SD
<chr> <dbl> <dbl>
1 End NA NA
2 Mid NA NA
3 Must NA NA
4 Start NA NA
Warning messages:
1: In mean.default(~"beta linalool") :
argument is not numeric or logical: returning NA #this statement is repeated 4 more times
5: In var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
NAs introduced by coercion #this statement is repeated 4 more times
The problem lies in that beta linalool is typed in as 'beta linalool'. I figured this out by trying this methodology on the iris dataset and seeing that Petal.Length is not 'Petal Width':
my_function<-function(ds,x,y,c){
c<-enquo(c)
ds %>%filter(Sepal.Length>x&Sepal.Width<y) %>%
group_by(Species) %>%
summarise(m=mean(!!c),SD=sd(!!c))
}
my_function2(iris,5,4,Petal.Length)
# A tibble: 3 x 3
Species m SD
<fct> <dbl> <dbl>
1 setosa 1.53 0.157
2 versicolor 4.32 0.423
3 virginica 5.57 0.536
In fact my function works fine on a different variable:
> cwine(Data2,"CS 420A","-",nerol)
# A tibble: 4 x 3
`Time Point` m SD
<chr> <dbl> <dbl>
1 End 0.000453 0.0000338
2 Mid 0.000659 0.0000660
3 Must 0.000560 0.0000234
4 Start 0.000927 0.0000224
Is dplyr just that sensitive or am I missing something?

One option would be convert it to symbol and evaluate it
library(tidyverse)
cwine <- function(df,v,rb,c){
df %>%
filter(Variety==v & `Red Blotch` == rb)%>%
group_by(`Time Point`) %>%
summarise(m = mean(!!rlang::sym(c)),
SD = sd(!! rlang::sym(c)))
}
cwine(Data3,"CS 420A","-",'beta linalool')
# A tibble: 2 x 3
# `Time Point` m SD
# <int> <dbl> <dbl>
#1 2 -2.11 2.23
#2 4 0.0171 NA
Also, if we want to pass it by converting to quosure (enquo), it works, when we pass the variable name with backquotes (usually, unquoted version works, but here there is a space between words and to evaluate it as it is, backquote is needed)
cwine <- function(df,v,rb,c){
c1 <- enquo(c)
df %>%
filter(Variety==v & `Red Blotch` == rb)%>%
group_by(`Time Point`) %>%
summarise(m = mean(!! c1 ),
SD = sd(!! c1))
}
cwine(Data3,"CS 420A","-",`beta linalool`)
# A tibble: 2 x 3
# `Time Point` m SD
# <int> <dbl> <dbl>
#1 2 -2.11 2.23
#2 4 0.0171 NA
data
set.seed(24)
Data3 <- tibble(Variety = sample(c("CS 420A", "CS 410A"), 20, replace = TRUE),
`Red Blotch` = sample(c("-", "+"), 20, replace = TRUE),
`Time Point` = sample(1:4, 20, replace = TRUE),
`beta linalool` = rnorm(20))

How to rename a whitespace contained column names with look up table using dplyr [duplicate]

This question already has an answer here:
Rename multiple columns given character vectors of column names and replacement [duplicate]
(1 answer)
Closed 5 years ago.
I have the following data frame. The column names to be replaced contained whitespace. So this is different from previous post.
library(tidyverse)
dat <- tribble(
~group, ~y, ~`ARE(NR)/LNCAP-AR-ChIP-Seq(GSE27824)/Homer Best Motif log-odds Score`, ~`Znf263/K562-Znf263-ChIP-Seq/Homer Best Motif log-odds Score` ,
"group_1", "foo", 10, 3,
"group_2", "bar", 700, 4,
"group_2", "qux", 150, 5
)
dat
#> # A tibble: 3 x 4
#> group y
#> <chr> <chr>
#> 1 group_1 foo
#> 2 group_2 bar
#> 3 group_2 qux
#> # ... with 2 more variables: `ARE(NR)/LNCAP-AR-ChIP-Seq(GSE27824)/Homer
#> # Best Motif log-odds Score` <dbl>, `Znf263/K562-Znf263-ChIP-Seq/Homer
#> # Best Motif log-odds Score` <dbl>
lookup_dat <- tribble(
~old, ~new,
'ARE(NR)/LNCAP-AR-ChIP-Seq(GSE27824)/Homer Best Motif log-odds Score', 'ARE',
'Znf263/K562-Znf263-ChIP-Seq/Homer Best Motif log-odds Score', 'Znf263'
)
And a lookup table for converting column names. If the column name in dat is not contained in the lookup_dat$old then retain the column name.
lookup_dat
#> # A tibble: 2 x 2
#> old
#> <chr>
#> 1 ARE(NR)/LNCAP-AR-ChIP-Seq(GSE27824)/Homer Best Motif log-odds Score
#> 2 Znf263/K562-Znf263-ChIP-Seq/Homer Best Motif log-odds Score
#> # ... with 1 more variables: new <chr>
The final new data frame I hope to get is this:
group y ARE Znf263
group_1 foo 10 3
group_2 bar 700 4
group_2 qux 150 5
How can I do that?
I tried with this, but with error:
> dat %>%
+ rename_(.dots=with(lookup_dat, setNames(as.list(as.character(old)), new)))
Error in parse(text = x) : <text>:1:43: unexpected symbol
1: ARE(NR)/LNCAP-AR-ChIP-Seq(GSE27824)/Homer Best
^

One could also use tidyr to gather up all the troublesome column names, then merge with the lookup table, and spread using the new column names:
dat.long <- gather(dat, column, value, -group, -y) %>%
left_join(lookup_dat, by = c(column = 'old')) %>%
select(-column) %>%
spread(new, value)
# A tibble: 3 × 4
group y ARE Znf263
* <chr> <chr> <dbl> <dbl>
1 group_1 foo 10 3
2 group_2 bar 700 4
3 group_2 qux 150 5

Use rename with UQS (or !!!); setNames(lookup_dat$old, lookup_dat$new) creates a named vector mapping from old names to new names, !!! splice the vector as separate arguments to rename:
rename(dat, !!!setNames(lookup_dat$old, lookup_dat$new))
# A tibble: 3 x 4
# group y ARE Znf263
# <chr> <chr> <dbl> <dbl>
#1 group_1 foo 10 3
#2 group_2 bar 700 4
#3 group_2 qux 150 5

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Finding duplicate observations of selected variables in a tibble - r

Related

Reordering columns using common names - dplyr

Filtering specific column in list of dataframes with map by index in R

dplyr `across()` function and data frame length while grouping

Cannot use multi word variables in dplyr or am I missing something?

How to rename a whitespace contained column names with look up table using dplyr [duplicate]

Categories

Resources