I have a list of record that I need to dedup, these look like a combination of the same set of, but using the regular functions to deduplicate records does not work because the two columns are not duplicates. Below is a reproducible example.
df <- data.frame( A = c("2","2","2","43","43","43","331","391","481","490","501","501","501","502","502","502"),
B = c("43","501","502","2","501","502","491","496","490","481","2","43","502","2","43","501"))
Below is the desired output that I'm looking for.
df_Final <- data.frame( A = c("2","2","2","331","391","481"),
B = c("43","501","502","491","496","490"))
I guess the idea is that you want to find when the elements in column A first appear in column B
idx = match(df$A, df$B)
and keep the row if the element in A isn't in B (is.na(idx)) or the element in A occurs before it's first occurrence in B (seq_along(idx) < idx)
df[is.na(idx) | seq_along(idx) < idx,]
Maybe a more-or-less literal tidyverse approach to this would be to create and then drop a temporary column
library(tidyverse)
df %>% mutate(idx = match(A, B)) %>%
filter(is.na(idx) | seq_along(idx) < idx) %>%
select(-idx)
You can remove all rows which would be duplicates under some reordering with
require(dplyr)
df %>%
apply(1, sort) %>% t %>%
data.frame %>%
group_by_all %>%
slice(1)
Related
I have a data.frame which maps an id column to a group column, and the id column is not unique because the same id can map to multiple groups:
set.seed(1)
df <- data.frame(id = paste0("id", sample(1:10,300,replace = T)), group = c(rep("A",100), rep("B",100), rep("C",100)), stringsAsFactors = F)
I'd like to convert this data.frame into a list where each element is the ids in each group.
This seems a bit slow for the size of data I'm working with:
library(dplyr)
df.list <- lapply(unique(df$group), function(g) dplyr::filter(df, group == g)$id)
So I was thinking about this:
df.list <- df %>%
dplyr::group_by(group) %>%
dplyr::group_split()
Assuming it is faster than my first option, any idea how to get it to return the same output as in the first option rather than a list of data.frames?
Using base R only with split. It should be faster than the == with unique
with(df, split(id, group))
Or with tidyverse we can pull the column after the group_split. The group_split returns a data.frame/tibble and could be slower compared to the split only method above. But, here, we can make some performance improvements by removing the group column (keep = FALSE) and then in the list, pull the 'id' column to create the list of vectors
library(dplyr)
library(purrr)
df %>%
group_split(group, keep = FALSE) %>%
map(~ .x %>%
pull(id))
Or use {} with pipe
df %>%
{split(.$id, .$group)}
Or wrap with with
df %>%
with(., split(id, group))
I'm tring to filter something across a list of dataframes for a specific column. Typically across a single dataframe using dplyr I would use:
#creating dataframe
df <- data.frame(a = 0:10, d = 10:20)
# filtering column a for rows greater than 7
df %>% filter(a > 7)
I've tried doing this across a list using the following:
# creating list
x <- list(data.frame(a = 0:10, b = 10:20),
data.frame(c = 11:20, d = 21:30),
data.frame(e = 15:25, f = 35:45))
# selecting the appropriate column and trying to filter
# this is not working
x[1][[1]][1] %>% lapply(. %>% {filter(. > 2)})
# however, if I use the min() function it works
x[1][[1]][1] %>% lapply(. %>% {min(.)})
I find the %>% syntax quite easy to understand and carry out. However, in this case, selecting a specific column and doing something quite simple like filtering is not working. I'm guessing map could be equally useful. Any help is appreciated.
You can use filter_at to refer column by position.
library(dplyr)
purrr::map(x, ~.x %>% filter_at(1, any_vars(. > 7)))
In filter, you can subset the column and use it
purrr::map(x, ~.x %>% filter(.[[1]] > 7))
In base R, that would be :
lapply(x, function(y) y[y[[1]] > 7, ])
It seems you are interested in checking the condition on the first column of each dataframe in your list.
One solution using dplyr would be
lapply(x, function(df) {df %>% filter_at(1, ~. > 7)})
The 1 in filter_at indicates that I want to check the condition on the first column (1 is a positional index) of each dataframe in the list.
EDIT
After the discussion in the comments, I propose the following solution
lapply(x, function(df) {df %>% filter(a > 7) %>% select(a) %>% slice(1)})
Input data
x <- list(data.frame(a = 0:10, b = 10:20),
data.frame(a = 11:20, b = 21:30),
data.frame(a = 15:25, b = 35:45))
Output
[[1]]
a
1 8
[[2]]
a
1 11
[[3]]
a
1 15
Using filter with across
library(dplyr)
library(purrr)
map(x, ~ .x %>%
filter(across(names(.)[1], ~ .> 7)))
I have a data.frame() in R which contains 3 columns:
id<-c(12312, 12312, 12312, 48373, 345632, 223452)
id2<-c(1928277, 17665363, 8282922, 82827722, 1231233,12312333)
description<-c(Positive, Negative, Indetermined, Positive, Negative, Positive)
I want to delete the duplicated rows by id which in description have the value of Indetermined.
This seems like a probem for filter() so:
library(dplyr)
df %>%
mutate(count = 1) %>% # count all ids
group_by(id) %>%
mutate(count = sum(count),Duplicate = count>1) %>% # count how often each id occurs and mark duplicates
ungroup() %>%
filter(!Duplicate & description == "Indetermined") # filter out duplicates that are "indetermined"
Not the best approach, but this should do the trick.
(d <- tibble(id,id2,description))
d[!d$id %in% (d$id[d$description == "Indetermined"]),]
I can easily slice the 1st half (or any other percentage) of a data frame using:
library(dplyr)
df <- data.frame(x = 1:10)
df %>%
slice(seq(0.5 * n()))
However, how can I slice the 2nd half of my data frame?
With negative indices
library(dplyr)
df <- data.frame(x = 1:10)
df %>%
slice(-seq(0.5 * n()))
slice() can do two things: keep rows if you give it positive row numbers, or drop rows if you give it negative row numbers. You can use either of these to grab the second half of your dataframe:
# Keeping later rows
df %>% slice(seq(n()/2, n()))
# Dropping earlier rows
df %>% slice(-seq(1, n()/2))
You'll want to be careful if you have an odd number of rows, since n()/2 won't be an integer in those cases. Using seq(0.5 * n()) as in your example could run into this problem too. To be safe, you can be explicit about how to handle the middle cases with floor() and ceiling():
df <- data.frame(x = 1:11)
# Include row 5
df %>% slice(seq(floor(n()/2), n()))
# Exclude row 5
df %>% slice(seq(ceiling(n()/2), n()))
You can also just slightly modify your seq argument:
df <- data.frame(x = 1:10)
df %>%
slice(seq(n() * 0.5, n()))
Update per #Kerry Jackson's suggestion:
df %>%
slice(seq(floor(n() * 0.5) + 1, n()))
if an odd number of rows - you'll need to select how to deal with the middle row.
I have a dataframe with a column consisting of lists. I wish to mutate to create a new value for each row containing the length of each of these lists, but I'm having trouble doing that
I've tried
df <- data.frame(a=1:3,b=I(list(1,1:2,1:3)))
df %>% mutate(len = length(b))
but this just sets len to the the number of rows in the dataframe (value of len is 3 for every row). Any help would be greatly appreciated!
Figured it out, use
df %>% mutate(len = lengths(b))
We can do unnest and create the new column
library(tidyverse)
df %>%
unnest %>%
group_by(a) %>%
mutate(c = n())