R: show ALL rows with duplicated elements in a column [duplicate] - r

This question already has answers here:
Fastest way to remove all duplicates in R
(3 answers)
Closed 6 years ago.
Does a function like this exist in any package?
isdup <- function (x) duplicated (x) | duplicated (x, fromLast = TRUE)
My intention is to use it with dplyr to display all rows with duplicated values in a given column. I need the first occurrence of the duplicated element to be shown as well.
In this data.frame for instance
dat <- as.data.frame (list (l = c ("A", "A", "B", "C"), n = 1:4))
dat
> dat
l n
1 A 1
2 A 2
3 B 3
4 C 4
I would like to display the rows where column l is duplicated ie. those with an A value doing:
library (dplyr)
dat %>% filter (isdup (l))
returns
l n
1 A 1
2 A 2

dat %>% group_by(l) %>% filter(n() > 1)
I don't know if it exists in any package, but since you can implement it easily, I'd say just go ahead and implement it yourself.

Related

Delete rows based on values in R [duplicate]

This question already has answers here:
Subset data frame based on multiple conditions [duplicate]
(3 answers)
How to combine multiple conditions to subset a data-frame using "OR"?
(5 answers)
Closed 2 years ago.
Is there a way to delete rows based on values . For example
df
ColA ColB
A 1
B 2
A 3
Expected output (Basically i know we can delete based on row number. But is there way to way to delete based on values ("A", 3)
df
ColA ColB
A 1
B 2
You can use subset from base R
> subset(df,!(ColA=="A"&ColB==3))
ColA ColB
1 A 1
2 B 2
or a data.table solution
> setDT(df)[!.("A",3),on = .(ColA,ColB)]
ColA ColB
1: A 1
2: B 2
An option with filter
library(dplyr)
df %>%
filter(!(ColA == "A" & ColB == 3))
The easiest way to do this is to use the which() function (?which). You can then use this with a minus sign in conjunction with with indexing to subset based on a particular criteria.
df <- as.data.frame(cbind("ColA"=c("A", "B", "A"), "ColB" = c(1, 2, 3)))
df <- df[-which(df[,2]==3),]
View(df)

how to filter data by the number of unique values in R [duplicate]

This question already has answers here:
drop columns that take less than n values?
(2 answers)
Closed 3 years ago.
I have some data that I would like to investigate and would like to pull out
all features which have a certain number of unique values, whether that's 2,
5, 10, etc.
I'm not sure how to go about doing this though.
For example :
tst = data.frame(
a = c(1,1,1,0,0),
b = c(1,2,3,3,3),
c = c(1,2,3,4,4),
d = c(1,2,3,4,5)
)
tst
tst %>%
filter(<variables with x unique values>)
Where x=2 would just filter to a, x=3 filter to b, etc
You can use select_if with the n_distinct function.
tst %>%
select_if(~n_distinct(.) == 2)
# a
# 1 1
# 2 1
# 3 1
# 4 0
# 5 0
Here is one way in base R:
x <- 2
tst[, apply(tst, 2, function(row) length(unique(row))) == x, drop = FALSE]
This example code will create a variable combination of abcd. Then will identify which are duplicate combinations, then will return only those combinations that are not duplicates. I hope this is what you were asking for...
tst = data.frame(
a = c(1,1,1,0,0),
b = c(1,2,3,3,3),
c = c(1,2,3,4,4),
d = c(1,2,3,4,5)
)
tst %>%
unite(new,a,b,c,d,sep="") %>%
mutate(duplicate=duplicated(new)) %>%
filter(duplicate !="TRUE")

Selecting/subsetting blank and non-blank rows with a vector [duplicate]

This question already has an answer here:
dplyr filter : value is contained in a vector
(1 answer)
Closed 4 years ago.
I'd like to use a vector featuring blank ("") and non-blank character strings to subset rows so that I end up with a result like in dfgoal.
I've tried using dplyr::select(), but I get an error message (Error: Strings must match column names. Unknown columns: tooth, , head, foot).
I realise I've got a problem in that I want to keep some "" and get rid of others, but I don't know how to resolve it.
Thanks for any help!
# Data
df <- data.frame(avar=c("tooth","","","head","","foot","",""),bvar=c(1:8))
# Vector
veca <- c("tooth","foot")
vecb <- c("")
vecc <- as.vector(rbind(veca,vecb))
vecc <- unique(vecc)
# Attempt
library(dplyr)
df <- df %>% dplyr::select(vecc)
# Goal
dfgoal <- data.frame(avar=c("tooth","","","foot","",""),bvar=c(1,2,3,6,7,8))
I'm not entirely clear on what you're trying to do. I assume you're asking how to select rows where avar %in% veca including subsequent blank ("") rows.
Perhaps something like this using tidyr::fill?
library(tidyverse)
veca <- c("tooth","foot")
df %>%
mutate(tmp = ifelse(avar == "", NA, as.character(avar))) %>%
fill(tmp) %>%
filter(tmp %in% veca) %>%
select(-tmp)
# avar bvar
#1 tooth 1
#2 2
#3 3
#4 foot 6
#5 7
#6 8

Selecting rows from a data frame from combinations of lists given by another dataframe [duplicate]

This question already has answers here:
Selecting rows from a data frame from combinations of lists [duplicate]
(2 answers)
Closed 5 years ago.
I have a dataframe, dat:
dat<-data.frame(col1=rep(1:4,3),
col2=rep(letters[24:26],4),
col3=letters[1:12])
I want to filter dat on two different columns using ONLY the combinations given by the rows in the data frame filter:
filter<-data.frame(col1=1:3,col2=NA)
lists<-list(list("x","y"),list("y","z"),list("x","z"))
filter$col2<-lists
So for example, rows containing (1,x) and (1,y), would be selected, but not (1,z),(2,x), or (3,y).
I know how I would do it using a for loop:
#create a frame to drop results in
results<-dat[0,]
for(f in 1:nrow(filter)){
temp_filter<-filter[f,]
temp_dat<-dat[dat$col1==temp_filter[1,1] &
dat$col2%in%unlist(temp_filter[1,2]),]
results<-rbind(results,temp_dat)
}
Or if you prefer dplyr style:
require(dplyr)
results<-dat[0,]
for(f in 1:nrow(filter)){
temp_filter<-filter[f,]
temp_dat<-filter(dat,col1==temp_filter[1,1] &
col2%in%unlist(temp_filter[1,2])
results<-rbind(results,temp_dat)
}
results should return
col1 col2 col3
1 1 x a
5 1 y e
2 2 y b
6 2 z f
3 3 z c
7 3 x g
I would normally do the filtering using a merge, but I can't now since I have to check col2 against a list rather than a single value. The for loop works but I figured there would be a more efficient way to do this, probably using some variation of apply or do.call.
A solution using tidyverse. dat2 is the final output. The idea is to extract the value from the list column of filter data frame. Convert the filter data frame to the format as filter2 with the col1 and col2 columns having the same components in dat data frame. Finally, use semi_join to filter dat to create dat2.
By the way, filter is a pre-defined function in the dplyr package. In your example you used dplyr package, so it is better to avoid naming a data frame as filter.
library(tidyverse)
filter2 <- filter %>%
mutate(col2_a = map_chr(col2, 1),
col2_b = map_chr(col2, 2)) %>%
select(-col2) %>%
gather(group, col2, -col1)
dat2 <- dat %>%
semi_join(filter2, by = c("col1", "col2")) %>%
arrange(col1)
dat2
col1 col2 col3
1 1 x a
2 1 y e
3 2 y b
4 2 z f
5 3 z c
6 3 x g
Update
Another way to prepare the filter2 package, which does not need to know how many elements are in each list. The rest is the same as the previous solution.
library(tidyverse)
filter2 <- filter %>%
rowwise() %>%
do(data_frame(col1 = .$col1, col2 = flatten_chr(.$col2)))
dat2 <- dat %>%
semi_join(filter2, by = c("col1", "col2")) %>%
arrange(col1)
This is doable with a straight-forward join once you get the filter list back to a standard data.frame:
merge(
dat,
with(filter, data.frame(col1=rep(col1, lengths(col2)), col2=unlist(col2)))
)
# col1 col2 col3
#1 1 x a
#2 1 y e
#3 2 y b
#4 2 z f
#5 3 x g
#6 3 z c
Arguably, I'd do away with whatever process is creating those nested lists in the first place.

How to count with condition how many zeros in a data frame using just one function() in R? [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 5 years ago.
Consider the following replicable data frame:
col1 <- c(rep("a", times = 5), rep("b", times = 5), rep("c", times = 5))
col2 <- c(0,0,1,1,0,0,1,1,1,0,0,0,0,0,1)
data <- as.data.frame(cbind(col1, col2))
Now the data is a matrix of 15x2. Now I want to count how many zeros there are with the condition that only for the rows of a's. I use table():
table <- table(data$col2[data$col1=="a"])
table[names(table)==0]
This works just fine and result is 3.
But my real data has 100,000 observations with 12 different values of such col1 so I want to make a function so I don't have to type the above lines of code 12 times.
countzero <- function(row){
table <- table(data$col2[data$col1=="row"])
result <- table[names(table)==0]
return(result)
}
I expected that when I run countzero(row = a) it will return 3 as well but instead it returns 0, and also 0 for b and c.
For my real data, it returns
numeric(0)
which I have no idea why.
Anyone could help me out please?
EDIT: To all the answers showing me how to count in total how many zeros for each value of col1, it works all fine, but my purpose is to build a function that returns only the count of one specific col1 value, e.g. just the a's, because that count will be used later to compute other stuff (the percent of 0's in all a's, e.g.)
1) aggregate Try aggregate:
aggregate(col2 == 0 ~ col1, data, sum)
giving:
col1 col2 == 0
1 a 3
2 b 2
3 c 4
2) table or try table (omit the [,1] if you want the counts of 1's too):
table(data)[, 1]
giving:
a b c
3 2 4
We can use data.table which would be efficient
library(data.table)
setDT(data)[col2==0, .N, col1]
# col1 N
#1: a 3
#2: b 2
#3: c 4
Or with dplyr
library(dplyr)
data %>%
filter(col2==0) %>%
count(col1)

Resources