Efficient way to remove dups in data frame but determining the row that stays randomly - r

I'm looking for the most compact and efficient way of looking for dups in a dataframe based on a single variable (user_ID) and randomly keeping one and deleting the others. Been using something like this:
dupIDs <- user_db$user_ID[duplicated(user_db$user_ID)]
The important part is that I want the user_ID variable to be unique, so whenever there are dups, one should be randomly selected (cannot pick first or last, has to be random). I am looking for a loop-less solution - Thanks!
user_ID, var1, var2
1 3 4
1 5 6
2 7 7
3 8 8
Randomly yielding either:
user_ID, var1, var2
1 5 6
2 7 7
3 8 8
or
user_ID, var1, var2
1 3 4
2 7 7
3 8 8
Thanks in advance!!

Here's one option:
library(data.table)
setDT(df) # convert to data.table in place
set.seed(1)
# select 1 row randomly for each user_ID
df[df[, .I[sample(.N, 1)], by = user_ID]$V1]
# user_ID var1 var2
#1: 1 3 4
#2: 2 7 7
#3: 3 8 8
set.seed(4)
df[df[, .I[sample(.N, 1)], by = user_ID]$V1]
# user_ID var1 var2
#1: 1 5 6
#2: 2 7 7
#3: 3 8 8

Using base functions:
DF <-
read.csv(text=
'user_ID,var1,var2
1,3,4
2,7,7
3,8,8
3,6,7
2,5,5
3,5,6
1,5,6')
# sort the data by user_ID
DF <- DF[order(DF$user_ID),]
# create random sub-indexes for each user_ID
subIdx <- unlist(sapply(rle(DF$user_ID)$lengths,FUN=function(l)sample(1:l,l)))
# order again by user_ID then by sub-index
DF <- DF[order(DF$user_ID,subIdx),]
# remove the duplicate
DF <- DF[!duplicated(DF$user_ID),]
> DF
user_ID var1 var2
7 1 5 6
2 2 7 7
4 3 6 7

Related

How to check if rows in one column present in another column in R

I have a data set = data1 with id and emails as follows:
id emails
1 A,B,C,D,E
2 F,G,H,A,C,D
3 I,K,L,T
4 S,V,F,R,D,S,W,A
5 P,A,L,S
6 Q,W,E,R,F
7 S,D,F,E,Q
8 Z,A,D,E,F,R
9 X,C,F,G,H
10 A,V,D,S,C,E
I have another data set = data2 with check_email as follows:
check_email
A
D
S
V
I want to check if check_email column is present in data1 and want to take only those id from data1 when check_email in data2 is present in emails in data1.
My desired output will be:
id
1
2
4
5
7
8
10
I have created a code using for loop but it is taking forever because my actual dataset is very large.
Any advice in this regard will be highly appreciated!
You can use regular expression to subset your data. First collapse everything in one pattern:
paste(data2$check_email, collapse = "|")
# [1] "A|D|S|V"
Then create a indicator vector whether the pattern matches the emails:
grep(paste(data2$check_email, collapse = "|"), data1$emails)
# [1] 1 2 4 5 7 8 10
And then combine everything:
data1[grep(paste(data2$check_email, collapse = "|"), data1$emails), ]
# id emails
# 1 1 A,B,C,D,E
# 2 2 F,G,H,A,C,D
# 3 4 S,V,F,R,D,S,W,A
# 4 5 P,A,L,S
# 5 7 S,D,F,E,Q
# 6 8 Z,A,D,E,F,R
# 7 10 A,V,D,S,C,E
data1[rowSums(sapply(data2$check_email, function(x) grepl(x,data1$emails))) > 0, "id", F]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10
We can split the elements of the character vector as.character(data1$emails) into substrings, then we can iterate over this list with sapply looking for any value of this substring contained in data2$check_email. Finally we extract those values from data1
> emails <- strsplit(as.character(data1$emails), ",")
> ind <- sapply(emails, function(emails) any(emails %in% as.character(data2$check_email)))
> data1[ind,"id", drop = FALSE]
id
1 1
2 2
4 4
5 5
7 7
8 8
10 10

Subset a dataframe according to matches between dataframe column and separate character vector in R

I want to use a chacracter vector to:
Find rows in a dataframe that contain single or greater matches to this vector in a comma delimited list within a column of the dataframe
Subset the dataframe retaining only the rows with matches
Example data
look<-c("ID1", "ID2", "ID5", "ID9")
df<-data.frame(var1=1:10, var2=3:12, var3=rep(c("","ID1,ID3","ID1,ID9","","")))
df
var1 var2 var3
1 1 3
2 2 4 ID1,ID3
3 3 5 ID1,ID9
4 4 6
5 5 7
6 6 8
7 7 9 ID1,ID3
8 8 10 ID1,ID9
9 9 11
10 10 12
Where the output would look like:
var1 var2 var3
1 2 4 ID1,ID3
2 3 5 ID1,ID9
3 7 9 ID1,ID3
4 8 10 ID1,ID9
The match between the var3 column could be greater than 1 value from the look vector.
Is there a base r solution that doesn't involve using strsplit on the var3 column?
1) Create the appropriate regular expression and perform a grep. As requested this does not use any packages and does not use strsplit:
subset(df, grepl(paste0("\\b", paste(look, collapse = "|"), "\\b"), var3))
giving:
var1 var2 var3
2 2 4 ID1,ID3
3 3 5 ID1,ID9
7 7 9 ID1,ID3
8 8 10 ID1,ID9
1a) Depending on precisely what var3 and look contain it may be possible to shorten it to just this (but it is less general than the longer one above -- for example ID1 would also match ID11 if we used this but the prior solution does not have this problem):
subset(df, grepl(paste(look, collapse = "|"), var3))
2) If you are willing to relax the strsplit requirement then this still does not use any packages:
subset(df, sapply(strsplit(as.character(var3), ","), function(x) any(x %in% look)))
1)
We can use filter with str_detect in dplyr
library(dplyr)
library(stringr)
df %>%
filter(str_detect(var3, paste(look, collapse="|")))
# var1 var2 var3
# 1 2 4 ID1,ID3
# 2 3 5 ID1,ID9
# 3 7 9 ID1,ID3
# 4 8 10 ID1,ID9
NOTE: Only one method is provided.
You can use the same with base R using grepl function as above OP done
df <- df[grepl("\\,",df$var3),]
var1 var2 var3
2 2 4 ID1,ID3
3 3 5 ID1,ID9
7 7 9 ID1,ID3
8 8 10 ID1,ID9

Subsetting a data frame to the rows not appearing in another data frame

I have a data frame A with observations
Var1 Var2 Var3
1 3 4
2 5 6
4 5 7
4 5 8
6 7 9
and data frame B with observations
Var1 Var2 Var3
1 3 4
2 5 6
which is basically a subset of A.
Now I want to select observations in A NOT in B, i.e, the data frame C with observations
Var1 Var2 Var3
4 5 7
4 5 8
6 7 9
Is there a way I can do this in R? The data frames I've used are just arbitrary data.
dplyr has a nice anti_join function that does exactly that:
> library(dplyr)
> anti_join(A, B)
Joining by: c("Var1", "Var2", "Var3")
Var1 Var2 Var3
1 6 7 9
2 4 5 8
3 4 5 7
Using sqldf is an option.
require(sqldf)
C <- sqldf('SELECT * FROM A EXCEPT SELECT * FROM B')
One approach could be to paste all the columns of A and B together, limiting to the rows in A whose pasted representation doesn't appear in the pasted representation of B:
A[!(do.call(paste, A) %in% do.call(paste, B)),]
# Var1 Var2 Var3
# 3 4 5 7
# 4 4 5 8
# 5 6 7 9
One obvious downside of this approach is that it assumes two rows with the same pasted representation are in fact identical. Here is a slightly more clunky approach that doesn't have this limitation:
combined <- rbind(B, A)
combined[!duplicated(combined) & seq_len(nrow(combined)) > length(B),]
# Var1 Var2 Var3
# 5 4 5 7
# 6 4 5 8
# 7 6 7 9
Basically I used rbind to append A below B and then limited to rows that are both non-duplicated and that are not originally from B.
Another option:
C <- rbind(A, B)
C[!(duplicated(C) | duplicated(C, fromLast = TRUE)), ]
Output:
Var1 Var2 Var3
3 4 5 7
4 4 5 8
5 6 7 9
Using data.table you could do an anti-join as follows:
library(data.table)
setDT(df1)[!df2, on = names(df1)]
which gives the desired result:
Var1 Var2 Var3
1: 4 5 7
2: 4 5 8
3: 6 7 9

R - Output of aggregate and range gives 2 columns for every column name - how to restructure?

I am trying to produce a summary table showing the range of each variable by group. Here is some example data:
df <- data.frame(group=c("a","a","b","b","c","c"), var1=c(1:6), var2=c(7:12))
group var1 var2
1 a 1 7
2 a 2 8
3 b 3 9
4 b 4 10
5 c 5 11
6 c 6 12
I used the aggregate function like this:
df_range <- aggregate(df[,2:3], list(df$group), range)
Group.1 var1.1 var1.2 var2.1 var2.2
1 a 1 2 7 8
2 b 3 4 9 10
3 c 5 6 11 12
The output looked normal, but the dimensions are 3x3 instead of 5x3 and there are only 3 names:
names(df_range)
[1] "Group.1" "var1" "var2"
How do I get this back to the normal data frame structure with one name per column? Or alternatively, how do I get the same summary table without using aggregate and range?
That is the documented output of a matrix within the data frame. You can undo the effect with:
newdf <- do.call(data.frame, df_range)
# Group.1 var1.1 var1.2 var2.1 var2.2
#1 a 1 2 7 8
#2 b 3 4 9 10
#3 c 5 6 11 12
dim(newdf)
#[1] 3 5
Here's an approach using dplyr:
library(dplyr)
df %>%
group_by(group) %>%
summarise_each(funs(max(.) - min(.)), var1, var2)

Multirow deletion: delete row depending on other row

I'm stuck with a quite complex problem. I have a data frame with three rows: id, info and rownum. The data looks like this:
id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8
What I want to do now is to delete all other rows of one id if one of the rows contains the info a. This would mean for example that row 2 and 3 should be removed as row 1's coloumn info contains the value a. Please note that the info values are not ordered (id 3/row 5 & 6) and cannot be ordered due to other data limitations.
I solved the case using a for loop:
# select all id containing an "a"-value
a_val <- data$id[grep("a", data$info)]
# check for every id containing an "a"-value
for(i in a_val) {
temp_data <- data[which(data$id == i),]
# only go on if the given id contains more than one row
if (nrow(temp_data) > 1) {
for (ii in nrow(temp_data)) {
if (temp_data$info[ii] != "a") {
temp <- temp_data$row[ii]
if (!exists("delete_rows")) {
delete_rows <- temp
} else {
delete_rows <- c(delete_rows, temp)
}
}
}
}
}
My solution works quite well. Nevertheless, it is very, very, very slow as the original data contains more than 700k rows and more that 150k rows with an "a"-value.
I could use a foreach loop with 4 cores to speed it up, but maybe someone could give me a hint for a better solution.
Best regards,
Arne
[UPDATE]
The outcome should be:
id info row
1 a 1
2 a 4
3 a 6
4 b 7
4 c 8
Here is one possible solution.
First find ids where info contains "a":
ids <- with(data, unique(id[info == "a"]))
Subset the data:
subset(data, (id %in% ids & info == "a") | !id %in% ids)
Output:
id info row
1 1 a 1
4 2 a 4
6 3 a 6
7 4 b 7
8 4 c 8
An alternative solution (maybe harder to decipher):
subset(data, info == "a" | !rep.int(tapply(info, id, function(x) any(x == "a")),
table(id)))
Note. #BenBarnes found out that this solution only works if the data frame is ordered according to id.
You might want to investigate the data.table package:
EDIT: If the row variable is not a sequential numbering of each row in your data (as I assumed it was), you could create such a variable to obtain the original row order:
library(data.table)
# Create data.table of your data
dt <- as.data.table(data)
# Create index to maintain row order
dt[, idx := seq_len(nrow(dt))]
# Set a key on id and info
setkeyv(dt, c("id", "info"))
# Determine unique ids
uid <- dt[, unique(id)]
# subset your data to select rows with "a"
dt2 <- dt[J(uid, "a"), nomatch = 0]
# identify rows of dataset where the id doesn't have an "a"
dt3 <- dt[J(dt2[, setdiff(uid, id)])]
# rbind those two data.tables together
(dt4 <- rbind(dt2, dt3))
# id info row idx
# 1: 1 a 1 1
# 2: 2 a 4 4
# 3: 3 a 6 6
# 4: 4 b 7 7
# 5: 4 c 8 8
# And if you need the original ordering of rows,
dt5 <- dt4[order(idx)]
Note that setting a key for the data.table will order the rows according to the key columns. The last step (creating dt5) sets the row order back to the original.
Here is a way using ddply:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
library("plyr")
ddply(df,.(id),subset,rep(!'a'%in%info,length(info))|info=='a')
Returns:
id info row
1 1 a 1
2 2 a 4
3 3 a 6
4 4 b 7
5 4 c 8
if df is this (RE Sacha above) use match which just finds the index of the first occurrence:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
# the first info row matching 'a' and all other rows that are not 'a'
with(df, df[c(match('a',info), which(info != 'a')),])
id info row
1 1 a 1
2 1 b 2
3 1 c 3
5 3 b 5
7 4 b 7
8 4 c 8
try to take a look at subset, it's quite easy to use and it will solve your problem.
you just need to specify the value of the column that you want to subset based on, alternatively you can choose more columns.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html
http://www.statmethods.net/management/subset.html

Resources