Subsetting a data frame to the rows not appearing in another data frame - r

I have a data frame A with observations
Var1 Var2 Var3
1 3 4
2 5 6
4 5 7
4 5 8
6 7 9
and data frame B with observations
Var1 Var2 Var3
1 3 4
2 5 6
which is basically a subset of A.
Now I want to select observations in A NOT in B, i.e, the data frame C with observations
Var1 Var2 Var3
4 5 7
4 5 8
6 7 9
Is there a way I can do this in R? The data frames I've used are just arbitrary data.

dplyr has a nice anti_join function that does exactly that:
> library(dplyr)
> anti_join(A, B)
Joining by: c("Var1", "Var2", "Var3")
Var1 Var2 Var3
1 6 7 9
2 4 5 8
3 4 5 7

Using sqldf is an option.
require(sqldf)
C <- sqldf('SELECT * FROM A EXCEPT SELECT * FROM B')

One approach could be to paste all the columns of A and B together, limiting to the rows in A whose pasted representation doesn't appear in the pasted representation of B:
A[!(do.call(paste, A) %in% do.call(paste, B)),]
# Var1 Var2 Var3
# 3 4 5 7
# 4 4 5 8
# 5 6 7 9
One obvious downside of this approach is that it assumes two rows with the same pasted representation are in fact identical. Here is a slightly more clunky approach that doesn't have this limitation:
combined <- rbind(B, A)
combined[!duplicated(combined) & seq_len(nrow(combined)) > length(B),]
# Var1 Var2 Var3
# 5 4 5 7
# 6 4 5 8
# 7 6 7 9
Basically I used rbind to append A below B and then limited to rows that are both non-duplicated and that are not originally from B.

Another option:
C <- rbind(A, B)
C[!(duplicated(C) | duplicated(C, fromLast = TRUE)), ]
Output:
Var1 Var2 Var3
3 4 5 7
4 4 5 8
5 6 7 9

Using data.table you could do an anti-join as follows:
library(data.table)
setDT(df1)[!df2, on = names(df1)]
which gives the desired result:
Var1 Var2 Var3
1: 4 5 7
2: 4 5 8
3: 6 7 9

Related

How do I add observations to an existing data frame column?

I have a data frame. Let's say it looks like this:
Input data set
I have simulated some values and put them into a vector c(4,5,8,8). I want to add these simulated values to columns a, b and c.
I have tried rbind or inserting the vector into the existing data frame, but that replaced the existing values with the simulated ones, instead of adding the simulated values below the existing ones.
x <- data.frame("a" = c(2,3,1), "b" = c(5,1,2), "c" = c(6,4,7))
y <- c(4,5,8,8)
This is the output I expect to see:
Output
Help would be greatly appreciated. Thank you.
Can do:
as.data.frame(sapply(x,
function(z)
append(z,y)))
a b c
1 2 5 6
2 3 1 4
3 1 2 7
4 4 4 4
5 5 5 5
6 8 8 8
7 8 8 8
An option is assignment
n <- nrow(x)
x[n + seq_along(y), ] <- y
x
# a b c
#1 2 5 6
#2 3 1 4
#3 1 2 7
#4 4 4 4
#5 5 5 5
#6 8 8 8
#7 8 8 8
Another option is replicate the 'y' and rbind
rbind(x, `colnames<-`(replicate(ncol(x), y), names(x)))
x[(nrow(x)+1):(nrow(x)+length(y)),] <- y

Subset a dataframe according to matches between dataframe column and separate character vector in R

I want to use a chacracter vector to:
Find rows in a dataframe that contain single or greater matches to this vector in a comma delimited list within a column of the dataframe
Subset the dataframe retaining only the rows with matches
Example data
look<-c("ID1", "ID2", "ID5", "ID9")
df<-data.frame(var1=1:10, var2=3:12, var3=rep(c("","ID1,ID3","ID1,ID9","","")))
df
var1 var2 var3
1 1 3
2 2 4 ID1,ID3
3 3 5 ID1,ID9
4 4 6
5 5 7
6 6 8
7 7 9 ID1,ID3
8 8 10 ID1,ID9
9 9 11
10 10 12
Where the output would look like:
var1 var2 var3
1 2 4 ID1,ID3
2 3 5 ID1,ID9
3 7 9 ID1,ID3
4 8 10 ID1,ID9
The match between the var3 column could be greater than 1 value from the look vector.
Is there a base r solution that doesn't involve using strsplit on the var3 column?
1) Create the appropriate regular expression and perform a grep. As requested this does not use any packages and does not use strsplit:
subset(df, grepl(paste0("\\b", paste(look, collapse = "|"), "\\b"), var3))
giving:
var1 var2 var3
2 2 4 ID1,ID3
3 3 5 ID1,ID9
7 7 9 ID1,ID3
8 8 10 ID1,ID9
1a) Depending on precisely what var3 and look contain it may be possible to shorten it to just this (but it is less general than the longer one above -- for example ID1 would also match ID11 if we used this but the prior solution does not have this problem):
subset(df, grepl(paste(look, collapse = "|"), var3))
2) If you are willing to relax the strsplit requirement then this still does not use any packages:
subset(df, sapply(strsplit(as.character(var3), ","), function(x) any(x %in% look)))
1)
We can use filter with str_detect in dplyr
library(dplyr)
library(stringr)
df %>%
filter(str_detect(var3, paste(look, collapse="|")))
# var1 var2 var3
# 1 2 4 ID1,ID3
# 2 3 5 ID1,ID9
# 3 7 9 ID1,ID3
# 4 8 10 ID1,ID9
NOTE: Only one method is provided.
You can use the same with base R using grepl function as above OP done
df <- df[grepl("\\,",df$var3),]
var1 var2 var3
2 2 4 ID1,ID3
3 3 5 ID1,ID9
7 7 9 ID1,ID3
8 8 10 ID1,ID9

R - Output of aggregate and range gives 2 columns for every column name - how to restructure?

I am trying to produce a summary table showing the range of each variable by group. Here is some example data:
df <- data.frame(group=c("a","a","b","b","c","c"), var1=c(1:6), var2=c(7:12))
group var1 var2
1 a 1 7
2 a 2 8
3 b 3 9
4 b 4 10
5 c 5 11
6 c 6 12
I used the aggregate function like this:
df_range <- aggregate(df[,2:3], list(df$group), range)
Group.1 var1.1 var1.2 var2.1 var2.2
1 a 1 2 7 8
2 b 3 4 9 10
3 c 5 6 11 12
The output looked normal, but the dimensions are 3x3 instead of 5x3 and there are only 3 names:
names(df_range)
[1] "Group.1" "var1" "var2"
How do I get this back to the normal data frame structure with one name per column? Or alternatively, how do I get the same summary table without using aggregate and range?
That is the documented output of a matrix within the data frame. You can undo the effect with:
newdf <- do.call(data.frame, df_range)
# Group.1 var1.1 var1.2 var2.1 var2.2
#1 a 1 2 7 8
#2 b 3 4 9 10
#3 c 5 6 11 12
dim(newdf)
#[1] 3 5
Here's an approach using dplyr:
library(dplyr)
df %>%
group_by(group) %>%
summarise_each(funs(max(.) - min(.)), var1, var2)

Efficient way to remove dups in data frame but determining the row that stays randomly

I'm looking for the most compact and efficient way of looking for dups in a dataframe based on a single variable (user_ID) and randomly keeping one and deleting the others. Been using something like this:
dupIDs <- user_db$user_ID[duplicated(user_db$user_ID)]
The important part is that I want the user_ID variable to be unique, so whenever there are dups, one should be randomly selected (cannot pick first or last, has to be random). I am looking for a loop-less solution - Thanks!
user_ID, var1, var2
1 3 4
1 5 6
2 7 7
3 8 8
Randomly yielding either:
user_ID, var1, var2
1 5 6
2 7 7
3 8 8
or
user_ID, var1, var2
1 3 4
2 7 7
3 8 8
Thanks in advance!!
Here's one option:
library(data.table)
setDT(df) # convert to data.table in place
set.seed(1)
# select 1 row randomly for each user_ID
df[df[, .I[sample(.N, 1)], by = user_ID]$V1]
# user_ID var1 var2
#1: 1 3 4
#2: 2 7 7
#3: 3 8 8
set.seed(4)
df[df[, .I[sample(.N, 1)], by = user_ID]$V1]
# user_ID var1 var2
#1: 1 5 6
#2: 2 7 7
#3: 3 8 8
Using base functions:
DF <-
read.csv(text=
'user_ID,var1,var2
1,3,4
2,7,7
3,8,8
3,6,7
2,5,5
3,5,6
1,5,6')
# sort the data by user_ID
DF <- DF[order(DF$user_ID),]
# create random sub-indexes for each user_ID
subIdx <- unlist(sapply(rle(DF$user_ID)$lengths,FUN=function(l)sample(1:l,l)))
# order again by user_ID then by sub-index
DF <- DF[order(DF$user_ID,subIdx),]
# remove the duplicate
DF <- DF[!duplicated(DF$user_ID),]
> DF
user_ID var1 var2
7 1 5 6
2 2 7 7
4 3 6 7

subset with pattern

Say I have a data frame df
df <- data.frame( a1 = 1:10, b1 = 2:11, c2 = 3:12 )
I wish to subset the columns, but with a pattern
df1 <- subset( df, select= (pattern = "1") )
To get
> df1
a1 b1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 6 7
7 7 8
8 8 9
9 9 10
10 10 11
Is this possible?
It is possible to do this via
subset(df, select = grepl("1", names(df)))
For automating this as a function, one can use use [ to do the subsetting. Couple that with one of R's regular expression functions and you have all you need.
By way of an example, here is a custom function implementing the ideas I mentioned above.
Subset <- function(df, pattern) {
ind <- grepl(pattern, names(df))
df[, ind]
}
Note this does not error checking etc and just relies upon grepl to return a logical vector indicating which columns match pattern, which is then passed to [ to subset by columns. Applied to your df this gives:
> Subset(df, pattern = "1")
a1 b1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 6 7
7 7 8
8 8 9
9 9 10
10 10 11
Same same but different:
df2 <- df[, grep("1", names(df))]
a1 b1
1 1 2
2 2 3
3 3 4
4 4 5
5 5 6
6 6 7
7 7 8
8 8 9
9 9 10
10 10 11
Base R now has a convenience function endsWith():
df[, endsWith(names(df), "1")]
In data.table you can do:
library(data.table)
setDT(df)
df[, .SD, .SDcols = patterns("1")]
# Or more precisely
df[, .SD, .SDcols = patterns("1$")]
In dplyr:
library(dplyr)
select(df, ends_with("1"))

Resources