Subset a data frame with multiple match conditions in R - r

With the sample data
> df1 <- data.frame(x=c(1,1,2,3), y=c("a","b","a","b"))
> df1
x y
1 1 a
2 1 b
3 2 a
4 3 b
> df2 <- data.frame(x=c(1,3), y=c("a","b"))
> df2
x y
1 1 a
2 3 b
I want to remove all the value pairs (x,y) of df2 from df1. I can do it using a for loop over each row in df2 but I'm sure there is a better and simpler way that I just can't think of at the moment. I've been trying to do something starting with the following:
> df1$x %in% df2$x & df1$y %in% df2$y
[1] TRUE TRUE FALSE TRUE
But this isn't what I want as df1[2,] = (1,b) is pulled out for removal. Thank you very much in advance for your help.

Build a set of pairs from df2:
prs <- with(df2, paste(x,y,sep="."))
Test each row in df1 with similarly process for membership in the pairset:
df1[ paste(df1$x, df1$y, sep=".") %in% prs , ]

You could go the other way around: rbind everything and remove duplicates
out <-rbind(df1,df2)
out[!duplicated(out, fromLast=TRUE) & !duplicated(out),]
x y
2 1 b
3 2 a

Related

How to rank rows by two columns at once in R?

Here is the code to rank based on column v2:
x <- data.frame(v1 = c(2,1,1,2), v2 = c(1,1,3,2))
x$rank1 <- rank(x$v2, ties.method='first')
But I really want to rank based on both v2 and/then v1 since there are ties in v2. How can I do that without using RPostgreSQL?
How about:
within(x, rank2 <- rank(order(v2, v1), ties.method='first'))
# v1 v2 rank1 rank2
# 1 2 1 1 2
# 2 1 1 2 1
# 3 1 3 4 4
# 4 2 2 3 3
order works, but for manipulating data frames, also check out the plyr and dplyr packages.
> arranged_x <- arrange(x, v2, v1)
Here we create a sequence of numbers and then reorder it as if it was created near the ordered data:
x$rank <- seq.int(nrow(x))[match(rownames(x),rownames(x[order(x$v2,x$v1),]))]
Or:
x$rank <- (1:nrow(x))[order(order(x$v2,x$v1))]
Or even:
x$rank <- rank(order(order(x$v2,x$v1)))
Try this:
x <- data.frame(v1 = c(2,1,1,2), v2 = c(1,1,3,2))
# The order function returns the index (address) of the desired order
# of the examined object rows
orderlist<- order(x$v2, x$v1)
# So to get the position of each row in the index, you can do a grep
x$rank<-sapply(1:nrow(x), function(x) grep(paste0("^",x,"$"), orderlist ) )
x
# For a little bit more general case
# With one tie
x <- data.frame(v1 = c(2,1,1,2,2), v2 = c(1,1,3,2,2))
x$rankv2<-rank(x$v2)
x$rankv1<-rank(x$v1)
orderlist<- order(x$rankv2, x$rankv1)
orderlist
#This rank would not be appropriate
x$rank<-sapply(1:nrow(x), function(x) grep(paste0("^",x,"$"), orderlist ) )
#there are ties
grep(T,duplicated(x$rankv2,x$rankv1) )
# Example for only one tie
makeTieRank<-mean(x[which(x[,"rankv2"] %in% x[grep(T,duplicated(x$rankv2,x$rankv1) ),][,c("rankv2")] &
x[,"rankv1"] %in% x[grep(T,duplicated(x$rankv2,x$rankv1) ),][,c("rankv1")]),]$rank)
x[which(x[,"rankv2"] %in% x[grep(T,duplicated(x$rankv2,x$rankv1) ),][,c("rankv2")] &
x[,"rankv1"] %in% x[grep(T,duplicated(x$rankv2,x$rankv1) ),][,c("rankv1")]),]$rank<-makeTieRank
x

The rules of subsetting

Having df1 and df2 as follows:
df1 <- read.table(text =" x y z
1 1 1
1 2 1
1 1 2
2 1 1
2 2 2",header=TRUE)
df2 <- read.table(text =" a b c
1 1 1
1 2 8
1 1 2
2 6 2",header=TRUE)
I can ask of the data a bunch of things like:
df2[ df2$b == 6 | df2$c == 8 ,] #any rows where b=6 plus c=8 in df2
#and additive conditions
df2[ df2$b == 6 & df2$c == 8 ,] # zero rows
between data.frame:
df1[ df1$z %in% df2$c ,] # rows in df1 where values in z are in c (allrows)
This gives me all rows:
df1[ (df1$x %in% df2$a) &
(df1$y %in% df2$b) &
(df1$z %in% df2$c) ,]
but shouldn't this give me all rows of df1 too:
df1[ df1$z %in% df2$c | df1$b == 9,]
What I am really hoping to do is to subset df1 an df2 on three column conditions,
so that I only get rows in df1 where a,b,c all equal x,y,z at the same time within a row. In real data i will have more than 3 columns but I will still want to subset on 3 additive column conditions.
So subsetting my example data df1 on df2 my result would be:
df1
1 1 1
1 1 2
Playing with syntax has confusedme more and the SO posts are all variaion of what I want that actually lead to more confusion for me.
I figured out I can do this:
merge(df1,df2, by.x=c("x","y","z"),by.y=c("a","b","c"))
which gives me what I want, but I would like to understand why I am wrong in my [ attempts.
In addition to your nice solution using merge (thanks for that, I always forget merge), this can be achieved in base using ?interaction as follows. There may be other variations of this, but this is the one I am familiar with:
> df1[interaction(df1) %in% interaction(df2), ]
Now to answer your question: First, I think there's a typo (corrected) in:
df1[ df1$z %in% df2$c | df2$b == 9,] # second part should be df2$b == 9
You would get an error, because the first part evaluates to
[1] TRUE TRUE TRUE TRUE TRUE
and the second evaluates to:
[1] FALSE FALSE FALSE FALSE
You do a | operation on unequal lengths getting the error:
longer object length is not a multiple of shorter object length
Edit: If you have multiple columns then you can choose the interaction as such. For example, if you want to get from df1 the rows where the first two columns match with that of df2, then you could simply do:
> df1[interaction(df1[, 1:2]) %in% interaction(df2[, 1:2]), ]

Random row selection in R

I have this dataframe
id <- c(1,1,1,2,2,3)
name <- c("A","A","A","B","B","C")
value <- c(7:12)
df<- data.frame(id=id, name=name, value=value)
df
This function selects a random row from it:
randomRows = function(df,n){
return(df[sample(nrow(df),n),])
}
i.e.
randomRows(df,1)
But I want to randomly select one row per 'name' (or per 'id' which is the same) and concatenate that entire row into a new table, so in this case, three rows. This has to loop throught a 2000+ rows dataframe. Please show me how?!
I think you can do this with the plyr package:
library("plyr")
ddply(df,.(name),randomRows,1)
which gives you for example:
id name value
1 1 A 8
2 2 B 11
3 3 C 12
Is this what you are looking for?
Here's one way of doing it in base R.
> df.split <- split(df, df$name)
> df.sample <- lapply(df.split, randomRows, 1)
> df.final <- do.call("rbind", df.sample)
> df.final
id name value
A 1 A 7
B 2 B 11
C 3 C 12

Improving performance of updating contents of large data frame using contents of similar data frame

I'm looking for a general solution for updating one large data frame with the contents of a second similar data frame. I have dozens of datasets, each with thousands of rows and upwards of 10,000 columns. An "update" dataset will overlap its corresponding "base" dataset by anywhere from a few percent to perhaps 50 percent, rowwise. The datasets have a "key" column and there will be only one row per each unique key value in any given dataset.
The basic rule is: if a non-NA value exists in the update dataset for a given cell, replace the same cell in the base dataset with that value. (The "same cell" means same value of the "key" column and colname.)
Note the update dataset will likely contain new rows ("inserts") which I can handle with an rbind.
So given the base data frame "df1", where column "K" is the unique key column, and "P1" .. "P3" represent the 10,000 columns, whose names will vary from one pair of datasets to the next:
K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1
...and the update data frame "df2":
K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2
The result I need is as follows, where the 1's for "B" and "C" were overwritten by the 2's but not overwritten by the NA's:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
This doesn't seem to be a merge candidate as merge gives me either duplicate rows (with respect to the "key" column) or duplicate columns (e.g. P1.x, P1.y), which I have to iterate over to collapse somehow.
I have tried pre-allocating a matrix with the dimensions of the final rows/columns, and populating it with the contents of df1, then iterating over the overlapping rows of df2, but I cannot get better than 20 cells per second performance, requiring hours to complete (compared to minutes for the equivalent DATA step UPDATE functionality in SAS).
I'm sure I'm missing something, but can't find a comparable example.
I see ddply usage that looks close, but not a general solution. The data.table package didn't seem to help as it's not obvious to me that this is a join problem, at least not generally over so many columns.
Also a solution that focuses only on the intersecting rows is adequate as I can identify the others and rbind them in.
Here is some code to fabricate the data frames above:
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n");
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n");
df1 <- read.table("f1.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
df2 <- read.table("f2.dat", sep=",", header=TRUE, stringsAsFactors=FALSE);
Thanks
This loops by column, setting dt1 by reference and (hopefully) should be quick.
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
if (!identical(names(dt1),names(dt2)))
stop("Assumed for now. Can relax later if needed.")
w = chmatch(dt2$K, dt1$K)
for (i in 2:ncol(dt2)) {
nna = !is.na(dt2[[i]])
set(dt1,w[nna],i,dt2[[i]][nna])
}
dt1 = rbind(dt1,dt2[is.na(w)])
dt1
K P1 P2 P3
[1,] A 1 1 1
[2,] B 2 1 2
[3,] C 1 2 2
[4,] D 2 2 2
This is likely not the fastest solution but is done entirely in base.
(updated answer per Tommy's comments)
#READING IN YOUR DATA FRAMES
df1 <- read.table(text=" K P1 P2 P3
1 A 1 1 1
2 B 1 1 1
3 C 1 1 1", header=TRUE)
df2 <- read.table(text=" K P1 P2 P3
1 B 2 NA 2
2 C NA 2 2
3 D 2 2 2", header=TRUE)
all <- c(levels(df1$K), levels(df2$K)) #all cells of key column
dups <- all[duplicated(all)] #the overlapping key cells
ndups <- all[!all %in% dups] #unique key cells
df3 <- rbind(df1[df1$K%in%ndups, ], df2[df2$K%in%ndups, ]) #bind the unique rows
decider <- function(x, y) ifelse(is.na(x), y, x) #function replaces NAs if existing
df4 <- data.frame(mapply(df2[df2$K%in%dups, ], df1[df1$K%in%dups, ],
FUN = decider)) #repalce all NAs of df2 with df1 values if they exist
df5 <- rbind(df3, df4) #bind unique rows of df1 and df2 with NA replaced df4
df5 <- df5[order(df5$K), ] #reorder based on key column
rownames(df5) <- 1:nrow(df5) #give proper non duplicated rownames
df5
This yields:
K P1 P2 P3
1 A 1 1 1
2 B 2 1 2
3 C 1 2 2
4 D 2 2 2
Upon closer reading not all columns have the same name but I am assuming the same order. this may be a more helpful approach:
all <- c(levels(df1$K), levels(df2$K))
dups <- all[duplicated(all)]
ndups <- all[!all %in% dups]
LS <- list(df1, df2)
LS2 <- lapply(seq_along(LS), function(i) {
colnames(LS[[i]]) <- colnames(LS[[2]])
return(LS[[i]])
}
)
LS3 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%ndups, ])
LS4 <- lapply(seq_along(LS2), function(i) LS2[[i]][LS2[[i]]$K%in%dups, ])
decider <- function(x, y) ifelse(is.na(x), y, x)
DF <- data.frame(mapply(LS4[[2]], LS4[[1]], FUN = decider))
DF$K <- LS4[[1]]$K
LS3[[3]] <- DF
df5 <- do.call("rbind", LS3)
df5 <- df5[order(df5$K), ]
rownames(df5) <- 1:nrow(df5)
df5
EDIT : Please ignore this answer. Bad idea to loop by row. It works but is very slow. Left for posterity! See my 2nd attempt as separate answer.
require(data.table)
dt1 = as.data.table(df1)
dt2 = as.data.table(df2)
K = dt2[[1]]
for (i in 1:nrow(dt2)) {
k = K[i]
p = unlist(dt2[i,-1,with=FALSE])
p = p[!is.na(p)]
dt1[J(k),names(p):=as.list(p),with=FALSE]
}
or, can you use matrix instead of data.frame? If so it could be a single line using A[B] syntax where B is a 2-column matrix containing the row and column numbers to update.
The following gives the correct answer for the small example data, tries to minimize the number of "copies" of tables, and uses the new fread and (new?) rbindlist. Does it work with your larger actual data set? I didn't quite follow all the comments in the original post about the memory issues you had when trying to flatten/normalize/stack, so apologies if you've already tried this route.
library(data.table)
library(reshape2)
cat("K,P1,P2,P3", "A,1,1,1", "B,1,1,1", "C,1,1,1", file="f1.dat", sep="\n")
cat("K,P1,P2,P3", "B,2,,2", "C,,2,2", "D,2,2,2", file="f2.dat", sep="\n")
dt1s<-data.table(melt(fread("f1.dat"), id.vars="K"), key=c("K","variable")) # read f1.dat, melt to long/stacked format, and convert to data.table
dt2s<-data.table(melt(fread("f2.dat"), id.vars="K", na.rm=T), key=c("K","variable")) # read f2.dat, melt to long/stacked format (removing NAs), and convert to data.table
setnames(dt2s,"value","value.new")
dt1s[dt2s,value:=value.new] # Update new values
dtout<-reshape(rbindlist(list(dt1s,dt1s[dt2s][is.na(value),list(K,variable,value=value.new)])), direction="wide", idvar="K", timevar="variable") # Use rbindlist to insert new records, and then reshape
setkey(dtout,K)
setnames(dtout,colnames(dtout),sub("value.", "", colnames(dtout))) # Clean up the column names

Change the order of columns

I am working on a large dataframe with >40 columns. I want to be able to move a column, without having to specify all the column names. For example:
a<-c(1:5)
b<-c(4,3,2,1,1)
Percent<-c(40,30,20,10,10)
Labels<-c("Cat","Dog","Rabbit","Rat","Mouse")
df1<-data.frame(a,b,Percent,Labels)
How would I move the column 'Lables' to before column 'a' WITHOUT having to write all the other column names (i.e. can I just specify a column to come before/after another column?).
Thanks.
Something quick and dirty would be (i.e. no sanity checking etc. and assuming only a single colname is supplied):
moveToFirstCol <- function(df, colname) {
cnams <- colnames(df)
want <- which(colname == cnams)
df[, c(cnams[want], cnams[-want])]
}
which gives:
> moveToFirstCol(df1, "Labels")
Labels a b Percent
1 Cat 1 4 40
2 Dog 2 3 30
3 Rabbit 3 2 20
4 Rat 4 1 10
5 Mouse 5 1 10
That should suggest a way to handle this sort of thing if you need additional flexibility.
Solution with additional flexibility:
move_variable <- function(x,where,data,after=FALSE){
vnames <- names(data)
x_idx <- match(x, vnames)
where_idx <- match(where, vnames)
idx <- seq(length(vnames))
idx[x_idx] <- where_idx
idx1 <- rep(0L, length(vnames))
if(after) idx1[x_idx] <- 1 else idx1[where_idx] <- 1
return(data[order(idx, idx1)])
}

Resources