How to find which elements of one set are in another set? - r

I have two sets: A with columns x,y, and B also with columns x, y.
I need to find the index of the rows of A which are inside of B (both x and y must match).
I have come up with a simple solution (see below), but this comparison is inside of the loop and paste adds much more extra time.
B <- data.frame(x = sample(1:1000, 1000), y = sample(1:1000, 1000))
A <- B[sample(1:1000, 10),]
#change some elements
A$x[c(1,3,7,10)] <- A$x[c(1,3,7,10)] + 0.5
A$xy <- paste(A$x, A$y, sep='ZZZ')
B$xy <- paste(B$x, B$y, sep='ZZZ')
indx <- which(A$xy %in% B$xy)
indx
For example for a single observation an alternative to paste is almost 3 times faster
ind <- sample(1:1000, 1)
xx <- B$x[ind]
yy <- B$y[ind]
ind <- which(with(B, x==xx & y==yy))
# [1] 0.0160000324249268 seconds
xy <- paste(xx,'ZZZ',yy, sep='')
ind <- which(B$xy == xy)
# [1] 0.0469999313354492 seconds

How about using merge() to do the matching for you?
A$id <- seq_len(nrow(A))
sort(merge(A, B)$id)
# [1] 2 4 5 6 8 9
Edit:
Or, to get rid of two unnecessary sorts, use the sort= option to merge()
merge(A, B, sort=FALSE)$id
# [1] 2 4 5 6 8 9

Related

How to "translate" variables in one data frame using a second data frame as a key?

I have a data frame with two string variables, and would like to convert them to numeric values using a separate "key" data frame. The below example is simplified, but I need to be able to apply it to replace the contents of the V1 and V2 variables based on an arbitrary key that will not always be a=1, b=2 etc...
Example:
set.seed(1)
x <- data.frame(
V1 = sample((letters), 10, replace=TRUE),
V2 = sample((letters), 10, replace=TRUE)
)
key <- data.frame(letters, 1:26)
I need to reference the first element of V1 against the key, replace with the according value (e.g. a = 1, b = 2, etc.), do the same for the second element, and then when done with V1 move on and do the same for V2.
I've been struggling to work out a solution using lapply() and sub() but keep getting stuck because I can't see a way to pass the sub() function more than a 1:1 comparison. Is there a different function I should be using?
Forgive me- I'm sure the solution must be simple but I'm quite new to R still.
Here are two approaches with base R to make it:
using sapply()
x[] <- with(key, sapply(x, function(v) values[match(v,letters)]))
or
x <- data.frame(with(key, sapply(x, function(v) values[match(v,letters)])))
using as.matrix (similar to the unlist() approach by #Ronak Shah)
x[] <- with(key, values[match(as.matrix(x),letters)])
You can create a lookup table with data.table and then apply the mapping along the columns of your data frame with apply:
library(data.table)
key <- data.table(letters = letters, value = 1:26, key = "letters")
apply(x, 2, function(x) key[x]$value)
>
V1 V2
1 y a
2 d u
3 g u
4 a j
5 b v
6 w n
7 k j
8 n g
9 r i
10 s o
You could unlist and match in base R
x[] <- key$values[match(unlist(x), key$letters)]
x
# V1 V2
#1 25 1
#2 4 21
#3 7 21
#4 1 10
#5 2 22
#6 23 14
#7 11 10
#8 14 7
#9 18 9
#10 19 15
Or using dplyr
library(dplyr)
x %>% mutate_all(~key$values[match(., key$letters)])
data
set.seed(1)
x <- data.frame(
V1 = sample((letters), 10, replace=TRUE),
V2 = sample((letters), 10, replace=TRUE)
)
key <- data.frame(letters = letters, values = 1:26)
You could use apply with both row and column margins, e.g, as.data.frame(apply(x, c(1,2), function(l) key[key$letters == l,c(2)])).

Using pmatch for checking columns of a matrix into another

I have two matrices and I want to check which (column) vectors of the first one are also in the second one, and if so to get their index.
I tried to use pmatch but I have to tweak it a bit because it first convert the matrices into vector, see the MWE:
X <- matrix(rnorm(12), 3, 4)
x <- X[, c(2, 4)]
pm <- pmatch(x, X)
print(pm)
[1] 4 5 6 10 11 12
d1 <- dim(X)[1]
d2 <- length(pm)/d1
ind <- pmatch(x, X)[d1*c(1:d2)]/d1
print(ind)
[1] 2 4
ind is what I want, but I guess there might be prebuilt function to do it. And I'm also concerned with computational efficiency.
We can loop over the columns of 'x' and use ==
sapply(seq_len(ncol(x)), function(i) which(!colSums(X != x[,i])))
#[1] 2 4

Find top deciles from dataframe by group

I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here.
Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign), I have left the naming process until after the lapply. I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some sort of reshaping?
#Reproducible data
data <- data.frame("custID" = c(1:10, 1:20),
"v1" = rep(c("A", "B"), c(10,20)),
"v2" = c(30:21, 20:19, 1:3, 20:6), stringsAsFactors = TRUE)
#Function to analyze customer distribution for each category (v1)
pf <- function(cat, df) {
df <- df[df$v1 == cat,]
df <- df[order(-df$v2),]
#Divide the customers into top percents
nr <- nrow(df)
p10 <- round(nr * .10, 0)
cat("Number of people in the Top 10% :", p10, "\n")
p20 <- round(nr * .20, 0)
p11_20 <- p20-p10
cat("Number of people in the 11-20% :", p11_20, "\n")
#Keep only those customers in the top groups
df <- df[1:p20,]
#Create a variable to identify the percent group the customer is in
top_pct <- integer(length = p10 + p11_20)
#Identify those in each group
top_pct[1:p10] <- 10
top_pct[(p10+1):p20] <- 20
#Add this variable to the data frame
df$top_pct <- top_pct
#Keep only custID and the new variable
df <- subset(df, select = c(custID, top_pct))
return(df)
}
##Run the customer distribution function
v1Levels <- levels(data$v1)
res <- lapply(v1Levels, pf, df = data)
#Explore the results
summary(res)
# Length Class Mode
# [1,] 2 data.frame list
# [2,] 2 data.frame list
print(res)
# [[1]]
# custID top_pct
# 1 1 10
# 2 2 20
#
# [[2]]
# custID top_pct
# 11 1 10
# 16 6 10
# 12 2 20
# 17 7 20
##Merge the two data frames but with top_pct as a different variable for each category
#Change the new variable name
for(i in 1:length(res)) {
names(res[[i]])[2] <- paste0(v1Levels[i], "_top_pct")
}
#Merge the results
res_m <- res[[1]]
for(i in 2:length(res)) {
res_m <- merge(res_m, res[[i]], by = "custID", all = TRUE)
}
print(res_m)
# custID A_top_pct B_top_pct
# 1 1 10 10
# 2 2 20 20
# 3 6 NA 10
# 4 7 NA 20
Stick to your Stata instincts and use a single data set:
require(data.table)
DT <- data.table(data)
DT[,r:=rank(v2)/.N,by=v1]
You can see the result by typing DT.
From here, you can group the within-v1 rank, r, if you want to. Following Stata idioms...
DT[,g:={
x = rep(0,.N)
x[r>.8] = 20
x[r>.9] = 10
x
}]
This is like gen and then two replace ... if statements. Again, you can see the result with DT.
Finally, you can subset with
DT[g>0]
which gives
custID v1 v2 r g
1: 1 A 30 1.000 10
2: 2 A 29 0.900 20
3: 1 B 20 0.975 10
4: 2 B 19 0.875 20
5: 6 B 20 0.975 10
6: 7 B 19 0.875 20
These steps can also be chained together:
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0]
(Thanks to #ExperimenteR:)
To rearrange for the desired output in the OP, with values of v1 in columns, use dcast:
dcast(
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0],
custID~v1)
Currently, dcast requires the latest version of data.table, available (I think) from Github.
You don't need the function pf to achieve what you want. Try dplyr/tidyr combo
library(dplyr)
library(tidyr)
data %>%
group_by(v1) %>%
arrange(desc(v2))%>%
mutate(n=n()) %>%
filter(row_number() <= round(n * .2)) %>%
mutate(top_pct= ifelse(row_number()<=round(n* .1), 10, 20)) %>%
select(custID, top_pct) %>%
spread(v1, top_pct)
# custID A B
#1 1 10 10
#2 2 20 20
#3 6 NA 10
#4 7 NA 20
The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.
lapply(split(data, data$v1), function(df) {
cutoff <- quantile(df$v2, c(0.8, 0.9))
top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
na.omit(data.frame(id=df$custID, top_pct))
})
Finding quantiles is done with quantile.

How to combine a data frame and a vector

df<-data.frame(w=c("r","q"), x=c("a","b"))
y=c(1,2)
How do I combine df and y into a new data frame that has all combinations of rows from df with elements from y? In this example, the output should be
data.frame(w=c("r","r","q","q"), x=c("a","a","b","b"),y=c(1,2,1,2))
w x y
1 r a 1
2 r a 2
3 q b 1
4 q b 2
This should do what you're trying to do, and without too much work.
dl <- unclass(df)
dl$y <- y
merge(df, expand.grid(dl))
# w x y
# 1 q b 1
# 2 q b 2
# 3 r a 1
# 4 r a 2
data.frame(lapply(df, rep, each = length(y)), y = y)
this should work
library(combinat)
df<-data.frame(w=c("r","q"), x=c("a","b"))
y=c("one", "two") #for generality
indices <- permn(seq_along(y))
combined <- NULL
for(i in indices){
current <- cbind(df, y=y[unlist(i)])
if(is.null(combined)){
combined <- current
} else {
combined <- rbind(combined, current)
}
}
print(combined)
Here is the output:
w x y
1 r a one
2 q b two
3 r a two
4 q b one
... or to make it shorter (and less obvious):
combined <- do.call(rbind, lapply(indices, function(i){cbind(df, y=y[unlist(i)])}))
First, convert class of columns from factor to character:
df <- data.frame(lapply(df, as.character), stringsAsFactors=FALSE)
Then, use expand.grid to get a index matrix for all combinations of rows of df and elements of y:
ind.mat = expand.grid(1:length(y), 1:nrow(df))
Finally, loop through the rows of ind.mat to get the result:
data.frame(t(apply(ind.mat, 1, function(x){c(as.character(df[x[2], ]), y[x[1]])})))

Creating a function to replace NAs from one data.frame with values from another

I regularly have situations where I need to replace missing values from a data.frame with values from some other data.frame that is at a different level of aggregation. So, for example, if I have a data.frame full of county data I might replace NA values with state values stored in another data.frame. After writing the same merge... ifelse(is.na()) yada yada a few dozen times I decided to break down and write a function to do this.
Here's what I cooked up, along with an example of how I use it:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols){
mergedDf <- merge(naDf, fillDf, by=mergeCols)
for (col in fillCols){
colWithNas <- mergedDf[[paste(col, "x", sep=".")]]
colWithOutNas <- mergedDf[[paste(col, "y", sep=".")]]
k <- which( is.na( colWithNas ) )
colWithNas[k] <- colWithOutNas[k]
mergedDf[col] <- colWithNas
mergedDf[[paste(col, "x", sep=".")]] <- NULL
mergedDf[[paste(col, "y", sep=".")]] <- NULL
}
return(mergedDf)
}
## test case
fillDf <- data.frame(a = c(1,2,1,2), b = c(3,3,4,4) ,f = c(100,200, 300, 400), g = c(11, 12, 13, 14))
naDf <- data.frame( a = sample(c(1,2), 100, rep=TRUE), b = sample(c(3,4), 100, rep=TRUE), f = sample(c(0,NA), 100, rep=TRUE), g = sample(c(0,NA), 200, rep=TRUE) )
fillNaDf(naDf, fillDf, mergeCols=c("a","b"), fillCols=c("f","g") )
So after I got this running I had this odd feeling that someone has probably solved this problem before me and in a much more elegant way. Is there a better/easier/faster solution to this problem? Also, is there a way that eliminates the loop in the middle of my function? That loop is there because I am often replacing NAs in more than one column. And, yes, the function assumes the columns we're filling from are named the same and the columns we are filling to and the same applies to the merge.
Any guidance or refactoring would be helpful.
EDIT on Dec 2 I realized I had logic flaws in my example which I fixed.
What a great question.
Here's a data.table solution:
# Convert data.frames to data.tables (i.e. data.frames with extra powers;)
library(data.table)
fillDT <- data.table(fillDf, key=c("a", "b"))
naDT <- data.table(naDf, key=c("a", "b"))
# Merge data.tables, based on their keys (columns a & b)
outDT <- naDT[fillDT]
# a b f g f.1 g.1
# [1,] 1 3 NA 0 100 11
# [2,] 1 3 NA NA 100 11
# [3,] 1 3 NA 0 100 11
# [4,] 1 3 0 0 100 11
# [5,] 1 3 0 NA 100 11
# First 5 rows of 200 printed.
# In outDT[i, j], on the following two lines
# -- i is a Boolean vector indicating which rows will be operated on
# -- j is an expression saying "(sub)assign from right column (e.g. f.1) to
# left column (e.g. f)
outDT[is.na(f), f:=f.1]
outDT[is.na(g), g:=g.1]
# Just keep the four columns ultimately needed
outDT <- outDT[,list(a,b,g,f)]
# a b g f
# [1,] 1 3 0 0
# [2,] 1 3 11 0
# [3,] 1 3 0 0
# [4,] 1 3 11 0
# [5,] 1 3 11 0
# First 5 rows of 200 printed.
Here's a slightly more concise/robust version of your approach. You could replace the for-loop with a call to lapply, but I find the loop easier to read.
This function assumes any columns not in mergeCols are fair game to have their NAs filled. I'm not really sure this helps, but I'll take my chances with the voters.
fillNaDf.ju <- function(naDf, fillDf, mergeCols) {
mergedDf <- merge(fillDf, naDf, by=mergeCols, suffixes=c(".fill",""))
dataCols <- setdiff(names(naDf),mergeCols)
# loop over all columns we didn't merge by
for(col in dataCols) {
rows <- is.na(mergedDf[,col])
# skip this column if it doesn't contain any NAs
if(!any(rows)) next
rows <- which(rows)
# replace NAs with values from fillDf
mergedDf[rows,col] <- mergedDf[rows,paste(col,"fill",sep=".")]
}
# don't return ".fill" columns
mergedDf[,names(naDf)]
}
My preference would be to pull out the code from merge that does the matching and do it myself so that I could keep the ordering of the original data frame intact, both row-wise and column-wise. I also use matrix indexing to avoid any loops, though to do so I create a new data frame with the revised fillCols and replace the columns of the original with it; I thought I could fill it in directly but apparently you can't use matrix ordering to replace parts of a data.frame, so I wouldn't be surprised if a loop over the names would be faster in some situations.
With matrix indexing:
fillNaDf <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
na.ind <- is.na(naDf[,fillCols])
fill.ind <- cbind(match(naB, fillB)[row(na.ind)[na.ind]], col(na.ind)[na.ind])
naX <- naDf[,fillCols]
fillX <- fillDf[,fillCols]
naX[na.ind] <- fillX[fill.ind]
naDf[,colnames(naX)] <- naX
naDf
}
With a loop:
fillNaDf2 <- function(naDf, fillDf, mergeCols, fillCols) {
fillB <- do.call(paste, c(fillDf[, mergeCols, drop = FALSE], sep="\r"))
naB <- do.call(paste, c(naDf[, mergeCols, drop = FALSE], sep="\r"))
m <- match(naB, fillB)
for(col in fillCols) {
fix <- which(is.na(naDf[,col]))
naDf[fix, col] <- fillDf[m[fix],col]
}
naDf
}

Resources