Matching without replacement by id in R - r

In R, I can easily match unique identifiers using the match function:
match(c(1,2,3,4),c(2,3,4,1))
# [1] 4 1 2 3
When I try to match non-unique identifiers, I get the following result:
match(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 3
Is there a way to match the indices "without replacement", that is, each index appearing only once?
othermatch(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 4 # note the 4 where there was a 3 at the end

you're looking for pmatch
pmatch(c(1,2,3,1),c(2,3,1,1))
# [1] 3 1 2 4

A more naive approach -
library(data.table)
a <- data.table(p = c(1,2,3,1))
a[,indexa := .I]
b <- data.table(q = c(2,3,1,1))
b[,indexb := .I]
setkey(a,p)
setkey(b,q)
# since they are permutation, therefore cbinding the ordered vectors should return ab with ab[,p] = ab[,q]
ab <- cbind(a,b)
setkey(ab,indexa)
ab[,indexb]
#[1] 3 1 2 4

Related

using %in% to subset a data.table

I have a data.table
library(data.table)
DT <- data.table(a=c(1,2,3,4), b=c(4,4,4,4), x=c(1,3,5,5))
> DT
a b x
1: 1 4 1
2: 2 4 3
3: 3 4 5
4: 4 4 5
and I would like to select rows where x equals either a or b. Obviously, I could use
> DT[x==a | x==b]
a b x
1: 1 4 1
which gives the correct result. However, with many columns I thought, the follwoing should work just as well
> DT[x%in%c(a,b)]
a b x
1: 1 4 1
2: 2 4 3
but it gives a different result that is not intuitive to me. Can anyone help?
The expression
DT[x==a | x==b]
returns all rows in DT where the values in x and a are equal or x and b are equal. This is the desired result.
On the other hand
DT[x%in%c(a,b)]
returns all rows where x matches any value in c(a, b), not just the corresponding value. Thus your second row appears because x == 3 and 3 appears (somewhere) in a.
We can use Reduce with .SDcols for multiple columns. Specify the columns of interest in .SDcols, then loop over the .SD (Subset of Data.table), do the comparison (==) with 'x', and Reduce it to a single logical vector with |
DT[DT[, Reduce(`|`, lapply(.SD, `==`, x)), .SDcols = a:b]]
# a b x
#1: 1 4 1
Another way is use rowSums
DT[rowSums(DT[,.SD,.SDcols=-'x']==x)>0,]
# a b x
#1: 1 4 1
You can change to rowMeans...==1 if you want to select rows where all columns equal x

Get the mean across list of dataframes by rows

I have a list of dataframes and I want to calculate a mean from each first rows, for all second rows etc.
I think this is possible by creating some common factor as index, put dataframes together using rbind and then calculate the mean value using aggregate(value ~ row.index, mean, large.df). However, I guess there is more straightforward way?
Here is my example:
df1 = data.frame(val = c(4,1,0))
df2 = data.frame(val = c(5,2,1))
df3 = data.frame(val = c(6,3,2))
myLs=list(df1, df2, df3)
[[1]]
val
1 4
2 1
3 0
[[2]]
val
1 5
2 2
3 1
[[3]]
val
1 6
2 3
3 2
And my expected dataframe output, as rowise means:
df.means
mean
1 5
2 2
3 1
My first steps, not working as expected yet:
# Calculate the mean of list by rows
lapply(myLs, function(x) mean(x[1,]))
A simple way would be to cbind the list and calculate mean of each row with rowMeans
rowMeans(do.call(cbind, myLs))
#[1] 5 2 1
We can also use bind_cols from dplyr to combine all the dataframes.
rowMeans(dplyr::bind_cols(myLs))
Here is another base R solution using unlist + data.frame + rowMeans, i.e.,
rowMeans(data.frame(unlist(myLs,recursive = F)))
# [1] 5 2 1
Using double loop:
sapply(1:3, function(i) mean(sapply(myLs, function(j) j[i, ] )))
# [1] 5 2 1
Another base R possibility could be:
Reduce("+", myLs)/length(myLs)
val
1 5
2 2
3 1

Select dataframe if both values exists

Here is example:
df1 <- data.frame(x=1:2, account=c(-1,-1))
df2 <- data.frame(x=1:3, account=c(1,-1,1))
df3 <- data.frame(x=1, account=c(-1))
ls <- list(df1,df2,df3)
Failed attempt:
for(i in 1:length(ls)){
d <- ls[[i]]; if(d$account %in% c(-1,1)) { dout <- d} else {next}
}
I also tried: (not sure why this doesn't work)
grepl(paste(c(-1,1), collapse="|"), as.character(df1$account))
gives: (which is correct, since | means or, so one of the values is matched)
[1] TRUE TRUE
however, I have tried this:
df1 <- data.frame(x=1:2, account=c(-1,1))
grepl(paste(c(-1,1), collapse="&"), as.character(df1$account))
gives:
[1] FALSE FALSE
I would like to store only the subset of dataframes that contain both -1,1 values in column account otherwise neglect.
Desired result:
d
x account
1 1 1
2 2 -1
3 3 1
Or, you could stop using a list of data.frames:
library(data.table)
DT <- rbindlist(ls, idcol="id")
# id x account
# 1: 1 1 -1
# 2: 1 2 -1
# 3: 2 1 1
# 4: 2 2 -1
# 5: 2 3 1
# 6: 3 1 -1
And filter the single table:
DT[, if (uniqueN(account) > 1) .SD, by=id]
# id x account
# 1: 2 1 1
# 2: 2 2 -1
# 3: 2 3 1
(This follows #akrun's answer; uniqueN(x) is a fast shortcut to length(unique(x)).)
We could loop through the list and check whether the length of unique elements in 'account' is greater than 1 (assuming that there are only -1 and 1 as possible elements). Use this logical index to filter the list.
ls[sapply(ls, function(x) length(unique(x$account))>1)]

match values in dataframes with values in a column

I have two data.frames that looks like these ones:
>df1
V1
a
b
c
d
e
>df2
V1 V2
1 a,k,l
2 c,m,n
3 z,b,s
4 l,m,e
5 t,r,d
I would like to match the values in df1$V1 with those from df2$V2and add a new column to df1 that corresponds to the matching and to the value of df2$V1, the desire output would be:
>df1
V1 V2
a 1
b 3
c 2
d 5
e 4
I've tried this approach but only works if df2$V2 contains just one element:
match(as.character(df1[,1]), strsplit(as.character(df2[,2], ",")) -> idx
df1$V2 <- df2[idx,1]
Many thanks
You can just use grep, which will return the position of the string found:
sapply(df1$V1, grep, x = df2$V2)
# a b c d e
# 1 3 2 5 4
If you expect repeats, you can use paste.
Let's modify your data so that there is a repeat:
df2$V2[3] <- "z,b,s,a"
And modify the solution accordingly:
sapply(df1$V1, function(z) paste(grep(z, x = df2$V2), collapse = ";"))
# a b c d e
# "1;3" "3" "2" "5" "4"
Similar to Tyler's answer, but in base using stack:
df.stack <- stack(setNames(strsplit(as.character(df2$V2), ","), df2$V1))
transform(df1, V2=df.stack$ind[match(V1, df.stack$values)])
produces:
V1 V2
1 a 1
2 b 3
3 c 2
4 d 5
5 e 4
One advantage of splitting over grep is that with grep you run the risk of searching for a and matching things like alabama, etc. (though you can be careful with the patterns to mitigate this (i.e. include word boundaries, etc.).
Note this will only find the first matching value.
Here's an approach:
library(qdap)
key <- setNames(strsplit(as.character(df2$V2), ","), df2$V1)
df1$V2 <- as.numeric(df1$V1 %l% key)
df1
## V1 V2
## 1 a 1
## 2 b 3
## 3 c 2
## 4 d 5
## 5 e 4
First we used strsplit to create a named list. Then we used qdap's lookup operator %l% to match values and create a new column (I converted to numeric though this may not be necessary).

Assignment to the result of a function changes variable

Looking through the ave function, I found a remarkable line:
split(x, g) <- lapply(split(x, g), FUN) # From ave
Interestingly, this line changes the value of x, which I found unexpected. I expected that split(x,g) would result in a list, which could be assigned to, but discarded afterward. My question is, why does the value of x change?
Another example may explain better:
a <- data.frame(id=c(1,1,2,2), value=c(4,5,7,6))
# id value
# 1 1 4
# 2 1 5
# 3 2 7
# 4 2 6
split(a,a$id) # Split a row-wise by id into a list of size 2
# $`1`
# id value
# 1 1 4
# 2 1 5
# $`2`
# id value
# 3 2 7
# 4 2 6
# Find the row with highest value for each id
lapply(split(a,a$id),function(x) x[which.max(x$value),])
# $`1`
# id value
# 2 1 5
# $`2`
# id value
# 3 2 7
# Assigning to the split changes the data.frame a!
split(a,a$id)<-lapply(split(a,a$id),function(x) x[which.max(x$value),])
a
# id value
# 1 1 5
# 2 1 5
# 3 2 7
# 4 2 7
Not only has a changed, but it changed to a value that does not look like the right hand side of the assignment! Even if assigning to split(a,a$id) somehow changes a (which I don't understand), why does it result in a data.frame instead of a list?
Note that I understand that there are better ways to accomplish this task. My question is why does split(a,a$id)<-lapply(split(a,a$id),function(x) x[which.max(x$value),]) change a?
The help page for split says in its header: "The replacement forms replace values corresponding to such a division." So it really should not be unexpected, although I admit it is not widely used. I do not understand how your example illustrates that the assigned values "do not look like the RHS of the assignment!". The max values are assigned to the 'value' lists within categories defined by the second argument factor.
(I do thank you for the question. I had not realized that split<- was at the core of ave. I guess it is more widely used than I realized, since I think ave is a wonderfully useful function.)
Just after definition of a, perform split(a, a$id)=1, the result would be:
> a
id value
1 1 1
2 1 1
3 1 1
4 1 1
The key here is that split<- actually modified the LHS with RHS values.
Here's an example:
> x <- c(1,2,3);
> split(x,x==2)
$`FALSE`
[1] 1 3
$`TRUE`
[1] 2
> split(x,x==2) <- split(c(10,20,30),c(10,20,30)==20)
> x
[1] 10 20 30
Note the line where I re-assign split(x,x==2) <- . This actually reassigns x.
As the comments below have stated, you can look up the definition of split<- like so
> `split<-.default`
function (x, f, drop = FALSE, ..., value)
{
ix <- split(seq_along(x), f, drop = drop, ...)
n <- length(value)
j <- 0
for (i in ix) {
j <- j%%n + 1
x[i] <- value[[j]]
}
x
}
<bytecode: 0x1e18ef8>
<environment: namespace:base>

Resources