Dictionaries and pairs - r

In R I was wondering if I could have a dictionary (in a sense like python) where I have a pair (i, j) as the key with a corresponding integer value. I have not seen a clean or intuitive way to construct this in R. A visual of my dictionary would be:
(1, 2) --> 1
(1, 3) --> 3
(1, 4) --> 4
(1, 5) --> 3
EDIT: The line of code to insert these key value pairs is in a loop with counters i and j. For example suppose I have:
for(i in 1: 5)
{
for(j in 2: 4)
{
maps[i][j] = which.min(someVector)
}
}
How do I change maps[i][j] to get the functionality I am looking for?

Both named lists and environments provide a mapping, but between a character string (the name, which acts as a key) and an arbitrary R object (the value). If your i and j are simple (as they appear in your example to be, being integers), you can easily make a unique string out of the pair of them by concatenating them with some delimiter. This would make a valid name/key
mykey <- function(i, j) {
paste(i, j, sep="|")
}
maps <- list()
for(i in 1: 5) {
for(j in 2: 4) {
maps[mykey(i,j)] <- which.min(someVector)
}
}
You can extract any value for a specific i and j with
maps[[mykey(i,j)]]

Here's how you'd do this in a data.table, which will be fast and easily extendable:
library(data.table)
d = data.table(i = 1, j = 2:5, value = c(1,3,4,3), key = c("i", "j"))
# access the i=1,j=3 value
d[J(1, 3)][, value]
# change that value
d[J(1, 3), value := 12]
# do some vector assignment (you should stop thinking loops, and start thinking vectors)
d[, value := i * j]
etc.

You can do this with a list of vectors.
maps <- lapply(vector('list',5), function(i) integer(0))
maps[[1]][2] <- 1
maps[[1]][3] <- 3
maps[[1]][4] <- 4
maps[[1]][5] <- 3
That said, there's probably a better way to do what you're trying to do, but you haven't given us enough background.

You can just use a data.frame:
a = data.frame(spam = c("alpha", "gamma", "beta"),
shrub = c("primus","inter","parus"),
stringsAsFactors = FALSE)
rownames(a) = c("John", "Eli","Seth")
> a
spam shrub
John alpha primus
Eli gamma inter
Seth beta parus
> a["John", "spam"]
[1] "alpha"
This handles the case with a 2d dictionary style object with named keys. The keys can also be integers, although they might have to be characters in stead of numeric's.

R matrices allow you to do this. There are both sparse and dense version. I beleive the tm-package uses a variation on sparse matrices to form its implementation of dictionaries. This shows how to extract the [i,j] elements of matrix M where [i,j] is represented as a a two-column matrix.
M<- matrix(1:20, 5, 5)
ij <- cbind(sample(1:5), sample(1:5) )
> ij
[,1] [,2]
[1,] 4 4
[2,] 1 2
[3,] 5 3
[4,] 3 1
[5,] 2 5
> M[ij]
[1] 19 6 15 3 2
#Justin also points out that you could use lists which can be indexed by position:
L <- list( as.list(letters[1:5] ), as.list( paste(2,letters[1:5] ) ) )
> L[[1]][[2]]
[1] "b"
> L[[2]][[2]]
[1] "2 b"

Related

concatenation sublists of two different lists [duplicate]

I have two lists
first = list(a = 1, b = 2, c = 3)
second = list(a = 2, b = 3, c = 4)
I want to merge these two lists so the final product is
$a
[1] 1 2
$b
[1] 2 3
$c
[1] 3 4
Is there a simple function to do this?
If lists always have the same structure, as in the example, then a simpler solution is
mapply(c, first, second, SIMPLIFY=FALSE)
This is a very simple adaptation of the modifyList function by Sarkar. Because it is recursive, it will handle more complex situations than mapply would, and it will handle mismatched name situations by ignoring the items in 'second' that are not in 'first'.
appendList <- function (x, val)
{
stopifnot(is.list(x), is.list(val))
xnames <- names(x)
for (v in names(val)) {
x[[v]] <- if (v %in% xnames && is.list(x[[v]]) && is.list(val[[v]]))
appendList(x[[v]], val[[v]])
else c(x[[v]], val[[v]])
}
x
}
> appendList(first,second)
$a
[1] 1 2
$b
[1] 2 3
$c
[1] 3 4
Here are two options, the first:
both <- list(first, second)
n <- unique(unlist(lapply(both, names)))
names(n) <- n
lapply(n, function(ni) unlist(lapply(both, `[[`, ni)))
and the second, which works only if they have the same structure:
apply(cbind(first, second),1,function(x) unname(unlist(x)))
Both give the desired result.
Here's some code that I ended up writing, based upon #Andrei's answer but without the elegancy/simplicity. The advantage is that it allows a more complex recursive merge and also differs between elements that should be connected with rbind and those that are just connected with c:
# Decided to move this outside the mapply, not sure this is
# that important for speed but I imagine redefining the function
# might be somewhat time-consuming
mergeLists_internal <- function(o_element, n_element){
if (is.list(n_element)){
# Fill in non-existant element with NA elements
if (length(n_element) != length(o_element)){
n_unique <- names(n_element)[! names(n_element) %in% names(o_element)]
if (length(n_unique) > 0){
for (n in n_unique){
if (is.matrix(n_element[[n]])){
o_element[[n]] <- matrix(NA,
nrow=nrow(n_element[[n]]),
ncol=ncol(n_element[[n]]))
}else{
o_element[[n]] <- rep(NA,
times=length(n_element[[n]]))
}
}
}
o_unique <- names(o_element)[! names(o_element) %in% names(n_element)]
if (length(o_unique) > 0){
for (n in o_unique){
if (is.matrix(n_element[[n]])){
n_element[[n]] <- matrix(NA,
nrow=nrow(o_element[[n]]),
ncol=ncol(o_element[[n]]))
}else{
n_element[[n]] <- rep(NA,
times=length(o_element[[n]]))
}
}
}
}
# Now merge the two lists
return(mergeLists(o_element,
n_element))
}
if(length(n_element)>1){
new_cols <- ifelse(is.matrix(n_element), ncol(n_element), length(n_element))
old_cols <- ifelse(is.matrix(o_element), ncol(o_element), length(o_element))
if (new_cols != old_cols)
stop("Your length doesn't match on the elements,",
" new element (", new_cols , ") !=",
" old element (", old_cols , ")")
}
return(rbind(o_element,
n_element,
deparse.level=0))
return(c(o_element,
n_element))
}
mergeLists <- function(old, new){
if (is.null(old))
return (new)
m <- mapply(mergeLists_internal, old, new, SIMPLIFY=FALSE)
return(m)
}
Here's my example:
v1 <- list("a"=c(1,2), b="test 1", sublist=list(one=20:21, two=21:22))
v2 <- list("a"=c(3,4), b="test 2", sublist=list(one=10:11, two=11:12, three=1:2))
mergeLists(v1, v2)
This results in:
$a
[,1] [,2]
[1,] 1 2
[2,] 3 4
$b
[1] "test 1" "test 2"
$sublist
$sublist$one
[,1] [,2]
[1,] 20 21
[2,] 10 11
$sublist$two
[,1] [,2]
[1,] 21 22
[2,] 11 12
$sublist$three
[,1] [,2]
[1,] NA NA
[2,] 1 2
Yeah, I know - perhaps not the most logical merge but I have a complex parallel loop that I had to generate a more customized .combine function for, and therefore I wrote this monster :-)
merged = map(names(first), ~c(first[[.x]], second[[.x]])
merged = set_names(merged, names(first))
Using purrr. Also solves the problem of your lists not being in order.
In general one could,
merge_list <- function(...) by(v<-unlist(c(...)),names(v),base::c)
Note that the by() solution returns an attributed list, so it will print differently, but will still be a list. But you can get rid of the attributes with attr(x,"_attribute.name_")<-NULL. You can probably also use aggregate().
We can do a lapply with c(), and use setNames to assign the original name to the output.
setNames(lapply(1:length(first), function(x) c(first[[x]], second[[x]])), names(first))
$a
[1] 1 2
$b
[1] 2 3
$c
[1] 3 4
Following #Aaron left Stack Overflow and #Theo answer, the merged list's elements are in form of vector c.
But if you want to bind rows and columns use rbind and cbind.
merged = map(names(first), ~rbind(first[[.x]], second[[.x]])
merged = set_names(merged, names(first))
Using dplyr, I found that this line works for named lists using the same names:
as.list(bind_rows(first, second))

Swapping elements between more than 2 arrays

Swapping elements within a single array (x) is a classic problem in computer science. The immediate (but by no means only, e.g., XOR) solution in a low-level language like C is to use a temporary variable:
x[0] = tmp
x[0] = x[1]
x[1] = tmp
The above algorithm swaps the first and second elements of x.
To swap elements between two subarrays, x and y, is similar
x[0] = tmp
x[0] = y[1]
y[1] = tmp
What about for the case of 3 arrays with the added restriction that an element of Array 1 must be swapped with an element of Array 2 and an element of Array 2 must be swapped with an element of Array 3? Elements in Arrays 1 and 3 are not swapped with one another.
How can such an approach (with the added restriction) be generalized to k arrays?
You could create a for-loop that repeats your set of instructions:
l=list(x = c(1,2,3,4,5),y = c(5,4,3,2,1),z = c(6,7,8,9,10))
swap_elements <- function(l)
{
for(i in 1:(length(l)-1))
{
tmp = l[[i]][1]
l[[i]][1] = l[[i+1]][2]
l[[i+1]][2] = tmp
}
return(l)
}
Output:
> swap_elements(l)
$x
[1] 4 2 3 4 5
$y
[1] 7 1 3 2 1
$z
[1] 6 5 8 9 10
if the Arrays are stacked into a matrix, you can lag the rows to create the required action
k <- 6
#generate dummy data with k rows and 3 columns
mat <- matrix(seq_len(3*k), nrow=k, byrow=TRUE)
mat
#lag the matrix
mat[c(seq_len(k)[-1], 1),]

How to store multidimensional subscript as variable in R

Suppose I have a matrix,
mat <- matrix((1:9)^2, 3, 3)
I can slice the matrix like so
> mat[2:3, 2]
[1] 25 36
How does one store the subscript as a variable? That is, what should my_sub be, such that
> mat[my_sub]
[1] 25 36
A list gets "invalid subscript type" error. A vector will lose the multidimensionality. Seems like such a basic operation to not have a primitive type that fits this usage.
I know I can access the matrix via vector addressing, which means converting from [2:3, 2] to c(5, 6), but that mapping presumes knowledge of matrix shape. What if I simply want [2:3, 2] for any matrix shape (assuming it is at least those dimensions)?
Here are some alternatives. They both generalize to higher dimenional arrays.
1) matrix subscripting If the indexes are all scalar except possibly one, as in the question, then:
mi <- cbind(2:3, 2)
mat[mi]
# test
identical(mat[mi], mat[2:3, 2])
## [1] TRUE
In higher dimensions:
a <- array(1:24, 2:4)
mi <- cbind(2, 2:3, 3)
a[mi]
# test
identical(a[mi], a[2, 2:3, 3])
## [1] TRUE
It would be possible to extend this to eliminate the scalar restriction using:
L <- list(2:3, 2:3)
array(mat[as.matrix(do.call(expand.grid, L))], lengths(L))
however, in light of (2) which also uses do.call but avoids the need for expand.grid it seems unnecessarily complex.
2) do.call This approach does not have the scalar limitation. mat and a are from above:
L2 <- list(2:3, 1:2)
do.call("[", c(list(mat), L2))
# test
identical(do.call("[", c(list(mat), L2)), mat[2:3, 1:2])
## [1] TRUE
L3 <- list(2, 2:3, 3:4)
do.call("[", c(list(a), L3))
# test
identical(do.call("[", c(list(a), L3)), a[2, 2:3, 3:4])
## [1] TRUE
This could be made prettier by defining:
`%[%` <- function(x, indexList) do.call("[", c(list(x), indexList))
mat %[% list(2:3, 1:2)
a %[% list(2, 2:3, 3:4)
Use which argument arr.ind = TRUE.
x <- c(25, 36)
inx <- which(mat == x, arr.ind = TRUE)
Warning message:
In mat == x :
longer object length is not a multiple of shorter object length
mat[inx]
#[1] 25 36
This is an interesting question. The subset function can actually help. You cannot subset directly your matrix using a vector or a list, but you can store the indexes in a list and use subset to do the trick.
mat <- matrix(1:12, nrow=4)
mat[2:3, 1:2]
# example using subset
subset(mat, subset = 1:nrow(mat) %in% 2:3, select = 1:2)
# double check
identical(mat[2:3, 1:2],
subset(mat, subset = 1:nrow(mat) %in% 2:3, select = 1:2))
# TRUE
Actually, we can write a custom function if we want to store the row- and column- indexes in the same list.
cust.subset <- function(mat, dim.list){
subset(mat, subset = 1:nrow(mat) %in% dim.list[[1]], select = dim.list[[2]])
}
# initialize a list that includes your sub-setting indexes
sbdim <- list(2:3, 1:2)
sbdim
# [[1]]
# [1] 2 3
# [[2]]
# [1] 1 2
# subset using your custom f(x) and your list
cust.subset(mat, sbdim)
# [,1] [,2]
# [1,] 2 6
# [2,] 3 7

Can I further vectorize this function

I am relatively new to R, and matrix-based scripting languages in general. I have written this function to return the index's of each row which has a content similar to any another row's content. It is a primitive form of spam reduction that I am developing.
if (!require("RecordLinkage")) install.packages("RecordLinkage")
library("RecordLinkage")
# Takes a column of strings, returns a list of index's
check_similarity <- function(x) {
threshold <- 0.8
values <- NULL
for(i in 1:length(x)) {
values <- c(values, which(jarowinkler(x[i], x[-i]) > threshold))
}
return(values)
}
is there a way that I could write this to avoid the for loop entirely?
We can simplify the code somewhat using sapply.
# some test data #
x = c('hello', 'hollow', 'cat', 'turtle', 'bottle', 'xxx')
# create an x by x matrix specifying which strings are alike
m = sapply(x, jarowinkler, x) > threshold
# set diagonal to FALSE: we're not interested in strings being identical to themselves
diag(m) = FALSE
# And find index positions of all strings that are similar to at least one other string
which(rowSums(m) > 0)
# [1] 1 2 4 5
I.e. this returns the index positions of 'hello', 'hollow', 'turtle', and 'bottle' as being similar to another string
If you prefer, you can use colSums instead of rowSums to get a named vector, but this could be messy if the strings are long:
which(colSums(m) > 0)
# hello hollow turtle bottle
# 1 2 4 5

How to get counts of intersections of six or more sets?

I am running an analysis of a number of sets and I have been using the package VennDiagram, which has been working just fine, but it only handles up to 5 sets, and now it turns out that I need to look at 6 or more sets.
Ideally, I'm looking for a something that can do this (below) with 6 or more sets, but it doesn't necessarily have to have a plot function as long as the counts can be retrieved:
Any ideas of what I can do to add one or more sets to these five and still get the counts?
Thanks!
Here is a recursive solution to find all of the intersections in the venn diagram. sets can be a list containing any number of sets to find the intersections of. For some reason, the code in the package you are using is all hard-coded for each set size, so it doesn't scale to arbitrary intersections.
## Build intersections, 'out' accumulates the result
intersects <- function(sets, out=NULL) {
if (length(sets) < 2) return ( out ) # return result
len <- seq(length(sets))
if (missing(out)) out <- list() # initialize accumulator
for (idx in split((inds <- combn(length(sets), 2)), col(inds))) { # 2-way combinations
ii <- len > idx[2] & !(len %in% idx) # indices to keep for next intersect
out[[(n <- paste(names(sets[idx]), collapse="."))]] <- intersect(sets[[idx[1]]], sets[[idx[2]]])
out <- intersects(append(out[n], sets[ii]), out=out)
}
out
}
The function builds pairwise intersections. To avoid building repeated solutions it only calls itself on components of the set with indices greater than those that were joined (ii in the code). The result is a list of all the intersections. If you pass named components, then the result will be named by the convention "set1.set2" etc.
Results
## Some sample data
set.seed(0)
sets <- setNames(lapply(1:3, function(.) sample(letters, 10)), letters[1:3])
## Manually check intersections
a.b <- intersect(sets[[1]], sets[[2]])
b.c <- intersect(sets[[2]], sets[[3]])
a.c <- intersect(sets[[1]], sets[[3]])
a.b.c <- intersect(a.b, sets[[3]])
## Compare
res <- intersects(sets)
all.equal(res[c("a.b","a.c","b.c","a.b.c")], list(a.b=a.b, a.c=a.c, b.c=b.c, a.b.c=a.b.c))
# TRUE
res
# $a.b
# [1] "g" "i" "n" "e" "r"
#
# $a.b.c
# [1] "g"
#
# $a.c
# [1] "x" "g"
#
# $b.c
# [1] "f" "g"
## Get the counts of intersections
lengths(res)
# a.b a.b.c a.c b.c
# 5 1 2 2
Or, with numbers
intersects(list(a=1:10, b=c(1, 5, 10), c=9:20))
# $a.b
# [1] 1 5 10
# $a.b.c
# [1] 10
# $a.c
# [1] 9 10
# $b.c
# [1] 10
Here's an attempt:
list1 <- c("a","b","c","e")
list2 <- c("a","b","c","e")
list3 <- c("a","b")
list4 <- c("a","b","g","h")
list_names <- c("list1","list2","list3","list4")
lapply(1:length(list_names),function(y){
combinations <- combn(list_names,y)
res<-as.list(apply(combinations,2,function(x){
if(length(x)==1){
p <- setdiff(get(x),unlist(sapply(setdiff(list_names,x),get)))
}
else if(length(x) < length(list_names)){
p <- setdiff(Reduce(intersect,lapply(x,get)),Reduce(union,sapply(setdiff(list_names,x),get)))
}
else p <- Reduce(intersect,lapply(x,get))
if(!identical(p,character(0))) p
else NA
}))
if(y==length(list_names)) {
res[[1]] <- unlist(res);
res<-res[1]
}
names(res) <- apply(combinations,2,paste,collapse="-")
res
})
The first lapply is used to loop from 1 to the number of sets you have. Then I took all possible combinations of list names, taken y at a time. This essentially generates all of the different subareas in the Venn diagram.
For each combination, the output is the difference between the intersection of the lists in the current combination to the union of the other lists that are not in the combination.
The final result is a list of length the number of sets inputed. The first element of that list holds the unique elements in each list, the second element the unique elements in any combination of two lists etc.
OK, here's one way, assuming you represent sets as a list of vectors, and items to be searched in those sets also as vector:
# Example data format
sets <- list(v1 = 1:6, v2 = 1:8, v3 = 3:8)
items <- c(2:7)
# Search for items in each set
result <- data.frame(searched = items)
for (set in names(sets)) {
result <- cbind(result, items %in% sets[[set]])
names(result)[length(names(result))] <- set
}
# Count
library(plyr)
ddply(result, names(sets), function (i) {
data.frame(count = nrow(i))
})
This gives you all combinations actually existing in the itemset:
v1 v2 v3 count
1 FALSE TRUE TRUE 1
2 TRUE TRUE FALSE 1
3 TRUE TRUE TRUE 4

Resources