I have 100,000 5-length vectors (the list VECTORS below) whose elements are chosen among one million values.
# dictionary
dictionary=seq(1:1e6)
# generate 100,000 5-length vectors whose elements are chosen from dictionary
VECTORS <- lapply(c(1:1e5), sample, x = dictionary, size =5)
My problem is to map each exact same vector into one integer, i.e. I need a mappy function that inputs a vector and yields an integer.
mappy(c(58431, 976854, 661294, 460685, 341123))=15, for example. Do you know how to do this in an efficient way?
Subsidiary question : what if my vectors aren't the same length anymore?
I assume here you want a bijection between the vectors you have in your list and integers. One approach would be to create a factor variable out of character representations of your vectors. Let's start with a reproducible version of your code (I'll make it a smaller vector):
set.seed(144)
VECTORS <- replicate(1e2, sample(seq_len(1e6), 5), FALSE)
Now you can create a factor variable from the character representation of each vector:
fvar <- factor(sapply(VECTORS, paste, collapse=" "))
Now we have a bijection between string representations of elements of VECTORS and integers:
vec <- c(894025, 153892, 98596, 218401, 36616) # 15th element of VECTORS
which(levels(fvar) == paste(vec, collapse=" "))
# [1] 90
levels(fvar)[90]
# [1] "894025 153892 98596 218401 36616"
as.numeric(strsplit(levels(fvar)[90], " ")[[1]])
# [1] 894025 153892 98596 218401 36616
If you wanted to wrap them up into nice functions:
id.from.vec <- function(vec) which(levels(fvar) == paste(vec, collapse=" "))
id.from.vec(c(894025, 153892, 98596, 218401, 36616))
# [1] 90
vec.from.id <- function(id) as.numeric(strsplit(levels(fvar)[id], " ")[[1]])
vec.from.id(90)
# [1] 894025 153892 98596 218401 36616
Note that this works out of the box even if the vectors are different lengths.
A keyed data.table has nice lookup properties:
library(data.table)
set.seed(1)
VECTORS <- lapply(seq(1e5), sample, x = 1e6, size = 5)
VECmap <- setkey(rbindlist(lapply(unique(VECTORS), as.list)))[, ID := .I]
# V1 V2 V3 V4 V5 ID
# 1: 13 897309 366563 678873 6571 1
# 2: 15 557977 640484 732531 848939 2
# 3: 48 18120 911805 188728 805726 3
# 4: 48 830301 862433 506297 877432 4
# 5: 52 873436 824165 86251 576173 5
# ---
# 99996: 999911 583599 803402 240910 931996 99996
# 99997: 999931 146505 287431 180259 230904 99997
# 99998: 999937 175888 266336 874987 982951 99998
# 99999: 999950 960139 455084 586956 875504 99999
# 100000: 999993 191750 258982 518519 78087 100000
mapVEC <- function(...) VECmap[.(...)]$ID
mapID <- function(id) unlist(VECmap[ID==id,!"ID",with=FALSE], use.names=FALSE)
# example usage
mapVEC(52, 873436, 824165, 86251, 576173)
# 5
mapID(5)
# 52 873436 824165 86251 576173
Comments As mentioned by #Roland, a bijection between (a) 1..1e6 and (b) all 5-length sequences of distinct numbers from 1..1e5 is not possible, so I'm just guessing that this is what the OP is after.
When you write a function with ... as an argument, that means an arbitrary number of unnamed arguments are accepted. Within the function, these arguments can be referred to with ..., but are often also seen with c(...) and list(...). Within a data.table, .(...) is an alias for list(...). To see documentation for writing functions, type help.start() and click through to the "R Language Definition."
Related
HEADLINE: Is there a way to get R to recognize data.frame column names contained within lists in the same way that it can recognize free-floating vectors?
SETUP: Say I have a vector named varA:
(varA <- 1:6)
# [1] 1 2 3 4 5 6
To get the length of varA, I could do:
length(varA)
#[1] 6
and if the variable was contained within a larger list, the variable and its length could still be found by doing:
list <- list(vars = "varA")
length(get(list$vars[1]))
#[1] 6
PROBLEM:
This is not the case when I substitute the vector for a dataframe column and I don't know how to work around this:
rows <- 1:6
cols <- c("colA")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# colA
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
list <- list(vars = "varA",
cols = "df$colA")
length(get(list$vars[1]))
#[1] 6
length(get(list$cols[1]))
#Error in get(list$cols[1]) : object 'df$colA' not found
Though this contrived example seems inane, because I could always use the simple length(variable) approach, I'm actually interested in writing data from hundreds of variables varying in lengths onto respective dataframe columns, and so keeping them in a list that I could iterate through would be very helpful. I've tried everything I could think of, but it may be the case that it's just not possible in R, especially given that I cannot find any posts with solutions to the issue.
You could try:
> length(eval(parse(text = list$cols[1])))
[1] 6
Or:
list <- list(vars = "varA",
cols = "colA")
length(df[, list$cols[1]])
[1] 6
Or with regex:
list <- list(vars = "varA",
cols = "df$colA")
length(df[, sub(".*\\$", "", list$cols[1])])
[1] 6
If you are truly working with a data frame d, then nrow(d) is the length of all of the variables in d. There should be no reason to use length in this case.
If you are actually working with a list x containing variables of potentially different lengths, then you should use the [[ operator to extract those variables by name (see ?Extract):
x <- list(a = 1:10, b = rnorm(20L))
l <- list(vars = "a")
length(d[[l$vars[1L]]]) # 10
If you insist on using get (you shouldn't), then you need to supply a second argument telling it where to look for the variable (see ?get):
length(get(l$vars[1L], x)) # 10
I have 5 lists that need to be the same length as the lists will be combined into a dataframe. One of them may not be the same length as the other 4 so what I currently have is an if statement that checks the length against the length of one of the other lists and then...
1) I create a temporary list using rep( NA, length ) where length is the extra elements I need to add to extend the list
2) I use the concat function c() to combine the list that needs extending with the list with the NAs.
x <- as.numeric( list )
if( length( list ) < length( main ))
{
temp <- rep( NA, length( main ) - length( list ))
list <- c( list, temp )
}
List 1 - NA NA
List 2 - 32 53 45
Merged List - 32 53 45 NA NA
The problem with this is that I then get a ton of NAs introduced by coercion after the dataframe is created.
Is there a better way of handling this? I assume it has to do with the fact that the main list is numeric. I tried doing the same with 0 instead of NA but that failed for some reason. What I use to extend the length does not matter. I just need it to not be a number other than 0.
I will assume that you start with several lists like that:
n=as.list(1:2)
a=as.list(letters[1:3])
A=as.list(LETTERS[1:4])
First, I'd suggest to combine them into a list of lists:
z <- list(n,a,A)
so you can find the length of the longest sub-lists:
max.length <- max(sapply(z,length))
and use length<- to fill the missing elements of the shorter sub-lists with NULL values:
# z2 <- lapply(z,function(k) {length(k) <- max.length; return(k)}) # Original version
# z2 <- lapply(z, "length<-", max.length) # More elegant way
z2 <- lapply(lapply(z, unlist), "length<-", max.length) # Even better because it makes sure that the resulting data frame will consists of atomic vectors
The resulting list can be easily transformed into data.frame:
df <- as.data.frame(do.call(rbind,z2))
Another option using stringi would be ("z" from #Marat Talipov's post). If you want to get the result as showed in "df",
library(stringi)
as.data.frame(stri_list2matrix(lapply(z, as.character), byrow=TRUE))
# V1 V2 V3 V4
#1 1 2 <NA> <NA>
#2 a b c <NA>
#3 A B C D
NOTE: Now, the columns are all "factors" or "characters" (if we specify stringsAsFactors=FALSE). As #Richard Scriven mentioned in the comments, this would make more sense to have the "rows" as "columns". The above method is good when you have all 'numeric' or 'character' lists.
I want to have the intersection of all groups of a data table. So for the given data:
data.table(a=c(1,2,3, 2, 3,2), myGroup=c("x","x","x", "y", "z","z"))
I want to have the result:
2
I know that
Reduce(intersect, list(c(1,2,3), c(2), c(3,2)))
will give me the desired result but I didn't figure out how to produce a list of groups of a data.table query.
I would try using Reduce in the following way (assuming dt is your data)
Reduce(intersect, dt[, .(list(unique(a))), myGroup]$V1)
## [1] 2
Here's one approach.
nGroups <- length(unique(dt[,myGroup]))
dt[, if(length(unique(myGroup))==nGroups) .BY else NULL, by="a"][[1]]
# [1] 2
And here it is with some explanatory comments.
## Mark down the number of groups in your data set
nGroups <- length(unique(dt[,myGroup]))
## Then, use `by="a"` to examine in turn subsets formed by each value of "a".
## For subsets having the full complement of groups
## (i.e. those for which `length(unique(myGroup))==nGroups)`,
## return the value of "a" (stored in .BY).
## For the other subsets, return NULL.
dt[, if(length(unique(myGroup))==nGroups) .BY else NULL, by="a"][[1]]
# [1] 2
If that code and the comments aren't clear on their own, a quick glance at the following might help. Basically, the approach above is just looking for and reporting the value of a for those groups that return x,y,z in column V1 below.
dt[,list(list(unique(myGroup))), by="a"]
# a V1
# 1: 1 x
# 2: 2 x,y,z
# 3: 3 x,z
I have a large data.table that I am collapsing to the month level using ,by.
There are 5 by vars, with # of levels: c(4,3,106,3,1380). The 106 is months, the 1380 is a geographic unit. As in turns out there are some 0's, in that some cells have no values. by drops these, but I'd like it to keep them.
Reproducible example:
require(data.table)
set.seed(1)
n <- 1000
s <- function(n,l=5) sample(letters[seq(l)],n,replace=TRUE)
dat <- data.table( x=runif(n), g1=s(n), g2=s(n), g3=s(n,25) )
datCollapsed <- dat[ , list(nv=.N), by=list(g1,g2,g3) ]
datCollapsed[ , prod(dim(table(g1,g2,g3))) ] # how many there should be: 5*5*25=625
nrow(datCollapsed) # how many there are
Is there an efficient way to fill in these missing values with 0's, so that all permutations of the by vars are in the resultant collapsed data.table?
I'd also go with a cross-join, but would use it in the i-slot of the original call to [.data.table:
keycols <- c("g1", "g2", "g3") ## Grouping columns
setkeyv(dat, keycols) ## Set dat's key
ii <- do.call(CJ, sapply(dat[, ..keycols], unique)) ## CJ() to form index
datCollapsed <- dat[ii, list(nv=.N)] ## Aggregate
## Check that it worked
nrow(datCollapsed)
# [1] 625
table(datCollapsed$nv)
# 0 1 2 3 4 5 6
# 135 191 162 82 39 13 3
This approach is referred to as a "by-without-by" and, as documented in ?data.table, it is just as efficient and fast as passing the grouping instructions in via the by argument:
Advanced: Aggregation for a subset of known groups is
particularly efficient when passing those groups in i. When
i is a data.table, DT[i,j] evaluates j for each row
of i. We call this by without by or grouping by i.
Hence, the self join DT[data.table(unique(colA)),j] is
identical to DT[,j,by=colA].
Make a cartesian join of the unique values, and use that to join back to your results
dat.keys <- dat[,CJ(g1=unique(g1), g2=unique(g2), g3=unique(g3))]
setkey(datCollapsed, g1, g2, g3)
nrow(datCollapsed[dat.keys]) # effectively a left join of datCollapsed onto dat.keys
# [1] 625
Note that the missing values are NA right now, but you can easily change that to 0s if you want.
Suppose that I have a vector x whose elements I want to use to extract columns from a matrix or data frame M.
If x[1] = "A", I cannot use M$x[1] to extract the column with header name A, because M$A is recognized while M$"A" is not. How can I remove the quotes so that M$x[1] is M$A rather than M$"A" in this instance?
Don't use $ in this case; use [ instead. Here's a minimal example (if I understand what you're trying to do).
mydf <- data.frame(A = 1:2, B = 3:4)
mydf
# A B
# 1 1 3
# 2 2 4
x <- c("A", "B")
x
# [1] "A" "B"
mydf[, x[1]] ## As a vector
# [1] 1 2
mydf[, x[1], drop = FALSE] ## As a single column `data.frame`
# A
# 1 1
# 2 2
I think you would find your answer in the R Inferno. Start around Circle 8: "Believing it does as intended", one of the "string not the name" sub-sections.... You might also find some explanation in the line The main difference is that $ does not allow computed indices, whereas [[ does. from the help page at ?Extract.
Note that this approach is taken because the question specified using the approach to extract columns from a matrix or data frame, in which case, the [row, column] mode of extraction is really the way to go anyway (and the $ approach would not work with a matrix).