R data.table intersection of all groups - r

I want to have the intersection of all groups of a data table. So for the given data:
data.table(a=c(1,2,3, 2, 3,2), myGroup=c("x","x","x", "y", "z","z"))
I want to have the result:
2
I know that
Reduce(intersect, list(c(1,2,3), c(2), c(3,2)))
will give me the desired result but I didn't figure out how to produce a list of groups of a data.table query.

I would try using Reduce in the following way (assuming dt is your data)
Reduce(intersect, dt[, .(list(unique(a))), myGroup]$V1)
## [1] 2

Here's one approach.
nGroups <- length(unique(dt[,myGroup]))
dt[, if(length(unique(myGroup))==nGroups) .BY else NULL, by="a"][[1]]
# [1] 2
And here it is with some explanatory comments.
## Mark down the number of groups in your data set
nGroups <- length(unique(dt[,myGroup]))
## Then, use `by="a"` to examine in turn subsets formed by each value of "a".
## For subsets having the full complement of groups
## (i.e. those for which `length(unique(myGroup))==nGroups)`,
## return the value of "a" (stored in .BY).
## For the other subsets, return NULL.
dt[, if(length(unique(myGroup))==nGroups) .BY else NULL, by="a"][[1]]
# [1] 2
If that code and the comments aren't clear on their own, a quick glance at the following might help. Basically, the approach above is just looking for and reporting the value of a for those groups that return x,y,z in column V1 below.
dt[,list(list(unique(myGroup))), by="a"]
# a V1
# 1: 1 x
# 2: 2 x,y,z
# 3: 3 x,z

Related

Which() for the whole dataset

I want to write a function in R that does the following:
I have a table of cases, and some data. I want to find the correct row matching to each observation from the data. Example:
crit1 <- c(1,1,2)
crit2 <- c("yes","no","no")
Cases <- matrix(c(crit1,crit2),ncol=2,byrow=FALSE)
data1 <- c(1,2,1)
data2 <- c("no","no","yes")
data <- matrix(c(data1,data2),ncol=2,byrow=FALSE)
Now I want a function that returns for each row of my data, the matching row from Cases, the result would be the vector
c(2,3,1)
Are you sure you want to be using matrices for this?
Note that the numeric data in crit1 and data1 has been converted to string (matrices can only store one data type):
typeof(data[ , 1L])
# [1] character
In R, a data.frame is a much more natural choice for what you're after. data.table is (among many other things) a toolset for working with "enhanced" data.frames; See the Introduction.
I would create your data as:
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
We can get the matching row indices as asked by doing a keyed join (See the vignette on keys):
setkey(Cases) # key by all columns
Cases
# crit1 crit2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
setkey(data)
data
# data1 data2
# 1: 1 no
# 2: 1 yes
# 3: 2 no
Cases[data, which=TRUE]
# [1] 1 2 3
This differs from 2,3,1 because the order of your data has changed, but note that the answer is still correct.
If you don't want to change the order of your data, it's slightly more complicated (but more readable if you're not used to data.table syntax):
Cases = data.table(crit1, crit2)
data = data.table(data1, data2)
Cases[data, on = setNames(names(data), names(Cases)), which=TRUE]
# [1] 2 3 1
The on= part creates the mapping between the columns of data and those of Cases.
We could write this in a bit more SQL-like fashion as:
Cases[data, on = .(crit1 == data1, crit2 == data2), which=TRUE]
# [1] 2 3 1
This is shorter and more readable for your sample data, but not as extensible if your data has many columns or if you don't know the column names in advance.
The prodlim package has a function for that:
library(prodlim)
row.match(data,Cases)
[1] 2 3 1

Conditional group by join in R

I am new to R and rather flumoxed by the following problem. I have two vectors of dates (the vectors are not necessarily aligned, nor of the same length).
I want to find for each date in the first vector the next date in the second vector.
vecA <- as.Date(c('1951-07-01', '1953-01-01', '1957-04-01', '1958-12-01',
'1963-06-01', '1965-05-01'))
vecB <- as.Date(c('1952-01-12', '1952-02-01', '1954-03-01', '1958-08-01',
'1959-03-01', '1964-03-01', '1966-05-01'))
In SQL I would write something like this, but I cannot find any tips in SO as to how to do this in R.
select vecA.Date, min(vecB.Date)
from vecA inner join vecB
on vecA.Date < vecB.Date
group by vecA.Date
The output should look like this:
Start End
1951-07-01 1952-01-12
1953-01-01 1954-03-01
1957-04-01 1958-08-01
1958-12-01 1959-03-01
1963-06-01 1964-03-01
1965-05-01 1966-05-01
Here's a possible solution using data.table rolling joins
library(data.table)
dt1 <- as.data.table(vecA) ## convert to `data.table` object
dt2 <- as.data.table(vecB) ## convert to `data.table` object
setkey(dt2) # key in order to perform a binary join
res <- dt2[dt1, vecB, roll = -Inf, by = .EACHI] # run the inner join while selecting closest date
setnames(res, c("Start", "End"))
res
# Start End
# 1: 1951-07-01 1952-01-12
# 2: 1953-01-01 1954-03-01
# 3: 1957-04-01 1958-08-01
# 4: 1958-12-01 1959-03-01
# 5: 1963-06-01 1964-03-01
# 6: 1965-05-01 1966-05-01
Alternatively, we can also do:
data.table(vecA=vecB, vecB, key="vecA")[dt1, roll=-Inf]
This code will do what you are asking, but it's not clear what you are trying to accomplish and so this might not be the best way. In essence, this code first orders both vectors to ensure they are in the same ordering. Then, using a for loop, it loops over all the elements in vecA and uses x < vecB to find out which elements in vecB are less than x.
That is wrapped in which, which returns the numeric index of of each TRUE element of a vector, and then in min which gives the smallest numeric index. This is then used to subset vecB to return the date; it's all wrapped in print so you can see the output of the loop.
This is probably not the best way of doing this, but without more context on your goals it should at least get you started.
> vecA <- vecA[order(vecA)]
> vecB <- vecB[order(vecB)]
> for(x in vecA) {print(vecB[min(which(x < vecB))])}
[1] "1952-01-12"
[1] "1954-03-01"
[1] "1958-08-01"
[1] "1959-03-01"
[1] "1964-03-01"
[1] "1966-05-01"

Multiple one-to-many matching between vectors in R

I want to update a dataframe with values from a table of new values where there is a one-to-many relationship between the dataframe and table of new values. This code illustrates the intent:
df = data.frame(x=rep(letters[1:4],5,rep=T), y=1:20)
and new values..
eds = data.frame(x=c('c','d'), val=c(101, 102))
For a one-to-one relationship the following should work:
df$x[match(eds$x, df$x)] = eds$x[match(df$x, eds$x)]
But match only works with first match, so this throws the error number of items to replace is not a multiple of replacement length. Grateful for any tips on the most efficient way to approach this. I'm guessing some sapply wrapper but I can't think of the method.
Thanks in advance.
tmp <- eds$val[match(df$x, eds$x)] # Matching indices (with NAs for no match)
df$y <- ifelse(is.na(tmp), df$y, tmp) # Values at matches (leaving alone for NAs)
head(df, 5)
# x y
# 1 a 1
# 2 b 2
# 3 c 101
# 4 d 102
# 5 a 5
Not that this not a very robust solution. It depends on your exact data structure here (repeating 'c', 'd' pattern) but it works for this case:
df[df[["x"]] %in% eds[["x"]], "y"] = eds[[2]]

compare values of data frames with different number of rows

I defined the following function, which takes two DataFrames, DF_TAGS_LIST and DF_epc_list. Both data frames have a column with a different number of rows. I want to search each value DF_TAGS_LIST in DF_epc_list, and if found, store it in another dataframe
One example of DF_TAGS_LIST:
TAGS_LIST
3036029B539869100000000B
3036029B537663000000002A
3036029B5398694000000009
3036029B539869400000000C
3036029B5398690000000006
3036029B5398692000000007
And one example of DF_epc_list:
EPC
3036029B539869100000000B
3036029B537663000000002A
3036029B5398690000000006
3036029B5398692000000007
3036029B5398691000000006
3036029B5376630000000034
3036029B53986940000000WF
3036029B5398694000000454
3036029B5398690000000234
3036029B53986920000000FG
In this case, I would like one dataframe output that had the following values:
FOUND_TAGS
3036029B5398690000000006
3036029B5398692000000007
3036029B539869100000000B
3036029B537663000000002A
My function is:
FOUND_COMPARE_TAGS<-function(DF_TAGS_LIST, DF_epc_list){
DF_epc_list<-toString(DF_epc_list)
DF_TAGS_LIST<-toString(DF_TAGS_LIST)
DF_found_epc_tags <- data.frame(DF_found_epc_tags=intersect(DF_TAGS_LIST$DF_TAGS_LIST, DF_epc_list$DF_epc_list)); setdiff(union(DF_TAGS_LIST$DF_TAGS_LIST, DF_epc_list$DF_epc_list), DF_found_epc_tags$DF_found_epc_tags)
#DF_found_epc_tags <- data.frame(DF_found_epc_tags = DF_TAGS_LIST[unique(na.omit(match(DF_epc_list$DF_epc_list, DF_TAGS_LIST$DF_TAGS_LIST))),])
return(DF_found_epc_tags)
}
I now returns an empty data frame with two columns. Only recently programmed in R
You can use %in% or (as I mentioned in my comment) intersect:
DF_TAGS_LIST[DF_TAGS_LIST$TAGS_LIST %in% DF_epc_list$EPC, , drop = FALSE]
# TAGS_LIST
# 1 3036029B539869100000000B
# 2 3036029B537663000000002A
# 5 3036029B5398690000000006
# 6 3036029B5398692000000007
intersect(DF_TAGS_LIST$TAGS_LIST, DF_epc_list$EPC)
# [1] "3036029B539869100000000B" "3036029B537663000000002A"
# [3] "3036029B5398690000000006" "3036029B5398692000000007"
FOUND_TAGS <- rbind(TAGS_LIST, EPC)
FOUND_TAGS <- FOUND_TAGS[duplicated(FOUND_TAGS), , drop = FALSE]

Remove quotes from vector element in order to use it as a value

Suppose that I have a vector x whose elements I want to use to extract columns from a matrix or data frame M.
If x[1] = "A", I cannot use M$x[1] to extract the column with header name A, because M$A is recognized while M$"A" is not. How can I remove the quotes so that M$x[1] is M$A rather than M$"A" in this instance?
Don't use $ in this case; use [ instead. Here's a minimal example (if I understand what you're trying to do).
mydf <- data.frame(A = 1:2, B = 3:4)
mydf
# A B
# 1 1 3
# 2 2 4
x <- c("A", "B")
x
# [1] "A" "B"
mydf[, x[1]] ## As a vector
# [1] 1 2
mydf[, x[1], drop = FALSE] ## As a single column `data.frame`
# A
# 1 1
# 2 2
I think you would find your answer in the R Inferno. Start around Circle 8: "Believing it does as intended", one of the "string not the name" sub-sections.... You might also find some explanation in the line The main difference is that $ does not allow computed indices, whereas [[ does. from the help page at ?Extract.
Note that this approach is taken because the question specified using the approach to extract columns from a matrix or data frame, in which case, the [row, column] mode of extraction is really the way to go anyway (and the $ approach would not work with a matrix).

Resources