Related
I have 2 dataframes
df1 <- data.frame(col11=c("a","a","a","b","b"), col=c(1,2,3,4,5))
df2 <- data.frame(col21=c("c","c","d","d","d"), col=c(1,5,1,2,5))
I want to count index1 as the number of rows in col with the same values between 2 dataframes based to groups of col11 of df1 and col21 of df2, and index2 as the number of unique values in col of both dataframes. Then I want to take the ratio index3 which is calculated by index1/index2 for each couple of groups in col11 and col21.
What I did is that I use inner and outer join tables to count index1 and index2 to create these intermediate dataframes
df3 <- data.frame(group11=c("a","a","b","b"), group21=c("c","d","c","d"), index1=c(1,2,1,1))
df4 <- data.frame(group11=c("a","a","b","b"), group21=c("c","d","c","d"), index2=c(5,6,4,5))
to have this resulted dataframe
df5 <- data.frame(group11=("a","a","b","b"), group21=c("c","d","c","d"), index3=c(0.2,0.33,0.25,0.2))
Could you help me to have the resulted dataframe without using join tables and without creating the intermediate dataframes? Thank you.
You could define two Vectorized functions that do the job.
First we split the col columns of both data frames according to their col** and put them into a list.
L <- c(split(df1$col, df1$col11), split(df2$col, df2$col21))
Define FUN3 to count the lengths of the intercepts, define FUN4 to count the lengths of the "unique" values. (I named the functions according to your dat3 and dat4 interim data frames, since it's the corresponding step).
FUN3 <- Vectorize(function(x, y) length(intersect(x, y)))
FUN4 <- Vectorize(function(x, y) length(c(x, y)))
Use outer which comes from the outer product of matrices. We just need [1:2, 3:4] as subset of the result.
res3 <- as.vector(outer(L, L, FUN3)[1:2, 3:4])
res4 <- as.vector(outer(L, L, FUN4)[1:2, 3:4])
To follow the same logic and get them right, we do similar with the letters from the col** columns, where we use the list numbers 1 to 4.
nm <- do.call(rbind, strsplit(as.vector(outer(1:2, 3:4, paste)), " "))
nm <- apply(nm, 1:2, function(x) names(L)[as.double(x)])
Finally we cbind everything together and setNames.
setNames(cbind.data.frame(nm, res3 / res4), c("group11", "group21", "index3"))
# group11 group21 index3
# 1 a c 0.2000000
# 2 b c 0.2500000
# 3 a d 0.3333333
# 4 b d 0.2000000
Edit
outer gives the outer product of the whole matrices. Since we just compare a, b to c, d we just want a part of the resulting matrix. In this example we just want the first "quadrant" i.e. the upper right 4x4 sub-matrix which is rows 1:2 and columns 3:4.
(res3 <- outer(L, L, FUN3))
# a b c d
# a 3 0 1 2
# b 0 2 1 1
# c 1 1 2 2
# d 2 1 2 3
We may formulate that less hard-coded like so:
(rows <- which(rownames(res3) %in% unique(df1$col11))) ## i.e. %in% c("a", "b")
# [1] 1 2
(cols <- which(colnames(res3) %in% unique(df2$col21))) ## i.e. %in% c("c", "d")
# [1] 3 4
(res3 <- as.vector(res3[rows, cols]))
# [1] 1 1 2 1
FUN4 accordingly.
For the names nm we want to subset the names of the L list. To correspond with the data, we need numerical sequences according to the length (i.e. number) of unique strings of the relevant columns of both data frames. Since the numbers of the second one should be consecutive, we just add the number of the first data frame.
lg1 <- seq(length(unique(df1$col11)))
lg2 <- seq(length(unique(df2$col21))) + length(unique(df1$col11))
nm <- do.call(rbind, strsplit(as.vector(outer(lg1, lg2, paste)), " "))
(nm <- apply(nm, 1:2, function(x) names(L)[as.double(x)]))
# [,1] [,2]
# [1,] "a" "c"
# [2,] "b" "c"
# [3,] "a" "d"
# [4,] "b" "d"
Here is a data.table approach which unfortunatly still has a lot of joining.
library(data.table)
df1 <- data.frame(col11=c("a","a","a","b","b"), col=c(1,2,3,4,5))
df2 <- data.frame(col21=c("c","c","d","d","d"), col=c(1,5,1,2,5))
setDT(df1)
setDT(df2)
res = CJ(col11 = df1[["col11"]], col21 = df2[["col21"]], unique = TRUE)
res[, index1 := df1[df2, on = .(col)][, .N, keyby = .(col11, col21)]$N]
res[, index2 := mapply(function(x, y) length((c(df1[col11 == x, col], df2[col21 == y, col]))), col11, col21)]
res[, index3 := index1 / index2][]
#> col11 col21 index1 index2 index3
#> <char> <char> <int> <int> <num>
#> 1: a c 1 5 0.2000000
#> 2: a d 2 6 0.3333333
#> 3: b c 1 4 0.2500000
#> 4: b d 1 5 0.2000000
We use data.table's reference semantics to directly update the data.table within the call so we have no additional objects.
The CJ(...) is to set up all the unique combinations.
The index1 := df1[df2, ...] is join syntax followed by determining the count (.N) of each combination. Note, I believe it is safe to not join this back to res because the keyby will result in the same order as what was made in the CJ.
The mapply(...) call is a fancy loop where we filter for each row in the res for each combination. I will make changes depending on feedback on whether the col is unique or not.
Finally, it is worth pointing out that there is no simple solution for this. There are going to be intermediate calculation steps to prevent these calls from going too long.
I have a dataframe df1
ID <- c("A","B","C")
Measurement <- c("Length","Height","Breadth")
df1 <- data.frame(ID,Measurement)
I am trying to create combinations of measurements with an underscore between them and put it under the ID column "ALL"
Here is my desired output
ID Measurement
A Length
B Height
C Breadth
ALL Length_Height_Breadth
ALL Length_Breadth_Height
ALL Breadth_Height_Length
ALL Breadth_Length_Height
ALL Height_Length_Breadth
ALL Height_Breadth_Length
Also when there are similar measurements in the "measurement" column, I want to eliminate the underscore.
For example:
ID <- c("A","B")
Measurement <- c("Length","Length")
df2 <- data.frame(ID,Measurement)
Then I would want the desired output to be
ID Measurement
A Length
B Length
ALL Length
I am trying to do something like this which is totally wrong
df1$ID <- paste(df1$Measurement, df1$Measurement, sep="_")
Can someone point me in the right direction to achieving the above outputs?
I would like to see how it is done programmatically instead of using the actual measurement names. I am intending to apply the logic to a larger dataset that has several measurement names and so a general solution would be much appreciated.
We could use the permn function from the combinat package:
library(combinat)
sol_1 <- sapply(permn(unique(df1$Measurement)),
FUN = function(x) paste(x, collapse = '_'))
rbind.data.frame(df1, data.frame('ID' = 'All', 'Measurement' = sol_1))
# ID Measurement
# 1 A Length
# 2 B Height
# 3 C Breadth
# 4 All Length_Height_Breadth
# 5 All Length_Breadth_Height
# 6 All Breadth_Length_Height
# 7 All Breadth_Height_Length
# 8 All Height_Breadth_Length
# 9 All Height_Length_Breadth
sol_2 <- sapply(permn(unique(df2$Measurement)),
FUN = function(x) paste(x, collapse = '_'))
rbind.data.frame(df2, data.frame('ID' = 'All', 'Measurement' = sol_2))
# ID Measurement
# 1 A Length
# 2 B Length
# 3 All Length
Giving credit where credit is due: Generating all distinct permutations of a list.
We could also use permutations from the gtools package (HT #joel.wilson):
library(gtools)
unique_meas <- as.character(unique(df1$Measurement))
apply(permutations(length(unique_meas), length(unique_meas), unique_meas),
1, FUN = function(x) paste(x, collapse = '_'))
# "Breadth_Height_Length" "Breadth_Length_Height"
# "Height_Breadth_Length" "Height_Length_Breadth"
# "Length_Breadth_Height" "Length_Height_Breadth"
I have a data table containing 3 columns, one of them
contains a key:value list of different lengths.
I wish to rearrange the table such that each row will have only one key, conditioned on the value
for example, suppose that I wish to get all rows for whom the value is <= 2 so that each key is on its own row:\
input_tbl <-
data.table::data.table(a=c("AA"),b=c("{\"ha:llo\":1,\"wor:ld\":2,\"doog:bye\":3}"),
c=c(1))
the wanted table then should be
tbl_output <- data.table::data.table(a=c("AA",
"AA"),b=c("ha:llo","wor:ld"), c=c(1,1), s=c(1,2))
I had tried the following function:
data_table_clean <- function(dt){
dt[ ,"b" := data.table::tstrsplit(b, ',', fixed = T),by=c(a, c)]
dt[,c('b', 's'):= data.table::tstrsplit(b, ':', fixed=TRUE)]
return(dt[s <=2,])
}
this produces the following error
"Error in eval(expr, envir, enclos) : object 'a' not found"
Any suggestions are welcome, off course.
The keys are actually of the form :
input2_tbl <-
data.table::data.table(a=c("AA"),b=c("{\"99:1d:3u:7y:89:67\":1,\"99:1D:34:YY:T6:Y6\":2,\"ll:5Y:UY:56:R5:R6\":3}"),
c=c(1))
and accordingly the output table should be:
tbl2_output <- data.table::data.table(a=c("AA",
"AA"),b=c(""99:1d:3u:7y:89:67","99:1D:34:YY:T6:Y6"),
c=c(1,1), s=c(1,2))
Thank you!
update
data_table_clean <- function(dt){
res <- dt[, data.table::tstrsplit(unlist(strsplit(gsub('[{}"]', '', b),',', fixed=TRUE)), ":(?=[^:]+$)", perl=TRUE),
by = .(a, c)][V2 > -100]
data.table::setnames(res, 3:4, c("b", "s"))
res
}
when running this I get the following error:
Error in .subset(x, j) : invalid subscript type 'list'
One option would be to extract the characters that we need in the final output. We use str_extract to do that after grouping by 'a', 'c'. The output is a list, which we unlist, get the non-numeric and numeric into two columns and then subset the rows with the condition s<3.
library(stringr)
library(data.table)
input_tbl[, {
tmp <- unlist(str_extract_all(b, "[A-Za-z]+:[A-Za-z]+|\\d+"))
list(b=tmp[c(TRUE, FALSE)], s=tmp[c(FALSE, TRUE)])
}, by = .(a,c)][s<3]
# a c b s
#1: AA 1 ha:llo 1
#2: AA 1 wor:ld 2
Or if we are using strsplit/tstrsplit, grouped by 'a', 'c', we remove the curly brackets and quotes ([{}]") with gsub, split by , (strsplit), unlist the output, and then use tstrsplit to split by : that is followed by a number. The subset part is similar as above.
res <- input_tbl[, tstrsplit(unlist(strsplit(gsub('[{}"]', '',
b), ',', fixed=TRUE)), ":(?=\\d)", perl=TRUE) ,.(a,c)][V2<3]
setnames(res, 3:4, c("b", "s"))
res
# a c b s
#1: AA 1 ha:llo 1
#2: AA 1 wor:ld 2
Update
For the updated dataset, we can do the tstrsplit on the last delimiter (:)
res1 <- input2_tbl[, tstrsplit(unlist(strsplit(gsub('[{}"]', '',
b),',', fixed=TRUE)), ":(?=[^:]+$)", perl=TRUE) ,
by = .(a, c)][V2 < 3]
setnames(res1, 3:4, c("b", "s"))
res1
# a c b s
# 1: AA 1 99:1d:3u:7y:89:67 1
# 2: AA 1 99:1D:34:YY:T6:Y6 2
Since it seems like you are working with a JSON object, why not use something that parses the JSON, for example, the "jsonlite" package?
With that, you can make a simple function, that looks like this:
myFun <- function(invec) {
require(jsonlite)
x <- fromJSON(invec)
list(b = names(x), s = unlist(x))
}
Now, applied to your dataset, you would get:
input_tbl[, myFun(b), by = .(a, c)]
# a c b s
# 1: AA 1 ha:llo 1
# 2: AA 1 wor:ld 2
# 3: AA 1 doog:bye 3
And, for the subsetting:
input_tbl[, myFun(b), by = .(a, c)][s <= 2]
# a c b s
# 1: AA 1 ha:llo 1
# 2: AA 1 wor:ld 2
You can probably also even rewrite the myFun function to add a "threshold" argument that lets you subset within the function itself.
How do you refer to variables in a data.table if the variable names are stored in a character vector? For instance, this works for a data.frame:
df <- data.frame(col1 = 1:3)
colname <- "col1"
df[colname] <- 4:6
df
# col1
# 1 4
# 2 5
# 3 6
How can I perform this same operation for a data.table, either with or without := notation? The obvious thing of dt[ , list(colname)] doesn't work (nor did I expect it to).
Two ways to programmatically select variable(s):
with = FALSE:
DT = data.table(col1 = 1:3)
colname = "col1"
DT[, colname, with = FALSE]
# col1
# 1: 1
# 2: 2
# 3: 3
'dot dot' (..) prefix:
DT[, ..colname]
# col1
# 1: 1
# 2: 2
# 3: 3
For further description of the 'dot dot' (..) notation, see New Features in 1.10.2 (it is currently not described in help text).
To assign to variable(s), wrap the LHS of := in parentheses:
DT[, (colname) := 4:6]
# col1
# 1: 4
# 2: 5
# 3: 6
The latter is known as a column plonk, because you replace the whole column vector by reference. If a subset i was present, it would subassign by reference. The parens around (colname) is a shorthand introduced in version v1.9.4 on CRAN Oct 2014. Here is the news item:
Using with = FALSE with := is now deprecated in all cases, given that wrapping
the LHS of := with parentheses has been preferred for some time.
colVar = "col1"
DT[, (colVar) := 1] # please change to this
DT[, c("col1", "col2") := 1] # no change
DT[, 2:4 := 1] # no change
DT[, c("col1","col2") := list(sum(a), mean(b))] # no change
DT[, `:=`(...), by = ...] # no change
See also Details section in ?`:=`:
DT[i, (colnamevector) := value]
# [...] The parens are enough to stop the LHS being a symbol
And to answer further question in comment, here's one way (as usual there are many ways) :
DT[, colname := cumsum(get(colname)), with = FALSE]
# col1
# 1: 4
# 2: 9
# 3: 15
or, you might find it easier to read, write and debug just to eval a paste, similar to constructing a dynamic SQL statement to send to a server :
expr = paste0("DT[,",colname,":=cumsum(",colname,")]")
expr
# [1] "DT[,col1:=cumsum(col1)]"
eval(parse(text=expr))
# col1
# 1: 4
# 2: 13
# 3: 28
If you do that a lot, you can define a helper function EVAL :
EVAL = function(...)eval(parse(text=paste0(...)),envir=parent.frame(2))
EVAL("DT[,",colname,":=cumsum(",colname,")]")
# col1
# 1: 4
# 2: 17
# 3: 45
Now that data.table 1.8.2 automatically optimizes j for efficiency, it may be preferable to use the eval method. The get() in j prevents some optimizations, for example.
Or, there is set(). A low overhead, functional form of :=, which would be fine here. See ?set.
set(DT, j = colname, value = cumsum(DT[[colname]]))
DT
# col1
# 1: 4
# 2: 21
# 3: 66
*This is not an answer really, but I don't have enough street cred to post comments :/
Anyway, for anyone who might be looking to actually create a new column in a data table with a name stored in a variable, I've got the following to work. I have no clue as to it's performance. Any suggestions for improvement? Is it safe to assume a nameless new column will always be given the name V1?
colname <- as.name("users")
# Google Analytics query is run with chosen metric and resulting data is assigned to DT
DT2 <- DT[, sum(eval(colname, .SD)), by = country]
setnames(DT2, "V1", as.character(colname))
Notice I can reference it just fine in the sum() but can't seem to get it to assign in the same step. BTW, the reason I need to do this is colname will be based on user input in a Shiny app.
Retrieve multiple columns from data.table via variable or function:
library(data.table)
x <- data.table(this=1:2,that=1:2,whatever=1:2)
# === explicit call
x[, .(that, whatever)]
x[, c('that', 'whatever')]
# === indirect via variable
# ... direct assignment
mycols <- c('that','whatever')
# ... same as result of a function call
mycols <- grep('a', colnames(x), value=TRUE)
x[, ..mycols]
x[, .SD, .SDcols=mycols]
# === direct 1-liner usage
x[, .SD, .SDcols=c('that','whatever')]
x[, .SD, .SDcols=grep('a', colnames(x), value=TRUE)]
which all yield
that whatever
1: 1 1
2: 2 2
I find the .SDcols way the most elegant.
With development version 1.14.3, data.table has gained a new interface for programming on data.table, see item 10 in New Features. It uses the new env = parameter.
library(data.table) # development version 1.14.3 used
dt <- data.table(col1 = 1:3)
colname <- "col1"
dt[, cn := cn + 3L, env = list(cn = colname)][]
col1
<int>
1: 4
2: 5
3: 6
For multiple columns and a function applied on column values.
When updating the values from a function, the RHS must be a list object, so using a loop on .SD with lapply will do the trick.
The example below converts integer columns to numeric columns
a1 <- data.table(a=1:5, b=6:10, c1=letters[1:5])
sapply(a1, class) # show classes of columns
# a b c1
# "integer" "integer" "character"
# column name character vector
nm <- c("a", "b")
# Convert columns a and b to numeric type
a1[, j = (nm) := lapply(.SD, as.numeric ), .SDcols = nm ]
sapply(a1, class)
# a b c1
# "numeric" "numeric" "character"
You could try this:
colname <- as.name("COL_NAME")
DT2 <- DT[, list(COL_SUM=sum(eval(colname, .SD))), by = c(group)]
How do you refer to variables in a data.table if the variable names are stored in a character vector? For instance, this works for a data.frame:
df <- data.frame(col1 = 1:3)
colname <- "col1"
df[colname] <- 4:6
df
# col1
# 1 4
# 2 5
# 3 6
How can I perform this same operation for a data.table, either with or without := notation? The obvious thing of dt[ , list(colname)] doesn't work (nor did I expect it to).
Two ways to programmatically select variable(s):
with = FALSE:
DT = data.table(col1 = 1:3)
colname = "col1"
DT[, colname, with = FALSE]
# col1
# 1: 1
# 2: 2
# 3: 3
'dot dot' (..) prefix:
DT[, ..colname]
# col1
# 1: 1
# 2: 2
# 3: 3
For further description of the 'dot dot' (..) notation, see New Features in 1.10.2 (it is currently not described in help text).
To assign to variable(s), wrap the LHS of := in parentheses:
DT[, (colname) := 4:6]
# col1
# 1: 4
# 2: 5
# 3: 6
The latter is known as a column plonk, because you replace the whole column vector by reference. If a subset i was present, it would subassign by reference. The parens around (colname) is a shorthand introduced in version v1.9.4 on CRAN Oct 2014. Here is the news item:
Using with = FALSE with := is now deprecated in all cases, given that wrapping
the LHS of := with parentheses has been preferred for some time.
colVar = "col1"
DT[, (colVar) := 1] # please change to this
DT[, c("col1", "col2") := 1] # no change
DT[, 2:4 := 1] # no change
DT[, c("col1","col2") := list(sum(a), mean(b))] # no change
DT[, `:=`(...), by = ...] # no change
See also Details section in ?`:=`:
DT[i, (colnamevector) := value]
# [...] The parens are enough to stop the LHS being a symbol
And to answer further question in comment, here's one way (as usual there are many ways) :
DT[, colname := cumsum(get(colname)), with = FALSE]
# col1
# 1: 4
# 2: 9
# 3: 15
or, you might find it easier to read, write and debug just to eval a paste, similar to constructing a dynamic SQL statement to send to a server :
expr = paste0("DT[,",colname,":=cumsum(",colname,")]")
expr
# [1] "DT[,col1:=cumsum(col1)]"
eval(parse(text=expr))
# col1
# 1: 4
# 2: 13
# 3: 28
If you do that a lot, you can define a helper function EVAL :
EVAL = function(...)eval(parse(text=paste0(...)),envir=parent.frame(2))
EVAL("DT[,",colname,":=cumsum(",colname,")]")
# col1
# 1: 4
# 2: 17
# 3: 45
Now that data.table 1.8.2 automatically optimizes j for efficiency, it may be preferable to use the eval method. The get() in j prevents some optimizations, for example.
Or, there is set(). A low overhead, functional form of :=, which would be fine here. See ?set.
set(DT, j = colname, value = cumsum(DT[[colname]]))
DT
# col1
# 1: 4
# 2: 21
# 3: 66
*This is not an answer really, but I don't have enough street cred to post comments :/
Anyway, for anyone who might be looking to actually create a new column in a data table with a name stored in a variable, I've got the following to work. I have no clue as to it's performance. Any suggestions for improvement? Is it safe to assume a nameless new column will always be given the name V1?
colname <- as.name("users")
# Google Analytics query is run with chosen metric and resulting data is assigned to DT
DT2 <- DT[, sum(eval(colname, .SD)), by = country]
setnames(DT2, "V1", as.character(colname))
Notice I can reference it just fine in the sum() but can't seem to get it to assign in the same step. BTW, the reason I need to do this is colname will be based on user input in a Shiny app.
Retrieve multiple columns from data.table via variable or function:
library(data.table)
x <- data.table(this=1:2,that=1:2,whatever=1:2)
# === explicit call
x[, .(that, whatever)]
x[, c('that', 'whatever')]
# === indirect via variable
# ... direct assignment
mycols <- c('that','whatever')
# ... same as result of a function call
mycols <- grep('a', colnames(x), value=TRUE)
x[, ..mycols]
x[, .SD, .SDcols=mycols]
# === direct 1-liner usage
x[, .SD, .SDcols=c('that','whatever')]
x[, .SD, .SDcols=grep('a', colnames(x), value=TRUE)]
which all yield
that whatever
1: 1 1
2: 2 2
I find the .SDcols way the most elegant.
With development version 1.14.3, data.table has gained a new interface for programming on data.table, see item 10 in New Features. It uses the new env = parameter.
library(data.table) # development version 1.14.3 used
dt <- data.table(col1 = 1:3)
colname <- "col1"
dt[, cn := cn + 3L, env = list(cn = colname)][]
col1
<int>
1: 4
2: 5
3: 6
For multiple columns and a function applied on column values.
When updating the values from a function, the RHS must be a list object, so using a loop on .SD with lapply will do the trick.
The example below converts integer columns to numeric columns
a1 <- data.table(a=1:5, b=6:10, c1=letters[1:5])
sapply(a1, class) # show classes of columns
# a b c1
# "integer" "integer" "character"
# column name character vector
nm <- c("a", "b")
# Convert columns a and b to numeric type
a1[, j = (nm) := lapply(.SD, as.numeric ), .SDcols = nm ]
sapply(a1, class)
# a b c1
# "numeric" "numeric" "character"
You could try this:
colname <- as.name("COL_NAME")
DT2 <- DT[, list(COL_SUM=sum(eval(colname, .SD))), by = c(group)]