split rows by condition in R data.table - r

I have a data table containing 3 columns, one of them
contains a key:value list of different lengths.
I wish to rearrange the table such that each row will have only one key, conditioned on the value
for example, suppose that I wish to get all rows for whom the value is <= 2 so that each key is on its own row:\
input_tbl <-
data.table::data.table(a=c("AA"),b=c("{\"ha:llo\":1,\"wor:ld\":2,\"doog:bye\":3}"),
c=c(1))
the wanted table then should be
tbl_output <- data.table::data.table(a=c("AA",
"AA"),b=c("ha:llo","wor:ld"), c=c(1,1), s=c(1,2))
I had tried the following function:
data_table_clean <- function(dt){
dt[ ,"b" := data.table::tstrsplit(b, ',', fixed = T),by=c(a, c)]
dt[,c('b', 's'):= data.table::tstrsplit(b, ':', fixed=TRUE)]
return(dt[s <=2,])
}
this produces the following error
"Error in eval(expr, envir, enclos) : object 'a' not found"
Any suggestions are welcome, off course.
The keys are actually of the form :
input2_tbl <-
data.table::data.table(a=c("AA"),b=c("{\"99:1d:3u:7y:89:67\":1,\"99:1D:34:YY:T6:Y6\":2,\"ll:5Y:UY:56:R5:R6\":3}"),
c=c(1))
and accordingly the output table should be:
tbl2_output <- data.table::data.table(a=c("AA",
"AA"),b=c(""99:1d:3u:7y:89:67","99:1D:34:YY:T6:Y6"),
c=c(1,1), s=c(1,2))
Thank you!
update
data_table_clean <- function(dt){
res <- dt[, data.table::tstrsplit(unlist(strsplit(gsub('[{}"]', '', b),',', fixed=TRUE)), ":(?=[^:]+$)", perl=TRUE),
by = .(a, c)][V2 > -100]
data.table::setnames(res, 3:4, c("b", "s"))
res
}
when running this I get the following error:
Error in .subset(x, j) : invalid subscript type 'list'

One option would be to extract the characters that we need in the final output. We use str_extract to do that after grouping by 'a', 'c'. The output is a list, which we unlist, get the non-numeric and numeric into two columns and then subset the rows with the condition s<3.
library(stringr)
library(data.table)
input_tbl[, {
tmp <- unlist(str_extract_all(b, "[A-Za-z]+:[A-Za-z]+|\\d+"))
list(b=tmp[c(TRUE, FALSE)], s=tmp[c(FALSE, TRUE)])
}, by = .(a,c)][s<3]
# a c b s
#1: AA 1 ha:llo 1
#2: AA 1 wor:ld 2
Or if we are using strsplit/tstrsplit, grouped by 'a', 'c', we remove the curly brackets and quotes ([{}]") with gsub, split by , (strsplit), unlist the output, and then use tstrsplit to split by : that is followed by a number. The subset part is similar as above.
res <- input_tbl[, tstrsplit(unlist(strsplit(gsub('[{}"]', '',
b), ',', fixed=TRUE)), ":(?=\\d)", perl=TRUE) ,.(a,c)][V2<3]
setnames(res, 3:4, c("b", "s"))
res
# a c b s
#1: AA 1 ha:llo 1
#2: AA 1 wor:ld 2
Update
For the updated dataset, we can do the tstrsplit on the last delimiter (:)
res1 <- input2_tbl[, tstrsplit(unlist(strsplit(gsub('[{}"]', '',
b),',', fixed=TRUE)), ":(?=[^:]+$)", perl=TRUE) ,
by = .(a, c)][V2 < 3]
setnames(res1, 3:4, c("b", "s"))
res1
# a c b s
# 1: AA 1 99:1d:3u:7y:89:67 1
# 2: AA 1 99:1D:34:YY:T6:Y6 2

Since it seems like you are working with a JSON object, why not use something that parses the JSON, for example, the "jsonlite" package?
With that, you can make a simple function, that looks like this:
myFun <- function(invec) {
require(jsonlite)
x <- fromJSON(invec)
list(b = names(x), s = unlist(x))
}
Now, applied to your dataset, you would get:
input_tbl[, myFun(b), by = .(a, c)]
# a c b s
# 1: AA 1 ha:llo 1
# 2: AA 1 wor:ld 2
# 3: AA 1 doog:bye 3
And, for the subsetting:
input_tbl[, myFun(b), by = .(a, c)][s <= 2]
# a c b s
# 1: AA 1 ha:llo 1
# 2: AA 1 wor:ld 2
You can probably also even rewrite the myFun function to add a "threshold" argument that lets you subset within the function itself.

Related

Non-zero Values of Data.Table Column

I would like to extract the non-zero values of a specific column a of my data.table. Here is an example
set.seed(42)
DT <- data.table(
id = c("b","b","b","a","a","c"),
a = sample(c(0,1), 6, replace=TRUE),
b = 7:12,
c = 13:18
)
col <- "a"
If DT is a data.frame, I can do
x <- DT[,col] # I can do DT[,..col] to translate this line
x[x>0] # here is where I am stuck
Since DT is a data.table, this code fails. The error message is: "i is invalid type (matrix)".
I tried as.vector(x) but without success.
Any hint appreciated. This seems to be a beginner question. However, searching SO and the introduction vignette for data.table did not turn up a solution.
We can either use .SDcols to specify the column
DT[DT[, .SD[[1]] > 0, .SDcols = col]]
or with get
DT[DT[ ,get(col) > 0]]
DT[get(col) > 0][[col]]
#[1] 1 1
Or another option is [[
DT[DT[[col]] > 0]
# id a b c
#1: a 1 11 17
#2: c 1 12 18
Or to get only the column
DT[DT[[col]] >0][[col]]
#[1] 1 1
you can use filter:
DT %>% filter(column_name > 0)

Filter data.table based on string match from another vector

I'm trying to select rows in a data.table. I need the values in variable dt$s to start with any of the strings in vector y
dt <- data.table(x = (c(1:5)), s = c("a", "ab", "b.c", "db", "d"))
y <- c("a", "b")
Desired result:
x s
1: 1 a
2: 2 ab
3: 3 b.c
I would use dt[s %in% y] for a full match, and %like% or "^a*" for a partial match with a single string, but I'm not sure how to get a strict starts with match on a character vector.
My real dataset and character vector is quite large, so I'd appreciate an efficient solution.
Thanks.
You can create a pattern dynamically from y.
library(data.table)
pat <- sprintf('^(%s)', paste0(y, collapse = '|'))
pat
#[1] "^(a|b)"
and use it to subset the data.
dt[grepl(pat, s)]
# x s
#1: 1 a
#2: 2 ab
#3: 3 b.c
Using glue and filter
library(glue)
library(dplyr)
library(stringr)
dt %>%
filter(str_detect(s, glue("^({str_c(y, collapse = '|')})")))
# x s
#1: 1 a
#2: 2 ab
#3: 3 b.c

What is the function of `with` parameter when selecting dataframe subset

I encounter this code in one of the Kaggle Notebook:
corrplot.mixed(corr = cor(videos[,c("category_id","views","likes",
"dislikes","comment_count"),with=F]))
videos is a data.frame
"category_id","views","likes","dislikes","comment_count" are columns in the videos data.frame
Would like to understand what is the function of the with parameter when selecting dataframe subset?
As mentioned by #user20650 it might be a data.table. Although in this case your code should work even without with = F.
Consider this example :
library(data.table)
dt <- data.table(a = 1:5, b = 5:1, c = 1:5)
To subset column a and b using character vector you could do
dt[, c('a', 'b'), with = F]
# a b
#1: 1 5
#2: 2 4
#3: 3 3
#4: 4 2
#5: 5 1
However, as mentioned this would work the same without with = F.
dt[, c('a', 'b')]
with = F is helpful when you have a vector of column names stored in a variable.
cols <- c('a', 'b')
dt[, cols] ##Error
dt[, cols, with = F] ##Works

Specific Ordering in R

I am making all possible combinations for a specific input, but it has to be ordered according to the order of the input aswell. Since the combinations are different sized, I'm struggling with the answers previously posted.
I would like to know if this is possible.
Input:
D N A 3
This means I need to output it in all combinations up to 3 character strings:
D
DD
DDD
DDN
DDA
DND
DNA
.
.
Which is basically ascending order if we consider D<N<A
So far my output looks like this:
A
AA
AAA
AAD
AAN
AD
ADA
ADD
ADN
AN
.
.
I have tried converting the input as factor c("D","N","A") and sort my output, but then it disappears any string bigger than 1 character.
Here's one possible solution:
generateCombs <- function(x, n){
if (n == 1) return(x[1]) # Base case
# Create a grid with all possible permutations of 0:n. 0 == "", and 1:n correspond to elements of x
permutations = expand.grid(replicate(n, 0:n, simplify = F))
# Order permutations
orderedPermutations = permutations[do.call(order, as.list(permutations)),]
# Map permutations now such that 0 == "", and 1:n correspond to elements of x
mappedPermutations = sapply(orderedPermutations, function(y) c("", x)[y + 1])
# Collapse each row into a single string
collapsedPermutations = apply(mappedPermutations, 1, function(x) paste0(x, collapse = ""))
# Due to the 0's, there will be duplicates. We remove the duplicates in reverse order
collapsedPermutations = rev(unique(rev(collapsedPermutations)))[-1] # -1 removes blank
# Return as data frame
return (as.data.frame(collapsedPermutations))
}
x = c("D", "N", "A")
n = 3
generateCombs(x, n)
The output is:
collapsedPermutations
1 D
2 DD
3 DDD
4 DDN
5 DDA
6 DN
7 DND
8 DNN
9 DNA
10 DA
11 DAD
...
A solution using a random library I just found (so I might be using it wrong) called iterpc.
Generate all the combinations, factor the elements, sort, then hack into a string.
ordered_combn = function(elems) {
require(data.table)
require(iterpc)
I = lapply(seq_along(elems), function(i) iterpc::iterpc(table(elems), i, replace=TRUE, ordered=TRUE))
I = lapply(I, iterpc::getall)
I = lapply(I, as.data.table)
dt = rbindlist(I, fill = TRUE)
dt[is.na(dt)] = ""
cols = paste0("V", 1:length(elems))
dt[, (cols) := lapply(.SD, factor, levels = c("", elems)), .SDcols = cols]
setkey(dt)
dt[, ID := 1:.N]
dt[, (cols) := lapply(.SD, as.character), .SDcols = cols]
dt[, ord := paste0(.SD, collapse = ""), ID, .SDcols = cols]
# return dt[, ord] as an ordered factor for neatness
dt
}
elems = c("D", "N", "A")
combs = ordered_combn(elems)
combs
Output
V1 V2 V3 ID ord
1: D 1 D
2: D D 2 DD
3: D D D 3 DDD
4: D D N 4 DDN
5: D D A 5 DDA
6: D N 6 DN
7: D N D 7 DND
8: D N N 8 DNN
...

data.table add list as column when only one row

I have function that operates on words using data.table which assigns a list of vectors as a column. This works well unless the data.table is one row. I demonstrate this problem below. How can I make data.table assign the list of one vector as a column in the same way I had it ad a list of 2 vectors as a column?
MWE
dat2 <- dat <- data.frame(
x = 1:2,
y = c('dog', 'cats'),
stringsAsFactors = FALSE
)
library(data.table)
setDT(dat) # 2 row data.table
(dat2 <- dat2[1, ]) # single row data.frame
setDT(dat2)
letterfy <- function(x) strsplit(x, "")
## works as expected when >= 2 rows
dat[, letters := letterfy(y)]
dat
## x y letters
## 1: 1 dog d,o,g
## 2: 2 cats c,a,t,s
## Try on 1 row
dat2[, letters := letterfy(y)]
#Warning message:
#In `[.data.table`(dat2, , `:=`(letters, letterfy(y))) :
# Supplied 3 items to be assigned to 1 items of column 'letters' (2 unused)
# x y letters
#1: 1 dog d
Desired Output for dat2
## x y letters
## 1: 1 dog d,o,g
Simply wrap the output in list:
> dat2[, letters := list(letterfy(y))][ ]
x y letters
1: 1 dog d,o,g
Note that dat[ , class(letters)] is list; since typically lists are passed on the RHS of := for multiple assignments, it seems data.table was a bit confused. I imagine the developers have a reason for unlisting in the assignment here... but this approach also works when there is more than one row, i.e., dat[ , letters := list(letterfy(y))] also works as expected.
Another option is to assign the letters column as a character vector by changing letterfy:
letterfy2 <- function(x) lapply(strsplit(x, ""), paste0, collapse = ",")
> dat[ , letters := letterfy2(y)][ ]
x y letters
1: 1 dog d,o,g
2: 2 cats c,a,t,s
dat2[, letters := letterfy2(y)][ ]
x y letters
1: 1 dog d,o,g

Resources