Specific Ordering in R - r

I am making all possible combinations for a specific input, but it has to be ordered according to the order of the input aswell. Since the combinations are different sized, I'm struggling with the answers previously posted.
I would like to know if this is possible.
Input:
D N A 3
This means I need to output it in all combinations up to 3 character strings:
D
DD
DDD
DDN
DDA
DND
DNA
.
.
Which is basically ascending order if we consider D<N<A
So far my output looks like this:
A
AA
AAA
AAD
AAN
AD
ADA
ADD
ADN
AN
.
.
I have tried converting the input as factor c("D","N","A") and sort my output, but then it disappears any string bigger than 1 character.

Here's one possible solution:
generateCombs <- function(x, n){
if (n == 1) return(x[1]) # Base case
# Create a grid with all possible permutations of 0:n. 0 == "", and 1:n correspond to elements of x
permutations = expand.grid(replicate(n, 0:n, simplify = F))
# Order permutations
orderedPermutations = permutations[do.call(order, as.list(permutations)),]
# Map permutations now such that 0 == "", and 1:n correspond to elements of x
mappedPermutations = sapply(orderedPermutations, function(y) c("", x)[y + 1])
# Collapse each row into a single string
collapsedPermutations = apply(mappedPermutations, 1, function(x) paste0(x, collapse = ""))
# Due to the 0's, there will be duplicates. We remove the duplicates in reverse order
collapsedPermutations = rev(unique(rev(collapsedPermutations)))[-1] # -1 removes blank
# Return as data frame
return (as.data.frame(collapsedPermutations))
}
x = c("D", "N", "A")
n = 3
generateCombs(x, n)
The output is:
collapsedPermutations
1 D
2 DD
3 DDD
4 DDN
5 DDA
6 DN
7 DND
8 DNN
9 DNA
10 DA
11 DAD
...

A solution using a random library I just found (so I might be using it wrong) called iterpc.
Generate all the combinations, factor the elements, sort, then hack into a string.
ordered_combn = function(elems) {
require(data.table)
require(iterpc)
I = lapply(seq_along(elems), function(i) iterpc::iterpc(table(elems), i, replace=TRUE, ordered=TRUE))
I = lapply(I, iterpc::getall)
I = lapply(I, as.data.table)
dt = rbindlist(I, fill = TRUE)
dt[is.na(dt)] = ""
cols = paste0("V", 1:length(elems))
dt[, (cols) := lapply(.SD, factor, levels = c("", elems)), .SDcols = cols]
setkey(dt)
dt[, ID := 1:.N]
dt[, (cols) := lapply(.SD, as.character), .SDcols = cols]
dt[, ord := paste0(.SD, collapse = ""), ID, .SDcols = cols]
# return dt[, ord] as an ordered factor for neatness
dt
}
elems = c("D", "N", "A")
combs = ordered_combn(elems)
combs
Output
V1 V2 V3 ID ord
1: D 1 D
2: D D 2 DD
3: D D D 3 DDD
4: D D N 4 DDN
5: D D A 5 DDA
6: D N 6 DN
7: D N D 7 DND
8: D N N 8 DNN
...

Related

Transpose whole dataframe into one row dataframe- (or transposing each row of data.table and column binding)

I have tried to transform my_dataset with the help of library reshape & data.table in order to achieve the result.dataset but haven't been successful as yet.
I have a data table my_dataset that looks like this :-
A X Count
id1 b 1
id1 c 2
And I want to have the result.dataset that should look like this :-
A X1 Count1 X2 Count2
id1 b 1 c 2
It would be great if anyone could help me to get the result.dataset as above, preferably by using reshape or data.table (or both lib).
Here's a solution that is using only reshape2 (trying to stick to the suggested packages). It starts by adding a column rep, that allows one to call dcast.
require(reshape2)
#adding rep
my_dataset$rep = unlist(tapply(my_dataset$A, my_dataset$A, function(x)1:length(x)))
#cast at work
C1 = dcast(my_dataset, A ~ paste('X',rep, sep=''), value.var='X')
C2 = dcast(my_dataset, A ~ paste('Count',rep, sep=''), value.var='Count')
result.dataset = cbind(C1, C2[,-1])
The columns will not be in the same order as your example though.
Try this:
dt <- read.table(text = 'A X Count
id1 b 1
id1 c 2',header=T)
a <- aggregate(.~A, dt, paste, collapse=",")
library(splitstackshape)
result <- concat.split.multiple(data = a, split.cols = c("X","Count"), seps = ",")
output:
> result
A X_1 X_2 Count_1 Count_2
1: id1 b c 1 2
We can aggregate the rows and use cSplit to split them.
library(data.table)
library(splitstackshape)
dat2 <- setDT(dat)[, lapply(.SD, paste, collapse = ","), by = A]
cols <- c(names(dat[, 1]), paste(names(dat[, -1]),
rep(1:nrow(dat), each = nrow(dat),
sep = "_"))
cSplit(dat2, splitCols = names(dat[, -1]))[, cols, with = FALSE]
# A X_1 Count_1 X_2 Count_2
# 1: id1 b 1 c 2
DATA
dat <- read.table(text = "A X Count
id1 b 1
id1 c 2",
header = TRUE, stringsAsFactors = FALSE)

split rows by condition in R data.table

I have a data table containing 3 columns, one of them
contains a key:value list of different lengths.
I wish to rearrange the table such that each row will have only one key, conditioned on the value
for example, suppose that I wish to get all rows for whom the value is <= 2 so that each key is on its own row:\
input_tbl <-
data.table::data.table(a=c("AA"),b=c("{\"ha:llo\":1,\"wor:ld\":2,\"doog:bye\":3}"),
c=c(1))
the wanted table then should be
tbl_output <- data.table::data.table(a=c("AA",
"AA"),b=c("ha:llo","wor:ld"), c=c(1,1), s=c(1,2))
I had tried the following function:
data_table_clean <- function(dt){
dt[ ,"b" := data.table::tstrsplit(b, ',', fixed = T),by=c(a, c)]
dt[,c('b', 's'):= data.table::tstrsplit(b, ':', fixed=TRUE)]
return(dt[s <=2,])
}
this produces the following error
"Error in eval(expr, envir, enclos) : object 'a' not found"
Any suggestions are welcome, off course.
The keys are actually of the form :
input2_tbl <-
data.table::data.table(a=c("AA"),b=c("{\"99:1d:3u:7y:89:67\":1,\"99:1D:34:YY:T6:Y6\":2,\"ll:5Y:UY:56:R5:R6\":3}"),
c=c(1))
and accordingly the output table should be:
tbl2_output <- data.table::data.table(a=c("AA",
"AA"),b=c(""99:1d:3u:7y:89:67","99:1D:34:YY:T6:Y6"),
c=c(1,1), s=c(1,2))
Thank you!
update
data_table_clean <- function(dt){
res <- dt[, data.table::tstrsplit(unlist(strsplit(gsub('[{}"]', '', b),',', fixed=TRUE)), ":(?=[^:]+$)", perl=TRUE),
by = .(a, c)][V2 > -100]
data.table::setnames(res, 3:4, c("b", "s"))
res
}
when running this I get the following error:
Error in .subset(x, j) : invalid subscript type 'list'
One option would be to extract the characters that we need in the final output. We use str_extract to do that after grouping by 'a', 'c'. The output is a list, which we unlist, get the non-numeric and numeric into two columns and then subset the rows with the condition s<3.
library(stringr)
library(data.table)
input_tbl[, {
tmp <- unlist(str_extract_all(b, "[A-Za-z]+:[A-Za-z]+|\\d+"))
list(b=tmp[c(TRUE, FALSE)], s=tmp[c(FALSE, TRUE)])
}, by = .(a,c)][s<3]
# a c b s
#1: AA 1 ha:llo 1
#2: AA 1 wor:ld 2
Or if we are using strsplit/tstrsplit, grouped by 'a', 'c', we remove the curly brackets and quotes ([{}]") with gsub, split by , (strsplit), unlist the output, and then use tstrsplit to split by : that is followed by a number. The subset part is similar as above.
res <- input_tbl[, tstrsplit(unlist(strsplit(gsub('[{}"]', '',
b), ',', fixed=TRUE)), ":(?=\\d)", perl=TRUE) ,.(a,c)][V2<3]
setnames(res, 3:4, c("b", "s"))
res
# a c b s
#1: AA 1 ha:llo 1
#2: AA 1 wor:ld 2
Update
For the updated dataset, we can do the tstrsplit on the last delimiter (:)
res1 <- input2_tbl[, tstrsplit(unlist(strsplit(gsub('[{}"]', '',
b),',', fixed=TRUE)), ":(?=[^:]+$)", perl=TRUE) ,
by = .(a, c)][V2 < 3]
setnames(res1, 3:4, c("b", "s"))
res1
# a c b s
# 1: AA 1 99:1d:3u:7y:89:67 1
# 2: AA 1 99:1D:34:YY:T6:Y6 2
Since it seems like you are working with a JSON object, why not use something that parses the JSON, for example, the "jsonlite" package?
With that, you can make a simple function, that looks like this:
myFun <- function(invec) {
require(jsonlite)
x <- fromJSON(invec)
list(b = names(x), s = unlist(x))
}
Now, applied to your dataset, you would get:
input_tbl[, myFun(b), by = .(a, c)]
# a c b s
# 1: AA 1 ha:llo 1
# 2: AA 1 wor:ld 2
# 3: AA 1 doog:bye 3
And, for the subsetting:
input_tbl[, myFun(b), by = .(a, c)][s <= 2]
# a c b s
# 1: AA 1 ha:llo 1
# 2: AA 1 wor:ld 2
You can probably also even rewrite the myFun function to add a "threshold" argument that lets you subset within the function itself.

Expand an list in an R dataframe into additional rows of the dataframe?

In a separate question earlier today I asked how to flatten nested lists into a row in a dataframe. I wish to further understand how to manipulate my lists within a dataframe, this time by expanding the list vertically within the dataframe, adding new rows to accommodate the the data.
In this case I wish to go from this structure:
CAT COUNT TREAT
A 1,2,3 Treat-a, Treat-b
B 4,5 Treat-c,Treat-d,Treat-e
To this structure:
CAT COUNT TREAT
A 1 Treat-a
A 2 Treat-b
A 3 NA
B 4 Treat-c
B 5 Treat-d
B NA Treat-e
Code to generate the test (source) data:
df<-data.frame(CAT=c("A","B"))
df$COUNT <-list(1:3,4:5) # as numbers
df$TREAT <-list(paste("Treat-", letters[1:2],sep=""),paste("Treat-", letters[3:5],sep=""))
I tried using CBIND in an approach similar to the answer supplied to my earlier question but it failed due to the different number of values between the multiple lists. Thank you for your patience as I attempt to grasp these basic manipulation tasks.
Here's what I came up with, using a helper function cbind.fill: (cbind a df with an empty df (cbind.fill?))
cbind.fill <- function(...){
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
#Split df by CAT
df.split <- split(df, df$CAT)
#Apply cbind.fill to make a matrix filled with NA where needed
rawlist <- lapply(df.split, function(x) cbind(as.character(x$CAT), cbind.fill(unlist(x$COUNT), unlist(x$TREAT) ) ))
#Bind rows and convert matrix to data.frame
df.new <- as.data.frame(do.call(rbind, rawlist))
#Column names
colnames(df.new) <- names(df)
df.new
CAT COUNT TREAT
1 A 1 Treat-a
2 A 2 Treat-b
3 A 3 <NA>
4 B 4 Treat-c
5 B 5 Treat-d
6 B <NA> Treat-e
Extending my answer from your previous question, you can create another function, let's call this one flattenLong that combines flatten and melt from "data.table" to get your desired output.
The function looks like this:
flattenLong <- function(indt, cols) {
ob <- setdiff(names(indt), cols)
x <- flatten(indt, cols, TRUE)
mv <- lapply(cols, function(y) grep(sprintf("^%s_", y), names(x)))
setorderv(melt(x, measure.vars = mv, value.name = cols), ob)[]
}
Usage is simply:
flattenLong(df, c("COUNT", "TREAT"))
## CAT variable COUNT TREAT
## 1: A 1 1 Treat-a
## 2: A 2 2 Treat-b
## 3: A 3 3 NA
## 4: B 1 4 Treat-c
## 5: B 2 5 Treat-d
## 6: B 3 NA Treat-e
For convenience, here's the flatten function again:
flatten <- function(indt, cols, drop = FALSE) {
require(data.table)
if (!is.data.table(indt)) indt <- as.data.table(indt)
x <- unlist(indt[, lapply(.SD, function(x) max(lengths(x))), .SDcols = cols])
nams <- paste(rep(cols, x), sequence(x), sep = "_")
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE), .SDcols = cols]
if (isTRUE(drop)) {
indt[, (nams) := unlist(lapply(.SD, transpose), recursive = FALSE),
.SDcols = cols][, (cols) := NULL]
}
indt[]
}

R extract matching values from list of data frames

I have a relatively large amount of data stored in a list of data frames with several columns.
For each element of the list I wish to check one column against a reference and if present extract the value held in another column of the same element and place in a new summary matrix.
e.g. with the following example code:
add1 = c("N1","N1","N1")
coords1 = c(1,2,3)
vals1 = c("a","b","c")
extra1 = c("x","y","x")
add2 = c("N2","N2","N2","N2")
coords2 = c(2,3,4,5)
vals2 = c("b","c","d","e")
extra2 = c("z","y","x","x")
add3 = c("N3","N3","N3")
coords3 = c(1,3,5)
vals3 = c("a","c","e")
extra3 = c("z","z","x")
df1 <- data.frame(add1, coords1, vals1, extra1)
df2 <- data.frame(add2, coords2, vals2, extra2)
df3 <- data.frame(add3, coords3, vals3, extra3)
list_all <- list(df1, df2, df3)
coordinate.extract <- unique(unlist(lapply(list_all, "[", 1)))
my_matrix <- matrix(0, ncol = length(list_all)
, nrow = (length(coordinate.extract)))
my_matrix_new <- cbind(as.character(coordinate.extract)
, my_matrix)
I would like to end up with:
my_matrix_new = V1 V2 V3 V4
1 a a
2 b b
3 c c c
4 d
5 e e
i.e. the 3rd column of each list element is chosen based on the value of the second column.
I hope this is clear.
Thanks,
Matt
I would use data.frame as there are mixed classes. You may try merge with Reduce to get the expected output. Select the 2nd and 3rd columns,in each list element, change the column name for the 2nd to be same across all the list elements, merge, and if needed replace the NA elements with ''
lst1 <- lapply(list_all, function(x) {names(x)[2] <- 'V1';x[2:3] })
res <- Reduce(function(...) merge(..., by='V1', all=TRUE), lst1)
res[-1] <- lapply(res[-1], as.character)
res[is.na(res)] <- ''
res
# V1 vals1 vals2 vals3
#1 1 a a
#2 2 b b
#3 3 c c c
#4 4 d
#5 5 e e
We can change the column names
names(res) <- paste0('V', seq_along(res))

Aggregating data frame rows using an input vector

I have this toy data.frame:
df = data.frame(id = c("a","b","c","d"), value = c(2,3,6,5))
and I'd like to aggregate its rows according to this toy vector:
collapsed.ids = c("a,b","c","d")
where the aggregated data.frame should keep max(df$value) of its aggregated rows.
So for this toy example the output would be:
> aggregated.df
id value
1 a,b 3
2 c 6
3 d 5
I should note that my real data.frame is ~150,000 rows
I would use data.table for this.
Something like the following should work:
library(data.table)
DT <- data.table(df, key = "id") # Main data.table
Key <- data.table(ind = collapsed.ids) # your "Key" table
## We need your "Key" table in a long form
Key <- Key[, list(id = unlist(strsplit(ind, ",", fixed = TRUE))), by = ind]
setkey(Key, id) # Set the key to facilitate a merge
## Merge and aggregate in one step
DT[Key][, list(value = max(value)), by = ind]
# ind value
# 1: a,b 3
# 2: c 6
# 3: d 5
You don't need data.table, you can just use base R.
split.ids <- strsplit(collapsed.ids, ",")
split.df <- data.frame(id = tmp <- unlist(split.ids),
joinid = rep(collapsed.ids, sapply(split.ids, length)))
aggregated.df <- aggregate(value ~ id, data = merge(df, split.df), max)
Result:
# id value
# 1 a,b 3
# 2 c 6
# 3 d 5
Benchmark
df <- df[rep(1:4, 50000), ] # Make a big data.frame
system.time(...) # of the above code
# user system elapsed
# 1.700 0.154 1.947
EDIT: Apparently Ananda's code runs in 0.039, so I'm eating crow. But either are acceptable for this size.

Resources