Find frequencies of combinations where the data.frame needs to be parsed - r

I'm sure there's a simple solution to this, but I can't figure it out!! Suppose I have a dataframe that has the following information:
aaa<-c("A,B","B,C","B,D,E")
vvv<-c("101","101,102","102,103,104")
data_h<-data.frame(aaa,vvv)
data_h
aaa vvv
1 A,B 101
2 B,C 101,102
3 B,D,E 102,103,104
Desired output is a frequency map of individual hits, for subsequent analysis in a heat map. So:
101 102 103 104
A 1 0 0 0
B 2 2 1 1
C 1 1 0 0
D 0 1 1 1
E 0 1 1 1
How do I make this transformation? I've seen many similar examples, but none where the contents of the data-frame need to be parsed.
The goal is to ultimately use heatmap or something similar on the output table to visualize the correlation between "aaa" and "vvv".

Here is a base R solution in 4 lines of code. First we define a function, spl which splits the components of a comma separated string producing a vector of all the fields. eg takes two string arguments and applies spl to each of them and then creates a grid of the result of the splitting. Finally we apply eg to each row of data_h, rbind the results together and tabulate them with xtabs:
spl <- function(x) strsplit(as.character(x), ",")[[1]]
eg <- function(aaa, vvv) expand.grid(aaa = spl(aaa), vvv = spl(vvv))
dd <- do.call("rbind", Map(eg, data_h$aaa, data_h$vvv))
xtabs(data = dd)
The result is:
vvv
aaa 101 102 103 104
A 1 0 0 0
B 2 2 1 1
C 1 1 0 0
D 0 1 1 1
E 0 1 1 1
dcast Alternately replace the last line of code above (the one with the xtabs) with:
library(reshape2)
dcast(dd, aaa ~ vvv, fun = length, value.var = "vvv")
in which case the result is:
aaa 101 102 103 104
1 A 1 0 0 0
2 B 2 2 1 1
3 C 1 1 0 0
4 D 0 1 1 1
5 E 0 1 1 1
tapply. Another alternative would be tapply (however, it will fill in empty cells with NA rather than 0):
tapply(1:nrow(dd), dd, length)
ADDED Alternatives. Some improvements.

The shape of the data.frame suggests using splitstackshape package. But I don't know very well this package so I just use it to reshape the data, and then compute frequencies by hand using table:
library(splitstackshape)
data_h_split <- concat.split.multiple(data_h,1:2)
# aaa_1 aaa_2 aaa_3 vvv_1 vvv_2 vvv_3
# 1 A B <NA> 101 NA NA
# 2 B C <NA> 101 102 NA
# 3 B D E 102 103 104
Once you have the data in this format (no comma , regular columns), it is easy to compute frequencies using table( you can use tapply,reshape):
table(cbind.data.frame(ff= unlist(data_h_split[1:3]),
xx= unlist(data_h_split[4:6])))
xx
ff 101 102 103 104
A 1 0 0 0
B 1 1 0 0
C 0 1 0 0
D 0 0 1 0
0 0 0 0
E 0 0 0 1
Ananda's edit
Here's a multi-step approach to get the result using "splitstackshape" to work for this.
library(splitstackshape)
## Split the "vvv" column first, and reshape at the same time
x <- concat.split.multiple(data_h, split.cols="vvv", ",", "long")
## Add an ID column
x$id <- 1:nrow(x)
## Split the "aaa" column next, again reshaping as we do so
x <- concat.split.multiple(x[complete.cases(x), ], split.cols="aaa", ",", "long")
## Use `table` with `droplevels`
with(droplevels(x), table(aaa, vvv))
# vvv
# aaa 101 102 103 104
# A 1 0 0 0
# B 2 2 1 1
# C 1 1 0 0
# D 0 1 1 1
# E 0 1 1 1

My concat.split.multiple function is in great need of a rewrite to improve its efficiency. I've done some work on that in my cSplit function, which might be useful if you have a particularly large dataset.
Here's how I would solve your given problem with cSplit:
table(
cSplit(
cSplit(data_h, splitCols = 2, sep = ",",
direction = "long", makeEqual = FALSE),
splitCols = 1, sep = ",", direction = "long",
makeEqual = FALSE))
# vvv
# aaa 101 102 103 104
# A 1 0 0 0
# B 2 2 1 1
# C 1 1 0 0
# D 0 1 1 1
# E 0 1 1 1
It seems to be pretty efficient too...
First, the functions to test:
fun1 <- function() table(cSplit(cSplit(df, 2, ",", "long", FALSE), 1, ",", "long", FALSE))
fun2 <- function() {
spl <- function(x) strsplit(as.character(x), ",")[[1]]
eg <- function(aaa, vvv) expand.grid(aaa = spl(aaa), vvv = spl(vvv))
dd <- do.call("rbind", Map(eg, df$A, df$V))
xtabs(data = dd)
}
Second, some sample data. Change Nrows and re-generate to see the effect on different sized data.frames.
set.seed(1)
Nrow <- 100
aaa <- 100:200
vvv <- LETTERS
maxA <- 10
maxV <- 10
Aaa <- sample(maxA, Nrow, TRUE)
Vvv <- sample(maxV, Nrow, TRUE)
A <- vapply(seq_along(Aaa), function(x)
paste(sample(aaa, Aaa[x], TRUE), collapse = ","), character(1L))
V <- vapply(seq_along(Vvv), function(x)
paste(sample(vvv, Vvv[x], TRUE), collapse = ","), character(1L))
df <- data.frame(A, V)
head(df)
# A V
# 1 127,122,152 E,E,O,S,W,S,M
# 2 127,118,152,156 V,A,Z,Q
# 3 113,125,172,197,110,177 L,A,T
# 4 195,182,131,165,196,196,134,126,116,132 F,Z,X,S,T,M,W,E,Q,H
# 5 151,193,151 L,B,E,B,Y,I,N
# 6 126,104,142,186,135,113,137,163,139 Q,G,N
Compare the two approaches to make sure the results are the same:
X <- fun1()
Y <- fun2()
all(X == Y[dimnames(X)[[1]], dimnames(X)[[2]]])
# [1] TRUE
Benchmark (on 100 rows).
library(microbenchmark)
## Nrow = 100
microbenchmark(fun1(), fun2(), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 7.263802 7.326237 7.440843 7.868905 10.26451 10
# fun2() 62.869130 64.046836 68.525880 73.595061 80.02027 10
Benchmark (on 1000 rows).
## Nrow = 1000
microbenchmark(fun1(), fun2(), times = 10)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1() 19.2303 20.21857 23.14337 26.97776 35.56338 10
# fun2() 775.6586 815.01639 835.98951 852.47804 888.15345 10

Related

Dummy/Binary Category Variable creation in Data Frame [duplicate]

I have an R data frame containing a factor that I want to "expand" so that for each factor level, there is an associated column in a new data frame, which contains a 1/0 indicator. E.g., suppose I have:
df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
I want:
df.desired <- data.frame(foo = c(1,1,0,0), bar=c(0,0,1,1), ham=c(1,2,3,4))
Because for certain analyses for which you need to have a completely numeric data frame (e.g., principal component analysis), I thought this feature might be built in. Writing a function to do this shouldn't be too hard, but I can foresee some challenges relating to column names and if something exists already, I'd rather use that.
Use the model.matrix function:
model.matrix( ~ Species - 1, data=iris )
If your data frame is only made of factors (or you are working on a subset of variables which are all factors), you can also use the acm.disjonctif function from the ade4 package :
R> library(ade4)
R> df <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c("red","blue","green","red"))
R> acm.disjonctif(df)
eggs.bar eggs.foo ham.blue ham.green ham.red
1 0 1 0 0 1
2 0 1 1 0 0
3 1 0 0 1 0
4 1 0 0 0 1
Not exactly the case you are describing, but it can be useful too...
A quick way using the reshape2 package:
require(reshape2)
> dcast(df.original, ham ~ eggs, length)
Using ham as value column: use value_var to override.
ham bar foo
1 1 0 1
2 2 0 1
3 3 1 0
4 4 1 0
Note that this produces precisely the column names you want.
probably dummy variable is similar to what you want.
Then, model.matrix is useful:
> with(df.original, data.frame(model.matrix(~eggs+0), ham))
eggsbar eggsfoo ham
1 0 1 1
2 0 1 2
3 1 0 3
4 1 0 4
A late entry class.ind from the nnet package
library(nnet)
with(df.original, data.frame(class.ind(eggs), ham))
bar foo ham
1 0 1 1
2 0 1 2
3 1 0 3
4 1 0 4
Just came across this old thread and thought I'd add a function that utilizes ade4 to take a dataframe consisting of factors and/or numeric data and returns a dataframe with factors as dummy codes.
dummy <- function(df) {
NUM <- function(dataframe)dataframe[,sapply(dataframe,is.numeric)]
FAC <- function(dataframe)dataframe[,sapply(dataframe,is.factor)]
require(ade4)
if (is.null(ncol(NUM(df)))) {
DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
names(DF)[1] <- colnames(df)[which(sapply(df, is.numeric))]
} else {
DF <- data.frame(NUM(df), acm.disjonctif(FAC(df)))
}
return(DF)
}
Let's try it.
df <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"), x=rnorm(4))
dummy(df)
df2 <-data.frame(eggs = c("foo", "foo", "bar", "bar"),
ham = c("red","blue","green","red"))
dummy(df2)
Here is a more clear way to do it. I use model.matrix to create the dummy boolean variables and then merge it back into the original dataframe.
df.original <-data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
df.original
# eggs ham
# 1 foo 1
# 2 foo 2
# 3 bar 3
# 4 bar 4
# Create the dummy boolean variables using the model.matrix() function.
> mm <- model.matrix(~eggs-1, df.original)
> mm
# eggsbar eggsfoo
# 1 0 1
# 2 0 1
# 3 1 0
# 4 1 0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"
# Remove the "eggs" prefix from the column names as the OP desired.
colnames(mm) <- gsub("eggs","",colnames(mm))
mm
# bar foo
# 1 0 1
# 2 0 1
# 3 1 0
# 4 1 0
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$eggs
# [1] "contr.treatment"
# Combine the matrix back with the original dataframe.
result <- cbind(df.original, mm)
result
# eggs ham bar foo
# 1 foo 1 0 1
# 2 foo 2 0 1
# 3 bar 3 1 0
# 4 bar 4 1 0
# At this point, you can select out the columns that you want.
I needed a function to 'explode' factors that is a bit more flexible, and made one based on the acm.disjonctif function from the ade4 package.
This allows you to choose the exploded values, which are 0 and 1 in acm.disjonctif. It only explodes factors that have 'few' levels. Numeric columns are preserved.
# Function to explode factors that are considered to be categorical,
# i.e., they do not have too many levels.
# - data: The data.frame in which categorical variables will be exploded.
# - values: The exploded values for the value being unequal and equal to a level.
# - max_factor_level_fraction: Maximum number of levels as a fraction of column length. Set to 1 to explode all factors.
# Inspired by the acm.disjonctif function in the ade4 package.
explode_factors <- function(data, values = c(-0.8, 0.8), max_factor_level_fraction = 0.2) {
exploders <- colnames(data)[sapply(data, function(col){
is.factor(col) && nlevels(col) <= max_factor_level_fraction * length(col)
})]
if (length(exploders) > 0) {
exploded <- lapply(exploders, function(exp){
col <- data[, exp]
n <- length(col)
dummies <- matrix(values[1], n, length(levels(col)))
dummies[(1:n) + n * (unclass(col) - 1)] <- values[2]
colnames(dummies) <- paste(exp, levels(col), sep = '_')
dummies
})
# Only keep numeric data.
data <- data[sapply(data, is.numeric)]
# Add exploded values.
data <- cbind(data, exploded)
}
return(data)
}
(The question is 10yo, but for the sake of completeness...)
The function i() from the fixest package does exactly that.
Beyond creating a design matrix from a factor-like variable, you can also very easily do two extra things on the fly:
binning values (with the argument 'bin'),
excluding some factor values (with the argument ref).
And since it is made for this task, if your variable happens to be numeric you don't need to wrap it with factor(x_num) (as opposed to the model.matrix solution).
Here's an example:
library(fixest)
data(airquality)
table(airquality$Month)
#> 5 6 7 8 9
#> 31 30 31 31 30
head(i(airquality$Month))
#> 5 6 7 8 9
#> [1,] 1 0 0 0 0
#> [2,] 1 0 0 0 0
#> [3,] 1 0 0 0 0
#> [4,] 1 0 0 0 0
#> [5,] 1 0 0 0 0
#> [6,] 1 0 0 0 0
#
# Binning (check out the help, there are many many ways to bin)
#
colSums(i(airquality$Month, bin = 5:6)))
#> 5 7 8 9
#> 61 31 31 30
#
# References
#
head(i(airquality$Month, ref = c(6, 9)), 3)
#> 5 7 8
#> [1,] 1 0 0
#> [2,] 1 0 0
#> [3,] 1 0 0
And here's a little wrapper expanding all non-numeric variables (by default):
library(fixest)
# data: data.frame
# var: vector of variable names // if missing, all non numeric variables
# no argument checking
expand_factor = function(data, var){
if(missing(var)){
var = names(data)[!sapply(data, is.numeric)]
if(length(var) == 0) return(data)
}
data_list = unclass(data)
new = lapply(var, \(x) i(data_list[[x]]))
data_list[names(data_list) %in% var] = new
do.call("cbind", data_list)
}
my_data = data.frame(eggs = c("foo", "foo", "bar", "bar"), ham = c(1,2,3,4))
expand_factor(my_data)
#> bar foo ham
#> [1,] 0 1 1
#> [2,] 0 1 2
#> [3,] 1 0 3
#> [4,] 1 0 4
Finally, for those wondering, the timing is equivalent to the model.matrix solution.
library(microbenchmark)
my_data = data.frame(x = as.factor(sample(100, 1e6, TRUE)))
microbenchmark(mm = model.matrix(~x, my_data),
i = i(my_data$x), times = 5)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> mm 155.1904 156.7751 209.2629 182.4964 197.9084 353.9443 5
#> i 154.1697 154.7893 159.5202 155.4166 163.9706 169.2550 5
In sapply == over eggs could be used to generate dummy vectors:
x <- with(df.original, data.frame(+sapply(unique(eggs), `==`, eggs), ham))
x
# foo bar ham
#1 1 0 1
#2 1 0 2
#3 0 1 3
#4 0 1 4
all.equal(x, df.desired)
#[1] TRUE
A maybe faster variant - Result best used as list or data.frame:
. <- unique(df.original$eggs)
with(df.original,
data.frame(+do.call(cbind, lapply(setNames(., .), `==`, eggs)), ham))
Indexing in a matrix - Result best used as matrix:
. <- unique(df.original$eggs)
i <- match(df.original$eggs, .)
nc <- length(.)
nr <- length(i)
cbind(matrix(`[<-`(integer(nc * nr), 1:nr + nr * (i - 1), 1), nr, nc,
dimnames=list(NULL, .)), df.original["ham"])
Using outer - Result best used as matrix:
. <- unique(df.original$eggs)
cbind(+outer(df.original$eggs, setNames(., .), `==`), df.original["ham"])
Using rep - Result best used as matrix:
. <- unique(df.original$eggs)
n <- nrow(df.original)
cbind(+matrix(df.original$eggs == rep(., each=n), n, dimnames=list(NULL, .)),
df.original["ham"])

How to compare with values adjacent in a sequence in the same group

Let's say I have something like this:
set.seed(0)
the.df <- data.frame( x=rep(letters[1:3], each=4),
n=rep(0:3, 3),
val=round(runif(12)))
the.df
x n val
1 a 0 1
2 a 1 0
3 a 2 0
4 a 3 1
5 b 0 1
6 b 1 0
7 b 2 1
8 b 3 1
9 c 0 1
10 c 1 1
11 c 2 0
12 c 3 0
Within each x, starting from n==2 (going from small to large), I want to set val to 0 if the previous val (in terms of n) is 0; otherwise, leave it as is.
For example, in the subset x=="b", I first ignore the two rows where n < 2. Now, in Row 7, because the previous val is 0 (the.df$val[the.df$x=="b" & the.df$n==1]), I set val to 0 (the.df$val[the.df$x=="b" & the.df$n==2] <- 0). Then on Row 8, now that val for the previous n is 0 (we just set it), I also want to set val here to 0 (the.df$val[the.df$x=="b" & the.df$n==3] <- 0).
Imagine that the data.frame is not sorted. Therefore procedures that depend on the order would require a sort. I also can't assume that adjacent rows exist (e.g., the row the.df[the.df$x=="a" & the.df$n==1, ] might be missing).
The trickiest part seems to be evaluating val in sequence. I can do this using a loop but I imagine that it would be inefficient (I have millions of rows). Is there a way I can do this more efficiently?
EDIT: wanted output
the.df
x n val wanted
1 a 0 1 1
2 a 1 0 0
3 a 2 0 0
4 a 3 1 0
5 b 0 1 1
6 b 1 0 0
7 b 2 1 0
8 b 3 1 0
9 c 0 1 1
10 c 1 1 1
11 c 2 0 0
12 c 3 0 0
Also, I don't mind making new columns (e.g., putting the wanted values there).
Using data.table I would try the following
library(data.table)
setDT(the.df)[order(n),
val := if(length(indx <- which(val[2:.N] == 0L)))
c(val[1:(indx[1L] + 1L)], rep(0L, .N - (indx[1L] + 1L))),
by = x]
the.df
# x n val
# 1: a 0 1
# 2: a 1 0
# 3: a 2 0
# 4: a 3 0
# 5: b 0 1
# 6: b 1 0
# 7: b 2 0
# 8: b 3 0
# 9: c 0 1
# 10: c 1 1
# 11: c 2 0
# 12: c 3 0
This will simultaneously order the data by n (as you said it's not ordered in real life) and recreate val by condition (meaning that if condition not satisfied, val will be untouched).
Hopefully in the near future this will be implemented and then the code could potentially be
setDT(the.df)[order(n), val[n > 2] := if(val[2L] == 0) 0L, by = x]
Which could be a great improvement both performance and syntax wise
A base R approach might be
df <- the.df[order(the.df$x, the.df$n),]
df$val <- ave(df$val, df$x, FUN=fun)
As for fun, #DavidArenburg's answer in plain R and written a bit more poetically might be
fun0 <- function(v) {
idx <- which.max(v[2:length(v)] == 0L) + 1L
if (length(idx))
v[idx:length(v)] <- 0L
v
}
It seems like a good idea to formulate the solution as an independent function first, because then it is easy to test. fun0 fails for some edge cases, e.g.,
> fun0(0)
[1] 0 0 0
> fun0(1)
[1] 0 0 0
> fun0(c(1, 1))
[1] 1 0
A better version is
fun1 <- function(v) {
tst <- tail(v, -1) == 0L
if (any(tst)) {
idx <- which.max(tst) + 1L
v[idx:length(v)] <- 0L
}
v
}
And even better, following #Arun
fun <- function(v)
if (length(v) > 2) c(v[1], cummin(v[-1])) else v
This is competitive (same order of magnitude) with the data.table solution, with ordering and return occurring in less than 1s for the ~10m row data.frame of #m-dz 's timings. At a second for millions of rows, it doesn't seem worth while to pursue further optimization.
Nonetheless, when there are a very large number of small groups (e.g., 2M each of size 5) an improvement is to avoid the tapply() function call by using group identity to offset the minimum. For instance,
df <- df[order(df$x, df$n),]
grp <- match(df$x, unique(df$x)) # strictly sequential groups
keep <- duplicated(grp) # ignore the first of each group
df$val[keep] <- cummin(df$val[keep] - grp[keep]) + grp[keep]
Hmmm, should be pretty efficient if you switch to data.table...
library(data.table)
# Define the.df as a data.table (or use data.table::setDT() function)
set.seed(0)
the.df <- data.table(
x = rep(letters[1:3], each = 4),
n = rep(0:3, 3),
val = round(runif(12))
)
m_dz <- function() {
setorder(the.df, x, n)
repeat{
# Get IDs of rows to change
# ids <- which(the.df[, (n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0)])
ids <- the.df[(n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0), , which = TRUE]
# If no IDs break
if(length(ids) == 0){
break
}
# Set val to 0
# for (i in ids) set(the.df, i = i, j = "val", value = 0)
set(the.df, i = ids, j = "val", value = 0)
}
return(the.df)
}
Edit: Above function is slightly modified thanks to #jangorecki's, i.e. uses which = TRUE and set(the.df, i = ids, j = "val", value = 0), which made the timings much more stable (no very high max timings).
Edit: timing comparison with #David Arenburgs's answer on a slightly bigger table, m-dz() updated (#FoldedChromatin's answer skipped because of diffrent results).
My function is slightly faster in terms of median and upper quantile, but there is quite a big spread in timings (see max...), I cannot figure out why. Hopefully the timing methodology is correct (returning the result to different object etc.).
Anything bigger will kill my PC :(
set.seed(0)
groups_ids <- replicate(300, paste(sample(LETTERS, 5, replace=TRUE), collapse = ""))
size1 <- length(unique(groups_ids))
size2 <- round(1e7/size1)
the.df1 <- data.table(
x = rep(groups_ids, each = size2), # 52 * 500 = 26000
n = rep(0:(size2-1), size1),
val = round(runif(size1*size2))
)
the.df2 <- copy(the.df1)
# m-dz
m_dz <- function() {
setorder(df1, x, n)
repeat{
ids <- df1[(n > 1) & (val == 1) & (shift(val, 1L, type = "lag") == 0), , which = TRUE]
if(length(ids) == 0){
break
}
set(df1, i = ids, j = "val", value = 0)
}
return(df1)
}
# David Arenburg
DavidArenburg <- function() {
setorder(df2, x, n)
df2[, val := if(length(indx <- which.max(val[2:.N] == 0) + 1L)) c(val[1:indx], rep(0L, .N - indx)), by = x]
return(df2)
}
library(microbenchmark)
microbenchmark(
res1 <- m_dz(),
res2 <- DavidArenburg(),
times = 100
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# res1 <- m_dz() 247.4136 268.5005 363.0117 288.4216 312.7307 7071.0960 100 a
# res2 <- DavidArenburg() 270.6074 281.3935 314.7864 303.5229 328.1210 525.8095 100 a
identical(res1, res2)
# [1] TRUE
Edit: (Old) results for even bigger table:
set.seed(0)
groups_ids <- replicate(300, paste(sample(LETTERS, 5, replace=TRUE), collapse = ""))
size1 <- length(unique(groups_ids))
size2 <- round(1e8/size1)
# Unit: seconds
# expr min lq mean median uq max neval cld
# res1 <- m_dz() 5.599855 5.800264 8.773817 5.923721 6.021132 289.85107 100 a
# res2 <- m_dz2() 5.571911 5.836191 9.047958 5.970952 6.123419 310.65280 100 a
# res3 <- DavidArenburg() 9.183145 9.519756 9.714105 9.723325 9.918377 10.28965 100 a
Why not just use by
> set.seed(0)
> the.df <- data.frame( x=rep(letters[1:3], each=4),
n=rep(0:3, 3),
val=round(runif(12)))
> the.df
x n val
1 a 0 1
2 a 1 0
3 a 2 0
4 a 3 1
5 b 0 1
6 b 1 0
7 b 2 1
8 b 3 1
9 c 0 1
10 c 1 1
11 c 2 0
12 c 3 0
> Mod.df<-by(the.df,INDICES=the.df$x,function(x){
x$val[x$n==2]=0
Which=which(x$n==2 & x$val==0)+1
x$val[Which]=0
x})
> do.call(rbind,Mod.df)
x n val
a.1 a 0 1
a.2 a 1 0
a.3 a 2 0
a.4 a 3 0
b.5 b 0 1
b.6 b 1 0
b.7 b 2 0
b.8 b 3 0
c.9 c 0 1
c.10 c 1 1
c.11 c 2 0
c.12 c 3 0

Make a vector with counts of rows that meet criteria

I want to make a vector that contains number of rows that meet my criteria^=:
leftE0 <- c(900,2000,4000,9000,15000,30000,53000,100000,160000)
rightE0 <- c(2000,4000,9000,15000,30000,53000,100000,160000,300000)
sum(datap$CF > 0 & (datap$E0.keV > leftE0[1]) & (datap$E0.keV < rightE0[1]), na.rm=TRUE)
I don't understand how to vectorise this action.
Use cut and table:
#some example data
set.seed(42)
datap <- data.frame(CF = rnorm(100), E0.keV = exp(runif(100, 0, log(4e6))))
breaks <- c(-Inf, 900,2000,4000,9000,15000,30000,53000,100000,160000, 300000, Inf)
table(cut(datap$E0.keV, breaks), datap$CF > 0)
# FALSE TRUE
# (-Inf,900] 21 32
# (900,2e+03] 6 3
# (2e+03,4e+03] 3 3
# (4e+03,9e+03] 6 0
# (9e+03,1.5e+04] 1 1
# (1.5e+04,3e+04] 0 1
# (3e+04,5.3e+04] 1 0
# (5.3e+04,1e+05] 2 0
# (1e+05,1.6e+05] 1 0
# (1.6e+05,3e+05] 2 1
# (3e+05, Inf] 3 13

column bind in R and name the column

I want to column bind (cbind) mydf[,"c"] and give it a new name newcolumn in one step and get the result matrix mydf. How do I do it?
mydf
# a b c
# 1 2 6
# 1 3 4
mydf
# a b c newcolumn
# 1 2 6 6
# 1 3 4 4
You can specify the new column name in the call to cbind:
mydf <- cbind(mydf, newcolumn=mydf[,"c"])
mydf
# a b c newcolumn
# [1,] 1 2 6 6
# [2,] 1 3 4 4
Data (constructed with the same approach):
mydf <- cbind(a=c(1, 1), b=c(2, 3), c=c(6, 4))
If you had a data frame instead of a matrix, you could simply do mydf$newcolumn <- mydf$c.
There are many approaches you could take here:
mydf <- data.frame(a=c(1,1),b=c(2,3),c=c(6,4));
mydf;
## a b c
## 1 1 2 6
## 2 1 3 4
data.frame(mydf,newcolumn=mydf$c);
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
cbind(mydf,newcolumn=mydf$c);
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
transform(mydf,newcolumn=c);
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
within(mydf,newcolumn <- c);
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
mydf$newcolumn <- mydf$c;
mydf;
## a b c newcolumn
## 1 1 2 6 6
## 2 1 3 4 4
Of the 5 approaches shown above, only the last actually modifies mydf. For the other 4, you have to assign mydf to the return value to replace it with the new data.frame that has the additional column.
library('microbenchmark');
bind.df <- function() mydf <- data.frame(mydf,newcolumn=mydf$c);
bind.cb <- function() mydf <- cbind(mydf,newcolumn=mydf$c);
bind.tr <- function() mydf <- transform(mydf,newcolumn=c);
bind.wi <- function() mydf <- within(mydf,newcolumn <- c);
bind.as1 <- function() mydf$newcolumn <- mydf$c;
bind.as2 <- function() mydf['newcolumn'] <- mydf['c'];
bind.as3 <- function() mydf[,'newcolumn'] <- mydf[,'c'];
bind.as4 <- function() mydf[['newcolumn']] <- mydf[['c']];
N <- 1e5; mydf <- data.frame(a=rep(c(1,1),N),b=rep(c(2,3),N),c=rep(c(6,4),N));
microbenchmark(bind.df(),bind.cb(),bind.tr(),bind.as1(),bind.as2(),bind.as3(),bind.as4(),times=1e4);
## Unit: microseconds
## expr min lq mean median uq max neval
## bind.df() 97.077 112.046 128.66080 121.027 134.711 1690.513 10000
## bind.cb() 86.814 100.927 117.14364 109.907 122.737 1849.172 10000
## bind.tr() 105.203 120.171 138.90802 131.290 145.830 1680.250 10000
## bind.as1() 12.402 20.100 23.35085 22.239 25.660 148.397 10000
## bind.as2() 370.776 412.686 596.47901 425.088 449.036 41799.239 10000
## bind.as3() 347.682 385.743 564.78320 396.435 419.528 42144.355 10000
## bind.as4() 17.534 26.087 30.09639 28.654 32.930 638.915 10000
If there are two columns and you would like to add two column by column then use cbind in dataframe type.
dapu <- cbind(data.frame(data_r), data.frame(data_c))

Convert a dataframe to presence absence matrix

I have a table which has unequal number of element in string format
File1 A B C
File2 A B D
File3 E F
I want to convert into a format as follows
A B C D E F
File1 1 1 1 0 0 0
FIle2 1 1 0 1 0 0
File3 0 0 0 0 1 1
I tried to do it using reshape2 but was not successful.
Sample data:
mydata <- structure(list(V1 = c("File1", "File2", "File3"),
V2 = c("A", "A", "E"), V3 = c("B", "B", "F"),
V4 = c("C", "D", "")),
.Names = c("V1", "V2", "V3", "V4"),
class = "data.frame", row.names = c(NA, -3L))
One possibility:
library(reshape2)
df2 <- melt(df, id.var = "V1")
with(df2, table(V1, value))
# value
# V1 A B C D E F
# File1 1 1 1 0 0 0
# File2 1 1 0 1 0 0
# File3 0 0 0 0 1 1
A reasonably efficient approach is to use the (presently) non-exported charMat function from my "splitstackshape" package. Since it's not exported, you will have to use ::: to access it.
library(splitstackshape)
cbind(mydata[1], splitstackshape:::charMat(
split.default(mydata[-1], sequence(ncol(mydata)-1)), fill=0))
# V1 V1 A B C D E F
# 1 File1 0 1 1 1 0 0 0
# 2 File2 0 1 1 0 1 0 0
# 3 File3 1 0 0 0 0 1 1
Under the hood, charMat makes use of matrix indexing to process everything pretty efficiently. Step-by-step, this is what charMat does.
X <- split.default(mydata[-1], sequence(ncol(mydata)-1))
len <- length(X)
vec <- unlist(X, use.names=FALSE)
lvl <- sort(unique(vec))
out <- matrix(0L, nrow = len, ncol = length(lvl), dimnames = list(NULL, lvl))
i.idx <- rep(seq.int(len), vapply(X, length, integer(1L)))
j.idx <- match(vec, lvl)
out[cbind(i.idx, j.idx)] <- 1
out
# A B C D E F
# [1,] 0 1 1 1 0 0 0
# [2,] 0 1 1 0 1 0 0
# [3,] 1 0 0 0 0 1 1
That looks like a mouthful, but it is actually quite a fast operation, made faster by using the charMat function :-)
Update: Benchmarks
The following benchmarks test Henrik's answer with my charMat answer, and also adapts Henrik's answer to use "data.table" instead, for better efficiency.
Two tests were run. The first is on a similar dataset with 90K rows, and the second on one with 900K rows.
Here's the sample data:
biggerdata <- do.call(rbind, replicate(30000, mydata, simplify = FALSE))
biggerdata$V1 <- make.unique(biggerdata$V1)
dim(biggerdata)
# [1] 90000 4
evenBigger <- do.call(rbind, replicate(10, biggerdata, simplify = FALSE))
evenBigger$V1 <- make.unique(evenBigger$V1)
dim(evenBigger)
# [1] 900000 4
Here are the functions to benchmark:
fun1 <- function(indf) {
cbind(indf[1], splitstackshape:::charMat(
split.default(indf[-1], sequence(ncol(indf)-1)), fill=0))
}
library(reshape2)
fun2 <- function(indf) {
df2 <- melt(indf, id.var = "V1")
with(df2, table(V1, value))
}
library(data.table)
library(reshape2)
DT <- data.table(biggerdata)
DT2 <- data.table(evenBigger)
fun3 <- function(inDT) {
DTL <- melt(inDT, id.vars="V1")
dcast.data.table(DTL, V1 ~ value, fun.aggregate=length)
}
And the results of the benchmarking.
library(microbenchmark)
microbenchmark(fun1(biggerdata), fun2(biggerdata), fun3(DT), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# fun1(biggerdata) 185.3652 199.8725 289.0206 308.5826 327.4185 20
# fun2(biggerdata) 1453.8791 1605.6053 1639.8567 1758.3984 1797.2229 20
# suppressMessages(fun3(DT)) 469.8979 570.4664 586.4715 598.6229 675.2961 20
microbenchmark(fun1(evenBigger), fun2(evenBigger), fun3(DT2), times = 5)
# Unit: seconds
# expr min lq median uq max neval
# fun1(evenBigger) 1.871611 1.896351 2.071355 2.140580 2.464569 5
# fun2(evenBigger) 26.911523 27.212910 27.363442 27.469812 27.938178 5
# fun3(DT2) 7.103615 7.131603 7.141908 7.205006 7.218321 5

Resources