Select subset of columns which minimise a criterion in R - r

I have a sparse binary data.frame which looks like this
set.seed(123)
dat <- as.data.frame(matrix(rep(round(runif(40,0,0.9),0),5),ncol = 20))
# > dat
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
# 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
# 2 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
# 3 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
# 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 5 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
# 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 7 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0
# 8 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1
# 9 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
# 10 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
I need to find the 3 columns which minimise the number of zeros obtained when I call rowSums on those columns.
Example:
# > rowSums(dat[,1:3])
# [1] 2 2 2 3 2 2 0 2 0 1
#
# > rowSums(dat[,2:4])
# [1] 3 2 3 3 1 2 1 1 0 1
Here when I call rowSums on the first 3 columns I get 2 zeros, while when I call rowSums on columns 2:4 I get only one 0, so the second solution would be preferred.
Of course, I don't need the columns to be next to each other when I apply rowSums, so I need to explore all the possible combinations (ex: I want rowSums to consider also the case ov V1+V5+V17, ...), and if there are multiple "optimal" solutions, It's OK for me to just keep one of them.
Note that my real data.frame is 220.000 rows x 200 columns, so I need an efficient approach in terms of time/memory consumed.

This is the most obvious solution, although likely won't scale very well:
which.min(combn(dat,3L,function(x) sum(rowSums(x)==0)));
## [1] 2
The output value of 2 can be thought of as a combination index. You can get the columns that belong to that combination by running combn() on the complete column index set of the input object and indexing out that particular combination of indexes:
cis <- combn(seq_along(dat),3L)[,2L];
cis;
## [1] 1 2 4
And then getting the column names is easy:
names(dat)[cis];
## [1] "V1" "V2" "V4"
You can get the number of zeroes in the solution as follows:
sum(rowSums(dat[,cis])==0);
## [1] 1
I've written a much faster solution in Rcpp.
To make the function more generic, I wrote it to take a logical matrix rather than a data.frame, with the design of finding the column combination with the fewest all-true rows. Thus, for your case, you can compute the argument as dat==0. I also parameterized the number of columns in the combination as a second parameter r, which will be 3 for your case.
library(Rcpp);
Sys.setenv('PKG_CXXFLAGS'='-std=c++11');
cppFunction('
IntegerVector findColumnComboWithMinimumAllTrue(LogicalMatrix M,int r) {
std::vector<int> rzFull(M.nrow()); std::iota(rzFull.begin(),rzFull.end(),0);
std::vector<int> rzErase;
std::vector<std::vector<int>> rzs(M.ncol(),std::vector<int>(M.nrow()));
std::vector<std::vector<int>*> rzps(M.ncol());
std::vector<int>* rzp = &rzFull;
std::vector<int> com(r);
int bestAllTrueCount = M.nrow()+1;
std::vector<int> bestCom(r);
int pmax0 = M.ncol()-r;
int p = 0;
while (true) {
rzErase.clear();
for (int rzi = 0; rzi < rzp->size(); ++rzi)
if (!M((*rzp)[rzi],com[p])) rzErase.push_back(rzi);
if (p+1==r) {
if (rzp->size()-rzErase.size() < bestAllTrueCount) {
bestAllTrueCount = rzp->size()-rzErase.size();
bestCom = com;
}
if (com[p]==pmax0+p) {
do {
--p;
} while (p >= 0 && com[p]==pmax0+p);
if (p==-1) break;
++com[p];
rzp = p==0 ? &rzFull : rzps[p-1];
} else {
++com[p];
}
} else {
if (rzErase.empty()) {
rzps[p] = rzp;
} else {
rzs[p].clear();
int rzi = -1;
for (int ei = 0; ei < rzErase.size(); ++ei)
for (++rzi; rzi < rzErase[ei]; ++rzi)
rzs[p].push_back((*rzp)[rzi]);
for (++rzi; rzi < rzp->size(); ++rzi)
rzs[p].push_back((*rzp)[rzi]);
rzp = rzps[p] = &rzs[p];
}
++p;
com[p] = com[p-1]+1;
}
}
IntegerVector res(bestCom.size());
for (int i = 0; i < res.size(); ++i)
res[i] = bestCom[i]+1;
return res;
}
');
Here's a demo on your example input:
set.seed(123L);
dat <- as.data.frame(matrix(rep(round(runif(40,0,0.9),0),5),ncol=20L));
findColumnComboWithMinimumAllTrue(dat==0,3L);
## [1] 1 2 4
And here's a full-size test, which takes almost 10 minutes on my system:
set.seed(1L); NR <- 220e3L; NC <- 200L;
dat <- as.data.frame(matrix(sample(0:1,NR*NC,T),NR,NC));
system.time({ findColumnComboWithMinimumAllTrue(dat==0,3L); });
## user system elapsed
## 555.641 0.328 556.401
res;
## [1] 28 64 89

Related

Drawing conditional combinations of a binary vector one by one

I am trying to write a routine to find combinations conditionally of a binary vector. For example, consider the following vector:
> A <- rep(c(1,0,0),3)
> A
[1] 1 0 0 1 0 0 1 0 0
Note that, length of the vector A is always multiple of 3. So the following condition always holds:
length(A) %% 3 == 0
The main condition is that there must be only a single 1 in each set of 3 vectors consecutively. In this example, for instance, one element of A[1:3] will be 1, one element of A[4:6] will be 1 and one element of A[7:9] will be 1 and the rest are all 0. Therefore, for this example, there will be a total of 27 possible combinations.
Objective is to make a routine to draw/return the next valid combination until all the possible legal combinations are returned.
Note that, I am not looking for a table with all the possible combinations. That Solution is already available in my other query in StackOverflow. However, with that method, I am running into memory problems when going beyond more than a length of 45 elements in A, as it is returning the full matrix which is huge. Therefore instead of storing the full matrix, I want to retrieve one combination at a time, and then decide later if I want to store it or not.
What the OP is after is an iterator. If we were to do this properly, we would write a class in C++ with a get_next method, and expose this to R. As it stands, with base R, since everything is passed by value, we must call a function on our object-to-be-updated and reassign the object-to-be-updated every time.
Here is a very crude implementation:
get_next <- function(comb, v, m) {
s <- seq(1L, length(comb), length(v))
e <- seq(length(v), length(comb), length(v))
last_comb <- rev(v)
can_be_incr <- sapply(seq_len(m), function(x) {
!identical(comb[s[x]:e[x]], last_comb)
})
if (all(!can_be_incr)) {
return(FALSE)
} else {
idx <- which(can_be_incr)[1L]
span <- s[idx]:e[idx]
j <- which(comb[span] == 1L)
comb[span[j]] <- 0L
comb[span[j + 1L]] <- 1L
if (idx > 1L) {
## Reset previous maxed out sections
for (i in 1:(idx - 1L)) {
comb[s[i]:e[i]] <- v
}
}
}
return(comb)
}
And here is a simple usage:
m <- 3L
v <- as.integer(c(1,0,0))
comb <- rep(v, m)
count <- 1L
while (!is.logical(comb)) {
cat(count, ": ", comb, "\n")
comb <- get_next(comb, v, m)
count <- count + 1L
}
1 : 1 0 0 1 0 0 1 0 0
2 : 0 1 0 1 0 0 1 0 0
3 : 0 0 1 1 0 0 1 0 0
4 : 1 0 0 0 1 0 1 0 0
5 : 0 1 0 0 1 0 1 0 0
6 : 0 0 1 0 1 0 1 0 0
7 : 1 0 0 0 0 1 1 0 0
8 : 0 1 0 0 0 1 1 0 0
9 : 0 0 1 0 0 1 1 0 0
10 : 1 0 0 1 0 0 0 1 0
11 : 0 1 0 1 0 0 0 1 0
12 : 0 0 1 1 0 0 0 1 0
13 : 1 0 0 0 1 0 0 1 0
14 : 0 1 0 0 1 0 0 1 0
15 : 0 0 1 0 1 0 0 1 0
16 : 1 0 0 0 0 1 0 1 0
17 : 0 1 0 0 0 1 0 1 0
18 : 0 0 1 0 0 1 0 1 0
19 : 1 0 0 1 0 0 0 0 1
20 : 0 1 0 1 0 0 0 0 1
21 : 0 0 1 1 0 0 0 0 1
22 : 1 0 0 0 1 0 0 0 1
23 : 0 1 0 0 1 0 0 0 1
24 : 0 0 1 0 1 0 0 0 1
25 : 1 0 0 0 0 1 0 0 1
26 : 0 1 0 0 0 1 0 0 1
27 : 0 0 1 0 0 1 0 0 1
Note, this implementation will be memory efficient, however it will be very slow.

R: How to drop columns with less than 10% 1's

My dataset:
a b c
1 1 0
1 0 0
1 1 0
I want to drop columns which have less than 10% 1's. I have this code but it's not working:
sapply(df, function(x) df[df[,c(x)]==1]>0.1))
Maybe I need a totally different approach.
Try this option with apply() and a build-in function to test the threshold of 1 across each column. I have created a dummy example. The index i contains the columns that will be dropped after using myfun to compute the proportion of 1's in each column. Here the code:
#Data
df <- as.data.frame(matrix(c(1,0),20,10))
df$V1<-c(1,rep(0,19))
df$V2<-c(1,rep(0,19))
#Function
myfun <- function(x) {sum(x==1)/length(x)}
#Index For removing
i <- unname(which(apply(df,2,myfun)<0.1))
#Drop
df2 <- df[,-i]
The output:
df2
V3 V4 V5 V6 V7 V8 V9 V10
1 1 1 1 1 1 1 1 1
2 0 0 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1
4 0 0 0 0 0 0 0 0
5 1 1 1 1 1 1 1 1
6 0 0 0 0 0 0 0 0
7 1 1 1 1 1 1 1 1
8 0 0 0 0 0 0 0 0
9 1 1 1 1 1 1 1 1
10 0 0 0 0 0 0 0 0
11 1 1 1 1 1 1 1 1
12 0 0 0 0 0 0 0 0
13 1 1 1 1 1 1 1 1
14 0 0 0 0 0 0 0 0
15 1 1 1 1 1 1 1 1
16 0 0 0 0 0 0 0 0
17 1 1 1 1 1 1 1 1
18 0 0 0 0 0 0 0 0
19 1 1 1 1 1 1 1 1
20 0 0 0 0 0 0 0 0
Where columns V1 and V2 are dropped due to having 1's less than 0.1.
You can use colMeans in base R to keep columns that have more than 10% of 1's.
df[colMeans(df == 1) >= 0.1, ]
Or in dplyr use select with where :
library(dplyr)
df %>% select(where(~mean(. == 1) >= 0.1))

String decomposition

I need to decompose about 75 million character strings using R. I need to do something like creating a Term Document matrix, where each word that occurs in the document becomes a column in the matrix and anywhere the term occurs, the matrix element is coded as 1.
I have:
About 75 million character strings ranging in length from about 0-100 characters; they represent a time series giving coded information about what happened in that period. Each code is exactly one character and corresponds to a time period.
I need:
Some kind of matrix or way of conveying the information that takes away the time series and just tells me how many times a certain code was reported in each series.
For instance:
The string "ABCDEFG-123" would become be a row in the matrix where each character would be tallied as occurring once. If this is too difficult a matrix of 0s and 1s would also give me some information though I would prefer to keep as much information as possible.
Does anyone have any ideas of how to do this quickly? There are 20 possible codes.
Example:
my20chars = c(LETTERS[1:10], 0:9)
set.seed(1)
x = replicate(1e4, paste0(sample(c(my20chars,"-"),10, replace=TRUE), collapse=""))
One approach:
library(data.table)
d = setDT(stack(strsplit(setNames(x,x),"")))
dcast(d[ values %in% my20chars ], ind ~ values, fun = length)
Result:
ind 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J
1: ---8EEAD8I 0 0 0 0 0 0 0 0 2 0 1 0 0 1 2 0 0 0 1 0
2: --33B6E-32 0 0 1 3 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0
3: --3IFBG8GI 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 1 2 0 2 0
4: --4210I8H5 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0
5: --5H4DE9F- 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 1 0 0
---
9996: JJFJBJ24AJ 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 5
9997: JJI-J-0FGB 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 3
9998: JJJ1B54H63 0 1 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 3
9999: JJJED7A3FI 0 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 0 0 1 3
10000: JJJIF6GI13 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 2 3
Benchmark:
library(microbenchmark)
nstrs = 1e5
nchars = 10
x = replicate(nstrs, paste0(sample(c(my20chars,"-"), nchars, replace=TRUE), collapse=""))
microbenchmark(
dcast = {
d = setDT(stack(strsplit(setNames(x,x),"")))
dcast(d[ values %in% my20chars ], ind ~ values, fun = length, value.var="ind")
},
times = 10)
# Unit: seconds
# expr min lq mean median uq max neval
# dcast 3.112633 3.423935 3.480692 3.494176 3.573967 3.741931 10
So, this is not fast enough to handle the OP's 75 million strings, but may be a good place to start.
I really like #Frank's solution, but here's another way, that has two advantages:
It uses a sparse matrix format, so you are more likely to fit everything into memory; and
It is (even) simpler.
It uses our quanteda package, where you tokenise the characters in each string and form a document-feature matrix from these in one command:
my20chars = c(LETTERS[1:10], 0:9)
set.seed(1)
x = replicate(1e4, paste0(sample(c(my20chars,"-"),10, replace=TRUE), collapse=""))
require(quanteda)
myDfm <- dfm(x, what = "character", toLower = FALSE, verbose = FALSE)
# for equivalent printing, does not change content:
myDfm <- myDfm[, order(features(myDfm))]
rownames(myDfm) <- x
head(myDfm)
# Document-feature matrix of: 6 documents, 20 features.
# 6 x 20 sparse Matrix of class "dfmSparse"
# features
# docs 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J
# FH29E8933B 0 0 1 2 0 0 0 0 1 2 0 1 0 0 1 1 0 1 0 0
# ED4I605-H6 1 0 0 0 1 1 2 0 0 0 0 0 0 1 1 0 0 1 1 0
# 9E3CFIAI8H 0 0 0 1 0 0 0 0 1 1 1 0 1 0 1 1 0 1 2 0
# 020D746C5I 2 0 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 1 0
# 736116A054 1 2 0 1 1 1 2 1 0 0 1 0 0 0 0 0 0 0 0 0
# 08JFBCG03I 2 0 0 1 0 0 0 0 1 0 0 1 1 0 0 1 1 0 1 1
Disadvantage:
It's (much) slower.
Benchmark:
microbenchmark(
dcast = {
d = setDT(stack(strsplit(setNames(x,x),"")))
dcast(d[ values %in% my20chars ], ind ~ values, fun = length, value.var="ind")
},
quanteda = dfm(x, what = "character", toLower = FALSE, removePunct = FALSE, verbose = FALSE),
times = 10)
# Unit: seconds
# expr min lq mean median uq max naval
# dcast 2.380971 2.423677 2.465338 2.429331 2.521256 2.636102 10
# quanteda 21.106883 21.168145 21.369443 21.345173 21.519018 21.883966 10

More efficient approach to create a dummy coding

Question: In Python, I would work with dictonaries and use lots of map/apply functions. However, with R I started with this simple approach using lists, and I would like to know whether there is a more efficient/more elegant approach to doing the following.
In statistics, you use dummy variables to represent levels of a nominal attribute. E.g., A/B/C would become 00, 01, 10. A/B/C/D would become 000, 001, 010, 100. Thus, only one 1 per item is allowed. You therefore need n-1 numbers to represent n variables/letters.
Here I create some data:
data <- data.frame(
"upper" = c(1,1,1,2,2,2,3,3,3), # var 1
"country" = c(1,2,3,1,2,3,1,2,3), # var 2
"price" = c(1,2,3,2,3,1,3,1,2) # var 3
)
Create a list with keys (attributes) and values (lists of unique attribute levels):
lst <- list()
for (attribute in colnames(data)) {
lst[[attribute]] = unique(data[[attribute]])
}
Create dummy coding, i is used to consider only n-1 items:
dummy <- list()
for (attribute in colnames(data)) {
i <- 1
for (level in lst[[attribute]]) {
if (length(lst[[attribute]])!=i) {
dummy[[paste0(attribute, level)]] <- ifelse(
data[[attribute]]==level,
1,
0
)
}
i <- i + 1
}
}
Results:
dummy
$upper1
[1] 1 1 1 0 0 0 0 0 0
$upper2
[1] 0 0 0 1 1 1 0 0 0
$country1
[1] 1 0 0 1 0 0 1 0 0
$country2
[1] 0 1 0 0 1 0 0 1 0
$price1
[1] 1 0 0 0 0 1 0 1 0
$price2
[1] 0 1 0 1 0 0 0 0 1
We create a design matrix using model.matrix, split the columns to create a list of list, finally, concatenate the list elements together (do.call(c,..).
res <- do.call("c",lapply(data, function(x) {
x1 <- model.matrix(~0+factor(x))
split(x1, col(x1))}))
As we only need the first two levels, we can subset the 'res' using c(TRUE, TRUE, FALSE) which will recycle to the end of the list.
res[c(TRUE, TRUE, FALSE)]
#$upper.1
#[1] 1 1 1 0 0 0 0 0 0
#$upper.2
#[1] 0 0 0 1 1 1 0 0 0
#$country.1
#[1] 1 0 0 1 0 0 1 0 0
#$country.2
#[1] 0 1 0 0 1 0 0 1 0
#$price.1
#[1] 1 0 0 0 0 1 0 1 0
#$price.2
#[1] 0 1 0 1 0 0 0 0 1

How to count the frequency in binary tranactions per column, and adding the result after last row in R?

I have a txt file (data5.txt):
1 0 1 0 0
1 1 1 0 0
0 0 1 0 0
1 1 1 0 1
0 0 0 0 1
0 0 1 1 1
1 0 0 0 0
1 1 1 1 1
0 1 0 0 1
1 1 0 0 0
I need to count the frequency of one's and zero's in each column
if the frequency of ones >= frequency of zero's then I will print 1 after the last row for that Colum
I'm new in R, but I tried this, and I got error:
Error in if (z >= d) data[n, i] = 1 else data[n, i] = 0 :
missing value where TRUE/FALSE needed
my code:
data<-read.table("data5.txt", sep="")
m =length(data)
d=length(data[,1])/2
n=length(data[,1])+1
for(i in 1:m)
{
z=sum(data[,i])
if (z>=d) data[n,i]=1 else data[n,i]=0
}
You may try this:
rbind(df, ifelse(colSums(df == 1) >= colSums(df == 0), 1, NA))
# V1 V2 V3 V4 V5
# 1 1 0 1 0 0
# 2 1 1 1 0 0
# 3 0 0 1 0 0
# 4 1 1 1 0 1
# 5 0 0 0 0 1
# 6 0 0 1 1 1
# 7 1 0 0 0 0
# 8 1 1 1 1 1
# 9 0 1 0 0 1
# 10 1 1 0 0 0
# 11 1 1 1 NA 1
Update, thanks to a nice suggestion from #Arun:
rbind(df, ifelse(colSums(df == 1) >= ceiling(nrow(df)/2), 1, NA)
or even:
rbind(df, ifelse(colSums(df == 1) >= nrow(df)/2, 1, NA)
Thanks to #SvenHohenstein.
Possibly I misinterpreted your intended results. If you want 0 when frequency of ones is not equal or larger than frequency of zero, then this suffice:
rbind(df, colSums(df) >= nrow(df) / 2)
Again, thanks to #SvenHohenstein for his useful comments!

Resources