choose the specific variable in r - r

There are data like this table:
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15
4 0 0 2 0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 3 0 0 0 0 0 0 0 0 0 0 0
and I wish to make a new matrix with the specific variable columns which have a numerical value different from zero (in this case the specific variable columns are v1 and v4).
I know the subset function but I cannot find the way to choose conditional columns by using "if statement".
I mean... how can I make a matrix with only the specific columns that have numerical value different from zero by using "if statement"?
Please help me to solve my problem.
Thanks.

You haven't specified what format your data is in, but if you have a matrix or a data.frame, you should be able to use the R extract operator ([) to specify only the columns you want. You can feed it a vector of logical values (TRUE or FALSE) for that specification, so all you need is a function that will return the logical values you want.
As a simple example with a matrix, you could apply a function seeing if there are any non-zero values across each of the columns of the matrix:
> a
[,1] [,2] [,3] [,4]
[1,] 0 1 0 4
[2,] 0 2 0 5
[3,] 0 3 0 6
> a[, apply(a, 2, function(x) { return(any(x != 0)) })]
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
This same extract mechanism works on data.frames as well:
> a
V1 V2 V3 V4
1 0 1 0 4
2 0 2 0 5
3 0 3 0 6
> a[, sapply(a, function(x) { return(any(x != 0)) })]
V2 V4
1 1 4
2 2 5
3 3 6

Related

Using R to remove all columns that sum to 0

I have a very large CSV file containing counts of unique DNA sequences, and there is a column for each unique sequence. I started with hundreds of samples and cut it down to only 15 that I care about but now I have THOUSANDS of columns that contain nothing but Zeroes and it is messing up my data processing. How do I go about completely removing any column that sums to zero? I’ve seen some similar questions on here but none of those suggestions have worked for me.
I have 6653 columns and 16 rows in my data frame.
If it matters my columns all have super crazy names, some several hundred characters long ( AATCGGCTAA..., etc) and the row names are the sample IDs which are also not entirely numeric. Any tips greatly appreciated. I am still new to R so please let me know where I will need to change things in code examples if you can! Thanks!
You can use colSums
set.seed(10)
df <- as.data.frame(matrix(sample(0:1, 50, replace = TRUE, prob = c(.8, .2)),
5, 10))
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 0 0 0 1 0 0 0 0 0
# 2 0 0 0 0 0 1 0 1 0 0
# 3 0 0 0 0 0 0 0 1 0 0
# 4 0 0 0 0 0 0 1 0 0 0
# 5 0 0 0 1 0 0 0 0 0 1
df[colSums(df) != 0]
# V4 V5 V6 V7 V8 V10
# 1 0 1 0 0 0 0
# 2 0 0 1 0 1 0
# 3 0 0 0 0 1 0
# 4 0 0 0 1 0 0
# 5 1 0 0 0 0 1
But you might not want to remove all columns which sum to 0, because that could be true even if not all elements are 0. Take V4 in the data frame below as an example.
df$V4[1] <- -1
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
# 1 0 0 0 -1 1 0 0 0 0 0
# 2 0 0 0 0 0 1 0 1 0 0
# 3 0 0 0 0 0 0 0 1 0 0
# 4 0 0 0 0 0 0 1 0 0 0
# 5 0 0 0 1 0 0 0 0 0 1
So if you want to only remove columns where all elements are 0, you can do
df[colSums(df == 0) < nrow(df)]
# V4 V5 V6 V7 V8 V10
# 1 -1 1 0 0 0 0
# 2 0 0 1 0 1 0
# 3 0 0 0 0 1 0
# 4 0 0 0 1 0 0
# 5 1 0 0 0 0 1
welcome to SO here is a tidyverse approach
library(tidyverse)
mtcars %>%
select_if(is.numeric) %>%
select_if(~ sum(.x) > 0)

Transforming Binary data

I have a dataframe that only consists of 0 and 1. So for each individual instead of having one column with a factoral value (ex. low price, 4 rooms) I have
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21
1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 1 0 0 0 1 0
2 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 1
3 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0
4 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 1 0 1 0 0
How can I transform the dataset in R, so that I create new columns (#number of rooms) and give the position of the 1 (in the 4th column) a vhigh value?
I have multiple expenatory varibales I need to do this for. the 21 columns are representing 6 variables for 1000+ observations. should be something like this
PurchaseP. NumberofRooms ...
1. vhigh. 4
2. low. 4
3. vhigh. 1
4. vhigh. 2
Just did it for the first 2 epxlenatory varibales here, but essentially it repeats like this with each explenatory variable has 3-4 possible factoral values.
V1:V4 = purchase price, V5:V8 = number of rooms,V9:V11 = floors, and so on
In my head something like this could work
create a if statemt to give each 1 a value depending on column position, ex. if value in V4=1 then name "vhigh". and do this for each Vx
Then combine each column V1:V4, V5:V8, V9:V11 (depending on if it has 3-4 possible factoral/integer values) while ignoring 0 values.
Would this work, or is there a simpler approach? How would one code this in R?
Here is an approach that should work for you. I wrote a function, which will take as arguments your data.frame, the columns representing one of your variables of interest (e.g. purchase price is stored in columns 1 to 4), and the names of the levels you would like as a result. The function will then return the result you requested. You'll need to write this out for the 6 variables you are interested in.
I'll simulate some data and illustrate the approach.
df <- data.frame(matrix(rep(c(0,0,0,1, 1,0,0,0, 1,0,0,0,0,0,0,1), 2),
nrow = 4, byrow = T))
df
#> X1 X2 X3 X4 X5 X6 X7 X8
#> 1 0 0 0 1 1 0 0 0
#> 2 1 0 0 0 0 0 0 1
#> 3 0 0 0 1 1 0 0 0
#> 4 1 0 0 0 0 0 0 1
We'll say that the first four columns are the purchase price in v.low to v.high, and the second four are the number of rooms (1:4). We'll write a function that takes this information as arguments and returns the result:
rangeToCol <- function(df, # Your data.frame
range, # the columns that incode the category of interest
lev.names # The names of the category levels
) {
tdf <- df[range]
lev.names[unlist(apply(tdf, 1, function(rw){which(rw==1)}))]
}
new.df <- data.frame(PurchaseP = rangeToCol(df, 1:4,
c('vlow','low','high','vhigh')),
NumberofRooms = rangeToCol(df, 5:8, c(1:4)))
new.df
#> PurchaseP NumberofRooms
#> 1 vhigh 1
#> 2 vlow 4
#> 3 vhigh 1
#> 4 vlow 4

How can I create this special sequence?

I would like to create the following vector sequence.
0 1 0 0 2 0 0 0 3 0 0 0 0 4
My thought was to create 0 first with rep() but not sure how to add the 1:4.
Create a diagonal matrix, take the upper triangle, and remove the first element:
d <- diag(0:4)
d[upper.tri(d, TRUE)][-1L]
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
If you prefer a one-liner that makes no global assignments, wrap it up in a function:
(function() { d <- diag(0:4); d[upper.tri(d, TRUE)][-1L] })()
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
And for code golf purposes, here's another variation using d from above:
d[!lower.tri(d)][-1L]
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
rep and rbind up to their old tricks:
rep(rbind(0,1:4),rbind(1:4,1))
#[1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
This essentially creates 2 matrices, one for the value, and one for how many times the value is repeated. rep does not care if an input is a matrix, as it will just flatten it back to a vector going down each column in order.
rbind(0,1:4)
# [,1] [,2] [,3] [,4]
#[1,] 0 0 0 0
#[2,] 1 2 3 4
rbind(1:4,1)
# [,1] [,2] [,3] [,4]
#[1,] 1 2 3 4
#[2,] 1 1 1 1
You can use rep() to create a sequence that has n + 1 of each value:
n <- 4
myseq <- rep(seq_len(n), seq_len(n) + 1)
# [1] 1 1 2 2 2 3 3 3 3 4 4 4 4 4
Then you can use diff() to find the elements you want. You need to append a 1 to the end of the diff() output, since you always want the last value.
c(diff(myseq), 1)
# [1] 0 1 0 0 1 0 0 0 1 0 0 0 0 1
Then you just need to multiply the original sequence with the diff() output.
myseq <- myseq * c(diff(myseq), 1)
myseq
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
unlist(lapply(1:4, function(i) c(rep(0,i),i)))
# the sequence
s = 1:4
# create zeros vector
vec = rep(0, sum(s+1))
# assign the sequence to the corresponding position in the zeros vector
vec[cumsum(s+1)] <- s
vec
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4
Or to be more succinct, use replace:
replace(rep(0, sum(s+1)), cumsum(s+1), s)
# [1] 0 1 0 0 2 0 0 0 3 0 0 0 0 4

Select subset of columns which minimise a criterion in R

I have a sparse binary data.frame which looks like this
set.seed(123)
dat <- as.data.frame(matrix(rep(round(runif(40,0,0.9),0),5),ncol = 20))
# > dat
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
# 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1
# 2 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1
# 3 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
# 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 5 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
# 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 7 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0
# 8 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1
# 9 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0
# 10 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0
I need to find the 3 columns which minimise the number of zeros obtained when I call rowSums on those columns.
Example:
# > rowSums(dat[,1:3])
# [1] 2 2 2 3 2 2 0 2 0 1
#
# > rowSums(dat[,2:4])
# [1] 3 2 3 3 1 2 1 1 0 1
Here when I call rowSums on the first 3 columns I get 2 zeros, while when I call rowSums on columns 2:4 I get only one 0, so the second solution would be preferred.
Of course, I don't need the columns to be next to each other when I apply rowSums, so I need to explore all the possible combinations (ex: I want rowSums to consider also the case ov V1+V5+V17, ...), and if there are multiple "optimal" solutions, It's OK for me to just keep one of them.
Note that my real data.frame is 220.000 rows x 200 columns, so I need an efficient approach in terms of time/memory consumed.
This is the most obvious solution, although likely won't scale very well:
which.min(combn(dat,3L,function(x) sum(rowSums(x)==0)));
## [1] 2
The output value of 2 can be thought of as a combination index. You can get the columns that belong to that combination by running combn() on the complete column index set of the input object and indexing out that particular combination of indexes:
cis <- combn(seq_along(dat),3L)[,2L];
cis;
## [1] 1 2 4
And then getting the column names is easy:
names(dat)[cis];
## [1] "V1" "V2" "V4"
You can get the number of zeroes in the solution as follows:
sum(rowSums(dat[,cis])==0);
## [1] 1
I've written a much faster solution in Rcpp.
To make the function more generic, I wrote it to take a logical matrix rather than a data.frame, with the design of finding the column combination with the fewest all-true rows. Thus, for your case, you can compute the argument as dat==0. I also parameterized the number of columns in the combination as a second parameter r, which will be 3 for your case.
library(Rcpp);
Sys.setenv('PKG_CXXFLAGS'='-std=c++11');
cppFunction('
IntegerVector findColumnComboWithMinimumAllTrue(LogicalMatrix M,int r) {
std::vector<int> rzFull(M.nrow()); std::iota(rzFull.begin(),rzFull.end(),0);
std::vector<int> rzErase;
std::vector<std::vector<int>> rzs(M.ncol(),std::vector<int>(M.nrow()));
std::vector<std::vector<int>*> rzps(M.ncol());
std::vector<int>* rzp = &rzFull;
std::vector<int> com(r);
int bestAllTrueCount = M.nrow()+1;
std::vector<int> bestCom(r);
int pmax0 = M.ncol()-r;
int p = 0;
while (true) {
rzErase.clear();
for (int rzi = 0; rzi < rzp->size(); ++rzi)
if (!M((*rzp)[rzi],com[p])) rzErase.push_back(rzi);
if (p+1==r) {
if (rzp->size()-rzErase.size() < bestAllTrueCount) {
bestAllTrueCount = rzp->size()-rzErase.size();
bestCom = com;
}
if (com[p]==pmax0+p) {
do {
--p;
} while (p >= 0 && com[p]==pmax0+p);
if (p==-1) break;
++com[p];
rzp = p==0 ? &rzFull : rzps[p-1];
} else {
++com[p];
}
} else {
if (rzErase.empty()) {
rzps[p] = rzp;
} else {
rzs[p].clear();
int rzi = -1;
for (int ei = 0; ei < rzErase.size(); ++ei)
for (++rzi; rzi < rzErase[ei]; ++rzi)
rzs[p].push_back((*rzp)[rzi]);
for (++rzi; rzi < rzp->size(); ++rzi)
rzs[p].push_back((*rzp)[rzi]);
rzp = rzps[p] = &rzs[p];
}
++p;
com[p] = com[p-1]+1;
}
}
IntegerVector res(bestCom.size());
for (int i = 0; i < res.size(); ++i)
res[i] = bestCom[i]+1;
return res;
}
');
Here's a demo on your example input:
set.seed(123L);
dat <- as.data.frame(matrix(rep(round(runif(40,0,0.9),0),5),ncol=20L));
findColumnComboWithMinimumAllTrue(dat==0,3L);
## [1] 1 2 4
And here's a full-size test, which takes almost 10 minutes on my system:
set.seed(1L); NR <- 220e3L; NC <- 200L;
dat <- as.data.frame(matrix(sample(0:1,NR*NC,T),NR,NC));
system.time({ findColumnComboWithMinimumAllTrue(dat==0,3L); });
## user system elapsed
## 555.641 0.328 556.401
res;
## [1] 28 64 89

Creating a factor/categorical variable from 4 dummies

I have a data frame with four columns, let's call them V1-V4 and ten observations. Exactly one of V1-V4 is 1 for each row, and the others of V1-V4 are 0. I want to create a new column called NEWCOL that takes on the value of 3 if V3 is 1, 4 if V4 is 1, and is 0 otherwise.
I have to do this for MANY sets of variables V1-V4 so I would like the solution to be as short as possible so that it will be easy to replicate.
This does it for 4 columns to add a fifth using matrix multiplication:
> cbind( mydf, newcol=data.matrix(mydf) %*% c(0,0,3,4) )
V1 V2 V3 V4 newcol
1 1 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 1 0 0 0
5 0 0 1 0 3
6 0 0 1 0 3
7 0 0 0 1 4
8 0 0 0 1 4
9 0 0 0 1 4
10 0 0 0 1 4
It's generalizable to getting multiple columns.... we just need the rules. You need to make a matric with the the same number of rows as there are columns in the original data and have one column for each of the new factors needed to build each new variable. This shows how to build one new column from the sum of 3 times the third column plus 4 times the fourth, and another new column from one times the first and 2 times the second.
> cbind( mydf, newcol=data.matrix(mydf) %*% matrix(c(0,0,3,4, # first set of factors
1,2,0,0), # second set
ncol=2) )
V1 V2 V3 V4 newcol.1 newcol.2
1 1 0 0 0 0 1
2 1 0 0 0 0 1
3 0 1 0 0 0 2
4 0 1 0 0 0 2
5 0 0 1 0 3 0
6 0 0 1 0 3 0
7 0 0 0 1 4 0
8 0 0 0 1 4 0
9 0 0 0 1 4 0
10 0 0 0 1 4 0
An example data set:
mydf <- data.frame(V1 = c(1, 1, rep(0, 8)),
V2 = c(0, 0, 1, 1, rep(0, 6)),
V3 = c(rep(0, 4), 1, 1, rep(0, 4)),
V4 = c(rep(0, 6), rep(1, 4)))
# V1 V2 V3 V4
# 1 1 0 0 0
# 2 1 0 0 0
# 3 0 1 0 0
# 4 0 1 0 0
# 5 0 0 1 0
# 6 0 0 1 0
# 7 0 0 0 1
# 8 0 0 0 1
# 9 0 0 0 1
# 10 0 0 0 1
Here's an easy approach to generate the new column:
mydf <- transform(mydf, NEWCOL = V3 * 3 + V4 * 4)
# V1 V2 V3 V4 NEWCOL
# 1 1 0 0 0 0
# 2 1 0 0 0 0
# 3 0 1 0 0 0
# 4 0 1 0 0 0
# 5 0 0 1 0 3
# 6 0 0 1 0 3
# 7 0 0 0 1 4
# 8 0 0 0 1 4
# 9 0 0 0 1 4
# 10 0 0 0 1 4

Resources