How to extract columns which contains lots of 0 value with R? - r

I have a matrix with lots of columns (more than 817.000) and 40 rows . I would like to extract the columns which contain lots of 0 (for example > 30 or 35 , no matter the number) .
That should extract several columns, and I will choose one randomnly which I will use as a reference for the rest of the matrix.
Any idea?
Edit :
OTU0001 OTU0004 OTU0014 OTU0016 OTU0017 OTU0027 OTU0029 OTU0030
Sample_10.rare 0 0 85 0 0 0 0 0
Sample_11.rare 0 42 169 0 42 127 0 85
Sample_12.rare 0 0 0 0 0 0 0 42
Sample_13.rare 762 550 2159 127 550 0 677 1397
Sample_14.rare 847 508 2751 169 1397 169 593 1990
Sample_15.rare 1143 593 3725 677 2116 466 212 2286
Sample_16.rare 5630 5291 5291 1270 3852 1185 296 2836
It should extract 4 columns, OTU0001 OTU0016 OTU0027 OTU0029 because they got 3 zero each. And if it is possible, I would like to extract the position of the extracted columns.

An option with base R
Filter(function(x) sum(x == 0) > 7, df)

You could do something like this (Where 7 is the number of relevant zeros):
library(dplyr)
df <- tibble(Col1 = c(rep(0, 10), rep(1, 10)),
Col2 = c(rep(0,5), rep(1, 15)),
Col3 = c(rep(0,15), rep(1, 5)))
y <- df %>%
select_if(function(col) length(which(col==0)) > 7)

Related

Reformat output of R table() command for plotting

I wish to plot some count data (likely as a bubble plot). I've some different experiments and for each experiment, I've three replicates. The output from the table() command is given below.
> with(myData.df, table(ChargeGroup,Expt,Repx))
, , Repx = 1
Expt
ChargeGroup Ctrl CV2 Gas n15 n30 n45 n60 p15 p30 v0
<+10 540 512 567 204 642 648 71 2 2 6
+10:+15 219 258 262 156 283 16 0 1 0 7
+15:+20 119 118 14 200 14 0 0 7 0 51
+20:+25 57 38 0 84 1 0 0 31 7 87
+25: 30 16 0 17 0 0 0 24 19 18
, , Repx = 2
Expt
ChargeGroup Ctrl CV2 Gas n15 n30 n45 n60 p15 p30 v0
<+10 529 522 582 201 642 626 77 1 2 5
+10:+15 232 249 264 150 273 14 0 1 0 5
+15:+20 116 113 18 204 13 0 0 12 0 41
+20:+25 53 46 0 82 0 0 0 36 6 94
+25: 28 12 0 26 0 0 0 33 21 28
, , Repx = 3
Expt
ChargeGroup Ctrl CV2 Gas n15 n30 n45 n60 p15 p30 v0
<+10 536 525 591 224 671 641 63 1 2 6
+10:+15 236 238 257 170 276 16 0 2 1 10
+15:+20 113 108 15 212 12 0 0 10 0 47
+20:+25 57 40 0 77 0 0 0 34 3 107
+25: 32 11 0 25 0 0 0 26 15 26
Can anyone help in to further process the output so that I can go directly for plotting in either base graphics or ggplot?
Thanks
There are couple of methods - with base R, by looping over the third dmension and plotting with barplot
par(mfrow = c(3, 1))
apply(with(myData.df, table(ChargeGroup,Expt,Repx)), 3, barplot)
-testing
par(mfrow = c(3, 1))
apply(with(mtcars, table(cyl, vs, gear)), 3, barplot)
Or convert to a single data.frame with as.data.frame and using ggplot or directly get the data.frame/tibble output with count
library(dplyr)
library(ggplot2)
myData.df %>%
count(ChargeGroup,Expt,Repx) %>%
ggplot(aes(x=ChargeGroup, y = n, fill = Expt)) +
geom_col() +
facet_wrap(~ Repx)
-testing
mtcars %>%
count(cyl = factor(cyl), vs = factor(vs), gear = factor(gear)) %>%
ggplot(aes(x = cyl, y = n, fill = vs)) +
geom_col() +
facet_wrap(~ gear)

Subsetting nested lists based on condition (values) in R

I have a large nested list (list of named lists) - the example of such a list is given below. I would like to create a new list, in which only sub-lists with "co" vectors containing both 0 and 1 values would be preserved, while 0-only sublists would be discarded (eg. the output should contain only first-, third- and fourth- subgroups.
I played with lapply and filter according to this thread:
Subset elements in a list based on a logical condition
However, it throwed errors. I would appreciate tips how to handle lists within the lists.
# reprex
set.seed(123)
## empty lists
first_group <- list()
second_group <- list()
third_group <- list()
fourth_group <- list()
# dummy_vecs
values1 <- c(sample(120:730, 30, replace=TRUE))
coeff1 <- c(sample(0:1, 30, replace=TRUE))
values2 <- c(sample(50:810, 43, replace=TRUE))
coeff2 <- c(rep(0, 43))
values3 <- c(sample(510:730, 57, replace=TRUE))
coeff3 <- c(rep(0, 8), rep(1, 4), rep(0, 45))
values4 <- c(sample(123:770, 28, replace=TRUE))
coeff4 <- c(sample(0:1, 28, replace=TRUE))
## fill lists with values:
first_group[["val"]] <- values1
first_group[["co"]] <- coeff1
second_group[["val"]] <- values2
second_group[["co"]] <- coeff2
third_group[["val"]] <- values3
third_group[["co"]] <- coeff3
fourth_group[["val"]] <- values4
fourth_group[["co"]] <- coeff4
#concatenate lists:
dummy_list <- list()
dummy_list[["first-group"]] <- first_group
dummy_list[["second-group"]] <- second_group
dummy_list[["third-group"]] <- third_group
dummy_list[["fourth-group"]] <- fourth_group
rm(values1, values2, values3, values4, coeff1, coeff2, coeff3, coeff4, first_group, second_group, third_group, fourth_group)
gc()
#show list
print(dummy_list)
# create boolean for where condition is TRUE
cond <- sapply(dummy_list, function(x) any(0 %in% x$co) & any(1 %in% x$co))
# subset
dummy_list[cond]
You could use Filter from base R:
Filter(function(x) sum(x$co) !=0, dummy_list)
Or you can use purrr:
library(tidyverse)
dummy_list %>%
keep( ~ sum(.$co) != 0)
Output
$`first-group`
$`first-group`$val
[1] 534 582 298 645 314 237 418 348 363 133 493 721 722 210 467 474 145 638 545 330 709 712 674 492 262 663 609 142 428 254
$`first-group`$co
[1] 0 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 0
$`third-group`
$`third-group`$val
[1] 713 721 683 526 699 555 563 672 619 603 588 533 622 724 616 644 730 716 660 663 611 669 644 664 679 514 579 525 533 541 530 564 584 673 592 726 548 563 727
[40] 646 708 557 586 592 693 620 548 705 510 677 539 603 726 525 597 563 712
$`third-group`$co
[1] 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
$`fourth-group`
$`fourth-group`$val
[1] 142 317 286 174 656 299 676 206 645 755 514 424 719 741 711 552 550 372 551 520 650 503 667 162 644 595 322 247
$`fourth-group`$co
[1] 0 0 0 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1
However, if you also want to exclude any co that have all 1s, then we can add an extra condition.
Filter(function(x) sum(x$co) !=0 & sum(x$co == 0) > 0, dummy_list)
purrr
dummy_list %>%
keep( ~ sum(.$co) != 0 & sum(.$co == 0) > 0)

Running Total with subtraction

I have a data set with closing and opening dates of public schools in California. Available here or dput() at the bottom of the question. The data also lists what type of school it is and where it is. I am trying to create a running total column which also takes into account school closings as well as school type.
Here is the solution I've come up with, which basically entails me encoding a lot of different 1's and 0's based on the conditions using ifelse:
# open charter schools
pubschls$open_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# open public schools
pubschls$open_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==TRUE, 1, 0)
# closed charters
pubschls$closed_chart <- ifelse(pubschls$Charter=="Y" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
# closed public schools
pubschls$closed_pub <- ifelse(pubschls$Charter=="N" & is.na(pubschls$ClosedDate)==FALSE, 1, 0)
lausd <- filter(pubschls, NCESDist=="0622710")
# count number open during each year
Then I subtract the columns from each other to get totals.
la_schools_count <- aggregate(lausd[c('open_chart','closed_chart','open_pub','closed_pub')],
by=list(year(lausd$OpenDate)), sum)
# find net charters by subtracting closed from open
la_schools_count$net_chart <- la_schools_count$open_chart - la_schools_count$closed_chart
# find net public schools by subtracting closed from open
la_schools_count$net_pub <- la_schools_count$open_pub - la_schools_count$closed_pub
# add running totals
la_schools_count$cum_chart <- cumsum(la_schools_count$net_chart)
la_schools_count$cum_pub <- cumsum(la_schools_count$net_pub)
# total totals
la_schools_count$total <- la_schools_count$cum_chart + la_schools_count$cum_pub
My output looks like this:
la_schools_count <- select(la_schools_count, "year", "cum_chart", "cum_pub", "pen_rate", "total")
year cum_chart cum_pub pen_rate total
1 1952 1 0 100.00000 1
2 1956 1 1 50.00000 2
3 1969 1 2 33.33333 3
4 1980 55 469 10.49618 524
5 1989 55 470 10.47619 525
6 1990 55 470 10.47619 525
7 1991 55 473 10.41667 528
8 1992 55 476 10.35782 531
9 1993 55 477 10.33835 532
10 1994 56 478 10.48689 534
11 1995 57 478 10.65421 535
12 1996 57 479 10.63433 536
13 1997 58 481 10.76067 539
14 1998 59 480 10.94620 539
15 1999 61 480 11.27542 541
16 2000 61 481 11.25461 542
17 2001 62 482 11.39706 544
18 2002 64 484 11.67883 548
19 2003 73 485 13.08244 558
20 2004 83 496 14.33506 579
21 2005 90 524 14.65798 614
22 2006 96 532 15.28662 628
23 2007 90 534 14.42308 624
24 2008 97 539 15.25157 636
25 2009 108 546 16.51376 654
26 2010 124 566 17.97101 690
27 2011 140 580 19.44444 720
28 2012 144 605 19.22563 749
29 2013 162 609 21.01167 771
30 2014 179 611 22.65823 790
31 2015 195 611 24.19355 806
32 2016 203 614 24.84700 817
33 2017 211 619 25.42169 830
I'm just wondering if this could be done in a better way. Like an apply statement to all rows based on the conditions?
dput:
structure(list(CDSCode = c("19647330100289", "19647330100297",
"19647330100669", "19647330100677", "19647330100743", "19647330100750"
), OpenDate = structure(c(12324, 12297, 12240, 12299, 12634,
12310), class = "Date"), ClosedDate = structure(c(NA, 15176,
NA, NA, NA, NA), class = "Date"), Charter = c("Y", "Y", "Y",
"Y", "Y", "Y")), .Names = c("CDSCode", "OpenDate", "ClosedDate",
"Charter"), row.names = c(NA, -6L), class = c("tbl_df", "tbl",
"data.frame"))
I followed your code and learned what you were doing except pen_rate. It seems that pen_rate is calculated dividing cum_chart by total. I download the original data set and did the following. I called the data set foo. Whenclosed_pub), I combined Charter and ClosedDate. I checked if ClosedDate is NA or not, and converted the logical output to numbers (1 = open, 0 = closed). This is how I created the four groups (i.e., open_chart, closed_chart, open_pub, and closed_pub). I guess this would ask you to do less typing. Since the dates are in character, I extracted year using substr(). If you have a date object, you need to do something else. Once you have year, you group the data with it and calculate how many schools exist for each type of school using count(). This part is the equivalent of your aggregate() code. Then, Convert the output to a wide-format data with spread() and did the rest of the calculation as you demonstrated in your codes. The final output seems different from what you have in your question, but my outcome was identical to one that I obtained by running your codes. I hope this will help you.
library(dplyr)
library(tidyr)
library(readxl)
# Get the necessary data
foo <- read_xls("pubschls.xls") %>%
select(NCESDist, CDSCode, OpenDate, ClosedDate, Charter) %>%
filter(NCESDist == "0622710" & (!Charter %in% NA))
mutate(foo, group = paste(Charter, as.numeric(is.na(ClosedDate)), sep = "_"),
year = substr(OpenDate, star = nchar(OpenDate) - 3, stop = nchar(OpenDate))) %>%
count(year, group) %>%
spread(key = group, value = n, fill = 0) %>%
mutate(net_chart = Y_1 - Y_0,
net_pub = N_1 - N_0,
cum_chart = cumsum(net_chart),
cum_pub = cumsum(net_pub),
total = cum_chart + cum_pub,
pen_rate = cum_chart / total)
# A part of the outcome
# year N_0 N_1 Y_0 Y_1 net_chart net_pub cum_chart cum_pub total pen_rate
#1 1866 0 1 0 0 0 1 0 1 1 0.00000000
#2 1873 0 1 0 0 0 1 0 2 2 0.00000000
#3 1878 0 1 0 0 0 1 0 3 3 0.00000000
#4 1881 0 1 0 0 0 1 0 4 4 0.00000000
#5 1882 0 2 0 0 0 2 0 6 6 0.00000000
#110 2007 0 2 15 9 -6 2 87 393 480 0.18125000
#111 2008 2 8 9 15 6 6 93 399 492 0.18902439
#112 2009 1 9 4 15 11 8 104 407 511 0.20352250
#113 2010 5 26 5 21 16 21 120 428 548 0.21897810
#114 2011 2 16 2 18 16 14 136 442 578 0.23529412
#115 2012 2 27 3 7 4 25 140 467 607 0.23064250
#116 2013 1 5 1 19 18 4 158 471 629 0.25119237
#117 2014 1 3 1 18 17 2 175 473 648 0.27006173
#118 2015 0 0 2 18 16 0 191 473 664 0.28765060
#119 2016 0 3 0 8 8 3 199 476 675 0.29481481
#120 2017 0 5 0 9 9 5 208 481 689 0.30188679

Matching columns from a data frame (csv file with the columns of another dataframe csv file and add new colums

I got two big data frames(csv format), one (df1) has this structure
chromName fragStart fragEnd fragLength leftFragEndLength rightFragEndLength
Chr1 176 377 202 202 202
Chr1 472 746 275 275 275
Chr1 1276 1382 107 107 107
Chr1 1581 1761 181 173 4
Chr1 1890 2080 191 93 71
The other (df2) includes the results for 5'target_id_start 5'target_id_end and 3'target_id_start,3'target_id_end together and it looks like this
Chr target_id_start target_id_end tot_counts uniq_counts est_counts
1 Chr1 10000016 10000066 0 0 0
2 Chr1 10000062 10000112 0 0 0
3 Chr1 10000171 10000221 0 0 0
4 Chr1 10000347 10000397 0 0 0
5 Chr1 1000041 1000091 0 0 0
what I'm trying to do is to check if the column target_id_start and target_id_end is between or equal with the columns fragStart and fragEnd. If this is true then i want to write the columns tot_counts uniq_counts est_counts in the first file df1. This will be true for 5'target_id_start 5'target_id_end and 3'target_id_start,3'target_id_end and the result to be like that
chromName fragStart fragEnd fragLength leftFragEndLength rightFragEndLength tot_counts5' uniq_counts5' est_counts5' tot_counts3' uniq_counts3' est_counts3'
Chr1 176 377 202 202 202 0 0 0 0 0 0
Chr1 472 746 275 275 275 0 0 0 0 0 0
Chr1 1276 1382 107 107 107 0 0 0 0 0 0
Chr1 1581 1761 181 173 4 0 0 0 0 0 0
Chr1 1890 2080 191 93 71 0 0 0 0 0 0
Do you know any good way to do this in R ? Thank you very much.
Even though I really hate loops, the best I can offer is:
a <- data.frame(x = c(1,10,100), y = c(2, 20, 200))
b <- data.frame(x = c(1.5, 30, 90, 150), y = c(1.6, 50, 101, 170), z = c("a","b","c", "d"))
a$z <= NA
for(i in 1:length(a$x)){
temp <- which((b$x >= a$x[i] & b$x <= a$y[i]) | (b$y >= a$x[i] & b$y <= a$y[i]))
a$z[i] <- ifelse(length(temp) > 0, temp, NA)
}
As an example - loop writes row index of data frame b where interval in a corresponds to interval in b. Further on you can write a loop where it takes these row indices and writes corresponding values to some other column.
This might give you some idea. But this is not efficient on large data sets. Hope it inspires you to proper solution. Not a workaround such as mine.

How to remove rows with 0 values using R

Hi am using a matrix of gene expression, frag counts to calculate differentially expressed genes. I would like to know how to remove the rows which have values as 0. Then my data set will be compact and less spurious results will be given for the downstream analysis I do using this matrix.
Input
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000005 0 0 0 0 0 0
XLOC_000006 0 0 0 0 0 0
XLOC_000007 0 0 0 0 1 3
XLOC_000008 0 0 0 0 0 0
XLOC_000009 0 0 0 0 0 0
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
Desired output
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000007 0 0 0 0 1 3
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
As of now I only want to remove those rows where all the frag count columns are 0 if in any row some values are 0 and others are non zero I would like to keep that row intact as you can see my example above.
Please let me know how to do this.
df[apply(df[,-1], 1, function(x) !all(x==0)),]
A lot of options to do this within the tidyverse have been posted here: How to remove rows where all columns are zero using dplyr pipe
my preferred option is using rowwise()
library(tidyverse)
df <- df %>%
rowwise() %>%
filter(sum(c(col1,col2,col3)) != 0)

Resources