Hi am using a matrix of gene expression, frag counts to calculate differentially expressed genes. I would like to know how to remove the rows which have values as 0. Then my data set will be compact and less spurious results will be given for the downstream analysis I do using this matrix.
Input
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000005 0 0 0 0 0 0
XLOC_000006 0 0 0 0 0 0
XLOC_000007 0 0 0 0 1 3
XLOC_000008 0 0 0 0 0 0
XLOC_000009 0 0 0 0 0 0
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
Desired output
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000007 0 0 0 0 1 3
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
As of now I only want to remove those rows where all the frag count columns are 0 if in any row some values are 0 and others are non zero I would like to keep that row intact as you can see my example above.
Please let me know how to do this.
df[apply(df[,-1], 1, function(x) !all(x==0)),]
A lot of options to do this within the tidyverse have been posted here: How to remove rows where all columns are zero using dplyr pipe
my preferred option is using rowwise()
library(tidyverse)
df <- df %>%
rowwise() %>%
filter(sum(c(col1,col2,col3)) != 0)
Related
I have a matrix with lots of columns (more than 817.000) and 40 rows . I would like to extract the columns which contain lots of 0 (for example > 30 or 35 , no matter the number) .
That should extract several columns, and I will choose one randomnly which I will use as a reference for the rest of the matrix.
Any idea?
Edit :
OTU0001 OTU0004 OTU0014 OTU0016 OTU0017 OTU0027 OTU0029 OTU0030
Sample_10.rare 0 0 85 0 0 0 0 0
Sample_11.rare 0 42 169 0 42 127 0 85
Sample_12.rare 0 0 0 0 0 0 0 42
Sample_13.rare 762 550 2159 127 550 0 677 1397
Sample_14.rare 847 508 2751 169 1397 169 593 1990
Sample_15.rare 1143 593 3725 677 2116 466 212 2286
Sample_16.rare 5630 5291 5291 1270 3852 1185 296 2836
It should extract 4 columns, OTU0001 OTU0016 OTU0027 OTU0029 because they got 3 zero each. And if it is possible, I would like to extract the position of the extracted columns.
An option with base R
Filter(function(x) sum(x == 0) > 7, df)
You could do something like this (Where 7 is the number of relevant zeros):
library(dplyr)
df <- tibble(Col1 = c(rep(0, 10), rep(1, 10)),
Col2 = c(rep(0,5), rep(1, 15)),
Col3 = c(rep(0,15), rep(1, 5)))
y <- df %>%
select_if(function(col) length(which(col==0)) > 7)
Sorry for the very specific question, but I have a file as such:
Adj Year man mt wm wmt by bytl gr grtl
3 careless 1802 0 126 0 54 0 13 0 51
4 careless 1803 0 166 0 72 0 1 0 18
5 careless 1804 0 167 0 58 0 2 0 25
6 careless 1805 0 117 0 5 0 5 0 7
7 careless 1806 0 408 0 88 0 15 0 27
8 careless 1807 0 214 0 71 0 9 0 32
...
560 mean 1939 21 5988 8 1961 0 1152 0 1512
561 mean 1940 20 5810 6 1965 1 914 0 1444
562 mean 1941 10 6062 4 2097 5 964 0 1550
563 mean 1942 8 5352 2 1660 2 947 2 1506
564 mean 1943 14 5145 5 1614 1 878 4 1196
565 mean 1944 42 5630 6 1939 1 902 0 1583
566 mean 1945 17 6140 7 2192 4 1004 0 1906
Now I have to call for specific values (e.g. [careless,1804,man] or [mean, 1944, wmt].
Now I have no clue how to do that, one possibility would be to split the data.frame and create an array if I'm correct. But I'd love to have a simpler solution.
Thank you in advance!
Subsetting for specific values in Adj and Year column and selecting the man column will give you the required output.
df[df$Adj == "careless" & df$Year == 1804, "man"]
I got two big data frames(csv format), one (df1) has this structure
chromName fragStart fragEnd fragLength leftFragEndLength rightFragEndLength
Chr1 176 377 202 202 202
Chr1 472 746 275 275 275
Chr1 1276 1382 107 107 107
Chr1 1581 1761 181 173 4
Chr1 1890 2080 191 93 71
The other (df2) includes the results for 5'target_id_start 5'target_id_end and 3'target_id_start,3'target_id_end together and it looks like this
Chr target_id_start target_id_end tot_counts uniq_counts est_counts
1 Chr1 10000016 10000066 0 0 0
2 Chr1 10000062 10000112 0 0 0
3 Chr1 10000171 10000221 0 0 0
4 Chr1 10000347 10000397 0 0 0
5 Chr1 1000041 1000091 0 0 0
what I'm trying to do is to check if the column target_id_start and target_id_end is between or equal with the columns fragStart and fragEnd. If this is true then i want to write the columns tot_counts uniq_counts est_counts in the first file df1. This will be true for 5'target_id_start 5'target_id_end and 3'target_id_start,3'target_id_end and the result to be like that
chromName fragStart fragEnd fragLength leftFragEndLength rightFragEndLength tot_counts5' uniq_counts5' est_counts5' tot_counts3' uniq_counts3' est_counts3'
Chr1 176 377 202 202 202 0 0 0 0 0 0
Chr1 472 746 275 275 275 0 0 0 0 0 0
Chr1 1276 1382 107 107 107 0 0 0 0 0 0
Chr1 1581 1761 181 173 4 0 0 0 0 0 0
Chr1 1890 2080 191 93 71 0 0 0 0 0 0
Do you know any good way to do this in R ? Thank you very much.
Even though I really hate loops, the best I can offer is:
a <- data.frame(x = c(1,10,100), y = c(2, 20, 200))
b <- data.frame(x = c(1.5, 30, 90, 150), y = c(1.6, 50, 101, 170), z = c("a","b","c", "d"))
a$z <= NA
for(i in 1:length(a$x)){
temp <- which((b$x >= a$x[i] & b$x <= a$y[i]) | (b$y >= a$x[i] & b$y <= a$y[i]))
a$z[i] <- ifelse(length(temp) > 0, temp, NA)
}
As an example - loop writes row index of data frame b where interval in a corresponds to interval in b. Further on you can write a loop where it takes these row indices and writes corresponding values to some other column.
This might give you some idea. But this is not efficient on large data sets. Hope it inspires you to proper solution. Not a workaround such as mine.
New to R and want to use mlogit function.
However after putting my data into a data frame and run
x <- mlogit.data(mlogit, choice="PlacedN", shape="long", alt.var="RaceID")
I get duplicate 'row.names' are not allowed
I can upload my file if needed I've spent days trying to get this to work, so any help will be appreciated
You may want to put "RaceID" into the alt.levels argument instead of alt.var. From the mlogit.data help file:
alt.levels
the name of the alternatives: if null, for a wide data.frame, they are guessed from the variable names and the choice variable (both should be the same), for a long data.frame, they are guessed from the alt.var argument.
Give this a try.
library(mlogit)
m <- read.csv("mlogit.csv")
mlogd <- mlogit.data(m, choice="PlacedN", shape="long", alt.levels="RaceID")
head(mlogd)
# RaceID PlacedN RSP TrA JoA aDS bDS mDS aDH bDH mDH LDH MR eMR
# 1.RaceID 20119552 TRUE 3.00 13 12 0 0 0 0 0 0 0 0 131
# 2.RaceID 20119552 FALSE 4.00 23 26 91 94 94 139 153 145 153 150 150
# 3.RaceID 20119552 FALSE 0.83 15 15 99 127 99 150 153 150 153 159 159
# 4.RaceID 20119552 FALSE 18.00 21 15 0 0 0 0 0 0 0 0 131
# 5.RaceID 20119552 FALSE 16.00 16 12 92 127 92 134 135 134 135 136 136
# 6.RaceID 20119617 TRUE 2.50 12 10 0 0 0 0 0 0 0 0 152
I am trying to aggregate the columns of this data frame by unique column name (date). I keep getting an error. I have tried merge_all, merge_recurse, and aggregate but can not get it to work. I have hit an impasse that is seemingly unconquerable with my knowledge set and I can not find answers that are helping anywhere. Is this even possible? The data frame is below:
2014-02-14 2014-02-14 2014-02-14 2014-02-21 2014-06-20 2014-06-20 2014-06-20 2014-09-19 Totals
PutWing 12 -6 0 171 7 -31 0 0 -77
Ten -6 0 0 24 -19 52 0 0 -10
Eighteen -15 0 0 73 0 -70 0 0 100
Thirty 0 0 0 -149 41 64 0 0 -463
FortyTwo 0 0 0 -91 0 121 0 0 426
ATM 44 0 0 -118 -25 -199 0 0 -134
FortyTwoC 0 0 0 -67 14 0 0 0 792
ThirtyC 0 0 0 79 0 0 0 0 -509
EighteenC 61 0 0 -57 0 -32 0 0 20
CallWing 1 0 0 -48 0 0 0 0 -28
Totals 95 -6 0 -183 17 -95 0 0 116
SlopeRisk 0 0 0 26 5 -6 0 0 -26
Assuming your data is in df:
df <- t(df)
rownames(df) <- substr(rownames(df), 1, 11) # only necessary if you get funny row names from data import; if your data is as it's shown you can skip this step.
df.agg <- aggregate(df, by=list(rownames(df)), sum)
row.names(df.agg) <- df.agg[[1]]
t(df.agg[-1])
Produces:
# Totals X2014.02.14 X2014.02.21 X2014.06.20 X2014.09.19
# PutWing -77 6 171 -24 0
# Ten -10 -6 24 33 0
# Eighteen 100 -15 73 -70 0
# Thirty -463 0 -149 105 0
# FortyTwo 426 0 -91 121 0
# ATM -134 44 -118 -224 0
# FortyTwoC 792 0 -67 14 0
# ThirtyC -509 0 79 0 0
# EighteenC 20 61 -57 -32 0
# CallWing -28 1 -48 0 0
# Totals 116 89 -183 -78 0
# SlopeRisk -26 0 26 -1 0
Basically, you need to transpose your data to use all the group/apply functions that R offers. After transposing, you could also use plyr, data.table, or dplyr to do the aggregation instead of aggregate as I did, but those are all non-base packages.
This will need some cleaning up column names, etc, but I'll leave that up to you.