I have a large nested list (list of named lists) - the example of such a list is given below. I would like to create a new list, in which only sub-lists with "co" vectors containing both 0 and 1 values would be preserved, while 0-only sublists would be discarded (eg. the output should contain only first-, third- and fourth- subgroups.
I played with lapply and filter according to this thread:
Subset elements in a list based on a logical condition
However, it throwed errors. I would appreciate tips how to handle lists within the lists.
# reprex
set.seed(123)
## empty lists
first_group <- list()
second_group <- list()
third_group <- list()
fourth_group <- list()
# dummy_vecs
values1 <- c(sample(120:730, 30, replace=TRUE))
coeff1 <- c(sample(0:1, 30, replace=TRUE))
values2 <- c(sample(50:810, 43, replace=TRUE))
coeff2 <- c(rep(0, 43))
values3 <- c(sample(510:730, 57, replace=TRUE))
coeff3 <- c(rep(0, 8), rep(1, 4), rep(0, 45))
values4 <- c(sample(123:770, 28, replace=TRUE))
coeff4 <- c(sample(0:1, 28, replace=TRUE))
## fill lists with values:
first_group[["val"]] <- values1
first_group[["co"]] <- coeff1
second_group[["val"]] <- values2
second_group[["co"]] <- coeff2
third_group[["val"]] <- values3
third_group[["co"]] <- coeff3
fourth_group[["val"]] <- values4
fourth_group[["co"]] <- coeff4
#concatenate lists:
dummy_list <- list()
dummy_list[["first-group"]] <- first_group
dummy_list[["second-group"]] <- second_group
dummy_list[["third-group"]] <- third_group
dummy_list[["fourth-group"]] <- fourth_group
rm(values1, values2, values3, values4, coeff1, coeff2, coeff3, coeff4, first_group, second_group, third_group, fourth_group)
gc()
#show list
print(dummy_list)
# create boolean for where condition is TRUE
cond <- sapply(dummy_list, function(x) any(0 %in% x$co) & any(1 %in% x$co))
# subset
dummy_list[cond]
You could use Filter from base R:
Filter(function(x) sum(x$co) !=0, dummy_list)
Or you can use purrr:
library(tidyverse)
dummy_list %>%
keep( ~ sum(.$co) != 0)
Output
$`first-group`
$`first-group`$val
[1] 534 582 298 645 314 237 418 348 363 133 493 721 722 210 467 474 145 638 545 330 709 712 674 492 262 663 609 142 428 254
$`first-group`$co
[1] 0 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 0
$`third-group`
$`third-group`$val
[1] 713 721 683 526 699 555 563 672 619 603 588 533 622 724 616 644 730 716 660 663 611 669 644 664 679 514 579 525 533 541 530 564 584 673 592 726 548 563 727
[40] 646 708 557 586 592 693 620 548 705 510 677 539 603 726 525 597 563 712
$`third-group`$co
[1] 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
$`fourth-group`
$`fourth-group`$val
[1] 142 317 286 174 656 299 676 206 645 755 514 424 719 741 711 552 550 372 551 520 650 503 667 162 644 595 322 247
$`fourth-group`$co
[1] 0 0 0 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 1 1
However, if you also want to exclude any co that have all 1s, then we can add an extra condition.
Filter(function(x) sum(x$co) !=0 & sum(x$co == 0) > 0, dummy_list)
purrr
dummy_list %>%
keep( ~ sum(.$co) != 0 & sum(.$co == 0) > 0)
Related
I have a data frame that looks like this
Frame RightEye_x RightEye_y RightEye_z LeftEye_x LeftEye_y LeftEye_z
0 773 490 0 778 322 0
1 780 490 0 789 334 0
2 781 490 0 792 334 0
3 783 337 0 797 334 1
And I would like to transform it into
BodyPart Frame x y z
RightEye 0 773 490 0
RightEye 1 780 490 0
RightEye 2 781 490 0
RightEye 3 783 337 0
LeftEye 0 778 322 0
LeftEye 1 789 334 0
LeftEye 2 792 334 0
LeftEye 3 797 334 1
Using the melt(...) method in data.table:
library(data.table)
setDT(df)
result <- melt(df, measure.vars = patterns(c('_x', '_y', '_z')), value.name = c('x', 'y', 'z'))
result[, variable:=c('RightEye', 'LeftEye')[variable]]
result
## Frame variable x y z
## 1: 0 RightEye 773 490 0
## 2: 1 RightEye 780 490 0
## 3: 2 RightEye 781 490 0
## 4: 3 RightEye 783 337 0
## 5: 0 LeftEye 778 322 0
## 6: 1 LeftEye 789 334 0
## 7: 2 LeftEye 792 334 0
## 8: 3 LeftEye 797 334 1
We can use base R reshape like below
reshape(
setNames(df, gsub("(.*)_(.*)", "\\2_\\1", names(df))),
direction = "long",
idvar = "Frame",
varying = -1,
timevar = "BodyPart",
sep = "_"
)
which gives
Frame BodyPart x y z
0.RightEye 0 RightEye 773 490 0
1.RightEye 1 RightEye 780 490 0
2.RightEye 2 RightEye 781 490 0
3.RightEye 3 RightEye 783 337 0
0.LeftEye 0 LeftEye 778 322 0
1.LeftEye 1 LeftEye 789 334 0
2.LeftEye 2 LeftEye 792 334 0
3.LeftEye 3 LeftEye 797 334 1
I have a matrix with lots of columns (more than 817.000) and 40 rows . I would like to extract the columns which contain lots of 0 (for example > 30 or 35 , no matter the number) .
That should extract several columns, and I will choose one randomnly which I will use as a reference for the rest of the matrix.
Any idea?
Edit :
OTU0001 OTU0004 OTU0014 OTU0016 OTU0017 OTU0027 OTU0029 OTU0030
Sample_10.rare 0 0 85 0 0 0 0 0
Sample_11.rare 0 42 169 0 42 127 0 85
Sample_12.rare 0 0 0 0 0 0 0 42
Sample_13.rare 762 550 2159 127 550 0 677 1397
Sample_14.rare 847 508 2751 169 1397 169 593 1990
Sample_15.rare 1143 593 3725 677 2116 466 212 2286
Sample_16.rare 5630 5291 5291 1270 3852 1185 296 2836
It should extract 4 columns, OTU0001 OTU0016 OTU0027 OTU0029 because they got 3 zero each. And if it is possible, I would like to extract the position of the extracted columns.
An option with base R
Filter(function(x) sum(x == 0) > 7, df)
You could do something like this (Where 7 is the number of relevant zeros):
library(dplyr)
df <- tibble(Col1 = c(rep(0, 10), rep(1, 10)),
Col2 = c(rep(0,5), rep(1, 15)),
Col3 = c(rep(0,15), rep(1, 5)))
y <- df %>%
select_if(function(col) length(which(col==0)) > 7)
I got two big data frames(csv format), one (df1) has this structure
chromName fragStart fragEnd fragLength leftFragEndLength rightFragEndLength
Chr1 176 377 202 202 202
Chr1 472 746 275 275 275
Chr1 1276 1382 107 107 107
Chr1 1581 1761 181 173 4
Chr1 1890 2080 191 93 71
The other (df2) includes the results for 5'target_id_start 5'target_id_end and 3'target_id_start,3'target_id_end together and it looks like this
Chr target_id_start target_id_end tot_counts uniq_counts est_counts
1 Chr1 10000016 10000066 0 0 0
2 Chr1 10000062 10000112 0 0 0
3 Chr1 10000171 10000221 0 0 0
4 Chr1 10000347 10000397 0 0 0
5 Chr1 1000041 1000091 0 0 0
what I'm trying to do is to check if the column target_id_start and target_id_end is between or equal with the columns fragStart and fragEnd. If this is true then i want to write the columns tot_counts uniq_counts est_counts in the first file df1. This will be true for 5'target_id_start 5'target_id_end and 3'target_id_start,3'target_id_end and the result to be like that
chromName fragStart fragEnd fragLength leftFragEndLength rightFragEndLength tot_counts5' uniq_counts5' est_counts5' tot_counts3' uniq_counts3' est_counts3'
Chr1 176 377 202 202 202 0 0 0 0 0 0
Chr1 472 746 275 275 275 0 0 0 0 0 0
Chr1 1276 1382 107 107 107 0 0 0 0 0 0
Chr1 1581 1761 181 173 4 0 0 0 0 0 0
Chr1 1890 2080 191 93 71 0 0 0 0 0 0
Do you know any good way to do this in R ? Thank you very much.
Even though I really hate loops, the best I can offer is:
a <- data.frame(x = c(1,10,100), y = c(2, 20, 200))
b <- data.frame(x = c(1.5, 30, 90, 150), y = c(1.6, 50, 101, 170), z = c("a","b","c", "d"))
a$z <= NA
for(i in 1:length(a$x)){
temp <- which((b$x >= a$x[i] & b$x <= a$y[i]) | (b$y >= a$x[i] & b$y <= a$y[i]))
a$z[i] <- ifelse(length(temp) > 0, temp, NA)
}
As an example - loop writes row index of data frame b where interval in a corresponds to interval in b. Further on you can write a loop where it takes these row indices and writes corresponding values to some other column.
This might give you some idea. But this is not efficient on large data sets. Hope it inspires you to proper solution. Not a workaround such as mine.
I am experimenting pca with R. I have the following data:
V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
2454 0 168 290 45 1715 61 551 245 30 91
222 188 94 105 60 3374 615 7 294 0 169
552 0 0 465 0 3040 0 0 771 0 0
2872 0 0 0 0 3380 0 289 0 0 0
2938 0 56 56 0 2039 538 311 113 0 254
2849 0 0 332 0 2548 0 332 0 0 221
3102 0 0 0 0 2690 0 0 0 807 807
3134 0 0 0 0 2897 289 144 144 144 0
558 0 0 0 0 3453 0 0 0 0 0
2893 0 262 175 0 2452 350 1138 262 87 175
552 0 0 351 0 3114 0 0 678 0 0
2874 0 109 54 0 2565 272 1037 109 0 0
1396 0 0 407 0 1730 0 0 305 0 0
2866 0 71 179 0 2403 358 753 35 107 143
449 0 0 0 0 2825 0 0 0 0 0
2888 0 0 523 0 2615 104 627 209 0 0
2537 0 57 0 0 1854 0 0 463 0 0
2873 0 0 342 0 3196 0 114 0 0 114
720 0 0 365 4 2704 0 4 643 4 0
218 125 31 94 219 2479 722 0 219 0 94
to which I apply the following code:
fit <- prcomp(data)
ev <- fit$rotation # pc loadings
In order to make some tests, I tried to see the data matrix I retrieve when I do keep all the components I can keep:
numberComponentsKept = 10
featureVector = ev[,1:numberComponentsKept]
newData <- as.matrix(data)%*%as.matrix(featureVector)
The newData matrix should be the same as the original one, but instead, I get a very different result:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
2454 1424.447 867.5986 514.0592 -155.4783720 -574.7425 85.38724 -86.71887 90.872507 4.305168 92.08284
222 3139.681 1020.4150 376.3165 471.8718398 -796.9549 142.14301 -119.86945 32.919950 -31.269467 32.55846
552 2851.544 539.6075 883.3969 -93.3579153 -908.6689 68.34030 -40.97052 -13.856931 23.133566 89.00851
2872 3111.317 1210.0187 433.0382 -144.4065362 -381.2305 -20.08927 -49.03447 9.569258 44.201571 70.13113
2938 1788.334 945.8162 189.6526 308.7703509 -593.5577 124.88484 -109.67276 -115.127348 14.170615 99.19492
2849 2291.839 978.1819 374.7567 -243.6739292 -496.8707 287.01065 -126.22501 -18.747873 54.080763 62.80605
3102 2530.989 814.7548 -510.5978 -410.6295894 -1015.3228 46.85727 -21.20662 14.696831 23.687923 72.37691
3134 2679.430 970.1323 311.8627 124.2884480 -536.4490 -26.23858 83.86768 -17.808390 -28.802387 92.09583
558 3268.599 988.2515 353.6538 -82.9155988 -342.5729 12.96219 -60.94886 18.537087 7.291126 96.14917
2893 1921.761 1664.0084 631.0800 -55.6321469 -864.9628 -28.11045 -104.78931 37.797727 -12.078535 104.88374
552 2927.108 607.6489 799.9602 -79.5494412 -827.6994 14.14625 -50.12209 -14.020936 29.996639 86.72887
2874 2084.285 1636.7999 621.6383 -49.2934502 -577.4815 -67.27198 -11.06071 -7.167577 47.395309 51.02962
1396 1618.171 337.4320 488.2717 -100.1663625 -469.8857 212.37199 -1.19409 13.531485 -23.332701 64.58806
2866 2007.261 1387.6890 395.1586 0.8640971 -636.1243 133.41074 12.34794 -26.969634 5.506828 74.13767
449 2674.136 808.5174 289.3345 -67.8356695 -280.2689 10.60475 -49.86404 15.165731 5.965083 78.66244
2888 2254.171 1162.4988 749.7230 -206.0215007 -652.2364 302.36320 40.76341 -1.079259 17.635956 57.86999
2537 1747.098 371.8884 429.1309 9.3761544 -480.7130 -196.25019 -81.31580 2.819608 24.089379 56.91885
2873 2973.872 974.3854 433.7282 -197.0601947 -478.3647 301.96576 -81.81105 14.516646 -1.191972 100.79057
720 2537.535 504.4124 744.5909 -78.1162036 -771.1396 38.17725 -36.61446 -9.079443 25.488688 78.21597
218 2292.718 800.5257 260.6641 603.3295960 -641.9296 187.38913 11.71382 70.011487 78.047216 96.10967
What did I do wrong?
I think the problem is rather a PCA problem than an R problem. You multiply the original data with the rotation matrix and you wonder then why newData!=data. This would be only the case if the rotation matrix would be the identity matrix.
What you probably were planning to do is the following:
# Run PCA:
fit <- prcomp(USArrests)
ev <- fit$rotation # pc loadings
# Reversed PCA:
head(fit$x%*% t(as.matrix(ev)))
# Centered Original data:
head(t(apply(USArrests,1,'-',colMeans(USArrests))))
In the last step you have to center the data, because the function prcomp centers them by default.
Hi am using a matrix of gene expression, frag counts to calculate differentially expressed genes. I would like to know how to remove the rows which have values as 0. Then my data set will be compact and less spurious results will be given for the downstream analysis I do using this matrix.
Input
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000005 0 0 0 0 0 0
XLOC_000006 0 0 0 0 0 0
XLOC_000007 0 0 0 0 1 3
XLOC_000008 0 0 0 0 0 0
XLOC_000009 0 0 0 0 0 0
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
Desired output
gene ZPT.1 ZPT.0 ZPT.2 ZPT.3 PDGT.1 PDGT.0
XLOC_000001 3516 626 1277 770 4309 9030
XLOC_000002 342 82 185 72 835 1095
XLOC_000003 2000 361 867 438 454 687
XLOC_000004 143 30 67 37 90 236
XLOC_000007 0 0 0 0 1 3
XLOC_000010 7 1 5 3 0 1
XLOC_000011 63 10 19 15 92 228
As of now I only want to remove those rows where all the frag count columns are 0 if in any row some values are 0 and others are non zero I would like to keep that row intact as you can see my example above.
Please let me know how to do this.
df[apply(df[,-1], 1, function(x) !all(x==0)),]
A lot of options to do this within the tidyverse have been posted here: How to remove rows where all columns are zero using dplyr pipe
my preferred option is using rowwise()
library(tidyverse)
df <- df %>%
rowwise() %>%
filter(sum(c(col1,col2,col3)) != 0)