find a list of x from n lists - r

I have a list of 20 lists (each list contain genes of 20 populations) to
I have to find genes (for each population) whose are presented in at least 15 lists
I'm using R
Any help please ?
example:
BigList$list1$pop1
BigList$list1$pop2
BigList$list1$pop3
BigList$list2$pop1
BigList$list2$pop1
BigList$list2$pop3
BigList$list3$pop1
BigList$list3$pop1
BigList$list3$pop3
My list is like :
[[1]]$pop2
[1] "CFC1" "ZNF536" "TRIM67" "AC092431.3" "RP11-572M11.4" "HCG23" "AC006372.4" "RP11-6O2.4" "CACNG3"
[10] "AC129492.6" "POTEC" "RP11-862L9.3" "AC018766.5" "RP11-506O24.1" "RP11-397O8.7" "RP11-54O7.11" "RP11-335O13.7" "RP11-392O17.1"
[19] "AC140481.2" "RP11-284H18.1" "RP11-370B11.3" "SLC17A8" "RP11-474D1.2" "GOLGA8H" "RP11-815J21.3" "CTD-2135D7.2" "RP11-388M20.6"
[28] "CTD-2034I21.2" "KRT31" "USH1G" "CTC-360G5.9" "TBL1Y" "RP11-143E21.6" "SERPINA10" "RP11-303E16.3" "RP11-849F2.5"
[37] "VCAN-AS1" "OPN4" "MS4A2" "LIMS3" "SYNE1-AS1" "RP11-881M11.4" "GCSAML-AS1" "LIMS3L" "FBXW12"
[46] "RP11-364P22.1" "ADAMTS19" "AC005276.1" "RP11-513D5.5" "RP11-68L18.1" "RP11-402G3.3" "PGA3" "PGA4" "RP11-582E3.2"
[55] "LINC00943" "AC073657.1" "RP11-773H22.4" "ANKRD30B" "RP11-103J8.2" "CTA-407F11.8" "ETNPPL" "RP11-1M18.1" "RP11-277P12.10"
[64] "AC105339.1" "DDX4" "CTD-2342N23.3" "RP11-684B21.1" "NDST4" "CCDC60" "U91319.1" "RGR" "AC108868.6"
[73] "RP11-480G7.1"
[[1]]$pop2
[1] "RP11-469N6.1" "GDF5" "NELL1"
[[1]]$ppo3
[1] "RP3-398G3.5" "AC010091.1" "RP11-3B12.5" "RP11-78F17.1" "C20ORF135" "CTC-325J23.3" "DBH" "FOXE3" "FOXD4L1"
[10] "AC114730.8" "AC008697.1" "RP3-323N1.2" "RP11-142M10.2" "AC005616.2" "DCDC2B" "RP11-415J8.7" "LINC00326" "IL1RAPL2"
[19] "RP11-167N4.2" "RP11-114H23.1" "RP11-57A19.2" "C17orf98" "XX-CR54.3" "DLX2" "RP11-337N6.1" "RP11-416O18.1" "RP11-25H12.1"
[28] "RP11-269F21.3" "LINC00491" "CTB-43E15.3" "GABRR1" "H2BFWT" "TRPC5OS" "HTR2C" "RP11-642C5.1" "RP11-64P14.7"
[[2]]$pop1
[1] "CNGA3" "ITLN2" "RP11-400N13.1" "RP11-331F9.4" "GPR88" "LINC01037" "RP11-255M2.2" "LA16c-329F2.1" "RP11-154H12.2" "DUXA"
[11] "RP11-36B6.1" "RP11-12A16.3"
[[2]]$pop2
[1] "AC011893.3" "ISM1-AS1" "CA10" "RP11-301L8.2" "RP11-1250I15.3" "GABRG2" "NAMA" "CLEC1B" "RP11-458D21.5"
[10] "RGPD4" "SLITRK3" "RP3-495K2.2" "C11orf87" "RCVRN" "RP5-1112F19.2" "RP3-333A15.1" "RP5-836J3.1" "METTL11B"
[19] "AC112721.1" "RP11-761N21.1" "GRID2" "GML" "CLEC2A" "RP11-834C11.8" "RP11-406H23.2" "RP4-715N11.2" "RHD"
[28] "EYA1" "TAS2R19" "GABRA1" "SLC8A3" "RP3-510H16.3" "GRM7-AS3" "RP11-71H9.1" "PPEF2" "TULP1"
[37] "RP11-704J17.5" "RP11-10C8.2" "RP11-298H24.1" "RP11-263K4.3" "METTL21C" "AC012317.1" "CCDC42" "AC139100.3" "AF015262.2"
[[2]]$pop3
[1] "SYT10" "SPATA13-AS1" "AC064834.2" "CTD-2544H17.2" "AC106786.1" "RP11-25L3.3" "IMPG1" "DDX4" "RP11-50B3.4"
I have to find intersections to find (List with genes in at least:
2 lists
3 lists
Thank you.

Related

list files pattern select date

Hello I have a set of daily meteo data, using the expression :
f <- list.files(getwd(), include.dirs=TRUE, recursive=TRUE, pattern= "PREC")
I select only the files of Precipitation
I wonder how to select only files for example of January, the one for example named 20170103 (yyyymmdd) , so the one named yyyy01dd....
the files are named in this way: "PREC_20010120.grd".
Try pattern='PREC_\\d{4}01\\d[2].*'.
PREC_ literally
\\d{4} four digits
01 '"01" literally
\\d{2} two digits
.* any character repeatedly
Thank you , but I retrieved only 35 items instead of 31 days * 10 years what's wrong ?
[1] "20100102/PREC_20100102.tif" "20100112/PREC_20100112.tif"
[3] "20100122/PREC_20100122.tif" "20110102/PREC_20110102.tif"
[5] "20110112/PREC_20110112.tif" "20110122/PREC_20110122.tif"
[7] "20120102/PREC_20120102.tif" "20120112/PREC_20120112.tif"
[9] "20120122/PREC_20120122.tif" "20130102/PREC_20130102.tif"
[11] "20130112/PREC_20130112.tif" "20130122/PREC_20130122.tif"
[13] "20140102/PREC_20140102.tif" "20140112/PREC_20140112.tif"
[15] "20140122/PREC_20140122.tif" "20150102/PREC_20150102.tif"
[17] "20150112/PREC_20150112.tif" "20150122/PREC_20150122.tif"
[19] "20160102/PREC_20160102.tif" "20160112/PREC_20160112.tif"
[21] "20160122/PREC_20160122.tif" "20170102/PREC_20170102.tif"
[23] "20170112/PREC_20170112.tif" "20170122/PREC_20170122.tif"
[25] "20180102/PREC_20180102.tif" "20180112/PREC_20180112.tif"
[27] "20180122/PREC_20180122.tif" "20190102/PREC_20190102.tif"
[29] "20190112/PREC_20190112.tif" "20190122/PREC_20190122.tif"
[31] "20200102/PREC_20200102.tif" "20200112/PREC_20200112.tif"
[33] "20200122/PREC_20200122.tif" "20210102/PREC_20210102.tif"
[35] "20210112/PREC_20210112.tif" "20210122/PREC_20210122.tif"
Resolved with:
f <- list.files(getwd(), include.dirs=TRUE, recursive=TRUE, pattern='PREC_\\d{4}01.*')

R - Operations over corresponding vector items in list

Let's say I have a list of vectors, like so:
[[1]]
[1] -0.36603596 -0.41461025 -0.68573296 -0.55516173 0.05071238 0.47723472 0.10851948
[8] 0.67005116 0.25519780 -0.79428716 0.16506077 0.81905548 0.22808934 -0.39257712
[15] 0.44778539 -0.36149934 -0.90142102 -0.99826169 0.24544167 -0.18989310 -0.67592344
[22] -0.65447808 0.26617179 -0.25020153 0.19562031 0.53520465 -0.47531100 -0.60152887
[29] 0.12012461 -0.68947499 -0.33258301 0.19914520 -0.70396942 0.21574644 -0.67197365
[36] -0.12744723 -0.07113916 0.44497439 0.07592963 -0.29082130 -0.27967624 0.28314801
[43] -0.09840383 -0.55582233 -0.29474315 -0.41717316 0.51017306 -0.31227399 0.39484400
[50] -0.88843530
[[2]]
[1] -0.14763873 -0.69009083 -0.55705599 -0.43779047 0.15626341 -0.00629513 -0.95227841
[8] 0.85645849 -0.40110676 -0.35732008 0.31375323 0.71478975 0.02262899 -0.12802829
[15] 0.58750725 -0.25629463 -0.65609956 -0.83185625 -0.35244759 -0.33287717 -0.99199682
[22] -0.45836093 -0.19431609 -0.41590652 1.06120542 0.20687783 0.13268137 -0.34219985
[29] -0.18096691 -0.24496102 -0.47769117 0.89134577 -0.56128402 0.70825268 0.10426368
[36] -0.13962506 -0.72478276 -0.40178315 0.65943132 -0.82083464 0.22569929 -1.02243310
[43] -0.70983610 -1.36733592 0.68807554 0.09156598 0.76850778 -0.64040433 0.79276407
[50] -0.40297792
[[3]]
[1] 0.34405450 -0.07928067 0.08353835 -0.37919066 -0.47233278 -0.38839824 -0.13269067
[8] 0.17348495 0.42777652 -0.19297300 -0.86438130 0.75787336 -0.34358747 0.47852682
[15] 1.29980892 -0.42527812 -0.25074922 -0.59565850 0.32800193 -0.56109570 -0.72905476
[22] -0.11498356 -0.29827083 -0.21653428 0.78533418 0.64735755 0.31889828 -0.37129803
[29] -0.51252162 0.24192268 -0.29281809 1.03299397 -0.11251429 0.13157698 -0.06404053
[36] 0.01904473 -0.13162565 0.30488937 0.31933970 0.14135025 -0.31501649 0.16738399
[43] -0.19627252 -1.29613018 -0.03572980 -0.72008672 0.13932428 -0.06117093 -0.62665670
[50] -0.12662761
[[4]]
[1] 0.183303468 0.160037845 -0.053473912 0.005199917 -0.126312554 0.116465956 -0.061730281
[8] 0.392903969 -0.008337453 -0.752631038 -0.235599857 0.999534398 0.375208363 0.201100799
[15] 0.444068886 -0.575795949 -0.873388633 -0.863612264 0.076050073 -0.188358603 -0.391865671
[22] -1.726690292 -1.206992567 -0.547175750 0.290255919 1.119834989 0.551360182 -0.510140345
[29] -0.460314706 -0.245835558 -0.315087602 0.947181076 -0.132550448 0.038419545 -0.017929636
[36] 0.041870497 -0.520961791 0.195326850 -0.117783785 -0.427426472 -0.119577158 0.702550914
[43] -0.045789957 -0.794299036 0.181420440 0.407347072 0.571894407 -0.217325835 0.280283391
[50] -0.492866084
[[5]]
[1] -0.40852268 -0.33488615 -0.30609700 -0.67467326 -0.11966383 1.01161858 -0.27108333
[8] 0.92772286 0.39047166 0.29019594 0.24404167 0.07824440 0.32786441 0.21657727
[15] 0.34362648 -0.44996166 -0.27823770 -1.24962127 -0.57241699 -0.30297804 -0.66728157
[22] 0.01783441 0.50773758 -0.31477033 -0.14581338 -0.13827194 -0.25574117 0.40049840
[29] 0.38634920 -0.29027963 -0.03381480 0.48510557 -0.61594522 1.09573928 -0.27992008
[36] -0.41523542 -0.24131548 0.43480320 0.32855110 0.48579320 0.47366867 0.62697303
[43] -0.57792202 -0.81951194 0.21583044 0.15593484 -0.10270703 -0.10206812 -0.25195873
[50] -0.89835763
I want to average corresponding vector items (e.g.: [[1]][1], [[1]][2], [[1]][3], etc.) to result in a single vector of averaged values. For instance, the mean of every first vector item across the list would be -0.07896788. What's the best way to go about this?
let's say list is called mylist:
mydf=as.data.frame(do.call("rbind",mylist))
colMeans(mydf)
would that be the desired output?

adaboost model gives a vector of output for one row

I have built a model using Adaboost. When I give one row as input, this is the output I get. I was expecting to get just one number as the prediction
> predict(Model,testset[1,],type="prob")[,2]
[1] 0.5159268 0.5143351 0.5135043 0.5127763 0.5116162 0.5097892 0.5098299 0.5098701
[9] 0.5083176 0.5088486 0.5073487 0.5082424 0.5078101 0.5073640 0.5053638 0.5066038
[17] 0.5063418 0.5055067 0.5060952 0.5051869 0.5050157 0.5038692 0.5040837 0.5052188
[25] 0.5040825 0.5046496 0.5050795 0.5042205 0.4976465 0.5046798 0.5047607 0.4957011
[33] 0.5048601 0.5039299 0.5032739 0.5042044 0.5044005 0.5044902 0.5037352 0.4981865
[41] 0.5021579 0.5038746 0.5043289 0.5032334 0.5051926 0.5021917 0.5015447 0.5029390
[49] 0.4951465 0.5033675
> predict(Model,testset[2,],type="prob")[,2]
[1] 0.5159268 0.5143351 0.5135043 0.5127763 0.5116162 0.5097892 0.5098299 0.5098701
[9] 0.5083176 0.5088486 0.5073487 0.5082424 0.4921899 0.5073640 0.5053638 0.5066038
[17] 0.5063418 0.5055067 0.5060952 0.5051869 0.5050157 0.5038692 0.5040837 0.5052188
[25] 0.5040825 0.5046496 0.5050795 0.5042205 0.5023535 0.4953202 0.5047607 0.5042989
[33] 0.4951399 0.5039299 0.4967261 0.5042044 0.5044005 0.4955098 0.5037352 0.5018135
[41] 0.5021579 0.5038746 0.5043289 0.5032334 0.4948074 0.5021917 0.4984553 0.5029390
[49] 0.4951465 0.5033675
If I give say 5 rows as input, as expected I get 5 predictions.
> predict(Model,testset[1:5,],type="prob")[,2]
[1] 0.7470780 0.7101257 0.4795726 0.7451049 0.5607364
Why is the first command giving me 50 predictions when I'm giving just one row as input?

How to find unique extra column name between two data.frames?

I have two almost identical data.frames, and I want to find the unique column name that is added to the x.2 object.
> colnames(x.1)
[1] "listPrice" "rent" "floor" "livingArea"
[5] "rooms" "published" "constructionYear" "objectType"
[9] "booliId" "soldDate" "soldPrice" "url"
[13] "additionalArea" "isNewConstruction" "location.namedAreas" "location.address.streetAddress"
[17] "location.address.city" "location.position.latitude" "location.position.longitude" "location.region.municipalityName"
[21] "location.region.countyName" "location.distance.ocean" "source.name" "source.id"
[25] "source.type" "source.url" "areaSize" "priceDiff"
[29] "perc.priceDiff" "sqrmPrice"
> colnames(x.2)
[1] "listPrice" "livingArea" "additionalArea" "plotArea"
[5] "rooms" "published" "constructionYear" "objectType"
[9] "booliId" "soldDate" "soldPrice" "url"
[13] "isNewConstruction" "floor" "rent" "location.namedAreas"
[17] "location.address.streetAddress" "location.address.city" "location.position.latitude" "location.position.longitude"
[21] "location.region.municipalityName" "location.region.countyName" "location.distance.ocean" "source.name"
[25] "source.id" "source.type" "source.url" "areaSize"
[29] "priceDiff" "perc.priceDiff" "sqrmPrice"
You can use setdiff to get the column names that are in 'x.2' and not in 'x.1'
setdiff(colnames(x.2), colnames(x.1))
Try
colnames(x.2)[!colnames(x.2) %in% colnames(x.1)]

R sort list of files numerically [duplicate]

This question already has answers here:
How to sort a character vector where elements contain letters and numbers?
(6 answers)
Closed 2 years ago.
I have a list of files that I need to sort numerically, such that I can import them in order
my code is:
bed = '/files/coverage_v2'
beds=list.files(path=bed, pattern='ctcf.motif.minus[0-9]+.bed.IGTB950.bed')
for(b in beds){
`for(b in beds){`print(b)
read.table(b)
}
> [1] "ctcf.motif.minus1.bed.IGTB950.bed" "ctcf.motif.minus10.bed.IGTB950.bed"
[3] "ctcf.motif.minus100.bed.IGTB950.bed" "ctcf.motif.minus101.bed.IGTB950.bed"
[5] "ctcf.motif.minus102.bed.IGTB950.bed" "ctcf.motif.minus103.bed.IGTB950.bed"
[7] "ctcf.motif.minus104.bed.IGTB950.bed" "ctcf.motif.minus105.bed.IGTB950.bed"
[9] "ctcf.motif.minus106.bed.IGTB950.bed" "ctcf.motif.minus107.bed.IGTB950.bed"
[11] "ctcf.motif.minus108.bed.IGTB950.bed" "ctcf.motif.minus109.bed.IGTB950.bed"
[13] "ctcf.motif.minus11.bed.IGTB950.bed" "ctcf.motif.minus110.bed.IGTB950.bed"
[15] "ctcf.motif.minus111.bed.IGTB950.bed" "ctcf.motif.minus112.bed.IGTB950.bed"
[17] "ctcf.motif.minus113.bed.IGTB950.bed" "ctcf.motif.minus114.bed.IGTB950.bed"
[19] "ctcf.motif.minus115.bed.IGTB950.bed" "ctcf.motif.minus116.bed.IGTB950.bed"
[21] "ctcf.motif.minus117.bed.IGTB950.bed" "ctcf.motif.minus118.bed.IGTB950.bed"
[23] "ctcf.motif.minus119.bed.IGTB950.bed" "ctcf.motif.minus12.bed.IGTB950.bed"
[25] "ctcf.motif.minus120.bed.IGTB950.bed" "ctcf.motif.minus121.bed.IGTB950.bed"
[27] "ctcf.motif.minus122.bed.IGTB950.bed" "ctcf.motif.minus123.bed.IGTB950.bed"
[29] "ctcf.motif.minus124.bed.IGTB950.bed" "ctcf.motif.minus125.bed.IGTB950.bed"
[31] "ctcf.motif.minus126.bed.IGTB950.bed" "ctcf.motif.minus127.bed.IGTB950.bed"
[33] "ctcf.motif.minus128.bed.IGTB950.bed" "ctcf.motif.minus129.bed.IGTB950.bed"
[35] "ctcf.motif.minus13.bed.IGTB950.bed" "ctcf.motif.minus130.bed.IGTB950.bed"
[37] "ctcf.motif.minus131.bed.IGTB950.bed" "ctcf.motif.minus132.bed.IGTB950.bed"
[39] "ctcf.motif.minus133.bed.IGTB950.bed" "ctcf.motif.minus134.bed.IGTB950.bed"
But what I really want is for it to be sorted numerically:
> "ctcf.motif.minus1.bed.IGTB950.bed"
"ctcf.motif.minus10.bed.IGTB950.bed"
"ctcf.motif.minus11.bed.IGTB950.bed"
"ctcf.motif.minus12.bed.IGTB950.bed"
"ctcf.motif.minus13.bed.IGTB950.bed"
"ctcf.motif.minus100.bed.IGTB950.bed"
"ctcf.motif.minus101.bed.IGTB950.bed"
etc, so that it will be imported numerically.
Thanks in advance!!
You could try mixedsort from gtools
library(gtools)
beds1 <- mixedsort(beds)
head(beds1)
#[1]"ctcf.motif.minus1.bed.IGTB950.bed" "ctcf.motif.minus10.bed.IGTB950.bed"
#[3]"ctcf.motif.minus11.bed.IGTB950.bed" "ctcf.motif.minus12.bed.IGTB950.bed"
#[5]"ctcf.motif.minus13.bed.IGTB950.bed" "ctcf.motif.minus100.bed.IGTB950.bed"
Or using regex (assuming that the order depends on the numbers after 'minus' and before 'bed'.
beds[order(as.numeric(gsub('\\D+|\\.bed.*', '', beds)))]

Resources