How to find unique extra column name between two data.frames?

How to find unique extra column name between two data.frames? - r

I have two almost identical data.frames, and I want to find the unique column name that is added to the x.2 object.
> colnames(x.1)
[1] "listPrice" "rent" "floor" "livingArea"
[5] "rooms" "published" "constructionYear" "objectType"
[9] "booliId" "soldDate" "soldPrice" "url"
[13] "additionalArea" "isNewConstruction" "location.namedAreas" "location.address.streetAddress"
[17] "location.address.city" "location.position.latitude" "location.position.longitude" "location.region.municipalityName"
[21] "location.region.countyName" "location.distance.ocean" "source.name" "source.id"
[25] "source.type" "source.url" "areaSize" "priceDiff"
[29] "perc.priceDiff" "sqrmPrice"
> colnames(x.2)
[1] "listPrice" "livingArea" "additionalArea" "plotArea"
[5] "rooms" "published" "constructionYear" "objectType"
[9] "booliId" "soldDate" "soldPrice" "url"
[13] "isNewConstruction" "floor" "rent" "location.namedAreas"
[17] "location.address.streetAddress" "location.address.city" "location.position.latitude" "location.position.longitude"
[21] "location.region.municipalityName" "location.region.countyName" "location.distance.ocean" "source.name"
[25] "source.id" "source.type" "source.url" "areaSize"
[29] "priceDiff" "perc.priceDiff" "sqrmPrice"

You can use setdiff to get the column names that are in 'x.2' and not in 'x.1'
setdiff(colnames(x.2), colnames(x.1))

Try
colnames(x.2)[!colnames(x.2) %in% colnames(x.1)]

Related

Code to rename multiple columns in rStudio

I want to rename this columns in R, I want to remove X from each of them so that it remains just figures which represents different years varying from 1960 to 2020. The first two (country name and Country Code) are sorted out already.
[1] "ï..Country.Name" "Country.Code" "X1960" "X1961" "X1962"
[6] "X1963" "X1964" "X1965" "X1966" "X1967"
[11] "X1968" "X1969" "X1970" "X1971" "X1972"
[16] "X1973" "X1974" "X1975" "X1976" "X1977"
[21] "X1978" "X1979" "X1980" "X1981" "X1982"
[26] "X1983" "X1984" "X1985" "X1986" "X1987"
[31] "X1988" "X1989" "X1990" "X1991" "X1992"
[36] "X1993" "X1994" "X1995" "X1996" "X1997"
[41] "X1998" "X1999" "X2000" "X2001" "X2002"
[46] "X2003" "X2004" "X2005" "X2006" "X2007"
[51] "X2008" "X2009" "X2010" "X2011" "X2012"
[56] "X2013" "X2014" "X2015" "X2016" "X2017"
[61] "X2018" "X2019" "X2020"

names(df) <- gsub("^X", "", names(df))
gsub() matches a regular expression and replaces it if found. In our case, the regex says the string must have an X at the beginning.

find a list of x from n lists

I have a list of 20 lists (each list contain genes of 20 populations) to
I have to find genes (for each population) whose are presented in at least 15 lists
I'm using R
Any help please ?
example:
BigList$list1$pop1
BigList$list1$pop2
BigList$list1$pop3
BigList$list2$pop1
BigList$list2$pop1
BigList$list2$pop3
BigList$list3$pop1
BigList$list3$pop1
BigList$list3$pop3
My list is like :
[[1]]$pop2
[1] "CFC1" "ZNF536" "TRIM67" "AC092431.3" "RP11-572M11.4" "HCG23" "AC006372.4" "RP11-6O2.4" "CACNG3"
[10] "AC129492.6" "POTEC" "RP11-862L9.3" "AC018766.5" "RP11-506O24.1" "RP11-397O8.7" "RP11-54O7.11" "RP11-335O13.7" "RP11-392O17.1"
[19] "AC140481.2" "RP11-284H18.1" "RP11-370B11.3" "SLC17A8" "RP11-474D1.2" "GOLGA8H" "RP11-815J21.3" "CTD-2135D7.2" "RP11-388M20.6"
[28] "CTD-2034I21.2" "KRT31" "USH1G" "CTC-360G5.9" "TBL1Y" "RP11-143E21.6" "SERPINA10" "RP11-303E16.3" "RP11-849F2.5"
[37] "VCAN-AS1" "OPN4" "MS4A2" "LIMS3" "SYNE1-AS1" "RP11-881M11.4" "GCSAML-AS1" "LIMS3L" "FBXW12"
[46] "RP11-364P22.1" "ADAMTS19" "AC005276.1" "RP11-513D5.5" "RP11-68L18.1" "RP11-402G3.3" "PGA3" "PGA4" "RP11-582E3.2"
[55] "LINC00943" "AC073657.1" "RP11-773H22.4" "ANKRD30B" "RP11-103J8.2" "CTA-407F11.8" "ETNPPL" "RP11-1M18.1" "RP11-277P12.10"
[64] "AC105339.1" "DDX4" "CTD-2342N23.3" "RP11-684B21.1" "NDST4" "CCDC60" "U91319.1" "RGR" "AC108868.6"
[73] "RP11-480G7.1"
[[1]]$pop2
[1] "RP11-469N6.1" "GDF5" "NELL1"
[[1]]$ppo3
[1] "RP3-398G3.5" "AC010091.1" "RP11-3B12.5" "RP11-78F17.1" "C20ORF135" "CTC-325J23.3" "DBH" "FOXE3" "FOXD4L1"
[10] "AC114730.8" "AC008697.1" "RP3-323N1.2" "RP11-142M10.2" "AC005616.2" "DCDC2B" "RP11-415J8.7" "LINC00326" "IL1RAPL2"
[19] "RP11-167N4.2" "RP11-114H23.1" "RP11-57A19.2" "C17orf98" "XX-CR54.3" "DLX2" "RP11-337N6.1" "RP11-416O18.1" "RP11-25H12.1"
[28] "RP11-269F21.3" "LINC00491" "CTB-43E15.3" "GABRR1" "H2BFWT" "TRPC5OS" "HTR2C" "RP11-642C5.1" "RP11-64P14.7"
[[2]]$pop1
[1] "CNGA3" "ITLN2" "RP11-400N13.1" "RP11-331F9.4" "GPR88" "LINC01037" "RP11-255M2.2" "LA16c-329F2.1" "RP11-154H12.2" "DUXA"
[11] "RP11-36B6.1" "RP11-12A16.3"
[[2]]$pop2
[1] "AC011893.3" "ISM1-AS1" "CA10" "RP11-301L8.2" "RP11-1250I15.3" "GABRG2" "NAMA" "CLEC1B" "RP11-458D21.5"
[10] "RGPD4" "SLITRK3" "RP3-495K2.2" "C11orf87" "RCVRN" "RP5-1112F19.2" "RP3-333A15.1" "RP5-836J3.1" "METTL11B"
[19] "AC112721.1" "RP11-761N21.1" "GRID2" "GML" "CLEC2A" "RP11-834C11.8" "RP11-406H23.2" "RP4-715N11.2" "RHD"
[28] "EYA1" "TAS2R19" "GABRA1" "SLC8A3" "RP3-510H16.3" "GRM7-AS3" "RP11-71H9.1" "PPEF2" "TULP1"
[37] "RP11-704J17.5" "RP11-10C8.2" "RP11-298H24.1" "RP11-263K4.3" "METTL21C" "AC012317.1" "CCDC42" "AC139100.3" "AF015262.2"
[[2]]$pop3
[1] "SYT10" "SPATA13-AS1" "AC064834.2" "CTD-2544H17.2" "AC106786.1" "RP11-25L3.3" "IMPG1" "DDX4" "RP11-50B3.4"
I have to find intersections to find (List with genes in at least:
2 lists
3 lists
Thank you.

R - Operations over corresponding vector items in list

Let's say I have a list of vectors, like so:
[[1]]
[1] -0.36603596 -0.41461025 -0.68573296 -0.55516173 0.05071238 0.47723472 0.10851948
[8] 0.67005116 0.25519780 -0.79428716 0.16506077 0.81905548 0.22808934 -0.39257712
[15] 0.44778539 -0.36149934 -0.90142102 -0.99826169 0.24544167 -0.18989310 -0.67592344
[22] -0.65447808 0.26617179 -0.25020153 0.19562031 0.53520465 -0.47531100 -0.60152887
[29] 0.12012461 -0.68947499 -0.33258301 0.19914520 -0.70396942 0.21574644 -0.67197365
[36] -0.12744723 -0.07113916 0.44497439 0.07592963 -0.29082130 -0.27967624 0.28314801
[43] -0.09840383 -0.55582233 -0.29474315 -0.41717316 0.51017306 -0.31227399 0.39484400
[50] -0.88843530
[[2]]
[1] -0.14763873 -0.69009083 -0.55705599 -0.43779047 0.15626341 -0.00629513 -0.95227841
[8] 0.85645849 -0.40110676 -0.35732008 0.31375323 0.71478975 0.02262899 -0.12802829
[15] 0.58750725 -0.25629463 -0.65609956 -0.83185625 -0.35244759 -0.33287717 -0.99199682
[22] -0.45836093 -0.19431609 -0.41590652 1.06120542 0.20687783 0.13268137 -0.34219985
[29] -0.18096691 -0.24496102 -0.47769117 0.89134577 -0.56128402 0.70825268 0.10426368
[36] -0.13962506 -0.72478276 -0.40178315 0.65943132 -0.82083464 0.22569929 -1.02243310
[43] -0.70983610 -1.36733592 0.68807554 0.09156598 0.76850778 -0.64040433 0.79276407
[50] -0.40297792
[[3]]
[1] 0.34405450 -0.07928067 0.08353835 -0.37919066 -0.47233278 -0.38839824 -0.13269067
[8] 0.17348495 0.42777652 -0.19297300 -0.86438130 0.75787336 -0.34358747 0.47852682
[15] 1.29980892 -0.42527812 -0.25074922 -0.59565850 0.32800193 -0.56109570 -0.72905476
[22] -0.11498356 -0.29827083 -0.21653428 0.78533418 0.64735755 0.31889828 -0.37129803
[29] -0.51252162 0.24192268 -0.29281809 1.03299397 -0.11251429 0.13157698 -0.06404053
[36] 0.01904473 -0.13162565 0.30488937 0.31933970 0.14135025 -0.31501649 0.16738399
[43] -0.19627252 -1.29613018 -0.03572980 -0.72008672 0.13932428 -0.06117093 -0.62665670
[50] -0.12662761
[[4]]
[1] 0.183303468 0.160037845 -0.053473912 0.005199917 -0.126312554 0.116465956 -0.061730281
[8] 0.392903969 -0.008337453 -0.752631038 -0.235599857 0.999534398 0.375208363 0.201100799
[15] 0.444068886 -0.575795949 -0.873388633 -0.863612264 0.076050073 -0.188358603 -0.391865671
[22] -1.726690292 -1.206992567 -0.547175750 0.290255919 1.119834989 0.551360182 -0.510140345
[29] -0.460314706 -0.245835558 -0.315087602 0.947181076 -0.132550448 0.038419545 -0.017929636
[36] 0.041870497 -0.520961791 0.195326850 -0.117783785 -0.427426472 -0.119577158 0.702550914
[43] -0.045789957 -0.794299036 0.181420440 0.407347072 0.571894407 -0.217325835 0.280283391
[50] -0.492866084
[[5]]
[1] -0.40852268 -0.33488615 -0.30609700 -0.67467326 -0.11966383 1.01161858 -0.27108333
[8] 0.92772286 0.39047166 0.29019594 0.24404167 0.07824440 0.32786441 0.21657727
[15] 0.34362648 -0.44996166 -0.27823770 -1.24962127 -0.57241699 -0.30297804 -0.66728157
[22] 0.01783441 0.50773758 -0.31477033 -0.14581338 -0.13827194 -0.25574117 0.40049840
[29] 0.38634920 -0.29027963 -0.03381480 0.48510557 -0.61594522 1.09573928 -0.27992008
[36] -0.41523542 -0.24131548 0.43480320 0.32855110 0.48579320 0.47366867 0.62697303
[43] -0.57792202 -0.81951194 0.21583044 0.15593484 -0.10270703 -0.10206812 -0.25195873
[50] -0.89835763
I want to average corresponding vector items (e.g.: [[1]][1], [[1]][2], [[1]][3], etc.) to result in a single vector of averaged values. For instance, the mean of every first vector item across the list would be -0.07896788. What's the best way to go about this?

let's say list is called mylist:
mydf=as.data.frame(do.call("rbind",mylist))
colMeans(mydf)
would that be the desired output?

Remove string from a vector in R

I have a vector that looks like
> inecodes
[1] "01001" "01002" "01049" "01003" "01006" "01037" "01008" "01004" "01009" "01010" "01011"
[12] "01013" "01014" "01016" "01017" "01021" "01022" "01023" "01046" "01056" "01901" "01027"
[23] "01019" "01020" "01028" "01030" "01031" "01032" "01902" "01033" "01036" "01058" "01034"
[34] "01039" "01041" "01042" "01043" "01044" "01047" "01051" "01052" "01053" "01054" "01055"
And I want to remove these "numbers" from this vector:
>pob
[1] "01001-Alegría-Dulantzi" "01002-Amurrio"
[3] "01049-Añana" "01003-Aramaio"
[5] "01006-Armiñón" "01037-Arraia-Maeztu"
[7] "01008-Arratzua-Ubarrundia" "01004-Artziniega"
[9] "01009-Asparrena" "01010-Ayala/Aiara"
[11] "01011-Baños de Ebro/Mañueta" "01013-Barrundia"
[13] "01014-Berantevilla" "01016-Bernedo"
[15] "01017-Campezo/Kanpezu" "01021-Elburgo/Burgelu"
[17] "01022-Elciego" "01023-Elvillar/Bilar"
[19] "01046-Erriberagoitia/Ribera Alta"
They are longer that these samples and they don't have the same length. The answer must to be like following:
>pob
[1] "Alegría-Dulantzi" "Amurrio"
[3] "Añana" "Aramaio"
[5] "Armiñón" "Arraia-Maeztu"
[7] "Arratzua-Ubarrundia" "Artziniega"
[9] "Asparrena" "Ayala/Aiara"
[11] "Baños de Ebro/Mañueta" "Barrundia"
[13] "Berantevilla" "Bernedo"
[15] "Campezo/Kanpezu" "Elburgo/Burgelu"
[17] "Elciego" "Elvillar/Bilar"
[19] "Erriberagoitia/Ribera Alta"

Not sure why you needed inecodes at all, since you can use sub to remove all digits:
sub('^\\d+-', '', pob)
Result:
[1] "Alegría-Dulantzi" "Amurrio" "Añana"
[4] "Aramaio" "Armiñón" "Arraia-Maeztu"
[7] "Arratzua-Ubarrundia" "Artziniega" "Asparrena"
[10] "Ayala/Aiara" "Baños de Ebro/Mañueta" "Barrundia"
[13] "Berantevilla" "Bernedo" "Campezo/Kanpezu"
[16] "Elburgo/Burgelu" "Elciego" "Elvillar/Bilar"
[19] "Erriberagoitia/Ribera Alta"
One reason that you might need inecodes is that you have codes in pob that don't exist in inecodes, but that doesn't seem like the case here. If you insist on using inecodes to remove numbers from pob, you can use str_replace_all from stringr:
library(stringr)
str_replace_all(pob, setNames(rep("", length(inecodes)), paste0(inecodes, "-")))
This gives you the exact same result:
[1] "Alegría-Dulantzi" "Amurrio" "Añana"
[4] "Aramaio" "Armiñón" "Arraia-Maeztu"
[7] "Arratzua-Ubarrundia" "Artziniega" "Asparrena"
[10] "Ayala/Aiara" "Baños de Ebro/Mañueta" "Barrundia"
[13] "Berantevilla" "Bernedo" "Campezo/Kanpezu"
[16] "Elburgo/Burgelu" "Elciego" "Elvillar/Bilar"
[19] "Erriberagoitia/Ribera Alta"
Data:
inecodes = c("01001", "01002", "01049", "01003", "01006", "01037", "01008",
"01004", "01009", "01010", "01011", "01013", "01014", "01016",
"01017", "01021", "01022", "01023", "01046", "01056", "01901",
"01027", "01019", "01020", "01028", "01030", "01031", "01032",
"01902", "01033", "01036", "01058", "01034", "01039", "01041",
"01042", "01043", "01044", "01047", "01051", "01052", "01053",
"01054", "01055")
pob = c("01001-Alegría-Dulantzi", "01002-Amurrio", "01049-Añana", "01003-Aramaio",
"01006-Armiñón", "01037-Arraia-Maeztu", "01008-Arratzua-Ubarrundia",
"01004-Artziniega", "01009-Asparrena", "01010-Ayala/Aiara", "01011-Baños de Ebro/Mañueta",
"01013-Barrundia", "01014-Berantevilla", "01016-Bernedo", "01017-Campezo/Kanpezu",
"01021-Elburgo/Burgelu", "01022-Elciego", "01023-Elvillar/Bilar",
"01046-Erriberagoitia/Ribera Alta")

library(stringr)
for(code in inecodes) {
ix <- which(str_detect(pob, code))
pob[ix] <- unlist(str_split(pob, "-", 2))[2]
}

Try this. Match should be much faster
pos<-which(!is.na(pob[match(sub('^([0-9]+)-.*$','\\1',pob),inecodes)]))
pob[pos]<-sub('^[0-9]+-(.*)$','\\1',pob[pos])
Please do post the timings if you manage to get this. Match usually solves many computational issues for large data sets lookup. Would like to see if there are any opposite scenarios.

A bit shorter than sub, str_detect and str_replace is str_remove:
library(stringr)
c("01001-Alegría-Dulantzi", "01002-Amurrio") %>%
str_remove("[0-9]*-")
returns
"Alegría-Dulantzi" "Amurrio"

R sort list of files numerically [duplicate]

This question already has answers here:
How to sort a character vector where elements contain letters and numbers?
(6 answers)
Closed 2 years ago.
I have a list of files that I need to sort numerically, such that I can import them in order
my code is:
bed = '/files/coverage_v2'
beds=list.files(path=bed, pattern='ctcf.motif.minus[0-9]+.bed.IGTB950.bed')
for(b in beds){
`for(b in beds){`print(b)
read.table(b)
}
> [1] "ctcf.motif.minus1.bed.IGTB950.bed" "ctcf.motif.minus10.bed.IGTB950.bed"
[3] "ctcf.motif.minus100.bed.IGTB950.bed" "ctcf.motif.minus101.bed.IGTB950.bed"
[5] "ctcf.motif.minus102.bed.IGTB950.bed" "ctcf.motif.minus103.bed.IGTB950.bed"
[7] "ctcf.motif.minus104.bed.IGTB950.bed" "ctcf.motif.minus105.bed.IGTB950.bed"
[9] "ctcf.motif.minus106.bed.IGTB950.bed" "ctcf.motif.minus107.bed.IGTB950.bed"
[11] "ctcf.motif.minus108.bed.IGTB950.bed" "ctcf.motif.minus109.bed.IGTB950.bed"
[13] "ctcf.motif.minus11.bed.IGTB950.bed" "ctcf.motif.minus110.bed.IGTB950.bed"
[15] "ctcf.motif.minus111.bed.IGTB950.bed" "ctcf.motif.minus112.bed.IGTB950.bed"
[17] "ctcf.motif.minus113.bed.IGTB950.bed" "ctcf.motif.minus114.bed.IGTB950.bed"
[19] "ctcf.motif.minus115.bed.IGTB950.bed" "ctcf.motif.minus116.bed.IGTB950.bed"
[21] "ctcf.motif.minus117.bed.IGTB950.bed" "ctcf.motif.minus118.bed.IGTB950.bed"
[23] "ctcf.motif.minus119.bed.IGTB950.bed" "ctcf.motif.minus12.bed.IGTB950.bed"
[25] "ctcf.motif.minus120.bed.IGTB950.bed" "ctcf.motif.minus121.bed.IGTB950.bed"
[27] "ctcf.motif.minus122.bed.IGTB950.bed" "ctcf.motif.minus123.bed.IGTB950.bed"
[29] "ctcf.motif.minus124.bed.IGTB950.bed" "ctcf.motif.minus125.bed.IGTB950.bed"
[31] "ctcf.motif.minus126.bed.IGTB950.bed" "ctcf.motif.minus127.bed.IGTB950.bed"
[33] "ctcf.motif.minus128.bed.IGTB950.bed" "ctcf.motif.minus129.bed.IGTB950.bed"
[35] "ctcf.motif.minus13.bed.IGTB950.bed" "ctcf.motif.minus130.bed.IGTB950.bed"
[37] "ctcf.motif.minus131.bed.IGTB950.bed" "ctcf.motif.minus132.bed.IGTB950.bed"
[39] "ctcf.motif.minus133.bed.IGTB950.bed" "ctcf.motif.minus134.bed.IGTB950.bed"
But what I really want is for it to be sorted numerically:
> "ctcf.motif.minus1.bed.IGTB950.bed"
"ctcf.motif.minus10.bed.IGTB950.bed"
"ctcf.motif.minus11.bed.IGTB950.bed"
"ctcf.motif.minus12.bed.IGTB950.bed"
"ctcf.motif.minus13.bed.IGTB950.bed"
"ctcf.motif.minus100.bed.IGTB950.bed"
"ctcf.motif.minus101.bed.IGTB950.bed"
etc, so that it will be imported numerically.
Thanks in advance!!

You could try mixedsort from gtools
library(gtools)
beds1 <- mixedsort(beds)
head(beds1)
#[1]"ctcf.motif.minus1.bed.IGTB950.bed" "ctcf.motif.minus10.bed.IGTB950.bed"
#[3]"ctcf.motif.minus11.bed.IGTB950.bed" "ctcf.motif.minus12.bed.IGTB950.bed"
#[5]"ctcf.motif.minus13.bed.IGTB950.bed" "ctcf.motif.minus100.bed.IGTB950.bed"
Or using regex (assuming that the order depends on the numbers after 'minus' and before 'bed'.
beds[order(as.numeric(gsub('\\D+|\\.bed.*', '', beds)))]