I want to retrieve only numeric values - r

vector<-c("0.78953744969927742", "0.46557689748480685", "0.19740881059705201",
"9.7073839462985714E-2", "4.9051709747422199E-2", "0.1167420589551126",
"0.12679434401288708", "0.51370748568563795", "0.1925345466801483",
"0.48287163643195624", "4.211984449707315E-2", "blablablab",
"0.10553766233766231", "7.8187250996015922E-2", "0.20718689788053954",
"1.6450511945392491E-2", "0.51752961082910309", "0.10978571428571428",
"0.42610062893081763", "0.52208333333333334", "0.27569868995633184",
"7.7189939288811793E-2", "0.53982300884955747", "38.25% (blablabla) blablablablablablablablablablablabla","0.22324159021406728")
I have to transform all observations into numerical values. Those consisting only of words in NA. If there are words after an observation starting with a number; retrieve only the numbers. If there are percentages after the number, eliminate these percentages and keep only the number

With readrs parse_number
library(readr)
vec_num <- parse_number(vector)
Warning: 1 parsing failure.
row col expected actual
12 -- a number blablablab
vec_num
[1] 0.78953745 0.46557690 0.19740881 0.09707384 0.04905171 0.11674206
[7] 0.12679434 0.51370749 0.19253455 0.48287164 0.04211984 NA
[13] 0.10553766 0.07818725 0.20718690 0.01645051 0.51752961 0.10978571
[19] 0.42610063 0.52208333 0.27569869 0.07718994 0.53982301 38.25000000
[25] 0.22324159
attr(,"problems")
# A tibble: 1 × 4
row col expected actual
<int> <int> <chr> <chr>
1 12 NA a number blablablab
vec_num[24]
[1] 38.25

Removing all the trash
> as.numeric(gsub("[^0-9\\.\\E\\-]","",vector))
[1] 0.78953745 0.46557690 0.19740881 0.09707384 0.04905171 0.11674206
[7] 0.12679434 0.51370749 0.19253455 0.48287164 0.04211984 NA
[13] 0.10553766 0.07818725 0.20718690 0.01645051 0.51752961 0.10978571
[19] 0.42610063 0.52208333 0.27569869 0.07718994 0.53982301 38.25000000
[25] 0.22324159

You can use
as.numeric(stringr::str_extract(vector, '[\\d+.\\-\\E]+'))

Depends a bit on what kind of regexp exactly, but this will do:
\d+(.\d+(E[+-]\d+)?)
Check and refine at regex101.com

Related

reorder a 1 dimensional dataframe based on the column order of a larger dataframe (R)

relevant_ods_reordered <- relevant_ods[names(cpm)]
the above seeks to reorder column names of a dataframe relevant_ods:
Plate1_DMSO_A01 Plate1_DMSO_B01 Plate1_DMSO_C01 Plate1_Lopinavir_D01
OD595 0.431 0.4495 0.4993 0.5785
Plate1_DMSO_E01 Plate1_DMSO_F01 Plate1_DMSO_G01 Plate1_DMSO_H01
OD595 0.5336 0.5133 0.527 0.5413
Plate1_DMSO_C12 Plate1_DMSO_D12 Plate1_Lopinavir_E12 Plate1_DMSO_F12
OD595 0.4137 0.4274 0.5241 0.4264
Plate1_DMSO_G12 Plate1_DMSO_H12
OD595 0.4561 0.4767
to match the order of the columns in a significantly larger dataframe:
[1] "Plate1_DMSO_A01" "Plate1_DMSO_A12"
[3] "Plate1_DMSO_B01" "Plate1_DMSO_B12"
[5] "Plate1_DMSO_C01" "Plate1_DMSO_C12"
[7] "Plate1_DMSO_D12" "Plate1_DMSO_E01"
[9] "Plate1_DMSO_F01" "Plate1_DMSO_F12"
[11] "Plate1_DMSO_G01" "Plate1_DMSO_G12"
[13] "Plate1_DMSO_H01" "Plate1_DMSO_H12"
[15] "Plate1_Lopinavir_D01" "Plate1_Lopinavir_E12"
[17] "Plate1_NS1519_22009_A02" "Plate1_NS1519_22009_A04"
[19] "Plate1_NS1519_22009_A05" "Plate1_NS1519_22009_A06"
[21] "Plate1_NS1519_22009_A07" "Plate1_NS1519_22009_A08"
[23] "Plate1_NS1519_22009_A09" "Plate1_NS1519_22009_A10"
[25] "Plate1_NS1519_22009_A11" "Plate1_NS1519_22009_B02"
[27] "Plate1_NS1519_22009_B03" "Plate1_NS1519_22009_B04"
[29] "Plate1_NS1519_22009_B05" "Plate1_NS1519_22009_B06"
etc.
Clearly, there is a returned
Error in `[.data.frame`(relevant_ods, names(cpm)) :
undefined columns selected
due to the mismatch between the numbers of columns
I have tried
relevant_ods_reordered <- relevant_ods[names(cpm),]
relevant_ods_reordered <- select(relevant_ods, names(cpm))
relevant_ods_reordered <- match(relevant_ods, names(cpm))
With base R, you need to find the names in common. intersect is good for this and preserves the order of its first argument:
relevant_ods[intersect(names(cpm), names(relevant_ods))]
Or with dplyr, use the select helper any_of:
select(relevant_ods, any_of(names(cpm)))

Substitution of strings results in incorrect names

I,d like to change several strings in vector. In my case, I have in all.images object:
# Original character's list
all.images <-c("S2B2A_20171003_124_IndianaIIPR00911120170922_BOA_10.tif",
"S2B2A_20181028_124_IndianaIIPR0065820181024_BOA_10.tif",
"S2B2A_20170715_124_SantaMariaCalcasPR0033420170731_BOA_10.tif",
"S2B2A_20180928_124_NSraAparecidaBortolettoPR0042720180912_BOA_10.tif",
"S2A2A_20170610_124_LagoaAmarelaPR0022020170619_BOA_10.tif",
"S2A2A_20160705_124_AguaSumidaPR001320160629_BOA_10.tif",
"S2A2A_20181023_124_SaoPedroGabrielGarciaPR001720181031_BOA_10.tif",
"S2B2A_20180908_124_NSraAparecidaBortolettoPR001920180911_BOA_10.tif",
"S2A2A_20180824_124_NSraAparecidaBortolettoPR0043320180911_BOA_10.tif",
"S2A2A_20170720_124_VoAnaPR001520170802_BOA_10.tif",
"S2B2A_20180322_124_SaoMateusPR0021920180314_BOA_10.tif",
"S2A2A_20181212_124_NSradeFatimaJoaoBatistaPR002320181128_BOA_10.tif",
"S2A2A_20180413_081_SantaFeSebastiaoFogacaPR0021920180427_BOA_10.tif",
"S2B2A_20170913_124_PerdizesPR0034920170905_BOA_10.tif",
"S2A2A_20170610_124_TresMeninasPR001820170601_BOA_10.tif",
"S2B2A_20180428_081_SantaFeSebastiaoFogacaPR0021020180501_BOA_10.tif",
"S2B2A_20180508_081_SantaFeSebastiaoFogacaPR0022320180427_BOA_10.tif",
"S2A2A_20170809_124_VoAnaPR001620170803_BOA_10.tif",
"S2B2A_20180819_124_PontalIIPR0012220180801_BOA_10.tif",
"S2B2A_20181214_081_NSradeFatimaJoaoBatistaPR002320181128_BOA_10.tif",
"S2A2A_20180423_081_SantaFeSebastiaoFogacaPR0033920180427_BOA_10.tif",
"S2A2A_20180814_124_PontalIIPR0012220180801_BOA_10.tif",
"S2B2A_20170715_124_VoAnaPR0015A20170803_BOA_10.tif",
"S2A2A_20160615_124_AguaSumidaPR0011220160627_BOA_10.tif",
"S2A2A_20170720_124_SantaMariaCalcasPR0022820170726_BOA_10.tif",
"S2A2A_20180913_124_SantaMariaCalcasPR001620180829_BOA_10.tif",
"S2B2A_20170804_124_NSraAparecidaBortolettoPR0035720170811_BOA_10.tif",
"S2A2A_20170809_124_SantaFeBaracatPR001920170801_BOA_10.tif",
"S2B2A_20180322_124_NSradeFatimaGlebaAPR001320180403_BOA_10.tif",
"S2B2A_20180508_081_SantaFeSebastiaoFogacaPR0021920180427_BOA_10.tif")
#
My idea is 1) remove S2B2A_ and _BOA_10.tif; 2) After S2B2A_ convert the 8 values into dates (e.g. 2017-09-05); 3) After the dates take the next three
values to the end (eg. 124 or 081); and 4) Separate the characters based in capital letters and dates (eg. AguaSumidaPR0011220160627 to AguaSumida-PR00112-2016-06-27).
But when I try to do:
sub("^\\w+_(\\d+)_(\\d+)_([A-Za-z]+)([A-Z]{2}\\d{3})(\\d)(\\d{4})(\\d{2})(\\d+)_.*",
"\\3_\\4_\\5_\\6-\\7-\\8_\\1_\\2", all.images)
[1] "IndianaII_PR009_1_1120-17-0922_20171003_124"
[2] "IndianaII_PR006_5_8201-81-024_20181028_124"
...
[28] "SantaFeBaracat_PR001_9_2017-08-01_20170809_124"
[29] "NSradeFatimaGlebaA_PR001_3_2018-04-03_20180322_124"
[30] "SantaFeSebastiaoFogaca_PR002_1_9201-80-427_20180508_081"
I have incorrected dates (eg. in [30] 9201-80-427_20180508_081) and my desirable output needs to be:
[1] "IndianaII_PR009111_2017-09-22_2017-10-03_124"
[2] "IndianaII_PR00658_2018-10-24_2018-10-28_124"
...
[28] "SantaFeBaracat_PR0019_2017-08-01_2017-08-09_124"
[29] "NSradeFatimaGlebaA_PR0013_2018-04-03_2018-03-22_124"
[30] "SantaFeSebastiaoFogaca_PR00219_2018-04-27_2018-05-08_081"
Please any help with it?
I think this handles those exceptions in the comments on your answer using look ahead:
sub("^\\w+_(\\d{4})(\\d{2})(\\d{2})_(\\d+)_([A-Za-z]+)([A-Z]{2}\\w+)(?=\\d{8})+(\\d{4})(\\d{2})(\\d+)_.*",
"\\5_\\6_\\7-\\8-\\9_\\1-\\2-\\3_\\4", all.images, perl = TRUE)

Delete/filter rows with a specific value

We conducted an experiment at Uni which we tried out ourselves before we gave it to real test persons. The problem now is, that our testing-data is included in the whole csv datafile so I need to delete the first 23 "test persons".
They all got a unique code and I could count how many of those unique codes exist (as you can see, there are 38). Now I only need the last 15 of them... I tried it with subset but I don't really now how to filter for those specific last 15 subjectId's (VPcount)
unique(d$VPcount)
uniqueN(d$VPcount)
[1] 7.941675e-312 7.941683e-312 7.941686e-312 7.941687e-312 7.941695e-312 7.941697e-312 7.941734e-312
[8] 7.942134e-312 7.942142e-312 7.942146e-312 7.942176e-312 7.942191e-312 7.942194e-312 7.942199e-312
[15] 7.942268e-312 7.942301e-312 7.942580e-312 7.943045e-312 7.944383e-312 7.944386e-312 7.944388e-312
[22] 7.944388e-312 7.944429e-312 7.944471e-312 7.944477e-312 7.944478e-312 7.944494e-312 7.944500e-312
[29] 7.944501e-312 7.944501e-312 7.944503e-312 7.944503e-312 7.944506e-312 7.944506e-312 7.944506e-312
[36] 7.944506e-312 7.944508e-312 7.944511e-312
[1] 38
You can try :
data <- subset(d, VPcount %in% tail(unique(VPcount), 15))

Rename row.name in data frame using matches or partial matches from a list

I have a data frame in R with 341 rows. I want to rename the row names using a list with 349 names. All 341 names will be in this list for sure. But not all of them will be perfect hits.
The data looks like this
rownames(df_RPM1)
[1] "LQNS02059392.1_11686_5p"
[2] "LQNS02277998.1_30984_3p"
[3] "LQNS02277998.1_30984_5p"
[4] "LQNS02277998.1_30988_3p"
[5] "LQNS02277998.1_30988_5p"
[6] "LQNS02277997.1_30943_3p"
[7] "miR-9|LQNS02278070.1_31740_3p"
[8] "miR-9|LQNS02278094.1_36129_3p"
head(inlist)
[1] "dpu-miR-2-03_LQNS02059392.1_11686_5p" "dpu-miR-10-P2_LQNS02277998.1_30984_3p"
[3] "dpu-miR-10-P2_LQNS02277998.1_30984_5p" "dpu-miR-10-P3_LQNS02277998.1_30988_3p"
[5] "dpu-miR-10-P3_LQNS02277998.1_30988_5p" "miR-9|LQNS02278070.1_31740_3p"
[6] "miR-9|LQNS02278094.1_36129_3p"
The order won't necessarily be the same in the two.
Can anyone suggest me how to do this in R?
Thanks a lot
Depends a lot what a "non-perfect hit" looks like. Assuming the row name is a substring of the real name, str_detect() does the job quite well:
library(tidyverse)
real_names <- c("dpu-miR-2-03_LQNS02059392.1_11686_5p",
"dpu-miR-10-P2_LQNS02277998.1_30984_3p",
"dpu-miR-10-P2_LQNS02277998.1_30984_5p",
"dpu-miR-10-P3_LQNS02277998.1_30988_3p",
"dpu-miR-10-P3_LQNS02277998.1_30988_5p",
"miR-9|LQNS02278070.1_31740_3p",
"miR-9|LQNS02278094.1_36129_3p")
str_which(real_names, "LQNS02059392.1_11686_5p")
#> [1] 1
So we can vectorize (I removed the element 6 which is not found in the example list):
pos <- map_int(rownames(df_RPM1), ~ str_which(real_names, fixed(.)))
pos
#> [1] 1 2 3 4 5 6 7
And all that's left is to change the row names:
rownames(df_RPM1) <- real_names[pos]
Of course, if a non-perfect hit means something more complicated, you may need to create a regex from the row names or something like that.

R: order a vector of strings with both character and numeric values both alphabetically and numerically

I have a vector of strings that contain both character and numeric values. For example:
a=c("ILLUMINA:420:C2D7UACXX:1:1102:14591:91480","ILLUMINA:420:C2D7UACXX:1:1102:14592:3881","ILLUMINA:420:C2D7UACXX:1:1102:14592:37103","ILLUMINA:420:C2D7UACXX:1:1102:14592:37356")
I'd like to order the vector so that the characters are sorted alphabetically and the numbers numerically. The structure of the strings is always of the format:
"ILLUMINA:420:C2D7UACXX:1:<number>:<number>:<number>", so actually the order only applies to the last three colon separated numbers.
I did try mixedsort {gtools} but the result was the same as using sort and
sort.int, which is:
> mixedsort(a)
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480" "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103"
[3] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356" "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881"
Clearly the right order should be:
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480" "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881"
[3] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103" "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356"
Is there any immediate solution?
EDIT completely change the solution after OP clarification
You can extract the last 3 elements and order, and you create a data.frame:
dat = read.table(text=sub('.*:1:([0-9]+):([0-9]+):([0-9]+)','\\1|\\2|\\3',a),sep='|')
dat
V1 V2 V3
1 1102 14591 91480
2 1102 14592 3881
3 1102 14592 37103
4 1102 14592 37356
Then you order using 3 columns:
a[with(dat,order(V1,V2,V3))]
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480" "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881"
[3] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103" "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356"
gtools::mixedsort does work in your case, actually:
> a=c("ILLUMINA:420:C2D7UACXX:1:1102:14591:91480",
"ILLUMINA:420:C2D7UACXX:1:1102:14592:3881",
"ILLUMINA:420:C2D7UACXX:1:1102:14592:37103",
"ILLUMINA:420:C2D7UACXX:1:1102:14592:37356")
>
> mixedsort(a)
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480"
[2] "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881"
[3] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103"
[4] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356"
I am using gtools_3.4.2 and R-3.2.0
Here's a faster solution:
fields.list = strsplit(a,split=":")
sort.dt = data.table(t(sapply(fields.list,function(x) as.numeric(c(x[5],x[6],x[7])))))
sorted.a = v[with(sort.dt,order(V1,V2,V3))]
> sorted.a
[1] "ILLUMINA:420:C2D7UACXX:1:1102:14591:91480" "ILLUMINA:420:C2D7UACXX:1:1102:14592:3881" "ILLUMINA:420:C2D7UACXX:1:1102:14592:37103"
[4] "ILLUMINA:420:C2D7UACXX:1:1102:14592:37356"

Resources