Conditionally add character to a string - r

I am trying to add 0s into character strings, but only under certain conditions.
I have a vector of file names like such:
my.fl <- c("res_P1_R1.rds", "res_P2_R1.rds",
"res_P1_R19.rds", "res_P2_R2.rds",
"res_P10_R1.rds", "res_P10_R19.rds")
I want to sort(my.fl) so that the file names are ordered by the numbers following the P and R, but as it stands sorting results in this:
"res_P1_R1.rds" "res_P1_R19.rds" "res_P10_R1.rds" "res_P10_R19.rds" "res_P2_R1.rds" "res_P2_R2.rds"
To fix this I need to add 0s after P and R, but only when the following number ranges from 1-9, if the following number is > 9 I want to do nothing.
The result should be as follows:
"res_P01_R01.rds" "res_P01_R19.rds" "res_P10_R01.rds" "res_P10_R19.rds" "res_P02_R01.rds" "res_P02_R02.rds"
and if I sort it, it is ordered as expected e.g.:
"res_P01_R01.rds" "res_P01_R19.rds" "res_P02_R01.rds" "res_P02_R02.rds" "res_P10_R01.rds" "res_P10_R19.rds"
I can add 0s based on position, but since the required position changes my solution only works on a subset of the file names. I think this would be a common problem but I haven't managed to find an answer on SO (or anywhere), any help much appreciated.

You should be able to just use mixedsort from the gtools package which removes the need to insert zeroes.
my.fl <- c("res_P1_R1.rds", "res_P2_R1.rds",
"res_P1_R19.rds", "res_P2_R2.rds",
"res_P10_R1.rds", "res_P10_R19.rds")
library(gtools)
mixedsort(my.fl)
[1] "res_P1_R1.rds" "res_P1_R19.rds" "res_P2_R1.rds" "res_P2_R2.rds" "res_P10_R1.rds" "res_P10_R19.rds"
But if you do want to insert the zeroes you could use something like:
sort(gsub("(?<=\\D)(\\d{1})(?=\\D)", "0\\1", my.fl, perl = TRUE))
[1] "res_P01_R01.rds" "res_P01_R19.rds" "res_P02_R01.rds" "res_P02_R02.rds" "res_P10_R01.rds" "res_P10_R19.rds"

Related

Stringr str_which first compare 1st row with whole column than to next row

I am trying to match DNA sequences in a column. I am trying to find the longer version of itself, but also in this column it has the same sequence.
I am trying to use Str_which for which I know it works, since if I manually put the search pattern in it finds the rows which include the sequence.
As a preview of the data I have:
SNID type seqs2
9584818 seqs TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA
9584818 reversed TTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGA
9562505 seqs GTCTTCAGCATCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAACTTTGTGAAT
9562505 reversed ATTCACAAAGTTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGATGCTGAAGAC
Using a simple search of row one as x
x <- "TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA"
str_which(df$seqs2, x)
I get the answer I expect:
> str_which(df$seqs3, x)
[1] 1 3
But when I try to search as a whole column, I just get the result of the rows finding itself. And not the other rows in which it is also stated.
> str_which(df$seqs2, df$seqs2)
[1] 1 2 3 4
Since my data set is quite large, I do not want to do this manually, and rather use the column as input, and not just state "x" first.
Anybody any idea how to solve this? I have tried most Stringr cmds by now, but by mistake I might have did it wrongly or skipped some important ones.
Thanks in advance
You may need lapply :
lapply(df$seqs2, function(x) stringr::str_which(df$seqs2, x))
You can also use grep to keep this in base R :
lapply(df$seqs2, function(x) grep(x, df$seqs2))

How to match any character existing between a pattern and a semicolon

I am trying to get anything existing between sample_id= and ; in a vector like this:
sample_id=10221108;gender=male
tissue_id=23;sample_id=321108;gender=male
treatment=no;tissue_id=98;sample_id=22
My desired output would be:
10221108
321108
22
How can I get this?
I've been trying several things like this, but I don't find the way to do it correctly:
clinical_data$sample_id<-c(sapply(myvector, function(x) sub("subject_id=.;", "\\1", x)))
You could use sub with a capture group to isolate that which you are trying to match:
out <- sub("^.*\\bsample_id=(\\d+).*$", "\\1", x)
out
[1] "10221108" "321108" "22"
Data:
x <- c("sample_id=10221108;gender=male",
"tissue_id=23;sample_id=321108;gender=male",
"treatment=no;tissue_id=98;sample_id=22")
Note that the actual output above is character, not numeric. But, you may easily convert using as.numeric if you need to do that.
Edit:
If you are unsure that the sample IDs would always be just digits, here is another version you may use to capture any content following sample_id:
out <- sub("^.*\\bsample_id=([^;]+).*$", "\\1", x)
out
You could try the str_extract method which utilizes the Stringr package.
If your data is separated by line, you can do:
str_extract("(?<=\\bsample_id=)([:digit:]+)") #this tells the extraction to target anything that is proceeded by a sample_id= and is a series of digits, the + captures all of the digits
This would extract just the numbers per line, if your data is all collected like that, it becomes a tad more difficult because you will have to tell the extraction to continue even if it has extracted something. The code would look something like this:
str_extract_all("((?<=sample_id=)\\d+)")
This code will extract all of the numbers you're looking for and the output will be a list. From there you can manipulate the list as you see fit.

How to separate a text file into columns

This is what my text file looks like:
1241105.41129.97Y317052.03
2282165.61187.63N364051.40
2251175.87190.72Y366447.49
2243125.88150.81N276045.45
328192.89117.68Y295050.51
2211140.81165.77N346053.11
1291125.61160.61Y335048.3
3273127.73148.76Y320048.04
2191132.22156.94N336051.38
3221118.73161.03Y349349.5
2341189.01200.31Y360048.02
1253144.45180.96N305051.51
2251125.19152.75N305052.72
2192137.82172.25N240046.96
3351140.96174.85N394048.09
1233135.08173.36Y265049.82
1201112.59140.75N380051.25
2202128.19159.73N307048.29
2192132.82172.25Y240046.96
3351148.96174.85Y394048.09
1233132.08173.36N265049.82
1231114.59140.75Y380051.25
3442128.19159.73Y307048.29
2323179.18191.27N321041.12
All these values are continuous and each character indicates something. I am unable to figure out how to separate each value into columns and specify a heading for all these new columns which will be created.
I used this code, however it does not seem to work.
birthweight <- read.table("birthweighthw1.txt", sep="", col.names=c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngthā€¯))
Any help would be appreciated.
Assuming that you have a clear definition for every column, you can use regular expressions to solve this in no time.
From your column names and example data, I guess that the regular expression that matches each field is:
ethnic: \d{1}
age: \d{1,2}
smoke: \d{1}
preweight: \d{3}\.\d{2}
delweight: \d{3}\.\d{2}
breastfed: Y|N
brthwght: \d{3}
brthlngth: \d{3}\.\d{1,2}
We can put all this together in a regular expression that captures each of these fields
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
Note: In R, we need to scape "\" that's why we write \d instead of \d.
That said, here comes the code to solve the problem.
First, you need to read your strings
lines <- readLines("birthweighthw1.txt")
Now, we define our regular expression and use the function str_match from the package stringr to get your data into character matrix.
require(stringr)
reg.expression <- "(\\d{1})(\\d{1,2})(\\d{1})(\\d{3}\\.\\d{2})(\\d{3}\\.\\d{2})(Y|N)(\\d{3})(\\d{3}\\.\\d{1,2})"
captured <- str_match(string= lines, pattern= reg.expression)
You can check that the first column in the matrix contains the text matched, and the following columns the data captured. So, we can get rid of the first column
captured <- captured[,-1]
and transform it into a data.frame with appropriate column names
result <- as.data.frame(captured,stringsAsFactors = FALSE)
names(result) <- c("ethnic","age","smoke","preweight","delweight","breastfed","brthwght","brthlngth")
Now, every column in result is of type character, you can transform each of them into other types. For example:
require(dplyr)
result <- result %>% mutate(ethnic=as.factor(ethnic),
age=as.integer(age),
smoke=as.factor(smoke),
preweight=as.numeric(preweight),
delweight=as.numeric(delweight),
breastfed=as.factor(breastfed),
brthwght=as.integer(brthwght),
brthlngth=as.numeric(brthlngth)
)

Processing files in a particular order in R

I have several datafiles, which I need to process in a particular order. The pattern of the names of the files is, e.g. "Ad_10170_75_79.txt".
Currently they are sorted according to the first numbers (which differ in length), see below:
f <- as.matrix (list.files())
f
[1] "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_1049_25_79.txt" "Ad_10531_77_79.txt"
But I need them to be sorted by the middle number, like this:
> f
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
As I just need the middle number of the filename, I thought the easiest way is, to get rid of the rest of the name and renaming all files. For this I tried using strsplit (plyr).
f2 <- strsplit (f,"_79.txt")
But I'm sure there is a way to sort the files directly, without renaming all files. I tried using sort and to describe the name with regex but without success. This has been a problem for many days, and I spent several hours searching and trying, to solve this presumably easy task. Any help is very much appreciated.
old example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_25_79.txt", "Ad_10531_77_79.txt")
Thank your for your answers. I think I have to modify my example, because the solution should work for all possible middle numbers, independent of their digits.
new example dataset:
f <- c("Ad_10170_75_79.txt", "Ad_10345_76_79.txt",
"Ad_1049_9_79.txt", "Ad_10531_77_79.txt")
Here's a regex approach.
f[order(as.numeric(gsub('Ad_\\d+_(\\d+)_\\d+\\.txt', '\\1', f)))]
# [1] "Ad_1049_9_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
Try this:
f[order(as.numeric(unlist(lapply(strsplit(f, "_"), "[[", 3))))]
[1] "Ad_1049_25_79.txt" "Ad_10170_75_79.txt" "Ad_10345_76_79.txt" "Ad_10531_77_79.txt"
First we split by _, then select the third element of every list element, find the order and subset f based on that order.
I would create a small dataframe containing filenames and their respective extracted indices:
f<- c("Ad_10170_75_79.txt","Ad_10345_76_79.txt","Ad_1049_25_79.txt","Ad_10531_77_79.txt")
f2 <- strsplit (f,"_79.txt")
mydb <- as.data.frame(cbind(f,substr(f2,start=nchar(f2)-1,nchar(f2))))
names(mydb) <- c("filename","index")
library(plyr)
arrange(mydb,index)
Take the first column of this as your filename vector.
ADDENDUM:
If a numeric index is required, simply convert character to numeric:
mydb$index <- as.numeric(mydb$index)

Data frame column naming

I am creating a simple data frame like this:
qcCtrl <- data.frame("2D6"="DNS00012345", "3A4"="DNS000013579")
My understanding is that the column names should be "2D6" and "3A4", but they are actually "X2D6" and "X3A4". Why are the X's being added and how do I make that stop?
I do not recommend working with column names starting with numbers, but if you insist, use the check.names=FALSE argument of data.frame:
qcCtrl <- data.frame("2D6"="DNS00012345", "3A4"="DNS000013579",
check.names=FALSE)
qcCtrl
2D6 3A4
1 DNS00012345 DNS000013579
One of the reasons I caution against this, is that the $ operator becomes more tricky to work with. For example, the following fails with an error:
> qcCtrl$2D6
Error: unexpected numeric constant in "qcCtrl$2"
To get round this, you have to enclose your column name in back-ticks whenever you work with it:
> qcCtrl$`2D6`
[1] DNS00012345
Levels: DNS00012345
The X is being added because R does not like having a number as the first character of a column name. To turn this off, use as.character() to tell R that the column name of your data frame is a character vector.

Resources