I am using R currently and I have produced a table with 3 columns. The first column contains names looking like "XXX_YYY_ZZZ" and I would only want to keep the "XXX" part. This is why I tried gsub, but couldn't make it so I turned to strapplyc(), which works but produces only one column. Apparently, I would want to keep my initial table, but with the first column replaced by the strapplyc() output. Or any other different approach you think would fit better!
Thank you in advance.
Since you have NOT showed samples so creating a simplex example here for testing it.
cal1 <- c("XXX_YYY_ZZZ","XXX_YYY_ZZZ")
gsub("_.*","",cal1)
Output will be as follows.
> gsub("_.*","",cal1)
[1] "XXX" "XXX"
Works for me. Here is a regex which looks for three groups of text, separated by underscores. The ^ indicates start of string and $ indicates end of string. I capture first (\\1) group, but there's nothing stopping you from capturing \\2, \\3 or even \\1\\3.
gsub("^(.*)_(.*)_(.*)$", "\\1", "XXX_YYY_ZZZ")
[1] "XXX"
You could also use strsplit.
> strsplit("XXX_YYY_ZZZ", "_")[[1]][1]
[1] "XXX"
Related
I just learnt R and was trying to clean data for analysis using R using string manipulation using the code given below for Amount_USD column of a table. I could not find why changes were not made. Please help.
Code:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,10) == "\\\xc2\\\xa0",
str_sub(csv_file$Amount_USD,12,-1),csv_file2$Amount_USD)
Result:
\\xc2\\xa010,000,000
\\xc2\\xa016,200,000
\\xc2\\xa019,350,000
Expected Result:
10,000,000
16,200,000
19,350,000
You could use the following code, but maybe there is a more compact way:
vec <- c("\\xc2\\xa010,000,000", "\\xc2\\xa016,200,000", "\\xc2\\xa019,350,000")
gsub("(\\\\x[[:alpha:]]\\d\\\\x[[:alpha:]]0)([d,]*)", "\\2", vec)
[1] "10,000,000" "16,200,000" "19,350,000"
A compact way to extract the numbers is by using str_extract and negative lookahead:
library(stringr)
str_extract(vec, "(?!0)[\\d,]+$")
[1] "10,000,000" "16,200,000" "19,350,000"
How this works:
(?!0): this is negative lookahead to make sure that the next character is not 0
[\\d,]+$: a character class allowing only digits and commas to occur one or more times right up to the string end $
Alternatively:
str_sub(vec, start = 9)
There were a few minor issues with your code.
The main one being two unneeded backslashes in your matching statement. This also leads to a counting error in your first str_sub(), where you should be getting the first 8 characters not 10. Finally, you should be getting the substring from the next character after the text you want to match (i.e. position 9, not 12). The following should work:
csv_file2$Amount_USD <- ifelse(str_sub(csv_file$Amount_USD,1,8) == "\\xc2\\xa0", str_sub(csv_file$Amount_USD,9,-1),csv_file2$Amount_USD)
However, I would have done this with a more compact gsub than provided above. As long as the text at the start to remove is always going to be "\\xc2\\xa0", you can simply replace it with nothing. Note that for gsub you will need to escape all the backslashes, and hence you end up with:
csv_file2$Amount_USD <- gsub("\\\\xc2\\\\xa0", replacement = "", csv_file2$Amount_USD)
Personally, especially if you plan to do any sort of mathematics with this column, I would go the additional step and remove the commas, and then coerce the column to be numeric:
csv_file2$Amount_USD <- as.numeric(gsub("(\\\\xc2\\\\xa0)|,", replacement = "", csv_file2$Amount_USD))
I am trying to match DNA sequences in a column. I am trying to find the longer version of itself, but also in this column it has the same sequence.
I am trying to use Str_which for which I know it works, since if I manually put the search pattern in it finds the rows which include the sequence.
As a preview of the data I have:
SNID type seqs2
9584818 seqs TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA
9584818 reversed TTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGA
9562505 seqs GTCTTCAGCATCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAACTTTGTGAAT
9562505 reversed ATTCACAAAGTTCTGTGGCCCCAGGAGCAGAATCTACAAGTTCATTGCACTCACATATAAGTAGATTCCTYAGATGAAGAAGTTTCTTTGGTAGGTTCCCTTTCAGCTTGGGACAGTGTCTTAAAGAAAGATGCTGAAGAC
Using a simple search of row one as x
x <- "TCTTTCTTTAAGACACTGTCCCAAGCTGAAAGGGAACCTACCAAAGAAACTTCTTCATCTRAGGAATCTACTTATATGTGAGTGCAATGAACTTGTAGATTCTGCTCCTGGGGCCACAGAA"
str_which(df$seqs2, x)
I get the answer I expect:
> str_which(df$seqs3, x)
[1] 1 3
But when I try to search as a whole column, I just get the result of the rows finding itself. And not the other rows in which it is also stated.
> str_which(df$seqs2, df$seqs2)
[1] 1 2 3 4
Since my data set is quite large, I do not want to do this manually, and rather use the column as input, and not just state "x" first.
Anybody any idea how to solve this? I have tried most Stringr cmds by now, but by mistake I might have did it wrongly or skipped some important ones.
Thanks in advance
You may need lapply :
lapply(df$seqs2, function(x) stringr::str_which(df$seqs2, x))
You can also use grep to keep this in base R :
lapply(df$seqs2, function(x) grep(x, df$seqs2))
I am trying to get anything existing between sample_id= and ; in a vector like this:
sample_id=10221108;gender=male
tissue_id=23;sample_id=321108;gender=male
treatment=no;tissue_id=98;sample_id=22
My desired output would be:
10221108
321108
22
How can I get this?
I've been trying several things like this, but I don't find the way to do it correctly:
clinical_data$sample_id<-c(sapply(myvector, function(x) sub("subject_id=.;", "\\1", x)))
You could use sub with a capture group to isolate that which you are trying to match:
out <- sub("^.*\\bsample_id=(\\d+).*$", "\\1", x)
out
[1] "10221108" "321108" "22"
Data:
x <- c("sample_id=10221108;gender=male",
"tissue_id=23;sample_id=321108;gender=male",
"treatment=no;tissue_id=98;sample_id=22")
Note that the actual output above is character, not numeric. But, you may easily convert using as.numeric if you need to do that.
Edit:
If you are unsure that the sample IDs would always be just digits, here is another version you may use to capture any content following sample_id:
out <- sub("^.*\\bsample_id=([^;]+).*$", "\\1", x)
out
You could try the str_extract method which utilizes the Stringr package.
If your data is separated by line, you can do:
str_extract("(?<=\\bsample_id=)([:digit:]+)") #this tells the extraction to target anything that is proceeded by a sample_id= and is a series of digits, the + captures all of the digits
This would extract just the numbers per line, if your data is all collected like that, it becomes a tad more difficult because you will have to tell the extraction to continue even if it has extracted something. The code would look something like this:
str_extract_all("((?<=sample_id=)\\d+)")
This code will extract all of the numbers you're looking for and the output will be a list. From there you can manipulate the list as you see fit.
I have two datasets that I want to link (inner_join) with a common key which is a string. The problem is that in one of the two dataset the key is not complete, but this uncomplete key is included in the other one, like the following example:
key for 1st dataset: PV955--075P412171042--
and for the 2nd: PV955--???P412171042--
The ??? represents numbers that are missing, so my question is can we do like a string comparison/inclusion to check if the characters of my 2nd key are included my 1st key and do the join on this if yes?
Idk if the issue is clear, and thanks for the answers.
It's hard to answer without seeing your data, however you can try this:
library(stringr)
> str_detect("075P412171042","P412171042")
[1] TRUE
In base R with regular expressions :
key1 <- "PV955--075P412171042--"
key2 <- "PV955--???P412171042--"
key2re <- gsub("--...", "--...", key2)
grepl(key2re, key1)
## [1] TRUE
Replace the 3 unknown characters after "--" by dots meaning any character in regular expressions.
Then grepl check if the two strings match.
I have a dataframe Genotypes and it has columns of loci labeled D2S1338, D2S1338.1, CSF1PO, CSF1PO.1, Penta.D, Penta.D.1. These names were automatically generated when I imported the Excel spreadsheet into R such that the for the two columns labeled CSF1PO, the column with the first set of alleles was labeled CSF1PO and the second column was labeled CSF1PO.1. This works fine until I get to Penta D which was listed with a space in Excel and imported as Penta.D. When I apply the following code, Penta.D gets combined with Penta.C and Penta.E to give me nonsensical results:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE), function(x) x[1])))
Expected <- sapply(locuses, function(x) 1 - sum(unlist(Freqs[grepl(x, names(Freqs))])^2))
This code works great for all loci except the Pentas because of how they were automatically names. How do I either write an exception for the strsplit at Penta.C, Penta.D, and Penta.E or change these names to PentaC, PentaD, and PentaE so that the above code works as expected? I run the following line:
Genotypes <- transform(Genotypes, rename.vars(Genotypes, from="Penta.C", to="PentaC", info=TRUE))
and it tells me:
Changing in Genotypes
From: Penta.C
To: PentaC
but when I view Genotypes, it still has my Penta loci written as Penta.C. I thought this function would write it back to the original data frame, not just a copy. What am I missing here? Thanks for your help.
The first line of your code is splitting the variable names by . and extracting the first piece. It sounds like you instead want to split by . and extract all the pieces except for the last one:
locuses = unique(unlist(lapply(strsplit(names(Freqs), ".", fixed=TRUE),
function(x) paste(x[1:(length(x)-1)], collapse=""))))
Looks like you want to remove ".n" where n is a single digit if and only if it appears at the end of a line.
loci.columns <- read.table(header=F,
text="D2S1338,D2S1338.1,CSF1PO,CSF1PO.1,Penta.D,Penta.D.1",
sep=",")
loci <- gsub("\\.\\d$",replace="",unlist(loci.columns))
loci
# [1] "D2S1338" "D2S1338" "CSF1PO" "CSF1PO" "Penta.D" "Penta.D"
loci <- unique(loci)
loci
# [1] "D2S1338" "CSF1PO" "Penta.D"
In gsub(...), \\. matches ".", \\d matches any digit, and $ forces the match to be at the end of the line.
The basic problem seems like the names are being made "valid" on import by the make.names function
> make.names("Penta C")
[1] "Penta.C"
Avoid R's column re-naming with use of the check.names=FALSE argument to read.table. If you refer explicitly to columns you'll need to provide a back-quoted strings
df$`Penta C`