A newbie to textmining analysis and R coding.
I have 200 genes with mixed string. I want to split them and paste strings (eg, cadherins, orphan receptors) in one column and numbers (eg, 2/3), number+string (eg, 7D, 7TM) in another column.
I used strssplit to split the words. Please any suggestion on how to parse them would be helpful.
example:
> Genes <- c("7D cadherins", "7TM orphan receptors", "7TM orphan receptors RNA18S", "28S ribosomal RNAs RNA28S", "45S pre-ribosomal RNAs RNA45S", "5.8S ribosomal RNAs", "Actin related protein 2/3 complexā€¯)
Expected result(2nd and 3rd column):
7D cadherins cadherins 7D
7TM orphan receptors orphan receptors 7TM
18S ribosomal RNAs RNA18S ribosomal RNAs RNA18S 18S RNA18S
28S ribosomal RNAs RNA28S ribosomal RNAs RNA28S 28S RNA28S
45S pre-ribosomal RNAs RNA45S pre-ribosomal RNAs 45S RNA45S
5.8S ribosomal RNAs ribosomal RNAs 5.8S
Actin related protein 2/3 complex Actin related protein complex 2/3
Using strsplit to split the names, grep to detect words with or without numbers and paste to collapse the words. Put everithing in a function to avoid repetition:
wordS <- function(x, invert = TRUE) {
clean <- gsub( '[[:space:]]+', ' ', x ) # to remove extra spaces
split <- strsplit( clean, ' ' )
detec <- lapply( split, function(y) grep('[0-9]', y, invert = invert, value = TRUE) )
words <- sapply( detec, paste, collapse = ' ' )
return( words )
}
data.frame(
Gene = Genes,
column2 = wordS(Genes),
column3 = wordS(Genes, invert = FALSE)
)
Gene column2 column3
1 7D cadherins cadherins 7D
2 7TM orphan receptors orphan receptors 7TM
3 7TM orphan receptors RNA18S orphan receptors 7TM RNA18S
4 28S ribosomal RNAs RNA28S ribosomal RNAs 28S RNA28S
5 45S pre-ribosomal RNAs RNA45S pre-ribosomal RNAs 45S RNA45S
6 5.8S ribosomal RNAs ribosomal RNAs 5.8S
7 Actin related protein 2/3 complex Actin related protein complex 2/3
Related
This question already has answers here:
Regex - Split String on Comma, Skip Anything Between Balanced Parentheses
(2 answers)
Closed 1 year ago.
I have the following string:
Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews
I want to replace the semicolons that are not in parenthesis to commas. There can be any number of brackets and any number of semicolons within the brackets and the result should look like this:
Almonds , Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt), Cashews
This is my current code:
x<- Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews
gsub(";(?![^(]*\\))",",",x,perl=TRUE)
[1] "Almonds , Roasted Peanuts (Peanuts, Canola Oil (Antioxidants (319; 320)); Salt), Cashews "
The problem I am facing is if there's a nested () inside a bigger bracket, the regex I have will replace the semicolon to comma.
Can I please get some help on regex that will solve the problem? Thank you in advance.
The pattern ;(?![^(]*\)) means matching a semicolon, and assert that what is to the right is not a ) without a ( in between.
That assertion will be true for a nested opening parenthesis, and will still match the ;
You could use a recursive pattern to match nested parenthesis to match what you don't want to change, and then use a SKIP FAIL approach.
Then you can match the semicolons and replace them with a comma.
[^;]*(\((?>[^()]+|(?1))*\))(*SKIP)(*F)|;
In parts, the pattern matches
[^;]* Match 0+ times any char except ;
( Capture group 1
\( Match the opening (
(?> Atomic group
[^()]+ Match 1+ times any char except ( and )
| Or
(?1) Recurse the whole first sub pattern (group 1)
)* Close the atomic group and optionally repeat
\) Match the closing )
) Close group 1
(*SKIP)(*F) Skip what is matched
| Or
; Match a semicolon
See a regex demo and an R demo.
x <- c("Almonds ; Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt); Cashews",
"Peanuts (32.5%); Macadamia Nuts (14%; PPPG(AHA)); Hazelnuts (9%); nuts(98%)")
gsub("[^;]*(\\((?>[^()]+|(?1))*\\))(*SKIP)(*F)|;",",",x,perl=TRUE)
Output
[1] "Almonds , Roasted Peanuts (Peanuts; Canola Oil (Antioxidants (319; 320)); Salt), Cashews"
[2] "Peanuts (32.5%), Macadamia Nuts (14%; PPPG(AHA)), Hazelnuts (9%), nuts(98%)"
I would like to extract the name of the drug, where "Drug:", "Other:",etc precedes name of drug.
Take the first word after every ":", including characters like "-".
If there are 2 instances of ":", then "and" should join the 2 words as one string. The ourpur should be in a one column dataframe with column name Drug.
Here is my reproducible example:
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
The output should look something like this:
output.df <- data.frame(Drugs = c("TLD-1433", "CG0070 and n-dodecyl-B-D-matose", "Atezolizumab", "N-803 and N-803", "Everolimus and Intravesical", "Association and Association"))
This is what I've tried, which didn't work.
Attempt 1:
str_extract(my.df$col1, '(?<=:\\s)(\\w+)')
Attempt 2:
str_extract(my.df$col1, '(?<=:\\s)(\\w+)(-)(\\w+)')
I am not so familiar with R, but a pattern that would give you the matches from the example data could be:
(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*
Then you could concatenate the matches with and in between.
The pattern matches:
(?<=:\s) Positive lookbehind, assert : and a whitespace char to the left
\w+(?:-\w+)* Match 1+ word chars, followed by optionally repeating - and 1+ word chars
(?: Non capture group
and \w+(?:-\w+)* Match and followed by 1+ word chars followed by optionally repeating - and 1+ word chars
)* Close non capture group and optionally repeat
Regex demo
To get all the matches, you can use str_match_all
str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
For example
library(stringr)
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
lapply(
str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
, paste, collapse=" and ")
Output
[[1]]
[1] "TLD-1433"
[[2]]
[1] "CG0070 and n-dodecyl-B-D-maltoside"
[[3]]
[1] "Atezolizumab"
[[4]]
[1] "N-803 and BCG and N-803"
[[5]]
[1] "Everolimus and Intravesical"
[[6]]
[1] "Association and Association"
Use
:\s*\b([\w-]+\b(?:\s+and\s+\b[\w-]+)*)\b
See regex proof.
EXPLANATION
--------------------------------------------------------------------------------
: ':'
--------------------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
[\w-]+ any character of: word characters (a-z,
A-Z, 0-9, _), '-' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
and 'and'
--------------------------------------------------------------------------------
\s+ whitespace (\n, \r, \t, \f, and " ")
(1 or more times (matching the most
amount possible))
--------------------------------------------------------------------------------
\b the boundary between a word char (\w)
and something that is not a word char
--------------------------------------------------------------------------------
[\w-]+ any character of: word characters (a-
z, A-Z, 0-9, _), '-' (1 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
R code:
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
library(stringr)
matches <- str_match_all(my.df$col1, ":\\s*\\b([\\w-]+\\b(?:\\s+and\\s+\\b[\\w-]+)*)\\b")
Drugs <- sapply(matches, function(z) paste(z[,-1], collapse=" and "))
output.df <- data.frame(Drugs)
output.df
Results:
Drugs
1 TLD-1433
2 CG0070 and n-dodecyl-B-D-maltoside
3 Atezolizumab
4 N-803 and BCG and N-803
5 Everolimus and Intravesical
6 Association and Association
I would like to retrieve the intron sequences of some genes (e.g https://www.ncbi.nlm.nih.gov/nuccore/X62462.1).
I can get it with Nucleotide database for some of the genes, but some of them only appear on Gene database from NCBI. To do so, I am using Biopython.
Here a piece of code to retrieve intron from nucleotide database.
from Bio.Seq import Seq
from Bio import SeqIO, Entrez
count = 4 # Number of entries to see
genes = ["estrogen receptor"]
shortname = genes[0]
Entrez.email = "email#gmail.com"
handle = Entrez.esearch(db="nucleotide", term="Human[Orgn] AND "+shortname+"[GENE] AND biomol_genomic[PROP] AND nucleotide_protein[Filter]", idtype="acc", retmax=count)
record = Entrez.read(handle)
handle.close()
With this part I check which entry I want:
print("Entries:", record["Count"])
seq_records=[]
for i in range(len(record["IdList"])):
idname = record["IdList"][i]
with Entrez.efetch(db="nucleotide", rettype="gb", retmode="text", id=idname) as handle:
seq_record = SeqIO.read(handle, "gb")
seq_records.append(seq_record)
print(i, "--", seq_record.description, seq_record.id)
Entries: 1
0 -- H.sapiens 5' flanking region for estrogen receptor (breast) gene X62462.1
And now I retrieve the introns sequences for this gene:
id_chosen = 0
intron = [f for f in seq_records[id_chosen].features if f.type == "intron"]
x=1
for start, end in [(e.location.start.position, e.location.end.position ) for e in intron]:
print(">>>",seq_record.id, "Intron:",x, start+1, end, ",len:",len(seq_record.seq[start:end]))
x += 1
print(seq_record.seq[start:end], "\n")
Output: >X62462.1 Intron: 1 911 2933 ,len: 2023
GTAGGCTTGTTTTGATTTCTCTCTCTGTAGCTTTAGCATTTTGAGAAAGCAACTTACCTTTCTGGCTAGTGTCTGTATCCTAGCAGGGAGATGAGGATTGCTGTTCTCCATGG......
In this case there is only one intron.
So my question is...how to do it with a gene that has several intron splicing and appears in the Gene database? How can I access to those features?
Example: https://www.ncbi.nlm.nih.gov/gene/374
Thanks!
I am looking to explore the GameTheory package from CRAN, but I would appreciate help in converting my data (in the form of a data frame of unique combinations and results) in to the required coalition object. The precursor to this I believe to be an ordered list of all coalition values (https://cran.r-project.org/web/packages/GameTheory/vignettes/GameTheory.pdf).
My real data has n ~ 30 'players', and unique combinations = large (say 1000 unique combinations), for which I have 1 and 0 identifiers to describe the combinations. This data is sparsely populated in that I do not have data for all combinations, but will assume combinations not described have zero value. I plan to have one specific 'player' who will appear in all combinations, and act as a baseline.
By way of example this is the data frame I am starting with:
require(GameTheory)
games <- read.csv('C:\\Users\\me\\Desktop\\SampleGames.csv', header = TRUE, row.names = 1)
games
n1 n2 n3 n4 Stakes Wins Success_Rate
1 1 1 0 0 800 60 7.50%
2 1 0 1 0 850 45 5.29%
3 1 0 0 1 150000 10 0.01%
4 1 1 1 0 300 25 8.33%
5 1 1 0 1 1800 65 3.61%
6 1 0 1 1 1900 55 2.89%
7 1 1 1 1 700 40 5.71%
8 1 0 0 0 3000000 10 0.00333%
where n1 is my universal player, and in this instance, I have described all combinations.
To calculate my 'base' coalition value from player {1} alone, I am looking to perform the calculation: 0.00333% (success rate) * all stakes, i.e.
0.00333% * (800 + 850 + 150000 + 300 + 1800 + 1900 + 700 + 3000000) = 105
I'll then have zero values for {2}, {3} and {4} as they never "play" alone in this example.
To calculate my first pair coalition value, I am looking to perform the calculation:
7.5%(800 + 300 + 1800 + 700) + 0.00333%(850 + 150000 + 1900 + 3000000) = 375
This is calculated as players {1,2} base win rate (7.5%) by the stakes they feature in, plus player {1} base win rate (0.00333%) by the combinations he features in that player {2} does not - i.e. exclusive sets.
This logic is repeated for the other unique combinations. For example row 4 would be the combination of {1,2,3} so the calculation is:
7.5%(800+1800) + 5.29%(850+1900) + 8.33%(300+700) + 0.00333%(3000000+150000) = 529 which descriptively is set {1,2} success rate% by Stakes for the combinations it appears in that {3} does not, {1,3} by where {2} does not feature, {1,2,3} by their occurrences, and the base player {1} by examples where neither {2} nor {3} occur.
My expected outcome therefore should look like this I believe:
c(105,0,0,0, 375,304,110,0,0,0, 529,283,246,0, 400)
where the first four numbers are the single player combinations {1} {2} {3} and {4}, the next six numbers are two player combinations {1,2} {1,3} {1,4} (and the null cases {2,3} {2,4} {3,4} which don't exist), then the next four are the three player combinations {1,2,3} {1,2,4} {1,3,4} and the null case {2,3,4}, and lastly the full combination set {1,2,3,4}.
I'd then feed this in to the DefineGame function of the package to create my coalitions object.
Appreciate any help: I have tried to be as descriptive as possible. I really don't know where to start on generating the necessary sets and set exclusions.
1. ZFP112
Official Symbol: ZFP112 and Name: zinc finger protein 112 homolog (mouse)[Homo sapiens]
Other Aliases: ZNF112, ZNF228
Other Designations: zfp-112; zinc finger protein 112; zinc finger protein 228
Chromosome: 19; Location: 19q13.2
Annotation: Chromosome 19NC_000019.9 (44830706..44860856, complement)
ID: 7771
2. SEP15
15 kDa selenoprotein[Homo sapiens]
Chromosome: 1; Location: 1p31
Annotation: Chromosome 1NC_000001.10 (87328128..87380107, complement)
MIM: 606254
ID: 9403
3. MLL4
myeloid/lymphoid or mixed-lineage leukemia 4[Homo sapiens]
Other Aliases: HRX2, KMT2B, MLL2, TRX2, WBP7
Other Designations: KMT2D; WBP-7; WW domain binding protein 7; WW domain-binding protein 7; histone-lysine N-methyltransferase MLL4; lysine N-methyltransferase 2B; lysine N-methyltransferase 2D; mixed lineage leukemia gene homolog 2; myeloid/lymphoid or mixed-lineage leukemia protein 4; trithorax homolog 2; trithorax homologue 2
Chromosome: 19; Location: 19q13.1
Annotation: Chromosome 19NC_000019.9 (36208921..36229779)
MIM: 606834
ID: 9757
37. LOC100509547
hypothetical protein LOC100509547[Homo sapiens]
This record was discontinued.
ID: 100509547
43. LOC100509587
hypothetical protein LOC100509587[Homo sapiens]
Chromosome: 6
This record was replaced with GeneID: 100506601
ID: 100509587
I want to get the gene name (ZFP112, SEP15, MLL4), the Location field (if present), the ID field, and skip the other stuff. All the string utilities like scan() seem geared toward more regular data. The blank line between records is effectively the record separator. I can write this to disk and read it back in with readLines() but I'd prefer to do it from memory since I downloaded it over HTTP.
Read the data in from "myfile.dat", say, (or just start from L below if you have previously read it in as separate lines). Now extract those lines that begin with digits followed by a dot followed by a space or that contain the word Location: or start with ID:. Then remove everything in those lines up to and including the last space. Create a group vector g which identifies the group to which each component of v2 belongs. (We have used the fact that the beginning field of each group starts with a non-digit and the other fields start with a digit.) Then split v2 into those groups . Expand short components of s by appropriately inserting an NA assuming that if its short that Location: is missing. (We assume the first field and the ID fields cannot be missing.) Finally transpose it so that the fields are in columns and the cases in rows.
L <- readLines("myfile.dat")
v <- grep("^\\d+\\. |Location: |^ID: ", L, value = TRUE)
v2 <- sub(".* ", "", v)
g <- cumsum(regexpr("^\\D", v2) > 0)
s <- split(v2, g)
m <- sapply(s, function(x) if (length(x) == 2) c(x[[1]], NA, x[[2]]) else x)
t(m)
Using the sample data in the post we get this from the last line:
[,1] [,2] [,3]
1 "ZFP112" "19q13.2" "7771"
2 "SEP15" "1p31" "9403"
3 "MLL4" "19q13.1" "9757"
4 "LOC100509547" NA "100509547"
5 "LOC100509587" NA "100509587"