how to only use sub on when there are multiple values - r

So this is a short example of a dataframe:
x<- c("WB (16)","CT (14)WB (15)","ET (13)CITG-TILm (16)EE-SS (17)TN-SE (17)")
My question is how to get sub(".*?)", "", x)(or a different function) to work such that this will be the result:
x<-c("WB (16)","WB (15)","TN-SE(17)")
instead of
x<-c("","WB (15)")
I got different types of letters (so not only WB, CT and TN-SE),such as:
"NBIO(15)" "CITG-TP(08)" "BK-AR(10)"
So it should be a general function...
Thanks!

Could you please try following.
sub(".*[0-9]+[^)]\\)?([^)$])", "\\1", x)
Output will be as follows.
[1] "WB (16)" "WB (15)" "TN-SE (17)"
Where Input will be as follows.
> x
[1] "WB (16)" "CT (14)WB (15)"
[3] "ET (13)CITG-TILm (16)EE-SS (17)TN-SE (17)"
Explanation: Following is only for explanation purposes.
sub(" ##Using sub function of Base R here.
##sub works on method of sub(regex_to_match_current_line's_stuff, new_string/variable/value out of matched,regex, variable)
.*[0-9]+[^)]\\) ##Using look ahead method of regex by mentioning .*(everything till) a ) is NOT found then mentioning ) there to cover it too so it will match till a ) which is NOt on end of line.
? ##? this makes sure above regex is matched first and it will move for next regex condition as per look ahead functoianlity.
([^)$])", ##() means in R to put a value into R's memory to remember it kind of place holder in memory, I am mentioning here to keep everything till a ) found at last.
"\\1", ##Substitute whole line with \\1 means first place holder's value.
x) ##Mentioning variable/vector's name here.

I think that I understand what you want. This certainly works on your example.
sub(".*?([^()]+\\(\\d+\\))$", "\\1", x)
[1] "WB (16)" "WB (15)" "TN-SE (17)"
Details: This looks for something of the form SomeStuff (Numbers) at the end of the string and throws away anything before it. SomeStuff is not allowed to contain parentheses.

Related

How can I dynamically get words surrounding a keyword?

I have a sentence that may contain keywords. I search for them, if one is true, I want the word before and after the keyword.
cont <- c("could not","would not","does not","will not","do not","were not","was not","did not")
text <- "this failed to increase incomes and production did not improve"
str_extract(text,"([^\\s]+\\s+){1}names(which(sapply(cont,grepl,text)))(\\s+[^\\s]+){1}")
This fails when I dynamically search using the names function but if I input:
str_extract(text,"([^\\s]+\\s+){1}did not(\\s+[^\\s]+){1}")
it correctly returns: production did not improve.
How can I get this to function without directly inputing the keywords?
Final note: I do not completely understand the syntax used to get surrounding objects. Basic r books have not covered this. Can someone explain please?
You could use your cont vector to create a vector of regex strings:
targets <- paste0("([^\\s]+\\s+){1}", cont, "(\\s+[^\\s]+){1}")
Which you can feed into str_extract_all and then unlist:
unlist(stringr::str_extract_all(text, targets))
#> [1] "production did not improve"
If this is something you need to do quite frequently, you could wrap it in a function:
get_surrounding <- function(string, keywords) {
targets <- paste0("([^\\s]+\\s+){1}", keywords, "(\\s+[^\\s]+){1}")
unlist(stringr::str_extract_all(string, targets))
}
With which you can easily run the query on new strings:
new_text <- "The production did not increase because the manager would not allow it."
get_surrounding(new_text, cont)
#> [1] "manager would not allow" "production did not increase"
Perhaps we can try this
> regmatches(text, gregexpr(sprintf("\\w+\\s(%s)\\s\\w+", paste0(cont, collapse = "|")), text))[[1]]
[1] "production did not improve"
Each match of the following regular expression will save the preceding and following words in capture groups 1 and 2, respectively.
\\b([a-z]+) +(?:could|would|does|will|do|were|was|did) +not +([a-z]+)\\b
You will of course have to form this expression programmatically, but that should be straightforward.
Hover the cursor over each element of the expression at this demo to obtain an explanation of its function.
For the string
"she could not believe that production did not improve"
there are two matches. For the first ("she could not believe") "she" and "believe" are saved to capture groups 1 and 2, respectively. For the second ("production did not improve") "production" and "improve" are saved to capture groups 1 and 2, respectively.

Unexpected outcome, not replacing, in R out of a gsub function

As the output of a certain operation, I have the following dataframe whith 729 observations.
> head(con)
Connections
1 r_con[C3-C3,Intercept]
2 r_con[C3-C4,Intercept]
3 r_con[C3-CP1,Intercept]
4 r_con[C3-CP2,Intercept]
5 r_con[C3-CP5,Intercept]
6 r_con[C3-CP6,Intercept]
As can be seen, the pattern to be removed is everything but the pair of Electrode information, for instance, in the first observation this would be C3-C3. Now, this is my take on the issue, which I'd expect to have the dataframe with everything removed. If I'm not wrong (which probably am) the regex syntax is ok and from my understanding I believe fixed=TRUE is also necessary. However, I do not understand the R output. When I would expect the pattern to be changed by nothing ""it returns this output, which doesn't make sense to me.
> gsub("r_con\\[\\,Intercept\\]\\","",con,fixed=TRUE)
[1] "3:731"
I believe this will probably be a silly question for an expert programmer, which I am far from being, and any insight would be much appreciated.
[UPDATE WITH SOLUTION]
Thanks to Tim and Ben I realised I was using a wrong regex syntax and a wrong source, this made it to me:
con2 <- sub("^r_con\\[([^,]+),Intercept\\]", "\\1", con$Connections)
I think your problem is that you're accessing "con" in your sub call. Also, as the user above me pointed out, you probably don't want to use sub.
I'm assuming, that your data is consistent, i.e., the strings in con$Connections follow more or less the same pattern. Then, this works:
I have set up this example:
con <- data.frame(Connections = c("r_con[C3-C3,Intercept]", "r_con[C3-CP1,Intercept]"))
library(stringr)
f <- function(x){
part <- str_split(x, ",")[[1]][1]
str_sub(part, 7, -1)
}
f(con$Connections[1])
sapply(con$Connections, f)
The sub function doesn't work this way. One viable approach would be to capture the quantity you want, then use this capture group as the replacement:
x <- "r_con[C3-C3,Intercept]"
term <- sub("^r_con\\[([^,]+),Intercept\\]", "\\1", x)
term
[1] "C3-C3"

How to match any character existing between a pattern and a semicolon

I am trying to get anything existing between sample_id= and ; in a vector like this:
sample_id=10221108;gender=male
tissue_id=23;sample_id=321108;gender=male
treatment=no;tissue_id=98;sample_id=22
My desired output would be:
10221108
321108
22
How can I get this?
I've been trying several things like this, but I don't find the way to do it correctly:
clinical_data$sample_id<-c(sapply(myvector, function(x) sub("subject_id=.;", "\\1", x)))
You could use sub with a capture group to isolate that which you are trying to match:
out <- sub("^.*\\bsample_id=(\\d+).*$", "\\1", x)
out
[1] "10221108" "321108" "22"
Data:
x <- c("sample_id=10221108;gender=male",
"tissue_id=23;sample_id=321108;gender=male",
"treatment=no;tissue_id=98;sample_id=22")
Note that the actual output above is character, not numeric. But, you may easily convert using as.numeric if you need to do that.
Edit:
If you are unsure that the sample IDs would always be just digits, here is another version you may use to capture any content following sample_id:
out <- sub("^.*\\bsample_id=([^;]+).*$", "\\1", x)
out
You could try the str_extract method which utilizes the Stringr package.
If your data is separated by line, you can do:
str_extract("(?<=\\bsample_id=)([:digit:]+)") #this tells the extraction to target anything that is proceeded by a sample_id= and is a series of digits, the + captures all of the digits
This would extract just the numbers per line, if your data is all collected like that, it becomes a tad more difficult because you will have to tell the extraction to continue even if it has extracted something. The code would look something like this:
str_extract_all("((?<=sample_id=)\\d+)")
This code will extract all of the numbers you're looking for and the output will be a list. From there you can manipulate the list as you see fit.

extracting only relevant comments from a list of comments

Continuing with my exploration into text analysis, i have encountered yet another roadblock.I understand the logic but don't know how to do it in R.
Here's what i want to do:
I have 2 CSVs- 1. contains 10,000 comments 2. containing a list of words
I want to select all those comments that have any of the words in the 2nd CSV. How can i go about it?
example:
**CSV 1:**
this is a sample set
the comments are not real
this is a random set of words
hope this helps the problem case
thankyou for helping out
i have learned a lot here
feel free to comment
**CSV 2**
sample
set
comment
**Expected output:**
this is a sample set
the comments are not real
this is a random set of words
feel free to comment
Please note:
the different forms of words is also considered, eg, comment and comments are both considered.
We can use grep after pasteing the elements in the second dataset.
v1 <- scan("file2.csv", what ="")
lines1 <- readLines("file1.csv")
grep(paste(v1, collapse="|"), lines1, value=TRUE)
#[1] "this is a sample set" "the comments are not real"
#[3] "this is a random set of words" "feel free to comment"
First create two objects called lines and words.to.match from your files. You could do it like this:
lines <- read.csv('csv1.csv', stringsAsFactors=F)[[1]]
words.to.match <- read.csv('csv2.csv', stringsAsFactors=F)[[1]]
Let's say they look like this:
lines <- c(
'this is a sample set',
'the comments are not real',
'this is a random set of words',
'hope this helps the problem case',
'thankyou for helping out',
'i have learned a lot here',
'feel free to comment'
)
words.to.match <- c('sample', 'set', 'comment')
You can then compute the matches with two nested *apply-functions:
matches <- mapply(
function(words, line)
any(sapply(words, grepl, line, fixed=T)),
list(words.to.match),
lines
)
matched.lines <- lines[which(matches)]
What's going on here? I use mapply to compute a function over each line in lines, taking words.to.match as the other argument. Note that the cardinality of list(words.to.match) is 1. I just recycle this argument across each application. Then, inside the mapply function I call an sapply function to check whether any of the words match the line (I check for the match via grepl).
This is not necessarily the most efficient solution, but it's a bit more intelligible to me. Another way you could compute matches is:
matches <- lapply(words.to.match, grepl, lines, fixed=T)
matches <- do.call("rbind", matches)
matches <- apply(matches, c(2), any)
I dislike this solution because you need to do a do.call("rbind",...), which is a bit hacky.

R:how to get grep to return the match, rather than the whole string

I have what is probably a really dumb grep in R question. Apologies, because this seems like it should be so easy - I'm obviously just missing something.
I have a vector of strings, let's call it alice. Some of alice is printed out below:
T.8EFF.SP.OT1.D5.VSVOVA#4
T.8EFF.SP.OT1.D6.LISOVA#1
T.8EFF.SP.OT1.D6.LISOVA#2
T.8EFF.SP.OT1.D6.LISOVA#3
T.8EFF.SP.OT1.D6.VSVOVA#4
T.8EFF.SP.OT1.D8.VSVOVA#3
T.8EFF.SP.OT1.D8.VSVOVA#4
T.8MEM.SP#1
T.8MEM.SP#3
T.8MEM.SP.OT1.D106.VSVOVA#2
T.8MEM.SP.OT1.D45.LISOVA#1
T.8MEM.SP.OT1.D45.LISOVA#3
I'd like grep to give me the number after the D that appears in some of these strings, conditional on the string containing "LIS" and an empty string or something otherwise.
I was hoping that grep would return me the value of a capturing group rather than the whole string. Here's my R-flavoured regexp:
pattern <- (?<=\\.D)([0-9]+)(?=.LIS)
nothing too complicated. But in order to get what I'm after, rather than just using grep(pattern, alice, value = TRUE, perl = TRUE) I'm doing the following, which seems bad:
reg.out <- regexpr(
"(?<=\\.D)[0-9]+(?=.LIS)",
alice,
perl=TRUE
)
substr(alice,reg.out,reg.out + attr(reg.out,"match.length")-1)
Looking at it now it doesn't seem too ugly, but the amount of messing about it's taken to get this utterly trivial thing working has been embarrassing. Anyone any pointers about how to go about this properly?
Bonus marks for pointing me to a webpage that explains the difference between whatever I access with $,# and attr.
Try the stringr package:
library(stringr)
str_match(alice, ".*\\.D([0-9]+)\\.LIS.*")[, 2]
You can do something like this:
pat <- ".*\\.D([0-9]+)\\.LIS.*"
sub(pat, "\\1", alice)
If you only want the subset of alice where your pattern matches, try this:
pat <- ".*\\.D([0-9]+)\\.LIS.*"
sub(pat, "\\1", alice[grepl(pat, alice)])

Resources