Finding consecutive values in a string in R

Finding consecutive values in a string in R - r

I am trying to find 3 or more consecutive "a" within the last 10 letters of my data frame string. My data frame looks like this:
V1
aaashkjnlkdjfoin
jbfkjdnsnkjaaaas
djshbdkjaaabdfkj
jbdfkjaaajbfjna
ndjksnsjksdnakns
aaaandfjhsnsjna
I have written this code, however it just gets out the number of consecutive "a" within the whole string. However, I am wanting to do it so it only looks at the last 10 digits and then prints the string where the consecutive "a" are found. The code I have wrote is:
out: [1] 3
I am wanting my output to look like this:
jbfkjdnsnkjaaaas
djshbdkjaaabdfkj
jbdfkjaaajbfjna
Can anyone help

Using regex, you could do:
grep("(?=.{10}$).*?a{3,}", string, perl = TRUE, value = TRUE)
[1] "jbfkjdnsnkjaaaas" "djshbdkjaaabdfkj" "jbdfkjaaajbfjna"
string <- c("aaashkjnlkdjfoin", "jbfkjdnsnkjaaaas", "djshbdkjaaabdfkj",
"jbdfkjaaajbfjna", "ndjksnsjksdnakns", "aaaandfjhsnsjna")
If you have a dataframe and need tosubset it:
subset(df, grepl("(?=.{10}$).*?a{3}",V1, perl = TRUE))
V1
2 jbfkjdnsnkjaaaas
3 djshbdkjaaabdfkj
4 jbdfkjaaajbfjna

Related

Remove part of string after 3-digit number

I would like to substitute the strings in the list by cutting each string after the first 3-digit number.
a <- c("MTH314PHY410","LB471LB472","PHY472CHM141")
I would like for it to look something like
a <- c("MTH314","LB471","PHY472")
I have tried something like
b <- gsub("[100-999].*","",a)
but it returns c("MTH","LB","PHY") without the first number

A possible solution, based on stringr::str_remove:
library(stringr)
a <- c("MTH314PHY410","LB471LB472","PHY472CHM141")
str_remove(a, "(?<=\\d{3}).*")
#> [1] "MTH314" "LB471" "PHY472"

c("MTH314PHY410","LB471LB472","PHY472CHM141") %>%
stringr::str_extract('.+?\\d{3}')
[1] "MTH314" "LB471" "PHY472"

R integer (date) to number

I have a matrix date that looks like this:
Date Time
1 2017-05-19 08:52:21
2
3 2017-05-20 22:29:29
4 2017-05-20 15:21:35
Both date$Date and date$Time are integers.
I would like to obtain a new column like this:
Date Time
1 20170519 085221
2 NA NA
3 20170520 222929
4 20170520 152135
I've tried with as.character, as.numeric, as.Date... But can't find the solution /=
Sorry if the question was already answer in another post, but I wasn't able to find it!

You need format...
format(as.POSIXct("2017-05-19"),"%Y%m%d")
[1] "20170519"
format(as.POSIXct("08:52:21",format="%H:%M:%S"),"%H%M%S")
[1] "085221"
See ?strptime for the formatting codes.

Since you apparently don't necessarily want date or time class objects (do you?), and since you don't further specify what exactly you need this for, there seems no need to work with date or time functions.
You could try this:
Step 1: First, if you want empty cells to contain NA, fill those in per column
df$Date[df$Date == ""] <- NA
df$Time[df$Time == ""] <- NA
Step 2: And then simply replace the "-" and ":" in the Date and Time values, respectively, to get the wanted strings
df$Date <- gsub(pattern = "-", x = df$Date, replacement = "")
df$Time <- gsub(pattern = ":", x = df$Time, replacement = "")
Date Time
1 20170519 85221
2 <NA> <NA>
3 20170520 222929
4 20170520 152135
The output might not yield integer classes (my starting df resembling your df did not contain integers, so can't double check; result here were character classes), so if you really want integer classes, simply apply as.integer().
As you see the output is the same as your expected output, except for the leading "0" of the row 1 Time value. If need be, there's a work around to get that in there, although I'm not sure what that would add. And after applying as.integer it would most likely disappear anyway.

How do I find a subtext without comma using regex in R?

I have a data frame as:
result <- c('Ab1 : 256 ug/mL(R), Ab2(disk); 18mm(S)', 'Ab1 : 4 ug/mL(S), Ab2(disk); <2mm(R)')
df <- data.frame(result)
What should I do if I would like to check whether '(R)' appears after 'antibiotics1' ?
grep("Ab1[[:print:]]*\\(R\\)", result)
gives
[1] 1 2
while the result I want is
[1] 1

Try this:
grep("Ab1[^(]*?\\(R\\)", result)
[1] 1
Ab1 match 'Ab1' literally
[^(]*? match anything besides an opening parenthesis, non greedily
(R) match '(R)' literally
In the second case, it is not possible to do this match without first consuming at least one opening parenthesis, hence only the first matches.

R: Make character string refer to an object

I have a large list of files (file1, file2, file3, etc.) and, for each analysis, I want to refer to two files from this list (e.g. function(file1,file2)). When I try to do this using paste0("file", pairs[1,x] I get back the character string "file1" rather than the object file1.
How can I refer to the objects rather than create a character string?
Thank you very much!
Additional comment:
pairs is a 2xn matrix where each column is the combination of files for one analysis (e.g. pairs[1,1] = 1 and pairs[2,1] = 2 for the comparison between file1 and file2).

Are you looking for get()???
a <- 1:5
> get("a")
[1] 1 2 3 4 5

How to get the variable from a string containing the variable name:
> a = 10
> string = "a"
> string
[1] "a"
> eval(parse(text = string))
[1] 10
> eval(parse(text = "a"))
[1] 10
Hope this helps.

Another alternative:
eval(as.name("file"))

R: Find patern and get the values in between

I am using readLines() to extract an html code from a site. In almost every line of the code there is pattern of the form <td>VALUE1<td>VALUE2<td>. I would like to take the values in between the <td>. I tried some compilations such as:
output <- gsub(pattern='(.*<td>)(.*)(<td>.*)(.*)(.*<td>)',replacement='\\2',x='<td>VALUE1<td>VALUE2<td>')
but the output gives back only the one value. Any idea how to do that?

string <- "<td>VALUE1<td>VALUE2<td>"
regmatches(string , gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T) )
# use gregexpr function to get the match indices and the lengthes
indices <- gregexpr("(?<=<td>)\\w+(?=<td>)" , string , perl = T)
# this should be the result
# [1] 5 15
# attr(,"match.length")
# this means you have two matches the first one starts at index 5 and the
#second match starts at index 15
#[1] 6 6
#attr(,"useBytes")
# this means the first match should be with length 6 , also in this case the
#second match with length of 6
# then get the result of this match and pass it to regmatches function to
# substring your string at these indices
regmatches(string , indices)

Did you take a look at the "XML" package that can extract tables from HTML? You probably need to provide more context of the entire message that you are trying to parse so that we could see if it might be appropriate.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Finding consecutive values in a string in R - r

Related

Remove part of string after 3-digit number

R integer (date) to number

How do I find a subtext without comma using regex in R?

R: Make character string refer to an object

R: Find patern and get the values in between

Categories

Resources