I have managed to extract from a string all values starting with "N_", however I cant extracted precisely those with a certain range of numbers.
Is an R data frame and I have something like this
V1 N_words
(N_R33A, N_R35B, N_T44N, J_T7B) N_R33A, N_R35B, N_T44N
My desired output would be having a specific range of all the N_
V1 N_words (range 30-35)
(N_R33A, N_R35B, N_T44N, J_T7B) N_R33A, N_R35B
The code I am using is but is only extracting N_ and I dont seem to be able to select a range, I am also creating a new column to my x data frame with the extracted words :
x$N_words = str_extract_all(x$V1, "N_([A-Z]|[0-9])+")
One option is modifying the regex:
x = "(N_R33A, N_R35B, N_T44N, J_T7B)"
str_extract_all(x, "N_[A-Z]3[0-5][A-Z]")
# [[1]]
# [1] "N_R33A" "N_R35B"
matches N_
followed by an uppercase letter ([A-Z])
followed by 3
followed by 0, 1, 2, 3, 4 or 5 ([0-5]) .
followed by an uppercase letter ([A-Z])
Related
I am trying to extract the most recent date that a report was added in an R dataframe of reports. The text always looks like Date Ordered: M/DD/YYYY and may contain 0 many times in a given report. If it's repeating, I want the most recent (usually the last) instance, and I'm trying to convert it to a date in a mutated dplyr column.
Using the code below on my actual dataframe, I get the error:
Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { :
missing value where TRUE/FALSE needed
However, it runs fine on a single item making me think that it's trying to concatenate the entire column.
The test code doesn't give an error, but actually pulls the last date from the last report for all instances:
lastdate
1 1999-04-15
2 1999-04-15
dataset=data.frame(cbind(ID=c(001,002),
Report=c("Blah Blah Date Ordered: 5/19/2000 test is positive. Date Ordered: 4/2/2005 Additional testing negative.",
"Meh Date Ordered: 4/15/1999")),
stringsAsFactors = F)`
dataset %>%
mutate(lastdate = as.Date(last(gsub("Date Ordered:\\s+", "",
strapplyc(Report,
"Date Ordered:\\s*\\d+/\\d+/\\d+", simplify = TRUE))),
"%m/%d/%Y"))
Desired output should be:
2005-4-2
1999-4-15
Actual for dataset:
Error in if (nchar(s) > 0 && substring(s, 1, 1) == "\002") { :
missing value where TRUE/FALSE needed
Actual for test data:
lastdate
1 1999-04-15
2 1999-04-15
I suggest a gsub like
dataset$lastsdate <- as.Date(gsub(".*Date Ordered:\\s*(\\d{1,2}/\\d{1,2}/\\d{4}).*|.*","\\1", dataset$Report),"%m/%d/%Y")
See the regex in action.
The regex matches:
.* - any 0+ chars as many as possible
Date Ordered: - a literal substring
\s* - 0+ whitespaces
(\d{1,2}/\d{1,2}/\d{4}) - Capturing group 1 (\1): 1 or 2 digits, /, 1 or 2 digits, /, 4 digits
.* - the rest of the string
| - or
.* - the entire string.
I have a string in R as
x <- "The length of the word is going to be of nice use to me"
I want the first 10 words of the above specified string.
Also for example I have a CSV file where the format looks like this :-
Keyword,City(Column Header)
The length of the string should not be more than 10,New York
The Keyword should be of specific length,Los Angeles
This is an experimental basis program string,Seattle
Please help me with getting only the first ten words,Boston
I want to get only the first 10 words from the column 'Keyword' for each row and write it onto a CSV file.
Please help me in this regards.
Regular expression (regex) answer using \w (word character) and its negation \W:
gsub("^((\\w+\\W+){9}\\w+).*$","\\1",x)
^ Beginning of the token (zero-width)
((\\w+\\W+){9}\\w+) Ten words separated by not-words.
(\\w+\\W+){9} A word followed by not-a-word, 9 times
\\w+ One or more word characters (i.e. a word)
\\W+ One or more non-word characters (i.e. a space)
{9} Nine repetitions
\\w+ The tenth word
.* Anything else, including other following words
$ End of the token (zero-width)
\\1 when this token found, replace it with the first captured group (the 10 words)
How about using the word function from Hadley Wickham's stringr package?
word(string = x, start = 1, end = 10, sep = fixed(" "))
Here is an small function that unlist the strings, subsets the first ten words and then pastes it back together.
string_fun <- function(x) {
ul = unlist(strsplit(x, split = "\\s+"))[1:10]
paste(ul,collapse=" ")
}
string_fun(x)
df <- read.table(text = "Keyword,City(Column Header)
The length of the string should not be more than 10 is or are in,New York
The Keyword should be of specific length is or are in,Los Angeles
This is an experimental basis program string is or are in,Seattle
Please help me with getting only the first ten words is or are in,Boston", sep = ",", header = TRUE)
df <- as.data.frame(df)
Using apply (the function isn't doing anything in the second column)
df$Keyword <- apply(df[,1:2], 1, string_fun)
EDIT
Probably this is a more general way to use the function.
df[,1] <- as.character(df[,1])
df$Keyword <- unlist(lapply(df[,1], string_fun))
print(df)
# Keyword City.Column.Header.
# 1 The length of the string should not be more than New York
# 2 The Keyword should be of specific length is or are Los Angeles
# 3 This is an experimental basis program string is or Seattle
# 4 Please help me with getting only the first ten Boston
x <- "The length of the word is going to be of nice use to me"
head(strsplit(x, split = "\ "), 10)
I have a string of names in the following format:
names <- c("Q-1234-1", "Q-1234-2", "Q-1234-1-8", "Q-1234-2-8")
I am trying to extract the single digit after the second hyphen. There are instances where there will be a third hyphen and an additional digit at the end of the name. The desired output is:
1, 2, 1, 2
I assume that I will need to use sub/gsub but am not sure where to start. Any suggestions?
We can use sub to match the pattern of zero or more characters that are not a - ([^-]*) from the start (^) of the string followed by a - followed by zero or more characters that are not a - followed by a - and the number that follows being captured as a group. In the replacement, we use the backreference of the captured group (\\1)
as.integer(sub("^[^-]*-[^-]*-(\\d).*", "\\1", names))
#[1] 1 2 1 2
Or this can be modified to
as.integer(sub("^([^-]*-){2}(\\d).*", "\\2", names))
#[1] 1 2 1 2
Here's an alternative using stringr
library("stringr")
names <- c("Q-1234-1", "Q-1234-2", "Q-1234-1-8", "Q-1234-2-8")
output = str_split_fixed(names, pattern = "-", n = 4)[,3]
There are functions in Excel called left, right, and mid, where you can extract part of the entry from a cell. For example, =left(A1, 3), would return the 3 left most characters in cell A1, and =mid(A1, 3, 4) would start with the the third character in cell A1 and give you characters number 3 - 6. Are there similar functions in R or similarly straightforward ways to do this?
As a simplified sample problem I would like to take a vector
sample<-c("TRIBAL","TRISTO", "RHOSTO", "EUGFRI", "BYRRAT")
and create 3 new vectors that contain the first 3 characters in each entry, the middle 2 characters in each entry, and the last 4 characters in each entry.
A slightly more complicated question that Excel doesn't have a function for (that I know of) would be how to create a new vector with the 1st, 3rd, and 5th characters from each entry.
You are looking for the function substr or its close relative substring:
The leading characters are straight-forward:
substr(sample, 1, 3)
[1] "TRI" "TRI" "RHO" "EUG" "BYR"
So is extracting some characters at a defined position:
substr(sample, 2, 3)
[1] "RI" "RI" "HO" "UG" "YR"
To get the trailing characters, you have two options:
substr(sample, nchar(sample)-3, nchar(sample))
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
substring(sample, nchar(sample)-3)
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
And your final "complicated" question:
characters <- function(x, pos){
sapply(x, function(x)
paste(sapply(pos, function(i)substr(x, i, i)), collapse=""))
}
characters(sample, c(1,3,5))
TRIBAL TRISTO RHOSTO EUGFRI BYRRAT
"TIA" "TIT" "ROT" "EGR" "BRA"
There are functions in Excel called left, right, and mid, where you can extract part of the entry from a cell. For example, =left(A1, 3), would return the 3 left most characters in cell A1, and =mid(A1, 3, 4) would start with the the third character in cell A1 and give you characters number 3 - 6. Are there similar functions in R or similarly straightforward ways to do this?
As a simplified sample problem I would like to take a vector
sample<-c("TRIBAL","TRISTO", "RHOSTO", "EUGFRI", "BYRRAT")
and create 3 new vectors that contain the first 3 characters in each entry, the middle 2 characters in each entry, and the last 4 characters in each entry.
A slightly more complicated question that Excel doesn't have a function for (that I know of) would be how to create a new vector with the 1st, 3rd, and 5th characters from each entry.
You are looking for the function substr or its close relative substring:
The leading characters are straight-forward:
substr(sample, 1, 3)
[1] "TRI" "TRI" "RHO" "EUG" "BYR"
So is extracting some characters at a defined position:
substr(sample, 2, 3)
[1] "RI" "RI" "HO" "UG" "YR"
To get the trailing characters, you have two options:
substr(sample, nchar(sample)-3, nchar(sample))
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
substring(sample, nchar(sample)-3)
[1] "IBAL" "ISTO" "OSTO" "GFRI" "RRAT"
And your final "complicated" question:
characters <- function(x, pos){
sapply(x, function(x)
paste(sapply(pos, function(i)substr(x, i, i)), collapse=""))
}
characters(sample, c(1,3,5))
TRIBAL TRISTO RHOSTO EUGFRI BYRRAT
"TIA" "TIT" "ROT" "EGR" "BRA"