I'm trying to extract the nth character onwards in a string, using R. Here's my data:
StringField
example_string1
example_string2
example_string3
example_string4
example_string5
example_string6
example_string7
example_string8
example_string9
example_string10
example_string11
example_string12
I want to extract only the numbers after example_string, so the result would be:
1
2
3
4
5
6
7
8
9
10
11
12
I've tried something along the lines of:
df$unique_number <- substr(df$stringField, 15:)
to indicate I want everything from the 15th position onward, till the end of the string. Is there an easy way to accomplish what I'm trying to do?
Here is an easy option using sub. We can capture the final digits in the input, and then replace with only that captured quantity.
x <- "example_string10"
num <- sub("^.*?(\\d+)$", "\\1", x)
num
[1] "10"
x <- "example_string10"
substr(x, 15, 20)
#> [1] "10"
Created on 2020-02-06 by the reprex package (v0.3.0)
Replace each non-digit (\D) with an empty string and convert to numeric:
transform(df, unique_number = as.numeric(gsub("\\D", "", StringField)))
Note
We used this as input:
df <- data.frame(StringField = c("example_string1", "example_string2",
"example_string3"), stringsAsFactors = FALSE)
df %>% tidyr::extract(StringField, into = "nmb", "([0-9]+)")
If you are interested in extracting only numbers from a string, this can be a solution:
library(stringr)
as.numeric(str_extract(df$stringField,"\\d+"))
Related
['ax', 'byc', 'crm', 'dop']
This is a character string, and I want a count of all substrings, ie 4 here as output. Want to do this for the entire column containing such strings.
We may use str_count
library(stringr)
str_count(str1, "\\w+")
[1] 4
Or may also extract the alpha numeric characters into a list and get the lengths
lengths(str_extract_all(str1, "[[:alnum:]]+"))
If it is a data.frame column, extract the column as a vector and apply str_count
str_count(df1$str1, "\\w+")
data
str1 <- "['ax', 'byc', 'crm', 'dop']"
df1 <- data.frame(str1)
Here are a few base R approaches. We use the 2 row input defined reproducibly in the Note at the end. No packages are used.
lengths(strsplit(DF$b, ","))
## [1] 4 4
nchar(gsub("[^,]", "", DF$b)) + 1
## [1] 4 4
count.fields(textConnection(DF$b), ",")
## [1] 4 4
Note
DF <- data.frame(a = 1:2, b = "['ax', 'byc', 'crm', 'dop']")
Consider the following dataframe:
status
1 file-status-done-bad
2 file-status-maybe-good
3 file-status-underreview-good
4 file-status-complete-final-bad
We want to extract the last part of status, wherein part is delimited by -. Such:
status status_extract
1 file-status-done-bad done
2 file-status-maybe-good maybe
3 file-status-ok-underreview-good underreview
4 file-status-complete-final-bad final
In SQL this is easy, select split_part(status, '-', -2).
However, the solutions I've seen with R either operate on vectors or are messy to extract particular elements (they return ALL elements). How is this done in a mutate chain? The below is a failed attempt.
df %>%
mutate(status_extract = str_split_fixed(status, pattern = '-')[[-2]])
Found the a really simple answer.
library(tidyverse)
df %>%
mutate(status_extract = word(status, -1, sep = "-"))
In base R you can combine the functions sapply and strsplit
df$status_extract <- sapply(strsplit(df$status, "-"), function(x) x[length(x) - 1])
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
You can use map() and nth() to extract the nth value from a vector.
library(tidyverse)
df %>%
mutate(status_extract = map_chr(str_split(status, "-"), nth, -2))
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
which is equivalent to a base version like
sapply(strsplit(df$status, "-"), function(x) rev(x)[2])
# [1] "done" "maybe" "underreview" "final"
You can use regex to get what you want without splitting the string.
sub('.*-(\\w+)-.*$', '\\1', df$status)
#[1] "done" "maybe" "underreview" "final"
I have a dataset where column names have prefixes (corresponding to panel waves), e.g.
a_age
a_sex
a_jbstat
b_age
b_sex
b_jbstat
I would like to convert the prefixes into suffixes, so that it becomes:
age_a
sex_a
jbstat_a
age_b
sex_b
jbstat_b
I'd be grateful for suggestions on efficient ways of doing this.
You can use sub and backreference:
sub("([a-z])_([a-z]+)", "\\2_\\1", x)
[1] "age_a" "sex_a" "jbstat_a" "age_b" "sex_b" "jbstat_b"
The backreferences \\1and \\2 recall the exact character strings in the two capturing groups ([a-z]), which is recalled by \\1, and ([a-z]+), which is recalled by \\2. To obtain the desired string change, these 'recollections' are simply reversed in the replacement argument to sub.
EDIT:
If the elements are column names, you can do this:
names(df) <- sub("([a-z])_([a-z]+)", "\\2_\\1", names(df))
One way to do it, is to use a regex
x <- c(
"a_age",
"a_sex",
"a_jbstat",
"b_age",
"b_sex",
"b_jbstat"
)
stringr::str_replace(x, "^([a-z]+)_([a-z]+)$", "\\2_\\1")
#> [1] "age_a" "sex_a" "jbstat_a" "age_b" "sex_b" "jbstat_b"
Created on 2020-05-25 by the reprex package (v0.3.0)
Edit: Full Example
df <- data.frame(
a_age = 1,
a_sex = 1,
b_age = 2,
b_sex = 2
)
df
#> a_age a_sex b_age b_sex
#> 1 1 1 2 2
names(df) <- stringr::str_replace(names(df), "^([a-z]+)_([a-z]+)$", "\\2_\\1")
df
#> age_a sex_a age_b sex_b
#> 1 1 1 2 2
Created on 2020-05-26 by the reprex package (v0.3.0)
I have a vector as below
data <- c("6X75ML","24X37.5ML (KKK)", "6X2X75ML", "168X5CL (UUU)")
here i want to extract the first number before the "X" for each of the elements.
In case of situations with 2 "X" i.e. "6X2X75CL" the number 12 (6 multiplied by 2) should be calculated.
expected output
6, 24, 12, 168
Thank you for the help...
Here's a possible solution using regular expressions :
data <- c("6X75ML","24X37.5ML (KKK)", "6X2X75ML", "168X5CL (UUU)")
# this regular expression finds any group of digits followed
# by a upper-case 'X' in each string and returns a list of the matches
tokens <- regmatches(data,gregexpr('[[:digit:]]+(?=X)',data,perl=TRUE))
res <- sapply(tokens,function(x)prod(as.numeric(x)))
> res
[1] 6 24 12 168
Here is a method using base R:
dataList <- strsplit(data, split="X")
sapply(dataList, function(x) Reduce("*", as.numeric(head(x, -1))))
[1] 6 24 12 168
strplit breaks up the vector along "X". The resulting list is fed to sapply which the performs an operation on all but the final element of each vector in the list. The operation is to transform the elements into numerics and the multiply them. The final element is dropped using head(x, -1).
As #zheyuan-li comments, prod can fill in for Reduce and will probably be a bit faster:
sapply(dataList, function(x) prod(as.numeric(head(x, -1))))
[1] 6 24 12 168
We can also use str_extract_all
library(stringr)
sapply(str_extract_all(data, "\\d+(?=X)"), function(x) prod(as.numeric(x)))
#[1] 6 24 12 168
ind=regexpr("X",data)
val=as.integer(substr(data, 1, ind-1))
data2=substring(data,ind+1)
ind2=regexpr("[0-9]+X", data2)
if (!all(ind2!=1)) {
val2 = as.integer(substr(data2[ind2==1], 1, attr(ind2,"match.length")[ind2==1]-1))
val[ind2==1] = val[ind2==1] * val2
}
I have character vector of the following form (this is just a sample):
R1Ng(10)
test(0)
n.Ex1T(34)
where as can be seen above, the first part is always some combination of alphanumeric and punctuation marks, then there are parentheses with a number inside. I want to create a numeric vector which will store the values inside the parentheses, and each number should have name attribute, and the name attribute should be the string before the number. So, for example, I want to store 10, 0, 34, inside a numeric vector and their name attributes should be, R1Ng, test, n.Ex1T, respectively.
I can always do something like this to get the numbers and create a numeric vector:
counts <- regmatches(data, gregexpr("[[:digit:]]+", data))
as.numeric(unlist(counts))
But, how can I extract the first string part, and store it as the name attribute of that numberic array?
How about this:
x <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")
data.frame(Name = gsub( "\\(.*", "", x),
Count = as.numeric(gsub(".*?\\((.*?)\\).*", "\\1", x)))
# Name Count
# 1 R1Ng 10
# 2 test 0
# 3 n.Ex1T 34
Or alternatively as a vector
setNames(as.numeric(gsub(".*?\\((.*?)\\).*", "\\1", x)),
gsub( "\\(.*", "", x ))
# R1Ng test n.Ex1T
# 10 0 34
Here is another variation using the same expression and capturing parentheses:
temp <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")
data.frame(Name=gsub("^(.*)\\((\\d+)\\)$", "\\1", temp),
count=gsub("^(.*)\\((\\d+)\\)$", "\\2", temp))
We can use str_extract_all
library(stringr)
lst <- str_extract_all(x, "[^()]+")
Or with strsplit from base R
lst <- strsplit(x, "[()]")
If we need to store as a named vector
sapply(lst, function(x) setNames(as.numeric(x[2]), x[1]))
# R1Ng test n.Ex1T
# 10 0 34
data
x <- c("R1Ng(10)", "test(0)", "n.Ex1T(34)")