Extract String Part to Column in R - r

Consider the following dataframe:
status
1 file-status-done-bad
2 file-status-maybe-good
3 file-status-underreview-good
4 file-status-complete-final-bad
We want to extract the last part of status, wherein part is delimited by -. Such:
status status_extract
1 file-status-done-bad done
2 file-status-maybe-good maybe
3 file-status-ok-underreview-good underreview
4 file-status-complete-final-bad final
In SQL this is easy, select split_part(status, '-', -2).
However, the solutions I've seen with R either operate on vectors or are messy to extract particular elements (they return ALL elements). How is this done in a mutate chain? The below is a failed attempt.
df %>%
mutate(status_extract = str_split_fixed(status, pattern = '-')[[-2]])

Found the a really simple answer.
library(tidyverse)
df %>%
mutate(status_extract = word(status, -1, sep = "-"))

In base R you can combine the functions sapply and strsplit
df$status_extract <- sapply(strsplit(df$status, "-"), function(x) x[length(x) - 1])
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final

You can use map() and nth() to extract the nth value from a vector.
library(tidyverse)
df %>%
mutate(status_extract = map_chr(str_split(status, "-"), nth, -2))
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
which is equivalent to a base version like
sapply(strsplit(df$status, "-"), function(x) rev(x)[2])
# [1] "done" "maybe" "underreview" "final"

You can use regex to get what you want without splitting the string.
sub('.*-(\\w+)-.*$', '\\1', df$status)
#[1] "done" "maybe" "underreview" "final"

Related

How to extract n-th occurence of a pattern with regex

Let's say I have a string like this:
my_string = "my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool"
And I'd like to extract the first and the second date separately with stringr.
I tried something like str_extract(my_string, '(\\d+\\.\\d+\\.\\d+){n}') and while it works when n=1 it doesn't work with n=2. How can I extract the second occurence?
Example of data.frame:
df <- data.frame(string_col = c("my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool",
"my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd",
"asdad asda-adsad KK-ASD-20.05.05-jjj"))
And I want to create columns date1, date2.
Edit:
Although #RonanShah and #ThomasIsCoding provided solutions based on str_extract_all, I'd really like to get to know how we can do it using regex only as finding n-th occurence seems to be important pattern and potentially may result in much neater solution.
(I) Capturing groups (marked by ()) can be multiplied by {n} but will then count only as one capture group and match the last instance. If you explicitly write down capturing gorups for both dates, you can use str_match (without the "_all"):
> stringr::str_match(df$string_col, '(\\d+\\.\\d+\\.\\d+)-(\\d+\\.\\d+\\.\\d+)?')[, -1, drop = FALSE]
[,1] [,2]
[1,] "19.01.03" "20.01.22"
[2,] "20.01.08" "20.04.01"
[3,] "20.05.05" NA
Here, ? makes the occurrence of the second date optional and [, -1, drop = FALSE] removes the first column that always contains the whole match. You might want to change the - in the pattern to something more general.
To really find only the nth match, you could use (I) in a expression like this:
stringr::str_match(df$string_col, paste0('(?:(\\d+\\.\\d+\\.\\d+).*){', n, '}'))[, -1]
[1] "0.01.22" "0.04.01" NA
Here, we used (?: ) to specify a non-capturing group, such the the caputure (( )) does not include whats in between dates (.*).
you could use stringr::str_extract_all() instead, like this
str_extract_all(my_string, '\\d+\\.\\d+\\.\\d+')
str_extract would always return the first match. While there might be ways altering your regex to capture the nth occurrence of a pattern but a simple way would be to use str_extract_all and return the nth value.
library(stringr)
n <- 1
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')[[1]][n]
#[1] "19.01.03"
n <- 2
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')[[1]][n]
#[1] "20.01.22"
For the dataframe input we can extract all the date pattern and store it in a list and use unnest_wider to get them as separate columns.
library(dplyr)
df %>%
mutate(date = str_extract_all(string_col, '\\d+\\.\\d+\\.\\d+')) %>%
tidyr::unnest_wider(date) %>%
rename_with(~paste0('date', seq_along(.)), starts_with('..'))
# string_col date1 date2
# <chr> <chr> <chr>
#1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
#2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
#3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 NA
I guess you might need str_extract_all
str_extract_all(my_string, '(\\d+\\.\\d+\\.\\d+)')
or regmatches if you prefer with base R
regmatches(my_string,gregexpr('(\\d+\\.\\d+\\.\\d+)',my_string))
Update
With your data frame df
transform(df,
date = do.call(
rbind,
lapply(
u <- str_extract_all(string_col, "(\\d+\\.\\d+\\.\\d+)"),
`length<-`,
max(lengths(u))
)
)
)
we will get
string_col date.1 date.2
1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 <NA>
This is a good example to showcase {unglue}.
Here you have 2 patterns (one date or two dates), the first is two dates separated by a dash and surrounded by anything, the second is a date surrounded by anything. We can write it this way :
library(unglue)
unglue_unnest(
df, string_col,
c("{}{date1=\\d+\\.\\d+\\.\\d+}-{date2=\\d+\\.\\d+\\.\\d+}{}",
"{}{date1=\\d+\\.\\d+\\.\\d+}{}"),
remove = FALSE)
#> string_col date1 date2
#> 1 my string a-maxeka UU-AA-19.01.03-20.01.22-bamdanool 19.01.03 20.01.22
#> 2 my string a-maxeka UU-AA-20.01.08-20.04.01-jdasdasd 20.01.08 20.04.01
#> 3 asdad asda-adsad KK-ASD-20.05.05-jjj 20.05.05 <NA>

Remove non-unique string components from a column in R

example <- data.frame(
file_name = c("some_file_name_first_2020.csv",
"some_file_name_second_and_third_2020.csv",
"some_file_name_4_2020_update.csv"),
a = 1:3
)
example
#> file_name a
#> 1 some_file_name_first_2020.csv 1
#> 2 some_file_name_second_and_third_2020.csv 2
#> 3 some_file_name_4_2020_update.csv 3
I have a dataframe that looks something like this example. The "some_file_name" part changes often and the unique identifier is usually in the middle and there can be suffixed information (sometimes) that is important to retain.
I would like to end up with the dataframe below. The approach I can think of is finding all common string "components" and removing them from each row.
desired
#> file_name a
#> 1 first 1
#> 2 second_and_third 2
#> 3 4_update 3
This works for the example shared, perhaps you can use this to make a more general solution :
#split the data on "_" or "."
list_data <- strsplit(example$file_name, '_|\\.')
#Get the words that occur only once
unique_words <- names(Filter(function(x) x==1, table(unlist(list_data))))
#Keep only unique_words and paste the string back.
sapply(list_data, function(x) paste(x[x %in% unique_words], collapse = "_"))
#[1] "first" "second_and_third" "4_update"
However, this answer relies on the fact that you would have separators like "_" in the filenames to detect each "component".

How to extract a subset based on value by grepl

Hi i want to extract all the observations starting from "120.5" I am doing it in following way.
a<-c(120.1,120.3,120.5,120.566)
Part<-c(1,2,3,4)
DFFF<-data.frame(a,Part)
lill <- subset(DFFF, grepl('^120.5', a), select = Part)
> lill
Part
3 3
I want outcome to be 3 and 4. How to do that in R.
Since you're only subsetting on a numerical variable, #NelsonGon's solution DFFF[DFFF$a>=120.5,]is absolutely the first option. If, for some reason, you have to use greplyou can subset like this:
DFFF[grepl("120.5", DFFF$a), ]
a Part
3 120.500 3
4 120.566 4
But bear in mind that this only works as long as the numbers in a are not equal to or greater than 120.6; all these values will not be matched.
In base R
ind <- which(DFFF$a >= 120.5)
lill <- DFFF$Part[ind]
Tidyverse
library(tidyverse)
DFFF %>%
filter(a >= 120.5) %>%
pull(Part)
You were close, quotes needed.
lill <- subset(DFFF, grepl('^120.5', a), select="Part")
lill
# Part
# 3 3
# 4 4

Extracting nth character till the end of string in R

I'm trying to extract the nth character onwards in a string, using R. Here's my data:
StringField
example_string1
example_string2
example_string3
example_string4
example_string5
example_string6
example_string7
example_string8
example_string9
example_string10
example_string11
example_string12
I want to extract only the numbers after example_string, so the result would be:
1
2
3
4
5
6
7
8
9
10
11
12
I've tried something along the lines of:
df$unique_number <- substr(df$stringField, 15:)
to indicate I want everything from the 15th position onward, till the end of the string. Is there an easy way to accomplish what I'm trying to do?
Here is an easy option using sub. We can capture the final digits in the input, and then replace with only that captured quantity.
x <- "example_string10"
num <- sub("^.*?(\\d+)$", "\\1", x)
num
[1] "10"
x <- "example_string10"
substr(x, 15, 20)
#> [1] "10"
Created on 2020-02-06 by the reprex package (v0.3.0)
Replace each non-digit (\D) with an empty string and convert to numeric:
transform(df, unique_number = as.numeric(gsub("\\D", "", StringField)))
Note
We used this as input:
df <- data.frame(StringField = c("example_string1", "example_string2",
"example_string3"), stringsAsFactors = FALSE)
df %>% tidyr::extract(StringField, into = "nmb", "([0-9]+)")
If you are interested in extracting only numbers from a string, this can be a solution:
library(stringr)
as.numeric(str_extract(df$stringField,"\\d+"))

R - number of unique values in a column of data frame

for a dataframe df, I need to find the unique values for some_col. Tried the following
length(unique(df["some_col"]))
but this is not giving the expected results. However length(unique(some_vector)) works on a vector and gives the expected results.
Some preceding steps while the df is created
df <- read.csv(file, header=T)
typeof(df) #=> "list"
typeof(unique(df["some_col"])) #=> "list"
length(unique(df["some_col"])) #=> 1
Try with [[ instead of [. [ returns a list (a data.frame in fact), [[ returns a vector.
df <- data.frame( some_col = c(1,2,3,4),
another_col = c(4,5,6,7) )
length(unique(df[["some_col"]]))
#[1] 4
class( df[["some_col"]] )
[1] "numeric"
class( df["some_col"] )
[1] "data.frame"
You're getting a value of 1 because the list is of length 1 (1 column), even though that 1 element contains several values.
you need to use
length(unique(unlist(df[c("some_col")])))
When you call column by df[c("some_col")] or by df["some_col"] ; it pulls it as a list. Unlist will convert it into the vector and you can work easily with it. When you call column by df$some_col .. it pulls the data column as vector
I think you might just be missing a ,
Try
length(unique(df[,"some_col"]))
In response to comment :
df <- data.frame(cbind(A=c(1:10),B=rep(c("A","B"),5)))
df["B"]
Output :
B
1 A
2 B
3 A
4 B
5 A
6 B
7 A
8 B
9 A
10 B
and
length(unique(df[,"B"]))
Output:
[1] 1
Which is the same incorrect/undesirable output as the OP posted
HOWEVER With a comma ,
df[,"B"]
Output :
[1] A B A B A B A B A B
Levels: A B
and
length(unique(df[,"B"]))
Now gives you the correct/desired output by the OP. Which in this example is 2
[1] 2
The reason is that df["some_col"] calls a data.frame and length call to an object class data.frame counts the number of data.frames in that object which is 1, while df[,"some_col"] returns a vector and length call to a vector correctly returns the number of elements in that vector. So you see a comma (,) makes all the difference.
using tidyverse
df %>%
select("some_col") %>%
n_distinct()
The data.table package contains the convenient shorthand uniqueN. From the documentation
uniqueN is equivalent to length(unique(x)) when x is anatomic vector, and nrow(unique(x)) when x is a data.frame or data.table. The number of unique rows are computed directly without materialising the intermediate unique data.table and is therefore faster and memory efficient.
You can use it with a data frame:
df <- data.frame(some_col = c(1,2,3,4),
another_col = c(4,5,6,7) )
data.table::uniqueN(df[['some_col']])
[1] 4
or if you already have a data.table
dt <- setDT(df)
dt[,uniqueN(some_col)]
[1] 4
Here is another option:
df %>%
distinct(column_name) %>%
count()
or this without tidyverse:
count(distinct(df, column_name))
checking benchmarks in the web you will see that distinct() is fast.

Resources