How to remove everything from a row except pattern - r

I have a dataframe that contains one column separated by ; like this
AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455
What I want is remove everything except the codes that starts with AB....
AB00001;AB00002
AB84375
AB84375;AB84375
AB001
I've tried to separate them with separate(), but I don´t know how to continue. Any suggestions?

If your data frame is called df and your column is called V1, you could try:
sapply(strsplit(df$V1, ";"), function(x) paste(grep("^AB", x, value = TRUE), collapse = ";"))
#> [1] "AB00001;AB00002" "AB84375" "AB84375;AB84375" "AB001"
This splits at all the semicolons then matches all strings starting with "AB", then joins them back together with semicolons.

I thought of using stringr and Daniel O's data:
df %>%
mutate(data = str_extract_all(data, "AB\\w+"))
which gives us
data
1 AB00001, AB00002
2 AB84375
3 AB84375, AB84375
4 AB001

1) Base R Assuming DF shown reproducibly in the Note at the end we prefix each line with a semicolon and then use the gsub with the pattern shown and finally remove the semicolon we added. No packages are used.
transform(DF, V1 = sub("^;", "", gsub("(;AB\\d+)|;[^;]*", "\\1", paste0(";", V1))))
giving:
V1
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
2) dplyr/tidyr This one is longer than the others in this answer but it is straight forward and has no complex regular expressions.
library(dplyr)
library(tidyr)
DF %>%
mutate(id = 1:n()) %>%
separate_rows(V1, sep = ";") %>%
filter(substr(V1, 1, 2) == "AB") %>%
group_by(id) %>%
summarize(V1 = paste(V1, collapse = ";")) %>%
ungroup %>%
select(-id)
giving:
# A tibble: 4 x 1
V1
<chr>
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
3) gsubfn Replace codes that do not start with AB with an empty string and then remove redundant semicolons from what is left.
library(gsubfn)
transform(DF, V1 = gsub("^;|;$", "", gsub(";+", ";",
gsubfn("[^;]*", ~ if (substr(x, 1, 2) == "AB") x else "", V1))))
giving:
V1
1 AB00001;AB00002
2 AB84375
3 AB84375;AB84375
4 AB001
Note
Lines <- "AB00001;09843;AB00002;GD00001
AB84375;34
AB84375;AB84375
74859375;AB001;4455;FG3455"
DF <- read.table(text = Lines, as.is = TRUE, strip.white = TRUE)

Related

How to sort a concatenated string in a column in R?

Have a data frame with a concatenated column that I want to order numerically with the number after -
df <- data.frame(Order = c("A23_2-A27_3-A40_4-A10_1", "A25_2-A21_3-A11_1", "A9_1", "A33_2-A8_1"))
and want to have a result like this:
df <- data.frame(Order = c("A10A23A27A40", "A11A25A21", "A9", "A8A33"))
tried couple of things with tidyverse but couldn't get a clean result.
df %>%
rowid_to_column() %>%
separate_rows(Order, sep='-') %>%
separate(Order, c('Order', 'v'), convert = TRUE) %>%
arrange(v)%>%
group_by(rowid) %>%
summarise(Order = str_c(Order, collapse = ''))
# A tibble: 4 x 2
rowid Order
<int> <chr>
1 1 A10A23A27A40
2 2 A11A25A21
3 3 A9
4 4 A8A33
Another base R approach:
df$Order <- sapply(strsplit(df$Order, '-'), function(x) {
spl <- strsplit(x, '_') # split by '_'
spl <- do.call(rbind, spl) # create a 2-column matrix
ord <- order(as.numeric(spl[, 2])) # order of numeric parts
paste(spl[ord, 1], collapse='') # concatenate in correct order
})
Here is a base R option:
df$Order <-
sapply(strsplit(df$Order, "-"), function(x)
paste0(gsub("\\_.*", "", x[order(as.numeric(sub("^[^_]*_", "", x)))]), collapse = ""))
Output
Order
1 A10A23A27A40
2 A11A25A21
3 A9
4 A8A33
Or a tidyverse option:
library(tidyverse)
df %>%
mutate(Order = map(str_split(Order, "-"), ~
str_c(
str_replace_all(.x[order(as.numeric(str_replace_all(.x, "^[^_]*_", "")))], "\\_.*", ""), collapse = ""
)))

How to join multiple columns together on blanks of one column in R

This is my dataframe:
df <- data.frame(option_1 = c("Box 1", "", ""), option_2 = c("", 4, ""), Width = c("","",3))
I want to get this data frame:
option_1
1 Box 1
2 4
3 3
I'm doing this on a much bigger dataframe with 5+ columns I'm merging on blanks with respect to the option_1 column. I have tried using coalesce, but some of the columns won't "merge" on the blanks. For example:
df %>%
mutate(option_value_1 = coalesce(option_value_1, option_value_2, option_value_3, option_value_4, option_value_5, option_value_6, option_value_7))
option_value_5 wouldn't come together with option_value_1 on the blanks, but the other option values did. Should I put the vectors in a list then use coalesce?
We convert the blank ("") to NA and coalesce with the bang-bang (!!!) operator. According to ?"!!!"
The big-bang operator !!! forces-splice a list of objects. The elements of the list are spliced in place, meaning that they each become one single argument.
library(dplyr)
df %>%
na_if("") %>%
transmute(option_1 = coalesce(!!! .))
-output
option_1
1 Box 1
2 4
3 3
If we are interested only in the 'option' columns, subset the columns (also can use invoke with coalesce
library(purrr)
df %>%
na_if("") %>%
mutate(option_1 = invoke(coalesce,
across(starts_with("option"))), .keep = "unused")
With a base R approach:
df <- data.frame(option_1 = apply(df, 1, \(x) paste(x, collapse = "")))
df
#> option_1
#> 1 Box 1
#> 2 4
#> 3 3
Or using tidyverse:
df %>%
rowwise %>%
transmute(option_1 = str_c(c_across(everything()), collapse = "")) %>%
ungroup

extract the first n characters from concatenated strings

I've a data frame with two columns: id and value.
id <- seq(1,5)
value <- c("123;21232;78678",
"4575;23423",
"1245;45757;45375;39873",
"787;536453;1121;3232;58684;2434",
"444;44444;4444;444")
df <- data.frame(id, value)
df$value <- as.character(df$value)
I need to extract the first two digits of each number (separated by ;). The output would look like:
We can use str_extract_all from stringr
library(stringr)
library(dplyr)
library(purrr)
df %>%
mutate(value_2 = map_chr(str_extract_all(value, "\\b(\\d{2})"),
str_c, collapse=";"))
# id value value_2
#1 1 123;21232;78678 12;21;78
#2 2 4575;23423 45;23
#3 3 1245;45757;45375;39873 12;45;45;39
#4 4 787;536453;1121;3232;58684;2434 78;53;11;32;58;24
#5 5 444;44444;4444;444 44;44;44;44
Or with str_replace_all
df %>%
mutate(value_2 = str_replace_all(value, "(^|;)(..)[^;]+", "\\2;") %>%
trimws(whitespace = ";"))
Or using base R
df$value_2 <- sapply(regmatches(df$value,
gregexpr("\\b(\\d{2})", df$value)), paste, collapse=";")
Or with gsub from base R
trimws(gsub("(^|;)(..)[^;]+", "\\2;", df$value, perl = TRUE), whitespace = ";")
#[1] "12;21;78" "45;23" "12;45;45;39"
#[4] "78;53;11;32;58;24" "44;44;44;44"
Using just base R:
df$value_2 = sapply(strsplit(df$value, ';'), function(i){
paste(substr(i, 1, 2), collapse = ';')
})
This goes row by row and separates each string by the semicolons, then pulls out the first two characters and returns them into a new string separated by semicolons.
This will give you:
id value value_2
1 1 123;21232;78678 12;21;78
2 2 4575;23423 45;23
3 3 1245;45757;45375;39873 12;45;45;39
4 4 787;536453;1121;3232;58684;2434 78;53;11;32;58;24
5 5 444;44444;4444;444 44;44;44;44

R: Extracting Bigrams with Zero-Width Lookaheads

I want to extract bigrams from sentences, using the regex described here and store the output to a new column which references the original.
library(dplyr)
library(stringr)
library(splitstackshape)
df <- data.frame(a =c("apple orange plum"))
# Single Words - Successful
df %>%
# Base R
mutate(b = sapply(regmatches(a,gregexpr("\\w+\\b", a, perl = TRUE)),
paste, collapse=";")) %>%
# Duplicate with Stringr
mutate(c = sapply(str_extract_all(a,"\\w+\\b"),paste, collapse=";")) %>%
cSplit(., c(2,3), sep = ";", direction = "long")
Initially, I thought the problem seemed to be with the regex engine but neither stringr::str_extract_all (ICU) nor base::regmatches (PCRE) works.
# Bigrams - Fails
df %>%
# Base R
mutate(b = sapply(regmatches(a,gregexpr("(?=(\\b\\w+\\s+\\w+))", a, perl = TRUE)),
paste, collapse=";")) %>%
# Duplicate with Stringr
mutate(c = sapply(str_extract_all(a,"(?=(\\b\\w+\\s+\\w+))"),paste, collapse=";")) %>%
cSplit(., c(2,3), sep = ";", direction = "long")
As a result, I'm guessing the problem is probably to do with using a zero-width lookahead around a capturing group. Is there any valid regex in R which will allows these bigrams be extracted?
As #WiktorStribiżew suggested, using str_extract_all helps here. Here's how to apply it with multiple rows in a data frame. Let
(df <- data.frame(a = c("one two three", "four five six")))
# a
# 1 one two three
# 2 four five six
Then we may do
df %>% rowwise() %>%
do(data.frame(., b = str_match_all(.$a, "(?=(\\b\\w+\\s+\\w+))")[[1]][, 2], stringsAsFactors = FALSE))
# Source: local data frame [4 x 2]
# Groups: <by row>
#
# A tibble: 4 x 2
# a b
# * <fct> <chr>
# 1 one two three one two
# 2 one two three two three
# 3 four five six four five
# 4 four five six five six
where stringsAsFactors = FALSE is just to avoid warnings coming from bindings rows.

How to change code T-25-4 into T-25-04 in a dataframe in R?

I have a data.frame in R. The first columns contain codes like T-25-4. I want to change it to T-25-04 and so on. So the last number should be in 2 digits
Example:
T-25-1
T-25-2
T-25-3
T-25-4
T-25-5
T-25-6
T-25-7
T-25-8
T-25-9
Borrowing first part of ycw's answer, but simpler with mutate and gsub:
library(tidyverse)
dt <- data_frame(Col = c("T-25-1", "T-25-2", "T-25-3", "T-25-4", "T-25-5",
"T-25-6", "T-25-7", "T-25-8", "T-25-9"))
dt %>%
mutate(Col = gsub("(\\d)$", paste0("0", "\\1"), Col))
If last digit goes higher than 9 and you don't want to add 0:
dt %>%
mutate(Col = ifelse(nchar(sub(".*-(\\d+)$", "\\1", Col)) < 2, # Check if last number is less than 10
sub("(\\d+)$", paste0("0", "\\1"), Col), # Add 0 in front if less than 10
Col))
We can use functions from tidyverse and stringr. df2 is the final output.
library(tidyverse)
library(stringr)
# Create example data frame
dt <- data_frame(Col = c("T-25-1", "T-25-2", "T-25-3", "T-25-4", "T-25-5",
"T-25-6", "T-25-7", "T-25-8", "T-25-9"))
# Process the data
dt2 <- dt %>%
# Separate the original column to three columns
separate(Col, into = c("Col1", "Col2", "Col3")) %>%
# Pad zero to Col3 until the width is 2
mutate(Col3 = str_pad(Col3, width = 2, side= "left", pad = "0")) %>%
# Combine all three columns separated by "-
unite(Col, Col1:Col3, sep = "-")
# View the reuslts
dt2
# A tibble: 9 x 1
Col
* <chr>
1 T-25-01
2 T-25-02
3 T-25-03
4 T-25-04
5 T-25-05
6 T-25-06
7 T-25-07
8 T-25-08
9 T-25-09

Resources