Change underscore behind word within column in R - r

Hi I have a data frame like this, with two columns (A and B):
A B
x_1234 rs4566
x_1567 rs3566
z_1444 rs78654
r_1234 rs34567
I would like to change each letter in front of the numbers in column A after the number, also with a underscore.
Expected output:
A B
1234_x rs4566
1567_x rs3566
1444_z rs78654
1234_r rs34567
I tried something like, but it doesn't work:
DF$A <- gsub(".*_", "_*.", DF$A)

We may need to switch the characters after capturing as a group ((.*)- captures characters before the _ and the second capture group as one or more digits (\\d+), then switch those in the replacement with the backreferences (\\2 followed by \\1 separated by a _)
DF$A <- sub("(.*)_(\\d+)", "\\2_\\1", DF$A)
-output
> DF
A B
1 1234_x rs4566
2 1567_x rs3566
3 1444_z rs78654
4 1234_r rs34567
The OP's code matches any characters (.*) followed by the _ and replace with the _ and literal characters (*.). Instead, the replacement should be based on the capture group backreferences
data
DF <- structure(list(A = c("x_1234", "x_1567", "z_1444", "r_1234"),
B = c("rs4566", "rs3566", "rs78654", "rs34567")),
class = "data.frame", row.names = c(NA,
-4L))

Related

Using grep to match variables in one column to a string of text in another column [duplicate]

This question already has answers here:
Check column if contains value from another column
(4 answers)
Find rows where one column string is in another column using dplyr in R
(3 answers)
Closed 1 year ago.
I need to match a string in the first variable with a string in the second variable and then return true or false in the third column.
Here is my data
regex <- c("cat", "dog", "mouse")
text<- c("asdf.cat/asdf", "asdf=asdf", "asdf=mouse asdf")
df <- data.frame(regex, text)```
And I need an output like this
regex
text
result
cat
asdf.cat/asdf
1
dog
asdf=asdf
0
mouse
asdf=mouse asdf
1
I have tried using grepl but I cant figure out how to use it in a dataframe.
df$result <- as.integer(grepl("cat", df$text))
This will work for the first row only
I have also tried the following code which works to filter out the matches but I want to keep them all in and just return true or false.
df %>%
filter(unlist(Map(function(x, y) grepl(x, y), regex, text)))
As you can see it is complicated by the text string containing various characters
I feel like this should be easy but I cant wrap my head round it!!
Instead of grepl, use str_detect which is vectorised for the pattern and string
library(stringr)
library(dplyr)
df %>%
mutate(result= +(str_detect(text, regex)))
-output
regex text result
1 cat asdf.cat/asdf 1
2 dog asdf=asdf 0
3 mouse asdf=mouse asdf 1
data
df <- structure(list(regex = c("cat", "dog", "mouse"), text = c("asdf.cat/asdf",
"asdf=asdf", "asdf=mouse asdf")), class = "data.frame", row.names = c(NA,
-3L))

Separating a column by using regex expressions

I have a data frame like this:
tibble(x = c("asdh.1", "asdh.1.1", "cccc.1.1", "asdh.1.2", "cccc.1.2", "asdh.1.11", "cccc.1.11"))
# A tibble: 7 x 1
x
<chr>
1 asdh.1
2 asdh.1.1
3 cccc.1.1
4 asdh.1.2
5 cccc.1.2
6 asdh.1.11
7 cccc.1.11
Now I would like to split the column x into 2 columns such that the second column only contains the digits after the last dot, and the first column everything before the last dot. I tried messing around with regex but did not accomplish the desired outcome. The closest I got might be %>% separate(col=x, into=c("y", "numbers"), sep="(.*)\\.([1-9]{1,2}$)") but that gives only two empty columns.
We can specify a regex lookaround in separate to match the . (. is a metacharacter that matches any character - so we escape \\) followed by one or more digits (\\d+) at the end ($) of the string
library(tidyr)
separate(df1, col = x, into = c("y", "numbers"),
sep = "\\.(?=\\d+$)", convert = TRUE)

Extract three groups: between second and second to last, between second to last and last, and after last underscores

can someone help with these regular expressions?
d_total_v_conf.int.low_all
I want three expressions: total_v, conf.int.low, all
I can't just capture elements before the third _, it is more complex than that:
d_share_v_hskill_wc_mean_plus
Should yield share_v_hskill_wc, mean and plus
The first match is for all characters between the second and the penultimate _, the second match takes all between the penultimate and the last _ and the third takes everything after the last _
We can use sub to capture the groups and create a delimiter, to scan
f1 <- function(str_input) {
scan(text = sub("^[^_]+_(.*)_([^_]+)_([^_]+)$",
"\\1,\\2,\\3", str_input), what = "", sep=",")
}
f1(str1)
#[1] "total_v" "conf.int.low" "all"
f1(str2)
#[1] "share_v_hskill_wc" "mean" "plus"
If it is a data.frame column
library(tidyr)
library(dplyr)
df1 %>%
extract(col1, into = c('col1', 'col2', 'col3'),
"^[^_]+_(.*)_([^_]+)_([^_]+)$")
# col1 col2 col3
#1 total_v conf.int.low all
#2 share_v_hskill_wc mean plus
data
str1 <- "d_total_v_conf.int.low_all"
str2 <- "d_share_v_hskill_wc_mean_plus"
df1 <- data.frame(col1 = c(str1, str2))
Here is a single regex that yields the three groups as requested:
(?<=^[^_]_)((?:(?:(?!_).)+)|_)+(_[^_]+$)
Demo
The idea is to use a lookaround, plus an explict match for the first group, an everything-but batch in the middle, and another explicit match for the last part.
You may need to adjust the start and end anchors if those strings show up in free text.
You can use {unglue} for this task :
library(unglue)
x <- c("d_total_v_conf.int.low_all", "d_share_v_hskill_wc_mean_plus")
pattern <- "d_{a}_{b=[^_]+}_{c=[^_]+}"
unglue_data(x, pattern)
#> a b c
#> 1 total_v conf.int.low all
#> 2 share_v_hskill_wc mean plus
what you want basically is to extract a, b and c from a pattern looking like "d_{a}_{b}_{c}", but where b and c are made of one or more non underscore characters, which is what "[^_]+" means in regex.

Remove all the characters before the last comma in R

I have a data table like this:
id number
1 5562,4024,...,1213
2 4244,4214,...,244
3 424,4213
4 1213,441
...
And I want to subset only the last part of each column of number, which should be like:
id number
1 1213
2 244
3 4213
4 441
...
So what should I do to achieve that?
One option is capture the digits at the end ($) of the string as a group that follows a , and replace with the backreference (\\1) of the captured group
df$number <- as.numeric(sub(".*,(\\d+)$", "\\1", df$number))
Or match the characters (.*) until the , and replace it with blank ("")
df$number <- as.numeric(sub(".*,", "", df$number))
data
df <- structure(list(id = 1:4, number = c("5562,4024,...,1213",
"4244,4214,...,244",
"424,4213", "1213,441")), class = "data.frame", row.names = c(NA,
-4L))

How to apply strsplit to a dataframe in R?

I have a very similar question to this one, but I have a more complicated situation.
Here is my sample code:
test = data.frame(x = c(1:4),
y = c("/abc/werts/h1-1234", "/abc/fghye/seths/h2-234",
"/abc/gvawrttd/hyeadar/h3-9868", "/abc/qqras/x1-7653"))
test$y = as.character(test$y)
And I want an output like this:
1 h1-1234
2 h2-234
3 h3-9868
4 x1-7653
I tried:
test$y = tail(unlist(strsplit(test$y, "/")), 1)
However, the result of above codes returned:
1 h1-1234
2 h1-1234
3 h1-1234
4 h1-1234
So my question is, how should modified my code so that I can get my desired output?
Thanks in advance!
Here is the line you are looking for:
test$y = sapply(strsplit(test$y, "/"), tail, 1)
It applies tail to each element in the list returned by strsplit.
Here is an option using sub to match zero or more characters (.*) followed by / (\\/) followed by zero or more characters that are not a / captured as a group (([^/]*)) until the end ($) of the string, and replace with the backreference (\\1) of the capture group
test$y <- sub(".*\\/([^/]*)$", "\\1", test$y)
test$y
#[1] "h1-1234" "h2-234" "h3-9868" "x1-7653"

Resources