Regex for consecutive repeated word in R

Regex for consecutive repeated word in R - r

Regular expression to find words and digits that are repeated back to back
Suppose I have a data frame
df<-data.frame(name=c("mike","mike","mike","bob","mike"),age=c(23,23,23,25,23)
how can I write a regular expression to check "name" column whether "mike" or any other word is repeated back to bake for e.g here mike is repeated 3 times and in "age" column a digit is repeated e.g here 23 is repeated 3 times back to back

You can try this :
library(dplyr)
df %>%
mutate(across(.fns = data.table::rleid, .names = '{col}_grp')) %>%
group_by(across(ends_with('grp'))) %>%
filter(n() >= 3) %>%
ungroup %>%
select(names(df))
# name age
# <chr> <dbl>
#1 mike 23
#2 mike 23
#3 mike 23
For every column in df we use rleid to give a unique number to consecutive values and select those groups that have >= 3 rows in them.

Base R one liner (no regex):
df[which(c(0, cumsum(abs(diff(as.integer(as.factor(df$name)))))) == 0),]

Related

Summarize a data frame based on consecutive rows with repeated values

I have a data frame with the following structure:
pos<- c(67,125,158,195,235,458,499,526,785,912,999,1525)
v_1<-c("j","c","v","r","s","q","r","r","s","t","u","v")
v_2<-c("c","t","v","r","s","q","r","w","c","c","o","v")
v_3<-c("z","c","v","r","s","q","r","w","c","b","p","v")
v_4<-c("x","w","z","z","s","q","r","w","c","o","t","v")
data<-as.data.frame(cbind(pos,v_1,v_2,v_3,v_4))
In this dataframe it is possible to find the same letters among the different columns in consecutive rows. I need to obtain a separate data frame with the values of the variable "pos" for consecutive rows with shared letters, as can be seen in the figure:
In this figure even though all the columns have the same letter in pos 1525, this row isn’t included since it’s not consecutive with another row with repeated letters.

Solution using tidyr and dplyr:
After pivoting to long, use dplyr::add_count() to find repeated values within each pos;
Within each v, find consecutive rows with repeated values, defined as: >1 repeat and >1 repeat in either preceding or following row;
Create a column containing pos for consecutive rows and NA otherwise;
Take the minimum and maximum to get start and end for each v.
library(tidyr)
library(dplyr)
data %>%
pivot_longer(!pos, names_to = "v") %>%
add_count(pos, value) %>%
group_by(v) %>%
mutate(consec = ifelse(
n > 1 & (lag(n) > 1 | lead(n) > 1),
pos,
NA
)) %>%
summarize(
start = min(consec, na.rm = TRUE),
end = max(consec, na.rm = TRUE)
)
# A tibble: 4 × 3
v start end
<chr> <chr> <chr>
1 v_1 125 499
2 v_2 158 785
3 v_3 125 785
4 v_4 235 785
Note, not sure if/how you want to handle if there is more than one set of consecutive rows, so this solution doesn’t address that.

Creating new columns in R using parts of an existing column

I am trying to create new columns using the information in an existing column:
eg. the column 'name' contains the following value: 0112200015-1_R2_001.fastq.gz. From this I would like to generate a column 'sample_id' containing 0112200015 (first 10 digits), a column 'timepoint' containing 1 (from -1) and a column 'paired_end' containing 2 (from R2)
What would the correct code for this be?

tidyr::extract
You can use extract from tidyr package.
library(tidyr)
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d{10})-(\\d)_R(\\d)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2
where df is:
df <- data.frame(name = "0112200015-1_R2_001.fastq.gz")
To make the solution more tailored to your needs, you should provide more examples, so to handle rare cases and exceptions.
A few regex can work for you. This one for example extracts the first 3 numbers it finds between non-numeric separators:
df %>%
extract(name, c("sample_id", "timepoint", "paired_end"),
regex = "^(\\d+)\\D+(\\d+)\\D+(\\d+)")
#> sample_id timepoint paired_end
#> 1 0112200015 1 2

I assume you want to create a new data frame with this information.
I created a vector with values similar to your column names, but you sould be using the colnames output
vector <- c("1234-1_R2_001.fastq.gz", "5678-1_R2_001.fastq.gz", "1928-1_R2_001.fastq.gz")
df <- data.frame(sample_id = str_replace(vector, "-.*$", ""),
timepoint = str_extract(vector, "(?<=-)."),
paired_end = str_extract(vector, "(?<=R)."))
all the str functions are from the stringr package.

This should give you the correct answer using dplyr and stringr in a tidy way. It is based on the assumption that the timepoint and paired_end always consist of one digit. If this is not the case, the small adjustment of replacing "\\d{1}" by "\\d+" returns one or multiple digits, depending on the actual value.
library(dplyr)
library(stringr)
df <-
tibble(name = "0112200015-1_R2_001.fastq.gz")
df %>%
# Extract the 10 digit sample id
mutate(sample_id = str_extract(name, pattern = "\\d{10}"),
# Extract the 1 digit timepoint which comes after "-" and before the first "_"
timepoint = str_extract(name, pattern = "(?<=-)\\d{1}(?=_)"),
# Extract the 1 digit paired_end which comes after "_R"
paired_end = str_extract(name, pattern = "(?<=_R)\\d{1}"))
# A tibble: 1 x 4
name sample_id timepoint paired_end
<chr> <chr> <chr> <chr>
1 0112200015-1_R2_001.fastq.gz 0112200015 1 2

R - Identifying only strings ending with A and B in a column

I have a column in a data frame in R that contains sample names. Some names are identical except that they end in A or B at the end, and some samples repeat themselves, like this:
df <- data.frame(Samples = c("S_026A", "S_026B", "S_028A", "S_028B", "S_038A", "S_040_B", "S_026B", "S_38A"))
What I am trying to do is to isolate all sample names that have an A and B at the end and not include the sample names that only have either A or B.
The end result of what I'm looking for would look like this:
"S_026" and "S_028" as these are the only ones that have A and B at the end.
All I seem to find is how to remove duplicates, and removing duplicates would only give me "S_026B" and "S_38A" in this case.
Alternatively, I have tried to strip the A's and B's at the end and then sum how many times each of those names sum > 2, but again, this does not give me the desired results.
Any suggestions?

We could use substring to get the last character after grouping by substring not including the last character, and check if there are both 'A', and 'B' in the substring
library(dplyr)
df %>%
group_by(grp = substr(Samples, 1, nchar(Samples)-1)) %>%
filter(all(c("A", "B") %in% substring(Samples, nchar(Samples)))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 x 1
Samples
<chr>
1 S_026A
2 S_026B
3 S_028A
4 S_028B
5 S_026B

You can extract the last character from Sample in different column, keep only those values that have both 'A' and 'B' and keep only the unique values.
library(dplyr)
library(tidyr)
df %>%
extract(Samples, c('value', 'last'), '(.*)(.)') %>%
group_by(value) %>%
filter(all(c('A', 'B') %in% last)) %>%
ungroup %>%
distinct(value)
# value
# <chr>
#1 S_026
#2 S_028

Use case_when and startsWith to selectively mutate by row

I'm trying to create a new column based on another, using case_when to give different outputs based on the value of each row.
I start with df <- data.frame(a=c("abc", "123", "abc", "123"))
And want to generate a new column b like so
#> a b
#> 1 abc letter
#> 2 123 number
#> 3 abc letter
#> 4 123 number
I've tried df %>% mutate(b = case_when(startsWith(a, "a") ~ "letter", startsWith(a, "1") ~ "number")) but it only gives an error. Can someone show me how to get different values for column b based on the first letter of the row in column a?

According to ?startsWith
x -vector of character string whose “starts” are considered.
So, startsWith expects the class to be character and here it is factor class. Converting it to character class would solve the issue
library(dplyr)
df %>%
mutate(b = case_when(startsWith(as.character(a), "a") ~ "letter",
TRUE ~ "number"))
# a b
#1 abc letter
#2 123 number
#3 abc letter
#4 123 number
The default behavior of data.frame would be stringsAsFactors = TRUE. If we specify stringsAsFactors = FALSE, the 'a' column will be character class
Another option is str_detect to create a logical expression by checking if the character from the start (^) of the string is a digit ([0-9])
library(stringr)
library(dplyr)
df %>%
mutate(b = c("letter", "number")[1+str_detect(a, "^[0-9]")])
# a b
#1 abc letter
#2 123 number
#3 abc letter
# 123 number

You can just use if_else() since there's only two cases here. A regex seems more appropriate, given the test you're trying to run; the key is that ^ specifies the start of the string, and [:alpha:] matches alphabetical letters, case-insensitive.
library(tidyverse)
df <- data.frame(a=c("abc", "123", "abc", "123"))
df %>% mutate(
b = a %>% str_detect("^[:alpha:]") %>% if_else("letter", "number")
)
#> a b
#> 1 abc letter
#> 2 123 number
#> 3 abc letter
#> 4 123 number
Created on 2019-09-29 by the reprex package (v0.3.0)
As #akrun pointed out, there is an issue here with factors vs. characters - are you sure this is an appropriate example for your use case, i.e. your real data is in factors? Luckily, though, str_detect() works just as well either way.

R dplyr rowwise mutate

Good morning all, this is my first time posting on stack overflow. Thank you for any help!
I have 2 dataframes that I am using to analyze stock data. One data frame has dates among other information, we can call it df:
df1 <- tibble(Key = c('a','b','c'), i =11:13, date= ymd(20110101:20110103))
The second dataframe also has dates and other important information.
df2 <-tibble(Answer = c('a','d','e','b','f','c'), j =14:19, date= ymd(20150304:20150309))
Here is what I want to do:
For each row in df1, I need to:
-Find the date in df2 where, for when df2$answer is the same as df1$key, it is the closest to that row's date in df1.
-Then extract information for another column in that row in df2, and put it in a new row in df1.
The code i tried:
df1 %>%
group_by(Key, i) %>%
mutate(
`New Column` = df2$j[
which.min(subset(df2$date, df2$Answer== Key) - date)])
This has the result:
Key i date `New Column`
1 a 11 2011-01-01 14
2 b 12 2011-01-02 14
3 c 13 2011-01-03 14
This is correct for the first row, a. In df2, the closest date is 2015-03-04, for which the value of j is in fact 14.
However, for the second row, Key=b, I want df2 to subset to only look at dates for rows where df2$Answer = b. Therefore, the date should be 2015-03-07, for which j =17.
Thank you for your help!
Jesse

This should work:
library(dplyr)
df1 %>%
left_join(df2, by = c("Key" = "Answer")) %>%
mutate(date_diff = abs(difftime(date.x, date.y, units = "secs"))) %>%
group_by(Key) %>%
arrange(date_diff) %>%
slice(1) %>%
ungroup()
We are first joining the two data frames with left_join. Yes, I'm aware there are possibly multiple dates for each Key, bear with me.
Next, we calculate (with mutate) the absolute value (abs) of the difference between the two dates date.x and date.y.
Now that we have this, we will group the data by Key using group_by. This will make sure that each distinct Key will be treated separately in subsequent calculations.
Since we've calculated the date_diff, we can now re-order (arrange) the data for each Key, with the smallest date_diff as first for each Key.
Finally, we are only interested in that first, smallest date_diff for each Key, so we can discard the rest using slice(1).
This pipeline gives us the following:
Key i date.x j date.y date_diff
<chr> <int> <date> <int> <date> <time>
1 a 11 2011-01-01 14 2015-03-04 131587200
2 b 12 2011-01-02 17 2015-03-07 131760000
3 c 13 2011-01-03 19 2015-03-09 131846400

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Regex for consecutive repeated word in R - r

Base R one liner (no regex): df[which(c(0, cumsum(abs(diff(as.integer(as.factor(df$name)))))) == 0),]

Related

Summarize a data frame based on consecutive rows with repeated values

Creating new columns in R using parts of an existing column

R - Identifying only strings ending with A and B in a column

Use case_when and startsWith to selectively mutate by row

R dplyr rowwise mutate

Categories

Resources