Related
I have two lists A and B. The dates in A are 2000 - 2022 while those in B are 2023-2030.
names(A) and names(B) give the follow character vectors:
a <- c("ACC_a_his", "BCC_b_his", "Can_c_his", "CES_d_his")
b <- c("ACC_a_fu", "BCC_b_fu", "Can_c_fu", "CES_d_fu","FGO_c_fu")
Also, I have a string vector, c which is common across the names in a and b:
c=c("ACC","BCC", "Can", "CES", "FGO")
Note that the strings in c do not always appear in the same position in filenames. The string can be at the beginning, middle or end of filenames.
Challenge
Using the strings in c I would like to get the difference (i.e., which name exists in b but not in a or vice versa) between the names in a and b
Expected output = "FGO_c_fu"
rbind (or whatever is best) matching dataframes in lists A and B if the names are similar based on string in c
Update: See OP's comment:
Try this:
library(dplyr)
library(tibble)
library(tidyr)
library(stringr)
# or just library(tidyverse)
df %>%
pivot_longer(everything()) %>%
mutate(x = str_extract(value, paste(c, collapse = "|"))
) %>%
group_by(x) %>%
filter(!any(row_number() > 1)) %>%
na.omit() %>%
pull(value)
[1] "FGO_c_fu"
First answer:
Here is an alternative approach:
We create a list
the vectors are of unequal length
With data.frame(lapply(my_list, length<-, max(lengths(my_list)))) we create a data frame
pivot longer and group by all before the first underline
remove NA and filter:
library(dplyr)
library(tidyr)
library(tibble)
my_list <- tibble::lst(a, b)
df <- data.frame(lapply(my_list, `length<-`, max(lengths(my_list))))
df %>%
pivot_longer(everything()) %>%
group_by(x = sub("\\_.*", "", value)) %>%
filter(!any(row_number() > 1)) %>%
na.omit() %>%
pull(value)
[1] "FGO_c_fu"
Let's say I have a string as follows:
string <- "the home home on the range the friend"
All I want to do is determine which words in the string appear at least 2 times.
The psuedocode here is:
Count how many times each word appears
Return list of words that have more than two appearances in the string
Final result should be a list featuring both the and home, in that order.
I am hoping to do this using the tidyverse, ideally with stringr or dplyr. Was attempting to use tidytext as well but have been struggling.
We can split the string by space, get the table and subset based on frequency
out <- table(strsplit(string, "\\s+")[[1]])
out[out >=2]
home the
2 3
Yet another possible solution:
library(tidyverse)
data.frame(x = str_split(string, "\\s+", simplify = T) %>% t) %>%
add_count(x) %>%
filter(n >= 2) %>%
distinct %>%
pull(x)
#> [1] "the" "home"
library(tidyverse)
data.frame(string) %>%
separate_rows(string) %>%
count(string, sort = TRUE) %>%
filter(n >= 2)
Result
# A tibble: 2 × 2
string n
<chr> <int>
1 the 3
2 home 2
Here's an approach using quanteda that prints "the" before "home" as requested in the original post.
library(quanteda)
aString <- "the home home on the range the friend"
aDfm<- dfm(tokens(aString))
# extract the features where the count > 1
aDfm#Dimnames$features[aDfm#x > 1]
...and the output:
> aDfm#Dimnames$features[aDfm#x > 1]
[1] "the" "home"
Here is another option using tidytext and tidyverse, where we first separate each word (unnest_tokens), then we can count each word and sort by frequency. Then, we keep only words that have more than 1 observation, then use tibble::deframe to return a named vector.
library(tidytext)
library(tidyverse)
tibble(string) %>%
unnest_tokens(word, string) %>%
count(word, sort = TRUE) %>%
filter(n >= 2) %>%
deframe()
Output
the home
3 2
Or if you want to leave as a dataframe, then you can just ignore the last step with deframe.
I'm trying to extract dates from a Notes column using tidyr's extract function. The data I'm working on looks like this:
dates <- data.frame(col1 = c("customer", "customer2", "customer3"),
Notes = c("DOB: 12/10/62
START: 09/01/2019
END: 09/01/2020", "
S/DATE: 28/08/19
R/DATE: 27/08/20", "DOB: 13/01/1980
Start:04/12/2018"),
End_date = NA,
Start_Date = NA )
I tried extracting the date following the string "S/DATE" like this:
extract <- extract(
dates,
col = "Notes",
into = "Start_date",
regex = "(?<=(S\\/DATE:)).*" # Using regex lookahead
)
However, this only extracts the string "S/DATE:", not the date after it. When I tried this on regex101.com, it works as expected.
Thanks. Ibrahim
You could use sub here for a base R option:
s_date <- ifelse(grepl("S/DATE", dates$Notes),
sub("^.*\\bS/DATE: (\\S+).*$", "\\1", dates$Notes), NA)
s_date
[1] NA "28/08/19" NA
Note that the call to grepl above is needed here, because sub by default will return the entire input string (in this case the full Notes) in the event that S/DATE be not found in the text.
One method can be like this one also. (Assuming that you need either of S/DATE or START as your expected new column name is Start_date). If however all such values aren't required you may easily modify this syntax.
Explanation -
In the innermost expr Notes column has been splitted into list by either of these separators : or \n.
In this list, blanks are removed then
In the modified list item next to Start or S/Date is extracted using sapply which simplifies the list into a vector (if possible)
lastly lubridate::dmy is used in outermost expr.
sapply(strsplit(dates$Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])
[1] "09/01/2019" "28/08/19" "04/12/2018"
If you'll wrap the above in lubridate::dmy dates will be correctly formatted too
dmy(sapply(strsplit(dates$Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))]))
[1] "2019-01-09" "2019-08-28" "2018-12-04"
Further, this can be passed into dplyr pipes, so as to simultaneously create a new column in your dates
dates %>% mutate(Start_Date = dmy(sapply(strsplit(Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])))
col1 Notes End_date Start_Date
1 customer DOB: 12/10/62\nSTART: 09/01/2019\nEND: 09/01/2020 NA 2019-01-09
2 customer2 \nS/DATE: 28/08/19\nR/DATE: 27/08/20 NA 2019-08-28
3 customer3 DOB: 13/01/1980\nStart:04/12/2018 NA 2018-12-04
I would combine stringr and lubridate:
dates %>%
mutate(
Start_Date =
sub("\ns/date:", "\nstart:", tolower(Notes)) %>%
str_remove_all("(.*\nstart:)|(\n.*)") %>%
trimws() %>%
lubridate::dmy()
)
# col1 Notes End_date Start_Date
# 1 customer DOB: 12/10/62\nSTART: 09/01/2019\nEND: 09/01/2020 NA 2019-01-09
# 2 customer2 \nS/DATE: 28/08/19\nR/DATE: 27/08/20 NA 2019-08-28
# 3 customer3 DOB: 13/01/1980\nStart:04/12/2018 NA 2018-12-04
The answer is not as concise, but I find it intuitive and easy to follow the steps.
First I substitute one start-pattern with another (sub), where I use tolower to make all lower caps. Then I remove everything before the start date, and everything after the line change str_remove_all. Finally I trim whitespace (trimws) and turn into a date (lubridate::dmy).
Another approach is splitting the text and dealing with smaller chunks.
Step by step illustration, with one row of data
# Split the text on newlines, yielding dates with labels
dates$Notes %>% head(1) %>% strsplit("\n")
[[1]]
[1] "DOB: 12/10/62" "START: 09/01/2019" "END: 09/01/2020"
Drilling down to the next level
# Split each name/value pair on colons
dates$Notes %>% head(1) %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*")
[[1]]
[1] "DOB" "12/10/62"
[[2]]
[1] "START" "09/01/2019"
[[3]]
[1] "END" "09/01/2020"
Extract the individual values
# extract a vector of name labels
dates$Notes %>% head(1) %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[1])
[1] "DOB" "START" "END"
# extract a vector of associated values
dates$Notes %>% head(1) %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[2])
[1] "12/10/62" "09/01/2019" "09/01/2020"
With some clever dplyr usage, you'll get a data frame
dates %>%
group_by(col1) %>%
# summarize can collapse many rows into one or expand one into many
summarize(
name = Notes %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[1]),
value = Notes %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[2])
) %>%
ungroup()
Result, all of the values separated and ready for further processing
# A tibble: 8 x 3
col1 name value
<chr> <chr> <chr>
1 customer DOB 12/10/62
2 customer START 09/01/2019
3 customer END 09/01/2020
4 customer2 NA NA
5 customer2 S/DATE 28/08/19
6 customer2 R/DATE 27/08/20
7 customer3 DOB 13/01/1980
8 customer3 Start 04/12/2018
I want to extract bigrams from sentences, using the regex described here and store the output to a new column which references the original.
library(dplyr)
library(stringr)
library(splitstackshape)
df <- data.frame(a =c("apple orange plum"))
# Single Words - Successful
df %>%
# Base R
mutate(b = sapply(regmatches(a,gregexpr("\\w+\\b", a, perl = TRUE)),
paste, collapse=";")) %>%
# Duplicate with Stringr
mutate(c = sapply(str_extract_all(a,"\\w+\\b"),paste, collapse=";")) %>%
cSplit(., c(2,3), sep = ";", direction = "long")
Initially, I thought the problem seemed to be with the regex engine but neither stringr::str_extract_all (ICU) nor base::regmatches (PCRE) works.
# Bigrams - Fails
df %>%
# Base R
mutate(b = sapply(regmatches(a,gregexpr("(?=(\\b\\w+\\s+\\w+))", a, perl = TRUE)),
paste, collapse=";")) %>%
# Duplicate with Stringr
mutate(c = sapply(str_extract_all(a,"(?=(\\b\\w+\\s+\\w+))"),paste, collapse=";")) %>%
cSplit(., c(2,3), sep = ";", direction = "long")
As a result, I'm guessing the problem is probably to do with using a zero-width lookahead around a capturing group. Is there any valid regex in R which will allows these bigrams be extracted?
As #WiktorStribiżew suggested, using str_extract_all helps here. Here's how to apply it with multiple rows in a data frame. Let
(df <- data.frame(a = c("one two three", "four five six")))
# a
# 1 one two three
# 2 four five six
Then we may do
df %>% rowwise() %>%
do(data.frame(., b = str_match_all(.$a, "(?=(\\b\\w+\\s+\\w+))")[[1]][, 2], stringsAsFactors = FALSE))
# Source: local data frame [4 x 2]
# Groups: <by row>
#
# A tibble: 4 x 2
# a b
# * <fct> <chr>
# 1 one two three one two
# 2 one two three two three
# 3 four five six four five
# 4 four five six five six
where stringsAsFactors = FALSE is just to avoid warnings coming from bindings rows.
I'm trying to separate a string column into two pieces based on chopping up the string. It's best illustrated with example below. rowwise does work, but given the size of the data.frame, I'd like to use a more efficient method. How can I avoid using rowwise?
library(dplyr)
library(stringr)
library(tidyr)
#make data
a <- "(1, 10)"
b <- "(10, 20)"
c <- "(20, 30)"
df <- data.frame(size = c(a,b,c))
# Goal is to separate the 'size' column into 'lower' and 'upper' by
# extracting the value contained in the parens and split by a comma.
# Once the column is split into 'upper' and 'lower' I will perform
# additional operations.
# DESIRED RESULT
size lower upper
<fct> <chr> <chr>
1 (1, 10) 1 10
2 (10, 20) 10 20
3 (20, 30) 20 30
# WHAT I HAVE TRIED
> #This works... but too inefficient
> df %>%
+ rowwise() %>%
+ mutate(lower = str_split(size, ",") %>% .[[1]] %>% .[1] %>%
+ str_split("\\(") %>% .[[1]] %>% .[2])
size lower
<fct> <chr>
1 (1, 10) 1
2 (10, 20) 10
3 (20, 30) 20
> # I'm not sure why this doesn't work
> df %>%
+ mutate(lower = str_split(size, ",") %>% .[[1]] %>% .[1] %>%
+ str_split("\\(") %>% .[[1]] %>% .[2])
size lower
1 (1, 10) 1
2 (10, 20) 1
3 (20, 30) 1
> #Not obivous how to use separate (tidyr)
> df %>%
+ separate(size, sep=",", c("lower", "upper"))
lower upper
1 (1 10)
2 (10 20)
3 (20 30)
You don't state your goal explicitly, but it seems like you want to extract the first number from a string. This is easy with stringi::str_extract_first_regex
library(stringi)
stri_extract_first_regex(df$size, "[0-9]+")
# [1] "1" "10" "20"
So in your case,
df %>% mutate(lower = as.numeric(stri_extract_first_regex, size, "[0-9]+"))
You can extract all numbers with stri_extract_all_regex.
Based on your edits:
df$nums = str_extract_all(df$size, "[0-9]+")
df$lower = as.numeric(sapply(df$nums, `[[`, 1))
df$upper = as.numeric(sapply(df$nums, `[[`, 2))
df
# size nums lower upper
# 1 (1, 10) 1, 10 1 10
# 2 (10, 20) 10, 20 10 20
# 3 (20, 30) 20, 30 20 30
Another way to go is to get rid of the parens and whitespace and then use separate:
df %>%
mutate(just_nums = str_replace_all(size, "[^0-9,]", "")) %>%
separate(just_nums, into = c("lower", "upper"))
# size lower upper
# 1 (1, 10) 1 10
# 2 (10, 20) 10 20
# 3 (20, 30) 20 30
The regex pattern "[^0-9,]" matches everything except numbers and commas.
For rowwise operations, I prefer data.table.
Try this
library(data.table)
library(stringi)
#make data
a <- "(1, 10)"
b <- "(10, 20)"
c <- "(20, 30)"
dt <- data.table(c(a,b,c))
dt[, lower := tstrsplit(V1, ",")[1]]
dt[, lower:= stri_replace_all_regex(lower, '\\(', '')]
dt
An option is to use tidyr::separate after removing both ( and ) from the data.
library(tidyverse)
df %>% mutate(size = gsub("\\(|)","",size)) %>% # Both ( and ) has been removed.
separate(size, c("Min", "Max"), sep = ",")
# Min Max
# 1 1 10
# 2 10 20
# 3 20 30
You are almost there. Here is my explanation for two approach, one is similar to yours:
In the first code, I have used unnest_tokens from tidytext package, which can split words on a different rows, since you want to extract the first item before the comma(I have assumed it basis your example, although you should mention it). I have choosen the first row basis this by using filter command.
In the second code, I have used the regex (note you can also use here str_replace also). Here I am using map(since the items returned by str_split is a list) to iterate the returned items and pass each by gsub, which can replace the regex matched with the back referencing items. Also to select only the first item, I have used [[1]] in the end of gsub.
library(tidyverse)
library(stringr)
library(tidytext)
df %>%
unnest_tokens(lower,size, token="words",drop=F) %>%
filter(row_number()%%2==T)
df %>%
mutate(lower = map(str_split(df$size, ","), function(x)gsub("\\((\\w+)","\\1",x)[[1]]))
Output:
# size lower
# 1 (1, 10) 1
# 2 (10, 20) 10
# 3 (20, 30) 20
In case you want to extract both the terms before and after the commas, you can use extract function as well.
tidyr::extract(df, size, c("lower", "upper"), regex= "\\((\\w+),\\s+(\\w+)\\)")
Output:
# lower upper
# 1 1 10
# 2 10 20
# 3 20 30