I'm trying to extract dates from a Notes column using tidyr's extract function. The data I'm working on looks like this:
dates <- data.frame(col1 = c("customer", "customer2", "customer3"),
Notes = c("DOB: 12/10/62
START: 09/01/2019
END: 09/01/2020", "
S/DATE: 28/08/19
R/DATE: 27/08/20", "DOB: 13/01/1980
Start:04/12/2018"),
End_date = NA,
Start_Date = NA )
I tried extracting the date following the string "S/DATE" like this:
extract <- extract(
dates,
col = "Notes",
into = "Start_date",
regex = "(?<=(S\\/DATE:)).*" # Using regex lookahead
)
However, this only extracts the string "S/DATE:", not the date after it. When I tried this on regex101.com, it works as expected.
Thanks. Ibrahim
You could use sub here for a base R option:
s_date <- ifelse(grepl("S/DATE", dates$Notes),
sub("^.*\\bS/DATE: (\\S+).*$", "\\1", dates$Notes), NA)
s_date
[1] NA "28/08/19" NA
Note that the call to grepl above is needed here, because sub by default will return the entire input string (in this case the full Notes) in the event that S/DATE be not found in the text.
One method can be like this one also. (Assuming that you need either of S/DATE or START as your expected new column name is Start_date). If however all such values aren't required you may easily modify this syntax.
Explanation -
In the innermost expr Notes column has been splitted into list by either of these separators : or \n.
In this list, blanks are removed then
In the modified list item next to Start or S/Date is extracted using sapply which simplifies the list into a vector (if possible)
lastly lubridate::dmy is used in outermost expr.
sapply(strsplit(dates$Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])
[1] "09/01/2019" "28/08/19" "04/12/2018"
If you'll wrap the above in lubridate::dmy dates will be correctly formatted too
dmy(sapply(strsplit(dates$Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))]))
[1] "2019-01-09" "2019-08-28" "2018-12-04"
Further, this can be passed into dplyr pipes, so as to simultaneously create a new column in your dates
dates %>% mutate(Start_Date = dmy(sapply(strsplit(Notes,
"[: | \n]"),
function(x) subset(x, x != "")[1 + which(toupper(subset(x, x != "")) %in% c("S/DATE", "START"))])))
col1 Notes End_date Start_Date
1 customer DOB: 12/10/62\nSTART: 09/01/2019\nEND: 09/01/2020 NA 2019-01-09
2 customer2 \nS/DATE: 28/08/19\nR/DATE: 27/08/20 NA 2019-08-28
3 customer3 DOB: 13/01/1980\nStart:04/12/2018 NA 2018-12-04
I would combine stringr and lubridate:
dates %>%
mutate(
Start_Date =
sub("\ns/date:", "\nstart:", tolower(Notes)) %>%
str_remove_all("(.*\nstart:)|(\n.*)") %>%
trimws() %>%
lubridate::dmy()
)
# col1 Notes End_date Start_Date
# 1 customer DOB: 12/10/62\nSTART: 09/01/2019\nEND: 09/01/2020 NA 2019-01-09
# 2 customer2 \nS/DATE: 28/08/19\nR/DATE: 27/08/20 NA 2019-08-28
# 3 customer3 DOB: 13/01/1980\nStart:04/12/2018 NA 2018-12-04
The answer is not as concise, but I find it intuitive and easy to follow the steps.
First I substitute one start-pattern with another (sub), where I use tolower to make all lower caps. Then I remove everything before the start date, and everything after the line change str_remove_all. Finally I trim whitespace (trimws) and turn into a date (lubridate::dmy).
Another approach is splitting the text and dealing with smaller chunks.
Step by step illustration, with one row of data
# Split the text on newlines, yielding dates with labels
dates$Notes %>% head(1) %>% strsplit("\n")
[[1]]
[1] "DOB: 12/10/62" "START: 09/01/2019" "END: 09/01/2020"
Drilling down to the next level
# Split each name/value pair on colons
dates$Notes %>% head(1) %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*")
[[1]]
[1] "DOB" "12/10/62"
[[2]]
[1] "START" "09/01/2019"
[[3]]
[1] "END" "09/01/2020"
Extract the individual values
# extract a vector of name labels
dates$Notes %>% head(1) %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[1])
[1] "DOB" "START" "END"
# extract a vector of associated values
dates$Notes %>% head(1) %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[2])
[1] "12/10/62" "09/01/2019" "09/01/2020"
With some clever dplyr usage, you'll get a data frame
dates %>%
group_by(col1) %>%
# summarize can collapse many rows into one or expand one into many
summarize(
name = Notes %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[1]),
value = Notes %>% strsplit("\n") %>%
unlist() %>% strsplit(":\\s*") %>%
sapply(function(x) x[2])
) %>%
ungroup()
Result, all of the values separated and ready for further processing
# A tibble: 8 x 3
col1 name value
<chr> <chr> <chr>
1 customer DOB 12/10/62
2 customer START 09/01/2019
3 customer END 09/01/2020
4 customer2 NA NA
5 customer2 S/DATE 28/08/19
6 customer2 R/DATE 27/08/20
7 customer3 DOB 13/01/1980
8 customer3 Start 04/12/2018
Related
Let's say I have a string as follows:
string <- "the home home on the range the friend"
All I want to do is determine which words in the string appear at least 2 times.
The psuedocode here is:
Count how many times each word appears
Return list of words that have more than two appearances in the string
Final result should be a list featuring both the and home, in that order.
I am hoping to do this using the tidyverse, ideally with stringr or dplyr. Was attempting to use tidytext as well but have been struggling.
We can split the string by space, get the table and subset based on frequency
out <- table(strsplit(string, "\\s+")[[1]])
out[out >=2]
home the
2 3
Yet another possible solution:
library(tidyverse)
data.frame(x = str_split(string, "\\s+", simplify = T) %>% t) %>%
add_count(x) %>%
filter(n >= 2) %>%
distinct %>%
pull(x)
#> [1] "the" "home"
library(tidyverse)
data.frame(string) %>%
separate_rows(string) %>%
count(string, sort = TRUE) %>%
filter(n >= 2)
Result
# A tibble: 2 × 2
string n
<chr> <int>
1 the 3
2 home 2
Here's an approach using quanteda that prints "the" before "home" as requested in the original post.
library(quanteda)
aString <- "the home home on the range the friend"
aDfm<- dfm(tokens(aString))
# extract the features where the count > 1
aDfm#Dimnames$features[aDfm#x > 1]
...and the output:
> aDfm#Dimnames$features[aDfm#x > 1]
[1] "the" "home"
Here is another option using tidytext and tidyverse, where we first separate each word (unnest_tokens), then we can count each word and sort by frequency. Then, we keep only words that have more than 1 observation, then use tibble::deframe to return a named vector.
library(tidytext)
library(tidyverse)
tibble(string) %>%
unnest_tokens(word, string) %>%
count(word, sort = TRUE) %>%
filter(n >= 2) %>%
deframe()
Output
the home
3 2
Or if you want to leave as a dataframe, then you can just ignore the last step with deframe.
I have a data frame where some values for "revenue" are listed in the hundreds, say "300," and others are listed as "1.5k." Obviously this is annoying, so I need to find some way of splitting the "k" and "." characters from those values and only those values. Any thoughts?
Another way to do this is just with Regex (and tidyverse for pipes)
library(tidyverse)
string <- c("300", "1.5k")
string %>% ifelse(
# check if string ends in k (upper/lower case)
grepl("[kK]$", .),
# if string ends in k, remove it and multiply by 1000
1000 * as.numeric(gsub("[kK]$", "", .)),
.) %>% as.numeric()
[1] 300 1500
You could create a function that remove "k", change to a numeric vector and multiple by 1,000.
to_1000 <- function(x){
x %>%
str_remove("k") %>%
as.numeric() %>%
{.*1000}
}
x <- c("3000","1.5k")
tibble(x) %>%
mutate(x_num = if_else(str_detect(x,"k"),to_1000(x),as.numeric(x)))
# A tibble: 2 x 2
x x_num
<chr> <dbl>
1 3000 3000
2 1.5k 1500
I have a column in a data frame in R that contains sample names. Some names are identical except that they end in A or B at the end, and some samples repeat themselves, like this:
df <- data.frame(Samples = c("S_026A", "S_026B", "S_028A", "S_028B", "S_038A", "S_040_B", "S_026B", "S_38A"))
What I am trying to do is to isolate all sample names that have an A and B at the end and not include the sample names that only have either A or B.
The end result of what I'm looking for would look like this:
"S_026" and "S_028" as these are the only ones that have A and B at the end.
All I seem to find is how to remove duplicates, and removing duplicates would only give me "S_026B" and "S_38A" in this case.
Alternatively, I have tried to strip the A's and B's at the end and then sum how many times each of those names sum > 2, but again, this does not give me the desired results.
Any suggestions?
We could use substring to get the last character after grouping by substring not including the last character, and check if there are both 'A', and 'B' in the substring
library(dplyr)
df %>%
group_by(grp = substr(Samples, 1, nchar(Samples)-1)) %>%
filter(all(c("A", "B") %in% substring(Samples, nchar(Samples)))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 x 1
Samples
<chr>
1 S_026A
2 S_026B
3 S_028A
4 S_028B
5 S_026B
You can extract the last character from Sample in different column, keep only those values that have both 'A' and 'B' and keep only the unique values.
library(dplyr)
library(tidyr)
df %>%
extract(Samples, c('value', 'last'), '(.*)(.)') %>%
group_by(value) %>%
filter(all(c('A', 'B') %in% last)) %>%
ungroup %>%
distinct(value)
# value
# <chr>
#1 S_026
#2 S_028
I have a dataframe looks like this
index id
1 abc;def;ghi;jkl;mno
2 bcd;efg;hij;klm;nop
3 cde;fgh;ijk;lmn;opq
.
.
.
I would like to use R to find if "abc" is in the dataframe and return its index.
I have tried to separate the 'id' column to 5 different columns and to find whether "abc" is in each row. But my dataset contains about 200,000 rows. It takes so long to loop through every row. I am wondering if there is any more efficient way to detect it.
For example, "abc" is part of df$id[1], then the result should return 1;
"cde" should return 3.
you can use the which function in combination with grepl like this:
which(grepl("abc", df$id))
grepl returns TRUE if "abc" is contained in a string and FALSE otherwise.
which returns the index of entries which contain TRUE.
Or even easier with grep:
grep("abc", df$id)
Try:
library(stringr)
df[str_detect(df$id, "abc"), "index"]
I've been using a %g% operator lately (inspired by %in%) to do this type of thing:
library(tidyverse)
`%g%` <- function(x,y) {
z <- paste0(y, collapse = "|")
grepl(z, x, ignore.case = T)
}
df <- read.table(h = T,
stringsAsFactors = F,
text = "index id
1 abc;def;ghi;jkl;mno
2 bcd;efg;hij;klm;nop
3 cde;fgh;ijk;lmn;opq")
df %>%
filter(id %g% "abc") %>%
pull(index)
#> [1] 1
df %>%
filter(id %g% "cde") %>%
pull(index)
#> [1] 3
This supports multiple values as well:
df %>%
filter(id %g% c("abc", "cde")) %>%
pull(index)
#> [1] 1 3
Created on 2019-04-24 by the reprex package (v0.2.1)
I try to find in a text sentence words of more than 4 letters
I tried this:
fullsetence <- as.character(c("A test setence with test length","A second test for length"))
nchar(fullsetence)
I expect to take as results, based for example in the previous example sentence/string one has 2 words with length greater than 4 letters and the second has 2 words.
Using nchar I take the full length of characters from the string.
What is the right way to make it?
library(dplyr)
library(purrr)
# vector of sentences
fullsetence <- as.character(c("A test setence with test length","A second test for length"))
# get vector of counts for words with more than 4 letters
fullsetence %>%
strsplit(" ") %>%
map(~sum(nchar(.) > 4)) %>%
unlist()
# [1] 2 2
# create a dataframe with sentence and the corresponding counts
# use previous code as a function within "mutate"
data.frame(fullsetence, stringsAsFactors = F) %>%
mutate(Counts = fullsetence %>%
strsplit(" ") %>%
map(~sum(nchar(.) > 4)) %>%
unlist() )
# fullsetence Counts
# 1 A test setence with test length 2
# 2 A second test for length 2
If you want to get the actual words with more than 4 letters you can use this in a similar way:
fullsetence %>%
strsplit(" ") %>%
map(~ .[nchar(.) > 4])
data.frame(fullsetence, stringsAsFactors = F) %>%
mutate(Words = fullsetence %>%
strsplit(" ") %>%
map(~ .[nchar(.) > 4]))