How to extract the longest match? - r

Consider this simple example
library(stringr)
library(dplyr)
dataframe <- data_frame(text = c('how is the biggest ??',
'really amazing stuff'))
# A tibble: 2 x 1
text
<chr>
1 how is the biggest ??
2 really amazing stuff
I need to extract some terms based on a regex expression, but only extract the term that is the longest.
So far, I was able to only extract the first match (not necessary the longest) using str_extract.
> dataframe %>% mutate(mymatch = str_extract(text, regex('\\w+')))
# A tibble: 2 x 2
text mymatch
<chr> <chr>
1 how is the biggest ?? how
2 really amazing stuff really
I tried to play with str_extract_all but I cant find an efficient syntax.
Output should be:
# A tibble: 2 x 2
text mymatch
<chr> <chr>
1 how is the biggest ?? biggest
2 really amazing stuff amazing
Any ideas?
Thanks!

You can do something like this:
library(stringr)
library(dplyr)
dataframe %>%
mutate(mymatch = sapply(str_extract_all(text, '\\w+'),
function(x) x[nchar(x) == max(nchar(x))][1]))
With purrr:
library(purrr)
dataframe %>%
mutate(mymatch = map_chr(str_extract_all(text, '\\w+'),
~ .[nchar(.) == max(nchar(.))][1]))
Result:
# A tibble: 2 x 2
text mymatch
<chr> <chr>
1 how is the biggest ?? biggest
2 really amazing stuff amazing
Note:
If there is a tie, this takes the first one.
Data:
dataframe <- data_frame(text = c('how is the biggest ??',
'really amazing biggest stuff'))

An easy way is to break the process down into 2 steps, first a list of list of all the words in each row. Then find and return the longest word from each sub list:
df <- data_frame(text = c('how is the biggest ??',
'really amazing stuff'))
library(stringr)
#create a list of all words per row
splits<-str_extract_all(df$text, '\\w+', simplify = FALSE)
#find longest word and return it
sapply(splits, function(x) {x[which.max(nchar(x))]})

As a variant of other answers, I'd suggest writing a function that does the manipuation
longest_match <- function(x, pattern) {
matches <- str_match_all(x, pattern)
purrr::map_chr(matches, ~ .[which.max(nchar(.))])
}
Then use it
dataframe %>%
mutate(mymatch = longest_match(text, "\\w+"))
By way of commentary, it seems better practice to isolate the function that does the new stuff longest_match() from the manipulations enabled by mutate(). For instance, the function is easy to test, can be used in other circumstances, and can be modified ('return the last rather than first longest match') independently of the data transformation step.. There's no real value in sticking everything into one line, so it makes sense to write lines of code that logically accomplish one thing -- find all matches, map from all matches to longest, ... purrr::map_chr() is better than sapply() because it is more robust -- it guarantees that the result is a character vector, so that something like
> df1 = dataframe[FALSE,]
> df1 %>% mutate(mymatch = longest_match(text, "\\w+"))
# A tibble: 0 x 2
# ... with 2 variables: text <chr>, mymatch <chr>
'does the right thing', i.e., mymatch is a character vector (sapply() would return a list in this case).

Or, using purrr...
library(dplyr)
library(purrr)
library(stringr)
dataframe %>% mutate(mymatch=map_chr(str_extract_all(text,"\\w+"),
~.[which.max(nchar(.))]))
# A tibble: 2 x 2
text mymatch
<chr> <chr>
1 how is the biggest ?? biggest
2 really amazing stuff amazing

Related

How to search for words with asterisks and wildcards (e.g., exampl*) in R (word appearance in a data frame)

I wrote a code to count the appearance of words in a data frame:
Items <- c('decid*','head', 'heads')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top', 'undecided')
df_main<-data.frame(words)
item <- vector()
count <- vector()
for (i in 1:length(unique(Items))){
item[i] <- Items[i]
count[i]<- sum(df_main$words == item[i])}
word_freq <- data.frame(cbind(item, count))
word_freq
However, the results are like this:
item
count
1
decid*
0
2
head
1
3
heads
1
As you see, it does not correctly count for "decid*". The actual results I expect should be like this:
item
count
1
decid*
2
2
head
1
3
heads
1
I think I need to change the item word (decid*) format, however, I could not figure it out. Any help is much appreciated!
I think you want to use decid* as regex pattern. == looks for an exact match, you may use grepl to look for a particular pattern.
I have used sapply as an alternative to for loop.
result <- stack(sapply(unique(df1$Items), function(x) {
if(grepl('*', x, fixed = TRUE)) sum(grepl(x, df_main$word))
else sum(x == df_main$words)
}))
result
# values ind
#1 2 decid*
#2 1 head
#3 1 heads
Using tidyverse
library(dplyr)
library(stringr)
df1 %>%
rowwise %>%
mutate(count =sum(str_detect(df_main$words,
str_c("\\b", str_replace(Items, fixed("*"), ".*" ), "\\b")))) %>%
ungroup
-output
# A tibble: 3 × 2
Items count
<chr> <int>
1 decid* 2
2 head 1
3 heads 1
Perhaps as an alternative approach altogether: instead of creating a new dataframe word_freq, why not create a new column in df_main(if that's your "main" dataframe) which indicates the number of matches of your (apparently key)Items. Also, that column will not actually contain counts because the input column words only contains a single word each. So the question is not how many matches are there for each row but whether there is a match in the first place. That can be indicated by greplin base Ror str_detectin stringr
EDIT:
Given the newly posted input data
Items <- c('decid*','head', 'heads')
df1<-data.frame(Items)
words<- c('head', 'heads', 'decided', 'decides', 'top', 'undecided')
df_main<-data.frame(words)
and the OP's wish to have the matches in df_main, the solution might be this:
library(stringr)
df_main$Items_match <- +str_detect(df_main$words, str_c(Items, collapse = "|"))
Result:
df_main
words Items_match
1 head 1
2 heads 1
3 decided 1
4 decides 1
5 top 0
6 undecided 1

Remove non-unique string components from a column in R

example <- data.frame(
file_name = c("some_file_name_first_2020.csv",
"some_file_name_second_and_third_2020.csv",
"some_file_name_4_2020_update.csv"),
a = 1:3
)
example
#> file_name a
#> 1 some_file_name_first_2020.csv 1
#> 2 some_file_name_second_and_third_2020.csv 2
#> 3 some_file_name_4_2020_update.csv 3
I have a dataframe that looks something like this example. The "some_file_name" part changes often and the unique identifier is usually in the middle and there can be suffixed information (sometimes) that is important to retain.
I would like to end up with the dataframe below. The approach I can think of is finding all common string "components" and removing them from each row.
desired
#> file_name a
#> 1 first 1
#> 2 second_and_third 2
#> 3 4_update 3
This works for the example shared, perhaps you can use this to make a more general solution :
#split the data on "_" or "."
list_data <- strsplit(example$file_name, '_|\\.')
#Get the words that occur only once
unique_words <- names(Filter(function(x) x==1, table(unlist(list_data))))
#Keep only unique_words and paste the string back.
sapply(list_data, function(x) paste(x[x %in% unique_words], collapse = "_"))
#[1] "first" "second_and_third" "4_update"
However, this answer relies on the fact that you would have separators like "_" in the filenames to detect each "component".

Efficiently transform XML to data frame

I need to transform some vanilla xml into a data frame. The XML is a simple representation of rectangular data (see example below). I can achieve this pretty straightforwardly in R with xml2 and a couple of for loops. However, I'm sure there is a much better/faster way (purrr?). The XML I will be ultimately working with are very large, so more efficient methods are preferred. I would be grateful for any advice from the community.
library(tidyverse)
library(xml2)
demo_xml <-
"<DEMO>
<EPISODE>
<item1>A</item1>
<item2>1</item2>
</EPISODE>
<EPISODE>
<item1>B</item1>
<item2>2</item2>
</EPISODE>
</DEMO>"
dx <- read_xml(demo_xml)
episodes <- xml_find_all(dx, xpath = "//EPISODE")
dx_names <- xml_name(xml_children(episodes[1]))
df <- data.frame()
for(i in seq_along(episodes)) {
for(j in seq_along(dx_names)) {
df[i, j] <- xml_text(xml_find_all(episodes[i], xpath = dx_names[j]))
}
}
names(df) <- dx_names
df
#> item1 item2
#> 1 A 1
#> 2 B 2
Created on 2019-09-19 by the reprex package (v0.3.0)
Thank you in advance.
This is a general solution which handles a varying number of different sub-nodes for each parent node. Each Episode node may have different sub-nodes.
This strategy parses the children nodes identifying the name and values of each sub node. Then it converts this list into a longer style dataframe and then reshapes it into your desired wider style:
library(tidyr)
library(xml2)
demo_xml <-
"<DEMO>
<EPISODE>
<item1>A</item1>
<item2>1</item2>
</EPISODE>
<EPISODE>
<item1>B</item1>
<item2>2</item2>
</EPISODE>
</DEMO>"
dx <- read_xml(demo_xml)
#find all episodes
episodes <- xml_find_all(dx, xpath = "//EPISODE")
#extract the node names and values from all of the episodes
nodenames<-xml_name(xml_children(episodes))
contents<-trimws(xml_text(xml_children(episodes)))
#Idenitify the number of subnodes under each episodes for labeling
IDlist<-rep(1:length(episodes), sapply(episodes, length))
#make a long dataframe
df<-data.frame(episodes=IDlist, nodenames, contents, stringsAsFactors = FALSE)
#make the dataframe wide, Remove unused blank nodes:
answer <- spread(df[df$contents!="",], nodenames, contents)
#tidyr 1.0.0 version
#answer <- pivot_wider(df, names_from = nodenames, values_from = contents)
# A tibble: 2 x 3
episodes item1 item2
<int> <chr> <chr>
1 1 A 1
2 2 B 2
This may be an option without using a for loop,
episodes <- xml_find_all(dx, xpath = "//EPISODE") %>% xml_attr("item1")
dx_names <- xml_name(xml_children(episodes[1]))
# You can get all values between the tags by xml_text()
values <- xml_children(episodes) %>% xml_text()
as.data.frame(matrix(values,
ncol=length(dx_names),
dimnames =list(seq(dx_names),dx_names),byrow=TRUE))
gives,
item1 item2
1 A 1
2 B 2
Note that, you may need to change the Item2 column to a numeric one by as.numeric() since it's been assigned as factor by this solution.

R: Count the frequency of every unique character in a column

I have a data frame df which contains a column named strings. The values in this column are some sentences.
For example:
id strings
1 "I want to go to school, how about you?"
2 "I like you."
3 "I like you so much"
4 "I like you very much"
5 "I don't like you"
Now, I have a list of stop word,
["I", "don't" "you"]
How can I make another data frame which stores the total number of occurrence of each unique word (except stop word)in the column of previous data frame.
keyword frequency
want 1
to 2
go 1
school 1
how 1
about 1
like 4
so 1
very 1
much 2
My idea is that:
combine the strings in the column to a big string.
Make a list storing the unique character in the big string.
Make the df whose one column is the unique words.
Compute the frequency.
But this seems really inefficient and I don't know how to really code this.
At first, you can create a vector of all words through str_split and then create a frequency table of the words.
library(stringr)
stop_words <- c("I", "don't", "you")
# create a vector of all words in your df
all_words <- unlist(str_split(df$strings, pattern = " "))
# create a frequency table
word_list <- as.data.frame(table(all_words))
# omit all stop words from the frequency table
word_list[!word_list$all_words %in% stop_words, ]
One way is using tidytext. Here a book and the code
library("tidytext")
library("tidyverse")
#> df <- data.frame( id = 1:6, strings = c("I want to go to school", "how about you?",
#> "I like you.", "I like you so much", "I like you very much", "I don't like you"))
df %>%
mutate(strings = as.character(strings)) %>%
unnest_tokens(word, string) %>% #this tokenize the strings and extract the words
filter(!word %in% c("I", "i", "don't", "you")) %>%
count(word)
#> # A tibble: 11 x 2
#> word n
#> <chr> <int>
#> 1 about 1
#> 2 go 1
#> 3 how 1
#> 4 like 4
#> 5 much 2
EDIT
All the tokens are transformed to lower case, so you either include i in the stop_words or add the argument lower_case = FALSE to unnest_tokens
Assuming you have a mystring object and a vector of stopWords, you can do it like this:
# split text into words vector
wordvector = strsplit(mystring, " ")[[1]]
# remove stopwords from the vector
vector = vector[!vector %in% stopWords]
At this point you can turn a frequency table() into a dataframe object:
frequency_df = data.frame(table(words))
Let me know if this can help you.

using variable column names in dplyr summarise

I found this question already asked but without proper answer. R using variable column names in summarise function in dplyr
I want to calculate the difference between two column means, but the column name should be provided by variables... So far I found only the function as.name to provide column names as text, but this somehow doesn't work here...
With fix column names it works.
x <- c('a','b')
df <- group_by(data.frame(a=c(1,2,3,4), b=c(2,3,4,5), c=c(1,1,2,2)), c)
df %>% summarise(mean(a) - mean(b))
With variable columns, it doesn't work
df %>% summarise(mean(x[1]) - mean(x[2]))
df %>% summarise(mean(as.name(x[1])) - mean(as.name(x[2])))
Since this was asked already 3 years ago and dplyr is under good development, I am wondering if there is an answer to this now.
You can use base::get:
df %>% summarise(mean(get(x[1])) - mean(get(x[2])))
# # A tibble: 2 x 2
# c `mean(a) - mean(b)`
# <dbl> <dbl>
# 1 1 -1
# 2 2 -1
get will search in current environment by default.
As the error message says, mean expects a logical or numeric object, as.name returns a name:
class(as.name("a")) # [1] "name"
You could evaluate your name, that would work as well :
df %>% summarise(mean(eval(as.name(x[1]))) - mean(eval(as.name(x[2]))))
# # A tibble: 2 x 2
# c `mean(eval(as.name(x[1]))) - mean(eval(as.name(x[2])))`
# <dbl> <dbl>
# 1 1 -1
# 2 2 -1
This is not a direct answer to your question but maybe could be useful for other people reading your post:
It could be easier to use variable columns directly, like
df %>% summarise(someName = mean(.[[1]]) - mean(.[[2]]))
############ which is the same as ############
df %>% summarise(someName = mean(.[,1,drop=T]) - mean(.[,2,drop=T]))
Note that drop=T is because when using just single square bracket the result preserves the class (in this case class( . ) = data.frame) and this isn't what we want (columns must be given in vector form to the summarise function)

Resources