Finding the Matching Characters after Pattern in R DataFrame - r

I am fairly new to string manipulation, and I am stuck on a problem regarding string and character data in an R dataframe. I am attempting to extract numeric values from a long string after a pattern and then store the result as a new column in my dataframe. I have a fairly large dataset, and I am attempting to get out some useful information stored in a column called "notes".
For instance, the strings I am interested in always follow this pattern (there is nothing significant about the tasks):
df$notes[1] <- "On 5 June, some people walked down the street in this area. [size=around 5]"
df$notes[2] <- "On 6 June, some people rode bikes down the street in this area. [size= nearly 4]"
df$notes[3] <- "On 7 June, some people walked into a grocery store in this area. [size= about 100]"
In some columns, we do not get a numeric value, and that is a problem I can deal with after I get a solution to this one. Those rows follow something similar to this:
df$notes[4] <- "On 10 July, an hundreds of people drank water from this fountain [size=hundreds]"
df$notes[5] <- "on 15 August, an unreported amount of people drove their cars down the street. [size= no report]"
I am trying to extract the entire match after "size= (some quantifier)", and store the value into an appended column of my dataframe.
Eventually, I need to write a loop that goes through this column (call it "notes") in my dataframe, and storing the values "5, 4, 100" into a new column (call it "est_size").
Ideally, my new column will look like this:
df$est_size[1] <- "around 5"
df$est_size[2] <- "nearly 4"
df$est_size[3] <- "about 100"
df$est_size[4] <- "hundreds"
df$est_size[5] <- "no report"
Code that I have tried / stuck on:
stringr::str_extract(notes[1], \w[size=]\d"
but all I get back is "size=" and not the value after
Thank you in advance for helping!

We can use a regex lookaround to match one or more characters that are not a closing square bracket ] after the size=
library(dplyr)
library(stringr)
df <- df %>%
mutate(est_size = trimws(str_extract(notes, '(?<=size=)[^\\]]+')))
-output
df #notes est_size
#1 On 5 June, some people walked down the street in this area. [size=around 5] around 5
#2 On 6 June, some people rode bikes down the street in this area. [size= nearly 4] nearly 4
#3 On 7 June, some people walked into a grocery store in this area. [size= about 100] about 100
#4 On 10 July, an hundreds of people drank water from this fountain [size=hundreds] hundreds
#5 on 15 August, an unreported amount of people drove their cars down the street. [size= no report] no report
data
df <- structure(list(notes = c("On 5 June, some people walked down the street in this area. [size=around 5]",
"On 6 June, some people rode bikes down the street in this area. [size= nearly 4]",
"On 7 June, some people walked into a grocery store in this area. [size= about 100]",
"On 10 July, an hundreds of people drank water from this fountain [size=hundreds]",
"on 15 August, an unreported amount of people drove their cars down the street. [size= no report]"
)), class = "data.frame", row.names = c(NA, -5L))

Using str_extract:
library(stringr)
trimws(str_extract(df$notes, "(?<=size=)[\\w\\s]+"))
[1] "around 5" "nearly 4" "about 100" "hundreds" "no report"
Here, we use positive lookbehind (?<=...) to assert an accompanying pattern for what we want to extract: we want to extract the alphanumeric string(s) that follow after size=so we put size=into the lookbehind expression and extract whatever alphanumeric chars (\\w) and whitespace chars (\\s) (but not special chars such as ]!) come after it.

Related

How to reformat similar text for merging in R?

I am working with the NYC open data, and I am wanting tho merge two data frames based on community board. The issue is, the two data frames have slightly different ways of representing this. I have provided an example of the two different formats below.
CommunityBoards <- data.frame(FormatOne = c("01 BRONX", "05 QUEENS", "15 BROOKLYN", "03 STATEN ISLAND"),
FormatTwo = c("BRONX COMMUNITY BOARD #1", "QUEENS COMMUNITY BOARD #5",
"BROOKLYN COMMUNITY BOARD #15", "STATEN ISLAND COMMUNITY BD #3"))
Along with the issue of the placement of the numbers and the "#", the second data frame shortens "COMMUNITY BOARD" to "COMMUNITY BD" just on Staten Island. I don't have a strong preference of what string looks like, so long as I can discern borough and community board number. What would be the easiest way to reformat one or both of these strings so I could merge the two sets?
Thank you for any and all help!
You can use regex to get out just the district numbers. For the first format, the only thing that matters is the begining of the string before the space, hence you could do
districtsNrs1 <- as.numeric(gsub("(\\d+) .*","\\1",CommunityBoards$FormatOne))
For the second, I assume that the formats look like "something HASHTAG number", hence you could do
districtsNrs2 <- as.numeric(gsub(".* #(\\d+)","\\1",CommunityBoards$FormatTwo))
to get the pure district numbers.
Now you know how to extract the district numbers. With that information, you can name/reformat the district-names how you want.
To know which district number is which district, you can create a translation data.frame between the districts and numbers like
districtNumberTranslations <- data.frame(
districtNumber = districtsNrs2,
districtName = sapply(strsplit(CommunityBoards$FormatTwo," COMMUNITY "),"[[",1)
)
giving
# districtNumber districtName
#1 1 BRONX
#2 5 QUEENS
#3 15 BROOKLYN
#4 3 STATEN ISLAND

Calculate similarity within a dataframe across specific rows (R)

I have a dataframe that looks something like this:
df <- data.frame("index" = 1:10, "title" = c("Sherlock","Peaky Blinders","Eastenders","BBC News", "Antiques Roadshow","Eastenders","BBC News","Casualty", "Dragons Den","Peaky Blinders"), "date" = c("01/01/20","01/01/20","01/01/20","01/01/20","01/01/20","02/01/20","02/01/20","02/01/20","02/01/20","02/01/20"))
The output looks like this:
Index Title Date
1 Sherlock 01/01/20
2 Peaky Blinders 01/01/20
3 Eastenders 01/01/20
4 BBC News 01/01/20
5 Antiques Roadshow 01/01/20
6 Eastenders 02/01/20
7 BBC News 02/01/20
8 Casualty 02/01/20
9 Dragons Den 02/01/20
10 Peaky Blinders 02/01/20
I want to be able to determine the number of times that a title appears on different dates. In the example above, "BBC News", "Peaky Blinders" and "Eastenders" all appear on 01/01/20 and 02/01/20. The similarity between the two dates is therefore 60% (3 out of 5 titles are identical across both dates).
It's probably also worth mentioning that the actual dataframe is much larger, and has 120 titles per day, and spans some 700 days. I need to compare the "titles" of each "date" with the previous "date" and then calculate their similarity. So to be clear, I need to determine the similarity of 01/01/20 with 02/01/20, 02/01/20 with 03/01/20, 03/01/20 with 04/01/20, and so on...
Does anyone have any idea how I might go about doing this? My eventual aim is to use Tableau to visualise similarity/difference over time, but I fear that such a calculation would be too complicated for that particular software and I'll have to somehow add it into the actual data itself.
Here is another possibility. You can create a simple function to calculate the similarity or other index between groups. Then, split your data frame by date into a list, and lapply the custom function to each in the list (final result will be a list).
calc_similar <- function(i) {
sum(s[[i]] %in% s[[i-1]])/length(s[[i-1]])
}
s <- split(df$title, df$date)
setNames(lapply(seq_along(s)[-1], calc_similar), names(s)[-1])
Output
$`2020-01-02`
[1] 0.6
I have come up with this solution. However, I'm unsure about how will it work when the number of records per day is different (i.e. you have 8 titles for day n and 15 titles for day n+1). I guess you would like to normalize with respect to the day with more records. Anyway, here it comes:
divide <- split.data.frame(df, as.factor(df$date))
similarity <- vector()
for(i in 1:(length(divide)-1)){
index <- sum((divide[[i]]$title) %in% divide[[i+1]]$title)/max(c(length(divide[[i]]$title), length((divide[[i+1]]$title))))
similarity <- c(similarity, index)
}
similarity

Extracting N number of matches from a text string in R?

I am using stringr in R, and I have a string of text that lists titles of news articles. I want to extract these titles, but only the first N-number of titles that appear. In my example string of text, I have three article titles, but I only want to extract the first two.
How can I tell str_extract to only collect the first 2 titles? Thank you.
Here is my current code with the example texts.
library(stringr)
Here is the example text.
texting <- ("Time: Friday, September 14, 2018 4:34:00 PM EDT\r\nJob Number: 73591483\r\nDocuments (100)\r\n 1. U.S. Stocks Rebound Slightly After Tech-Driven Slump\r\n Client/Matter: -None-\r\n Search Terms: trade war or US-China trade or china tariff and not dealbook\r\n Search Type: Terms and Connectors\r\n Narrowed by:\r\n Content Type Narrowed by\r\n News Sources: The New York Times; Content Type: News;\r\n Timeline: Jan 01, 2018 to Dec 31, 2018\r\n 2. Shifting Strategy on Tariffs\r\n Client/Matter: -None-\r\n Search Terms: trade war or US-China trade or china tariff and not dealbook\r\n 100. Example")
titles.1 <- str_extract_all(texting, "\\d+\\.\\s.+")
titles.1
The current code brings back all three matches in the string:
[[1]]
[1] "1. U.S. Stocks Rebound Slightly After Tech-Driven Slump"
[2] "2. Shifting Strategy on Tariffs"
[3] "100. Example"
I only want it to collect the first two matches.
You can use the option simplify = TRUE to get a vector as result, rather than a list. Then, just pick the first N elements from the vector
titles.1 <- str_extract_all(texting, "\\d+\\.\\s.+", simplify = TRUE)[1:2]

row wise count the number of the words in a review text in an R dataframe

I want to count the number of words in each row:
Review_ID Review_Date Review_Content Listing_Title Star Hotel_Name
1 1/25/2016 I booked both the Crosby and Four Seasons but decided to cancel the Four Seasons closer to the arrival date based on reviews. Glad I did. The Crosby is an outstanding hotel. The rooms are immaculate and luxurious, with real attention to detail and none of the bland furnishings you find in even the top chain hotels. Staff on the whole were extremely attentive and seemed to enjoy being there. Breakfast was superb and facilities at ground level gave an intimate and exclusive feel to the hotel. It's a fairly expensive place to stay but is one of those hotels where you feel you're getting what you pay for, helped by an excellent location. Hope to be back! Outstanding 5 Crosby Street Hotel
2 1/18/2016 We've stayed many times at the Crosby Street Hotel and always have an incredible, flawless experience! The staff couldn't be more accommodating, the housekeeping is immaculate, the location's awesome and the rooms are the coolest combination of luxury and chic. During our most recent trip over The New Years holiday, we stayed in the stunning Crosby Suite which has the most extraordinary, gorgeous decor. The Crosby remains our absolute favorite in NYC. Can't wait to return! Always perfect! 5 Crosby Street Hotel
I was thinking something like:
WordFreqRowWise %>%
rowwise() %>%
summarise(n = n())
To get the results something like..
Review_ID Review_Content total_Words Min_occrd_word Max Average
1 .... 230 great: 1 the: 25 total_unique/total_words in the row
But do not have idea, how can I do it....
Here is a method in base R using strsplit and sapply. Let's say the data is stored in a data.frame df and the reviews are stored in the variable Review_Content
# break up the strings in each row by " "
temp <- strsplit(df$Review_Content, split=" ")
# count the number of words as the length of the vectors
df$wordCount <- sapply(temp, length)
In this instance, sapply will return a vector of the counts for each row.
Since the word count is now an object, you can perform analysis you want on it. Here are some examples:
summarize the distribution of word counts: summary(df$wordCount)
maximum word count: max(df$wordCount)
mean word count: mean(df$wordCount)
range of word counts: range(df$wordCount)
interquartile range of word counts: IQR(df$wordCount)
Adding to #lmo's answer above..
Below code will generate a dataframe that consists of all the words, row-wise, and their frequencies:
temp2 <- data.frame()
for (i in 1:length(temp)){
temp1 <- as.data.frame(table(temp[[i]]))
temp1$ID <- paste0("Row_", i)
temp2 <- rbind(temp2, temp1)
temp1 <- NULL
}

How to create a column and replace value

Question
1
An artist impression of a star system is responsible for a nova. The team from university of VYU focus on a class of compounds. The young people was seen enjoying the football match.
2
Scientists have made a breakthrough and solved a decades-old mystery by revealing how a powerful. Heart attacks more due to nurture than nature. SA footballer Senzo Meyiwa shot dead to save girlfriend
Expected output
1 An artist impression of a star system is responsible for a nova.
1 The team from university of VYU focus on a class of compounds.
1 The young people was seen enjoying the foorball match.
2 Scientist have made a breakthrough and solved a decades- old mystery by revealing how a powerful.
2 Heart attacks more due to nurture than nature.
2 SA footballer Senzo Meyiwa shot dead to save girlfriend
The data is in the csv format and it has got around 1000 data points, numbers are in columns(1) and sentence are in column(2). I need to split the string and retain the row number for that particular sentence. Need your help to build the r code
Note: Number and the sentence are two different columns
I have tried this code to string split but i need code for row index
x$qwerty <- as.character(x$qwerty)
sa<-list(strsplit(x$qwerty,".",fixed=TRUE))[[1]]
s<-unlist(sa)
write.csv(s,"C:\\Users\\Suhas\\Desktop\\out23.csv")
One inconvenience of vectorization in R is that they operate from "inside" the vector. That is, they operate on the elements themselves, rather than the elements in the context of the vector. Therefore the user loses the innate ability to keep track of the index, i.e. where element being operated on was located in the original object.
The workaround is to generate the index separately. This is easy to achieve with seq_along, which is an optimized version of 1:length(qwerty). Then you can just paste the index and the results together. In your case, you'll obviously want to do the pasteing before you unlist.
If your dataset is as shown above, may be this helps. You can read from the file as readLines("file.txt")
lines <- readLines(n=7)
1
An artist impression of a star system is responsible for a nova. The team from university of VYU focus on a class of compounds. The young people was seen enjoying the football match.
2
Scientists have made a breakthrough and solved a decades-old mystery by revealing how a powerful. Heart attacks more due to nurture than nature. SA footballer Senzo Meyiwa shot dead to save girlfriend
lines1 <- lines[lines!='']
indx <- grep("^\\d", lines1)
lines2 <- unlist(strsplit(lines1, '(?<=\\.)(\\b| )', perl=TRUE))
indx <- grepl("^\\d+$", lines2)
res <- unlist(lapply(split(lines2,cumsum(indx)),
function(x) paste(x[1], x[-1])), use.names=FALSE)
res
#[1] "1 An artist impression of a star system is responsible for a nova."
#[2] "1 The team from university of VYU focus on a class of compounds."
#[3] "1 The young people was seen enjoying the football match."
#[4] "2 Scientists have made a breakthrough and solved a decades-old mystery by revealing how a powerful."
#[5] "2 Heart attacks more due to nurture than nature."
#[6] "2 SA footballer Senzo Meyiwa shot dead to save girlfriend"
If you want it as 2 column data.frame
dat <- data.frame(id=rep(lines2[indx],diff(c(which(indx),
length(indx)+1))-1), Col1=lines2[!indx], stringsAsFactors=FALSE)
head(dat,2)
# id Col1
#1 1 An artist impression of a star system is responsible for a nova.
#2 1 The team from university of VYU focus on a class of compounds.

Resources