Removing all sentences that begin with a specific word - r

I have a dataset with a "Notes" column, which I'm trying to clean up with R. The notes look something like this:
Collected for 2 man-hours total. Cloudy, imminent storms.
Collected for 2 man-hours total. Rainy.
Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.
..And so on
I want to remove all sentences that start with "Collected" but not any of the sentences that follow. The number of sentences that follow vary, e.g. from 0-4 sentences afterwards. I was trying to remove all combinations of Collected + (last word of the sentence) but there's too many combinations. Removing Collected + [.] removes all the subsequent sentences. Does anyone have any suggestions? Thank you in advance.

An option using gsub can be as:
gsub("^Collected[^.]*\\. ","",df$Notes)
# [1] "Cloudy, imminent storms."
# [2] "Rainy."
# [3] "Sunny."
Regex explanation:
- `^Collected` : Starts with `Collected`
- `[^.]*` : Followed by anything other than `.`
- `\\. ` : Ends with `.` and `space`.
Replace such matches with "".
Data:
df<-read.table(text=
"Notes
'Collected for 2 man-hours total. Cloudy, imminent storms.'
'Collected for 2 man-hours total. Rainy.'
'Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny.'",
header = TRUE, stringsAsFactors = FALSE)

a = "Collected 30 min w/2 staff for a total of 1 man-hour of sampling. Sunny."
sub("^ ","",sub("Collected.*?\\.","",a))
> [1] "Sunny."
Or if you know that there will be a space after the period:
sub("Collected.*?\\. ","",a)

Related

Replace more than one word in a column with R

I trying to change the all the names with the word stocker in job.tittle to a new column job.title.2
I tried to use gsub() without the expected result
My data.frame looks liek this:
x<- data.frame(Job.tittle=c("DW Overnight Stockers", "Checkers","TH Stockers", "CM Midland Stockers"), Head.counts=c(100,50,100,200))
Thank you
I tried this: x$job.tittle.2<-gsub("\bDW Overnight Stockers\w+","Stocker",x$Job.tittle)
and did not work
Here you go. Using regex, this takes a string that contains the word "stocker" or "stockers", in either upper or lower case, any where in the string, and replaces it with "Stocker".
x$job.title.2 <- gsub(".*stockers?.*", "Stocker", x$Job.tittle, ignore.case = TRUE)
x
Job.tittle Head.counts job.title.2
1 DW Overnight Stockers 100 Stocker
2 Checkers 50 Checkers
3 TH Stockers 100 Stocker
4 CM Midland Stockers 200 Stocker

Extract a 100-Character Window around Keywords in Text Data with R (Quanteda or Tidytext Packages)

This is my first time asking a question on here so I hope I don't miss any crucial parts. I want to perform sentiment analysis on windows of speeches around certain keywords. My dataset is a large csv file containing a number of speeches, but I'm only interest in the sentiment of the words immediately surrounding certain key words.
I was told that the quanteda package in R would likely be my best bet for finding such a function, but I've been unsuccessful in locating it so far. If anyone knows how to do such a task it would be greatly appreciated !!!
Reprex (I hope?) below:
speech = c("This is the first speech. Many words are in this speech, but only few are relevant for my research question. One relevant word, for example, is the word stackoverflow. However there are so many more words that I am not interested in assessing the sentiment of", "This is a second speech, much shorter than the first one. It still includes the word of interest, but at the very end. stackoverflow.", "this is the third speech, and this speech does not include the word of interest so I'm not interested in assessing this speech.")
data <- data.frame(id=1:3,
speechContent = speech)
I'd suggest using tokens_select() with the window argument set to a range of tokens surrounding your target terms.
To take your example, if "stackoverflow" is the target term, and you want to measure sentiment in the +/- 10 tokens around that, then this would work:
library("quanteda")
## Package version: 3.2.1
## Unicode version: 13.0
## ICU version: 69.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
## [CODE FROM ABOVE]
corp <- corpus(data, text_field = "speechContent")
toks <- tokens(corp) %>%
tokens_select("stackoverflow", window = 10)
toks
## Tokens consisting of 3 documents and 1 docvar.
## text1 :
## [1] "One" "relevant" "word" ","
## [5] "for" "example" "," "is"
## [9] "the" "word" "stackoverflow" "."
## [ ... and 9 more ]
##
## text2 :
## [1] "word" "of" "interest" ","
## [5] "but" "at" "the" "very"
## [9] "end" "." "stackoverflow" "."
##
## text3 :
## character(0)
There are many ways to compute sentiment from this point. An easy one is to apply a sentiment dictionary, e.g.
tokens_lookup(toks, data_dictionary_LSD2015) %>%
dfm()
## Document-feature matrix of: 3 documents, 4 features (91.67% sparse) and 1 docvar.
## features
## docs negative positive neg_positive neg_negative
## text1 0 1 0 0
## text2 0 0 0 0
## text3 0 0 0 0
Using quanteda:
library(quanteda)
corp <- corpus(data, docid_field = "id", text_field = "speechContent")
x <- kwic(tokens(corp, remove_punct = TRUE),
pattern = "stackoverflow",
window = 3
)
x
Keyword-in-context with 2 matches.
[1, 29] is the word | stackoverflow | However there are
[2, 24] the very end | stackoverflow |
as.data.frame(x)
docname from to pre keyword post pattern
1 1 29 29 is the word stackoverflow However there are stackoverflow
2 2 24 24 the very end stackoverflow stackoverflow
Now read the help for kwic (use ?kwic in console) to see what kind of patterns you can use. With tokens you can specify which data cleaning you want to use before using kwic. In my example I removed the punctuation.
The end result is a data frame with the window before and after the keyword(s). In this example a window of length 3. After that you can do some form of sentiment analyses on the pre and post results (or paste them together first).

how to clean irregular strings & organize them into a dataframe at right column

I have two long strings that look like this in a vector:
x <- c("Job Information\n\nLocation: \n\n\nScarsdale, New York, 10583-3050, United States \n\n\n\n\n\nJob ID: \n53827738\n\n\nPosted: \nApril 22, 2020\n\n\n\n\nMin Experience: \n3-5 Years\n\n\n\n\nRequired Travel: \n0-10%",
"Job Information\n\nLocation: \n\n\nGlenview, Illinois, 60025, United States \n\n\n\n\n\nJob ID: \n53812433\n\n\nPosted: \nApril 21, 2020\n\n\n\n\nSalary: \n$110,000.00 - $170,000.00 (Yearly Salary)")
and my goal is to neatly organized them in a dataframe (output form) something like this:
#View(df)
Location Job ID Posted Min Experience Required Travel Salary
[1] Scarsdale,... 53827738 April 22... 3-5 Years 0-10% NA
[2] Glenview,... 53812433 April 21... NA NA $110,000.00 - $170,000.00 (Yearly Salary)
(...) was done to present the dataframe here neatly.
However as you see, two strings doesn't necessarily have same attibutes. Forexample, first string has Min Experience and Required Travel, but on the second string, those field don't exist, but has Salary. So this getting very tricky for me. I thought I will read between \n character but they are not set, some have two newlines, other have 4 or 5. I was wondering if someone can help me out. I will appreciate it!
We can split the string on one or more '\n' ('\n{1,}'). Remove the first word from each (which is 'Job Information') as we don't need it anywhere (x <- x[-1]). For remaining part of the string we can see that they are in pairs in the form of columnname - columnvalue. We create a dataframe from this using alternating index and bind_rows combine all of them by name.
dplyr::bind_rows(sapply(strsplit(gsub(':', '', x), '\n{1,}'), function(x) {
x <- x[-1]
setNames(as.data.frame(t(x[c(FALSE, TRUE)])), x[c(TRUE, FALSE)])
}))
# Location Job ID Posted Min Experience
#1 Scarsdale, New York, 10583-3050, United States 53827738 April 22, 2020 3-5 Years
#2 Glenview, Illinois, 60025, United States 53812433 April 21, 2020 <NA>
# Required Travel Salary
#1 0-10% <NA>
#2 <NA> $110,000.00 - $170,000.00 (Yearly Salary)

Text Processing : extract fixed number of numbers from text

I am trying the following :
gg <-c("delete from below 110 11031133 11 11031135 110",
"delete froml #10989431 from adfdaf 10888022 <(>&<)> 10888018",
"this is for the deletion of an incorrect numberss that is no longer used for asd09 and sd040",
"please delete the following mangoes from trey 10246211 1 10821224 1 10821248 1 10821249",
"from 11015647 helppp 1 na from 0050 - zfhhhh 10840637 1")
pattern_to_find <- c('\\d{4,}')
aa <- str_extract_all(gg, pattern_to_find)
aa
with this code I am able to extact any numeric pattern with number greater than a fixed number. But if I want to extract 2 didit number then it picks up all the first two numbers from the numeric field .
pattern_to_find <- c('\\d{2}').
How can I modify my pattern to work on both ways.
Regards,
R
Tidyverse solution:
library(tidyverse)
pattern_to_find <- c('\\d{2,}')
aa <- str_extract_all(gg, pattern_to_find)
Base R solution:
base_aa <- regmatches(gg, gregexpr(pattern_to_find, gg))

How to count number of words for each row of a column and then convert to numeric?

I have a column in a data frame that lists amenities found in a hotel location. I need to count how many amenities are in each row and then convert this to a number so I can then make another column with these numbers.
> airbnb$amenities[1:25]
[1] "{TV,Internet,Wifi,\"Air conditioning\",\"Paid parking off premises\",Breakfast,Heating,\"Smoke detector\",\"Carbon monoxide detector\",\"First aid kit\",\"Safety card\",\"Fire extinguisher\",Essentials,Shampoo,\"Lock on bedroom door\",\"24-hour check-in\",Hangers,\"Hair dryer\",Iron,\"Laptop friendly workspace\",\"translation missing: en.hosting_amenity_49\",\"translation missing: en.hosting_amenity_50\",\"Private entrance\",\"Hot water\",\"Patio or balcony\",\"Garden or backyard\",\"Luggage dropoff allowed\",\"Well-lit path to entrance\",\"Host greets you\"}"
[2] "{TV,Wifi,\"Air conditioning\",Kitchen,\"Pets live on this property\",Cat(s),\"Free street parking\",Heating,Washer,Dryer,\"Smoke detector\",Essentials,Shampoo,Hangers,\"Hair dryer\",Iron,\"Laptop friendly workspace\",\"Hot water\",\"Luggage dropoff allowed\",Other}"
[3] "{TV,\"Cable TV\",Wifi,\"Air conditioning\",Pool,Kitchen,\"Free parking on premises\",Breakfast,Elevator,\"Hot tub\",\"Buzzer/wireless intercom\",Heating,\"Family/kid friendly\",Washer,\"Smoke detector\",\"First aid kit\",Essentials,Shampoo,\"24-hour check-in\",Hangers,\"Hair dryer\",Iron,\"translation missing: en.hosting_amenity_50\"}"
[4] "{Internet,Wifi,Pool,Kitchen,\"Free street parking\",\"Buzzer/wireless intercom\",Heating,\"Smoke detector\",Essentials,Hangers,Iron,\"Hot water\",Microwave,\"Coffee maker\",Refrigerator,\"Dishes and silverware\",\"Cooking basics\",\"BBQ grill\",\"Garden or backyard\",\"Long term stays allowed\",\"Host greets you\"}"
[5] "{TV,Internet,Wifi,\"Air conditioning\",Kitchen,\"Paid parking off premises\",Elevator,\"Buzzer/wireless intercom\",Heating,Washer,Dryer,\"Smoke detector\",\"First aid kit\",\"Safety card\",Essentials,Shampoo,Hangers,\"Hair dryer\",Iron,\"Laptop friendly workspace\",\"Hot water\",Microwave,Refrigerator,Dishwasher,\"Dishes and silverware\",\"Cooking basics\",Oven,Stove,\"Long term stays allowed\",Other}"
I'm familiar with using grep, gsub and the like but I'm confused as to how to count within each row. I thought that grep('[a-z]', airbnb$amenities) might work to somehow count the patterns in each row but am still confused.
One option is str_count to count the delimiter (,) and then add 1 to it to get the number of n-gram words
library(stringr)
airbnb$amenityCount <- str_count(airbnb$amenities, ",") + 1

Resources