str_extract() and summarise() gives me na row - r

This should be pretty straightforward, as think I'm just looking for verification about what I'm seeing.
I'm trying to use str_extract() to pull areas of interest out of a column in my data frame, and then count how often each word appears. I'm running into an issue though where when I do this, the data frame I produce has NA listed in one of the rows. This is confusing to me, because I don't know what is causing it or if it is a sign of an error in my code. I'm not sure how to fix this.
Additionally, note that the last item in words is "the table is light", which contains two of the words of interest in this example. I've done this intentionally because I want to make sure that it will be counted twice.
library(tidyverse)
df <- data.frame(words =c("paper book", "food press", "computer monitor", "my fancy speakers",
"my two dogs", "the old couch", "the new couch", "loud speakers",
"wasted paper", "put the dishes away", "set the table", "put it on the table",
"lets go to church", "turn out the lights", "why are the lights on",
"the table is light"))
keep <- c("dogs|paper|table|light|couch")
new_df <- df %>%
mutate(Subject = str_extract(words, keep), n = n()) %>%
group_by(Subject)%>%
summarise(`Word Count` = length(Subject))
This is what I'm getting now
Subject `Word Count`
<chr> <int>
1 couch 2
2 dogs 1
3 light 2
4 paper 2
5 table 3
6 NA 6
So my question is- what is causing the NA row in Subject? Is it all other records?

The NA appears for those values where there are no words in keep appearing in that row so there is nothing to extract.
library(dplyr)
library(stringr)
df %>% mutate(Subject = str_extract(words, keep))
# words Subject
#1 paper book paper
#2 food press <NA>
#3 computer monitor <NA>
#4 my fancy speakers <NA>
#5 my two dogs dogs
#6 the old couch couch
#7 the new couch couch
#8 loud speakers <NA>
#9 wasted paper paper
#10 put the dishes away <NA>
#11 set the table table
#12 put it on the table table
#13 lets go to church <NA>
#14 turn out the lights light
#15 why are the lights on light
#16 the table is light table
For example, for 2nd row 'food press' there are no values from "dogs|paper|table|light|couch" in it hence it returns NA.

Related

R Concatenate Across Rows Within Groups but Preserve Sequence

My data consists of text from many dyads that has been split into sentences, one per row. I'd like to concatenate the data by speaker within dyads, essentially converting the data to speaking turns. Here's an example data set:
dyad <- c(1,1,1,1,1,2,2,2,2)
speaker <- c("John", "John", "John", "Paul","John", "George", "Ringo", "Ringo", "George")
text <- c("Let's play",
"We're wasting time",
"Let's make a record!",
"Let's work it out first",
"Why?",
"It goes like this",
"Hold on",
"Have to tighten my snare",
"Ready?")
dat <- data.frame(dyad, speaker, text)
And this is what I'd like the data to look like:
dyad speaker text
1 1 John Let's play. We're wasting time. Let's make a record!
2 1 Paul Let's work it out first
3 1 John Why?
4 2 George It goes like this
5 2 Ringo Hold on. Have to tighten my snare
6 2 George Ready?
I've tried grouping by sender and pasting/collapsing from dplyr but the concatenation combines all of a sender's text without preserving speaking turn order. For example, John's last statement ("Why") winds up with his other text in the output rather than coming after Paul's comment. I also tried to check if the next speaker (using lead(sender)) is the same as current and then combining, but it only does adjacent rows, in which case it misses John's third comment in the example. Seems it should be simple but I can't make it happen. And it should be flexible to combine any series of continuous rows by a given speaker.
Thanks in advance
Create another group with rleid (from data.table) and paste the rows in summarise
library(dplyr)
library(data.table)
library(stringr)
dat %>%
group_by(dyad, grp = rleid(speaker), speaker) %>%
summarise(text = str_c(text, collapse = ' '), .groups = 'drop') %>%
select(-grp)
-output
# A tibble: 6 × 3
dyad speaker text
<dbl> <chr> <chr>
1 1 John Let's play We're wasting time Let's make a record!
2 1 Paul Let's work it out first
3 1 John Why?
4 2 George It goes like this
5 2 Ringo Hold on Have to tighten my snare
6 2 George Ready?
Not as elegant as dear akrun's solution. helper does the same as rleid function here without the NO need of an additional package:
library(dplyr)
dat %>%
mutate(helper = (speaker != lag(speaker, 1, default = "xyz")),
helper = cumsum(helper)) %>%
group_by(dyad, speaker, helper) %>%
summarise(text = paste0(text, collapse = " "), .groups = 'drop') %>%
select(-helper)
dyad speaker text
<dbl> <chr> <chr>
1 1 John Let's play We're wasting time Let's make a record!
2 1 John Why?
3 1 Paul Let's work it out first
4 2 George It goes like this
5 2 George Ready?
6 2 Ringo Hold on Have to tighten my snare

Split strings into utterances and assign same-speaker utterances to columns in dataframe

I have multi-party conversations in strings like this:
convers <- "Peter: Hiya Mary: Hi. How w'z your weekend. Peter: a::hh still got a headache. An' you (.) party a lot? Mary: nuh, you know my kid's sick 'n stuff Peter: yeah i know that's=erm al hamshi: hey guys how's it goin'? Peter: Great! Mary: where've you BEn last week al hamshi: ah y' know, camping with my girl friend."
I also have a vector with the speakers' names:
speakers <- c("Peter", "Mary", "al hamshi")
I'd like to create a dataframe with the utterances by each individual speaker in a separate column. I can only do this task in a piecemeal fashion, by addressing each speaker specifically using the indices in speakers, and then combine the separate results in a list but what I'd really like to have is a dataframe with separate columns for each speaker:
Peter <- str_extract_all(convers, paste0("(?<=", speakers[1],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
Mary <- str_extract_all(convers, paste0("(?<=", speakers[2],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
al_hamshi <- str_extract_all(convers, paste0("(?<=", speakers[3],":\\s).*?(?=\\s*(?:", paste(speakers, collapse="|"),"):|\\z)"))
df <- list(
Peter = Peter, Mary = Mary , al_hamshi = al_hamshi
)
df
$Peter
$Peter[[1]]
[1] "Hiya" "a::hh still got a headache. An' you (.) party a lot?"
[3] "yeah i know that's=erm" "Great!"
$Mary
$Mary[[1]]
[1] "Hi. How w'z your weekend." "nuh, you know my kid's sick 'n stuff" "where've you BEn last week"
$al_hamshi
$al_hamshi[[1]]
[1] "hey guys how's it goin'?" "ah y' know, camping with my girl friend."
How can I extract the same-speaker utterances not one by one but in one go and how can the results be assigned not to a list but a dataframe?
With a bit of pre-processing, and assuming the names exactly match the speakers in the conversation text, you can do:
# Pattern to use to insert new lines in string
pattern <- paste0("(", paste0(speakers, ":", collapse = "|"), ")")
# Split string by newlines
split_conv <- strsplit(gsub(pattern, "\n\\1", convers), "\n")[[1]][-1]
# Capture speaker and text into data frame
dat <- strcapture("(.*?):(.*)", split_conv, data.frame(speaker = character(), text = character()))
Which gives:
speaker text
1 Peter Hiya
2 Mary Hi. How w'z your weekend.
3 Peter a::hh still got a headache. An' you (.) party a lot?
4 Mary nuh, you know my kid's sick 'n stuff
5 Peter yeah i know that's=erm
6 al hamshi hey guys how's it goin'?
7 Peter Great!
8 Mary where've you BEn last week
9 al hamshi ah y' know, camping with my girl friend.
To get each speaker into their own column:
# Count lines by speaker
dat$cnt <- with(dat, ave(speaker, speaker, FUN = seq_along))
# Reshape and rename
dat <- reshape(dat, idvar = "cnt", timevar = "speaker", direction = "wide")
names(dat) <- sub("text\\.", "", names(dat))
cnt Peter Mary al hamshi
1 1 Hiya Hi. How w'z your weekend. hey guys how's it goin'?
3 2 a::hh still got a headache. An' you (.) party a lot? nuh, you know my kid's sick 'n stuff ah y' know, camping with my girl friend.
5 3 yeah i know that's=erm where've you BEn last week <NA>
7 4 Great! <NA> <NA>
If new lines already exist in your text, choose another character that doesn't exist to do use to split the string.
You can add :\\s to each speakers, as you are also doing, then make a gregexpr finding the position where a speaker starts. Extract this using regmatches and remove the previously added :\\s to get the speaker. Make again a regmatches but with invert giving the sentences. With spilt the sentences are grouped to the speaker. To bring this to the desired data.frame you have to add NA to have the same length for all speakes, done her with [ inside lapply:
x <- gregexpr(paste0(speakers, ":\\s", collapse="|"), convers)
y <- sub(":\\s$", "", regmatches(convers, x)[[1]])
z <- trimws(regmatches(convers, x, TRUE)[[1]][-1])
tt <- split(z, y)
do.call(data.frame, lapply(tt, "[", seq_len(max(lengths(tt)))))
# al.hamshi Mary Peter
#1 hey guys how's it goin'? Hi. How w'z your weekend. Hiya
#2 ah y' know, camping with my girl friend. nuh, you know my kid's sick 'n stuff a::hh still got a headache. An' you (.) party a lot?
#3 <NA> where've you BEn last week yeah i know that's=erm
#4 <NA> <NA> Great!

Add value in one column based on multiple key words in another column in r

I want to do the following things: if key words "GARAGE", "PARKING", "LOT" exist in column "Name" then I would add value "Parking&Garage" into column "Type".
Here is the dataset:
df<-data.frame(Name=c("GARAGE 1","GARAGE 2", "101 GARAGE","PARKING LOT","CENTRAL PARKING","SCHOOL PARKING 1","CITY HALL"))
The following codes work well for me, but is there a neat way to make the codes shorter? Thanks!
df$Type[grepl("GARAGE", df$Name) |
grepl("PARKING", df$Name) |
grepl("LOT", df$Name)]<-"Parking&Garage"
The regex "or" operator | is your friend here:
df$Type[grepl("GARAGE|PARKING|LOT", df$Name)]<-"Parking&Garage"
You can create a list of keywords to change, create a pattern dynamically and replace the values.
keywords <- c('GARAGE', 'PARKING', 'LOT')
df$Type <- NA
df$Type[grep(paste0(keywords, collapse = '|'), df$Name)] <- "Parking&Garage"
df
# Name Type
#1 GARAGE 1 Parking&Garage
#2 GARAGE 2 Parking&Garage
#3 101 GARAGE Parking&Garage
#4 PARKING LOT Parking&Garage
#5 CENTRAL PARKING Parking&Garage
#6 SCHOOL PARKING 1 Parking&Garage
#7 CITY HALL <NA>
This would be helpful if you need to add more keywords to your list later.
an alternative with dpylr and stringr packages:
library(stringr)
library(dplyr)
df %>%
dplyr::mutate(TYPE = stringr::str_detect(Name, "GARAGE|PARKING|LOT"),
TYPE = ifelse(TYPE == TRUE, "Parking&Garage", NA_character_))

How to remove columns/rows having empty value?

I have 3 columns.
type <- Tv show, Movie,Movie
title <- Norm of the North: King Sized Adventure,Jandino: Whatever it Takes,Transformers Prime
director <- Richard Finn and Tim Maltby
The 3rd column has only one value(i.e director).
How to remove those rows with empty values?
One way to remove the rows with empty cells is this:
Illustrative data:
df <- data.frame(
type = c("Tv show", "Movie", "Sitcom"),
title = c("Norm of the North", "King Sized Adventure","Whatever it Takes"),
director = c("Richard Finn", "Tim Maltby", "")
)
First, transform empty cells to NA:
df[df==""] <- NA
df
type title director
1 Tv show Norm of the North Richard Finn
2 Movie King Sized Adventure Tim Maltby
3 Sitcom Whatever it Takes <NA>
Then, remove rows with NA using na.omit:
na.omit(df)
type title director
1 Tv show Norm of the North Richard Finn
2 Movie King Sized Adventure Tim Maltby

matching text between two different data frames in R

I have the following data in a data frame:
structure(list(`head(ker$text)` = structure(1:6, .Label = c("#_rpg_17 little league travel tourney. These parents about to be wild.",
"#auscricketfan #davidwarner31 yes WI tour is coming soon", "#keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR",
"#NWAWhatsup tour of duty in NWA considered a dismal assignment? Companies send in their best ppl and then those ppl don't want to leave",
"Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy",
"Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO"
), class = "factor")), .Names = "head(ker$text)", row.names = c(NA,
-6L), class = "data.frame")
I have another data frame that contains hashtags extracted from the above data frame. It is as follows:
structure(list(destination = c("#topstation", "#destination", "#munnar",
"#Kerala", "#Delhi", "#beach")), .Names = "destination", row.names = c(NA,
6L), class = "data.frame")
I want to create a new column in my first data frame, which will have contain only the tags matched with the second data frame. For example, the first line of df1 does not have any hashtags, hence this cell in the new column will be blank. However, the second line contains 4 hashtags, of which three of them are matching with the second data frame. I have tried using:
str_match
str_extract
functions. I came very close to getting this using a code given in one of the posts here.
new_col <- ker[unlist(lapply(destn$destination, agrep, ker$text)), ]
While I understand, I am getting a list as an output I am getting an error indicating
replacement has 1472 rows, data has 644
I have tried setting max.distance to different parameters, each gave me differential errors. Can someone help me with a solution? One alternative which I am thinking of is to have each hashtag in a separate column, but not sure if it will help me in analysing the data further with other variables that I have. The output I am looking for is as follows:
text new_col new_col2 new_col3
statement1
statement2
statement3 #destination #munnar #topstation
statement4
statement5 #Kerala
statement6 #Kerala
library(stringi);
df1$tags <- sapply(stri_extract_all(df1[[1]],regex='#\\w+'),function(x) paste(x[x%in%df2[[1]]],collapse=','));
df1;
## head(ker$text) tags
## 1 #_rpg_17 little league travel tourney. These parents about to be wild.
## 2 #auscricketfan #davidwarner31 yes WI tour is coming soon
## 3 #keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR #destination,#munnar,#topstation
## 4 #NWAWhatsup tour of duty in NWA considered a dismal assignment? Companies send in their best ppl and then those ppl don't want to leave
## 5 Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy #Kerala
## 6 Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO #Kerala
Edit: If you want a separate column for each tag:
library(stringi);
m <- sapply(stri_extract_all(df1[[1]],regex='#\\w+'),function(x) x[x%in%df2[[1]]]);
df1 <- cbind(df1,do.call(rbind,lapply(m,`[`,1:max(sapply(m,length)))));
df1;
## head(ker$text) 1 2 3
## 1 #_rpg_17 little league travel tourney. These parents about to be wild. <NA> <NA> <NA>
## 2 #auscricketfan #davidwarner31 yes WI tour is coming soon <NA> <NA> <NA>
## 3 #keralatourism #favourite #destination #munnar #topstation https://t.co/sm9qz7Z9aR #destination #munnar #topstation
## 4 #NWAWhatsup tour of duty in NWA considered a dismal assignment? Companies send in their best ppl and then those ppl don't want to leave <NA> <NA> <NA>
## 5 Are you Looking for a trip to Kerala? #Kerala-prime tourist attractions of India.Visit:http://t.co/zFCoaoqCMP http://t.co/zaGNd0aOBy #Kerala <NA> <NA>
## 6 Are you Looking for a trip to Kerala? #Kerala, God's own country, is one of the prime tourist attractions of... http://t.co/FLZrEo7NpO #Kerala <NA> <NA>
You could do something like this:
library(stringr)
results <- sapply(df$`head(ker$text)`,
function(x) { str_match_all(x, paste(df2$destination, collapse = "|")) })
df$matches <- results
If you want to separate the results out, you can use:
df <- cbind(df, do.call(rbind, lapply(results,[, 1:max(sapply(results, length)))))

Resources