R: Counting frequency of words from predefined dictionary - r

I have a very large dataset that looks like this: one column contains names, the second column contains their respective (very long) texts. I also have a pre-defined dictionary that contains at least 20 terms. How can I count the number of times these key words occur in each row of my dataframe? I have tried str_detect,grep(l), and %>% like, and looped over each row, but the problem seems to be that I want to detect too many terms, and these functions stop working when I use 15+ terms or so.
Would be sooo happy if anyone could help me out with this!
col1<- c("Henrik", "Joseph", "Lucy")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he did not")
df <- data.frame(col1, col2)```
dict <- c("groceries", "going", "me") #but my actual dictionary is much larger

Create a unique identifier for your rows. Split your col2 by words, one in each row. Filter for only the select words in your dict. Then count by each row. Finally, combine with original df and set NA to Zeros for rows that don't have any words from your dict.
library(dplyr)
col1 <- c("A","B","A")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he did not")
df <- data.frame(col1, col2, stringsAsFactors = FALSE)
dict <- c("groceries", "going", "me")
df <- df %>% mutate(row=row_number()) %>% select(row, everything())
counts <- df %>% tidyr::separate_rows(col2) %>% filter(col2 %in% dict) %>% group_by(row) %>% count(name = "counts")
final <- left_join(df, counts, by="row") %>% tidyr::replace_na(list(counts=0L))
final
#> row col1 col2 counts
#> 1 1 A I am going to get groceries 2
#> 2 2 B He called me at six. 1
#> 3 3 A No, he did not 0

Here is a base R option using gregexpr
dfout <- within(
df,
counts <- sapply(
gregexpr(paste0(dict, collapse = "|"), col2),
function(x) sum(x > 0)
)
)
or
dfout <- within(
df,
counts <- sapply(
regmatches(col2, gregexpr("\\w+", col2)),
function(v) sum(v %in% dict)
)
)
which gives
> dfout
col1 col2 counts
1 1 I am going to get groceries 2
2 2 He called me at six. 1
3 3 No, he did not 0
Data
structure(list(col1 = 1:3, col2 = c("I am going to get groceries",
"He called me at six.", "No, he did not")), class = "data.frame", row.names = c(NA,
-3L))

I think my solution gives you the output you want - that is for each word in your "dict" list, you can see how many times it appears in each sentence. Each row is an entry in df$col2 i.e. a sentence. "Dict" is your vector of terms that you're looking to match. We can loop over the vector and for each entry in the vector we match how many times that entry appears in each row/sentence using stringr::str_count. Note the syntax for str_count: str_count(string being checked over, expression you're trying to match)
str_count returns a vector showing how many times the word appears in each row. I create a data frame of these vectors which will contain the same number of rows as there are entries in the dict vector. Then you can just cbind "dict" to that data frame and you can see how many times each word is used in each sentence. I adjust the column names at very end so you can match the words to the sentence #'s. Note that if you want to calculate row means you'll need to subset out the "dict" column of the final data frame because it's character.
library(stringr)
col1<- c("Henrik", "Joseph", "Lucy")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he
did not")
df <- data.frame(col1, col2)
dict <- c("groceries", "going", "me")
word_matches <- data.frame()
for (i in dict) {
word_tot<-(str_count(df$col2, i))
word_matches <- rbind(word_matches,word_tot)
}
word_matches
colnames(word_matches) <- paste("Sentence", 1:ncol(word_matches))
cbind(dict,word_matches)
dict Sentence 1 Sentence 2 Sentence 3
1 groceries 1 0 0
2 going 1 0 0
3 me 0 1 0

Related

How to find an exact set of strings in a column of varied strings in R dataframe?

I'm looking to find a match of an exact bunch of strings in a R dataframe column containing strings.
Here's the format in which I have my bunch of reference strings which will be stored in the variable splitval:
library(gsubfn)
#Splitting each rule into its individual parameter elements
str <- strsplit(gsub("\\,\\+"," +", gsub("=>","", gsubfn(".", list("{" = "", "}" = ""), gsub("corpsi", "+corpsi", "{dog} => {pet}")))), split='+', fixed=TRUE)
parameters <- data.frame(do.call(rbind, str)) #Creating a df of the split parameters
parameters <- data.frame(t(parameters))
parameters <- parameters[parameters$t.parameters.!="",]
parameters <- trimws(parameters, "r")
#Applying filter on all the parameters of a single rule row
splitval = strsplit(parameters[1],split=' ', fixed=TRUE)
splitval = lapply(list(splitval[[1]]), function(z){ z[z != ""]}) #Eliminating the "" instances
So now, splitval has the following value:
[[1]]
[1] "dog" "pet"
Now my objective is to filter out all those row entries of the following dataframe where the string column's entries have both the exact words dog and pet.
Note: It should not filter out strings containing phrases like doganimal pets or dogsareanimals and petssss
This is my dataframe:
df <- data.frame(Srno = 1:5, Description = c("dog is my pet", "doganimal pets country", "my pet is my dog", "dogsareanimals and petssss", "a pet dog is great"))
Which looks like this:
Hence, I need only rows 1,3 & 5 in my extract since only these contain the exclusive strings "dog" and "pet" together (in no specific order)
But when I use the following code, I get all the rows of the dataframe since all the strings contain the two keywords of reference - grep is not serving the intended purpose.
extract_df <- df[(grep(splitval[[1]][1], df$Description)),]
for(k in 2:length(splitval[[1]]))
{
extract_df <- extract_df[(grep(splitval[[1]][k], df$Description)),]
}
So can anyone help me to get only rows 1,3 & 5 in the output extracted dataframe?
Assuming that splitval can have many words in it and will not always have two fixed words in it you can split string for each word and select rows that have all the words in vec.
In base R you can do this as :
vec <- splitval[[1]]
#For this case
#vec <- c("dog", "pet")
subset(df, sapply(strsplit(df$Description, '\\s+'), function(x) all(vec %in% x)))
# Srno Description
#1 1 dog is my pet
#3 3 my pet is my dog
#5 5 a pet dog is great
Using tidyverse :
library(tidyverse)
df %>% filter(map_lgl(str_split(df$Description, '\\s+'), ~all(vec %in% .x)))

How do I change all the character values of a column that starts with specific characters?

I have a dataset with millions of observations.
One of the columns of this dataset uses 4 or 5 characters to classify these observations.
My goal is to merge this classification into smaller groups, for example, I want to replace all the values of the column that STARTS with "AA" (e.g., "AABC" or "AAUCC") for just "A". How can I do this?
To illustrate:
Considering that my data is labeled "f2016" and the column that I'm interested in is "SECT16", I've been using the following code to replace values:
f2016$SECT16[f2016$SECT16 == "AABB"] <- "A"
But I cannot do this to all combinations of letters that I have in the dataset. Is there a way that I can do the same replacement holding the first two letters constant?
Here is another base R solution:
f2016[startsWith(f2016$SECT16, "AA"),] <- "A"
# SECT16
# 1 A
# 2 A
# 3 ABBBBC
# 4 DDDDE
# 5 BABA
This replaces chars with the prefix specified in this case AA. An an excerpt from from the help(startsWith).
startsWith() is equivalent to but much faster than
substring(x, 1, nchar(prefix)) == prefix
or also
grepl("^", x)
where prefix is not to contain special regular expression characters.
Data
f2016 <- data.frame(SECT16 = c("AAABBB", "AAAAAABBBB", "ABBBBC", "DDDDE", "BABA"), stringsAsFactors = F)
We can use grep/grepl
f2016$SECT16[grep("^AA", f2016$SECT16)] <- "A"
#f2016$SECT16[grepl("^AA", f2016$SECT16)] <- "A"
Consider this dataset
df <- data.frame(A = c("ABCD", "AACD", "DASDD", "AABB"), stringsAsFactors = FALSE)
df
# A
#1 ABCD
#2 AACD
#3 DASDD
#4 AABB
df$A[grep("^AA", df$A)] <- "A"
df
# A
#1 ABCD
#2 A
#3 DASDD
#4 A
You can use stringr and dplyr.
Modify all columns:
df <- df %>% mutate_all(function(x) stringr::str_replace(x, "^AA.+", "A"))
Modify specific columns:
df <- df %>% mutate_at(1, function(x) stringr::str_replace(x, "^AA.+", "A"))
Data
df <- data.frame(SECT16 = c("AABC", "AABB"),
SECT17 = c("AADD", "AAEE"))

Selecting/subsetting blank and non-blank rows with a vector [duplicate]

This question already has an answer here:
dplyr filter : value is contained in a vector
(1 answer)
Closed 4 years ago.
I'd like to use a vector featuring blank ("") and non-blank character strings to subset rows so that I end up with a result like in dfgoal.
I've tried using dplyr::select(), but I get an error message (Error: Strings must match column names. Unknown columns: tooth, , head, foot).
I realise I've got a problem in that I want to keep some "" and get rid of others, but I don't know how to resolve it.
Thanks for any help!
# Data
df <- data.frame(avar=c("tooth","","","head","","foot","",""),bvar=c(1:8))
# Vector
veca <- c("tooth","foot")
vecb <- c("")
vecc <- as.vector(rbind(veca,vecb))
vecc <- unique(vecc)
# Attempt
library(dplyr)
df <- df %>% dplyr::select(vecc)
# Goal
dfgoal <- data.frame(avar=c("tooth","","","foot","",""),bvar=c(1,2,3,6,7,8))
I'm not entirely clear on what you're trying to do. I assume you're asking how to select rows where avar %in% veca including subsequent blank ("") rows.
Perhaps something like this using tidyr::fill?
library(tidyverse)
veca <- c("tooth","foot")
df %>%
mutate(tmp = ifelse(avar == "", NA, as.character(avar))) %>%
fill(tmp) %>%
filter(tmp %in% veca) %>%
select(-tmp)
# avar bvar
#1 tooth 1
#2 2
#3 3
#4 foot 6
#5 7
#6 8

R Count Words in String Only Once

My data looks something like this:
df <- c("I am a car","I will","I have","I give","A bat","A cat")
df <- as.data.frame(df)
colnames(df) <- c("text")
df$count <- str_count(df$text, regex("a{1}?",ignore_case = T))
I want to count just the one instance of 'a' in the first row, not every time that it appears in the whole string. Thanks!
Perhaps we need grep
as.integer(grepl("\\ba\\b", df$text, ignore.case=TRUE))
Or using stringr
library(stringr)
as.integer(str_detect(df$text, "\\ba\\b"))
#[1] 1 0 1 0 1 1

Return duplicates in a list based on 2 criteria

I have a list that contains 2 data sets.
a = data.frame(c(1,1,1,1,1,2,2,2,2,2), c("a","b", "c", "d","e","e","f", "g", "h","i"))
colnames(a) = c("Numbers","Letters")
c = data.frame(c(3,3,3,3,3,4,4,4,4,4), c("q","r", "s", "t","u","u","v", "w", "x","y"))
colnames(c) = c("Numbers","Letters")
my.list = list(a,c)
my.list
I am interest in returning only the letters that are found in common between the unique numbers of each data set. The desired results are given by the following:
new_a = data.frame(c(1,2),c("e","e"))
new_c = data.frame(c(3,4),c("u","u"))
colnames(new_a) = c("Numbers","Letters")
colnames(new_c) = c("Numbers","Letters")
my.new.list = list(new_a,new_c)
my.new.list
As you will see, letter "e" is the only common letter that numbers "1" and "2" share in data set 1 while letter "u" is the only common letter shared by numbers 3 and 4 in data set 2.
I am trying to do this for a very large list. To give you an idea of my true problem, I have a list where each element is a state. Within each state, I have multiple asset managers or "accounts" and each account holds multiple tickers. I am trying to find the tickers that the accounts have in common for each geographical locations. In the above example, the numbers would be the accounts, the letters would be the tickers and the two data sets contained in the list would be two different states. I hope that helps frame my problem.
Thanks!
library(data.table)
a <- as.data.table(a)
a[, if(.N > 1) .SD, by = list(Letters)]
# Letters Numbers
# 1: e 1
# 2: e 2
Explanation: Take table a and group by the column Letters (by = list(Letters)) and return the subset of data for each group (.SD) only when the number of rows (.N) for that group is >1.
We can use Reduce with intersect in base R
lapply(my.list, function(x) x[with(x, Letters %in%
Reduce(intersect, split(Letters, Numbers))),])
Or using dplyr
library(dplyr)
lapply(my.list, function(x)
x %>%
group_by(Letters) %>%
filter(n_distinct(Numbers)==2))
Instead of having a list, it can be changed to a single dataset with an additional grouping column and then do the same,
library(tidyr)
unnest(my.list, group) %>%
group_by(group, Letters) %>%
filter(n_distinct(Numbers)==2)
If we don't know the number of unique Numbers in each list elements
unnest(my.list, group) %>%
group_by(group) %>%
mutate(n= n_distinct(Numbers)) %>%
group_by(Letters, add=TRUE) %>%
filter(n_distinct(Numbers)==n) %>%
select(-n)

Resources