R Count Words in String Only Once - r

My data looks something like this:
df <- c("I am a car","I will","I have","I give","A bat","A cat")
df <- as.data.frame(df)
colnames(df) <- c("text")
df$count <- str_count(df$text, regex("a{1}?",ignore_case = T))
I want to count just the one instance of 'a' in the first row, not every time that it appears in the whole string. Thanks!

Perhaps we need grep
as.integer(grepl("\\ba\\b", df$text, ignore.case=TRUE))
Or using stringr
library(stringr)
as.integer(str_detect(df$text, "\\ba\\b"))
#[1] 1 0 1 0 1 1

Related

Remove a number of character from string in a column

I have a data frame with a column of strings and I would like to remove the first three characters in each of the strings. As in the following example:
From this:
df <- data_frame(col1 = c('01_A','02_B', '03_C'))
To this:
df <- data_frame(col1 = c('A','B', 'C'))
I have been trying to use the dplyr transmute function but I can't really get it to work.
Any help would be super appreciated!
I think this will work:
library(dplyr)
library(stringr)
df %>%
mutate(col1 = str_remove(col1, "\\d+(_)"))
col1
1 A
2 B
3 C
We could also use substring from base R as the OP mentioned above position based substring extraction
df$col1 <- substring(df$col1, 4)
df$col1
#[1] "A" "B" "C"
You can use sub like below
> df %>%
+ mutate(col1 = sub("^.{3}", "", col1))
# A tibble: 3 x 1
col1
<chr>
1 A
2 B
3 C

R: Counting frequency of words from predefined dictionary

I have a very large dataset that looks like this: one column contains names, the second column contains their respective (very long) texts. I also have a pre-defined dictionary that contains at least 20 terms. How can I count the number of times these key words occur in each row of my dataframe? I have tried str_detect,grep(l), and %>% like, and looped over each row, but the problem seems to be that I want to detect too many terms, and these functions stop working when I use 15+ terms or so.
Would be sooo happy if anyone could help me out with this!
col1<- c("Henrik", "Joseph", "Lucy")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he did not")
df <- data.frame(col1, col2)```
dict <- c("groceries", "going", "me") #but my actual dictionary is much larger
Create a unique identifier for your rows. Split your col2 by words, one in each row. Filter for only the select words in your dict. Then count by each row. Finally, combine with original df and set NA to Zeros for rows that don't have any words from your dict.
library(dplyr)
col1 <- c("A","B","A")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he did not")
df <- data.frame(col1, col2, stringsAsFactors = FALSE)
dict <- c("groceries", "going", "me")
df <- df %>% mutate(row=row_number()) %>% select(row, everything())
counts <- df %>% tidyr::separate_rows(col2) %>% filter(col2 %in% dict) %>% group_by(row) %>% count(name = "counts")
final <- left_join(df, counts, by="row") %>% tidyr::replace_na(list(counts=0L))
final
#> row col1 col2 counts
#> 1 1 A I am going to get groceries 2
#> 2 2 B He called me at six. 1
#> 3 3 A No, he did not 0
Here is a base R option using gregexpr
dfout <- within(
df,
counts <- sapply(
gregexpr(paste0(dict, collapse = "|"), col2),
function(x) sum(x > 0)
)
)
or
dfout <- within(
df,
counts <- sapply(
regmatches(col2, gregexpr("\\w+", col2)),
function(v) sum(v %in% dict)
)
)
which gives
> dfout
col1 col2 counts
1 1 I am going to get groceries 2
2 2 He called me at six. 1
3 3 No, he did not 0
Data
structure(list(col1 = 1:3, col2 = c("I am going to get groceries",
"He called me at six.", "No, he did not")), class = "data.frame", row.names = c(NA,
-3L))
I think my solution gives you the output you want - that is for each word in your "dict" list, you can see how many times it appears in each sentence. Each row is an entry in df$col2 i.e. a sentence. "Dict" is your vector of terms that you're looking to match. We can loop over the vector and for each entry in the vector we match how many times that entry appears in each row/sentence using stringr::str_count. Note the syntax for str_count: str_count(string being checked over, expression you're trying to match)
str_count returns a vector showing how many times the word appears in each row. I create a data frame of these vectors which will contain the same number of rows as there are entries in the dict vector. Then you can just cbind "dict" to that data frame and you can see how many times each word is used in each sentence. I adjust the column names at very end so you can match the words to the sentence #'s. Note that if you want to calculate row means you'll need to subset out the "dict" column of the final data frame because it's character.
library(stringr)
col1<- c("Henrik", "Joseph", "Lucy")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he
did not")
df <- data.frame(col1, col2)
dict <- c("groceries", "going", "me")
word_matches <- data.frame()
for (i in dict) {
word_tot<-(str_count(df$col2, i))
word_matches <- rbind(word_matches,word_tot)
}
word_matches
colnames(word_matches) <- paste("Sentence", 1:ncol(word_matches))
cbind(dict,word_matches)
dict Sentence 1 Sentence 2 Sentence 3
1 groceries 1 0 0
2 going 1 0 0
3 me 0 1 0

Partial string match in another dataframe in r

Is there a way where I can find all the partial matches from df_2 to df_1?
partial match (if part of DF_1 string is in the whole string of DF_2)
For example, part of "for solution" is in the whole string of "solution"
df_1=data.frame(
DF_1=c("suspension","tablet","for solution","capsule")
)
df_2=data.frame(
index=c("1","2","3","4","5"),
DF_2=c("for suspension", "suspension", "solution", "tablet,ER","tablet,IR")
)
df_out=data.frame(
DF_1=c("suspension","suspension","tablet","tablet","for solution"),
DF_2=c("for suspension", "suspension","tablet,ER","tablet,IR","solution"),
index=c("1","2","4","5","3")
)
We can use fuzzyjoin
library(fuzzyjoin)
regex_left_join(df_2, df_1, by = c("DF_2"= "DF_1"))
Following #Akrun suggestion of using fuzzyjoin
According to your expected output, you want to join twice, and you want to perform inner_join.
Finally you'll match twice if there is a perfect match, which is why you want to deduplicate (I did it with distinct from dplyr but you can do it with what you want.
df_out = distinct(
rbind(
regex_inner_join(df_1, df_2, by = c("DF_1"= "DF_2")),
regex_inner_join(df_2, df_1, by = c("DF_2"= "DF_1"))
)
)
df_out
The output is:
DF_1 index DF_2
1 suspension 2 suspension
2 for solution 3 solution
3 suspension 1 for suspension
4 tablet 4 tablet,ER
5 tablet 5 tablet,IR
You find your expected table, not in the same order though (lines & columns).
Here is a base R option using nested *apply + grepl
df_out <- within(
df_2,
DF_1 <- unlist(sapply(
DF_2,
function(x) {
Filter(
Negate(is.na),
lapply(
df_1$DF_1,
function(y) ifelse(grepl(y, x), y, ifelse(grepl(x, y), x, NA))
)
)
}
), use.names = FALSE)
)
such that
> df_out
index DF_2 DF_1
1 1 for suspension suspension
2 2 suspension suspension
3 3 solution solution
4 4 tablet,ER tablet
5 5 tablet,IR tablet
This sounds like a job for grepl()!
E.g. grepl(value, chars, fixed = TRUE)
Let me quote an example from a different answer:
> chars <- "test"
> value <- "es"
> grepl(value, chars)
[1] TRUE
> chars <- "test"
> value <- "et"
> grepl(value, chars)
[1] FALSE

How do I change all the character values of a column that starts with specific characters?

I have a dataset with millions of observations.
One of the columns of this dataset uses 4 or 5 characters to classify these observations.
My goal is to merge this classification into smaller groups, for example, I want to replace all the values of the column that STARTS with "AA" (e.g., "AABC" or "AAUCC") for just "A". How can I do this?
To illustrate:
Considering that my data is labeled "f2016" and the column that I'm interested in is "SECT16", I've been using the following code to replace values:
f2016$SECT16[f2016$SECT16 == "AABB"] <- "A"
But I cannot do this to all combinations of letters that I have in the dataset. Is there a way that I can do the same replacement holding the first two letters constant?
Here is another base R solution:
f2016[startsWith(f2016$SECT16, "AA"),] <- "A"
# SECT16
# 1 A
# 2 A
# 3 ABBBBC
# 4 DDDDE
# 5 BABA
This replaces chars with the prefix specified in this case AA. An an excerpt from from the help(startsWith).
startsWith() is equivalent to but much faster than
substring(x, 1, nchar(prefix)) == prefix
or also
grepl("^", x)
where prefix is not to contain special regular expression characters.
Data
f2016 <- data.frame(SECT16 = c("AAABBB", "AAAAAABBBB", "ABBBBC", "DDDDE", "BABA"), stringsAsFactors = F)
We can use grep/grepl
f2016$SECT16[grep("^AA", f2016$SECT16)] <- "A"
#f2016$SECT16[grepl("^AA", f2016$SECT16)] <- "A"
Consider this dataset
df <- data.frame(A = c("ABCD", "AACD", "DASDD", "AABB"), stringsAsFactors = FALSE)
df
# A
#1 ABCD
#2 AACD
#3 DASDD
#4 AABB
df$A[grep("^AA", df$A)] <- "A"
df
# A
#1 ABCD
#2 A
#3 DASDD
#4 A
You can use stringr and dplyr.
Modify all columns:
df <- df %>% mutate_all(function(x) stringr::str_replace(x, "^AA.+", "A"))
Modify specific columns:
df <- df %>% mutate_at(1, function(x) stringr::str_replace(x, "^AA.+", "A"))
Data
df <- data.frame(SECT16 = c("AABC", "AABB"),
SECT17 = c("AADD", "AAEE"))

Variable as a column name in data frame

Is there any way to use string stored in variable as a column name in a new data frame? The expected result should be:
col.name <- 'col1'
df <- data.frame(col.name=1:4)
print(df)
# Real output
col.name
1 1
2 2
3 3
4 4
# Expected output
col1
1 1
2 2
3 3
4 4
I'm aware that I can create data frame and then use names() to rename column or use df[, col.name] for existing object, but I'd like to know if there is any other solution which could be used during creating data frame.
You cannot pass a variable into the name of an argument like that.
Instead what you can do is:
df <- data.frame(placeholder_name = 1:4)
names(df)[names(df) == "placeholder_name"] <- col.name
or use the default name of "V1":
df <- data.frame(1:4)
names(df)[names(df) == "V1"] <- col.name
or assign by position:
df <- data.frame(1:4)
names(df)[1] <- col.name
or if you only have one column just replace the entire names attribute:
df <- data.frame(1:4)
names(df) <- col.name
There's also the set_names function in the magrittr package that you can use to do this last solution in one step:
library(magrittr)
df <- set_names(data.frame(1:4), col.name)
But set_names is just an alias for:
df <- `names<-`(data.frame(1:4), col.name)
which is part of base R. Figuring out why this expression works and makes sense will be a good exercise.
In addition to ssdecontrol's answer, there is a second option.
You're looking for mget. First assign the name to a variable, then the value to the variable that you have previously assigned. After that, mget will evaluate the string and pass it to data.frame.
assign("col.name", "col1")
assign(paste(col.name), 1:4)
df <- data.frame(mget(col.name))
print(df)
col1
1 1
2 2
3 3
4 4
I don't recommend you do this, but:
col.name <- 'col1'
eval(parse(text=paste0('data.frame(', col.name, '=1:4)')))

Resources