I have two data frames in that DF1 is (word dictionary) and DF2 is sentences.I want to make text matching in such a way that If word in DF1 matches to DF2 sentence(any word from sentence) then output should be column with yes if match or No if won't match data frames are as follow:
(DF1) word dictionary:
DF1 <- c("csi", "dsi", "market", "share", "improvement", "dealers", "increase")
(DF2)sentences:
DF2 <- c("Customer satisfaction index improvement", "reduction in retail cycle", "Improve market share", "% recovery from vendor")
and output should be:
Customer satisfaction index improvement ( yes)
reduction in retail cycle (no)
Improve market share (yes)
% recovery from vendor (no)
note- yes and No is different column showing result of text matching
Can anyone help .....thanks in advance
You could do it like this:
df <- data.frame(sentence = c("Customer satisfaction index improvement", "reduction in retail cycle", "Improve market share", "% recovery from vendor"))
words <- c("csi", "dsi", "market", "share", "improvement", "dealers", "increase")
# combine the words in a regular expression and bind it as column yes
df <- cbind(df, yes = grepl(paste(words, collapse = "|"), df$sentence))
This outputs
sentence yes
1 Customer satisfaction index improvement TRUE
2 reduction in retail cycle FALSE
3 Improve market share TRUE
4 % recovery from vendor FALSE
See it working on ideone.com.
Try this:
DF1 <- c("csi", "dsi", "market", "share", "improvement", "dealers", "increase")
DF2 <- c("Customer satisfaction index improvement", "reduction in retail cycle", "Improve market share", "% recovery from vendor")
result <- cbind(DF2, "word found" = ifelse(rowSums(sapply(DF1, grepl, x = DF2)) > 0, "YES", "NO"))
> result
DF2 word found
[1,] "Customer satisfaction index improvement" "YES"
[2,] "reduction in retail cycle" "NO"
[3,] "Improve market share" "YES"
[4,] "% recovery from vendor" "NO"
Related
I have a data set of 50,176 tweets (tweets_data: 50176 obs. of 1 variable). Now, I have created a self-made lexicon (formal_lexicon), which consists of around 1 million words, which are all formal language style. Now, I want to create a small code which per tweet counts how many (if there are any) words are also in that lexicon.
tweets_data:
Content
1 "Blablabla"
2 "Hi my name is"
3 "Yes I need"
.
.
.
50176 "TEXT50176"
formal_lexicon:
X
1 "admittedly"
2 "Consequently"
3 "Furthermore"
.
.
.
1000000 "meanwhile"
The output should thus look like:
Content Lexicon
1 "TEXT1" 1
2 "TEXT2" 3
3 "TEXT3" 0
.
.
.
50176 "TEXT50176" 2
Should be a simple for loop like:
for(sentence in tweets_data$Content){
for(word in sentence){
if(word %in% formal_lexicon){
...
}
}
}
I don't think "word" works and I'm not sure how to count in the specific column if a word is in the lexicon. Can anyone help?
structure(list(X = c("admittedly", "consequently", "conversely", "considerably", "essentially", "furthermore")), row.names = c(NA, 6L), class = "data.frame")
c("#barackobama Thank you for your incredible grace in leadership and for being an exceptional… ", "happy 96th gma #fourmoreyears! \U0001f388 # LACMA Los Angeles County Museum of Art", "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381", "Damn, it's hard to wrap presents when you're drunk. cc #santa", "When my whole fam tryna have a peaceful holiday " )
You can try something like this:
library(tidytext)
library(dplyr)
# some fake phrases and lexicon
formal_lexicon <- structure(list(X = c("admittedly", "consequently", "conversely", "considerably", "essentially", "furthermore")), row.names = c(NA, 6L), class = "data.frame")
tweets_data <- c("#barackobama Thank you for your incredible grace in leadership and for being an exceptional… ", "happy 96th gma #fourmoreyears! \U0001f388 # LACMA Los Angeles County Museum of Art", "2017 resolution: to embody authenticity!", "Happy Holidays! Sending love and light to every corner of the earth \U0001f381", "Damn, it's hard to wrap presents when you're drunk. cc #santa", "When my whole fam tryna have a peaceful holiday " )
# put in a data.frame your tweets
tweets_data_df <- data.frame(Content = tweets_data, id = 1:length(tweets_data))
tweets_data_df %>%
# get the word
unnest_tokens( txt,Content) %>%
# add a field that count if the word is in lexicon - keep the 0 -
mutate(pres = ifelse(txt %in% formal_lexicon$X,1,0)) %>%
# grouping
group_by(id) %>%
# summarise
summarise(cnt = sum(pres)) %>%
# put back the texts
left_join(tweets_data_df ) %>%
# reorder the columns
select(id, Content, cnt)
With result:
Joining, by = "id"
# A tibble: 6 x 3
id Content cnt
<int> <chr> <dbl>
1 1 "#barackobama Thank you for your incredible grace in leadership a~ 0
2 2 "happy 96th gma #fourmoreyears! \U0001f388 # LACMA Los Angeles Co~ 0
3 3 "2017 resolution: to embody authenticity!" 0
4 4 "Happy Holidays! Sending love and light to every corner of the ea~ 0
5 5 "Damn, it's hard to wrap presents when you're drunk. cc #santa" 0
6 6 "When my whole fam tryna have a peaceful holiday " 0
Hope this is useful for you:
library(magrittr)
library(dplyr)
library(tidytext)
# Data frame with tweets, including an ID
tweets <- data.frame(
id = 1:3,
text = c(
'Hello, this is the first tweet example to your answer',
'I hope that my response help you to do your task',
'If it is tha case, please upvote and mark as the correct answer'
)
)
lexicon <- data.frame(
word = c('hello', 'first', 'response', 'task', 'correct', 'upvote')
)
# Couting words in tweets present in your lexicon
in_lexicon <- tweets %>%
# To separate by row every word in your twees
tidytext::unnest_tokens(output = 'words', input = text) %>%
# Determining if a word is in your lexicon
dplyr::mutate(
in_lexicon = words %in% lexicon$word
) %>%
dplyr::group_by(id) %>%
dplyr::summarise(words_in_lexicon = sum(in_lexicon))
# Binding count and the original data
dplyr::left_join(tweets, in_lexicon)
I am trying to score documents based on the words that occur in them. I have two types of scores for each word occuring in the corpus. It is essentially like a sentiment analysis but with a bespoke dictionary and respective scores. THANK YOU <3
#documents to be scored on 2 dimensions: score1 and score2
documents <- data.frame(textID = 1:3, text = c("Hello everybody, pleased to see everyone together", " DHL postmen have faced difficulties this year", "divorcees have trouble finding jobs in this country"), scored1 = rep(NA,3), scored2=rep(NA,3) )
#first scoring dimension
scores1 <- as.matrix(data.frame(words = c("hello", "everybody", "pleased", "to" ,"see", "everyone","together", "DHL", "postmen", "have", "faced","difficulties","this", "year", "divorcees", "trouble", "finding", "jobs", "in", "country" ), scores = 1:20))
#second scoring dimension
scores2 <- as.matrix(data.frame(words = c("hello", "everybody", "pleased", "to" ,"see", "everyone","together", "DHL", "postmen", "have", "faced","difficulties","this", "year", "divorcees", "trouble", "finding", "jobs", "in", "country" ), scores = 10:29))
#the result should look like this, where each text receives a score that represents the sum of #individual word scores:
#textID text scored1 scored2
#1 1 Hello everybody, pleased to see everyone together 28 91
#2 2 DHL postmen have faced difficulties this year 77 140
#3 3 divorcees have trouble finding jobs in this country 128 200
This could be achieved by
tidytext::unnest_token the documents into single words
dplyr::left_join the word scores
dplyr::summarise to compute the scores for each document
library(dplyr)
library(tidytext)
#documents to be scored on 2 dimensions: score1 and score2
documents <- data.frame(textID = 1:3, text = c("Hello everybody, pleased to see everyone together", " DHL postmen have faced difficulties this year", "divorcees have trouble finding jobs in this country"), scored1 = rep(NA,3), scored2=rep(NA,3) )
# 1. Get rid of as.matrix
#first scoring dimension
scores1 <- data.frame(words = c("hello", "everybody", "pleased", "to" ,"see", "everyone","together", "DHL", "postmen", "have", "faced","difficulties","this", "year", "divorcees", "trouble", "finding", "jobs", "in", "country" ), scores = 1:20)
#second scoring dimension
scores2 <- data.frame(words = c("hello", "everybody", "pleased", "to" ,"see", "everyone","together", "DHL", "postmen", "have", "faced","difficulties","this", "year", "divorcees", "trouble", "finding", "jobs", "in", "country" ), scores = 10:29)
# 2. Make words lowercase
scores1 <- mutate(scores1, words = tolower(words))
scores2 <- mutate(scores2, words = tolower(words))
# 3. Compute scores
documents %>%
select(-scored1, -scored2) %>%
tidytext::unnest_tokens(text, output = words, drop = FALSE) %>%
left_join(scores1, by = c("words" = "words")) %>%
left_join(scores2, by = c("words" = "words"), suffix = c("1", "2")) %>%
group_by(textID, text) %>%
summarise(across(starts_with("scores"), sum, na.rm = TRUE)) %>%
rename(scored1 = scores1, scored2 = scores2) %>%
ungroup()
#> `summarise()` regrouping output by 'textID' (override with `.groups` argument)
#> # A tibble: 3 x 4
#> textID text scored1 scored2
#> <int> <chr> <int> <int>
#> 1 1 "Hello everybody, pleased to see everyone together" 28 91
#> 2 2 " DHL postmen have faced difficulties this year" 77 140
#> 3 3 "divorcees have trouble finding jobs in this country" 128 200
I am looking for a function that allows me to add a new column to add the values called ID to a string, that is:
I have a list of words with your ID:
car = 9112
red = 9512
employee = 6117
sky = 2324
words<- c("car", "sky", "red", "employee", "domestic")
match<- c("car", "red", "domestic", "employee", "sky")
the comparison is made by reading in an excel file, if it finds the value equal to my vector words, it replaces the word with its ID, but leaves the original word
x10<- c(words)# string
words.corpus <- c(L4$`match`) # pattern
idwords.corpus <- c(L4$`ID`) # replace
words.corpus <- paste0("\\A",idwords.corpus, "\\z|\\A", words.corpus,"\\z")
vect.corpus <- idwords.corpus
names(vect.corpus) <- words.corpus
data15 <- str_replace_all(x10, vect.corpus)
result:
data15:
" 9112", "2324", "9512", "6117", "employee"
What I'm looking for is to add a new column with the ID, instead of replacing the word with the ID
words ID
car 9112
red 9512
employee 6117
sky 2324
domestic domestic
I'd use data.table for fast lookup based on the fixed words value. While it's not 100% clear what you are asking for, it sounds like you want to replace words with an index value if there is a match, or leave the word as a word if not. This code will do that:
library("data.table")
# associate your ids with fixed word matches in a named numeric vector
ids <- data.table(
word = c("car", "red", "employee", "sky"),
ID = c(9112, 9512, 6117, 2324)
)
setkey(ids, word)
# this is what you would read in
data <- data.table(
word = c("car", "sky", "red", "employee", "domestic", "sky")
)
setkey(data, word)
data <- ids[data]
# replace NAs from no match with word
data[, ID := ifelse(is.na(ID), word, ID)]
data
## word ID
## 1: car 9112
## 2: domestic domestic
## 3: employee 6117
## 4: red 9512
## 5: sky 2324
## 6: sky 2324
Here the "domestic" is not matched so it remains as the word in the ID column. I also repeated "sky" to show how this will work for every instance of a word.
If you want to preserve the original sort order, you could create an index variable before the merge, and then reorder the output by that index variable.
Suppose I have to following data
specialty <- c("Primary Care", "Internal Medicine Subspecialties" ,
"Pediatric subspecialties","Surgical subspecialties", "Emergency
Medicine","All other specialties", "No Medical specialty")
test <- c(23,43,67,77,54)
dfTEST <- data.frame(test)
dfTEST<- t(dfTEST)
colnames(dfTEST) <- c(1,2,4,5,7)
> dfTEST
1 2 4 5 7
test 23 43 67 77 54
Note that my dfTest has 5 variables that skip numbers. I need to create a data frame that maps these colname numbers (1,2,4,5,7) to the specialty. Specialty is 7 strings that are in coordination to the dfTest colnames. Meaning dfTest 2 = "Internal Medicine Subspecialties" and dfTest 4 ="surgical subspecialties and so on. Below is a snippet of what I am looking to achieve, but I am stumped on how to go about it. I need it to be flexible so that no matter what the numbers in the colnames are, the code will still work. Any ideas?? Thanks!!
> dfTEST
1 2 4 5 7
test 23 43 67 77 54
added "primary care" "internal" ...
This here should solve your problem.
library(dplyr)
specialty_lookup <- data.frame(specialty = c("Primary Care",
"Internal Medicine Subspecialties",
"Pediatric subspecialties",
"Surgical subspecialties",
"Emergency Medicine",
"All other specialties",
"No Medical specialty"),
test = 1:7,
stringsAsFactors = F)
data <- data.frame(code = c(23,43,67,77,54),
test = c(1,2,4,5,7))
data <- data %>%
left_join(specialty_lookup)
data_wide <- data %>%
select(-test) %>%
t() %>%
data.frame()
colnames(data_wide) <- data$test
data_wide
But you should question yourself if this is really the format you want your data to have. From the little I could see of your problem, the following format would be more adequate:
library(dplyr)
specialty_lookup <- data.frame(specialty = c("Primary Care",
"Internal Medicine Subspecialties",
"Pediatric subspecialties",
"Surgical subspecialties",
"Emergency Medicine",
"All other specialties",
"No Medical specialty"),
test = 1:7, stringsAsFactors = F)
data <- data.frame(code = c(23,43,67,77,54),
test = c(1,2,4,5,7))
data <- data %>%
left_join(specialty_lookup)
data
Hope this helps:
# get the indexes of correspondent specialties
ids <- as.integer(colnames(dfTEST))
dfTEST<- as.data.frame(t(dfTEST))
dfTEST$added <- specialty[ids]
dfTEST<- t(dfTEST)
The output:
> dfTEST
1 2 4
test "23" "43" "67"
added "Primary Care" "Internal Medicine Subspecialties" "Surgical subspecialties"
5 7
test "77" "54"
added "Emergency \n Medicine" "No Medical specialty"
I have a dataframe that is in this format:
A <- c("John Smith", "Red Shirt", "Family values are better")
B <- c("John is a very highly smart guy", "We tried the tea but didn't enjoy it at all", "Family is very important as it gives you values")
df <- as.data.frame(A, B)
My intention is to get the result back as:
ID A B
1 John Smith is a very highly smart guy
2 Red Shirt We tried the tea but didn't enjoy it at all
3 Family values are better is very important as it gives you
I have tried:
test<-df %>% filter(sapply(1:nrow(.), function(i) grepl(A[i], B[i])))
But this doesn't get me the desired output.
One solution is to use mapply along with strsplit.
The trick is to split df$A in separate words and collapse those words separated by | and then use it as pattern in gsub to replace with "".
lst <- strsplit(df$A, split = " ")
df$B <- mapply(function(x,y){gsub(paste0(x,collapse = "|"), "",df$B[y])},lst,1:length(lst))
df
# A B
# 1 John Smith is a very highly smart guy
# 2 Red Shirt We tried the tea but didn't enjoy it at all
# 3 Family values are better is very important as it gives you
Another option is as:
df$B <- mapply(function(x,y)gsub(x,"",y) ,gsub(" ", "|",df$A),df$B)
Data:
A <- c("John Smith", "Red Shirt", "Family values are better")
B <- c("John is a very highly smart guy", "We tried the tea but didn't enjoy it at all", "Family is very important as it gives you values")
df <- data.frame(A, B, stringsAsFactors = FALSE)
Just another option using stringr::str_split_fixed function:
library(stringr)
str_split_fixed(sapply(paste(df$A,df$B, sep=" columnbreaker "),
function(i){
paste(unique(
strsplit(as.character(i), split=" ")[[1]]),
collapse = " ")}),
" columnbreaker ", 2)
# [,1] [,2]
# [1,] "John Smith" "is a very highly smart guy"
# [2,] "Red Shirt" "We tried the tea but didn't enjoy it at all"
# [3,] "Family values are better" "is very important as it gives you"