I have a dataframe with Twitter bios formatted like the table below.
account
bio
38374
i love candy as much as life itself proud liberal
45673
can all just get along
94928
conserv christian mom and proud pro trump veteran maga
11204
professor of women and gender studies at wesleyan university blacklivesmatter
37465
former ohio state football coach now a proud papa to seven grandchildren
A number of responses on stack overflow ask how to remove a specified list of words from a dataframe column
(like R - remove word from a sentence and How to remove words of a sentence by using a dictionary as reference).But I want to remove ALL words in the bio column UNLESS they are found in a pre-determined list of words. The list of words to keep is made up of 1052 words (as seen below)
> termstokeep
[1] love life follow live just like music regist trademark
[10] make fan one copyright lover thing world time god
[19] can get design peopl artist girl univers writer will
[28] student work busi good new know friend famili best
[37] day account market sport art game manag want book
[46] enthusiast person alway travel never free real help dream
[55] servic mom husband profession beauti offici wife now news
[64] social food come father heart educ develop need anim
[73] everyth proud tri year happi also media way man
[82] team produc look state take back support director home
[91] find call engin learn provid photograph great author video
[100] guy communiti coach name big passion see teacher school
[109] product sinc gamer enjoy keep player better let believ
[118] mother think mind dog futur give colleg say owner
[127] jesus fun got littl chang founder boy use first
[136] liberal write footbal kid fuck event polit consult care
[145] conserv much health technolog tech opinion stay everi right
[154] full former member special well young high creat snap
[163] entrepreneur movi feel view compani coffe cat citi human
[172] digit show singer sometim interest dad watch scienc creativ
[181] blogger base addict fit read bless fashion part noth
[190] run forev editor born hard die around onlin nerd
[199] class web musician made stuff leader ever inspir still
[208] christian place current public danc pleas geek talk film
[217] realli babi someth page rock lot women lead two
Ideally, after all non-specified words are removed, the dataframe would look something like this:
account
bio
38374
love life proud liberal
45673
94928
conserv christian mom proud pro trump veteran maga
11204
professor women gender university blacklivesmatter
37465
ohio state football coach proud grandchildren
How can accomplish this?
Here is a way with base gregexpr and regmatches.
pattern <- paste0("\\<", termstokeep, "\\>")
pattern <- paste(pattern, collapse = "|")
m <- gregexpr(pattern, df1$bio)
r <- regmatches(df1$bio, m)
df1$bio_clean <- sapply(r, paste, collapse = " ")
Created on 2022-02-22 by the reprex package (v2.0.1)
Data
termstokeep <-
c("love", "life", "follow", "live", "just", "like", "music",
"regist", "trademark", "make", "fan", "one", "copyright", "lover",
"thing", "world", "time", "god", "can", "get", "design", "peopl",
"artist", "girl", "univers", "writer", "will", "student", "work",
"busi", "good", "new", "know", "friend", "famili", "best", "day",
"account", "market", "sport", "art", "game", "manag", "want",
"book", "enthusiast", "person", "alway", "travel", "never", "free",
"real", "help", "dream", "servic", "mom", "husband", "profession",
"beauti", "offici", "wife", "now", "news", "social", "food",
"come", "father", "heart", "educ", "develop", "need", "anim",
"everyth", "proud", "tri", "year", "happi", "also", "media",
"way", "man", "team", "produc", "look", "state", "take", "back",
"support", "director", "home", "find", "call", "engin", "learn",
"provid", "photograph", "great", "author", "video", "guy", "communiti",
"coach", "name", "big", "passion", "see", "teacher", "school",
"product", "sinc", "gamer", "enjoy", "keep", "player", "better",
"let", "believ", "mother", "think", "mind", "dog", "futur", "give",
"colleg", "say", "owner", "jesus", "fun", "got", "littl", "chang",
"founder", "boy", "use", "first", "liberal", "write", "footbal",
"kid", "fuck", "event", "polit", "consult", "care", "conserv",
"much", "health", "technolog", "tech", "opinion", "stay", "everi",
"right", "full", "former", "member", "special", "well", "young",
"high", "creat", "snap", "entrepreneur", "movi", "feel", "view",
"compani", "coffe", "cat", "citi", "human", "digit", "show",
"singer", "sometim", "interest", "dad", "watch", "scienc", "creativ",
"blogger", "base", "addict", "fit", "read", "bless", "fashion",
"part", "noth", "run", "forev", "editor", "born", "hard", "die",
"around", "onlin", "nerd", "class", "web", "musician", "made",
"stuff", "leader", "ever", "inspir", "still", "christian", "place",
"current", "public", "danc", "pleas", "geek", "talk", "film",
"realli", "babi", "someth", "page", "rock", "lot", "women", "lead",
"two")
df1 <- read.table(text = "
account bio
38374 'i love candy as much as life itself proud liberal'
45673 'can all just get along'
94928 'conserv christian mom and proud pro trump veteran maga'
11204 'professor of women and gender studies at wesleyan university blacklivesmatter'
37465 'former ohio state football coach now a proud papa to seven grandchildren'
", header = TRUE)
Created on 2022-02-22 by the reprex package (v2.0.1)
Here is another base R option:
df$bio <- sapply(lapply(strsplit(df$bio, "\\s"), intersect, termstokeep),
paste, collapse = " ")
Output
account bio
1 38374 love much life proud liberal
2 45673 can just get
3 94928 conserv christian mom proud
4 11204 women
5 37465 former state coach now proud
Data (thanks #RuiBarradas!)
df <- structure(list(account = c(38374L, 45673L, 94928L, 11204L, 37465L
), bio = c("i love candy as much as life itself proud liberal",
"can all just get along", "conserv christian mom and proud pro trump veteran maga",
"professor of women and gender studies at wesleyan university blacklivesmatter",
"former ohio state football coach now a proud papa to seven grandchildren"
)), class = "data.frame", row.names = c(NA, -5L))
termstokeep <- c("love", "life", "follow", "live", "just", "like", "music",
"regist", "trademark", "make", "fan", "one", "copyright", "lover",
"thing", "world", "time", "god", "can", "get", "design", "peopl",
"artist", "girl", "univers", "writer", "will", "student", "work",
"busi", "good", "new", "know", "friend", "famili", "best", "day",
"account", "market", "sport", "art", "game", "manag", "want",
"book", "enthusiast", "person", "alway", "travel", "never", "free",
"real", "help", "dream", "servic", "mom", "husband", "profession",
"beauti", "offici", "wife", "now", "news", "social", "food",
"come", "father", "heart", "educ", "develop", "need", "anim",
"everyth", "proud", "tri", "year", "happi", "also", "media",
"way", "man", "team", "produc", "look", "state", "take", "back",
"support", "director", "home", "find", "call", "engin", "learn",
"provid", "photograph", "great", "author", "video", "guy", "communiti",
"coach", "name", "big", "passion", "see", "teacher", "school",
"product", "sinc", "gamer", "enjoy", "keep", "player", "better",
"let", "believ", "mother", "think", "mind", "dog", "futur", "give",
"colleg", "say", "owner", "jesus", "fun", "got", "littl", "chang",
"founder", "boy", "use", "first", "liberal", "write", "footbal",
"kid", "fuck", "event", "polit", "consult", "care", "conserv",
"much", "health", "technolog", "tech", "opinion", "stay", "everi",
"right", "full", "former", "member", "special", "well", "young",
"high", "creat", "snap", "entrepreneur", "movi", "feel", "view",
"compani", "coffe", "cat", "citi", "human", "digit", "show",
"singer", "sometim", "interest", "dad", "watch", "scienc", "creativ",
"blogger", "base", "addict", "fit", "read", "bless", "fashion",
"part", "noth", "run", "forev", "editor", "born", "hard", "die",
"around", "onlin", "nerd", "class", "web", "musician", "made",
"stuff", "leader", "ever", "inspir", "still", "christian", "place",
"current", "public", "danc", "pleas", "geek", "talk", "film",
"realli", "babi", "someth", "page", "rock", "lot", "women", "lead",
"two")
A possible solution, based on tidyverse:
library(tidyverse)
df %>%
rowwise %>%
mutate(bio = str_split(bio, "\\s") %>% unlist %>% intersect(words) %>%
str_c(collapse = " ")) %>%
ungroup
#> # A tibble: 5 x 2
#> account bio
#> <int> <chr>
#> 1 38374 love much life proud liberal
#> 2 45673 can just get
#> 3 94928 conserv christian mom proud
#> 4 11204 women
#> 5 37465 former state coach now proud
I'm trying to classify a data frame of customer reviews into the respective categories. For example,
x <- data.frame(Reviews = c("The phone performance and display is good","Worth the money","Camera is good"))
The desired output is as below image
I tried creating a dictionary as below using R's Quanteda package
dic <- dictionary(list(camera = c("camera","lens","pixel", "pictures",
"pixels", "snap"), display = c("resolution", "display", "depth", "mode",
"color", "colour", "discolour"), performance = c("performance", "speed",
"usage", "fast", "run", "running", "lag", "processor", "shut", "shut down",
"restart", "hanging","hang"), Value = c("money", "worth", "budget", "value",
"price", "specs", "specifications", "invest",
"under","expectations","expected","expecting","expect")))
I would like to classify the texts based on keywords as stated above. Please help
P.S : dfm is one option. But particularly, I would like to know how to classify a data frame of texts as per the desired output.
Using already most of your code:
# Creating a DFM and saving the Reviews in a Vector
require("quanteda")
x <- dfm( Reviews <- c(
"The phone performance and display is good",
"Worth the money",
"Camera is good"),
tolower = TRUE)
I converted the capital letters to lowercase, otherwise, the fixed comparison would not work. Further, I recommend stopword removal and some kind of steaming.
# Creating the dictionary
dic <- dictionary(list(camera = c("camera","lens","pixel", "pictures", "pixels", "snap"),
display = c("resolution", "display", "depth", "mode", "color", "colour", "discolour"),
performance = c("performance", "speed", "usage", "fast", "run", "running", "lag", "processor", "shut", "shut down", "restart", "hanging","hang"),
Value = c("money", "worth", "budget", "value", "price", "specs", "specifications", "invest", "under","expectations","expected","expecting","expect")))
Using the dfm_lookup function:
# fixed parameter fof exact matching
res <- dfm_lookup(x, dic, valuetype = "fixed")
row.names(res)<- Reviews
res
Hope this is what you are looking for :)
I am building forecasting tool which works on historical stock database. I have problem with downloading all the historical prices from https://stooq.pl
My R code works fine, but I don't know how to baypass download limitation (problem occurs above ~40 downloads I need like 450). Code bellow:
stock<-c("06n", "08n", "11b", "1at", "4fm", "aal", "aat", "aba", "abc", "abe", "abm", "abs", "acg", "acp", "act", "adv", "ago", "agt", "ahl", "alc", "ali", "all", "alm", "alr", "amb", "amc", "aml", "ape", "apl", "apn", "apr", "apt", "arc", "arh", "arr","06n", "08n", "11b", "1at", "4fm", "aal", "aat", "aba", "abc", "abe", "abm", "abs", "acg", "acp", "act", "adv", "ago", "agt", "ahl", "alc", "ali", "all", "alm", "alr", "amb", "amc", "aml", "ape", "apl", "apn", "apr", "apt", "arc", "arh", "arr","06n", "08n", "11b", "1at", "4fm", "aal", "aat", "aba", "abc", "abe", "abm", "abs", "acg", "acp", "act", "adv", "ago", "agt", "ahl", "alc", "ali", "all", "alm", "alr", "amb", "amc", "aml", "ape", "apl", "apn", "apr", "apt", "arc", "arh", "arr") #example
Dane<- list()
i=1
for(c in stock){
Dane[[i]]<-read.csv(url(paste("https://stooq.pl/q/d/l/?s=",c,"&i=d",sep="")))
i=i+1
}
After ~40 downloads this error appears:
[1] Przekroczony.dzienny.limit.wywolan (you have exceeded daily limit of downloads) - It is not a real error, program is scraping file without data, only this message inside.
Is there a way to baypass this error? I don't know different webpage (I am not sure if there is any at all) from which I can download data I need.
I couldn't get your link to work. Anyway, take a look at this one.
http://investexcel.net/multiple-stock-quote-downloader-for-excel/
Obviously it's Excel, not R, but it does a nice job. In addition, you can try something like this.
codes <- c("MSFT","SBUX","S","AAPL","ADT")
urls <- paste0("https://www.google.com/finance/historical?q=",codes,"&output=csv")
paths <- paste0(codes,"csv")
missing <- !(paths %in% dir(".", full.name = TRUE))
missing
# simple error handling in case file doesn't exists
downloadFile <- function(url, path, ...) {
# remove file if exists already
if(file.exists(path)) file.remove(path)
# download file
tryCatch(
download.file(url, path, ...), error = function(c) {
# remove file if error
if(file.exists(path)) file.remove(path)
# create error message
c$message <- paste(substr(path, 1, 4),"failed")
message(c$message)
}
)
}
# wrapper of mapply
Map(downloadFile, urls[missing], paths[missing])
SELECT d
FROM YYY d
WHERE d MEMBER OF :parameter.myCollection
What is wrong with that query? parameter is an entity I retrieve from the database in a previous step. I keep getting the following exception:
org.apache.openjpa.persistence.ArgumentException: "Encountered "d MEMBER OF :" at character 18, but expected: ["(", "*", "+", "-", ".", "/", ":", "<", "<=", "<>", "=", ">", ">=", "?", "ABS", "ALL", "AND", "ANY", "AS", "ASC", "AVG", "BETWEEN", "BOTH", "BY", "CONCAT", "COUNT", "CURRENT_DATE", "CURRENT_TIME", "CURRENT_TIMESTAMP", "DELETE", "DESC", "DISTINCT", "EMPTY", "ESCAPE", "EXISTS", "FETCH", "FROM", "GROUP", "HAVING", "IN", "INDEX", "INNER", "IS", "JOIN", "KEY", "LEADING", "LEFT", "LENGTH", "LIKE", "LOCATE", "LOWER", "MAX", "MEMBER", "MIN", "MOD", "NEW", "NOT", "NULL", "OBJECT", "OF", "OR", "ORDER", "OUTER", "SELECT", "SET", "SIZE", "SOME", "SQRT", "SUBSTRING", "SUM", "TRAILING", "TRIM", "TYPE", "UPDATE", "UPPER", "VALUE", "WHERE", , , , , , , , , ]." while parsing JPQL "SELECT d
FROM YYY d
WHERE d MEMBER OF :parameter.myCollection
". See nested stack trace for original parse error.
at org.apache.openjpa.kernel.jpql.JPQLParser.parse(JPQLParser.java:51)
at org.apache.openjpa.kernel.ExpressionStoreQuery.newCompilation(ExpressionStoreQuery.java:154)
at org.apache.openjpa.kernel.QueryImpl.newCompilation(QueryImpl.java:672)
at org.apache.openjpa.kernel.QueryImpl.compilationFromCache(QueryImpl.java:654)
at org.apache.openjpa.kernel.QueryImpl.compileForCompilation(QueryImpl.java:620)
at org.apache.openjpa.kernel.QueryImpl.compileForExecutor(QueryImpl.java:682)
at org.apache.openjpa.kernel.QueryImpl.compile(QueryImpl.java:589)
at org.apache.openjpa.persistence.EntityManagerImpl.createNamedQuery(EntityManagerImpl.java:1037)
at org.apache.openjpa.persistence.EntityManagerImpl.createNamedQuery(EntityManagerImpl.java:1016)
Check this out:
SELECT d
FROM YYY d
WHERE d IN :parameter.myCollection
An explanation of how to use 'MEMBER OF' and 'IN' is here