rtweet - multiple AND/OR keyword search - r

I am using the rtweet package to retrieve tweets that contain specific keywords. I know how to do an "and"/"or" match, but how to chain these together into one keyword query with multiple OR/and conditions . For example, a search query I may wish to put into the search_twitter function is:
('cash' or 'currency' or 'banknote' or 'accepting cash' or 'cashless') AND ('covid' or 'virus' or 'coronavirus')
So the tweets can contain any one of the words in the first bracket and also any one of the words in the second bracket.

Using dplyr:
Assuming you have a df with a column that contains a character field of tweets:
Sample data:
df <- structure(list(Column = c("coronavirus cash", "covid", "currency covid",
"currency coronavirus", "coronavirus virus", "trees", "plants",
"moneys")), row.names = c(NA, -8L), class = c("tbl_df", "tbl",
"data.frame"))
You can use the following:
library(dplyr)
match <- df %>%
dplyr::filter(str_detect(Column, "cash|currency|banknote|accepting cash|cashless")) %>%
dplyr::filter(str_detect(Column, "covid|virus|coronavirus"))

Related

Regular expression to remove ellipses

I am trying to remove the dots at the end of the stname column, but nothing I am trying is working.
This is what the dataset looks like.
df = structure(list(stname = c("Alabama……………………………………",
"Alaska………………………………………", "American Samoa……………………………",
"Arizona………………………………………", "Arkansas……………………………………",
"California………………………………"), value = c(34305795,
20236292, 103657, 267021650, 15045025, 3976908430)), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
I tried the following but the dots are still there.
library(tidyverse)
#remove non alpha-numeric characters
df %>%
mutate(stname = str_replace_all(stname, "[^[:alnum:][:space:]]", ""))
#remove dots
df %>%
mutate(stname = str_replace(stname, "\\.+", ""))
Neither of those approaches worked.
The problem you are facing is that it's not an actual dot in your stname column. You have a horizontal ellipsis there (P.S. check ansi chars table, this will be number 133: http://www.alanwood.net/demos/ansi.html)
That's why in your regex you also need to search for horizontal ellipsis. Try to run this, it should help:
mutate(stname = str_replace(stname, "…+", ""))

Categorize observations in dataframe by different identifiers

I've searched around for a solution to this problem, but can't seem to find any.
I have pulled tweets from Danish MP's using the rtweet package to acces the Twitter API. I used get_gimeline() to pull the data.
get_timeline(c(politikere), n = 100, parse = TRUE, since_id = "1315756184247435264", max_id = "1333904927559725056", type = "recent") %>%
dplyr::filter(created_at > "2020-10-25" & created_at <="2020-12-01")
Now i would like to categorize the different Twitter users by their Party ID, in order to do some pary specific sentiment analysis.
From the API call you get all sorts of information in to a tibble dataframe e.g "user id" spanning to around 90 different variables.
user_id
status_id
created_at
screen_name
text
description
...x_i
The point is that I want to create a new column in the dataset named party_id and I want to assign a new value onto each user according to the party they belong to:
I would want to create a column which identifies the party affilitation. It should look something like this:
user_id
status_id
created_at
screen_name
text
description
party_id
1234346
683901040
2020-11-23
larsen_mc
gg..
Danish MP..
Conservatives
I looked at the dplyr package but I can't quite get my head around how to assign the same value to different rows that does not share the same identifiers. If e.g all the conservative MP's shared the same status_id it would be a somewhat easier task by using inner_join, but every user has it's own unique identifier in this case (of course).
Here is the example_data
structure(list(user_id = c("2373406198", "4360080437", "3512158337",
"746909257", "36910691", "58550919", "279986859", "1225930531",
"26263965", "2222188479"), status_id = c("1354094283230474241",
"1354707826317393922", "1354391556900483072", "1347169543853117444",
"1354866447735005185", "1332633849659088897", "1355522537669734401",
"1355554489361686530", "1329028442105458688", "1330791375449829376"
), created_at = structure(c(1611676209, 1611822489, 1611747085,
1610025223, 1611860307, 1606559643, 1612016732, 1612024349, 1605700047,
1606120363), tzone = "UTC", class = c("POSIXct", "POSIXt")),
screen_name = c("jacobmark_sf", "RuneLundEL", "kimvalentinDK",
"TommyPetersenDK", "JuulMona", "Blixt22", "JanEJoergensen",
"RasmusJarlov", "StemLAURITZEN", "olebirkolesen")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
Hopes this makes sense
Best,
Gustav
Okay - I found a solution! After making the identifier manually (called Parti_id) I used the tidyverse package and used left_join():
poldata <- poldata %>%
select(screen_name,Parti_id)
FTtweets <- left_join(tmlpol, poldata, by = "screen_name")

Removing Stop words from a list of strings in R

Sample data
Dput code of my data
x <- structure(list(Comments = structure(2:1, .Label = c("I have a lot of home-work to be completed..",
"I want to vist my teacher today only!!"), class = "factor"),
Comment_ID = c(704, 802)), class = "data.frame", row.names = c(NA,
-2L))
I want to remove the stop words from the above data set using tidytext::stop_words$word and also retain the same columns in the output. Along with this how can I remove punctuation in tidytext package?
Note: I don't want to change my dataset into corpus
You can collapse all the words in tidytext::stop_words$word into one regex adding word boundaries. However, tidytext::stop_words$word is of length 1149 and this might be too big for regex to handle so you can remove few words which are not needed and apply this.
For example taking only first 10 words from tidytext::stop_words$word, you can do :
gsub(paste0(paste0('\\b', tidytext::stop_words$word[1:10], '\\b',
collapse = "|"), '|[[:punct:]]+'), '', x$Comments)
#[1] "I want to vist my teacher today only"
# "I have lot of homework to be completed"
clean_tweet = removeWords(clean_tweet, stopwords("english"))

Spread dataframe

I have the following dataframe/tibble sample:
structure(list(name = c("Contents.Key", "Contents.LastModified",
"Contents.ETag", "Contents.Size", "Contents.Owner", "Contents.StorageClass",
"Contents.Bucket", "Contents.Key", "Contents.LastModified", "Contents.ETag"
), value = c("2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0_0e94e664-4d5e-4646-b2b9-1937398cfaed_2019-01-01-07-54-46-064",
"2019-01-01T07:54:47.000Z", "\"378d04496cb27d93e1c37e1511a79ec7\"",
"24187", "e7c0d260939d15d18866126da3376642e2d4497f18ed762b608ed2307778bdf1",
"STANDARD", "vfevvv-edrfvevevev-streamed-data", "2019/01/01/07/556662_cba3a4fc-cb8f-4150-859f-5f21a38373d0_33a8ba28-245c-490b-99b2-254507431d47_2019-01-01-07-54-56-755",
"2019-01-01T07:54:57.000Z", "\"df8cc7082e0cc991aa24542e2576277b\""
)), row.names = c(NA, -10L), class = c("tbl_df", "tbl", "data.frame"
))
I want to spread the names column using tidyr::spread() function but I don't get the desired result
df %>% tidyr::spread(key = name, value = value)
I get an error:
Error: Duplicate identifiers for rows:...
Also tried with melt function same result.
I have connected to S3 using aws.s3::get_bucket() function and trying to convert it to dataframe. I am aware there is a aws.s3::get_bucket_df() function which should do this but it doesn't work (you may look at my relevant question.
After I've got the bucket list, I've unlisted it and run enframe command.
Please advise.
You can introduce a new column first(introduces NAs, will have to deal with them).
df %>%
mutate(RN=row_number()) %>%
group_by(RN) %>%
spread(name,value)

R not producing the same result when the data set source is changed

if i manually create 2 DFs then the code does what it was intended to do:
`df1 <- structure(list(CompanyName = c("Google", "Tesco")), .Names = "CompanyName", class = "data.frame", row.names = c(NA, -2L))
df2 <- structure(list(CompanyVariationsNames = c("google plc", "tesco bank","tesco insurance", "google finance", "google play")), .Names = "CompanyVariationsNames", class = "data.frame", row.names = c(NA, -5L))-5L))
`
test <- df2 %>%
rowwise() %>%
mutate(CompanyName = as.character(Filter(length,
lapply(df1$CompanyName, function(x) x[grepl(x, CompanyVariationsNames, ignore.case=T)])))) %>%
group_by(CompanyName) %>%
summarise(Variation = paste(CompanyVariationsNames, collapse=",")) %>%
cSplit("Variation", ",")
this produces the following result:
CompanyName Variation_1 Variation_2 Variation_3
1: Google google plc google finance google play
2: Tesco tesco bank tesco insurance NA
but..... if i import a data set (using read.csv)then i get the following error Error in mutate_impl(.data, dots) : Column CompanyName must be length 1 (the group size), not 0. my data sets are rather large so df1 would have 1000 rows and df2 will have 54k rows.
is there a specific reason why the code works when the data set is manually created and it does not when data is imported?
the DF1 contains company names and DF2 contains variation names of those companies
help please!
Importing from CSV can be tricky. See if the default separator (comma) applies to your file in particular. If not, you can change it by setting the sep argument to a character that works. (E.g.: read.csv(file_path, sep = ";") which is a commom problem in my country due to our local conventions.
In fact, if your standard is semicolons, read.csv2(file_path) will suffice.
And also (to avoid further trouble) it is very commom for csv to mess with columns with decimal values, because here we use commas as decimal separators rather then dots. So, it would be worth checking if this is a problem in your file too, in any of the other columns.
If that is your case, you can set the adequate parameter in either read.csv or read.csv2 by setting dec = "," (E.g.: read.csv(file_path, sep = ";", dec = ","))

Resources