Recently twitter has expanded the character limit of a tweet to 280 characters.
Since then, the TwitteR package only retrieves (or shows, IDK) the initial 140 characters of an extended tweet.
# load package
library(twitteR)
# set oauth
setup_twitter_oauth(Consumer_Key,Consumer_Secret,Access_Token,Access_Token_Secret)
# get user timeline
k<-userTimeline("SenateFloor", n = 50, includeRts = T)
# to data frame
k<-twListToDF(k)
# print tweet text
print(k$text[1:5])
Console output
[1] "#Senate in at 4:00 PM. Following Leader remarks, will proceed to Executive Session & resume consideration of Cal. #… https:// t.co/BpcPa15Twp"
[2] "RT #GovTop: Weekly Digest of the #CongressionalRecord https:// t.co/vuH71y8FpH"
[3] "#HJRes123 ( Making further continuing appropriations for fiscal year 2018). The Joint Resolution was agreed to by a… https:// t.co/bquyMPPhhm"
[4] "#HJRes123 ( Making further continuing appropriations for fiscal year 2018). https:// t.co/SOmYJ3Dv4t"
[5] "Cal. #167, Susan Bodine to be Assistant Administrator of the Environmental Protection Agency. The nomination was co… https:// t.co/pW7qphwloh"
As you can see, an elipsis (...) cuts the tweets that pass the 140 limit.
> nchar(k2$text[1:5])
[1] 144 77 140 99 140
Is there any way to get the whole text from this extended tweets?
As noted in the comment, just use rtweet:
library(rtweet)
library(tidyverse)
sen_df <- get_timeline("SenateFloor", 300)
mutate(sen_df, `Tweet Length`=map_dbl(text, nchar)) %>%
ggplot(aes(`Tweet Length`)) +
ggalt::geom_bkde(color="steelblue", fill="steelblue", alpha=2/3) +
scale_y_continuous(expand=c(0,0)) +
labs(title="#SenateFloor Tweet Length Distribution") +
hrbrthemes::theme_ipsum_rc(grid="XY")
If you would like to continue to use twitteR then you could try this:
# get user timeline
k<-userTimeline("SenateFloor", n = 50, includeRts = T, tweet_mode = "extended")
Related
As this announcement mentions (https://www.facebook.com/business/news/transparency-social-issue-electoral-political-ads) new targeting information (or a summary) has been made available in the Facebook Ad Library.
I am used to use the 'Radlibrary' package in R, but I can't seem to find any fields in 'Radlibrary' which allows me to get this information? Does anyone know either how to access this information from the Radlibrary package in R (preferred, since this is what I know and usually works with) or how to access this from the API in another way?
I use it to look at how politicians choose to target their ads, why it would be a too big of a task to manually look it up at the facebook.com/ads/library
EDIT
The targeting I refer to is found browsering the ad library like the screenshots below
Thanks for highlighting this data being published which I did not know had been announced. I just registered for an API token to play around with it.
It seems to me that looking for ads from a particular politician or organisation is a question of downloading large amounts of data and then manipulating it in R. For example, to recreate the curl query on the API docs page:
curl -G \
-d "search_terms='california'" \
-d "ad_type=POLITICAL_AND_ISSUE_ADS" \
-d "ad_reached_countries=['US']" \
-d "access_token=<ACCESS_TOKEN>" \
"https://graph.facebook.com/<API_VERSION>/ads_archive"
We can simply do:
# enter token interactively so it doesn't get added to R history
token <- readline()
query <- adlib_build_query(
search_terms = "california",
ad_reached_countries = 'US',
ad_type = "POLITICAL_AND_ISSUE_ADS"
)
response <- adlib_get(params = query, token = token)
results_df <- Radlibrary::as_tibble(response, censor_access_token = TRUE)
This seems to return what one would expect:
names(results_df)
# [1] "id" "ad_creation_time" "ad_creative_bodies" "ad_creative_link_captions" "ad_creative_link_titles" "ad_delivery_start_time"
# [7] "ad_snapshot_url" "bylines" "currency" "languages" "page_id" "page_name"
# [13] "publisher_platforms" "estimated_audience_size_lower" "estimated_audience_size_upper" "impressions_lower" "impressions_upper" "spend_lower"
# [19] "spend_upper" "ad_creative_link_descriptions" "ad_delivery_stop_time"
library(dplyr)
results_df |>
group_by(page_name) |>
summarise(n = n()) |>
arrange(desc(n))
# # A tibble: 237 x 2
# page_name n
# <chr> <int>
# 1 Senator Brian Dahle 169
# 2 Katie Porter 122
# 3 PragerU 63
# 4 Results for California 28
# 5 Big News Buzz 20
# 6 California Water Service 20
# 7 Cancer Care is Different 17
# 8 Robert Reich 14
# 9 Yes On 28 14
# 10 Protect Tribal Gaming 13
# # ... with 227 more rows
Now - assuming that you are interested specifically in the ads by Senator Brian Dahle, it does not appear that you can send a query for all ads he has placed (i.e. using the page_name parameter in the query). But you can request for all political ads in their area (setting the limit parameter to a high number) with a particular search_term or search_page_id, and then filter the data to the relevant person.
I have a dataframe with a column with some text in it. I want to do three data pre-processing steps:
1) remove words that occur only once
2) remove words with low inverse document frequency (IDF) and 3) remove words that occur most frequently
This is an example of the data:
head(stormfront_data$stormfront_self_content)
Output:
[1] " , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!"
[2] "bonjour warm brother ! forward speaking !"
[3] " check time time forums. frequently moved columbia distinctly numbered. groups gatherings "
[4] " ! site pretty nice. amount news articles. main concern moment islamification."
[5] " , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed."
[6] " white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
Any help would be greatly appreciated, as I am not too familiar with R.
Here's a solution to Q1 in several steps:
Step 1: clean data by removing anything that is not alphanumeric (\\W):
data2 <- trimws(paste0(gsub("\\W+", " ", data), collapse = ""))
Step 2: Make a sorted frequency list of the words:
fw <- as.data.frame(sort(table(strsplit(data2, "\\s{1,}")), decreasing = T))
Step 3: define a pattern to match (namely all the words that occur only once), make sure you wrap them into boundary position markers (\\b) so that only exact matches get matched (e.g., networkbut not networking):
pattern <- paste0("\\b(", paste0(fw$Var1[fw$Freq==1], collapse = "|"), ")\\b")
Step 4: remove matched words:
data3 <- gsub(pattern, "", data2)
Step 5: clean up by removing superfluous spaces:
data4 <- trimws(gsub("\\s{1,}", " ", data3))
Result:
[1] "stormfront introduction white networking site decided networking site white brothers stormfront introduction stormfront stormfront main board forums forums articles stormfront stormfront stormfront local local stormfront stormfront main board groups networking member member site online online stormfront time time forums groups site articles main site decided time site white brothers jay member stormfront jay"
Here is an approach with tidytext
library(tidytext)
library(dplyr)
word_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
count(document, word, sort = TRUE)
total_count <- tibble(document = seq(1,nrow(data)), text = data) %>%
unnest_tokens(word, text) %>%
group_by(word) %>%
summarize(total = n())
words <- left_join(word_count,total_count)
words %>%
bind_tf_idf(word, document, n)
# A tibble: 111 x 7
document word n total tf idf tf_idf
<int> <chr> <int> <int> <dbl> <dbl> <dbl>
1 1 stormfront 10 11 0.139 1.10 0.153
2 1 networking 3 3 0.0417 1.79 0.0747
3 1 site 3 6 0.0417 0.693 0.0289
4 1 board 2 2 0.0278 1.79 0.0498
5 1 forums 2 3 0.0278 1.10 0.0305
6 1 introduction 2 2 0.0278 1.79 0.0498
7 1 local 2 2 0.0278 1.79 0.0498
8 1 main 2 3 0.0278 1.10 0.0305
9 1 member 2 3 0.0278 1.10 0.0305
10 1 online 2 2 0.0278 1.79 0.0498
# … with 101 more rows
From here it is trivial to filter with dplyr::filter, but since you don't define any specific criteria other than "only once", I'll leave that to you.
Data
data <- structure(c(" , , stormfront! thread members post introduction, \".\" stumbled white networking site, reading & decided register account, largest networking site white brothers, sisters! read : : guidelines posting - stormfront introduction stormfront - stormfront main board consists forums, -forums : newslinks & articles - stormfront ideology philosophy - stormfront activism - stormfront network local level: local regional - stormfront international - stormfront , . addition main board supply social groups utilized networking. final note: steps sustaining member, core member site online, affords additional online features. sf: shopping cart stormfront!",
"bonjour warm brother ! forward speaking !", " check time time forums. frequently moved columbia distinctly numbered. groups gatherings ",
" ! site pretty nice. amount news articles. main concern moment islamification.",
" , discovered site weeks ago. finally decided join found article wanted share . proud race long time idea site people shared views existed.",
" white brothers, names jay member years, bit info ? stormfront meet ups ? stay strong guys jay, uk"
), .Dim = c(6L, 1L))
Base R solution:
# Remove double spacing and punctuation at the start of strings:
# cleaned_str => character vector
cstr <- trimws(gsub("\\s*[[:punct:]]+", "", trimws(gsub('\\s+|^\\s*[[:punct:]]+|"',
' ', df), "both")), "both")
# Calulate the document frequency: document_freq => data.frame
document_freq <- data.frame(table(unlist(sapply(cstr, function(x){
unique(unlist(strsplit(x, "[^a-z]+")))}))))
# Store the inverse document frequency as a vector: idf => double vector:
document_freq$idf <- log(length(cstr)/document_freq$Freq)
# For each record remove terms that occur only once, occur the maximum number
# of times a word occurs in the dataset, or words with a "low" idf:
# pp_records => character vector
pp_records <- do.call("rbind", lapply(cstr, function(x){
# Store the term and corresponding term frequency as a data.frame: tf_dataf => data.frame
tf_dataf <- data.frame(table(as.character(na.omit(gsub("^$", NA_character_,
unlist(strsplit(x, "[^a-z]+")))))),
stringsAsFactors = FALSE)
# Store a vector containing each term's idf: idf => double vector
tf_dataf$idf <- document_freq$idf[match(tf_dataf$Var1, document_freq$Var1)]
# Explicitly return the ppd vector: .GlobalEnv() => character vector
return(
data.frame(
cleaned_record = x,
pp_records =
paste0(unique(unlist(
strsplit(gsub("\\s+", " ",
trimws(
gsub(paste0(tf_dataf$Var1[tf_dataf$Freq == 1 |
tf_dataf$idf < (quantile(tf_dataf$idf, .25) - (1.5 * IQR(tf_dataf$idf))) |
tf_dataf$Freq == max(tf_dataf$Freq)],
collapse = "|"), "", x), "both"
)), "\\s")
)), collapse = " "),
row.names = NULL,
stringsAsFactors = FALSE
)
)
}
))
# Column bind cleaned strings with the original records: ppd_cleaned_df => data.frame
ppd_cleaned_df <- cbind(orig_record = df, pp_records)
# Output to console: ppd_cleaned_df => stdout (console)
ppd_cleaned_df
So I want to find patterns and "clusters" based on what items that are bought together, and according to the wiki for eclat:
The Eclat algorithm is used to perform itemset mining. Itemset mining let us find frequent patterns in data like if a consumer buys milk, he also buys bread. This type of pattern is called association rules and is used in many application domains.
Though, when I use the eclat in R, i get "zero frequent items" and "NULL" when when retrieving the results through tidLists. Anyone can see what I am doing wrong?
The full dataset: https://pastebin.com/8GbjnHK2
Each row is a transactions, containing different items in the columns. Quick snap of the data:
3060615;;;;;;;;;;;;;;;
3060612;3060616;;;;;;;;;;;;;;
3020703;;;;;;;;;;;;;;;
3002469;;;;;;;;;;;;;;;
3062800;;;;;;;;;;;;;;;
3061943;3061965;;;;;;;;;;;;;;
The code
trans = read.transactions("Transactions.csv", format = "basket", sep = ";")
f <- eclat(trans, parameter = list(supp = 0.1, maxlen = 17, tidLists = TRUE))
dim(tidLists(f))
as(tidLists(f), "list")
Could it be due to the data structure? In that case, how should I change it? Furthermore, what do I do to get the suggested itemsets? I couldn't figure that out from the wiki.
EDIT: I used 0.004 for supp, as suggested by #hpesoj626. But it seems like the function is grouping the orders/users and not the items. I don't know how to export the data, so here is a picture of the tidLists:
The problem is that you have set your support too high. Try adjusting supp say, supp = .001, for which we get
dim(tidLists(f))
# [1] 928 15840
For your data set, the highest support is 0.08239 which is below 0.1. That is why you are getting no results with supp = 0.1.
inspect(head(sort(f, by = "support"), 10))
# items support count
# [1] {3060620} 0.08239 1305
# [2] {3060619} 0.07260 1150
# [3] {3061124} 0.05688 901
# [4] {3060618} 0.05663 897
# [5] {4027039} 0.04975 788
# [6] {3060617} 0.04564 723
# [7] {3061697} 0.04306 682
# [8] {3060619,3060620} 0.03087 489
# [9] {3039715} 0.02727 432
# [10] {3045117} 0.02708 429
I am trying to fetch text from anchor tag, which is embedded in div tag. Following is the link of website `http://mmb.moneycontrol.com/forum-topics/stocks-1.html
The text I want to extract is Mawana Sugars
Mawana Sugars
So I want to extract all the stocks names listed on this website and description of it.
Here is my attempt to do it in R
doc <- htmlParse("http://mmb.moneycontrol.com/forum-topics/stocks-1.html")
xpathSApply(doc,"//div[#class='clearfix PR PB5']//text()",xmlValue)
But, it does not return anything. How can I do it in R?
My answer is essentially the same as the one I just gave here.
The data is dynamically loaded, and cannot be retrieved directly from the html. But, looking at "Network" in Chrome DevTools for instance, we can find a nicely formatted JSON at http://mmb.moneycontrol.com/index.php?q=topic/ajax_call§ion=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=1
To get you started:
library(jsonlite)
dat <- fromJSON("http://mmb.moneycontrol.com/index.php?q=topic/ajax_call§ion=get_messages&offset=&lmid=&isp=0&gmt=cat_lm&catid=1&pgno=1")
Output looks like:
dat[1:3, c("msg_id", "user_id", "topic", "heading", "flag", "price", "message")]
# msg_id user_id topic heading flag
# 1 47730730 liontrade NMDC Stocks APR
# 2 47730726 agrawalknath Glenmark Glenmark APR
# 3 47730725 bissy91 Infosys Stocks APR
# price
# 1 Price when posted : BSE: Rs. 127.90 NSE: Rs. 128.15
# 2 Price when posted : NSE: Rs. 714.10
# 3 Price when posted : BSE: Rs. 956.50 NSE: Rs. 955.00
# message
# 1 There is no mention of dividend in the announcement.
# 2 Eagerly Waiting for 670 to 675 to BUY second phase of Buying in Cash Delivery. Already Holding # 800.
# 3 6 ✂ ✂--Don t Pay High Brokerage While Trading. Take Delivery Free & Rs 20 to trade in any size - Join Today .👉 goo.gl/hDqLnm
I have twitter data. Using library(stringr) i have extracted all the weblinks. However, when I try to do the same I am getting error. The same code had worked some days ago. The following is the code:
library(stringr)
hash <- "#[a-zA-Z0-9]{1, }"
hashtag <- str_extract_all(travel$texts, hash)
The following is the error:
Error in stri_extract_all_regex(string, pattern, simplify = simplify, :
Error in {min,max} interval. (U_REGEX_BAD_INTERVAL)
I have re-installed stringr package....but doesn't help.
The code that I used for weblink is:
pat1 <- "http://t.co/[a-zA-Z0-9]{1,}"
twitlink <- str_extract_all(travel$texts, pat1)
The reproduceable example is as follows:
rtt <- structure(data.frame(texts = c("Review Anthem of the Seas Anthems maiden voyage httptcoLPihj2sNEP #stevenewman", "#Job #Canada #Marlin Travel Agentagente de voyages Full Time in #St Catharines ON httptconMHNlDqv69", "Experience #Fiji amp #NewZealand like never before on a great 10night voyage 4033 pp departing Vancouver httptcolMvChSpaBT"), source = c("Twitter Web Client", "Catch a Job Canada", "Hootsuite"), tweet_time = c("2015-05-07 19:32:58", "2015-05-07 19:37:03", "2015-05-07 20:45:36")))
Your problem comes from the whitespace in the hash:
#Not working (look the whitespace after the comma)
str_extract_all(rtt$texts,"#[a-zA-Z0-9]{1, }")
#working
str_extract_all(rtt$texts,"#[a-zA-Z0-9]{1,}")
You may want to consider usig the qdapRegex package that I maintain for this task. It makes extracting urls and hash tags easy. qdapRegex is a package that contains a bunch of canned regex and the uses the amazing stringi package as a backend to do the regex task.
rtt <- structure(data.frame(texts = c("Review Anthem of the Seas Anthems maiden voyage httptcoLPihj2sNEP #stevenewman", "#Job #Canada #Marlin Travel Agentagente de voyages Full Time in #St Catharines ON httptconMHNlDqv69", "Experience #Fiji amp #NewZealand like never before on a great 10night voyage 4033 pp departing Vancouver httptcolMvChSpaBT"), source = c("Twitter Web Client", "Catch a Job Canada", "Hootsuite"), tweet_time = c("2015-05-07 19:32:58", "2015-05-07 19:37:03", "2015-05-07 20:45:36")))
library(qdapRegex)
## first combine the built in url + twitter regexes into a function
rm_twitter_n_url <- rm_(pattern=pastex("#rm_twitter_url", "#rm_url"), extract=TRUE)
rm_twitter_n_url(rtt$texts)
rm_hash(rtt$texts, extract=TRUE)
Giving the following output:
## > rm_twitter_n_url(rtt$texts)
## [[1]]
## [1] "httptcoLPihj2sNEP"
##
## [[2]]
## [1] "httptconMHNlDqv69"
##
## [[3]]
## [1] "httptcolMvChSpaBT"
## > rm_hash(rtt$texts, extract=TRUE)
## [[1]]
## [1] "#stevenewman"
##
## [[2]]
## [1] "#Job" "#Canada" "#Marlin" "#St"
##
## [[3]]
## [1] "#Fiji" "#NewZealand"