How to create a mult-line timeline graph of tweets posted - r

I have a dataset of 56,040 tweets in R called 'tweets' collected in the week following the roe v wade announcement. I'm attempting to analyze using sentiment analysis scores. I have three columns:
'stance' = included for each tweet - includes either 'life' or 'choice' depending on which stance the tweet is taking.
'POSIX' = the timestamp of when the date was posted, currently in YYYY-MM-DD-HH-MM-SS format.
'score' = the sentiment score for each tweet, ranging from around -10 to 10.
I've tried various ways without success and honestly don't know what I'm doing but figure this can't be that difficult. I'm attempting to create a line graph with two lines (one for life and another for choice) over the course of the timeframe (right after midnight on the 22nd until 11:59 on the 3rd) showing the average sentiment score of tweets by hour, controlling for the number of tweets that were sent out at that hour. Any suggestions?
So far I've attempted various ggplot and plotly attempts with no success. Pls help lol

Related

How to make a Line Chart with multiple Series where one series has null values

I have a line chart that has 3 series - the first shows cumulative sales for last week - the next shows the daily sales for this week and the next shows cumulative sales for this week.
This shows the days of the week at the bottom.
Last weeks cumulative sales show fine as there is data for each day of the week.
For this week's daily and cumulative sales I don't want a point drawn if the day hasn't happened yet.
I loop through the days of the week and if the day hasn't happened yet I want to assign the value to null instead of 0 and I don't want the line to show anything for these series if the day hasn't happened yet.
Does anyone have advice on this point?
After much looking I found how to do this. It was very easy and described in the below documentation from shieldui.com:
Example using null in dataSeries
I had earlier tried all manner of sending empty strings, null value strings and in the end it was as simple as setting that data member to null like this and shieldui handled it correctly:
dailyincome = null;

Converting (parsing?) character string of Facebook fancount data using R

I need to extract Facebook country likes for a number of brands. My problem is I dont even know where to start - I have spent the past 4 hours searching but to be honest Im not even sure what to search for. I can get the data but am struggling to convert into a usable format in R for time series analysis.
Any assistance would be gratefully accepted
The data I'm retrieving via Facebook Graph API for likes by country (of Coca-Cola is as follows:
[1] "{\"data\":[{\"id\":\"1542920115951985\/insights\/page_fans_country\/lifetime\",\"name\":\"page_fans_country\",\"period\":\"lifetime\",\"values\":[{\"value\":{\"BR\":17270087,\"US\":13567311,\"MX\":5674950,\"AR\":3616300,\"FR\":3409959,\"IN\":2949669,\"GB\":2670260,\"TH\":2657306,\"IT\":2401621,\"CO\":1946677,\"ID\":1921076,\"EG\":1805233,\"PK\":1665707,\"PH\":1614358,\"TR\":1607936,\"CL\":1504917,\"VN\":1384143,\"DE\":1312448,\"PL\":1201112,\"VE\":1084783,\"CA\":990114,\"RO\":932538,\"EC\":856116,\"PE\":815942,\"ES\":790320,\"AU\":759775,\"MA\":578003,\"TN\":515510,\"RS\":476986,\"NG\":476934,\"PT\":469059,\"MY\":435316,\"BE\":431930,\"ZA\":431509,\"IQ\":354145,\"SE\":352331,\"KE\":342997,\"GR\":333749,\"HU\":333281,\"NL\":330307,\"GT\":326328,\"CR\":304006,\"DZ\":300497,\"PR\":287430,\"DO\":278847},\"end_time\":\"2015-01-01T08:00:00+0000\"},{\"value\":{\"BR\":17270151,\"US\":13566624,\"MX\":5675012,\"AR\":3618242,\"FR\":3409837,\"IN\":2949969,\"GB\":2669934,\"TH\":2658044,\"IT\":2401726,\"CO\":1946797,\"ID\":1921156,\"EG\":1805337,\"PK\":1665824,\"PH\":1614402,\"TR\":1608104,\"CL\":1504979,\"VN\":1384782,\"DE\":1312138,\"PL\":1201212,\"VE\":1084776,\"CA\":990093,\"RO\":932788,\"EC\":856129,\"PE\":816002,\"ES\":790385,\"AU\":759775,\"MA\":578080,\"TN\":518210,\"RS\":477264,\"NG\":476965,\"PT\":469177,\"MY\":435296,\"ZA\":433741,\"BE\":431908,\"IQ\":364228,\"SE\":352267,\"KE\":343007,\"GR\":333771,\"HU\":333312,\"NL\":330232,\"GT\":326513,\"CR\":304021,\"DZ\":300587,\"PR\":287432,\"DO\":278892},\"end_time\":\"2015-01-02T08:00:00+0000\"}],\"title\":\"Lifetime Likes by Country\",\"description\":\"Lifetime: Aggregated Facebook location data, sorted by country, about the people who like your Page. (Unique Users)\"}],\"paging\":{\"previous\":\"https:\/\/graph.facebook.com\/cocacola\/insights\/page_fans_country?access_token=EAACEdEose0cBAMLTB1Ufx44l8Q2hT34jxjmVjONPzhqncAvv985cUXOY6Q9FZBLuL3OM8oLXDPTBroD5DY8SS9ZBd1OIhSAMwjrISQRgWh5kkJVu75Ss7aWESIlKrwBLyLt6VYHUEUlUlmCV72TSQGZBkkOeE4OaZA4gvHIZBngZDZD&since=1419897600&until=1420070400\",\"next\":\"https:\/\/graph.facebook.com\/cocacola\/insights\/page_fans_country?access_token=EAACEdEose0cBAMLTB1Ufx44l8Q2hT34jxjmVjONPzhqncAvv985cUXOY6Q9FZBLuL3OM8oLXDPTBroD5DY8SS9ZBd1OIhSAMwjrISQRgWh5kkJVu75Ss7aWESIlKrwBLyLt6VYHUEUlUlmCV72TSQGZBkkOeE4OaZA4gvHIZBngZDZD&since=1420243200&until=1420416000\"}}"
This data is for two days and the data I need to retrieve start from line 3 (BR has 17270087 fans for Coca-Cola) and ends on line 10 (DO has 278847 fans) plus the date (indicated by end time of 2015-01-01). I then need to repeat the extract for line 12 to line 19 plus the end time of 2015-01-02 for each of the country references. Ideally I also want to capture the Facebook ID on line 2 (1542920115951985) to be able to build a data frame with Facebook ID, Date, Country and Likes in each record.

Generating "Hovmöller" style diagram from dataset with gaps in R

What I have is data in a tab delimited txt file in the following format (http://pastebin.com/XN3y9Wek):
Date Time Flow (L/h)
...
6/10/15 05:19:05 -0.175148624605041
6/10/15 05:34:05 -0.170297042615798
...
7/10/15 07:34:08 -0.033833540932291
7/10/15 07:49:08 -0.0256913011453011
...
The data currently ranges from 6/10/15 till 22/11/15. Measurements occur approximately every 15 minutes, but sometimes there is data loss which means that there are not the same amount of data points for every day. There are also periods where there is a larger gap (for example evening 16/11 -> morning 17/11) due to logger malfunction.
From this data I would like to create a similar figure like this one, as it offers a very nice seasonal representation of a large amount of data (my full dataset spans over several years):
Its similar to the style of a Hovmöller diagram. I have tried experimenting with R and the lattice package, but I struggle with the data gaps I have in my datasets and the irregular data points per day.
Any help you can offer me, an R beginner, would be greatly appreciated!
(If it would be possible in PHP or Javascript, feel free to post this as well)

How to download intraday stock market data with R

All,
I'm looking to download stock data either from Yahoo or Google on 15 - 60 minute intervals for as much history as I can get. I've come up with a crude solution as follows:
library(RCurl)
tmp <- getURL('https://www.google.com/finance/getprices?i=900&p=1000d&f=d,o,h,l,c,v&df=cpct&q=AAPL')
tmp <- strsplit(tmp,'\n')
tmp <- tmp[[1]]
tmp <- tmp[-c(1:8)]
tmp <- strsplit(tmp,',')
tmp <- do.call('rbind',tmp)
tmp <- apply(tmp,2,as.numeric)
tmp <- tmp[-apply(tmp,1,function(x) any(is.na(x))),]
Given the amount of data I'm looking to import, I worry that this could be computationally expensive. I also don't for the life of me, understand how the time stamps are coded in Yahoo and Google.
So my question is twofold--what's a simple, elegant way to quickly ingest data for a series of stocks into R, and how do I interpret the time stamping on the Google/Yahoo files that I would be using?
I will try to answer timestamp question first. Please note this is my interpretation and I could be wrong.
Using the link in your example https://www.google.com/finance/getprices?i=900&p=1000d&f=d,o,h,l,c,v&df=cpct&q=AAPL I get following data :
EXCHANGE%3DNASDAQ
MARKET_OPEN_MINUTE=570
MARKET_CLOSE_MINUTE=960
INTERVAL=900
COLUMNS=DATE,CLOSE,HIGH,LOW,OPEN,VOLUME
DATA=
TIMEZONE_OFFSET=-300
a1357828200,528.5999,528.62,528.14,528.55,129259
1,522.63,528.72,522,528.6499,2054578
2,523.11,523.69,520.75,522.77,1422586
3,520.48,523.11,519.6501,523.09,1130409
4,518.28,520.579,517.86,520.34,1215466
5,518.8501,519.48,517.33,517.94,832100
6,518.685,520.22,518.63,518.85,565411
7,516.55,519.2,516.55,518.64,617281
...
...
Note the first value of first column a1357828200, my intuition was that this has something to do with POSIXct. Hence a quick check :
> as.POSIXct(1357828200, origin = '1970-01-01', tz='EST')
[1] "2013-01-10 14:30:00 EST"
So my intuition seems to be correct. But the time seems to be off. Now we have one more info in the data. TIMEZONE_OFFSET=-300. So if we offset our timestamps by this amount we should get :
as.POSIXct(1357828200-300*60, origin = '1970-01-01', tz='EST')
[1] "2013-01-10 09:30:00 EST"
Note that I didn't know which day data you had requested. But quick check on google finance reveals, those were indeed price levels on 10th Jan 2013.
Remaining values from first column seem to be some sort of offset from first row value.
So downloading and standardizing the data ended up being more much of a bear than I figured it would--about 150 lines of code. The problem is that while Google provides the past 50 training days of data for all exchange-traded stocks, the time stamps within the days are not standardized: an index of '1,' for example could either refer to the first of second time increment on the first trading day in the data set. Even worse, stocks that only trade at low volumes only have entries where a transaction is recorded. For a high-volume stock like APPL that's no problem, but for low-volume small caps it means that your series will be missing much if not the majority of the data. This was problematic because I need all the stock series to lie neatly on to of each other for the analysis I'm doing.
Fortunately, there is still a general structure to the data. Using this link:
https://www.google.com/finance/getprices?i=1800&p=1000d&f=d,o,h,l,c,v&df=cpct&q=AAPL
and changing the stock ticker at the end will give you the past 50 days of trading days on 1/2-hourly increment. POSIX time stamps, very helpfully decoded by #geektrader, appear in the timestamp column at 3-week intervals. Though the timestamp indexes don't invariably correspond in a convenient 1:1 manner (I almost suspect this was intentional on Google's part) there is a pattern. For example, for the half-hourly series that I looked at the first trading day of ever three-week increment uniformly has timestamp indexes running in the 1:15 neighborhood. This could be 1:13, 1:14, 2:15--it all depends on the stock. I'm not sure what the 14th and 15th entries are: I suspect they are either daily summaries or after-hours trading info. The point is that there's no consistent pattern you can bank on.The first stamp in a training day, sadly, does not always contain the opening data. Same thing for the last entry and the closing data. I found that the only way to know what actually represents the trading data is to compare the numbers to the series on Google maps. After days of futiley trying to figure out how to pry a 1:1 mapping patter from the data, I settled on a "ballpark" strategy. I scraped APPL's data (a very high-volume traded stock) and set its timestamp indexes within each trading day as the reference values for the entire market. All days had a minimum of 13 increments, corresponding to the 6.5 hour trading day, but some had 14 or 15. Where this was the case I just truncated by taking the first 13 indexes. From there I used a while loop to essentially progress through the downloaded data of each stock ticker and compare its time stamp indexes within a given training day to the APPL timestamps. I kept the overlap, gap-filled the missing data, and cut out the non-overlapping portions.
Sounds like a simple fix, but for low-volume stocks with sparse transaction data there were literally dozens of special cases that I had to bake in and lots of data to interpolate. I got some pretty bizarre results for some of these that I know are incorrect. For high-volume, mid- and large-cap stocks, however, the solution worked brilliantly: for the most part the series either synced up very neatly with the APPL data and matched their Google Finance profiles perfectly.
There's no way around the fact that this method introduces some error, and I still need to fine-tune the method for spare small-caps. That said, shifting a series by a half hour or gap-filling a single time increment introduces a very minor amount of error relative to the overall movement of the market and the stock. I am confident that this data set I have is "good enough" to allow me to get relevant answers to some questions that I have. Getting this stuff commercially costs literally thousands of dollars.
Thoughts or suggestions?
Why not loading the data from Quandl? E.g.
library(Quandl)
Quandl('YAHOO/AAPL')
Update: sorry, I have just realized that only daily data is fetched with Quandl - but I leave my answer here as Quandl is really easy to query in similar cases
For the timezone offset, try:
as.POSIXct(1357828200, origin = '1970-01-01', tz=Sys.timezone(location = TRUE))
(The tz will automatically adjust according to your location)

Compare Twitter Hashtag Volume using twitteR package

I would like to use the twitteR package in R to compare the number (count) of mentions of two competing hashtags from 11/14/2012-11/22/2012 (i.e. an 8-day time period). For example, I would like hourly comparisons of two hashtags: #A vs #B.
I am wondering if there is a way to use the twitteR package in R to do this. Something using the searchTwitter function:
searchTwitter(searchString, n=25, lang=NULL, since=NULL, until=NULL,
locale=NULL, geocode=NULL, sinceID=NULL, ...)
I am not interested in grabbing all tweets, just getting an hourly count comparison for #A vs. #B over the specified time period. I know I have to be cognizant of the rate limit and maybe will have to do some clever sampling of tweets to avoid the rate limit. Any ideas if this is feasible, and how I would go about coding it?
I would pull 100 tweets for each hash tag every 2 minutes. Use #TweetsReturned / (TimePulled - TimeOfOldestTweet) to get a Tweets per Unit Time estimate. You can plot these to get a moving average type chart of activity over time. If you do tweets per 2 minutes, just add them up to estimate tweets per hour.

Resources