Convert R JSON Twitter data to list - r

When using SearchTwitter, I converted to dataframe and then exported to JSON. However, all the text is in one line, etc (sample below). I need to separate so that each tweet is its own.
phish <- searchTwitteR('phish', n = 5, lang = 'en')
phishdf <- do.call("rbind", lapply(phish, as.data.frame))
exportJson <-toJSON(phishdf)
write(exportJson, file = "phishdf.json")
json_phishdf <- fromJSON(file="phishdf.json")
I tried converting to a list and am wondering if maybe converting to a data frame is a mistake.
However, for a list, I tried:
newlist['text']=phish[[1]]$getText()
But this will just give me the text for the first tweet. Is there a way to iterate over the entire data set, maybe in a for loop?
{"text":["#ilazer #abbijacobson I do feel compelled to say that I phind phish awphul... sorry, Abbi!","#phish This on-sale was an embarrassment. Something needs to change.","FS: Have 2 Tix To Phish In Chula Vista #Phish #facevaluetickets #phish #facevalue GO: https://t.co/dFdrpyaotp","RT #WKUPhiDelt: Come unwind from a busy week of class and kick off the weekend with a Phish Fry! 4:30-7:30 at the Phi Delt house. Cost is $\u2026","RT #phish: Tickets for Phish's July 15 & 16 shows at The Gorge go on sale in fifteen minutes at 1PM ET: https://t.co/tEKLNjI5u7 https://t.c\u2026"],
"favorited":[false,false,false,false,false],
"favoriteCount":[0,0,0,0,0],
"replyToSN":["rAlexandria","phish","NA","NA","NA"],
"created":[1456521159,1456521114,1456521022,1456521016,1456520988],
"truncated":[false,false,false,false,false],
"replyToSID":["703326502629277696","703304948990222337","NA","NA","NA"],
"id":["703326837720662016","703326646074343424","703326261045829632","703326236722991105","703326119328686080"],
"replyToUID":["26152867","14503997","NA","NA","NA"],"statusSource":["Mobile Web (M5)","Twitter for iPhone","CashorTrade - Face Value Tickets","Twitter for iPhone","Twitter for Android"],
"screenName":["rAlexandria","adamgelvan","CashorTrade","Kyle_Smith1087","timogrennell"],
"retweetCount":[0,0,0,2,5],
"isRetweet":[false,false,false,true,true],
"retweeted":[false,false,false,false,false],
"longitude":["NA","NA","NA","NA","NA"],
"latitude":["NA","NA","NA","NA","NA"]}

I followed your code and don't have the issue you're describing. Are you using library(twitteR) and library(jsonlite)?
Here is the code, and a screenshot of it working
library(twitteR)
library(jsonlite)
phish <- searchTwitteR('phish', n = 5, lang = 'en')
phishdf <- do.call("rbind", lapply(phish, as.data.frame))
exportJson <-toJSON(phishdf)
write(exportJson, file = "./../phishdf.json")
## note the `txt` argument, as opposed to `file` used in the question
json_phishdf <- fromJSON(txt="./../phishdf.json")

Related

Having problem with ggmap's mapdist() function

I have this code. I have my google API set up already, registered as well in R, Distance Matrix API has been initiated as well in the Google Cloud console.
Here is the dataframe I have, random 25 postal codes FROM and TO postal codes.
Dataset_test = data.frame(
FROM_POSTAL = c("V8A 0E5","T4G 6M4","V1N 8X3",
"C1B 5G1","R5H 2L4","H9S 8L4","L8E 4Y0","H2Y 7N6",
"K1B 7C0","G4A 5B0","E4P 3T2","E4V 5P4","H3J 1R5",
"G0B 4J7","E7A 6E7","E5B 2Y9","S4H 1T8","A2V 4G5",
"V8L 2A9","T9E 1M5","A5A 5M2","E4T 5B4","S2V 6C4",
"S9H 5P8","B1Y 0V0"),
TO_POSTAL = c("G0J 0B8","N0H 9N4","J9B 4Y4",
"L3Z 2Y7","E8K 4R4","B4P 7X9","S4H 2M0","A1Y 0B8",
"A1W 1E9","P9N 7X1","E4R 4B0","N0P 0M8","E1W 9Y7",
"T9W 8E2","G6X 4S9","A0E 0V4","J5X 7N8","N4N 8A1",
"V9K 0B9","L4G 3H7","E1W 0T2","G5R 9G3","L7C 9S2",
"E8P 2X6","E2A 2M1")
)
Here is the simple script I have to try to calculate the distance between the two postal codes by driving using Google's Distance Matrix API.
Driving_Distance = mapdist(from = Dataset_test[["FROM_POSTAL"]], to = Dataset_test[["TO_POSTAL"]], mode = c("driving")) %>% distinct()
When I run this, it throws an error in the Driving_Distance - says
Error: Argument 1 is a list, must contain atomic vectors
Your Canadian postal codes are hereby working with the mapdist() function.
The number of addresses used here were shortened for the sake of brevity.
A tibble was used instead of a dataframe so that the variables were character data types rather than factor data types. The actual Google API key that was used has been replaced with some text.
This was a good mapping question. The working code and output below:
library(ggmap)
library(plyr)
library(googleway)
library(tidyverse)
df = tibble(
FROM_POSTAL = c("V8A 0E5","T4G 6M4","V1N 8X3",
"C1B 5G1","R5H 2L4","H9S 8L4"),
TO_POSTAL = c("G0J 0B8","N0H 9N4","J9B 4Y4",
"L3Z 2Y7","E8K 4R4","B4P 7X9"))
dd <- apply(df, 1, function(x){
google_distance(origins = list(x["from"]),
destinations = list(x["to"]),
key="My_secret_key")
})
dd

R: Replace all Values that are not equal to a set of values

All.
I've been trying to solve a problem on a large data set for some time and could use some of your wisdom.
I have a DF (1.3M obs) with a column called customer along with 30 other columns. Let's say it contains multiple instances of customers Customer1 thru Customer3000. I know that I have issues with 30 of those customers. I need to find all the customers that are NOT the customers I have issues and replace the value in the 'customer' column with the text 'Supported Customer'. That seems like it should be a simple thing...if it werent for the number of obs, I would have loaded it up in Excel, filtered all the bad customers out and copy/pasted the text 'Supported Customer' over what remained.
Ive tried replace and str_replace_all using grepl and paste/paste0 but to no avail. my current code looks like this:
#All the customers that have issues
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127",
"Customer128", ..... , "Customer140")
#Look for everything that is NOT in the list above and replace with "Enabled"
orderData$customer <- str_replace_all(orderData$customer, paste0("[^", paste(out, collapse =
"|"), "]"), "Enabled Customers")
That code gets me this error:
Error in stri_replace_all_regex(string, pattern, fix_replacement(replacement), :
In a character range [x-y], x is greater than y. (U_REGEX_INVALID_RANGE)
I've tried the inverse of this approach and pulled a list of all obs that dont match the list of out customers. Something like this:
in <- orderData %>% filter(!customer %in% out) %>% select(customer) %>%
distinct(customer)
This gets me a much larger list of customers that ARE enabled (~3,100). Using the str_replace_all and paste approach seems to have issues though. At this large number of patterns, paste no longer collapses using the "|" operator. instead I get a string that looks like:
"c(\"Customer1\", \"Customer2345\", \"Customer54\", ......)
When passed into str_replace_all, this does not match any patterns.
Anyways, there's got to be an easier way to do this. Thanks for any/all help.
Here is a data.table approach.
First, some example data since you didn't provide any.
customer <- sample(paste0("Customer",1:300),5000,replace = TRUE)
orderData <- data.frame(customer = sample(paste0("Customer",1:300),5000,replace = TRUE),stringsAsFactors = FALSE)
orderData <- cbind(orderData,matrix(runif(0,100,n=5000*30),ncol=30))
out <- c("Customer123", "Customer124", "Customer125", "Customer126", "Customer127", "Customer128","Customer140")
library(data.table)
setDT(orderData)
result <- orderData[!(customer %in% out),customer := gsub("Customer","Supported Customer ",customer)]
result
customer 1 2 3 4 5 6 7 8 9
1: Supported Customer 134 65.35091 8.57117 79.594166 84.88867 97.225276 84.563997 17.15166 41.87160 3.717705
2: Supported Customer 225 72.95757 32.80893 27.318046 72.97045 28.698518 60.709381 92.51114 79.90031 7.311200
3: Supported Customer 222 39.55269 89.51003 1.626846 80.66629 9.983814 87.122153 85.80335 91.36377 14.667535
4: Supported Customer 184 24.44624 20.64762 9.555844 74.39480 49.189537 73.126275 94.05833 36.34749 3.091072
5: Supported Customer 194 42.34858 16.08034 34.182737 75.81006 35.167769 23.780069 36.08756 26.46816 31.994756
---

Is it possible to repeat code in R - but replace certain characters? (needs to be done 350 times)

I am writing a script to data scrape analyst share ratings and current share prices from the internet in R (using RStudio);
library(rvest)
BKGURL <- 'http://www.marketbeat.com/stocks/LON/BKG/' #analysts
BKGwebpage <- read_html(BKGURL)
BKGhtml <- html_nodes(BKGwebpage, "td:nth-child(5) , td:nth-child(4) , td:nth- child(3) , td:nth-child(2) , td:nth-child(1)")
BKG <- html_text(BKGhtml) #imports analyst text
BKGprice <- 'http://markets.investorschronicle.co.uk/research/Markets/Companies/Summary?s=BKG:LSE'
BKGpricewebpage <- read_html(BKGprice)
BKGpriceHTML <- html_nodes(BKGpricewebpage, "#wsod td.first")
BKGgbpp <- html_text(BKGpriceHTML) #imports current share price in text
before compiling them in a data frame; (code for INN not shown to save space)
Code <- c('BKG', 'INN')
Analysts_Opinion <- c(BKG [2], INN [2])
Consensus <- c(BKG [4], INN [4])
Price_target <- c(BKG [6], INN [6])
Last_rating <- c(BKG [7], INN [7])
Current_price <- c(BKGgbpp [1], INNgbpp [1])
Scrapev1 <- data.frame(Code, Analysts_Opinion, Consensus, Price_target, Last_rating, Current_price)
Scrapev1 then gives
Code Analysts_Opinion Consensus Price_target Last_rating Current_price
1 BKG 2 Sell Rating(s), 6 Hold Rating(s), 8 Buy Rating(s) Hold (Score: 2.38) GBX 3,434.29 7/26/2016 2,650
2 INN 1 Buy Rating(s) Buy (Score: 3.00) GBX 190 2/2/2016 198.00
So the code works fine for importing the data, but I need to repeat/replicate the code in the top panel 350 times, changing "BKG" for the 349 other codes in each URL and name. Currently I am stumped on what to do as copy and pasting each would take quite some time, surely there is a quicker way of doing it in R?
Any help or suggestions as to how to tackle this problem would be much appreciated. Apologies if the code is sloppy, I have taught myself (poorly) R by using this very website and come from a Pharmacology background - with an interest in technology!
You could do it with strings, then parsing and evaluating them. However I wouldn't advise you to do so. The best way in my opinion would be to use lists and names. Something like:
library(rvest)
auxlist<- c('BKG', 'ASD', 'QWE')
URLS <- c() # Or list()
webpages <- list()
# etc...
for(comp in auxlist){
URLS[[comp]] <- paste0('http://www.marketbeat.com/stocks/LON/', comp, '/')
webpages[[comp]] <- read_html(URLS[[comp]])
# etc...
}

Function to iterate over list, merging results into one data frame

I've completed the first couple R courses on DataCamp and in order to build up my skills I've decided to use R to prep for fantasy football this season, thus I have began playing around with the nflscrapR package.
With the nflscrapR package, one can pull Game Information using the season_games() function which simply returns a data frame with the gameID, game date, the home and away team abbreviations.
Example:
games.2012 = season_games(2012)
head(games.2012)
GameID date home away season
1 2012090500 2012-09-05 NYG DAL 2012
2 2012090900 2012-09-09 CHI IND 2012
3 2012090908 2012-09-09 KC ATL 2012
4 2012090907 2012-09-09 CLE PHI 2012
5 2012090906 2012-09-09 NO WAS 2012
6 2012090905 2012-09-09 DET STL 2012
Initially I copy and pasted the original function and changed the last digit manually for each season, then rbinded all the seasons into one data frame, games.
games.2012 <- season_games(2012)
games.2013 <- season_games(2013)
games.2014 <- season_games(2014)
games.2015 <- season_games(2015)
games = rbind(games2012,games2013,games2014,games2015)
I'd like to write a function to simplify this process.
My failed attempt:
gameID <- function(years) {
for (i in years) {
games[i] = season_games(years[i])
}
}
With years = list(2012, 2013) for testing purposes, produced the following:
Error in strsplit(headers, "\r\n") : non-character argument Called
from: strsplit(headers, "\r\n")
Thanks in advance!
While #Gregor has an apparent solution, he didn't run it because this wasn't a minimal example. I googled, found, and tried to use this code, and it doesn't work, at least in a non-trivial amount of time.
On the other hand, I took this code from Vivek Patil's blog.
library(XML)
weeklystats = as.data.frame(matrix(ncol = 14)) # Initializing our empty dataframe
names(weeklystats) = c("Week", "Day", "Date", "Blank",
"Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose",
"YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose",
"Year") # Naming columns
URLpart1 = "http://www.pro-football-reference.com/years/"
URLpart3 = "/games.htm"
#### Our workhorse function ####
getData = function(URLpart1, URLpart3) {
for (i in 2012:2015) {
URL = paste(URLpart1, as.character(i), URLpart3, sep = "")
tablefromURL = readHTMLTable(URL)
table = tablefromURL[[1]]
names(table) = c("Week", "Day", "Date", "Blank", "Win.Team", "At", "Lose.Team",
"Points.Win", "Points.Lose", "YardsGained.Win", "Turnovers.Win",
"YardsGained.Lose", "Turnovers.Lose")
table$Year = i # Inserting a value for the year
weeklystats = rbind(table, weeklystats) # Appending happening here
}
return(weeklystats)
}
I posted this because, it works, you might learn something about web scraping you didn't know, and it runs in 11 seconds.
system.time(weeklystats <- getData(URLpart1, URLpart3))
user system elapsed
0.870 0.014 10.926
You should probably take a look at some popular answers for working with lists, specifically How do I make a list of data frames? and What's the difference between [ and [[?.
There's no reason to put your years in a list. They're just integers, so just do a normal vector.
years = 2012:2015
Then we can get your function to work (we'll need to initialize an empty list before the for loop):
gameID <- function(years) {
games = list()
for (i in years) {
games[[i]] = season_games(years[i])
}
return(games)
}
Read my link above for why we're using [[ with the list and [ with the vector. And we could run it like this:
game_list = gameID(2012:2015)
But this is such a simple function that it's easier to use lapply. Your function is just a wrapper around a for loop that returns a list, and that's precisely what lapply is too. But where your function has season_games hard-coded in, lapply can work with any function.
game_list = lapply(2012:2015, season_games)
# should be the same result as above
In either case, we have the list of data frames and want to combine it into one big data frame. The base R way is rbind with do.call, but dplyr and data.table have more efficient versions.
# pick your favorite
games = do.call(rbind, args = game_list) # base
games = dplyr::bind_rows(game_list)
games = data.table::rbindlist(game_list)

linking crsp and compustat in R via WRDS

I am using R to connect to WRDS. Now, I would like to link compustat and crsp tables. In SAS, this would be achieved using macros and the CCM link table. What would be the best way to approach this topic in R?
PROGRESS UPDATE:
I downloaded crsp, compustat and ccm_link tables from wrds.
sql <- "select * from CRSP.CCMXPF_LINKTABLE"
res <- dbSendQuery(wrds, sql)
ccmxpf_linktable <- fetch(res, n = -1)
ccm.dt <- data.table(ccmxpf_linktable)
rm(ccmxpf_linktable)
I am then converting the suggested matching routine from the wrds event study sas file into R:
ccm.dt[,typeflag:=linktype %in% c("LU","LC","LD","LN","LS","LX") & USEDFLAG=="1"]
setkey(ccm.dt, gvkey, typeflag)
for (i in 1:nrow(compu.dt)) {
gvkey.comp = compu.dt[i, gvkey]
endfyr.comp = compu.dt[i,endfyr]
PERMNO.val <- ccm.dt[.(gvkey.comp, TRUE),][linkdt<=endfyr.comp & endfyr.comp<=linkenddt,lpermno]
if (length(PERMNO.val)==0) PERMNO.val <- NA
suppressWarnings(compu.dt[i, "PERMNO"] <- PERMNO.val)
}
However, this code is fantastically inefficient. I started out with data.table, but do not really understand how to apply the logic in the for-loop. I am hoping that some could point me to a way how to improve the for-loop.
Matching fields in stages works better. maybe someone finds this useful. Any suggestions for further improvement are of course very welcome!!!
# filter on ccm.dt
ccm.dt <- ccm.dt[linktype %in% c("LU","LC","LD","LN","LS","LX") & USEDFLAG=="1"]
setkey(ccm.dt, gvkey)
setkey(compu.dt, gvkey)
compu.merged <- merge(compu.dt, ccm.dt, all.x = TRUE, allow.cartesian = TRUE)
# deal with NAs in linkenddt - set NAs to todays date, assuming they still exist.
today <- as.character(Sys.Date())
compu.merged[is.na(linkenddt), "linkenddt":=today]
# filter out date mismatches
compu <- compu.merged[linkdt <= endfyr & endfyr<=linkenddt]

Resources