I am working on saving twitter search results into a database (SQL Server) and am getting an error when I pull the search results from twitteR.
If I execute:
library(twitteR)
puppy <- as.data.frame(searchTwitter("puppy", session=getCurlHandle(),num=100))
I get an error of:
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class structure("status", package = "twitteR") into a data.frame
This is important because in order to use RODBC to add this to a table using sqlSave it needs to be a data.frame. At least that's the error message I got:
Error in sqlSave(localSQLServer, puppy, tablename = "puppy_staging", :
should be a data frame
So does anyone have any suggestions on how to coerce the list to a data.frame or how I can load the list through RODBC?
My final goal is to have a table that mirrors the structure of values returned by searchTwitter. Here is an example of what I am trying to retrieve and load:
library(twitteR)
puppy <- searchTwitter("puppy", session=getCurlHandle(),num=2)
str(puppy)
List of 2
$ :Formal class 'status' [package "twitteR"] with 10 slots
.. ..# text : chr "beautifull and kc reg Beagle Mix for rehomes: This little puppy is looking for a new loving family wh... http://bit.ly/9stN7V "| __truncated__
.. ..# favorited : logi FALSE
.. ..# replyToSN : chr(0)
.. ..# created : chr "Wed, 16 Jun 2010 19:04:03 +0000"
.. ..# truncated : logi FALSE
.. ..# replyToSID : num(0)
.. ..# id : num 1.63e+10
.. ..# replyToUID : num(0)
.. ..# statusSource: chr "<a href="http://twitterfeed.com" rel="nofollow">twitterfeed</a>"
.. ..# screenName : chr "puppy_ads"
$ :Formal class 'status' [package "twitteR"] with 10 slots
.. ..# text : chr "the cutest puppy followed me on my walk, my grandma won't let me keep it. taking it to the pound sadface"
.. ..# favorited : logi FALSE
.. ..# replyToSN : chr(0)
.. ..# created : chr "Wed, 16 Jun 2010 19:04:01 +0000"
.. ..# truncated : logi FALSE
.. ..# replyToSID : num(0)
.. ..# id : num 1.63e+10
.. ..# replyToUID : num(0)
.. ..# statusSource: chr "<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry®</a>"
.. ..# screenName : chr "iamsweaters"
So I think the data.frame of puppy should have column names like:
- text
- favorited
- replytoSN
- created
- truncated
- replytoSID
- id
- replytoUID
- statusSource
- screenName
I use this code I found from http://blog.ouseful.info/2011/11/09/getting-started-with-twitter-analysis-in-r/ a while ago:
#get data
tws<-searchTwitter('#keyword',n=10)
#make data frame
df <- do.call("rbind", lapply(tws, as.data.frame))
#write to csv file (or your RODBC code)
write.csv(df,file="twitterList.csv")
I know this is an old question, but still, here is what I think is a ``modern'' version to solve this. Just use the function twListToDf
gvegayon <- getUser("gvegayon")
timeline <- userTimeline(gvegayon,n=400)
tl <- twListToDF(timeline)
Hope it helps
Try this:
ldply(searchTwitter("#rstats", n=100), text)
twitteR returns an S4 class, so you need to either use one of its helper functions, or deal directly with its slots. You can see the slots by using unclass(), for instance:
unclass(searchTwitter("#rstats", n=100)[[1]])
These slots can be accessed directly as I do above by using the related functions (from the twitteR help: ?statusSource):
text Returns the text of the status
favorited Returns the favorited information for the status
replyToSN Returns the replyToSN slot for this status
created Retrieves the creation time of this status
truncated Returns the truncated information for this status
replyToSID Returns the replyToSID slot for this status
id Returns the id of this status
replyToUID Returns the replyToUID slot for this status
statusSource Returns the status source for this status
As I mentioned, it's my understanding that you will have to specify each of these fields yourself in the output. Here's an example using two of the fields:
> head(ldply(searchTwitter("#rstats", n=100),
function(x) data.frame(text=text(x), favorited=favorited(x))))
text
1 #statalgo how does that actually work? does it share mem between #rstats and postgresql?
2 #jaredlander Have you looked at PL/R? You can call #rstats from PostgreSQL: http://www.joeconway.com/plr/.
3 #CMastication I was hoping for a cool way to keep data in a DB and run the normal #rstats off that. Maybe a translator from R to SQL code.
4 The distribution of online data usage: AT&T has recently announced it will no longer http://goo.gl/fb/eTywd #rstat
5 #jaredlander not that I know of. Closest is sqldf package which allows #rstats and sqlite to share mem so transferring from DB to df is fast
6 #CMastication Can #rstats run on data in a DB?Not loading it in2 a dataframe or running SQL cmds but treating the DB as if it wr a dataframe
favorited
1 FALSE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 FALSE
You could turn this into a function if you intend on doing it frequently.
For those that run into the same problem I did which was getting an error saying
Error in as.double(y) : cannot coerce type 'S4' to vector of type 'double'
I simply changed the word text in
ldply(searchTwitter("#rstats", n=100), text)
to statusText, like so:
ldply(searchTwitter("#rstats", n=100), statusText)
Just a friendly heads-up :P
Here is a nice function to convert it into a DF.
TweetFrame<-function(searchTerm, maxTweets)
{
tweetList<-searchTwitter(searchTerm,n=maxTweets)
return(do.call("rbind",lapply(tweetList,as.data.frame)))
}
Use it as :
tweets <- TweetFrame(" ", n)
The twitteR package now includes a function twListToDF that will do this for you.
puppy_table <- twListToDF(puppy)
Related
I have 4 years experience using R but I am very new to the Big Data game as I always worked on csv files.
It is thrilling to manipulate large amount of data from a distance but also somehow frustating as simple things you were used to are to be rengineered.
The task I am struggling right now is to have a basic 5 figure summary of a variable:
summary(df$X)
Some context, I am connected with impala, these lines of codes work fine:
library(dbplyr)
localTable <- tbl(con, 'serverTable')
localTable %>% tally()
localTable %>% filter(X > 10) %>% tally()
If I just write
localTable
instead, RStudio gets stuck/takes a lot of time so I suppress it with the task manager.
Coming back to my current question, I tried to have a 5 figure summary in these ways:
summary(localTable$X) #returns Length 0, Class NULL, Mode NULL
localTable %>% fivenum(X) #returns Error in rank(x, ties.method = "min", na.last = "keep") : unimplemented type 'list' in 'greater'
also building a custom summary() with summarise
localTable %>% summarize(Min = min(X),
Q1 = quantile(X, .25),
Avg = mean(X),
Q3 = quantile(X, .75),
Max = max(X))
returns me a SYNTAX ERROR.
My guess is that there is a very trivial missing link between my code and the server in form of a data structure, but I can't figure it out what.
I tried as well to save localTable$x to a in-memory variable with
XL <- localTable$X
but I always get a NULL
On the graphical side, using dbplot, if I try
library(dbplot)
localTable %>% dbplot_histogram(X)
I get an empty graphic.
I thought about leveraging the 5 figures summary in the boxplot function, ggplotbuild(object)$data likewise so to speak, but with dbplot_boxplot I get the error could not find function "dbplot_boxplot".
I started using dbplyr as I am quite fluent with dplyr and I don't want to write queries in SQL with DBI::dbGetQuery, but you can suggest other packages like implyR, sparklyR or the such, as well as tutorials on the subject as large, as the ones I found are quite basic.
EDIT:
as requested in a comment, I add the result of
str(localTable)
which is
List of 2
$ src:List of 2
..$ con :Formal class 'Impala' [package ".GlobalEnv"] with 4 slots
.. .. ..# ptr :<externalptr>
.. .. ..# quote : chr "`"
.. .. ..# info :List of 15
.. .. .. ..$ dbname : chr "IMPALA"
.. .. .. ..$ dbms.name : chr "Impala"
.. .. .. ..$ db.version : chr "2.9.0-cdh5.12.1"
.. .. .. ..$ username : chr "User"
.. .. .. ..$ host : chr ""
.. .. .. ..$ port : chr ""
.. .. .. ..$ sourcename : chr "impala connector"
.. .. .. ..$ servername : chr "Impala"
.. .. .. ..$ drivername : chr "Cloudera ODBC Driver for Impala"
.. .. .. ..$ odbc.version : chr "03.80.0000"
.. .. .. ..$ driver.version : chr "2.6.11.1011"
.. .. .. ..$ odbcdriver.version : chr "03.80"
.. .. .. ..$ supports.transactions : logi FALSE
.. .. .. ..$ getdata.extensions.any_column: logi TRUE
.. .. .. ..$ getdata.extensions.any_order : logi TRUE
.. .. .. ..- attr(*, "class")= chr [1:3] "Impala" "driver_info" "list"
.. .. ..# encoding: chr ""
..$ disco: NULL
..- attr(*, "class")= chr [1:4] "src_Impala" "src_dbi" "src_sql" "src"
$ ops:List of 2
..$ x : 'ident' chr "serverTable"
..$ vars: chr [1:157] "X" ...
..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
- attr(*, "class")= chr [1:5] "tbl_Impala" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
Not sure if I can dput my table as it is sensitive information
There are quite a few aspects to your post. I am going to try and address the main ones.
(1) What you are calling localTable is not local. What you have is a local access point to a remote table. It is a remote table because the data is stored in the database, rather than in R.
To copy a remote table into local R memory use localTable = collect(remoteTable). Use this carefully. If the table is many GB in the database this will be slow to transfer into R. Also if you collect a database table that is bigger than the ram avaialble to R then you will receive an out of memory error.
I recommend using collect for moving summary results into R. Do the processing and summarizing in the database and just fetch the results into R. Alternatively, use remoteTable %>% head(20) %>% collect() to copy just the first 20 rows into R.
(2) The tableName$colname will not work for remote tables. In R the $ notation lets you access a named component of a list. Data.frames are a special kind of list. If you try data(iris) followed by names(iris) you will get the columns names of iris. Any of these can be accessed using iris$.
However as your str(localTable) shows, localTable is a list of length 2 with the first named item src. If you call names(localTable) then you will receive two names back, the first of which is src. This means you can call localTable$src (and as localTable$src is also a list you can also call localTable$src$con).
When working with dbplyr R translates data manipulation commands into the database language. There are translations defined for most dplyr commands, but there are not translations defined for all R commands.
So the recommended approach to access just a specific column is using select from dplyr:
local_copy_of_just_one_column = remoteTable %>%
select(required_column) %>%
collect()
(3) You have the right approach with a custom summary function. This is the best approach for producing the five figure summary without pulling the data into local memory (RAM).
One possible cause of the syntax error is that you may have used R commands that do not have a translation into your database language.
You can check whether a command has translations defined using translate_sql. I recommend you try
library(dbplyr)
translate_sql(quantile(colname, 0.25))
To see what the translation look like.
You can view the translation of an entire table manipulation using show_query. This is my go-to approach when debugging SQL translation. Try:
localTable %>%
summarize(Min = min(X),
Q1 = quantile(X, .25),
Avg = mean(X),
Q3 = quantile(X, .75),
Max = max(X)) %>%
show_query()
If this does not produce valid SQL then executing the command will error.
One possible cause is the Min and Max have special meanings in SQL and so might produce odd behavior in your translation.
When I experimented with quantile it looks like it might need an OVER clause in SQL. This is created using group_by. So perhaps you want something like the following:
localSummary = remoteTable %>%
# create dummy column
mutate(ones = 1) %>%
# group to satisfy over clause
group_by(ones) %>%
summarise(var_min = min(var),
var_lq = quantile(var, 0.25),
var_mean = mean(var),
var_uq = quantile(var, 0.75),
var_max = max(var)) %>%
# copy results from database into R memory
collect()
I'm interested to remove all stopwords from my text using R. The list of stopwords that I want to remove can be found at http://www.ranks.nl/stopwords under the section which says "Long Stopword List" (a very long list version). I'm using tm package. Can one help me, please? Tnx!
You can copy that list (after you select it in your browser) aand then paste it into this expression in R:
LONGSWS <- " <paste into this position> "
You would place the cursor for your editor or the IDE console device inside the two quotes. Then do this:
sw.vec <- scan(text=LONGSWS, what="")
#Read 474 items
The scan function needs to have the type of input specified via an example given to the what argument, and for that purpose just using "" is sufficient for character types. Then you should be able to apply the code you offered in your comment:
tm_map(text, removeWords, sw.vec)
You have not supplied an example text object. Using just a character vector is not successful:
tm_map("test of my text", removeWords, sw.vec )
#Error in UseMethod("tm_map", x) :
# no applicable method for 'tm_map' applied to an object of class "character"
So we will need to assume you have a suitable object of a suitable class to place in the first position of the arguments to tm_map. So using the example from the ?tm_map help page:
> res <- tm_map(crude, removeWords, sw.vec )
> str(res)
List of 20
$ 127:List of 2
..$ content: chr "Diamond Shamrock Corp said \neffective today cut contract prices crude oil \n1.50 dlrs barrel.\n The re"| __truncated__
..$ meta :List of 15
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "1987-02-26 17:00:56"
.. ..$ description : chr ""
.. ..$ heading : chr "DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES"
.. ..$ id : chr "127"
.. ..$ language : chr "en"
.. ..$ origin : chr "Reuters-21578 XML"
.. ..$ topics : chr "YES"
.. ..$ lewissplit : chr "TRAIN"
.. ..$ cgisplit : chr "TRAINING-SET"
# ----------------snipped remainder of long output.
I have a nested list of userAccts and tweets, the structure (in R) is below.
> str(botdetails[[1]][[100]])
Reference class 'status' [package "twitteR"] with 17 fields
$ text : chr "RT #jeremyslevin: 30% of the Bush tax cuts-which wrote the book on giveaways to the rich-went to the 1%. Trump "| __truncated__
$ favorited : logi FALSE
$ favoriteCount: num 0
$ replyToSN : chr(0)
$ created : POSIXct[1:1], format: "2017-09-02 07:59:32"
$ truncated : logi FALSE
$ replyToSID : chr(0)
$ id : chr "903890119945359360"
$ replyToUID : chr(0)
$ statusSource : chr "Twitter for Android"
$ screenName : chr "Monalisazelf"
$ retweetCount : num 252
$ isRetweet : logi TRUE
$ retweeted : logi FALSE
$ longitude : chr(0)
$ latitude : chr(0)
$ urls :'data.frame': 0 obs. of 4 variables:
..$ url : chr(0)
..$ expanded_url: chr(0)
..$ dispaly_url : chr(0)
..$ indices : num(0)
and 53 methods, of which 39 are possibly relevant:
getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet, getLatitude,
getLongitude, getReplyToSID, getReplyToSN, getReplyToUID, getRetweetCount,
getRetweeted, getRetweeters, getRetweets, getScreenName, getStatusSource, getText,
getTruncated, getUrls, initialize, setCreated, setFavoriteCount, setFavorited, setId,
setIsRetweet, setLatitude, setLongitude, setReplyToSID, setReplyToSN, setReplyToUID,
setRetweetCount, setRetweeted, setScreenName, setStatusSource, setText, setTruncated,
setUrls, toDataFrame, toDataFrame#twitterObj
>
My issue is trying to convert the nested lists into a data frame, twListtoDF gives me this error:
> twListToDF(botdetails)
Error in twListToDF(botdetails) :
Elements of twList are not of an appropriate class
>
The help page for twListtoDF lists status as an appropriate class for the function:
Details
The classes supported by this function are status, user, and directMessage.
Can anyone suggest an effective method of creating an R dataframe from this nested list?
Both the purrr and jsonlite packages have a flatten() function. The jsonlite version would probably work best for your purposes, as I'm assuming the Twitter API returns a JSON object (and the purrr:flatten only removes one layer of recursion at a time).
Info here: https://rdrr.io/cran/jsonlite/man/flatten.html
I am assuming that you are trying to extract user information of certain Twitter bot accounts. If you are looking at extracting all this information as a dataframe, try the following:
botdetails <- map(Botlist[1:100], ~twListToDF(lookupUsers(.x)))
botdf <- rbind.fill(botdetails)
Here, Botlist is a character vector containing the names of the Twitter accounts. botdetails will return you a list of dataframes which you can combine by using the rbind.fill() function from the plyr package.
I know this has been asked before and the answer has always been to use sink but I'm getting the prompt written out too and I don't want that. Any way to just store the output to a .txt file and not the prompts?
info(hdr) prints the following to console:
> info(hdr)
DataFrame with 3 rows and 3 columns
Number Type Description
<character> <character> <character>
NS 1 Integer Number of Samples With Data
DP 1 Integer Total Depth
DB 0 Flag dbSNP membership, build 131
I want to send all of info()'s output to a .txt file and these are the commands I'm using
sink("info.txt")
info(hdr)
sink()
What's written in info.txt:
> info(hdr)
DataFrame with 3 rows and 3 columns
Number Type Description
<character> <character> <character>
NS 1 Integer Number of Samples With Data
DP 1 Integer Total Depth
DB 0 Flag dbSNP membership, build 131
> sink()
Why are the commands showing up too? Anyway to prevent that?
EDIT: for clarity. The function info() comes from VariantAnnotation package from the following link: http://www.bioconductor.org/help/workflows/variants/. The output of str(hdr) is:
Formal class 'VCFHeader' [package "VariantAnnotation"] with 3 slots
..# reference: chr(0)
..# samples : chr "GS06985-1100-37-ASM"
..# header :Formal class 'SimpleDataFrameList' [package "IRanges"] with 4 slots
.. .. ..# elementType : chr "DataFrame"
.. .. ..# elementMetadata: NULL
.. .. ..# metadata : list()
.. .. ..# listData :List of 4
Many more lines but I truncated.
Whenever I use any sort of HTTP command via the system() function in R studio, the rainbow circle of death appears and I have to force-quit R Studio. Up until now, I've written a bunch of checks to make sure a user isn't in R Studio before using an HTTP command (which I use a ton to access data), but it's quite a pain, and it would be fantastic to get to the root of the problem.
e.g.
system("http get http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M")
causes R studio to crash. Oddly, on another laptop of mine, such commands don't crash R Studio but cause the following error: 'sh: http: command not found', even though http is installed and works fine when using the terminal.
Does anybody know how to fix this problem / why it happens / does it occur for you guys too? Although I know a lot about R, I'm afraid I have no idea how to try to fix this problem.
Thanks!!!
Using http from the httpie package on Linux hangs RStudio (and not plain terminal R) on my Linux system (your rainbow circle implies its a Mac?) so I'm getting the same behaviour as you.
Installing and using wget works for me:
system("wget -O /tmp/data.out http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M")
Or you could try R's native download.file function. There's a whole bunch of other functions for getting stuff off the web - see the Web Task View http://cran.r-project.org/web/views/WebTechnologies.html
I've not seen this http command used much, so maybe its flakey. Or maybe its opening stdin...
Yes... Try this:
system("http get http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M >/tmp/data2.out </dev/null" )
I think http is opening stdin, the Unix standard input channel, RStudio isn't sending anything to it. So it waits. If you explicitly assign http's stdin as /dev/null then http completes. This works for me in RStudio.
However, I still prefer wget or curl-based solutions!
Without more contextual information regarding Rstudio version / operating system it is hard to do more than suggest an alternative approach that avoids the use system()
Instead you could use RCurl and getURL
library(RCurl)
getURL('http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M')
#[1] "{\"status\":\"REQUEST_SUCCEEDED\",\"responseTime\":129,\"message\":[],\"Results\":{\n\"series\":\n[{\"seriesID\":\"CXUALCBEVGLB0101M\",\"data\":[{\"year\":\"2013\",\"period\":\"A01\",\"periodName\":\"Annual\",\"value\":\"445\",\"footnotes\":[{}]},{\"year\":\"2012\",\"period\":\"A01\",\"periodName\":\"Annual\",\"value\":\"451\",\"footnotes\":[{}]},{\"year\":\"2011\",\"period\":\"A01\",\"periodName\":\"Annual\",\"value\":\"456\",\"footnotes\":[{}]}]}]\n}}"
You could also use PUT, GET, POST, etc directly in R, abstracted from RCurl by the httr package:
library(httr)
tmp <- GET("http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M")
dat <- content(tmp, as="parsed")
str(dat)
## List of 4
## $ status : chr "REQUEST_SUCCEEDED"
## $ responseTime: num 27
## $ message : list()
## $ Results :List of 1
## ..$ series:'data.frame': 1 obs. of 2 variables:
## .. ..$ seriesID: chr "CXUALCBEVGLB0101M"
## .. ..$ data :List of 1
## .. .. ..$ :'data.frame': 3 obs. of 5 variables:
## .. .. .. ..$ year : chr [1:3] "2013" "2012" "2011"
## .. .. .. ..$ period : chr [1:3] "A01" "A01" "A01"
## .. .. .. ..$ periodName: chr [1:3] "Annual" "Annual" "Annual"
## .. .. .. ..$ value : chr [1:3] "445" "451" "456"
## .. .. .. ..$ footnotes :List of 3
## .. .. .. .. ..$ :'data.frame': 1 obs. of 0 variables
## .. .. .. .. ..$ :'data.frame': 1 obs. of 0 variables
## .. .. .. .. ..$ :'data.frame': 1 obs. of 0 variables