Five figures summary with dbplyr - r

I have 4 years experience using R but I am very new to the Big Data game as I always worked on csv files.
It is thrilling to manipulate large amount of data from a distance but also somehow frustating as simple things you were used to are to be rengineered.
The task I am struggling right now is to have a basic 5 figure summary of a variable:
summary(df$X)
Some context, I am connected with impala, these lines of codes work fine:
library(dbplyr)
localTable <- tbl(con, 'serverTable')
localTable %>% tally()
localTable %>% filter(X > 10) %>% tally()
If I just write
localTable
instead, RStudio gets stuck/takes a lot of time so I suppress it with the task manager.
Coming back to my current question, I tried to have a 5 figure summary in these ways:
summary(localTable$X) #returns Length 0, Class NULL, Mode NULL
localTable %>% fivenum(X) #returns Error in rank(x, ties.method = "min", na.last = "keep") : unimplemented type 'list' in 'greater'
also building a custom summary() with summarise
localTable %>% summarize(Min = min(X),
Q1 = quantile(X, .25),
Avg = mean(X),
Q3 = quantile(X, .75),
Max = max(X))
returns me a SYNTAX ERROR.
My guess is that there is a very trivial missing link between my code and the server in form of a data structure, but I can't figure it out what.
I tried as well to save localTable$x to a in-memory variable with
XL <- localTable$X
but I always get a NULL
On the graphical side, using dbplot, if I try
library(dbplot)
localTable %>% dbplot_histogram(X)
I get an empty graphic.
I thought about leveraging the 5 figures summary in the boxplot function, ggplotbuild(object)$data likewise so to speak, but with dbplot_boxplot I get the error could not find function "dbplot_boxplot".
I started using dbplyr as I am quite fluent with dplyr and I don't want to write queries in SQL with DBI::dbGetQuery, but you can suggest other packages like implyR, sparklyR or the such, as well as tutorials on the subject as large, as the ones I found are quite basic.
EDIT:
as requested in a comment, I add the result of
str(localTable)
which is
List of 2
$ src:List of 2
..$ con :Formal class 'Impala' [package ".GlobalEnv"] with 4 slots
.. .. ..# ptr :<externalptr>
.. .. ..# quote : chr "`"
.. .. ..# info :List of 15
.. .. .. ..$ dbname : chr "IMPALA"
.. .. .. ..$ dbms.name : chr "Impala"
.. .. .. ..$ db.version : chr "2.9.0-cdh5.12.1"
.. .. .. ..$ username : chr "User"
.. .. .. ..$ host : chr ""
.. .. .. ..$ port : chr ""
.. .. .. ..$ sourcename : chr "impala connector"
.. .. .. ..$ servername : chr "Impala"
.. .. .. ..$ drivername : chr "Cloudera ODBC Driver for Impala"
.. .. .. ..$ odbc.version : chr "03.80.0000"
.. .. .. ..$ driver.version : chr "2.6.11.1011"
.. .. .. ..$ odbcdriver.version : chr "03.80"
.. .. .. ..$ supports.transactions : logi FALSE
.. .. .. ..$ getdata.extensions.any_column: logi TRUE
.. .. .. ..$ getdata.extensions.any_order : logi TRUE
.. .. .. ..- attr(*, "class")= chr [1:3] "Impala" "driver_info" "list"
.. .. ..# encoding: chr ""
..$ disco: NULL
..- attr(*, "class")= chr [1:4] "src_Impala" "src_dbi" "src_sql" "src"
$ ops:List of 2
..$ x : 'ident' chr "serverTable"
..$ vars: chr [1:157] "X" ...
..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
- attr(*, "class")= chr [1:5] "tbl_Impala" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
Not sure if I can dput my table as it is sensitive information

There are quite a few aspects to your post. I am going to try and address the main ones.
(1) What you are calling localTable is not local. What you have is a local access point to a remote table. It is a remote table because the data is stored in the database, rather than in R.
To copy a remote table into local R memory use localTable = collect(remoteTable). Use this carefully. If the table is many GB in the database this will be slow to transfer into R. Also if you collect a database table that is bigger than the ram avaialble to R then you will receive an out of memory error.
I recommend using collect for moving summary results into R. Do the processing and summarizing in the database and just fetch the results into R. Alternatively, use remoteTable %>% head(20) %>% collect() to copy just the first 20 rows into R.
(2) The tableName$colname will not work for remote tables. In R the $ notation lets you access a named component of a list. Data.frames are a special kind of list. If you try data(iris) followed by names(iris) you will get the columns names of iris. Any of these can be accessed using iris$.
However as your str(localTable) shows, localTable is a list of length 2 with the first named item src. If you call names(localTable) then you will receive two names back, the first of which is src. This means you can call localTable$src (and as localTable$src is also a list you can also call localTable$src$con).
When working with dbplyr R translates data manipulation commands into the database language. There are translations defined for most dplyr commands, but there are not translations defined for all R commands.
So the recommended approach to access just a specific column is using select from dplyr:
local_copy_of_just_one_column = remoteTable %>%
select(required_column) %>%
collect()
(3) You have the right approach with a custom summary function. This is the best approach for producing the five figure summary without pulling the data into local memory (RAM).
One possible cause of the syntax error is that you may have used R commands that do not have a translation into your database language.
You can check whether a command has translations defined using translate_sql. I recommend you try
library(dbplyr)
translate_sql(quantile(colname, 0.25))
To see what the translation look like.
You can view the translation of an entire table manipulation using show_query. This is my go-to approach when debugging SQL translation. Try:
localTable %>%
summarize(Min = min(X),
Q1 = quantile(X, .25),
Avg = mean(X),
Q3 = quantile(X, .75),
Max = max(X)) %>%
show_query()
If this does not produce valid SQL then executing the command will error.
One possible cause is the Min and Max have special meanings in SQL and so might produce odd behavior in your translation.
When I experimented with quantile it looks like it might need an OVER clause in SQL. This is created using group_by. So perhaps you want something like the following:
localSummary = remoteTable %>%
# create dummy column
mutate(ones = 1) %>%
# group to satisfy over clause
group_by(ones) %>%
summarise(var_min = min(var),
var_lq = quantile(var, 0.25),
var_mean = mean(var),
var_uq = quantile(var, 0.75),
var_max = max(var)) %>%
# copy results from database into R memory
collect()

Related

complex nested list to a clean matrix in R?

I have struggled for two days longs to find a way to create a specific matrix from a nested list
First of all, I am sorry if I don't explain my issue correctly I am one week new to StackOverflow* and R (and programming...)!
I use a file that you can find there :
original link: https://parltrack.org/dumps/ep_mep_activities.json.lz
Uncompressed by me here: https://wetransfer.com/downloads/701b7ac5250f451c6cb26d29b41bd88020200808183632/bb08429ca5102e3dc277f2f44d08f82220200808183652/666973
first 3 lists and last one (out of 23905) past here: https://pastebin.com/Kq7mjis5
With rjson, I have a nested list like this :
Nested list of MEP Votes
List of 23905
$ :List of 7
..$ ts : chr "2004-12-16T11:49:02"
..$ url : chr "http://www.europarl.europa.eu/RegData/seance_pleniere/proces_verbal/2004/12-16/votes_nominaux/xml/P6_PV(2004)12-16(RCV)_XC.xml"
..$ voteid : num 7829
..$ title : chr "Projet de budget général 2005 modifié - bloc 3"
..$ votes :List of 3
.. ..$ +:List of 2
.. .. ..$ total : num 45
.. .. ..$ groups:List of 6
.. .. .. ..$ ALDE :List of 1
.. .. .. .. ..$ : Named num 4404
.. .. .. .. .. ..- attr(*, "names")= chr "mepid"
.. .. .. ..$ GUE/NGL:List of 25
.. .. .. .. ..$ : Named num 28469
.. .. .. .. .. ..- attr(*, "names")= chr "mepid"
.. .. .. .. ..$ : Named num 4298
.. .. .. .. .. ..- attr(*, "names")= chr "mepid"
then my goal is to have something like this :
final matrix
First I would like to keep only the lists (from [[1]] to [[23905]]) containing $vote$+$groups$Renew or $vote$-$groups$Renew or $vote$'0'$groups$Renew. The main list (the 23905) are registered votes. My work is on the Renew group so my only interest is to have a vote where the Renew groups exist to compare them with other groups.
After that my goal is to create a matrix like this all the [[x]] where we can find groups$Renewexists:
final matrix
V1 V2 (not mandatory) V3[[x]]$voteid
[mepid==666] GUE/NGL + (mepid==[666] is found in [[1]]$vote$+$groups$GUE/NGL)
[mepid==777] Renew - (mepid==[777] is found in [[1]]$vote$-$groups$GUE/NGL)
I want to create a matrix so I can process the votes of each MEP (referenced by their MEPid). Their votes are either + (for yea), - (for nay) or 0 (for abstain). Moreover, I would like to have political groups of MEP displayed in the column next to their mepid. We can find their political group thanks to the place where their votes are stored. If the mepid is shown in the list [[x]]$vote$+$groups$GUE/NGL she or he belongs to the GUE/NGL groups.
What I want to do might look like this
# Clean the nested list
Keep Vote[[x]] if Vote[[x]] list contain ,
$vote$+$groups$Renew,
or $vote$-$groups$Renew,
or $vote$'0'$groups$Renew
# Create the matrix (or a data.frame if it is easier)
VoteMatrix <- as.matrix(
V1 = all "mepid" found in the nested list
V2 = groups (name of the list where we can find the mepid) (not mandatory)
V3 to Vy = If.else(mepid is in [[x]]$vote$+ then “+”,
mepid is in [[x]]$vote$- then “-“, "0")
)
Thank you in advance,
*Nevertheless, I am reading this website actively since I started R!
You can see that the 'votes' sublist is composed of three items a list of member numbers stored within what I think are party designators. Here's how you might "straighten" the positive voter 'memids' by party:
str( unlist( sapply(names(jlis[[1]]$votes$'+'$groups), function(x) unlist(jlis[[1]]$votes$'+'$groups[[x]]) ) ) )
Named num [1:104] 28268 4514 28841 28314 28241 ...
- attr(*, "names")= chr [1:104] "ALDE.mepid" "ALDE.mepid" "ALDE.mepid" "ALDE.mepid" ...
You get a named numeric vector with 108 entries. Perhaps this will demonstrate what sort of terminology to use in better describing your desired result. (Just giving a partial schema for the desired result leaves way too much ambiguity to support a fully formed request.)
I do NOT see the number 23905 anywhere in what I downloaded from your link. We are clearly looking at different data. I see this for the timestamp: chr "2004-12-01T15:20:31". I'm not going to cut you any slack for not knowing R, since the task needs to be fully explained in a natural language. I will cut you slack regarding grammar if English is not your native tongue, but you definitely need to make a better effort at explication. This is what I see for the names with the votes$'+'$groups sublists of the first three items, but since RENEW is not in any of them there's not a lot that could be demonstrated about picking items:
> names( jlis[[1]]$votes$'+'$groups)
[1] "ALDE" "GUE/NGL" "IND/DEM" "NI" "PPE-DE" "PSE" "UEN"
> names( jlis[[2]]$votes$'+'$groups)
[1] "GUE/NGL" "IND/DEM" "NI" "PPE-DE"
> names( jlis[[3]]$votes$'+'$groups)
[1] "ALDE" "GUE/NGL" "IND/DEM" "NI" "PPE-DE" "PSE" "UEN" "Verts/ALE"
Furthermore, when I looked at all of the possible votes values using this method (for all three of the items you made available) I still see no RENEW names.
sapply( jlis[[1]]$votes[c("+","-","0")], function(x) names(x$groups) )
After second edit: Here's the next step of isolating those votes that contain a "Renew` value. I'm assuming that its possible to have a "Renew" value in only one of the three possible 'votes' values (+,-.0). If not (and there are always "Renew" values in each of them when there is one in any of them) then you might be able to simplify the logic. We make three logical vectors:
sapply( seq_along(MEPVotes) , function(i){ 'Renew' %in% names( MEPVotes[[i]]$votes[['0']][['groups']]) } )
#[1] FALSE FALSE FALSE TRUE
sapply( seq_along(MEPVotes) , function(i){ 'Renew' %in% names( MEPVotes[[i]]$votes[['+']][['groups']]) } )
#[1] FALSE FALSE FALSE TRUE
sapply( seq_along(MEPVotes) , function(i){ 'Renew' %in% names( MEPVotes[[i]]$votes[['-']][['groups']]) } )
#[1] FALSE FALSE FALSE TRUE
And then wrap them in a matrix call with 3 columns and take the maximum of each row (the maximum of c(TRUE,FALSE) is 1 and then convert back to logical.
selection_vec = as.logical( apply( matrix( c(
sapply( seq_along(MEPVotes) , function(i){ 'Renew' %in% names( MEPVotes[[i]]$votes[['0']][['groups']]) } ),
sapply( seq_along(MEPVotes) , function(i){ 'Renew' %in% names( MEPVotes[[i]]$votes[['+']][['groups']]) } ),
sapply( seq_along(MEPVotes) , function(i){ 'Renew' %in% names( MEPVotes[[i]]$votes[['-']][['groups']]) } ) ),
ncol=3 ), 1,max))
> selection_vec
[1] FALSE FALSE FALSE TRUE

Modifying a `citation` object in `R`

I am trying to modify a citation object in R as follows
cit <- citation("ggplot2")
cit$textVersion
#[1] "H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009."
cit$textVersion <- "Hadley Wickham and Winston Chang (2016). ggplot2: Create Elegant Data Visualisations Using
the Grammar of Graphics. R package version 2.2.1."
But there is no change.
cit$textVersion
#[1] "H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009."
If we examine the structure of cit, now there are two textVersion attributes. How to modify the original textVersion alone?
str(cit)
List of 1
$ :Class 'bibentry' hidden list of 1
..$ :List of 6
.. ..$ author :Class 'person' hidden list of 1
.. .. ..$ :List of 5
.. .. .. ..$ given : chr "Hadley"
.. .. .. ..$ family : chr "Wickham"
.. .. .. ..$ role : NULL
.. .. .. ..$ email : NULL
.. .. .. ..$ comment: NULL
.. ..$ title : chr "ggplot2: Elegant Graphics for Data Analysis"
.. ..$ publisher: chr "Springer-Verlag New York"
.. ..$ year : chr "2009"
.. ..$ isbn : chr "978-0-387-98140-6"
.. ..$ url : chr "http://ggplot2.org"
.. ..- attr(*, "bibtype")= chr "Book"
.. ..- attr(*, "textVersion")= chr "H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009."
.. ..- attr(*, "textversion")= chr "Hadley Wickham and Winston Chang (2016). ggplot2: Create Elegant Data Visualisations Using\n the Grammar of Gr"| __truncated__
- attr(*, "mheader")= chr "To cite ggplot2 in publications, please use:"
- attr(*, "class")= chr "bibentry"
A citation object is not made to be modified. The subset operators ($, [, but also $<-) are specific and don't allow easy modifications. This is for a reason: citation information is written in a specific file of a package and are not thought to be modified.
I don't know why you are trying it, but if you really need to, here is a little hack.
#store the class of the object, so can be reassigned later
oc<-class(cit)
#unclass the object to be free to modify
tmp<-unclass(cit)
#assign the new "textVersion"
attr(tmp[[1]],"textVersion")<-"Hadley Wickham and Winston Chang (2016). ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 2.2.1."
#assign the class back
class(tmp)<-oc
tmp
#To cite ggplot2 in publications, please use:
#
# Hadley Wickham and Winston Chang (2016). ggplot2: Create Elegant Data
# Visualisations Using the Grammar of Graphics. R package version
# 2.2.1.
#
#A BibTeX entry for LaTeX users is
#
# #Book{,
# author = {Hadley Wickham},
# title = {ggplot2: Elegant Graphics for Data Analysis},
# publisher = {Springer-Verlag New York},
# year = {2009},
# isbn = {978-0-387-98140-6},
# url = {http://ggplot2.org},
# }

R Studio crashes when using system(some-http-command) - anybody know a fix?

Whenever I use any sort of HTTP command via the system() function in R studio, the rainbow circle of death appears and I have to force-quit R Studio. Up until now, I've written a bunch of checks to make sure a user isn't in R Studio before using an HTTP command (which I use a ton to access data), but it's quite a pain, and it would be fantastic to get to the root of the problem.
e.g.
system("http get http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M")
causes R studio to crash. Oddly, on another laptop of mine, such commands don't crash R Studio but cause the following error: 'sh: http: command not found', even though http is installed and works fine when using the terminal.
Does anybody know how to fix this problem / why it happens / does it occur for you guys too? Although I know a lot about R, I'm afraid I have no idea how to try to fix this problem.
Thanks!!!
Using http from the httpie package on Linux hangs RStudio (and not plain terminal R) on my Linux system (your rainbow circle implies its a Mac?) so I'm getting the same behaviour as you.
Installing and using wget works for me:
system("wget -O /tmp/data.out http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M")
Or you could try R's native download.file function. There's a whole bunch of other functions for getting stuff off the web - see the Web Task View http://cran.r-project.org/web/views/WebTechnologies.html
I've not seen this http command used much, so maybe its flakey. Or maybe its opening stdin...
Yes... Try this:
system("http get http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M >/tmp/data2.out </dev/null" )
I think http is opening stdin, the Unix standard input channel, RStudio isn't sending anything to it. So it waits. If you explicitly assign http's stdin as /dev/null then http completes. This works for me in RStudio.
However, I still prefer wget or curl-based solutions!
Without more contextual information regarding Rstudio version / operating system it is hard to do more than suggest an alternative approach that avoids the use system()
Instead you could use RCurl and getURL
library(RCurl)
getURL('http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M')
#[1] "{\"status\":\"REQUEST_SUCCEEDED\",\"responseTime\":129,\"message\":[],\"Results\":{\n\"series\":\n[{\"seriesID\":\"CXUALCBEVGLB0101M\",\"data\":[{\"year\":\"2013\",\"period\":\"A01\",\"periodName\":\"Annual\",\"value\":\"445\",\"footnotes\":[{}]},{\"year\":\"2012\",\"period\":\"A01\",\"periodName\":\"Annual\",\"value\":\"451\",\"footnotes\":[{}]},{\"year\":\"2011\",\"period\":\"A01\",\"periodName\":\"Annual\",\"value\":\"456\",\"footnotes\":[{}]}]}]\n}}"
You could also use PUT, GET, POST, etc directly in R, abstracted from RCurl by the httr package:
library(httr)
tmp <- GET("http://api.bls.gov/publicAPI/v1/timeseries/data/CXUALCBEVGLB0101M")
dat <- content(tmp, as="parsed")
str(dat)
## List of 4
## $ status : chr "REQUEST_SUCCEEDED"
## $ responseTime: num 27
## $ message : list()
## $ Results :List of 1
## ..$ series:'data.frame': 1 obs. of 2 variables:
## .. ..$ seriesID: chr "CXUALCBEVGLB0101M"
## .. ..$ data :List of 1
## .. .. ..$ :'data.frame': 3 obs. of 5 variables:
## .. .. .. ..$ year : chr [1:3] "2013" "2012" "2011"
## .. .. .. ..$ period : chr [1:3] "A01" "A01" "A01"
## .. .. .. ..$ periodName: chr [1:3] "Annual" "Annual" "Annual"
## .. .. .. ..$ value : chr [1:3] "445" "451" "456"
## .. .. .. ..$ footnotes :List of 3
## .. .. .. .. ..$ :'data.frame': 1 obs. of 0 variables
## .. .. .. .. ..$ :'data.frame': 1 obs. of 0 variables
## .. .. .. .. ..$ :'data.frame': 1 obs. of 0 variables

Can't run r package mixOmics functions with large matrix

The mixOmics package is meant to analyze big data sets (e.g. from high throughput experiments), but it seems not be working with my big matrix.
I am having issues with both rcc (regularized canonical correlation) and tune.rcc (labmda parameters estimation for regularized can cor).
> str(Y)
num [1:13, 1:17766] ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:13] ...
..$ : chr [1:17766] ...
> str(X)
num [1:13, 1:26] ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:13] ...
..$ : chr [1:26] ...
tune.rcc(X, Y, grid1 = seq(0.001, 1, length = 5),
grid2 = seq(0.001, 1, length = 5),
validation = "loo", plt=F)
On Mavericks: runs forever (I quit R after hours)
Since I know Mavericks is problematic I've tried it on a Windows8 machine and on the mixOmics web interface.
On Windows 8:
Error: cannot allocate vector of size 2.4 Gb
On web interface, since it is not possible to estimate lambdas (tune.rcc) I tried rcc with "some" lambdas and get:
Error: cannot allocate vector of size 2.4 Gb
Am I doing something obviously wrong?
Any help very much appreciated.

How to convert searchTwitter results (from library(twitteR)) into a data.frame?

I am working on saving twitter search results into a database (SQL Server) and am getting an error when I pull the search results from twitteR.
If I execute:
library(twitteR)
puppy <- as.data.frame(searchTwitter("puppy", session=getCurlHandle(),num=100))
I get an error of:
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class structure("status", package = "twitteR") into a data.frame
This is important because in order to use RODBC to add this to a table using sqlSave it needs to be a data.frame. At least that's the error message I got:
Error in sqlSave(localSQLServer, puppy, tablename = "puppy_staging", :
should be a data frame
So does anyone have any suggestions on how to coerce the list to a data.frame or how I can load the list through RODBC?
My final goal is to have a table that mirrors the structure of values returned by searchTwitter. Here is an example of what I am trying to retrieve and load:
library(twitteR)
puppy <- searchTwitter("puppy", session=getCurlHandle(),num=2)
str(puppy)
List of 2
$ :Formal class 'status' [package "twitteR"] with 10 slots
.. ..# text : chr "beautifull and kc reg Beagle Mix for rehomes: This little puppy is looking for a new loving family wh... http://bit.ly/9stN7V "| __truncated__
.. ..# favorited : logi FALSE
.. ..# replyToSN : chr(0)
.. ..# created : chr "Wed, 16 Jun 2010 19:04:03 +0000"
.. ..# truncated : logi FALSE
.. ..# replyToSID : num(0)
.. ..# id : num 1.63e+10
.. ..# replyToUID : num(0)
.. ..# statusSource: chr "<a href="http://twitterfeed.com" rel="nofollow">twitterfeed</a>"
.. ..# screenName : chr "puppy_ads"
$ :Formal class 'status' [package "twitteR"] with 10 slots
.. ..# text : chr "the cutest puppy followed me on my walk, my grandma won't let me keep it. taking it to the pound sadface"
.. ..# favorited : logi FALSE
.. ..# replyToSN : chr(0)
.. ..# created : chr "Wed, 16 Jun 2010 19:04:01 +0000"
.. ..# truncated : logi FALSE
.. ..# replyToSID : num(0)
.. ..# id : num 1.63e+10
.. ..# replyToUID : num(0)
.. ..# statusSource: chr "<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry®</a>"
.. ..# screenName : chr "iamsweaters"
So I think the data.frame of puppy should have column names like:
- text
- favorited
- replytoSN
- created
- truncated
- replytoSID
- id
- replytoUID
- statusSource
- screenName
I use this code I found from http://blog.ouseful.info/2011/11/09/getting-started-with-twitter-analysis-in-r/ a while ago:
#get data
tws<-searchTwitter('#keyword',n=10)
#make data frame
df <- do.call("rbind", lapply(tws, as.data.frame))
#write to csv file (or your RODBC code)
write.csv(df,file="twitterList.csv")
I know this is an old question, but still, here is what I think is a ``modern'' version to solve this. Just use the function twListToDf
gvegayon <- getUser("gvegayon")
timeline <- userTimeline(gvegayon,n=400)
tl <- twListToDF(timeline)
Hope it helps
Try this:
ldply(searchTwitter("#rstats", n=100), text)
twitteR returns an S4 class, so you need to either use one of its helper functions, or deal directly with its slots. You can see the slots by using unclass(), for instance:
unclass(searchTwitter("#rstats", n=100)[[1]])
These slots can be accessed directly as I do above by using the related functions (from the twitteR help: ?statusSource):
text Returns the text of the status
favorited Returns the favorited information for the status
replyToSN Returns the replyToSN slot for this status
created Retrieves the creation time of this status
truncated Returns the truncated information for this status
replyToSID Returns the replyToSID slot for this status
id Returns the id of this status
replyToUID Returns the replyToUID slot for this status
statusSource Returns the status source for this status
As I mentioned, it's my understanding that you will have to specify each of these fields yourself in the output. Here's an example using two of the fields:
> head(ldply(searchTwitter("#rstats", n=100),
function(x) data.frame(text=text(x), favorited=favorited(x))))
text
1 #statalgo how does that actually work? does it share mem between #rstats and postgresql?
2 #jaredlander Have you looked at PL/R? You can call #rstats from PostgreSQL: http://www.joeconway.com/plr/.
3 #CMastication I was hoping for a cool way to keep data in a DB and run the normal #rstats off that. Maybe a translator from R to SQL code.
4 The distribution of online data usage: AT&T has recently announced it will no longer http://goo.gl/fb/eTywd #rstat
5 #jaredlander not that I know of. Closest is sqldf package which allows #rstats and sqlite to share mem so transferring from DB to df is fast
6 #CMastication Can #rstats run on data in a DB?Not loading it in2 a dataframe or running SQL cmds but treating the DB as if it wr a dataframe
favorited
1 FALSE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 FALSE
You could turn this into a function if you intend on doing it frequently.
For those that run into the same problem I did which was getting an error saying
Error in as.double(y) : cannot coerce type 'S4' to vector of type 'double'
I simply changed the word text in
ldply(searchTwitter("#rstats", n=100), text)
to statusText, like so:
ldply(searchTwitter("#rstats", n=100), statusText)
Just a friendly heads-up :P
Here is a nice function to convert it into a DF.
TweetFrame<-function(searchTerm, maxTweets)
{
tweetList<-searchTwitter(searchTerm,n=maxTweets)
return(do.call("rbind",lapply(tweetList,as.data.frame)))
}
Use it as :
tweets <- TweetFrame(" ", n)
The twitteR package now includes a function twListToDF that will do this for you.
puppy_table <- twListToDF(puppy)

Resources