I am trying to modify a citation object in R as follows
cit <- citation("ggplot2")
cit$textVersion
#[1] "H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009."
cit$textVersion <- "Hadley Wickham and Winston Chang (2016). ggplot2: Create Elegant Data Visualisations Using
the Grammar of Graphics. R package version 2.2.1."
But there is no change.
cit$textVersion
#[1] "H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009."
If we examine the structure of cit, now there are two textVersion attributes. How to modify the original textVersion alone?
str(cit)
List of 1
$ :Class 'bibentry' hidden list of 1
..$ :List of 6
.. ..$ author :Class 'person' hidden list of 1
.. .. ..$ :List of 5
.. .. .. ..$ given : chr "Hadley"
.. .. .. ..$ family : chr "Wickham"
.. .. .. ..$ role : NULL
.. .. .. ..$ email : NULL
.. .. .. ..$ comment: NULL
.. ..$ title : chr "ggplot2: Elegant Graphics for Data Analysis"
.. ..$ publisher: chr "Springer-Verlag New York"
.. ..$ year : chr "2009"
.. ..$ isbn : chr "978-0-387-98140-6"
.. ..$ url : chr "http://ggplot2.org"
.. ..- attr(*, "bibtype")= chr "Book"
.. ..- attr(*, "textVersion")= chr "H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009."
.. ..- attr(*, "textversion")= chr "Hadley Wickham and Winston Chang (2016). ggplot2: Create Elegant Data Visualisations Using\n the Grammar of Gr"| __truncated__
- attr(*, "mheader")= chr "To cite ggplot2 in publications, please use:"
- attr(*, "class")= chr "bibentry"
A citation object is not made to be modified. The subset operators ($, [, but also $<-) are specific and don't allow easy modifications. This is for a reason: citation information is written in a specific file of a package and are not thought to be modified.
I don't know why you are trying it, but if you really need to, here is a little hack.
#store the class of the object, so can be reassigned later
oc<-class(cit)
#unclass the object to be free to modify
tmp<-unclass(cit)
#assign the new "textVersion"
attr(tmp[[1]],"textVersion")<-"Hadley Wickham and Winston Chang (2016). ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. R package version 2.2.1."
#assign the class back
class(tmp)<-oc
tmp
#To cite ggplot2 in publications, please use:
#
# Hadley Wickham and Winston Chang (2016). ggplot2: Create Elegant Data
# Visualisations Using the Grammar of Graphics. R package version
# 2.2.1.
#
#A BibTeX entry for LaTeX users is
#
# #Book{,
# author = {Hadley Wickham},
# title = {ggplot2: Elegant Graphics for Data Analysis},
# publisher = {Springer-Verlag New York},
# year = {2009},
# isbn = {978-0-387-98140-6},
# url = {http://ggplot2.org},
# }
Related
I have 4 years experience using R but I am very new to the Big Data game as I always worked on csv files.
It is thrilling to manipulate large amount of data from a distance but also somehow frustating as simple things you were used to are to be rengineered.
The task I am struggling right now is to have a basic 5 figure summary of a variable:
summary(df$X)
Some context, I am connected with impala, these lines of codes work fine:
library(dbplyr)
localTable <- tbl(con, 'serverTable')
localTable %>% tally()
localTable %>% filter(X > 10) %>% tally()
If I just write
localTable
instead, RStudio gets stuck/takes a lot of time so I suppress it with the task manager.
Coming back to my current question, I tried to have a 5 figure summary in these ways:
summary(localTable$X) #returns Length 0, Class NULL, Mode NULL
localTable %>% fivenum(X) #returns Error in rank(x, ties.method = "min", na.last = "keep") : unimplemented type 'list' in 'greater'
also building a custom summary() with summarise
localTable %>% summarize(Min = min(X),
Q1 = quantile(X, .25),
Avg = mean(X),
Q3 = quantile(X, .75),
Max = max(X))
returns me a SYNTAX ERROR.
My guess is that there is a very trivial missing link between my code and the server in form of a data structure, but I can't figure it out what.
I tried as well to save localTable$x to a in-memory variable with
XL <- localTable$X
but I always get a NULL
On the graphical side, using dbplot, if I try
library(dbplot)
localTable %>% dbplot_histogram(X)
I get an empty graphic.
I thought about leveraging the 5 figures summary in the boxplot function, ggplotbuild(object)$data likewise so to speak, but with dbplot_boxplot I get the error could not find function "dbplot_boxplot".
I started using dbplyr as I am quite fluent with dplyr and I don't want to write queries in SQL with DBI::dbGetQuery, but you can suggest other packages like implyR, sparklyR or the such, as well as tutorials on the subject as large, as the ones I found are quite basic.
EDIT:
as requested in a comment, I add the result of
str(localTable)
which is
List of 2
$ src:List of 2
..$ con :Formal class 'Impala' [package ".GlobalEnv"] with 4 slots
.. .. ..# ptr :<externalptr>
.. .. ..# quote : chr "`"
.. .. ..# info :List of 15
.. .. .. ..$ dbname : chr "IMPALA"
.. .. .. ..$ dbms.name : chr "Impala"
.. .. .. ..$ db.version : chr "2.9.0-cdh5.12.1"
.. .. .. ..$ username : chr "User"
.. .. .. ..$ host : chr ""
.. .. .. ..$ port : chr ""
.. .. .. ..$ sourcename : chr "impala connector"
.. .. .. ..$ servername : chr "Impala"
.. .. .. ..$ drivername : chr "Cloudera ODBC Driver for Impala"
.. .. .. ..$ odbc.version : chr "03.80.0000"
.. .. .. ..$ driver.version : chr "2.6.11.1011"
.. .. .. ..$ odbcdriver.version : chr "03.80"
.. .. .. ..$ supports.transactions : logi FALSE
.. .. .. ..$ getdata.extensions.any_column: logi TRUE
.. .. .. ..$ getdata.extensions.any_order : logi TRUE
.. .. .. ..- attr(*, "class")= chr [1:3] "Impala" "driver_info" "list"
.. .. ..# encoding: chr ""
..$ disco: NULL
..- attr(*, "class")= chr [1:4] "src_Impala" "src_dbi" "src_sql" "src"
$ ops:List of 2
..$ x : 'ident' chr "serverTable"
..$ vars: chr [1:157] "X" ...
..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
- attr(*, "class")= chr [1:5] "tbl_Impala" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
Not sure if I can dput my table as it is sensitive information
There are quite a few aspects to your post. I am going to try and address the main ones.
(1) What you are calling localTable is not local. What you have is a local access point to a remote table. It is a remote table because the data is stored in the database, rather than in R.
To copy a remote table into local R memory use localTable = collect(remoteTable). Use this carefully. If the table is many GB in the database this will be slow to transfer into R. Also if you collect a database table that is bigger than the ram avaialble to R then you will receive an out of memory error.
I recommend using collect for moving summary results into R. Do the processing and summarizing in the database and just fetch the results into R. Alternatively, use remoteTable %>% head(20) %>% collect() to copy just the first 20 rows into R.
(2) The tableName$colname will not work for remote tables. In R the $ notation lets you access a named component of a list. Data.frames are a special kind of list. If you try data(iris) followed by names(iris) you will get the columns names of iris. Any of these can be accessed using iris$.
However as your str(localTable) shows, localTable is a list of length 2 with the first named item src. If you call names(localTable) then you will receive two names back, the first of which is src. This means you can call localTable$src (and as localTable$src is also a list you can also call localTable$src$con).
When working with dbplyr R translates data manipulation commands into the database language. There are translations defined for most dplyr commands, but there are not translations defined for all R commands.
So the recommended approach to access just a specific column is using select from dplyr:
local_copy_of_just_one_column = remoteTable %>%
select(required_column) %>%
collect()
(3) You have the right approach with a custom summary function. This is the best approach for producing the five figure summary without pulling the data into local memory (RAM).
One possible cause of the syntax error is that you may have used R commands that do not have a translation into your database language.
You can check whether a command has translations defined using translate_sql. I recommend you try
library(dbplyr)
translate_sql(quantile(colname, 0.25))
To see what the translation look like.
You can view the translation of an entire table manipulation using show_query. This is my go-to approach when debugging SQL translation. Try:
localTable %>%
summarize(Min = min(X),
Q1 = quantile(X, .25),
Avg = mean(X),
Q3 = quantile(X, .75),
Max = max(X)) %>%
show_query()
If this does not produce valid SQL then executing the command will error.
One possible cause is the Min and Max have special meanings in SQL and so might produce odd behavior in your translation.
When I experimented with quantile it looks like it might need an OVER clause in SQL. This is created using group_by. So perhaps you want something like the following:
localSummary = remoteTable %>%
# create dummy column
mutate(ones = 1) %>%
# group to satisfy over clause
group_by(ones) %>%
summarise(var_min = min(var),
var_lq = quantile(var, 0.25),
var_mean = mean(var),
var_uq = quantile(var, 0.75),
var_max = max(var)) %>%
# copy results from database into R memory
collect()
I'm interested to remove all stopwords from my text using R. The list of stopwords that I want to remove can be found at http://www.ranks.nl/stopwords under the section which says "Long Stopword List" (a very long list version). I'm using tm package. Can one help me, please? Tnx!
You can copy that list (after you select it in your browser) aand then paste it into this expression in R:
LONGSWS <- " <paste into this position> "
You would place the cursor for your editor or the IDE console device inside the two quotes. Then do this:
sw.vec <- scan(text=LONGSWS, what="")
#Read 474 items
The scan function needs to have the type of input specified via an example given to the what argument, and for that purpose just using "" is sufficient for character types. Then you should be able to apply the code you offered in your comment:
tm_map(text, removeWords, sw.vec)
You have not supplied an example text object. Using just a character vector is not successful:
tm_map("test of my text", removeWords, sw.vec )
#Error in UseMethod("tm_map", x) :
# no applicable method for 'tm_map' applied to an object of class "character"
So we will need to assume you have a suitable object of a suitable class to place in the first position of the arguments to tm_map. So using the example from the ?tm_map help page:
> res <- tm_map(crude, removeWords, sw.vec )
> str(res)
List of 20
$ 127:List of 2
..$ content: chr "Diamond Shamrock Corp said \neffective today cut contract prices crude oil \n1.50 dlrs barrel.\n The re"| __truncated__
..$ meta :List of 15
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "1987-02-26 17:00:56"
.. ..$ description : chr ""
.. ..$ heading : chr "DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES"
.. ..$ id : chr "127"
.. ..$ language : chr "en"
.. ..$ origin : chr "Reuters-21578 XML"
.. ..$ topics : chr "YES"
.. ..$ lewissplit : chr "TRAIN"
.. ..$ cgisplit : chr "TRAINING-SET"
# ----------------snipped remainder of long output.
I am building an API wrapper using httr. The API I'm using doesn't have content within people for the current year, but returns data for the copyright. When I use httr::GET I still get a 200 status code since there is a response.
The response should have data similar to 2019. How do I use httr to throw an error? Is there a warning similar to httr::warn_for_status available?
Example of request that works and returns people, vs. one that doesn't
library(httr)
data <- GET("https://statsapi.mlb.com/api/v1/sports/1/players?season=2019")
# response with data in content. I'll spare everyone the 1,410 rows in the JSON Response
str(content(data, type= "application/json"), list.len = 2)
#> List of 2
#> $ copyright: chr "Copyright 2020 MLB Advanced Media, L.P. Use of any content on this page acknowledges agreement to the terms po"| __truncated__
#> $ people :List of 1410
#> ..$ :List of 36
#> .. ..$ id : int 472551
#> .. ..$ fullName : chr "Fernando Abad"
#> .. .. [list output truncated]
#> ..$ :List of 35
#> .. ..$ id : int 650556
#> .. ..$ fullName : chr "Bryan Abreu"
#> .. .. [list output truncated]
#> .. [list output truncated]
no_data <- GET("https://statsapi.mlb.com/api/v1/sports/1/players?season=2020")
str(content(no_data, type= "application/json"))
#> List of 2
#> $ copyright: chr "Copyright 2020 MLB Advanced Media, L.P. Use of any content on this page acknowledges agreement to the terms po"| __truncated__
#> $ people : list()
The alternative I've used is to parse the data and then use nrow(df) < 1. There has to be a better way.
R - TM package - Issue with arabic - diff between Mac OS X and Windows OS
ON MACBOOK PRO with RSTUDIO
```{r}
versionInfo()
```
1.R version 3.1.0 (2014-04-10)
2.Platform: x86_64-apple-darwin10.8.0 (64-bit)
3.Packages : tm_0.6 NLP_0.1-3
ON WINDOWS 8.1 with RSTUDIO
```{r}
versionInfo()
```
1.R version 3.1.0 (2014-04-10)
2.Platform: x86_64-w64-mingw32/x64 (64-bit)
3.Packages : tm_0.6 NLP_0.1-3
Problem description
Dear all,
I have been working all the week-end. I'm working on PhD on social network analysis. At this moment, I'm using TM package for text mining and analysis purposes, with english and arabic languages mixed in bid data sets.
The data sets are collected from Twitter API with a JAVA program and placed in a MongoDB data base.
For test purposes, I use a small dataset of 36000 tweets.
The problem is that for huge datasets computing (>1000000 rows), my MacBookPro would not be sufficient. I need to use a PC with Windows 8.1 OS which have better ROM and RAM.
When testing my Code on Windows 8.1 OS which working fine on RStudio on MAC OS X with the same test dataset, I have some different results from TM package at the Corpus compute level.
Here the beginning of the R code:
```{r}
y <<- dget("file") # get the file ext rated from MongoDB with rmongodb package
a <<- y$tweet_text # extract only the text of the tweets in the dataset
text_df <<- data.frame(a, stringsAsFactors = FALSE) # Save as a data frame
myCorpus_df <<- Corpus(DataframeSource(text_df_2)) # Compute a Corpus from the data frame
```
When I check on R in MAC OS, all the character, english and arabic, are well represented :
```{r}
str(myCorpus_df[1:2])
```
List of 2
$ 1:List of 2
..$ content: chr "The CHRONICLE EYE Ahrar al#Sham is clearly fighting #ISIS where its men storm some #Manbij buildings #Aleppo "
..$ meta :List of 7
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18"
.. ..$ description : chr(0)
.. ..$ heading : chr(0)
.. ..$ id : chr "1"
.. ..$ language : chr "en"
.. ..$ origin : chr(0)
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
$ 2:List of 2
..$ content: chr "RT ######### جبهة النصرة مهاجرينها وأنصارها مقراتها مكان آمن لكل من يخشى على نفسه الآذى "
..$ meta :List of 7
.. ..$ author : chr(0)
.. ..$ datetimestamp: POSIXlt[1:1], format: "2014-07-03 22:42:18"
.. ..$ description : chr(0)
.. ..$ heading : chr(0)
.. ..$ id : chr "2"
.. ..$ language : chr "en"
.. ..$ origin : chr(0)
.. ..- attr(*, "class")= chr "TextDocumentMeta"
..- attr(*, "class")= chr [1:2] "PlainTextDocument" "TextDocument"
- attr(*, "class")= chr [1:2] "VCorpus" "Corpus"
Nevertheless, when I do the same part of code in RSTUDIO on WINDOWS OS, all the arabic language is wrongly decoded (can't pass you here). the str of the Corpus show the same parameters. Only the display of arabic is unreadable. When checking at the data frame text_df, the arabic language is well displayed.
When I check the encoding of an arabic word on the both OS (MAC & WINDOWS OS), it seems to be well coded :
```{r}
Encoding("لمياه_و_الإصحا")
```
[1] "UTF-8"
I've tried to pass many additional information when creating the Corpus (with readerControletc…) but nothing have changed : my arabic language is not well displayed in R or in RStudio on Windows OS with the tm package.
Is anyone have encountered the same difference issues between MAC OS X and WINDOWS OS with non-latin language text mining ?
As far as I can tell, it seems to me that the Arabic characters are being encoded in some native (Windows-specific) encoding, while your R code is incorrectly decoding them as UTF8. That's why you're getting all those ennoying symbols such as "Ø" **. To verify this, just inspect the raw bytes of your string variables using charToRaw and then check the UTF8 character table.
I haven't worked with the mongodb package before, but I wonder if there is a way to force the data to be read from mongodb in UTF8 format, perhaps by specifying an encoding parameter of some "read" function.
** Actually, the reason I can immediately recognize those characters is because I ran into this kind of problem while working with Arabic tweets that I had obtained using the twitteR package.
I am working on saving twitter search results into a database (SQL Server) and am getting an error when I pull the search results from twitteR.
If I execute:
library(twitteR)
puppy <- as.data.frame(searchTwitter("puppy", session=getCurlHandle(),num=100))
I get an error of:
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot coerce class structure("status", package = "twitteR") into a data.frame
This is important because in order to use RODBC to add this to a table using sqlSave it needs to be a data.frame. At least that's the error message I got:
Error in sqlSave(localSQLServer, puppy, tablename = "puppy_staging", :
should be a data frame
So does anyone have any suggestions on how to coerce the list to a data.frame or how I can load the list through RODBC?
My final goal is to have a table that mirrors the structure of values returned by searchTwitter. Here is an example of what I am trying to retrieve and load:
library(twitteR)
puppy <- searchTwitter("puppy", session=getCurlHandle(),num=2)
str(puppy)
List of 2
$ :Formal class 'status' [package "twitteR"] with 10 slots
.. ..# text : chr "beautifull and kc reg Beagle Mix for rehomes: This little puppy is looking for a new loving family wh... http://bit.ly/9stN7V "| __truncated__
.. ..# favorited : logi FALSE
.. ..# replyToSN : chr(0)
.. ..# created : chr "Wed, 16 Jun 2010 19:04:03 +0000"
.. ..# truncated : logi FALSE
.. ..# replyToSID : num(0)
.. ..# id : num 1.63e+10
.. ..# replyToUID : num(0)
.. ..# statusSource: chr "<a href="http://twitterfeed.com" rel="nofollow">twitterfeed</a>"
.. ..# screenName : chr "puppy_ads"
$ :Formal class 'status' [package "twitteR"] with 10 slots
.. ..# text : chr "the cutest puppy followed me on my walk, my grandma won't let me keep it. taking it to the pound sadface"
.. ..# favorited : logi FALSE
.. ..# replyToSN : chr(0)
.. ..# created : chr "Wed, 16 Jun 2010 19:04:01 +0000"
.. ..# truncated : logi FALSE
.. ..# replyToSID : num(0)
.. ..# id : num 1.63e+10
.. ..# replyToUID : num(0)
.. ..# statusSource: chr "<a href="http://blackberry.com/twitter" rel="nofollow">Twitter for BlackBerry®</a>"
.. ..# screenName : chr "iamsweaters"
So I think the data.frame of puppy should have column names like:
- text
- favorited
- replytoSN
- created
- truncated
- replytoSID
- id
- replytoUID
- statusSource
- screenName
I use this code I found from http://blog.ouseful.info/2011/11/09/getting-started-with-twitter-analysis-in-r/ a while ago:
#get data
tws<-searchTwitter('#keyword',n=10)
#make data frame
df <- do.call("rbind", lapply(tws, as.data.frame))
#write to csv file (or your RODBC code)
write.csv(df,file="twitterList.csv")
I know this is an old question, but still, here is what I think is a ``modern'' version to solve this. Just use the function twListToDf
gvegayon <- getUser("gvegayon")
timeline <- userTimeline(gvegayon,n=400)
tl <- twListToDF(timeline)
Hope it helps
Try this:
ldply(searchTwitter("#rstats", n=100), text)
twitteR returns an S4 class, so you need to either use one of its helper functions, or deal directly with its slots. You can see the slots by using unclass(), for instance:
unclass(searchTwitter("#rstats", n=100)[[1]])
These slots can be accessed directly as I do above by using the related functions (from the twitteR help: ?statusSource):
text Returns the text of the status
favorited Returns the favorited information for the status
replyToSN Returns the replyToSN slot for this status
created Retrieves the creation time of this status
truncated Returns the truncated information for this status
replyToSID Returns the replyToSID slot for this status
id Returns the id of this status
replyToUID Returns the replyToUID slot for this status
statusSource Returns the status source for this status
As I mentioned, it's my understanding that you will have to specify each of these fields yourself in the output. Here's an example using two of the fields:
> head(ldply(searchTwitter("#rstats", n=100),
function(x) data.frame(text=text(x), favorited=favorited(x))))
text
1 #statalgo how does that actually work? does it share mem between #rstats and postgresql?
2 #jaredlander Have you looked at PL/R? You can call #rstats from PostgreSQL: http://www.joeconway.com/plr/.
3 #CMastication I was hoping for a cool way to keep data in a DB and run the normal #rstats off that. Maybe a translator from R to SQL code.
4 The distribution of online data usage: AT&T has recently announced it will no longer http://goo.gl/fb/eTywd #rstat
5 #jaredlander not that I know of. Closest is sqldf package which allows #rstats and sqlite to share mem so transferring from DB to df is fast
6 #CMastication Can #rstats run on data in a DB?Not loading it in2 a dataframe or running SQL cmds but treating the DB as if it wr a dataframe
favorited
1 FALSE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 FALSE
You could turn this into a function if you intend on doing it frequently.
For those that run into the same problem I did which was getting an error saying
Error in as.double(y) : cannot coerce type 'S4' to vector of type 'double'
I simply changed the word text in
ldply(searchTwitter("#rstats", n=100), text)
to statusText, like so:
ldply(searchTwitter("#rstats", n=100), statusText)
Just a friendly heads-up :P
Here is a nice function to convert it into a DF.
TweetFrame<-function(searchTerm, maxTweets)
{
tweetList<-searchTwitter(searchTerm,n=maxTweets)
return(do.call("rbind",lapply(tweetList,as.data.frame)))
}
Use it as :
tweets <- TweetFrame(" ", n)
The twitteR package now includes a function twListToDF that will do this for you.
puppy_table <- twListToDF(puppy)