I figure read.lines() or maybe the tm package will be the way to go for this, but I was wondering what recommendations folks have for reading a .txt screenplay. Demarcating parts of the screenplay will require case-sensitive and newline-sensitive regexes and such, so I need to preserve all those elements in the content that is read. Thoughts?
read_file() in the tidyverse package does a nice job of this.
Related
I have a 5.1 GB json file that I would like to read in R using rjson. I want afterwards to construct a dataframe from it, however it won't load because the size is too large.
Is there any way to work around it?
Thank you for your help =)
Nina, I would recommend you using jsonlite package instead of rjson.
library(jsonlite)
your_json <- "your_path.json"
unpacked_json <- jsonlite::stream_in(textConnection(readLines(your_json, n=100000)),verbose=F)
Here you limit the page size to let IDE correctly read your JSON file. For more information I would also recommend you to make some research on this topic:
https://community.rstudio.com/t/how-to-read-large-json-file-in-r/13486
Reading a huge json file in R , issues
I know for sure that it is sometimes really hard to cope with documentation (and as all other human beings we are lazy); and I don't like to read doc-n myself, but I highly recommend you to make yourself familiar with jsonlite documentation and vignettes. Here's the CRAN link: https://cran.r-project.org/web/packages/jsonlite/index.html
I want to use python's packages rpy2 to get a FlexTable and merge cells, the code is:
r['spanFlexTable'](tab1,i=1,from=2,to=3)
But from is a key word in python, can someone help me to solve this?
In this post the solution of postfixing an underscore (from_) may work:
Is it possible to escape a reserved word in Python?
I have written and collected R code on various topics that solve particular problems at hand. I stored the R script/code in .txt files. I have now 100s of them.
How do you keep your R code at hand efficiently?
#Manetheran has the right idea: write a package. It's easy to do (especially with RStudio). Read "Writing R Extensions" and then on top of that learn about roxygen2 (which allows you to document each function in-line and avoid writing .Rd files).
Then you can use devtools to load your package locally, or once it's stable if you think other people can use the functions you can submit your package to CRAN.
I prefer to keep it simple. I use Total Commander and when I need an example which uses some R function, I just do Alt-F7 and search for *.R files which contain the desired string.
I use RStudio and have created two or three basic scripts. I save my much-used functions in the basic script that is most appropriate. Then, at the start of an RStudio script for a project, I source one or more of the basic scripts as is appropriate.
My aim is to better organize the work done by a R code.
In particular it could be useful to split the R code I have written in different R files, perhaps with each R file accomplishing to a different task. I have in mind what we can do in Matlab with different M files, where we can easily call functions written in different M files directly from the main code.
Is it useful to write this R files in the form of functions?
How can we call these R files /functions in the main code?
Thanks
You can use source("filename.R") to include the file in your main script.
I am not sure if there is a ready function to include an entire directory, but it is straightforward to write using list.files() and then call source dynamicly for each filename. You can also filter files to only list *.R for example.
Unless you intend to write an R package, you should rethink your organization. R is not Matlab, thank goodness! You can place as many functions as you wish into a single file, and make them all available in your environment with source foo.r . If you are writing a collection of generic functions and don't want to build a package, this really is the cleaner way to go.
As a side thought, consider making some of your tools more flexible by adding more input arguments. You may find that you don't really need so many separate functions/files. As a trivial example, if you have some function do_it_double , another do_it_integer , and yet another do_it_character , all of which do basically the same thing, just merge them into a single do_it_all(x,y,datatype='double') and override the default datatype as desired. (I know this can be done with internal input validation. I'm just giving an example)
Your approach might be working good. I would recommend to wrap the code in a function and use one R file for one R function.
It might be interesting to look at the packages devtools and ProjectTemplate which aim to help organizing R code.
I have been cribbing off of the very helpful responses on Scraping html tables into R data frames using the XML package to scrape some html off the web and work with it in R.
The XML package seems to be pretty thorough about escaping non-alphabetic characters in text strings. Is there a simple way in XML or some other package that would reverse some/all of the character escaping that passing my data through XML did? I started to do it myself, but after encountering cases like 'Representative JoaquÃÂn Castro' thought 'there must be a better solution...'
Just for clarity, using the XML package to parse this HTML
library(XML)
apos_str <- c("<b>Tim O'Reilly</b>")
apos_str.parsed <- htmlTreeParse(apos_str, error=function(...){})
apos_str.parsed$children$html[[1]][[1]]
would produce
<b>Tim O'Reilly</b>
And I'd ideally like a function or package that would search for that
'
and turn it back into
'<b>Tim O'Reilly</b>'
Edit To clarify, from the comments below, I get how to do this for the particular case of apostrophes, or any other character I see in my data. What I'm looking for is a package where someone has worked this out more generally.
Research I've done so far:
-Read everything I could find in the XML documentation on escaping.
-Looked for a promising package on the CRAN NLP page.
-did a search for 'unescape [R]' and 'reverse escape [R]' here on SO.
Wasn't able to make any headway so thought I would bring the question here.
I'm not sure I understand the difficulty. String processing for replacements are done with the base regex functions: sub, gsub, regexpr, gregexpr
?sub # the same help page will also discuss 'gsub'
txt <- '<b>Tim O'Reilly</b>'
sub("\\'", "'", txt)
[1] "<b>Tim O'Reilly</b>"
If you had a list of values that occur between "&" and ";" you could split on those and then recombine. I suppose it is possible that you were hoping someone had already done that. You should clarify what level of abstraction you were hoping to achieve.
EDIT:
A blogger discusses the specific case of "&apos" http://fishbowl.pastiche.org/2003/07/01/the_curse_of_apos/
I've done some further research of my own. Those are not properly called "escapes" but rather "named entities". I cannot find any references to them in the rhelp archives. I have downloaded the XML listing from the w3.org website that defines these "enities" and am trying to convert to a tabular form that would support search and replace. But your comment about 'Representative JoaquÃÂn Castro' has me puzzled. the odd characters are not in the form "$#xxx", so ........... what exactly are you asking for? Please post a suitable test case with the expected output.
EDIT 2: The was a basically identical question from Michael Friendly that just got answered by David Carlson on Rhelp. Here's the link to the posting on the Rhelp archives:
https://stat.ethz.ch/pipermail/r-help/2012-August/321478.html
He's already done a better job than I had on creating a translation table and has included code to march through html text. (and a bonus... he included &apos). And a next-day followup from Michael Friendly has wrapped the process up in a function. You can follow the link on the Archives page.