Converting warc.gz to .warc - webarchive

My attempt to extract a warc.gz file, using gzip, resulted in a WARC, but it won't load in http://replayweb.page.
Extracting it using The Unarchiver gave me all the expanded html and other files.
What is the latest recommended method for converting warc.gz to warc? For some reason I am coming up short in my attempts to find suggestions for this simple task.
Thanks!

The programming way is using "warcio" python lib, command-line way is using "warc2warc" utility from warctools.

Related

RDCOMClient log file

I have been using RDCOMClient for a while now to interact with vendor software. For the most part it has worked fine. Recently, however, I have the need to loop through many operations (several hundred). I am running into problems with the RDCOM.err file growing to a very large size (easily GBs). This file is put in C: with no apparent option to change that. Is there some way that I can suppress this output or specify another location for the file to go? I don't need any of the output in the file so suppressing it would be best.
EDIT: I tried to add to my script a file.remove but R has the file locked. The only way I can get the lock released is to restart R.
Thanks.
Setting the permissions to read only was going to be my suggested hack.
A slightly more elegant approach is to edit one line of the C code in the package in src/RUtils.h from
\#define errorLog(a,...) fprintf(getErrorFILE(), a, ##__VA_ARGS__); fflush(getErrorFILE());
to
\#define errorLog(a, ...) {}
However, I've pushed some simple updates to the package on github that add a writeErrors() function that one can use to toggle whether errors are written or not. So this allows this to be turned on and off dynamically.
So
library(RDCOMClient)
writeErrors(FALSE)
will turn off the error logging to the file.
I found a work around for this. I created the files C:\RDCOM.err and C:\RDCOM_server.err and marked them both as read-only. I am not sure if there is a better way to accomplish this, but for now I am running without logging.

Using glmm_funs.R?

I am fitting a GLMM and I had seen some examples where is used the function: overdisp_fun, deļ¬ned in glmm_funs.R, but I don't know which package contain them or how can I call it from R, can somebody help me?
Thanks,
If you google for glmm_funs.R, you'll find links to the script (eg here: http://glmm.wdfiles.com/local--files/trondheim/glmm_funs.R).
You can save the file on your local machine, then call it in your R session with source("path to file/glmm_funs.R").
You will then be able to use the functions contained in the script, including overdisp_fun().
You can think of it a little bit like loading a package, except the functions are just presented in a script.

Organizing R's work in functions and subfuctions

My aim is to better organize the work done by a R code.
In particular it could be useful to split the R code I have written in different R files, perhaps with each R file accomplishing to a different task. I have in mind what we can do in Matlab with different M files, where we can easily call functions written in different M files directly from the main code.
Is it useful to write this R files in the form of functions?
How can we call these R files /functions in the main code?
Thanks
You can use source("filename.R") to include the file in your main script.
I am not sure if there is a ready function to include an entire directory, but it is straightforward to write using list.files() and then call source dynamicly for each filename. You can also filter files to only list *.R for example.
Unless you intend to write an R package, you should rethink your organization. R is not Matlab, thank goodness! You can place as many functions as you wish into a single file, and make them all available in your environment with source foo.r . If you are writing a collection of generic functions and don't want to build a package, this really is the cleaner way to go.
As a side thought, consider making some of your tools more flexible by adding more input arguments. You may find that you don't really need so many separate functions/files. As a trivial example, if you have some function do_it_double , another do_it_integer , and yet another do_it_character , all of which do basically the same thing, just merge them into a single do_it_all(x,y,datatype='double') and override the default datatype as desired. (I know this can be done with internal input validation. I'm just giving an example)
Your approach might be working good. I would recommend to wrap the code in a function and use one R file for one R function.
It might be interesting to look at the packages devtools and ProjectTemplate which aim to help organizing R code.

R2PPT crashes R; are there alternatives to R2PPT?

I am attempting to automate the insertion of JPEG images into Powerpoint. I have a macro done for that already, except using R would be infinitely better for my purposes.
The package R2PPT should do this, I understand. However, I cannot use it. For example, when I try to use PPT.Open, I understand I can do it two different ways by calling method = "rcom" or method = "RDCOMClient". Using the latter, R will always crash, sending an error report to windows. Using the former, it tells me I need to install statconnDCOM , before giving the error:
Error in PPT.Open(x) : attempt to apply non-function.
I cannot install statconnDCOM freely, as I wouldn't call this work non-commercial. So if there isn't a way to get around this issue, are there at least some free alternatives to R2PPT so that I can save several hours of manual work with a simple R code? If there is a way for me to use R2PPT, that would be ideal.
Thanks!
Edit:
I'm using R version 2.15 and downloaded the most recent version of R2PPT. Powerpoint is 2007.
Do you have administrative privileges on this machine?
There is an issue with package RDCOMClient. It needs permissions to write file rdcom.err in the root of drive C:. If you don't have privileges to write to c:, there is a rather cumbersome workaround:
Close R
Create "c:\temp" folder if it doesn't exist.
Locate on your hard drive file rdcomclient.dll. It usually placed in \R\library\RDCOMClient\libs\i386\ and in \R\library\RDCOMClient\libs\x64\ (you need to patch file which corresponds your Windows version - 32 bit or 64 bit). It's recommended to make backup copy of this files before patching.
Open rdcomclient.dll in text editor (Notepad++, for example -http://notepad-plus-plus.org/)
Find in file string c:\rdcom.err - it occurs only once.
Go into overwrite mode (usually by pressing "Ins" key). It is very important that new path will have the same number of characters as original one. Type C:\temp\e.rr instead of c:\rdcom.err
Save the file.
Now all should work fine.
Arguably not an answer, but have you looked at using Sweave/knitr to render your presentations in LaTeX using something like Beamer? (As discussed on slide 17 here.)
Wouldn't help any with getting JPGs into a PowerPoint, but would certainly make putting R-output (numerical or graphical) into a presentation much easier!
Edit: if you want to use knitr (which I recommend), here's another reference.

Is there an existing way to import EDF files into R?

I have some European Data Format (EDF) files that I would like to import into R.
There are some Python libraries for parsing EDF files and the EDF spec is available, so I know it's possible, but I would avoid writing code if I could.
Does there already exist a facility for importing these kinds of files?
Was looking for the same thing. Found this function written by Fabien Feschet - works well for my data. http://feschet.fr/?p=11
Found another resource recently. This works very well. Need to download both read_edf.R and utilities.R
https://github.com/bwrc/edf/tree/master/R
I tried look for the same thing a while ago, but I couldn't find anything for R. I ended up using biosig Python module to convert edfs to ascii. There is also this edf2ascii-converter.
I guess there wasn't any package available at the time when the question was asked but now you could use edfReader:
https://cran.r-project.org/web/packages/edfReader/
https://github.com/Pisca46/edfReader

Resources