How do I read an .tar.xz file? - r

I downloaded the Gwern Branwen dataset here: https://www.gwern.net/DNM-archives
I'm trying to read the dataset in R and I'm having a lot of trouble. I tried to open one of the files in the dataset called "1776.tar.xz" and I think I "unzipped" it with untar() but I'm not getting anything past that.
untar("C:/User/user/Downloads/dnmarchives/1776.tar.xz",
files = NULL,
list = FALSE, exdir = ".",
compressed = "xz", extras = NULL, verbose = FALSE, restore_times = TRUE,
tar = Sys.getenv("TAR"))
Edit: Thanks for all of the comments so far! The code is in base R. I have multiple datasets that I downloaded from Gwern's website. I'm just trying to open one to explore.

Base R includes function untar. On my Ubuntu 19.10 running R 3.6.2, default installation, the following was enough.
fls <- list.files(pattern = "\\.xz")
untar(fls[1], verbose = TRUE)
Note.
In the question, "dataset" is singular but there were several datasets (plural) on that website. To download the files I used
args <- "--verbose rsync://78.46.86.149:873/dnmarchives/grams.tar.xz rsync://78.46.86.149:873/dnmarchives/grams-20150714-20160417.tar.xz ./"
cmd <- "rsync"
od <- getwd()
setwd('~/tmp')
system2(cmd, args)

Thanks everyone! Not sure what was wrong with r for a bit but I reinstalled. I ended up unzipping manually and loading up the files.

I find that base R's untar() is a bit unreliable and/or slow on Windows.
What worked very well for me (on all platforms) was
library(archive)
archive_extract("C:/User/user/Downloads/dnmarchives/1776.tar.xz",
dir="C:/User/user/Downloads/dnmarchives")
It supports 'tar', 'ZIP', '7-zip', 'RAR', 'CAB', 'gzip', 'bzip2', 'compress', 'lzma' and 'xz' formats.
And one can also use it directly read in a csv file within an archive without having to UNZIP it first using
read_csv(archive_read("C:/User/user/Downloads/dnmarchives/1776.tar.xz", file = 1), col_types = cols())

On Debian or Ubuntu, first install the package xz-utils
$ sudo apt-get install xz-utils
Extract a .tar.xz the same way you would extract any tar.__ file.
$ tar -xf file.tar.xz
Done.

Related

How to run R Code within Python interface?

I am trying to run below code where I want to read csv file and then write "sas7bdat". I have tried below code.
we already have prerequisite library installed on the system for R.
from rpy2 import robjects
robjects.r('''
library(haven)
data <- read_csv("filename.csv")
write_sas(data, "filename.sas7bdat")
''')
After running above code, there are no output get generated by this code and even I am not getting any error.
Expected output: trying to read .csv file and then that data i want to export in .sas7bdat format. (In Standard python 3.9.2 Editor)
python do not have such functionality/library hence I am trying this way to export data in .sas7bdat format.
Plz Suggest some change in above code or any other way in python through which I can create/export .sas7bdat format in python.
Thanks.
I had experience using R in Python Jupyter Notebooks, it is a bit complicated at beginning, but it did work. Here I just pasted my personal notes, hope these help:
# Major steps in installing "rpy2":
# Step 1: install R on Jupyter Notebook: conda install -c r r-essentials
# Step 2: install the "rpy2" Python package: pip install rpy2 (you may have to check the version)
# Step 3: create the environment variables: R_HOME, R_USER and R_LIBS_USER
# you can modify these environment variables in the system settings on your windows PC or use codes to set them every time)
# load the rpy2 module after installation
# Then you will be able to enable R cells within the Python Jupyter Notebook
# run this line in your Jupyter Notebook
%load_ext rpy2.ipython
My work was to do ggplot2 in Python, so I did:
# now use R to access this dataframe and plot it using ggplot2
# tell Jupyter Notebook that you are going to use R in this cell, and for the "test_data" generated using the Python
%%R -i test_data
library(ggplot2)
plot <- ggplot(test_data) +
geom_point(aes(x,y),size = 20)
plot
ggsave('test.png')
Please before you run the code make sure that haven and reader are installed in your R kernel.
from rpy2.robjects.packages import SignatureTranslatedAnonymousPackage
string = """
write_sas <- function(file, col_names = TRUE, write_to){
data <- readr::read_csv(file, col_names = col_names)
haven::write_sas(data, path = write_to)
print(paste("Data is written to ", write_to))
}
"""
rwrap = SignatureTranslatedAnonymousPackage(string, "rwrap")
rwrap.write_sas( file = "https://robjhyndman.com/data/ausretail.csv",
col_names = False,
write_to = "~/Downloads/filename.sas7bdat")
You can use any of the R function arguments. same as I used col_names

How to export sf object to GDB using RPyGeo in R (Windows)?

I have a bunch of sf objects I'd like to export to GDB from R. I'm running R 4.0.2 on Windows 10. In this case the sf objects are all vector point data. The main reasons to export to GDB are to keep longer field names (the shapefile truncation is very annoying), and because GDBs are more desirable storage locations for our workflows.
Yes, I know about the ArcGisBinding package. I've got it to work in a test script but it's pretty unstable - often crashing and requiring a restart of R. This is a problem, because the sf objects I'd like to export come after an already long Rmd that reads in, formats and cleans the data. So it's not a simple manner of re-running the script until arc.write doesn't break. I could break up the script, but then I'd still have to read in a bunch of shapefiles. One option I haven't yet explored is using reticulate to call a python script instead of trying to do everything in R, but we're trying to do our analysis all in one place, if possible.
I'm pretty sure I've managed to set up RPyGeo appropriately, first setting my python path using the reticulate package. I'm doing it this way because IT restrictions means I can't edit PATH variables on my machine.
#package calls
library(sf)
library(spData)
library(reticulate)
#set python version in reticulate
py_path <- "C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe"
reticulate::use_python(python = py_path, required = TRUE)
#call RPyGeo
library(RPyGeo) # for potential point export
#output gdb
out.gdb <- "C:/LOCAL_PROJECTS/Output/Output.gdb"
#RPyGeo Parameters
# Note that, in order to use RPyGeo you need a working ArcMap or ArcGIS Pro installation on your computer.
# python path - note that this will change depending on which version of Arc one is using
# py_path <- "C:/Program Files/ArcGIS/Pro/bin/Python/envs/arcgispro-py3/python.exe"
arcpy <- rpygeo_build_env(workspace = out.gdb,
overwrite = TRUE,
extensions = c("Spatial","DataInteroperability"),
path = py_path)
I've tried a bunch of different tools to export an sf object, here using dummy data also used in the RPyGeo vignette
data(nz, package = "spData")
arcpy$Copy_management(in_data = nz,out_data = "nz_test")
arcpy$Copy_management(in_data = nz,out_data = file.path(out.gdb,"nz"))
arcpy$FeatureClassToGeodatabase_conversion(Input_Features = nz,Output_Geodatabase = out.gdb)
arcpy$FeatureClassToFeatureClass_conversion(in_features = nz,out_path = out.gdb,out_name = "nz")
arcpy$QuickExport_interop(Input = nz,Output = file.path(out.gdb,"nz"))
arcpy$CopyFeatures_management(in_features = nz,out_feature_class = file.path(out.gdb,"nz"))
arcpy$CopyFeatures_management(in_features = nz,out_feature_class = "nz")
Each time I get an error, for example:
Error in py_call_impl(callable, dots$args, dots$keywords) :
RuntimeError: Object: Error in executing tool
Detailed traceback:
File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\management.py", line 3232, in CopyFeatures
raise e
File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\management.py", line 3229, in CopyFeatures
retval = convertArcObjectToPythonObject(gp.CopyFeatures_management(*gp_fixargs((in_features, out_feature_class, config_keyword, spatial_grid_1, spatial_grid_2, spatial_grid_3), True)))
File "C:\Program Files\ArcGIS\Pro\Resources\ArcPy\arcpy\geoprocessing\_base.py", line 511, in <lambda>
return lambda *args: val(*gp_fixargs(args, True))
I'm not an expert in ArcPy by any means. Nor am I an expert in tracing errors inside packages. Am I making a simple syntax mistake? Is there something else that I'm missing? Any help would be much appreciated!

TreeTagger in R

I have downloaded TreeTaggerv3.2 for Windows and have configured it per the install.txt. I am trying to use it in R with koRpus package. I have set the kRp.env as -
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en",
preset="en", treetagger="manual", format="file",
TT.tknz=TRUE, encoding="UTF-8" )
.My data to be tagged is in a file and trying to use it as treetag("myfile.txt") but it is throwing the error-
Error in matrix(unlist(strsplit(tagged.text, "\t")), ncol = 3, byrow = TRUE, :
'data' must be of a vector type, was 'NULL'
In addition: Warning message:
running command 'C:\windows\system32\cmd.exe /c C:\TreeTagger\bin\tag-english.bat
C:\Users\vivsingh\Desktop\NLP\tree_tag_ex.txt' had status 255
The standalone TreeTagger is working on by windows.Any idea on how it works?
I had the exact same error and warning while trying lemmatization on R word vector following Bernhard Learns blog using windows 7 and R 3.4.1 (x64). The issue was also appearing using textstem package but TreeTagger was running properly in cmd window.
I mixed several answers I found on this post and here is my steps and code running properly:
get into R win_library (~\Documents\R\win-library\3.4\rJava\jri\x64\jri.dll) and copy jri.dll (thanks kravi!) to replace it the parent folder.
close and restart R
library(koRpus)
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en", preset="en", treetagger="manual", format="file", TT.tknz=TRUE, encoding="UTF-8")
lemma_tagged <- treetag(lemma_unique$word_clean, treetagger="manual", format="obj", TT.tknz=FALSE , lang="en", TT.options=list(path="c:/TreeTagger", preset="en"))
lemma_tagged_tbl <- tbl_df(lemma_tagged#TT.res)
Hope it helps.
I am posting this answer to keep a record. I also faced the same issue due to incorrect specification of the location of jri.dll on 64-Bit processor and windows 8.1. If we call
set.kRp.env(TT.cmd="manual", lang="en", TT.options=list(path="/path/to/tree-tagger-windows-x.x/TreeTagger", preset="en")) and we follow either of following two steps, we can resolve this error:
While installing R, if we install only 64 Bit version of R, and
specify the proper path for these variables
LD_LIBRARY_PATH = /path/to/rJava/jri
JAVA_HOME = /path/to/jdk1.x.x
java.library.path = /path/to/rJava/jri/jri.dll
CLASSPATH = /path/to/rJava/jri
If we already installed both versions viz. 32 bit and 64 bit of R on your computer then just copy jri.dll from /path/to/rJava/jri/x64/jri.dll and replace at path/to/rJava/jri/jri.dll. Further, we need to set the path of above mentioned four variables.
I've got this issue (very similar I guess) and posted query to GitHub.
https://github.com/unDocUMeantIt/koRpus/issues/7
The current working solution for me for this case was easier than I could expect, just downgrading the koRpus package. This can change with time but this version should remain appropriate.
library("devtools")
install_github("unDocUMeantIt/koRpus", ref="0.06-5")
This package is not Java related they said.
You can face the same error while setting up the korpus environment and getting the result from treetagger. For example, when you use:
tagged.text <- treetag(
"C:/temp/sample_text.txt",
treetagger = "manual",
lang = "en",
TT.options = list(
path = "c:/Treetagger",
preset = "en"
),
doc_id = "sample"
)
You would receive a similar error
Error: Awww, this should not happen: TreeTagger didn't return any useful data.
This can happen if the local TreeTagger setup is incomplete or different from what presets expected.
You should re-run your command with the option 'debug=TRUE'. That will print all relevant configuration.
Look for a line starting with 'sys.tt.call:' and try to execute the full command following it in a command line terminal. Do not close this R session in the meantime, as 'debug=TRUE' will keep temporary files that might be needed.
If running the command after 'sys.tt.call:' does fail, you'll need to fix the TreeTagger setup.
If it does not fail but produce a table with proper results, please contact the author!
Here you need to change the value of treetagger, from
treetagger = "manual"
to
treetagger = "kRp.env"
However, before that remember to set the kRp.env as #Xochitl C. suggested in their answer
set.kRp.env(TT.cmd="C:\\TreeTagger\\bin\\tag-english.bat", lang="en", preset="en", treetagger="manual", format="file", TT.tknz=TRUE, encoding="UTF-8")
Once you do this, you'll get the desired result.

issue with get_rollit_source

I tried to use get_rollit_source from the RcppRoll package as follows:
library(RcppRoll)
get_rollit_source(roll_max,edit=TRUE,RStudio=TRUE)
I get an error:
Error in get("outFile", envir = environment(fun)) :
object 'outFile' not found
I tried
outFile="C:/myDir/Test.cpp"
get_rollit_source(roll_max,edit=TRUE,RStudio=FALSE,outFile=outFile)
I get an error:
Error in get_rollit_source(roll_max, edit = TRUE, RStudio = FALSE, outFile = outFile) :
File does not exist!
How can fix this issue?
I noticed that the RcppRoll folder in the R library doesn't contain any src directory. Should I download it?
get_rollit_source only works for 'custom' functions. For things baked into the package, you could just download + read the source code (you can download the source tarball here, or go to the GitHub repo).
Anyway, something like the following should work:
rolling_sqsum <- rollit(final_trans = "x * x")
get_rollit_source(rolling_sqsum)
(I wrote this package quite a while back when I was still learning R / Rcpp so there are definitely some rough edges...)

Accessing "examples" subdirectory in R-package

I using a CRAN package which contains a subdirectory "examples/" containing a file "ex.txt". How do I access this file?
I tried
require("XX")
read.table(paste(.path.package("XX"), "/examples/ex.txt", sep=""), header=TRUE, sep="\t")
but then the file is not found. When I look in the installation directory of the package, I indeed see no "examples/" subdirectory. However, when I run R CMD check and R CMD INSTALL on the package source, I get no warnings about the "examples/" subdirectory. So the package installs without problems, but omits the examples. What do I have to do in order to access the files in "examples/"?
At first I misread your question and thought you were the package author. The problem is that as you noticed examples doesn't get copied in when installed. A solution would be for the package authors to put the folder in /inst/examples instead of /examples. Since you don't have control of that we can create a workaround by downloading the source and then using that instead.
# Downloads the source code for a package
# Extracts it to a temporary directory
downloadAndExtract <- function(package, tdir = tempdir()){
down <- download.packages(package, destdir = tdir)
targz <- down[,2]
untar(targz, exdir = tdir)
file.path(tdir, package)
}
path <- downloadAndExtract("XX")
filepath <- file.path(path, "examples", "ex.txt")
dat <- read.table(filepath, header = TRUE, sep = "\t")
Clearly this isn't ideal but since you won't find that file in the installed package we need to resort to some sort of workaround...

Resources