Writing a large JSON to CSV using sparklyr

Writing a large JSON to CSV using sparklyr - r

I'm trying to convert a large JSON file (6GB) into a CSV to more easily load it into R. I happened upon this solution (from https://community.rstudio.com/t/how-to-read-large-json-file-in-r/13486/33):
library(sparklyr)
library(dplyr)
library(jsonlite)
Sys.setenv(SPARK_HOME="/usr/lib/spark")
# Configure cluster (c3.4xlarge 30G 16core 320disk)
conf <- spark_config()
conf$'sparklyr.shell.executor-memory' <- "7g"
conf$'sparklyr.shell.driver-memory' <- "7g"
conf$spark.executor.cores <- 20
conf$spark.executor.memory <- "7G"
conf$spark.yarn.am.cores <- 20
conf$spark.yarn.am.memory <- "7G"
conf$spark.executor.instances <- 20
conf$spark.dynamicAllocation.enabled <- "false"
conf$maximizeResourceAllocation <- "true"
conf$spark.default.parallelism <- 32
sc <- spark_connect(master = "local", config = conf, version = '2.2.0')
sample_tbl <- spark_read_json(sc,name="example",path="example.json", header = TRUE, memory = FALSE,
overwrite = TRUE)
sdf_schema_viewer(sample_tbl)
I've never used Spark before, and I'm trying to understand where the data I loaded is located in Rstudio, and how can I write the data to a CSV?

Not sure about sparklyr, but if you are trying to read large json file and trying to write into CSV file using Spark R, below is sample code for same.
This code will run only on spark environment and not in Rstudio
# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files.
path <- "examples/src/main/resources/people.json"
# Create a DataFrame from the file(s) pointed to by path
people <- read.json(path)
# Write date frame to CSV
write.df(people, "people.csv", "csv")

Related

creating multiple R dataframe dynamically importing from the file directory contains sas datasets

Need your help , I am new to R.
The scenario is have a list of sas datasets in the specfic locations.
path <- 'C:\\XXXX\\XXX'
files <- list.files(path = path,pattern="*.sas7bdat", full.names=FALSE)
the files variable gives the list of files names available in that directory.
i am keeping the file name as the dataframe using split function removing the extensions stored in domain_name variable.
Iterating each filename which his the sas dataset importing and create each dataset name dynamically.(for instance if there are 30 sas datasets, 30 R dataframes should be created.
library(haven)
for (i in 1:length(files)){
domain_name=strsplit(i,split='.sas7bdat', fixed=TRUE)
domain_name <- read_sas(data_file=paste(path,i,sep='/'))
}
could you explain the concept and fix this problem.
Thanks in advance

The following should in principle work. As there is no real example I can only guess.
path <- 'C:/path2file/'
print(path)
files <- list.files(path = path, pattern="*.sas7bdat", full.names=FALSE)
print(files)
mydf <- list()
for (i in 1:length(files)){
filename <- paste0(path, files[i])
print(filename)
# browser() # if you like to step through the file
mydf[[i]] <- haven::read_sas(data_file=filename)
print(names(mydf[[i]]))
eval(parse(text = paste0("mydf_", i, " <- haven::read_sas(data_file=filename)")))
}
Then you can access each data.frame via e.g. df1 <- mydf[[1]]

decode base64 to a raster

it is a rather untypical scenario, I am using R Custom visual in PowerBI to plot a raster and the only way to pass data is by using a dataframe.
this what I have done so far,
generate a raster in R
save it to file using SaveRDS
encoded the file as a base64 and save it as a csv.
now using this code I manage to read the csv, load it to a dataframe combine al the rows
my question is how to decode it back to a raster Object ?
here is a reproducible example
# Input load. Please do not change #
`dataset` = read.csv('https://raw.githubusercontent.com/djouallah/keplergl/master/raster.csv', check.names = FALSE, encoding = "UTF-8", blank.lines.skip = FALSE);
# Original Script. Please update your script content here and once completed copy below section back to the original editing window #
library(caTools)
library(readr)
dataset$Value <- as.character(dataset$Value)
dataset <- dataset[order(dataset$Index),]
z <- paste(dataset$Value)
Raster <- base64decode(z,"raw")
here is the result

it turned out the solution is very easy, saveRDS has an option to save with ascii = TRUE
saveRDS(background,'test.rds',ascii = TRUE,compress = FALSE)
now I just read it as humain readbale format (which is easy to load to PowerBI) and it works
fil <- 'https://raw.githubusercontent.com/djouallah/keplergl/master/test.rds'
cony <- gzcon(url(fil))
XXX <- readRDS(cony,refhook = NULL)
plotRGB(XXX)

How to get a vector of the file names contained in a tempfile in R?

I am trying to automatically download a bunch of zipfiles using R. These files contain a wide variety of files, I only need to load one as a data.frame to post-process it. It has a unique name so I could catch it with str_detect(). However, using tempfile(), I cannot get a list of all files within it using list.files().
This is what I've tried so far:
temp <- tempfile()
download.file("https://url/file.zip", destfile = temp)
files <- list.files(temp) # this is where I only get "character(0)"
# After, I'd like to use something along the lines of:
data <- read.table(unz(temp, str_detect(files, "^file123.txt"), header = TRUE, sep = ";")
unlink(temp)
I know that the read.table() command probably won't work, but I think I'll be able to figure that out once I get a vector with the list of the files within temp.
I am on a Windows 7 machine and I am using R 3.6.0.

Following what was said before, this structure should allow you to check the correct download with a temporary file structure :
temp <- tempfile("test.zip")
download.file("https://url/file.zip", destfile = temp)
files <- list.files(temp)

Read/Write parquet file from s3 using R

I want to fetch parquet file from my s3 bucket using R. In my server Spark in not installed.
How to read and write parquet file in R without spark? I am able to read and write data from s3 using different format but not parquet format.
My code is given below -
Read csv file from s3
library(aws.s3)
obj <-get_object("s3://mn-dl.sandbox/Internal Data/test.csv")
csvcharobj <- rawToChar(obj)
con <- textConnection(csvcharobj)
data <- read.csv(file = con)
data1 <-data
#Write csv data directly to s3
s3write_using(data1, FUN = write.csv,
bucket = "mn-dl.sandbox",
object = "Internal Data/abc.csv")
Thanks in advance

Definitely a rookie in using R and AWS, so hopefully this is a universal solution and not just one that worked for me, but here's what I did.
install.packages("paws")
install.packages("arrow")
library(paws)
library(arrow)
s3 <- paws::s3(config=list(<your configurations here to give access to s3>))
object <- s3$get_object(Bucket = "path_to_bucket", Key = "file_name.parquet")
data <- object$Body
read_parquet(data)

I have being using the read function, but this write function should work.
Even though you're using the arrow package, you shouldn't need a spark server installed.
install.packages("aws.s3")
install.packages("arrow")
library(aws.s3)
library(arrow)
# Read data
obj <-get_object("s3://mn-dl.sandbox/Internal Data/test.csv")
csvcharobj <- rawToChar(obj)
con <- textConnection(csvcharobj)
data <- read.csv(file = con)
data1 <- data
# Write csv data directly to s3
aws.s3::s3write_using(x = data1,
FUN = arrow::write_parquet,
bucket = "mn-dl.sandbox",
object = "Internal Data/abc.csv")

Parse multiple XBRL files stored in a zip file

I have downloaded multiple zip files from a website. Each zip file contains multiple html and xml extension files (~ 100K in each).
It is possible to manually extract the files and then parse them. However, i would like to be able to do this within R (if possible)
Example file (sorry it is a bit big) using code from a
previous question
- download one zip file
library(XML)
pth <- "http://download.companieshouse.gov.uk/en_monthlyaccountsdata.html"
doc <- htmlParse(pth)
myfiles <- doc["//a[contains(text(),'Accounts_Monthly_Data')]", fun = xmlAttrs][[1]]
fileURLS <- file.path("http://download.companieshouse.gov.uk", myfiles) [[1]]
dir.create("temp", "hmrcCache")
download.file(fileURLS, destfile = file.path("temp", myfiles))
I can parse the files using the
XBRL package if i manually extract them.
This can be done as follows
library(XBRL)
inst <- file.path("temp", "Prod224_0004_00000121_20130630.html")
out <- xbrlDoAll(inst, cache.dir="temp/hmrcCache", prefix.out=NULL, verbose=T)
I am struggling with how to extract these files from the zip folder and parse each , say, in a loop using R, without manually extracting them.
I tried making a start, but don't know how to progress from here. Thanks for any advice.
# Get names of files
lst <- unzip(file.path("temp", myfiles), list=TRUE)
dim(lst) # 118626
# unzip and extract first file
nms <- lst$Name[1] # Prod224_0004_00000121_20130630.html
lst2 <- unz(file.path("temp", myfiles), filename=nms)
I am using Windows 8.1
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

Using the suggestion from Karsten in the comments, I unzipped the files to a temporary directory, and then parsed each file. I used the snow package to speed things up.
# Parse one zip file to start
fls <- list.files(temp)[[1]]
# Unzip
tmp <- tempdir()
lst <- unzip(file.path(temp, fls), exdir=tmp)
# Only parse first 10 records
inst <- lst[1:10]
# Start to parse - in parallel
cl <- makeCluster(parallel::detectCores())
clusterCall(cl, function() library(XBRL))
# Start
st <- Sys.time()
out <- parLapply(cl, inst, function(i)
xbrlDoAll(i,
cache.dir="temp/hmrcCache",
prefix.out=NULL, verbose=T) )
stopCluster(cl)
Sys.time() - st

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Writing a large JSON to CSV using sparklyr - r

Related

creating multiple R dataframe dynamically importing from the file directory contains sas datasets

decode base64 to a raster

How to get a vector of the file names contained in a tempfile in R?

Read/Write parquet file from s3 using R

Parse multiple XBRL files stored in a zip file

Categories

Resources