s3.save a json file to aws s3 - r

I'm trying to save a correctly formatted json file to aws s3.
I can save a regular data frame to s3 with e.g.
library(tidyverse)
library(aws.s3)
s3save(mtcars, bucket = "s3://ourco-emr/", object = "tables/adhoc.db/mtcars/mtcars")
But I need to get mtcars into json format. Specifically ndjson.
I am able to create a correctly formatted json file with e.g:
predictions_file <- file("mtcars.json")
jsonlite::stream_out(mtcars), predictions_file)
This saves a file to my directory called mtcars.json.
However, with the aws.s3 function s3save(), I need to send an object that's in memory, not a file.
Tried:
predictions_file <- file("mtcars.json")
s3write_using(mtcars,
FUN = jsonlite::stream_out,
con = predictions_file,
"s3://ourco-emr/",
object = "tables/adhoc.db/mtcars/mtcars")
Gives:
Error in if (verbose) message("opening ", is(con), " output connection.") :
argument is not interpretable as logical
I tried the same code block but leaving out the line for con=predictions_file, that just gave:
Argument con must be a connection.
If the function jsonlite::stream_out() creates a correctly formatted json file, how can I then write that file to s3?
Edit:
The desired json output would look like this:
{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3,"wt":2,"qsec":16,"vs":0,"am":1,"gear":4,"carb":4,"year":"2020","month":"03","day":"05"}
{"mpg":21,"cyl":6,"disp":160,"hp":110,"drat":3,"wt":2,"qsec":17,"vs":0,"am":1,"gear":4,"carb":4,"year":"2020","month":"03","day":"05"}
{"mpg":22,"cyl":4,"disp":108,"hp":93,"drat":35,"wt":2,"qsec":18,"vs":1,"am":1,"gear":4,"carb":1,"year":"2020","month":"03","day":"05"}
{"mpg":21,"cyl":6,"disp":258,"hp":110,"drat":8,"wt":3,"qsec":19,"vs":1,"am":0,"gear":3,"carb":1,"year":"2020","month":"03","day":"05"}
{"mpg":18,"cyl":8,"disp":360,"hp":175,"drat":3,"wt":3,"qsec":17,"vs":0,"am":0,"gear":3,"carb":2,"year":"2020","month":"03","day":"05"}
When attempting with readchar:
mtcars_string <- readChar("mtcars.json", 1e6)
s3save(mtcars_string, bucket = "s3://ourco-emr/", object = "tables/adhoc.db/mtcars/2020/03/06/mtcars")
If I then download and open the resulting json file, it looks like this:
5244 5833 0a58 0a00 0000 0300 0306 0000
0305 0000 0000 0555 5446 2d38 0000 0402
0000 0001 0004 0009 0000 000d 6d74 6361
7273 5f73 7472 696e 6700 0000 1000 0000
0100 0400 0900 0012 347b 226d 7067 223a
3231 2c22 6379 6c22 3a36 2c22 6469 7370
So it looks like a tsb has been sent to aws s3 as opposed to json

I had the same problem. I need to write and upload JSON lines (ndjson) to S3 and, as far as I know, only stream_out() from the jsonlite-package writes JSON lines.
stream_out() takes only connection-objects as a destination, s3write_using(), however, writes to a temporary file tmp and passes the path to that file as a string to FUN. stream_out() then throws the error:
Argument con must be a connection.
A tentative fix is to modify s3write_using() to pass a connection to FUN instead of a filepath-string.
trace(s3write_using, edit=TRUE) - opens an editor
Change line 5:
value <- FUN(x, tmp, ...)
To this:
value <- FUN(x, file(tmp), ...)
You can then upload the data using stream_out():
s3write_using(x = data,
FUN = stream_out,
bucket = 'mybucket',
object = 'my/object.json',
opts = list(acl = "private", multipart = FALSE, verbose = T, show_progress = T))
The edit remains for the whole session or until you do untrace(s3write_using).
One should probably file a request in their cloudyr/aws.s3 GitHub as this as to be a common use case.

Related

Converting docx.files to pdf.files with docx2pdf

Not sure what I am doing wrong.
I want to convert multiple docx.files to pdf.files - each file into a separate one.
I decided to use the "doconv"-package with following command:
docx_files <- list.files(pattern=paste0("Protokollnr_"))[39:73]
docx_files %>% length
lapply(1:35, function(x) {
docx2pdf(input = docx_files[[x]],
output = tempfile(fileext = ".pdf"))})
I does not say anything specific in the error message - only that it cannot be converted.
Is it that I should have specified the file path - now I only define the file name in my WD.
The object "docx_files" contain:
c("Protokollnr_1.docx", "Protokollnr_10.docx", "Protokollnr_11.docx",
"Protokollnr_12.docx", "Protokollnr_13.docx", "Protokollnr_14.docx",
"Protokollnr_15.docx", "Protokollnr_16.docx", "Protokollnr_17.docx",
"Protokollnr_18.docx", "Protokollnr_19.docx", "Protokollnr_2.docx",
"Protokollnr_20.docx", "Protokollnr_21.docx", "Protokollnr_22.docx",
"Protokollnr_23.docx", "Protokollnr_24.docx", "Protokollnr_25.docx",
"Protokollnr_26.docx", "Protokollnr_27.docx", "Protokollnr_28.docx",
"Protokollnr_29.docx", "Protokollnr_3.docx", "Protokollnr_30.docx",
"Protokollnr_31.docx", "Protokollnr_32.docx", "Protokollnr_33.docx",
"Protokollnr_34.docx", "Protokollnr_35.docx", "Protokollnr_4.docx",
"Protokollnr_5.docx", "Protokollnr_6.docx", "Protokollnr_7.docx",
"Protokollnr_8.docx", "Protokollnr_9.docx")
The error message is:
Error in docx2pdf(input = docx_files[[x]], output = tempfile(fileext = ".pdf")) :
could not convert C:/Users/Nadine/OneDrive/Documents/Arbeit_Büro_papa/Protokolle_Sallapulka/fertige_Protokolle/Protokollnr_1.docx
Many thanks,
Nadine
I'd recommend specifying the file path since the function requires the following format:
docx2pdf(input, output = gsub("\\.docx$", ".pdf", input))

How to get rid of embedded NUL on a raw vector?

Im scraping a ASP.NET website.
This will return a raw element (reporte_nacido) which is a csv file (tab as delimiter):
reporte_nacido = postForm('https://xxxxx/WebSiteNDE/BirthsPages/FiltrosExcelNac.aspx',
.params = params,
curl = curl,
.opts = RCurl::curlOptions(ssl.verifypeer=FALSE, verbose=T))
If i load the file on a text viewer, it looks like this
Now im trying to load that raw element within R but i get the following error. I believe the file downloaded from the server comes corrupted somehow and R is being picky about it
rawToChar(as.vector(unlist(reporte_nacido)))
Error in rawToChar(as.vector(unlist(reporte_nacido))) :
embedded nul in string: '\xfe\xff\0N\0\xda\0M\0E\0R\0O\0 \0C\0E\0R\0T\0I\0F\0I\0C\0A\0D\0O\0\t\0D\0E\0P\0A\0R\0T\0A\0M\0E\0N\0T\0O\0\t\0M\0U\0N\0I\0C\0I\0P\0I\0O\0\t\0A\0R\0E\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0I\0N\0S\0P\0E\0C\0C\0I\0O\0N\0 \0C\0O\0R\0R\0E\0G\0I\0M\0I\0E\0N\0T\0O\0 \0O\0 \0C\0A\0S\0E\0R\0I\0O\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0S\0I\0T\0I\0O\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0C\0\xd3\0D\0I\0G\0O\0 \0I\0N\0S\0T\0I\0T\0U\0C\0I\0\xd3\0N\0\t\0N\0O\0M\0B\0R\0E\0 \0I\0N\0S\0T\0I\0T\0U\0C\0I\0\xd3\0N\0\t\0S\0E\0X\0O\0\t\0P\0E\0S\0O\0 \0(\0G\0r\0a\0m\0o\0s\0)\0\t\0T\0A\0L\0L\0A\0 \0(\0C\0e\0n\0t\0\xed\0m\0e\0t\0r\0o\0s\0)\0\t\0F\0E\0C\0H\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0H\0O\0R\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0P\0A\0R\0T\0O\0 \0A\0T\0E\0N\0D\0I\0D\0O\0 \0P\0O\0R\0\t\0T\0I\0E\0M\0P\0O\0 \0D\0E\0 \0G\0E\0S\0T\0A\0C\0I\0\xd3\0N\0\t\0N\0\xda\0M\0E\0R\0O\0 \0C\0O\0N\0S\0U\0L\0T\0A\0S\0 \0P\0R\0E\0N\0A\0T\0A\0L\0E\0S\0\t\0T\0I\0P\0O\0 \0P\0A
The raw vector you are getting is text encoded as UTF-16. You can convert it like this:
library(stringi)
raw_vec <- as.vector(unlist(reporte_nacido))
decoded <- stri_encode(raw_vec, "UTF16")
decoded
#> [1] "NÚMERO CERTIFICADO\tDEPARTAMENTO\tMUNICIPIO\tAREA NACIMIENTO\tINSPECCION CORREGIMIENTO O CASERIO NACIMIENTO\tSITIO NACIMIENTO\tCÓDIGO INSTITUCIÓN\tNOMBRE INSTITUCIÓN\tSEXO\tPESO (Gramos)\tTALLA (Centímetros)\tFECHA NACIMIENTO\tHORA NACIMIENTO\tPARTO ATENDIDO POR\tTIEMPO DE GESTACIÓN\tNÚMERO CONSULTAS PRENATALES\tTIPO PA"
It appears to be tab-separated rather than csv format, so you probably want to read it like this:
read.table(text = decoded, sep = "\t", header = TRUE)

The encryption won't decrypt

I was given an encrypted copy of the study guide here, but how do you decrypt and read it???
In a file called pa11.py write a method called decode(inputfile,outputfile). Decode should take two parameters - both of which are strings. The first should be the name of an encoded file (either helloworld.txt or superdupertopsecretstudyguide.txt or yet another file that I might use to test your code). The second should be the name of a file that you will use as an output file.
Your method should read in the contents of the inputfile and, using the scheme described in the hints.txt file above, decode the hidden message, writing to the outputfile as it goes (or all at once when it is done depending on what you decide to use).
The penny math lecture is here.
"""
Program: pennyMath.py
Author: CS 1510
Description: Calculates the penny math value of a string.
"""
# Get the input string
original = input("Enter a string to get its cost in penny math: ")
cost = 0
Go through each character in the input string
for char in original:
value = ord(char) #ord() gives us the encoded number!
if char>="a" and char<="z":
cost = cost+(value-96) #offset the value of ord by 96
elif char>="A" and char<="Z":
cost = cost+(value-64) #offset the value of ord by 64
print("The cost of",original,"is",cost)
Another hint: Don't forget about while loops...
Another hint: After letters -
skip ahead by their pennymath value positions + 2
After numbers - skip ahead by their number + 7 positions
After anything else - just skip ahead by 1 position
The issue I'm having in that I cant seem to get the coding right to decode the file it comes out looking the same. This is the current code I have been using. But once I try to decrypt the message it stays the same.
def pennycost(c):
if c >="a" and c <="z":
return ord(c)-96
elif c>="A" and c<="Z":
return ord(c)-64
def decryption(inputfile,outputfile):
with open(inputfile) as f:
fo = open(outputfile,"w")
count = 0
while True:
c = f.read(1)
if not c:
break;
if count > 0:
count = count -1;
continue
elif c.isalpha():
count = pennycost(c)
fo.write(c)
elif c.isdigit():
count = int(c)
fo.write(c)
else:
count = 6
fo.write(c)
fo.close()
inputfile = input("Please enter the input file name: ")
outputfile = input("Plese enter the output file name(EXISTING FILE WILL BE OVER WRITTEN!): ")
decryption(inputfile,outputfile)

How to set a keyword to write fully to the CSV file

This script is working in so far that the output is correct. However it is not populating the CSV file for me. But only populating the last iteration of the loop. Being new to IDL, I need to grasp this concept of the keyword.
I believe I need a keyword, but my attempts of inserting this have all failed.
Can some amend the script so that the csv file populates fully please.
PRO Lat_Lon_Alt_Array
; This program is the extract the Latitute, Longigitude & Altitute
; with the Site name and file code.
; The purpose is to output the above dimensions from the station files
; into a csv file.
COMPILE_OPt IDL2
the_file_list = file_search('D:/Rwork/Project/25_Files/','*.nc')
FOR filein = 0, N_ElEMENTS (the_file_list)-1 DO BEGIN
station = NCDF_OPEN(the_file_list[filein])
NCDF_VARGET, station, 'station_name', St_Name
NCDF_VARGET, station, 'lat', latitude
NCDF_VARGET, station, 'lon', longitude
NCDF_VARGET, station, 'alt', height
latitude=REFORM(latitude,1)
longitude=REFORM(longitude,1)
height=REFORM(height,1)
Print,the_file_list[filein]
Print, 'name'
Print, St_Name
Print,'lat'
Print,latitude
Print,'lon'
print,longitude
Print,'alt'
Print,height
; Add each station data to the file
WRITE_CSV, 'LatLon.csv', the_file_list[filein],latitude,longitude,height
ENDFOR
RETURN
END
WRITE_CSV overwrites the file every time it is called, hence you only ever see the last entry.
Create arrays to hold all the values before the for loop:
n_files = N_ElEMENTS(the_file_list)
latitude_arr = DBLARR(n_files) ; Assuming type is double
longitude_arr = DBLARR(n_files)
height_arr = DBLARR(n_files)
In your for loop fill them with:
latitude_arr[filein] = latitude
longitude_arr[filein] = longitude
height_arr[filein] = height
Then after the for loop, write them with:
WRITE_CSV, 'LatLon.csv', the_file_list, latitude_arr, longitude_arr, height_arr

How can I cut large csv files using any R packages like ff or data.table?

I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. Which R package is best for doing this for large files?
I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. These are from ?read.table
skip integer: the number of lines of the data file to skip before
beginning to read data.
nrows integer: the maximum number of rows to read in. Negative and
other invalid values are ignored.
To avoid some troublesome issues at the end you need to do some error handling. In other words I don't know what happpens when skip value is greater than the number of rows in your big csv.
p.s. I also don't know whether header=TRUE is affecting skip or not, you also have to check that.
The answer given bu #berkorbay is OK and I can confirm that header can be used with skip. However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines.
I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. It is much faster. I enclose the source here, translating some parts so that the intent is clear:
#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;
print("\nEnter input file name = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada) || die "Cannot open input file: $!\n" ;
$cabecera = <IN> ;
$leidas = 0 ;
$fragmento = 1 ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
if ($leidas > $nlineas) {
close(OUT) ;
$fragmento++ ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
$leidas = 0;
}
$leidas++ ;
print OUT $_ ;
}
close(OUT) ;
Just save with whatever name and execute. The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script").
One should have used read.csv.ffdf of ff package with specific parameters like this to read big file:
library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)
Once big file is read into a ff object, Subsetting ffobject into data frames can be done using:
a[1000:1000000,]
Rest of the code for subsetting and saving broken dataframes
totalrows = dim(a)[1]
row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes
block.size = 200000000 #in bytes .IN Mbs 200 Mb
#rows.block is rows per block
rows.block = ceiling(block.size/row.size)
#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)
for(i in (0:nmaps)){
if(i==nmaps){
df = a[(i*rows.block+1) : totalrows,]
}
else{
df = a[(i*rows.block+1) : ((i+1)*rows.block),]
}
#process df or save it
write.csv(df,paste0("M",i+1,".csv"))
#remove df
rm(df)
}
Alternatively you can first read the files into mysql using dbWriteTable and then use read.dbi.ffdf function from the ETLUtils package to read it back to R. Consider the function below;
read.csv.sql.ffdf <- function(file, name,overwrite = TRUE, header = TRUE, drv = MySQL(), dbname = "new", username = "root",host='localhost', password = "1234"){
conn = dbConnect(drv, user = username, password = password, host = host, dbname = dbname)
dbWriteTable(conn, name, file, header = header, overwrite = overwrite)
on.exit(dbRemoveTable(conn, name))
command = paste0("select * from ", name)
ret = read.dbi.ffdf(command, dbConnect.args = list(drv =drv, dbname = dbname, username = username, password = password))
return(ret)
}

Resources