How to get rid of embedded NUL on a raw vector? - r

Im scraping a ASP.NET website.
This will return a raw element (reporte_nacido) which is a csv file (tab as delimiter):
reporte_nacido = postForm('https://xxxxx/WebSiteNDE/BirthsPages/FiltrosExcelNac.aspx',
.params = params,
curl = curl,
.opts = RCurl::curlOptions(ssl.verifypeer=FALSE, verbose=T))
If i load the file on a text viewer, it looks like this
Now im trying to load that raw element within R but i get the following error. I believe the file downloaded from the server comes corrupted somehow and R is being picky about it
rawToChar(as.vector(unlist(reporte_nacido)))
Error in rawToChar(as.vector(unlist(reporte_nacido))) :
embedded nul in string: '\xfe\xff\0N\0\xda\0M\0E\0R\0O\0 \0C\0E\0R\0T\0I\0F\0I\0C\0A\0D\0O\0\t\0D\0E\0P\0A\0R\0T\0A\0M\0E\0N\0T\0O\0\t\0M\0U\0N\0I\0C\0I\0P\0I\0O\0\t\0A\0R\0E\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0I\0N\0S\0P\0E\0C\0C\0I\0O\0N\0 \0C\0O\0R\0R\0E\0G\0I\0M\0I\0E\0N\0T\0O\0 \0O\0 \0C\0A\0S\0E\0R\0I\0O\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0S\0I\0T\0I\0O\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0C\0\xd3\0D\0I\0G\0O\0 \0I\0N\0S\0T\0I\0T\0U\0C\0I\0\xd3\0N\0\t\0N\0O\0M\0B\0R\0E\0 \0I\0N\0S\0T\0I\0T\0U\0C\0I\0\xd3\0N\0\t\0S\0E\0X\0O\0\t\0P\0E\0S\0O\0 \0(\0G\0r\0a\0m\0o\0s\0)\0\t\0T\0A\0L\0L\0A\0 \0(\0C\0e\0n\0t\0\xed\0m\0e\0t\0r\0o\0s\0)\0\t\0F\0E\0C\0H\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0H\0O\0R\0A\0 \0N\0A\0C\0I\0M\0I\0E\0N\0T\0O\0\t\0P\0A\0R\0T\0O\0 \0A\0T\0E\0N\0D\0I\0D\0O\0 \0P\0O\0R\0\t\0T\0I\0E\0M\0P\0O\0 \0D\0E\0 \0G\0E\0S\0T\0A\0C\0I\0\xd3\0N\0\t\0N\0\xda\0M\0E\0R\0O\0 \0C\0O\0N\0S\0U\0L\0T\0A\0S\0 \0P\0R\0E\0N\0A\0T\0A\0L\0E\0S\0\t\0T\0I\0P\0O\0 \0P\0A

The raw vector you are getting is text encoded as UTF-16. You can convert it like this:
library(stringi)
raw_vec <- as.vector(unlist(reporte_nacido))
decoded <- stri_encode(raw_vec, "UTF16")
decoded
#> [1] "NÚMERO CERTIFICADO\tDEPARTAMENTO\tMUNICIPIO\tAREA NACIMIENTO\tINSPECCION CORREGIMIENTO O CASERIO NACIMIENTO\tSITIO NACIMIENTO\tCÓDIGO INSTITUCIÓN\tNOMBRE INSTITUCIÓN\tSEXO\tPESO (Gramos)\tTALLA (Centímetros)\tFECHA NACIMIENTO\tHORA NACIMIENTO\tPARTO ATENDIDO POR\tTIEMPO DE GESTACIÓN\tNÚMERO CONSULTAS PRENATALES\tTIPO PA"
It appears to be tab-separated rather than csv format, so you probably want to read it like this:
read.table(text = decoded, sep = "\t", header = TRUE)

Related

Converting docx.files to pdf.files with docx2pdf

Not sure what I am doing wrong.
I want to convert multiple docx.files to pdf.files - each file into a separate one.
I decided to use the "doconv"-package with following command:
docx_files <- list.files(pattern=paste0("Protokollnr_"))[39:73]
docx_files %>% length
lapply(1:35, function(x) {
docx2pdf(input = docx_files[[x]],
output = tempfile(fileext = ".pdf"))})
I does not say anything specific in the error message - only that it cannot be converted.
Is it that I should have specified the file path - now I only define the file name in my WD.
The object "docx_files" contain:
c("Protokollnr_1.docx", "Protokollnr_10.docx", "Protokollnr_11.docx",
"Protokollnr_12.docx", "Protokollnr_13.docx", "Protokollnr_14.docx",
"Protokollnr_15.docx", "Protokollnr_16.docx", "Protokollnr_17.docx",
"Protokollnr_18.docx", "Protokollnr_19.docx", "Protokollnr_2.docx",
"Protokollnr_20.docx", "Protokollnr_21.docx", "Protokollnr_22.docx",
"Protokollnr_23.docx", "Protokollnr_24.docx", "Protokollnr_25.docx",
"Protokollnr_26.docx", "Protokollnr_27.docx", "Protokollnr_28.docx",
"Protokollnr_29.docx", "Protokollnr_3.docx", "Protokollnr_30.docx",
"Protokollnr_31.docx", "Protokollnr_32.docx", "Protokollnr_33.docx",
"Protokollnr_34.docx", "Protokollnr_35.docx", "Protokollnr_4.docx",
"Protokollnr_5.docx", "Protokollnr_6.docx", "Protokollnr_7.docx",
"Protokollnr_8.docx", "Protokollnr_9.docx")
The error message is:
Error in docx2pdf(input = docx_files[[x]], output = tempfile(fileext = ".pdf")) :
could not convert C:/Users/Nadine/OneDrive/Documents/Arbeit_Büro_papa/Protokolle_Sallapulka/fertige_Protokolle/Protokollnr_1.docx
Many thanks,
Nadine
I'd recommend specifying the file path since the function requires the following format:
docx2pdf(input, output = gsub("\\.docx$", ".pdf", input))

Uploading a file to Figshare using rapiclient swagger api

I am trying to programmatically update my Figshare repository using rapiclient.
Following the answer to this question, I managed to authenticate and see my repository by:
library(rapiclient)
library(httr)
# figshare repo id
id = 3761562
fs_api <- get_api("https://docs.figshare.com/swagger.json")
header <- c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT")))
fs_api <- list(operations = get_operations(fs_api, header),
schemas = get_schemas(fs_api))
reply <- fs_api$operations$article_files(id)
I also managed to delete a file using:
fs_api$operations$private_article_file_delete(article_id = id, file_id = F)
Now, I would like to upload a new file to the repository. There seem to be two methods I need:
fs_api$operations$private_article_upload_initiate
fs_api$operations$private_article_upload_complete
But I do not understand the documentation. According to fs_api$operations$private_article_upload_initiate help:
> fs_api$operations$private_article_upload_initiate
private_article_upload_initiate
Initiate Upload
Description:
Initiate new file upload within the article. Either use link to
provide only an existing file that will not be uploaded on figshare
or use the other 3 parameters(md5, name, size)
Parameters:
link (string)
Url for an existing file that will not be uploaded on figshare
md5 (string)
MD5 sum pre computed on the client side
name (string)
File name including the extension; can be omitted only for linked
files.
size (integer)
File size in bytes; can be omitted only for linked files.
What does "file that will not be uploaded on Figshare" mean? How would I use the API to upload a local file ~/foo.txt?
fs_api$operations$private_article_upload_initiate(link='~/foo.txt')
returns HTTP 400.
I feel like I sent you down a bad path with my previous answer because I am not sure how to edit some of the api endpoints when using rapiclient. For example, the corresponding endpoint for fs_api$operations$private_article_upload_initiate() will be https://api.figshare.com/v2/account/articles/{article_id}/files, and I am not sure how to substitute for {article_id} prior to sending the request.
You may have to define your own client for operations you cannot get working any other way.
Here is an example of uploading a file to an existing private article as per the goal of your question.
library(httr)
# id of previously created figshare article
my_article_id <- 99999999
# make example file to upload
my_file <- tempfile("my_file", fileext = ".txt")
writeLines("Hello World!", my_file)
# Step 1 initiate upload
# https://docs.figshare.com/#private_article_upload_initiate
r <- POST(
url = sprintf("https://api.figshare.com/v2/account/articles/%s/files", my_article_id),
add_headers(c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT")))),
body = list(
md5 = tools::md5sum(my_file)[[1]],
name = basename(my_file),
size = file.size(my_file)
),
encode = "json"
)
initiate_upload_response <- content(r)
# Step 2 single file info (get upload url)
# https://docs.figshare.com/#private_article_file
r <- GET(url = initiate_upload_response$location,
add_headers(c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT"))))
)
single_file_response <- content(r)
# Step 3 uploader service (get number of upload parts required)
# https://docs.figshare.com/#endpoints
r <- GET(url = single_file_response$upload_url,
add_headers(c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT"))))
)
upload_service_response <- content(r)
# Step 4 upload parts (this example only has one part)
# https://docs.figshare.com/#endpoints_1
r <- PUT(url = single_file_response$upload_url, path = 1,
add_headers(c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT")))),
body = upload_file(my_file)
)
upload_parts_response <- content(r)
# Step 5 complete upload (after all part uploads are successful)
# https://docs.figshare.com/#private_article_upload_complete
r <- POST(
url = initiate_upload_response$location,
add_headers(c(Authorization = sprintf("token %s", Sys.getenv("RFIGSHARE_PAT"))))
)
complete_upload_response <- content(r)

How to append suffix to file names in write.csv() in R?

I have many data frames. I write them to csv, but I would not like to manually enter to each file the ending '_100' only to be able to specify it once and that each file would write with this ending
write.csv(results_SVM, file = "results_SVM.csv")
write.csv(results_ANN, file = "results_ANN.csv")
write.csv(results_RBF, file = "results_ANN.csv")
Get the same suffix for each file:
write.csv(results_SVM, file = "results_SVM_100.csv")
write.csv(results_ANN, file = "results_ANN_100.csv")
write.csv(results_RBF, file = "results_ANN_100.csv")
You can use paste in the filename:
#suf <- "" #nothing
suf <- "_100" #with _100
write.csv(results_SVM, file = paste0("results_SVM",suf,".csv"))
write.csv(results_ANN, file = paste0("results_ANN",suf,".csv"))
write.csv(results_RBF, file = paste0("results_ANN",suf,".csv"))

'Con not a connection' Error in R program

I am trying to use readLines in R but I am getting below error
orders1<-readLines(orders,2)
# Error in readLines(orders, 2) : 'con' is not a connection
Code :
orders<-read.csv("~/orders.csv")
orders
orders1<-readLines(orders,2)
orders1
Data:
id,item,quantity_ordered,item_cost
1,playboy roll,1,12
1,rockstar roll,1,10
1,spider roll,1,8
1,keystone roll,1,25
1,salmon sashimi,6,3
1,tuna sashimi,6,2.5
1,edamame,1,6
2,salmon skin roll,1,8
2,playboy roll,1,12
2,unobtanium roll,1,35
2,tuna sashimi,4,2.5
2,yellowtail hand roll,1,7
4,california roll,1,4
4,cucumber roll,1,3.5
5,unagi roll,1,6.5
5,firecracker roll,1,9
5,unobtanium roll,1,35
,chicken teriaki hibachi,1,7.95
,diet coke,1,1.95
I'm guessing you want this:
orders1 <- readLines( file("~/orders.csv") )
It's not clear why you want to do your own parsing or substitution, but that should give readLines a valid connection object.

How can I cut large csv files using any R packages like ff or data.table?

I want to cut large csv files (file size more than RAM size) and use them or save each in disk for later usage. Which R package is best for doing this for large files?
I haven't tried but using skip and nrows parameters in read.table or read.csv is worth a try. These are from ?read.table
skip integer: the number of lines of the data file to skip before
beginning to read data.
nrows integer: the maximum number of rows to read in. Negative and
other invalid values are ignored.
To avoid some troublesome issues at the end you need to do some error handling. In other words I don't know what happpens when skip value is greater than the number of rows in your big csv.
p.s. I also don't know whether header=TRUE is affecting skip or not, you also have to check that.
The answer given bu #berkorbay is OK and I can confirm that header can be used with skip. However, if your file is really large it gets painfully slow, as each subsequent reading after the first must skip over all previously read lines.
I had to do something similar and, after wasting quite a bit of time, I wrote a short script in PERL which fragments the original file in chuncks that you can read one after the other. It is much faster. I enclose the source here, translating some parts so that the intent is clear:
#!/usr/bin/perl
system("cls");
print("Fragment .csv file keeping header in each chunk\n") ;
print("\nEnter input file name = ") ;
$entrada = <STDIN> ;
print("\nEnter maximum number of lines in each fragment = ") ;
$nlineas = <STDIN> ;
print("\nEnter output file name stem = ") ;
$salida = <STDIN> ;
chop($salida) ;
open(IN,$entrada) || die "Cannot open input file: $!\n" ;
$cabecera = <IN> ;
$leidas = 0 ;
$fragmento = 1 ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
while(<IN>) {
if ($leidas > $nlineas) {
close(OUT) ;
$fragmento++ ;
$fichero = $salida.$fragmento ;
open(OUT,">$fichero") || die "Cannot open output file: $!\n" ;
print OUT $cabecera ;
$leidas = 0;
}
$leidas++ ;
print OUT $_ ;
}
close(OUT) ;
Just save with whatever name and execute. The first line might have to be changed if you have PERL in a diferent place (an, if you are on Windows, you migh have to invoke the script as "perl name-of-script").
One should have used read.csv.ffdf of ff package with specific parameters like this to read big file:
library(ff)
a <- read.csv.ffdf(file="big.csv", header=TRUE, VERBOSE=TRUE, first.rows=1000000, next.rows=1000000, colClasses=NA)
Once big file is read into a ff object, Subsetting ffobject into data frames can be done using:
a[1000:1000000,]
Rest of the code for subsetting and saving broken dataframes
totalrows = dim(a)[1]
row.size = as.integer(object.size(a[1:10000,])) / 10000 #in bytes
block.size = 200000000 #in bytes .IN Mbs 200 Mb
#rows.block is rows per block
rows.block = ceiling(block.size/row.size)
#nmaps is the number of chunks/maps of big dataframe(ff), nmaps = number of maps - 1
nmaps = floor(totalrows/rows.block)
for(i in (0:nmaps)){
if(i==nmaps){
df = a[(i*rows.block+1) : totalrows,]
}
else{
df = a[(i*rows.block+1) : ((i+1)*rows.block),]
}
#process df or save it
write.csv(df,paste0("M",i+1,".csv"))
#remove df
rm(df)
}
Alternatively you can first read the files into mysql using dbWriteTable and then use read.dbi.ffdf function from the ETLUtils package to read it back to R. Consider the function below;
read.csv.sql.ffdf <- function(file, name,overwrite = TRUE, header = TRUE, drv = MySQL(), dbname = "new", username = "root",host='localhost', password = "1234"){
conn = dbConnect(drv, user = username, password = password, host = host, dbname = dbname)
dbWriteTable(conn, name, file, header = header, overwrite = overwrite)
on.exit(dbRemoveTable(conn, name))
command = paste0("select * from ", name)
ret = read.dbi.ffdf(command, dbConnect.args = list(drv =drv, dbname = dbname, username = username, password = password))
return(ret)
}

Resources