I'm working with StreamSets on a Cloudera Distribution, trying to ingest some data from this website http://files.data.gouv.fr/sirene/
I've encountered some issues choosing the parameters of both the HTTP Client and the Hadoop FS Destination.
https://image.noelshack.com/fichiers/2017/44/2/1509457504-streamsets-f.jpg
I get this error : HTTP_00 - Cannot parse record: java.io.IOException: org.apache.commons.compress.archivers.ArchiveException: No Archiver found for the stream signature
I'll show you my configuration.
HTTP Client :
General
Name : HTTP Client INSEE
Description : Client HTTP SIRENE
On Record Error : Send to Error
HTTP
Resource URL : http://files.data.gouv.fr/sirene/
Headers : sirene_ : sirene_
Mode : Streaming
Per-Status Actions
HTTP Statis Code : 500 | Action for status : Retry with exponential backoff |
Base Backoff Interval (ms) : 1000 | Max Retries : 10
HTTP Method : GET
Body Time Zone : UTC (UTC)
Request Transfert Encoding : BUFFERED
HTTP Compression : None
Connect Timeout : 0
Read Timeout : 0
Authentication Type : None
Use OAuth 2
Use Proxy
Max Batch Size (records) : 1000
Batch Wait Time (ms) : 2000
Pagination
Pagination Mode : None
TLS
UseTLS
Timeout Handling
Action for timeout : Retry immediately
Max Retries : 10
Data Format
Date Format : Delimited
Compression Format : Archive
File Name Pattern within Compressed Directory : *.csv
Delimiter Format Type : Custom
Header Line : With Header Line
Max Record Length (chars) : 1024
Allow Extra Columns
Delimiter Character : Semicolon
Escape Character : Other \
Quote Character : Other "
Root Field Type : List-Map
Lines to Skip : 0
Parse NULLs
Charset : UTF-8
Ignore Control Characters
Hadoop FS Destination :
General
Name : Hadoop FS 1
Description : Writing into HDFS
Stage Library : CDH 5.7.6
Produce Events
Required Fields
Preconditions
On Record Error : Send to Error
Output Files
File Type : Whole File
Files Prefix
Directory in Header
Directory Template : /user/pap/StreamSets/sirene/
Data Time Zone : UTC (UTC)
Time Basis : ${time:now()}
Use Roll Attribute
Validate HDFS Permissions : ON
Skip file recovery : ON
Late Records
Late Record Time Limit (secs) : ${1 * HOURS}
Late Record Handling : Send to error
Data Format
Data Format : whole File
File Name Expression : ${record:value('/fileInfo/filename')}
Permissions Expression : 777
File Exists : Overwrite
Include Checksum in Events
... so what am I doing wrong ? :(
It looks like http://files.data.gouv.fr/sirene/ is returning a file listing, rather than a compressed archive. This is a tricky one, since there isn't a standard way to iterate through such a listing. You might be able to read http://files.data.gouv.fr/sirene/ as text, then use the Jython evaluator to parse out the zip file URLs, retrieve, decompress and parse them, adding the parsed records to the batch. I think you'd have problems with this method, though, as all the records would end up in the same batch, blowing out memory.
Another idea might be to use two pipelines - the first would use HTTP client origin and a script evaluator to download the zipped files and write them to a local directory. The second pipeline would then read in the zipped CSV via the Directory origin as normal.
If you do decide to have a go, please engage with the StreamSets community via one of our channels - see https://streamsets.com/community
I'm writing the Jython evaluator. I'm not familiar with the available constants/objects/records as presented in comments. I tried to adapt this python script into the Jython evaluator :
import re
import itertools
import urllib2
data = [re.findall(r'(sirene\w+.zip)', line) for line in open('/home/user/Desktop/filesdatatest.txt')]
data_list = filter(None, data)
data_brackets = list(itertools.chain(*data_list))
data_clean = ["http://files.data.gouv.fr/sirene/" + url for url in data_brackets]
for url in data_clean:
urllib2.urlopen(url)
records = [re.findall(r'(sirene\w+.zip)', record) for record in records] gave me this error message SCRIPTING_05 - Script error while processing record: javax.script.ScriptException: TypeError: expected string or buffer, but got in at line number 50
filesdatatest.txt contains things like :
Listing of /v1/AUTH_6032cb4c2159474684c8df1da2e2b642/storage/sirene/
Name Size Date
../
README.txt 2Ki 2017-10-11 03:31:57
sirene_201612_L_M.zip 1Gi 2017-01-05 00:12:08
sirene_2017002_E_Q.zip 444Ki 2017-01-05 00:44:58
sirene_2017003_E_Q.zip 6Mi 2017-01-05 00:45:01
sirene_2017004_E_Q.zip 2Mi 2017-01-05 03:37:42
sirene_2017005_E_Q.zip 2Mi 2017-01-06 03:40:47
sirene_2017006_E_Q.zip 2Mi 2017-01-07 05:04:04
so I know how to parse records.
Related
I am trying to fetch the list of folders/files from a specific location in the drive from the server. I authenticated the drive using a service token. (Downloaded the JSON file and passed the location of the file as the parameter).
drive_auth(path = 'servicetoken.json')
I am trying to get the list of files from a specific location
folders_list <- drive_ls(path = "0EIgLgNPOMnJaWVJsdlkey") %>%
as.data.frame()
I am getting an error message
Warning: Error in : Client error: (404) Not Found
* domain: global
* reason: notFound
* message: File not found: 0EIgLgNPOMnJaWVJsdlkey.
* locationType: parameter
* location: fileId
Do we need to generate a JSON file every time we authenticate the drive?
What wrong I am doing here to get the error message in the server?
Not able to test right now, bit you might have to wrap the id in googledrive::as_id('0EIgLgNPOMnJaWVJsdlkey') to declare that you're passing an id and not a path
Probably a usage or settings issue:
I'm trying to use R's googleCloudStorageR package to upload files to my google storage bucket.
Running:
googleCloudStorageR::gcs_upload("test/my_test.csv")
prints these messages:
2020-05-11 18:57:19 -- File size detected as 368 bytes
2020-05-11 18:57:20> Request Status Code: 400
And then this error:
Error: API returned: Cannot insert legacy ACL for an object when uniform bucket-level access is enabled. Read more at https://cloud.google.com/storage/docs/uniform-bucket-level-access.
Is there a different usage to googleCloudStorageR::gcs_upload that will succeed? (not clear from its documentation
If I set predefinedAcl to "default" I get a JSON related error:
Error : lexical error: invalid char in json text
This error message is followed by some html code and following that this message:
> xdg-open: unexpected argument 'minimum-scale=1,'
Try 'xdg-open --help' for more information.`.
I'm not sure what JSON it's referring to but if it's the JSON I set googleCloudStorageR to authenticate access to my bucket than I'm surprised it's complaining at this stage
It looks like in https://github.com/cloudyr/googleCloudStorageR/pull/84 it got support to inherit the bucket level ACL if you set predefinedAcl to default. In your example this would be:
googleCloudStorageR::gcs_upload("test/my_test.csv", predefinedAcl = "default")
The issue has been resolved by the googleCloudStorageR developers. It is not yet on the CRAN distribution but it installing it from github (devtools::install_github("cloudyr/googleCloudStorageR")) should do.
And the usage is:
googleCloudStorageR::gcs_upload("test/my_test.csv", predefinedAcl = "bucketLevel")
I'm trying to get data from an OPeNDAP server using R and the ncdf4 package. However, the nasa eosdis server requires username / password. How can I pass this info using R?
Here is what I'm trying to do:
require(ncdf4)
f1 <- nc_open('https://disc2.gesdisc.eosdis.nasa.gov/opendap/TRMM_L3/TRMM_3B42.7/2018/020/3B42.20180120.15.7.HDF')
And the error message:
Error in Rsx_nc4_get_vara_double: NetCDF: Authorization failure syntax
error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or
SCAN_ERROR context: HTTP^ Basic: Access denied. Var: nlat Ndims: 1
Start: 0 Count: 400 Error in ncvar_get_inner(d$dimvarid$group_id,
d$dimvarid$id, default_missval_ncdf4(), : C function
R_nc4_get_vara_double returned error
I tried the url https://username:password#disc2.... but that did not work also.
Daniel,
The service you are accessing is using third-party redirection to authenticate users. Therefore the simple way of providing credentials in the URL doesn't work.
You need to create 2 files.
A .dodsrc file (a RC file for the netcdf-c library) with the following content
HTTP.COOKIEFILE=.cookies
HTTP.NETRC=.netrc
A .netrc file, in the location referenced in the .dodsrc, with your credentials:
machine urs.earthdata.nasa.gov
login YOURUSERNAMEHERE
password YOURPASWORDHERE
You can find more details at
https://www.unidata.ucar.edu/software/netcdf/docs/md__Users_wfisher_Desktop_v4_86_81-prep_netcdf-c_docs_auth.html
Regards
Antonio
unfortunately, even after defining the credentials and their location
ncdf4::nc_open("https://gpm1.gesdisc.eosdis.nasa.gov/opendap/GPM_L3/GPM_3IMERGDE.06/2020/08/3B-DAY-E.MS.MRG.3IMERG.20200814-S000000-E235959.V06.nc4")
still returns
Error in Rsx_nc4_get_vara_double: NetCDF: Authorization failure
The same happens when using ncdump from a terminal:
$ ncdump https://gpm1.gesdisc.eosdis.nasa.gov/opendap/GPM_L3/GPM_3IMERGDE.06/2020/08/3B-DAY-E.MS.MRG.3IMERG.20200814-S000000-E235959.V06.nc4
returns
syntax error, unexpected WORD_WORD, expecting SCAN_ATTR or SCAN_DATASET or
SCAN_ERROR context: HTTP^ Basic: Access denied. NetCDF: Authorization
failure Location: file
/build/netcdf-KQb2aQ/netcdf-4.6.0/ncdump/vardata.c; line 473
I have an application URL. I need to run login test using Jmeter. I recorded the login steps using blazemeter extension of chrome. But when I run it I get below error. I know there have been questions like this, I have tried few and it seems my case is different.
I have tried:
Added these two lines in jmeter.bat
set JAVA_HOME=C:\Program Files\Java\jdk1.8.0_65
set PATH=%JAVA_HOME%\bin;%PATH%
Run Jmeter using "Run as Administrator"
Download the certificate from here https://gist.github.com/borisguery/9ef114c53b83e553b635 and install it this way
https://www.youtube.com/watch?v=2k581jcWk9M
Restart the Jmeter but and try again but no luck.
When I expand the error in Jmeter View tree listener I get error on this particular css file: https://abcurl.xyzsample.com/assets/loginpage/css/okta-sign-in.min.7c7cfd15fa939095d61912dd8000a2a8.css
Error:
Thread Name: Thread Group 1-1
Load time: 268
Connect Time: 0
Latency: 0
Size in bytes: 2256
Headers size in bytes: 0
Body size in bytes: 2256
Sample Count: 1
Error Count: 1
Response code: Non HTTP response code: javax.net.ssl.SSLHandshakeException
Response message: Non HTTP response message: Received fatal alert: handshake_failure
Response headers:
HTTPSampleResult fields:
ContentType:
DataEncoding: null
If you are getting error for only one .css file and it does not belong to the application under test (i.e. it is an external stylesheet) the best thing you could do is just to exclude it from the load test via URLs must match section which lives under "Advanced" tab of the HTTP Request Defaults configuration element.
If you need to load this .css by any means you could also try the following approaches:
Play with https.default.protocol and https.socket.protocols properties (look for the above lines in jmeter.properties) file
Install Java Cryptography Extension (JCE) Unlimited Strength Jurisdiction Policy Files into /jre/lib/security folder of your JRE or JDK home (replace existing files with the downloaded ones)
If your url needs a client certificate, then copy your cert to /bin folder and from the jmeter console if you go to options -> SSL Manager and select your cert , it would prompt you for the certificate password . And if you run your tests again , that should work .
Additionally you can also do keystore configuraion (http://jmeter.apache.org/usermanual/component_reference.html#Keystore_Configuration) , if you haven't done already .
Please Note that my jmeter version is 4.0 . Hope this helps .
I want to import datasets in R that come from an FTP server. I am using FileZilla to manually see the files. Currently my data is in a xxxx.csv.gz file in the FTP server and a new file gets added once a day.
My issues are that I have tried using the following link as guidance and it doesn't seem to work well in my case:
Using R to download newest files from ftp-server
When I attempt the following code an error message comes up:
library(RCurl)
url <- "ftp://yourServer"
userpwd <- "yourUser:yourPass"
filenames <- getURL(url, userpwd = userpwd,
ftp.use.epsv = FALSE,dirlistonly = TRUE)
Error:
Error in function (type, msg, asError = TRUE) :
Failed to connect to ftp.xxxx.com port 21: Timed out
The reason why this happened is because under the credentials: it states that I should use Port: 22 for secure port
How do I modify my getURL function so that I can access Port: 22?
Also there is a directory after making this call that I need to get to in order to access the files.
For example purposes: let's say the directory is:
Directory: /xxx/xxxxx/xxxxxx
(I've also tried attaching this to the original URL callout and the same error message comes up)
Basically I want to get access to this directory, upload individual csv.gz files into R and then automatically call the following day's data.
The file names are:
XXXXXX_20160205.csv.gz
(The file names are just dates and each file will correspond to the previous day)
I guess the first step is to just make a connection to the files and download them and later down the road, automatically call the previous day's csv.gz file.
Any help would be great, thanks!