Right way to transfer a CSV file to a BI application? - http

We are doing a BI application, and our customers send us data files daily. We are doing data exchange using CSV files, because our customers are used to watch data with Excel, and they are not ready yet to use an API on their system (maybe in few years we will be able to use XML/JSON webservice, we hope).
Currently the data transfer is made with FTP (SFTP in fact). Our customers upload file automatically on an FTP server, and we have a CRON task that watches if a file has been sent.
But there are many disadvantages with that:
We cannot know with reliability if the upload is done, or still in progress (we asked them to upload a file with a temporary name, and move it after, but many of them still don't do that)
So, we can try to guess, and consider upload is done if enough time has passed. But FTP protocol doesn't allow to get server time, and time can be desynced. So we can upload an empty file and read it's date to know the time of the server. But we need write permission to do that...
FTP protocol allow to pause upload...
Then, we are considering to transfer files by asking our customer to upload them directly on our application, using HTTPS. This is more reliable, but less convenient:
Our customer cannot check the content of the file after upload
We have to be careful with upload size and timeout on our server
Files can be quite large (up to 300Mo), so it's better to zip them before upload (can reduce size to 10%).
This is more work for us than just an FTP server (we need to create UI, upload progress, list files to download them back, ...)
There is other solutions? How usually BI applications share data? Is HTTPS a good solutions for us?

We found a solution which is a webdav server. We are using Nextcloud, it provides an online interface, and script access with webdav protocol.
It's more reliable than FTP, because the file appear only when upload is done.
And it's better than HTTP upload on our application. We don't have to handle file upload, create interfaces, ...

Related

How can I track downloads of files from remote websites

I am sharing the link of a file (e.g. pdf), which is stored in my server. Is it possible to track whenever some user is downloading the file? I don't have access to the script of the other page but I thought I could track the incoming requests to my server. Would that be computationally expensive? Any hints towards which direction to look?
You can use the measurement protocol, a language agnostic description of a http tracking request to the Google Analytics tracking server.
The problem in your case is that you do not have a script between the click and the download to send the tracking request. One possible workaround would be to use the server access log, provided you have some control over the server.
For example the Apache web server can user piped logs, e.g. instead if being written directly to a file the log entry is passed to a script or program. I'm reasonably sure that other servers have something similar.
You could pipe the logs to a script that evaluates if the log entry points at the URL of your pdf file, and if so breaks down the info into individual data fields and sends them via a programming language of your choice to the GA tracking server.
If you cannot control the server to that level you'd need to place a script with the same name and location as the original file on the server, map the pdf extension to a script interpreter of your choice (in apache via addType, which with many hosts can be done via a htaccess file) and have the script sending the tracking request before delivering the original file.
Both solutions require a modicum of programming practice (the latter much less than the former). Piping logs might be expensive, depending on the number of requests to your server (you might create an extra log file for downloadable files, though). An intermediary script would not be an expensive operation.

How download managers work?

Hi from long time i have doubt. when we use "http" protocol to download something the download starts from the first byte of the file. I mean if there is a file of 2MB on the site and when we click it, it starts downloading from the first byte. But when we give the link of the file to the download managers they work differently. I mean after downloading few bytes if we pause they stop downloading and when we resume they start from where they have stopped(not from the beginning). how is this possible?
The answer is the server setting. If a server allows the client to read the file from somewhere after the first byte, the client can specify the number of bytes to skip and the server will start sending the file from that position in the file. If the server doesn't allow then the client is forced to start reading the file from the beginning, whether any download manager is used or not.
For example 4shared.com always allows to start from beginning.
Note: In such cases using any download manager provides no gains.
It really depends on the server where file is hosted if it allows the byte-seeking. In other words, if a file hosting service has "streaming" feature than just "download" feature, applications like download managers will be able to pull a file in pieces & combine them after all the pieces have been downloaded.

Writing large volume of web post requests to flat files (File based Queuing )

I am developing a Spring Based Web Application which will handle large volume of requests per minute and this web app needs to respond very quickly.
For this purpose, We decided to implement a flat-file based queuing mechanism, which would just write the requests (set of database columns values) to flat files and another process would pick this data from flat files periodically and write it to the database. I pick up only those files that am done writing with.
As am using a flat file, For each request I receive, I need to open and close the flat file inside my controller method.
My Question is : Is there a better way to implement this solution ? JMS is out of scope as we don't have the infrastructure right now.
If this file based approach seems good, then is there a better way to reduce the file I/O ? With the current design, I open/write/close the flat file for each web request received, which I know is bad. :(
Env : SpringSource ToolSuite, Apache/Tomcat with back-end as Oracle.
File access has to be synchronized, otherwise you'll corrupt it. Synchronized access clashes with the large volume of requests you plan.
Take a look at things like Kestrel or just go with a database like SQLite (at least you can delegate the synchronization burden)

Need to check uptime on a large file being hosted

I have a dynamically generated rss feed that is about 150M in size (don't ask)
The problem is that it keeps crapping out sporadically and there is no way to monitor it without downloading the entire feed to get a 200 status. Pingdom times out on it and returns a 'down' error.
So my question is, how do I check that this thing is up and running
What type of web server, and server side coding platform are you using (if any)? Is any of the content coming from a backend system/database to the web tier?
Are you sure the problem is not with the client code accessing the file? Most clients have timeouts and downloading large files over the internet can be a problem depending on how the server behaves. That is why file download utilities track progress and download in chunks.
It is also possible that other load on the web server or the number of users is impacting server. If you have little memory available and certain servers then it may not be able to server that size of file to many users. You should review how the server is sending the file and make sure it is chunking it up.
I would recommend that you do a HEAD request to check that the URL is accessible and that the server is responding at minimum. The next step might be to setup your download test inside or very close to the data center hosting the file to monitor further. This may reduce cost and is going to reduce interference.
Found an online tool that does what I needed
http://wasitup.com uses head requests so it doesn't time out waiting to download the whole 150MB file.
Thanks for the help BrianLy!
Looks like pingdom does not support the head request. I've put in a feature request, but who knows.
I hacked this capability into mon for now (mon is a nice compromise between paying someone else to monitor and doing everything yourself). I have switched entirely to https so I modified the https monitor to do it. The did it the dead-simple way: copied the https.monitor file, called it https.head.monitor. In the new monitor file I changed the line that says (you might also want to update the function name and the place where that's called):
get_https to head_https
Now in mon.cf you can call a head request:
monitor https.head.monitor -u /path/to/file

Export large amounts of data to client in asp.net

I need to export a large amount of data (~100mb) from a sql table to a user via the web. What would be the best solution for doing so? One thought was to export the data to a folder on the db server, compress it (by some means) and then provide a download link for the user. Any other methods for doing so? Also, can we compress data from within sql server?
Any approaches are welcome.
I wouldn't tie up the database waiting for the user to download 100Mb, even for a high speed user. When the user requests the file have them specify an email address. Then call an asynch process to pull the data, write it to a temp file (don't want > 100mb in memory after all), then zip the temp file to a storage location, then send the user an email with a link to download the file.
You can respond to a page request with a file:
Response.AddHeader("Content-Disposition",
"attachment; filename=yourfile.csv");
Response.ContentType = "text/plain";
Be sure to turn buffering off, so IIS can start sending the first part of the file while you are building the second:
Response.BufferOutput = false;
After that, you can start writing the file like:
Response.Write("field1,field2,field3\r\n");
When the file is completely written, end the response, so ASP.NET doesn't append a web page to your file:
Response.End();
This way, you don't have to write files on your web servers, you just create the files in memory and send them to your users.
If compression is required, you can write a ZIP file in the same way. This is a nice free library to create ZIP files.
Your approach works fine. SSIS + 7zip might be useful for automating the process if you need to do it more than a couple times.
If XML is OK, one approach would be to select the data "FOR XML" like this:
http://www.sqljunkies.ddj.com/Article/296D1B56-8BDD-4236-808F-E62CC1908C4E.scuk
And then spit out the raw XML directly to the browser as content-type: text/xml. Also be sure to set up Gzip compression on your web server for files with XML extensions. http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/502ef631-3695-4616-b268-cbe7cf1351ce.mspx?mfr=true
This will shrink the XML file down to 1/3 or maybe 1/4 the size as it's transferred. This wouldn't be the highest performance option because of the inherent wasted space in XML files, but a lot depends on what format you're looking for in the end.
Another option would be to use the free CSharpZipLib to compress the XML (or whatever format you want) into a zip file that the user would download. Along those lines, if this is something that will be used frequently you might want to look into caching and storing the zip file on the web server with some sort of expiration so it's not regenerated for every single request.
The download link is a perfectly valid and reasonable solution. Another would be to automatically redirect the user to that file so they didn't need to click a link. It really depends on your workflow and UI experience.
I would suggest against implementing compression in the SQL Server engine. Instead look at the DotNetZip library (Or System.IO.Conpression if you think your users have the capability of uncompressing gzip archives) and implement the compression within the web application.

Resources