Update a single sheet in a workbook - r

I like using Excel as a poor man's database for storing data dictionaries and such, because Excel makes it super easy to edit the data in there without the pains of installing a RDBMS.
Now I hit an unexpected problem. I can't find a simple way to rewrite just one of the worksheets, at least not without reading and writing the whole file.
write.xlsx(df,file ="./codebook.xlsx",sheetName="mysheet",overwrite=F)
Complains file exists. With overwrite=T, my sheets are lost.

Related

Is there a way of rendering rmarkdown to an object directly without saving to disk?

We are working on a project where following an analysis of data using R, we use rmarkdown to render an html report which will be returned to users uploading the original dataset. This will be part of an online complex system involving multiple steps. One of the requirements is that the rmarkdown html will be serialized and saved in a SQL database for the system to return to users.
My question is - is there a way to render the markdown directly to an object in R to allow for direct serialisation? We want to avoid saving to disk unless absolutely needed as there will be multiple parallel processes doing similar tasks and resources might be limited. From my reasearch so far it doesn't seem possible, but would appreciate any insight.
You are correct, it's not possible due to the architecture of rmarkdown.
If you have this level of control over your server, you could create a RAM disk, using part of your the memory to simulate a hard drive. The actual hard disk won't be used.

Importing very large sqlite table to BigQuery

I have a relatively large SQLite table (5 million rows, 2GB) which I'm trying to move to Google BigQuery. The easy solution, which I've used for other tables in the db, was to use something like SQLite Manager (the Firefox extension) to export to CSV, but this fails with what I'd imagine is an out of memory error when trying to export the table in question. I'm trying to think of the best way to approach this, and have come up with the following:
Write something that will manually write a single, gigantic CSV. This seems like a bad idea for many reasons, but the big ones are that one of the fields is text data which will inevitably screw things up with any of the delimiters supported by BQ's import tools, and I'm not sure that BQ could even support a single CSV that big
Write a script to manually export everything to a series of CSVs, like ~100k rows each or something--the main problem being that this will then require importing 50 files
Write everything to a series of JSONs and try to figure out a way to deal with it from there, same as above
Try to import it to MySQL and then do a mysqldump which apparently can be read by BQ
Use Avro, which seems like the same as #2 except it's going to be in binary so it'll be harder to debug when it inevitably fails
I also have some of this data on a local ElasticSearch node, but I couldn't find any way of migrating that to BQ either. Does anyone have any suggestions? Most of what I've found online has been trying to get things out of BQ, not put things in.
(2) is not a problem. BQ can import up to 10k files per import job.
Also, BQ can also import very large CSV/JSON/AVRO files, as long as the input can be sharded (text based formats are not compressed, CSV files without quoted new lines).
See https://cloud.google.com/bigquery/quota-policy#import for more.

PHPExcel: Opening a file takes a long time

I'm using PHPExcel to read through Excel spreadsheets of various sizes and then import the cell data into a database. Reading through the spreadsheet itself works great and is very quick, but I've noticed that the time to actually load/open the file for PHPExcel to use can take up to 10-20 seconds (the larger the file, the longer it takes--especially if the spreadsheet is >1MB in size).
This is the code I'm using to load the file before iterating through it:
$filetype = PHPExcel_IOFactory::identify($file);
$objReader = PHPExcel_IOFactory::createReader($filetype);
$objReader->setReadDataOnly(true);
$objPHPExcel = $objReader->load($file);
What can I do to get the file to load faster? It's frustrating that the greatest latency in importing the data is just in opening up the file initially.
Thank you!
I've seen this same behavior with Ruby and an Excel library: a non-trivial amount of time to open a large file, where large is > 500KB.
I think the cause is two things:
1) an xlsx file is zip compressed, so it must first be un-compressed
2) an xlsx file is a series of XML files, which all must be parsed.
#1 can be a small hit, but most likely it pales in comparison to #2. I believe its the XML parsing that is the real culprit. In addition, the XML parser is a DOM-based parser, so the whole XML DOM must be parsed and loaded into memory.
I don't think there is really anything you can do to speed this up. A large xlsx file contains a lot of XML which must be parsed and loaded into memory.
Actually, there is something you can do. The problem with most of the XML parsers is that they first load the entire document in memory. For big documents, this takes a considerable amount of time.
A way to avoid this is to use parsers that allow streaming. So instead of loading all the XML files content in memory, you just load the part you need. That way, you can pretty much have only one row at a time in memory. This is super fast AND memory efficient.
If you are curious, you can find an example of a library using this technique here: https://github.com/box/spout

Store map key/values in a persistent file

I will be creating a structure more or less of the form:
type FileState struct {
LastModified int64
Hash string
Path string
}
I want to write these values to a file and read them in on subsequent calls. My initial plan is to read them into a map and lookup values (Hash and LastModified) using the key (Path). Is there a slick way of doing this in Go?
If not, what file format can you recommend? I have read about and experimented with with some key/value file stores in previous projects, but not using Go. Right now, my requirements are probably fairly simple so a big database server system would be overkill. I just want something I can write to and read from quickly, easily, and portably (Windows, Mac, Linux). Because I have to deploy on multiple platforms I am trying to keep my non-go dependencies to a minimum.
I've considered XML, CSV, JSON. I've briefly looked at the gob package in Go and noticed a BSON package on the Go package dashboard, but I'm not sure if those apply.
My primary goal here is to get up and running quickly, which means the least amount of code I need to write along with ease of deployment.
As long as your entiere data fits in memory, you should't have a problem. Using an in-memory map and writing snapshots to disk regularly (e.g. by using the gob package) is a good idea. The Practical Go Programming talk by Andrew Gerrand uses this technique.
If you need to access those files with different programs, using a popular encoding like json or csv is probably a good idea. If you just have to access those file from within Go, I would use the excellent gob package, which has a lot of nice features.
As soon as your data becomes bigger, it's not a good idea to always write the whole database to disk on every change. Also, your data might not fit into the RAM anymore. In that case, you might want to take a look at the leveldb key-value database package by Nigel Tao, another Go developer. It's currently under active development (but not yet usable), but it will also offer some advanced features like transactions and automatic compression. Also, the read/write throughput should be quite good because of the leveldb design.
There's an ordered, key-value persistence library for the go that I wrote called gkvlite -
https://github.com/steveyen/gkvlite
JSON is very simple but makes bigger files because of the repeated variable names. XML has no advantage. You should go with CSV, which is really simple too. Your program will make less than one page.
But it depends, in fact, upon your modifications. If you make a lot of modifications and must have them stored synchronously on disk, you may need something a little more complex that a single file. If your map is mainly read-only or if you can afford to dump it on file rarely (not every second) a single csv file along an in-memory map will keep things simple and efficient.
BTW, use the csv package of go to do this.

Export Excel from web, what's the BEST way?

About 3 years ago, I was looking for a way to allow a web app user to download table results to an Excel file. I knew that I didn't want to put Office on the web server and that I probably wanted to create the XLS file in XML format. The question was: what was the best way?
Now I am writing my resume and I am trying to recap the things that I did and I am concerned that I didn't take the best approach and I am wondering if somebody can tell me whether my suspicions are true.
Basically, I saved an Excel file as XML and then looked at the contents of the saved file and reverse engineered what I thought was a pretty cool SDK to create an Excel file in XML format. It was fairly robust with options , nice object model, etc.
But did such a library already exist? One that I could have used? I want to know if I will need to defend this "accomplishment"
Also, could anyone recommend me a good place where I can see actual resumes of people with .NET / SQL Server or general developer skills?
You can try SmartXLS (for Java or .Net), it supports most features of Excel (cell formatting, Charts, formulas, pivot tables etc), and can read/write both the Excel97-2003 xls format and the Excel2007 openxml format.
These people wrote a perfectly good one that you probably couldn't implement yourself for as cheaply.

Resources