K6 Load Testing - Can you specify whether the output metric for data received/sent is shown in kB or MB? - k6

K6 Load Testing
At the end of testing, the following metrics are shown:
data_received..............: 1.4 MB 44 kB/s
data_sent..................: 214 kB 6.7 kB/s
In the above example, data_received is shown in MB whereas data_sent is shown in kB.
Is it possible to specify that these values are ALWAYS shown in kB or MB?
This will make comparing values easier as we know that we are comparing like with like.

There is no such option, and while it will probably be possible to add I would argue this is better done using the json export summary where this is just a number.
There is also a proposal (the whole ticket is about problems with the current summary output) to have a way to have user (and standard) templates for this output where this might be possible, but nobody has started working on it.

Related

Copying data from SQL Server to R using ODBC connection

I have successfully set up a R SQL Server ODBC connection by doing:
DBI_connection <- dbConnect(odbc(),
driver = "SQL Server"
server = server_name
database = database_name)
Dataset_in_R <- dbFetch(dbSendQuery(DBI_connection,
"SELECT * FROM MyTable_in_SQL"))
3 quick questions:
1-Is there a quicker way to copy data from SQL Server to R? This table has +44million rows and it is still running...
2-If I make any changes to this data in R does it change anything in my MyTable_in_SQL? I dont think so because I have saved it in a global data.frame variable in R, but just checking.
3-How to avoid going through this step every time I open R? Is there a way to save my data.frame in the "background" in R?
1: Is there a quicker way to copy data from SQL Server to R?
The answer here is rather simple to answer. the odbc package in R does quite a bit under-the-hood to ensure compatibility between the result fetched from the server and R's data structure. It might be possible to obtain a slight increase in speed by using an alternative package (RODBC is an old package, and it sometimes seems faster). In this case however, with 44 mil. rows, I expect that the bigger performance boost comes from preparing your sql-statement. The general idea would be to
Remove any unnecessary columns. Remember each column will need to be downloaded, so if you have 20 columns, removing 1 column may reduce your query execution time by ~5% (assuming linear run-time)
If you plan on performing aggregation, it will (very close to almost) faster to perform this directly in your query, eg, if you have a column called Ticker and a column called Volume and you want the average value of Volume you could calculate this directly in your query. Similar for last row using last_value(colname) over ([partition by [grouping col 1], [grouping col 2] ...] order by [order col 1], [order col 2]) as last_value_colname.
If you choose to do this, it might be beneficial to test your query on a small subset of rows using TOP N or LIMIT N eg: select [select statement] from mytable_in_sql order by [order col] limit 100 which would only return the first 100 rows. As Martin Schmelzer commented this can be done via R with the dplyr::tbl function as well, however it is always faster to correct your statement.
Finally if your query becomes more complex (does not seem to be the case here), it might be beneficial to create a View on the table CREATE VIEW with the specific select statement and query this view instead. The server will then try to optimize the query, and if your problem is on the server side rather than local side, this can improve performance.
Finally one must state the obvious. As noted above when you query the server you are downloading some (maybe quite a lot) of data. This can be improved by improving your internet connection either by repositioning your computer, router or directly connecting via a cord (or purely upgrading ones internet connection). For 44 Mil. rows if you have only a single 64 bit double precision variable, you have 44 * 10^6 / 1024^3 = 2.6 GiB of data (if not compressed). If you have 10 columns, this goes up to 26 GiB of data. It simply is going to take quite a long time to download all of this data. Thus getting this row count down would be extremely helpful!
As a side note you might be able to simply download the table directly via SSMS slightly faster (still slow due to table size) and then import the file locally. For the fastest speed you likely have to look into the Bulk import and export functionality of SQL-server.
2: If I make any changes to this data in R does it change anything in my MyTable_in_SQL?
No: R has no internal pointer/connection once the table has been loaded. I don't even believe a package exists (in R at least) that opens a stream to the table which could dynamically update the table. I know that a functionality like this exists in Excel, but even using this has some dangerous side effects and should (in my opinion) only be used in read-only applications, where the user wants to see a (almost) live-stream of the data.
3: How to avoid going through this step every time I open R? Is there a way to save my data.frame in the "background" in R?
To avoid this, simply save the table after every session. Whenever you close Rstudio it will ask you if you want to save your current session, and here you may click yes, at which point it will save .Rhistory and .Rdata in the getwd() directory, which will be imported the next time you open your session (unless you changed your working directory before closing the session using setwd(...). However I highly suggest you do not do this for larger datasets, as it will cause your R session to take forever to open the next time you open R, as well as possibly creating unnecessary copies of your data (for example if you import it into df and make a transformation in df2 then you will suddenly have 2 copies of a 2.6+ GiB dataset to load every time you open R). Instead I highly suggest saving the file using arrow::write_parquet(df, file_path), which is a much (and I mean MUCH!!) faster alternative to saving as RDS or csv files. These can't be opened as easily in Excel, but can be opened in R using arrow::read_parquet and python using pandas.read_parquet or pyarrow.parquet.read_parquet, while being compressed to a size that is usually 50 - 80 % smaller than the equivalent csv file.
Note:
If you already did save your R session after loading in the file, and you experience a very slow startup, I suggest removing the .RData file from your working directory. Usually the documents folder (C:/Users/[user]/Documents) from your system.
On question 2 you're correct, any changes in R won't change anything in the DB.
About question 3, you can save.image() or save.image('path/image_name.Rdata') and it will save your environment so you can recover it later on another session with load.image('path/image_name.Rdata').
Maybe with this you don't need a faster way to get data from a DB.

Why do objects in RStudio/R which display as 1 GB in size get saved by RData or RDS file formats at much larger sizes, even at no compression?

I currently have an list object in RStudio, which shows up in the Environment listing as 1.2 GB. However, when I save with the function saveRDS with compress = FALSE, the size of saved object shows up as nearly 4 GB.
Is the reporting of the size of my list object wrong or is something else happening? I thought that if an object took up a certain space in R, it should save at that same size without compression? I understand there are a few questions on Stackoverflow similar to this, but none seems to explain why it differs even with no compression.
The calculation of the size of objects in R is complicated by the need for efficient memory management. Your list may contain elements that are not accounted for while in memory as they may be shared resources, but will need to be included when exported. The help file for object.size states that:
Exactly which parts of the memory allocation should be attributed to
which object is not clear-cut. This function merely provides a rough
indication: it should be reasonably accurate for atomic vectors, but
does not detect if elements of a list are shared, for example.
(Sharing amongst elements of a character vector is taken into account,
but not that between character vectors in a single object.)

correlation matrix using large data sets in R when ff matrix memory allocation is not enough

I have a simple analysis to be done. I just need to calculate the correlation of the columns (or rows ,if transposed). Simple enough? I am unable to get the results for the whole week and I have looked through most of the solutions here.
My laptop has a 4GB RAM. I do have access to a server with 32 nodes. My data cannot be loaded here as it is huge (411k columns and 100 rows). If you need any other information or maybe part of the data I can try to put it up here, but the problem can be easily explained without really having to see the data. I simply need to get a correlation matrix of size 411k X 411k which means I need to compute the correlation among the rows of my data.
Concepts I have tried to code: (all of them in some way give me memory issues or run forever)
The most simple way, one row against all, write the result out using append.T. (Runs forever)
biCorPar.r by bobthecat (https://gist.github.com/bobthecat/5024079), splitting the data into blocks and using ff matrix. (unable to allocate memory to assign the corMAT matrix using ff() in my server)
split the data into sets (every 10000 continuous rows will be a set) and do correlation of each set against the other (same logic as bigcorPar) but I am unable to find a way to store them all together finally to generate the final 411kX411k matrix.
I am attempting this now, bigcorPar.r on 10000 rows against 411k (so 10000 is divided into blocks) and save the results in separate csv files.
I am also attempting to run every 1000 vs 411k in one node in my server and today is my 3rd day and I am still on row 71.
I am not an R pro so I could attempt only this much. Either my codes run forever or I do not have enough memory to store the results. Are there any more efficient ways to tackle this issue?
Thanks for all your comments and help.
I'm familiar with this problem myself in the context of genetic research.
If you are interested only in the significant correlations, you may find my package MatrixEQTL useful (available on CRAN, more info here: http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/ ).
If you want to keep all correlations, I'd like to first warn you that in the binary format (economical compared to text) it would take 411,000 x 411,000 x 8 bytes = 1.3 TB. If this what you want and you are OK with the storage required for that, I can provide my code for such calculations and storage.

pytables: how to fill in a table row with binary data

I have a bunch of binary data in N-byte chunks, where each chunk corresponds exactly to one row of a PyTables table.
Right now I am parsing each chunk into fields, writing them to the various fields in the table row, and appending them to the table.
But this seems a little silly since PyTables is going to convert my structured data back into a flat binary form for inclusion in an HDF5 file.
If I need to optimize the CPU time necessary to do this (my data comes in large bursts), is there a more efficient way to load the data into PyTables directly?
PyTables does not currently expose a 'raw' dump mechanism like you describe. However, you can fake it by using UInt8Atom and UInt8Col. You would do something like:
import tables as tb
f = tb.open_file('my_file.h5', 'w')
mytable = f.create_table('/', 'mytable', {'mycol': tb.UInt8Col(shape=(N,))})
mytable.append(myrow)
f.close()
This would likely get you the fastest I/O performance. However, you will miss out on the meaning of the various fields that are part of this binary chunk.
Arguably, raw dumping of the chunks/rows is not what you want to do anyway, which is why it is not explicitly supported. Internally HDF5 and PyTables handle many kinds of conversion for you. This includes but is not limited to things like endianness and thet platform specific feature. By managing the data types for you the resultant HDF5 file and data set cross platform. When you dump raw bytes in the manner you describe you short-circuit one of the main advantages of using HDF5/PyTables. If you do short-circuit, you have a high probability that the resulting file will look like garbage on anything but the original system that produced it.
So in summary, you should be converting the chunks to the appropriate data types in memory and then writing out. Yes this takes more processing power, time, etc. So in addition to being the right thing to do it will ultimately save you huge headaches down the road.

R using waaay more memory than expected

I have an Rscript being called from a java program. The purpose of the script is to automatically generate a bunch of graphs in ggplot and them splat them on a pdf. It has grown somewhat large with maybe 30 graphs each of which are called from their own scripts.
The input is a tab delimited file from 5-20mb but the R session goes up to 12gb of ram usage sometimes (on a mac 10.68 btw but this will be run on all platforms).
I have read about how to look at the memory size of objects and nothing is ever over 25mb and even if it deep copies everything for every function and every filter step it shouldn't get close to this level.
I have also tried gc() to no avail. If I do gcinfo(TRUE) then gc() it tells me that it is using something like 38mb of ram. But the activity monitor goes up to 12gb and things slow down presumably due to paging on the hd.
I tried calling it via a bash script in which I did ulimit -v 800000 but no good.
What else can I do?
In the process of making assignments R will always make temporary copies, sometimes more than one or even two. Each temporary assignment will require contiguous memory for the full size of the allocated object. So the usual advice is to plan to have _at_least_ three time the amount of contiguous _memory available. This means you also need to be concerned about how many other non-R programs are competing for system resources as well as being aware of how you memory is being use by R. You should try to restart your computer, run only R, and see if you get success.
An input file of 20mb might expand quite a bit (8 bytes per double, and perhaps more per character element in your vectors) depending on what the structure of the file is. The pdf file object will also take quite a bit of space if you are plotting each point within a large file.
My experience is not the same as others who have commented. I do issue gc() before doing memory intensive operations. You should offer code and describe what you mean by "no good". Are you getting errors or observing the use of virtual memory ... or what?
I apologize for not posting a more comprehensive description with code. It was fairly long as was the input. But the responses I got here were still quite helpful. Here is how I mostly fixed my problem.
I had a variable number of columns which, with some outliers got very numerous. But I didn't need the extreme outliers, so I just excluded them and cut off those extra columns. This alone decreased the memory usage greatly. I hadn't looked at the virtual memory usage before but sometimes it was as high as 200gb lol. This brought it down to up to 2gb.
Each graph was created in its own function. So I rearranged the code such that every graph was first generated, then printed to pdf, then rm(graphname).
Futher, I had many loops in which I was creating new columns in data frames. Instead of doing this, I just created vectors not attached to data frames in these calculations. This actually had the benefit of greatly simplifying some of the code.
Then after not adding columns to the existing dataframes and instead making column vectors it reduced it to 400mb. While this is still more than I would expect it to use, it is well within my restrictions. My users are all in my company so I have some control over what computers it gets run on.

Resources