How to store multidimensional data - r

Please consider the following situation:
I measure values every hour (time) (campaign from few month to ~10 years)
off several species (1 to 10)
with several instruments (1 to 5)
on several measurement sites (~70)
and each site has several sampling levels (1 to 5)
and each value has a flag indicating if it is valid or not
I am looking for the fastest and simplest way to store these data, considering the fact that the database/files/whatever should be readable and writeable with R.
Note that:
Some experiments consist of measuring for a very long time few species, for a single instrument and sampling level,
Some experiments consist of comparing the same few-months timeframe for a lot of sites (~70)
Some sites have many sampling levels and/or instruments (which will be compared)
The storage system must be readable (and if possible writeable) in parallel
What I tried so far:
MySQL data base, with 1 table per site/species, each table containing the folowing columns: time, sampling level, instrument, value and flag. Of course, as the number of site is growing, the number of table is also growing. And comparing sites is painfull, as it requires a lot of requests. Moreover, sampling level and instrument are repeated a lot of time within the table, this inefficiently occupies space.
NetCDF files: interesting for their ability to store multi-dimensional data, they good to store a set of data but are not practical to use for daily modification and not very "scalable".
Druid, a Multidimentional database management system, originally "business intelligence"-oriented. The principle is good, but it is much to heavy and slow for my application.
Thus, I am looking for a system which:
Take more or less the same time to retrieve
100 hours of data of 1 site, 1 species, 1 instrument, 1 sampling level, or
10 hours of data of 10 sites, 1 species, 1 instrument, 1 sampling level, or
10 hours of data of 1 site, 2 species, 1 instrument, 5 sampling levels, or
etc.
Allows parallel R/W
Minimize the time to write in and read from the database
Minimize used disk space
Allows easy addition of a new site, or instrument, or species, etc.
Works with R
A good system would be a kind of hypercube which allows complex request on all dimensions...

A relational database with a multi-column primary key (or candidate key) is well suited to store this kind of multi-dimensional data. From your description, it seems that the appropriate primary key would be time, species, instrument, site, and sampling_level. The flag appears to be an attribute of the value, not a key. This table should have indexes for all the columns that you will use to select data for retrieval. You may want additional tables to store descriptions or other attributes of the species, instruments, and sites. The main data table would have foreign keys into each of these.

Related

How to access unaggregated results when aggregation is needed due to dataset size in R

My task is to get total inbound leads for a group of customers, leads by month for the same group of customers and conversion rate of those leads.
The dataset I'm pulling from is 20 million records so I can't query the whole thing. I have successfully done the first step (getting total lead count for each org with this:
inbound_leads <- domo_get_query('6d969e8b-fe3e-46ca-9ba2-21106452eee2',
auto_limit = TRUE,
query = "select org_id,
COUNT(*)
from table
GROUP BY org_id
ORDER BY org_id"
DOMO is the bi tool I'm pulling from and domo_get_query is an internal function from a custom library my company built. It takes a query argument which is a mysql query)and various others which aren't important right now.
sample data looks like this:
org_id, inserted_at, lead_converted_at
1 10/17/2021 2021-01-27T03:39:03
2 10/18/2021 2021-01-28T03:39:03
1 10/17/2021 2021-01-28T03:39:03
3 10/19/2021 2021-01-29T03:39:03
2 10/18/2021 2021-01-29T03:39:03
I have looked through many aggregation online tutorials but none of them seem to go over how to get data needed pre-aggregation (such as number of leads per month per org, which isn't possible once the aggregation has occurred because in the above sample the aggregation would remove the ability to see more than one instance of org_id 1 for example) from a dataset that needs to be aggregated in order to be accessed in the first place. Maybe I just don't understand this enough to know the right questions to ask. Any direction appreciated.
If you're unable to fit your data in memory, you have a few options. You could process the data in batches (i.e. one year at a time) so that it fits in memory. You could use a package like chunked to help.
But in this case I would bet the easiest way to handle your problem is to solve it entirely in your SQL query. To get leads by month, you'll need to truncate your date column and group by org_id, month.
To get conversion rate for leads in those months, you could add a column (in addition to your count column) that is something like:
sum(case when conversion_date is not null then 1 else 0) as convert_count

Is there a suitable way to cluster time series where only four values are possible(0,1,2,4) and length is not fixed?

I am trying to cluster customers consumption behaviors using time series techniques. Customers buy tokens and use them whenever they want (a max of 4 tokens per day).
This is a sample of what the customers journeys time series (x = days after first order , y = number of tokens consumed per day) and it look alike the image below.
I tried clustering with derived variables (median delay between two events, standard deviation of the delays, total number of tokens, time between first and last consumption, mean number of tokens consumed per consumption event ...). I used K-means and this gave me some good results but it wasn't enough to spot all patterns in data. I looked at some papers about the use of Dynamic time warping in such cases but I have never used such algorithms..
Is there any materials (demos) on the use of such algorithms to cluster such time series ?
Yes.
There are many techniques that can be useful here.
The obvious approach from literature would be HAC with DTW.

Queries on the same big data dataset

Lets say I have a very big dataset (billions of records), one that doesnt fit on a single machine and I want to have multiple unknown queries (its a service where a user can choose a certain subset of the dataset and I need to return the max of that subset).
For the computation itself I was thinking about Spark or something similar, problem is Im going to have a lot of IO/network activity since Spark is going to have to keep re-reading the data set from the disk and distributing it to the workers, instead of, for instance, having Spark divide the data among the workers when the cluster goes up and then just ask from each worker to do the work on certain records (by their number, for example).
So, to the big data people here, what do you usually do? Just have Spark redo the read and distribution for every request?
If I want to do what I said above I have no choice but to write something of my own?
If the queries are known but the subsets unknown, you could precalculate the max (or whatever the operator) for many smaller windows / slices of the data. This gives you a small and easily queried index of sorts, which might allow you to calculate the max for an arbitrary subset. In case a subset does not start and end neatly where your slices do, you just need to process the ‘outermost’ partial slices to get the result.
If the queries are unknown, you might want to consider storing the data in a MPP database or use OLAP cubes (Kylin, Druid?) depending on the specifics; or you could store the data in a columnar format such as Parquet for efficient querying.
Here's a precalculating solution based on the problem description in the OP's comment to my other answer:
Million entries, each has 3k name->number pairs. Given a subset of the million entries and a subset of the names, you want the average for each name for all the entries in the subset. So each possible subset (of each possible size) of a million entries is too much to calculate and keep.
Precalculation
First, we split the data into smaller 'windows' (shards, pages, partitions).
Let's say each window contains around 10k rows with roughly 20k distinct names and 3k (name,value) pairs in each row (choosing the window size can affect performance, and you might be better off with smaller windows).
Assuming ~24 bytes per name and 2 bytes for the value, each window contains 10k*3k*(24+2 bytes) = 780 MB of data plus some overhead that we can ignore.
For each window, we precalculate the number of occurrences of each name, as well as the sum of the values for that name. With those two values we can calculate the average for a name over any set of windows as:
Average for name N = (sum of sums for N)/(sum of counts for N)
Here's a small example with much less data:
Window 1
{'aaa':20,'abcd':25,'bb':10,'caca':25,'ddddd':50,'bada':30}
{'aaa':12,'abcd':31,'bb':15,'caca':24,'ddddd':48,'bada':43}
Window 2
{'abcd':34,'bb':8,'caca':22,'ddddd':67,'bada':9,'rara':36}
{'aaa':21,'bb':11,'caca':25,'ddddd':56,'bada':17,'rara':22}
Window 3
{'caca':20,'ddddd':66,'bada':23,'rara':29,'tutu':4}
{'aaa':10,'abcd':30,'bb':8,'caca':42,'ddddd':38,'bada':19,'tutu':6}
The precalculated Window 1 'index' with sums and counts:
{'aaa':[32,2],'abcd':[56,2],'bb':[25,2],'caca':[49,2],'ddddd':[98,2],'bada':[73,2]}
This 'index' will contain around 20k distinct names and two values for each name, or 20k*(24+2+2 bytes) = 560 KB of data. That's one thousand times less than the data itself.
Querying
Now let's put this in action: given an input spanning 1 million rows, you'll need to load (1M/10k)=100 indices or 56 MB, which fits easily in memory on a single machine (heck, it would fit in memory on your smartphone).
But since you are aggregating the results, you can do even better; you don't even need to load all of the indices at once, you can load them one at a time, filter and sum the values, and discard the index before loading the next. That way you could do it with just a few megabytes of memory.
More importantly, the calculation should take no more than a few seconds for any set of windows and names. If the names are sorted alphabetically (another worthwhile pre-optimization) you get the best performance, but even with unsorted lists it should run more than fast enough.
Corner cases
The only thing left to do is handle the case where the input span doesn't line up exactly with the precalculated windows. This requires a little bit of logic for the two 'ends' of the input span, but it can be easily built into your code.
Say each window contains exactly one week of data, from Monday through Sunday, but your input specifies a period starting on a Wednesday. In that case you would have to load the actual raw data from Wednesday through Sunday of the first week (a few hundred megabytes as we noted above) to calculate the (count,sum) tuples for each name first, and then use the indices for the rest of the input span.
This does add some processing time to the calculation, but with an upper bound of 2*780MB it still fits very comfortably on a single machine.
At least that's how I would do it.

Setting minimum sample size for multiple sub-populations based on smallest sub-population

So I have 1 population of users, this population is split into sub-populations of users based on their date of birth. There are about 20 different buckets of users that fall into the desired age groups.
The question is to see how different bucket interacts with a system over time.
Each bucket has varied size, biggest bucket has about 20,000 users (at the mid point of the distribution) with both tail ends having <200 users each.
To answer the question of system usage over time I have cleaned the data and am taking a sample of .9 of the lowest sup-population from each of the buckets.
Then I re-sample with replacement N number of times (can be between 100 to 10000 or whatnot). The average of these re-samples closely approaches the sub-population mean of each bucket, what I find that pretty much over time for most metrics of interaction (1,2,3,4,5,6 months) the tail end with the lowest number of users is the most active. (this could suggest that higher member buckets contain a large proportion of users who are not active or those users that are active are just not as active different user buckets).
I took a quick summary of each of the buckets to make sure that there are no irregularities and indeed the data shows that the lowest bucket does have higher quartiles, mean, lowest and highest data values compared to the other buckets.
I went over the data collection methodology to make sure that there are no errors in obtaining the data and looking through various data points it does support the result of graphing the re-sampled values.
My question is, should I take sample size based on each individual bucket independently, my gut tells me no as all the buckets belong to the same population and if I sample on the buckets each sample has to be fair and thus use N number of data points from the smallest bucket.
There is no modelling involved, this is just looking at the average number of usage of each user bucket per month.
Is my approach more or less on the right track?

R - Cluster x number of events within y time period

I have a dataset that has 59k entries recorded over 63 years, I need to identify clusters of events with the criteria being:
6 or more events within 6 hours
Each event has a unique ID, time HH:MM:SS and date DD:MM:YY, an output would ideally have a cluster ID, the eventS that took place within each cluster, and start and finish time and date.
Thinking about the problem in R we would need to look at every date/time and count the number of events in the following 6 hours, if the number is 6 or greater save the event IDs, if not move onto the next date and perform the same task. I have taken a data extract that just contains EventID, Date, Time and Year.
https://dl.dropboxusercontent.com/u/16400709/StackOverflow/DataStack.csv
If I come up with anything in the meantime I will post below.
Update: Having taken a break to think about the problem I have a new approach.
Add 6 hours to the Date/Time of each event then count the number of events that fall within the start end time, if there are 6 or more take the eventIDs and assign them a clusterID. Then move onto the next event and repeat 59k times as a loop.
Don't use clustering. It's the wrong tool. And the wrong term. You are not looking for abstract "clusters", but something much simpler and much more well defined. In particular, your data is 1 dimensional, which makes things a lot easier than the multivariate case omnipresent in clustering.
Instead, sort your data and use a sliding window.
If your data is sorted, and time[x+5] - time[x] < 6 hours, then these events satisfy your condition.
Sorting is O(n log n), but highly optimized. The remainder is O(n) in a single pass over your data. This will beat every single clustering algorithm, because they don't exploit your data characteristics.

Resources