Unexpected throughput with DynamoDB - amazon-dynamodb

I have a table in DDB with site_id as my hash key and person_id as the range key. There are another 6-8 columns on this table with numeric statistics about this person (e.g. times seen, last log in etc). This table has data for about 10 sites and 20 million rows (this is only used as a proof of concept now - the production table will have much bigger numbers).
I m trying to retrieve all person_ids for a given site where time_seen > 10. So I m doing a query using the hashkey and enter the time_seen > 10 as a criterion. This will result to a few thousand entries which I expected to get pretty much instantly. My test harness runs in AWS on the same region.
The read capacity on this table is 100 units. The results I m getting are attached.
For some reason I m hitting the limits. Given the only two limits I m aware of are the max data size returned. I m only returning 32 bytes per row (so approx 100KB per result) so no chance this is the case. The time as you see doesnt hit the 5 sec limit either. So why cant I get my results faster?
Results are retrieved in a single thread from C#.
Thanks

Related

How to access unaggregated results when aggregation is needed due to dataset size in R

My task is to get total inbound leads for a group of customers, leads by month for the same group of customers and conversion rate of those leads.
The dataset I'm pulling from is 20 million records so I can't query the whole thing. I have successfully done the first step (getting total lead count for each org with this:
inbound_leads <- domo_get_query('6d969e8b-fe3e-46ca-9ba2-21106452eee2',
auto_limit = TRUE,
query = "select org_id,
COUNT(*)
from table
GROUP BY org_id
ORDER BY org_id"
DOMO is the bi tool I'm pulling from and domo_get_query is an internal function from a custom library my company built. It takes a query argument which is a mysql query)and various others which aren't important right now.
sample data looks like this:
org_id, inserted_at, lead_converted_at
1 10/17/2021 2021-01-27T03:39:03
2 10/18/2021 2021-01-28T03:39:03
1 10/17/2021 2021-01-28T03:39:03
3 10/19/2021 2021-01-29T03:39:03
2 10/18/2021 2021-01-29T03:39:03
I have looked through many aggregation online tutorials but none of them seem to go over how to get data needed pre-aggregation (such as number of leads per month per org, which isn't possible once the aggregation has occurred because in the above sample the aggregation would remove the ability to see more than one instance of org_id 1 for example) from a dataset that needs to be aggregated in order to be accessed in the first place. Maybe I just don't understand this enough to know the right questions to ask. Any direction appreciated.
If you're unable to fit your data in memory, you have a few options. You could process the data in batches (i.e. one year at a time) so that it fits in memory. You could use a package like chunked to help.
But in this case I would bet the easiest way to handle your problem is to solve it entirely in your SQL query. To get leads by month, you'll need to truncate your date column and group by org_id, month.
To get conversion rate for leads in those months, you could add a column (in addition to your count column) that is something like:
sum(case when conversion_date is not null then 1 else 0) as convert_count

Queries on the same big data dataset

Lets say I have a very big dataset (billions of records), one that doesnt fit on a single machine and I want to have multiple unknown queries (its a service where a user can choose a certain subset of the dataset and I need to return the max of that subset).
For the computation itself I was thinking about Spark or something similar, problem is Im going to have a lot of IO/network activity since Spark is going to have to keep re-reading the data set from the disk and distributing it to the workers, instead of, for instance, having Spark divide the data among the workers when the cluster goes up and then just ask from each worker to do the work on certain records (by their number, for example).
So, to the big data people here, what do you usually do? Just have Spark redo the read and distribution for every request?
If I want to do what I said above I have no choice but to write something of my own?
If the queries are known but the subsets unknown, you could precalculate the max (or whatever the operator) for many smaller windows / slices of the data. This gives you a small and easily queried index of sorts, which might allow you to calculate the max for an arbitrary subset. In case a subset does not start and end neatly where your slices do, you just need to process the ‘outermost’ partial slices to get the result.
If the queries are unknown, you might want to consider storing the data in a MPP database or use OLAP cubes (Kylin, Druid?) depending on the specifics; or you could store the data in a columnar format such as Parquet for efficient querying.
Here's a precalculating solution based on the problem description in the OP's comment to my other answer:
Million entries, each has 3k name->number pairs. Given a subset of the million entries and a subset of the names, you want the average for each name for all the entries in the subset. So each possible subset (of each possible size) of a million entries is too much to calculate and keep.
Precalculation
First, we split the data into smaller 'windows' (shards, pages, partitions).
Let's say each window contains around 10k rows with roughly 20k distinct names and 3k (name,value) pairs in each row (choosing the window size can affect performance, and you might be better off with smaller windows).
Assuming ~24 bytes per name and 2 bytes for the value, each window contains 10k*3k*(24+2 bytes) = 780 MB of data plus some overhead that we can ignore.
For each window, we precalculate the number of occurrences of each name, as well as the sum of the values for that name. With those two values we can calculate the average for a name over any set of windows as:
Average for name N = (sum of sums for N)/(sum of counts for N)
Here's a small example with much less data:
Window 1
{'aaa':20,'abcd':25,'bb':10,'caca':25,'ddddd':50,'bada':30}
{'aaa':12,'abcd':31,'bb':15,'caca':24,'ddddd':48,'bada':43}
Window 2
{'abcd':34,'bb':8,'caca':22,'ddddd':67,'bada':9,'rara':36}
{'aaa':21,'bb':11,'caca':25,'ddddd':56,'bada':17,'rara':22}
Window 3
{'caca':20,'ddddd':66,'bada':23,'rara':29,'tutu':4}
{'aaa':10,'abcd':30,'bb':8,'caca':42,'ddddd':38,'bada':19,'tutu':6}
The precalculated Window 1 'index' with sums and counts:
{'aaa':[32,2],'abcd':[56,2],'bb':[25,2],'caca':[49,2],'ddddd':[98,2],'bada':[73,2]}
This 'index' will contain around 20k distinct names and two values for each name, or 20k*(24+2+2 bytes) = 560 KB of data. That's one thousand times less than the data itself.
Querying
Now let's put this in action: given an input spanning 1 million rows, you'll need to load (1M/10k)=100 indices or 56 MB, which fits easily in memory on a single machine (heck, it would fit in memory on your smartphone).
But since you are aggregating the results, you can do even better; you don't even need to load all of the indices at once, you can load them one at a time, filter and sum the values, and discard the index before loading the next. That way you could do it with just a few megabytes of memory.
More importantly, the calculation should take no more than a few seconds for any set of windows and names. If the names are sorted alphabetically (another worthwhile pre-optimization) you get the best performance, but even with unsorted lists it should run more than fast enough.
Corner cases
The only thing left to do is handle the case where the input span doesn't line up exactly with the precalculated windows. This requires a little bit of logic for the two 'ends' of the input span, but it can be easily built into your code.
Say each window contains exactly one week of data, from Monday through Sunday, but your input specifies a period starting on a Wednesday. In that case you would have to load the actual raw data from Wednesday through Sunday of the first week (a few hundred megabytes as we noted above) to calculate the (count,sum) tuples for each name first, and then use the indices for the rest of the input span.
This does add some processing time to the calculation, but with an upper bound of 2*780MB it still fits very comfortably on a single machine.
At least that's how I would do it.

Hitting Query Limit in Google Distance Matrix API on R

I have a list of 36 locations for which I have to get a distance matrix from each location to every other location, i.e. a 36x36 matrix. Using help from other questions on this topic on this forum, I was able to put together a basic code (demonstrated with four locations only) as follows:
library(googleway)
library(plyr)
key <- "VALID KEY" #removed for security reasons
districts <- c("Attock, Pakistan",
"Bahawalnagar, Pakistan",
"Bahawalpur, Pakistan",
"Bhakkar, Pakistan")
#Calculate pairwise distance between each location
lst <- google_distance(origins=districts, destinations=districts, key=key)
res.lst <- list()
lst_elements <- for (i in 1:length(districts)) {
e.row <- rbind(cbind(districts[i], distance_destinations(lst),
distance_elements(lst)[[i]][['distance']]))
res.lst[[i]] <- e.row
}
# view results as list
res.lst
# combine each element of list into a dataframe.
res.df <- ldply(res.lst, rbind)
#give names to columns
colnames(res.df) <- c("origin", "destination", "dist.km", "dist.m")
#Display result
res.df
This code works fine for small number of queries; i.e. if locations are few e.g. 5 at a time. For anything larger, I get a "Over-Query-Limit" error with the message: "You have exceeded your rate-limit for this API" even though I have not reached the 2500 limit. I also signed up for 'Pay-as-you-use' billing option but I continue to get the same error. I wonder if this is an issue of how many requests are being sent per second (i.e. the rate)? And if so, can I modify my code to address this? Even without an API key, this code does not ask for more than 2500 queries so I should be able to do it but I'm stumped how to resolve this even with billing enabled.
The free quota is 2500 elements.
Each query sent to the Distance Matrix API is limited by the number of allowed elements, where the number of origins times the number of destinations defines the number of elements.
Standard Usage Limits
Users of the standard API:
2,500 free elements per day, calculated as the sum of client-side and server-side queries.
Maximum of 25 origins or 25 destinations per request.
a 36x36 request would be 1296 elements. After 2 you would be out of quota.
For anyone still struggling with this issue; I was able to resolve it by using a while loop. Since I was well under the 2500 query limit, this was a rate problem rather than a query limit being reached problem. With a while loop, I broke the locations into chunks (running distance queries for 2x36 at a time) and looping over the entire data to get the 36x36 I needed.

How to store multidimensional data

Please consider the following situation:
I measure values every hour (time) (campaign from few month to ~10 years)
off several species (1 to 10)
with several instruments (1 to 5)
on several measurement sites (~70)
and each site has several sampling levels (1 to 5)
and each value has a flag indicating if it is valid or not
I am looking for the fastest and simplest way to store these data, considering the fact that the database/files/whatever should be readable and writeable with R.
Note that:
Some experiments consist of measuring for a very long time few species, for a single instrument and sampling level,
Some experiments consist of comparing the same few-months timeframe for a lot of sites (~70)
Some sites have many sampling levels and/or instruments (which will be compared)
The storage system must be readable (and if possible writeable) in parallel
What I tried so far:
MySQL data base, with 1 table per site/species, each table containing the folowing columns: time, sampling level, instrument, value and flag. Of course, as the number of site is growing, the number of table is also growing. And comparing sites is painfull, as it requires a lot of requests. Moreover, sampling level and instrument are repeated a lot of time within the table, this inefficiently occupies space.
NetCDF files: interesting for their ability to store multi-dimensional data, they good to store a set of data but are not practical to use for daily modification and not very "scalable".
Druid, a Multidimentional database management system, originally "business intelligence"-oriented. The principle is good, but it is much to heavy and slow for my application.
Thus, I am looking for a system which:
Take more or less the same time to retrieve
100 hours of data of 1 site, 1 species, 1 instrument, 1 sampling level, or
10 hours of data of 10 sites, 1 species, 1 instrument, 1 sampling level, or
10 hours of data of 1 site, 2 species, 1 instrument, 5 sampling levels, or
etc.
Allows parallel R/W
Minimize the time to write in and read from the database
Minimize used disk space
Allows easy addition of a new site, or instrument, or species, etc.
Works with R
A good system would be a kind of hypercube which allows complex request on all dimensions...
A relational database with a multi-column primary key (or candidate key) is well suited to store this kind of multi-dimensional data. From your description, it seems that the appropriate primary key would be time, species, instrument, site, and sampling_level. The flag appears to be an attribute of the value, not a key. This table should have indexes for all the columns that you will use to select data for retrieval. You may want additional tables to store descriptions or other attributes of the species, instruments, and sites. The main data table would have foreign keys into each of these.

Setting minimum sample size for multiple sub-populations based on smallest sub-population

So I have 1 population of users, this population is split into sub-populations of users based on their date of birth. There are about 20 different buckets of users that fall into the desired age groups.
The question is to see how different bucket interacts with a system over time.
Each bucket has varied size, biggest bucket has about 20,000 users (at the mid point of the distribution) with both tail ends having <200 users each.
To answer the question of system usage over time I have cleaned the data and am taking a sample of .9 of the lowest sup-population from each of the buckets.
Then I re-sample with replacement N number of times (can be between 100 to 10000 or whatnot). The average of these re-samples closely approaches the sub-population mean of each bucket, what I find that pretty much over time for most metrics of interaction (1,2,3,4,5,6 months) the tail end with the lowest number of users is the most active. (this could suggest that higher member buckets contain a large proportion of users who are not active or those users that are active are just not as active different user buckets).
I took a quick summary of each of the buckets to make sure that there are no irregularities and indeed the data shows that the lowest bucket does have higher quartiles, mean, lowest and highest data values compared to the other buckets.
I went over the data collection methodology to make sure that there are no errors in obtaining the data and looking through various data points it does support the result of graphing the re-sampled values.
My question is, should I take sample size based on each individual bucket independently, my gut tells me no as all the buckets belong to the same population and if I sample on the buckets each sample has to be fair and thus use N number of data points from the smallest bucket.
There is no modelling involved, this is just looking at the average number of usage of each user bucket per month.
Is my approach more or less on the right track?

Resources