I have what I think is a interesting question, about google sheets and some Maths, here is the scenario:
4 numbers as follows:
64.20 | 107 | 535 | 1070
A reference number in which the previous numbers needs to fit leaving the minimum possible residue while setting the number of times each of them fitted in the reference number for example we could say the reference number is the following:
806.45
So here is the problem:
I'm calculating how many times those 4 numbers can fit in the reference number by starting from the higher to the lower number like this:
| 1070 | => =IF(E12/((I15+J15)+IF(H17,K17,0)+IF(H19,K19,0)) > 0,ROUNDDOWN(E12/((I15+J15)+IF(H17,K17,0)+IF(H19,K19,0))),0)
| 535 | => =IF(H15>0,ROUNDDOWN((E12-K15-IF(H17,K17,0)-IF(H19,K19,0))/(I14+J14)),ROUNDDOWN(E12/((I14+J14)+IF(H17,K17,0)+IF(H19,K19,0))))
| 107 | => =IF(OR(H15>0,H14>0),ROUNDDOWN((E12-K15-K14-IF(H17,K17,0)-IF(H19,K19,0))/(I13+J13)),ROUNDDOWN((E12-IF(H17,K17,0)-IF(H19,K19,0))/(I13+J13)))
| 64.20 | => =IF(OR(H15>0,H14>0,H13>0),ROUNDDOWN((E12-K15-K14-K13-IF(H17,K17,0)-IF(H19,K19,0))/(I12+J12)),ROUNDDOWN((E12-IF(H17,K17,0)-IF(H19,K19,0))/(I12+J12)))
As you can notice, I'm checking if the higher values has a concurrence, so I can substract the amount from the original number and calculate again how many times can fit the lower number in the result of that subtraction , you can also see that I'm including some checkboxes to the formula in order to add a fixed number to the main number.
This actually works, and as you can see in the example, the result is:
| 1070 | -> Fits 0 times
| 535 | -> Fits 1 time
| 107 | -> Fits 2 times
| 64.20 | -> Fits 0 times
The residue of 806.45 in this example is: 57.45
But each number that needs to fit on the main number needs to take in consideration others; IF you solve this exercise manually, you could get something much better.. like this:
| 1070 | -> Fits 0 times
| 535 | -> Fits 1 time
| 107 | -> Fits 0 times
| 64.20 | -> Fits 4 times
The residue of 806.45 in this example is: 14.65
When I’m talking about residue I mean the result when subtracting, I’m sorry if this is not clear, it’s hard to me to explain maths in English, since is not my native language, please see the spreadsheet and make a copy to better understand what I’m trying to do, or suggest me a way to explain it better if possible.
So what would you do to make it work more efficient and "smart" with the minimum possible residue after the calculation?
Here is the Google's spreadsheet for reference and practice, please make a copy so others can try their own solutions:
LINK TO SPREADSHEET
Thanks in advance for any help or hints.
Delete all current formulas in H12:H15.
Then place this mega-formula in H12:
=ArrayFormula(QUERY(SPLIT(FLATTEN(SPLIT(VLOOKUP(E12,QUERY(SPLIT(FLATTEN(QUERY(SPLIT(FLATTEN(QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&(SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13))),"|"),"Select Col1")&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)&" "&I14)&"|"&QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&((SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13)))),"|"),"Select Col2")+TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)*I14)),"|"),"Select Col1")&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I15),1,0)&" "&I15)&"|"&QUERY(SPLIT(FLATTEN(QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&(SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13))),"|"),"Select Col1")&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)&" "&I14)&"|"&QUERY(SPLIT(FLATTEN(SEQUENCE(ROUNDUP(E12/I12),1,0)&" "&I12&" / "&TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)&" "&I13)&"|"&((SEQUENCE(ROUNDUP(E12/I12),1,0)*I12)+(TRANSPOSE(SEQUENCE(ROUNDUP(E12/I13),1,0)*I13)))),"|"),"Select Col2")+TRANSPOSE(SEQUENCE(ROUNDUP(E12/I14),1,0)*I14)),"|"),"Select Col2")+TRANSPOSE(SEQUENCE(ROUNDUP(E12/I15),1,0)*I15)),"|"),"Select Col2, Col1 WHERE Col2 <= "&E12&" ORDER BY Col2 Asc, Col1 Desc"),2,TRUE)," / ",0,0))," "),"Select Col1"))
Typically, I explain my formulas. In this case, I trust that readers will understand why I cannot explain it. I can only offer it in working order.
To briefly give the general idea, this formula figures out how many times each of the four numbers fits into the target number alone and then adds every possible combination of all of those. Those are then limited to only the combinations less than the target number and sorted smallest to largest in total. Then a VLOOKUP looks up the target number in that list, returns the closest match, SPLITs the multiples from the amounts (which, in the end, have been concatenated into long strings), and returns only the multiples.
I am at the beginning of setting up a survival analysis in R.
I took a look in this book here: https://www.powells.com/book/modeling-survival-data-9780387987842/ but struggle to properly set the data up in the first place. So this is a very basic question to survival analysis as I can not find a good example online.
I'd to understand how to incorporate the consorized data into my surv() function. I understand the inputs in surv are:
0 = right censored
1 = event
2 = left censored
3 = interval censored
Right Censored: The time of study ends before an event takes place (ob1)
Left Censored: The event has already happend before the study starts
Event: Typically, death or some other form of expected outcome (marked by x)
Intervall Censored: The observation starts at some point in the study and has an event / drops out before end of study (ob5)
Left truncated: Ob 3,4,5 are left truncated
To better understand what I am talking about I sketched the described types of censored data below:
"o" marks beginning of data / first occurance in data set
"x" marks event
Start of study End of observation
ob1 o-|-----------------------------------------------------------|--------
| |
ob2 o-|-------------------------------xo |
| |
ob3 | o-----------------------------------xo |
| |
ob4 | o------------------x-|----------o
| |
ob5 | o----------------------------o |
|--------------------------------------------------------------
1999 2010
Finally, what would i like to know:
Did I classify ob1- ob5 correctly?
How about the other types of observations?
How do I represent these as input for the surv function? If for example right censored is true, i.e. the study ends how does a "0" indicate so? What is the input for the time series when neither event(1) nor end of observation occur (0)? what happens at a time when "nothing" happens?
When and how is the interval censored data marked? 3 for beginning and end?
I can provide some sample code if needed.
Again, thank you for your help on this and valuable questions!
I currently have a very large dataset with 2 attributes which contain only strings. The first attribute has search queries (single words) and the second attribute has their corresponding categories.
So the data is set up like this (a search query can have multiple categories):
Search Query | Category
X | Y
X | Z
A | B
C | G
C | H
Now I'm trying to use clustering algorithms to get an idea of the different groups my data is comprised of. I read somewhere that when using a clustering algorithm with just strings it is recommended to first use the Expected Maximum clustering algorithm to get a sense of how many clusters I need and then use that with K-means.
Unfortunately, I'm still very new to machine learning and Weka, so I'm constantly reading up on everything to teach myself. I might be making some very simple mistakes here so bear with me, please :)
So I imported a sample (100.000 lines out of 2.7 million) of my dataset in Weka and used the EM clustering algorithm and it gives me the following results:
=== Run information ===
Scheme: weka.clusterers.EM -I 100 -N -1 -X 10 -max -1 -ll-cv 1.0E-6 -ll-iter 1.0E-6 -M 1.0E-6 -K 10 -num-slots 1 -S 100
Relation: testrunawk1_weka_sample.txt
Instances: 100000
Attributes: 2
att1
att2
Test mode: split 66% train, remainder test
=== Clustering model (full training set) ===
EM
==
Number of clusters selected by cross-validation: 2
Number of iterations performed: 14
[135.000 lines long table with strings, 2 clusters and their values]
Time is taken to build a model (percentage split): 28.42 seconds
Clustered Instances
0 34000 (100%)
Log-likelihood: -20.2942
So should I infer from this that I should be using 2 or 34000 clusters with k-means?
Unfortunately, both seem unusable for me. What I was hoping for is that I get for example 20 clusters which I can then look at individually to figure out what kind of groups can be found in my data. 2 clusters seems like too low with the wide amount of categories etc in my data and 34000 clusters would be way too much to inspect manually.
I am unsure if I'm doing something wrong in either the Weka EM algorithm settings (set to standard now) or if my data is just a mess, and if so how would I go about making this work?
I am still very much learning how this all works, so any advice is much appreciated! If there is a need for more examples of my settings or anything else just tell me and I'll get it for you. I could also send you this dataset if that is easier, but it's too large to paste in here. :)
I have a stream of numbers and in every cycle I need to count the average of last N of them. This can be, of course, solved using an array where I store the last N numbers and in every cycle I shift it, add the new one and count the average.
N = 3
+---+-----+
| a | avg |
+---+-----+
| 1 | |
| 2 | |
| 3 | 2.0 |
| 4 | 3.0 |
| 3 | 3.3 |
| 3 | 3.3 |
| 5 | 3.7 |
| 4 | 4.0 |
| 5 | 4.7 |
+---+-----+
First N numbers (where there "isn't enough data for counting the average") doesn't interest me much, so the results there may be anything/undefined.
My question is, can this be done without using an array, that is, with static amount of memory? If so, then how?
I'll do the coding myself - I just need to know the theory.
Thanks
Think of this as a black box containing some state. If you control the input stream, you can draw conclusions on the state. In your sliding window array-based approach, it is kind of obvious that if you feed a bunch of zeros into the algorithm after the original input, you get a bunch of averages with a decreasing number of non-zero values taken into account. The last one has just one original non-zero value, so if you multiply that my N you get the last input back. Using that and the second-to-last output which accounts for two non-zero inputs, you can reconstruct the second-to-last input, and so on.
So essentially your algorithm needs to maintain sufficient state to reconstruct the last N elements of input, at least if you formulate it as an on-line algorithm. I don't think an off-line algorithm can do any better, except if you consider it reading the input multiple times, but I don't have as strong an agument for that.
Of course, in some theoretical models you can avoid the array and e.g. encode all the state into a single arbitrary length integer, but that's just cheating the theory, and doesn't make any difference in practice.
I am quite a beginner in Data Warehouse Design. I have red some theory, but recently met a practical problem with a design of a OLAP cube. I use star schema.
Lets say I have 2 dimension tables and 1 fact table:
Dimension Gazetteer:
dimension_id
country_name
province_name
district_name
Dimension Device:
dimension_id
device_category
device_subcategory
Fact table:
gazetteer_id
device_dimension_id
hazard_id (measure column)
area_m2 (measure column)
A "business object" (which is a mine field actually) can have multiple devices, is located in a single location (Gazetteer) and ocuppies X square meters.
So in order to know which device categories there are, I created a fact per each device in hazard like this:
+--------------+---------------------+-----------------------+-----------+
| gazetteer_id | device_dimension_id | hazard_id | area_m2 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 321 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 654 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
| 123 | 987 | 0a0a-502c-11aa1331e98 | 6000 |
+--------------+---------------------+-----------------------+-----------+
I defined a measure "number of hazards" as distinct-count of hazard_id.
I also defined a "total area occupied" measure as a sum of area_m2.
Now I can use the dimension gazetteer and device and know how many hazards there are with given dimension members.
But the problem is the area_m2: because it is defined as a sum, it gives a value n-times higher than the actual area, where n is th number of devices of the hazard object. For example, with the data above would give 18000m2.
How would you solve this problem?
I am using the Pentaho stack.
Thanks in advance
[moved from comment]
If a hazard-id is a minefield, and you're looking at mines-by-region(gazetter) & size-of-minefields-by-gazetteer, maybe you could make a Hazard dimension, which holds the area of the Hazard; or possibly make a Null-device entry in the DeviceDimension table, and only the Null-device entry gets the area_m2 set, the real devices get area_m2=0.
If you need to answer queries like: total area of minefields containing device 321, the second approach isn't going to easily answer these questions, which suggests that making a Hazard dimension might be a better approach.
I would also consider adding a device-count fact, which could have the num devices of each type per hazard.