Need some help to determine how is this date/time being encoded.
I've tried different methods, unixtime, little endian, big endian, can't figure it out.
Here are some examples (known date only):
20 94 9D 21 = 29-12-2016
C7 91 9E 21 = 30-12-2016
AD 6A 72 22 ~ around 24-03-2017
Thank you.
It would be very helpful to have a midpoint time (do you have any more examples?), but it appears to be approximately a half-second per integer value there.
Sample 1: 0x219d9420 -> 563,975,200 (decimal)
Sample 2: 0x219e91c7 -> 564,040,135 (decimal)
Sample 3: 0x22726aad -> 577,923,757 (decimal)
Timestamp 1: 29-12-2016 -> 1482969600 (unixtime)
Timestamp 2: 30-12-2016 -> 1483056000 (unixtime)
Timestamp 3: 24-03-2017 -> 1490313600 (unixtime)
The difference between sample 3 and sample 1/2 definitely increases somewhat proportionally to the distance between timestamp 3 and timestamps 1/2, but because 1/2 are so close together (and uncertain), it's really difficult to say for certain.
Overall, you end up with 7,344,000 seconds passed while 13,948,557 mystery timestamps passed, which is pretty close (given the uncertainty in the given dates) to 2 mystery-timestamps per second. This would put a start time about 9.4 years before the first timestamp, around Aug 2, 2007.
Related
I've got a dataset that has monthly metrics for different stores. Each store has three monthly (Total sales, customers and transaction count), my task is over a year I need to find the store that most closely matches a specific test store (Ex: Store 77).
Therefore over the year both the test store and most similar store need to have similar performance. My question is how do I go about finding the most similar store? I've currently used euclidean distance but would like to know if there's a better way to go about it.
Thanks in advance
STORE
month
Metric 1
22
Jan-18
10
23
Jan-18
20
Is correlation a better way to measure similarity in this case compared to distance? I'm fairly new to data so if there's any resources where I can learn more about this stuff it would be much appreciated!!
In general, deciding similarity of items is domain-specific, i.e. it depends on the problem you try to solve. Therefore, there is not one-size-fits-all solution. Nevertheless, there is some a basic procedure someone can follow trying to solve this kind of problems.
Case 1 - only distance matters:
If you want to find the most similar items (stores in our case) using a distance measure, it's a good tactic to firstly scale your features in some way.
Example (min-max normalization):
Store
Month
Total sales
Total sales (normalized)
1
Jan-18
50
0.64
2
Jan-18
40
0.45
3
Jan-18
70
0
4
Jan-18
15
1
After you apply normalization on all attributes, you can calculate euclidean distance or any other metric that you think it fits your data.
Some resources:
Similarity measures
Feature scaling
Case 2 - Trend matters:
Now, say that you want to find the similarity over the whole year. If the definition of similarity for your problem is just the instance of the stores at the end of the year, then distance will do the job.
But if you want to find similar trends of increase/decrease of the attributes of two stores, then distance measures conceal this information. You would have to use correlation metrics or any other more sophisticated technique than just a distance.
Simple example:
To keep it simple, let's say we are interested in 3-months analysis and that we use only sales attribute (unscaled):
Store
Month
Total sales
1
Jan-18
20
1
Feb-18
20
1
Mar-18
20
2
Jan-18
5
2
Feb-18
15
2
Mar-18
40
3
Jan-18
10
3
Feb-18
30
3
Mar-18
78
At the end of March, in terms of distance Store 1 and Store 2 are identical, both having 60 total sales.
But, as far as the increase ratio per month is concerned, Store 2 and Store 3 is our match. In February they both had 2 times more sales and in March 1.67 and 1.6 times more sales respectively.
Bottom line: It really depends on what you want to quantify.
Well-known correlation metrics:
Pearson correlation coefficient
Spearman correlation coefficient
This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 7 years ago.
I am trying to get the top frequency words in a data.table
data.table : dtable4G
key freq value
================================
thanks for the 612 support
thanks for the 380 drink
thanks for the 215 payment
thanks for the 27 encouragement
have a great 154 day
have a great 132 weekend
have a great 54 week
have a great 42 time
have a great 19 night
at the same 346 time
at the same 57 damn
at the same 30 pace
at the same 11 speed
at the same 7 level
at the same 1 rate
I tried the code
dtable4G[ , max(freq), by = key]
and
dtable4G[ , .I[which.max(freq)] , by = key]
Both the above commands, I am getting the same result:
key V1
====================
thanks for the 612
have a great 154
at the same 346
I want the result to be:
key freq value
================================
thanks for the 612 support
have a great 154 day
at the same 346 time
Any ideas what I am doing wrong?
EDITED
dtable4G[dtable4G[, .I[which.max(freq)], by = key]$V1]
worked for me. Though it took some time to run through my 5.4 mil rows.
But this was way faster than using
dtable4G[,.SD[which.max(freq)],by=key]
Reference: With data.table, is SD[which.max(Var1)] the fastest way to find the max of a group?
We can subset the data table for only the max freq of each key column value with the following:
dtable4G[,.SD[which.max(freq)],by=key]
For better performance you can use the below approach as well. It doesn't construct the .SD and is thus faster:
dtable4g[dtable4g[, .I[which.max(freq)], by = key]$V1]
I'm deciphering some hexcode that I've determined are dates.
I've determined that:
50 C0 01 00 => 2014-05-21
52 C0 01 00 => 2014-05-23
59 C0 01 00 => 2014-05-30
The last byte of 00 seems to be superfluous padding.
I tried applying the packing scheme that MySQL uses for dates, but that doesn't appear to work here.
Do you guys have any insight on how these dates are being packed into binary/hexcode?
Just a wild guess, but maybe it's going to be useful.
If you interpret the numbers as a little-endian 32-bit integer, then:
0x0001c050 = 114768
in decimal. Also notice how a difference of one in the numbers means one day of difference. So one in the date means one day. Just out of curiosity, I've divided this 114768 by 365.25 (the average number of days in a year). That's 314.217659, which is 314 years and 79 or 80 days.
If you count back 314 years and 80 days from 21/05/2014, then you get to 1 March, 1700. That's the fundamental date of the Gregorian system.
So I suppose this date format is just the number of days elapsed since the Gregorian epoch.
I have a dataframe "x" with 5.9 million rows and 4 columns: idnumber/integer, compdate/integer and judge/character,, representing individual cases completed in an administrative court. The data was imported from a stata dataset and the date field came in as integer, which is fine for my purposes. I want to create the caseload variable by calculating the number of cases completed by the judge within the 30 day window of the completion date of the case at issue.
here are the first 34 rows of data:
idnumber compdate judge
1 9615 JVC
2 15316 BAN
3 15887 WLA
4 11968 WFN
5 15001 CLR
6 13914 IEB
7 14760 HSD
8 11063 RJD
9 10948 PPL
10 16502 BAN
11 15391 WCP
12 14587 LRD
13 10672 RTG
14 11864 JCW
15 15071 GMR
16 15082 PAM
17 11697 DLK
18 10660 ADP
19 13284 ECC
20 13052 JWR
21 15987 MAK
22 10105 HEA
23 14298 CLR
24 18154 MMT
25 10392 HEA
26 10157 ERH
27 9188 RBR
28 12173 JCW
29 10234 PAR
30 10437 ADP
31 11347 RDW
32 14032 JTZ
33 11876 AMC
34 11470 AMC
Here's what I came up with. So for each record I'm taking a subset of the data for that particular judge and then subsetting the cases decided in the 30 day window, and then assigning the length of a vector in the subsetted dataframe to the caseload variable for the subject case, as follows:
for(i in 1:length(x$idnumber)){
e<-x$compdate[i]
f<-e-29
a<-x[x$judge==x$judge[i] & !is.na(x$compdate),]
b<-a[a$compdate<=e & a$compdate>=f,]
x$caseload[i]<-length(b$idnumber)
}
It is working but it is taking extremely long to complete. How can I optimize this or do this easier. Sorry I'm very new to r and to programming -- I'm a law professor trying to analyze court data.... Your help is appreciated. Thanks.
Ken
You don't have to loop through every row. You can do operations on the entire column at once. First, create some data:
# Create some data.
n<-6e6 # cases
judges<-apply(combn(LETTERS,3),2,paste0,collapse='') # About 2600 judges
set.seed(1)
x<-data.frame(idnumber=1:n,judge=sample(judges,n,replace=TRUE),compdate=Sys.Date()+round(runif(n,1,120)))
Now, you can make a rolling window function, and run it on each judge.
# Sort
x<-x[order(x$judge,x$compdate),]
# Create a little rolling window function.
rolling.window<-function(y,window=30) seq_along(y) - findInterval(y-window,y)
# Run the little function on each judge.
x$workload<-unlist(by(x$compdate,x$judge,rolling.window)))
I don't have much experience with rolling calculations, but...
Calculate this per-day, not per-case (since it will be the same for cases on the same day).
Calculate a cumulative sum of the number of cases, and then take the difference of the current value of this sum and the value of the sum 31 days ago (or min{daysAgo:daysAgo>30} since cases are not resolved every day).
It's probably fastest to use a data.table. This is my attempt, using #nograpes simulated data. Comments start with #.
require(data.table)
DT <- data.table(x)
DT[,compdate:=as.integer(compdate)]
setkey(DT,judge,compdate)
# count cases for each day
ldt <- DT[,.N,by='judge,compdate']
# cumulative sum of counts
ldt[,nrun:=cumsum(N),by=judge]
# see how far to look back
ldt[,lookbk:=sapply(1:.N,function(i){
z <- compdate[i]-compdate[i:1]
older <- which(z>30)
if (length(older)) min(older)-1L else as(NA,'integer')
}),by=judge]
# compute cumsum(today) - cumsum(more than 30 days ago)
ldt[,wload:=list(sapply(1:.N,function(i)
nrun[i]-ifelse(is.na(lookbk[i]),0,nrun[i-lookbk[i]])
))]
On my laptop, this takes under a minute. Run this command to see the output for one judge:
print(ldt['XYZ'],nrow=120)
I'm trying to build quite a complex loop in R.
I have a set of data set as an object called p_int (p_int is peak intensity).
For this example the structure of p_int i.e. str(p_int) is:
num [1:1599]
The size of p_int can vary i.e. [1:688], [1:1200] etc.
What I'm trying to do with p_int is to construct a complex loop to extract the monoisotopic peaks, these are peaks with certain characteristics which will be extracted into a second object: mono_iso:
search for the first eight sets of data results in p_int. Of these eight, find the set of data with the greatest score (this score also needs to be above 50).
Once this result has been found, record it into mono_iso.
The loop will then fix on to this position of where this result is located within the large dataset. From this position it will then skip the next result along the dataset before doing the same for the next set of 8 results.
So something similar to this:
16 Results: 100 120 90 66 220 90 70 30 70 100 54 85 310 200 33 41
** So, to begin with, the loop would take the first 8 results:
100 120 90 66 220 90 70 30
**It would then decide which peak is the greatest:
220
**It would determine whether 220 was greater than 50
IF YES: It would record 220 into "mono_iso"
IF NO: It would move on to the next set of 8 results
**220 is greater than 50... so records into mono_iso
The loop would then place it's position at 220 it would then skip the "90" and begin the same thing again for the next set of 8 results beginning at the next data result in line: in this case at the 70:
70 30 70 100 54 85 310 200
It would then record the "310" value (highest value) and do the same thing again etc etc until the end of the set of data.
Hope this makes perfect sense. If anyone could possibly help me out into making such a loop work with R-script, I'd very much appreciate it.
Use this:
mono_iso <- aggregate(p_int, by=list(group=((seq_along(p_int)-1)%/%8)+1), function(x)ifelse(max(x)>50,max(x),NA))$x
This will put NA for groups such that max(...)<=50. If you want to filter those out, use this:
mono_iso <- mono_iso[!is.na(mono_iso)]