Lets example ,Have one master Table contains the telemetry data for every 5 seconds one record
having millions of data in this table
ID
DateTime
IngestionTime
X
Y
Z
1
2012-12-28T12:04:00
2012-12-28T12:04:00
12
11
10
2
2012-12-28T12:06:00
2012-12-28T12:06:00
2
9
7
3
2012-12-29T12:11:00
2012-12-29T12:11:00
2
9
7
1
2012-12-29T12:15:00
2012-12-29T12:15:00
33
7
2
2012-12-29T12:24:00
2012-12-29T12:24:00
9
72
3
2012-12-29T12:30:00
2012-12-29T12:30:00
44
40
54
4
2012-12-29T12:35:00
2012-12-29T12:35:00
11
9
92
I'm having in function demo(datetime:fromTime, datetime:toTime)
from this I'm querying for fromTime 2012-12-29T12:11:00 to toTime: same 29thdecmber)
so if any empty values there i need to fill
those empty values from previous date with respective column
Requirement to fill this lastknownvalue in very optimized way since we are delaing with millions of data
ID
DateTime
IngestionTime
X
Y
Z
1
2012-12-28T12:04:00
2012-12-28T12:04:00
12
11
10
2
2012-12-28T12:06:00
2012-12-28T12:06:00
8
9
7
3
2012-12-29T12:11:00
2012-12-29T12:11:00
2
9
7
1
2012-12-29T12:15:00
2012-12-29T12:15:00
lastKnownValueForThisID?
33
7
2
2012-12-29T12:24:00
2012-12-29T12:24:00
lastKnownValueForThisID
9
72
3
2012-12-29T12:30:00
2012-12-29T12:30:00
44
40
54
4
2012-12-29T12:35:00
2012-12-29T12:35:00
11
9
92
I am not sure that I fully understand the question, however for filling missing values in a time series, there are a few useful builtin functions, for example series_fill_backward(), series_fill_const() and more.
Related
I have a dataset that includes individual events across a time period. some example records as below, each individual has 2-4 records observed within a period. The event# is ordered by time, however, the same event# did not occur at the same date (A's #1 event occurs on 6/1, while C's #1 event happens on 6/3). Should I analyze the data as an unbalanced panel data with 2 dimensions individual and event #(i.e, the time dimension)? thanks. If not, how should I treat this data? thanks.
obs
ind
event#
date
var1
y
1
A
1
6/1
11
33
2
A
2
6/4
12
23
3
A
3
6/5
13
32
4
A
4
6/5
14
55
5
B
1
6/1
15
44
6
B
2
6/2
18
54
7
C
1
6/3
15
22
8
C
2
6/3
29
55
9
C
3
6/6
31
23
10
D
1
6/3
13
45
11
D
2
6/5
2
12
I am trying to create pen-pal pairs in R. The problem is that I can't figure out how to loop it so that once I pair one person that person and their pair are eliminated from the pool and the loop continues until everyone has a pair.
I have already rated the criteria to pair them and found a score for every person for how well they would pair for the other person. I think added every pair score together to get a sense of how good the pair is overall (not perfect, but good enough for these purposes). I have found each person's ideal match then and ordered these matches by most picky person to least picky person (basically person with the lowest best-paired score to highest best-paired score). I also found their 2nd-8th best match (there will probably be about 300 people in the data).
A test of the best-matches is below:
indexed_fake apply.fin_fake..1..max. X1 X2 X3 X4 X5 X6 X7 X8
14 14 151 3 9 8 4 10 12 2 6
4 4 177 9 5 8 7 11 3 10 12
9 9 177 4 11 3 6 10 7 12 5
5 5 179 7 4 11 3 12 10 8 5
10 10 179 12 10 2 9 3 5 6 4
13 13 182 8 1 12 11 10 5 3 2
1 1 185 7 1 3 8 6 13 2 11
7 7 185 1 12 5 7 4 6 9 11
3 3 187 12 3 8 5 9 1 2 10
8 8 190 8 12 13 3 4 11 1 6
2 2 191 6 12 11 10 3 4 5 1
6 6 191 2 11 7 1 6 9 10 8
11 11 193 12 6 9 5 2 8 11 4
12 12 193 11 3 8 7 12 10 2 5
Columns X1-X8 are the 8 best pairs for the people listed in the first columns. With this example every person would ideally get paired with someone in their top 8, ideally maximizing the pair compatibility as another user mentioned. Every person would get one pair.
Any help is appreciated!
This is not a specific answer. But it's easier to write in this space. You have a classic assignment optimization problem. These problems can be solved using packages in R. You have to assign preference weights to your feasible pairings. So for example 14-3 could be assigned 8 points, 14-9; 7 points, 14-8; 6 points...14-6; 1 point. Note that 3-14 would be assigned no points because while 14 likes 3, 3 does not like 14. The preference score for any x-y, y-x pairing could be the weight for the x-y preference plus the weight of the y-x preference.
The optimization model would choose the weighted pairs to maximize the total satisfaction among all of the the pairings.
If you have 300 people I can't think of an alternative algorithm that could be simply implemented.
I have a dataset with the reports from a local shop, where each line has a client's ID, date of purchase and total value per purchase.
I want to create a new plot where for each client ID I have all the purchases in the last month or even just sample purchases in a range of dates I choose.
The main problem is that certain customers might buy once a month, while others can come daily - so the number of observations per period of time can vary.
I have tried subsetting my dataset to a specific range of time, but either I choose a specific date - and then I only get a small % of all customers, or I choose a range and get multiple observations for certain customers.
(In this case - I wouldn't mind getting the earliest observation)
An important note: I know how to create a for loop to solve this problem, but since the dataset is over 4 million observations it isn't practical since it would take an extremely long time to run.
A basic example of what the dataset looks like:
ID Date Sum
1 1 1 234
2 1 2 45
3 1 3 1
4 2 4 223
5 3 5 546
6 4 6 12
7 2 1 20
8 4 3 30
9 6 2 3
10 3 5 45
11 7 6 456
12 3 7 65
13 8 8 234
14 1 9 45
15 3 2 1
16 4 3 223
17 6 6 546
18 3 4 12
19 8 7 20
20 9 5 30
21 11 6 3
22 12 6 45
23 14 9 456
24 15 10 65
....
And the new data set would look something like this:
ID 1Date 1Sum 2Date 2Sum 3Date 3Sum
1 1 234 2 45 3 1
2 1 20 4 223 NA NA
3 2 1 5 546 5 45
...
Thanks for your help!
I think you can do this with a bit if help from dplyr and tidyr
library(dplyr)
library(tidyr)
dd %>% group_by(ID) %>% mutate(seq=1:n()) %>%
pivot_wider("ID", names_from="seq", values_from = c("Date","Sum"))
Where dd is your sample data frame above.
I have a table with a column "Age" that has a values from 1 to 10, and a column "Population" that has values specified for each of the "age" values. I want to generate a cumulative function for population such that resultant values start from ages at least 1 and above, 2 and above, and so on. I mean, the resultant array should be (203,180..and so on). Any help would be appreciated!
Age Population Withdrawn
1 23 3
2 12 2
3 32 2
4 33 3
5 15 4
6 10 1
7 19 2
8 18 3
9 19 1
10 22 5
You can use cumsum and rev:
df$sum_above <- rev(cumsum(rev(df$Population)))
The result:
> df
Age Population sum_above
1 1 23 203
2 2 12 180
3 3 32 168
4 4 33 136
5 5 15 103
6 6 10 88
7 7 19 78
8 8 18 59
9 9 19 41
10 10 22 22
I used igraph package to detect communities. When I used membership(community) function, the result is:
1 2 3 4 5 6 7 13 17 18 19 20 22 23 24 25
12 9 1 10 12 6 12 16 1 11 6 6 3 13 16 1
29 30 31 33 34 37 38 39 40 41 42 43 44 45 46 47
9 5 11 14 13 6 13 11 12 13 1 16 11 6 12 7
...
The first line is node ID and the second line is its corresponding community ID.
Suppose the name of the above result is X. I used Y=data.frame(X). The result is:
community
1 12
2 9
3 1
4 10
5 12
6 6
7 12
13 16
...
I want to use the first column (1,2,3,...), for instance, Y[13,]=16. But in this case, it is Y[8,]=16. How to do this?
This question may be very simple. But I do not know how to google it. Thanks.
Function as.data.frame() converts a named vector to a data frame, where the names of the vector elements are used as row names.
In other words, use a construct like rownames(Y)[8] to access the first column (or the row names, actually).