I would like to inquire about extracting data from a given dataset (I guess it is similar to data decomposition).
The goal is to decompose a dataset to extract the features?
For example: Extract individual components of the volume of rectangle prism without the knowledge of the individual features (length, width, height).
Please do recommend best practice to carry out the operation. Also, do suggest any book or article which can explain such process in detail?
Update
The example code for the analysis:
using DataFrames
mutable struct rect
length
breadth
height
end
r = rect(rand(Int, 40), rand(Int, 40), rand(Int, 40))
volume(rect) = rect.length .* rect.breadth .* rect.height
volume_val = volume(r)
df = DataFrame(:length => r.length, :width=> r.width, :height=> r.height, :volume => volume_val)
# For this df dataframe, I would like to extract length, width and height from volume without the use of volume equation
Thanks in advance!
Do you mean searching for which values of length, width, height you get the given volume?
using DataFrames
mutable struct rect
length
width
height
end
r = rect(rand(1:100, 10), rand(1:100, 10), rand(1:100, 10))
volume(rect) = rect.length .* rect.width .* rect.height
volume_val = volume(r)
julia> df = DataFrame(:length => r.length, :width=> r.width, :height=> r.height, :volume => volume_val)
10×4 DataFrame
Row │ length width height volume
│ Int64 Int64 Int64 Int64
─────┼───────────────────────────────
1 │ 41 82 58 194996
2 │ 41 57 92 215004
3 │ 88 42 63 232848
4 │ 32 98 12 37632
5 │ 26 65 14 23660
6 │ 94 26 40 97760
7 │ 14 72 65 65520
8 │ 51 72 79 290088
9 │ 36 50 26 46800
10 │ 63 22 94 130284
julia> df[df.volume .== 46800,:]
1×4 DataFrame
Row │ length width height volume
│ Int64 Int64 Int64 Int64
─────┼───────────────────────────────
1 │ 36 50 26 46800
Related
Using prev() function I can access previous rows individually.
mytable
| sort by Time asc
| extend mx = max_of(prev(Value, 1), prev(Value, 2), prev(Value, 3))
How to define a window to aggregate over in more generic way? Say I need maximum of 100 values in previous rows. How to write a query that does not require repeating prev() 100 times?
Can be achieved by combining scan and series_stats_dynamic().
scan is used to create an array of last x values, per record.
series_stats_dynamic() is used to get the max value of each array.
// Data sample generation. Not part of the solution
let mytable = materialize(range i from 1 to 15 step 1 | extend Time = ago(1d*rand()), Value = toint(rand(100)));
// Solution starts here
let window_size = 3; // >1
mytable
| order by Time asc
| scan declare (last_x_vals:dynamic)
with
(
step s1 : true => last_x_vals = array_concat(array_slice(s1.last_x_vals, -window_size + 1, -1), pack_array(Value));
)
| extend toint(series_stats_dynamic(last_x_vals).max)
i
Time
Value
last_x_vals
max
5
2022-06-10T11:25:49.9321294Z
45
[45]
45
14
2022-06-10T11:54:13.3729674Z
82
[45,82]
82
2
2022-06-10T13:25:40.9832745Z
44
[45,82,44]
82
1
2022-06-10T17:38:28.3230397Z
24
[82,44,24]
82
7
2022-06-10T18:29:33.926463Z
17
[44,24,17]
44
15
2022-06-10T19:54:33.8253844Z
9
[24,17,9]
24
3
2022-06-10T20:17:46.1347592Z
43
[17,9,43]
43
12
2022-06-11T00:02:55.5315197Z
94
[9,43,94]
94
9
2022-06-11T00:11:18.5924511Z
61
[43,94,61]
94
11
2022-06-11T00:39:40.6858444Z
38
[94,61,38]
94
4
2022-06-11T03:54:59.418534Z
84
[61,38,84]
84
10
2022-06-11T05:55:38.2904242Z
6
[38,84,6]
84
6
2022-06-11T07:25:43.3977923Z
36
[84,6,36]
84
13
2022-06-11T09:36:08.7904844Z
28
[6,36,28]
36
8
2022-06-11T09:51:45.2225391Z
73
[36,28,73]
73
Fiddle
In Julia, vectorized function with dot . is used for element-wise manipulation.
Running f.(x) means f(x[1]), f(x[2]),... are sequentially executed
However, suppose I have a function which takes two arguments, say g(x,y)
I want g(x[1],y[1]),g(x[2],y[1]), g(x[3],y[1]), ..., g(x[1],y[2]), g(x[2],y[2]), g(x[3],y[2]), ...
Is there any way to evaluate all combination of x and y?
Matt's answer is good, but I'd like to provide an alternative using an array comprehension:
julia> x = 1:5
y = 10:10:50
[i + j for i in x, j in y]
5×5 Array{Int64,2}:
11 21 31 41 51
12 22 32 42 52
13 23 33 43 53
14 24 34 44 54
15 25 35 45 55
In my opinion the array comprehension can often be more readable and more flexible than broadcast and reshape.
Yes, reshape y such that it is orthogonal to x. The . vectorization uses broadcast to do its work. I imagine this as "extruding" singleton dimensions across all the other dimensions.
That means that for vectors x and y, you can evaluate the product of all combinations of x and y simply by reshaping one of them:
julia> x = 1:5
y = 10:10:50
(+).(x, reshape(y, 1, length(y)))
5×5 Array{Int64,2}:
11 21 31 41 51
12 22 32 42 52
13 23 33 43 53
14 24 34 44 54
15 25 35 45 55
Note that the shape of the array matches the orientation of the arguments; x spans the rows and y spans the columns since it was transposed to a single-row matrix.
I have two vectors of uneven lengths.
> starts
[1] 1 4 7 11 13 15 18 20 37 41 53 61
> ends
[1] 3 6 10 17 19 35 52 60 63
Each corresponding part in starts and ends are supposed to form a boundary, e.g. (1, 3) for the first, (4, 6) for second, etc. However you will notice that starts has 10 elements, and ends has just 9. What happened is for some anomaly, there may be consecutive starts, e.g. 4th to 6th elements of starts (11, 13, 15) are all smaller than the 4th element of ends (17).
Edit: please note also corresponding ends are not always 1 higher than starts, sample above edited to reflect so i.e. after ends 35, the next starts is 37.
My question is, how to find all these extranuous unpaired starts? My aim is to lengthen ends to be same length as starts, and pair all extranuous starts with a corresponding NA in ends. The actual vector lengths are in thousands, with mismatches in hundreds. I can imagine a nested for loop to address this, but am wondering if there is a more efficient solution.
Edit: the expected result would be (starts unchanged, displayed for comparison):
> starts
[1] 1 4 7 11 13 15 18 20 37 41 53 61
> ends
[1] 3 6 10 NA NA 17 19 35 NA 52 60 63
or equivalent, not particular about format.
> starts = c(1, 4, 7, 11, 15, 19, 23, 27)
> ends = c(3, 5, 14, 22, 25)
> e = ends[findInterval(starts, ends)+1]
> e
[1] 3 5 14 14 22 22 25 NA
> e[duplicated(e, fromLast=T)]=NA
> e
[1] 3 5 NA 14 NA 22 25 NA
findInterval seems to work
Assuming both starts and ends are sorted and that it's only in ends where the values are missing, you might be able to do something as straightforward as:
ends[c(match(starts, ends + 1)[-1], length(ends))]
# [1] 3 6 10 NA 17 19 36 52 60 63
I was trying to make a new variable, with following V265 and V267.
Putting 0 on V267 when the value has two digits, don't put 0 when it is three digits / and cutting any front 0 of V265 (like row 5). Then incorporate fixed V265 and V267 to a new variable.
I tried bunch of codes and googled but I couldn't make it. Thanks in advance!
V265 V267 New
1 26 55 -> 26055
2 36 61 -> 36061
3 36 71 -> 36071
4 47 125 -> 47125
5 06 37 -> 6037
6 42 81 -> 42081
df$New <- 1000*df$V265+df$V267
More general and more typing than #josilber's answer:
as.numeric(paste0(df$V265,
formatC(df$V267, format = "d", width = 3, flag = "0")))
Question in regards to adding arrays. I have this code below:
B[row][col] = B[row+1][col+1] + B[row][col+1];
Let say row = 2, col = 3; I don't quite understand what happens how. We have the (=) assignment so I'm guessing would assign whatever is on the right but I don't know how to count it. In this example it come up to me to be: 13 on the right side but that doesn't make sense. I would assign 13 value to b[row][col] ??? In the tracing program showed as 2. I don't understand, please help!
I'm not entirely sure what it is you're asking but essentially you have a 2D array and the B[row][col] syntax is to access a specific "cell" within the 2D array. Think of it like a grid. So what you're doing with the assignment operator is taking the values in cells B[row+1][col+1] and B[row][col+1], adding them together, and assigning that resulting value to the cell B[row][col]. Does that make sense? Also it'll be good to make sure you don't get any index out of bounds exceptions doing this.
This does somewhat depend on the tool/language you are using, for instance matlab starts indexing arrays at 1 so the first element of an array a is a[1] while languages like C/Java start indexing at 0 so the first element of an array a is a[0].
Lets assume that indexing is done like in C/Java, then consider a multidimensional array B
12 13 14 11
41 17 23 22
18 10 20 38
81 17 32 61
Then with row = 2 and col = 3 you will have that B[row][col] as the element that sits on the third row (remembering indexing starts at 0, so B[2] is the third row) and fourth column, marked here between * signs.
12 13 14 11
41 17 23 22
18 10 20 *38*
81 17 32 61
As for changing a value in the multidimensional array, it is done by assigning a new value to the index of the old value.
B[row][col] = B[row+1][col+1] + B[row][col+1];
With row=1 and col=0 we get
B[1][0] = B[2][1] + B[1][1];
B[1][0] = 10 + 17;
B[0][0] = 27;
Or:
12 13 14 11 12 13 14 11
(41) 17 23 22 (27) 17 23 22
18 10 20 38 ==> 18 10 20 38
81 17 32 61 81 17 32 61