Aggregate function per partition and merge result in all partition for cosmosdb - azure-cosmosdb

I have sample data like following
partition_key1
idA = 1
idB = 1
value = 1
partition_key1
idA = 1
idB = 1
value = 2
partition_key1
idA = 1
idB = 2
value = 3
partition_key2
idA = 1
idB = 2
value = 2
partition_key2
idA = 1
idB = 2
value = 5
I want to find the max value for each partition_key, idA, idB on a per partition basis and merge the result
sample result from sample data
partition_key1
idA = 1
idB = 1
value = 2
partition_key1
idA = 1
idB = 2
value = 3
partition_key2
idA = 1
idB = 2
value = 5
I am able to get the above result by fixing partitioning key, now I want to merge through all partitioning key, is that possible?
SELECT c.partition_key, c.idA, c.idB, MAX(c.value)
FROM c
where c.partioningKey = <FIXED_PARTITION_KEY>
GROUP BY c.partition_key, c.idA, c.idB

Related

How to PartitionBy (SQL) to rank on RStudio

So I have something like this:
data.frame(content = c("a","a","b","b","c","c"),
eje = c("politics","sports","education","sports","health","politics"),
value = c(3,2,1,2,1,1))
And I'd like to group by content and keep the values in eje that has the highest value on value, and to keep both values when it ties.
So on sample I'd stay with:
data.frame(content = c("a","b","c","c"),
eje = c("politics","sports","health","politics"),
value = c(3,2,1,1))`
On SQL I'd do something like RANK OVER PARTITION BY (content, DESC value) and then filter rows with value "1" on the RANK column created.
d = data.frame(content = c("a","a","b","b","c","c"),
eje = c("politics","sports","education","sports","health","politics"),
value = c(3,2,1,2,1,1))
library(dplyr)
d %>%
group_by(content) %>%
slice_max(value)
# # A tibble: 4 × 3
# # Groups: content [3]
# content eje value
# <chr> <chr> <dbl>
# 1 a politics 3
# 2 b sports 2
# 3 c health 1
# 4 c politics 1
data.table option:
library(data.table)
dt <- data.table(df)
dt[dt[, .I[value == max(value)], by=content]$V1]
Output:
content eje value
1: a politics 3
2: b sports 2
3: c health 1
4: c politics 1

Restructure data.frame

I have survey data structured as follows:
df <- data.frame(userid = c(1, 2, 3),
pos1 = c("itemA_1", "itemB_1", "itemA_2"),
pos2 = c("itemB_1", "itemC_2", "itemC_1"),
pos3 = c("itemC_5", "itemA_4", "itemB_3")
)
df
> df
userid pos1 pos2 pos3
1 1 itemA_1 itemB_3 itemC_3
2 2 itemB_1 itemC_1 itemA_1
3 3 itemA_2 itemC_4 itemB_1
In the survey several items (itemA, itemB, itemC ...) were rated on a five-point likert-skale ranging from 1 to 5. The order in which the items were answered was also saved.
For example in the above data.frame user 1 rated itemA first and the rating was 1. Then he rated itemB and the rating was 3. Finally he rated itemC and the rating was 3.
user 2 started with itemB and the rating was 1 etc.
Obviously, that structure is not very useful to analyse the data. So I'd rather have it in a form like this:
userid itemA itemB itemC ...
1 1 3 3
2 1 1 1
3 2 1 4
But how can I get there? Thanks for help!
Get the data in long format, separate the rating value from 'item' and get the data in wide format.
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = starts_with('pos'), names_to = NULL) %>%
separate(value, c('item', 'rating'), sep = '_', convert = TRUE) %>%
pivot_wider(names_from = item, values_from = rating)
# userid itemA itemB itemC
# <dbl> <int> <int> <int>
#1 1 1 1 5
#2 2 4 1 2
#3 3 2 3 1

Create new rows and put a flag to differentiate between existing row

I have a dataset like this:
df_have <- data.frame(id = rep("a",3), time = c(1,3,5), flag = c(0,1,1))
The data has one row per time per id but I need to have the second row duplicated and put into the data.frame like this:
df_want <- data.frame(id = rep("a",4), time = c(1,3,3,5), flag = c(0,0,1,1))
The flag variables should become 0 with the new row added and all other information the same. Any help would be appreciated.
Edit:
The comments below are helpful but I would also need to do this in groups by id and some ids have more rows than other ids. After reading this and seeing the comments below I see the logic isn't clear. My original data does not have a count variable (what I call flag) but it needs it in the final output. What I would need is that every row besides for the first and last timepoint (within each id) to be duplicated and every time there is a duplicate make a counter to demonstrate when a row was created until the next new row is created.
df_have2 <- data.frame(id = c(rep("a",3),rep("b",4)) ,
time = c(1,3,5,1,3,5,7))
df_want2 <- data.frame(id = c(rep("a",4),rep("b",6)),
time = c(1,3,3,5,1,3,3,5,5,7),
flag = c(1,1,2,2,1,1,2,2,3,3))
We could expand the data with slice and then create the 'flag' by matching the 'time' with unique values of 'time' and take the lag of it
library(dplyr)
df_have2 %>%
group_by(id) %>%
slice(rep(row_number(), c(1, rep(2, n() - 2), 1))) %>%
mutate(flag = lag(match(time, unique(time)), default = 1)) %>%
ungroup
# A tibble: 10 x 3
# id time flag
# <chr> <dbl> <dbl>
# 1 a 1 1
# 2 a 3 1
# 3 a 3 2
# 4 a 5 2
# 5 b 1 1
# 6 b 3 1
# 7 b 3 2
# 8 b 5 2
# 9 b 5 3
#10 b 7 3

Creating new rows for each group using values from first row of group

I need to create a new row for each group of a grouped tibble based on values from the first row of each group.
I am trying to use do(add_row()) to create the new row and use top_n to access the value from the first row of each group.
df = tibble(ID = rep(1:2, each = 2), x = rep(1:2, each = 2), y = seq(1:4))
gb_df <- group_by(df, ID, x)
new_df <- gb_df %>% do(add_row(., ID = top_n(.,1, wt=y)[,"ID"], x = 0, y =
top_n(.,1, wt=y)[,"y"]-1, .before=0))
However I get the following error message.
Error: Columns `ID`, `y` must be 1d atomic vectors or lists
I want the following output.
> new_df
# A tibble: 6 x 3
# Groups: ID [3]
ID x y
<dbl> <dbl> <dbl>
1 1 0 0
2 1 4 1
3 1 4 2
4 2 0 2
5 2 5 3
6 2 5 4

Compare rows and replace value if there is a difference

First of all: Happy New Year :)
I'm struggling with a loop so I'm now seeking your help.
Below is a short dummy:
df <- data.frame(name = c("a","a","b","b","c","d"), type = c(1,1,2,2,3,4), area = c("a","b","a","a","b","b"), length = c(10), power = c(10, 100))
I'd like to compare each unique combination of name, type and area, and see if length and power vary or not. If they do not, I want to keep their value; if they do, I want to replace their value by 'Unknown'.
In the example above, there would thus only be a replacement for name = b: length would remain '10' but power would become 'Unknown'. As a result, the resulting dataframe would only have five rows.
That seems like a rather simple loop to come up with, but I haven't succeeded so far... do you have any idea?
Cheers,
Fred
I think you don't need a for loop but can use duplicated.
First look up the rows that are have the same name, type, area and length but do not have the same power value. Replace one of the power values with Unknown
df[which(duplicated(df[1:4]) & !duplicated(df[1:5])),'power'] <- 'Unkown'
Next create a new dataframe that discards the other row
df2 <- df[which(!duplicated(df[1:4],fromLast = T)),]
Output:
> df2
name type area length power
1 a 1 a 10 10
2 a 1 b 10 100
4 b 2 a 10 Unkown
5 c 3 b 10 10
6 d 4 b 10 100
EDIT: Following additional requests from the OP here's a dplyr solution that works for more general cases.
# New dataframe; containing multiple duplicates
df3 <- data.frame(name = c("a","a","b","b","b","c","d"),
type = c(1,1,2,2,2,3,4), area = c("a","b","a","a","a","b","b"),
length = rep(10,7),
power = c(10, 100, 10, 100,100,10,100))
df3 %>%
group_by(name, type, area) %>%
mutate(length = ifelse(n() > 1 && var(length) != 0, "Unknown", paste0(length)),
power = ifelse(n() > 1 && var(power) != 0, "Unknown", paste0(power)))
The function first groups by name, type and area. Next, it checks if there is more than 1 row, if this is true it checks if values vary, if both are true it replaces all values by "Unknown".
Output:
# A tibble: 7 x 5
# Groups: name, type, area [5]
name type area length power
<fct> <dbl> <fct> <chr> <chr>
1 a 1 a 10 10
2 a 1 b 10 100
3 b 2 a 10 Unknown
4 b 2 a 10 Unknown
5 b 2 a 10 Unknown
6 c 3 b 10 10
7 d 4 b 10 100
With dplyr you can do:
df %>%
group_by(name, type, area) %>%
mutate(length = ifelse(length != first(length), "Unknown", paste0(length)),
power = ifelse(power != first(power), "Unknown", paste0(power)))
name type area length power
<fct> <dbl> <fct> <chr> <chr>
1 a 1. a 10 10
2 a 1. b 10 100
3 b 2. a 10 10
4 b 2. a 10 Unknown
5 c 3. b 10 10
6 d 4. b 10 100
It checks whether the values are the same as for the first row for a given combination of "name", "type" and "area". If not, it fills the rows with "Unknown".

Resources