Calculate difference in percent grouped on Country and Date - azure-data-explorer

I already can calculate difference in percent grouped by date,
let t = (datatable(Value:int, Date:datetime, Сountry:string)
[1000, '2018-01-01', "USA", // 1 Jan == 100%
2000, '2018-01-01', "Poland", // this 100% also because grouped by Date
3000, '2018-01-02', "USA", // 2 Jan == 233% because compared with 1 Jan
4000, '2018-01-02', "Poland", // 233% also
]);
let val_2018 = toscalar(t | where Date >= datetime(2018-01-01) and Date < datetime(2018-01-02) | summarize sum(Value));
t
| summarize percent = 100.0 * sum(Value) / val_2018 by Date
So i need to get 4 records like these:
1 Jan, USA, 100%
1 Jan, Poland, 100%
2 Jan, USA, 300%
2 Jan, Poland, 200%
but how add grouping by country as well? As i understood need to add "by Сountry" in all "summarize" but toscalar can't have "by" after "summarize" :(

You can use a combination of the prev() and iff() functions.
For example:
datatable(['value']: int, ['date']: datetime, ['country']: string) [
1000, datetime(2018-01-01), "USA",
2000, datetime(2018-01-01), "Poland",
3000, datetime(2018-01-02), "USA",
4000, datetime(2018-01-02), "Poland",
]
| order by country asc, ['date'] asc
| extend percentage = iff(
prev(country) == country,
100.0 * value / prev(value),
100.0)
value
date
country
percentage
2000
2018-01-01 00:00:00.0000000
Poland
100
4000
2018-01-02 00:00:00.0000000
Poland
200
1000
2018-01-01 00:00:00.0000000
USA
100
3000
2018-01-02 00:00:00.0000000
USA
300

Related

How to convert this list of lists of lists into data frame in R?

Suppose we have something that looks like:
list of two months:
Jan
Feb
Then when you expand Jan and Feb, you see other lists (countries):
Jan -> USA, UK, etc.
Fab -> USA, UK, etc.
Finally, each country is a list, containing two items. E.g. opening USA in the Jan list:
USA -> gdp, population
What I'm trying to do is achieve a data frame like so (numbers would just be the values in data):
month country gdp population
Jan USA number no.
Jan UK number no.
Feb USA number no.
Feb UK number no.
I tried bind_rows(list), which doesn't quite work because the header becomes the countries and the columns are populated with numbers that no longer have their labels.
Does anyone have any ideas?
Edit: dput() output:
list(`Jan` = list(UK = list(date_value = "Jan",
country_code = "UK", gdp = 308L, pop = 0L)
Here is an option
library(purrr)
library(dplyr)
map_dfr(lst2, bind_rows)
# A tibble: 2 x 4
# date_value country_code gdp pop
# <chr> <chr> <int> <int>
#1 Jan UK 308 0
#2 Jan UK 308 0
data
lst2 <- list(Jan = list(UK = list(date_value = "Jan", country_code = "UK",
gdp = 308L, pop = 0L)), Feb = list(UK = list(date_value = "Jan",
country_code = "UK", gdp = 308L, pop = 0L)))

Creating a new variable from the ranges of another column in which the ranges change - R

I am a beginner in R so sorry if it is a very simple question. I looked but I could not find the same problem.
I want to create a new variable from the ranges of another column in R but the ranges are not the same for each row.
To be more specific, my data has years 1960 - 2000 and i have ranges for employment. For 1960 to 1980 a teacher is 1 and a lawyer is 2 etc. For 1980 - 1990 a teacher is in the value range 1-29 and lawyer is 50-89 etc. Then finally for 1990-2000, the value range for the teacher is 40-65 and for the lawyer it is 1-39.
I dont even know how to begin with it (teacher and lawyer are not the only occupations there are 10 different occupations with overlapping value ranges for different years - which makes it very confusing for me).
I would appreciate your help. Thank you very much.
Here are a couple of approaches to get you started.
First, say you have a data frame with year and occupation_code:
df1 <- data.frame(
year = c(1965, 1985, 1995),
occupation_code = c(1, 2, 3)
)
year occupation_code
1 1965 1
2 1985 2
3 1995 3
Then, create a second data frame which will clearly indicate the year ranges and occupation code ranges with each occupation. You can include all of your occupations here.
df2 <- data.frame(
year_start = c(1960, 1960, 1980, 1980, 1990, 1990),
year_end = c(1980, 1980, 1990, 1990, 2000, 2000),
occupation_code_start = c(1, 2, 1, 50, 40, 1),
occupation_code_end = c(1, 2, 29, 89, 65, 39),
occupation = c("teacher", "lawyer", "teacher", "lawyer", "teacher", "lawyer")
)
year_start year_end occupation_code_start occupation_code_end occupation
1 1960 1980 1 1 teacher
2 1960 1980 2 2 lawyer
3 1980 1990 1 29 teacher
4 1980 1990 50 89 lawyer
5 1990 2000 40 65 teacher
6 1990 2000 1 39 lawyer
Then, you can merge the two together.
One approach is with data.table package.
library(data.table)
setDT(df1)
setDT(df2)
df2[df1,
on = .(year_start <= year,
year_end >= year,
occupation_code_start <= occupation_code,
occupation_code_end >= occupation_code),
.(year, occupation = occupation)]
This will give you:
year occupation
1: 1965 teacher
2: 1985 teacher
3: 1995 lawyer
Another approach is with fuzzyjoin and tidyverse:
library(tidyverse)
library(fuzzyjoin)
fuzzy_left_join(df1, df2,
by = c("year" = "year_start",
"year" = "year_end",
"occupation_code" = "occupation_code_start",
"occupation_code" = "occupation_code_end"),
match_fun = list(`>=`, `<=`, `>=`, `<=`)) %>%
select(year, occupation)

Column manipulation in R - matching correct names

I have a data.frame composed of multiple columns and thousands of rows. Below I attempt to display its (head):
|year |state_name|idealPoint| vote_no| vote_yes|
|:--------------|---------:|---------:|---------:|---------:|
|1971 | China | -25.0000| 31.0000| 45.4209|
|1972 | China | -26.2550| 38.2974| 45.4209|
|1973 | China | 28.2550| 35.2974| 45.4209|
|1994 | Czech | 27.2550| 34.2974| 45.4209|
As you can see. Not all countries [there are 196 of them] joined voting at the UN in the same year.
What I want to do is to create a new column in my data.frame (votes) that consists of the absolute difference between ChinaIdealpoints to Czech Ideal points (for given year...). I know how to create the new column with dplyr but how do I multiply correct countries from the list of 196 countries? (the difference between the year of joining can be then deleted manually I think).
The final Output should be new data.frame (or new columns in votes) looking like this: China ideal point in 1994 was, for instance, 2.2550
|year |state_name|idealPoint|Abs.Difference China_Czech
|:--------------|---------:|---------:|-------------------------:|
|1971 | China | -25.0000| NA |
|1972 | China | -26.2550| NA |
|1973 | China | 28.2550| NA |
|1994 | Czech | 27.2550| 25.0000 |
Codes:
df1 <- data.frame(year = c(1994,1995,1996,1997,1994,1995,1996,1997),
state_name = c("China","China","China","China","Czech_Republic","Czech_Republic","Czech_Republic","Czech_Republic"),
idealpoints = c(-25.0000,-26.2550,28.2550,27.2550,-27.0000,-28.2550,29.2550,22.2550),
vote_no = c(31.0000,38.2974,35.2974,34.2974,33.0000,36.2974,37.2974,38.2974),
vote_yes = c(45.4209,45.4209,45.4209,45.4209,45.4209,45.4209,45.4209,45.4209))
china_df <- df1[df1$state_name == "China",]
czech_df <- df1[df1$state_name == "Czech_Republic",]
china_czech_merge <- merge(china_df,czech_df,by = "year")
china_czech_merge$Abs_diff <- abs(china_czech_merge$idealpoints.x - china_czech_merge$idealpoints.y)
Output:
year state_name.x idealpoints.x vote_no.x vote_yes.x state_name.y idealpoints.y vote_no.y vote_yes.y Abs_diff
1 1994 China -25.000 31.0000 45.4209 Czech_Republic -27.000 33.0000 45.4209 2
2 1995 China -26.255 38.2974 45.4209 Czech_Republic -28.255 36.2974 45.4209 2
3 1996 China 28.255 35.2974 45.4209 Czech_Republic 29.255 37.2974 45.4209 1
4 1997 China 27.255 34.2974 45.4209 Czech_Republic 22.255 38.2974 45.4209 5
I think this will work for you.
Thanks
Does this perhaps solve your problem?
library(tibble)
library(dplyr)
a <- tribble(
~year, ~ctry, ~vote,
1994, "China", 5,
1995, "China", 100,
1996, "China", 600,
1997, "China", 45,
1998, "China", 9,
1994, "Czech_Republic", 1,
1995, "Czech_Republic", 5,
1996, "Czech_Republic", 100,
1997, "Czech_Republic", 40,
1998, "Czech_Republic", 6,
)
a %>%
group_by(year) %>%
mutate(foo = abs(lag(lead(vote) - vote)))
Output:
# A tibble: 10 x 4
# Groups: year [5]
year ctry vote foo
<dbl> <chr> <dbl> <dbl>
1 1994 China 5 NA
2 1995 China 100 NA
3 1996 China 600 NA
4 1997 China 45 NA
5 1998 China 9 NA
6 1994 Czech_Republic 1 4
7 1995 Czech_Republic 5 95
8 1996 Czech_Republic 100 500
9 1997 Czech_Republic 40 5
10 1998 Czech_Republic 6 3
You'll have to filter down the data to fit your needs, e.g. by country.

Monthly average of data for a time series in a group by condition

Below is how my data looks like.
Date, City , Cost
Jan, New York, 1000
Feb, New York, 1500
Mar, New York, 1200
Apr, New York, 900
May, New York, 1100
June, New York, 1500
Jan, London, 2000
Feb, London, 2400
Mar, London, 1700
Apr, London, 1900
May, London, 1900
June, London, 1000
I want to calculate the below things:
1. % Cost change for last 3 months and last 6 months
2. Month by Month % Cost change for every group.
Hence, outcome will be like
Date, City , Cost
Jan, New York, 1000, 0%
Feb, New York, 1500 , 50%
Mar, New York, 1200 , -20%
Apr, New York, 900 , -25%
May, New York, 1100, 23%
June, New York, 1500, 36%
Jan, London, 2000 , 0%
Feb, London, 2400 , 20%
Mar, London, 1200 , -50%
Apr, London, 1200 , 0%
May, London, 1900 , 56%
June, London, 1900 , 0%
July, London, 1000 , -44%
and
City, Last 3 month change, Last 6 month change,
New York,-44% (1000-1900)/1900 , 58% (1000-2400)/2400
London, and so on...
Note: Concerning point 1: I'm unsure what you're after: 3 months, 6 months change relative to what? Concerning point 2: I can't reproduce your expected output. Please double-check your numbers.
I assume you want to calculate the percentage change in Cost relative to the previous value. You can do the following using dplyr::lag:
library(tidyverse);
df %>%
group_by(City) %>%
mutate(perc_change = (Cost - lag(Cost)) / lag(Cost) * 100)
## A tibble: 12 x 4
## Groups: City [2]
# Date City Cost perc_change
# <fct> <fct> <int> <dbl>
# 1 Jan " New York" 1000 NA
# 2 Feb " New York" 1500 50.0
# 3 Mar " New York" 1200 -20.0
# 4 Apr " New York" 900 -25.0
# 5 May " New York" 1100 22.2
# 6 June " New York" 1500 36.4
# 7 Jan " London" 2000 NA
# 8 Feb " London" 2400 20.0
# 9 Mar " London" 1700 -29.2
#10 Apr " London" 1900 11.8
#11 May " London" 1900 0.
#12 June " London" 1000 -47.4
Sample data
df <- read.csv(text =
"Date, City , Cost
Jan, New York, 1000
Feb, New York, 1500
Mar, New York, 1200
Apr, New York, 900
May, New York, 1100
June, New York, 1500
Jan, London, 2000
Feb, London, 2400
Mar, London, 1700
Apr, London, 1900
May, London, 1900
June, London, 1000", header = T)

Calculate duration/difference between first and n rows that match on column value

I'm trying to calculate difference/duration between the first and n rows of a dataframe that match in one column. I want to place that value in a new column "duration". Sample data: below.
y <- data.frame(c("USA", "USA", "USA", "France", "France", "Mexico", "Mexico", "Mexico"), c(1992, 1993, 1994, 1989, 1990, 1999, 2000, 2001))
colnames(y) <- c("Country", "Year")
y$Year <- as.integer(y$Year) # this is to match the class of my actual data
My desired result is:
1992 USA 0
1993 USA 1
1994 USA 2
1989 France 0
1990 France 1
1999 Mexico 0
2000 Mexico 1
2001 Mexico 2
I've tried using dplyr's group_by and mutate
y <- y %>% group_by(Country) %>% mutate(duration = Year - lag(Year))
but I can only get the actual lag year (e.g. 1999) or only calculate the difference between sequential rows getting me either NA for the first row of a country or 1 for all other rows with the same country. Many q & a's focus on difference between sequential rows and not between the first and n rows.
Thoughts?
This can be done by subtracting the first 'Year' with the 'Year' column after grouping by 'Country'.
y %>%
group_by(Country) %>%
mutate(duration = Year - first(Year))

Resources