Pig: Count only specific rows

Pig: Count only specific rows - count

I have a data that has location, sentiment and brand fields. I want to calculate the number of positives, negatives and neutrals in a location for a brand.
Assuming x has the data, I did:
a1 = GROUP x BY (location, brand);
a2 = FOREACH a1 GENERATE FLATTEN(group) AS (location, brand), COUNT(x.sentiment=="positive"?1:0) AS positive_count, COUNT(x.sentiment=="negative"?1:0) AS negative_count, COUNT(x.sentiment=="neutral:?1:0) as neutral_count;
But I am getting a syntax error saying Unexpected character '"'
I tried grouping by all three: location, sentiment and brand but I am getting only the overall count like:
{location: "newyork", brand: "pampers", sentiment = "positive", count = 10}
{location: "newyork", brand: "pampers", sentiment = "negative", count = 2}
{location: "newyork", brand: "pampers", sentiment = "neutral", count = 20}
I want seperate fields for positives_count, negatives_count and neutrals_count. Something like this:
{location: "newyork", brand: "pampers", positive_count = 10, negative_count = 2, neutral_count = 20}
{location: "london", brand: "pampers", positive_count = 12, negative_count = 0, neutral_count = 35}
{location: "newyork", brand: "huggies", positive_count = 40, negative_count = 6, neutral_count = 10}
Can some one help me out please?

Use single quotes
a1 = GROUP x BY (location, brand);
a2 = FOREACH a1 GENERATE FLATTEN(group) AS (location, brand),
COUNT(x.sentiment=='positive'?1:0) AS positive_count,
COUNT(x.sentiment=='negative'?1:0) AS negative_count,
COUNT(x.sentiment=='neutral'?1:0) as neutral_count;
EDIT
newyork pampers positive
newyork pampers positive
newyork pampers negative
newyork pampers positive
newyork pampers positive
newyork pampers neutral
newyork pampers positive
newyork pampers negative
newyork pampers neutral
newyork pampers positive
newyork pampers positive
newyork pampers neutral
Script
B = GROUP A BY (location,brand);
C = FOREACH B {
A1 = FILTER A BY sentiment matches 'positive';
A2 = FILTER A BY sentiment matches 'negative';
A3 = FILTER A BY sentiment matches 'neutral';
GENERATE FLATTEN(group) as (location,brand),COUNT(A1),COUNT(A2),COUNT(A3);
};
Output

I filtered the alias that contained the original data and counted each number of entries and joined them all.
p = FILTER y BY (sentiment == 'positive');
p1 = GROUP p BY (location, brand, avl_author_type);
p2 = FOREACH p1 GENERATE FLATTEN(group) AS (location, brand, avl_author_type), COUNT(p) AS positive_counts;
n = FILTER y BY (sentiment == 'negative');
n1 = GROUP n BY (location, brand, avl_author_type);
n2 = FOREACH n1 GENERATE FLATTEN(group) AS (location, brand, avl_author_type), COUNT(n) AS negative_counts;
ne = FILTER y BY (sentiment == 'neutral');
ne1 = GROUP ne BY (location, brand, avl_author_type);
ne2 = FOREACH ne1 GENERATE FLATTEN(group) AS (location, brand, avl_author_type), COUNT(ne) AS neutral_counts;
j1 = JOIN p2 BY (location, brand, avl_author_type) LEFT OUTER, n2 BY (location, brand, avl_author_type);
j2 = FOREACH j1 GENERATE p2::location as location, p2::brand as brand, p2::avl_author_type as avl_author_type, p2::positive_counts as positive_counts, n2::negative_counts as negative_counts;
j3 = JOIN j2 BY (location, brand, avl_author_type) LEFT OUTER, ne2 BY (location, brand, avl_author_type);
j4 = FOREACH j3 GENERATE j2::location as location, j2::brand as brand, j2::avl_author_type as avl_author_type, j2::positive_counts as positive, j2::negative_counts as negative, ne2::neutral_counts as neutral;
Kind of lengthy but worked.

Related

subtracting values in dataframe with condition(s) from another dataframe

I have two dataframes that have sales data from fruits store.
1st Data frame has sales data from 'Store A',
and the 2nd data frame has that data gathered from 'Store A + Store B'
StoreA = data.frame(
Fruits = c('Apple', 'Banana', 'Blueberry'),
Customer = c('John', 'Peter', 'Jenny'),
Quantity = c(2, 3, 1)
)
Total = data.frame(
Fruits = c('Blueberry', 'Apple', 'Banana', 'Blueberry', 'Pineapple'),
Customer = c('Jenny' , 'John', 'Peter', 'John', 'Peter'),
Quantity = c(4, 7, 3, 5, 3)
)
StoreA
Total
I wish to subtract the sales data of 'StoreA' from 'Total' to get sales data for 'StoreB'.
At the end, I wish to have something like

Great Question! There is a simple and graceful way of achieving exactly what you want.
The title to this question is: "subtracting values in a data frame with conditons() from another data frame"
This subtraction can be accomplished just like the title says. But there is a better way than using subtraction. Turning a subtraction problem into an addition problem is often the easiest way of solving a problem.
To make this into an addition problem, just convert one of the data frames (StoreA$Quantity) into negative values. Only convert the Quantity variable into negative values. And then rename the other data frame (Total) into StoreB.
Once that is done, it's easy to finish. Just use the join function with the two data frames (StoreA & StoreB). Doing that brings the negative and positive values together and the data is more understandable. When there are the same things with positive and negative values, then it's obvious these things need to be combined.
To combine those similar items, use the group_by() function and pipe it into a summarize() function. Doing the coding this way makes the code easy to read and easy to understand. The code can almost be read like a book.
Create data frames:
StoreA = data.frame(Fruits = c('Apple', 'Banana', 'Blueberry'), Customer = c('John', 'Peter', 'Jenny'), Quantity = c(2,3,1))
StoreB = data.frame(Fruits = c('Blueberry','Apple', 'Banana', 'Blueberry', 'Pineapple'), Customer = c('Jenny' ,'John', 'Peter', 'John', 'Peter'), Quantity = c(4,7,3,5,3))
Convert StoreA$Quantity to negative values:
StoreA_ <- StoreA
StoreA_Quanity <- StoreA$Quantity * -1
StoreA_
StoreA_ now looks like this:
Fruits Customer Quantity
<fct> <fct> <dbl>
Apple John -2
Banana Peter -3
Blueberry Jenny -1
Now combine StoreA and StoreB. Use the full_join() function to join the two stores:
Total <- full_join(StoreA_, StoreB, disparse = 0)
Total
The last thing is accomplished using the group_by function. This will combine the positive and negative values together.
Total %>% group_by(Fruits, Customer) %>% summarize(s = sum(Quantity))
It's Done! The output is shown at this link:

You could do a full join first, then rename the columns, fill the missing values resulting from the join and then compute the difference.
library(tidyverse)
StoreA = data.frame(Fruits = c('Apple', 'Banana', 'Blueberry'), Customer = c('John', 'Peter', 'Jenny'), Quantity = c(2,3,1))
Total = data.frame(Fruits = c('Blueberry','Apple', 'Banana', 'Blueberry', 'Pineapple'), Customer = c('Jenny' ,'John', 'Peter', 'John', 'Peter'), Quantity = c(4,7,3,5,3))
full_join(StoreA %>%
rename(Qty_A = Quantity),
Total %>%
rename(Qty_Total = Quantity), by = c("Fruits", "Customer")) %>%
# fill NAs with zero
replace_na(list(Qty_A = 0)) %>%
# compute the difference
mutate(Qty_B = Qty_Total - Qty_A)
#> Fruits Customer Qty_A Qty_Total Qty_B
#> 1 Apple John 2 7 5
#> 2 Banana Peter 3 3 0
#> 3 Blueberry Jenny 1 4 3
#> 4 Blueberry John 0 5 5
#> 5 Pineapple Peter 0 3 3
Created on 2020-09-28 by the reprex package (v0.3.0)

R: How to remove duplicates with some conditions in a complex Dataframe?

I have a dataset for factory producing Gold and Silver product (Pen), we would like to check the quality by assigning employees to check these products produced from all machines in the factory. Data sample data below:
Every machine is in a specific Room/Section/Building, and we have two columns to group employee IDs that are testing Gold and Silver Pens.
The issue is I have duplicates employees testing the quality of the same machine. So I would like to remove these duplicates and group the ones which are not duplicates.
Sample:
Bld.No <- c(1,1,1,1,1,1,2,2,2,2)
Section <- c("A","A","A","A","B","B","C","C","D","D")
Room.No <- c(100,100,100,100,200,200,300,300,400,400)
Gold <- c(8,6,4,0,6,0,7,2,2,1)
Silver <- c(1,0,0,1,2,3,4,0,4,0)
Total <- c(9,6,4,1,8,3,11,2,6,1)
Emp.Gold.ID <- c("A11, A09, B22, E12, A04, C09, D33, A01", "A11, A09, B22, E12, A04, A01", "A09, 822, E12, A04", NA, "A71, A09, B12, E32, A04, C19", NA, "B22, E12, A04, C09, D33, A01, M11", "E12, Z09", "C09, D33", "D18")
Emp.Silver.ID <- c("A17", NA, NA, "D33", "B22, E12", "A09, B12, E32", "A44, C02, D03, A71", NA, "A12, A01, M11, D18", NA)
df <- data.frame(Bld.No, Section, Room.No, Gold, Silver, Total, Emp.Gold.ID, Emp.Silver.ID)
Note: if emp.Id is already in the previous records, either gold or silver, we should remove it. Meaning ID should be in either one and remove the duplicate. See the example of the last record in the sample and output table, we removed the last record (2, D, 400, 1, 0, 1, D18, NA), because of D18 is already in the previous record, even though it's in the Silver column.
Sample Data and Output:
Sample Data and Output

To do this, I would use separate_rows to get all the IDs in separate rows to remove duplicates later on with distinct.
After removing duplicate IDs, would create comma separated strings of the IDs for gold and silver. You can either summarize total Gold and Silver before this step or afterwards.
Note that to get the same results as in your sample Data and Output, I changed 822 to B22.
Please let me know if this is what you had in mind.
library(dplyr)
library(tidyr)
df$Emp.Gold.ID <- as.character(df$Emp.Gold.ID)
df$Emp.Silver.ID <- as.character(df$Emp.Silver.ID)
df %>%
separate_rows(Emp.Gold.ID) %>%
separate_rows(Emp.Silver.ID) %>%
pivot_longer(cols = starts_with("Emp."), names_to = "ID", values_drop_na = TRUE) %>%
group_by(Bld.No, Section, Room.No) %>%
distinct(value, .keep_all = TRUE) %>%
group_by(Bld.No, Section, Room.No, ID) %>%
summarise(NewID = toString(value)) %>%
pivot_wider(names_from = ID, values_from = NewID) %>%
mutate(Gold = length(unlist(strsplit(Emp.Gold.ID, ", "))),
Silver = length(unlist(strsplit(Emp.Silver.ID, ", "))),
Total = Gold + Silver)
# A tibble: 4 x 8
# Groups: Bld.No, Section, Room.No [8]
Bld.No Section Room.No Emp.Gold.ID Emp.Silver.ID Gold Silver Total
<dbl> <fct> <dbl> <chr> <chr> <int> <int> <int>
1 1 A 100 A11, A09, B22, E12, A04, C09, D33, A01 A17 8 1 9
2 1 B 200 A71, A09, B12, E32, A04, C19 B22, E12 6 2 8
3 2 C 300 B22, E12, A04, C09, D33, A01, M11, Z09 A44, C02, D03, A71 8 4 12
4 2 D 400 C09, D33 A12, A01, M11, D18 2 4 6

Unnest lists of list from an elastic Search result in R

EDIT 1: the problem in more simple terms (for the whole issue, check the Original Edit)
How can I unlist a list of key, values pairs within a dataframe, knowing that the number of pairs may vary.
For instance:
_source.types _source.label
1 key1, key2, value1, value2 label1
2 NULL label1
3 key3, value3 label2
Note that (key1, key2, value1, value2) is a <data.frame>
Expected result:
types.k1 types.v1 types.k2 types.v2 label
1 key1 value1 key2 value2 label1
2 NULL NULL NULL NULL label1
3 key3 value3 NULL NULL label2
I've tried unnest, unlist,... without success as I have always an error due to the number of elements or the class of the object.
ORIGINAL EDIT
I have a result from a Search request to an elastic search base, using elastic package. As the query is a loop from terms within a pre-existing dataframe, I have a list of responses for each term.
#existing dataframe
df <- data.frame(id=c("1","2"),terms=(c("Guy de Maupassant","Vincent Cassel")))
#loop query to ES
query_es <- '{"_source": ["id", "label", "types", "subTypes"],
"query":{"bool":{"must":[{"term":{"label":"%s"}}]}}}'
out = list()
for (i in seq_along(df$terms)) {
out[[i]] <- Search(index = "index_1",
body = sprintf(query_es, df$terms[i]),
size = 3, asdf=TRUE)$hits$hits
}
The result is a list of lists like this (I just display the first result for clarity) :
[[1]]
_index _type _id _score _source.types
1 index_1 triplet Q9327 13.18037 Q5, dbPedia.Person, être humain, personne
2 index_1 triplet Q3122270 13.17847 Q11424, dbPedia.Film, film, film
_source.subTypes _source.label _source.id
1 Q1930187, Q36180, Q15949613, Q6625963, Q214917, journaliste, écrivain, nouvelliste, romancier, dramaturge Guy de Maupassant Q9327
2 NULL Guy de Maupassant Q3122270
As you can see, I have 2 possible results for the first term: a writer or a movie, each one having a list of {id,value} for the types and the subTypes.
In order to have a more comprehensive view, I re-arrange the output:
out2 <- bind_rows(out, .id = "id")
out2 <- out_i_bind2[,-c(2:5)]
colnames(out2) <- c("id","types","subTypes","entityLabel","entityId")
As a result, I have (for the first term only):
id types
1 1 Q5, dbPedia.Person, être humain, personne
2 1 Q11424, dbPedia.Film, film, film
subTypes entityLabel entityId
1 Q1930187, Q36180, Q15949613, Q6625963, Q214917, journaliste, écrivain, nouvelliste, romancier, dramaturge Guy de Maupassant Q9327
2 NULL Guy de Maupassant Q3122270
Notice that for the second result (movie), I do not have any subType. Moreover, the length of the listed elements within types or subTypes may vary according to the search term.
Now, I would like to unnest the lists in order to have a dataframe like this (sorry the format is not very comprehensive, but basically the idea is to have each {key,value} unnested in 2 columns with an incremental index):
X_id X_source.types.id X_source.types.value X_source.types.id.1 X_source.types.value.1 X_source.subTypes.id
1 1 Q5 être humain dbPedia.Person personne Q1930187
2 1 Q11424 film dbPedia.Film film <NA>
X_source.subTypes.value X_source.subTypes.id.1 X_source.subTypes.value.1 X_source.subTypes.id.2 X_source.subTypes.value.2
1 journaliste Q36180 écrivain Q15949613 nouvelliste
2 <NA> <NA> <NA> <NA> <NA>
X_source.subTypes.id.3 X_source.subTypes.value.3 X_source.subTypes.id.4 X_source.subTypes.value.4 X_source.label X_source.id
1 Q6625963 romancier Q214917 dramaturge Guy de Maupassant Q9327
2 <NA> <NA> <NA> <NA> Guy de Maupassant Q3122270
The conservation of the related ids is very important. I tried many things found here :
Convert in R output of package Elastic (nested list?) to data.frame or JSON
or here:
Extract data from elasticsearch into R with elastic package, load into a data frame, error due to hits not expanding to the same length
without any success...
Any idea to deal with it? I was wondering if I should transform the rearranged output (out2) or if it's better to come back to the original output (out)...
Thanks in advance!
PS : here is the dput version of the "out" (from df Search):
> dput(out, control="useSource")
list(list(`_index` = c("alias_fr", "alias_fr"), `_type` = c("triplet",
"triplet"), `_id` = c("Q9327", "Q3122270"), `_score` = c(13.180366,
13.178474), `_source.types` = list(list(id = c("Q5", "dbPedia.Person"
), value = c("être humain", "personne")), list(id = c("Q11424",
"dbPedia.Film"), value = c("film", "film"))), `_source.subTypes` = list(
list(id = c("Q1930187", "Q36180", "Q15949613", "Q6625963",
"Q214917"), value = c("journaliste", "écrivain", "nouvelliste",
"romancier", "dramaturge")), NULL), `_source.label` = c("Guy de Maupassant",
"Guy de Maupassant"), `_source.id` = c("Q9327", "Q3122270")),
list(`_index` = "alias_fr", `_type` = "triplet", `_id` = "Q193504",
`_score` = 13.18018, `_source.types` = list(list(id = c("Q5",
"dbPedia.Person"), value = c("être humain", "personne"
))), `_source.subTypes` = list(list(id = c("Q33999",
"Q10800557", "Q3282637", "Q2526255", "Q28389"), value = c("acteur",
"acteur de cinéma", "producteur de cinéma", "réalisateur",
"scénariste"))), `_source.label` = "Vincent Cassel",
`_source.id` = "Q193504"))
And the same for out2 :
> dput(out2, control="useSource")
list(id = c("1", "1", "2"), types = list(list(id = c("Q5", "dbPedia.Person"
), value = c("être humain", "personne")), list(id = c("Q11424",
"dbPedia.Film"), value = c("film", "film")), list(id = c("Q5",
"dbPedia.Person"), value = c("être humain", "personne"))), subTypes = list(
list(id = c("Q1930187", "Q36180", "Q15949613", "Q6625963",
"Q214917"), value = c("journaliste", "écrivain", "nouvelliste",
"romancier", "dramaturge")), NULL, list(id = c("Q33999",
"Q10800557", "Q3282637", "Q2526255", "Q28389"), value = c("acteur",
"acteur de cinéma", "producteur de cinéma", "réalisateur",
"scénariste"))), entityLabel = c("Guy de Maupassant", "Guy de Maupassant",
"Vincent Cassel"), entityId = c("Q9327", "Q3122270", "Q193504"
))

I finally manage to solve the problem, thanks to this post and some transformation steps. The solution is not very elegant though, but it works:
out_bind <- bind_rows(out, .id = "id")
#transform to data table in order to apply rbindlist
out <- as.data.table(out_bind)
#rbindlist for "types" variable
out_nest1 <- rbindlist(out$types, fill = T, id = "row")[, entityId := out$entityId[row]][]
#rbindlist to "subTypes variable (choosing another id name -row1-, if not Rstudio was crashing!)
out_nest2 <- rbindlist(out$subTypes, fill = T, id = "row1")[, entityId := out$entityId[row1]][]
#finally joining the whole data
out <- full_join(out,out_nest1,by="entityId")
out <- full_join(out,out_nest2,by="entityId")
Now I can spend a good christmas ;)
Edit: the crash was not due to the id name, but to a data.table issue, solved here.

How to do a sequential merge in R based on multiple columns in two same datasets

I need to perform a sequential merging in R and what I mean by this is that let's say I have two datasets: orders and deliveries.
I want to match up these orders and deliveries together but I first want to merge based on the address column, then for the rows that don't match up, I want to merge based on zip code, then for those rows that don't match up, I want to merge based on latitude and longitude, then for those rows that don't match up, I want to merge on some other attribute and so forth.
I can easily do a merge based on one attribute like so:
merge1 <- merge(orders, deliveries, by.x = c("order_date", "address"),
by.y = c("date", "delivery_address"), sort = FALSE)
But now I want to match up those rows that didn't match up in merge1 by let's say zip code which has two different names in both columns ("zipcode" in one dataset and "postcode" in another).
I tried doing a left join on the orders and then finding the rows which return NA for some column in the deliveries dataset for merge1 and then tried doing another merge using that subset, but haven't been able to successfully do that.
merge1 <- merge(orders, deliveries, by.x = c("order_date", "address"),
by.y = c("date", "delivery_address"), all.x = TRUE, sort = FALSE)
merge2 <- merge(merge1[is.na(merge1$delivery_address),], deliveries, by.x = c("order_date", "zipcode"),
by.y = c("date", "postcode"), all.x = TRUE, sort = FALSE)
I know that's totally wrong as it only returns me NA values and it duplicates the columns, but that was my train of thought.
Basically, just want a way to do a sequential merging in R between two datasets, first by one column, then by another, and so on and so forth. I don't want a left join though, an inner join where only the matching rows are returned, however, I could do a left join and then after all of the merging, select only the rows which don't have NA's. So my final result should be all the orders matched up with deliveries, but only the ones which matched up accordingly.
EDIT:
People asked for some example data, so here is some:
orders <- data.frame( order = c(1,2,3,4,5,6,7,8,9,10),
address = c(1111, 1112, 1314, 1113, 1114, 1618, 1917, 1118, 1945, 2000),
zipcode = c(001, 002, 001, 999, 999, 006, 007, 007, 999, 010))
deliveries <- data.frame(length = c(4, 5, 9, 11, 13, 15, 93, 17, 4, 8, 12),
delivery_address = c(1111, 1112, 0111, 1113, 1114, 0000, 1618, 0001, 0002, 0405, 1121),
postcode = c(001, 912, 001, 910, 913, 006, 080, 007, 074, 088, 010))
merge1 <- merge(orders, deliveries, by.x = "address", by.y = "delivery_address", sort = FALSE)
So merge1 properly gives me the orders matched up with deliveries that had the same address, now how do I add to the merge1 dataset and add those rows which didn't get matched with the deliveries dataset so I can match them by postcode since there are still some orders and deliveries that can get matched by postcode.

This works for your example data:
# functions used here use dplyr to process data
library("dplyr")
# using forward pipe syntax for readability of this example
# though this isn't necessary for functions to work
library("magrittr")
# merge by exact matches between address and delivery_address
# add column of delivery_address for binding later so dataframes match
merge1 <- orders %>%
inner_join(y = deliveries,
by = c("address" = "delivery_address")) %>%
mutate(delivery_address = address)
# extract unmerged columns from orders then merge exact matches by
# zipcode to postcode.
# add postcode column for binding
merge2 <- orders %>%
anti_join(y = deliveries,
by = c("address" = "delivery_address")) %>%
inner_join(y = deliveries,
by = c("zipcode" = "postcode")) %>%
mutate(postcode = zipcode)
# bind two sets of results together.
results <- bind_rows(merge1, merge2)
results
I highly recommend the RStudio cheat sheets on data transformation for this sort of work

Consider merging by all and row binding each, then drop duplicates with unique():
merge1 <- unique(rbind(transform(merge(orders, deliveries, by.x = "address", by.y = "delivery_address", sort = FALSE),
delivery_address = address),
transform(merge(orders, deliveries, by.x = "zipcode", by.y = "postcode", sort = FALSE),
postcode = zipcode)))
# address order zipcode length postcode delivery_address
# 1 1111 1 1 4 1 1111
# 2 1112 2 2 5 912 1112
# 3 1113 4 999 11 910 1113
# 4 1114 5 999 13 913 1114
# 5 1618 6 6 93 80 1618
# 6 1314 3 1 9 1 111
# 7 1314 3 1 4 1 1111
# 8 1111 1 1 9 1 111
# 10 1618 6 6 15 6 0
# 11 1917 7 7 17 7 1
# 12 1118 8 7 17 7 1
# 13 2000 10 10 12 10 1121
And for a generalizable solution across multiple columns use Map() and do.call() on a user-defined function, seqmerge, where you extend the xvar and yvar to pairings of merge columns. Be sure both are same length.
seqmerge <- function(xvar, yvar) {
df <- merge(orders, deliveries, by.x = xvar, by.y = yvar, sort = FALSE)
df[[yvar]] = df[[xvar]]
return(df)
}
xvars <- c("address", "zipcode") # ADD MORE AS NEEDED
yvars <- c("delivery_address", "postcode") # ADD MORE AS NEEDED
merge2 <- unique(do.call(rbind, Map(seqmerge, xvars, yvars, USE.NAMES=FALSE)))
all.equal(merge1, merge2)
# [1] TRUE
identical(merge1, merge2)
# [1] TRUE

Sliding window over data.frame with nested hierarchy

Description of the data
My data.frame represents the salary of people living in different cities (city) in different countries (country). city names, country names and salaries are integers. In my data.frame, the variable country is ordered, the variable city is ordered within each country and the variable salary is ordered within each city (and country). There are two additional columns called arg1 and arg2, which contain floats/doubles.
Goal
For each country and each city, I want to consider a window of size WindowSize of salaries and calculate D = sum(arg1)/sum(arg2) over this window. Then, the window slide by WindowStep and D should be recalculated and so on. For example, let's consider a WindowSize = 1000 and WindowStep = 10. Within each country and within each city, I would like to get D for the range of salaries between 0 and 1000 and for the range between 10 and 1010 and for the range 20 and 1020, etc...
At the end the output should be a data.frame associating a D statistic to each window. If a given window has no entry (for example nobody has a salary between 20 and 1020 in country 1, city 3), then the D statistic should be NA.
Note on performance
I will have to run this algorithm about 10000 times on pretty big tables (that have nothing to do with countries, cities and salaries; I don't yet have a good estimate of the size of these tables), so performance is of concern.
Example data
set.seed(84)
country = rep(1:3, c(30, 22, 51))
city = c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt = paste0(city, country)
salary = c()
for (i in unique(tt)) salary = append(salary, sort(round(runif(sum(tt==i), 0,100000))))
arg1 = rnorm(length(country), 1, 1)
arg2 = rnorm(length(country), 1, 1)
dt = data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dim)
country city salary arg1 arg2
1 1 1 22791 -1.4606212 1.07084528
2 1 1 34598 0.9244679 1.19519158
3 1 1 76411 0.8288587 0.86737330
4 1 1 76790 1.3013056 0.07380115
5 1 1 87297 -1.4021137 1.62395596
6 1 2 12581 1.3062181 -1.03360620
With this example, if windowSize = 70000 and windowStep = 30000, the first values of D are -0.236604 and 0.439462 which are the results of sum(dt$arg1[1:2])/sum(dt$arg2[1:2]) and sum(dt$arg1[2:5])/sum(dt$arg2[2:5]), respectively.

Unless I've misunderstood something, the following might be helpful.
Define a simple function regardless of hierarchical groupings:
ff = function(salary, wSz, wSt, arg1, arg2)
{
froms = (wSt * (0:ceiling(max(salary) / wSt)))
tos = froms + wSz
Ds = mapply(function(from, to, salaries, args1, args2) {
inds = salaries > from & salaries < to
sum(args1[inds]) / sum(args2[inds])
},
from = froms, to = tos,
MoreArgs = list(salaries = salary, args1 = arg1, args2 = arg2))
list(from = froms, to = tos, D = Ds)
}
Compute on the groups with, for example, data.table:
library(data.table)
dt2 = as.data.table(dt)
ans = dt2[, ff(salary, 70000, 30000, arg1, arg2), by = c("country", "city")]
head(ans, 10)
# country city from to D
# 1: 1 1 0 70000 -0.2366040
# 2: 1 1 30000 100000 0.4394620
# 3: 1 1 60000 130000 0.2838260
# 4: 1 1 90000 160000 NaN
# 5: 1 2 0 70000 1.8112196
# 6: 1 2 30000 100000 0.6134090
# 7: 1 2 60000 130000 0.5959344
# 8: 1 2 90000 160000 NaN
# 9: 1 3 0 70000 1.3216255
#10: 1 3 30000 100000 1.8812397
I.e. a faster equivalent of
lapply(split(dt[-c(1, 2)], interaction(dt$country, dt$city, drop = TRUE)),
function(x) as.data.frame(ff(x$salary, 70000, 30000, x$arg1, x$arg2)))

Without your expected outcome it is a bit hard to guess whether my result is correct but it should give you a head start for the first step. From a performance point of view the data.table package is very fast. Much faster than loops.
set.seed(84)
country <- rep(1:3, c(30, 22, 51))
city <- c(rep(1:5, c(5,5,5,5,10)), rep(1:5, c(1,1,10,8,2)), rep(c(1,3,4,5), c(20, 7, 3, 21)))
tt <- paste0(city, country)
salary <- c()
for (i in unique(tt)) salary <- append(salary, sort(round(runif(sum(tt==i), 0,100000))))
arg1 <- rnorm(length(country), 1, 1)
arg2 <- rnorm(length(country), 1, 1)
dt <- data.frame(country = country, city = city, salary = salary, arg1 = arg1, arg2 = arg2)
head(dt)
# For data table
require(data.table)
# For rollapply
require(zoo)
setDT(dt)
WindowSize <- 10
WindowStep <- 3
dt[, .(D = (rollapply(arg1, width = WindowSize, FUN = sum, by = WindowStep) /
rollapply(arg2, width = WindowSize, FUN = sum, by = WindowStep)),
by = list(country = country, city = city))]
You can achieve the latter part of your goal by melting the data and doing and writing a custom summary function that you use to dcast your data together again.

Table = NULL
StepNumber = 100
WindowSize = 1000
WindowRange = c(0,WindowSize)
WindowStep = 100
for(x in dt$country){
#subset of data for that country
CountrySubset = dt[dt$country == x,,drop=F]
for(y in CountrySubset$city){
#subset of data for citys within country
CitySubset = CountrySubset[CountrySubset$city == y,,drop=F]
for(z in 1:StepNumber){
WinRange = WindowRange + (z*WindowStep)
#subset of salarys within country of city via windowRange
WindowData = subset(CitySubset, salary > WinRange[1] & salary < WinRange[2])
CalcD = sum(WindowData$arg1)/sum(WindowData$arg2)
Output = c(Country = x, City = y, WinStart = WinRange[1], WinEnd = WinRange[2], D = CalcD)
Table = rbind(Table,Output)
}
}
}
Using your example code this should work, its just a series of nested loops that will write to Table. It does however duplicate a line every now and then because the only way I know to keep adding results to a table is rbind.
So if someone can alter this to fix that. Should be good.
WindowStep is the difference between each consecutive WindowSize you want.
StepNumber is how many steps you want to take in total, might be best to find out what the maximum salary is and then adjust for that.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Pig: Count only specific rows - count

Related

subtracting values in dataframe with condition(s) from another dataframe

R: How to remove duplicates with some conditions in a complex Dataframe?

Unnest lists of list from an elastic Search result in R

How to do a sequential merge in R based on multiple columns in two same datasets

Sliding window over data.frame with nested hierarchy

Categories

Resources