Aggregation on the result of group - crossfilter

I have a data set as follows:
Year | Sale
2010 | 100
2011 | 60
2012 | 10
2013 | 1400
2010 | 900
2012 | 730
2014 | 300
First i want to group the value column by Year.so the result would be:
key | value
2010 | 1000
2011 | 60
2012 | 740
2013 | 1400
2014 | 300
and then i want diff of consecutive year as follows:
key | value
2010 | 0
2011 | -940
2012 | -680
2013 | 660
2014 | -1100
can you help me do this in crossfilter reduce using add,redunce and initial methods

I'd strong recommend transforming your data before doing your Crossfilter so that each record includes the sales for the current and subsequent year. So you would have a record that looks like
Year | Sales | NextYearSales
2010 | 100 | 60
Then you can just do
var cf = crossfilter(data);
var dim = cf.dimension(function(d) { return d.Year });
var group = dim.group();
group.reduceSum(function(d) { return d.Sales - d.NextYearSales; });
Alternatively, you could use Crossfilter's groupAll method, or Reductio's somewhat nicer wrapper of it.

Related

Groupby and filter by max value in pandas

I am working on a dataframe that looks like the following: (ID 2 added in edit)
+-------+----+------+------+
| Value | ID | Date | ID 2 |
+-------+----+------+------+
| 1 | 5 | 2012 | 111 |
| 1 | 5 | 2013 | 112 |
| 0 | 12 | 2017 | 113 |
| 0 | 12 | 2022 | 114 |
| 1 | 27 | 2005 | 115 |
| 1 | 27 | 2011 | 116 |
+-------+----+------+-----+
Only using rows with "Value" == "1" ("value is boolean), I would like to group the dataframe by ID and input the string "latest" to new (blank) column, giving the following output:
+-------+----+------+--------+
| Value | ID | Date | Latest |
+-------+----+------+--------+
| 1 | 5 | 2012 | |
| 1 | 5 | 2013 | Latest |
| 0 | 12 | 2017 | |
| 0 | 12 | 2022 | |
| 1 | 27 | 2005 | |
| 1 | 27 | 2011 | Latest |
+-------+----+------+--------+
The syntax of pandas is throwing me off as I am fairly new to Python.
In R I suppose I would be trying something like
df %>% select(Value == "1") %>% group_by(ID) %>% select(max(Date)
but I am not sure of the syntax in Pandas...I am trying to first select the subset of rows that meets the criteria "value == 1" by using
q = df.query('Value == 1')
my_query_index = q.index
my_query_index
This returns the index of all the rows but I am not sure how to incorporate this into the dataframe before grouping and filtering by max(date).
All help appreciated. Thank you.
EDIT: I used the pinned answer as following:
latest = df.query('Value==1').groupby("ID").max("Year").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df
But I have since realized some of the max years are the same, i.e. there could be 4 rows, all with max year 2017. For the tiebreaker, I need to use the max ID 2 within groups. I have added .groupby("ID 2").max("ID 2") to the line of code, giving
latest = df.query('Value==1').groupby("ID").max("Year").groupby("ID 2").max("ID 2").assign(Latest = "Latest")
df = pd.merge(df,latest,how="outer")
df
but it is giving me a dataframe completely different than the one desired.
Thank you for your help, all is appreciated.
You can do this:
latest = df.query('Value==1').groupby("ID").max("year").assign(Latest = "Latest")
pd.merge(df,latest,how="outer")
Value ID Date Latest
0 1 5 2012 NaN
1 1 5 2013 Latest
2 0 12 2017 NaN
3 0 12 2022 NaN
4 1 27 2005 NaN
5 1 27 2011 Latest
Sort by 'ID' then by 'Date'
Use duplicated(keep='last') to identify the last item in each group
loc to assign in the right spot
df = df.sort_values(['ID', 'Date'])
mask1 = df.Value.eq(1)
mask2 = ~df.ID.duplicated(keep='last')
df.loc[mask1 & mask2, 'Latest'] = 'Latest'
df
Value ID Date Latest
0 1 5 2012 NaN
1 1 5 2013 Latest
2 0 12 2017 NaN
3 0 12 2022 NaN
4 1 27 2005 NaN
5 1 27 2011 Latest
One option is to groupby, using transform to get the max, then use a conditional statement with np.where to get the output:
max_values = df.groupby("ID").Date.transform("max")
df.assign(Latest=np.where(df.Date.eq(max_values) & df.Value.eq(1), "Latest", ""))
Value ID Date Latest
0 1 5 2012
1 1 5 2013 Latest
2 0 12 2017
3 0 12 2022
4 1 27 2005
5 1 27 2011 Latest

Add new column to dataframe, based on values at specific rows within that dataframe [duplicate]

This question already has answers here:
Matching up two vectors in R
(2 answers)
Closed 4 years ago.
Suppose I have a dataframe such as the below
people.dat <- data.frame("ID" = c(2001, 1001, 2005, 2001 5000), "Data"
= c(100, 300, 500, 900, 200))
Which looks something like this
+------+------+
| ID | Data |
+------+------+
| 2001 | 100 |
| 1001 | 300 |
| 2005 | 500 |
| 2001 | 900 |
| 5000 | 200 |
+------+------+
Suppose the first thing I do is work out how many unique ID values are in the dataframe (this is necessary, due to the size of the real dataset in question)
unique_ids <- sort(c(unique(people.dat$ID)))
Which gives
[1] 1001 2001 2005 5000
Where I get stuck is that I would like to add a new column, say "new_id", which looks at the "ID" value in the dataframe and evaluates its position in unique_ids, and assigns positional value (so the column "new_id" consists of values at each row which range from 1:length(unique_ids)
An example of the output would be as follows
+------+------+--------+
| ID | Data | new_id |
+------+------+--------+
| 2001 | 100 | 2 |
| 1001 | 300 | 1 |
| 2005 | 500 | 3 |
| 2001 | 900 | 1 |
| 5000 | 200 | 4 |
+------+------+--------+
I thought about using a for loop with if statements, but my first attempts didn't quite hit the mark. Although, if I just wanted to replace "ID" with a sequential value, the following code would work (but where I get stuck is that I want to keep ID, but add another "new_id" column)
for (i in 1:48){
people.dat$ID[people.dat$ID == unique_ids[i]] <- i
}
Thank you for any help. Hope I have made the question as clear as possible (although I struggled to phrase some of it, so please let me know if there is anything specific that needs clarifying)
This is more like a 'rank' problem
people$rank=as.numeric(factor(people$ID))
people
ID Data rank
1 2001 100 2
2 1001 300 1
3 2005 500 3
4 2001 900 2
5 5000 200 4

Data imputation for empty subsetted dataframes in R

I'm trying to build a function in R in which I can subset my raw dataframe according to some specifications, and thereafter convert this subsetted dataframe into a proportion table.
Unfortunately, some of these subsettings yields to an empty dataframe as for some particular specifications I do not have data; hence no proportion table can be calculated. So, what I would like to do is to take the closest time step from which I have a non-empty subsetted dataframe and use it as an input for the empty subsetted dataframe.
Here some insights to my dataframe and function:
My raw dataframe looks +/- as follows:
| year | quarter | area | time_comb | no_individuals | lenCls | age |
|------|---------|------|-----------|----------------|--------|-----|
| 2005 | 1 | 24 | 2005.1.24 | 8 | 380 | 3 |
| 2005 | 2 | 24 | 2005.2.24 | 4 | 490 | 2 |
| 2005 | 1 | 24 | 2005.1.24 | 3 | 460 | 6 |
| 2005 | 1 | 21 | 2005.1.21 | 25 | 400 | 2 |
| 2005 | 2 | 24 | 2005.2.24 | 1 | 680 | 6 |
| 2005 | 2 | 21 | 2005.2.21 | 2 | 620 | 5 |
| 2005 | 3 | 21 | 2005.3.21 | NA | NA | NA |
| 2005 | 1 | 21 | 2005.1.21 | 1 | 510 | 5 |
| 2005 | 1 | 24 | 2005.1.24 | 1 | 670 | 4 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 750 | 4 |
| 2006 | 4 | 24 | 2006.4.24 | 1 | 660 | 8 |
| 2006 | 2 | 24 | 2006.2.24 | 8 | 540 | 3 |
| 2006 | 2 | 24 | 2006.2.24 | 4 | 560 | 3 |
| 2006 | 1 | 22 | 2006.1.22 | 2 | 250 | 2 |
| 2006 | 3 | 22 | 2006.3.22 | 1 | 520 | 2 |
| 2006 | 2 | 24 | 2006.2.24 | 1 | 500 | 2 |
| 2006 | 2 | 22 | 2006.2.22 | NA | NA | NA |
| 2006 | 2 | 21 | 2006.2.21 | 3 | 480 | 2 |
| 2006 | 1 | 24 | 2006.1.24 | 1 | 640 | 5 |
| 2007 | 4 | 21 | 2007.4.21 | 2 | 620 | 3 |
| 2007 | 2 | 21 | 2007.2.21 | 1 | 430 | 3 |
| 2007 | 4 | 22 | 2007.4.22 | 14 | 410 | 2 |
| 2007 | 1 | 24 | 2007.1.24 | NA | NA | NA |
| 2007 | 2 | 24 | 2007.2.24 | NA | NA | NA |
| 2007 | 3 | 24 | 2007.3.22 | NA | NA | NA |
| 2007 | 4 | 24 | 2007.4.24 | NA | NA | NA |
| 2007 | 3 | 21 | 2007.3.21 | 1 | 560 | 4 |
| 2007 | 1 | 21 | 2007.1.21 | 7 | 300 | 3 |
| 2007 | 3 | 23 | 2007.3.23 | 1 | 640 | 5 |
Here year, quarter and area refers to a particular time (Year & Quarter) and area for which X no. of individuals were measured (no_individuals). For example, from the first row we get that in the first quarter of the year 2005 in area 24 I had 8 individuals belonging to a length class (lenCLs) of 380 mm and age=3. It is worth to mention that for a particular year, quarter and area combination I can have different length classes and ages (thus, multiple rows)!
So what I want to do is basically to subset the raw dataframe for a particular year, quarter and area combination, and from that combination calculate a proportion table based on the number of individuals in each length class.
So far my basic function looks as follows:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
From the dataset example above, one can notice that for year=2005 & quarter=3 & area=21, I do not have any measured individuals. Yet, for the same area AND year I have data for either quarter 1 or 2. The most reasonable assumption would be to take the subsetted dataframe from the closest time step (herby quarter 2 with the same area and year), and replace the NA from the columns "no_individuals", "lenCls" and "age" accordingly.
Note also that for some cases I do not have data for a particular year! In the example above, one can see this by looking into area 24 from year 2007. In this case I can not borrow the information from the nearest quarter, and would need to borrow from the previous year instead. This would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I have tried to include this in my function by specifying some extra rules, but due to my poor programming skills I didn't make any progress.
So, any help here will be very much appreciated.
Here my LAK function which I'm trying to update:
LAK <- function(df, Year="2005", Quarter="1", Area="22", alkplot=T){
require(FSA)
# subset alk by year, quarter and area
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
# In case of empty dataset
#if(is.data.frame(sALK) && nrow(sALK)==0){
if(sALK[rowSums(is.na(sALK)) > 0,]){
warning("Empty subset combination; data will be subsetted based on the
nearest timestep combination")
FIXME: INCLDUE IMPUTATION RULES HERE
}
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_individuals), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
return(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
So, I finally came up with a partial solution to my problem and will include my function here in case it might be of someone's interest:
LAK <- function(df, Year="2005", Quarter="1", Area="22",alkplot=T){
require(FSA)
# subset alk by year, quarter, area and species
sALK <- subset(df, year==Year & quarter==Quarter & area==Area)
print(sALK)
if(nrow(sALK)==1){
warning("Empty subset combination; data has been subsetted to the nearest input combination")
syear <- unique(as.numeric(as.character(sALK$year)))
sarea <- unique(as.numeric(as.character(sALK$area)))
sALK2 <- subset(df, year==syear & area==sarea)
vals <- as.data.frame(table(sALK2$comb_index))
colnames(vals)[1] <- "comb_index"
idx <- which(vals$Freq>1)
quarterId <- as.numeric(as.character(vals[idx,"comb_index"]))
imput <- subset(df,year==syear & area==sarea & comb_index==quarterId)
dfexp2 <- imput[rep(seq(nrow(imput)), imput$no_at_length_age), 1:ncol(imput)]
raw2 <- t(table(dfexp2$lenCls, dfexp2$age))
key2 <- round(prop.table(raw2, margin=1), 3)
print(key2)
if(alkplot==TRUE){
alkPlot(key2,"area",xlab="Age")
}
} else {
dfexp <- sALK[rep(seq(nrow(sALK)), sALK$no_at_length_age), 1:ncol(sALK)]
raw <- t(table(dfexp$lenCls, dfexp$age))
key <- round(prop.table(raw, margin=1), 3)
print(key)
if(alkplot==TRUE){
alkPlot(key,"area",xlab="Age")
}
}
}
This solves my problem when I have data for at least one quarter of a particular Year & Area combination. Yet, I'm still struggling to figure out how to deal when I do not have data for a particular Year & Area combination. In this case I need to borrow data from the closest Year that contains data for all the quarters for the same area.
For the example exposed above, this would mean that for year=2007 & area=24 & quarter=1 I would borrow the information from year=2006 & area=24 & quarter 1, and so on and so forth.
I don't know if you have ever encountered MICE, but it is a pretty cool and comprehensive tool for variable imputation. It also allows you to see how the imputed data is distributed so that you can choose the method most suited for your problem. Check this brief explanation and the original package description

Keeping all the observations from a subset

I'm trying to create two different subsets from a big data frame. Here's the code :
amelioration <- subset(resilience,
(subset(resilience, resilience$YEAR==2010)$SCORE>
subset(resilience, resilience$YEAR==2009)$SCORE))
The purpose is to select the observations which have a better score in 2010 than in 2009.
The problem is that in the new table (amelioration), the lines for 2009 and 2010 are missing but i want to keep them.
It should look like this :
ID | YEAR | SCORE
--------------------
177 | 2008 | 14
--------------------
177 | 2009 | 11
--------------------
177 | 2010 | 17
--------------------
177 | 2011 | 16
But it looks like this instead :
ID | YEAR | SCORE
--------------------
177 | 2008 | 14
--------------------
177 | 2011 | 16
I tried with the which command but it doesn't work either. What should I do ?
Using data.table you can recast your year variable over score, and filter directly on years.
ambivalence = data.table(ambivalence)
castedDF = dcast(ambivalence, id ~ year, fun.aggregate = sum)
targetRows = castedDF[`2010` > `2009`]

Calculating percentage changes in data frame subsets using a base year

I have the following dataset with information on the sales figures for two companies over a period of five years. I want to use the first year's figure as a baseline and calculate the percentage change in sales for each subsequent year for each company. I use the following:
transform(dataset, SalesD = unlist(aggregate(Sales ~ Company, function(x) ((x - x[1]) / x[1]) * 100, data=Dataset)$Sales))
yet I do not get the correct values for the second company (I expect the value at row 6 to be zero, as this is the base year). Here are the results:
+----+---------+------+--------+--------+
| ID | Company | Year | Sales | SalesD |
+----+---------+------+--------+--------+
| 1 | LSL | 2015 | 100000 | 0 |
| 2 | LSL | 2016 | 120000 | 20 |
| 3 | LSL | 2017 | 150000 | 50 |
| 4 | LSL | 2018 | 100000 | 0 |
| 5 | LSL | 2019 | 50000 | -50 |
| 6 | IDA | 2015 | 150000 | 50 |
| 7 | IDA | 2016 | 180000 | 80 |
| 8 | IDA | 2017 | 200000 | 100 |
| 9 | IDA | 2018 | 180000 | 80 |
| 10 | IDA | 2019 | 160000 | 60 |
+----+---------+------+--------+--------+
Could you help me point out what is wrong in the code?
Many thanks!
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by "Company", we get the percentage change by subtracting the "Sales" from the value of "Sales" that corresponds to min value of "Year", multiply by 100, round and assign (:=) to create "SalesD".
library(data.table)
setDT(df1)[, SalesD := round(100*(Sales-
Sales[which.min(Year)])/Sales[which.min(Year)]) , Company]
df1
# ID Company Year Sales SalesD
# 1: 1 LSL 2015 100000 0
# 2: 2 LSL 2016 120000 20
# 3: 3 LSL 2017 150000 50
# 4: 4 LSL 2018 100000 0
# 5: 5 LSL 2019 50000 -50
# 6: 6 IDA 2015 150000 0
# 7: 7 IDA 2016 180000 20
# 8: 8 IDA 2017 200000 33
# 9: 9 IDA 2018 180000 20
#10: 10 IDA 2019 160000 7

Resources