R: Create a column of averages based upon groups of four rows - r

>head(df)
person week target actual drop_out organization agency
1: QJ1 1 30 19 TRUE BB LLC
2: GJ2 1 30 18 FALSE BB LLC
3: LJ3 1 30 22 TRUE CC BBR
4: MJ4 1 30 24 FALSE CC BBR
5: PJ5 1 35 55 FALSE AA FUN
6: EJ6 1 35 50 FALSE AA FUN
There are around ~30 weeks in the dataset with a repeating Person ID each week.
I want to look at each person's values FOUR weeks at a time (so week 1-4, 5-9, 10-13, and so on). For each of these chunks, I want to add up all the "actual" columns and divide it by the sum of the "target" columns. Then we could put that value in a column called "monthly percent."
As per Shape's recommendation I've created a month column like so
fullReshapedDT$month <- with(fullReshapedDT, ceiling(week/4))
Trying to figure out how to iterate over the month column and calculate averages now. Trying something like this, but it obviously doesn't work:
fullReshapedDT[,.(monthly_attendance = actual/target,by=.(person_id, month)]

Have you tried creating a group variable? It will allow you to group operations by the four-week period:
setDT(df1)[,grps:=ceiling(week/4) #Create 4-week groups
][,sum(actual)/sum(target), .(person, grps) #grouped operations
][,grps:=NULL][] #Remove unnecessary columns
# person V1
# 1: QJ1 1.1076923
# 2: GJ2 1.1128205
# 3: LJ3 0.9948718
# 4: MJ4 0.6333333
# 5: PJ5 1.2410256
# 6: EJ6 1.0263158
# 7: QJ1 1.2108108
# 8: GJ2 0.6378378
# 9: LJ3 0.9891892
# 10: MJ4 0.8564103
# 11: PJ5 1.1729730
# 12: EJ6 0.8666667

Related

I need to find the mean for the data with cells without values

I need to find the average prices for all the different weeks. I need to make a ggplot to show how the price is during the year.
When you find the mean how does the empty cells affect the mean?
I have tried several thing including using the melt() function so I only have 3 variables. The variable are factors which I want to find the mean of.
Company variable value
ns Price week 24 1749
ns Price week 24
ns Price week 24 1599
ns Price week 24
ns Price week 24
ns Price week 24 359
ns Price week 24 460
I got more than 300K obs, and would love to have a small data.frame where I only have the Company, Price of different weeks as a mean. Now I have all observations for each week and I need to use the mean for using GGplot.
When I use following code
dat %in% mutate(means=mean(value), na.rm=TRUE)
I got a warning message saying the argument is not numeric or logical: returning NA.
I am looking forward to getting your help!
Clean code from PavoDive's comment
dt[!is.na(value), mean(value), by = .(price, week)]
and even better
dt[ , mean(value, na.rm = TRUE), by = .(price, week)]
Original:
This works using data.table. The first part filters out rows that don't have a number in value. Next is to say we want the average from the value column. Final the by defines how to group the rows.
Code:
dt[value >0 | value<1, .(MeanValues = mean(`value`)), by = c("Price", "Week")][]
Input:
dt <- data.table(`Price` = c("A","B","B","A","A","B","B","A"),
`Week`= c(1,2,1,1,2,2,1,2),
`value` = c(3,7,2,NA,1,46,1,NA))
Price Week value
1: A 1 3
2: B 2 7
3: B 1 2
4: A 1 NA
5: A 2 1
6: B 2 46
7: B 1 1
8: A 2 NA
Output:
1: A 1 3.0
2: B 2 26.5
3: B 1 1.5
4: A 2 1.0

data table lapply and additional columns in output

I am just hoping there is a more convenient way. Imaging I would like to run a model with different transformations of some of the columns, e.g. winsorizing. I would like to provide the transformed data set to the model and some additional columns that do not need to be transformed. Is there a practical way to this in one line? I do not want to replace the data using := because I am planning to run the model with different specifications of the transformation.
dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
sel.col<-c("x","y")
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
I would Need to call data.table again to merge the original dt with the transformed data and pay Attention to the order.
data.table(dt[,.(id,Country),by=factor],
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor])
I was hoping that I could include the additional columns with the lapply call
dt[,.(lapply(.SD,Winsorize), id, Country),.SDcols=sel.col,by=factor]
Are there any other solutions?
Do you just need?
dt[, c(lapply(.SD,Winsorize), list(id = id, Country = Country)), .SDcols=sel.col,by=factor]
Unfortunately this method get's slow with big data. Apparently this was optimised in some recent update, but it still very slow.
There is no need to merge, you can assign columns after lapply call:
> library(DescTools)
> library(data.table)
> dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
> sel.col<-c("x","y")
> dt
id Country x y factor
1: 1 Germany 13.116248 -0.4609152 B
2: 2 Germany -6.623404 -3.7048052 A
3: 3 USA -18.027532 22.2946805 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -12.585897 0.8255081 B
6: 6 Germany -8.816252 -12.1218135 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 6.3262951 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 15.857069 8.6422997 A
> # Notice an assignment `(sel.col) :=` here:
> dt[,(sel.col) := lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
> dt
id Country x y factor
1: 1 Germany 11.129140 -0.4609152 B
2: 2 Germany -6.623404 -1.7234191 A
3: 3 USA -17.097573 19.5642043 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -11.831968 0.8255081 B
6: 6 Germany -8.816252 -12.0116571 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 5.2261377 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 11.581528 8.6422997 A

R identifying and aggregating balances with no history

I have a group of accounts with balances over 4 months. I want to a sum the balances that have just appeared that particular month. This is what I have gotten so far.
One account originated (new) each month.
Accounts <- c('A','B','C','A','B','C','A','B','C')
Dates <- as.Date(c('2016-01-31', '2016-01-31','2016-01-31','2016-02-28','2016-02-28','2016-02-28','2016-03-31','2016-03-31','2016-03-31'))
Balances <- c(100,NA,NA,90,50,NA,80,40,120)
Origination <- data.frame(Dates,Accounts,Balances)
library(reshape2)
Origination <- dcast(Origination,Dates ~ Accounts, value.var = "Balances")
Origination$Originated <- apply(Origination[2:4],1,function(x) ifelse(sum(is.na(x))==nrow(Origination),NA,tail(na.omit(x),1)))
Origination <- melt(Origination, id = c("Dates"))
Origination <-dcast(Origination, variable ~ Dates, value.var = "value")
variable 2016-01-31 2016-02-29 2016-03-31
1 A 100 90 80
2 B NA 50 40
3 C NA NA 120
4 Originated 100 50 120
This creates an origination table with a row called Originated. First month we only have the 100, second month we have the amortized A to 90 but also a new account 50 and last month we have both the amortized A and B with new C at 120. The Originated Column captures it exactly as I want.
But if I introduce another account D with in month 2 it picks just that amount (10) and not the sum of the two that is being originated. ie 50 (B) plus the 10(C).
Accounts <- c('A','B','C','D','A','B','C','D','A','B','C','D')
Dates <- as.Date(c('2016-01-31', '2016-01-31','2016-01-31','2016-01-31','2016-02-28','2016-02-28','2016-02-28','2016-02-28','2016-03-31','2016-03-31','2016-03-31','2016-03-31'))
Balances <- c(100,NA,NA,NA,90,50,10,NA,80,40,5,120)
Origination <- data.frame(Dates,Accounts,Balances)
library(reshape2)
Origination <- dcast(Origination,Dates ~ Accounts, value.var = "Balances")
Origination$Originated <- apply(Origination[2:4],1,function(x) ifelse(sum(is.na(x))==nrow(Origination),NA,tail(na.omit(x),1)))
Origination <- melt(Origination, id = c("Dates"))
Origination <-dcast(Origination, variable ~ Dates, value.var = "value")
variable 2016-01-31 2016-02-28 2016-03-31
1 A 100 90 80
2 B NA 50 40
3 C NA 10 5
4 D NA NA 120
5 Originated 100 10 5
So the ask is, how do I sum the newly added accounts from A through D across dates. Perhaps I am over thinking it. The result I would like is this:
variable 2016-01-31 2016-02-28 2016-03-31
1 A 100 90 80
2 B NA 50 40
3 C NA 10 5
4 D NA NA 120
5 Originated 100 60 120
Help is much appreciated.
Aksel
I have finally found a way to get the output I want. Here is the answer for those whom are interested.
sel <- rbind(FALSE, !is.na(head(Origination[-1], -1)))
#sel
# A B C D
#[1,] FALSE FALSE FALSE FALSE
#[2,] TRUE FALSE FALSE FALSE
#[3,] TRUE TRUE TRUE FALSE
rowSums(replace(Origination[-1], sel, 0), na.rm=TRUE)
#[1] 100 60 120

Calculate rolling average of simulated data series with data.table

I am simulating a price time series, where the time horizon basically is that each month consists of 20 working days and 12 months are one year. I now would like to calculate the rolling average of this price, always based on the first day of the month.
I do have a working solution, but would like to know if there's a more elegant or faster one.
dt.oil.price
Period Month Day.Month Oil.Price Oil.Supply Risk.Free.Interest
1: 1 1 1 39.4560000 NA 0.08642857
2: 2 1 2 3.7889460 NA 0.08642857
3: 3 1 3 51.0748751 NA 0.08642857
4: 4 1 4 60.6282853 NA 0.08642857
5: 5 1 5 35.7267224 NA 0.08642857
6: 6 1 6 26.1868977 NA 0.08642857
7: 7 1 7 32.6488136 NA 0.08642857
8: 8 1 8 42.6397549 NA 0.08642857
9: 9 1 9 18.8969991 NA 0.08642857
...
20: 20 1 20 8.8036135 NA 0.08642857
21: 21 2 1 2.5559526 NA 0.08642857
22: 22 2 2 24.3996401 NA 0.08642857
...
40: 40 2 20 41.2988566 NA 0.08642857
41: 41 3 1 20.8012327 NA 0.08642857
42: 42 3 2 70.5297726 NA 0.08642857
Just to give you an idea on the structure of the data. To create the above data structure with 60 periods:
set.seed(1);
dt.oil.price <- as.data.table(cbind( Period = 1:60,
Month = as.integer(rep(1:(60/20), each = 20))[1:60],
Oil.Price=rnorm(3*20,mean = 50, sd = 10)))
dt.oil.price[,"Day.Month" := rank(Period),by="Month"]
With the following code I can then select all first days of a month and calculate the mean of the oil price for these days:
dt.oil.price[ Day.Month == 1, mean(Oil.Price)]
In the next step I use another helper column "Num.Months" to rank the number of months accordingly, by
dt.oil.price[Day.Month == 1 & Period <= 8921,"Num.Months" := rank(-Period)]
and with this I can then select only the last two months for the average calculation, by subsetting this
dt.oil.price[Day.Month == 1 & Period <= 8921,"Num.Months" := rank(-Period)][Num.Months <= 2, Oil.Price]
A code snippet, which allows to calculate the mean without using an explicit helper column for the last three months:
dt.oil.price[Day.Month == 1 & Period <= 60, {Num.Months = rank(-Period); list("Period" = Period, "Month" = Month, "Oil.Price" = Oil.Price, "Num.Months" = Num.Months)}][Num.Months <=12, mean(Oil.Price)]
I hope my steps are all clear and it becomes also clear, what I would like to achieve. It is also possible to calculate the moving average dynamically by defining for example a period and then calculate the moving average for the last 12 months preceding that period. This can be achieved, by sub-setting the data.table only to periods smaller than the defined period and then calculating "Num.Months" for this data.table subset.

Assign rows to a group based on spatial neighborhood and temporal criteria in R

I have an issue that I just cannot seem to sort out. I have a dataset that was derived from a raster in arcgis. The dataset represents every fire occurrence during a 10-year period. Some raster cells had multiple fires within that time period (and, thus, will have multiple rows in my dataset) and some raster cells will not have had any fire (and, thus, will not be represented in my dataset). So, each row in the dataset has a column number (sequential integer) and a row number assigned to it that corresponds with the row and column ID from the raster. It also has the date of the fire.
I would like to assign a unique ID (fire_ID) to all of the fires that are within 4 days of each other and in adjacent pixels from one another (within the 8-cell neighborhood) and put this into a new column.
To clarify, if there were an observation from row 3, col 3, Jan 1, 2000 and another from row 2, col 4, Jan 4, 2000, those observations would be assigned the same fire_ID.
Below is a sample dataset with "rows", which are the row IDs of the raster, "cols", which are the column IDs of the raster, and "dates" which are the dates the fire was detected.
rows<-sample(seq(1,50,1),600, replace=TRUE)
cols<-sample(seq(1,50,1),600, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),600, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)
I've tried sorting the data by "row", then "column", then "date" and looping through, to create a new fire_ID if the row and column ID were within one value and the date was within 4 days, but this obviously doesn't work, as fires which should be assigned the same fire_ID are assigned different fire_IDs if there are observations in between them in the list that belong to a different fire_ID.
fire_df2<-fire_df[order(fire_df$rows, fire_df$cols, fire_df$date),]
fire_ID=numeric(length=nrow(fire_df2))
fire_ID[1]=1
for (i in 2:nrow(fire_df2)){
fire_ID[i]=ifelse(
fire_df2$rows[i]-fire_df2$rows[i-1]<=abs(1) & fire_df2$cols[i]-fire_df2$cols[i-1]<=abs(1) & fire_df2$date[i]-fire_df2$date[i-1]<=abs(4),
fire_ID[i-1],
i)
}
length(unique(fire_ID))
fire_df2$fire_ID<-fire_ID
Please let me know if you have any suggestions.
I think this task requires something along the lines of hierarchical clustering.
Note, however, that there will be necessarily some degree of arbitrariness in the ids. This is because it is entirely possible that the cluster of fires itself is longer than 4 days yet every fire is less than 4 days away from some other fire in that cluster (and thus should have the same id).
library(dplyr)
# Create the distances
fire_dist <- fire_df %>%
# Normalize dates
mutate( norm_dates = as.numeric(dates)/4) %>%
# Only keep the three variables of interest
select( rows, cols, norm_dates ) %>%
# Compute distance using L-infinite-norm (maximum)
dist( method="maximum" )
# Do hierarchical clustering with "single" aggl method
fire_clust <- hclust(fire_dist, method="single")
# Cut the tree at height 1 and obtain groups
group_id <- cutree(fire_clust, h=1)
# First attach the group ids back to the data frame
fire_df2 <- cbind( fire_df, group_id ) %>%
# Then sort the data
arrange( group_id, dates, rows, cols )
# Print the first 20 records
fire_df2[1:10,]
(Make sure you have dplyr library installed. You can run install.packages("dplyr",dep=TRUE) if not installed. It is a really good and very popular library for data manipulations)
A couple of simple tests:
Test #1. The same forest fire moving.
rows<-1:6
cols<-1:6
dates<-seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
fire_df<-data.frame(rows, cols, dates)
gives me this:
rows cols dates group_id
1 1 1 2000-01-01 1
2 2 2 2000-01-02 1
3 3 3 2000-01-03 1
4 4 4 2000-01-04 1
5 5 5 2000-01-05 1
6 6 6 2000-01-06 1
Test #2. 6 different random forest fires.
set.seed(1234)
rows<-sample(seq(1,50,1),6, replace=TRUE)
cols<-sample(seq(1,50,1),6, replace=TRUE)
dates<-sample(seq(from=as.Date("2000/01/01"), to=as.Date("2000/02/01"), by="day"),6, replace=TRUE)
fire_df<-data.frame(rows, cols, dates)
output:
rows cols dates group_id
1 6 1 2000-01-10 1
2 32 12 2000-01-30 2
3 31 34 2000-01-10 3
4 32 26 2000-01-27 4
5 44 35 2000-01-10 5
6 33 28 2000-01-09 6
Test #3: one expanding forest fire
dates <- seq(from=as.Date("2000/01/01"), to=as.Date("2000/01/06"), by="day")
rows_start <- 50
cols_start <- 50
fire_df <- data.frame(dates = dates) %>%
rowwise() %>%
do({
diff = as.numeric(.$dates - as.Date("2000/01/01"))
expand.grid(rows=seq(rows_start-diff,rows_start+diff),
cols=seq(cols_start-diff,cols_start+diff),
dates=.$dates)
})
gives me:
rows cols dates group_id
1 50 50 2000-01-01 1
2 49 49 2000-01-02 1
3 49 50 2000-01-02 1
4 49 51 2000-01-02 1
5 50 49 2000-01-02 1
6 50 50 2000-01-02 1
7 50 51 2000-01-02 1
8 51 49 2000-01-02 1
9 51 50 2000-01-02 1
10 51 51 2000-01-02 1
and so on. (All records identified correctly to belong to the same forest fire.)

Resources