R - replace zero values by average of non-zero ones for fixed categories - r

I am given a dataset of the following form
year<-rep(c(1990:1999),each=10)
age<-rep(50:59, 10)
cat1<-rep(c("A","B","C","D","E"),each=100)
value<-rnorm(10*10*5)
value[c(3,51,100,340,441)]<-0
df<-data.frame(year,age,cat1,value)
year age cat1 value
1 1990 50 A -0.7941799
2 1990 51 A 0.1592270
3 1990 52 A 0.0000000
4 1990 53 A 1.9222384
5 1990 54 A 0.3922259
6 1990 55 A -1.2671957
I now would like to replace any zeroes in the "value" column by the average over the column "cat1" of the non-zero entries of "value" for the corresponding year and age. For example, for year 1990, age 52 the enty for cat1=A is zero, this should be replaced by average of the non-zero entries of the remaining categories for this specific year and age.
As we have
df[df$year==1990 & df$age==52,]
year age cat1 value
3 1990 52 A 0.0000000
103 1990 52 B -1.1325446
203 1990 52 C -1.6136773
303 1990 52 D 0.5724360
403 1990 52 E 0.2795241
we would replace the entry 0 by
sum(df[df$year==1990 & df$age==52,4])/4
[1] -0.4735654
Is there a nice and clean way to this generally?

library(data.table)
setDT(df)[value==0, value := NA,]
df[, value := replace(value, is.na(value), mean(value, na.rm = TRUE)) , by = .(year, age)]

maybe 99,9% of operations with tables can be decomposed into basic fast and optimized: split, concatenation(in case of numeric: sum, multiplication etc), filter, sort, join.
Here left_join from dplyr is your way to go.
Just create another dataframe filtered from zeroes and aggregated over value with proper grouping. Then substitute zeroes with values from new joined column.

Related

Calculating rolling rates and excluding null rows

I have a dataset with ~40 variables with rows for each of the 25 areas and quarters, we have data from 2019 Q1 to today, 2022 Q2. For each quarter I am creating a rate (variable/population*10000) to allow comparison, however, we want each quarters rate to be based on the preceding 4 quarters i.e. 2022 Q2 rate will be the sum of the variable for 2022 Q2, Q1, 2021 Q4 and Q3. I can calculate this for all the relevant columns using the below
full_data_rates_pop %>%
group_by(Area) %>%
summarise(across(4:21, ~(sum(., na.rm = T))/(mean(Population_17.24))*10000)) %>%
bind_rows(full_data_rates_pop) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),timeframe_value, 'Quarterly'))
This does the job for my areas however I also want to create regional rates for each time period, originally I just summed up the variable and population for all the areas and created the rates in the same way. However, I have realised that for some areas/time periods data is missing and as such the current method produces inaccurate results. I want for each column to be able to exclude any rows which are Null.
Area
Quarter
Metric_1
Metric_2
Population
A
2022.2
45
89
12000
A
2022.1
58
23
12000
A
2021.4
NULL
64
11000
A
2021.3
20
76
11000
B
2022.2
56
101
9700
B
2022.1
32
78
9700
B
2021.4
41
NULL
10100
B
2021.3
38
NULL
10100
This is a mini dummy version of my data just with the latest 4 quarters but I want the new row to calculate so that the values are the sum of all values and the sum of population excluding any rows where the metric value was null
Area
Quarter
Metric_1_rate
Metric_2_rate
ALL
2022.2
38.87
75.08
Is there a way to filter out any rows which have a null value for that column however it will still be needed for other rows where there is no null value?

Calculation of rolling standard deviation by group

I have a long dataset in the following format:
Date Country Score
1995-01-01 Australia 100
1995-01-02 Australia 99
1995-01-03 Australia 85
: : :
: : :
2019-06-30 Australia 57
1995-01-01 Austria 67
1995-01-02 Austria 12
1995-01-03 Austria 10
: : :
: : :
2019-06-30 Austria 21
I want to calculate a 90-day period rolling standard deviation of the Score for each country. I have tried using the rollapply function (Package:zoo) and roll_sd (Package:RcppRoll) but they are not working for groupwise standard deviation. Can anyone please suggest a possible way to calculate the rolling standard deviation.
Thanks!
In general grouping is done separately from the base operation in R so it is not that those functions can't be used for grouped data. It is just that you need to embed them within a grouping operation. Here we use ave to do the grouping and rollapplyr to perform the rolling sd.
Now, at each point can we assume that the last 90 days are the last 90 rows? Assuming yes and taking rolling standard deviations of 2 so that we can use the selected rows of the posted data shown reproducibly in the Note at the end:
library(zoo)
roll <- function(x) rollapplyr(x, 2, sd, fill = NA)
transform(DF, roll = ave(Score, Country, FUN = roll))
giving:
Date Country Score roll
1 1995-01-01 Australia 100 NA
2 1995-01-02 Australia 99 0.7071068
3 1995-01-03 Australia 85 9.8994949
4 1995-01-01 Austria 67 NA
5 1995-01-02 Austria 12 38.8908730
6 1995-01-03 Austria 10 1.4142136
Wide form approach
Another approach is to convert the data to wide form and then perform the rolling operation:
library(zoo)
z <- read.zoo(DF, split = "Country")
zr <- rollapplyr(z, 2, sd, fill = NA)
zr
giving this zoo series:
Australia Austria
1995-01-01 NA NA
1995-01-02 0.7071068 38.890873
1995-01-03 9.8994949 1.414214
You can then just leave it as a zoo series in order to take advantage of the other time series functions in that package or can convert it back to a data frame using fortify.zoo(zr) or fortify.zoo(zr, melt = TRUE, names = names(DF)) depending on what you need.
Note
The input used in reproducible form.
Lines <- "Date Country Score
1995-01-01 Australia 100
1995-01-02 Australia 99
1995-01-03 Australia 85
1995-01-01 Austria 67
1995-01-02 Austria 12
1995-01-03 Austria 10"
DF <- read.table(text = Lines, header = TRUE)
DF$Date <- as.Date(DF$Date)

how to find max value across multiple columns and return value and the data of max value by grouping r

I have a df called "Ak_total" with 3819 obj and 93 variable.
I would like to calculate a max value for each column from 6:93 by group (Ak_total$Year).
The problem is that I would like get not only the max value of each column for each year (which is simple) but also find the day/date (Ak_total$Date) when the max value occur.
Example:
Year Date BetulaMAX
1998 1998-05-26 42
1999 1999-06-07 32
2000 2000-06-04 173
2001 2001-06-03 113
2002 2002-06-05 65
Year Date GrassMax
1998 1998-08-27 260
1999 1999-08-19 215
2000 2000-08-02 173
2001 2001-08-23 76
2002 2002-08-22 193
I did
max value (Peak DATE)
max_all <- function(x) if(length(x))x==max(x)
Ak_max_date_Betula <- subset(Ak_total,!!ave(Betula, Year, FUN=max_all))
But I got the max and data only for one column (Betula).
Is it any possibility to do that for all columns in once?

I need to find the mean for the data with cells without values

I need to find the average prices for all the different weeks. I need to make a ggplot to show how the price is during the year.
When you find the mean how does the empty cells affect the mean?
I have tried several thing including using the melt() function so I only have 3 variables. The variable are factors which I want to find the mean of.
Company variable value
ns Price week 24 1749
ns Price week 24
ns Price week 24 1599
ns Price week 24
ns Price week 24
ns Price week 24 359
ns Price week 24 460
I got more than 300K obs, and would love to have a small data.frame where I only have the Company, Price of different weeks as a mean. Now I have all observations for each week and I need to use the mean for using GGplot.
When I use following code
dat %in% mutate(means=mean(value), na.rm=TRUE)
I got a warning message saying the argument is not numeric or logical: returning NA.
I am looking forward to getting your help!
Clean code from PavoDive's comment
dt[!is.na(value), mean(value), by = .(price, week)]
and even better
dt[ , mean(value, na.rm = TRUE), by = .(price, week)]
Original:
This works using data.table. The first part filters out rows that don't have a number in value. Next is to say we want the average from the value column. Final the by defines how to group the rows.
Code:
dt[value >0 | value<1, .(MeanValues = mean(`value`)), by = c("Price", "Week")][]
Input:
dt <- data.table(`Price` = c("A","B","B","A","A","B","B","A"),
`Week`= c(1,2,1,1,2,2,1,2),
`value` = c(3,7,2,NA,1,46,1,NA))
Price Week value
1: A 1 3
2: B 2 7
3: B 1 2
4: A 1 NA
5: A 2 1
6: B 2 46
7: B 1 1
8: A 2 NA
Output:
1: A 1 3.0
2: B 2 26.5
3: B 1 1.5
4: A 2 1.0

R: subsetting all observations of individuals that have one matched observation

Sorry for another dang subsetting question; I just can't find this case described, though it must be common. Boiled-down data looks like this:
Plot Year BA
A 1980 44
A 1990 54
A 2000 66
B 1980 58
B 1990 69
B 2000 80
I want all observations for any plot with BA < 50 in 1980 -- in the above, all three A rows. I understand subset(Df, BA<50 & Year==1980) but can't figure out the next level of indexing.
Also if anyone has a better way to phrase the title I'll change it. Every way I could think of to search on only turned up the &/| questions. (So many &/| questions...)
Index your condition on Plot, checking membership with %in% in case there is more than one Plot satisfying the condition in the real data.
subset(df, Plot %in% unique(Plot[BA < 50 & Year == 1980]))
# Plot Year BA
# 1 A 1980 44
# 2 A 1990 54
# 3 A 2000 66
Or with standard evaluation [ subsetting,
df[with(df, Plot %in% unique(Plot[BA < 50 & Year == 1980])), ]
# Plot Year BA
# 1 A 1980 44
# 2 A 1990 54
# 3 A 2000 66
Another option with dplyr, this assumes there is only one record equal to 1980 for each plot, otherwise you may want to wrap the condition with all() or any() depending on your desired logic:
library(dplyr)
df %>% group_by(Plot) %>% filter(BA[Year == 1980] < 50)
# Source: local data frame [3 x 3]
# Groups: Plot [1]
# Plot Year BA
# <fctr> <int> <int>
# 1 A 1980 44
# 2 A 1990 54
# 3 A 2000 66
Under circumstances where multiple 1980 exist for some plots, the logic by #DirtySockSniffer's answer is equivalent to df %>% group_by(Plot) %>% filter(any(BA[Year == 1980] < 50)) in dplyr.
We can use data.table
library(data.table)
setDT(df1)[, if(all(BA[Year == 1980] < 50)) .SD, by = Plot]
# Plot Year BA
#1: A 1980 44
#2: A 1990 54
#3: A 2000 66

Resources