how to find max value across multiple columns and return value and the data of max value by grouping r - r

I have a df called "Ak_total" with 3819 obj and 93 variable.
I would like to calculate a max value for each column from 6:93 by group (Ak_total$Year).
The problem is that I would like get not only the max value of each column for each year (which is simple) but also find the day/date (Ak_total$Date) when the max value occur.
Example:
Year Date BetulaMAX
1998 1998-05-26 42
1999 1999-06-07 32
2000 2000-06-04 173
2001 2001-06-03 113
2002 2002-06-05 65
Year Date GrassMax
1998 1998-08-27 260
1999 1999-08-19 215
2000 2000-08-02 173
2001 2001-08-23 76
2002 2002-08-22 193
I did
max value (Peak DATE)
max_all <- function(x) if(length(x))x==max(x)
Ak_max_date_Betula <- subset(Ak_total,!!ave(Betula, Year, FUN=max_all))
But I got the max and data only for one column (Betula).
Is it any possibility to do that for all columns in once?

Related

Calculating rolling rates and excluding null rows

I have a dataset with ~40 variables with rows for each of the 25 areas and quarters, we have data from 2019 Q1 to today, 2022 Q2. For each quarter I am creating a rate (variable/population*10000) to allow comparison, however, we want each quarters rate to be based on the preceding 4 quarters i.e. 2022 Q2 rate will be the sum of the variable for 2022 Q2, Q1, 2021 Q4 and Q3. I can calculate this for all the relevant columns using the below
full_data_rates_pop %>%
group_by(Area) %>%
summarise(across(4:21, ~(sum(., na.rm = T))/(mean(Population_17.24))*10000)) %>%
bind_rows(full_data_rates_pop) %>%
arrange(Area,-Quarter)%>% # sort so that total rows come after the quarters
mutate(Timeframe=if_else(is.na(Quarter),timeframe_value, 'Quarterly'))
This does the job for my areas however I also want to create regional rates for each time period, originally I just summed up the variable and population for all the areas and created the rates in the same way. However, I have realised that for some areas/time periods data is missing and as such the current method produces inaccurate results. I want for each column to be able to exclude any rows which are Null.
Area
Quarter
Metric_1
Metric_2
Population
A
2022.2
45
89
12000
A
2022.1
58
23
12000
A
2021.4
NULL
64
11000
A
2021.3
20
76
11000
B
2022.2
56
101
9700
B
2022.1
32
78
9700
B
2021.4
41
NULL
10100
B
2021.3
38
NULL
10100
This is a mini dummy version of my data just with the latest 4 quarters but I want the new row to calculate so that the values are the sum of all values and the sum of population excluding any rows where the metric value was null
Area
Quarter
Metric_1_rate
Metric_2_rate
ALL
2022.2
38.87
75.08
Is there a way to filter out any rows which have a null value for that column however it will still be needed for other rows where there is no null value?

How to subset columns based on the value in another column in R

I'm looking to subset multiple columns based on the value (a year) that is issued elsewhere in the data. For example, I have a column reflecting various data, and another including a year. My data looks something like this:
Individual
Age 2010
weight 2010
Age 2011
Weight 2011
Age 2012
Weight 2012
Age 2013
Weight 2013
Year
A
53
50
85
100
82
102
56
90
2013
B
22
NA
23
75
NA
68
25
60
2013
C
33
65
34
64
35
70
NA
75
2010
D
NA
70
28
NA
29
78
30
55
2012
E
NA
NA
64
90
NA
NA
NA
NA
2011
I want to create a new column that reflects the data that the 'Year' columns highlights. For example, subsetting data for 'Individual' A from 2013, and 'Individual B' from 2012.
My end goal is to have a table that looks like:
Individual
Age
Weight
A
56
90
B
25
60
C
33
65
D
29
78
E
64
90
Is there any way to subset the years based on the years chosen in the final column?
I made a subset of your data and came up with the following (could be more elegant but this works):
Individual<-c("A","B","C","D","E")
Age2010<-c(53,22,33,NA,NA)
`weight 2010`<-c(50,NA,65,70,NA)
Age2011<-c(85,23,34,28,64)
Weight2011<-c(100,75,64,NA,90)
df<-as.data.frame(cbind(Individual,Age2010,`weight 2010`,Age2011,Weight2011))
colnames(df)<-str_replace_all(colnames(df)," ", "") # remove spaces
# create a dataframe for each year (prob could do this using `apply`)
df2010<-df %>% select(Individual, contains("2010")) %>% mutate(year=2010) %>% rename(weight=weight2010,age=Age2010)
df2011<-df %>% select(Individual, contains("2011")) %>% mutate(year=2011) %>% rename(weight=Weight2011,age=Age2011)
final<-bind_rows(df2010,df2011)
Of course, you can extend this for the remaining years in your dataset. You will then have a year variable to perform your analyses.

R - replace zero values by average of non-zero ones for fixed categories

I am given a dataset of the following form
year<-rep(c(1990:1999),each=10)
age<-rep(50:59, 10)
cat1<-rep(c("A","B","C","D","E"),each=100)
value<-rnorm(10*10*5)
value[c(3,51,100,340,441)]<-0
df<-data.frame(year,age,cat1,value)
year age cat1 value
1 1990 50 A -0.7941799
2 1990 51 A 0.1592270
3 1990 52 A 0.0000000
4 1990 53 A 1.9222384
5 1990 54 A 0.3922259
6 1990 55 A -1.2671957
I now would like to replace any zeroes in the "value" column by the average over the column "cat1" of the non-zero entries of "value" for the corresponding year and age. For example, for year 1990, age 52 the enty for cat1=A is zero, this should be replaced by average of the non-zero entries of the remaining categories for this specific year and age.
As we have
df[df$year==1990 & df$age==52,]
year age cat1 value
3 1990 52 A 0.0000000
103 1990 52 B -1.1325446
203 1990 52 C -1.6136773
303 1990 52 D 0.5724360
403 1990 52 E 0.2795241
we would replace the entry 0 by
sum(df[df$year==1990 & df$age==52,4])/4
[1] -0.4735654
Is there a nice and clean way to this generally?
library(data.table)
setDT(df)[value==0, value := NA,]
df[, value := replace(value, is.na(value), mean(value, na.rm = TRUE)) , by = .(year, age)]
maybe 99,9% of operations with tables can be decomposed into basic fast and optimized: split, concatenation(in case of numeric: sum, multiplication etc), filter, sort, join.
Here left_join from dplyr is your way to go.
Just create another dataframe filtered from zeroes and aggregated over value with proper grouping. Then substitute zeroes with values from new joined column.

Looking up values without loop in R

I need to look up a value in a data frame based on multiple criteria in another data frame. Example
A=
Country Year Number
USA 1994 455
Canada 1997 342
Canada 1998 987
must have added a column by the name of "rate" coming from
B=
Year USA Canada
1993 21 654
1994 41 321
1995 56 789
1996 85 123
1997 65 456
1998 1 999
So that the final data frame is
C=
Country Year Number Rate
USA 1994 455 41
Canada 1997 342 456
Canada 1998 987 999
In other words: Look up year and country from A in B and result is C. I would like to do this without a loop. I would like a general approach, such that I would be able to look up based on more than two criteria.
Here's another way using data.table that doesn't require converting the 2nd data table to long form:
require(data.table) # 1.9.6+
A[B, Rate := get(Country), by=.EACHI, on="Year"]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
where A and B are data.tables, and Country is of character type.
We can melt the second dataset from 'wide' to 'long' format, merge with the first dataset to get the expected output.
library(reshape2)
res <- merge(A, melt(B, id.var='Year'),
by.x=c('Country', 'Year'), by.y=c('variable', 'Year'))
names(res)[4] <- 'Rate'
res
# Country Year Number Rate
#1 Canada 1997 342 456
#2 Canada 1998 987 999
#3 USA 1994 455 41
Or we can use gather from tidyr and right_join to get this done.
library(dplyr)
library(tidyr)
gather(B, Country,Rate, -Year) %>%
right_join(., A)
# Year Country Rate Number
#1 1994 USA 41 455
#2 1997 Canada 456 342
#3 1998 Canada 999 987
Or as #DavidArenburg mentioned in the comments, this can be also done with data.table. We convert the 'data.frame' to 'data.table' (setDT(A)), melt the second dataset and join on 'Year', and 'Country'.
library(data.table)#v1.9.6+
setDT(A)[melt(setDT(B), 1L, variable = "Country", value = "Rate"),
on = c("Country", "Year"),
nomatch = 0L]
# Country Year Number Rate
# 1: USA 1994 455 41
# 2: Canada 1997 342 456
# 3: Canada 1998 987 999
Or a shorter version (if we are not too picky no variable names)
setDT(A)[melt(B, 1L), on = c(Country = "variable", Year = "Year"), nomatch = 0L]

r ddply error undefined columns selected

I have a time series data set like below:
age time income
16 to 24 2004 q1 400
16 to 24 2004 q2 500
… …
65 and over 2014 q3 800
it has different 60 quarters of income data for each age group.as income data is seasonal. i am trying to apply decomposition function to filter out trends.what i have did so far is below. but R consistently throw errors (error message:undefined columns selected) at me. any idea how to go about it?
fun =function(x){
ts = ts(x,frequency=4,start=c(2004,1))
ts.d =decompose(ts,type='additive')
as.vector(ts.d$trend)
}
trend.dt = ddply(my.dat,.(age),transform,trend=fun(income))
expected result is (NA is because, after decomposition, the first and last ob will not have value,but the rest should have)
age time income trend
16 to 24 2004 q1 400 NA
16 to 24 2004 q2 500 489
… …
65 and over 2014 q3 800 760
65 and over 2014 q3 810 NA

Resources