Matching DFs on two columns and multiplying - r

I have a dataframe such as the following one, only with much more columns and an additional ID variable.
data <- data.frame(year = c(rep(2014,12), rep(2015,12)), month = c(seq(1,12), seq(1,12)), value = c(rep(5,24)))
The data for some year/month combinations is incorrect, and must be adjusted by multiplying by a factor for the periods shown below.
fix <- data.frame(year = c(2014, 2014, 2015), month = c(1, 5, 6), f = c(.9, 1.1, 12))
I'm currently doing this via ddply, but I'm looking for a more elegant solution:
factorize <- function(x) {
x$value = x$value * fix[fix$year == unique(x$year) & fix$month == unique(x$month),3]
x
}
data2 <- ddply(data, c("year", "month"), factorize)
Any thoughts or suggestions?
Thanks!

Here's a base R approach:
transform(merge(data, fix, all.x=TRUE), value = ifelse(is.na(f), value, value*f), f=NULL)
And in case you need faster performance you can use data.table:
library(data.table)
data <- merge(setDT(data), setDT(fix), all.x = TRUE, by = c("year", "month"))
data[!is.na(f), value := value*f]
data[,f := NULL]

I think that with one line of code with dplyr and ifelse you can achieve your goal.
data %>% mutate(fix = ifelse( year == fix$year &
month == fix$month,
fix$f, value)) %>% select(-value)
year month fix
1 2014 1 0.9
2 2014 2 5.0
3 2014 3 5.0
4 2014 4 5.0
5 2014 5 1.1
6 2014 6 5.0
7 2014 7 5.0
8 2014 8 5.0
9 2014 9 5.0
10 2014 10 5.0
11 2014 11 5.0
12 2014 12 5.0
13 2015 1 5.0
14 2015 2 5.0
15 2015 3 5.0
16 2015 4 5.0
17 2015 5 5.0
18 2015 6 12.0
19 2015 7 5.0
20 2015 8 5.0
21 2015 9 5.0
22 2015 10 5.0
23 2015 11 5.0
24 2015 12 5.0

Related

Computing information taken out from applying 'aggregate' in R

I have the following information:
head(Callao20)
Dia Mes Aho Temp
1 12 Feb 2020 NA
2 12 Feb 2020 NA
3 12 Feb 2020 NA
4 12 Feb 2020 NA
5 12 Feb 2020 NA
6 12 Feb 2020 NA
Despite the fact that I have NA's, I also have further information below. By the way, do you recommend me to delete such NA's?.
Anyway, I'd like to estimate the cv for each month, then I estimated the following parameters monthly:
aggregate(Callao20[, 4], list(Callao20$Mes), mean)
Group.1 x
1 Feb NA
2 Mar 17.84195
3 Abr 17.50487
4 May 16.77294
5 Jun 16.45750
6 Jul 15.53369
7 Ago 14.93071
8 Set 14.65176
9 Oct 14.60224
10 Nov 14.48786
11 Dic 14.47635
...and also:
aggregate(Callao20[, 4], list(Callao20$Mes), sd)
Group.1 x
1 Feb NA
2 Mar 0.6280132
3 Abr 0.7163050
4 May 0.3962204
5 Jun 0.4165841
6 Jul 0.3743657
7 Ago 0.4063140
8 Set 0.3538223
9 Oct 0.6060919
10 Nov 0.5034747
11 Dic 0.3035467
Knowing that cv = (sd/mean)*100, how do you recommend me to estimate it for each month, from what I already have?.
We could use tidyverse as this can handle NA better
library(dplyr)
Callao20 %>%
group_by(Mes) %>%
summarise(out = sd(Temp, na.rm = TRUE)/mean(Temp, na.rm = TRUE) * 100)
Or if we want to use aggregate, we can use a formula approach (R 4.1.0)
aggregate(Temp ~ Mes, Callao20,
\(x) sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE) * 100)
I would suggest to do this in one aggregate command instead of breaking it down in separate aggregate calls and then trying to combine them.
aggregate(Callao20[, 4], list(Callao20$Mes),
function(x) (sd(x, na.rm = TRUE)/mean(x, na.rm = TRUE))*100)

Correlation matrix using a moving window in data

I'm trying to create correlation matrices using a 5-year moving window, so using data from 2000-2005, 2001-2006 etc.
Here's some example data:
d <- data.frame(v1=seq(2000,2015,1),
v2=rnorm(16),
v3=rnorm(16),
v4=rnorm(16))
v1 v2 v3 v4
1 2000 -1.0907101 -1.3697559 0.52841978
2 2001 -1.3143654 -0.6443144 -0.44653227
3 2002 -0.1762554 2.0513870 -1.07372405
4 2003 0.1668012 -1.6985891 -0.32962331
5 2004 0.6006146 -0.1843326 -0.56936906
6 2005 -1.3113762 -0.3854868 -1.61247953
7 2006 3.1914908 -0.2635004 0.04689692
8 2007 0.7935639 -1.0844792 -0.25895397
9 2008 1.4217089 1.9572254 1.27221568
10 2009 -0.4192379 -0.5451291 0.18891557
11 2010 -0.1304170 -1.4676465 0.17137507
12 2011 1.2212943 0.9523027 -0.39269076
13 2012 -0.4464840 -0.7117153 -0.71619199
14 2013 0.1879822 1.0693801 -0.44835571
15 2014 -0.5602422 -0.7036433 0.53531753
16 2015 1.4322259 1.5398703 1.00294281
I've created new columns start and end for each group using dplyr:
d<-d%>%
mutate(start=floor(v1),
end=ifelse(ceiling(v1)==start,start+5,ceiling(v1)))
I tried group_by(start,end) and then running the correlation, but that didn't work. Is there a quicker way than filtering the data to do this?
This prints correlation matrices for 5 year windows:
require("tidyverse")
lapply(2000:2011, function(y) {
filter(d, v1 >= y & v1 <= (y + 4)) %>%
dplyr::select(-v1) %>%
cor() %>%
return()
})

How to count how many values were used in a mean() function?

I am trying to create a column in a data frame containing how many values were used in the mean function for each line.
First, I had a data frame df like this:
df <- data.frame(tree_id=rep(c("CHC01", "CHC02"),each=8),
rad=(c(rep("A", 4),rep("B", 4), rep("A", 4),
rep("C", 4))), year=rep(2015:2018, 4),
growth= c(NA, NA, 1.2, 3.2, 2.1, 1.5, 2.3, 2.7, NA, NA, NA, 1.7, 3.5, 1.4, 2.3, 2.7))
Then, I created a new data frame called avg_df, containing only the mean values of growth grouped by tree_id and year
library(dplyr)
avg_df <- df%>%
group_by(tree_id, year, add=TRUE)%>%
summarise(avg_growth=mean(growth, na.rm = TRUE))
Now, I would like to add a new column in avg_df, containing how much values I used for calculating the mean growth for each tree_id and year, ignoring the NA.
Example: for CHC01 in 2015, the result is 1, because it was the average of 2.1 and NA and
for CHC01 in 2018, it will be 2, because the result is the average of 3.2 and 2.7
Here is the expected output:
avg_df$radii <- c(1,1,2,2,1,1,1,2)
tree_id year avg_growth radii
CHC01 2015 2.1 1
CHC01 2016 1.5 1
CHC01 2017 1.75 2
CHC01 2018 2.95 2
CHC02 2015 3.5 1
CHC02 2016 1.4 1
CHC02 2017 2.3 1
CHC02 2018 2.2 2
*In my real data, the values in radii will vary from 1 to 4.
Could anyone help me with this?
Thank you very much!
We can get the sum of non-NA elements (!is.na(growth)) after grouping by 'tree_id' and 'year'
library(dplyr)
df %>%
group_by(tree_id, year) %>%
summarise(avg_growth=mean(growth, na.rm = TRUE),
radii = sum(!is.na(growth)))
# A tibble: 8 x 4
# Groups: tree_id [2]
# tree_id year avg_growth radii
# <fct> <int> <dbl> <int>
#1 CHC01 2015 2.1 1
#2 CHC01 2016 1.5 1
#3 CHC01 2017 1.75 2
#4 CHC01 2018 2.95 2
#5 CHC02 2015 3.5 1
#6 CHC02 2016 1.4 1
#7 CHC02 2017 2.3 1
#8 CHC02 2018 2.2 2
Or using data.table
library(data.table)
setDT(df)[, .(avg_growth = mean(growth, na.rm = TRUE),
radii = sum(!is.na(growth))), by = .(tree_id, year)]

Aggregation on 2 columns while keeping two unique R

So I have this:
Staff Result Date Days
1 50 2007 4
1 75 2006 5
1 60 2007 3
2 20 2009 3
2 11 2009 2
And I want to get to this:
Staff Result Date Days
1 55 2007 7
1 75 2006 5
2 15 2009 5
I want to have the Staff ID and Date be unique in each row, but I want to sum 'Days' and mean 'Result'
I can't work out how to do this in R, I'm sure I need to do lots of aggregations but I keep getting different results to what I am aiming for.
Many thanks
the simplest way to do this is to group_by Staff and Date and summarise the results with dplyr package:
require(dplyr)
df <- data.frame(Staff = c(1,1,1,2,2),
Result = c(50, 75, 60, 20, 11),
Date = c(2007, 2006, 2007, 2009, 2009),
Days = c(4, 5, 3, 3, 2))
df %>%
group_by(Staff, Date) %>%
summarise(Result = floor(mean(Result)),
Days = sum(Days)) %>%
data.frame
Staff Date Result Days
1 1 2006 75 5
2 1 2007 55 7
3 2 2009 15 5
You can aggregate on two variables by using a formula and then merge the two aggregates
merge(aggregate(Result ~ Staff + Date, data=df, mean),
aggregate(Days ~ Staff + Date, data=df, sum))
Staff Date Result Days
1 1 2006 75.0 5
2 1 2007 55.0 7
3 2 2009 15.5 5
Here is another option with data.table
library(data.table)
setDT(df1)[, .(Result = floor(mean(Result)), Days = sum(Days)), .(Staff, Date)]
# Staff Date Result Days
#1: 1 2007 55 7
#2: 1 2006 75 5
#3: 2 2009 15 5

Average column in daily information at every n-th row

I am very new on R. I have daily observations of temperature and PP for 12-year period (6574 row, 6col, some NA ). I want to calculate, for example, the average from 1st to 10thday of January-2001, then 11-20 and finally 21 to 31 and so on for every month until december for each year in the period I mentioned before.
But also I have problems because February sometimes has 28 or 29 days (leap years).
This is how i open my file is a CSV, with read.table
# READ CSV
setwd ("C:\\Users\\GVASQUEZ\\Documents\\ESTUDIO_PAMPAS\\R_sheet")
huancavelica<-read.table("huancavelica.csv",header = TRUE, sep = ",",
dec = ".", fileEncoding = "latin1", nrows = 6574 )
This is the output of my CSV file
Año Mes Dia PT101 TM102 TM103
1 1998 1 1 6.0 15.6 3.4
2 1998 1 2 8.0 14.4 3.2
3 1998 1 3 8.6 13.8 4.4
4 1998 1 4 5.6 14.6 4.6
5 1998 1 5 0.4 17.4 3.6
6 1998 1 6 3.4 17.4 4.4
7 1998 1 7 9.2 14.6 3.2
8 1998 1 8 2.2 16.8 2.8
9 1998 1 9 8.6 18.4 4.4
10 1998 1 10 6.2 15.0 3.6
. . . . . . .
With the data setup that you have a fairly tried and true method should work:
# add 0 in front of single digit month variable to account for 1 and 10 sorting
huancavelica$MesChar <- ifelse(nchar(huancavelica$Mes)==1,
paste0("0",huancavelica$Mes), as.character(huancavelica$Mes))
# get time of month ID
huancavelica$timeMonth <- ifelse(huancavelica$Dia < 11, 1,
ifelse(huancavelica$Dia > 20, 3, 2)
# get final ID
huancavelica$ID <- paste(huancavelica$Año, huancavelica$MesChar, huancavelica$timeMonth, sep=".")
# average stat
huancavelica$myStat <- ave(huancavelica$PT101, huancavelica$ID, FUN=mean, na.rm=T)
We can try
library(data.table)
setDT(df1)[, Grp := (Dia - 1)%/%10+1, by = .(Ano, Mes)
][Grp>3, Grp := 3][,lapply(.SD, mean, na.rm=TRUE), by = .(Ano, Mes, Grp)]
It adds a bit more complexity, but you could cut each month into thirds and get the average for each third. For example:
library(dplyr)
library(lubridate)
# Fake data
set.seed(10)
df = data.frame(date=seq(as.Date("2015-01-01"), as.Date("2015-12-31"), by="1 day"),
value=rnorm(365))
# Cut months into thirds
df = df %>%
mutate(mon_yr = paste0(month(date, label=TRUE, abbr=TRUE) , " ", year(date))) %>%
group_by(mon_yr) %>%
mutate(cutMonth = cut(day(date),
breaks=c(0, round(1/3*n()), round(2/3*n()), n()),
labels=c("1st third","2nd third","3rd third")),
cutMonth = paste0(mon_yr, ", ", cutMonth)) %>%
ungroup %>%
mutate(cutMonth = factor(cutMonth, levels=unique(cutMonth)))
date value cutMonth
1 2015-01-01 0.01874617 Jan 2015, 1st third
2 2015-01-02 -0.18425254 Jan 2015, 1st third
3 2015-01-03 -1.37133055 Jan 2015, 1st third
...
363 2015-12-29 -1.3996571 Dec 2015, 3rd third
364 2015-12-30 -1.2877952 Dec 2015, 3rd third
365 2015-12-31 -0.9684155 Dec 2015, 3rd third
# Summarise to get average value for each 1/3 of a month
df.summary = df %>%
group_by(cutMonth) %>%
summarise(average.value = mean(value))
cutMonth average.value
1 Jan 2015, 1st third -0.49065685
2 Jan 2015, 2nd third 0.28178222
3 Jan 2015, 3rd third -1.03870698
4 Feb 2015, 1st third -0.45700203
5 Feb 2015, 2nd third -0.07577199
6 Feb 2015, 3rd third 0.33860882
7 Mar 2015, 1st third 0.12067388
...

Resources