Calculate weighted average in R dataframe - r

"f","index","values","lo.80","lo.95","hi.80","hi.95"
"auto.arima",2017-07-31 16:40:00,2.81613884762163,NA,NA,NA,NA
"auto.arima",2017-07-31 16:40:10,2.83441637197378,NA,NA,NA,NA
"auto.arima",2017-07-31 20:39:10,3.18497899649267,2.73259824384436,2.49312233904087,3.63735974914098,3.87683565394447
"auto.arima",2017-07-31 20:39:20,3.16981166809297,2.69309866988864,2.44074205235297,3.64652466629731,3.89888128383297
"ets",2017-07-31 16:40:00,2.93983529828936,NA,NA,NA,NA
"ets",2017-07-31 16:40:10,3.09739640066054,NA,NA,NA,NA
"ets",2017-07-31 20:39:10,3.1951571771414,2.80966705285567,2.60560090776504,3.58064730142714,3.78471344651776
"ets",2017-07-31 20:39:20,3.33876776870274,2.93593322313957,2.72268549604222,3.7416023142659,3.95485004136325
"bats",2017-07-31 16:40:00,2.82795253090081,NA,NA,NA,NA
"bats",2017-07-31 16:40:10,2.96389759682623,NA,NA,NA,NA
"bats",2017-07-31 20:39:10,3.1383560278272,2.76890864400062,2.573335012715,3.50780341165378,3.7033770429394
"bats",2017-07-31 20:39:20,3.3561357998535,2.98646195085452,2.79076843614824,3.72580964885248,3.92150316355876
I have a dataframe like above which has column names as:"f","index","values","lo.80","lo.95","hi.80","hi.95".
What I want to do is calculate the weighted average on forecast results from different models for a particular timestamp. By this what i mean is
For every row in auto.arima there is a corresponding row in ets and bats with the same timestamp value, so weighted average should be calculated something like this:
value_arima*1/3 + values_ets*1/3 + values_bats*1/3 ; similary values for lo.80 and other columns should be calculated.
This result should be stored in a new dataframe with all the weighted average values.
New dataframe can look something like:
index(timesamp from above dataframe),avg,avg_lo_80,avg_lo_95,avg_hi_80,avg_hi_95
I think I need to use spread() and mutate () function to achieve this. Being new to R I'm unable to proceed after forming this dataframe.
Please help.

The example you provide is not a weighted average but a simple average.
What you want is a simple aggregate.
The first part is your dataset as provided by dput (better for sharing here)
d <- structure(list(f = structure(c(1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L,
2L, 2L, 2L, 2L), .Label = c("auto.arima", "bats", "ets"), class = "factor"),
index = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L,
3L, 4L), .Label = c("2017-07-31 16:40:00", "2017-07-31 16:40:10",
"2017-07-31 20:39:10", "2017-07-31 20:39:20"), class = "factor"),
values = c(2.81613884762163, 2.83441637197378, 3.18497899649267,
3.16981166809297, 2.93983529828936, 3.09739640066054, 3.1951571771414,
3.33876776870274, 2.82795253090081, 2.96389759682623, 3.1383560278272,
3.3561357998535), lo.80 = c(NA, NA, 2.73259824384436, 2.69309866988864,
NA, NA, 2.80966705285567, 2.93593322313957, NA, NA, 2.76890864400062,
2.98646195085452), lo.95 = c(NA, NA, 2.49312233904087, 2.44074205235297,
NA, NA, 2.60560090776504, 2.72268549604222, NA, NA, 2.573335012715,
2.79076843614824), hi.80 = c(NA, NA, 3.63735974914098, 3.64652466629731,
NA, NA, 3.58064730142714, 3.7416023142659, NA, NA, 3.50780341165378,
3.72580964885248), hi.95 = c(NA, NA, 3.87683565394447, 3.89888128383297,
NA, NA, 3.78471344651776, 3.95485004136325, NA, NA, 3.7033770429394,
3.92150316355876)), .Names = c("f", "index", "values", "lo.80",
"lo.95", "hi.80", "hi.95"), class = "data.frame", row.names = c(NA,
-12L))
> aggregate(d[,3:7], by = d["index"], FUN = mean)
index values lo.80 lo.95 hi.80 hi.95
1 2017-07-31 16:40:00 2.861309 NA NA NA NA
2 2017-07-31 16:40:10 2.965237 NA NA NA NA
3 2017-07-31 20:39:10 3.172831 2.770391 2.557353 3.575270 3.788309
4 2017-07-31 20:39:20 3.288238 2.871831 2.651399 3.704646 3.925078
You can save this output in an object and change the column names as you want.
If you really want a weighted average this is a way to obtain it (here bat has a weight of 0.8 and the 2 others 0.1) :
> d$weight <- (d$f)
> levels(d$weight) # check the levels
[1] "auto.arima" "bats" "ets"
> levels(d$weight) <- c(0.1, 0.8, 0.1)
> # transform the factor into numbers
> # warning as.numeric(d$weight) is not correct !!
> d$weight <- as.numeric(as.character((d$weight)))
>
> # Here the result is saved in a data.frame called "result
> result <- aggregate(d[,3:7] * d$weight, by = d["index"], FUN = sum)
> result
index values lo.80 lo.95 hi.80 hi.95
1 2017-07-31 16:40:00 2.837959 NA NA NA NA
2 2017-07-31 16:40:10 2.964299 NA NA NA NA
3 2017-07-31 20:39:10 3.148698 2.769353 2.568540 3.528043 3.728857
4 2017-07-31 20:39:20 3.335767 2.952073 2.748958 3.719460 3.922576

Related

R: function or similar to sum up number of non-NA values for columns that contain specific characters in large data set [duplicate]

This question already has an answer here:
How many non-NA values in each row for a matrix?
(1 answer)
Closed 2 years ago.
I have a large data set (907 x 1855). I need to count how many follow-ups each patient have had. A follow-up column contain either 1, 2 or NA and a follow-up may be defined as the specific column being !is.na().
There are up to max 20 follow-ups. As you can see, each follow up has the _vX added as suffix where x correspond to the number of follow-up.
Thus, follow-up nr 20 has the very inconvenient RedCapautogenerated column name p$fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20
> head(p)
fu_location fu_location_v2 fu_location_v2_v3 fu_location_v2_v3_v4 ...
1 1 1 1 1 ...
2 2 2 1 2 ...
3 1 1 1 2 ...
4 2 2 2 2 ...
I need to count the number of !is.na(for column names that contains "fu_location"). I tried mutate(n_fu = sum(!is.na(contains("fu_location")))) but that did not work.
Preferably, the solution is in dplyr. Perhaps a function?
Expected output:
> head(p)
fu_location fu_location_v2 fu_location_v2_v3 fu_location_v2_v3_v4 n_fu
1 1 1 1 1 8
2 2 2 1 2 20
3 1 1 1 2 4
4 2 2 2 2 4
Data
p <- structure(list(fu_location = c(1L, 2L, 1L, 2L), fu_location_v2 = c(1L,
2L, 1L, 2L), fu_location_v2_v3 = c(1L, 1L, 1L, 2L), fu_location_v2_v3_v4 = c(1L,
2L, 2L, 2L), fu_location_v2_v3_v4_v5 = c(2L, 2L, NA, NA), fu_location_v2_v3_v4_v5_v6 = c(1L,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7 = c(2L, 1L, NA, NA
), fu_location_v2_v3_v4_v5_v6_v7_v8 = c(1L, 2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9 = c(NA,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10 = c(NA,
1L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11 = c(NA,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12 = c(NA,
1L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13 = c(NA,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14 = c(NA,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15 = c(NA,
1L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16 = c(NA,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17 = c(NA,
1L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18 = c(NA,
2L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19 = c(NA,
1L, NA, NA), fu_location_v2_v3_v4_v5_v6_v7_v8_v9_v10_v11_v12_v13_v14_v15_v16_v17_v18_v19_v20 = c(NA,
2L, NA, NA)), row.names = c(NA, -4L), class = "data.frame")
Use rowSums :
library(dplyr)
p %>% mutate(n_fu = rowSums(!is.na(select(., contains('fu_location')))))
Or in base :
p$n_fu <- rowSums(!is.na(p[grep('fu_location', names(p))]))

R: populate data.frame within function in mapply

A data.frame df1 is queried (fuzzy match) against another data.frame df2 with agrep. Via iterating over its output (a list called matches holding the row number of the respective matches in df2), df1 is populated with affiliated values from df2.
The goal is a function that is passed to mapply; however, in all my attempts df1 remains unchanged.
In a for-loop, the code works as expected and populates df1 with the affiliated variables from df2. Still, I would be interested how to solve this with a function that is passed to mapply.
First, the two data.frames:
df1 <- structure(list(Species = c("Alisma plantago-aquatica", "Alnus glutinosa",
"Carex davalliana", "Carex echinata",
"Carex elata"),
CheckPoint = c(NA, NA, NA, NA, NA),
L = c(NA, NA, NA, NA, NA),
R = c(NA, NA, NA, NA, NA),
K = c(NA, NA, NA, NA, NA)),
row.names = c(NA, 5L), class = "data.frame")
df2 <- structure(list(Species = c("Alisma gramineum", "Alisma lanceolatum",
"Alisma plantago-aquatica", "Alnus glutinosa",
"Alnus incana", "Alnus viridis",
"Carex davalliana", "Carex depauperata",
"Carex diandra", "Carex digitata",
"Carex dioica", "Carex distans",
"Carex disticha", "Carex echinata",
"Carex elata"),
L = c(7L, 7L, 7L, 5L, 6L, 7L, 9L, 4L, 8L, 3L, 9L, 9L, 8L,
8L, 8L),
R = c(7L, 7L, 5L, 5L, 4L, 3L, 4L, 7L, 6L, NA, 4L, 6L, 6L,
NA, NA),
K = c(6L, 2L, NA, 3L, 5L, 4L, 4L, 2L, 7L, 4L, NA, 3L, NA,
3L, 2L)),
row.names = seq(1:15), class = "data.frame")
Then, fuzzy match by Species:
matches <- lapply(df1$Species, agrep, x = df2$Species, value = FALSE,
max.distance = c(deletions = 0,
insertions = 1,
substitutions = 1))
Populating df1 with the values from df2 works as expected:
for (i in 1:dim(df1)[1]){
df1[i, 2:5] <- df2[matches[[i]], ]
}
In contrast to my approach with mapply that does return the correct values, although as a list of dissasembled values that are never written into df1. No combination (with or without return(df1), writing it into another variable nor desparate attempts with the state of SIMPLIFY or USE.NAMES) yielded the desired results.
populatedf1 <- function(matches, index){
df1[index, 2:5] <- df2[matches, ]
#return(df1)
}
mapply(populatedf1, matches, seq_along(matches), SIMPLIFY = FALSE,
USE.NAMES = FALSE)
Would be great if someone knows the solution or could point me into a certain direction, thanks! :)
Actually, you would not need any loop here (for or mapply) if you replace lapply with sapply (so that it returns a vector instead of list) and then do a direct assignment.
matches <- sapply(df1$Species, agrep, x = df2$Species, value = FALSE,
max.distance = c(deletions = 0,
insertions = 1,
substitutions = 1))
df1[, 2:5] <- df2[matches,]
df1
# Species CheckPoint L R K
#1 Alisma plantago-aquatica Alisma plantago-aquatica 7 5 NA
#2 Alnus glutinosa Alnus glutinosa 5 5 3
#3 Carex davalliana Carex davalliana 9 4 4
#4 Carex echinata Carex echinata 8 NA 3
#5 Carex elata Carex elata 8 NA 2
As far as your approach is concerned you can use Map or mapply with SIMPLIFY = FALSE and bring the list of dataframes into one dataframe using do.call and rbind and then assign.
df1[, 2:5] <- do.call(rbind, Map(populatedf1, matches, seq_along(matches)))

How do I use approx() inside mutate_at() with a conditional statement in dplyr?

I want to interpolate missing values using dplyr, piping, and approx().
Data:
test <- structure(list(site = structure(c(3L, 3L, 3L, 3L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L), .Label = c("lake", "stream", "wetland"), class = "factor"),
depth = c(0L, -3L, -4L, -8L, 0L, -1L, -3L, -5L, 0L, -2L,
-4L, -6L), var1 = c(1L, NA, 3L, 4L, 1L, 2L, NA, 4L, 1L, NA,
NA, 4L), var2 = c(1L, NA, 3L, 4L, NA, NA, NA, NA, NA, 2L,
NA, NA)), .Names = c("site", "depth", "var1", "var2"), class = "data.frame", row.names = c(NA,
-12L))
This code works:
library(tidyverse)
# interpolate missing var1 values for each site using approx()
test_int <- test %>%
group_by(site) %>%
mutate_at(vars(c(var1)),
funs("i" = approx(depth, ., depth, rule=1, method="linear")[["y"]]))
But the code no longer works if it encounters a grouping (site & var) that doesn't have at least 2 non-NA values, e.g.,
# here I'm trying to interpolate missing values for var1 & var2
test_int2 <- test %>%
group_by(site) %>%
mutate_at(vars(c(var1, var2)),
funs("i" = approx(depth, ., depth, rule=1, method="linear")[["y"]]))
R appropriately throws this error:
Error in mutate_impl(.data, dots) :
Evaluation error: need at least two non-NA values to interpolate.
How do I include a conditional statement or filter so that it only tries to interpolate cases where the site has at least 2 non-NA values and skips the rest or returns NA for those?
This will do what you are looking for...
test_int2 <- test %>%
group_by(site) %>%
mutate_at(vars(c(var1, var2)),
funs("i"=if(sum(!is.na(.))>1)
approx(depth, ., depth, rule=1, method="linear")[["y"]]
else
NA))
test_int2
# A tibble: 12 x 6
# Groups: site [3]
site depth var1 var2 var1_i var2_i
<fctr> <int> <int> <int> <dbl> <dbl>
1 wetland 0 1 1 1.0 1.0
2 wetland -3 NA NA 2.5 2.5
3 wetland -4 3 3 3.0 3.0
4 wetland -8 4 4 4.0 4.0
5 lake 0 1 NA 1.0 NA
6 lake -1 2 NA 2.0 NA
7 lake -3 NA NA 3.0 NA
8 lake -5 4 NA 4.0 NA
9 stream 0 1 NA 1.0 NA
10 stream -2 NA 2 2.0 NA
11 stream -4 NA NA 3.0 NA
12 stream -6 4 NA 4.0 NA

Appeding and merging two datasets of unequal length in R

I am trying to add two variables to my dataset from another dataset which is different in length. I have a coralreef survey dataset for which I am missing start and end times of each dive per site and zone of survey.
Additionally I have a table containing the start and end times of each dive per site and zone:
This table repeats the wpt (site) because 2 zones are measured per site, meaning in this table each row should be unique. In my own dataset I have many more repetitions of wpt because I have several observations in the same site and zone. I need to match the unique rows of mergingdata to merge it to my fishdata returning the start and end times of the mergingdata. So I want to match and merge by "wpt" and by "zone"
this is what I have tried:
merge<- merge(fishdata, mergingdata, by="wpt", all=TRUE, sort=FALSE)
but this only merges by zone, and my output gets an extra column called zone.y - is there a way in which I can merge by the unique combination of 2 variables? "wpt" and "zone"?
Thank you!
The documentation of merge help(merge) says:
By default the data frames are merged on the columns with names they
both have, but separate specifications of the columns can be given by
by.x and by.y.
As you have both id columns in both data.frames, merge function will combine the data using those common columns. So, omiting the id parameter in your code should work.
merge<- merge(fishdata, mergingdata, all=TRUE, sort=FALSE)
However, you can also specify the identifier columns using by, by.x and by.y parameters as follow:
merge<- merge(fishdata, mergingdata, by=c("wpt","zone"), all=TRUE, sort=FALSE)
EDIT
Looking at your post modifications, I figured out that your data has the following structure:
fishdata <- structure(list(date = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "23.11.2014", class = "factor"),
entry = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "shore", class = "factor"),
wpt = c(2L, 2L, 2L, 2L, 2L, 2L), zone = structure(c(1L, 1L,
1L, 1L, 1L, 1L), .Label = "DO", class = "factor"), transect = c(1L,
1L, 1L, 1L, 1L, 1L), gps = c(NA, NA, NA, NA, NA, NA), surveyor = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "ev", class = "factor"), depth_code = c(NA,
NA, NA, NA, NA, NA), phase = structure(c(2L, 2L, 1L, 1L,
1L, 1L), .Label = c("S_PRIN", "S_STOP"), class = "factor"),
species = structure(c(2L, 1L, 2L, 2L, 1L, 1L), .Label = c("IP",
"TP"), class = "factor"), family = c(NA, NA, NA, NA, NA,
NA)), .Names = c("date", "entry", "wpt", "zone", "transect",
"gps", "surveyor", "depth_code", "phase", "species", "family"
), class = "data.frame", row.names = c(NA, -6L))
mergingdata <- structure(list(start.time = c(10.34, 10.57, 10, 10.24, 9.15,
9.39), end.time = c(10.5, 11.1, 10.2, 10.4, 9.3, 9.5), wpt = c(2L,
2L, 3L, 3L, 4L, 4L), zone = structure(c(1L, 2L, 1L, 2L, 1L, 2L
), .Label = c("DO", "LT"), class = "factor")), .Names = c("start.time",
"end.time", "wpt", "zone"), class = "data.frame", row.names = c(NA,
-6L))
Assiuming that the dataset structures are correct...
> fishdata
date entry wpt zone transect gps surveyor depth_code phase species family
1 23.11.2014 shore 2 DO 1 NA ev NA S_STOP TP NA
2 23.11.2014 shore 2 DO 1 NA ev NA S_STOP IP NA
3 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN TP NA
4 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN TP NA
5 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN IP NA
6 23.11.2014 shore 2 DO 1 NA ev NA S_PRIN IP NA
> mergingdata
start.time end.time wpt zone
1 10.34 10.5 2 DO
2 10.57 11.1 2 LT
3 10.00 10.2 3 DO
4 10.24 10.4 3 LT
5 9.15 9.3 4 DO
6 9.39 9.5 4 LT
I do the merge as follow:
> merge(x = fishdata, y = mergingdata, all.x = TRUE)
wpt zone date entry transect gps surveyor depth_code phase species family start.time end.time
1 2 DO 23.11.2014 shore 1 NA ev NA S_STOP TP NA 10.34 10.5
2 2 DO 23.11.2014 shore 1 NA ev NA S_STOP IP NA 10.34 10.5
3 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN TP NA 10.34 10.5
4 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN TP NA 10.34 10.5
5 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN IP NA 10.34 10.5
6 2 DO 23.11.2014 shore 1 NA ev NA S_PRIN IP NA 10.34 10.5
Note that I use x.all=TRUE, because what we want is to have all the rows from the x object which is fishdata merged with the extra columns of the y object (mergingdata). All that, by using the common columns of both objects as an index.

Breaking the tapply junkie habit

I've learned R by toying, and I'm starting to think that I'm abusing the tapply function. Are there better ways to do some of the following actions? Granted, they work, but as they get more complex I wonder if I'm losing out on better options. I'm looking for some criticism, here:
tapply(var1, list(fac1, fac2), mean, na.rm=T)
tapply(var1, fac1, sum, na.rm=T) / tapply(var2, fac1, sum, na.rm=T)
cumsum(tapply(var1, fac1, sum, na.rm=T)) / sum(var1)
Update: Here's some example data...
var1 var2 fac1 fac2
1 NA 275.54 10 (266,326]
2 NA 565.89 10 (552,818]
3 NA 815.41 6 (552,818]
4 NA 281.77 6 (266,326]
5 NA 640.24 NA (552,818]
6 NA 78.42 NA [78.4,266]
7 NA 1027.06 NA (818,1.55e+03]
8 NA 355.20 NA (326,552]
9 NA 464.52 NA (326,552]
10 NA 1397.11 10 (818,1.55e+03]
11 NA 229.82 NA [78.4,266]
12 NA 542.77 NA (326,552]
13 NA 829.32 NA (818,1.55e+03]
14 NA 284.78 NA (266,326]
15 NA 194.97 10 [78.4,266]
16 NA 672.55 8 (552,818]
17 NA 348.01 10 (326,552]
18 NA 1550.79 9 (818,1.55e+03]
19 101.98 101.98 4 [78.4,266]
20 NA 292.80 6 (266,326]
Update data dump:
structure(list(var1 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, 101.98, NA), var2 = c(275.54,
565.89, 815.41, 281.77, 640.24, 78.42, 1027.06, 355.2, 464.52,
1397.11, 229.82, 542.77, 829.32, 284.78, 194.97, 672.55, 348.01,
1550.79, 101.98, 292.8), fac1 = c(10L, 10L, 6L, 6L, NA, NA, NA,
NA, NA, 10L, NA, NA, NA, NA, 10L, 8L, 10L, 9L, 4L, 6L), fac2 = structure(c(2L,
4L, 4L, 2L, 4L, 1L, 5L, 3L, 3L, 5L, 1L, 3L, 5L, 2L, 1L, 4L, 3L,
5L, 1L, 2L), .Label = c("[78.4,266]", "(266,326]", "(326,552]",
"(552,818]", "(818,1.55e+03]"), class = "factor")), .Names = c("var1",
"var2", "fac1", "fac2"), row.names = c(NA, -20L), class = "data.frame")
For part 1 I prefer aggregate because it keeps the data in a more R-like one observation per row format.
aggregate(var1, list(fac1, fac2), mean, na.rm=T)

Resources