Calculate moving geometric mean by water sampling station - r

I need to calculate the moving geometric mean on fecal coliform over time(at each value I want the geomean of that value and the previous 29 values), by individual sampling stations. When I download the data from our database the column headers are:
Station SampleDate FecalColiform
Depending on the growing area there are a few to over a dozen stations.
I tried to adapt some code that I found at HERE:
#File: Fecal
Fecal <- group_by(Fecal, Station) %>%
arrange(SampleDate) %>%
mutate(logres = log10(ResultValue)) %>%
mutate(mgm = stats::filter(logres, rep(1/24, 24), sides =1))
This worked, but the problem is that I don't want the resulting log values. I want just the regular geomean so that I can plot it and everyone can easily understand the values. I tried to somehow sneak the geometric.mean function from the psych package in there I could not make that work.
There are resources for calculating a moving average, and code for calculating geometric mean and I have tried to combine several of them. I can't find an example for moving geometric mean.
Eventually I would like to graph all of geomeans by station similar to the example in the link above.
> dput(ByStationRGMData[1:10,])
structure(list(Station = c(114L, 114L, 114L, 114L, 114L, 114L,
114L, 114L, 114L, 114L), Classification = structure(c(3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c(" Approved ", " Conditionally Approved ",
" Prohibited "), class = "factor"), SampleDate = c(19890103L,
19890103L, 19890209L, 19890316L, 19890413L, 19890511L, 19890615L,
19890713L, 19890817L, 19890914L), SWTemp = c(NA, NA, 5L, 8L,
NA, 13L, 15L, 18L, NA, 18L), Salinity = c(NA, NA, 22L, 18L, NA,
26L, 22L, 24L, NA, 32L), FecalColiform = c(180, 49, 2, 17, 7.9,
1.8, 4.5, 11, 33, 1.8), RGM = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_
)), .Names = c("Station", "Classification", "SampleDate", "SWTemp",
"Salinity", "FecalColiform", "RGM"), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -10L), vars = list(
Station), drop = TRUE, indices = list(0:9), group_sizes = 10L, biggest_group_size = 10L, labels = structure(list(
Station = 114L), class = "data.frame", row.names = c(NA,
-1L), vars = list(Station), drop = TRUE, .Names = "Station"))
I would also like to add a moving 90th percentile to the dataframe and the graphs. I tried the following:
ByStationRGMData <- RawData %>%
group_by(Station) %>%
arrange(SampleDate) %>%
mutate(RGM = as.numeric(rollapply(FecalColiform, 30, geometric.mean, fill=NA, align="right"))) +
mutate(F90 = as.numeric(rollapply(FecalColiform, 30, quantile, p=0.90, fill=NA, align="right")))
This gives me the error:
Error in mutate_(.data, .dots = lazyeval::lazy_dots(...)) : argument ".data" is missing, with no default
I can't seem to figure out what I'm missing.

You can use rollapply from the zoo package (illustrated here using the built-in mtcars data frame). I've used a window of 3 values, but you can set that to 30 in your actual data. align="left" uses the current value and n-1 previous values, where n is the window width:
library(psych)
library(dplyr)
library(zoo)
mtcars %>%
mutate(mpgGM = rollapply(mpg, 3, geometric.mean, fill=NA, align="left"))
Include a grouping variable to get rolling geometric means separately for each group.

Related

How to skip and disregard a row in a loop that can't be read by a line of code or that provides error?

structure(list(`total primary - yes RS` = c(0L, 138L, 101L, 86L,
118L), `total primary - no RS` = c(0L, 29L, 39L, 35L, 38L), `total secondary- yes rs` = c(0L,
6L, 15L, 3L, 15L), `total secondary- no rs` = c(0L, 0L, 7L, 1L,
2L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"
))
I had previously asked for a line of code that could run a chisquare for each of four rows included
https://stackoverflow.com/questions/66750999/with-r-i-would-like-to-loop-through-each-row-and-create-corresponding-chisquare/66751018#66751018
Though the script worked it only worked because the four rows were able to run through the script.
library(broom)
library(dplyr)
apply(df, 1, function(x) tidy(chisq.test(matrix(x, ncol = 2)))) %>%
bind_rows
I now have a line that has zero and when i run the same script i get
Error in stats::chisq.test(x, y, ...) :
at least one entry of 'x' must be positive
I tried to do something using tryCatch(), this way
tryCatch(apply(df, 1, function(x) tidy(chisq.test(matrix(x, ncol = 2))))) %>%
bind_rows
but it did not work. Ultimately the dataset has a bunch of rows like this I would like a scenario where the script recognizes that it isn't only in row 1, but in multiple rows like 5,23,67 and so on.
I am not sure I am following your code/data exactly, but what if you move your tryCatch statement inside the apply statement like so: apply(df, 2, function(x) tryCatch(tidy(chisq.test(matrix(x, ncol = 2))))) %>% bind_rows? Does that help at all?

how to drop multiple column which has categorical values in R?

I know how to drop columns by name, but I am not quite sure how I am going to drop the columns which has categorical values. It can be done manually looking at which columns has categorical values, but not intuitive for me using R code. How to detect columns which has categorical values? any way to make this happen?
minimal data
mydf=structure(list(taken_time = c(15L, 5L, 39L, -21L, 46L, 121L),
ap6xl = c(203.2893857, 4.858269406, 2, 14220, 218.2215352,
115.5227706), pct5 = c(732.074484, 25.67901235, 1.01, 120.0477168,
3621.328567, 79.30561111), crp4 = c(196115424.7, 1073624.455,
1.23, 1457496.474, 10343851.7, 81288042.73), age = c(52L,
74L, 52L, 67L, 82L, 67L), gender = structure(c(2L, 2L, 2L,
1L, 2L, 1L), .Label = c("F", "M"), class = "factor"), inpatient_readmission_time_rtd = c(79.78819444,
57.59068053, 57.59068053, 57.59068053, 57.59068053, 9.893055556
), infection_flag = c(0L, 0L, 1L, 1L, 0L, 1L), temperature_value = c(98.9,
98.9, 98, 101.3, 99.5, 98.1), heartrate_value = c(106, 61,
78, 91, 120, 68), pH_result_time_rta = c(11, 85.50402145,
85.50402145, 85.50402145, 85.50402145, 85.50402145), gcst_value = c(15,
15, 15, 14.63769293, 15, 14.63769293)), row.names = c(NA,
6L), class = "data.frame")
instead of manually typing name of columns which has categorical values, is there any way we can detect categorical columns and drop it?
I am concerning the case such as dataframe might have more than 10 categorical columns, it is sort of pain, so I am curious if it is possible using R. any thought?
for example, I can do this for above dataframe by manually looking at which one are categorical columns:
mydf <- mydf[!names(mydf) %in% c("gender", "infection_flag")]
is there any way we can detect which ones is categorical columns and drop it for numerical calculation purpose? any idea?
An option with base R
i1 <- sapply(mydf, is.numeric)
df[i1]
You can use dplyr and select all the numerical columns:
library(dplyr)
mydf %>% select_if(is.numeric)

group variables depending on defined circular area with center of circle having variable radius

I have a data table object:
> dput(head(trackdatacompvar))
structure(list(wellvid = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = c("A4-009",
"B3-006", "B4-015", "C2-009", "C2-034", "C3-017", "C4-014", "C4-016",
"C4-026", "C4-036"), class = "factor"), TRACK_ID = c(0L, 0L,
0L, 0L, 0L, 0L), treatment = structure(c(2L, 2L, 2L, 2L, 2L,
2L), .Label = c("Si_induced", "Si_notinduced"), class = "factor"),
A = c(0L, 0L, 0L, 0L, 0L, 0L), X = c(50.216, 50.216, 50.091,
50.091, 50.216, 50.216), Y = c(295.609, 295.609, 295.477,
295.477, 295.609, 295.609), T = 0:5, V = c(0, 0, 0.181793839279557,
0, 0.181793839279557, 0), x_grpA = c(641.67, 641.67, 641.67,
641.67, 641.67, 641.67), y_grpA = c(625, 625, 625, 625, 625,
625), rad_grpA = c(50L, 50L, 50L, 50L, 50L, 50L), x_grpB = c(889.58,
889.58, 889.58, 889.58, 889.58, 889.58), y_grpB = c(377.08,
377.08, 377.08, 377.08, 377.08, 377.08), rad_grpB = c(20L,
20L, 20L, 20L, 20L, 20L)), .Names = c("wellvid", "TRACK_ID",
"treatment", "A", "X", "Y", "T", "V", "x_grpA", "y_grpA", "rad_grpA",
"x_grpB", "y_grpB", "rad_grpB"), sorted = "wellvid", class = c("data.table",
"data.frame"), row.names = c(NA, -6L), .internal.selfref = <pointer: 0x0000000000210788>)
I want to define 4 groups of data depending on circular area. Groups A and B will be dependent on the x,y origin of 2 beads (labelled as x_grpA, y_grpA and x_grpB, y_grpB), group C is an outside area and group D as the area where groups A and B overlap (but this area is sometimes not there). The 2 circular groups should be inside a circular area with radius of 115 µm. This 115 µm is dependent on the size of the bead, so I also have in my data 2 radius (rad_grpA and rad_grpB). To understand it visually, here are 2 pictures:
My original idea is to reuse the awesome script I was given before. So, I tried defining the center of the each data point and the corresponding length of the whole area of group A as:
center_grpA <- c(trackdatacompvar$x_grpA, trackdatacompvar$y_grpA)
circle_grpA <- (trackdatacompvar$rad_grpA)*2 + 115
But after this I am lost.
In the end I want to put inside my dataframe their grouping as one variable.
Would appreciate any help! Thanks :)
We can use a little convenience function from a package of mine here:
check_if_in_circle <- function(points, x, y, r) {
(points[, 1] - x) ^ 2 + (points[, 2] - y) ^ 2 < r ^ 2
}
Now we check for each point, whether it's in circle A, circle B, and then ifelse to figure out whether to assign A, B, C or D. I use within to avoid typing that long data name.
trackdatacompvar <- within(trackdatacompvar,
{
grpA <- check_if_in_circle(points = cbind(X, Y),
x_grpA, y_grpA, rad_grpA + 115)
grpB <- check_if_in_circle(points = cbind(X, Y),
x_grpB, y_grpB, rad_grpB + 115)
grp <- ifelse(grpA, ifelse(grpB, 'D', 'A'),
ifelse(grpB, 'B', 'C'))
} )
For the few rows you gave us, all are in group C.

R: How to create multiple maps (rworldmap) using apply?

I want to create multiple maps (similar to this example) using the apply family. Here a small sample of my code (~200 rows x 150 cols). (UN and ISO3 are codes for rworldmap):
df <- structure(list(BLUE.fruits = c(12803543,
3745797, 19947613, 0, 130, 4), BLUE.nuts = c(21563867, 533665,
171984, 0, 0, 0), BLUE.veggies = c(92690, 188940, 34910, 0, 0,
577), GREEN.fruits = c(3389314, 15773576, 8942278, 0, 814, 87538
), GREEN.nuts = c(6399474, 1640804, 464688, 0, 0, 0), GREEN.veggies = c(15508,
174504, 149581, 0, 0, 6190), UN = structure(c(4L, 5L, 1L, 6L,
2L, 3L), .Label = c("12", "24", "28", "4", "8", "n/a"), class = "factor"),
ISO3 = structure(c(1L, 3L, 6L, 4L, 2L, 5L), .Label = c("AFG",
"AGO", "ALB", "ASM", "ATG", "DZA"), class = "factor")), .Names = c("BLUE.fruits", "BLUE.nuts", "BLUE.veggies", "GREEN.fruits", "GREEN.nuts",
"GREEN.veggies", "UN", "ISO3"), row.names = c(97L, 150L, 159L,
167L, 184L, 191L), class = "data.frame")
and the code I used before to plot one single map:
library(rworldmap)
mapDevice('x11')
spdf <- joinCountryData2Map(df, joinCode="ISO3", nameJoinColumn="ISO3")
mapWF <- mapCountryData(spdf, nameColumnToPlot="BLUE.nuts",
catMethod="quantiles")
Note: in mapCountryData() I used the names of single columns (in this case "BLUE.nuts"). My question is: is there a way to apply this mapping code for the different columns creating six different maps? Either in one multi-panel using layout() or even better creating six different plots that get saved according to their colnames. Ideas? Thanks a lot in advance
You are close.
Add this to save one plot per column.
#put column names to plot in a vector
col_names <- names(df)[1:6]
lapply(col_names, function(x) {
#opens device to store pdf
pdf(paste0(x,'.pdf'))
#plots map
mapCountryData(spdf, nameColumnToPlot=x)
#closes created pdf
dev.off()
})

Using summarise function to make sumIF with the dplyr package [duplicate]

This question already has answers here:
Why does summarize or mutate not work with group_by when I load `plyr` after `dplyr`?
(2 answers)
Closed 2 years ago.
I am using the dplyr to make a sumIF function on my data frame. However, it does not give me the desired output:
> dput(sys)
structure(list(NUMERIC = c(244L, 24L, 1L, 2L, 4L, 111L, 23L,
2L, 3L, 4L, 24L), VAL = c("FALSE", "FALSE", "TES", "TEST", "TRUE",
"TRUE", "TRUE", "asdfs", "asdfs", "safd", "sd"), IDENTIFIER = c(99L,
99L, 98L, 98L, 99L, 99L, 99L, 13L, 13L, 99L, 12L)), .Names = c("NUMERIC",
"VAL", "IDENTIFIER"), row.names = c(NA, 11L), class = c("grouped_dt",
"tbl_dt", "tbl", "grouped_dt", "tbl_dt", "tbl", "data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000000100788>, sorted = c("VAL",
"IDENTIFIER"), vars = list(VAL, IDENTIFIER))
>
>
> sys <- group_by(sys, VAL, IDENTIFIER)
> df.summary <- summarise(sys,
+ numeric = sum(NUMERIC)
+ )
>
> (df.summary)
numeric
1 442
My desired result should look like that:
Any recommendation as to what I am doing wrong?
This could occur when you have plyr loaded along with dplyr. You can either do this on a new R session or use
dplyr::summarise(sys,
numeric = sum(NUMERIC)
)

Resources