Percentile in a data frame using two columns

Percentile in a data frame using two columns - r

Perhaps it´s an easy problem but I´m stuck.
My data frame (which come from a yearly survey) contains length data of several especies by year and by haul. I want to obtain, for each year, the 95 percentile for each species. A sample of my dataframe,
structure(list(year = c(2015L, 2015L, 2015L, 2015L, 2014L, 2016L,
2015L, 2016L, 2014L, 2016L, 2015L, 2015L, 2016L, 2016L, 2014L, 2014L,
2014L, 2015L, 2016L, 2016L), cod_haul = structure(c(72L, 51L, 77L,
43L, 20L, 92L, 75L, 93L, 9L, 103L, 65L, 63L, 85L, 102L, 27L, 24L,
14L, 55L, 114L, 105L), .Label = c("N14_02", "N14_03", "N14_04",
"N14_06", "N14_07", "N14_08", "N14_10", "N14_13", "N14_16", "N14_17",
"N14_19", "N14_21", "N14_24", "N14_25", "N14_26", "N14_27", "N14_28",
"N14_29", "N14_30", "N14_32", "N14_33", "N14_35", "N14_37", "N14_39",
"N14_40", "N14_41", "N14_42", "N14_44", "N14_51", "N14_54", "N14_55",
"N14_56", "N14_57", "N14_58", "N14_61", "N14_62", "N14_64", "N14_66",
"N14_67", "N15_01", "N15_03", "N15_07", "N15_11", "N15_12", "N15_14",
"N15_16", "N15_18", "N15_19", "N15_20", "N15_22", "N15_23", "N15_24",
"N15_25", "N15_26", "N15_27", "N15_28", "N15_29", "N15_30", "N15_31",
"N15_32", "N15_36", "N15_37", "N15_39", "N15_41", "N15_44", "N15_46",
"N15_47", "N15_48", "N15_52", "N15_55", "N15_56", "N15_58", "N15_59",
"N15_60", "N15_62", "N15_63", "N15_64", "N15_66", "N15_67", "N16_04",
"N16_06", "N16_07", "N16_08", "N16_11", "N16_12", "N16_13", "N16_15",
"N16_17", "N16_18", "N16_20", "N16_22", "N16_23", "N16_25", "N16_28",
"N16_29", "N16_30", "N16_31", "N16_32", "N16_33", "N16_34", "N16_35",
"N16_37", "N16_40", "N16_41", "N16_45", "N16_46", "N16_47", "N16_48",
"N16_49", "N16_50", "N16_51", "N16_52", "N16_53", "N16_54", "N16_56",
"N16_58", "N16_60", "N16_61", "N16_62", "N16_63", "N16_64","N16_66"),
class = "factor"), haul = c(58L, 23L, 64L, 11L, 32L, 23L, 62L, 25L,
16L, 40L, 44L, 39L, 12L, 37L, 42L, 39L, 25L, 27L, 54L, 45L), name =
structure(c(2L, 23L, 11L, 2L, 19L, 15L, 18L, 16L, 3L, 21L, 16L, 21L,
20L, 19L, 3L, 18L, 16L, 11L, 7L, 13L), .Label = c("Argentina
sphyraena", "Arnoglossus laterna", "Blennius ocellaris", "Boops
boops", "Callionymus lyra", "Callionymus maculatus", "Capros aper",
"Cepola macrophthalma", "Chelidonichthys cuculus", "Chelidonichthys
lucerna", "Conger conger", "Eutrigla gurnardus", "Gadiculus
argenteus", "Galeus melastomus", "Helicolenus dactylopterus",
"Lepidorhombus boscii", "Lepidorhombus whiffiagonis", "Merluccius
merluccius", "Microchirus variegatus", "Micromesistius poutassou",
"Phycis blennoides", "Raja clavata", "Scyliorhinus canicula",
"Solea solea", "Trachurus trachurus", "Trisopterus luscus"), class
= "factor"), length = c(9L, 18L, 50L, 12L, 14L, 12L, 31L, 19L, 15L,
16L, 26L, 48L, 23L, 10L, 16L, 24L, 12L, 46L, 75L, 13L), number =
c(5L, 4L, 1L, 2L, 29L, 5L, 2L, 14L, 1L, 1L, 4L, 1L, 29L, 21L, 2L,
1L, 2L, 1L, 2L, 14L)), row.names = c(NA, 20L), class =
"data.frame")
I haven't been able to find how to solve it even though I have tried several approaches, but none worked.
Any suggestions or advice is much appreciated.
Thanks!
Ps: Although it isn´t absolutely necessary, it would be great if the percentile could be added to the dataframe as a new column.

df %>%
group_by(year) %>%
summarize(species.95 = quantile(species, 0.95)
I cannot download your dataframe but you can use the quantile function to find the 95% for each species.

if I get you right
library(tidyverse) "collector")), skip = 1L), class = "col_spec"))
df %>%
group_by(year, name) %>%
mutate(q95 = quantile(length, probs = 0.95))
or
library(data.table)
setDT(df)
df[, q95 := quantile(length, probs = 0.95), by = list(year, name)][order(name, year)]

Related

ANOVA error: why is each row of output not identified by a unique combination of keys?

I have a two-way ANOVA test (w/repeated measures) that I'm using with four almost identical datasets:
> res.aov <- anova_test(
+ data = LST_Weather_dataset_N, dv = LST, wid = Month,
+ within = c(Buffer, TimePeriod),
+ effect.size = "ges",
+ detailed = TRUE,
+ )
Where:
LST = surface temperature deviation in C
Month = 1-12
Buffer = a value 100-1900 - one of 19 areas outward from the boundary of a solar power plant (each 100m wide)
TimePeriod = a factor with a value of 1 or 2 corresponding to pre-/post-construction of a solar power plant.
For one dataset I get the error:
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 38 rows:
* 10, 11
* 217, 218
* 240, 241
* 263, 264
* 286, 287
* 309, 310
* 332, 333
...
As far as I can tell I have unique combinations.
dplyr::count(LST_Weather_dataset_N, LST, Month, Buffer, TimePeriod, sort = TRUE)
returns
LST Month Buffer TimePeriod n
1 -6.309045316 12 100 2 1
2 -5.655279925 9 1000 2 1
3 -5.224196295 12 200 2 1
4 -5.194473224 9 1100 2 1
5 -5.025429891 12 400 2 1
6 -4.987575966 9 700 2 1
7 -4.979453868 12 600 2 1
8 -4.825298768 12 300 2 1
9 -4.668994574 12 500 2 1
10 -4.652282192 12 700 2 1
...
'n' is always 1.
I can't work out why this is happening.
Extract of datafram below:
> dput(LST_Weather_dataset_N[sample(1:nrow(LST_Weather_dataset_N), 50),])
structure(list(Buffer = c(1400L, 700L, 300L, 1400L, 100L, 200L,
1700L, 100L, 800L, 1900L, 1100L, 100L, 700L, 800L, 1400L, 400L,
1300L, 200L, 1200L, 500L, 1200L, 1300L, 400L, 1000L, 1300L, 1100L,
100L, 300L, 300L, 600L, 1100L, 1400L, 1500L, 1600L, 1700L, 1800L,
1700L, 1300L, 1200L, 300L, 1100L, 1900L, 1700L, 700L, 1400L,
1200L, 1600L, 1700L, 1900L, 1300L), Date = c("02/05/2014", "18/01/2017",
"19/06/2014", "25/12/2013", "15/09/2017", "08/04/2017", "22/08/2014",
"21/07/2014", "13/07/2017", "25/12/2013", "22/10/2013", "02/05/2014",
"07/03/2017", "15/03/2014", "13/07/2017", "19/06/2014", "25/12/2013",
"17/10/2017", "16/04/2014", "06/10/2013", "15/09/2017", "18/01/2017",
"10/01/2014", "17/12/2016", "13/07/2017", "19/06/2014", "07/03/2017",
"15/03/2014", "11/02/2014", "22/10/2013", "06/10/2013", "15/09/2017",
"16/04/2014", "18/01/2017", "15/03/2014", "21/07/2014", "17/10/2017",
"15/09/2017", "10/01/2014", "23/09/2014", "16/04/2014", "22/10/2013",
"11/06/2017", "26/05/2017", "19/06/2014", "14/08/2017", "11/02/2014",
"26/02/2017", "26/02/2017", "11/02/2014"), LST = c(1.255502397,
4.33385966, 3.327025603, -0.388631166, -0.865430798, 4.386292648,
-0.243018665, 3.276865987, 0.957036835, -0.065821795, 0.69731779,
4.846851651, -1.437700684, 1.003808572, 0.572460421, 2.995902374,
-0.334633662, -1.231447567, 0.644520741, 0.808262029, -3.392959991,
2.324569449, 2.346707612, -3.124354627, 0.58719862, 1.904859254,
1.701580958, 2.792443253, 1.638270039, 1.460743317, 0.699767335,
-3.015643366, 0.930527864, 1.309519336, 0.477789664, 0.147584938,
-0.498188865, -3.506795723, -1.007487965, 1.149604087, 1.192366386,
0.197471474, 0.999391224, -0.190613618, 1.27324015, 2.686622796,
0.573109026, 0.97847983, 0.395005095, -0.40855426), Month = c(5L,
1L, 6L, 12L, 9L, 4L, 8L, 7L, 7L, 12L, 10L, 5L, 3L, 3L, 7L, 6L,
12L, 10L, 4L, 10L, 9L, 1L, 1L, 12L, 7L, 6L, 3L, 3L, 2L, 10L,
10L, 9L, 4L, 1L, 3L, 7L, 10L, 9L, 1L, 9L, 4L, 10L, 6L, 5L, 6L,
8L, 2L, 2L, 2L, 2L), Year = c(2014L, 2017L, 2014L, 2013L, 2017L,
2017L, 2014L, 2014L, 2017L, 2013L, 2013L, 2014L, 2017L, 2014L,
2017L, 2014L, 2013L, 2017L, 2014L, 2013L, 2017L, 2017L, 2014L,
2016L, 2017L, 2014L, 2017L, 2014L, 2014L, 2013L, 2013L, 2017L,
2014L, 2017L, 2014L, 2014L, 2017L, 2017L, 2014L, 2014L, 2014L,
2013L, 2017L, 2017L, 2014L, 2017L, 2014L, 2017L, 2017L, 2014L
), JulianDay = c(122L, 18L, 170L, 359L, 258L, 98L, 234L, 202L,
194L, 359L, 295L, 122L, 66L, 74L, 194L, 170L, 359L, 290L, 106L,
279L, 258L, 18L, 10L, 352L, 194L, 170L, 66L, 74L, 42L, 295L,
279L, 258L, 106L, 18L, 74L, 202L, 290L, 258L, 10L, 266L, 106L,
295L, 162L, 146L, 170L, 226L, 42L, 57L, 57L, 42L), TimePeriod = c(1L,
2L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L,
2L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L,
2L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L, 2L,
1L), Temperature = c(28L, 9L, 31L, 12L, 27L, 21L, 29L, 36L, 38L,
12L, 23L, 28L, 12L, 21L, 38L, 31L, 12L, 23L, 25L, 22L, 27L, 9L,
11L, 7L, 38L, 31L, 12L, 21L, 14L, 23L, 22L, 27L, 25L, 9L, 21L,
36L, 23L, 27L, 11L, 31L, 25L, 23L, 29L, 27L, 31L, 34L, 14L, 16L,
16L, 14L), Humidity = c(6L, 34L, 7L, 31L, 29L, 22L, 34L, 15L,
19L, 31L, 16L, 6L, 14L, 14L, 19L, 7L, 31L, 12L, 9L, 12L, 29L,
34L, 33L, 18L, 19L, 7L, 14L, 14L, 31L, 16L, 12L, 29L, 9L, 34L,
14L, 15L, 12L, 29L, 33L, 18L, 9L, 16L, 8L, 13L, 7L, 13L, 31L,
31L, 31L, 31L), Wind_speed = c(6L, 0L, 6L, 7L, 13L, 33L, 6L,
20L, 9L, 7L, 0L, 6L, 0L, 6L, 9L, 6L, 7L, 6L, 0L, 7L, 13L, 0L,
0L, 35L, 9L, 6L, 0L, 6L, 6L, 0L, 7L, 13L, 0L, 0L, 6L, 20L, 6L,
13L, 0L, 0L, 0L, 0L, 24L, 11L, 6L, 24L, 6L, 26L, 26L, 6L), Wind_gust = c(0L,
0L, 0L, 0L, 0L, 54L, 0L, 46L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 48L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 46L, 0L, 0L, 0L, 0L, 0L, 0L, 48L, 0L, 0L, 39L,
0L, 41L, 41L, 0L), Wind_trend = c(1L, 0L, 1L, 1L, 2L, 2L, 0L,
1L, 2L, 1L, 0L, 1L, 0L, 1L, 2L, 1L, 1L, 0L, 0L, 2L, 2L, 0L, 1L,
1L, 2L, 1L, 0L, 1L, 1L, 0L, 2L, 2L, 0L, 0L, 1L, 1L, 0L, 2L, 1L,
1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), Wind_direction = c(0,
0, 0, 337.5, 360, 22.5, 0, 22.5, 0, 337.5, 0, 0, 0, 0, 0, 0,
337.5, 180, 0, 247.5, 360, 0, 0, 180, 0, 0, 0, 0, 337.5, 0, 247.5,
360, 0, 0, 0, 22.5, 180, 360, 0, 0, 0, 0, 360, 22.5, 0, 360,
337.5, 360, 360, 337.5), Pressure = c(940.2, 943.64, 937.69,
951.37, 932.69, 933.94, 937.07, 938.01, 937.69, 951.37, 939.72,
940.2, 948.33, 947.71, 937.69, 937.69, 951.37, 943.32, 932.69,
944.71, 932.69, 943.64, 942.31, 943.01, 937.69, 937.69, 948.33,
947.71, 941.94, 939.72, 944.71, 932.69, 932.69, 943.64, 947.71,
938.01, 943.32, 932.69, 942.31, 938.94, 932.69, 939.72, 928.31,
931.12, 937.69, 932.37, 941.94, 936.13, 936.13, 941.94), Pressure_trend = c(1L,
2L, 0L, 2L, 0L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 0L, 2L,
1L, 2L, 1L, 0L, 2L, 2L, 2L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 0L, 2L,
2L, 1L, 1L, 1L, 0L, 2L, 1L, 2L, 1L, 0L, 0L, 0L, 1L, 1L, 2L, 2L,
1L)), row.names = c(179L, 14L, 195L, 426L, 306L, 118L, 299L,
229L, 244L, 436L, 374L, 153L, 90L, 91L, 256L, 197L, 424L, 348L,
137L, 355L, 328L, 26L, 7L, 419L, 254L, 211L, 78L, 81L, 43L, 359L,
373L, 332L, 143L, 32L, 109L, 263L, 393L, 330L, 23L, 309L, 135L,
398L, 224L, 166L, 217L, 290L, 69L, 72L, 76L, 63L), class = "data.frame")

Well, this is a bit embarrassing.
The error arose as there were not, in fact, paired months of the data. Rather than there being 38 data (19x2) for each month, due to an error in determining the month value one month had 57 data (19x3). Correcting this, and checking that each month had the same number of paired data for the ANOVA allowed the test to run sucessfully.
> res.aov <- anova_test(
+ data = LST_Weather_dataset_N, dv = LST, wid = Month,
+ within = c(Buffer, TimePeriod),
+ effect.size = "ges",
+ detailed = TRUE,
+ )
> get_anova_table(res.aov, correction = "auto")
ANOVA Table (type III tests)
Effect DFn DFd SSn SSd F p p<.05 ges
1 (Intercept) 1 11 600.135 974.584 6.774 2.50e-02 * 0.189
2 Buffer 18 198 332.217 331.750 11.015 2.05e-21 * 0.115
3 TimePeriod 1 11 29.561 977.945 0.333 5.76e-01 0.011
4 Buffer:TimePeriod 18 198 13.055 283.797 0.506 9.53e-01 0.005
I still don't understand how the error message was telling me this, though.

Changing the Font of ggplot text?

So I'm trying to change the font in ggplot of my graph title and labels. I want to set the font to calibri but no matter what I do I keep getting the following error message:
1: In grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x, x$y, :
font family 'Calibri' not found, will use 'sans' instead
I've done the following to try and load fonts
library(extrafont)
font_import()
loadfonts(device = "win")
But when I'm making the graph using the following code I get the error message
churchplot <- ggplot(church, aes(x = year, y = Great.deal.Quite.a.lot, color = Great.deal.Quite.a.lot)) + geom_line(size = 1.5, color = "palegreen4") +
theme(axis.line = element_blank(),
panel.grid.minor.x = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(color = "gray85"),
panel.border = element_blank(),
panel.background = element_blank()) +
geom_hline(aes(yintercept = 0), size = .5) +
expand_limits(y = 0) +
scale_y_continuous(expand = c(NA, 0), limits = c(0, 85)) +
theme(axis.text.x = element_text(vjust = 1, size = 11),
axis.text.y = element_text(size = 11),
axis.title.x = element_text(vjust = -1, hjust = .01 ,size = 15),
plot.title = element_text(size = 16, face = "bold")) +
ggtitle("Trust in the Church and Organized Religion") +
ylab("") +
xlab("Year") +
labs(color = "Trust in Church") +
annotate(geom="text", x = 2011, y = 58, label = "Title",
color="forestgreen", size = 5, fontface = "bold", family = "Calibri")
churchplot
dput(church)
structure(list(year = c(2020L, 2019L, 2018L, 2017L, 2016L, 2015L,
2014L, 2013L, 2012L, 2011L, 2010L, 2009L, 2008L, 2007L, 2006L,
2005L, 2004L, 2003L, 2002L, 2001L, 2000L, 1999L, 1998L, 1997L,
1996L, 1995L, 1994L, 1993L, 1991L, 1991L, 1990L, 1989L, 1988L,
1987L, 1986L, 1985L, 1984L, 1983L, 1981L, 1979L, 1977L, 1975L,
1973L), Great.deal = c(25L, 21L, 20L, 23L, 20L, 25L, 25L, 25L,
25L, 25L, 25L, 29L, 26L, 24L, 28L, 31L, 26L, 27L, 26L, 32L, 28L,
32L, 34L, 35L, 30L, 32L, 29L, 29L, 31L, 33L, 33L, 30L, 35L, 35L,
34L, 42L, 41L, 39L, 40L, 40L, 38L, 44L, 43L), Quite.a.lot = c(17L,
15L, 18L, 18L, 21L, 17L, 20L, 23L, 19L, 23L, 23L, 23L, 22L, 22L,
24L, 22L, 27L, 23L, 19L, 28L, 28L, 26L, 25L, 21L, 27L, 25L, 25L,
24L, 25L, 26L, 23L, 22L, 24L, 26L, 23L, 24L, 23L, 23L, 24L, 25L,
26L, 24L, 22L), Some = c(31L, 36L, 33L, 29L, 31L, 32L, 29L, 32L,
29L, 29L, 30L, 29L, 31L, 30L, 26L, 28L, 28L, 30L, 32L, 24L, 26L,
28L, 26L, 28L, 27L, 28L, 29L, 29L, 27L, 26L, 26L, 26L, 27L, 28L,
27L, 21L, 22L, 26L, 20L, 21L, 20L, 20L, 21L), Very.little = c(23L,
25L, 24L, 25L, 24L, 20L, 20L, 17L, 22L, 20L, 18L, 14L, 15L, 21L,
19L, 16L, 15L, 17L, 18L, 13L, 14L, 12L, 12L, 12L, 13L, 11L, 14L,
14L, 12L, 12L, 14L, 17L, 11L, 10L, 12L, 11L, 13L, 9L, 11L, 11L,
13L, 9L, 7L), None..vol.. = c(3L, 4L, 3L, 3L, 3L, 3L, 4L, 2L,
4L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 2L, 2L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 2L, 1L, 2L, 3L, 2L, 1L, 3L, 1L, NA, 1L, 5L, 1L,
1L, 1L, 4L), No.opinion = c(1L, 0L, 2L, 2L, 1L, 2L, 2L, 2L, 1L,
1L, 2L, 2L, 3L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L,
2L, 1L, 1L, 3L, 2L, 2L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 3L,
2L, 2L), Great.deal.Quite.a.lot = c(42L, 36L, 38L, 41L, 41L,
42L, 45L, 48L, 44L, 48L, 48L, 52L, 48L, 46L, 52L, 53L, 53L, 50L,
45L, 60L, 56L, 58L, 59L, 56L, 57L, 57L, 54L, 53L, 56L, 59L, 56L,
52L, 59L, 61L, 57L, 66L, 64L, 62L, 64L, 65L, 64L, 68L, 65L)), class = "data.frame", row.names = c(NA,
-43L))

This works for me :
#install.packages('extrafont')
library(extrafont)
library(ggplot2)
font_import() #Import all fonts
ggplot(mtcars, aes(mpg, disp)) +
geom_point() +
annotate(geom="text", x = 30, y = 200, label = "This is Title",
color="forestgreen",size = 5, fontface = "bold", family="Comic Sans MS")
To know all the available fonts you can use fonts() function.
fonts()
# [1] "CM Roman" "CM Roman Asian" "CM Roman CE"
# [4] "CM Roman Cyrillic" "CM Roman Greek" "CM Sans"
# [7] "CM Sans Asian" "CM Sans CE" "CM Sans Cyrillic"
# [10] "CM Sans Greek" "CM Symbol" "CM Typewriter"
# [13] "CM Typewriter Asian" "CM Typewriter CE" "CM Typewriter Cyrillic"
# [16] "CM Typewriter Greek" ".SF Compact Rounded" ".Keyboard"
# [19] ".New York" ".SF Compact" "System Font"
# [22] ".SF NS Mono" ".SF NS Rounded" "Academy Engraved LET"
# [25] "Andale Mono" "Apple Braille" "AppleMyungjo"
#...
#...

create sql expression in R for certain condition

I get the data from the sql server to perform regression analysis, and then the regression results i return back to another sql table.
library("RODBC")
library(sqldf)
dbHandle <- odbcDriverConnect("driver={SQL Server};server=MYSERVER;database=MYBASE;trusted_connection=true")
sql <-
"select
Dt
,CustomerName
,ItemRelation
,SaleCount
,DocumentNum
,DocumentYear
,IsPromo
from dbo.mytable"
df <- sqlQuery(dbHandle, sql)
After this query i must perform regression analysis separately for groups
my_lm <- function(df) {
lm(SaleCount~IsPromo, data = df)
}
reg=df %>%
group_by(CustomerName,ItemRelation,DocumentNum,DocumentYear) %>%
nest() %>%
mutate(fit = map(data, my_lm),
tidy = map(fit, tidy)) %>%
select(-fit, - data) %>%
unnest()
View(reg)
#save to sql table
sqlSave(dbHandle, as.data.frame(reg), "dbo.mytableforecast", verbose = TRUE) # use "append = TRUE" to add rows to an existing table
odbcClose(dbHandle)
The question:
The script works automatically, i.e. in the scheduler there is task that script in certain time was launched.
For example, today was loaded 100 observations.
From 01.01.2017-10.04.2017
Script performed regression and returned data to sql table.
Tomorrow will loaded new 100 observations.
11.04.2017-20.07.2017
I.E. when tomorrow the data will loaded and the script will start at 10 pm, it must work only with data from 11.04.2017-20.07.2017, and not from 01.01.2017-20.07.2017
the situation is complicated by the fact that after the regression the column Dt is dropped, so the solution given me here does not work
Automatic transfer data from the sql to R
because Dt is absent.
How can i set the condition for schedule select Dt ,CustomerName ,ItemRelation ,SaleCount ,DocumentNum ,DocumentYear ,IsPromo from dbo.mytable "where Dt>the last date when the script was launched"
is it possible to create this expression?
data example from sql
df=structure(list(Dt = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
8L, 9L, 9L, 10L, 10L, 11L, 11L, 12L, 12L, 13L, 13L, 14L, 14L,
15L, 15L, 16L, 16L, 16L, 16L, 17L, 17L, 17L, 17L, 18L, 18L, 18L,
18L, 19L), .Label = c("2017-10-12 00:00:00.000", "2017-10-13 00:00:00.000",
"2017-10-14 00:00:00.000", "2017-10-15 00:00:00.000", "2017-10-16 00:00:00.000",
"2017-10-17 00:00:00.000", "2017-10-18 00:00:00.000", "2017-10-19 00:00:00.000",
"2017-10-20 00:00:00.000", "2017-10-21 00:00:00.000", "2017-10-22 00:00:00.000",
"2017-10-23 00:00:00.000", "2017-10-24 00:00:00.000", "2017-10-25 00:00:00.000",
"2017-10-26 00:00:00.000", "2017-10-27 00:00:00.000", "2017-10-28 00:00:00.000",
"2017-10-29 00:00:00.000", "2017-10-30 00:00:00.000"), class = "factor"),
CustomerName = structure(c(1L, 11L, 12L, 13L, 14L, 15L, 16L,
17L, 18L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 1L, 11L, 12L,
13L, 14L, 15L, 16L, 17L, 18L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L, 10L), .Label = c("x1", "x10", "x11", "x12", "x13", "x14",
"x15", "x16", "x17", "x18", "x2", "x3", "x4", "x5", "x6",
"x7", "x8", "x9"), class = "factor"), ItemRelation = c(13322L,
13322L, 13322L, 13322L, 13322L, 13322L, 13322L, 11706L, 13322L,
11706L, 13322L, 11706L, 13322L, 11706L, 13322L, 11706L, 13322L,
11706L, 13322L, 11706L, 13322L, 11706L, 13322L, 11706L, 13163L,
13322L, 158010L, 11706L, 13163L, 13322L, 158010L, 11706L,
13163L, 13322L, 158010L, 11706L), SaleCount = c(10L, 3L,
1L, 0L, 9L, 5L, 5L, 11L, 7L, 0L, 5L, 11L, 1L, 0L, 0L, 19L,
10L, 0L, 1L, 12L, 1L, 11L, 6L, 0L, 167L, 7L, 0L, 16L, 165L,
1L, 0L, 0L, 29L, 0L, 0L, 11L), DocumentNum = c(36L, 36L,
36L, 36L, 36L, 36L, 36L, 51L, 36L, 51L, 36L, 51L, 36L, 51L,
36L, 51L, 36L, 51L, 36L, 51L, 36L, 51L, 36L, 51L, 131L, 36L,
89L, 51L, 131L, 36L, 89L, 51L, 131L, 36L, 89L, 51L), DocumentYear = c(2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L, 2017L),
IsPromo = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L)), .Names = c("Dt", "CustomerName",
"ItemRelation", "SaleCount", "DocumentNum", "DocumentYear", "IsPromo"
), class = "data.frame", row.names = c(NA, -36L))

Consider saving the max DT (retrieved before regression that drops field) in a log file at the end of your scheduled script, then add a log read-in at beginning of script for the last logged date to include in WHERE clause:
# READ DATE FROM LOG FILE
log_dt <- readLines("/path/to/SQL_MaxDate.txt", warn=FALSE)
# QUERY WITH WHERE CLAUSE
sql <- paste0("SELECT Dt, CustomerName, ItemRelation, SaleCount,
DocumentNum, DocumentYear, IsPromo
FROM dbo.mytable WHERE Dt > '", log_dt, "'")
df <- sqlQuery(dbHandle, sql)
# RETRIEVE MAX DATE VALUE
max_DT <- as.character(max(df$Dt))
# ... regression
# WRITE DATE TO LOG FILE
cat(max_DT, file="/path/to/SQL_MaxDate.txt")
Better yet, use parameterization with RODBCext to avoid string concatenation and quoting:
library(RODBC)
library(RODBCext)
# READ DATE FROM LOG FILE
log_dt <- readLines("/path/to/SQL_MaxDate.txt", warn=FALSE)
dbHandle <- odbcDriverConnect(...)
# PREPARED STATEMENT WITH PLACEHOLDER
sql <- "SELECT Dt, CustomerName, ItemRelation, SaleCount,
DocumentNum, DocumentYear, IsPromo
FROM dbo.mytable WHERE Dt > ?")
# EXECUTE QUERY BINDING PARAM VALUE
df <- sqlExecute(dbHandle, sql, log_dt, fetch=TRUE)
# RETRIEVE MAX DATE VALUE
max_DT <- as.character(max(df$Dt))
# ... regression
# WRITE DATE TO LOG FILE
cat(max_DT, file="/path/to/SQL_MaxDate.txt")

Ordering factor levels with the same name across different variables - facet_grid in ggplot2

I am plotting a timeline or time series plot of birds that I attached radio-transmitters to, and followed over the course of a breeding season. The timeline shows when I first tagged a bird, and when I stopped tracking a bird.
Each bird is labelled with its radio tag frequency, which is 6 numbers (151.XXX). 2 radio tags, 151.164 and 151.094, were used in both years. This is problematic when I use facet_grid to plot the timelines by year. Basically all birds are plotted, but this messes up the ordering of the plots (I would like to have the initial date a bird was tagged to go in order; see plot below). Is it possible to keep these factor levels the same but fix the ordering?
My data frame looks like this:
dat<-structure(list(freq = structure(c(21L, 1L, 32L, 8L, 11L, 16L,
5L, 30L, 13L, 26L, 10L, 19L, 22L, 34L, 23L, 4L, 17L, 33L, 36L,
3L, 14L, 31L, 24L, 35L, 15L, 20L, 27L, 29L, 6L, 18L, 28L, 25L,
12L, 9L, 7L, 2L), .Label = c("094_1", "094_2", "11", "112", "122",
"132", "152", "164_1", "164_2", "179", "191", "216", "226", "231",
"250", "251", "265", "280", "295", "338", "34", "372", "38",
"385", "429", "46", "475", "53", "558", "57", "71", "72", "831",
"876", "919", "965"), class = "factor"), site = structure(c(3L,
4L, 3L, 3L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 3L,
3L, 3L, 3L), .Label = c("BSLP", "GSPR", "HSGL", "SCFA"), class = "factor"),
zone = c(18L, 17L, 18L, 18L, 17L, 18L, 18L, 18L, 18L, 18L,
18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L, 18L,
17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 17L, 18L, 18L,
18L, 18L), year = c(2014L, 2014L, 2014L, 2014L, 2014L, 2014L,
2014L, 2014L, 2014L, 2014L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L,
2015L, 2015L, 2015L), jstart = c(106L, 113L, 119L, 119L,
122L, 124L, 125L, 125L, 128L, 131L, 104L, 104L, 104L, 104L,
104L, 105L, 105L, 105L, 105L, 105L, 106L, 106L, 109L, 109L,
110L, 110L, 110L, 110L, 111L, 111L, 115L, 121L, 139L, 142L,
147L, 159L), jend = c(147L, 154L, 133L, 129L, 143L, 170L,
168L, 170L, 164L, 178L, 153L, 160L, 147L, 156L, 140L, 141L,
160L, 122L, 166L, 160L, 133L, 160L, 149L, 160L, 162L, 158L,
155L, 159L, 162L, 163L, 156L, 160L, 194L, 173L, 196L, 202L
)), .Names = c("freq", "site", "zone", "year", "jstart",
"jend"), class = "data.frame", row.names = c(NA, -36L))
And here is the code:
library(ggplot2)
library(RColorBrewer)
dat$freq<-as.factor(dat$freq)
dat$year<-as.factor(dat$year)
# change the factor names:
levels(dat$freq) #just checking the levels
levels(dat$freq)[levels(dat$freq)=="094_1"] <- "151.094"
levels(dat$freq)[levels(dat$freq)=="094_2"] <- "151.094"
levels(dat$freq)[levels(dat$freq)=="164_1"] <- "151.164"
levels(dat$freq)[levels(dat$freq)=="164_2"] <- "151.164"
levels(dat$freq)[levels(dat$freq)=="34"] <- "151.034"
levels(dat$freq)[levels(dat$freq)=="72"] <- "151.072"
levels(dat$freq)[levels(dat$freq)=="191"] <- "151.191"
levels(dat$freq)[levels(dat$freq)=="251"] <- "151.251"
levels(dat$freq)[levels(dat$freq)=="122"] <- "151.122"
levels(dat$freq)[levels(dat$freq)=="57"] <- "151.057"
levels(dat$freq)[levels(dat$freq)=="226"] <- "151.226"
levels(dat$freq)[levels(dat$freq)=="179"] <- "151.179"
levels(dat$freq)[levels(dat$freq)=="295"] <- "151.295"
levels(dat$freq)[levels(dat$freq)=="372"] <- "151.372"
levels(dat$freq)[levels(dat$freq)=="876"] <- "151.876"
levels(dat$freq)[levels(dat$freq)=="38"] <- "151.038"
levels(dat$freq)[levels(dat$freq)=="112"] <- "151.112"
levels(dat$freq)[levels(dat$freq)=="265"] <- "151.265"
levels(dat$freq)[levels(dat$freq)=="831"] <- "151.831"
levels(dat$freq)[levels(dat$freq)=="965"] <- "151.965"
levels(dat$freq)[levels(dat$freq)=="11"] <- "151.011"
levels(dat$freq)[levels(dat$freq)=="231"] <- "151.231"
levels(dat$freq)[levels(dat$freq)=="71"] <- "151.071"
levels(dat$freq)[levels(dat$freq)=="385"] <- "151.385"
levels(dat$freq)[levels(dat$freq)=="919"] <- "151.919"
levels(dat$freq)[levels(dat$freq)=="250"] <- "151.250"
levels(dat$freq)[levels(dat$freq)=="338"] <- "151.338"
levels(dat$freq)[levels(dat$freq)=="475"] <- "151.475"
levels(dat$freq)[levels(dat$freq)=="558"] <- "151.558"
levels(dat$freq)[levels(dat$freq)=="132"] <- "151.132"
levels(dat$freq)[levels(dat$freq)=="280"] <- "151.280"
levels(dat$freq)[levels(dat$freq)=="53"] <- "151.053"
levels(dat$freq)[levels(dat$freq)=="429"] <- "151.429"
levels(dat$freq)[levels(dat$freq)=="216"] <- "151.216"
levels(dat$freq)[levels(dat$freq)=="152"] <- "151.152"
levels(dat$freq)[levels(dat$freq)=="46"] <- "151.046"
# order the factors:
dat$freq <- factor(dat$freq, levels = rev(dat$freq[order(dat$year,dat$jstart)]))
dat$freq # notice the changed order of factor levels
# plot the data:
ggplot(dat) +
geom_segment(aes(x = jstart, y = freq, xend = jend, yend = freq, color=site), alpha=0.5, size = 3) +
scale_color_brewer(palette="Set1", name="Study site") +
xlab("Date") +
ylab("Bird ID") +
theme_bw() +
theme(legend.position="bottom",
legend.title = element_text(colour="black", size=16, face="bold"),
legend.text = element_text(colour="black", size = 14, face = "plain"),
axis.title.x = element_text(colour="black", size=16, vjust=-0.1, face="bold"),
axis.text.x = element_text(angle=0, vjust=1, size=14, colour="black"),
axis.title.y =element_text(colour="black", size=16, vjust=1.3, face="bold"),
axis.text.y = element_text(angle=0, size=14, colour="black"),
strip.text.y = element_text(size=18)) +
facet_grid(year ~ ., scales = "free_y", space = "free_y") +
scale_x_continuous(labels = function(x) format(as.Date(as.character(x), "%j"), "%d-%b"))
The factors that are present in both years, 151.164 and 151.094, do not sort correctly for 2014.
Is there a way to correct this? I would like to keep these factor level names the same (i.e. 151.XXX, and retain 151.164 and 151.094 for both years).
Thanks, Jay

R - Average columns by information in row X

I have a data.frame where the first 13 rows contain site/observation information. Each column represents 1 individual, however most individuals have an A and B observation (although some only have A while a few have an A, B, and C observation). I'd like to average each row for every individual, and create a new data.frame from this information.
Example (small subset with row 1, row 7, row 13, and row 56-61):
OriginalID Tree003A Tree003B Tree008B Tree013A
1 Township LY LY LY LY
7 COFECHA ID LY1A003A LY1A003B LY1A008B LY1A013A
13 PathLength 37.5455 54.8963 57.9732 64.0679
56 2006 1.538 1.915 0.827 2.722
57 2007 1.357 1.923 0.854 2.224
58 2008 1.311 2.204 0.669 2.515
59 2009 0.702 1.125 0.382 2.413
60 2010 0.937 1.556 0.907 2.315
61 2011 0.942 1.268 1.514 1.858
I'd like to create a new data.frame that averages each individual's annual observations, whether they have an A, A and B, or A B and C observation. Individual's IDs are in Row 7 (COFECHA ID):
Intended Output:
OriginalID Tree003avg Tree008avg Tree013avg
1 Township LY LY LY
7 COFECHA ID LY1A003avg LY1A008avg LY1A013avg
13 PathLength 46.2209 57.9732 64.0679
56 2006 1.727 0.827 2.722
57 2007 1.640 0.854 2.224
58 2008 1.758 0.669 2.515
59 2009 0.914 0.382 2.413
60 2010 1.247 0.907 2.315
61 2011 1.105 1.514 1.858
Any ideas on how to average the columns would be great. I've been trying to modify the following code, but due to the 13 rows of additional information at the top of the data.frame, I didn't know how to specify to only average rows 14:61.
rowMeans(subset(LY011B, select = c("LY1A003A", "LY1A003B")), na.rm=TRUE)
The code for a larger set of the data that I'm working with is:
> dput(LY011B)
structure(list(OriginalTreeID = structure(c(58L, 53L, 57L, 59L,
51L, 61L, 50L, 55L, 56L, 60L, 54L, 49L, 52L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L,
19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L, 29L, 30L, 31L,
32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 42L, 43L, 44L,
45L, 46L, 47L, 48L), .Label = c("1964", "1965", "1966", "1967",
"1968", "1969", "1970", "1971", "1972", "1973", "1974", "1975",
"1976", "1977", "1978", "1979", "1980", "1981", "1982", "1983",
"1984", "1985", "1986", "1987", "1988", "1989", "1990", "1991",
"1992", "1993", "1994", "1995", "1996", "1997", "1998", "1999",
"2000", "2001", "2002", "2003", "2004", "2005", "2006", "2007",
"2008", "2009", "2010", "2011", "AnalysisDateTime", "COFECHA ID",
"CoreLetter", "PathLength", "Plot#", "RingCount", "SiteID", "SP",
"Subplot#", "Township", "Tree#", "YearLastRing", "YearLastWhiteWood"
), class = "factor"), Tree003A = structure(c(35L, 8L, 34L, 7L,
34L, 21L, 36L, 31L, 37L, 30L, 32L, 29L, 33L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 23L, 22L, 25L, 28L, 27L, 24L, 26L, 20L, 16L,
15L, 6L, 18L, 12L, 10L, 3L, 9L, 11L, 19L, 17L, 14L, 13L, 2L,
4L, 5L), .Label = c("", "0.702", "0.803", "0.937", "0.942", "0.961",
"003", "1", "1.09", "1.116", "1.124", "1.224", "1.311", "1.357",
"1.471", "1.509", "1.538", "1.649", "1.679", "1.782", "1999",
"2.084", "2.148", "2.162", "2.214", "2.313", "2.429", "2.848",
"2/19/2014 11:06", "2011", "23017323011sp1", "24", "37.5455",
"A", "LY", "LY1A003A", "sp1"), class = "factor"), Tree003B = structure(c(56L,
19L, 54L, 18L, 55L, 49L, 57L, 51L, 58L, 50L, 52L, 48L, 53L, 1L,
1L, 1L, 1L, 10L, 7L, 8L, 6L, 5L, 4L, 3L, 2L, 11L, 9L, 30L, 15L,
24L, 20L, 23L, 33L, 37L, 42L, 13L, 44L, 36L, 12L, 16L, 21L, 27L,
35L, 41L, 38L, 26L, 40L, 14L, 46L, 32L, 28L, 17L, 31L, 22L, 39L,
43L, 45L, 47L, 25L, 34L, 29L), .Label = c("", "0.073", "0.092",
"0.173", "0.174", "0.358", "0.413", "0.425", "0.58", "0.697",
"0.719", "0.843", "0.883", "0.896", "0.937", "0.941", "0.964",
"003", "1", "1.048", "1.067", "1.075", "1.097", "1.119", "1.125",
"1.176", "1.207", "1.267", "1.268", "1.27", "1.297", "1.402",
"1.429", "1.556", "1.662", "1.693", "1.704", "1.735", "1.76",
"1.792", "1.816", "1.881", "1.915", "1.92", "1.923", "2.155",
"2.204", "2/19/2014 11:06", "2000", "2011", "23017323011sp1",
"48", "54.8963", "A", "B", "LY", "LY1A003B", "sp1"), class = "factor"),
Tree008B = structure(c(59L, 24L, 57L, 23L, 58L, 52L, 60L,
54L, 61L, 53L, 55L, 51L, 56L, 19L, 14L, 13L, 22L, 7L, 8L,
9L, 4L, 6L, 3L, 1L, 2L, 10L, 25L, 47L, 43L, 49L, 46L, 40L,
50L, 48L, 44L, 17L, 36L, 31L, 27L, 30L, 39L, 37L, 34L, 45L,
38L, 32L, 41L, 29L, 42L, 33L, 28L, 26L, 21L, 11L, 15L, 16L,
18L, 12L, 5L, 20L, 35L), .Label = c("0.302", "0.31", "0.318",
"0.357", "0.382", "0.412", "0.452", "0.476", "0.5", "0.539",
"0.591", "0.669", "0.673", "0.787", "0.79", "0.827", "0.835",
"0.854", "0.879", "0.907", "0.917", "0.967", "008", "1",
"1.027", "1.037", "1.141", "1.152", "1.172", "1.263", "1.383",
"1.411", "1.446", "1.498", "1.514", "1.611", "1.671", "1.685",
"1.695", "1.719", "1.783", "1.879", "1.884", "1.927", "1.97",
"2.019", "2.069", "2.35", "2.696", "2.979", "2/19/2014 11:06",
"2000", "2011", "23017323011sp1", "48", "57.9732", "A", "B",
"LY", "LY1A008B", "sp1"), class = "factor"), Tree013A = structure(c(45L,
6L, 44L, 5L, 44L, 38L, 46L, 40L, 47L, 39L, 42L, 37L, 43L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 10L,
13L, 8L, 22L, 14L, 18L, 24L, 4L, 11L, 25L, 7L, 36L, 41L,
33L, 29L, 17L, 28L, 23L, 21L, 16L, 26L, 15L, 3L, 20L, 12L,
2L, 9L, 34L, 35L, 27L, 32L, 31L, 30L, 19L), .Label = c("",
"0.608", "0.916", "0.945", "013", "1", "1.125", "1.18", "1.388",
"1.423", "1.493", "1.498", "1.554", "1.579", "1.619", "1.629",
"1.719", "1.756", "1.858", "1.867", "1.869", "1.876", "1.9",
"1.916", "2.023", "2.089", "2.224", "2.246", "2.247", "2.315",
"2.413", "2.515", "2.547", "2.645", "2.722", "2.785", "2/19/2014 11:11",
"2002", "2011", "23017323011sp1", "3.375", "34", "64.0679",
"A", "LY", "LY1A013A", "sp1"), class = "factor")), .Names = c("OriginalTreeID",
"Tree003A", "Tree003B", "Tree008B", "Tree013A"), row.names = c(NA,
61L), class = "data.frame")

Here is another approach where most of the work is done
by rearranging the data with the reshape package.
After the data is "munged", it can be rearranged into almost anything
you want with the cast function.
# I'm used to the transpose
y = t(x)
# Make the first row the column names
# Also get rid of hashes. They make things difficult
library(stringr)
colnames(y) = str_replace( y[1,], "#", "" )
y = data.frame(y[-1,],check.names=FALSE)
# reshape the data by defining the "ID" variables
library(reshape)
z = melt(y,id.vars=c("Township","Plot","Subplot","Tree",
"CoreLetter","COFECHA ID","SiteID","SP","AnalysisDateTime"))
z$value = as.numeric(as.character(z$value))
# Now you can do lots of things!
# All the info you wanted is there, but it's in a different format
# than your "intended output"
cast( z, Tree ~ variable, mean, na.rm=TRUE )
# To get to your "intended output"
out = cast( z, Township + Plot + Subplot + Tree ~ variable, mean, na.rm=TRUE )
out[["COFECHA ID"]] = with(out,paste0(Township,Plot,Subplot,Tree,"avg"))
out2 = out[,c(1,ncol(out),8:(ncol(out)-1))]
out3 = cbind(colnames(out2),t(out2))
colnames(out3) = c("OriginalID",paste0("Tree",out$Tree,"avg"))
# For kicks, here are some other things. Have fun!
cast(z, Tree ~ variable, median, na.rm=TRUE ) # the median instead of the mean
cast(z, Tree + CoreLetter ~ variable ) # back to your original data
cast(z, CoreLetter ~ variable, length ) # How many measurements from each core?
cast(z, CoreLetter ~ variable, mean ) # The average across different cores
For even more fun!
library(ggplot2)
d = z[-c(1:16), ] # A not so pretty hack
colnames(d)[10] = "Year"
d$Year = as.integer(as.character(d$Year))
ggplot(d, aes(x=Year, y=value, group=Tree, color=Tree, shape=CoreLetter)) +
geom_point() + geom_smooth(method="loess",span=0.3)
Does this mean that early 2000's were dry?

try this.....
d.f <- your data structure...above
subset the data
d.f <- d[-(1:13), -1]
c.n <- colnames(d.f)
build the grouping var
f <- gsub(".?$", "", c.n)
f <- d[4, 2:ncol(d)]
split the dataframeinto sub-dataframes
d.f <- apply(d.f, 2, as.numeric)
d.f[is.na(d.f)] <- 0
d.f.g <- as.data.frame(t(d.f))
a <- split(d.f.g, f)
calculate the groupwise averages as colMeans (because transposed)
grp.means <- lapply(a, colMeans)
the grp.means is a list of dataframes each containing the date averages for each grp. re-form this as required, you'll probably want to transpose again.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Percentile in a data frame using two columns - r

df %>% group_by(year) %>% summarize(species.95 = quantile(species, 0.95) I cannot download your dataframe but you can use the quantile function to find the 95% for each species.

if I get you right library(tidyverse) "collector")), skip = 1L), class = "col_spec")) df %>% group_by(year, name) %>% mutate(q95 = quantile(length, probs = 0.95)) or library(data.table) setDT(df) df[, q95 := quantile(length, probs = 0.95), by = list(year, name)][order(name, year)]

Related

ANOVA error: why is each row of output not identified by a unique combination of keys?

Changing the Font of ggplot text?

create sql expression in R for certain condition

Ordering factor levels with the same name across different variables - facet_grid in ggplot2

R - Average columns by information in row X

Categories

Resources

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Percentile in a data frame using two columns - r

df %>% group_by(year) %>% summarize(species.95 = quantile(species, 0.95) I cannot download your dataframe but you can use the quantile function to find the 95% for each species.

if I get you right library(tidyverse) "collector")), skip = 1L), class = "col_spec")) df %>% group_by(year, name) %>% mutate(q95 = quantile(length, probs = 0.95)) or library(data.table) setDT(df) df[, q95 := quantile(length, probs = 0.95), by = list(year, name)][order(name, year)]

Related

ANOVA error: why is each row of output *not* identified by a unique combination of keys?

Changing the Font of ggplot text?

create sql expression in R for certain condition

Ordering factor levels with the same name across different variables - facet_grid in ggplot2

R - Average columns by information in row X

Categories

Resources

ANOVA error: why is each row of output not identified by a unique combination of keys?