Related
I have a linear mixed effects model that determines change in grass based on both the previous year's grass and several environmental variables (and their interaction) at different distinct sites over time.
Using this mixed effects model and established, projected future environmental variables, I want to predict change in grass density. Each year's prediction thus depends on the previous year's density, located on the row above it in my dataframe. We begin with a real value from the present year, and then predict into the future.
library(tidyverse); library(lme4)
#data we have from the past, where each site has annual ChlA/Sal/Temp as well as grass density. our formula, change.mod, predicts grass.change, based on these env variables AND last year's grass coverage (grass.y1)
ThePast = tibble(
year = c(2017, 2018, 2019, 2020, 2021, 2017, 2018, 2019, 2020, 2021,2017, 2018, 2019, 2020, 2021),
site = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "C", "C"),
ChlA = c(50, 210, 190, 101, 45, 20, 20, 80, 5, 40, 25, 12, 11, 5, 20),
Sal= c(1, 4, 5, 0.1, 10, 18, 14, 17, 10, 21, 30, 28, 25, 20, 22),
Temp = c(28, 21, 24, 25, 22, 19, 20, 17, 18, 15, 18, 16, 19, 20, 20),
grass = c(.5, .3, .1, .4, .1, .25, .33, .43, .44, .08, .75, .54, .69, .4, .6)) %>%
group_by(site) %>%
mutate(grass.y1 = lag(grass, order_by = year)) %>% #last year's grass
mutate(grass.change = grass - grass.y1) %>% #calculate change
ungroup()
#the ME model
change.mod = lmer(grass.change ~ grass.y1 + log10(ChlA) + log10(Sal) + grass.y1:log10(Temp) + grass.y1:log10(Sal) + (1|site), data = ThePast)
#Future environmental data per site per year, to be used to predict grass.
TheDistantFuture <- tibble(
year = c(2022, 2022, 2022, 2023, 2023, 2023, 2024, 2024, 2024),
site = c( "A", "B", "C","A", "B", "C", "A", "B", "C"),
ChlA = c(40, 200, 10, 95, 10, 4, 149, 10, 15),
Sal= c(12, 11, 15, 16, 21, 32, 21, 21, 22),
Temp = c(24, 22, 26, 28, 29, 32, 31, 20, 18))
#The final dataframe should look like this, where both of the grass columns are predicted out into the future. could have the grass.y1 column in here if we wanted
PredictedFuture <- tibble(
year = c(2022, 2022, 2022, 2023, 2023, 2023, 2024, 2024, 2024),
site = c( "A", "B", "C","A", "B", "C", "A", "B", "C"),
ChlA = c(40, 200, 10, 95, 10, 4, 149, 10, 15),
Sal= c(12, 11, 15, 16, 21, 32, 21, 21, 22),
Temp = c(24, 22, 26, 28, 29, 32, 31, 20, 18),
grass = c(0.237, 0.335, 0.457, 0.700, 0.151, 0.361, 0.176, 0.380, 0.684),
grass.change = c(0.1368, 0.2550, -0.1425, -0.1669, -0.18368, -0.0962, 0.106, 0.229, 0.323 ))
Right now, I can generate the next year's (2022) correct predictions using group_by() and predict(), referencing last year's grass density with a lag function.
#How do we get to PredictedFuture?? Here is what I'm trying:
FutureIsNow = ThePast %>%
filter(year == 2021) %>% #take last year of real data to have baseline starting grass density
bind_rows(TheDistantFuture) %>% #bind future data
arrange(site, year) %>% #arrange by site then year
group_by(site) %>% #maybe this should be rowwise?
mutate(grass.change = predict(change.mod, newdata = data.frame(
grass.y1 = lag(grass, n = 1, order_by = year),
ChlA = ChlA, Sal = Sal, Temp = Temp, site = site))) %>% #this correctly predicts 2022 grass change
mutate(grass = grass.change + lag(grass, n = 1)) #this also works to calculate grass in 2022
This df looks like this:
> FutureIsNow
# A tibble: 12 × 7
# Groups: site [3]
year site ChlA Sal Temp grass grass.change
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2021 A 45 10 22 NA NA
2 2022 A 40 12 24 0.237 0.137
3 2023 A 95 16 28 NA NA
4 2024 A 149 21 31 NA NA
5 2021 B 40 21 15 NA NA
6 2022 B 200 11 22 0.335 0.255
7 2023 B 10 21 29 NA NA
8 2024 B 10 21 20 NA NA
9 2021 C 20 22 20 NA NA
10 2022 C 10 15 26 0.457 -0.143
11 2023 C 4 32 32 NA NA
12 2024 C 15 22 18 NA NA
Close, but not really repeatable...
Any ideas for predicting grass change for 2023, 2024, down the rows? I prefer working in tidyverse, though it may be possible to solve this more easily with nested for loops. Potential solutions include a rowwise data structure, or maybe to nest_by(station), but I don't know how to then reference the grass.y1 column. Maybe the solution could be via a rolling prediction with rollify, but I am not sure!
Thank you in advance for your help! Long time reader, first time asker!
So, let's go with a simpler example here for a reprex to show how purrr::accumulate2() can work for you here.
Let's setup a discrete time population model where there is also some covariate that affects time
$N_t = 1.5N_{t-1} + C$
Simple! Heck, we can even use accumulate2 to simulate a population, and then add some noise.
library(tidyverse)
# ok, let's make a population from a simple discrete time growth model
# but, with a covariate!
covariate <- runif(5, 5, 10)
# use accumulate2 with the covariate to generate a population timeseries
pop <- accumulate2(1:5,covariate, ~.x*1.5 + .y, .init = 0) %>% unlist()
pop <- pop[-1]
pop_obs <- rnorm(5, pop, 1) #add some noise
Great! Now, turn it into data and fit a model
# the data ####
dat <- tibble(
time = 1:5,
covariate = covariate,
pop_obs = pop_obs,
lag_pop = lag(pop_obs)
)
# the model ####
mod <- lm(pop_obs ~ covariate + lag_pop, data = dat)
# does this look reasonable?
coef(mod)
My coefficients looked reasonable, but, set a seed and see!
Now we will need some data we want to simulate for - new covariates, but, we will need to incorporate the lag.
# now, simulation data ####
simdat <- tibble(
time = 6:10,
covariate = runif(5, 15,20),
lag_pop = dat$pop_obs[5] #the last lagged value!
)
Great! To make this work, we'll need a function that takes arguments of the lagged value and covariate and runs a prediction. Note, here the second argument is just a numeric. But, you could pass an element of a list - a row of a data frame, if you will. This might be accomplished later with some rowwise nesting or somesuch. For you to work out!
# OK, now we need to get predictions for pop at each step in time! ####
sim_pred <- function(lag_pop, covariate){
newdat <- tibble(covariate = covariate,
lag_pop = lag_pop)
predict(mod, newdata = newdat)
}
With this in hand, we can simulate forward using lag_pop to generate a new population. Note, we'll need to use .init to make sure our first value is correct as well as strip off the final value (I think...might want to check that)
# and let her rip!
# note, we have to init with the first value and
# for multiple covariates, make a rowwise list -
# each element of the list is
# one row of the data and the sim_pred function takes it apart
simdat %>%
mutate(pop = accumulate2(lag_pop,
covariate,
~sim_pred(.x, .y),
.init = lag_pop[1]) %>% `[`(-1) %>% unlist())
That should do!
I am looking to get a bar graph of medals in R. I have 3 distinct columns (gold, silver, bronze). The columns for gold medals has a total of 8, the silver has 10, and the bronze has 13.
For the code, I started writing: ggplot(data, aes(x=?)) + geom_bar()
I am not sure how to write all 3 gold medals on the function where it shows x=?
Thanks
For plotting purposes, it is "easier" to work with long data instead of wide. Below I converted the data you mentioned in your comment to long and plotted the data as a grouped bar.
library(tidyverse)
# load data
raw_data <- structure(list(Rank = c(1, 2, 3, 4, 5, 6),
`Team/Noc` = c("United States of America", "People's Republic of China", "Japan", "Great Britain", "ROC", "Australia"),
Gold = c(39, 38, 27, 22, 20, 17),
Silver = c(41,32, 14, 21, 28, 7),
Bronze = c(33, 18, 17, 22, 23, 22),
Total = c(113, 88, 58, 65, 71, 46),
`Rank by Total` = c(1, 2, 5, 4, 3, 6)),
row.names = c(NA,-6L),
class = c("tbl_df", "tbl", "data.frame"))
# convert wide data to long
long_data <- raw_data %>%
pivot_longer(cols = -`Team/Noc`, names_to = 'Medal') %>% # convert wide data to long format
filter(Medal %in% c("Gold", "Silver", "Bronze")) # only select medal columns
# plot
ggplot(long_data) +
geom_col(aes(x = `Team/Noc`,
y = value,
fill = Medal),
position = "dodge" # grouped bars
)
Hope this gets you started!
I have two lists (the dataframes in the list contain more columns than those two, but they are not important for my question):
KPI_new <- list(June=data.frame(ID=(rep("",17)), eRec= c("107349", "110878", "110024", "112188", "6187", "100420", "94436", "110165", "108508", "108773", "111859", "111907", "110704", "100413", "88995", "91644","111298") ))
KPI_old <- list(May=data.frame(ID=c(27, 30, 4, 6, 7, 20, 31, 8, 28, 25, 29, 16, 17, 18), eRec = c( "107349", "110024", "6187" , "100420", "94436", "88995" , "110165" ,"91644", "108508", "105213", "108773", "102636" ,"102339" ,"100413")),
April = data.frame(ID=c(26, 27, 2, 4, 5, 6, 7, 20, 21, 22, 8, 23, 28, 25, 29, 9, 24, 16, 17, 18), eRec=c("37866", "107349", "93051", "6187", "98274", "100420", "94436", "88995" ,"105107", "105109", "91644", "105103" ,"108508" ,"105213", "108773", "85409" ,"104145","102636" ,"102339" ,"100413")),
March = data.frame(ID= c(2, 19, 4, 5, 6, 7, 20, 21, 22, 8, 23, 25, 9, 24, 15, 16, 17, 18), eRec=c("93051" , "104499" ,"6187", "98274", "100420" ,"94436", "88995" ,"105107" ,"105109", "91644" ,"105103", "105213" ,"85409" , "104145", "100989", "102636" ,"102339", "100413")),
February = data.frame(ID= c(1 , 2, 19, 4, 5, 6, 7 ,20, 21, 22, 8, 23, 9 ,10, 24, 12, 13, 14, 15, 16, 17, 18), eRec=c("94266" , "93051", "104499" ,"6187" , "98274", "100420", "94436" ,"88995", "105107", "105109", "91644" ,"105103", "85409" ,"102252", "104145", "94559", "101426", "100992" ,"100989" ,"102636" ,"102339" ,"100413")),
January = data.frame(ID = c(1:18), eRec=c("94266" , "93051", "99836", "6187" , "98274", "100420", "94436", "91644", "85409", "102252", "94412", "94559", "101426", "100992", "100989", "102636", "102339", "100413")))
The list KPI_old contains several dataframes. The ID column is assigned based on the eRec column. So if the eRec column exists in January and in February also, the ID is the same.
Now I want to assign IDs to the (at this point empty) ID column of the dataframe in the KPI_new list based on KPI_old.
I tried the following:
KPI_old_df <- do.call("rbind", KPI_old)
KPI_new[[1]]$ID[(KPI_new[[1]][,2]) %in% KPI_old_df[,2]] <- unique(KPI_old_df$ID[(KPI_old_df[,2]) %in% KPI_new[[1]][,2]])
This assigns the right values - the IDs of KPI_old to KPI_new for the eRec values in KPI_new which already occur in KPI_old - but it assigns some of them to the wrong rows. The order is not right.
It seems like there is something very basic which I am missing.
Thanks in advance.
Try using match in the following way
KPI_new[[1]]$ID <- KPI_old_df$ID[match(KPI_new[[1]]$eRec, KPI_old_df$eRec)]
KPI_new
#$June
# ID eRec
#1 27 107349
#2 NA 110878
#3 30 110024
#4 NA 112188
#5 4 6187
#6 6 100420
#7 7 94436
#8 31 110165
#9 28 108508
#10 29 108773
#11 NA 111859
#12 NA 111907
#13 NA 110704
#14 18 100413
#15 20 88995
#16 8 91644
#17 NA 111298
Not all IDs are present in KPI_old_df, hence some of them return NA.
I have this graph:
I just need to add labels to each colored line.
I need to add to the blue one Forecast Sales and for the red one Historical Sales.
I tried to adapt these examples here but I have much error. Also, I can not plot the graph above just by using this code:
to make it reproductible :
dput(df1)
structure(list(Semaine = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,
28, 29, 30, 31), M = c(5649.96284329564, 7400.19639744335, 6948.61488673139,
5043.28209277238, 7171.29719525351, 7151.04746494067, 5492.96601941748,
6796.1160130719, 5532.95496473142, 7371.33061889251, 5462.73861171367,
7156.01570964247, 5558.63194819212, 9329.49289405685, 5770.02903225806,
7348.68497576737, 5261.26655896607, 8536.11304909561, 7463.97630586968,
6133.49774339136, 7252.69089929995, 6258.54674403611, 8167.67766497462,
5644.66612816371, 7512.5169628433, 5407.84275713516, 7795.63220247711,
5596.75282714055, 7264.37264404954, 5516.98492191707, 8188.80776699029
> dput(df2)
structure(list(Semaine = c(32, 33.2, 34.4, 35.6, 36.8, 38), M = c(5820.32304669441,
6296.32038834951, 7313.24757281553, 7589.714214588, 8992.35922330097,
9664.95469255663)), .Names = c("Semaine", "M"), row.names = c(NA,
-6L), class = "data.frame")
ggplot() + geom_line(data=df1, aes(x = Semaine, y = M),color = "red") +
stat_smooth(data=df2, aes(x = Semaine, y = M),color = "blue")+
scale_x_continuous(breaks = seq(0,40,1))
Thank you!
cols <- c("A"="red", "B"="blue")
ggplot() + geom_line(data=df1, aes(x = Semaine, y = M,color = "A")) +
stat_smooth(data=df2, aes(x = Semaine, y = M,color = "B"), method = 'loess')+
scale_x_continuous(breaks = seq(0,40,1)) +
scale_color_manual(name="Title", values=cols)
I've got an R data frame that looks like this:
> glimpse(spottingIntensity)
Observations: 28
Variables: 5
$ nClassifications <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,...
$ nPhotosClassified <int> 45816, 25252, 12327, 5286, 2327, 1231, 713, 565, 447, 435, 318, 227, 192, 156,...
$ totalClassifiedPhotos <int> 95781, 95781, 95781, 95781, 95781, 95781, 95781, 95781, 95781, 95781, 95781, 9...
$ proportionOfClassified <dbl> 4.783412e-01, 2.636431e-01, 1.286998e-01, 5.518840e-02, 2.429501e-02, 1.285224...
$ cumulativeProportions <dbl> 0.4783412, 0.7419843, 0.8706842, 0.9258726, 0.9501676, 0.9630198, 0.9704639, 0...
In it, nClassifications and nPhotosClassified are data and the other variables are derived.
I use the following to plot the data with ggplot2:
ggplot(data = spottingIntensity, mapping = aes(x = nClassifications, y = cumulativeProportions)) +
geom_col() +
geom_text(mapping = aes(label = nPhotosClassified), nudge_y = 0.03) +
scale_x_continuous(limits = c(NA, 10),
breaks = seq.int(from = 1, to = 10, by = 1))
Which gives me these warnings:
Warning messages:
1: Removed 18 rows containing missing values (position_stack).
2: Removed 18 rows containing missing values (geom_text).
And this output:
I see that in the plot, the column for nClassifications = 10 is not shown even though data for it exists in my original data frame.
I checked the data frame, and I do have a few "missing rows" for nClassifications = 24, 27, 30, 31, but not for nClassifications = 10.
So:
Why isn't the bar in the plot for nClassifications = 10 showing up? How do I fix this? (I expected a bar similar in height to nClassifications = 9)
How do I programmatically "fill/complete" my data frame so that there are corresponding rows for nClassifications = 24, 27, 30, 31? In this case, nPhotosClassified <- 0 for those four nClassifications. And with that I can derive the other variables.
Can dplyr/tidyr help with 1. and 2.? Or is there another way? Thank you!
EDIT: Ooooops, I pasted the wrong code snippet before, it's correct now.