Related
I'm trying to improve a simple but important detail in my facet_grid graphic.
I played with scale and space parameteres of facet_grid() function. But I would like to increase the height only for the bottom facet in order to be possible to read the sample count that appears cut off in the graph (highlighted in red). See Image1.
The code used is:
ggplot(salm_data, aes(x=SALM) ) + geom_bar(stat = "count", position = position_dodge())
+ facet_grid(YEAR ~ TYPE, scales = "free_x", space = "free")
+ geom_text(size=2.5, stat = "count", aes(label = after_stat(count)), vjust = -0.3) +theme_bw() + labs(y="Number of samples") +
theme(panel.grid.minor = element_blank())+
labs(x=expression(paste(italic("Klebsiella spp."), " ")))
Image 1:
Image1
To try to overcome this "problem" I tried including next code to enlarge the Y-axis:
scale_y_continuous(breaks = seq(from = 0, to = 300, by = 40))
But I'm not capable to improve the situation (see Image2).
Image 2:
Image2
Could you help me with some ideas to overcome this scenario?
Thanks on advance for your help/comments,
Magi
Grid graphics enable you to modify the widths and heights of ggplot() objects.
Sample code:
library(grid)
gg = ggplot(salm_data, aes(x=type) ) +
geom_bar(stat = "count", position = position_dodge())+
facet_grid(year ~ type, scales = "free_x", space = "free")+
geom_text(size=2.5, stat = "count", aes(label = after_stat(count)), vjust = -0.3) +
theme_bw() +
labs(y="Number of samples") +
theme(panel.grid.minor = element_blank())+
labs(x=expression(paste(italic("Klebsiella spp."), " ")))
gt = ggplot_gtable(ggplot_build(g))
gt$widths[7] = 4*gt$widths[7] # modify panel-1-2 and panel-2-2
gt$heights[7] = 4*gt$heights[7] #modify panel-1-2 and panel-2-2
grid.draw(gt)
Sample plot:
Sample data:
salm_data<-structure(list(type = c("A", "B", "C", "D", "A", "B", "C", "D",
"A", "B", "C", "D", "A", "C", "A", "B", "C", "D", "A", "B", "C",
"D", "A", "B", "C", "D", "A", "C", "A", "B", "C", "D", "A", "B",
"C", "D", "A", "B", "C", "D", "A", "C", "A", "B", "C", "D", "A",
"B", "C", "D", "A", "B", "C", "D", "A", "C", "A", "B", "C", "D",
"A", "B", "C", "D", "A", "B", "C", "D", "A", "C", "A", "B", "C",
"D", "A", "B", "C", "D", "A", "B", "C", "D", "A", "C", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A", "A"
), year = c(2001, 2001, 2001, 2001, 2001, 2002, 2001, 2002, 2001,
2002, 2001, 2002, 2002, 2002, 2001, 2001, 2001, 2001, 2001, 2002,
2001, 2002, 2001, 2002, 2001, 2002, 2002, 2002, 2001, 2001, 2001,
2001, 2001, 2002, 2001, 2002, 2001, 2002, 2001, 2002, 2002, 2002,
2001, 2001, 2001, 2001, 2001, 2002, 2001, 2002, 2001, 2002, 2001,
2002, 2002, 2002, 2001, 2001, 2001, 2001, 2001, 2002, 2001, 2002,
2001, 2002, 2001, 2002, 2002, 2002, 2001, 2001, 2001, 2001, 2001,
2002, 2001, 2002, 2001, 2002, 2001, 2002, 2002, 2002, 2001, 2001,
2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001,
2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001,
2001, 2001, 2001, 2001)), class = "data.frame", row.names = c(NA,
-112L))
Many thanks #Rfanatic. That's just the thing that I desired.
I create the gtable from package grid as you indicated but I have used also the library gtable to help me to discover the panel codification for my case. Here's the code if can also help someone in the future:
gt <- ggplot_gtable(ggplot_build(gg))
gt$widths[14] = 1*gt$widths[14]
gt$heights[14] = 4*gt$heights[14]
grid.draw(gt)
I use the position 14 because it was the row position number for the square that I have problems:
library(gtable)
gtable_show_layout(gt)
But the best for me it has been print the TableGrob:
print(gt)
#17 1 (14-14,11-11) panel-4-4 gTree[panel-16.gTree.8635]
I was interested in panel-4-4 in my example (only I printed here this line of the gt output). Moreover, due to I want to change the row height and width, I pick up 14 instead of 11.
Thanks .
I have a dataframe with a column about time and this column contains some NA. I would like to fill these cells with the year before + 1 (if the missing cell is not the beginning of the serie). Here's a reproducible example:
df <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
y = c(2000, NA, NA, 2000, 2001, 2002))
I tried to follow this post
df <- df %>%
complete(y = seq(min(y), max(y), by = "year"))
but I can't find out how to do so. Any idea?
Edit: expected output:
df <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
y = c(2000, 2001, 2002, 2000, 2001, 2002))
Note: I would prefer a dplyr solution.
Note 2 (October 23rd 2019): The three answers so far are good but quite complicated. I'm really surprised that it is not possible to do that simply (for example, having the possibility to add a lag in the fill function would be really useful I think).
This solution is a bit annoying but completely vectorized in dplyr. I doubled your df into a new df2 to try across a couple gapped occurrences.
library(tidyr)
library(dplyr)
df <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
y = c(2000, NA, NA, 2000, 2001, 2002))
df2 <- bind_rows(df, df)
Basically you need to create groups across the blocks with NA. Then you can calculate a within-group cumsum and use fill to drag down the prior value. It is annoying because of all the lines.
df2 %>%
group_by(grp = cumsum(!is.na(y) & lag(is.na(y), default = FALSE))) %>%
mutate(add_year = cumsum(is.na(y))) %>%
fill(y) %>%
mutate(y = y + add_year) %>%
ungroup() %>%
select(-grp, -add_year)
In base you can use ave in combination with cumsum to split your dataset and apply there seq, as you have tried already.
df$y <- ave(df$y, cumsum(!is.na(df$y)), FUN=function(x)
seq(x[1], length.out = length(x)))
identical(df, dfExpected)
#[1] TRUE
df$y
#[1] 2000 2001 2002 2000 2001 2002
In case it starts with NA and you want then to let it start with 2000 you can use replace:
df2$y <-ave(df2$y, cumsum(!is.na(df2$y)), FUN=function(x)
seq(replace(x[1],is.na(x[1]),2000), length.out = length(x)))
identical(df2, dfExpected)
#[1] TRUE
Data:
df <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
y = c(2000, NA, NA, 2000, 2001, 2002))
dfExpected <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
y = c(2000, 2001, 2002, 2000, 2001, 2002))
df2 <- data.frame(x = c("A", "B", "C", "A", "B", "C"),
y = c(NA, NA, NA, 2000, 2001, 2002))
This uses dplyr functions case_when() and lag combined with a while-loop in a custom-function.
Output is as expected, try it out.
library(dplyr)
lag_years <- function(df){
while (anyNA(df$y))
{
df %>%
mutate(y = case_when(is.na(y)&!is.na(lag(y))~lag(y)+1,TRUE~y)) %>%
{.} -> df
}
return(df)
}
lag_years(df) %>%
head()
I have a dataset of multiple lakes with water level elevations through time. The observations are not regularly spaced and have many large gaps. Further, some of the older observations may be of lower or unknown quality. I created a separate model that does a reasonably good job of predicting water levels across time, but still misses the actual observations by varying amounts.
I would like to create a third inputed/interpolated set of data in which the solution is:
informed by the modeled values where observations are missing
crosses the highly weighted observations
and is informed by the lower weighted observations
So far, I have used the fable package's TSLM->interpolate to perform this. It works reasonably well, but I cannot see a way to introduce weighting to the process. Further, it relies to heavily on the global coefficient and intercepts making it a bit too volatile when the modeled value significantly misses the observed. I am thinking that I need to use some sort of weighted loess that relies on local coefficients and can accomodate weighting.
library(dplyr)
library(tsibble)
library(fable)
library(ggplot2)
test_data <- data.frame(obs_year = c(2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009),
site_name = c("Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake1","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2","Lake2"),
observed = c(100,200,NA, NA, NA, NA, 220, NA, NA, 125, NA,NA,425, NA, 475, NA, 450, 450, 475, 500),
weights = c(1,1,NA, NA, NA, NA, 2, NA, NA, 2, NA,NA,2, NA, 1, NA, 2, 2, 2, 2),
modeled = c(110,120,165,150, 200, 225, 240, 250, 150, 130, 450,430,415,400, 425, 450, 460, 460, 470, 490))
test_tsibble <- as_tsibble(test_data, key = site_name, index = obs_year)
tslm_interpolate <- test_tsibble %>%
group_by(site_name) %>%
model(lm = TSLM(observed~modeled)) %>%
fabletools::interpolate(test_tsibble)
tslm_interpolate <- left_join(tslm_interpolate, test_data, by = c("site_name", "obs_year")) %>%
dplyr::select(obs_year, site_name, observed = observed.y, imputed = observed.x, modeled, weights)
tslm_interpolate %>%
ggplot(aes(x=obs_year))+
geom_line(aes(y = imputed), color = "blue")+
geom_line(aes(y = modeled), color = "red")+
geom_point(aes(y = observed), color = "green")+
facet_wrap(~site_name, scales = "free_y")
I have a data table, data, I want to group them by group_label
and subtract value of a group form that of other groups.
In other words, I want to subtract all "NYC" values in any group by
the value of NYC in group B.
I want to subtract any value associated with LA, in any group,
from the value of LA associated with LA in group B. so my result looks like
result. How can I do that?
data = data.table(city = c("NYC", "NYC", "NYC", "LA", "LA", "LA"),
group_label = c("A", "A", "B", "B", "A", "C"),
time_period = c(1980, 1990, 2000, 1982, 2007, 2010),
value = c(2, 20, 13, 24, 4, 6)
)
result = data.table(city = c("NYC", "NYC", "NYC", "LA", "LA", "LA"),
group_label = c("A", "A", "B", "B", "A", "C"),
value = c(2, 20, 13, 24, 4, 6),
time_period = c(1980, 1990, 2000, 1982, 2007, 2010),
diff = c(-11, 7, 0, 0, -20, -18)
)
An option would be
data[, diff := value - value[group_label == "B"], city]
Or with dplyr
library(dplyr)
data %>%
group_by(city) %>%
mutate(diff = value - value[group_label == "B"])
I have this dataset (run it in the command line, to have a look at it)
structure(list(Staz = c("Carmagnola", "Chieri", "Chivasso", "Ivrea",
"Moncalieri", "Orbassano"), Year = c(2004, 2004, 2004, 2004,
2004, 2004), Season = c("Autumn", "Autumn", "Autumn", "Autumn",
"Autumn", "Autumn"), Avg_T = c(11.7361111111111, 11.7361111111111,
11.7361111111111, 11.7361111111111, 11.7361111111111, 11.7361111111111
), Min_T = c(7.27222222222222, 7.27222222222222, 7.27222222222222,
7.27222222222222, 7.27222222222222, 7.27222222222222), Max_T = c(16.6722222222222,
16.6722222222222, 16.6722222222222, 16.6722222222222, 16.6722222222222,
16.6722222222222), Moisture = c(69.6388888888889, 69.6388888888889,
69.6388888888889, 69.6388888888889, 69.6388888888889, 69.6388888888889
), Rain = c(79.2, 79.2, 79.2, 79.2, 79.2, 79.2), Year_Bef = c(2004,
2004, 2004, 2004, 2004, 2004), Year_Bef_Two = c(2004, 2004, 2004,
2004, 2004, 2004)), .Names = c("Staz", "Year", "Season", "Avg_T",
"Min_T", "Max_T", "Moisture", "Rain", "Year_Bef", "Year_Bef_Two"
), row.names = c(NA, 6L), class = "data.frame")
From what you can see there is a variable named 'Season', defining the season of the data. I would like to split the weather variables ('Avg_T', Min_T', 'Max_T', 'Moisture', 'Rain') for every season, but having them in the same row. So, I would have just one row per study area for every year, containing information about seasonal data.
I tried to do that with the 'cast' and 'dcast' commands in the 'reshape' and 'reshape2' packages but it didn't work.
May somebody help me?
Thanks,
Jacopo
First, let's say your data lives in df. I rbind df to df and change the season in the latter half of df to summer so that we have more than 1 season present:
df <- rbind(df, df)
df[7:12,]$Season = 'Summer'
Then I get rid of the last two columns in df (they don't seem to be doing anything):
df = df[, -c(9,10)]
Now, we're ready to use the reshape function:
r_df <- reshape(df, timevar = 'Season', idvar = c('Staz', 'Year'), direction = 'wide')
I think that should give you what you're looking for.