"Heatbars" for visualizing consecutive missing data days? - r

I am trying to visualize large chunks of consecutive missing data side-by-side on ranges of 3, 5 and 10 years sampled daily. Hopefully using ggplot2 since I already have some aesthetics functions done.
I imagined this would come from a barplot or maybe some heatmap variation, but I am not too sure how to use them with time-series data.
I chose a black/white list of bars because I think it is easier to observe where (1) lies large chunks of missing data and (2) if they are occurring on different moments in time (which would be important to choose which stations to use, etc), while being (3) relatively easy to observe many bars which would not be true to the more conventional line plots for time-series.
This is a draft of what I had in mind.
Here is some example data for 5 stations (in practice this could be up to over 80):
#Data from 5 different stations sampled daily.
df <- cbind(seq(as.Date(("2010/01/01")),by="day",length.out=365*5),data.frame(matrix(rnorm(365*5*5),365*5,5)))
colnames(df) <- c("timestamp","st1","st2","st3","st4","st5")
#Add varying ranges of missing consecutive amount of days to observe result on visualization.
df[1:50,"st1"] <- NA # 50
df[51:200,"st2"] <- NA # 150
df[1:400,"st3"] <- NA # 400
df[501:1300,"st5"] <- NA # 800

Here's a rough stab at it...Alter the scales and theme elements to your liking...
library(ggplot2)
library(scales)
library(reshape2)
melt(df, id.vars = "timestamp") -> k
k$value <- ifelse(is.na(k$value), "NA", "Not NA")
ggplot(data = k) +
geom_point(aes(x = timestamp, y = variable, fill = value, colour = value), shape =22) +
scale_x_date() +
theme_bw()

Related

ggplot: Plotting timeseries data with missing values

I have been trying to plot a graph between two columns from a data frame which I had created. The data values stored in the first column is daily time data named "Time"(format- YYYY-MM-DD) and the second column contains precipitation magnitude, which is a numeric value named "data1".
This data is taken from an excel file "St Lucia3" which has a total 11598 data points and stores daily precipitation data from 1981 to 2018 in two columns:
YearMonthDay (format- "YYYYMMDD", example "19810501")
Rainfall (mm)
The code for importing data into R:
StLucia <- read_excel("C:/Users/hp/Desktop/St Lucia3.xlsx")
The code for time data "Time" :
Time <- as.Date(as.character(StLucia$YearMonthDay), format= "%Y%m%d")
The code for precipitation data "data1" :
library("imputeTS")
data1 <- na_ma(StLucia$`Rainfall (mm)`, k = 4, weighting = "exponential")
The code for data frame "Pecip1" :
Precip1 <- data.frame(Time, data1, check.rows=TRUE)
The code for ggplot is:
ggplot(data = Precip1, mapping= aes(x= Time, y= data1)) + geom_line()
Using ggplot for plotting the graph between "Time" and "data1" results as:
Can someone please explain to me why there is an "unusual kink" like behavior at the right end of the graph, even though there are no such values in the column "data1".
The plot of "data1" data against its index is as shown:
The code for this plot is:
plot(data1, type = "l")
Any help would be highly appreciated. Thanks!
By using pad we can make up for those lost values an assign an NA value as to
avoid plotting in the region of missing data.
library(padr)
library(zoo)
YearMonthDay<-c(19810501,19810502,19810504,19810505)
Data<-c(1,2,3,4)
StLucia<-data.frame(YearMonthDay,Data)
StLucia$YearMonthDay <- as.Date(as.character(StLucia$YearMonthDay), format=
"%Y%m%d")
> StLucia
YearMonthDay Data
1 1981-05-01 1
2 1981-05-02 2
3 1981-05-04 3
4 1981-05-05 4
Note: you can see we are missing a date, but still there is no gap between position 2 and 3, thus plotting versus indexing you would not see a gap.
So lets add the missing date:
StLucia<-pad(StLucia,interval="day")
> StLucia
YearMonthDay Data
1 1981-05-01 1
2 1981-05-02 2
3 1981-05-03 NA
4 1981-05-04 3
5 1981-05-05 4
plot(StLucia, type = "l")
If you want to fill in those NA values, use na.locf() from package(zoo)
Here is a reproducible example - change the names to match your data.
# create sample data
set.seed(47)
dd = data.frame(t = Sys.Date() + c(0:5, 30:32), y = runif(9))
# demonstrate problem
ggplot(dd, aes(t, y)) +
geom_point() +
geom_line()
The easiest solution, as Tung points out, is to use a more appropriate geom, like geom_col:
ggplot(dd, aes(t, y)) +
geom_col()
If you really want to use lines, you should fill in the missing dates with NA for rainfall. H
# calculate all days
all_days = data.frame(t = seq.Date(from = min(dd$t), to = max(dd$t), by = "day"))
# join to original data
library(dplyr)
dd_complete = left_join(all_days, dd, by = "t")
# ggplot won't connect lines across missing values
ggplot(dd_complete, aes(t, y)) +
geom_point() +
geom_line()
Alternately, you could replace the missing values with 0s to have the line just go along the axis, but I think it's nicer to not plot the line, which implies no data/missing data, rather than plot 0s which implies no rainfall.

R stacked area chart - ignore NA and retain full x-axis

i've decadal time series from 1700 to 1900 (21 time slices) and for each decade i've got 7 categories that represent a quantity; see here
As you can see, only 5 of the decades actually have data.
I can plot a nice little stacked area chart in R, with the help of this very nice example, which retains only the 5 time slices that have data.
My problem is that i want an x-axis that retains all 21 times slices but still plots a stacked area chart using only the 5 time slices. The idea is that the stacked areas will still only be plotted against the correct year but simply connect up to the next point, 10 ticks down the x-axis, ignoring the no-data in between. i can achieve something in excel but i dont like it.
My reasoning is i want to plot lines on the top of the stacked area that are much more complete, for example from 1700 to 1850, or 1800 to 1900, for visual comparison purposes.
This post suggests how to connect dots in a line chart when you want to ignore NAs but it doesnt work for me in this instance.
a <- 1700:1900
b <- a[seq(1, length(a), 10)]
df <- data.frame("Year"=b,replicate(7,sample(1:21)))
rows <- c(2:10,11:15,17,19,21)
df[rows,2:8] <- NA
df
thanks a lot
If you wish to transform your year to factor, on the lines of the code below:
# Transform the data to long
library(reshape2)
df <- melt(data = df, na.rm = FALSE, id.vars = "Year")
df$Year <- as.factor(df$Year)
# Chart
require(ggplot2)
ggplot(df, aes(Year, value)) +
geom_area(aes(colour = variable, fill= variable), position = 'stack')
It will generate the chart below:
I wasn't sure if you are interested in mapping all of the X variables. I was thinking that this is the case so I reshaped your data. Presumably, it is wiser not to change the Year to factor. The code below:
a <- 1700:1900
b <- a[seq(1, length(a), 10)]
df <- data.frame("Year"=b,replicate(7,sample(1:21)))
rows <- c(2:10,11:15,17,19,21)
df[rows,2:8] <- NA
# Transform the data to long
library(reshape2)
df <- melt(data = df, na.rm = FALSE, id.vars = "Year")
# Leave it as int.
# df$Year <- as.factor(df$Year)
# Chart
require(ggplot2)
ggplot(df, aes(Year, value)) +
geom_area(aes(colour = variable, fill= variable), position = 'stack')
would generate much more meaningful chart:
Potentially, if you decide to use years as factors you may group them and have one category for a number of missing years so the x-axis is more readable. I would say it's a matter of presentation to great extent.

Color Dependent Bar Graph in R

I'm a bit out of my depth with this one here. I have the following code that generates two equally sized matrices:
MAX<-100
m<-5
n<-40
success<-matrix(runif(m*n,0,1),m,n)
samples<-floor(MAX*matrix(runif(m*n),m))+1
the success matrix is the probability of success and the samples matrix is the corresponding number of samples that was observed in each case. I'd like to make a bar graph that groups each column together with the height being determined by the success matrix. The color of each bar needs to be a color (scaled from 1 to MAX) that corresponds to the number of observations (i.e., small samples would be more red, for instance, whereas high samples would be green perhaps).
Any ideas?
Here is an example with ggplot. First, get data into long format with melt:
library(reshape2)
data.long <- cbind(melt(success), melt(samples)[3])
names(data.long) <- c("group", "x", "success", "count")
head(data.long)
# group x success count
# 1 1 1 0.48513473 8
# 2 2 1 0.56583802 58
# 3 3 1 0.34541582 40
# 4 4 1 0.55829073 64
# 5 5 1 0.06455401 37
# 6 1 2 0.88928606 78
Note melt will iterate through the row/column combinations of both matrices the same way, so we can just cbind the resulting molten data frames. The [3] after the second melt is so we don't end up with repeated group and x values (we only need the counts from the second melt). Now let ggplot do its thing:
library(ggplot2)
ggplot(data.long, aes(x=x, y=success, group=group, fill=count)) +
geom_bar(position="stack", stat="identity") +
scale_fill_gradient2(
low="red", mid="yellow", high="green",
midpoint=mean(data.long$count)
)
Using #BrodieG's data.long, this plot might be a little easier to interpret.
library(ggplot2)
library(RColorBrewer) # for brewer.pal(...)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=count),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)
Note that actual values are probably different because you use random numbers in your sample. In future, consider using set.seed(n) to generate reproducible random samples.
Edit [Response to OP's comment]
You get numbers for x-axis and facet labels because you start with matrices instead of data.frames. So convert success and samples to data.frames, set the column names to whatever your test names are, and prepend a group column with the "list of factors". Converting to long format is a little different now because the first column has the group names.
library(reshape2)
set.seed(1)
success <- data.frame(matrix(runif(m*n,0,1),m,n))
success <- cbind(group=rep(paste("Factor",1:nrow(success),sep=".")),success)
samples <- data.frame(floor(MAX*matrix(runif(m*n),m))+1)
samples <- cbind(group=success$group,samples)
data.long <- cbind(melt(success,id=1), melt(samples, id=1)[3])
names(data.long) <- c("group", "x", "success", "count")
One way to set a threshold color is to add a column to data.long and use that for fill:
threshold <- 25
data.long$fill <- with(data.long,ifelse(count>threshold,max(count),count))
Putting it all together:
library(ggplot2)
library(RColorBrewer)
ggplot(data.long) +
geom_bar(aes(x=x, y=success, fill=fill),colour="grey70",stat="identity")+
scale_fill_gradientn(colours=brewer.pal(9,"RdYlGn")) +
facet_grid(group~.)+
theme(axis.text.x=element_text(angle=-90,hjust=0,vjust=0.4))
Finally, when you have names for the x-axis labels they tend to get jammed together, so I rotated the names -90°.

Plotting an filled line chart with 4 variables against a 5th variable ggplot2

I am trying to create a postion="fill" which represents an allocation on the y axis (to always sum to 100) and another variable on the x axis. Variable 1-4 are numeric integers, variable 5 is also numeric. Variable 5 is a continuous numeric. All five variables on are on the same row.
Y axis: variable 1 + variable 2 + variable 3 + variable 4 = 100
X axis: variable 5
Is there a way to do this without melting my data table?
Sample code, caution: runs a bit slow due to how I set up variables 1-4...
library(combinat)
combinations <- combn(100, 4)
permutations <- combinations[, colSums(combinations) == 100]
rm(combinations)
data <- t(rbind(permutations,
replicate(ncol(permutations), cumprod(1+rnorm(20, 0.05, 0.30))[20])
))
One way to generate a reproducible example would be
set.seed(1)
data_ex <- data.frame(t(rmultinom(1000,prob=rep(0.25,4),size=100)),
v5=runif(1000,0.8,1))
and then
library(ggplot2)
library(reshape2)
ggplot(melt(data_ex,id.var="v5")) +
geom_area(aes(x=v5,y=value,fill=variable))
draws the plot.
If you really want to do things the hard way you can avoid using melt, but melt is much (much much) easier!
cumvals <- t(apply(data_ex[,1:4],1,cumsum))
data2 <- data.frame(cumvals,v5=data_ex$v5)
ggplot(data2,aes(x=v5)) +
## these must go in reverse order
geom_area(aes(y=X4),fill="green")+
geom_area(aes(y=X3),fill="purple")+
geom_area(aes(y=X2),fill="red")+
geom_area(aes(y=X1),fill="blue")

Convert absolute values to ranges for charting in R

Warning: still new to R.
I'm trying to construct some charts (specifically, a bubble chart) in R that shows political donations to a campaign. The idea is that the x-axis will show the amount of contributions, the y-axis the number of contributions, and the area of the circles the total amount contributed at this level.
The data looks like this:
CTRIB_NAML CTRIB_NAMF CTRIB_AMT FILER_ID
John Smith $49 123456789
The FILER_ID field is used to filter the data for a particular candidate.
I've used the following functions to convert this data frame into a bubble chart (thanks to help here and here).
vals<-sort(unique(dfr$CTRIB_AMT))
sums<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, sum)
counts<-tapply( dfr$CTRIB_AMT, dfr$CTRIB_AMT, length)
symbols(vals,counts, circles=sums, fg="white", bg="red", xlab="Amount of Contribution", ylab="Number of Contributions")
text(vals, counts, sums, cex=0.75)
However, this results in way too many intervals on the x-axis. There are several million records all told, and divided up for some candidates could still result in an overwhelming amount of data. How can I convert the absolute contributions into ranges? For instance, how can I group the vals into ranges, e.g., 0-10, 11-20, 21-30, etc.?
----EDIT----
Following comments, I can convert vals to numeric and then slice into intervals, but I'm not sure then how I combine that back into the bubble chart syntax.
new_vals <- as.numeric(as.character(sub("\\$","",vals)))
new_vals <- cut(new_vals,100)
But regraphing:
symbols(new_vals,counts, circles=sums)
Is nonsensical -- all the values line up at zero on the x-axis.
Now that you've binned vals into a factor with cut, you can just use tapply again to find the counts and the sums using these new breaks. For example:
counts = tapply(dfr$CTRIB_AMT, new_vals, length)
sums = tapply(dfr$CTRIB_AMT, new_vals, sum)
For this type of thing, though, you might find the plyr and ggplot2 packages helpful. Here is a complete reproducible example:
require(ggplot2)
# Options
n = 1000
breaks = 10
# Generate data
set.seed(12345)
CTRIB_NAML = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_NAMF = replicate(n, paste(letters[sample(10)], collapse=''))
CTRIB_AMT = paste('$', round(runif(n, 0, 100), 2), sep='')
FILER_ID = replicate(10, paste(as.character((0:9)[sample(9)]), collapse=''))[sample(10, n, replace=T)]
dfr = data.frame(CTRIB_NAML, CTRIB_NAMF, CTRIB_AMT, FILER_ID)
# Format data
dfr$CTRIB_AMT = as.numeric(sub('\\$', '', dfr$CTRIB_AMT))
dfr$CTRIB_AMT_cut = cut(dfr$CTRIB_AMT, breaks)
# Summarize data for plotting
plot_data = ddply(dfr, 'CTRIB_AMT_cut', function(x) data.frame(count=nrow(x), total=sum(x$CTRIB_AMT)))
# Make plot
dev.new(width=4, height=4)
qplot(CTRIB_AMT_cut, count, data=plot_data, geom='point', size=total) + opts(axis.text.x=theme_text(angle=90, hjust=1))

Resources