I am relatively new to R studio and R in general, I am not even sure if this is the right place to ask this question. I was instructed to draw a graph showing seasonality using daily rainfall over a number of years. I need help more in interpreting the graph than in plotting it.
There is an example already in R using mscdata that I was able to replicate using my own data, the code for the example is as below. Any help with what this graph means or explains will be greatly appreciated.Thank you
install.packages(seas)
library(seas)
data(mscdata)
dat <- mksub(mscdata, id=1108447)
dat.ss <- seas.sum(dat, width="mon")
x<-mscdata
# Structure in R
str(dat.ss)
tail(mscdata)
# Annual data
dat.ss$ann
# Demonstrate how to slice through a cubic array
dat.ss$seas["1990",,]
dat.ss$seas[,2,] # or "Feb", if using English locale
dat.ss$seas[,,"precip"]
# Simple calculation on an array
(monthly.mean <- apply(dat.ss$seas[,,"precip"], 2, mean,na.rm=TRUE))
barplot(monthly.mean, ylab="Mean monthly total (mm/month)",
main="Un-normalized mean precipitation in Vancouver, BC")
text(6.5, 150, paste("Un-normalized rates given 'per month' should be",
"avoided since ~3-9% error is introduced",
"to the analysis between months", sep="\n"))
# Normalized precip
norm.monthly <- dat.ss$seas[,,"precip"] / dat.ss$days
norm.monthly.mean <- apply(norm.monthly, 2, mean,na.rm=TRUE)
print(round(norm.monthly, 2))
print(round(norm.monthly.mean, 2))
barplot(norm.monthly.mean,
ylab="Normalized mean monthly total (mm/day)",
main="Normalized mean precipitation in Vancouver, BC")
# Better graphics of data
dat.ss <- seas.sum(dat, width=11)
image(dat.ss)
This code gives a graph showing sample quartiles, annual rainfall but I don't really know what it means. Any help whatsoever will be appreciated
The Graph using the package seas is as below
Plot
I'll start with the top left graph :
You've probably guessed that each row is a year (as shown by the Y-axis) while day groups/months of the year are X-axis. The color of each box of the heatmap is proportionally darker according to the mm's worth of rain in that day group, with the scale being displayed on the far right. I assume the red X's mean missing values.
Top right is like a barplot with the sum of rainfall each year (row), just continuously plotted. The red bar should be the average precipitation overall (not sure about the orange one).
Bottom left is a bit more tricky. Think of it like you reordered the rows in each column to have the heaviest rainfall of the day group at the top (forgetting about the year info here). The Y-axis shows the quantiles. The quantiles' respective values change for each day group, so the lines you see on top of the plot indicate key rainfall values in mm (4,6,8,10,12). Indeed, If you look at the 2mm line (lowest one), you'll see that in January, about 20% of rainfalls (across all years) are below this threshold, while in the end of July, over 80% are below 2mm (expect less rainfall in the summer).
Lastly, bottom right is similar to the one above it. It's the sum of all rows, referring to the quantiles rather than years this time, resulting in the staircase pattern.
You'll notice that since the scale of the plot is the same as the one showing the average per year, the top of the staircase is outside of the plot...
Hope I made that clear enough.
Related
I have a series of daily values, y. For each day, di (i.e., each row), I would like to calculate the (graph) area, ai, of the region between the curve and the horizontal line y = yi between di and the most recent previous occurrence of the value yi. Sketch below. Because observations occur at regular, discrete timesteps (daily), the calculated area, ai, is equivalent to the sum of the daily differences between each daily y and yi (black bars in figure). I'm interested only in valleys, so the calculated area, ai, can be set to 0 when y is decreasing (yi - yi-1 <= 0).
Toy data below. Expected result shown in dat$a.
dat$a[6] was calculated from 55 - 50;
dat$a[7] was calculated from (60-55)+(60-50). And so on.
dat = data.frame(d = seq.Date(as_date("2021-01-01"),as_date("2021-01-10"),by = "1 day"),
y = c(100,95,90,70,50,55,60,75,85,90),
a = c(0,0,0,0,0,5,15,65,115,145))
My first thought was to calculate the area between the curve and the horizontal line y = yi between days di and the the most recent previous occurrence of the value yi, using perhaps geiger::area.between.curves(), but I couldn't work out how to identify most recent previous occurrence of the value yi.
[In case the context helps, the actual data are daily values of the area (m2) of a wetland not submerged by water. When the water rises, a portion of the wetland that had been dry for some time becomes wet. Here, I'm trying to calculate the extent of the reflooding in m2-days. A portion of the wetland that has been dry for a long time but becomes reflooded will contribute many m2-days to the sum.]
I'm most comfortable in the tidyverse, and such answers are greatly preferred. I am not familiar with data.table.
Thanks in advance
Update
I was able to able to achieve my desired calculation in Excel, though it's brutally inelegant. Couple hundred rows in an example, linked below. Given that my real data are 180k rows, my poor machine hated the 18 million calculated cells. Though I can move on with my analysis, I am still very interested in an R solution. My implemented approach differs subtly from my imagined R approach in that it's summing 'horizontal rectangles', so to speak, each of the same (small) y-unit height, rather than 'vertical rectangles', each of unit width.
Here's the file.
Since the question is missing complete information we will compute the the area under the curve assuming that a day is one unit. Modify as appropriate for your specific problem.
library(pracma)
nr <- nrow(dat)
dat0 <- dat[c(1, 1:nr, nr), ]
dat0[c(1, nr), "y"] <- 0
with(dat0, abs(polyarea(as.numeric(d), y)))
I'm trying to build a histogram in which the X-axis shows each case I'm working with (my matrix's info includes the murders' resolution rate for different police stations in one city for a year), each police station, and the Y-axis would show the resolution rate (from 0 to 1). So, there would be 51 bars, one for each police station, and each one should reach one of those rates from 0 to 1.
But when I run hist with my matrix, the X-axis displays resolution rates and the Y-axis displays the frequency, the number of police stations that reach each resolution rate.
How can I get the result I wrote before? This is the code I'm using:
anobase<-matrix(CResolucion[seleccion_ano==2018], length(seleccion_estado), 1)
rownames(anobase) <- seleccion_estado
colnames(anobase) <- 2018
hist(anobase)
(and, yeah, I'm new at using R)
So, that's the plot. As you see, the X-axis displays values from 0 to 1. These values represent the resolution rate said before (result from dividing solved murders by the total of murders registered). The Y-axis on the other hand displays a frequency from 0-15. Then, each bar shows how many cases have each resolution rate. What I want to do is show in the X-axis each police station, so each bar would be a police station, and they should reach that resolution rate from 0-1 (Y-axis). I hope I'm being clear.
You don't want a histogram; you want a column or bar chart. Histograms summarize the distribution of a single continuous variable; column charts compare values of a continuous variable across categories (here, police stations).
You haven't posted a reproducible example, so I can't tell exactly what's going on with your data. Let's assume, though, that you have a vector of resolution rates called rates and a vector of station names associated with those rates called stations. In base R, you could then create a column chart with barplot(rates, names.arg = stations).
I am trying to plot a boxplot in R, and I can't see the boxes, below shows what i see:
and here is the code i am writing:
ggplot(data=FIR,
aes(x=as.factor(FIR$Revised.Status),y=as.numeric(FIR$Diff_Date_Requested)))+
geom_boxplot(fill="blue", alpha=0.2)+xlab("Status")
Revised.Status: is the results of the request it is A or C
Diff_Date_Requested: is the difference between 2 dates
From ?geom_boxplot:
The lower and upper hinges correspond to the first and third quartiles
(the 25th and 75th percentiles).
If your 25th percentile and 75th percentile are the same value, or are really close together, you won't see the box, just a single line.
It looks like you might be working with count data or other integer values. If more than half your y-values are 1 (i.e., whatever it is happens the next day, so difference = 1), that might explain it.
You might want to think about other ways of visualizing count data, like geom_bar().
Dear stackoverflow community,
I'm quite new in R and this is my first stackoverflow entry so please show mercy with me if it's not the perfect questioning.
I'm calculating standardized precipitation index (SPI) with the package "SPEI" for a time series of a climate station with 20 years of monthly precipitation data. I have done this for the timescale of 1 and 12 month like this:
spi1 <- spi(SPI_Anu_input_ts[,'PRCP_Anu'], 1)
spi12 <- spi(SPI_Anu_input_ts[,'PRCP_Anu'], 12)
The output of SPI is not a matrix or a dataframe, it's a list. Inside this list under the entry fitted you find a timeseries with the wanted and calculated index values.
To plot these index values you don't have to enter x & y like usual:
plot(x, y, ...)
You can just use the complete list:
par(mfcol=c(2,1))
plot(spi1, 'Anuradhapura, SPI-1')
plot(spi12, 'Anuradhapura, SPI-12')
Then it looks like this:
Plot SPI1 & SPI12
Part of SPI calculation is that the amount of time scale is the first month for the first index value. The precipitation data is starting in Jan 1990. So the indices for SPI1 start in january but for SPI12 start in december (first 11 month are NA).
As you can see in the graphic both x and y axes are shifted. Neither
xlim=as.Date(c("1990-01-01","2017-09-01"))
nor any axes limitation like
ylim=c(-2.5,2.5)
is working to have the same value range in both graphics.
Do anyone know how to solve that?
I need to get a plot of a Lorentz curve of a cumulative variable as a function of the number of observations. I want both axes to be displayed on a percentage basis (e.g. say observations are the number of buyers and the y variable is the amount they bought, buyers are already ranked in descending order, I want to get the plot that says "The top 10% buyers purchased 90% of the total bought"). My dataset is a couple million observations.
What is the best way to do this? Sub-questions:
If I need to add two variables for the quantiles of total observations and total $ bought (so as to use them to plot), what is the object that returns the row number? I tried:
user_quantile <- row(df)/nrow(df)
but I get a matrix of identical columns (user_quantile.1, user_quantile.2) of which I only need one column.
Is there instead any way to skip adding percentages as variables and only have them for axes values?
The plot has way to many points than I need to get the line. What is the best approach to minimize the computational effort and get a nice graph?
Thanks.
You may want to acquaint yourself with the excellent RSeek search engine for R content. One quick query for Lorentz curve (and Lorenz curve) lead to these packages:
ineq: Measuring inequality, concentration, and poverty
reldist: Relative Distribution Methods
GeoXp: Interactive exploratory spatial data analysis
lawstat: An R package for biostatistics, public policy and law
all of which seem to supply a Lorenz curve function.
In order to get the plot done you need first to arrange the raw data.
1) You can use the cut2() function from the Hmisc package to cut the data in quantiles. Check the documentation, it's not hard. It's similar to the cut() from the base package.
2) After using the cut2() function with the income data, you need to compute the frequency of each decile. Use table() for that. Then calculate percentages of income for each decile.
3) Now you should have a very small table with the following columns:
Decile, cumulative % of total income.
Add another column with the 45 degree line. Just add a constant cumulative % of income.
finaltable$cumulative_equality_line = seq(0.1, 1, by = 0.1)
4) You can use base graphics or ggplot2 for plotting. I guess you can do it with the info of step 3 or perhaps check out specific plotting questions.
I'll have to do it soon, but i already have the final table. I'll post the code for plotting once i do it.
Good luck!