Line being plotted with in surv_fit when there are no events - r

I have several cumulative incidence curves showing the incidence of an outcome after an exposure, plotted on the same graph, stratified by age at exposure.
The longest follow-up period is 25 years, but in the oldest group, the follow-up does not last any longer than 15 years (as they are all dead by then).
When I plot the cum inc curves, the curve for the oldest group 'flatlines' after 15y, but does not disappear until the 25y point, when all the other curves also disappear (i.e the plot ends).
Is there a way to stop a cumulative incidence curve being plotted when there is nobody at risk, even if other curves for different groups on the same graph go on for longer than the point you wish to stop plotting at for that one problematic curve?
I did this with the surv_fit function in R.

Related

R function for determining oscillations

Given the plot of the time-series data, I was wondering if there is a robust function/mathematical formula I can use in R to determine which plots are oscillating. For example each individual graph corresponds to a single cell's intensity value over a certain time period. I would want a method to give a score or some value that would be able to differentiate between plots that are not oscillating (#513 and 559) compared to the plots that are oscillating (508,512,557,558). All the plots have the same scaling.

How to change info in a histogram in r?

I'm trying to build a histogram in which the X-axis shows each case I'm working with (my matrix's info includes the murders' resolution rate for different police stations in one city for a year), each police station, and the Y-axis would show the resolution rate (from 0 to 1). So, there would be 51 bars, one for each police station, and each one should reach one of those rates from 0 to 1.
But when I run hist with my matrix, the X-axis displays resolution rates and the Y-axis displays the frequency, the number of police stations that reach each resolution rate.
How can I get the result I wrote before? This is the code I'm using:
anobase<-matrix(CResolucion[seleccion_ano==2018], length(seleccion_estado), 1)
rownames(anobase) <- seleccion_estado
colnames(anobase) <- 2018
hist(anobase)
(and, yeah, I'm new at using R)
So, that's the plot. As you see, the X-axis displays values from 0 to 1. These values represent the resolution rate said before (result from dividing solved murders by the total of murders registered). The Y-axis on the other hand displays a frequency from 0-15. Then, each bar shows how many cases have each resolution rate. What I want to do is show in the X-axis each police station, so each bar would be a police station, and they should reach that resolution rate from 0-1 (Y-axis). I hope I'm being clear.
You don't want a histogram; you want a column or bar chart. Histograms summarize the distribution of a single continuous variable; column charts compare values of a continuous variable across categories (here, police stations).
You haven't posted a reproducible example, so I can't tell exactly what's going on with your data. Let's assume, though, that you have a vector of resolution rates called rates and a vector of station names associated with those rates called stations. In base R, you could then create a column chart with barplot(rates, names.arg = stations).

pixelwise rescaling of a time series using cumulative distribution function matching

I'm using R and I have a raster stack of surface soil moisture measurements from a radiometer on fixed on an observation tower. These data are daily values going back 10 years.
I also have another raster stack of satellite microwave measurements of soil moisture over a larger area going back 25 years. Both sensors have similar frequencies.
On a per-pixel basis, I would like to use a linear cumulative distribution function matching that rescales the satellite data against the tower data so that it would result in a longer time series of rescale satellite data.
This point is to correct for systematic differences between the soil moisture values and extend the time series. This is similar to what was done in the figure below where they matched the AMSR-E (blue plot) and ASCAT (red plot) data to Noah data (black plot).
Does anyone know how to implement this in R? Or at the very least help me get started? I've scoured the Internet and this website without success.

K-means clustering interpretation

I have 3 cluster pair plot with "Av. Mon. Hrs","Sat. Lvl","Last Eval", and found a matrix graph by below code.
library("ggplot2") # Expanded plotting functionality over "lattice" package
x<-cbind(HR_left$average_montly_hours,HR_left$satisfaction_level,HR_left$last_evaluation)
kmfit<-kmeans(x,3,nstart=25)
# Find the best 3 clusters using 25 random sets of (distinct) rows in x as initial centres.
pairs(x,col=(kmfit$cluster), labels=c("Av. Mon. Hrs","Sat. Lvl","Last Eval."))
It says
Cluster 1: The pairs plot characterised this cluster as working low
average monthly hours of employees, middle satisfaction range and a
low last evaluation.
Cluster 2: From the pairs plot, this cluster is
characterised by high monthly hours, very low satisfaction and high
evaluation.
Cluster 3: From the pairs plot, this cluster is
characterised by high monthly hours, high satisfaction and high
evaluation.
But I don't understand the pairplot graphs about how they interpretative of these three findings.
library(readr)
HR_comma_sep <- read_csv("https://stluc.manta.uqcloud.net/mdatascience/public/datasets/HumanResourceAnalytics/HR_comma_sep.csv")
HR_left<-HR_comma_sep[HR_comma_sep$left==1,]
library("ggplot2") # Expanded plotting functionality over "lattice" package
x<-cbind(HR_left$average_montly_hours,HR_left$satisfaction_level,HR_left$last_evaluation)
kmfit<-kmeans(x,3,nstart=25)
# Find the best 3 clusters using 25 random sets of (distinct) rows in x as initial centres.
pairs(x,col= (kmfit$cluster),labels=c("Av. Mon. Hrs","Sat. Lvl","Last Eval."))
The number of "monthly hours" is at a very different scale than the other two variables, thus is skewing the clustering. The difference in "hours worked" is dominating the differences in the other two variables.
Normalize each column by dividing by the mean, the range or finding the z-score.
Original Code:
library(readr)
HR_comma_sep <- read_csv("https://stluc.manta.uqcloud.net/mdatascience/public/datasets/HumanResourceAnalytics/HR_comma_sep.csv")
HR_left<-HR_comma_sep[HR_comma_sep$left==1,]
library("ggplot2")
x_org<-cbind(HR_left$average_montly_hours,
HR_left$satisfaction_level,
HR_left$last_evaluation)
kmfit<-kmeans(x_org, 3, nstart = 25)
pairs(x_org,col= (kmfit$cluster),labels=c("Av. Mon. Hrs","Sat. Lvl","Last Eval."))
Repeating the calculation using scaled values:
x_scaled<-cbind(scale(HR_left$average_montly_hours),
scale(HR_left$satisfaction_level),
scale(HR_left$last_evaluation))
kmfit<-kmeans(x_scaled, 3)
pairs(x_org,col= (kmfit$cluster),labels=c("Av. Mon. Hrs","Sat. Lvl","Last Eval."))
Using just the raw values, the clustering based on difference in the "monthly hours", The top plot shows 2 clusters (black and green) merged together and not clearly distinct.
After scaling the values and repeating the clustering, 3 clearly differentiated clusters are now clearly shown (bottom image).

Interpretation of a graph created by the R package seas

I am relatively new to R studio and R in general, I am not even sure if this is the right place to ask this question. I was instructed to draw a graph showing seasonality using daily rainfall over a number of years. I need help more in interpreting the graph than in plotting it.
There is an example already in R using mscdata that I was able to replicate using my own data, the code for the example is as below. Any help with what this graph means or explains will be greatly appreciated.Thank you
install.packages(seas)
library(seas)
data(mscdata)
dat <- mksub(mscdata, id=1108447)
dat.ss <- seas.sum(dat, width="mon")
x<-mscdata
# Structure in R
str(dat.ss)
tail(mscdata)
# Annual data
dat.ss$ann
# Demonstrate how to slice through a cubic array
dat.ss$seas["1990",,]
dat.ss$seas[,2,] # or "Feb", if using English locale
dat.ss$seas[,,"precip"]
# Simple calculation on an array
(monthly.mean <- apply(dat.ss$seas[,,"precip"], 2, mean,na.rm=TRUE))
barplot(monthly.mean, ylab="Mean monthly total (mm/month)",
main="Un-normalized mean precipitation in Vancouver, BC")
text(6.5, 150, paste("Un-normalized rates given 'per month' should be",
"avoided since ~3-9% error is introduced",
"to the analysis between months", sep="\n"))
# Normalized precip
norm.monthly <- dat.ss$seas[,,"precip"] / dat.ss$days
norm.monthly.mean <- apply(norm.monthly, 2, mean,na.rm=TRUE)
print(round(norm.monthly, 2))
print(round(norm.monthly.mean, 2))
barplot(norm.monthly.mean,
ylab="Normalized mean monthly total (mm/day)",
main="Normalized mean precipitation in Vancouver, BC")
# Better graphics of data
dat.ss <- seas.sum(dat, width=11)
image(dat.ss)
This code gives a graph showing sample quartiles, annual rainfall but I don't really know what it means. Any help whatsoever will be appreciated
The Graph using the package seas is as below
Plot
I'll start with the top left graph :
You've probably guessed that each row is a year (as shown by the Y-axis) while day groups/months of the year are X-axis. The color of each box of the heatmap is proportionally darker according to the mm's worth of rain in that day group, with the scale being displayed on the far right. I assume the red X's mean missing values.
Top right is like a barplot with the sum of rainfall each year (row), just continuously plotted. The red bar should be the average precipitation overall (not sure about the orange one).
Bottom left is a bit more tricky. Think of it like you reordered the rows in each column to have the heaviest rainfall of the day group at the top (forgetting about the year info here). The Y-axis shows the quantiles. The quantiles' respective values change for each day group, so the lines you see on top of the plot indicate key rainfall values in mm (4,6,8,10,12). Indeed, If you look at the 2mm line (lowest one), you'll see that in January, about 20% of rainfalls (across all years) are below this threshold, while in the end of July, over 80% are below 2mm (expect less rainfall in the summer).
Lastly, bottom right is similar to the one above it. It's the sum of all rows, referring to the quantiles rather than years this time, resulting in the staircase pattern.
You'll notice that since the scale of the plot is the same as the one showing the average per year, the top of the staircase is outside of the plot...
Hope I made that clear enough.

Resources