How to count number of events of length 5<=N<10 days meeting a set condition using CDO? - netcdf

I need a help regarding CDO operation on a netcdf file. I downloaded dataset for 40 years from ERA5 over a grid region and I masked variable values for a range (30-50) to 1 and other values to 0 using cdo.
cdo -expr,'var2=var*(var>=30 && var<50)' data1.nc data2.nc
Now I want to calculate number of times each grid cell recorded var2= 1 consecutively for 5 days but less than 10 days in the last 40 years. Is that possible using cdo or nco?

First of all, I'm assuming your input has been converted to daily, you don't say.
Then you also need to clarify the question. Your title originally said "how to count the number of days?", but that was a bit ambiguous?
Let's say you have a series like this that represents an 8 day event:
0 0 1 1 1 1 1 1 1 1 0 0 0 0
Does that count as a single occurrence? Your text seemed to imply this was the case, but the title not, I think you wanted to know the "number of events" not "days", so I edited your title to agree with the main text of the question, I hope this interpretation is correct.
I think you can do it but the solution is a bit longwinded. You can use runsum to give you a "1" for any day which is 1 and is on the end of a series of N days like this:
cdo gec,N -runsum,N in.nc out5.nc
But that doesn't totally answer your question. For example if N=5 this would convert the above series into this:
0 0 0 0 0 0 1 1 1 1 0 0 0 0
i.e there are 4 days on the end of a 5 day series.
How can we get an upper limit to the length of an event? Well if we do the same calculation for >10 day events, and add together, we get
not an event
An event of at least 5 days but less than 10,
An event that is >10 (and of course >5 days)
So we just add the two series and pick out the 1s to get the range of event lengths you require:
cdo gec,10 -runsum,10 in.nc out10.nc
# only keep events of 5,6,7,8 and 9 days in length:
cdo eqc,1 -add out5.nc out10.nc out5-10.nc
Okay now we have a file where var=1 when it is at the end of a series of at least 5 but less than ten days.
Now this is cool part, we can apply the same technique using runmean/runsum to pick up the START and END of each of these series, and then we can add up these events. If we apply a runsum with a window size of 2, this produces 1 for a sequence of "0 1" or "1 0" i.e. it picks up the start and end points of each event.
cdo eqc,1 -runsum,2 out5-10.nc out_start_end.nc
This command turns our example series into the following, since we've seen only a sequence of "0 1" or "1 0" results in a 1:
0 0 0 0 0 0 1 0 0 1 0 0 0 0
Now we just need to sum this in time and divide by 2 (I told you it was long winded!)
cdo divc,2 -timsum out_start_end.nc number_of_events.nc
ta da!
Note 1 that if the whole input series ends mid-event e.g. 0 0 1 1 1 , this method will count this an a "half" event, since you only pick up the start. Round down to the nearest integer if this upsets you.
Putting this all together (and you can probably pipe to combine some of this), here is the whole solution involving 10 cdo commands summarized:
cdo gec,5 -runsum,5 in.nc out5.nc
cdo gec,10 -runsum,10 in.nc out10.nc
cdo eqc,1 -add out5.nc out10.nc out5-10.nc
cdo eqc,1 -runsum,2 out5-10.nc out_start_end.nc
cdo divc,2 -timsum out_start_end.nc number_of_events.nc
Note 2, the runsum commands will use the window mid-point for the date/timestamp, but that is not important for this use-case. If anyone wants to also use the outN.nc files to see when the event days are, then it is usual to lag the time stamp using --timestat_date last, see this video for more details.
Note 3 If you sum the series of days within the events, you can now divide this by the number of events to get the mean event length.

Related

Is there a way to track change in response over time and map that onto changes in response of another category in r?

(I'm SUPER new to coding in general so all suggestions are much appreciated.)
So I'm working with a data set that contains panel survey data that was posed to the same 8000 participants 7 times over the course of the last decade. I currently have dummy variable forms for the answers I'm interested in, so now my data is looks like this:
colour2011
colour2016
colour2018
1
1
0
0
0
0
0
1
1
1
0
0
1
1
1
and the other variable's data looks similar with column names being tied to the year the question was asked. Is there a way to not only show change of answer for both using ggplot2, but also track rate of change and display that visually by year?

Finding Correlations between data in dataframe (including binary)

I have a dataset called dolls.csv that I imported using
dolls <- read.csv("dolls.csv")
This is a snippet of the data
Name Review Year Strong Skinny Weak Fat Normal
Bell 3.5 1990 1 1 0 0 0
Jan 7.2 1997 0 0 1 0 1
Tweet 7.6 1987 1 1 0 0 0
Sall 9.5 2005 0 0 0 1 0
I am trying to run some preliminary analysis of this data. The Name is the name of the doll, the review is a rating 1-10, year is year made, and all values after that are binary where they are 1 if they possess a characteristic or 0 if they don't.
I ran
summary(dolls)
and get the header, means, mins and max's of values.
I am trying to possibly see what the correlations are between characteristics and year or review rating to see if there is some correlation (for example to see if certain dolls have really high ratings yet have unfavorable traits ), not sure how to construct charts or what functions to use in this case? I was considering some ANOVA tail testing for outliers and means of different values but not sure how to compare values like this (In python i'd run a if-then statement but i dont know how to in R).
This is for a personal study I wanted to conduct and improve my R skills.
Thank you!

How to create an interval file defined by values from another file - for circos imaging of WGS data

I am trying to depict my whole-genome sequence (WGS) data of my parasite, using the circos software.
One of the elements I would like to depict, is the areas of the reference genome for which i do not have sequencing data from my parasite.
I order to do this, I have used Samtools to create an mpileup file, from which I have extracted the positions where the sequence depth = 0. I therefore have a file that looks like this:
$chromosome_name $chromosome_position $depth
chr_1 1 0
chr_1 2 0
chr_1 3 0
chr_2 67 0
chr_2 68 0
chr_2 1099 0
chr_2 1100 0
chr_2 1101 0
this means that there are 3 positions in chromosome 1, with no sequence data (depth = 0): namely positions 1, 2 and 3. For chromosome 2, the positions with no data are positions 67, 68, 1099, 1100 and 1101.
Due to the fact that my files are enormous (up to 3 million lines), and the fact that alot of the unsequenced positions come in intervals, I would like to create an interval file from the above data. Also, circos requires such an interval-file in order to create tiles. I therefore need to create a new file from the above, that looks like this:
$chromosome_name $start_pos $end_pos
chr_1 1 3
chr_2 67 68
chr_2 1099 1101
I have searched a bunch, but I have only found questions pertaining to grouping data by pre-defined intervals (e.g. group purchases occurring over a period of 6 months, patients by age etc).
So if anybody can help me out, I will be extremely happy!
Sidsel
Consider using bedtools. Specifically the bedtools merge sub-command:
http://bedtools.readthedocs.io/en/latest/content/tools/merge.html
From this page, it would seem to do what you want:
bedtools merge combines overlapping or “book-ended” features in an
interval file into a single feature which spans all of the combined
features.
Moreover, you can use the -d option to specify max distance between featured to merge:
-d Maximum distance between features allowed for features to be merged. Default is 0. That is, overlapping and/or book-ended features
are merged.

create an index for aggregating daily data to match periodic data

I have daily measurements prec.d and periodic measurements prec.p. The periodic measurements (3-12 days apart) are roughly the sum of the daily measurements between the start and end dates, and I need to compare prec in the two data frames. I have so far manually created an index week that represents the time span of each periodic measurement, but it would be great to make week in a reproducible fashion.
data.frame prec.d
day week prec
6/20/2013 1 0
6/21/2013 1 0
6/22/2013 1 0
6/23/2013 1 0
6/24/2013 1 41.402
6/25/2013 1 2.794
6/26/2013 1 6.096
6/27/2013 2 0.508
6/28/2013 2 0
6/29/2013 2 0
6/30/2013 2 2.54
7/1/2013 2 18.034
7/2/2013 2 4.064
And data.frame prec.p
start end week prec1 prec2 prec3
6/20/2013 6/26/2013 1 50.28 31.78042615 42.76461716
6/27/2013 7/2/2013 2 25.1 15.70964247 20.49507586
I would like to create the week field automatically, which spans from start to end in prec.p. The I can aggregate by week to make prec in both data frames match.
Introduce YYYYWW field in both weekly and minute data, WW stands for week number, that will give you a common index. For example
x <- as.Date(runif(100)*100)
yyyyww <- strftime(x, format="%Y%U")
yyyyww
Or take a look at package quantmod, if I remember correctly, it has functions for time frame conversion.

How to enter censored data into R's survival model?

I'm attempting to model customer lifetimes on subscriptions. As the data is censored I'll be using R's survival package to create a survival curve.
The original subscriptions dataset looks like this..
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
Which I manipulate to look like this..
id tenure_in_months status(1=cancelled, 0=active)
1 2 1
2 ? 0
3 1 1
..in order to feed the survival model:
obj <- with(subscriptions, Surv(time=tenure_in_months, event=status, type="right"))
fit <- survfit(obj~1, data=subscriptions)
plot(fit)
What shall I put in the tenure_in_months variable for the consored cases i.e. the cases where the subscription is still active today - should it be the tenure up until today or should it be NA?
First I shall say I disagree with the previous answer. For a subscription still active today, it should not be considered as tenure up until today, nor NA. What do we know exactly about those subscriptions? We know they tenured up until today, that is equivalent to say tenure_in_months for those observations, although we don't know exactly how long they are, they are longer than their tenure duration up to today.
This is a situation known as right-censor in survival analysis. See: http://en.wikipedia.org/wiki/Censoring_%28statistics%29
So your data would need to translate from
id start_date end_date
1 2013-06-01 2013-08-25
2 2013-06-01 NA
3 2013-08-01 2013-09-12
to:
id t1 t2 status(3=interval_censored)
1 2 2 3
2 3 NA 3
3 1 1 3
Then you will need to change your R surv object, from:
Surv(time=tenure_in_months, event=status, type="right")
to:
Surv(t1, t2, event=status, type="interval2")
See http://stat.ethz.ch/R-manual/R-devel/library/survival/html/Surv.html for more syntax details. A very good summary of computational details can be found: http://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_lifereg_sect018.htm
Interval censored data can be represented in two ways. For the first use type = interval and the codes shown above. In that usage the value of the time2 argument is ignored unless event=3. The second approach is to think of each observation as a time interval with (-infinity, t) for left censored, (t, infinity) for right censored, (t,t) for exact and (t1, t2) for an interval. This is the approach used for type = interval2, with NA taking the place of infinity. It has proven to be the more useful.
If a missing end date means that the subscription is still active, then you need to take the time until the current date as censor date.
NA wont work with the survival object. I think those cases will be omitted. That is not what you want! Because these cases contain important information about the survival.
SQL code to get the time till event (use in SELECT part of query)
DATEDIFF(M,start_date,ISNULL(end_date,GETDATE()) AS tenure_in_months
BTW:
I would use difference in days, for my analysis. Does not make sense to round off the time to months.
You need to know the date the data was collected. The tenure_in_months for id 2 should then be this date minus 2013-06-01.
Otherwise I believe your encoding of the data is correct. the status of 0 for id 2 indicates it's right-censored (meaning we have a lower bound on it's lifetime, but not an upper bound).

Resources