Transition matrix based on amounts - r

I have the following dataset:
example_data <- data.table(ID = c(1,2,3,4,4,6,1,2,3,4,4,5,1,2,3,4),
ContractID = c(1,2,3,4,5,6,1,2,3,4,5,7,10,11,12,13),
Day = c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3),
Rating = c("A","B","A","A","A","C","A","A","C","A","D","C","A","B","C","D"),
Amount = c(79,71,74,10,20,30,18,91,33,59,90,32,5,71,79,60))
and what I would like to create is a transition matrix from day 1 to day 2 and day 2 to day 3.
It would look like something like this:
Of course, the labels Day 1 and Day 2 are for illustrative purposes only.
For example: cell A, C with the value -41, we get it by 33 - 74, since the amount decreased from day 1 to day 2.
Observe that there are some small caveats:
We consider only contractIDs and IDs that are present in Day 1 and Day 2.
The contractID and ID are unique together and not only the ID.
I was wondering how one would be able to create such a table, without actually splitting it into two datasets, for the two days.

Related

Turn date column into days from beginning integer Rstudio

Hi everyone so I am currently plotting time series graphs in Rstudio, I have created a nice time series graph however I would actually like the x axis not to be showing me the date but more like an integer showing a number from the starting date of the graph.
Time Series Graph
Such as instead of seeing 01/01/2021 I want to see day 100, as in its the 100th day of recording data.
Do i need to create another column converting all the days into a numerical value then plot this?
If so how do i do this. At the moment all i have is a Date column and the value i am plotting column.
Column Data
Thanks
Assuming you want 01/01/2021 as first day you could use that as a reference and calculate the number of days passed since the first day of recording and plot that, this should give you more like an integer showing a number from the starting date.
Not sure what your data frame looks like so hopefully this helps.
Using lubridate
library(lubridate)
df
Date
1 01/01/2021
2 02/01/2021
3 03/01/2021
4 04/01/2021
df$days <- yday(dmy(df$Date)) -1
Output:
Date days
1 01/01/2021 0
2 02/01/2021 1
3 03/01/2021 2
4 04/01/2021 3
Which is indeed a numeric
str(df$days)
num [1:4] 0 1 2 3
This a simulation of dates
date.simulation = as.Date(1:100, "2001-01-01")
factor(date.simulation-min(date.simulation))
You just subtract the dates to the minimum date. And you need it as a factor for plotting purposes.

Plot data over time in R

I'm working with a dataframe including the columns 'timestamp' and 'amount'. The data can be produced like this
sample_size <- 40
start_date = as.POSIXct("2020-01-01 00:00")
end_date = as.POSIXct("2020-01-03 00:00")
timestamps <- as.POSIXct(sample(seq(start_date, end_date, by=60), sample_size))
amount <- rpois(sample_size, 5)
df <- data.frame(timestamps=timestamps, amount=amount)
Now I'd like to plot the sum of the amount entries for some timeframe (like every hour, 30 min, 20 min). The final plot would look like a histogram of the timestamps but should not just count how many timestamps fell into the timeframe, but what amount fell into the timeframe.
How can I approach this? I could create an extra vector with the amount of each timeframe, but don't know how to proceed.
Also I'd like to add a feature to reduce by hour. Such that just just one day is plotted (notice the range between start_date and end_date is two days) and in each timeframe (lets say every hour) the amount of data located in this hour is plotted. In this case the data
2020-01-01 13:03:00 5
2020-01-02 13:21:00 10
2020-01-02 13:38:00 1
2020-01-01 13:14:00 3
would produce a bar of height sum(5, 10, 1, 3) = 19 in the timeframe 13:00-14:00. How can I implement the plotting to easily switch between these two modes (plot days/plot just one day and reduce)?
EDIT: Following the advice of #Gregor Thomas I added a grouping column like this:
df$time_group <- lubridate::floor_date(df$timestamps, unit="20 minutes")
Now I'm wondering how to ignore the dates and thus reduce by 20 minute frame (independent of date).

How to match dates in 2 data frames in R, then sum specific range of values up to that date?

I have two data frames: rainfall data collected daily and nitrate concentrations of water samples collected irregularly, approximately once a month. I would like to create a vector of values for each nitrate concentration that is the sum of the previous 5 days' rainfall. Basically, I need to match the nitrate date with the rain date, sum the previous 5 days' rainfall, then print the sum with the nitrate data.
I think I need to either make a function, a for loop, or use tapply to do this, but I don't know how. I'm not an expert at any of those, though I've used them in simple cases. I've searched for similar posts, but none get at this exactly. This one deals with summing by factor groups. This one deals with summing each possible pair of rows. This one deals with summing by aggregate.
Here are 2 example data frames:
# rainfall df
mm<- c(0,0,0,0,5, 0,0,2,0,0, 10,0,0,0,0)
date<- c(1:15)
rain <- data.frame(cbind(mm, date))
# b/c sums of rainfall depend on correct chronological order, make sure the data are in order by date.
rain[ do.call(order, list(rain$date)),]
# nitrate df
nconc <- c(15, 12, 14, 20, 8.5) # nitrate concentration
ndate<- c(6,8,11,13,14)
nitrate <- data.frame(cbind(nconc, ndate))
I would like to have a way of finding the matching rainfall date for each nitrate measurement, such as:
match(nitrate$date[i] %in% rain$date)
(Note: Will match work with as.Date dates?) And then sum the preceding 5 days' rainfall (not including the measurement date), such as:
sum(rain$mm[j-6:j-1]
And prints the sum in a new column in nitrate
print(nitrate$mm_sum[i])
To make sure it's clear what result I'm looking for, here's how to do the calculation 'by hand'. The first nitrate concentration was collected on day 6, so the sum of rainfall on days 1-5 is 5mm.
Many thanks in advance.
You were more or less there!
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$ndate)) {
day = nitrate$ndate[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
}
Step by step explanation:
Initialize empty result column:
nitrate$prev_five_rainfall = NA
For each line in the nitrate df: (i = 1,2,3,4,5)
for (i in 1:length(nitrate$ndate)) {
Grab the day we want final result for:
day = nitrate$ndate[i]
Take the rainfull sum and it put in in the results column
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-6):(day-1)])
Close the for loop :)
}
Disclaimer: This answer is basic in that:
It will break if nitrate's ndate < 6
It will be incorrect if some dates are missing in the rain dataframe
It will be slow on larger data
As you get more experience with R, you might use data manipulation packages like dplyr or data.table for these types of manipulations.
#nelsonauner's answer does all the heavy lifting. But one thing to note, in my actual data my dates are not numerical like they are in the example above, they are dates listed as MM/DD/YYYY with the appropriate as.Date(nitrate$date, "%m/%d/%Y").
I found that the for loop above gave me all zeros for nitrate$prev_five_rainfall and I suspected it was a problem with the dates.
So I changed my dates in both data sets to numerical using the difference in number of days between a common start date and the recorded date, so that the for loop would look for a matching number of days in each data frame rather than a date. First, make a column of the start date using rep_len() and format it:
nitrate$startdate <- rep_len("01/01/1980", nrow(nitrate))
nitrate$startdate <- as.Date(all$startdate, "%m/%d/%Y")
Then, calculate the difference using difftime():
nitrate$diffdays <- as.numeric(difftime(nitrate$date, nitrate$startdate, units="days"))
Do the same for the rain data frame. Finally, the for loop looks like this:
nitrate$prev_five_rainfall = NA
for (i in 1:length(nitrate$diffdays)) {
day = nitrate$diffdays[i]
nitrate$prev_five_rainfall[i] = sum(rain$mm[(day-5):(day-1)]) # 5 days
}

Calculate Running Difference in Dates as New Dataframe Column

I've searched for several days and am still stumped.
Given a dataset defined by the following:
ids = c("a","b","c")
dates = c(as.Date("2015-01-01"), as.Date("2015-02-01"), as.Date("2015-02-15"))
test = data.frame(ids, dates)
I am trying to dynamically add new columns to the data frame whose values will be the difference between the column date (2015-03-01) and the value in the date column. I would expect the result would look like the following, but with a better column name:
d20150301 = c(59, 28, 14)
result = data.frame(ids, dates, d20150301)
Many thanks in advance.
You can subtract a vector of dates from a single date, so
test$d2015_03_01 <- as.Date('2015-03-01')-test$dates
makes test look like
> test
ids dates d2015_03_01
1 a 2015-01-01 59 days
2 b 2015-02-01 28 days
3 c 2015-02-15 14 days

Combine different rows

Consider a dataframe of the form
id start end
2009.36220 65693384 2010-03-20 2010-07-04
2010.36221 65693592 2010-01-01 2010-12-31
2010.36222 65698250 2010-01-01 2010-12-31
2010.36223 65704349 2010-01-01 2010-12-31
where I have around 20k observations per year for 15 years.
I need to combine the rows by the following rule:
if for the same id, there exists a record that ends at the last day of the year
and a record that starts at the first day of the following year
then
- create a new row with start value of the earlier row and end value of the later year
- and delete the two original rows
Given that the same id can be visible several times (since I have more than 2 years) I will then just iterate over the script several time to combine different ids that have for example 4 rows in consecutive years that satisfy the condition.
The Question
I'd know how to program this in an iterative manner, where I would go over every single row and check if there's a row with a start date next year somewhere in the whole data frame that corresponds to the end date this year - but that's extremely slow and non satisfying from an aesthetic point of view. I'm a very beginner with R, so I have no clue of where to even look to do such a thing in a more efficient manner - I'm open for any suggestion.
Warning: this kind of code with rbind() is cancerous, but this is the easiest solution I could think of. Let df be your data.
df$start = as.POSIXct(df$start)
df$end = as.POSIXct(df$end)
df2 = data.frame()
for (i in unique(df$id)){
s = subset(df, id==i)
df2 = rbind(df2, c(id, min(s$start), max(s$end)))
}

Resources