I had an R question concerning data wrangling. A sample data set I will include is downloadable online:
x<- read.csv("http://mgimond.github.io/ES218/Data/CO2.csv")
The datatable is shown in the attached image.
Example data table
I want to create a new column, let's say "time_since". This column would look at the "Average" column and calculate the time (in this case months) since "Average" is less than 300. So in this screenshot all are >300, so the value would be "0", but the month that eventually has a value less than 300 would then be "1" (representing 1 month since it has been one month under 300). If the following months are still under 300, this would increase according to the months that go by, but as soon as it become >300 again it will reset.
Basically it would be a function that would calculate the difference in time since a conditional statement is met, then restarts when the conditional is broken across dates.
I apologize if I worded it a bit confusing but hopefully the message comes across.
Maybe you can try :
library(dplyr)
x %>%
group_by(grp = cumsum(as.integer(Average > 300))) %>%
mutate(time_since = row_number()) %>%
ungroup -> result
Just to show you one excerpt of output where time_since > 1.
result %>% filter(grp == 61)
# Year Month Average Interpolated Trend Daily_mean grp time_since
# <int> <int> <dbl> <dbl> <dbl> <int> <int> <int>
#1 1964 1 320. 320. 320. -1 61 1
#2 1964 2 -100. 320. 320. -1 61 2
#3 1964 3 -100. 321. 320. -1 61 3
#4 1964 4 -100. 322. 319. -1 61 4
Here is a data.table approach. For this example, time_since is displaying the cumulative total of rows when the Average variable is greater than 315.
x<- read.csv("http://mgimond.github.io/ES218/Data/CO2.csv")
library(data.table)
setDT(x)
x[, ':='(time_since = seq(1:.N)), keyby = .(cumsum(Average < 315))][1:10, ]
#> Year Month Average Interpolated Trend Daily_mean time_since
#> 1: 1959 1 315.62 315.62 315.70 -1 1
#> 2: 1959 2 316.38 316.38 315.88 -1 2
#> 3: 1959 3 316.71 316.71 315.62 -1 3
#> 4: 1959 4 317.72 317.72 315.56 -1 4
#> 5: 1959 5 318.29 318.29 315.50 -1 5
#> 6: 1959 6 318.15 318.15 315.92 -1 6
#> 7: 1959 7 316.54 316.54 315.66 -1 7
#> 8: 1959 8 314.80 314.80 315.81 -1 1
#> 9: 1959 9 313.84 313.84 316.55 -1 1
#> 10: 1959 10 313.26 313.26 316.19 -1 1
Created on 2021-03-17 by the reprex package (v0.3.0)
Related
I got the data from MySQL and I'm trying to visualize it and uncover some answers. Using R for the statistic.
The final product is % discount for reach price change (=row).
Here is an example of my dataset.
itemId pricehis timestamp
1 69295477 1290 2022-04-12 04:42:53
2 69295624 1145 2022-04-12 04:42:53
3 69296136 3609 2022-04-12 04:42:54
4 69296607 855 2022-04-12 04:42:53
5 69295291 1000 2022-04-12 04:42:50
6 69295475 4188 2022-04-12 04:42:52
7 69295614 1145 2022-04-12 04:42:51
8 69295622 1290 2022-04-12 04:42:50
9 69295692 3609 2022-04-12 04:42:49
10 69295917 1725 2022-04-12 04:42:48
11 69296090 2449 2022-04-12 04:42:53
12 69296653 1145 2022-04-12 04:42:51
13 69296657 5638 2022-04-12 04:42:48
14 69296661 1725 2022-04-12 04:42:51
15 69296696 710 2022-04-12 04:42:51
I've been stuck at one part of the calculation - maximum value for each productId in 6 months.
In the dataset there are rows for specific productId with different pricehis values and different timestamps. I need to find the max value for a given row no older than 6 months.
The formula for calculating the desired discount is:
Discount grouped by itemId = 1 - pricehis / max(pricehis in the last 6 months)
At this moment I'm unable to solve the second part - pricehis in the last 6 months.
- I need a new column with maximum 'pricehis' in the last 6 months for the itemId. Also could be known as interval maximum.
I can group it by the itemId, but I can't figure out how to add the condition on 6 months max.
Any tips on how to get this?
I like slider::slide_index_dbl for this sort of thing. Here's some fake data chosen to demonstrate the 6mo window:
data.frame(itemId = rep(1:2, each = 6),
price = floor(100*cos(0:11)^2),
timestamp = as.Date("2000-01-01") + 50*(0:11)) -> df
We can start with df, group it by itemId, and then calula and then apply the window function. (Note that slider requires the data to be sorted by date within each group.)
library(dplyr).
library(lubridate) # for `%m-%`, to get sliding months (harder than it sounds!)
df %>%
group_by(itemId) %>%
mutate(max_6mo = slider::slide_index_dbl(.x = price, # based on price...
.i = timestamp, # and timestamp...
.f = max, # what's the max...
.before = ~.x %m-% months(6))) %>% # over the last 6mo
mutate(discount = 1 - price / max_6mo) %>% # use that to calc discount
ungroup()
Result
# A tibble: 12 × 5
itemId price timestamp max_6mo discount
<int> <dbl> <date> <dbl> <dbl>
1 1 100 2000-01-01 100 0
2 1 29 2000-02-20 100 0.71
3 1 17 2000-04-10 100 0.83
4 1 98 2000-05-30 100 0.0200
5 1 42 2000-07-19 98 0.571 # new max since >6mo since 100
6 1 8 2000-09-07 98 0.918
7 2 92 2000-10-27 92 0
8 2 56 2000-12-16 92 0.391
9 2 2 2001-02-04 92 0.978
10 2 83 2001-03-26 92 0.0978
11 2 70 2001-05-15 83 0.157 # new max since >6mo since 92
12 2 0 2001-07-04 83 1
I have a data series of daily snow depth values over a 60 year period. I would like to see the number of days with a snow depth higher than 30 cm for each season, for example from July 1980 to June 1981. What does the code for this have to look like? I know how I could calculate the daily values higher than 30 cm per season individually, but not how a code could calculate all seasons.
I have uploaded my dataframe on wetransfer: Dataframe
Thank you so much for your help in advance.
Pernilla
Something like this would work
library(dplyr)
library(lubridate)
df<-read.csv('BayrischerWald_Brennes_SH_daily_merged.txt', sep=';')
df_season <-df %>%
mutate(season=(Day %>% ymd() - days(181)) %>% floor_date("year") %>% year())
df_group_by_season <- df_season %>%
filter(!is.na(SHincm)) %>%
group_by(season) %>%
summarize(days_above_30=sum(SHincm>30)) %>%
ungroup()
df_group_by_season
#> # A tibble: 61 × 2
#> season days_above_30
#> <dbl> <int>
#> 1 1961 1
#> 2 1962 0
#> 3 1963 0
#> 4 1964 0
#> 5 1965 0
#> 6 1966 0
#> 7 1967 129
#> 8 1968 60
#> 9 1969 107
#> 10 1970 43
#> # … with 51 more rows
Created on 2022-01-15 by the reprex package (v2.0.1)
Here is an approach using the aggregate() function. After reading the data, convert the Date field to a date object and get rid of the rows with missing values for the date:
snow <- read.table("BayrischerWald_Brennes_SH_daily_merged.txt", header=TRUE, sep=";")
snow$Day <- as.Date(snow$Day)
str(snow)
# 'data.frame': 51606 obs. of 2 variables:
# $ Day : Date, format: "1961-11-01" "1961-11-02" "1961-11-03" "1961-11-04" ...
# $ SHincm: int 0 0 0 0 2 9 19 22 15 5 ...
snow <- snow[!is.na(snow$Day), ]
str(snow)
# 'data.frame': 21886 obs. of 2 variables:
# $ Day : Date, format: "1961-11-01" "1961-11-02" "1961-11-03" "1961-11-04" ...
# $ SHincm: int 0 0 0 0 2 9 19 22 15 5 ...
Notice more than half of your data has missing values for the date. Now we need to divide the data by ski season:
brks <- as.Date(paste(1961:2022, "07-01", sep="-"))
lbls <- paste(1961:2021, 1962:2022, sep="/")
snow$Season <- cut(snow$Day, breaks=brks, labels=lbls)
Now we use aggregate() to get the number of days with over 30 inches of snow:
days30cm <- aggregate(SHincm~Season, snow, subset=snow$SHincm > 30, length)
colnames(days30cm)[2] <- "Over30cm"
head(days30cm, 10)
# Season Over30cm
# 1 1961/1962 1
# 2 1967/1968 129
# 3 1968/1969 60
# 4 1969/1970 107
# 5 1970/1971 43
# 6 1972/1973 101
# 7 1973/1974 119
# 8 1974/1975 188
# 9 1975/1976 126
# 10 1976/1977 112
In addition, you can get other statistics such as the maximum snow of the season or the total cm of snow:
maxsnow <- aggregate(SHincm~Season, snow, max)
totalsnow <- aggregate(SHincm~Season, snow, sum)
I have an event log data. For reproducible example, let's use the data from eventdataR
eventdataR::patients
## look at patient 1 sequence
eventdataR::patients %>% dplyr::filter(patient == '1')
# A tibble: 12 x 7
handling patient employee handling_id registration_ty~ time .order
<fct> <chr> <fct> <chr> <fct> <dttm> <int>
1 Registration 1 r1 1 start 2017-01-02 11:41:53 1
2 Triage and A~ 1 r2 501 start 2017-01-02 12:40:20 2
3 Blood test 1 r3 1001 start 2017-01-05 08:59:04 3
4 MRI SCAN 1 r4 1238 start 2017-01-05 21:37:12 4
5 Discuss Resu~ 1 r6 1735 start 2017-01-07 07:57:49 5
6 Check-out 1 r7 2230 start 2017-01-09 17:09:43 6
7 Registration 1 r1 1 complete 2017-01-02 12:40:20 7
8 Triage and A~ 1 r2 501 complete 2017-01-02 22:32:25 8
9 Blood test 1 r3 1001 complete 2017-01-05 14:34:27 9
10 MRI SCAN 1 r4 1238 complete 2017-01-06 01:54:23 10
11 Discuss Resu~ 1 r6 1735 complete 2017-01-07 10:18:08 11
12 Check-out 1 r7 2230 complete 2017-01-09 19:45:45 12
In the above example, we can see the sequence of handling for patient 1 over a period of time. We can imagine that different patients would have different sequences or went through different number of sequences.
Now let's say I'm interested in a specific sequence and want to know which patients had gone through this specific sequence. How can I filter this dataset by this specific sequence so that I can get to know who these patients are?
The filter_activity_presence from edeaR library can help me with identifying the unique sequences and its frequency
patients %>% traces
# A tibble: 7 x 3
trace absolute_frequen~ relative_frequen~
<chr> <int> <dbl>
1 Registration,Triage and Assessment,X-Ray,Discuss R~ 258 0.516
2 Registration,Triage and Assessment,Blood test,MRI ~ 234 0.468
3 Registration,Triage and Assessment,Blood test,MRI ~ 2 0.004
4 Registration,Triage and Assessment,X-Ray 2 0.004
5 Registration,Triage and Assessment 2 0.004
6 Registration,Triage and Assessment,X-Ray,Discuss R~ 1 0.002
7 Registration,Triage and Assessment,Blood test 1 0.002
Let's say I'm interested in sequence from row 5, that is patients who had exclusively this sequence Registration -> Triage -> Assessment, there are 2 patients who had this sequence.
It seems the library that doesn't provide ready made function to extract this. At least from this doc page, https://www.bupar.net/subsetting.html#trace_length, it's not available.
Basically, given an exhaustive list of sequence, return all the patients who had gone through exactly this sequence.
In fact, if I can rebuild the trace and map it back to the original dataset, that should allow for a simple dplyr::filter. But this may not be ideal as well in the case if I'm interested in open ended sequence, for example, find all patients who started with Registration -> Triage and can be followed by any sequence.
Here's my long-winded attempt
# get trace for each patient
patient_trace <- as_tibble(patients) %>% group_by(patient) %>% dplyr::filter(registration_type == 'complete') %>%
summarise(trace = paste(handling, collapse = ","), n = n())
# identify the sequence trace of interest
trace_summary <- patients %>% traces
# here we want to see patients who had the sequence from row 5
res <- patients %>%
dplyr::filter(patient %in% c(patient_trace %>% dplyr::filter(trace %in% trace_summary$trace[5]) %>% .$patient)) %>%
dplyr::filter(registration_type == 'complete') %>%
arrange(patient, time)
# A tibble: 4 x 7
handling patient employee handling_id registration_ty~ time .order
<fct> <chr> <fct> <chr> <fct> <dttm> <int>
1 Registration 499 r1 499 complete 2018-05-01 22:57:38 1
2 Triage and As~ 499 r2 999 complete 2018-05-04 23:53:27 3
3 Registration 500 r1 500 complete 2018-05-02 01:28:23 2
4 Triage and As~ 500 r2 1000 complete 2018-05-05 07:16:02 4
You can filter them with dplyr :
library(dplyr)
req_sequence <- c('Registration', 'Triage and Assessment')
eventdataR::patients %>%
group_by(patient) %>%
filter(all(handling == req_sequence)) %>%
filter(registration_type == 'complete') %>%
ungroup
# handling patient employee handling_id registration_type time .order
# <fct> <chr> <fct> <chr> <fct> <dttm> <int>
#1 Registration 499 r1 499 complete 2018-05-01 22:57:38 3220
#2 Registration 500 r1 500 complete 2018-05-02 01:28:23 3221
#3 Triage and Assessment 499 r2 999 complete 2018-05-04 23:53:27 3720
#4 Triage and Assessment 500 r2 1000 complete 2018-05-05 07:16:02 3721
For this case to be sure of the output and to avoid any recycling effect we can filter registration_type == 'complete' first and also add another check of length(req_sequence) equal to number of rows for the patient id.
eventdataR::patients %>%
filter(registration_type == 'complete') %>%
group_by(patient) %>%
filter(length(req_sequence) == n() && all(handling == req_sequence)) %>%
ungroup
I've got a dataset of many individuals("ID") with body weight measurement ("BW")at random time points("time") spanning over 15 years.
Example:
ID=c("1","1","1","1","1","1","2","2","2","2","3","3","3")
Time=c("2015/1/1","2015/3/1","2016/1/1","2016/3/1","2017/1/1","2018/5/1","2012/1/1","2017/5/1","2019/4/1","2020/4/1","2019/10/1","2020/1/1","2020/4/1")
BW=rnorm(13,mean=75)
df<-data.frame(ID,Time,BW)
ID Time BW
1 1 2015/1/1 75.01736
2 1 2015/3/1 75.44717
3 1 2016/1/1 73.09934
4 1 2016/3/1 74.79920
5 1 2017/1/1 74.70097
6 1 2018/5/1 74.23496
7 2 2012/1/1 73.57179
8 2 2017/5/1 74.50970
9 2 2019/4/1 74.43412
10 2 2020/4/1 75.02952
11 3 2019/10/1 76.41390
12 3 2020/1/1 75.79827
13 3 2020/4/1 74.46035
What I'm trying to filter are IDs with measurements that has one within 12+/- 3 months prior to this measurement and one after. ie. one bodyweight at 0yr+/-3months one at 1yr one at 2yr+/-3months. In this case, only rows 3 to 5 fulfill the criteria.
And in all "individuals" that fulfills such criteria, I would like choose the measurement that has the most data points within this +/- 15 months range. The example desired output may look like:
ID Time BW Fulfill Counts
1 1 2015/1/1 75.01736 0 4
2 1 2015/3/1 75.44717 0 4
3 1 2016/1/1 73.09934 1 5
4 1 2016/3/1 74.79920 1 5
5 1 2017/1/1 74.70097 1 3
6 1 2018/5/1 74.23496 0 2
7 2 2012/1/1 73.57179 0 1
8 2 2017/5/1 74.50970 0 1
9 2 2019/4/1 74.43412 0 2
10 2 2020/4/1 75.02952 0 2
11 3 2019/10/1 76.41390 0 3
12 3 2020/1/1 75.79827 0 3
13 3 2020/4/1 74.46035 0 3
I've tried my best searching for similar answers on internet but I couldn't come up with anything remotely close to what I want to do. I could only make it to the grouping part with
group_by(ID)%>%
mutate(Fulfill==if time-...)
and then stuck the "calculating difference with every other row" thing. I'm imagining something like a loop for each row within a group(ID) to calculate the difference in time and then a logical statement for determining whether it's true or not. I've used R for a while but only with descriptive statistics previously, so I'm sorry if it's actually quite simple. Thanks.
Here is a tidyverse approach (not completely optimised, you could probably even simplify it to only one function call with map_dfr or so).
I've chosen to use purrr::map_ functions. This allows me to apply the function to every entry of the column/vector separately (this results from the first time Time is passed to map_) and at the same time also pass the complete Time column (the second argument), to calculate the filter operations to see if you have entries in the +-15 months.
ID=c("1","1","1","1","1","1","2","2","2","2","3","3","3")
Time=c("2015/1/1","2015/3/1","2016/1/1","2016/3/1","2017/1/1","2018/5/1","2012/1/1","2017/5/1","2019/4/1","2020/4/1","2019/10/1","2020/1/1","2020/4/1")
BW=rnorm(13,mean=75)
df<-data.frame(ID,Time,BW)
library(dplyr)
library(purrr)
library(lubridate)
check_entries <- function(curr_entry, entries) {
# establish bounds in which there must be entries
lower_bound_1 <- curr_entry %m-% months(15)
lower_bound_2 <- curr_entry %m-% months(9)
upper_bound_1 <- curr_entry %m+% months(9)
upper_bound_2 <- curr_entry %m+% months(15)
# filter the entries that match the time period constraints
entries <- data.frame(entries = entries)
filtered_lower <- entries %>%
filter(entries >= lower_bound_1 & entries <= lower_bound_2)
filtered_upper <- entries %>%
filter(entries >= upper_bound_1 & entries <= upper_bound_2)
# check if there is a matching earlier and later entry
if (nrow(filtered_lower) > 0 && nrow(filtered_upper) > 0) {
TRUE
} else {
FALSE
}
}
calculate_number_entries <- function(curr_entry, entries) {
# establish bounds in which there must be entries
lower_bound <- curr_entry %m-% months(15)
upper_bound <- curr_entry %m+% months(15)
# filter the matching entries and calculate the number of observations
entries <- data.frame(entries = entries)
entries %>%
filter(entries >= lower_bound & entries <= upper_bound) %>%
nrow()
}
df %>%
group_by(ID) %>%
mutate(Time = as.Date(Time, format = "%Y/%m/%d"),
Fulfill = map_lgl(Time, check_entries, Time),
Fulfill_ID = sum(Fulfill) > 0,
Counts = map_int(Time, calculate_number_entries, Time))
#> # A tibble: 13 x 6
#> # Groups: ID [3]
#> ID Time BW Fulfill Fulfill_ID Counts
#> <chr> <date> <dbl> <lgl> <lgl> <int>
#> 1 1 2015-01-01 75.4 FALSE TRUE 4
#> 2 1 2015-03-01 74.0 FALSE TRUE 4
#> 3 1 2016-01-01 74.2 TRUE TRUE 5
#> 4 1 2016-03-01 74.9 TRUE TRUE 5
#> 5 1 2017-01-01 75.6 FALSE TRUE 3
#> 6 1 2018-05-01 73.8 FALSE TRUE 1
#> 7 2 2012-01-01 75.6 FALSE FALSE 1
#> 8 2 2017-05-01 75.0 FALSE FALSE 1
#> 9 2 2019-04-01 74.3 FALSE FALSE 2
#> 10 2 2020-04-01 74.9 FALSE FALSE 2
#> 11 3 2019-10-01 75.5 FALSE FALSE 3
#> 12 3 2020-01-01 75.3 FALSE FALSE 3
#> 13 3 2020-04-01 76.0 FALSE FALSE 3
Created on 2020-12-06 by the reprex package (v0.3.0)
Note that I find a different result for the 5th entry, you may check if the month addition/subtraction is as you need it, check out lubridate for more info.
I'm trying to extend some code to be able to:
1) read in a vector of prices
2) left join that vector of prices to a data frame of years (or years and months)
3) append/fill the prices for missing years with interpolated data based on the last year of available prices plus a specified inflation rate. Consider an example like this one:
prices <- data.frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA))
What I need is something that will fill the missing rows of each column with the last price plus inflation (suppose 2%). I can do this in a pretty brute force way as:
i_rate<-0.02
for(i in c(1:nrow(prices))){
if(is.na(prices$wti[i]))
prices$wti[i]<-prices$wti[i-1]*(1+i_rate)
if(is.na(prices$brent[i]))
prices$brent[i]<-prices$brent[i-1]*(1+i_rate)
}
It seems to me there should be a way to do this using some combination of apply() and/or fill() but I can't seem to make it work.
Any help would be much appreciated.
As noted by #camille, the problem with dplyr::lag is that it doesn't work here with consecutive NAs because it uses the "original" ith element of a vector instead of the "revised" ith element. We'd have to first create a version of lag that will do this by creating a new function:
impute_inflation <- function(x, rate) {
output <- x
y <- rep(NA, length = length(x)) #Creating an empty vector to fill in with the loop. This makes R faster to run for vectors with a large number of elements.
for (i in seq_len(length(output))) {
if (i == 1) {
y[i] <- output[i] #To avoid an error attempting to use the 0th element.
} else {
y[i] <- output[i - 1]
}
if (is.na(output[i])) {
output[i] <- y[i] * (1 + rate)
} else {
output[i]
}
}
output
}
Then it's a pinch to apply this across a bunch of variables with dplyr::mutate_at():
library(dplyr)
mutate_at(prices, vars(wti, brent), impute_inflation, 0.02)
year wti brent
1 2018 75.000 80.00
2 2019 80.000 85.00
3 2020 90.000 94.00
4 2021 91.800 93.00
5 2022 93.636 94.86
You can use dplyr::lag to get the previous value in a given column. Your lagged values look like this:
library(dplyr)
inflation_factor <- 1.02
prices <- data_frame(year=2018:2022,
wti=c(75,80,90,NA,NA),
brent=c(80,85,94,93,NA)) %>%
mutate_at(vars(wti, brent), as.numeric)
prices %>%
mutate(prev_wti = lag(wti))
#> # A tibble: 5 x 4
#> year wti brent prev_wti
#> <int> <dbl> <dbl> <dbl>
#> 1 2018 75 80 NA
#> 2 2019 80 85 75
#> 3 2020 90 94 80
#> 4 2021 NA 93 90
#> 5 2022 NA NA NA
When a value is NA, multiply the lagged value by the inflation factor. As you can see, that doesn't handle consecutive NAs, however.
prices %>%
mutate(wti = ifelse(is.na(wti), lag(wti) * inflation_factor, wti),
brent = ifelse(is.na(brent), lag(brent) * inflation_factor, brent))
#> # A tibble: 5 x 3
#> year wti brent
#> <int> <dbl> <dbl>
#> 1 2018 75 80
#> 2 2019 80 85
#> 3 2020 90 94
#> 4 2021 91.8 93
#> 5 2022 NA 94.9
Or to scale this and avoid doing the same multiplication over and over, gather the data into a long format, get lags within each group (wti, brent, or any others you may have), and adjust values as needed. Then you can spread back to the original shape:
prices %>%
tidyr::gather(key = key, value = value, wti, brent) %>%
group_by(key) %>%
mutate(value = ifelse(is.na(value), lag(value) * inflation_factor, value)) %>%
tidyr::spread(key = key, value = value)
#> # A tibble: 5 x 3
#> year brent wti
#> <int> <dbl> <dbl>
#> 1 2018 80 75
#> 2 2019 85 80
#> 3 2020 94 90
#> 4 2021 93 91.8
#> 5 2022 94.9 NA
Created on 2018-07-12 by the reprex package (v0.2.0).