Say I have a raw dataset (already in data frame and I can convert that easily to xts.data.table with as.xts.data.table), the DF is like the following:
Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature
-------------------------
2018-02-03 | New York City | NY | US | 18 | 22 | 19
2018-02-03 | London | LDN |UK | 10 | 25 | 15
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29
2018-02-02 | New York City | NY | US | 12 | 30 | 18
2018-02-02 | London | LDN | UK | 12 | 15 | 14
2018-02-02 | Singapore | SG | SG | 27 | 31 | 30
and so on (many more cities and many more days).
And I would like to make this to show both the current day temperature and the day over day change from the previous day, together with the other info on the city (state, country). i.e., the new data frame should be something like (from the example above):
Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature| ChangeInDailyMin | ChangeInDailyMax | ChangeInDailyMedian
-------------------------
2018-02-03 | New York City | NY | US | 18 | 22 | 19 | 6 | -8 | 1
2018-02-03 | London | LDN |UK | 10 | 25 | 15 | -2 | -10 | 1
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29 | 1 | 1 | -1
2018-02-03 | New York City | NY | US | ...
and so on. i.e., add 3 more columns to show the day over day change.
Note that in the dataframe I may not have data everyday, however my change is defined as the differences between temperature on day t - temperature on the most recent date where I have data on the temperature.
I tried to use the shift function but R was complaining about the := sign.
Is there any way in R I could get this to work?
Thanks!
You can use dplyr::mutate_at and lubridate package to transform data in desired format. The data needs to be arranged in Date format and difference of current record with previous record can be taken with help of dplyr::lag function.
library(dplyr)
library(lubridate)
df %>% mutate_if(is.character, funs(trimws)) %>% #Trim any blank spaces
mutate(Date = ymd(Date)) %>% #Convert to Date/Time
group_by(City, State, Country) %>%
arrange(City, State, Country, Date) %>% #Order data date
mutate_at(vars(starts_with("Daily")), funs(Change = . - lag(.))) %>%
filter(!is.na(DailyMinTemperature_Change))
Result:
# # A tibble: 3 x 10
# # Groups: City, State, Country [3]
# Date City State Country DailyMinTemperature DailyMaxTemperature DailyMedianTemperature DailyMinTemperature_Change DailyMaxT~ DailyMed~
# <date> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl> <int>
# 1 2018-02-03 London LDN UK 10.0 25.0 15 -2.00 10.0 1
# 2 2018-02-03 New York City NY US 18.0 22.0 19 6.00 - 8.00 1
# 3 2018-02-03 Singapore SG SG 28.0 32.0 29 1.00 1.00 -1
#
Data:
df <- read.table(text =
"Date | City | State | Country | DailyMinTemperature | DailyMaxTemperature | DailyMedianTemperature
2018-02-03 | New York City | NY | US | 18 | 22 | 19
2018-02-03 | London | LDN |UK | 10 | 25 | 15
2018-02-03 | Singapore | SG | SG | 28 | 32 | 29
2018-02-02 | New York City | NY | US | 12 | 30 | 18
2018-02-02 | London | LDN | UK | 12 | 15 | 14
2018-02-02 | Singapore | SG | SG | 27 | 31 | 30",
header = TRUE, stringsAsFactors = FALSE, sep = "|")
Related
I have a strange problem with my R code. I am attempting to create a data frame for each quantile of fantasy sports scored for each week in the NFL season of 5 years. example table 1
Year
week
Position
date
Score
ranking
2018
1
RB
2018-01-01
NA
50
2018
1
RB
2018-01-01
NA
75
2018
1
RB
2018-01-01
NA
85
2018
1
RB
2018-01-01
NA
90
2018
2
RB
2018-01-02
NA
50
2018
2
RB
2018-01-02
NA
75
2018
2
RB
2018-01-02
NA
85
2018
2
RB
2018-01-02
NA
90
2019
1
RB
2019-01-01
NA
50
2019
1
RB
2019-01-01
NA
75
2019
1
RB
2019-01-01
NA
85
2019
1
RB
2019-01-01
NA
90
table 1: def_rb_table. For simplicity, I have only included one position type, 'RB,' I have also joined the NFL week and year into one date, with the NFL year referring to the year in the date and the NFL week referring to the day of the month. Ranking refers to the quantile in which the scoring of that date.
Using the code below figure 1, the table above table 1, and a weekly scoring of every NFL player on a second df (rb_data), I can create table 2.
repeat {
for (y in rb_data$date) {
for (d in def_rb_table$ranking){
def_rb_table$score[is.na(def_rb_table$score) & y == def_rb_table$date & def_rb_table$ranking == d] <-
apply(def_rb_table, 1, function(x) {(quantile(rb_data$fantasy_points[y == rb_data$date], (d/100)))})
}
}
if (def_rb_table$week == 17 & def_rb_table$season == 2022)
{break}
}
Figure 1. rb_data is the individual player data of each season and scoring.
The problem This code works (very slowly), but it does loop through each date and each ranking and completes the quantile of each ranking scoring. The code does appear to get stuck somewhere, as I have to stop it to see the results, but the amount of time doesn't appear to make it progress through to the next season. I am having two problems with this code. Table two will show both problems I am having with this code.
Year
week
Position
date
Score
ranking
2018
1
RB
2018-01-01
5.15
50
2018
1
RB
2018-01-01
10.65
75
2018
1
RB
2018-01-01
15.96
85
2018
1
RB
2018-01-01
18.78
90
2018
2
RB
2018-01-02
5.00
50
2018
2
RB
2018-01-02
9.00
75
2018
2
RB
2018-01-02
11.90
85
2018
2
RB
2018-01-02
13.9
90
2018
3
RB
2018-01-03
NA
50
2018
3
RB
2018-01-03
NA
75
2018
3
RB
2018-01-03
NA
85
2018
3
RB
2018-01-03
NA
90
2018
4
RB
2018-01-04
4.35
50
2018
4
RB
2018-01-04
11.80
75
2018
4
RB
2018-01-04
24.60
85
2018
4
RB
2018-01-04
19.02
90
2019
1
RB
2019-01-01
NA
50
2019
1
RB
2019-01-01
NA
75
2019
1
RB
2019-01-01
NA
85
2019
1
RB
2019-01-01
NA
90
2020
1
RB
202--01-01
NA
50
2020
1
RB
2020-01-01
NA
75
2020
1
RB
2020-01-01
NA
85
2020
1
RB
2020-01-01
NA
90
Table 2: The completed def_rb_table after figure 1 has run. For an unknown reason, the code will not progress past the first year in the data set and will ALWAYS skip one random week.
problem 1: Once the code has progressed through the 2018 season, it will not progress to the 2019 season. I have been circumventing this problem by saving the season it has completed removing it from the original data frame, and rerunning this code. (Not great, but it's the best I can do). The second problem is the one that has stumped me considerably.
Problem 2 For an unknown reason, when this code progress through the year, it will randomly skip one week of the NFL season. In the 2018 season, the code skipped week 11, in the 2019 season skipped week 6, and in 2020 skipped week 8. When I close the program and attempt to rerun the code, it will occasionally skip a different week of the same season. I do receive multiple warnings. The warnings are "number of items to replace is not a multiple of replacement length." I am assuming these warnings are not related to my current problems.
What im hoping to build
Year
week
Position
date
Score
ranking
2018
1
RB
2018-01-01
5.15
50
2018
1
RB
2018-01-01
10.65
75
2018
1
RB
2018-01-01
15.96
85
2018
1
RB
2018-01-01
18.78
90
2018
2
RB
2018-01-02
5.00
50
2018
2
RB
2018-01-02
9.00
75
2018
2
RB
2018-01-02
11.90
85
2018
2
RB
2018-01-02
13.9
90
2018
3
RB
2018-01-03
4.35
50
2018
3
RB
2018-01-03
11.80
75
2018
3
RB
2018-01-03
14.60
85
2018
3
RB
2018-01-03
16.47
90
2018
4
RB
2018-01-04
4.80
50
2018
4
RB
2018-01-04
13.97
75
2018
4
RB
2018-01-04
19.54
85
2018
4
RB
2018-01-04
22.41
90
2019
1
RB
2019-01-01
4.90
50
2019
1
RB
2019-01-01
9.30
75
2019
1
RB
2019-01-01
13.90
85
2019
1
RB
2019-01-01
18.00
90
2020
1
RB
202--01-01
4.80
50
2020
1
RB
2020-01-01
10.40
75
2020
1
RB
2020-01-01
13.50
85
2020
1
RB
2020-01-01
16.62
90
Table 3: example of the end product
I apologize for any missing information and the clunkiness in my code. This is the project I took to teach myself R, and I am learning along the way. Please ask if I need more information, and I will be happy to provide it.
I have been stuck on this problem for two days and have tried numerous solutions. I will try any solutions, even if I have tried them in the past. If anyone has any idea to simplify this code to be more efficient, that would also be great. Thank you for any help you can provide :)
EDIT: The most straightforward way to recreate the rb_data table is to use the online database from which it comes. Due to using quantile without the entire database, you would not be able to recreate any of the values I create. This method will create several useless vectors. The import vectors are player_name, season, week, fantasy_points, and date. Table 4 will be a simple recreation of the RB_data to understand what that table contains. The database is from https://github.com/nflverse/nflfastR/
rb_data <- nflfastR::load_player_stats(season = TRUE) filter(week <= 17, position == "RB" , season >= 2018)
rb_data$date <- as.Date(paste(week$season, 01, week$week, sep=" "),"%Y-%m-%d")
| Year | week | Position | date | Fantasy points | player id |
| ---- | ---- | -------- | ---------- | -------------- | ---------- |
| 2018 | 1 | RB | 2018-01-01 | 6.1 | F.Gore |
| 2018 | 2 | RB | 2018-01-02 | 4.4 | F.Gore |
| 2018 | 3 | RB | 2018-01-03 | 1.2 | F.Gore |
| 2018 | 4 | RB | 2018-01-04 | 11.7 | F.Gore |
| 2018 | 5 | RB | 2018-01-05 | 6.3 | F.Gore |
| 2018 | 6 | RB | 2018-01-06 | 11.9 | F.Gore |
| 2019 | 1 | RB | 2019-01-01 | 2.0 | F.Gore |
| 2019 | 2 | RB | 2019-01-02 | 14.3 | F.Gore |
| 2019 | 3 | RB | 2019-01-03 | 14.9 | F.Gore |
| 2019 | 4 | RB | 2019-01-04 | 10.9 | F.Gore |
| 2019 | 5 | RB | 2019-01-05 | 6.9 | F.Gore |
| 2019 | 6 | RB | 2019-01-06 | 6.6 | F.Gore |
| 2020 | 1 | RB | 2020-01-01 | 2.4 | F.Gore |
| 2020 | 2 | RB | 2020-01-02 | 6.3 | F.Gore |
| 2020 | 3 | RB | 2020-01-03 | 6.2 | F.Gore |
| 2020 | 4 | RB | 2020-01-04 | 3.6 | F.Gore |
| 2020 | 5 | RB | 2020-01-05 | 3.0 | F.Gore |
| 2020 | 6 | RB | 2020-01-06 | 7.0 | F.Gore |
| 2018 | 1 | RB | 2018-01-01 | 20.6 | A.Peterson |
| 2018 | 2 | RB | 2018-01-02 | 5.0 | A.Peterson |
| 2018 | 3 | RB | 2018-01-03 | 24.0 | A.Peterson |
| 2018 | 4 | RB | 2018-01-04 | 4.2 | A.Peterson |
| 2018 | 5 | RB | 2018-01-05 | 9.7 | A.Peterson |
| 2018 | 6 | RB | 2018-01-06 | 10.7 | A.Peterson |
| 2019 | 1 | RB | 2019-01-01 | 9.2 | A.Peterson |
| 2019 | 2 | RB | 2019-01-02 | 3.4 | A.Peterson |
| 2019 | 3 | RB | 2019-01-03 | 2.8 | A.Peterson |
| 2019 | 4 | RB | 2019-01-04 | 1.8 | A.Peterson |
| 2019 | 5 | RB | 2019-01-05 | 13.6 | A.Peterson |
| 2019 | 6 | RB | 2019-01-06 | 6.1 | A.Peterson |
| 2020 | 1 | RB | 2020-01-01 | 11.4 | A.Peterson |
| 2020 | 2 | RB | 2020-01-02 | 4.1 | A.Peterson |
| 2020 | 3 | RB | 2020-01-03 | 8.5 | A.Peterson |
| 2020 | 4 | RB | 2020-01-04 | 9.6 | A.Peterson |
| 2020 | 5 | RB | 2020-01-05 | 11.8 | A.Peterson |
| 2020 | 6 | RB | 2020-01-06 | 3.0 | A.Peterson |
My Dataframe contains some Options data with several variable. The relevant ones are: date, days (which means days to expiration) and mid. Mid needs to be interpolated via natural spline for a future timepoint in 30 days. This interpolation should be done every day for every strike price.
The data table looks as follows:
| date | days | mid |strike|
|------|------|-----|------|
| 2020- 01 - 01 | 8 | 12 | 110 |
| 2020- 01 - 01 | 28 | 14 | 110 |
| 2020- 01 - 01 | 49 | 15 | 110 |
| 2020- 01 - 01 | 80| 17 | 110 |
| 2020- 01 - 01 | 8 | 11 | 120 |
| 2020- 01 - 01 | 28 | 12 | 120 |
| 2020- 01 - 01 | 49 | 13 | 120 |
| 2020- 01 - 01 | 80| 14 | 120 |
| 2020- 01 - 12 | 6 | 12 | 110 |
| 2020- 01 - 12 | 26 | 14 | 110 |
| 2020- 01 - 12 | 47 | 15 | 110 |
| 2020- 01 - 12 | 82| 17 | 110 |
| 2020- 01 - 12 | 7 | 11 | 120 |
| 2020- 01 - 12 | 27 | 12 | 120 |
| 2020- 01 - 12 | 47 | 13 | 120 |
| 2020- 01 - 12 | 85| 14 | 120 |
This is just an example. The original data frame contains over 1 million entries. For this I can't use a for loop and want to interpolate by group.
I found some approaches online, unfortunately none of them really worked for me.
My last guess was:
df$id <- paste0(df$date, df$strike)
df %>%
group_by(id) %>%
mutate(mid_30 = splime(df$days, df$mid, xout = 30 , method = "natural" ))
Do you have any possible solution?
Thank you very much in advance!
For my thesis, I am trying to use several variables from two types of surveys (the British Election Studies (BES) and the British Social Attitudes Survey (BSA)) and combine them into one dataset.
Currently, I have two datasets, one with BES data, which looks like this (in simplified version):
| year | class | education | gender | age |
| ---- | ----- | --------- | ------ | --- |
| 1992 | working | A-levels | female | 32 |
| 1992 | middle | GCSE | male | 49 |
| 1997 | lower | Undergrad | female | 24 |
| 1997 | middle | GCSE | male | 29 |
The BSA data looks like this (again, simplified):
| year | class | education | gender | age |
| ---- | ----- | --------- | ------ | --- |
| 1992 | middle | A-levels | male | 22 |
| 1993 | working | GCSE | female | 45 |
| 1994 | upper | Postgrad | female | 38 |
| 1994 | middle | GCSE | male | 59 |
Basically, what I am trying to do is combine the two into one dataframe that looks like this:
| year | class | education | gender | age |
| ---- | ----- | --------- | ------ | --- |
| 1992 | working | A-levels | female | 32 |
| 1992 | middle | GCSE | male | 49 |
| 1992 | middle | A-levels | male | 22 |
| 1993 | working | GCSE | female | 45 |
| 1994 | upper | Postgrad | female | 38 |
| 1994 | middle | GCSE | male | 59 |
| 1997 | lower | Undergrad | female | 24 |
| 1997 | middle | GCSE | male | 29 |
I have googled a lot about joins and merging, but I can't figure it out in a way that works correctly. From what I understand, I believe I should join "by" the year variable, but is that correct? And how can I prevent it taking up a lot of memory to perform the computation (the actual datasets are about 30k for the BES and 130k for the BSA)? Is there a solution using either dplyr or data.tables in R?
Any help is much appreciated!!!
This is not a "merge" (or join) operation, it's just row-concatenation. In R, that's done with rbind (which works for matrix and data.frame using different methods). (For perspective, there's also cbind, which concatenates by columns. Not applicable here.)
base R
rbind(BES, BSA)
# year class education gender age
# 1 1992 working A-levels female 32
# 2 1992 middle GCSE male 49
# 3 1997 lower Undergrad female 24
# 4 1997 middle GCSE male 29
# 5 1992 middle A-levels male 22
# 6 1993 working GCSE female 45
# 7 1994 upper Postgrad female 38
# 8 1994 middle GCSE male 59
other dialects
dplyr::bind_rows(BES, BSA)
data.table::rbindlist(list(BES, BSA))
I am new to R. Your help here will be appreciated.
I have inputs such as.
columnA <- 14 # USERINPUT
columnB <- 1 # Incremented from 1.2.3.etc
columnC <- columnA * columnB
columnD <- 25 # remains constant
columnE <- columnC / columnD
columnF <- 8 # remains constant
columnG <- columnE + columnF
mydf <- data.frame(columnA,columnB,columnC,columnD,columnE,columnF,columnG)
Based on the above data frame I need to create a data frame such that in every susbsequent row value at columnB is incremented from 1 to 2 to 3 such that the value at columnG is never above 600 and we stop creating rows. I tried to do this in excel.Below is kind of the output i would need.
+---------+--------+---------+---------+---------+---------+---------+
| columnA | columB | columnC | columnD | columnE | columnF | columnG |
+---------+--------+---------+---------+---------+---------+---------+
| 14 | 1 | 14 | 25 | 0.56 | 8 | 8.56 |
| 14 | 2 | 28 | 25 | 1.12 | 8 | 9.12 |
| 14 | 3 | 42 | 25 | 1.68 | 8 | 9.68 |
| 14 | 4 | 56 | 25 | 2.24 | 8 | 10.24 |
| 14 | 5 | 70 | 25 | 2.8 | 8 | 10.8 |
| 14 | 6 | 84 | 25 | 3.36 | 8 | 11.36 |
| 14 | 7 | 98 | 25 | 3.92 | 8 | 11.92 |
| 14 | 8 | 112 | 25 | 4.48 | 8 | 12.48 |
+---------+--------+---------+---------+---------+---------+---------+
The end result should be a data frame
First you can compute the lenght of the data.frame:
userinput <- 14
N <- (600 - 8) * 25 / userinput
Then, using dplyr you create the data.frame:
mydf <- data_frame(ColA = 14, ColB = 1:floor(N), ColD = 25, ColF = 8) %>%
mutate(ColC = ColA * ColB, ColE = ColC/ColD, ColG = ColE + ColF)
If you need the columns in the correct order:
> mydf <- mydf %>% select(ColA, ColB, ColC, ColD, ColE, ColF, ColG)
> mydf
ColA ColB ColD ColF ColC ColE ColG
1: 14 1 25 8 14 0.56 8.56
2: 14 2 25 8 28 1.12 9.12
3: 14 3 25 8 42 1.68 9.68
4: 14 4 25 8 56 2.24 10.24
5: 14 5 25 8 70 2.80 10.80
---
1053: 14 1053 25 8 14742 589.68 597.68
1054: 14 1054 25 8 14756 590.24 598.24
1055: 14 1055 25 8 14770 590.80 598.80
1056: 14 1056 25 8 14784 591.36 599.36
1057: 14 1057 25 8 14798 591.92 599.92
I have a ASP.NET MVC 4 app with an ASP.NET Web Form with a ReportViewer control. I'm using Entity Framework.
I want to display a table with Orders(id, date_add, name,quantity, price)
1 | 2011-12-08 | apples | 4 | 0.99
2 | 2012-01-07 | oranges | 20 | 1.39
4 | 2012-03-04 | plumes | 80 | 1.59
5 | 2012-05-01 | apples | 15 | 0.89
6 | 2012-05-03 | pears | 10 | 1.29
7 | 2012-05-09 | oranges | 18 | 1.49
I want my report to look like this:
December - Sum(price for dec 2011)
1 | 2011-12-08 | apples | 4 | 0.99
2011 - Sum(price for 2011)
January - Sum(price for jan 2012)
2 | 2012-01-07 | oranges | 20 | 1.39
March - Sum(price for mar 2012)
4 | 2012-03-04 | plumes | 80 | 1.59
May - Sum(price for may 2012)
5 | 2012-05-01 | apples | 15 | 0.89
6 | 2012-05-03 | pears | 10 | 1.29
7 | 2012-05-09 | oranges | 18 | 1.49
2012 - Sum(price for 2012)
How can it be done using drill-through?
And could the month be collapsed and the year visible (at first run)?
I've made a SP where I extract the Week, Month and Year no. from the data_add field.
CREATE PROCEDURE [dbo].[Orders]
AS
SELECT id_order, data_add, order_name, quantity, price, MONTH(data_add) as Mn, YEAR(data_add) as Ye, DATEPART( wk, data_add) as Week,
FROM order
RETURN
In the report's Tablix, I've grouped 2 times( first by month, then by year).