Allow duplicate rows in row selection using an interval in R - r

I have the following extract of my dataset:
basisanddowngradessingledates[1716:1721, ]
# A tibble: 6 x 23
Dates Bank CDS Bond `Swap zero rate` `CDS-bond basis` `Basis change` `Rating agency`
<dttm> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 2015-05-15 Allied Irish Banks PLC 129.63 201.0235 40.6 -30.79352 1.9408116 NA
2 2015-05-18 Allied Irish Banks PLC 129.64 202.1998 41.0 -31.55976 -0.7662374 NA
3 2015-05-19 Allied Irish Banks PLC 129.65 200.4579 39.0 -31.80792 -0.2481631 Fitch
4 2015-05-20 Allied Irish Banks PLC 129.65 203.9960 39.0 -35.34598 -3.5380550 DBRS
5 2015-05-21 Allied Irish Banks PLC 129.63 203.5341 41.0 -32.90415 2.4418300 NA
6 2015-05-22 Allied Irish Banks PLC 130.64 203.2723 40.0 -32.63234 0.2718045 NA
I would like to select the intervals [-1:1], which corresponds to the day before and the day after a downgrade. At the row where the column "Rating agency" is not "NA" indicates that a downgrade has occured. In my example above, rows [1717:1719] and [1718:1720], so 6 rows, for each downgrade 3.
My dataset has 45276 entries with 536 downgrades (column "Rating agency" is not "NA") where I would like to build a list containing the 3 rows where a downgrade occured.
I tried it using the following code:
keepindex <- which(basisanddowngradessingledates[,8] != "NA")
interval11 <- unique(c(keepindex-1, keepindex, keepindex+1))
interval1ra1 <- basisanddowngradessingledates[interval11,]
This works if there are no downgrades on consecutive days. However in my example extract I have two downgrades right after each other and I get the following output:
print(interval1ra1[c(11:12, 348, 674), ])
# A tibble: 4 x 23
Dates Bank CDS Bond `Swap zero rate` `CDS-bond basis` `Basis change` `Rating agency`
<dttm> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 2015-05-18 Allied Irish Banks PLC 129.64 202.1998 41 -31.55976 -0.7662374 NA
2 2015-05-19 Allied Irish Banks PLC 129.65 200.4579 39 -31.80792 -0.2481631 Fitch
3 2015-05-20 Allied Irish Banks PLC 129.65 203.9960 39 -35.34598 -3.5380550 DBRS
4 2015-05-21 Allied Irish Banks PLC 129.63 203.5341 41 -32.90415 2.4418300 NA
I get 4 rows instead of 6 which I need.
I guess the unique()function prevents duplicate rows, but in my example I need these rows as described above.
How can I fix this?

Here is one possible solution to get previous and next row for each matching row:
> keepindex = c(1718,1719)
> lookupindex = c();
> for (lookupindex in keepindex) { result = c(lookupindex ,index-1,index,index+1) }
> lookupindex
[1] 1717 1718 1719 1718 1719 1720
In this solution the overlapping rows 1719 and 1718 are shown twice.

Found a simple solution by myself without using the unique funciton:
keepindex <- which(basisanddowngradessingledates[,8] != "NA")
interval1ra1 <- basisanddowngradessingledates[c(keepindex-1,keepindex,
keepindex+1), ]

Related

list of data frames, trying to create new column with normalisation values for each dataframe

I'm new to r and mostly work with dataframes. A frequent task is to normalize counts for several parameters from several data frames. I have a demo dataset:
dataset
Season
Product
Quality
Sales
Winter
Apple
bad
345
Winter
Apple
good
13
Winter
Potato
bad
23
Winter
Potato
good
66
Winter
Beer
bad
345
Winter
Beer
good
34
Summer
Apple
bad
88
Summer
Apple
good
90
Summer
Potato
bad
123
Summer
Potato
good
457
Summer
Beer
bad
44
Summer
Beer
good
546
What I want to do is add a column "FC" ([tag:fold change]) for "Sales". FC must be calculated for each "Season" and "Product" according to "Quality". "Bad" is the baseline.
Desired result:
Season
Product
Quality
Sales
FC
Winter
Apple
bad
345
1.00
Winter
Apple
good
13
0.04
Winter
Potato
bad
23
1.00
Winter
Potato
good
66
2.87
Winter
Beer
bad
345
1.00
Winter
Beer
good
34
0.10
Summer
Apple
bad
88
1.00
Summer
Apple
good
90
1.02
Summer
Potato
bad
123
1.00
Summer
Potato
good
457
3.72
Summer
Beer
bad
44
1.00
Summer
Beer
good
546
12.41
One way to do it is to filter first by "Season" and then by "Product" (e.g. creating subset data frame subset_winter_apple) and then calculate FC similarly to this:
subset_winter_apple$FC = subset_winter_apple$Sales / subset_winter_apple$Sales[1]
Later on, I can then combine all subset dataframes again e.g. using rbind() to reconstitute the original data frame with the FC column. However, this is highly inefficient. So I thought of splitting the data frame and creating a list:
split(
dataset,
list(dataset$Season, dataset$Product)
)
However, now I struggle with the normalisation (FC calculation) as I do not know how to reference the specific first cell value of "Sales" in the list of data frames so that each value in that column in each listed data frame is individually normalized. I did manage to calculate an FC value for the list, however, it is an exact copy in each listed data frame from the first one using lappy:
lapply(
dataset,
function(DF){DF$FC = dataset[[1]]$Sales/dataset[[1]]$Sales[1]; DF}
)
Clearly, I do not know how to reference the first cell in a specific column to normalize the entire column for each listed data frame. Can somebody please help me?
Many thanks in advance for your suggestions.
dplyr solution
Using logical indexing within a grouped mutate():
library(dplyr)
dataset %>%
group_by(Season, Product) %>%
mutate(FC = Sales / Sales[Quality == "bad"]) %>%
ungroup()
# A tibble: 12 × 5
Season Product Quality Sales FC
<chr> <chr> <chr> <int> <dbl>
1 Winter Apple bad 345 1
2 Winter Apple good 13 0.0377
3 Winter Potato bad 23 1
4 Winter Potato good 66 2.87
5 Winter Beer bad 345 1
6 Winter Beer good 34 0.0986
7 Summer Apple bad 88 1
8 Summer Apple good 90 1.02
9 Summer Potato bad 123 1
10 Summer Potato good 457 3.72
11 Summer Beer bad 44 1
12 Summer Beer good 546 12.4
Base R solution
Using by():
dataset <- by(
dataset,
list(dataset$Season, dataset$Product),
\(x) transform(x, FC = Sales / Sales[Quality == "bad"])
)
dataset <- do.call(rbind, dataset)
dataset[order(as.numeric(rownames(dataset))), ]
Season Product Quality Sales FC
1 Winter Apple bad 345 1.00000000
2 Winter Apple good 13 0.03768116
3 Winter Potato bad 23 1.00000000
4 Winter Potato good 66 2.86956522
5 Winter Beer bad 345 1.00000000
6 Winter Beer good 34 0.09855072
7 Summer Apple bad 88 1.00000000
8 Summer Apple good 90 1.02272727
9 Summer Potato bad 123 1.00000000
10 Summer Potato good 457 3.71544715
11 Summer Beer bad 44 1.00000000
12 Summer Beer good 546 12.40909091

Comparing Dates Across Multiple Variables

I'm attempting to figure out the amount of days in between games and if that has an impact on wins/losses, this is the information I'm starting with:
schedule:
Home
Away
Home_Final
Away_Final
Date
DAL
OAK
30
35
9/1/2015
KC
PHI
21
28
9/2/2015
This is the result I'd like to get:
Home
Away
Home_Final
Away_Final
Date
Home_Rest
Away_Rest
Adv
Adv_Days
Adv_Won
DAL
OAK
30
35
9/1/2015
null
null
null
null
null
KC
PHI
21
28
9/2/2015
null
null
null
null
null
DAL
PHI
28
7
9/9/2015
8
7
1
1
1
OAK
KC
14
21
9/9/2015
8
7
1
1
0
'Home_Rest' = The home teams amount of days between their games
'Away Rest' = The away teams amount of days between their games
'Adv' = True/False that there was an advantage on one side
'Adv_Days' = The amount of advantage in days
'Adv_Won' = The side with the advantage won
Here is what I've tried, I was able to get it to count how many days were between games for one team, but when I bring all the other ones in I can't wrap my head around how to do that.
library(tidyverse)
library(lubridate)
team_post <- schedule %>% filter(home == 'PHI' | visitor == 'PHI')
day_dif = interval(lag(ymd(team_post$date)), ymd(team_post$date))
team_post <- team_post %>% mutate(days_off = time_length(day_dif, "days"))
You can extend this to all teams using a grouped mutate. See docs for group_by() here.
Something like
schedule %>%
group_by(vars_to_group_by) %>%
mutate(new_var = expr_to_calculate_new_var)
In future, it would be helpful if you included code to recreate a minimal dataset for your example.
The problem is that before you can calculate differences between dates, you must put your dataframe in a friendlier format. Because the Date applies to both teams, that is, one item applies to two columns in the dataframe, which makes it difficult to give it a uniform treatment.
We'll add an id (row number) to the schedule dataframe, as a primary key, so it becomes easy to identify the rows later on.
schedule <- tidyr::tribble(
~Home, ~Away, ~Home_Final, ~Away_Final, ~Date,
"DAL", "OAK", 30, 35, "9/1/2015",
"KC", "PHI", 21, 28, "9/2/2015",
"DAL", "PHI", 28, 7, "9/9/2015",
"OAK", "KC", 14, 21, "9/9/2015"
)
schedule <- schedule %>% mutate(id = row_number())
> schedule
# A tibble: 4 x 6
Home Away Home_Final Away_Final Date id
<chr> <chr> <dbl> <dbl> <chr> <int>
1 DAL OAK 30 35 9/1/2015 1
2 KC PHI 21 28 9/2/2015 2
3 DAL PHI 28 7 9/9/2015 3
4 OAK KC 14 21 9/9/2015 4
Now we'll place your dataframe in a more 'relational' format.
schedule_relational <-
rbind(
schedule %>%
transmute(
id,
Team = Home,
Role = "Home",
Final = Home_Final,
Date
),
schedule %>%
transmute(
id,
Team = Away,
Role = "Away",
Final = Away_Final,
Date
)
)
> schedule_relational
# A tibble: 8 x 5
id Team Role Final Date
<int> <chr> <chr> <dbl> <chr>
1 1 DAL Home 30 9/1/2015
2 2 KC Home 21 9/2/2015
3 3 DAL Home 28 9/9/2015
4 4 OAK Home 14 9/9/2015
5 1 OAK Away 35 9/1/2015
6 2 PHI Away 28 9/2/2015
7 3 PHI Away 7 9/9/2015
8 4 KC Away 21 9/9/2015
How about that!
Now it becomes easy to calculate the difference between dates of games for each team:
schedule_relational <-
schedule_relational %>%
group_by(Team) %>%
arrange(Date) %>%
mutate(Rest = mdy(Date) - mdy(lag(Date))) %>%
ungroup()
> schedule_relational
# A tibble: 8 x 6
id Team Role Final Date Rest
<int> <chr> <chr> <dbl> <chr> <drtn>
1 1 DAL Home 30 9/1/2015 NA days
2 1 OAK Away 35 9/1/2015 NA days
3 2 KC Home 21 9/2/2015 NA days
4 2 PHI Away 28 9/2/2015 NA days
5 3 DAL Home 28 9/9/2015 8 days
6 4 OAK Home 14 9/9/2015 8 days
7 3 PHI Away 7 9/9/2015 7 days
8 4 KC Away 21 9/9/2015 7 days
Observe that the appropriate function to convert dates in character format is mdy(), because your dates are in month/day/year format.
We're very close to a solution! Now all we have to do is to pivot your data back to the wider format. We'll join back the data on the home team and away team by using the id as our unique key.
result <-
schedule_relational %>%
pivot_wider(
names_from = Role,
values_from = c(Team, Final, Rest),
names_glue = "{Role}_{.value}"
)
> result
# A tibble: 4 x 8
id Date Home_Team Away_Team Home_Final Away_Final Home_Rest Away_Rest
<int> <chr> <chr> <chr> <dbl> <dbl> <drtn> <drtn>
1 1 9/1/2015 DAL OAK 30 35 NA days NA days
2 2 9/2/2015 KC PHI 21 28 NA days NA days
3 3 9/9/2015 DAL PHI 28 7 8 days 7 days
4 4 9/9/2015 OAK KC 14 21 8 days 7 days
We'll adjust column names and ordering, and make the final calculations now.
result_final <-
result %>%
transmute(
Home = Home_Team,
Away = Away_Team,
Home_Final,
Away_Final,
Date,
Home_Rest,
Away_Rest,
Adv = as.integer(Home_Rest != Away_Rest),
Adv_Days = abs(Home_Rest != Away_Rest),
Adv_Won = as.integer(Home_Rest > Away_Rest & Home_Final > Away_Final | Away_Rest > Home_Rest & Away_Final > Home_Final)
)
> result_final
# A tibble: 4 x 10
Home Away Home_Final Away_Final Date Home_Rest Away_Rest Adv Adv_Days Adv_Won
<chr> <chr> <dbl> <dbl> <chr> <drtn> <drtn> <int> <int> <int>
1 DAL OAK 30 35 9/1/2015 NA days NA days NA NA NA
2 KC PHI 21 28 9/2/2015 NA days NA days NA NA NA
3 DAL PHI 28 7 9/9/2015 8 days 7 days 1 1 1
4 OAK KC 14 21 9/9/2015 8 days 7 days 1 1 0
It would be interesting if instead of reducing Adv and Adv_Won to yes/no (discrete) values, you'd track the number of days of rest and difference in score. Therefore you could correlate the results also in terms of magnitude.
I've made the code step by step, so you can see intermediate results and understand it better. You may later coalesce some of the statements if you want.
There may be more convoluted solutions, but this is very clear to read and understand.

How to assign one dataframe column's value to be the same as another column's value in r?

I am trying to run this line of code below to copy the city.output column to pm.city where it is not NA (in my sample dataframe, nothing is NA though) because city.output contains the correct city spellings.
resultdf <- dplyr::mutate(df, pm.city = ifelse(is.na(city.output) == FALSE, city.output, pm.city))
df:
pm.uid pm.address pm.state pm.zip pm.city city.output
<int> <chr> <chr> <chr> <chr> <fct>
1 1 1809 MAIN ST OH 63312 NORWOOD NORWOOD
2 2 123 ELM DR CA NA BRYAN BRYAN
3 3 8970 WOOD ST UNIT 4 LA 33333 BATEN ROUGE BATON ROUGE
4 4 4444 OAK AVE OH 87481 CINCINATTI CINCINNATI
5 5 3333 HELPME DR MT 87482 HELENA HELENA
6 6 2342 SOMEWHERE RD LA 45103 BATON ROUGE BATON ROUGE
resultdf (pm.city should be the same as city.output but it's an integer)
pm.uid pm.address pm.state pm.zip pm.city city.output
<int> <chr> <chr> <chr> <int> <fct>
1 1 1809 MAIN ST OH 63312 7 NORWOOD
2 2 123 ELM DR CA NA 2 BRYAN
3 3 8970 WOOD ST UNIT 4 LA 33333 1 BATON ROUGE
4 4 4444 OAK AVE OH 87481 3 CINCINNATI
5 5 4444 HELPME DR MT 87482 4 HELENA
6 6 2342 SOMEWHERE RD LA 45103 1 BATON ROUGE
An integer is instead assigned to pm.city. It appears the integer is the order number of the cities when they're in alphabetical order. Prior to this, I used the dplyr left_join method to attach city.output column from another dataframe but even there, there was no row number that I supplied explicitly.
This works on my computer in r studio but not when I run it from a server. Maybe it has something to do with my version of dplyr or the factor data type under city.output? I am pretty new to r.
The city.output is factor which gets coerced to integer storage values. Instead, convert to character with as.character
dplyr::mutate(df, pm.city = ifelse(!is.na(city.output), as.character(city.output), pm.city))

How to transpose panel data into correct form in R

So I am struggling to transform my data into a panel data form so that I can start analysing it. So far I have imported and merged my excel files so my data looks something like this (bear in mind the real data has far more rows and far more variables)
Company Name Date Market Share ...5.x ...6.x ...7.x ...8.x
<chr> <dttm> <chr> <chr> <chr> <chr> <chr>
1 NA NA FY0 FY-1 FY-2 FY-3 FY-4
2 Kimball Elect 2020-06-29 23:00:00 4020 4422 4232 4111 4003
3 Mercadolibre 2019-12-31 00:00:00 8357 2843 2653 2222 2134
4 Lazard Ltd 2019-12-31 00:00:00 47700 45061 45050 43280 42281
As you can see, row 1 exists to specify the lags in time for the market share variable, where FY0 is equal to the date in the date column and then FY-1 is the year before that, FY-2 is two years before etc. In the original excel files, the market share column was multi-index so all the lags were associated with the market share column, however when importing to R only FY0 remained associated with the market share column and all the other columns were auto-filled with '...5.x ...6.x ...7.x ...8.x'.
I essentially want to transform my data to look like this:
Company Name Date Market Share
1 Kimball Elect 2020 4020
2 Kimball Elect 2019 4422
3 Kimball Elect 2018 4232
4 Kimball Elect 2017 4111
5 Kimball Elect 2016 4003
6 Mercadolibre 2019 8357
7 Mercadolibre 2018 2843
8 Mercadolibre 2017 2653
9 Mercadolibre 2016 2222
10 Mercadolibre 2015 2134
11 Lazard Ltd 2019 47700
12 Lazard Ltd 2018 45061
13 Lazard Ltd 2017 45050
14 Lazard Ltd 2016 43280
15 Lazard Ltd 2015 42281
So basically I want to transpose the data in a way that makes the time lags into rows and then associate each lag (FY0, FY-1, FY-2...' with a date/year determined by the date column minus the lag ie. FY0 = 2020-06-29 so FY-1 = 2019-06-29.
Thanks in advance for anyone who is able to help as I feel this is quite tricky to do in R!
One solution is the following
Data
> example <- data.frame(Company = "Kimball", date = "2020", FY0 = 4200, FY1 = 4210)
> example
Company date FY0 FY1
1 Kimball 2020 4200 4210
Code
example %>%
tidyr::pivot_longer(., c("FY0", "FY1")) %>%
dplyr::group_by(Company) %>%
dplyr::mutate(Years = as.numeric(date) - (row_number() - 1)) %>%
dplyr::select(-date, -name)
Output
# A tibble: 2 x 3
# Groups: Company [1]
Company value Years
<chr> <dbl> <dbl>
1 Kimball 4200 2020
2 Kimball 4210 2019
EDIT
To address your concerns:
(1) The first row contains the variables FY0, ... . Hence you just need to replace the columns of the third, fourth, ..., last column with the values of the first row minus the first two columns, i.e. colnames(df) <- df[1, 3:(ncols(df))].
(2) The row_number() pertains to the grouping! Hence, for each group, i.e. firm, the numbering will start again at 1! No worries there.

Assigning coordinates to data of a different year in R?

I have got a data frame of Germany from 2012 with 8187 rows for 8187 postal codes (and about 10 variables listed as columns), but with no coordinates. Additionally, I have got coordinates of a different shapefile with 8203 rows (also including mostly the same postal codes).
I need the correct coordinates of the 8203 cases to be assigned to the 8178 cases of the initial data frame.
The problem: The difference of correct assignments needed is not 8178 with 16 cases missing (8203 - 8187 = 16), it is more. There are some towns (with postal codes) of 2012 which are not listed in the more recent shapefile and vice versa.
(I) Perhaps the easiest solution would be to obtain the coordinates from 2012 (unprojected: CRS("+init=epsg:4326")). --> Does anybody know an open source platform for this purpose? And do they have exactly 8187 postal codes?
(II) Or: Does anybody have an experience with assigning coordinates from to a data set of a different year? - Or, should this be avoided in any way because of some slightly changing borders and coordinates (especially when the data should be mapped and visualized in polygons from 2012) and some towns not listed in the older "and" in the newer data set?
I would appreciate your expert advice on how to approach (and hopefully solve) this issue!
EDIT - MWE:
# data set from 2012
> df1
# A tibble: 9 x 4
ID PLZ5 Name Var1
<dbl> <dbl> <chr> <dbl>
1 1 1067 Dresden 01067 40
2 2 1069 Dresden 01069 110
3 224 4571 Rötha 0
4 225 4574 Deutzen 120
5 226 4575 Neukieritzsch 144
6 262 4860 Torgau 23
7 263 4862 Mockrehna 57
8 8186 99996 Menteroda 0
9 8187 99998 Körner 26
# coordinates of recent shapefile
> df2
# A tibble: 9 x 5
ID PLZ5 Name Longitude Latitude
<dbl> <dbl> <chr> <dbl> <dbl>
1 1 1067 Dresden-01067 13.71832 51.06018
2 2 1069 Dresden-01069 13.73655 51.03994
3 224 4571 Roetha 12.47311 51.20390
4 225 4575 Neukieritzsch 12.41355 51.15278
5 260 4860 Torgau 12.94737 51.55790
6 261 4861 Bennewitz 13.00145 51.51125
7 262 4862 Mockrehna 12.83097 51.51125
8 8202 99996 Obermehler 10.59146 51.28864
9 8203 99998 Koerner 10.55294 51.21257
Hence,
4 225 4574 Deutzen 120
--> is not listed in df2 and:
6 261 4861 Bennewitz 13.00145 51.51125
--> is not listed in df1.
Any ideas concerning (I) and (II)?

Resources