I have this following dataset:
df <- structure(list(Data = structure(c(1623888000, 1629158400, 1629158400
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Client = c("Client1",
"Client1", "Client1"), Fund = c("Fund1", "Fund1", "Fund2"), Nature = c("Application",
"Rescue", "Application"), Quantity = c(433.059697, 0, 171.546757
), Value = c(69800, -70305.67, 24875), `NAV Yesterday` = c(162.40991399996,
162.40991399996, 145.044589000056), `NAV in Application Date` = c(161.178702344125,
162.346370458944, 145.004198476337), `Var NAV` = c(0.00763879866215962,
0.00039140721678275, 0.000278547270652531), `Var * Value` = c(533.188146618741,
-27.5181466187465, 6.92886335748171), FinalValue = c(70333.1881466187,
-70333.1881466187, 24881.9288633575), `Rentability WRONG` = c(0.0210345899274819,
0.0210345899274819, 0.0210345899274819)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
What I need to do is:
If quantity = 0, then remove all rows with the same Fund name as that one, but remove only the rows that have Date < or = Date of the Quantity = 0 Fund
What I did here is:
I grouped the data by Fund
Arranged each group by Data
Created a column zero_point that assigns 1 to the row where Quantity == 0 and NA otherwise
Filled the fields in zero_point that come before the actual "zero point" with the same value.
filtered those rows out.
output <- df %>%
group_by(Fund) %>%
arrange(Data) %>%
mutate(zero_point = case_when(Quantity == 0 ~ 1)) %>%
fill(zero_point, .direction = "up") %>%
filter(is.na(zero_point))
(On the condition that there is only one instance where Quantity is 0 per Fund group)
You can try -
library(dplyr)
df %>%
filter({
#Row index where Quantity = 0
inds = which(Quantity == 0)
#Drop rows where Data value is less than Data value at Quantity = 0
#and Fund is same as present at Quantity = 0.
!(Data <= Data[inds] & Fund %in% Fund[inds])
})
Here's a thought:
df %>%
group_by(Fund) %>%
filter(!any(Quantity == 0) | Data <= Data[which.min(Quantity)])
# # A tibble: 3 x 12
# # Groups: Fund [2]
# Data Client Fund Nature Quantity Value `NAV Yesterday` `NAV in Applica~ `Var NAV` `Var * Value` FinalValue `Rentability WR~
# <dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2021-06-17 00:00:00 Clien~ Fund1 Appli~ 433. 69800 162. 161. 0.00764 533. 70333. 0.0210
# 2 2021-08-17 00:00:00 Clien~ Fund1 Rescue 0 -70306. 162. 162. 0.000391 -27.5 -70333. 0.0210
# 3 2021-08-17 00:00:00 Clien~ Fund2 Appli~ 172. 24875 145. 145. 0.000279 6.93 24882. 0.0210
I'm assuming you meant "Data <= Data of the Quantity = 0 Fund", therefore using Data instead of Date (not found) or NAV in Application Date.
This filters nothing in this sample data, I'm hoping the logic is correct.
Testing for equality with floating-point (numeric) can be problematic at times (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754). If you have some small near-zero numbers, then this will silently produce counter-intuitive results without warning or error. You might be more defensive to use something like:
df %>%
group_by(Fund) %>%
filter(all(abs(Quantity) > 0) | Data <= Data[which.min(Quantity)])
or even
df %>%
group_by(Fund) %>%
filter(all(abs(Quantity) > 0) |
row_number() == which.min(Quantity) |
Data < Data[which.min(Quantity)])
While the latter is a bit paranoid (and double-calculates which.min(.), it should not succumb to problems with equality tests.
The only time this will fail is if all(is.na(Quantity)); that is, which.min(c(NA,NA)) returns integer(0) which will cause an error in dplyr::filter. One might choose to add safeguard with something like filter(any(!is.na(Quantity)) & (...)).
Related
I'd like to assign different values to several columns, based on the value in another column, i.e. do a multiple mutate based on a single condition.
For example, I would have a dataframe like this:
df <- tibble(cfr = c("IRL000I12572", "ESP000023522", "ESP000023194"),
vessel_name = c("RACHEL JAY", "ALAKRANTXU", "DONIENE"),
length = c(NA, NA, 109.30),
tonnage = c(NA, NA, 3507.00),
power = c(NA, NA, 7149.05))
I'd like to manually assign a set of values to length, tonnage, and power when cfr == IRL000I12572, another set of values when cfr == ESP000023522, and keep the given values when cfr == ESP000023194.
Right know, I'm doing it using either an ifelse or case_when statement in my mutate, but I end up with three rows per cfr (and I have many)...
For example:
df <- df %>%
mutate(length = ifelse(cfr == "IRL000I12572", 22.5, length),
tonnage = ifelse(cfr == "IRL000I12572", 153.00, tonnage),
power = ifelse(cfr == "IRL000I12572", 370, power))
Is there a way to 'condense' the statement and have only one per cfr value, to assign the three different length, tonnage, and power values in one row?
Thanks!
You can use rows_update() from dplyr. Note that this is marked as an experimental function, so use at your own risk!
library(dplyr)
df <- tibble(cfr = c("IRL000I12572", "ESP000023522", "ESP000023194"),
vessel_name = c("RACHEL JAY", "ALAKRANTXU", "DONIENE"),
length = c(NA, NA, 109.30),
tonnage = c(NA, NA, 3507.00),
power = c(NA, NA, 7149.05))
df_update <- tibble(cfr = "IRL000I12572",
length = 22.5,
tonnage = 153.00,
power = 370)
df %>%
rows_update(df_update, by = "cfr")
# A tibble: 3 x 5
cfr vessel_name length tonnage power
<chr> <chr> <dbl> <dbl> <dbl>
1 IRL000I12572 RACHEL JAY 22.5 153 370
2 ESP000023522 ALAKRANTXU NA NA NA
3 ESP000023194 DONIENE 109. 3507 7149.
You can also make use of across to pull from a reference list (or vector). But this would require a different reference table or some other code feature per lookup ID.
x <- list(length = 22.5,
tonnage = 153.00,
power = 370)
df %>%
mutate(across(names(x), ~ ifelse(cfr == "IRL000I12572", x[[cur_column()]], .)))
In base R you could do:
df[df$cfr == "IRL000I12572", -c(1:2)] <- list(22.5, 153.00, 370)
So that
df
#> # A tibble: 3 x 5
#> cfr vessel_name length tonnage power
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 IRL000I12572 RACHEL JAY 22.5 153 370
#> 2 ESP000023522 ALAKRANTXU NA NA NA
#> 3 ESP000023194 DONIENE 109. 3507 7149.
The background
Question edited heavily for clarity
I have data like this:
df<-structure(list(fname = c("Linda", "Bob"), employee_number = c("00000123456",
"654321"), Calendar = c(0, 0), Protocol = c(0, 0), Subject = c(0,
0), CRA = c(0, 0), Regulatory = c(1, 1), Finance = c(0, 1), ResearchNurse = c(0,
0)), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"
))
In a previous question I asked on here, I mentioned that I needed to pivot this data from wide to long in order to export it elsewhere. Answers worked great!
Problem is, I discovered that some of the people in my dataset didn't fill out their surveys correctly and have all zero's in certain problematic columns. I.e. when they get pivoted and filtered to "1" values, they get dropped.
Luckily (depending on how you think about it) I can fix their mistakes. If they left those columns blank, I can populate what they should have based on their other columns. I.e. what they filled out under "CRA","Regulatory", "Finance" or "ResearchNurse" will determine whether they get 1's or 0's in "Calendar","Protocol" or "Subject"
To figure out what goes in those columns, we created this matrix of job responsibilities:
jobs<-structure(list(`Roles (existing)` = c("Calendar Build", "Protocol Management",
"Subject Management"), `CRA/ Manager/ Senior` = c(1, 1, 0), Regulatory = c(0,
1, 1), Finance = c(0, 0, 0), `Research Nurse` = c(1, 0, 1)), row.names = c(NA,
-3L), class = c("tbl_df", "tbl", "data.frame"))
So if you're following so far, no matter what "Bob" put in his columns for "Calendar", "Protocol" or "subject" (he currently has zeros), it will be overwritten based on what he put in other columns. So if Bob put a "1" in his 'Regulatory' column, based on that matrix I screenshotted, he should get a 1 in both the protocol and subject columns.
The specific question
So how do I tell R, "look at bob's "CRA,Regulatory, Finance, and researchNurse" columns, and then crossreference the "jobs" dataframe, and overwrite his "calendar, protocol, and subjects" columns?
My expected output in this particular case would be:
One last little detail: I could see instances where (depending on the order), numbers would overwrite each other. I.e. if Bob should get a 1 in protocol because he's got a 1 in regulatory... but he's got a 1 in finance which would mean he should get a 0 in protocol.....
When in doubt, if a column is overwritten with a 1, it should never be turned back into a zero. I hope that makes sense.
I'd suggest converting your logic to ifelse statement(s):
df$Calendar <- ifelse(df$CRA == 1 | df$ResearchNurse == 1, 1, df$Calendar)
df$Protocol <- ifelse(df$CRA == 1 | df$Regulatory == 1, 1, df$Protocol)
df$Subject <- ifelse(df$Regulatory == 1 | df$ResearchNurse == 1, 1, df$Subject)
df
#> fname employee_number Calendar Protocol Subject CRA Regulatory Finance
#> 1 Linda 00000123456 0 1 1 0 1 0
#> 2 Bob 654321 0 1 1 0 1 1
#> ResearchNurse
#> 1 0
#> 2 0
data:
df <- structure(list(
fname = c("Linda", "Bob"),
employee_number = c("00000123456", "654321"),
Calendar = c(0, 0), Protocol = c(0, 0), Subject = c(0, 0),
CRA = c(0, 0), Regulatory = c(1, 1), Finance = c(0, 1),
ResearchNurse = c(0, 0)), row.names = c(NA, -2L), class = c("data.frame"))
Created on 2022-03-28 by the reprex package (v2.0.1)
Both tables need a common look up value.
So for example in your df table there is a employee_number column. Do you have the same field in the jobs table? If so this is easy to do with left_join() and then a case_when()
You will need simplify your current jobs table to have some summary value of the logic you put in your post eg(if Bob has a 1 in regulatory then he should get a 1 in protocol and subject columns). This can be done with some table manipulation functions. I can't tell you exactly which ones because I don't fully understand the logic.
Assuming that is clear to you and you know how to summarize that jobs table (and you have the unique employee_number) for each row then the below should work.
left_join(x=df,y=jobs,by="employee_number") %>%
muate(new_col1=case_when(logic_1 ~ value1,
logic_2 ~ value2,
logic_3 ~ value3,
TRUE ~ default_value))
You can repeat the newcol logic for additional columns as required.
library(tidyverse)
First, by pivoting both df and jobs, the task should become much easier
(df_long <- df %>%
pivot_longer(
cols = -c(fname, employee_number), names_to = "term"
) %>%
filter(value == 1) %>%
select(-value))
#> # A tibble: 3 x 3
#> fname employee_number term
#> <chr> <chr> <chr>
#> 1 Linda 00000123456 Regulatory
#> 2 Bob 654321 Regulatory
#> 3 Bob 654321 Finance
Now, if I understand your question correctly, Bob should have added “Protocol”
and “Subject”in his survey because he works in “Finance”. Luckily, we can add
that information for him automatically. We pivot jobs and clean up the
names/terms to match those in df. This can be done like this:
(jobs_long <- jobs %>%
rename(
CRA = `CRA/ Manager/ Senior`, ResearchNurse = `Research Nurse`
) %>%
mutate(
roles = `Roles (existing)` %>% str_extract("^\\w+"),
.keep = "unused"
) %>%
pivot_longer(-roles, names_to = "term") %>%
filter(value == 1) %>%
select(-value))
#> # A tibble: 6 x 2
#> roles term
#> <chr> <chr>
#> 1 Calendar CRA
#> 2 Calendar ResearchNurse
#> 3 Protocol CRA
#> 4 Protocol Regulatory
#> 5 Subject Regulatory
#> 6 Subject ResearchNurse
Once in this shape, we can join the two tables, do some tidying, and then we
end up with the correct information. We could continue from here and wrangle
the data back into the wide shape, but it’s probably more useful like this
so that’s where I would stop.
df_long %>%
left_join(jobs_long, by = c("term" = "term")) %>%
pivot_longer(cols = c(term, roles), values_drop_na = TRUE) %>%
distinct(fname, employee_number, term = value)
#> # A tibble: 7 x 3
#> fname employee_number term
#> <chr> <chr> <chr>
#> 1 Linda 00000123456 Regulatory
#> 2 Linda 00000123456 Protocol
#> 3 Linda 00000123456 Subject
#> 4 Bob 654321 Regulatory
#> 5 Bob 654321 Protocol
#> 6 Bob 654321 Subject
#> 7 Bob 654321 Finance
Created on 2022-03-31 by the reprex package (v1.0.0)
I have daily panel data with four variables: date, cusip(id identifier), PD (probability of default), and price. PD is only available on a quarterly basis for the first day of January, April, July, and October. I want to generate daily data for PD using Chow-Lin frequency conversion from tempdisagg package. I know how to apply td() function on time series, but I didn't find examples with panel data frames. Here are my code and sample data using reproduce() from devtools package, so only few sample days are included instead of full quarter. Running td() reports an error:
Error in td(PD ~ price, conversion = "first", method = "chow-lin-fixed", fixed.rho
= 0.5) : In numeric mode, 'to' must be an integer number.
I know that both price and PD are high-frequency daily indicators in mydata, so I guess I need to use to.quarterly() function on PDor something similar.
library(dplyr)
library(zoo)
library(tempdisagg)
library(tsbox)
mydata <- structure(list(date = structure(c(13516, 13516, 13517, 13517,13518, 13518, 13521, 13605, 13605, 13606), class = "Date"), cusip = c("31677310","66585910", "31677310", "66585910", "31677310", "66585910", "31677310","66585910", "31677310", "66585910"), PD = c(0.076891, 0.096,NA, NA, NA, NA, NA, 0.094341, 0.08867, NA), price = c(40.98, 61.31,40.99, 60.77, 40.18, 59.97, 39.92, 59.96, 38.6, 60.69)), row.names = c(6L,13L, 36L, 43L, 66L, 73L, 96L, 1843L, 1866L, 1873L), class = "data.frame")
mydata <- mydata%>%
group_by(cusip) %>%
arrange(cusip,date) %>%
mutate(PDdaily = td(PD ~ price, conversion = "first",method = "chow-lin-fixed", fixed.rho = 0.5))
Your example is not sufficient. For each disaggregation, we need at least 3 low frequency values to be able to perform a regression.
Here is an alternative example, with 3 pairs of low and high frequency series:
library(tidyverse)
library(tempdisagg)
library(tsbox)
mydata <- ts_c(
low_freq = ts_frequency(fdeaths, "year"),
high_freq = mdeaths
) %>%
ts_tbl() %>%
ts_wide() %>%
crossing(id = 1:3) %>%
arrange(id)
Applying td multiple times on data in a data frame will be cumbersome.
It is easier to extract the data into two lists, one with the low and one with high frequency series:
list_lf <- group_split(ts_na_omit(select(mydata, time, value = low_freq, id)), id, keep = FALSE)
list_hf <- group_split(select(mydata, time, value = high_freq, id), id, keep = FALSE)
Now you can use Map() or map2() to apply the function to each pair of elements:
ans <- map2(list_lf, list_hf, ~ predict(td(.x ~ .y)))
Transforming the disaggregated data back to a data frame:
bind_rows(ans, .id = "id")
#> # A tibble: 216 x 3
#> id time value
#> <chr> <date> <dbl>
#> 1 1 1974-01-01 59.2
#> 2 1 1974-02-01 54.2
#> 3 1 1974-03-01 54.4
#> 4 1 1974-04-01 54.4
#> 5 1 1974-05-01 47.3
#> 6 1 1974-06-01 42.8
#> 7 1 1974-07-01 43.3
#> 8 1 1974-08-01 40.6
#> 9 1 1974-09-01 42.0
#> 10 1 1974-10-01 47.3
#> # … with 206 more rows
Created on 2020-06-03 by the reprex package (v0.3.0)
I currently am importing two tables (in the most basic form) that appear as such
Table 1
State Month Account Value
NY Jan Expected Sales 1.04
NY Jan Expected Expenses 1.02
Table 2
State Month Account Value
NY Jan Sales 1,000
NY Jan Customers 500
NY Jan F Expenses 1,000
NY Jan V Expenses 100
And my end goal is to create a 3rd data frame that includes the values of the first two rows and calculates a 4th column based off of functions
NextYearExpenses = (t2 F Expenses + t2 V Expenses)* t1 Expected Expenses
NextYearSales = (t2 sales) * t1 Expected Sales
So my desired output is as followed
State Month New Account Value
NY Jan Sales 1,040
NY Jan Expenses 1,122
I am relatively new to R and I think ifelse statements might be my best bet. I have tried merging the tables and calculating with simple column functions but with no real progress.
Any suggestions?
You may need to do some data wrangling but nothing out of the ordinary
require(dplyr)
Table1<-tibble(State=c("NY","NY"), Month=c("Jan","Jan"), Account=c("Expected Sales", "Expected Expenses"), Value=c(1.04,1.02))
Table2<-tibble(State=c("NY","NY","NY","NY"), Month=c("Jan","Jan","Jan","Jan"), Account=c("Sales", "Customers", "F Expenses","V Expenses"), Value=c(1000,500,1000,100))
First thing I do is rename the accounts to have a common name, i.e. expenses, this is going to help me to merge later on to Table1
Table2$Account[Table2$Account=="F Expenses"]<-"Expenses"
Table2$Account[Table2$Account=="V Expenses"]<-"Expenses"
then I use the group_by function and group by State, Month and Account and do the sum
Table2 <- Table2 %>% group_by(State, Month,Account) %>%
summarise(Tot_Value=sum(Value)) %>% ungroup()
head(Table2)
## State Month Account Tot_Value
## <chr> <chr> <chr> <dbl>
## 1 NY Jan Customers 500
## 2 NY Jan Expenses 1100
## 3 NY Jan Sales 1000
then something similar with the renaming for the accounts in table 1
Table1$Account[Table1$Account=="Expected Sales"]<-"Sales"
Table1$Account[Table1$Account=="Expected Expenses"]<-"Expenses"
Merge into a third table, Table 3
Table3<- left_join(Table1,Table2)
use mutate to do the needed operation
Table3 <- Table3 %>% mutate(Value2=Value*Tot_Value)
head(Table3)
## # A tibble: 2 x 6
## State Month Account Value Tot_Value Value2
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 NY Jan Sales 1.04 1000 1040
## 2 NY Jan Expenses 1.02 1100 1122
Here's what I did with dplyr and tidyr.
First I combined your initial tables with rbind into a single long format table. Since you have unique identifiers for each of the Account values, these don't need to be separate tables. Next I group_by State and Month to group these assuming eventually you'll have a variety of states/months. Next I summarise based on the values of Account that you specified and created two new columns. Finally to get it into the long format that you want I used gather from tidyr to go from wide format to long format. You can separate these commands into smaller chunks by deleting after the %>% to get a better idea of what each step does.
library(dplyr)
library(tidyr)
rbind(df,df2) %>%
group_by(State,Month) %>%
summarise(Expenses = (Value[which(Account == "F Expenses")] + Value[which(Account == "V Expenses")]) * Value[which(Account == "Expected Expenses")],
Sales = Value[which(Account == "Sales")] * Value[which(Account == "Expected Sales")]) %>%
gather(New_Account,Value, c(Expenses,Sales))
# A tibble: 2 x 4
# Groups: State [1]
# State Month New_Account Value
# <chr> <chr> <chr> <dbl>
#1 NY Jan Expenses 1122
#2 NY Jan Sales 1040
I'd recommend checking out the concept of "tidy data", as there are some real challenges with working on data with the structure you currently have. E.g. creating t3 should only take 2-3 lines of code, all of this is just to work around your data architecture:
library(tidyverse)
t1 <- data.frame(State = rep("NY", 2),
Month = rep(as.Date("2018-01-01"), 2),
Account = c("Expected Sales", "Expected Expenses"),
Value = c(1.04, 1.02),
stringsAsFactors = FALSE)
t2 <- data.frame(State = rep("NY", 4),
Month = rep(as.Date("2018-01-01"), 4),
Account = c("Sales", "Customers", "F Expenses", "V Expenses"),
Value = c(1000, 500, 1000, 100),
stringsAsFactors = FALSE)
t3 <- t2 %>%
spread(Account, Value) %>%
inner_join({
t1 %>%
spread(Account, Value)
}, by = c("State" = "State", "Month" = "Month")) %>%
mutate(NewExpenses = (`F Expenses` + `V Expenses`) * `Expected Expenses`,
NewSales = Sales * `Expected Sales`) %>%
select(State, Month, Sales = NewSales, Expenses = NewExpenses) %>%
gather(Sales, Expenses, key = `New Account`, value = Value)
I am trying to apply a function to convert financial accounts from a number of companies to USD.
The firms relating to each currency can be found below.
USDfirms <- c("GOOG", "AMZN", "AAPL", "CSCO", "FB", "HP", "IBM", "0992.HK",
"MSFT", "CRM", "TWTR", "WB", "ZTE.CN")
CNYfirms <- c("BABA", "BIDU", "1169.HK", "HMI", "3888.HK", "1357.HK", "NTES",
"TCEHY", "1810.HK", "0763.HK")
TWDfirms <- c("2357.TW", "2324.TW", "2356.TW", "2498.TW", "3231.TW")
KRWfirms <- c("003550.KS", "005930.KS")
JPYfirms <- c("5563.T", "7752.T")
EURfirms <- "NOK"
So CNYfirms correspond to Chinese firms whose financial accounts are in CNY. The dput() is a dump of approximately 30 companies finances and can be found here.
EDIT: link 2 here http://s000.tinyupload.com/download.php?file_id=06545415747486823455&t=0654541574748682345555828
Its called BSISCF
The df - dput() is called BSISCF.
I also have a currency conversion table:
Which looks like the following;
# A tibble: 1 x 6
date cny_usd twd_usd krw_usd jpy_usd eur_usd
<date> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2018-04-01 0.159 0.0344 0.000943 0.00941 1.23
Data
fx <-structure(list(date = structure(17622, class = "Date"), cny_usd = 0.159228,
twd_usd = 0.03442, krw_usd = 0.000943, jpy_usd = 0.009408,
eur_usd = 1.232305), row.names = c(NA, -1L), class = c("tbl_df",
"tbl", "data.frame"), .Names = c("date", "cny_usd", "twd_usd",
"krw_usd", "jpy_usd", "eur_usd"))
I have the following function and what I am trying to do is take the symbolcolumn and use ifelse statements. So if the symbol column matches with a symbol in CNYfirms then in the exchange.rates column put the cny_usd exchange rate from the fx table. Do this for all the symbols. NA values will correspond to USD firms since there is no USD exchange rate in the fx table.
BSISCF <- BSISCF %>%
separate(symbol, into = c("ticker", "country"),
sep = "[.]", convert = TRUE, remove = FALSE) %>% # The NA values just correspond to US data
mutate(exchange.rates = ifelse(symbol == CNYfirms, fx$cny_usd,
ifelse(symbol == TWDfirms, fx$twd_usd,
ifelse(symbol == KRWfirms, fx$krw_usd,
ifelse(symbol == JPYfirms, fx$jpy_usd,
ifelse(symbol == EURfirms, fx$eur_usd, 0)))))) %>%
select(exchange.rates, everything())
Okay this information is the next steps I will take once the exchange rates have been collected
mutate(exchange.rates = ifelse(is.na(country), 1, exchange.rates)) %>%
mutate_at(.funs = funs(fx = .*exchange.rates), .vars = vars(Cash.And.Cash.Equivalents:Change.In.Cash.and.Cash.Equivalents)) %>%
mutate(adjusted_fx = adjusted*exchange.rates) #All financial statements and the stocks adjusted price converted into USD
If there is something I did not explain well, let me know.