dplyr: applying calculations with differing column names

dplyr: applying calculations with differing column names - r

I am trying to create the following formula:
Interest expense / (Total Debt(for all years)) / # number of years
The data looks like the following;
GE2017 GE2016 GE2015 GE2014
Interest Expense -2753000 -2026000 -1706000 -1579000
Long Term Debt 108575000 105080000 144659000 186596000
Short/Current Long Term Debt 134591000 136211000 197602000 261424000
Total_Debt 243166000 241291000 342261000 448020000
GOOG2017 GOOG2016 GOOG2015 GOOG2014
Interest Expense -109000 -124000 -104000 -101000
Long Term Debt 3943000 3935000 1995000 2992000
Short/Current Long Term Debt 3969000 3935000 7648000 8015000
Total_Debt 7912000 7870000 9643000 11007000
NVDA2018 NVDA2017 NVDA2016 NVDA2015
Interest Expense -61000 -58000 -47000 -46000
Long Term Debt 1985000 1985000 7000 1384000
Short/Current Long Term Debt 2000000 2791000 1434000 1398000
Total_Debt 3985000 4776000 1441000 2782000
That is, for GE, I am trying to take interest expense for the latest year -2753000 divide this by the average of Total Debt for all 4 years for GE.
So;
-2753000 / AVERAGE(243166000 + 241291000 + 342261000 + 448020000) = 0.0086
However I am running into problems with group_by() when taking the average since GE and the other firms have different column names due to the different years.
cost_of_debt %>%
t() %>%
data.frame() %>%
rownames_to_column('rn') %>%
group_by(rn)
#Calcualtion here
Secondly; If possible, I would like to do the same calculation as above but use only the last two years of each firm.
-2753000 / AVERAGE(243166000 + 241291000) = 0.01136
Would perhaps a grepl function work here?
I have a vector called symbols.
symbols <- c("NVDA", "GOOG", "GE")
Data:
cost_of_debt <- structure(list(GE2017 = c(-2753000, 108575000, 134591000, 243166000
), GE2016 = c(-2026000, 105080000, 136211000, 241291000), GE2015 = c(-1706000,
144659000, 197602000, 342261000), GE2014 = c(-1579000, 186596000,
261424000, 448020000), GOOG2017 = c(-109000, 3943000, 3969000,
7912000), GOOG2016 = c(-124000, 3935000, 3935000, 7870000), GOOG2015 = c(-104000,
1995000, 7648000, 9643000), GOOG2014 = c(-101000, 2992000, 8015000,
11007000), NVDA2018 = c(-61000, 1985000, 2e+06, 3985000), NVDA2017 = c(-58000,
1985000, 2791000, 4776000), NVDA2016 = c(-47000, 7000, 1434000,
1441000), NVDA2015 = c(-46000, 1384000, 1398000, 2782000)), .Names = c("GE2017",
"GE2016", "GE2015", "GE2014", "GOOG2017", "GOOG2016", "GOOG2015",
"GOOG2014", "NVDA2018", "NVDA2017", "NVDA2016", "NVDA2015"), row.names = c("Interest Expense",
"Long Term Debt", "Short/Current Long Term Debt", "Total_Debt"
), class = "data.frame")

For the first case, after creating row names as a column (rownames_to_column - from tibble), separate that to 'firm' and 'year' by splitting at the junction between the start of the 'year' and the end of the firm, name, grouped by 'firm', create a 'New' column by taking the proportion of 'Interest.Expense' with the mean value of 'Total_Debt'. Then, we can arrange by 'year', get the mean of the last two 'Total_Debt' for each 'firm' and divide with 'Interest.Expense
library(dplyr)
cost_of_debt %>%
t() %>%
data.frame() %>%
rownames_to_column('rn') %>%
separate(rn, into = c("firm", "year"),
"(?<=[A-Z])(?=[0-9])", convert = TRUE) %>%
group_by(firm) %>%
mutate(New = Interest.Expense/mean(Total_Debt)) %>%
arrange(firm, year) %>%
mutate(NewLast = Interest.Expense/mean(tail(Total_Debt, 2)))

I think you need to clean your data first so that it is easier to understand what is an observation and what is a variable. Google tidy data :) Here is my solution. First I make the data tidy, then the calculations are straightforward.
library(tidyverse)
library(stringr)
), class = "data.frame")
# Clean and make the data tidy
cost_of_debt <- cost_of_debt %>%
as_tibble() %>%
rownames_to_column(var = "indicator") %>%
mutate(indicator = str_replace_all(indicator, regex("\\s|\\/"), "_")) %>%
gather(k, value, -indicator) %>%
separate(k, into = c("company", "year"), -4) %>%
spread(indicator, value) %>%
rename_all(tolower)
Results in the data looking like this:
company year interest_expense long_term_debt short_current_long_term_debt total_debt
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 GE 2014 -1579000 186596000 261424000 448020000
2 GE 2015 -1706000 144659000 197602000 342261000
3 GE 2016 -2026000 105080000 136211000 241291000
4 GE 2017 -2753000 108575000 134591000 243166000
5 GOOG 2014 -101000 2992000 8015000 11007000
Then we can answer your question:
cost_of_debt <- cost_of_debt %>%
group_by(company) %>%
mutate(int_over_totdept4 = interest_expense / mean(total_debt),
int_over_totdept2 = interest_expense / mean(total_debt[year %in% c("2017", "2016")]))
Which gives a dataframe (with your new varibles furthest to the right):
company year interest_expense long_term_debt short_current_long_term_debt total_debt int_over_totdept4 int_over_totdept2
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 GE 2014 -1579000 186596000 261424000 448020000 -0.00495 -0.00652
2 GE 2015 -1706000 144659000 197602000 342261000 -0.00535 -0.00704
3 GE 2016 -2026000 105080000 136211000 241291000 -0.00636 -0.00836
4 GE 2017 -2753000 108575000 134591000 243166000 -0.00864 -0.0114
5 GOOG 2014 -101000 2992000 8015000 11007000 -0.0111 -0.0128
And if you want the summarized form of your questions:
# First question:
cost_of_debt %>% filter(company == "GE", year == "2017") %>% select(company, year, int_over_totdept4)
# Second question:
cost_of_debt %>% filter(year == "2017") %>% select(company, year, int_over_totdept2)

Related

How to calculate percent change in R when there are some years of data missing?

I'm calculating the percent change of enrollment from academic year to academic year, but there are some academic years missing data, so I don't want it to calculate the change in those instances and keep it as blank instead of calculating a two year difference. I have multiple years, schools, and groups I am doing this by. Example data frame below and the code I am using currently. So I am missing 2016-17 in this example and don't want to calculate it for 17-18 then.
School Academic Year Group Enrollment pct_change
1 School 1 2018-19 Overall 450 ANSWER
2 School 1 2017-18 Overall 630 NA
3 School 1 2015-16 Overall 635 ANSWER
4 School 1 2014-15 Overall 750 ANSWER
5 School 1 2013-14 Overall 704 ANSWER
data <- data %>%
group_by(School, Group) %>%
mutate(pct_change = (((Enrollment-lead(Enrollment, order_by = `Academic Year`))/Enrollment)) * 100) %>%
ungroup()

An option may be to expand the data for complete year
library(dplyr)
library(tidyr)
data %>%
separate(`Academic Year`, into = c("Year", "Day"),
remove = FALSE, convert = TRUE) %>%
group_by(School, Group) %>%
complete(Year = full_seq(Year, period = 1)) %>%
mutate(pct_change = (((Enrollment-lead(Enrollment,
order_by = Year))/Enrollment)) * 100) %>%
ungroup()
filter(complete.cases(Enrollment)) %>%
select(-Year, -Day)

cumulative sum using R

I have a data file that look like this The question: At what date is the total incidence (cumulative sum) between Venus and Mar is more than 2000?

I've created a simple example using an index instead of a Date column:
df <- data.frame(country = c(rep("Mar",10), rep("Venus",10)),
incidence = runif(20,0,30),
index=seq(1,20,1))
library(dplyr)
df %>%
group_by(country) %>%
mutate(cumInc = cumsum(incidence)) %>%
filter(cumInc > 100) %>%
filter(index==min(index))
country incidence index cumInc
<fct> <dbl> <dbl> <dbl>
1 Mar 29.2 10 108.
2 Venus 22.5 16 110.
You can just change 100 to your threshold and change index to date to get the first Date for Venus and for Mar when the cumulative sum exceeds the given threshold. So e.g.:
df %>%
group_by(country) %>%
mutate(cumInc = cumsum(incidence)) %>%
filter(cumInc > **Your Threshold**) %>%
filter(date==min(date))
If you want to obtain a data.frame later you can add simply %>% as.data.frame().
If you want to save you information just use something like:
result <- df %>%
group_by(...

summarise function in R

I am trying to create a R database including some numerical variable.
While doing this, I made a typing mistake whose result looks weird to me and I would like to understand why (for sure I am missing something, here).
I have tried to look around for possible explanation but haven' t found what I am looking for.
library("dplyr")
library("tidyr")
data <-
data.frame(FS = c(1), FS_name = c("Armenia"), Year = c(2015), class =
c("class190"), area_1000ha = c(66.447)) %>%
mutate(FS_name = as.character(FS_name)) %>%
mutate(Year = as.integer(Year)) %>%
mutate(class = as.character(class)) %>%
tbl_df()
data
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = TRUE)) %>%
ungroup()
As you can see, the mistake is
rm.na=
rather than
na.rm=
When I type correctly, I have the right result on area_1000ha variable (10.5).
If I don't - i.e. keeping rm.na= I get 11.5, instead (+1, in fact).
What am I missing?

I think rm.na=TRUE is added to the sum, and as TRUE is considered as 1, it sums your initial sum and 1.
If you change TRUE to 2 for example
x <- data %>%
group_by(FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, rm.na = 2)) %>%
ungroup()
The result is
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 12.5

There is no function in R as rm.na hence R is considering it as a variable which has value TRUE i.e. 1.
Try keeping it na.rm = T and you will get the right result.
Even if you change the name of the variable
x <- data %>%
group_by(FS, FS_name, Year, class) %>%
dplyr::summarise(area_1000ha = sum(area_1000ha, tester = TRUE)) %>%
ungroup()
I have replaced rm.na with tester variable.
# A tibble: 1 x 4
FS_name Year class area_1000ha
<chr> <int> <chr> <dbl>
1 Rome 2018 class190 11.5

Using Mutate to rank specific columns

I'm a relative newbie to dplyr. I have a data.frame organized with each store name and source (made up of the results for 2018) making up the observations. The variables are total revenue, quantity, customer experience score, and a few others.
I'd like to rank each category in the data.frame and create new observations. All variables would be ranked in descending order, but customer experience and one additional column would be ranked in ascending order. The source I'd like to call this would be called "ranks".
store <- c("NYC", "Chicago", "Boston")
source <- c("2018", "2018", "2018")
revenue <- c(10000, 50000, 2000)
quantity <- c(100, 50, 20)
satisfaction <- c(3, 2, 5)
table <- cbind(store, source, revenue, quantity, satisfaction)
I was able to get what I needed using mutate, but I had to manually name each new column. I'm sure there is a more efficient way to rank these values out there!
Here is what I originally did:
table <- table %>%
mutate(revenue_rank = rank(-revenue), quantity_rank = rank(-quantity), satisfaction_rank = rank(satisfaction))

In general, if you're having to do something repeatedly in a data frame, such as calculating ranks, you probably want to reshape to long data. Also note that what you got from cbind is a matrix, not data frame--probably not what you want, since this means numeric variables actually come through as characters. Instead of cbind, use data.frame or data_frame (for a tibble).
What I did here is gathered into a long data frame, grouped by the measures (revenue, quantity, or satisfaction), then gave ranks based on the value, keeping in mind that you wanted different orders for satisfaction and the other measures.
library(tidyverse)
store <- c("NYC", "Chicago", "Boston")
source <- c("2018", "2018", "2018")
revenue <- c(10000, 50000, 2000)
quantity <- c(100, 50, 20)
satisfaction <- c(3, 2, 5)
df <- data_frame(store, source, revenue, quantity, satisfaction)
df %>%
gather(key = measure, value = value, revenue:satisfaction) %>%
group_by(measure) %>%
mutate(rank = ifelse(measure == "satisfaction", rank(value), rank(-value))) %>%
ungroup() %>%
select(-value) %>%
mutate(measure = paste(measure, "rank", sep = "_")) %>%
spread(key = measure, value = rank)
#> # A tibble: 3 x 5
#> store source quantity_rank revenue_rank satisfaction_rank
#> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 Boston 2018 3 3 3
#> 2 Chicago 2018 2 1 1
#> 3 NYC 2018 1 2 2
Created on 2018-05-04 by the reprex package (v0.2.0).

unexpected row when going from long to wide format with dplyr and tidyr

I've got a data frame (dfdat) with two categorical variables, location and employmentstatus.
I'd like to generate a data frame with the proportions of employment status for each location.
mydf_wide (achieved outcome) is almost what I'm looking for. The problem's that employmentstatus is a variable with two levels, yet there're three rows in mydf_wide. I don't understand why that is, because I'd have expected something similar to mytable (expected outcome).
Any help would be much appreciated.
Starting point (df):
dfdat <- data.frame(location=c("GA","GA","MA","OH","RI","GA","AZ","MA","OH","RI"),employmentstatus=c(1,2,1,2,1,1,1,2,1,1))
Expected outcome (table):
mytable <- table(dfdat$employmentstatus,dfdat$location)
mytable <- round(100*(prop.table(mytable, 2)),1)
Achieved outcome (df):
library(dplyr)
mydf <- dfdat %>%
group_by(location,employmentstatus) %>%
summarise (n = n()) %>%
mutate(freq = round((n / sum(n)*100),1))
library(tidyr)
mydf_wide <- spread(mydf, location, freq)
mydf_wide <- as.data.frame(mydf_wide)

We need to do a second group_by with 'location' to get the sum. Also, instead of grouping and then creating the 'n', count function can be used
dfdat %>%
count(location, employmentstatus) %>%
group_by(location) %>%
mutate(n = round(100*n/sum(n), 2)) %>%
spread(location, n, fill = 0)
# A tibble: 2 x 6
# employmentstatus AZ GA MA OH RI
#* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1 100 66.67 50 50 100
#2 2 0 33.33 50 50 0
If we are using the OP's code, then remove the 'n' column and then do the spread
dfdat %>%
group_by(location,employmentstatus) %>%
summarise (n = n()) %>%
mutate(freq = round((n / sum(n)*100),1)) %>%
select(-n) %>%
spread(location, freq, fill =0)
or update the 'n' column with the output of round and then spread. An extra column in 'n' made sure that the combinations exist in the dataset

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

dplyr: applying calculations with differing column names - r

Related

How to calculate percent change in R when there are some years of data missing?

cumulative sum using R

summarise function in R

Using Mutate to rank specific columns

unexpected row when going from long to wide format with dplyr and tidyr

Categories

Resources