dplyr, purrr, dynamically generate/calculate new columns in R - r

I have the following problem. I have a data frame/tibble that has (a lot) of columns that represent a value in different years, e.g. the number of inhabitants in a city at different points in time. I want to generate now columns that give me the growth rate (see pictures attached). It should be something like using mutate() while looping over the columns. I think that should be a common task but I can't find any hint how to do it.
Edit:
A minimal example could look like this:
## Minimal example
library(tidyverse)
## Given data frame
df <- tibble(
City = c("Melbourne", "Sydney", "Adelaide"),
year_2000 = c(100, 100, 205),
year_2001 = c(101, 100, 207),
year_2002 = c(102, 100, 209)
)
## Result
df <- df %>%
mutate(
gr_2000_2001 = year_2001/year_2000*100 - 100,
gr_2001_2002 = year_2002/year_2001*100 - 100
)
I want to find a way to automate/do the mutate command in a smart way, as I have to do it for 150 years.
enter image description here
enter image description here

The easiest way in this example would probably be to make your data tidy and then apply whatever formula you are using to calculate growth rates by using dplyr's lag()function to a data frame grouped by City:
## Minimal example
library(tidyverse)
df <- data.frame(City = c("Melbourne", "Sydney"),
year_2000 = c(100, 100),
year_2001 = c(101,100),
year_2002 = c(102, 102))
df %>%
gather(year, value, 2:4) %>%
group_by(City) %>%
mutate(growth = value/dplyr::lag(value,n=1))
The result is this:
# A tibble: 6 x 4
# Groups: City [2]
City year value growth
<fct> <chr> <dbl> <dbl>
1 Melbourne year_2000 100 NA
2 Sydney year_2000 100 NA
3 Melbourne year_2001 101 1.01
4 Sydney year_2001 100 1
5 Melbourne year_2002 102 1.01
6 Sydney year_2002 102 1.02
If you absolutely need the data in the format you provided in the screenshots, you can then apply spread() to reshape it into the original format. This is not generally recommended, however.

Related

I want to use group by and summarize in R, but keep some values from the observation

so I have been trying to figure out which party won in a specific region in my country using the number of votes per location, the problem is that when I use group by region (DEPARTAMENTO), I cannot keep the name of the party, only the votes.
When I group by region and party (DEPARTAMENTO, AGRUPACION), instead of 25 values I got 68 values because of the different denominations for political parties.
I hope this is not that confusing. And thanks.
ERM2002ganador <-
ERMfinalt2002 %>%
group_by(DEPARTAMENTO)%>%
summarize(max(VTOTAL,na.rm = FALSE))
I am trying to get something like the following
DEPARTAMENTO VOTES(VTOTAL) AGRUPACION TYPE
LAMBAYEQUE 250000 PERU POSIBLE PP
What I got now is only
DEPARTAMENTO VOTES
Lambayeque 250000
And If I use the group by for TYPE too I got the following
DEPARTAMENTO VOTES TYPE
LAMBAYEQUE 25000 PP
LAMBAYEQUE 20000 MR
You can use dplyr::slice_max() instead of summarize(). This will keep the row with the highest VTOTAL for each group.
library(dplyr)
ERMfinalt2002 %>%
group_by(DEPARTAMENTO) %>%
slice_max(VTOTAL) %>%
ungroup()
# A tibble: 3 × 4
DEPARTAMENTO AGRUPACION TYPE VTOTAL
<fct> <fct> <fct> <dbl>
1 AREQUIPA PERU POSIBLE PP 227581
2 LAMBAYEQUE PARTIDO APRISTA PERUANO PP 290516
3 LIMA PERU POSIBLE PP 147409
Example data:
set.seed(13)
ERMfinalt2002 <- expand.grid(
DEPARTAMENTO = c("AREQUIPA", "LAMBAYEQUE", "LIMA"),
AGRUPACION = c("PERU POSIBLE", "PARTIDO APRISTA PERUANO"),
TYPE = "PP"
)
ERMfinalt2002$VTOTAL = round(runif(6, 50000, 300000))

Can't figure out how to remove column name from dummy variable heading

I wrote this code and used
library('fastDummies'):
New_Data <- dummy_cols(New_Curve_Data, select_columns = 'CountyName')
I just want the actual county name that is Banks to be displayed and not CountyName_Banks etc.
There are like 100 dummy variables that I created. So I cant manually change the names.
The prefix substring in the column names can be removed with sub by matching the 'CountyName_' as pattern and replace it with blank ("") on the names of the 'New_Data'` and assign it back
names(New_Data) <- sub("CountyName_", "", names(New_Data))
This can be also done in base R with table
as.data.frame.matrix(table(seq_len(nrow(New_Curve_Data)),
New_Curve_Data$CountyName))
You can use pivot_wider. Since you did not share an example data taking example from ?fastDummies::dummy_cols help page.
crime <- data.frame(city = c("SF", "SF", "NYC"),
year = c(1990, 2000, 1990),
crime = 1:3)
tidyr::pivot_wider(crime, names_from = city, values_from = city,
values_fn = length, values_fill = 0)
# year crime SF NYC
# <dbl> <int> <int> <int>
#1 1990 1 1 0
#2 2000 2 1 0
#3 1990 3 0 1

How can we split string and extract the text between round brackets

I need to split the string in dataframe to two columns, the first one contains the value before the round brackets and the second column contains the value inside the round brackets.
This is an example:
study_name = c("apple bannan (tcga, raw 2018)", "frame shift (mskk2 nature, 2000)" )
results= c("Untested", "tested")
df = data_frame(study_name,results)
This is how I tried to do it:
df <- df %>%
mutate(reference = str_extract_all(study_name, "\\([^()]+\\)")) %>%
rename(~gsub("\\([^()]+\\)", "", study_name))
This is the expected dataframe:
reference = c("(tcga, raw 2018)", "(mskk2 nature, 2000)")
study = c("apple bannan", "frame shift")
expexted_df = data_frame(study, reference)
You can use separate() and set the separator as "\\s(?=\\()".
library(tidyr)
df %>%
separate(study_name, c("study", "reference"), sep = "\\s(?=\\()")
# # A tibble: 2 x 3
# study reference results
# <chr> <chr> <chr>
# 1 apple bannan (tcga, raw 2018) Untested
# 2 frame shift (mskk2 nature, 2000) tested
If you want to extract the text in the parentheses, using extract() is a suitable choice.
df %>%
extract(study_name, c("study", "reference"), regex = "(.+)\\s\\((.+)\\)")
# # A tibble: 2 x 3
# study reference results
# <chr> <chr> <chr>
# 1 apple bannan tcga, raw 2018 Untested
# 2 frame shift mskk2 nature, 2000 tested
We can use str_extract thus:
library(stringr)
df$reference <- str_extract(df$study_name, "\\(.*\\)")
df$study <- str_extract(df$study_name, ".*(?= \\(.*\\))")
Result:
df
study_name results reference study
1 apple bannan (tcga, raw 2018) Untested (tcga, raw 2018) apple bannan
2 frame shift (mskk2 nature, 2000) tested (mskk2 nature, 2000) frame shift
If you no longer want the study_name column, remove it thus:
df$study_name <- NULL

Low to high frequency conversion in panel data in R using tempdisagg

I have daily panel data with four variables: date, cusip(id identifier), PD (probability of default), and price. PD is only available on a quarterly basis for the first day of January, April, July, and October. I want to generate daily data for PD using Chow-Lin frequency conversion from tempdisagg package. I know how to apply td() function on time series, but I didn't find examples with panel data frames. Here are my code and sample data using reproduce() from devtools package, so only few sample days are included instead of full quarter. Running td() reports an error:
Error in td(PD ~ price, conversion = "first", method = "chow-lin-fixed", fixed.rho
= 0.5) : In numeric mode, 'to' must be an integer number.
I know that both price and PD are high-frequency daily indicators in mydata, so I guess I need to use to.quarterly() function on PDor something similar.
library(dplyr)
library(zoo)
library(tempdisagg)
library(tsbox)
mydata <- structure(list(date = structure(c(13516, 13516, 13517, 13517,13518, 13518, 13521, 13605, 13605, 13606), class = "Date"), cusip = c("31677310","66585910", "31677310", "66585910", "31677310", "66585910", "31677310","66585910", "31677310", "66585910"), PD = c(0.076891, 0.096,NA, NA, NA, NA, NA, 0.094341, 0.08867, NA), price = c(40.98, 61.31,40.99, 60.77, 40.18, 59.97, 39.92, 59.96, 38.6, 60.69)), row.names = c(6L,13L, 36L, 43L, 66L, 73L, 96L, 1843L, 1866L, 1873L), class = "data.frame")
mydata <- mydata%>%
group_by(cusip) %>%
arrange(cusip,date) %>%
mutate(PDdaily = td(PD ~ price, conversion = "first",method = "chow-lin-fixed", fixed.rho = 0.5))
Your example is not sufficient. For each disaggregation, we need at least 3 low frequency values to be able to perform a regression.
Here is an alternative example, with 3 pairs of low and high frequency series:
library(tidyverse)
library(tempdisagg)
library(tsbox)
mydata <- ts_c(
low_freq = ts_frequency(fdeaths, "year"),
high_freq = mdeaths
) %>%
ts_tbl() %>%
ts_wide() %>%
crossing(id = 1:3) %>%
arrange(id)
Applying td multiple times on data in a data frame will be cumbersome.
It is easier to extract the data into two lists, one with the low and one with high frequency series:
list_lf <- group_split(ts_na_omit(select(mydata, time, value = low_freq, id)), id, keep = FALSE)
list_hf <- group_split(select(mydata, time, value = high_freq, id), id, keep = FALSE)
Now you can use Map() or map2() to apply the function to each pair of elements:
ans <- map2(list_lf, list_hf, ~ predict(td(.x ~ .y)))
Transforming the disaggregated data back to a data frame:
bind_rows(ans, .id = "id")
#> # A tibble: 216 x 3
#> id time value
#> <chr> <date> <dbl>
#> 1 1 1974-01-01 59.2
#> 2 1 1974-02-01 54.2
#> 3 1 1974-03-01 54.4
#> 4 1 1974-04-01 54.4
#> 5 1 1974-05-01 47.3
#> 6 1 1974-06-01 42.8
#> 7 1 1974-07-01 43.3
#> 8 1 1974-08-01 40.6
#> 9 1 1974-09-01 42.0
#> 10 1 1974-10-01 47.3
#> # … with 206 more rows
Created on 2020-06-03 by the reprex package (v0.3.0)

Complex Data Frame calculations in R

I currently am importing two tables (in the most basic form) that appear as such
Table 1
State Month Account Value
NY Jan Expected Sales 1.04
NY Jan Expected Expenses 1.02
Table 2
State Month Account Value
NY Jan Sales 1,000
NY Jan Customers 500
NY Jan F Expenses 1,000
NY Jan V Expenses 100
And my end goal is to create a 3rd data frame that includes the values of the first two rows and calculates a 4th column based off of functions
NextYearExpenses = (t2 F Expenses + t2 V Expenses)* t1 Expected Expenses
NextYearSales = (t2 sales) * t1 Expected Sales
So my desired output is as followed
State Month New Account Value
NY Jan Sales 1,040
NY Jan Expenses 1,122
I am relatively new to R and I think ifelse statements might be my best bet. I have tried merging the tables and calculating with simple column functions but with no real progress.
Any suggestions?
You may need to do some data wrangling but nothing out of the ordinary
require(dplyr)
Table1<-tibble(State=c("NY","NY"), Month=c("Jan","Jan"), Account=c("Expected Sales", "Expected Expenses"), Value=c(1.04,1.02))
Table2<-tibble(State=c("NY","NY","NY","NY"), Month=c("Jan","Jan","Jan","Jan"), Account=c("Sales", "Customers", "F Expenses","V Expenses"), Value=c(1000,500,1000,100))
First thing I do is rename the accounts to have a common name, i.e. expenses, this is going to help me to merge later on to Table1
Table2$Account[Table2$Account=="F Expenses"]<-"Expenses"
Table2$Account[Table2$Account=="V Expenses"]<-"Expenses"
then I use the group_by function and group by State, Month and Account and do the sum
Table2 <- Table2 %>% group_by(State, Month,Account) %>%
summarise(Tot_Value=sum(Value)) %>% ungroup()
head(Table2)
## State Month Account Tot_Value
## <chr> <chr> <chr> <dbl>
## 1 NY Jan Customers 500
## 2 NY Jan Expenses 1100
## 3 NY Jan Sales 1000
then something similar with the renaming for the accounts in table 1
Table1$Account[Table1$Account=="Expected Sales"]<-"Sales"
Table1$Account[Table1$Account=="Expected Expenses"]<-"Expenses"
Merge into a third table, Table 3
Table3<- left_join(Table1,Table2)
use mutate to do the needed operation
Table3 <- Table3 %>% mutate(Value2=Value*Tot_Value)
head(Table3)
## # A tibble: 2 x 6
## State Month Account Value Tot_Value Value2
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 NY Jan Sales 1.04 1000 1040
## 2 NY Jan Expenses 1.02 1100 1122
Here's what I did with dplyr and tidyr.
First I combined your initial tables with rbind into a single long format table. Since you have unique identifiers for each of the Account values, these don't need to be separate tables. Next I group_by State and Month to group these assuming eventually you'll have a variety of states/months. Next I summarise based on the values of Account that you specified and created two new columns. Finally to get it into the long format that you want I used gather from tidyr to go from wide format to long format. You can separate these commands into smaller chunks by deleting after the %>% to get a better idea of what each step does.
library(dplyr)
library(tidyr)
rbind(df,df2) %>%
group_by(State,Month) %>%
summarise(Expenses = (Value[which(Account == "F Expenses")] + Value[which(Account == "V Expenses")]) * Value[which(Account == "Expected Expenses")],
Sales = Value[which(Account == "Sales")] * Value[which(Account == "Expected Sales")]) %>%
gather(New_Account,Value, c(Expenses,Sales))
# A tibble: 2 x 4
# Groups: State [1]
# State Month New_Account Value
# <chr> <chr> <chr> <dbl>
#1 NY Jan Expenses 1122
#2 NY Jan Sales 1040
I'd recommend checking out the concept of "tidy data", as there are some real challenges with working on data with the structure you currently have. E.g. creating t3 should only take 2-3 lines of code, all of this is just to work around your data architecture:
library(tidyverse)
t1 <- data.frame(State = rep("NY", 2),
Month = rep(as.Date("2018-01-01"), 2),
Account = c("Expected Sales", "Expected Expenses"),
Value = c(1.04, 1.02),
stringsAsFactors = FALSE)
t2 <- data.frame(State = rep("NY", 4),
Month = rep(as.Date("2018-01-01"), 4),
Account = c("Sales", "Customers", "F Expenses", "V Expenses"),
Value = c(1000, 500, 1000, 100),
stringsAsFactors = FALSE)
t3 <- t2 %>%
spread(Account, Value) %>%
inner_join({
t1 %>%
spread(Account, Value)
}, by = c("State" = "State", "Month" = "Month")) %>%
mutate(NewExpenses = (`F Expenses` + `V Expenses`) * `Expected Expenses`,
NewSales = Sales * `Expected Sales`) %>%
select(State, Month, Sales = NewSales, Expenses = NewExpenses) %>%
gather(Sales, Expenses, key = `New Account`, value = Value)

Resources