Complex Data Frame calculations in R - r

I currently am importing two tables (in the most basic form) that appear as such
Table 1
State Month Account Value
NY Jan Expected Sales 1.04
NY Jan Expected Expenses 1.02
Table 2
State Month Account Value
NY Jan Sales 1,000
NY Jan Customers 500
NY Jan F Expenses 1,000
NY Jan V Expenses 100
And my end goal is to create a 3rd data frame that includes the values of the first two rows and calculates a 4th column based off of functions
NextYearExpenses = (t2 F Expenses + t2 V Expenses)* t1 Expected Expenses
NextYearSales = (t2 sales) * t1 Expected Sales
So my desired output is as followed
State Month New Account Value
NY Jan Sales 1,040
NY Jan Expenses 1,122
I am relatively new to R and I think ifelse statements might be my best bet. I have tried merging the tables and calculating with simple column functions but with no real progress.
Any suggestions?

You may need to do some data wrangling but nothing out of the ordinary
require(dplyr)
Table1<-tibble(State=c("NY","NY"), Month=c("Jan","Jan"), Account=c("Expected Sales", "Expected Expenses"), Value=c(1.04,1.02))
Table2<-tibble(State=c("NY","NY","NY","NY"), Month=c("Jan","Jan","Jan","Jan"), Account=c("Sales", "Customers", "F Expenses","V Expenses"), Value=c(1000,500,1000,100))
First thing I do is rename the accounts to have a common name, i.e. expenses, this is going to help me to merge later on to Table1
Table2$Account[Table2$Account=="F Expenses"]<-"Expenses"
Table2$Account[Table2$Account=="V Expenses"]<-"Expenses"
then I use the group_by function and group by State, Month and Account and do the sum
Table2 <- Table2 %>% group_by(State, Month,Account) %>%
summarise(Tot_Value=sum(Value)) %>% ungroup()
head(Table2)
## State Month Account Tot_Value
## <chr> <chr> <chr> <dbl>
## 1 NY Jan Customers 500
## 2 NY Jan Expenses 1100
## 3 NY Jan Sales 1000
then something similar with the renaming for the accounts in table 1
Table1$Account[Table1$Account=="Expected Sales"]<-"Sales"
Table1$Account[Table1$Account=="Expected Expenses"]<-"Expenses"
Merge into a third table, Table 3
Table3<- left_join(Table1,Table2)
use mutate to do the needed operation
Table3 <- Table3 %>% mutate(Value2=Value*Tot_Value)
head(Table3)
## # A tibble: 2 x 6
## State Month Account Value Tot_Value Value2
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 NY Jan Sales 1.04 1000 1040
## 2 NY Jan Expenses 1.02 1100 1122

Here's what I did with dplyr and tidyr.
First I combined your initial tables with rbind into a single long format table. Since you have unique identifiers for each of the Account values, these don't need to be separate tables. Next I group_by State and Month to group these assuming eventually you'll have a variety of states/months. Next I summarise based on the values of Account that you specified and created two new columns. Finally to get it into the long format that you want I used gather from tidyr to go from wide format to long format. You can separate these commands into smaller chunks by deleting after the %>% to get a better idea of what each step does.
library(dplyr)
library(tidyr)
rbind(df,df2) %>%
group_by(State,Month) %>%
summarise(Expenses = (Value[which(Account == "F Expenses")] + Value[which(Account == "V Expenses")]) * Value[which(Account == "Expected Expenses")],
Sales = Value[which(Account == "Sales")] * Value[which(Account == "Expected Sales")]) %>%
gather(New_Account,Value, c(Expenses,Sales))
# A tibble: 2 x 4
# Groups: State [1]
# State Month New_Account Value
# <chr> <chr> <chr> <dbl>
#1 NY Jan Expenses 1122
#2 NY Jan Sales 1040

I'd recommend checking out the concept of "tidy data", as there are some real challenges with working on data with the structure you currently have. E.g. creating t3 should only take 2-3 lines of code, all of this is just to work around your data architecture:
library(tidyverse)
t1 <- data.frame(State = rep("NY", 2),
Month = rep(as.Date("2018-01-01"), 2),
Account = c("Expected Sales", "Expected Expenses"),
Value = c(1.04, 1.02),
stringsAsFactors = FALSE)
t2 <- data.frame(State = rep("NY", 4),
Month = rep(as.Date("2018-01-01"), 4),
Account = c("Sales", "Customers", "F Expenses", "V Expenses"),
Value = c(1000, 500, 1000, 100),
stringsAsFactors = FALSE)
t3 <- t2 %>%
spread(Account, Value) %>%
inner_join({
t1 %>%
spread(Account, Value)
}, by = c("State" = "State", "Month" = "Month")) %>%
mutate(NewExpenses = (`F Expenses` + `V Expenses`) * `Expected Expenses`,
NewSales = Sales * `Expected Sales`) %>%
select(State, Month, Sales = NewSales, Expenses = NewExpenses) %>%
gather(Sales, Expenses, key = `New Account`, value = Value)

Related

Remove rows with two conditions in R

I have this following dataset:
df <- structure(list(Data = structure(c(1623888000, 1629158400, 1629158400
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), Client = c("Client1",
"Client1", "Client1"), Fund = c("Fund1", "Fund1", "Fund2"), Nature = c("Application",
"Rescue", "Application"), Quantity = c(433.059697, 0, 171.546757
), Value = c(69800, -70305.67, 24875), `NAV Yesterday` = c(162.40991399996,
162.40991399996, 145.044589000056), `NAV in Application Date` = c(161.178702344125,
162.346370458944, 145.004198476337), `Var NAV` = c(0.00763879866215962,
0.00039140721678275, 0.000278547270652531), `Var * Value` = c(533.188146618741,
-27.5181466187465, 6.92886335748171), FinalValue = c(70333.1881466187,
-70333.1881466187, 24881.9288633575), `Rentability WRONG` = c(0.0210345899274819,
0.0210345899274819, 0.0210345899274819)), row.names = c(NA, -3L
), class = c("tbl_df", "tbl", "data.frame"))
What I need to do is:
If quantity = 0, then remove all rows with the same Fund name as that one, but remove only the rows that have Date < or = Date of the Quantity = 0 Fund
What I did here is:
I grouped the data by Fund
Arranged each group by Data
Created a column zero_point that assigns 1 to the row where Quantity == 0 and NA otherwise
Filled the fields in zero_point that come before the actual "zero point" with the same value.
filtered those rows out.
output <- df %>%
group_by(Fund) %>%
arrange(Data) %>%
mutate(zero_point = case_when(Quantity == 0 ~ 1)) %>%
fill(zero_point, .direction = "up") %>%
filter(is.na(zero_point))
(On the condition that there is only one instance where Quantity is 0 per Fund group)
You can try -
library(dplyr)
df %>%
filter({
#Row index where Quantity = 0
inds = which(Quantity == 0)
#Drop rows where Data value is less than Data value at Quantity = 0
#and Fund is same as present at Quantity = 0.
!(Data <= Data[inds] & Fund %in% Fund[inds])
})
Here's a thought:
df %>%
group_by(Fund) %>%
filter(!any(Quantity == 0) | Data <= Data[which.min(Quantity)])
# # A tibble: 3 x 12
# # Groups: Fund [2]
# Data Client Fund Nature Quantity Value `NAV Yesterday` `NAV in Applica~ `Var NAV` `Var * Value` FinalValue `Rentability WR~
# <dttm> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 2021-06-17 00:00:00 Clien~ Fund1 Appli~ 433. 69800 162. 161. 0.00764 533. 70333. 0.0210
# 2 2021-08-17 00:00:00 Clien~ Fund1 Rescue 0 -70306. 162. 162. 0.000391 -27.5 -70333. 0.0210
# 3 2021-08-17 00:00:00 Clien~ Fund2 Appli~ 172. 24875 145. 145. 0.000279 6.93 24882. 0.0210
I'm assuming you meant "Data <= Data of the Quantity = 0 Fund", therefore using Data instead of Date (not found) or NAV in Application Date.
This filters nothing in this sample data, I'm hoping the logic is correct.
Testing for equality with floating-point (numeric) can be problematic at times (see Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754). If you have some small near-zero numbers, then this will silently produce counter-intuitive results without warning or error. You might be more defensive to use something like:
df %>%
group_by(Fund) %>%
filter(all(abs(Quantity) > 0) | Data <= Data[which.min(Quantity)])
or even
df %>%
group_by(Fund) %>%
filter(all(abs(Quantity) > 0) |
row_number() == which.min(Quantity) |
Data < Data[which.min(Quantity)])
While the latter is a bit paranoid (and double-calculates which.min(.), it should not succumb to problems with equality tests.
The only time this will fail is if all(is.na(Quantity)); that is, which.min(c(NA,NA)) returns integer(0) which will cause an error in dplyr::filter. One might choose to add safeguard with something like filter(any(!is.na(Quantity)) & (...)).

Condense dataframe with conditions to certain columns

I've got a dataset that looks like this
df = data.frame(Site = c(rep('w',4),rep('x',5),rep('y',2),rep('z',1)),
Parent = c(rep('W Inc.',4),rep('X Inc.',5),rep('Y Inc.',2),rep('Z Inc.',1)),
Status = c(rep('Prospect',4),rep('Client',5),rep('Client',2),rep('Prospect',1)),
Country = c(rep('USA',4),rep('Canada',5),rep('Mexico',2),rep('China',1)),
ProductID = c('XP10','XP11','XP18','XP19','XP4','XP5','XP6','XP7','XP8','XP10','XP18','XP6'),
ProductName = c('10Rockets','11Rockets','18Rockets','19Rockets','4Rockets','5Rockets','6Rockets','7Rockets','8Rockets','10Rockets','18Rockets','6Rockets'),
ProductProvider= c('Provider A','Provider B','Provider A','Provider A',rep('Provider A',5),'Provider A','Provider B','Provider B'))
I'd like to condense it such that each Site is a unique row, and the last 3 columns are concatenated.
Also, I'd like to concatenate the last column such that if there are any repetitions, it takes only the unique values per Site and separates them with commas.
My attempt
library(dplyr)
output2 = df %>% group_by(Site,Parent,Status,Country) %>%
mutate(ProductID = paste(ProductID, collapse=",")) %>%
mutate(ProductName = paste(ProductName, collapse=",")) %>%
mutate(ProductProvider = unique(paste(ProductProvider, collapse=","))) %>%
distinct()
I'm almost there, but the last column seems to have repetitions of ProductProvider which I do not want.
Target Output
I'm looking for a target data set like this, with the last column concatenated and free of any repetitions. Any inputs would be appreciated.
output = data.frame(Site = c(rep('w',1),rep('x',1),rep('y',1),rep('z',1)),
Parent = c(rep('W Inc.',1),rep('X Inc.',1),rep('Y Inc.',1),rep('Z Inc.',1)),
Status = c(rep('Prospect',1),rep('Client',1),rep('Client',1),rep('Prospect',1)),
Country = c(rep('USA',1),rep('Canada',1),rep('Mexico',1),rep('China',1)),
ProductID = c('XP10,XP11,XP18,XP19','XP4,XP5,XP6,XP7,XP8','XP10,XP18','XP6'),
ProductName = c('10Rockets,11Rockets,18Rockets,19Rockets','4Rockets,5Rockets,6Rockets,7Rockets,8Rockets','10Rockets,18Rockets','6Rockets'),
ProductProvider= c('Provider A,Provider B','Provider A','Provider A,Provider B','Provider B'))
With dplyr:
library(dplyr)
result = df %>% group_by(Site, Parent, Status, Country) %>%
summarize(across(ProductProvider, ~paste(unique(.), collapse = ", ")),
across(everything(), paste, collapse = ", "))
result
# # A tibble: 4 x 7
# # Groups: Site, Parent, Status [4]
# Site Parent Status Country ProductProvider ProductID ProductName
# <chr> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 w W Inc. Prospect USA Provider A, Provider B XP10, XP11, XP18, XP19 10Rockets, 11Rockets, 18Rockets, 19Rockets
# 2 x X Inc. Client Canada Provider A XP4, XP5, XP6, XP7, XP8 4Rockets, 5Rockets, 6Rockets, 7Rockets, 8Rockets
# 3 y Y Inc. Client Mexico Provider A, Provider B XP10, XP18 10Rockets, 18Rockets
# 4 z Z Inc. Prospect China Provider B XP6 6Rockets
Short and sweet with aggregate.
aggregate(. ~ Site, df, unique)
# Site Parent Status Country ProductID ProductName ProductProvider
# 1 w W Inc. Prospect USA XP10, XP11, XP18, XP19 10Rockets, 11Rockets, 18Rockets, 19Rockets Provider A, Provider B
# 2 x X Inc. Client Canada XP4, XP5, XP6, XP7, XP8 4Rockets, 5Rockets, 6Rockets, 7Rockets, 8Rockets Provider A
# 3 y Y Inc. Client Mexico XP10, XP18 10Rockets, 18Rockets Provider A, Provider B
# 4 z Z Inc. Prospect China XP6 6Rockets Provider B

dplyr, purrr, dynamically generate/calculate new columns in R

I have the following problem. I have a data frame/tibble that has (a lot) of columns that represent a value in different years, e.g. the number of inhabitants in a city at different points in time. I want to generate now columns that give me the growth rate (see pictures attached). It should be something like using mutate() while looping over the columns. I think that should be a common task but I can't find any hint how to do it.
Edit:
A minimal example could look like this:
## Minimal example
library(tidyverse)
## Given data frame
df <- tibble(
City = c("Melbourne", "Sydney", "Adelaide"),
year_2000 = c(100, 100, 205),
year_2001 = c(101, 100, 207),
year_2002 = c(102, 100, 209)
)
## Result
df <- df %>%
mutate(
gr_2000_2001 = year_2001/year_2000*100 - 100,
gr_2001_2002 = year_2002/year_2001*100 - 100
)
I want to find a way to automate/do the mutate command in a smart way, as I have to do it for 150 years.
enter image description here
enter image description here
The easiest way in this example would probably be to make your data tidy and then apply whatever formula you are using to calculate growth rates by using dplyr's lag()function to a data frame grouped by City:
## Minimal example
library(tidyverse)
df <- data.frame(City = c("Melbourne", "Sydney"),
year_2000 = c(100, 100),
year_2001 = c(101,100),
year_2002 = c(102, 102))
df %>%
gather(year, value, 2:4) %>%
group_by(City) %>%
mutate(growth = value/dplyr::lag(value,n=1))
The result is this:
# A tibble: 6 x 4
# Groups: City [2]
City year value growth
<fct> <chr> <dbl> <dbl>
1 Melbourne year_2000 100 NA
2 Sydney year_2000 100 NA
3 Melbourne year_2001 101 1.01
4 Sydney year_2001 100 1
5 Melbourne year_2002 102 1.01
6 Sydney year_2002 102 1.02
If you absolutely need the data in the format you provided in the screenshots, you can then apply spread() to reshape it into the original format. This is not generally recommended, however.

Identify change in categorical data across datapoints in R

I need help regarding the data manipulation in R .
My dataset looks something like this.
Name, country, age
Smith, Canada, 27
Avin, India, 25
Smith, India, 27
Robin, France, 28
Now I want to identify the number of changes that “Smith” has gone through (two) based on combination of Name and country only.
Basically, I want to compare each datapoint with other datapoints and identify the count of changes that have been there in the entire dataset for the combination of Name and Country only.
You can do this by comparing the lag values of the combination with it's current value by group using dplyr:
library(dplyr)
df %>%
group_by(Name) %>%
mutate(combination = paste(country, age),
lag_combination = lag(combination, default = 0, order_by = Name),
Changes = cumsum(combination != lag_combination)) %>%
slice(n()) %>%
select(Name, Changes)
Result:
# A tibble: 3 x 2
# Groups: Name [3]
Name Changes
<fctr> <int>
1 Avin 2
2 Robin 1
3 Smith 3
Notes:
dplyr:lag does not respect group_by(Name), so you need to add order_by = Name to lag by Name.
I'm setting a default using default = 0 so that the first entry of each group would not be NA.
Data:
df = read.table(text="Name, country, age
Smith, Canada, 27
Avin, India, 25
Smith, India, 27
Robin, France, 28
Smith, Canada, 27
Robin, France, 28
Avin, France, 26", header = TRUE, sep = ',')

r aggregate function -- how to display additional columns [duplicate]

This question already has answers here:
How to select the rows with maximum values in each group with dplyr? [duplicate]
(6 answers)
Select the row with the maximum value in each group
(19 answers)
Closed 5 years ago.
I have two data frames: City and Country. I am trying to find out the most popular city per country. City and Country have a common field, City.CountryCode and Country.Code. These two data frames were merged to one called CityCountry. I have tried the aggregate command like so:
aggregate(Population.x~CountryCode, CityCountry, max)
This aggregate command only shows the CountryCode and Population.X columns. How would I show the name of the Country and the name of the City? Is aggregate the wrong command to use here?
Could also use dplyr to group by Country, then filter by max(Population.x).
library(dplyr)
set.seed(123)
CityCountry <- data.frame(Population.x = sample(1000:2000, 10, replace = TRUE),
CountryCode = rep(LETTERS[1:5], 2),
Country = rep(letters[1:5], 2),
City = letters[11:20],
stringsAsFactors = FALSE)
CityCountry %>%
group_by(Country) %>%
filter(Population.x == max(Population.x)) %>%
ungroup()
# A tibble: 5 x 4
Population.x CountryCode Country City
<int> <chr> <chr> <chr>
1 1287 A a k
2 1789 B b l
3 1883 D d n
4 1941 E e o
5 1893 C c r

Resources