Confused on percent difference calculations in R using dplyr::mutate

Confused on percent difference calculations in R using dplyr::mutate - r

I'm attempting to find the percent differences of state characteristics (using a defined index created using factor analysis) between the years 2012 and 2017. However some states begin with a score of -0.617 (2012) and end with -1.25 (2017), creating a positive percent difference rather than a negative.
The only other thing I've tried is subtracting 1 from the fraction factor1/lag(factor1). Below is is the code I'm currently working with:
STFACTOR %>>%
dplyr::select(FIPSst, Geography, Year, factor1) %>>%
filter(Year == c(2012, 2017)) %>>%
group_by(Geography) %>>%
mutate(pct_change = (factor1/lag(factor1)-1) * 100)
These are the changes and results from each change in code
mutate(pct_change = (1-factor1/lag(factor1)) * 100)
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 -102.
mutate(pct_change = (factor1/lag(factor1)-1) * 100)
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 -47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 -86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 -58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 -92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 102.
I would expect the final result to look like this:
FIPSst Geography Year factor1[,1] pct_change
<chr> <fct> <int> <dbl> <dbl>
1 01 Alabama 2012 1.82 NA
2 01 Alabama 2017 0.945 -47.9
3 04 Arizona 2012 0.813 NA
4 04 Arizona 2017 0.108 -86.7
5 05 Arkansas 2012 1.52 NA
6 05 Arkansas 2017 0.626 -58.8
7 06 California 2012 1.04 NA
8 06 California 2017 0.0828 -92.1
9 08 Colorado 2012 -0.617 NA
10 08 Colorado 2017 -1.25 -102.

mutate(pct_change = (factor1-lag(factor1))/lag(abs(factor1)) * 100)
Above is the final solution to the problem, subtracted the old number from the new before I divided by the absolute value of the old number.

we can use
mutate(pct_change =(factor1 - lag(factor1))/abs(lag(factor1)) * 100)

Related

How can I merge variables to my dataframe from another dataframe if the year is the same?

I have the dataframe assets_year:
fiscalyear countryname Assets net_margin
<int> <chr> <dbl> <dbl>
1 2010 Austria 1602544072. 1.72
2 2010 Belgium 2534519957. 0.974
3 2010 Estonia 33248259. 1.31
4 2010 Finland 1490200498. 1.42
5 2010 France 17137601040. 1.51
6 2010 Germany 11553780086. 2.32
tail
fiscalyear countryname Assets net_margin
<int> <chr> <dbl> <dbl>
1 2017 Luxembourg 503785373. 0.730
2 2017 Netherlands 3810079489. 1.40
3 2017 Portugal 504072448. 1.73
4 2017 Slovakia 61735274. 2.49
5 2017 Slovenia 41642423. 1.96
6 2017 Spain 4397884239. 1.39
Additionally, I summed up the asset values per year in another DF:
fiscalyear `sum(Assets)`
<int> <dbl>
1 2010 52192928317.
2 2011 55914561036.
3 2012 52202110772.
4 2013 42418952433.
5 2014 53001352848.
6 2015 43550880007.
In order to scale net margin per asset value, I would like to cbind(...) the sum(assets) to my preexisting dataframe which is in panel format. Thus all countries have a entry for 2010, 2011 ... 2017.

Deleting specific column/row values with if conditions

This is probably straight forward, but I am struggling big time.
I have a data frame with different industries between 1999 and 2000.
fyear industry employees
1 1999 Agriculture 132.260
2 2000 Agriculture 154.590
3 2001 Agriculture 147.725
4 2002 Agriculture 142.098
5 2003 Agriculture 77.169
6 2004 Agriculture 82.979
7 2005 Agriculture 99.625
8 2006 Agriculture 98.195
9 2007 Agriculture 95.193
10 2008 Agriculture 104.459
11 2009 Agriculture 182.930
12 2010 Agriculture 180.648
13 2011 Agriculture 173.408
14 2012 Agriculture 181.483
15 2013 Agriculture 109.842
16 2014 Agriculture 90.177
17 2015 Agriculture 92.067
18 2016 Agriculture 83.568
19 2017 Agriculture 70.251
20 2018 Agriculture 65.082
21 2019 Agriculture 82.754
22 1999 Aircraft 653.194
23 2000 Aircraft 692.918
24 2001 Aircraft 666.751
25 2002 Aircraft 633.565
26 2003 Aircraft 687.611
27 2004 Aircraft 701.827
28 2005 Aircraft 725.825
29 2006 Aircraft 751.171
30 2007 Aircraft 744.060
31 2008 Aircraft 750.319
32 2009 Aircraft 677.598
33 2010 Aircraft 690.605
34 2011 Aircraft 712.501
35 2012 Aircraft 716.985
36 2013 Aircraft 709.918
I am trying to create some growth variables
df$employeegrowth <- df$employees / lag(df$employees) - 1
This naturally causes some issues for every "1999" rows, which I would like to replace with NA.
I am trying to solve this issue with an if formula:
df$employeegrowth <- if(df$fyear == "1999") {
df$employeegrowth <- "NA"
}
But this substitutes every value in the employee growth column with NA.
I do not want to delete the entire row as the other columns contain valuable information.
could someone point me in the right direction on this?

Use lag by group :
library(dplyr)
df %>%
group_by(industry) %>%
mutate(employeegrowth = employees/lag(employees) - 1)
# fyear industry employees employeegrowth
# <int> <chr> <dbl> <dbl>
# 1 1999 Agriculture 132. NA
# 2 2000 Agriculture 155. 0.169
# 3 2001 Agriculture 148. -0.0444
# 4 2002 Agriculture 142. -0.0381
# 5 2003 Agriculture 77.2 -0.457
# 6 2004 Agriculture 83.0 0.0753
# 7 2005 Agriculture 99.6 0.201
# 8 2006 Agriculture 98.2 -0.0144
# 9 2007 Agriculture 95.2 -0.0306
#10 2008 Agriculture 104. 0.0973
# … with 26 more rows
This will give NA for first value of fyear in each industry.

Linear model/lmList with nested/multiple group categories?

I am trying to build a model for monthly energy consumption based on weather, grouped by location (there are ~1100) AND year (I would like to do it from 2011-2014). The data is called factin and looks like this:
Store Month Days UPD HD CD Year
1 August, 2013 31 6478.27 0.06 10.03 2013
1 September, 2013 30 6015.38 0.50 5.67 2013
1 October, 2013 31 5478.21 5.29 1.48 2013
1 November, 2013 30 5223.78 18.60 0.00 2013
1 December, 2013 31 5115.80 20.52 0.23 2013
6 January, 2011 31 4517.56 27.45 0.00 2011
6 February, 2011 28 4116.07 16.75 0.07 2011
6 March, 2011 31 3981.78 12.68 0.39 2011
6 April, 2011 30 4041.68 3.83 2.53 2011
6 May, 2011 31 4287.23 1.61 6.58 2011
And my model code, which just spits out 1 set of coefficients for all the years of each store, looks like this:
factout <- lmList(UPD ~ HD + CD | Store, factin)
My question is: is there any way I can get coefficients for each store AND year without creating a separate data frame for each year?

dat <- read.table(header = T, stringsAsFactors = F, text = "Store Month year Days UPD HD CD Year
1 August 2013 31 6478.27 0.06 10.03 2013
1 September 2013 30 6015.38 0.50 5.67 2013
1 October 2013 31 5478.21 5.29 1.48 2013
1 November 2013 30 5223.78 18.60 0.00 2013
1 December 2013 31 5115.80 20.52 0.23 2013
6 January 2011 31 4517.56 27.45 0.00 2011
6 February 2011 28 4116.07 16.75 0.07 2011
6 March 2011 31 3981.78 12.68 0.39 2011
6 April 2011 30 4041.68 3.83 2.53 2011
6 May 2011 31 4287.23 1.61 6.58 2011")
factout <- lmList(UPD ~ HD + CD | Store, dat)
data.frame(Store = unique(dat$Store), summary(factout)$coef[1:2,1,1:3])
(Intercept) HD CD
1 5405.108 -12.90986 107.2061
6 3581.307 32.93137 102.9780

Mix values from dataframes with different formats

I have a database with the columns: "Year", "Month", "T1",......"T31":
For example df_0 is the original format and I want to convert it in the new_df (second part)
id0 <- c ("Year", "Month", "T_day1", "T_day2", "T_day3", "T_day4", "T_day5")
id1 <- c ("2010", "January", 10, 5, 2,3,3)
id2 <- c ("2010", "February", 20,36,5,8,1)
id3 <- c ("2010", "March", 12,23,23,5,25)
df_0 <- rbind (id1, id2, id3)
colnames (df_0)<- id0
head(df_0)
I would like to create a new dataframe in which the data from T1....T31 for each month and year will join to a column with all dates for example from 1st January 2010 to 4th January 2012:
date<-seq(as.Date("2010-01-01"), as.Date("2012-01-04"), by="days")
or join the value in a new column of a dataframe based on the values of other three columns (year, month and day):
year <- lapply(strsplit(as.character(date), "\\-"), "[", 1)
month <- lapply(strsplit(as.character(date), "\\-"), "[", 2)
day <- lapply(strsplit(as.character(date), "\\-"), "[", 3)
df <- cbind (year, month, day)
I would like to have a data frame with the information in this way:
Year <- rep(2010,15)
Month <- c(rep("January", 5), rep("February",5), rep("March",5))
Day<- rep(c(1,2,3,4,5))
Value <- c(10,5,2,3,3,20,36,5,8,1,12,23,23,5,25)
new_df <- cbind (Year, Month, Day, Value)
head(new_df)
Thanks in advance

What you're looking for is to reshape your data. One library which you can use is the reshape2 library. Here we can use the melt function in the reshape2 library:
melt(data.frame(df_0), id.vars=c("Year", "Month"))
Based on the data you have, the output would have:
Year Month variable value
1 2010 January T_day1 10
2 2010 February T_day1 20
3 2010 March T_day1 12
4 2010 January T_day2 5
5 2010 February T_day2 36
6 2010 March T_day2 23
7 2010 January T_day3 2
8 2010 February T_day3 5
9 2010 March T_day3 23
10 2010 January T_day4 3
11 2010 February T_day4 8
12 2010 March T_day4 5
13 2010 January T_day5 3
14 2010 February T_day5 1
15 2010 March T_day5 25
Which you can then alter the variable column to the days depending on how you have formatted that column.

Firstly, I generated my own test data. I used a reduced date vector for easier demonstration: 2010-01-01 to 2010-03-04. In my df_0 I generated a value for each date in my reduced date vector not including the last date, and including one additional date not in my date vector: 2010-03-05. It will become clear later why I did this.
set.seed(1);
date <- seq(as.Date('2010-01-01'),as.Date('2010-03-04'),by='day');
df_0 <- reshape(setNames(as.data.frame(cbind(do.call(rbind,strsplit(strftime(c(date[-length(date)],as.Date('2010-03-05')),'%Y %B %d'),' ')),round(rnorm(length(date)),3))),c('Year','Month','Day','T_day')),dir='w',idvar=c('Year','Month'),timevar='Day');
attr(df_0,'reshapeWide') <- NULL;
df_0;
## Year Month T_day.01 T_day.02 T_day.03 T_day.04 T_day.05 T_day.06 T_day.07 T_day.08 T_day.09 T_day.10 T_day.11 T_day.12 T_day.13 T_day.14 T_day.15 T_day.16 T_day.17 T_day.18 T_day.19 T_day.20 T_day.21 T_day.22 T_day.23 T_day.24 T_day.25 T_day.26 T_day.27 T_day.28 T_day.29 T_day.30 T_day.31
## 1 2010 January -0.626 0.184 -0.836 1.595 0.33 -0.82 0.487 0.738 0.576 -0.305 1.512 0.39 -0.621 -2.215 1.125 -0.045 -0.016 0.944 0.821 0.594 0.919 0.782 0.075 -1.989 0.62 -0.056 -0.156 -1.471 -0.478 0.418 1.359
## 32 2010 February -0.103 0.388 -0.054 -1.377 -0.415 -0.394 -0.059 1.1 0.763 -0.165 -0.253 0.697 0.557 -0.689 -0.707 0.365 0.769 -0.112 0.881 0.398 -0.612 0.341 -1.129 1.433 1.98 -0.367 -1.044 0.57 <NA> <NA> <NA>
## 60 2010 March -0.135 2.402 -0.039 <NA> 0.69 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
The first half of the solution is a reshaping from wide format to long, and can be done with a single call to reshape(). Additionally, I wrapped it in a call to na.omit() to prevent NA values from being generated from the unavoidable NA cells in df_0:
df_1 <- na.omit(reshape(df_0,dir='l',idvar=c('Year','Month'),timevar='Day',varying=grep('^T_day\\.',names(df_0)),v.names='Value'));
rownames(df_1) <- NULL;
df_1[order(match(df_1$Month,month.name),df_1$Day),];
## Year Month Day Value
## 1 2010 January 1 -0.626
## 4 2010 January 2 0.184
## 7 2010 January 3 -0.836
## 10 2010 January 4 1.595
## 12 2010 January 5 0.33
## 15 2010 January 6 -0.82
## 17 2010 January 7 0.487
## 19 2010 January 8 0.738
## 21 2010 January 9 0.576
## 23 2010 January 10 -0.305
## 25 2010 January 11 1.512
## 27 2010 January 12 0.39
## 29 2010 January 13 -0.621
## 31 2010 January 14 -2.215
## 33 2010 January 15 1.125
## 35 2010 January 16 -0.045
## 37 2010 January 17 -0.016
## 39 2010 January 18 0.944
## 41 2010 January 19 0.821
## 43 2010 January 20 0.594
## 45 2010 January 21 0.919
## 47 2010 January 22 0.782
## 49 2010 January 23 0.075
## 51 2010 January 24 -1.989
## 53 2010 January 25 0.62
## 55 2010 January 26 -0.056
## 57 2010 January 27 -0.156
## 59 2010 January 28 -1.471
## 61 2010 January 29 -0.478
## 62 2010 January 30 0.418
## 63 2010 January 31 1.359
## 2 2010 February 1 -0.103
## 5 2010 February 2 0.388
## 8 2010 February 3 -0.054
## 11 2010 February 4 -1.377
## 13 2010 February 5 -0.415
## 16 2010 February 6 -0.394
## 18 2010 February 7 -0.059
## 20 2010 February 8 1.1
## 22 2010 February 9 0.763
## 24 2010 February 10 -0.165
## 26 2010 February 11 -0.253
## 28 2010 February 12 0.697
## 30 2010 February 13 0.557
## 32 2010 February 14 -0.689
## 34 2010 February 15 -0.707
## 36 2010 February 16 0.365
## 38 2010 February 17 0.769
## 40 2010 February 18 -0.112
## 42 2010 February 19 0.881
## 44 2010 February 20 0.398
## 46 2010 February 21 -0.612
## 48 2010 February 22 0.341
## 50 2010 February 23 -1.129
## 52 2010 February 24 1.433
## 54 2010 February 25 1.98
## 56 2010 February 26 -0.367
## 58 2010 February 27 -1.044
## 60 2010 February 28 0.57
## 3 2010 March 1 -0.135
## 6 2010 March 2 2.402
## 9 2010 March 3 -0.039
## 14 2010 March 5 0.69
The second part of the solution requires merging the above long-format data.frame with the exact dates you stated you want in the resulting data.frame. This requires a fair amount of scaffolding code to transform the date vector into a data.frame with Year Month Day columns, but once that's done, you can simply call merge() with all.x=T to preserve every date in the date vector whether or not it was present in df_1, and to exclude any date in df_1 that is not also present in the date vector:
df_2 <- merge(transform(setNames(as.data.frame(do.call(rbind,strsplit(strftime(date,'%Y %B %d'),' '))),c('Year','Month','Day')),Day=as.integer(Day)),df_1,all.x=T);
df_2[order(match(df_2$Month,month.name),df_2$Day),];
## Year Month Day Value
## 29 2010 January 1 -0.626
## 30 2010 January 2 0.184
## 31 2010 January 3 -0.836
## 32 2010 January 4 1.595
## 33 2010 January 5 0.33
## 34 2010 January 6 -0.82
## 35 2010 January 7 0.487
## 36 2010 January 8 0.738
## 37 2010 January 9 0.576
## 38 2010 January 10 -0.305
## 39 2010 January 11 1.512
## 40 2010 January 12 0.39
## 41 2010 January 13 -0.621
## 42 2010 January 14 -2.215
## 43 2010 January 15 1.125
## 44 2010 January 16 -0.045
## 45 2010 January 17 -0.016
## 46 2010 January 18 0.944
## 47 2010 January 19 0.821
## 48 2010 January 20 0.594
## 49 2010 January 21 0.919
## 50 2010 January 22 0.782
## 51 2010 January 23 0.075
## 52 2010 January 24 -1.989
## 53 2010 January 25 0.62
## 54 2010 January 26 -0.056
## 55 2010 January 27 -0.156
## 56 2010 January 28 -1.471
## 57 2010 January 29 -0.478
## 58 2010 January 30 0.418
## 59 2010 January 31 1.359
## 1 2010 February 1 -0.103
## 2 2010 February 2 0.388
## 3 2010 February 3 -0.054
## 4 2010 February 4 -1.377
## 5 2010 February 5 -0.415
## 6 2010 February 6 -0.394
## 7 2010 February 7 -0.059
## 8 2010 February 8 1.1
## 9 2010 February 9 0.763
## 10 2010 February 10 -0.165
## 11 2010 February 11 -0.253
## 12 2010 February 12 0.697
## 13 2010 February 13 0.557
## 14 2010 February 14 -0.689
## 15 2010 February 15 -0.707
## 16 2010 February 16 0.365
## 17 2010 February 17 0.769
## 18 2010 February 18 -0.112
## 19 2010 February 19 0.881
## 20 2010 February 20 0.398
## 21 2010 February 21 -0.612
## 22 2010 February 22 0.341
## 23 2010 February 23 -1.129
## 24 2010 February 24 1.433
## 25 2010 February 25 1.98
## 26 2010 February 26 -0.367
## 27 2010 February 27 -1.044
## 28 2010 February 28 0.57
## 60 2010 March 1 -0.135
## 61 2010 March 2 2.402
## 62 2010 March 3 -0.039
## 63 2010 March 4 <NA>
Notice how 2010-03-04 is included, even though I didn't generate a value for it in df_0, and 2010-03-05 is excluded, even though I did.

correlation between two data frames in R

I have one data frame which has sales values for the time period Oct. 2000 to Dec. 2001 (15 months). Also I have profit values for the same time period as above and I want to find the correlation between these two data frames month wise for these 15 months in R. My data frame sales is:
Month sales
Oct. 2000 24.1
Nov. 2000 23.3
Dec. 2000 43.9
Jan. 2001 53.8
Feb. 2001 74.9
Mar. 2001 25
Apr. 2001 48.5
May. 2001 18
Jun. 2001 68.1
Jul. 2001 78
Aug. 2001 48.8
Sep. 2001 48.9
Oct. 2001 34.3
Nov. 2001 54.1
Dec. 2001 29.3
My second data frame profit is:
period profit
Oct 2000 14.1
Nov 2000 3.3
Dec 2000 13.9
Jan 2001 23.8
Feb 2001 44.9
Mar 2001 15
Apr 2001 58.5
May 2001 18
Jun 2001 58.1
Jul 2001 38
Aug 2001 28.8
Sep 2001 18.9
Oct 2001 24.3
Nov 2001 24.1
Dec 2001 19.3
Now I know that for initial two months I cannot get the correlation as there are not enough values but from Dec 2000 onwards I want to calculate the correlation by taking into consideration the previous months values. So, for Dec. 200 I will consider values of Oct. 2000, Nov. 2000 and Dec. 2000 which will give me 3 sales value and 3 profit values. Similarly for Jan. 2001 I will consider values of Oct. 2000, Nov. 2000 Dec. 2000 and Jan. 2001 thus having 4 sales value and 4 profit value. Thus for every month I will consider previous month values also to calculate the correlation and my output should be something like this:
Month Correlation
Oct. 2000 NA or Empty
Nov. 2000 NA or Empty
Dec. 2000 x
Jan. 2001 y
. .
. .
Dec. 2001 a
I know that in R there is a function cor(sales, profit) but how can I find out the correlation for my scenario?

Make some sample data:
> sales = c(1,4,3,2,3,4,5,6,7,6,7,5)
> profit = c(4,3,2,3,4,5,6,7,7,7,6,5)
> data = data.frame(sales=sales,profit=profit)
> head(data)
sales profit
1 1 4
2 4 3
3 3 2
4 2 3
5 3 4
6 4 5
Here's the beef:
> data$runcor = c(NA,NA,
sapply(3:nrow(data),
function(i){
cor(data$sales[1:i],data$profit[1:i])
}))
> data
sales profit runcor
1 1 4 NA
2 4 3 NA
3 3 2 -0.65465367
4 2 3 -0.63245553
5 3 4 -0.41931393
6 4 5 0.08155909
7 5 6 0.47368421
8 6 7 0.69388867
9 7 7 0.78317543
10 6 7 0.81256816
11 7 6 0.80386072
12 5 5 0.80155885
So now data$runcor[3] is the correlation of the first 3 sales and profit numbers.
Note I call this runcor as its a "running correlation", like a "running sum" which is the sum of all elements so far. This is the correlation of all pairs so far.

Another possibility would be: (if dat1 and dat2 are the initial datasets)
Update
dat1$Month <- gsub("\\.", "", dat1$Month)
datN <- merge(dat1, dat2, sort=FALSE, by.x="Month", by.y="period")
indx <- sequence(3:nrow(datN)) #create index to replicate the rows
indx1 <- cumsum(c(TRUE,diff(indx) <0)) #create another index to group the rows
#calculate the correlation grouped by `indx1`
datN$runcor <- setNames(c(NA, NA,by(datN[indx,-1],
list(indx1), FUN=function(x) cor(x$sales, x$profit) )), NULL)
datN
# Month sales profit runcor
#1 Oct 2000 24.1 14.1 NA
#2 Nov 2000 23.3 3.3 NA
#3 Dec 2000 43.9 13.9 0.5155911
#4 Jan 2001 53.8 23.8 0.8148546
#5 Feb 2001 74.9 44.9 0.9345166
#6 Mar 2001 25.0 15.0 0.9119941
#7 Apr 2001 48.5 58.5 0.7056301
#8 May 2001 18.0 18.0 0.6879528
#9 Jun 2001 68.1 58.1 0.7647177
#10 Jul 2001 78.0 38.0 0.7357748
#11 Aug 2001 48.8 28.8 0.7351366
#12 Sep 2001 48.9 18.9 0.7190413
#13 Oct 2001 34.3 24.3 0.7175138
#14 Nov 2001 54.1 24.1 0.7041889
#15 Dec 2001 29.3 19.3 0.7094334

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Confused on percent difference calculations in R using dplyr::mutate - r

mutate(pct_change = (factor1-lag(factor1))/lag(abs(factor1)) * 100) Above is the final solution to the problem, subtracted the old number from the new before I divided by the absolute value of the old number.

we can use mutate(pct_change =(factor1 - lag(factor1))/abs(lag(factor1)) * 100)

Related

How can I merge variables to my dataframe from another dataframe if the year is the same?

Deleting specific column/row values with if conditions

Linear model/lmList with nested/multiple group categories?

Mix values from dataframes with different formats

correlation between two data frames in R

Categories

Resources