Combining unequal data frames and applying a calculation - r

I've been doing some data cleaning and regressions but now I would like to apply the output however, I'm stuck on the following problem.
One data frame called "Historical" and looks like this:
Year Value
2014 5
2015 7.5
2016 11
The other data frame is called "forecast" and looks like this (new years in the future):
Year Growth
2017 0.05
2018 0.11
etc
So I would like to have one data frame to show historical values and forecasted values starting in 2017 (11*1.05)
How can I go about this?
Much appreciated

Given
a <- read.table(header=T, text="Year Value
2014 5
2015 7.5
2016 11")
b <- read.table(header=T, text="
Year Growth
2017 0.05
2018 0.11")
You could e.g. do
rbind(a, cbind(
Year=b$Year,
Value=cumprod(c(tail(a$Value, 1), 1+b$Growth))[-1])
)
# Year Value
# 1 2014 5.0000
# 2 2015 7.5000
# 3 2016 11.0000
# 4 2017 11.5500
# 5 2018 12.8205

Related

Is it possible in R to group by variable and then create correlation inside column?

my problem is that I would like to do group by and then correlation inside column in data frame.
For example:
Year
Var
2000
10
2010
15
2010
13
2000
11
And I want to group by year, so it would look like this:
Year
Var
2000
10
2000
11
2010
15
2010
13
And than create correlation between year 2000 and 2010.
Is it understandable?
Thank you for any advice.
Richard
So far I have this:
prep_cor <- all_species_df %>% group_by(Sheet) %>% slice_sample(n=600) %>% arrange(Sheet, desc(BLUE))
And I am expecting to have result in a table.
Turn data into wide format and then compute correlation:
df%>%
group_by(Year)%>%
mutate(ID=row_number())%>%
spread(Year,Var)%>%
select(-ID)%>%
cor
2000 2010
2000 1 -1
2010 -1 1

Aggregating based on previous year and this year

I have these data sets
month Year Rain
10 2010 376.8
11 2010 282.78
12 2010 324.58
1 2011 73.51
2 2011 225.89
3 2011 22.96
I used
df2prnext<-
aggregate(Rain~Year, data = subdataprnext, mean)
but I need the mean value of 217.53.
I am not getting the expected result. Thank you for your help.

Compute factor data between two data frames in R

I have not found a solution for this, and I think it should be very simple but now I can't think right.
I have two data frames, monthly traffic volume averages, and yearly traffic volume averages. I need to divide yearly averages by monthly averages.
ano mes dias Au_TPDM Bu_TPDM CU_TPDM CAI_TPDM CAII_TPDM TOTAL
1 2012 Ene 31 4288.323 620.5161 236.7419 4635.097 139.0645 6112.258
7 2012 Feb 29 3268.862 593.0000 246.3103 5191.069 147.9655 6267.286
13 2012 Mar 31 3667.903 624.7097 289.0323 5341.774 154.7419 6740.226
19 2012 Abr 30 4668.767 647.2333 281.2667 4930.433 158.3000 7236.300
25 2012 May 31 3198.581 598.9677 256.1290 5384.742 202.2581 6612.581
31 2012 Jun 30 3609.067 605.8667 280.3333 5309.500 178.7000 6795.000
anosDB TPDA_Au TPDA_Bu TPDA_CU TPDA_CAI TPDA_CAII TPDA_TOTAL
1 2012 4271.096 617.4809 255.1967 5119.454 163.5055 10426.73
2 2013 4685.079 638.5616 259.8877 5287.822 154.0110 11025.36
3 2014 4969.277 656.3918 266.8986 5407.800 177.0932 11477.46
4 2015 5184.953 541.8822 400.2137 4941.422 271.6877 11340.16
5 2016 5220.872 408.6967 541.0519 5584.492 182.4399 11937.55
6 2017 5298.852 408.7562 556.5644 6033.652 266.1644 12563.99
So the first 12 rows of the TPDM table should divide the first row of the TPDA table and create a new data frame which should contain monthly factors.
Something like:
ano mes dias FA_Au
2012 Ene 31 4271.096/4288.323
2012 Feb 29 4271.096/3268.862
(Don't need to show the computation, just the result)
I am sure that selecting the data by year would do that but haven't found the right way to do it.
Merge by year and find columns to divide by position
As already mentioned by zx8754 this can be done by merging on year and dividing the corresponding columns in base R:
merged <- merge(TPDM, TPDA, by.x = "ano", by.y = "anosDB")
FA <- cbind(merged[, 1:3], merged[, 10:15]/merged[, 4:9])
# rename columns
names(FA) <- sub("TPDA_", "FA_", names(FA))
FA
ano mes dias FA_Au FA_Bu FA_CU FA_CAI FA_CAII FA_TOTAL
1 2012 Ene 31 0.9959828 0.9951086 1.0779532 1.1044977 1.1757530 1.705872
2 2012 Feb 29 1.3066003 1.0412831 1.0360781 0.9862042 1.1050245 1.663675
3 2012 Mar 31 1.1644517 0.9884285 0.8829349 0.9583809 1.0566337 1.546941
4 2012 Abr 30 0.9148231 0.9540314 0.9073122 1.0383376 1.0328838 1.440892
5 2012 May 31 1.3353096 1.0309085 0.9963600 0.9507334 0.8084003 1.576802
6 2012 Jun 30 1.1834349 1.0191696 0.9103332 0.9642064 0.9149720 1.534471
Caveat:
This approach works as long as the positions, i.e., column numbers, of the corresponding columns are known. With the given datasets, the columns are ordered in the same way. Therefore, only an offset has to be considered to match corresponding columns.
Merge by year and find columns to divide by name
If, for some reason, the positions are not known in advance we can find corresponding columns by matching the column names.
For this, both datasets are reshaped from wide to long format. In long format, the column names (now called variable) are treated as data. Now, we can join monthly and annual values on year and column name, divide annual values by the corresponding monthly values, and reshape back to wide format, finally:
library(data.table)
# reshape and prepare monthly data
longM <- melt(setDT(TPDM), id.vars = 1:3)
longM[, variable := stringr::str_replace(variable, "_TPDM", "")]
longM[, mes := forcats::fct_inorder(mes)]
# reshape and prepare annual data
longA <- melt(setDT(TPDA), id.vars = 1)
longA[, variable := stringr::str_replace(variable, "TPDA_", "")]
setnames(longA, "anosDB", "ano")
# join
long_FA <- longA[longM, on = .(ano, variable),
.(ano, mes, dias, variable, FA = value/i.value)]
# reshape back to wide format
dcast(long_FA, ano + mes +dias ~ paste0("FA_", variable), value.var = "FA")
ano mes dias FA_Au FA_Bu FA_CAI FA_CAII FA_CU FA_TOTAL
1: 2012 Ene 31 0.9959828 0.9951086 1.1044977 1.1757530 1.0779532 1.705872
2: 2012 Feb 29 1.3066003 1.0412831 0.9862042 1.1050245 1.0360781 1.663675
3: 2012 Mar 31 1.1644517 0.9884285 0.9583809 1.0566337 0.8829349 1.546941
4: 2012 Abr 30 0.9148231 0.9540314 1.0383376 1.0328838 0.9073122 1.440892
5: 2012 May 31 1.3353096 1.0309085 0.9507334 0.8084003 0.9963600 1.576802
6: 2012 Jun 30 1.1834349 1.0191696 0.9642064 0.9149720 0.9103332 1.534471
Data
TPDM <- read.table(text = "
i ano mes dias Au_TPDM Bu_TPDM CU_TPDM CAI_TPDM CAII_TPDM TOTAL
1 2012 Ene 31 4288.323 620.5161 236.7419 4635.097 139.0645 6112.258
7 2012 Feb 29 3268.862 593.0000 246.3103 5191.069 147.9655 6267.286
13 2012 Mar 31 3667.903 624.7097 289.0323 5341.774 154.7419 6740.226
19 2012 Abr 30 4668.767 647.2333 281.2667 4930.433 158.3000 7236.300
25 2012 May 31 3198.581 598.9677 256.1290 5384.742 202.2581 6612.581
31 2012 Jun 30 3609.067 605.8667 280.3333 5309.500 178.7000 6795.000
", header = TRUE)[, -1L]
TPDA <- read.table(text = "
i anosDB TPDA_Au TPDA_Bu TPDA_CU TPDA_CAI TPDA_CAII TPDA_TOTAL
1 2012 4271.096 617.4809 255.1967 5119.454 163.5055 10426.73
2 2013 4685.079 638.5616 259.8877 5287.822 154.0110 11025.36
3 2014 4969.277 656.3918 266.8986 5407.800 177.0932 11477.46
4 2015 5184.953 541.8822 400.2137 4941.422 271.6877 11340.16
5 2016 5220.872 408.6967 541.0519 5584.492 182.4399 11937.55
6 2017 5298.852 408.7562 556.5644 6033.652 266.1644 12563.99
", header = TRUE)[, -1L]

Boxplot not plotting all data

I'm trying to plot a boxplot for a time series (e.g. http://www.r-graph-gallery.com/146-boxplot-for-time-series/) and can get every other example to work, bar my last one. I have averages per month for six years (2011 to 2016) and have data for 2014 and 2015 (albeit in small quantities), but for some reason, boxes aren't being shown for the 2014 and 2015 data.
My input data has three columns: year, month and residency index (a value between 0 and 1). There are multiple individuals (in this example, 37) each with an average residency index per month per year (including 2014 and 2015).
For example:
year month RI
2015 1 NA
2015 2 NA
2015 3 NA
2015 4 NA
2015 5 NA
2015 6 NA
2015 7 0.387096774
2015 8 0.580645161
2015 9 0.3
2015 10 0.225806452
2015 11 0.3
2015 12 0.161290323
2016 1 0.096774194
2016 2 0.103448276
2016 3 0.161290323
2016 4 0.366666667
2016 5 0.258064516
2016 6 0.266666667
2016 7 0.387096774
2016 8 0.129032258
2016 9 0.133333333
2016 10 0.032258065
2016 11 0.133333333
2016 12 0.129032258
which is repeated for each individual fish.
My code:
#make boxplot
boxplot(RI$RI~RI$month+RI$year,
xaxt="n",xlab="",col=my_colours,pch=20,cex=0.3,ylab="Residency Index (RI)", ylim=c(0,1))
abline(v=seq(0,12*6,12)+0.5,col="grey")
axis(1,labels=unique(RI$year),at=seq(6,12*6,12))
The average trend line works as per the other examples.
a=aggregate(RI$RI,by=list(RI$month,RI$year),mean, na.rm=TRUE)
lines(a[,3],type="l",col="red",lwd=2)
Any help on this matter would be greatly appreciated.
Your problem seems to be the presence of missing values, NA, in your data, the other values are plotted correctly. I've simplified your code a bit.
boxplot(RI$RI ~ RI$month + RI$year,
ylab="Residency Index (RI)")
a <- aggregate(RI ~ month + year, data = RI, FUN = mean, na.rm = TRUE)
lines(c(rep(NA, 6), a[,3]), type="l", col="red", lwd=2)
Also, I believe that maybe a boxplot is not the best way to depict your data. You only have one value per year/month, when a boxplot would require more. Maybe a simple scatter plot will do better.

Sum column values that match year in another column in R

I have the following dataframe
y<-data.frame(c(2007,2008,2009,2009,2010,2010),c(10,13,10,11,9,10),c(5,6,5,7,4,7))
colnames(y)<-c("year","a","b")
I want to have a final data.frame that adds together within the same year the values in "y$a" in the new "a" column and the values in "y$b" in the new "b" column so that it looks like this"
year a b
2007 10 5
2008 13 6
2009 21 12
2010 19 11
The following loop has done it for me,
years<- as.numeric(levels(factor(y$year)))
add.a<- numeric(length(y[,1]))
add.b<- numeric(length(y[,1]))
for(i in years){
ind<- which(y$year==i)
add.a[ind]<- sum(as.numeric(as.character(y[ind,"a"])))
add.b[ind]<- sum(as.numeric(as.character(y[ind,"b"])))
}
y.final<-data.frame(y$year,add.a,add.b)
colnames(y.final)<-c("year","a","b")
y.final<-subset(y.final,!duplicated(y.final$year))
but I just think there must be a faster command. Any ideas?
Kindest regards,
Marco
The aggregate function is a good choice for this sort of operation, type ?aggregate for more information about it.
aggregate(cbind(a,b) ~ year, data = y, sum)
# year a b
#1 2007 10 5
#2 2008 13 6
#3 2009 21 12
#4 2010 19 11

Resources