I have a report that i need to do on a quarterly basis that involves adding various components of revenue together to formulate a trailing 12 month and trailing 24 month total.
rather than retyping a bunch of column names to add each column together on a rolling basis i was hoping to create a function where i could declare variables for the trailing months so i can sum them together easier.
my dataframe all_rel contains all the data i need to sum together. it contains the following fields (unfortunately i just inherited this report an it isn't exactly in tidy format)
Total_Processing_Revenue
Ancillary_Revenue
in the data frame i have T24 months of these data points within separate columns
the script that someone put together that i inherited uses the following to add the columns together:
all_rel$anci_rev_cy_ytd = all_rel$X201701Ancillary_Revenue+all_rel$X201702Ancillary_Revenue+all_rel$X201703Ancillary_Revenue+...+all_rel$X201712Ancillary_Revenue
i'm trying was hoping to do something with paste but can't seem to get it to work
dfname <- 'all_rel$X'
revmonth1 <- '01'
revmonth2 <- '02'
revmonth3 <- '03'
revmonth4 <- '04'
revmonth5 <- '05'
revmonth6 <- '06'
revmonth7 <- '07'
revmonth8 <- '08'
revmonth9 <- '09'
revmonth10 <- '10'
revmonth11 <- '11'
revmonth12 <- '12'
cy <- '2017'
py <- '2016'
rev1 <- 'Total_Processing_Revenue'
rev2 <- 'Ancillary_Revenue'
all_rel$anci_rev_py_ytd = paste(dfname,py,revmonth1,rev2, sep ='')+paste(dfname,py,revmonth2,rev2, sep ='')+...paste(dfname,py,revmonth12,rev2, sep ='')
when i try to sum these fields together i get a "non-numeric argument to binary operator" error. Is there something else i can do instead of what i've been trying to do?
paste(rpt,py,revmonth1,rev2, sep ='') returns "all_rel$X201601Ancillary_Revenue"
is there a way that I can tell R that the reason why I'm pasting these names is to reference the data within them rather than the text I'm pasting?
i'm fairly new to R (i've been learning on the fly to try to make my life easier.
ultimately i need to figure out how to convert this mess to a tidy data format where each of the revenue columns has a month and year but i was hoping to use this issue to understand how to use substitution logic to better automate processes. Maybe i just worded my searches incorrectly but i was struggling to find the exact issue i'm trying to solve.
Any help is greatly appreciated.
::edit::
added dput(head)
structure(list(Chain = c("000001", "000029", "000060", "000064","000076", "000079"), X201601Net_Revenue = c(-2.92, 25005.14,55787.59, 3996.69, 14229.41, 3455.85),X201601Total_Processing_Revenue = c(0,16140.48, 23238.89, 3574.17, 4093.51, 641.1), X201601Ancillary_Revenue = c(-2.92,8864.66, 32548.7, 422.52, 10135.9, 2814.75), X201602Net_Revenue = c(0,41918.84, 56696.34, 4789.57, 13113.2, 5211.27), X201602Total_Processing_Revenue = c(0,13253.19, 24733.04, 4395.69, 4102.79, 546.68), X201602Ancillary_Revenue = c(0,28665.65, 31963.3, 393.88, 9010.41, 4664.59), X201603Net_Revenue = c(0,23843.76, 62494.51, 5262.87, 20551.79, 7646.75), X201603Total_Processing_Revenue = c(0,15037.39, 27523.19,4792.63,4805.61,2134.72)),.Names=c("Chain","X201601Net_Revenue","X201601Total_Processing_Revenue","X201601Ancillary_Revenue","X201602Net_Revenue","X201602Total_Processing_Revenue","X201602Ancillary_Revenue","X201603Net_Revenue", "X201603Total_Processing_Revenue"), row.names = c(NA,6L), class = "data.frame")
Here's how to tidy your data (calling your data dd):
library(tidyr)
library(dplyr)
gather(dd, key = key, value = value, -Chain) %>%
mutate(year = substr(key, start = 2, 5),
month = substr(key, 6, 7),
metric = substr(key, 8, nchar(key))) %>%
select(-key) %>%
spread(key = metric, value = value)
# Chain year month Ancillary_Revenue Net_Revenue Total_Processing_Revenue
# 1 000001 2016 01 -2.92 -2.92 0.00
# 2 000001 2016 02 0.00 0.00 0.00
# 3 000001 2016 03 NA 0.00 0.00
# 4 000029 2016 01 8864.66 25005.14 16140.48
# 5 000029 2016 02 28665.65 41918.84 13253.19
# 6 000029 2016 03 NA 23843.76 15037.39
# 7 000060 2016 01 32548.70 55787.59 23238.89
# 8 000060 2016 02 31963.30 56696.34 24733.04
# 9 000060 2016 03 NA 62494.51 27523.19
# 10 000064 2016 01 422.52 3996.69 3574.17
# 11 000064 2016 02 393.88 4789.57 4395.69
# 12 000064 2016 03 NA 5262.87 4792.63
# 13 000076 2016 01 10135.90 14229.41 4093.51
# 14 000076 2016 02 9010.41 13113.20 4102.79
# 15 000076 2016 03 NA 20551.79 4805.61
# 16 000079 2016 01 2814.75 3455.85 641.10
# 17 000079 2016 02 4664.59 5211.27 546.68
# 18 000079 2016 03 NA 7646.75 2134.72
With that done, you can use whatever grouped operations you want - sums, rolling sums or averages, etc. You might be interested in the yearmon class provided in the zoo package, this question on rolling sums by group, and of course the R-FAQ on grouped sums.
Related
I'm not even sure how to approach this situation, I'm probably blocked. I have a wide dataframe, something like this
Date Amy_X Amy_Y John_X John_Y
March 14 15 10.5 14.5
April 10 11 15 16
I would like to export the data to a csv file with the following format
Date Amy John
X Y X Y
March 14 15 10.5 14.5
April 10 11 15 16
My first question is what is the best approach to achieve this. Should I separate Amy_X into Amyand Xand then create a repeat of the vector of names Amy, Amy, John, John and use than as another header. What's the best solution for this scenario?
The question says to output the file to csv but the output shown is not comma-separated values (csv). We show both.
Using input data frame DF defined reproducibly in the Note at the end, create a data frame from the headers and use separate_rows on it and rbind that to DF. Then do any remaining fix ups. Write it out without the row and column names and without quotes. Replace stdout() with your file name.
library(dplyr)
library(tidyr)
DF2 <- DF %>%
names %>%
as.list %>%
as.data.frame %>%
separate_rows(everything()) %>%
setNames(names(DF)) %>%
rbind(DF)
DF2[2, 1] <- DF2[1, duplicated(unlist(DF2[1, ]))] <- ""
output <- capture.output(prmatrix(DF2, quote = FALSE,
rowlab = rep("", nrow(DF2)), collab = rep("", ncol(DF2))))[-1]
writeLines(output, stdout())
giving the following which reproduces the output shown in the question:
Date Amy John
X Y X Y
March 14 15 10.5 14.5
April 10 11 15 16
If you really did want csv then use this instead of the writeLines and statement prior to it above:
write.table(DF2, stdout(), sep = ",", quote = FALSE, row.names = FALSE,
col.names = FALSE)
giving:
Date,Amy,,John,
,X,Y,X,Y
March,14,15,10.5,14.5
April,10,11,15,16
Note
Lines <- "Date Amy_X Amy_Y John_X John_Y
March 14 15 10.5 14.5
April 10 11 15 16"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE, as.is = TRUE)
I'm new to R programming, so this question might be simple.
Anyway, I've tryed to find some answer to this specific thing I'm trying to do and didnt get it.
So, Im trying to import new data I've got to my old data.frame.
The problem is that this data has to substitute previous NA values in variables that already exist.
Also my data have different individuals (companys) in different periods (years), and my new data set only have the companys and years that was missing, plus some observation that I already had.
I tryied to simulate the problem with the data frames below:
Data frame with NAs:
df1 <- data.frame( company = c(rep("A",3), rep("B",3), rep("C",3)),
year = c(rep(2016:2018,each=1)),
income = c(95,87,93,NA,NA,58,102,80,NA),
debt = c(43,50,51,NA,37,37,53,NA,NA),
stringsAsFactors= F )
To search for new data, I created a data set with only the missing data, as my data had to many observations:
df_NA <- data.frame(df1[is.na(df1$income & df1$debt),])
So after searching, I was able to find the missing data, and now I have something like this:
df2 <- data.frame( company = c("A", "B" , "C" , "C"),
year = c(2018, 2016, 2017, 2018),
income = c(60,55, 80, 82),
debt = c(32,37, 53,48),
stringsAsFactors= F )
Now, I'm trying to get this data together, so I have the complete data.frame to work.
The problem is that I couldnt find a way to do it yet. I've tryed merge and join, indexing for company and year, but the variables that have the same name in both data.frame get duplicated and a suffix.
In my data I have much more observations and variables to fill, so I want to find a way I can do it with a command. Also this is going to happen again in the future, so it will be very helpfull.
I'm sorry if this was already answered. Thank you!
Here is an option using data.table:
library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on=c("company", "year"), c('income', 'debt') := { list(i.income, i.debt)}]
# company year income debt
#1: A 2016 95 43
#2: A 2017 87 50
#3: A 2018 60 32
#4: B 2016 55 37
#5: B 2017 NA 37
#6: B 2018 58 37
#7: C 2016 102 53
#8: C 2017 80 53
#9: C 2018 82 48
Or another option using dplyr
library(dplyr)
full_join(df1, df2, by = c("year", "company")) %>%
mutate(
income = coalesce(income.x, income.y),
debt= coalesce(debt.x, debt.y),
) %>%
select(company, year, income, debt)
I am trying to do a regression mode with calibration periods. For that I want to split my time series into 4 equal parts.
library(lubridate)
date_list = seq(ymd('2000-12-01'),ymd('2018-01-28'),by='day')
date_list = date_list[which(month(date_list) %in% c(12,1,2))]
testframe = as.data.frame(date_list)
testframe$values = seq (1, 120, length = nrow(testframe))
The testframe above is 18 seasons long and I want to devide that into 4 parts, meaning 2 Periodes of 4 winter seasons and 2 Periodes of 5 winter seasons.
My try was:
library(lubridate)
aj = year(testframe[1,1])
ej = year(testframe[nrow(testframe),1])
diff = ej - aj
But when I devide diff now with 4, its 4.5, but I would need something like 4,4,5,5 and use that to extract the seasons. Any idea how to do that automatically?
You can start with something like this:
library(lubridate)
testframe$year_ <- year(testframe$date_list)
testframe$season <- getSeason(testframe$date_list)
If you're wondering the origin of getSeason() function, read this. Now you can split have the datasets with the seasons:
by4_1 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[1:4],]
by4_2 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[5:8],]
by5_1 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[9:13],]
by5_2 <- testframe[testframe$year_ %in% as.data.frame(table(testframe$year_))$Var1[14:18],]
Now you can test it, for example:
table(by4_1$year_, by4_1$season)
Fall Winter
2000 14 17
2001 14 76
2002 14 76
2003 14 76
I am trying to read some data from the Roper Center into R to do some analysis with it. The older data sometimes comes in only ASCII format, it is just a data file of numbers, sometimes with no spaces or delimiters. Also every person has several rows. Here is an example
0001 01 06722121 101632 3113581R50106 050110M323
0001 0202089917300208991744 100154109020B73013.22 1O
0001 039049MON FEB 8 1999 05:30pm 1 8 0208991830 6:30PM 05071
0001 04 5 51
0001 052206 32 1 21 111
0001 06 1122223413323 1122160921080711122112 11
0001 0722221205111223241121212220612111111122 21 2222
0002 01 09318035 001582 2123551R00106 0501I333
0002 0202089917320208991746 50074616080B42014.20 1O
0002 039039MON FEB 8 1999 05:31pm 1 8 0208991831 6:31PM 05041
0002 04 2 61
0002 05 206 32 3 11 121
0002 06 1245545554555 1152080614031221121131 11
0002 0752321202112112322112434410722131242122 21 122222
I changed some numbers in there, hopefully I didn't mess it up but I think you need a subscription to the Roper Center to get this data.
I need to extract several elements for each respondent and put them into columns. Ill be doing this many times so code that only works for this case is not practical.
I have been using the package readr in R so far, but now that there are many rows per person its becoming more complicated and I wondered if anyone knew of a fast way to handle this with a R package or simple function.
A good example would be to get all of the weights in this sample. Those occur in columns 13-15 and are found in the first row for each person.
Cool solution: your files come with a dictionary of fixed widths, right? In that case, use readr::read_fwf
Ugly solution below. Will probably choke if you have a lot of data, and might (no, will) fail to separate some variables.
x designates your ASCII file.
library(dplyr)
library(readr)
x <- read_lines(x)
x <- data_frame(
uid = str_sub(x, 1, 4), # careful here, assuming UIDs are 4-length
txt = str_sub(x, 8) # careful here too
)
x <- lapply(unique(x$uid), function(y) {
paste0(x$txt[ x$uid == y], collapse = " ") %>%
strsplit("\\s+") %>%
unlist %>%
matrix(ncol = length(.)) %>%
as_data_frame
}) %>%
bind_rows %>%
write_csv("whatever.csv")
You can now reimport the data with neat variable names and set the correct column types:
x <- read_csv(x, col_names = c(
# column names
),
col_types = "cccciiii -- etc.")
So initially I had the following object:
> head(gs)
year disturbance lek_id complex tot_male
1 2006 N 3T Diamond 3
2 2007 N 3T Diamond 17
3 1981 N bare 3corners 4
4 1982 N bare 3corners 7
5 1983 N bare 3corners 2
6 1985 N bare 3corners 5
With that I computed general statistics min, max, mean, and sd of tot_male for year within complex. I used R data splitting functions, and assigned logical column names where it seemed appropriate and ultimately made them different objects.
> tyc_min = aggregate(gs$tot_male, by=list(gs$year, gs$complex), FUN=min)
> names(tyc_min) = c("year", "complex", "tot_male_min")
> tyc_max = aggregate(gs$tot_male, by=list(gs$year, gs$complex), FUN=max)
> names(tyc_max) = c("year", "complex", "tot_male_max")
> tyc_mean = aggregate(gs$tot_male, by=list(gs$year, gs$complex), FUN=mean)
> names(tyc_mean) = c("year", "complex", "tot_male_mean")
> tyc_sd = aggregate(gs$tot_male, by=list(gs$year, gs$complex), FUN=sd)
> names(tyc_sd) = c("year", "complex", "tot_male_sd")
Example Output (2nd Object - Tyc_max):
year complex tot_male_max
1 2003 0
2 1970 3corners 26
3 1971 3corners 22
4 1972 3corners 26
5 1973 3corners 32
6 1974 3corners 18
Now I need to add the number of samples per year/complex combination as well. Then I need to merge these into single data object, and export as a .csv file
I know I need to use merge() function along with all.y but have no idea how to handle this error:
Error in fix.by(by.x, x) :
'by' must specify one or more columns as numbers, names or logical
Or.. add the number of samples per year and complex. Any suggestions?
This might work (but hard to check without a reproducible example):
gsnew <- Reduce(function(...) merge(..., all = TRUE, by = c("year","complex")),
list(tyc_min, tyc_max, tyc_mean, tyc_sd))
But instead of aggregating for the separate statistics and then merging, you can also aggregate everything at once into a new dataframe / datatable with for example data.table, dplyr or base R. Then you don't have to merge afterwards (for a base R solution see the other answer):
library(data.table)
gsnew <- setDT(gs)[, .(male_min = min(tot_male),
male_max = max(tot_male),
male_mean = mean(tot_male),
male_sd = sd(tot_male), by = .(year, complex)]
library(dplyr)
gsnew <- gs %>% group_by(year, complex) %>%
summarise(male_min = min(tot_male),
male_max = max(tot_male),
male_mean = mean(tot_male),
male_sd = sd(tot_male))
mystat <- function(x) c(mi=min(x), ma=max(x))
aggregate(Sepal.Length~Species, FUN=mystat, data=iris)
for you:
mystat <- function(x) c(mi=min(x), ma=max(x), m=mean(x), s=sd(x), l=length(x))
aggregate(tot_male~year+complex, FUN=mystat, data=gs)