icc on dataframe with row for each rater - r

Let me start off with saying I'm completely new to R and trying to figure out how to run icc on my specific dataset which might be a bit different then normally.
The dataset looks as follows
+------------+------------------+--------------+--------------+--------------+
| date | measurement_type | measurement1 | measurement2 | measurement3 |
+------------+------------------+--------------+--------------+--------------+
| 25-04-2020 | 1 | 15.5 | 34.3 | 43.2 |
| 25-04-2020 | 2 | 21.2 | 12.3 | 2.2 |
| 25-04-2020 | 3 | 16.2 | 9.6 | 43.3 |
| 25-04-2020 | 4 | 27 | 1 | 6 |
+------------+------------------+--------------+--------------+--------------+
now I want to do icc on all of those rows since each row stands for a different rater. It should leave the date and measurement_type columns out.
can someone point me in the right direction, I have absolutely no idea how to go about this.
------- EDIT -------
I exported the actual dataset that will come out with some test data.
Which is available here
The 2 important sheets here are the first and third.
The first contains all the participants of the research and the third contains all 4 different reports for each participant. The code I have so far just to tie each report to the correct participant;
library("XLConnect")
library("sqldf")
library("irr")
library("dplyr")
library("tidyr")
# Load in Workbook
wb = loadWorkbook("Measuring.xlsx")
# Load in Worksheet
# Sheet 1 = Study Results
# Sheet 3 = Meetpunten
records = readWorksheet(wb, sheet=1)
reports = readWorksheet(wb, sheet=3)
for (record in 1:nrow(records)) {
recordId = records[record, 'Record.Id']
participantReports = sqldf(sprintf("select * from reports where `Record.Id` = '%s'", recordId))
baselineReport = sqldf("select * from participantReports where measurement_type = '1'")
drinkReport = sqldf("select * from participantReports where measurement_type = '2'")
regularReport = sqldf("select * from participantReports where measurement_type = '3'")
exerciseReport = sqldf("select * from participantReports where measurement_type = '4'")
}

Since in your data each row stands for a different rater, but the icc function in the irr package needs the raters to be columns, you can ignore the two first columns of your table, transpose it and run icc.
So, assuming this table:
+------------+------------------+--------------+--------------+--------------+
| date | measurement_type | measurement1 | measurement2 | measurement3 |
+------------+------------------+--------------+--------------+--------------+
| 25-04-2020 | 1 | 15.5 | 34.3 | 43.2 |
| 25-04-2020 | 2 | 21.2 | 12.3 | 2.2 |
| 25-04-2020 | 3 | 16.2 | 9.6 | 43.3 |
| 25-04-2020 | 4 | 27 | 1 | 6 |
+------------+------------------+--------------+--------------+--------------+
is stored in a variable called data, i would do it like this:
data2 = data.matrix(data[,-c(1,2)]) # generates the dataset without the first two columns
data2 is this table:
+--------------+--------------+--------------+
| measurement1 | measurement2 | measurement3 |
+--------------+--------------+--------------+
| 15.5 | 34.3 | 43.2 |
| 21.2 | 12.3 | 2.2 |
| 16.2 | 9.6 | 43.3 |
| 27 | 1 | 6 |
+--------------+--------------+--------------+
Then:
data2 = t(data2) # transpose data2 so as to have raters in the columns and their ratings in each line
icc(data2) # here i'm not bothering with the parameters, but you should explore the appropriate icc parameters for your needs.
should generate a correct run.

Related

How to summarize data in R (dplyr) and avoid duplicate identifiers? [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 1 year ago.
I'm trying to identify the lowest rate over a range of years for a number of items (ID).
In addition, I would like to know the Year the lowest rate was pulled from.
I'm grouping by ID, but I run into an issue when rates are duplicated across years.
sample data
df <- data.frame(ID = c(1,1,1,2,2,2,3,3,3,4,4,4),
Year = rep(2010:2012,4),
Rate = c(0.3,0.6,0.9,
0.8,0.5,0.2,
0.8,0.4,0.9,
0.7,0.7,0.7))
sample data as table
| ID | Year | Rate |
|:------:|:------:|:------:|
| 1 | 2010 | 0.3 |
| 1 | 2012 | 0.6 |
| 1 | 2010 | 0.9 |
| 2 | 2010 | 0.8 |
| 2 | 2011 | 0.5 |
| 2 | 2012 | 0.2 |
| 3 | 2010 | 0.8 |
| 3 | 2011 | 0.4 |
| 3 | 2012 | 0.9 |
| 4 | 2010 | 0.7 |
| 4 | 2011 | 0.7 |
| 4 | 2012 | 0.7 |
Using dplyr I grouped by ID, then found the lowest rate.
df.Summarise <- df %>%
group_by(ID) %>%
summarise(LowestRate = min(Rate))
This gives me the following
| ID | LowestRate |
| --- | --- |
| 1 | 0.3 |
| 2 | 0.2 |
| 3 | 0.4 |
| 4 | 0.7 |
However, I also need to know the year that data was pulled from.
This is what I would like my final result to look like:
| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2012 |
Here's where I ran into some issues.
Attempt #1: Include "Year" in the original dplyr code
df.Summarise2 <- df %>%
group_by(ID) %>%
summarise(LowestRate = min(Rate),
Year = Year)
Error: Column `Year` must be length 1 (a summary value), not 3
Makes sense. I'm not summarizing "Year" at all. I just want to include that row's value for Year!
Attempt #2: Use mutate instead of summarise
df.Mutate <- df %>%
group_by(ID) %>%
mutate(LowestRate = min(Rate))
So that essentially returns my original dataframe, but with an extra column for LowestRate attached.
How would I go from this to what I want?
I tried to left_join / merge based on ID and Lowest Rate, but there's multiple matches for ID #4. Is there any way to only pick one match (row)?
df.joined <- left_join(df.Summarise,df,by = c("ID","LowestRate" = "Rate"))
df.joined as table
| ID | Year | Rate |
| --- | --- | --- |
| 1 | 0.3 | 2010 |
| 2 | 0.2 | 2012 |
| 3 | 0.4 | 2011 |
| 4 | 0.7 | 2010 |
| 4 | 0.7 | 2011 |
| 4 | 0.7 | 2012 |
I've tried looking online, but I can't really find anything that strikes this exactly.
Using ".drop = FALSE" for group_by() didn't help, as it seems to be intended for empty values?
The dataset I'm working with is large, so I'd really like to find how to make this work and avoid hard-coding anything :)
Thanks for any help!
You can group by ID and then filter without summarizing, and that way you'll preserve all columns but still only keep the min value:
df %>%
group_by(ID) %>%
filter(Rate == min(Rate))

R - Join two dataframes based on date difference

Let's consider two dataframes df1 and df2. I would like to join dataframes based on the date difference only. For Example;
Dataframe 1: (df1)
| version_id | date_invoiced | product_id |
-------------------------------------------
| 1 | 03-07-2020 | 201 |
| 1 | 02-07-2020 | 2013 |
| 3 | 02-07-2020 | 2011 |
| 6 | 01-07-2020 | 2018 |
| 7 | 01-07-2020 | 201 |
Dataframe 2: (df2)
| validfrom | pricelist| pricelist_id |
------------------------------------------
|02-07-2020 | 10 | 101 |
|01-07-2020 | 20 | 102 |
|29-06-2020 | 30 | 103 |
|28-07-2020 | 10 | 104 |
|25-07-2020 | 5 | 105 |
I need to map the pricelist_id and the pricelist based on the the validfrom column present in df2. Say that, based on the least difference between the date_invoiced (df1) and validfrom (df2), the row should be mapped.
Expected Outcome:
| version_id | date_invoiced | product_id | date_diff | pricelist_id | pricelist |
----------------------------------------------------------------------------------
| 1 | 03-07-2020 | 201 | 1 | 101 | 10 |
| 1 | 02-07-2020 | 2013 | 1 | 102 | 20 |
| 3 | 02-07-2020 | 2011 | 1 | 102 | 20 |
| 6 | 01-07-2020 | 2018 | 1 | 103 | 30 |
| 7 | 01-07-2020 | 201 | 1 | 103 | 30 |
I need to map purely based on the difference and the difference should be the least. Always, the date_invoiced (df1), should have closest difference comparing to validfrom (df2). Thanks
Perhaps you might want to try using date.table and nearest roll. Here, the join is made on DATE which would be DATEINVOICED from df1 and VALIDFROM in df2.
library(data.table)
setDT(df1)
setDT(df2)
df1$DATEINVOICED <- as.Date(df1$DATEINVOICED, format = "%d-%m-%y")
df2$VALIDFROM <- as.Date(df2$VALIDFROM, format = "%d-%m-%y")
setkey(df1, DATEINVOICED)[, DATE := DATEINVOICED]
setkey(df2, VALIDFROM)[, DATE := VALIDFROM]
df2[df1, on = "DATE", roll='nearest']

Creating a new table that shows the percent change between two different categories from a single column in R

I'm trying to learn how to use some of the functions in the R "reshape2" package, specifically dcast. I'm trying to create a table that shows the aggregate sum (the sum of one category of data for all files divided by the max "RepNum" in one "Case") for two software versions and the percent change between the two.
Here's what my data set looks like (example data):
| FileName | Version | Category | Value | TestNum | RepNum | Case |
|:--------:|:-------:|:---------:|:-----:|:-------:|:------:|:-----:|
| File1 | 1.0.18 | Category1 | 32.5 | 11 | 1 | Case1 |
| File1 | 1.0.18 | Category1 | 31.5 | 11 | 2 | Case1 |
| File1 | 1.0.18 | Category2 | 32.3 | 11 | 1 | Case1 |
| File1 | 1.0.18 | Category2 | 31.4 | 11 | 2 | Case1 |
| File2 | 1.0.18 | Category1 | 34.6 | 11 | 1 | Case1 |
| File2 | 1.0.18 | Category1 | 34.7 | 11 | 2 | Case1 |
| File2 | 1.0.18 | Category2 | 34.5 | 11 | 1 | Case1 |
| File2 | 1.0.18 | Category2 | 34.6 | 11 | 2 | Case1 |
| File1 | 1.0.21 | Category1 | 31.7 | 12 | 1 | Case1 |
| File1 | 1.0.21 | Category1 | 32.0 | 12 | 2 | Case1 |
| File1 | 1.0.21 | Category2 | 31.5 | 12 | 1 | Case1 |
| File1 | 1.0.21 | Category2 | 32.4 | 12 | 2 | Case1 |
| File2 | 1.0.21 | Category1 | 31.5 | 12 | 1 | Case1 |
| File2 | 1.0.21 | Category1 | 34.6 | 12 | 2 | Case1 |
| File2 | 1.0.21 | Category2 | 31.7 | 12 | 1 | Case1 |
| File2 | 1.0.21 | Category2 | 32.4 | 12 | 2 | Case1 |
| File1 | 1.0.18 | Category1 | 32.0 | 11 | 1 | Case2 |
| File1 | 1.0.18 | Category1 | 34.6 | 11 | 2 | Case2 |
| File1 | 1.0.18 | Category2 | 34.6 | 11 | 1 | Case2 |
| File1 | 1.0.18 | Category2 | 34.7 | 11 | 2 | Case2 |
| File2 | 1.0.18 | Category1 | 32.3 | 11 | 1 | Case2 |
| File2 | 1.0.18 | Category1 | 34.7 | 11 | 2 | Case2 |
| File2 | 1.0.18 | Category2 | 31.4 | 11 | 1 | Case2 |
| File2 | 1.0.18 | Category2 | 32.3 | 11 | 2 | Case2 |
| File1 | 1.0.21 | Category1 | 32.4 | 12 | 1 | Case2 |
| File1 | 1.0.21 | Category1 | 34.7 | 12 | 2 | Case2 |
| File1 | 1.0.21 | Category2 | 31.5 | 12 | 1 | Case2 |
| File1 | 1.0.21 | Category2 | 34.6 | 12 | 2 | Case2 |
| File2 | 1.0.21 | Category1 | 31.7 | 12 | 1 | Case2 |
| File2 | 1.0.21 | Category1 | 31.4 | 12 | 2 | Case2 |
| File2 | 1.0.21 | Category2 | 34.5 | 12 | 1 | Case2 |
| File2 | 1.0.21 | Category2 | 31.5 | 12 | 2 | Case2 |
The actual data set has 6 unique files, the two most previous "TestNums & Versions", 2 unique categories, and 4 unique cases.
Using the magic of the internet, I was able to cobble together a table that looks like this for a different need (but the code should be similarish):
| FileName | Category | 1.0.1 | 1.0.2 | PercentChange |
|:--------:|:---------:|:-----:|:-----:|:-------------:|
| File1 | Category1 | 18.19 | 18.18 | -0.0045808520 |
| File1 | Category2 | 18.05 | 18.06 | -0.0005075721 |
| File2 | Category1 | 19.27 | 18.83 | -0.0224913494 |
| File2 | Category2 | 19.13 | 18.69 | -0.0231780146 |
| File3 | Category1 | 26.02 | 26.91 | 0.0342729019 |
| File3 | Category2 | 25.88 | 26.75 | 0.0335598775 |
| File4 | Category1 | 31.28 | 28.70 | -0.0823371327 |
| File4 | Category2 | 31.13 | 28.56 | -0.0826670833 |
| File5 | Category1 | 31.77 | 25.45 | -01999731215 |
| File5 | Category2 | 31.62 | 25.30 | -0.0117180458 |
| File6 | Category1 | 46.23 | 45.68 | -0.0119578545 |
| File6 | Category2 | 46.08 | 45.53 | -0.0045808520 |
This is the code that made that table:
vLatest and vPrevious are variables with the latest and second latest verion numbers
deviations<-subset(df, df$Version %in% c(vLatest, vPrevious))
deviationsCast<-dcast(df[,1:4], FileName + Category ~ Version, value.var = "Value", fun.aggregate=mean)
deviationsCast$PercentChange<-(deviationsCast[,dim(deviationsCast)[2]]-deviationsCast[,dim(deviationsCast)[2]-1])/deviationsCast[,dim(deviationsCast)[2]-1]
I'm really just hoping someone can help me understand the syntax of dcast. The initial generation of deviationsCast is where I'm most fuzzy on how everything is working together. Instead of getting this for the Files, I really want to get it so that its the sum of all files for each category for a unique "Case" and show the Percent change between them.
| Case | Measure | 1.0.18 | 1.0.21 | PercentChange |
|:------:|:----------:|:------:|:------:|:-------------:|
| Case 1 | Category 1 | 110 | 100 | 9.09% |
| Case 2 | Category 1 | 95 | 89 | 9.32% |
| Case 3 | Category 1 | 92 | 84 | 8.70% |
| Case 4 | Category 1 | 83 | 75 | 9.64% |
| Case 1 | Category 2 | 112 | 101 | 9.82% |
| Case 2 | Category 2 | 96 | 89 | 7.29% |
| Case 3 | Category 2 | 94 | 86 | 8.51% |
| Case 4 | Category 2 | 83 | 76 | 8.43% |
Note: The rounding and percent sign is a plus but a very preferred plus
The numbers do not reflect actual maths done correctly, just random numbers I put in there to show for an example. I hopefully explained the math that I'm trying to do sufficiently.
Example dataset to test with
FileName<-rep(c("File1","File2","File3","File4","File5","File6"),times=8,each=6)
Version<-rep(c("1.0.18","1.0.21"),times=4,each=36)
Category<-rep(c("Category1","Category2"),times=48,each=3)
Value<-rpois(n=288,lambda=32)
TestNum<-rep(11:12,times=4,each=36)
RepNum<-rep(1:3,times=96)
Case<-rep(c("Case1","Case2","Case3","Case4"),each=72)
df<-data.frame(FileName,Version,Category,Value,TestNum,RepNum,Case)
Its worth noting that the df here is essentially what deviations data frame is from the above code (with vLatest and vPrevious)
EDIT:
MrFlick's answer is almost perfect but when I try to implement it in my actual dataset I run into problems. The issue is due to using vLatest and vPrevious as my Versions instead of just writing the string. Here's the code that I use to get those two variables
vLatest<-unique(df[df[,"TestNum"] == max(df$TestNum), "Version"])
vPrevious<-unique(df[df[,"TestNum"] == sort(unique(df$TestNum), T)[2], "Version"])
And when I tried this:
pc <- function(a,b) (b-a)/a
summary <- df %>%
group_by(Case, Category, Version) %>%
summarize(Value=mean(Value)) %>%
spread(Version, Value) %>%
mutate(Change=scales::percent(pc(vPrevious,vLatest)))
I received this error: Error: non-numeric argument to binary operator
2nd EDIT:
I tried creating new variables that were for the two TestNum values (since they could be numeric values and wouldn't need to have factors).
maxTestNum<-max(df$TestNum)
prevTestNum<-sort(unique(df$TestNum), T)[2]
(The reason I don't use "prevTestNum<-maxTestNum-1" is because sometimes versions are omitted from the data results)
However when I put in those two variables into the code, the "Change" column is all the same value.
With the sample data set supplied by the OP, and from analysing the edits, I believe the following code might produce the desired result even with OP's production data set.
My understanding is that the OP has a data.frame with many test results but he wants only to show the relative change of the two most recent versions.
The OP has asked for help in using the dcast() function. This function is available from two packages, reshape2 and data.table. Here the data.table version is used for speed and concise code. In addition, functions from the forcats and formattable packages are used.
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table object
DT <- data.table(df)
# reorder factor levels of Version according to TestNum
DT[, Version := forcats::fct_reorder(Version, TestNum)]
# determine the two most recent Versions
# trick: pick 1st and 2nd entry of the _reversed_ levels
vLatest <- DT[, rev(levels(Version))[1L]]
vPrevious <- DT[, rev(levels(Version))[2L]]
# filter DT, reshape from long to wide format,
# compute change for the selected columns using get(),
# use formattable package for pretty printing
summary <- dcast(
DT[Version %in% c(vLatest, vPrevious)],
Case + Category ~ Version, mean, value.var = "Value")[
, PercentChange := formattable::percent(get(vLatest) / get(vPrevious) - 1.0)]
summary
Case Category 1.0.18 1.0.21 PercentChange
1: Case1 Category1 33.00000 31.94444 -3.20%
2: Case1 Category2 31.83333 31.83333 0.00%
3: Case2 Category1 33.05556 33.61111 1.68%
4: Case2 Category2 30.77778 32.94444 7.04%
5: Case3 Category1 33.16667 31.94444 -3.69%
6: Case3 Category2 33.44444 33.72222 0.83%
7: Case4 Category1 30.83333 34.66667 12.43%
8: Case4 Category2 32.27778 33.44444 3.61%
Explanations
Sorting Version
The OP has recognized that simply sorting Version alphabetically doesn't ensure the proper order. This can be demontrated by
sort(paste0("0.0.", 0:12))
[1] "0.0.0" "0.0.1" "0.0.10" "0.0.11" "0.0.12" "0.0.2" "0.0.3" "0.0.4" "0.0.5"
[10] "0.0.6" "0.0.7" "0.0.8" "0.0.9"
where 0.0.10 comes before 0.0.2.
This is crucial as data.frame() turns character variables to factor by default.
Fortunately, TestNum is associated with Version. So, TestNum is used to reorder the factor levels of Version with help of the fct_reorder() function from the forcats package.
This also ensures that dcast() creates the new columns in the appropriate order.
Accessing columns through variables
Using vLatest / vPrevious in an expression returns the error message
Error in vLatest/vPrevious : non-numeric argument to binary operator
This is to be expected because vLatests and vPrevious contain character values "1.0.21" and "1.0.18", resp., which can't be divided. What is meant here is take the values of the columns which names are given by vLatests and vPrevious and divide. This is achieved by using get().
Formatting as percent
While scales::percent() returns a character vector, formattable::percent() does return a numeric vector with a percent representation, i.e., we're still able to do numeric calculations.
Data
As given by the OP:
FileName <- rep(c("File1", "File2", "File3", "File4", "File5", "File6"),
times = 8, each = 6)
Version <- rep(c("1.0.18", "1.0.21"), times = 4, each = 36)
Category <- rep(c("Category1", "Category2"), times = 48, each = 3)
Value <- rpois(n = 288, lambda = 32)
TestNum <- rep(11:12, times = 4, each = 36)
RepNum <- rep(1:3, times = 96)
Case <- rep(c("Case1", "Case2", "Case3", "Case4"), each = 72)
df <- data.frame(FileName, Version, Category, Value, TestNum, RepNum, Case)

Generate a table in R

I'm learning R and have this practice project.
I have a table like this (read from a csv file), but with a lot more lines:
+----------+----------------+
| Home type| Gas consumption|
+----------+----------------+
| 1 | 31,2 |
| 2 | 51,3 |
| 3 | 40,4 |
| 3 | 100,0 |
| 2 | 34,6 |
| 1 | 16,0 |
+---+------------+----------+
I want to create an exhibit a table like this:
+----------+----------+----------+----------+
| Measures | 1 | 2 | 3 |
+-------------------------------------------+
| Mean | | | |
| Medium| | | |
| Min | | | |
| Max | | | |
| Q1 | | | |
| Q3 | | | |
+----------+----------+----------+----------+
In other words, I'd like to sort my data into columns, where column1 represents the gas consumption of type 1 houses, column2 represents the gas consumption of type 2 houses and so on.
Then I want to compute the mean, medium, min, max, Q1 and Q3 of each column and display them as shown above.
Could you at least guide me?
First some dummy data:
d <- data.frame("Home Type"=c(1,2,3,3,2,1),
"Gas Consumption"=c(31.2, 51.3, 40.4, 100.0, 34.6, 16.0))
Create a function that summarizes a vector with your requested metrics
stats <- function(x) c(Mean=mean(x), Median=median(x), Min=min(x), Max=max(x),
Q1=quantile(x, 0.25), Q3=quantile(x, 0.75))
Split the variable of interest by Home Type and apply the function to each group
> data.frame(lapply(split(d$Gas.Consumption, d$Home.Type), stats), check.names = FALSE)
1 2 3
Mean 23.6 42.950 70.2
Median 23.6 42.950 70.2
Min 16.0 34.600 40.4
Max 31.2 51.300 100.0
Q1.25% 19.8 38.775 55.3
Q3.75% 27.4 47.125 85.1

Copy column data when function unaggregates a single row into multiple in R

I need help in taking an annual total (for each of many initiatives) and breaking that down to each month using a simple division formula. I need to do this for each distinct combination of a few columns while copying down the columns that are broken from annual to each monthly total. The loop will apply the formula to two columns and loop through each distinct group in a vector. I tried to explain in an example below as it's somewhat complex.
What I have :
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2015 | TotalD | TotalD |
| A | Mike | 2015 | TotalE | TotalE |
| A | Rob | 2015 | TotalF | TotalF |
| B | John | 2015 | TotalG | TotalG |
| B | Mike | 2015 | TotalH | TotalH |
......
| Init | Name | Date |Total Savings|Total Costs|
| A | John | 2016 | TotalI | TotalI |
| A | Mike | 2016 | TotalJ | TotalJ |
| A | Rob | 2016 | TotalK | TotalK |
| B | John | 2016 | TotalL | TotalL |
| B | Mike | 2016 | TotalM | TotalM |
I'm going to loop a function for the first row to take the "Total Savings" and "Total Costs" and divide by 12 where Date = 2015 and 9 where Date = 2016 (YTD to Sept) and create an individual row for each. I'm essentially breaking out an annual total in a row and creating a row for each month of the year. I need help in running that loop to copy also columns "Init", "Name", until "Init", "Name" combination are not distinct. Also, note the formula for the division based on the year will be different as well. I suppose I could separate the datasets for 2015 and 2016 and use two different functions and merge if that would be easier. Below should be the output:
| Init | Name | Date |Monthly Savings|Monthly Costs|
| A | John | 01-01-2015 | TotalD/12* | MonthD |
| A | John | 02-01-2015 | MonthD | MonthD |
| A | John | 03-01-2015 | MonthD | MonthD |
...
| A | Mike | 01-01-2016 | TotalE/9* | TotalE |
| A | Mike | 02-01-2016 | TotalE | TotalE |
| A | Mike | 03-01-2016 | TotalE | TotalE |
...
| B | John | 01-01-2015 | TotalG/12* | MonthD |
| B | John | 02-01-2015 | MonthG | MonthD |
| B | John | 03-01-2015 | MonthG | MonthD |
TotalD/12* = MonthD - this is the formula for 2015
TotalE/9* = MonthE - this is the formula for 2016
Any help would be appreciated...
As a start, here are some reproducible data, with the columns described:
myData <-
data.frame(
Init = rep(LETTERS[1:3], each = 4)
, Name = rep(c("John", "Mike"), each = 2)
, Date = 2015:2016
, Savings = (1:12)*1200
, Cost = (1:12)*2400
)
Next, set the divisor to be used for each year:
toDivide <-
c("2015" = 12, "2016" = 9)
Then, I am using the magrittr pipe as I split the data up into single rows, then looping through them with lapply to expand each row into the appropriate number of rows (9 or 12) with the savings and costs divided by the number of months. Finally, dplyr's bind_rows stitches the rows back together.
myData %>%
split(1:nrow(.)) %>%
lapply(function(x){
temp <- data.frame(
Init = x$Init
, Name = x$Name
, Date = as.Date(paste(x$Date
, formatC(1:toDivide[as.character(x$Date)]
, width = 2, flag = "0")
, "01"
, sep = "-"))
, Savings = x$Savings / toDivide[as.character(x$Date)]
, Cost = x$Cost / toDivide[as.character(x$Date)]
)
}) %>%
bind_rows()
The head of this looks like:
Init Name Date Savings Cost
1 A John 2015-01-01 100.0000 200.0000
2 A John 2015-02-01 100.0000 200.0000
3 A John 2015-03-01 100.0000 200.0000
4 A John 2015-04-01 100.0000 200.0000
5 A John 2015-05-01 100.0000 200.0000
6 A John 2015-06-01 100.0000 200.0000
with similar entries for each expanded row.

Resources