Creating a new table that shows the percent change between two different categories from a single column in R - r

I'm trying to learn how to use some of the functions in the R "reshape2" package, specifically dcast. I'm trying to create a table that shows the aggregate sum (the sum of one category of data for all files divided by the max "RepNum" in one "Case") for two software versions and the percent change between the two.
Here's what my data set looks like (example data):
| FileName | Version | Category | Value | TestNum | RepNum | Case |
|:--------:|:-------:|:---------:|:-----:|:-------:|:------:|:-----:|
| File1 | 1.0.18 | Category1 | 32.5 | 11 | 1 | Case1 |
| File1 | 1.0.18 | Category1 | 31.5 | 11 | 2 | Case1 |
| File1 | 1.0.18 | Category2 | 32.3 | 11 | 1 | Case1 |
| File1 | 1.0.18 | Category2 | 31.4 | 11 | 2 | Case1 |
| File2 | 1.0.18 | Category1 | 34.6 | 11 | 1 | Case1 |
| File2 | 1.0.18 | Category1 | 34.7 | 11 | 2 | Case1 |
| File2 | 1.0.18 | Category2 | 34.5 | 11 | 1 | Case1 |
| File2 | 1.0.18 | Category2 | 34.6 | 11 | 2 | Case1 |
| File1 | 1.0.21 | Category1 | 31.7 | 12 | 1 | Case1 |
| File1 | 1.0.21 | Category1 | 32.0 | 12 | 2 | Case1 |
| File1 | 1.0.21 | Category2 | 31.5 | 12 | 1 | Case1 |
| File1 | 1.0.21 | Category2 | 32.4 | 12 | 2 | Case1 |
| File2 | 1.0.21 | Category1 | 31.5 | 12 | 1 | Case1 |
| File2 | 1.0.21 | Category1 | 34.6 | 12 | 2 | Case1 |
| File2 | 1.0.21 | Category2 | 31.7 | 12 | 1 | Case1 |
| File2 | 1.0.21 | Category2 | 32.4 | 12 | 2 | Case1 |
| File1 | 1.0.18 | Category1 | 32.0 | 11 | 1 | Case2 |
| File1 | 1.0.18 | Category1 | 34.6 | 11 | 2 | Case2 |
| File1 | 1.0.18 | Category2 | 34.6 | 11 | 1 | Case2 |
| File1 | 1.0.18 | Category2 | 34.7 | 11 | 2 | Case2 |
| File2 | 1.0.18 | Category1 | 32.3 | 11 | 1 | Case2 |
| File2 | 1.0.18 | Category1 | 34.7 | 11 | 2 | Case2 |
| File2 | 1.0.18 | Category2 | 31.4 | 11 | 1 | Case2 |
| File2 | 1.0.18 | Category2 | 32.3 | 11 | 2 | Case2 |
| File1 | 1.0.21 | Category1 | 32.4 | 12 | 1 | Case2 |
| File1 | 1.0.21 | Category1 | 34.7 | 12 | 2 | Case2 |
| File1 | 1.0.21 | Category2 | 31.5 | 12 | 1 | Case2 |
| File1 | 1.0.21 | Category2 | 34.6 | 12 | 2 | Case2 |
| File2 | 1.0.21 | Category1 | 31.7 | 12 | 1 | Case2 |
| File2 | 1.0.21 | Category1 | 31.4 | 12 | 2 | Case2 |
| File2 | 1.0.21 | Category2 | 34.5 | 12 | 1 | Case2 |
| File2 | 1.0.21 | Category2 | 31.5 | 12 | 2 | Case2 |
The actual data set has 6 unique files, the two most previous "TestNums & Versions", 2 unique categories, and 4 unique cases.
Using the magic of the internet, I was able to cobble together a table that looks like this for a different need (but the code should be similarish):
| FileName | Category | 1.0.1 | 1.0.2 | PercentChange |
|:--------:|:---------:|:-----:|:-----:|:-------------:|
| File1 | Category1 | 18.19 | 18.18 | -0.0045808520 |
| File1 | Category2 | 18.05 | 18.06 | -0.0005075721 |
| File2 | Category1 | 19.27 | 18.83 | -0.0224913494 |
| File2 | Category2 | 19.13 | 18.69 | -0.0231780146 |
| File3 | Category1 | 26.02 | 26.91 | 0.0342729019 |
| File3 | Category2 | 25.88 | 26.75 | 0.0335598775 |
| File4 | Category1 | 31.28 | 28.70 | -0.0823371327 |
| File4 | Category2 | 31.13 | 28.56 | -0.0826670833 |
| File5 | Category1 | 31.77 | 25.45 | -01999731215 |
| File5 | Category2 | 31.62 | 25.30 | -0.0117180458 |
| File6 | Category1 | 46.23 | 45.68 | -0.0119578545 |
| File6 | Category2 | 46.08 | 45.53 | -0.0045808520 |
This is the code that made that table:
vLatest and vPrevious are variables with the latest and second latest verion numbers
deviations<-subset(df, df$Version %in% c(vLatest, vPrevious))
deviationsCast<-dcast(df[,1:4], FileName + Category ~ Version, value.var = "Value", fun.aggregate=mean)
deviationsCast$PercentChange<-(deviationsCast[,dim(deviationsCast)[2]]-deviationsCast[,dim(deviationsCast)[2]-1])/deviationsCast[,dim(deviationsCast)[2]-1]
I'm really just hoping someone can help me understand the syntax of dcast. The initial generation of deviationsCast is where I'm most fuzzy on how everything is working together. Instead of getting this for the Files, I really want to get it so that its the sum of all files for each category for a unique "Case" and show the Percent change between them.
| Case | Measure | 1.0.18 | 1.0.21 | PercentChange |
|:------:|:----------:|:------:|:------:|:-------------:|
| Case 1 | Category 1 | 110 | 100 | 9.09% |
| Case 2 | Category 1 | 95 | 89 | 9.32% |
| Case 3 | Category 1 | 92 | 84 | 8.70% |
| Case 4 | Category 1 | 83 | 75 | 9.64% |
| Case 1 | Category 2 | 112 | 101 | 9.82% |
| Case 2 | Category 2 | 96 | 89 | 7.29% |
| Case 3 | Category 2 | 94 | 86 | 8.51% |
| Case 4 | Category 2 | 83 | 76 | 8.43% |
Note: The rounding and percent sign is a plus but a very preferred plus
The numbers do not reflect actual maths done correctly, just random numbers I put in there to show for an example. I hopefully explained the math that I'm trying to do sufficiently.
Example dataset to test with
FileName<-rep(c("File1","File2","File3","File4","File5","File6"),times=8,each=6)
Version<-rep(c("1.0.18","1.0.21"),times=4,each=36)
Category<-rep(c("Category1","Category2"),times=48,each=3)
Value<-rpois(n=288,lambda=32)
TestNum<-rep(11:12,times=4,each=36)
RepNum<-rep(1:3,times=96)
Case<-rep(c("Case1","Case2","Case3","Case4"),each=72)
df<-data.frame(FileName,Version,Category,Value,TestNum,RepNum,Case)
Its worth noting that the df here is essentially what deviations data frame is from the above code (with vLatest and vPrevious)
EDIT:
MrFlick's answer is almost perfect but when I try to implement it in my actual dataset I run into problems. The issue is due to using vLatest and vPrevious as my Versions instead of just writing the string. Here's the code that I use to get those two variables
vLatest<-unique(df[df[,"TestNum"] == max(df$TestNum), "Version"])
vPrevious<-unique(df[df[,"TestNum"] == sort(unique(df$TestNum), T)[2], "Version"])
And when I tried this:
pc <- function(a,b) (b-a)/a
summary <- df %>%
group_by(Case, Category, Version) %>%
summarize(Value=mean(Value)) %>%
spread(Version, Value) %>%
mutate(Change=scales::percent(pc(vPrevious,vLatest)))
I received this error: Error: non-numeric argument to binary operator
2nd EDIT:
I tried creating new variables that were for the two TestNum values (since they could be numeric values and wouldn't need to have factors).
maxTestNum<-max(df$TestNum)
prevTestNum<-sort(unique(df$TestNum), T)[2]
(The reason I don't use "prevTestNum<-maxTestNum-1" is because sometimes versions are omitted from the data results)
However when I put in those two variables into the code, the "Change" column is all the same value.

With the sample data set supplied by the OP, and from analysing the edits, I believe the following code might produce the desired result even with OP's production data set.
My understanding is that the OP has a data.frame with many test results but he wants only to show the relative change of the two most recent versions.
The OP has asked for help in using the dcast() function. This function is available from two packages, reshape2 and data.table. Here the data.table version is used for speed and concise code. In addition, functions from the forcats and formattable packages are used.
library(data.table) # CRAN version 1.10.4 used
# coerce to data.table object
DT <- data.table(df)
# reorder factor levels of Version according to TestNum
DT[, Version := forcats::fct_reorder(Version, TestNum)]
# determine the two most recent Versions
# trick: pick 1st and 2nd entry of the _reversed_ levels
vLatest <- DT[, rev(levels(Version))[1L]]
vPrevious <- DT[, rev(levels(Version))[2L]]
# filter DT, reshape from long to wide format,
# compute change for the selected columns using get(),
# use formattable package for pretty printing
summary <- dcast(
DT[Version %in% c(vLatest, vPrevious)],
Case + Category ~ Version, mean, value.var = "Value")[
, PercentChange := formattable::percent(get(vLatest) / get(vPrevious) - 1.0)]
summary
Case Category 1.0.18 1.0.21 PercentChange
1: Case1 Category1 33.00000 31.94444 -3.20%
2: Case1 Category2 31.83333 31.83333 0.00%
3: Case2 Category1 33.05556 33.61111 1.68%
4: Case2 Category2 30.77778 32.94444 7.04%
5: Case3 Category1 33.16667 31.94444 -3.69%
6: Case3 Category2 33.44444 33.72222 0.83%
7: Case4 Category1 30.83333 34.66667 12.43%
8: Case4 Category2 32.27778 33.44444 3.61%
Explanations
Sorting Version
The OP has recognized that simply sorting Version alphabetically doesn't ensure the proper order. This can be demontrated by
sort(paste0("0.0.", 0:12))
[1] "0.0.0" "0.0.1" "0.0.10" "0.0.11" "0.0.12" "0.0.2" "0.0.3" "0.0.4" "0.0.5"
[10] "0.0.6" "0.0.7" "0.0.8" "0.0.9"
where 0.0.10 comes before 0.0.2.
This is crucial as data.frame() turns character variables to factor by default.
Fortunately, TestNum is associated with Version. So, TestNum is used to reorder the factor levels of Version with help of the fct_reorder() function from the forcats package.
This also ensures that dcast() creates the new columns in the appropriate order.
Accessing columns through variables
Using vLatest / vPrevious in an expression returns the error message
Error in vLatest/vPrevious : non-numeric argument to binary operator
This is to be expected because vLatests and vPrevious contain character values "1.0.21" and "1.0.18", resp., which can't be divided. What is meant here is take the values of the columns which names are given by vLatests and vPrevious and divide. This is achieved by using get().
Formatting as percent
While scales::percent() returns a character vector, formattable::percent() does return a numeric vector with a percent representation, i.e., we're still able to do numeric calculations.
Data
As given by the OP:
FileName <- rep(c("File1", "File2", "File3", "File4", "File5", "File6"),
times = 8, each = 6)
Version <- rep(c("1.0.18", "1.0.21"), times = 4, each = 36)
Category <- rep(c("Category1", "Category2"), times = 48, each = 3)
Value <- rpois(n = 288, lambda = 32)
TestNum <- rep(11:12, times = 4, each = 36)
RepNum <- rep(1:3, times = 96)
Case <- rep(c("Case1", "Case2", "Case3", "Case4"), each = 72)
df <- data.frame(FileName, Version, Category, Value, TestNum, RepNum, Case)

Related

frequency table for banner list

I am trying to create a function to generate frequency table (to show count , valid percentage , percentage) for list of banner.
I want to export tables in xlsx files.
like for variable "gear" , i want to calculate the table for banner below ()
library(expss)
df <- mtcars
df$all<- 1
df$small<-ifelse(df$vs==1,1,NA)
df$large<-ifelse(df$am ==1,1,NA)
val_lab(df$all)<-c("Total"=1)
val_lab(df$small)<-c("Small"=1)
val_lab(df$large)<-c("Large"=1)
banner <- list(dat$all,dat$small,dat$large)
data <- df
var <- "gear"
var1 <- rlang::parse_expr(var)
expss::var_lab(data[[var]])
#tab1 <- expss::fre(data[[var1]])
table1 <- expss::fre(data[[var1]],
stat_lab = getOption("expss.fre_stat_lab", c("Count N", "Valid percent", "Percent",
"Responses, %", "Cumulative responses, %")))
table1
the output table should be like below
You need to make custom function around fre:
library(expss)
df <- mtcars
df$all<- 1
df$small<-ifelse(df$vs==1,1,NA)
df$large<-ifelse(df$am ==1,1,NA)
val_lab(df$all)<-c("Total"=1)
val_lab(df$small)<-c("Small"=1)
val_lab(df$large)<-c("Large"=1)
my_fre <- function(curr_var) setNames(expss::fre(curr_var)[, 1:3],
c("row_labels", "Count N", "Valid percent"))
cross_fun_df(df, gear, list(all, small, large), fun = my_fre)
# | | Total | | Small | | Large | |
# | | Count N | Valid percent | Count N | Valid percent | Count N | Valid percent |
# | ------ | ------- | ------------- | ------- | ------------- | ------- | ------------- |
# | 3 | 15 | 46.88 | 3 | 21.43 | | |
# | 4 | 12 | 37.50 | 10 | 71.43 | 8 | 61.54 |
# | 5 | 5 | 15.62 | 1 | 7.14 | 5 | 38.46 |
# | #Total | 32 | 100.00 | 14 | 100.00 | 13 | 100.00 |
# | <NA> | 0 | | 0 | | 0 | |

R - Join two dataframes based on date difference

Let's consider two dataframes df1 and df2. I would like to join dataframes based on the date difference only. For Example;
Dataframe 1: (df1)
| version_id | date_invoiced | product_id |
-------------------------------------------
| 1 | 03-07-2020 | 201 |
| 1 | 02-07-2020 | 2013 |
| 3 | 02-07-2020 | 2011 |
| 6 | 01-07-2020 | 2018 |
| 7 | 01-07-2020 | 201 |
Dataframe 2: (df2)
| validfrom | pricelist| pricelist_id |
------------------------------------------
|02-07-2020 | 10 | 101 |
|01-07-2020 | 20 | 102 |
|29-06-2020 | 30 | 103 |
|28-07-2020 | 10 | 104 |
|25-07-2020 | 5 | 105 |
I need to map the pricelist_id and the pricelist based on the the validfrom column present in df2. Say that, based on the least difference between the date_invoiced (df1) and validfrom (df2), the row should be mapped.
Expected Outcome:
| version_id | date_invoiced | product_id | date_diff | pricelist_id | pricelist |
----------------------------------------------------------------------------------
| 1 | 03-07-2020 | 201 | 1 | 101 | 10 |
| 1 | 02-07-2020 | 2013 | 1 | 102 | 20 |
| 3 | 02-07-2020 | 2011 | 1 | 102 | 20 |
| 6 | 01-07-2020 | 2018 | 1 | 103 | 30 |
| 7 | 01-07-2020 | 201 | 1 | 103 | 30 |
I need to map purely based on the difference and the difference should be the least. Always, the date_invoiced (df1), should have closest difference comparing to validfrom (df2). Thanks
Perhaps you might want to try using date.table and nearest roll. Here, the join is made on DATE which would be DATEINVOICED from df1 and VALIDFROM in df2.
library(data.table)
setDT(df1)
setDT(df2)
df1$DATEINVOICED <- as.Date(df1$DATEINVOICED, format = "%d-%m-%y")
df2$VALIDFROM <- as.Date(df2$VALIDFROM, format = "%d-%m-%y")
setkey(df1, DATEINVOICED)[, DATE := DATEINVOICED]
setkey(df2, VALIDFROM)[, DATE := VALIDFROM]
df2[df1, on = "DATE", roll='nearest']

Relabel of rowname column in R dataframe

When I bind multiple dataframes together using Out2 = do.call(rbind.data.frame, Out), I obtain the following output. How do I relabel the first column such that it only contains the numbers within the square brackets, i.e. 1 to 5 for each trial number? Is there a way to add a column name to the first column too?
| V1 | V2 | Trial |
+--------+--------------+--------------+-------+
| [1,] | 0.130880519 | 0.02085533 | 1 |
| [2,] | 0.197243133 | -0.000502744 | 1 |
| [3,] | -0.045241653 | 0.106888902 | 1 |
| [4,] | 0.328759949 | -0.106559163 | 1 |
| [5,] | 0.040894969 | 0.114073454 | 1 |
| [1,]1 | 0.103130056 | 0.013655756 | 2 |
| [2,]1 | 0.133080106 | 0.038049071 | 2 |
| [3,]1 | 0.067975054 | 0.03036033 | 2 |
| [4,]1 | 0.132437217 | 0.022887103 | 2 |
| [5,]1 | 0.124950463 | 0.007144698 | 2 |
| [1,]2 | 0.202996317 | 0.004181205 | 3 |
| [2,]2 | 0.025401354 | 0.045672932 | 3 |
| [3,]2 | 0.169469266 | 0.002551237 | 3 |
| [4,]2 | 0.2303046 | 0.004936579 | 3 |
| [5,]2 | 0.085702254 | 0.020814191 | 3 |
+--------+--------------+--------------+-------+
We can use parse_number to extract the first occurence of numbers
library(dplyr)
df1 %>%
mutate(newcol = readr::parse_number(row.names(df1)))
Or in base R, use sub to capture the digits after the [ in the row names
df1$newcol <- sub("^\\[(\\d+).*", "\\1", row.names(df1))

How to remove empty cells and reduce columns

I have a table, that looks roughly like this:
| variable | observer1 | observer2 | observer3 | final |
| -------- | --------- | --------- | --------- | ----- |
| case1 | | | | |
| var1 | 1 | 1 | | |
| var2 | 3 | 3 | | |
| var3 | 4 | 5 | | 5 |
| case2 | | | | |
| var1 | 2 | | 2 | |
| var2 | 5 | | 5 | |
| var3 | 1 | | 1 | |
| case3 | | | | |
| var1 | | 2 | 3 | 2 |
| var2 | | 2 | 2 | |
| var3 | | 1 | 1 | |
| case4 | | | | |
| var1 | 1 | | 1 | |
| var2 | 5 | | 5 | |
| var3 | 3 | | 3 | |
Three colums for the observers, but only two are filled.
First I want to compute the IRR, so I need a table that has two columns without the empty cells like this:
| variable | observer1 | observer2 |
| -------- | --------- | --------- |
| case1 | | |
| var1 | 1 | 1 |
| var2 | 3 | 3 |
| var3 | 4 | 5 |
| case2 | | |
| var1 | 2 | 2 |
| var2 | 5 | 5 |
| var3 | 1 | 1 |
| case3 | | |
| var1 | 2 | 3 |
| var2 | 2 | 2 |
| var3 | 1 | 1 |
| case4 | | |
| var1 | 1 | 1 |
| var2 | 5 | 5 |
| var3 | 3 | 3 |
I try to use the tidyverse packages, but I'm not sure. Some 'ifelse()' magic may be easier.
Is there a clean and easy method to do something like this? Can anybody point me to the right function to use? Or just to a keyword to search for on stackoverflow? I found a lot of methods to remove whole empty columns or rows.
Edit: I removed the link to the original data. It was unnecessary. Thanks to Lamia for his working answer.
Out of your 3 columns observer1, observer2 and observer3, you sometimes have 2 non-NA values, 1 non-NA value, or 3 NA values.
If you want to merge your 3 columns, you could do:
res = data.frame(df$coding,t(apply(df[paste0("observer",1:3)],1,function(x) x[!is.na(x)][1:2])))
The apply function will return for each row the 2 non-NA values if there are 2, one non-NA value and one NA if there is only one value, and two NAs if there is no data in the row.
We then put this result in a dataframe with the first column (coding).

Hmisc Table Creation

Just starting out with R and trying to figure out what works for my needs when it comes to creating "summary tables." I am used to Custom Tables in SPSS, and the CrossTable function in the package gmodels gets me almost where I need to be; not to mention it is easy to navigate for someone just starting out in R.
That said, it seems like the Hmisc table is very good at creating various summaries and exporting to LaTex (ultimately what I need to do).
My questions are:1)can you create the table below easily in the Hmsic page? 2) if so, can I interact variables (2 in the the column)? and finally 3) can I access p-values of significance tests (chi square).
Thanks in advance,
Brock
Cell Contents
|-------------------------|
| Count |
| Row Percent |
| Column Percent |
|-------------------------|
Total Observations in Table: 524
| asq[, 23]
asq[, 4] | 1 | 2 | 3 | 4 | 5 | Row Total |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
0 | 76 | 54 | 93 | 46 | 54 | 323 |
| 23.529% | 16.718% | 28.793% | 14.241% | 16.718% | 61.641% |
| 54.286% | 56.250% | 63.265% | 63.889% | 78.261% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
1 | 64 | 42 | 54 | 26 | 15 | 201 |
| 31.841% | 20.896% | 26.866% | 12.935% | 7.463% | 38.359% |
| 45.714% | 43.750% | 36.735% | 36.111% | 21.739% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
Column Total | 140 | 96 | 147 | 72 | 69 | 524 |
| 26.718% | 18.321% | 28.053% | 13.740% | 13.168% | |
-------------|-----------|-----------|-----------|-----------|-----------|-----------|
The gmodels package has a function called CrossTable, which is very nice for those used to SPSS and SAS output. Try this example:
library(gmodels) # run install.packages("gmodels") if you haven't installed the package yet
x <- sample(c("up", "down"), 100, replace = TRUE)
y <- sample(c("left", "right"), 100, replace = TRUE)
CrossTable(x, y, format = "SPSS")
This should provide you with an output just like the one you displayed on your question, very SPSS-y. :)
If you are coming from SPSS, you may be interested in the package Deducer ( http://www.deducer.org ). It has a contingency table function:
> library(Deducer)
> data(tips)
> tables<-contingency.tables(
+ row.vars=d(smoker),
+ col.vars=d(day),data=tips)
> tables<-add.chi.squared(tables)
> print(tables,prop.r=T,prop.c=T,prop.t=F)
================================================================================================================
==================================================================================
========== Table: smoker by day ==========
| day
smoker | Fri | Sat | Sun | Thur | Row Total |
-----------------------|-----------|-----------|-----------|-----------|-----------|
No Count | 4 | 45 | 57 | 45 | 151 |
Row % | 2.649% | 29.801% | 37.748% | 29.801% | 61.885% |
Column % | 21.053% | 51.724% | 75.000% | 72.581% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Yes Count | 15 | 42 | 19 | 17 | 93 |
Row % | 16.129% | 45.161% | 20.430% | 18.280% | 38.115% |
Column % | 78.947% | 48.276% | 25.000% | 27.419% | |
-----------------------|-----------|-----------|-----------|-----------|-----------|
Column Total | 19 | 87 | 76 | 62 | 244 |
Column % | 7.787% | 35.656% | 31.148% | 25.410% | |
Large Sample
Test Statistic DF p-value | Effect Size est. Lower (%) Upper (%)
Chi Squared 25.787 3 <0.001 | Cramer's V 0.325 0.183 (2.5) 0.44 (97.5)
-----------
================================================================================================================
You can get the counts and test to latex or html using the xtable package:
> library(xtable)
> xtable(drop(extract.counts(tables)[[1]]))
> test <- contin.tests.to.table((tables[[1]]$tests))
> xtable(test)

Resources