Knitting concatenated tables - r

I want to concatenate my two tables to see them in well aligned.
Here's an exemple
[table 1]
ID Nb Name
1 2500 Alex
2 3001 Sam
[Table 2]
ID Nb2 Name
1 1201445 Lea
2 14120 Remy
When I try to knit my table I have this:
ID Nb2 Name
1 2500 Alex
2 3001 Sam
ID Nb Name
1 1201445 Lea
2 14120 Remy
But I want to have this:
ID Nb Name
1 2500 Alex
2 3001 Sam
ID Nb2 Name
1 1201445 Lea
2 14120 Remy

Analyzing the code you provided (I have added packages that are needed):
require(dplyr)
require(flextable)
regulartable(mydataframe) %>% theme_zebra() %>% autofit()
I can see that you are using autofit() function, which sets the height and width of the columns automatically to fit best to data in the dataframe, and as I understand - you do it separately for each table. To solve this formatting issue, you don't need to concatenate those dataframes, but rather you could set the same width of columns for each table, which would look like this:
regulartable(mydataframe) %>% theme_zebra() %>% width(width = 1)
Depending on what kind of report you are creating, you should consider what type of data will be displayed in your tables and adjust the width parameter in the width function. It is worth mentioning that the width parameter takes a number representing the number of inches that your column should be.
I strongly recommend you reading documentation of flextable package to get further information on formatting those tables of yours.

Related

Can we 'replace' a column from a pivot table created using 'pivottabler' package of r?

I am using the 'pivottabler' package to create some pivot tables in R.
Basically, the pivot tables I create have similar structure, only the column header changes.
For example, I have a data set containing the prices of fruits based on region and month.
So I will create one pivot that will look like this:
Fruits Nigeria Laos England
Prices Prices Prices
Apple 1$ 2$ 3$
Mango 4$ 5$ 6$
Orange 7$ 8$ 9$
And another pivot table that will look this:
Fruits Jan Feb March
Prices Prices Prices
Apple 1$ 1.5$ 2$
Mango 4$ 4.5$ 5$
Orange 7$ 7.5$ 8$
Right now I am using two different codes to create both the pivots.
pt_country <- PivotTable$new()
pt_country$addData(Fruit_Prices) #Fruit_Prices is the data frame containing the data
pt_country$addColumnDataGroups("Countries")
pt_country$addRowDataGroups("Fruits")
pt_country$defineCalculation(CalculationName = "Prices")
pt_country$renderPivot()
pt_country <- PivotTable$new()
pt_country$addData(Fruit_Prices) #Fruit_Prices is the data frame containing the data
pt_country$addColumnDataGroups("Months")
pt_country$addRowDataGroups("Fruits")
pt_country$defineCalculation(CalculationName = "Prices")
pt_country$renderPivot()
I want to shorten the code length, since there will be multiple such pivot tables.
So, ideally I was looking for a solution that allows me to replace one column group with another without changes to other structures of code.
Any help will be appreciated.
I am the author of the pivottabler package.
There are currently only limited options to amend a pivot table after it has been calculated.
More detail
In your example, removing the columns would also remove the calculations, since in your R script the calculations are added after the columns. Reapplying the calculations is then not possible, because the pivot table recognises that the calculations were added already (you get an error). I will look at options to add flexibility in the future.
Alternative approach
One option to reduce the amount of code is to create a function which takes as a parameter the variable to show on the columns. This function can then be easily called to create different variations of the pivot table:
createPivot <- function(columnGroupVariableName)
{
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$addColumnDataGroups(columnGroupVariableName)
pt$addRowDataGroups("TOC")
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$renderPivot()
}
# create pivot tables with different variables on the columns
createPivot("TrainCategory")
createPivot("PowerType")
I had found a sort of workaround this problem.
Thought to mention it here just for sake.
The answer provided by #cbailiss is better in that it achieves the desired result in reduced lines of code and has a better comprehension and will be marked as the official answer.
pt <- PivotTable$new()
pt$addData(bhmtrains)
pt$defineCalculation(calculationName="TotalTrains", summariseExpression="n()")
pt$addRowDataGroups("TOC")
## Adding the required column for the pivot
Col <- pt$addColumnGroup() # Step1
Col_Data <- Col$addDataGroups("TrainCategory") # Step 2
pt$renderPivot()
## Removing the 'Col' group, thus deleting the added columns
Col$removeGroup()
## Repeating Step 1 and Step 2 for another column variable
Col <- pt$addColumnGroup() # Step1
Col_Data <- Col$addDataGroups("PowerType") # Step 2
pt$renderPivot()
The above lines of code worked for me and the method was found in the 'Irregular' vignettes at:
http://www.pivottabler.org.uk/articles/v11-irregularlayout.html

R: Merging two data.table while filtering for unique ID: only NA as answers

My problem is as follows:
I need to analyse data from several different files with a lot of entries (up to 500.000 per column, 10 columns in total).
The files are connected through the use of IDs, e.g. ORDER_IDs.
However, the IDs can appear multiple times, e.g. when an order contains multiple order lines. It is also possible, that one ID doesn't appear in one of the files, e.g. because a file with sales data does only have information on the orders shipped, but not those that have not been shippet yet.
So I have different files with different lengths and unique IDs identifying positions that can vary in their appearance (It is there or not) over all the data files.
What I want now is to filter one file by ID so that it only shows the IDs listed in another file. Also, the additional columns from the first file should be moved over.
Example of what I have:
dt1:
ORDER_ID SKU_ID Quantity_Shipped
12345 678910 100
12346 648392 30
64739 648392 20
dt2:
ORDER_ID Country
12345 DE
12346 DE
55430 SE
90632 JPN
76543 ARG
64739 CH
What I want:
ORDER_ID SKU_ID Quantity_Shipped Country
12345 678910 100 DE
12346 648392 30 DE
64739 648392 20 CH
Originally, the data was imported from a csv file.
The approach I used so far has worked when merging two files.
When trying to add the information from a third file hoewever, I get only NA as answers. What can I do to fix this?
This is the approach I used so far.
df2 <- data.frame(ORDER_ID = sales[["ORDER_ID"]])
df1 <- data.frame(ORDER_ID = OL[["ORDER_ID"]], SKU_ID = OL[["SKU_ID"]],
QTY_SHIPPED = OL[["QTY_SHIPPED"]], EXPECTED_VOLUME =
OL[["EXPECTED_VOLUME"]])
library(data.table)
dt2 <- data.table(df1)
dt1 <- data.table(df2)
dt3 <- dt2[match(dt1$ORDER_ID, dt2$ORDER_ID), ]
You can use either the inherent "merge" within data.table, or the explicit merge command (which is also calling a data.table S3 method, but that doesn't entirely matter here).
dt2[dt1, on = "ORDER_ID"]
# ORDER_ID Country SKU_ID Quantity_Shipped
# 1: 12345 DE 678910 100
# 2: 12346 DE 648392 30
# 3: 64739 CH 648392 20
merge(dt1, dt2, by = "ORDER_ID")
Sometimes I prefer to clarity of the merge call in that I control left/right and other aspects (if the default first-use above doesn't do what I want). I found https://rstudio-pubs-static.s3.amazonaws.com/52230_5ae0d25125b544caab32f75f0360e775.html has a good reference for left/right and other join types.
One time when merge will not work sufficiently is if you are doing range-joins, either using data.table::foverlaps or an extension of the inherent method:
# with mythical data, joining on `dat1$val` within `dat2$val` and `dat2$val2`
dat1[dat2, on = .( val >= val1 & val <= val2 )]

R Question - Trying to use separate to split data with a non-constant delimiter

One of the variables is participant age groups, an example of one of the records is shown below,
0::Adult 18+||1:: Adult 18+||2::Adult 18+||3::Child 0-11
How do you best split this out so that it will give Adult 18 + with the result of 3 and Child 0-11 with 1?
I tried using separate, but as the delimiter is not constant, it was omitting a lot of the records. Any suggestions would be helpful, thank you! As this is my first post, let me know if I need to add more information.
Here is one way:
library(magrittr)
vals <- "0::Adult 18+||1:: Adult 18+||2::Adult 18+||3::Child 0-11"
strsplit(gsub("[^[:alpha:][:space:]]","", vals), "\\s+") %>% as.data.frame() %>% table()
Adult Child
3 1

R: how to apply a complicated function to each column in a dataframe, to out put new dataframe for each column?

I've read a few posted but none of them solves my problems so far. Here's what I am trying to do:
My input data:
I have this dataframe with all the rows are products, all the columns are the standards which the product are reviewed against.The goal is to look at the pass/fail rate of each standard, and see which one got passed the most, and which standard is the one all products found hard to met.
(seems like I can't copy paste a table here, so I type below what the data frame looks like)
**Source** **sample_ID** **Standard_a** **Standard_amn** **Standard_df**
Product1 1 Yes Yes NA
Product1 2 Yes NO Yes
... ... Yes Yes NA
Product1 10 Yes Yes NO
Product2 1 Yes NO YES
... ... .. .... ...
Product2 8 Yes Yes Yes
Product3 1 Yes NA NA
.... Yes Yes YES
there are some rules about sampling the products, and valid entries for column entry (Yes means pass the standard, no means no, NA means not available etc). I've created a function to take care of it.
But my struggle is how to apply my function to all the columns.
Here is my function for one column (the 14th column), which I try to make it a loop or something to be used automatically on all columns
#take product info columns 1:3 and the 14th standard column: named um9d_f6,
and create a clean dataframe of this column, I call it x
x <- appeals_clean %>%
#select column um9d_f6, I use it's position here, which is 14
select(1:3,14) %>%
#group by source (which is product)
group_by(source) %>%
#code NA and missing as 0: this are the specific coding rule for pass and fail
mutate(convert_num = ifelse(um9d_f6 %in% c('YES','NO','AC'),1,0),
cum_elig = cumsum(convert_num),
max_cum_elig = max(cum_elig)) %>%
#get rid of: 1.empty entries after max eligible is reached 2.not enough eligible files 3.after eligible files reach 10
filter(!((max_cum_elig == cum_elig & um9d_f6 == '')|max_cum_elig <8|(max_cum_elig>10 & cum_elig >10)))
# now with the clean data from column 14 (again, named um9d_f6, now I calculate the pass rate)
y <- x %>%
group_by(source) %>%
summarise(um9d_f6_n_Pass = sum(um9d_f6 %in% c('YES','AC')),
um9d_f6_Rate = n_Pass/10)
The Y output dataframe looks like this,
**Source** **um9d_f6_n_Pass** **um9d_f6_Rate**
Product1 10 100
Product2 8 80
Product3 8 80
Product2 8 80
Product3 10 100
My struggle is how can I use a loop or other functions to get this y summary dataframe for each of the standard columns in the dataframe. So that I don't need to manually adjust for the column position and it's column name each time I use this function I created (there are about 60 standards, so hopefully I don't needs to do this manually)...
I've tried loops but haven't figured out
1. how to refer to the column in the function for x and
2. how to name the new columns in Y according to the input column name.
eventually I will left join all those new output dataframe Y that's why I need the naming to reflect it's source column names.
Any suggestions/recommendation would be greatly appreciated!
thank you
Leo

Simple lookup to insert values in an R data frame

This is a seemingly simple R question, but I don't see an exact answer here. I have a data frame (alldata) that looks like this:
Case zip market
1 44485 NA
2 44488 NA
3 43210 NA
There are over 3.5 million records.
Then, I have a second data frame, 'zipcodes'.
market zip
1 44485
1 44486
1 44488
... ... (100 zips in market 1)
2 43210
2 43211
... ... (100 zips in market 2, etc.)
I want to find the correct value for alldata$market for each case based on alldata$zip matching the appropriate value in the zipcode data frame. I'm just looking for the right syntax, and assistance is much appreciated, as usual.
Since you don't care about the market column in alldata, you can first strip it off using and merge the columns in alldata and zipcodes based on the zip column using merge:
merge(alldata[, c("Case", "zip")], zipcodes, by="zip")
The by parameter specifies the key criteria, so if you have a compound key, you could do something like by=c("zip", "otherfield").
Another option that worked for me and is very simple:
alldata$market<-with(zipcodes, market[match(alldata$zip, zip)])
With such a large data set you may want the speed of an environment lookup. You can use the lookup function from the qdapTools package as follows:
library(qdapTools)
alldata$market <- lookup(alldata$zip, zipcodes[, 2:1])
Or
alldata$zip %l% zipcodes[, 2:1]
Here's the dplyr way of doing it:
library(tidyverse)
alldata %>%
select(-market) %>%
left_join(zipcodes, by="zip")
which, on my machine, is roughly the same performance as lookup.
The syntax of match is a bit clumsy. You might find the lookup package easier to use.
alldata <- data.frame(Case=1:3, zip=c(44485,44488,43210), market=c(NA,NA,NA))
zipcodes <- data.frame(market=c(1,1,1,2,2), zip=c(44485,44486,44488,43210,43211))
alldata$market <- lookup(alldata$zip, zipcodes$zip, zipcodes$market)
alldata
## Case zip market
## 1 1 44485 1
## 2 2 44488 1
## 3 3 43210 2

Resources