How can I produce report quality tables from R? - r

If I have the following dataframe called result
> result
Name CV LCB UCB
1 within 2.768443 1.869964 5.303702
2 between 4.733483 2.123816 18.551051
3 total 5.483625 3.590745 18.772389
> dput(result,"")
structure(list(Name = structure(c("within", "between", "total"
), .rk.invalid.fields = list(), .Label = character(0)), CV = c(2.768443,
4.733483, 5.483625), LCB = c(1.869964, 2.123816, 3.590745), UCB = c(5.303702,
18.551051, 18.772389)), .Names = c("Name", "CV", "LCB", "UCB"
), row.names = c(NA, 3L), class = "data.frame")
What is the best way to present this data nicely? Ideally I'd like an image file that can be pasted into a report, or possibly an HTML file to represent the table?
Extra points for setting number of significant figures.

I would use xtable. I usually use it with Sweave.
library(xtable)
d <- data.frame(letter=LETTERS, index=rnorm(52))
d.table <- xtable(d[1:5,])
print(d.table,type="html")
If you want to use it in a Sweave document, you would use it like so:
<<label=tab1,echo=FALSE,results=tex>>=
xtable(d, caption = "Here is my caption", label = "tab:one",caption.placement = "top")
#

For the table aspect, the xtable package comes to mind as it can produce LaTeX output (which you can use via Sweave for professional reports) as well as html.
If you combine that in Sweave with fancy graphs (see other questions for ggplot examples) you are almost there.

library(ggplot2)
ggplot(result, aes(x = Name, y = CV, ymin = LCB, ymax = UCB)) + geom_errorbar() + geom_point()
ggplot(result, aes(x = Name, y = CV, ymin = LCB, ymax = UCB)) + geom_pointrange()

To set the significant figures, the easiest thing to do (for this sample data, mind you) would be to move Name to rownames and round the whole thing.
#Set the rownames equal to Name - assuming all unique
rownames(result) <- result$Name
#Drop the Name column so that round() can coerce
#result.mat to a matrix
result.mat <- result[ , -1]
round(result.mat, 2) #Where 2 = however many sig digits you want.
This is not a terribly robust solution - non-unique Name values would break it, I think, as would other non-numeric columns. But for producing a table like your example, it does the trick.

Related

How to get part of column in df in italics?

I have a dataframe as follows:
number <- c(34,36,67,87,99)
mz <- c("m/z 565.45","m/z 577.45","m/z 65.49","m/z 394.22","m/z 732.43")
df <- data.frame(number, mz)
However, I want the m/z part in italics, but I cannot figure out how to.
df$mz <- gsub('m/z', italic('m/z'), df$mz)
This does not work, I get the error:
Error in italic("m/z") : could not find function "italic"
This also does not work:
df$mz <- gsub('m/z', expression(italic('m/z')), df$mz)
I don't get an error but I get literally 'italic('m/z')' in my dataframe.
Is there any way around this?
EDIT:
The reason I want to do this is because I'm going to use the df to make a plot and I need the m/z to be in italics in the plot
You should store mz as character strings, then parse at the time of plotting:
df$mz <- sub('m/z', 'italic(m/z)~', df$mz)
A base R plot would then look like this:
plot(1:5, df$number, xaxt = 'n', xlab = 'mz')
axis(1, at = 1:5, labels = parse(text = df$mz))
And a ggplot like this:
ggplot(df, aes(factor(1:5), number)) +
geom_point() +
scale_x_discrete(labels = parse(text = df$mz), name = 'mz')

How to get the grouping right in R with Plotly

I have some problem to group my data in Plotly under R. To start with I was using local data from a csv file, reading them with:
geogrid_data <- read.delim('geogrid.csv', row.names = NULL, stringsAsFactors = TRUE)
and the plotting went well, using the following:
library(plotly)
library(RColorBrewer)
x <- list(
title = 'Date'
)
p <- plotly::plot_ly(geogrid_data,
type = 'scatter',
x = ~ts_now,
y = ~absolute_v_sum,
text = paste('Table: ', geogrid_data$table_name,
'<br>Absolute_v_Sum: ', geogrid_data$absolute_v_sum),
hoverinfo = 'text',
mode = 'lines',
color = list(
color = colorRampPalette(RColorBrewer::brewer.pal(11,'Spectral'))(
length(unique(geogrid_data$table_name))
)
),
transforms = list(
list(
type = 'groupby',
groups = ~table_name
)
)
) %>% layout(showlegend = TRUE, xaxis = x)
Here the output:
Then I was going to alter the data source to an Oracle database table, reading the data as follows, using the ROracle package:
# retrieve data into resultSet object
rs <- dbSendQuery(con, "SELECT * FROM GEOGRID_STATS")
# fetch records from the resultSet into a data.frame
geogrid_data <- fetch(rs)
# free resources occupied by resultSet
dbClearResult(rs)
dbUnloadDriver(drv)
# remove duplicates from dataframe (based on TABLE_NAME, TS_BEFORE, TS_NOW, NOW_SUM)
geogrid_data <- geogrid_data %>% distinct(TABLE_NAME, TS_BEFORE, TS_NOW, NOW_SUM, .keep_all = TRUE)
# alter date columns in place
geogrid_data$TS_BEFORE <- as.Date(geogrid_data$TS_BEFORE, format='%d-%m-%Y')
geogrid_data$TS_NOW <- as.Date(geogrid_data$TS_NOW, format='%d-%m-%Y')
and adjusting the plotting to:
p <- plotly::plot_ly(
type = 'scatter',
x = geogrid_data$TS_NOW,
y = geogrid_data$ABSOLUTE_V_SUM,
text = paste('Table: ', geogrid_data$TABLE_NAME,
'<br>Absolute_v_Sum: ', geogrid_data$ABSOLUTE_V_SUM,
'<br>Date: ', geogrid_data$TS_NOW),
hoverinfo = 'text',
mode = 'lines',
color = list(
color = colorRampPalette(RColorBrewer::brewer.pal(11,'Spectral'))(
length(unique(geogrid_data$TABLE_NAME))
)
),
transforms = list(
list(
type = 'groupby',
groups = geogrid_data$TABLE_NAME
)
)
) %>% layout(showlegend = TRUE, xaxis = x)
Unfortunately, this is leading to some problem with the grouping as it seems.:
As you can see from the label text when hovering over the data point, the point represents data from NY_SKOV_PLANTEB_MW_POLY while the legend is set to show data from NY_BYGN_MW_POLY. Looking at other data points in this graph I found a wild mix of points of all sorts in this graph, some of them representing data of NY_BYGN_MW_POLY, most of them not.
Also the plotting with respect to the time line does not work any more, e.g. data are plotted with start on Dec. 11 - Dec. 10 - Dec. 10 - Dec. 12 - Dec. 20 - Dec. 17 - Dec. 16 - Dec. 15.
Where do I go wrong in handling the data, and what do I have to do to get it right?
Of course, one should look at the data... thanks Marco, after your question I did look at my data.
There are some points where I simply assumed things.
The reason why all data plotted fine with data from the csv file is simple. All information manually compiled in the csv file came from information in emails that have been ordered by date. Hence, I compiled the data in the csv file ordered by date and Plotly does not have any problems grouping the data by table_name.
After looking at my data I tidied up, keeping only the data I need to show in the plot and used dplyr to sort the data by time.
geogrid_data <- dplyr::arrange(geogrid_data, TS_NOW)
It is only by time and not by time and table name because the sorting by table name is done anyway by Plotly and the groupby statement

How I can select the coordinate X and Y of R plot from Column Filter (R/Knime)?

So, I have this workflow :
I have selected 2 columns(Day and Temperature) from my file using ‘Columns filter’. And I connected to ‘R plot’ that I configurated but I obtain this :
The day column is not selected as X axis but (Row ID) and the Y axis is ok.
This is my code in R plot:
# Library
library(qcc)
library(readr)
library(Rserve)
Rserve(args = "--vanilla")
# Data column filter from CSV file imported
Test <- kIn
#Background color
qcc.options(bg.margin = "white", bg.figure = "gray95")
#R graph ranges of a continuous process variable
qcc(data = Test,
type = "R",
sizes = 5,
title = "Sample R Chart Title",
digits = 2,
plot = TRUE)
Here is my try (using KNIME's R, not the community contribution):
#install.packages("qcc")
library(qcc)
data <- knime.in
#Change the names to use Day instead of row keys
row.names(data) <- data$Day
#Using the updated data
plot(qcc(data = data,
type = "R",
sizes = 5,
title = "Sample R Chart Title",
digits = 2,
plot = TRUE))
With results like:
If you want to select the column for the X axis, just change the row.names assignment. (It can also come from knime.flow.in in case the column name is coming from a flow variable, but as I understand it is not the case for you.)

Produce pretty table for print that shows which point estimates differ significantly using R

I want to create a table of point estimates from a sample for print in the following format
variable group1 group2 group3 etc
age 18.2 18.5 23.2
weight 125.4 130.1 117.1
etc
I also have confidence intervals for each point estimate, but displaying them will cause too much clutter. Instead, I'd like to use text attributes (italics, bold, underline, font) to signal which point estimates in a row differ significantly. So, in the first row, if 23.2 differed significantly from the other two, it would be displayed in bold (for example). I'm not sure if such a display would appear bewildering, but I'd like to try.
Could anyone suggest a table formatting library in R that would allow me to accomplish this? Perhaps one that allows me to supply text attributes in the data table for each point estimate?
Another solution could be to use ReporteRs package using FlexTable API and send the object to a docx document :
library( ReporteRs )
data = iris[45:55, ]
MyFTable = FlexTable( data = data )
MyFTable[data$Petal.Length < 3, "Species"] = textProperties( color="red"
, font.style="italic")
MyFTable[data$Sepal.Length < 5, 1:4] = cellProperties( background.color="#999999")
MyFTable[ , 1:4] = parProperties( text.align="right" )
doc.filename = "test.docx"
doc = docx( )
doc = addFlexTable( doc, MyFTable )
writeDoc( doc, file = doc.filename )
I believe you can do something like this with the xtable() package - if you have xtable output your table, you can use knitr/pandoc to convert it to word, HTML, etc. or you can just paste the LaTeX output into a document and compile it.
Here's a demo:
library(xtable)
# original data frame
df <- data.frame(var=c("age", "weight", "etc"), group1=c("18.2", "125.4", "3"), group2=c("18.5", "130.1", "3"), group3=c("23.2", "117.1", "3"), etc=c("1", "2", "3"))
# data frame in similar format indicating significance
significant <- data.frame(var=c("age", "weight", "etc"), group1=c(F, T, F), group2=c(T, F, T), group3=c(F, T, F))
library(reshape2)
# transform everything into long form to apply text formatting
df.melt <- melt(df, id.vars = 1, variable.name="group", value.name="value")
sig.melt <- melt(significant, id.vars=1, variable.name = "group", value.name="sig")
# merge datasets together
tmp <- merge(df.melt, sig.melt)
tmp$ans <- tmp$value
# apply text formatting using LaTeX functions
tmp$ans[tmp$sig] <- paste0("\\textit{", tmp$ans, "}")[tmp$sig]
# transform dataset back to "wide form" for table output
df2 <- dcast(tmp, var~group, value.var="ans")
# output table in LaTeX format
print(xtable(df2), include.rownames=FALSE, sanitize.text.function=identity)
A qucik demo based on the OP-mentioned pander package:
Load it:
library(pander)
Create some dummy data, which I will import from the rapport package this time:
df <- rapport::ius2008
Compute a basic cross table:
t <- table(df$dwell, df$net.pay)
Identify those cells with high standardized residuals and emphasize those:
emphasize.cells(which(abs(chisq.test(t)$stdres) > 2, arr.ind = TRUE))
Do not split the markdown table:
panderOptions('table.split.table', Inf)
Print the markdown table:
pander(t)
Resulting in:
----------------------------------------------------------------------------
parents school/faculty employer self-funded other
---------------- --------- ---------------- ---------- ------------- -------
**city** 276 14 26 229 *20*
**small town** 14 1 1 11 *4*
**village** 13 1 0 13 2
----------------------------------------------------------------------------

R + match values at scale (using apply?)

Is there a way to make matching values at scale more programmatic? Basically what I want to do is add a bunch of columns for value lookups onto a dataframe, but I don't want to write the match[] argument every time. It seems like this would be a use case for mapply but I can't quite figure out how to use it here. Any suggestions?
Here's the data:
data <- data.frame(
region = sample(c("northeast","midwest","west"), 50, replace = T),
climate = sample(c("dry","cold","arid"), 50, replace = T),
industry = sample(c("tech","energy","manuf"), 50, replace = T))
And the corresponding lookup tables:
lookups <- data.frame(
orig_val = c("northeast","midwest","west","dry","cold","arid","tech","energy","manuf"),
look_val = c("dir1","dir2","dir3","temp1","temp2","temp3","job1","job2","job3")
)
So now what I want to do is: First add a column to "data" that's called "reg_lookups" and it will match the region to its appropriate value in "lookups". Do the same for "climate_lookups" and so on.
Right now, I've got this mess:
data$reg_lookup <- lookups$look_val[match(data$region, lookups$orig_val)]
data$clim_lookup <- lookups$look_val[match(data$climate, lookups$orig_val)]
data$indus_lookup <- lookups$look_val[match(data$industry, lookups$orig_val)]
I've tried using a function to do this, but the function doesn't seem to work, so then applying that to mapply is a no-go (plus I'm confused about how the mapply syntax would work here):
match_fun <- function(df, newval, df_look, lookup_val, var, ref_val) {
df$newval <- df_look$lookup_val[match(df$var, df_look$ref_val)]
return(df)
}
data2 <- match_fun(data, reg_2, lookups, look_val, region, orig_val)
I think you're just trying to do this:
data <- merge(data,lookups[1:3,],by.x = "region",by.y = "orig_val",all.x = TRUE)
data <- merge(data,lookups[4:6,],by.x = "climate",by.y = "orig_val",all.x = TRUE)
data <- merge(data,lookups[7:9,],by.x = "industry",by.y = "orig_val",all.x = TRUE)
But it would be much better to store the lookups either in separate data frames. That way you can control the names of the new columns more easily. It would also allow you to do something like this:
lookups1 <- split(lookups,rep(1:3,each = 3))
colnames(lookups1[[1]]) <- c('region','reg_lookup')
colnames(lookups1[[2]]) <- c('climate','clim_lookup')
colnames(lookups1[[3]]) <- c('industry','indus_lookup')
do.call(cbind,mapply(merge,
x = list(data[,1,drop = FALSE],data[,2,drop =FALSE],data[,3,drop = FALSE]),
y = lookups1,
moreArgs = list(all.x = TRUE),
SIMPLIFY = FALSE))
and you should be able to wrap that do.call bit in a function.
I used data[,1,drop = FALSE] in order to preserve them as one column data frames.
The way you structure mapply calls is to pass named arguments as lists (the x = and y = parts). I wanted to be sure to preserve all the rows from data, so I passed all.x = TRUE via moreArgs, so that gets passed each time merge is called. Finally, I need to stitch them all together myself, so I turned off SIMPLIFY.

Resources