Convert column values to percentages - r

I am reading in multiple Excel files into lists using read.xlsx from the openxlsx package. I append the lists with rbind and perform some data manipulation.
What I need to do is convert the values in columns 18 and 19 to percentages (currently the values show as .90, .85, etc. but I can also force the user to enter as 90, 85, etc. I need to 90%, 85%). I have tried to do this inside the data.frame and also using createStyle. So far, nothing has worked and will either corrupt my data or simply do nothing.
Here is what I have tried...
openxlsx Style
# Create percent style
pct = createStyle(numFmt = "0%")
# Apply style
addStyle(wb, sheet = "filename", style = pct, cols = 18, rows = 102, gridExpand = TRUE)
str_replace
allData <- str_replace(allData$'Content', pattern = "%", "")
allData$'Content' <- as.numeric(allData)/100
sapply (even just trying to convert data type to numeric didn't work. It was still set to General
allData[, c(18)] <- sapply(allData[, c(18)], as.numeric)
Any help would be greatly appreciated!

Figured this out sometime ago but forgot to post the answer. For those who are interested...
# Create a percent style
pct = createStyle(numFmt = "0%")
# Add percent style
addStyle(wb, sheet = "my_filename", style = pct, cols = c(18, 19), rows = 2:(nrow(allData)+1), gridExpand = TRUE)

Related

R: Conditional Formatting across excel files

I am trying to highlight rows of an excel file based on a match from the columns in a separate excel file. Pretty much, I want to highlight a row in file1 if a cell in that row matches a cell in file2.
I saw the R package "conditionalFormatting" has some of this functionality, but I cannot figure out how to use it.
the pseudo-code i think would look something like this:
file1 <- read_excel("file1")
file2 <- read_excel("file2")
conditionalFormatting(file1, sheet = 1, cols = 1:end, rows = 1:22,
rule = "number in file1 is found in a specific column of file 2")
Please let me know if this makes sense or if i need to clarify something.
Thanks!
The conditionalFormatting() function embeds active conditional formatting into the excel document but is likely more complicated than you need for a one-time highlight. I'd suggest loading each file into a dataframe, determining which rows contain a matching cell, creating a highlight style (yellow background), loading the file as a workbook object, setting the appropriate rows to the highlight style, and saving the updated workbook object.
The following function is the used to determine which rows have a match. The magrittr package provides the %>% pipes and the data.table package provides the transpose() function.
find_matched_rows <- function(df1, df2) {
require(magrittr)
require(data.table)
# the dataframe object treats each column as a list making it much easier and
# faster to search via column than row. Transpose the original file1 dataframe
# to treat the rows as columns.
df1_transposed <- data.table::transpose(df1)
# assuming that the location of the match in the second file is irrelevant,
# unlist the file2 dataframe so that each value in file1 can be searched in a
# vector
df2_as_vector <- unlist(df2)
# determine which columns contain a match. If one or more matches are found,
# attribute the row as 'TRUE' in the output vector to be used to subset the
# row numbers
match_map <- lapply(df1_transposed,FUN = `%in%`, df2_as_vector) %>%
as.data.frame(stringsAsFactors = FALSE) %>%
sapply(function(x) sum(x) > 0)
# make a vector of row numbers using the logical match_map vector to subset
matched_rows <- seq(1:nrow(df1))[match_map]
return(matched_rows)
}
The following code loads the data, finds the matched rows, applies the highlight, and saves over the original file1.xlsx. The second tst_df1 and tst_df2 provide for an easy way of testing the find_matched_rows() function. As expected, it finds that the 1st and 3rd rows of the first dataframe contain a cell that matches a cell in second dataframe.
# used to ensure that the correct rows are highlighted. the dataframe does not
# include the header as an independent row unlike excel.
file1_header_row <- 1
file2_header_row <- 1
tst_df1 <- openxlsx::read.xlsx("./file1.xlsx",
startRow = file1_header_row)
tst_df2 <- openxlsx::read.xlsx("./file2.xlsx",
startRow = file2_header_row)
#example data for testing
tst_df1 <- data.frame(fname = c("John", "Bob", "Bill"),
lname = c("Smith", "Johnson", "Samson"),
wage = c(10, 15.23, 137.38),
stringsAsFactors = FALSE)
tst_df2 <- data.frame(a = c(10, 34, 284.2),
b = c("Billy", "Bill", "Billy-Bob"),
c = c("Samson", "Johansson", NA),
stringsAsFactors = FALSE)
df_matched_rows <- find_matched_rows(tst_df1, tst_df2)
# any color found in colours() can be used here or hex color beginning with "#"
highlight_style <- openxlsx::createStyle(fgFill = "yellow")
file1_wb <- openxlsx::loadWorkbook(file = "./file1.xlsx")
openxlsx::addStyle(wb = file1_wb,
sheet = 1,
style = highlight_style,
rows = file1_header_row + df_matched_rows,
cols = 1:ncol(tst_df1),
stack = TRUE,
gridExpand = TRUE)
openxlsx::saveWorkbook(wb = file1_wb,
file = "./file1.xlsx",
overwrite = TRUE)

Using mutate in R to rename items in a column

EDIT
I am trying to name a column and rename all items within the column of a dataset:
dataSet <- read.csv(url) %>%
rename("newColumn1" = V1) %>%
mutate(newColumn1 = recode(newColumn1, "oldEntryX" = "newEntryX") %>%
select(dataSet, newColumn1)
And I get this error:
Error in recode(newColumn1, oldEntryX = "newEntryX" :
object 'newColumn1' not found
What am I missing?
The code runs correctly up through the rename function and displays the renamed column correctly, but soon as I include mutate it throws an error.
I have no problem sharing the real code but wanted to generalize it for the crowd.
source info was from https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data
IN the mutate step, you don't need quotes for column names on the lhs of =. Also, there are couple of case mismatches
Assuming the dataset is read correctly, we can
df1 %>%
rename(newColumn1 = V1, newColumn2 = V2) %>%
mutate(newColumn1 = recode(newColumn1, oldEntryX = "newEntryX"),
newColumn2 = recode(newColumn2, oldEntryY = "newEntryY"))
Based on the OP's code there is no closing quote as well "newColumn1
data
set.seed(24)
df1 <- data.frame(V1 = sample(c("oldEntryX", "x", "y"), 10, replace = TRUE),
V2 = sample(c("oldEntryY", "x", "y"), 10, replace = TRUE), stringsAsFactors= FALSE)
you can do this with some simple codes of R programming:
How to read csv file
Syntax :- `read.csv("filename.csv")
by using this command 1st row will be used as header. To improve this fault one should write
data <- read.csv("datafile.csv", header=FALSE)
How to rename the header/Column name:
names(data) <- c("Column1", "Column2", "Column3")
Now your headers are replaced by Column1, Column2 and Column3
Now to change Column1 data you can follow steps
data$Column1 <- c(write down set of values with which you want to replace)
To see the output type
data

Dynamic Reporting in R

I am looking for a help to generate a 'rtf' report from R (dataframe).
I am trying output data with many columns into a 'rtf' file using following code
library(rtf)
inp.data <- cbind(ChickWeight,ChickWeight,ChickWeight)
outputFileName = "test.out"
rtf<-RTF(paste(".../",outputFileName,".rtf"), width=11,height=8.5,font.size=10,omi=c(.5,.5,.5,.5))
addTable(rtf,inp.data,row.names=F,NA.string="-",col.widths=rep(1,12),header.col.justify=rep("C",12))
done(rtf)
The problem I face is, some of the columns are getting hide (as you can see last 2 columns are getting hide). I am expecting these columns to print in next page (without reducing column width).
Can anyone suggest packages/techniques for this scenario?
Thanks
Six years later, there is finally a package that can do exactly what you wanted. It is called reporter (small "r", no "s"). It will wrap columns to the next page if they exceed the available content width.
library(reporter)
library(magrittr)
# Prepare sample data
inp.data <- cbind(ChickWeight,ChickWeight,ChickWeight)
# Make unique column names
nm <- c("weight", "Time", "Chick", "Diet")
nms <- paste0(nm, c(rep(1, 4), rep(2, 4), rep(3, 4)))
names(inp.data) <- nms
# Create table
tbl <- create_table(inp.data) %>%
column_defaults(width = 1, align = "center")
# Create report and add table to report
rpt <- create_report("test.rtf", output_type = "RTF", missing = "-") %>%
set_margins(left = .5, right = .5) %>%
add_content(tbl)
# Write the report
write_report(rpt)
Only thing is you need unique columns names. So I added a bit of code to do that.
If docx format can replace rtf format, use package ReporteRs.
library( ReporteRs )
inp.data <- cbind(ChickWeight,ChickWeight,ChickWeight)
doc = docx( )
# uncomment addSection blocks if you want to change page
# orientation to landscape
# doc = addSection(doc, landscape = TRUE )
doc = addFlexTable( doc, vanilla.table( inp.data ) )
# doc = addSection(doc, landscape = FALSE )
writeDoc( doc, file = "inp.data.docx" )

R + match values at scale (using apply?)

Is there a way to make matching values at scale more programmatic? Basically what I want to do is add a bunch of columns for value lookups onto a dataframe, but I don't want to write the match[] argument every time. It seems like this would be a use case for mapply but I can't quite figure out how to use it here. Any suggestions?
Here's the data:
data <- data.frame(
region = sample(c("northeast","midwest","west"), 50, replace = T),
climate = sample(c("dry","cold","arid"), 50, replace = T),
industry = sample(c("tech","energy","manuf"), 50, replace = T))
And the corresponding lookup tables:
lookups <- data.frame(
orig_val = c("northeast","midwest","west","dry","cold","arid","tech","energy","manuf"),
look_val = c("dir1","dir2","dir3","temp1","temp2","temp3","job1","job2","job3")
)
So now what I want to do is: First add a column to "data" that's called "reg_lookups" and it will match the region to its appropriate value in "lookups". Do the same for "climate_lookups" and so on.
Right now, I've got this mess:
data$reg_lookup <- lookups$look_val[match(data$region, lookups$orig_val)]
data$clim_lookup <- lookups$look_val[match(data$climate, lookups$orig_val)]
data$indus_lookup <- lookups$look_val[match(data$industry, lookups$orig_val)]
I've tried using a function to do this, but the function doesn't seem to work, so then applying that to mapply is a no-go (plus I'm confused about how the mapply syntax would work here):
match_fun <- function(df, newval, df_look, lookup_val, var, ref_val) {
df$newval <- df_look$lookup_val[match(df$var, df_look$ref_val)]
return(df)
}
data2 <- match_fun(data, reg_2, lookups, look_val, region, orig_val)
I think you're just trying to do this:
data <- merge(data,lookups[1:3,],by.x = "region",by.y = "orig_val",all.x = TRUE)
data <- merge(data,lookups[4:6,],by.x = "climate",by.y = "orig_val",all.x = TRUE)
data <- merge(data,lookups[7:9,],by.x = "industry",by.y = "orig_val",all.x = TRUE)
But it would be much better to store the lookups either in separate data frames. That way you can control the names of the new columns more easily. It would also allow you to do something like this:
lookups1 <- split(lookups,rep(1:3,each = 3))
colnames(lookups1[[1]]) <- c('region','reg_lookup')
colnames(lookups1[[2]]) <- c('climate','clim_lookup')
colnames(lookups1[[3]]) <- c('industry','indus_lookup')
do.call(cbind,mapply(merge,
x = list(data[,1,drop = FALSE],data[,2,drop =FALSE],data[,3,drop = FALSE]),
y = lookups1,
moreArgs = list(all.x = TRUE),
SIMPLIFY = FALSE))
and you should be able to wrap that do.call bit in a function.
I used data[,1,drop = FALSE] in order to preserve them as one column data frames.
The way you structure mapply calls is to pass named arguments as lists (the x = and y = parts). I wanted to be sure to preserve all the rows from data, so I passed all.x = TRUE via moreArgs, so that gets passed each time merge is called. Finally, I need to stitch them all together myself, so I turned off SIMPLIFY.

How can I produce report quality tables from R?

If I have the following dataframe called result
> result
Name CV LCB UCB
1 within 2.768443 1.869964 5.303702
2 between 4.733483 2.123816 18.551051
3 total 5.483625 3.590745 18.772389
> dput(result,"")
structure(list(Name = structure(c("within", "between", "total"
), .rk.invalid.fields = list(), .Label = character(0)), CV = c(2.768443,
4.733483, 5.483625), LCB = c(1.869964, 2.123816, 3.590745), UCB = c(5.303702,
18.551051, 18.772389)), .Names = c("Name", "CV", "LCB", "UCB"
), row.names = c(NA, 3L), class = "data.frame")
What is the best way to present this data nicely? Ideally I'd like an image file that can be pasted into a report, or possibly an HTML file to represent the table?
Extra points for setting number of significant figures.
I would use xtable. I usually use it with Sweave.
library(xtable)
d <- data.frame(letter=LETTERS, index=rnorm(52))
d.table <- xtable(d[1:5,])
print(d.table,type="html")
If you want to use it in a Sweave document, you would use it like so:
<<label=tab1,echo=FALSE,results=tex>>=
xtable(d, caption = "Here is my caption", label = "tab:one",caption.placement = "top")
#
For the table aspect, the xtable package comes to mind as it can produce LaTeX output (which you can use via Sweave for professional reports) as well as html.
If you combine that in Sweave with fancy graphs (see other questions for ggplot examples) you are almost there.
library(ggplot2)
ggplot(result, aes(x = Name, y = CV, ymin = LCB, ymax = UCB)) + geom_errorbar() + geom_point()
ggplot(result, aes(x = Name, y = CV, ymin = LCB, ymax = UCB)) + geom_pointrange()
To set the significant figures, the easiest thing to do (for this sample data, mind you) would be to move Name to rownames and round the whole thing.
#Set the rownames equal to Name - assuming all unique
rownames(result) <- result$Name
#Drop the Name column so that round() can coerce
#result.mat to a matrix
result.mat <- result[ , -1]
round(result.mat, 2) #Where 2 = however many sig digits you want.
This is not a terribly robust solution - non-unique Name values would break it, I think, as would other non-numeric columns. But for producing a table like your example, it does the trick.

Resources