I am reading in data from individual xlsx files, with the data stored in 10-20 thousand individual tabs in each workbook file. The first sheet contains a master data table, including links to the individual tabs with further data.
The column based 'tabbed' data is summarized and transposed before being appended to the master data.
The master data table is large (10' thousands rows x hundreds cols) in its own right, the additional data tabs are small in their own rights (a few cols by 10's to a few '00 rows).
Using XLConnect package crashed out-of-memory already on calling loadWorkbook() (R 3.4.0, RStudio 1.1.383, 64bit, 8G machine), otherwise I could work along the lines of this.
Because I need to load from individual tabs, I am currently using a nested for() loop to load each individual tab data. However, with my number of tabs this takes nearly a minute per loop putting the total execution time to nearly a week! Using a nested for() loop is also decidedly non-tidy, so I suspect there is a neater and (much) faster way to achieve this, but can't see it.
I have read in the links into a dedicated df (linkReferences) in R.
The data source is not mine, so I am stuck with the input as provided.
The problem is purely related to the speed of reading the sheets, which appears to grow as the number of sheets in a file (and thus the file size) grows.
I am looking for any solution to speed this up, updated with self-contained minimum example.
On my pc: n = 10 gives time/sheet 0.16 sec, n = 100 ~0.56 sec/sheet and n = 1000 ~3 sec/sheet, which is similar to what i'm seeing in my real data (<10 sec/sheet for 16k sheets)
library(tidyverse)
number_of_sheets= 100
# =========================================================================
# CREATE SAMPLE FILE . Layout similar to actual data
library(openxlsx)
my.sheets.file <- "sampleXLSX.xlsx"
linkReferences <- data_frame( sheet = str_c("Data ",seq(1:number_of_sheets)) )
wb <- write.xlsx(linkReferences, file=my.sheets.file)
sample_header <-data.frame( head_name = c("head1", "head2","head3","head4","head5") ,
head_text = c("text1", "text2","text3","text4","text5") )
set.seed(31415)
for (i in 1:number_of_sheets) {
cat(i,"..")
sheet_name_i <- paste0("Data ",i)
addWorksheet(wb, sheetName = sheet_name_i)
writeData(wb, sheet=sheet_name_i, sample_header, startCol = "B", startRow=2)
n = ceiling( runif(1)*200 )
sample_data <- data_frame(A=seq(1:n),
B= runif(n),
C= sample(seq(1:5),n,replace=TRUE))
writeData(wb, sheet=sheet_name_i, sample_data, startCol = "B", startRow=10)
}
saveWorkbook(wb, file=my.sheets.file, overwrite=TRUE)
#===========================================================================
# THIS IS THE ACTUAL QUESTION
# Read from file with many tabs
library(readxl)
library(stringr)
linkReferences <- linkReferences %>%
mutate( Head1 = NA, Head2 = NA, Head3 = NA, Head4 = NA, Head5 = NA,
A.1 = NA, B.1 = NA, C.1 = NA,
A.2 = NA, B.2 = NA, C.2 = NA,
A.3 = NA, B.3 = NA, C.3 = NA,
A.4 = NA, B.4 = NA, C.4 = NA,
A.5 = NA, B.5 = NA, C.5 = NA
)
linkReferences.nrows = nrow(linkReferences)
lRnames <- names(linkReferences)
start.row=1
start_time <- Sys.time()
for (i in start.row:linkReferences.nrows){
cat("i=",i, " / ",linkReferences.nrows,"\n")
start_time_i=Sys.time()
linked_data <- read_xlsx(my.sheets.file,
sheet=as.character(linkReferences[i,"sheet"]),
skip=2,
col_types = c("text","text","text"),
col_names=FALSE)
print(Sys.time()-start_time_i) # This takes 99% of the loop time
linkReferences[i,2:6] <- unlist( linked_data[1:5,2])
data_head_row <- which( linked_data[,1]=="A")
names(linked_data) <- c("A","B","C")
linked_data <- linked_data[ (data_head_row+1):(nrow(linked_data)),]
# create a (rather random) sample summary
summary_linked_data <- linked_data%>%
group_by(C) %>%
summarise(B=last(B), A=last(A)) %>%
arrange(desc(C))
# not all data has the full range of options, so use actual number
summary_linked_data_nrows <- nrow(summary_linked_data)
#start_time_i2 <- Sys.time()
for( ii in 1:summary_linked_data_nrows) {
linkReferences[i, match(str_c("A.",ii),lRnames):match(str_c("C.",ii),lRnames)] <-
summary_linked_data[ii,]
}
#print(Sys.time()-start_time_i2)
print(linkReferences[i,2:20])
# ________________________________________________________
# BELOW IS ONLY FOR TEST LOOP TIMING STATS IN THIS EXAMPLE
delta_time <- Sys.time() - start_time
delta_time_attr <- attr(delta_time, "units")
row_time <- delta_time/(i-start.row+1)
if (delta_time_attr =="mins") {
row_time <- row_time*60
} else if( delta_time_attr == "hours") {
row_time <- row_time*3600
}
total_time <- row_time*(linkReferences.nrows-start.row-1)/3600
cat( "Passed time: ", delta_time, attr(delta_time, "units"),
" | time/row: ", round(row_time,2), "secs.",
" | Est total time:",
round(total_time*60,2), "mins = )",
round(total_time,2), "hours )",
"\n---------------\n")
}
# Conversion of data loaded as character to numeric can all happen outside loop once all data is loaded.
After some digging: XLConnect(), with its vectorised sheet reading capability (see here), is the clear winner, provided you can your workbook in memory.
I had to a. reduce the size of my workbook, and b. set XLconnect memory to 4GB as per #Joshua's link here.
For the 1000 sheets example as per the question above:
wb <- loadWorkbook() took 15 seconds,
linked_data_lst = readWorksheet() took 34 seconds,
and the data extraction for (i in 1:nr_linked_data){...} from the now in-memory list, took 86 seconds.
Giving a total time of 0.135 sec/sheet (22x faster than the code above)
#============================================================================
# now read it again
library(stringr)
options(java.parameters = "-Xmx4g" )
library(XLConnect)
linkReferences <- linkReferences %>%
mutate( Head1 = NA, Head2 = NA, Head3 = NA, Head4 = NA, Head5 = NA,
A.1 = NA, B.1 = NA, C.1 = NA,
A.2 = NA, B.2 = NA, C.2 = NA,
A.3 = NA, B.3 = NA, C.3 = NA,
A.4 = NA, B.4 = NA, C.4 = NA,
A.5 = NA, B.5 = NA, C.5 = NA
)
linkReferences.nrows = nrow(linkReferences)
lRnames <- names(linkReferences)
lRcols <- c(match(str_c("A.1"),lRnames):match(str_c("C.5"),lRnames))
lRheadCols <- c((lRcols[1]-5):(lRcols[1]-1))
start_time <- Sys.time()
wb <- loadWorkbook(my.sheets.file)
Sys.time() - start_time
start.row=1
end.row = linkReferences.nrows
start_time0 <- Sys.time()
linked_data_lst = readWorksheet(wb,
sheet=linkReferences[start.row:end.row,][["sheet"]],
startCol = 2,
endCol = 4,
startRow = 3,
header = FALSE)
delta_time <- (Sys.time() - start_time0) %>% print()
nr_linked_data <- length(linked_data_lst)
start_time <- Sys.time()
for (i in 1:nr_linked_data ) {
cat("i=",i, " / ",nr_linked_data,"\n")
linked_data <- as_tibble(linked_data_lst[[i]])
# EVERYTHING BELOW HERE IS EXACTLY SAME AS IN QUESTION CODE
# =========================================================
linkReferences[i,lRheadCols] <- unlist( linked_data[1:5,2])
data_head_row <- which( linked_data[,1]=="A")
names(linked_data) <- c("A","B","C")
linked_data <- linked_data[ (data_head_row+1):(nrow(linked_data)),]
linked_data <- linked_data %>% mutate_all( funs(as.numeric) )
# create a (rather random) sample summary
summary_linked_data <- linked_data%>%
group_by(C) %>%
summarise(B=last(B), A=last(A)) %>%
arrange(desc(C))
# not all data has the full range of options, so use actual number
summary_linked_data_nrows <- nrow(summary_linked_data)
#start_time_i2 <- Sys.time()
for( ii in 1:summary_linked_data_nrows) {
linkReferences[i, match(str_c("A.",ii),lRnames):match(str_c("C.",ii),lRnames)] <-
summary_linked_data[ii,]
}
#print(Sys.time()-start_time_i2)
print(linkReferences[i,lRheadCols[1]:max(lRcols)])
delta_time <- Sys.time() - start_time
delta_time_attr <- attr(delta_time, "units")
row_time <- delta_time/(i-start.row+1)
if (delta_time_attr =="mins") {
row_time <- row_time*60
} else if( delta_time_attr == "hours") {
row_time <- row_time*3600
}
total_time <- row_time*(linkReferences.nrows-start.row-1)/3600
cat( "Passed time: ", delta_time, attr(delta_time, "units"),
" | time/row: ", round(row_time,2), "secs.",
" | Est total time:",
round(total_time*60,2), "mins = )",
round(total_time,2), "hours )",
"\n---------------\n")
}
Related
I'm looking for a function/method to extrapolate (linearly) for an x number of values beyond the original values.
Let's say I start with:
a <- c(NA, NA, NA, NA, NA, NA, 1, 2, 3, NA, NA, NA, NA, NA, NA)
And I want to extrapolate two values beyond, I would end up with:
[1] NA NA NA NA -1 0 1 2 3 4 5 NA NA NA NA
What I found so far is the approxExtrap function from Hmisc (https://rdrr.io/cran/Hmisc/man/approxExtrap.html). But since you have to define 'xout', I feel that I have to write a loop and every time select pieces I want to extrapolate on. This is possible of course, but ultimately I expect to have sequences of millions of datapoints with a lot of gaps, so I feel this may be too time consuming. So I hope I'm overlooking a simpler solution.
Added: There are no small gaps in the data, but typically ~ 100 NA's and then ~ 40 datapoints. I would like to extrapolate/extend the 40 datapoints with 5 new datapoints before the start and after the end of the 40 datapoints and replace 5 NA's at both locations. It is not possible to interpolate between two sequences of 40 datapoints.
I managed to solve the problem by:
Determining the ranges of the different series of data
Define the range I want to extrapolate to
Do the actual extrapolation through the Hmisc package
Initially, I thought I could only manage this by some loops that had to go through the raw data row by row, and was hoping for an existing function.
I'm sure many of you would have coded this way more efficient and nicer. But wanted to post my script anyway for people with a similar problem.
require(Hmisc)
extrapol.length <- 5
test <- data.frame('Time' = c(1:100), # I didn't use this as my data was equally spread in time, if you want to use it, see the first argument in the approxExtrap-function in the secondlast line
'x' = c(rep(NA, 10), 1:30, rep(NA, 30), 1:10, rep(NA, 20)))
## Determine start and end of the continuous (non-NA) data streams
length.values <- diff(c(0, which(is.na(test[,2]))))-2 # length non-NA's
length.values <- length.values[length.values > -1]
length.nas <- diff(c(0, which(!is.na(test[,2])))) # length NA's
length.nas <- length.nas[length.nas > 1]
if(is.na(test[1,2])){
# data starts with NA
length.nas <- data.frame('Order' = seq(1, length(length.nas)*2, by = 2),
'Length' = length.nas, 'Type' = 'na')
length.values <- data.frame('Order' = seq(2, length(length.values)*2, by = 2),
'Length' = length.values, 'Type' = 'value')
start.end <- rbind(length.nas, length.values)
start.end <- start.end[order(start.end$Order),]
value.seqs <- data.frame('no' = c(1:length(start.end$Type[start.end$Type == 'na'])),
'start' = NA, 'end' = NA)
for(a in value.seqs$no){
value.seqs$start[a] <- sum(start.end$Length[1:((a*2)-1)])
value.seqs$end[a] <- sum(start.end$Length[1:(a*2)])
}
}else{
# Data starts with actual values
length.nas <- data.frame('Order' = seq(2, length(length.nas)*2, by = 2),
'Length' = length.nas, 'Type' = 'na')
length.values <- data.frame('Order' = seq(1, length(length.values)*2, by = 2),
'Length' = length.values, 'Type' = 'value')
start.end <- rbind(length.nas, length.values)
start.end <- start.end[order(start.end$Order),]
value.seqs <- data.frame('no' = c(1:length(start.end$Type[start.end$Type == 'value'])),
'start' = c(1,rep(NA, (length(start.end$Type[start.end$Type == 'value'])-1))), 'end' = NA)
for(a in value.seqs$no){
value.seqs$end[a] <- sum(start.end$Length[1:((a*2)-1)])+1
if(a < max(value.seqs$no))
value.seqs$start[a+1] <- sum(start.end$Length[1:(a*2)])+1
}
}
## Do not extrapolate outside of the time-range of the original dataframe
value.seqs$start.extr <- value.seqs$start - extrapol.length
value.seqs$start.extr[value.seqs$start.extr < 1] <- 1 # do not extrapolate below time < 1
value.seqs$end.extr <- value.seqs$end + extrapol.length
value.seqs$end.extr[value.seqs$end.extr > nrow(test) | is.na(value.seqs$end.extr)] <- nrow(test)
value.seqs$end[is.na(value.seqs$end)] <- max(which(!is.na(test[,2])))
## Extrapolate
for(b in value.seqs$no){
test[c(value.seqs$start.extr[b]:value.seqs$end.extr[b]),3] <- approxExtrap(value.seqs$start[b]:value.seqs$end[b],test[c(value.seqs$start[b]:value.seqs$end[b]),2],xout=c(value.seqs$start.extr[b]:value.seqs$end.extr[b]))[2]
}
Thanks for thinking along!
I have about 977 obs in top500Stocks which contains name of 977 stocks.
head(top500Stocks,10)
ï..Symbol
1 RELIANCE
2 TCS
3 HDFCBANK
4 INFY
5 HINDUNILVR
6 HDFC
7 ICICIBANK
8 KOTAKBANK
9 SBIN
10 BAJFINANCE
and I have Date, OHLC and Adj.Close, Vol and Ret of each stocks from the top500Stocks in stocksRetData
head(stocksRetData[[1]],3)
Date Open High Low Close Adj.Close Volume Ret
1 20000103 28.18423 29.86935 28.18423 38.94457 29.86935 28802010 0.000
2 20000104 30.66445 32.26056 29.82188 42.06230 32.26056 61320457 0.080
3 20000105 30.45677 34.16522 30.45677 43.71014 33.52440 173426953 0.039
Now for a given lookbackPeriod and holdPeriod I am trying to run the below function but it takes about 1 minute. How can I make it faster? Because I have to run for multiple lookbackPeriod and holdPeriod it will take forever to complete.
CalC.MOD_MScore.Ret.High <- function(lookbackPeriod, holdPeriod, fnoStocks,
stocksRetData, totalTestPeriod) {
#We go through each stock and calculate Modified mscores where we give more importance to recent data
WeeklyData <- list()
wmean <- function(x, k) mean(seq(k)/k * x)
for (i in 1:nrow(fnoStocks)){
out <- stocksRetData[[i]]
out <- tail(out,totalTestPeriod)
if (nrow(out)==totalTestPeriod){
tempDF <- transform(out, wtMean = rollapply(Ret, lookbackPeriod, wmean,
k = lookbackPeriod, align = "right",
fill = NA))
tempDF <- transform(tempDF, ExitVal = rollapply(lead(High, holdPeriod),
holdPeriod, max,
align = "right",
fill = NA))
tempDF$NWeekRet <- (tempDF$ExitVal - tempDF$Adj.Close ) / tempDF$Adj.Close
tempDF <- tempDF[!is.na(tempDF$wtMean),]
tempDF <- tempDF[!is.na(tempDF$ExitVal),]
tempDF$StockName = fnoStocks[i,1]
tempDF$WeekNum = c((lookbackPeriod):(nrow(tempDF)+lookbackPeriod-1))
WeeklyData[[i]] <- data.frame(
StockName = tempDF$StockName,
WeekNum = tempDF$WeekNum,
M_Score = tempDF$wtMean,
NWeekRet = tempDF$NWeekRet,
stringsAsFactors = FALSE
)
}
}# i ends here
return(bind_rows(WeeklyData))
}
This takes more than a minute to complete.
a <- CalC.MOD_MScore.Ret.High(4,14,fnoStocks = top500Stocks, stocksRetData = stocksRetData, 2000)
First of all, I wouldn't suggest using for-loops in R. I would rewrite your loop with a lapply like
CalC.MOD_MScore.Ret.High <- function(lookbackPeriod, holdPeriod, fnoStocks,
stocksRetData, totalTestPeriod) {
#We go through each stock and calculate Modified mscores where we give more importance to recent data
wmean <- function(x, k) mean(seq(k)/k * x)
WeeklyData <- lapply(1:nrow(fnoStocks), function(i) {
out <- stocksRetData[[i]]
out <- tail(out,totalTestPeriod)
if(nrow(out)!=totalTestPeriod) return(NULL)
tempDF <- transform(out, wtMean = rollapply(Ret, lookbackPeriod, wmean,
k = lookbackPeriod, align = "right",
fill = NA))
tempDF <- transform(tempDF, ExitVal = rollapply(lead(High, holdPeriod),
holdPeriod, max,
align = "right",
fill = NA))
tempDF$NWeekRet <- (tempDF$ExitVal - tempDF$Adj.Close ) / tempDF$Adj.Close
tempDF <- tempDF[!is.na(tempDF$wtMean),]
tempDF <- tempDF[!is.na(tempDF$ExitVal),]
tempDF$StockName = fnoStocks[i,1]
tempDF$WeekNum = c((lookbackPeriod):(nrow(tempDF)+lookbackPeriod-1))
data.frame(
StockName = tempDF$StockName,
WeekNum = tempDF$WeekNum,
M_Score = tempDF$wtMean,
NWeekRet = tempDF$NWeekRet,
stringsAsFactors = FALSE
)
})
return(bind_rows(WeeklyData))
}
Having an lapply makes it easier to throw some parallelization-tools on it.
You can have a look at the package parallel. With this package, you can parallelize and make use of multiple cores on you machine. Therefore, you need to setup a cluster, which produces some overhead, but I think it will pay out in your case. To use it, setup a cluster via cl <- parallel::makeCluster(parallel::detectCores()). The detectCores-method gets the number of available cores on your machine. Then, you can edit the lapply to
WeeklyData <- parallel::parLapply(cl = cl, 1:nrow(fnoStocks), function(i) {
...
})
After all your caluclations finished, call parallel::stopCluster(cl) to stop the cluster.
I have a dataframe e.g.
df_reprex <- data.frame(id = rep(paste0("S",round(runif(100, 1000000, 9999999),0)), each=10),
date = rep(seq.Date(today(), by=-7, length.out = 10), 100),
var1 = runif(1000, 10, 20),
var2 = runif(1000, 20, 50),
var3 = runif(1000, 2, 5),
var250 = runif(1000, 100, 200),
var1_baseline = rep(runif(100, 5, 10), each=10),
var2_baseline = rep(runif(100, 50, 80), each=10),
var3_baseline = rep(runif(100, 1, 3), each=10),
var250_baseline = rep(runif(100, 20, 70), each=10))
I want to write a function containing a for loop that for each row in the dataframe will subtract every "_baseline" column from the non-baseline column with the same name.
I have created a script that automatically creates a character string containing the code I would like to run:
df <- df_reprex
# get only numeric columns
df_num <- df %>% dplyr::select_if(., is.numeric)
# create a version with no baselines
df_nobaselines <- df_num %>% select(-contains("baseline"))
#extract names of non-baseline columns
numeric_cols <- names(df_nobaselines)
#initialise empty string
mutatestring <- ""
#write loop to fill in string:
for (colname in numeric_cols) {
mutatestring <- paste(mutatestring, ",", paste0(colname, "_change"), "=", colname, "-", paste0(colname, "_baseline"))
# df_num <- df_num %>%
# mutate(paste0(col, "_change") = col - paste0(col, "_baseline"))
}
mutatestring <- substr(mutatestring, 4, 9999999) # remove stuff at start (I know it's inefficient)
mutatestring2 <- paste("df %>% mutate(", mutatestring, ")") # add mutate call
but when I try to call "mutatestring2" it just prints the character string e.g.:
[1] "df %>% mutate( var1_change = var1 - var1_baseline , var2_change = var2 - var2_baseline , var3_change = var3 - var3_baseline , var250_change = var250 - var250_baseline )"
I thought that this part would be relatively easy and I'm sure I've missed something obvious, but I just can't get the text inside that string to run!
I've tried various slightly ridiculous methods but none of them return the desired output (i.e. the result returned by the character string if it was entered into the console as a command):
call(mutatestring2)
eval(mutatestring2)
parse(mutatestring2)
str2lang(mutatestring2)
mget(mutatestring2)
diff_func <- function() {mutatestring2}
diff_func1 <- function() {
a <-mutatestring2
return(a)}
diff_func2 <- function() {str2lang(mutatestring2)}
diff_func3 <- function() {eval(mutatestring2)}
diff_func4 <- function() {parse(mutatestring2)}
diff_func5 <- function() {call(mutatestring2)}
diff_func()
diff_func1()
diff_func2()
diff_func3()
diff_func4()
diff_func5()
I'm sure there must be a very straightforward way of doing this, but I just can't work it out!
How do I convert a character string to something that I can run or pass to a magrittr pipe?
You need to use the text parameter in parse, then eval the result. For example, you can do:
eval(parse(text = "print(5)"))
#> [1] 5
However, using eval(parse()) is normally a very bad idea, and there is usually a more sensible alternative.
In your case you can do this without resorting to eval(parse()), for example in base R you could subtract all the appropriate variables from each other like this:
baseline <- grep("_baseline$", names(df_reprex), value = TRUE)
non_baseline <- gsub("_baseline", "", baseline)
df_new <- cbind(df_reprex, as.data.frame(setNames(mapply(
function(i, j) df_reprex[[i]] - df_reprex[[j]],
baseline, non_baseline, SIMPLIFY = FALSE),
paste0(non_baseline, "_corrected"))))
Or if you want to keep the whole thing in a single pipe without storing intermediate variables, you could do:
mapply(function(i, j) df_reprex[[i]] - df_reprex[[j]],
grep("_baseline$", names(df_reprex), value = TRUE),
gsub("_baseline", "", grep("_baseline$", names(df_reprex), value = TRUE)),
SIMPLIFY = FALSE) %>%
setNames(gsub("_baseline", "_corrected",
grep("_baseline$", names(df_reprex), value = TRUE))) %>%
as.data.frame() %>%
{cbind(df_reprex, .)}
This question already has an answer here:
How Can I Quickly Inspect Built-in Data Sets (PSA)?
(1 answer)
Closed 2 years ago.
The package datasets and various packages come with a fair amount of useful datasets, however there seems to be no easy way to find your perfect dataset when you need it for your package examples, for teaching purposes, or to ask/answer a question here on SO.
Say for instance I want a dataset that is a data.frame, has at least 2 character columns, and is less than 100 rows long.
How can I explore EVERY dataset available and see a maximum of relevant information to make my choice ?
My past tries were messy, taking time, and crashed with some packages which have an unusual object structure like caret.
I've packaged a solution in a one function github package.
I'm copying the whole code at the bottom but the simplest is :
remotes::install_github("moodymudskipper/datasearch")
library(datasearch)
All data sets from package "dplyr"
dplyr_all <-
datasearch("dplyr")
View(dplyr_all)
Datasets from package "datasets" restricted by condition
datasets_ncol5 <-
datasearch("datasets", filter = ~is.data.frame(.) && ncol(.) == 5)
View(datasets_ncol5)
All datasets from all installed packages, no restriction
# might take more or less time, depends what you have installed
all_datasets <- datasearch()
View(all_datasets)
# subsetting the output
my_subset <- subset(
all_datasets,
class1 == "data.frame" &
grepl("treatment", names_collapsed) &
nrow < 100
)
View(my_subset)
datasearch <- function(pkgs = NULL, filter = NULL){
# make function silent
w <- options()$warn
options(warn = -1)
search_ <- search()
file_ <- tempfile()
file_ <- file(file_, "w")
on.exit({
options(warn = w)
to_detach <- setdiff(search(), search_)
for(pkg in to_detach) eval(bquote(detach(.(pkg))))
# note : we still have loaded namespaces, we could unload those that we ddn't
# have in the beginning but i'm worried about surprising effects, I think
# the S3 method tables should be cleaned too, and maybe other things
# note2 : tracing library and require didn't work
})
# convert formula to function
if(inherits(filter, "formula")) {
filter <- as.function(c(alist(.=), filter[[length(filter)]]))
}
## by default fetch all available packages in .libPaths()
if(is.null(pkgs)) pkgs <- .packages(all.available = TRUE)
## fetch all data sets description
df <- as.data.frame(data(package = pkgs, verbose = FALSE)$results)
names(df) <- tolower(names(df))
item <- NULL # for cmd check note
df <- transform(
df,
data_name = sub('.*\\((.*)\\)', '\\1', item),
dataset = sub(' \\(.*', '', item),
libpath = NULL,
item = NULL
)
df <- df[order(df$package, df$data_name),]
pkg_data_names <- aggregate(dataset ~ package + data_name, df, c)
pkg_data_names <- pkg_data_names[order(pkg_data_names$package, pkg_data_names$data_name),]
env <- new.env()
n <- nrow(pkg_data_names)
pb <- progress::progress_bar$new(
format = "[:bar] :percent :pkg",
total = n)
row_dfs <- vector("list", n)
for(i in seq(nrow(pkg_data_names))) {
pkg <- pkg_data_names$package[i]
data_name <- pkg_data_names$data_name[i]
datasets <- pkg_data_names$dataset[[i]]
pb$tick(tokens = list(pkg = format(pkg, width = 12)))
sink(file_, type = "message")
data(list=data_name, package = pkg, envir = env)
row_dfs_i <- lapply(datasets, function(dataset) {
dat <- get(dataset, envir = env)
if(!is.null(filter) && !filter(dat)) return(NULL)
cl <- class(dat)
nms <- names(dat)
nc <- ncol(dat)
if (is.null(nc)) nc <- NA
nr <- nrow(dat)
if (is.null(nr)) nr <- NA
out <- data.frame(
package = pkg,
data_name = data_name,
dataset = dataset,
class = I(list(cl)),
class1 = cl[1],
type = typeof(dat),
names = I(list(nms)),
names_collapsed = paste(nms, collapse = "/"),
nrow = nr,
ncol = nc,
length = length(dat))
if("data.frame" %in% cl) {
classes <- lapply(dat, class)
cl_flat <- unlist(classes)
out <- transform(
out,
classes = I(list(classes)),
types = I(list(vapply(dat, typeof, character(1)))),
logical = sum(cl_flat == 'logical'),
integer = sum(cl_flat == 'integer'),
numeric = sum(cl_flat == 'numeric'),
complex = sum(cl_flat == 'complex'),
character = sum(cl_flat == 'character'),
raw = sum(cl_flat == 'raw'),
list = sum(cl_flat == 'list'),
data.frame = sum(cl_flat == 'data.frame'),
factor = sum(cl_flat == 'factor'),
ordered = sum(cl_flat == 'ordered'),
Date = sum(cl_flat == 'Date'),
POSIXt = sum(cl_flat == 'POSIXt'),
POSIXct = sum(cl_flat == 'POSIXct'),
POSIXlt = sum(cl_flat == 'POSIXlt'))
} else {
out <- transform(
out,
nrow = NA,
ncol = NA,
classes = NA,
types = NA,
logical = NA,
integer = NA,
numeric = NA,
complex = NA,
character = NA,
raw = NA,
list = NA,
data.frame = NA,
factor = NA,
ordered = NA,
Date = NA,
POSIXt = NA,
POSIXct = NA,
POSIXlt = NA)
}
if(is.matrix(dat)) {
out$names <- list(colnames(dat))
out$names_collapsed = paste(out$names, collapse = "/")
}
out
})
row_dfs_i <- do.call(rbind, row_dfs_i)
if(!is.null(row_dfs_i)) row_dfs[[i]] <- row_dfs_i
sink(type = "message")
}
df2 <- do.call(rbind, row_dfs)
df <- merge(df, df2)
df
}
Extend/modify to your liking.
library(data.table)
dt = as.data.table(data(package = .packages(all.available = TRUE))$results)
dt = dt[, `:=`(Item = sub(' \\(.*', '', Item),
Object = sub('.*\\((.*)\\)', '\\1', Item))]
dt[, {
data(list = Object, package = Package)
d = eval(parse(text = Item))
classes = if (sum(class(d) %in% c('data.frame')) > 0) unlist(lapply(d, class))
else NA_integer_
.(class = paste(class(d), collapse = ","),
nrow = if (!is.null(nrow(d))) nrow(d) else NA_integer_,
ncol = if (!is.null(ncol(d))) ncol(d) else NA_integer_,
charCols = sum(classes == 'character'),
facCols = sum(classes == 'factor'))
}
, by = .(Package, Item)]
# Package Item class nrow ncol charCols facCols
# 1: datasets AirPassengers ts NA NA NA NA
# 2: datasets BJsales ts NA NA NA NA
# 3: datasets BJsales.lead ts NA NA NA NA
# 4: datasets BOD data.frame 6 2 0 0
# 5: datasets CO2 nfnGroupedData,nfGroupedData,groupedData,data.frame 84 5 0 3
# ---
#492: survival transplant data.frame 815 6 0 3
#493: survival uspop2 array 101 2 NA NA
#494: survival veteran data.frame 137 8 0 1
#495: viridis viridis.map data.frame 1024 4 1 0
#496: xtable tli data.frame 100 5 0 3
In package datasets there is no dataset of class data.frame that fulfills your conditions, more exactly if they are of class data.frame and have at most 100 columns, then none of them has two or more columns of class character. I've just found that out with a first version of the following code.
library(datasets)
res <- library(help = "datasets")
dat <- unlist(lapply(strsplit(res$info[[2]], " "), '[[', 1))
dat <- dat[dat != ""]
df_names <- NULL
for(i in seq_along(dat)){
d <- tryCatch(get(dat[i]), error = function(e) e)
if(inherits(d, "data.frame")){
if(nrow(d) <= 100){
char <- sum(sapply(d, is.character))
fact <- sum(sapply(d, is.factor))
if(char >= 2 || fact >= 2){
print(dat[i])
df_names <- c(df_names, dat[i])
}
}
}
}
df_names
[1] "CO2" "esoph" "npk" "sleep" "warpbreaks"
So I had to include extra instructions to handle columns of class factor. By default data frames are created with stringsAsFactors = TRUE. If you can do with those, there you have it, their names are in vector df_names. In order to make them available in the global environment just get the one you want.
The table returned by myfun() can be filtered with appropriate conditions, and the columns of datasets can be identified by its class given in the classes coulmn.
The problem with caret package is that it does not have any data frames or matrix object in it. The datasets may be present in the caret inside the list object. I am not sure about it, some list objects in the caret package contains a list of functions.
Also, if interested, you can make myfun() function to be more specific for returning information about data frames or matrix objects only.
myfun <- function( package )
{
t( sapply( ls( paste0( 'package:', package ) ), function(x){
y <- eval(parse(text = paste0( package, "::`", x, "`")))
data.frame( data_class = paste0(class(y), collapse = ","),
nrow = ifelse( any(class(y) %in% c( "data.frame", "matrix" ) ),
nrow(y),
NA_integer_ ),
ncol = ifelse( any(class(y) %in% c( "data.frame", "matrix" ) ),
ncol(y),
NA_integer_),
classes = ifelse( any(class(y) %in% c( "data.frame", "matrix" ) ),
paste0( unlist(lapply(y, class)), collapse = "," ),
NA),
stringsAsFactors = FALSE )
} ) )
}
library( datasets )
meta_data <- myfun( package = "datasets")
head(meta_data)
# data_class nrow ncol classes
# ability.cov "list" NA NA NA
# airmiles "ts" NA NA NA
# AirPassengers "ts" NA NA NA
# airquality "data.frame" 153 6 "integer,integer,numeric,integer,integer,integer"
# anscombe "data.frame" 11 8 "numeric,numeric,numeric,numeric,numeric,numeric,numeric,numeric"
# attenu "data.frame" 182 5 "numeric,numeric,factor,numeric,numeric"
meta_data[ "ChickWeight", ]
# $data_class
# [1] "nfnGroupedData,nfGroupedData,groupedData,data.frame"
#
# $nrow
# [1] 578
#
# $ncol
# [1] 4
#
# $classes
# [1] "numeric,numeric,ordered,factor,factor"
library( 'caret' )
meta_data <- myfun( package = "caret")
# data_class nrow ncol classes
# anovaScores "function" NA NA NA
# avNNet "function" NA NA NA
# bag "function" NA NA NA
# bagControl "function" NA NA NA
# bagEarth "function" NA NA NA
# bagEarthStats "function" NA NA NA
If the loaded packages needs to unloaded after applying the myfun() function on the package, try this:
loaded_pkgs <- search()
library( 'caret' )
meta_data <- myfun( package = "caret")
unload_pkgs <- setdiff( search(), loaded_pkgs )
for( i in unload_pkgs ) {
detach( pos = which( search() %in% i ) )
}
Edit: Building off of aL3xa's answer below, I've modified his syntax below. Not perfect, but getting closer. I still haven't found a way to make xtable accept \multicolumn{} arguments for columns or rows. It also appears that Hmisc handles some of these type of tasks behind the scenes, but it looks like a bit of an undertaking to understand what's going on there. Does anyone have experience with the latex function in Hmisc?
ctab <- function(tab, dec = 2, margin = NULL) {
tab <- as.table(tab)
ptab <- paste(round(prop.table(tab, margin = margin) * 100, dec), "%", sep = "")
res <- matrix(NA, nrow = nrow(tab) , ncol = ncol(tab) * 2, byrow = TRUE)
oddc <- 1:ncol(tab) %% 2 == 1
evenc <- 1:ncol(tab) %% 2 == 0
res[,oddc ] <- tab
res[,evenc ] <- ptab
res <- as.table(res)
colnames(res) <- rep(colnames(tab), each = 2)
rownames(res) <- rownames(tab)
return(res)
}
I would like to create a table formatted for LaTeX output that contains both the counts and percentages for each column or variable. I have not found a ready made solution to this problem, but feel I must be recreating the wheel to some extent.
I have developed a solution for straight tabulations, but am struggling with adopting something for a cross tabulation.
First some sample data:
#Generate sample data
dow <- sample(1:7, 100, replace=TRUE)
purp <- sample(1:4, 100, replace=TRUE)
dow <- factor(dow, 1:7, c("Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"))
purp <- factor(purp, 1:4, c("Business", "Commute", "Vacation", "Other"))
And now the working straight tab function:
customTable <- function(var, capt = NULL){
counts <- table(var)
percs <- 100 * prop.table(counts)
print(
xtable(
cbind(
Count = counts
, Percent = percs
)
, caption = capt
, digits = c(0,0,2)
)
, caption.placement="top"
)
}
#Usage
customTable(dow, capt="Day of Week")
customTable(purp, capt="Trip Pupose")
Does anyone have any suggestions for adopting this for cross tabulations (i.e. day of week BY trip purpose)? Here is what I've currently written, which does NOT use the xtable library and ALMOST works, but is not dynamic and is quite ugly to work with:
#Create table and percentages
a <- table(dow, purp)
b <- round(prop.table(a, 1),2)
#Column bind all of the counts & percentages together, this SHOULD become dynamic in future
d <- cbind( cbind(Count = a[,1],Percent = b[,1])
, cbind(Count = a[,2], Percent = b[,2])
, cbind(Count = a[,3], Percent = b[,3])
, cbind(Count = a[,4], Percent = b[,4])
)
#Ugly function that needs help, or scrapped for something else
crossTab <- function(title){
cat("\\begin{table}[ht]\n")
cat("\\begin{center}\n")
cat("\\caption{", title, "}\n", sep="")
cat("\\begin{tabular}{rllllllll}\n")
cat("\\hline\n")
cat("", cat("", paste("&\\multicolumn{2}{c}{",colnames(a), "}"), sep = ""), "\\\\\n", sep="")
c("&", cat("", colnames(d), "\\\\\n", sep=" & "))
cat("\\hline\n")
c("&", write.table(d, sep = " & ", eol="\\\\\n", quote=FALSE, col.names=FALSE))
cat("\\hline\n")
cat("\\end{tabular}\n")
cat("\\end{center}\n")
cat("\\end{table}\n")
}
crossTab(title = "Day of week BY Trip Purpose")
In the Tables-package it is one line:
# data:
dow <- sample(1:7, 100, replace=TRUE)
purp <- sample(1:4, 100, replace=TRUE)
dow <- factor(dow, 1:7, c("Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"))
purp <- factor(purp, 1:4, c("Business", "Commute", "Vacation", "Other"))
dataframe <- data.frame( dow, purp)
# The packages
library(tables)
library(Hmisc)
# The table
tabular( (Weekday=dow) ~ (Purpose=purp)*(Percent("row")+ 1) ,data=dataframe )
# The latex table
latex( tabular( (Weekday=dow) ~ (Purpose=purp)*(Percent("col")+ 1) ,data=dataframe ))
Using booktabs, you get this (can be further customised):
Great question, this one's bothering me for a while (it's not that hard, it's just me being lazy as hell... as usual). However... though the question's great, your approach, I'm afraid, isn't. There's priceless package called xtable that you can (mis)use. Besides, this issue is too common - there's a great chance that there's already some ready-made solution sitting somewhere on the Internets.
One of these days I'm about to work it out once and for all (I'll post the code on GitHub). The main idea goes a little bit like this: would you like frequency and/or percentage values within one cell (separated by \) or rows with absolute and relative frequencies (or %) in succession? I'd go with the 2nd one, so I'll post a "first-aid" solution for now:
ctab <- function(tab, dec = 2, ...) {
tab <- as.table(tab)
ptab <- paste(round(prop.table(tab) * 100, dec), "%", sep = "")
res <- matrix(NA, nrow = nrow(tab) * 2, ncol = ncol(tab), byrow = TRUE)
oddr <- 1:nrow(tab) %% 2 == 1
evenr <- 1:nrow(tab) %% 2 == 0
res[oddr, ] <- tab
res[evenr, ] <- ptab
res <- as.table(res)
colnames(res) <- colnames(tab)
rownames(res) <- rep(rownames(tab), each = 2)
return(res)
}
Now try something like:
data(HairEyeColor) # load an appropriate dataset
tb <- HairEyeColor[, , 1] # choose only male respondents
ctab(tb)
Brown Blue Hazel Green
Black 32 11 10 3
Black 11.47% 3.94% 3.58% 1.08%
Brown 53 50 25 15
Brown 19% 17.92% 8.96% 5.38%
Red 10 10 7 7
Red 3.58% 3.58% 2.51% 2.51%
Blond 3 30 5 8
Blond 1.08% 10.75% 1.79% 2.87%
Make sure you loaded xtable package and use print (it's a generic function, so you must pass a xtable classed object). It's important that you suppress the row names. I'll optimize this one tomorrow - it should be xtable compatible. It's 3AM in my time zone, so with these lines I'll end my answer:
print(xtable(ctab(tb)), include.rownames = FALSE)
Cheers!
I wasn't able to figure out how to generate a multi column header using xtable, but I did realize that i could concatenate my counts & percentages into the same column for printing purposes. Not ideal, but seems to get the job done. Here's the function I've written:
ctab3 <- function(row, col, margin = 1, dec = 2, percs = FALSE, total = FALSE, tex = FALSE, caption = NULL){
tab <- as.table(table(row,col))
ptab <- signif(prop.table(tab, margin = margin), dec)
if (percs){
z <- matrix(NA, nrow = nrow(tab), ncol = ncol(tab), byrow = TRUE)
for (i in 1:ncol(tab)) z[,i] <- paste(tab[,i], ptab[,i], sep = " ")
rownames(z) <- rownames(tab)
colnames(z) <- colnames(tab)
if (margin == 1 & total){
rowTot <- paste(apply(tab, 1, sum), apply(ptab, 1, sum), sep = " ")
z <- cbind(z, Total = rowTot)
} else if (margin == 2 & total) {
colTot <- paste(apply(tab, 2, sum), apply(ptab, 2, sum), sep = " ")
z <- rbind(z,Total = colTot)
}
} else {
z <- table(row, col)
}
ifelse(tex, return(xtable(z, caption)), return(z))
}
Probably not the final product, but does allow for some flexibility in parameters. At the most basic level, is only a wrapper of table() but can also generate LaTeX formatted output as well. Here is what I ended up using in a Sweave document:
<<echo = FALSE>>=
for (i in 1:ncol(df)){
print(ctab3(
col = df[,1]
, row = df[,i]
, margin = 2
, total = TRUE
, tex = TRUE
, caption = paste("Dow by", colnames(df[i]), sep = " ")
))
}
#
Using multicolumn with latex from the Hmisc package isn't too bad. This minimal Sweave document:
\documentclass{article}
\begin{document}
<<echo = FALSE,results = tex>>=
library(Hmisc)
dow <- sample(1:7, 100, replace=TRUE)
purp <- sample(1:4, 100, replace=TRUE)
dow <- factor(dow, 1:7, c("Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"))
purp <- factor(purp, 1:4, c("Business", "Commute", "Vacation", "Other"))
tbl <- table(dow,purp)
tbl_prop <- round(100 * prop.table(tbl,1),2)
tbl_df <- as.data.frame.matrix(tbl)
tbl_prop_df <- as.data.frame.matrix(tbl_prop)
colnames(tbl_prop_df) <- paste(colnames(tbl_prop_df),"1",sep = "")
df <- cbind(tbl_df,tbl_prop_df)[,ggplot2:::interleave(1:4,5:8)]
colnames(df) <- rep(c('n','\\%'),times = 4)
latex(object=df,file="",cgroup = colnames(tbl_df),
colheads = NULL,rowlabel = "",
center = "centering",collabel.just = rep("r",8))
#
\end{document}
Produces this for me:
Obviously, I've hard-coded a fair bit of stuff, and there could be slicker ways to produce the data frame that you end up passing to latex, but this should at least give a start using multicolum.
Also, a slight gotcha, I've used ggplot2's interleave function when combining the counts and percentages to alternate the columns. That's just cause I'm lazy.
How would this work for you?
library(reshape)
library(plyr)
df <- data.frame(dow = dow, purp = purp)
df.count <- count(df)
df.count <- ddply(df.count, .(dow), transform, p = round(freq / sum(freq),2))
df.m <- melt(df.count)
df.print <- cast(df.m, dow ~ purp + variable)
library(xtable)
xtable(df.print)
It doesn't give you nice multicolumns, and I don't have enough experience with xtable to figure out if that's possible. However, if you're going to be writing custom functions, you might try one which operates over the column names of df.print. You might be even able to write one sufficiently general to take all manner of recast data frames as input.
Edit:
Just thought of a good solution to get you closer. After creating df.m
df.preprint <- ddply(df.m, .(dow, purp), function(x){
x <- cast(x, dow ~ variable)
x$value <- paste(x$freq, x$p, sep = " / ")
return(c(value = x$value))
}
)
df.print <- cast(df.preprint, dow ~ purp)
print(xtable(df.print), include.rownames = F)
Now, every cell will contain N / percent values
I realize this thread is a bit old, but the tableNominal() function in the reporttools package may provide the functionality you are looking for.
tab<-table(row, col)
ctab<-round(100*prop.table(tab,2), 2) # for column percents (see the args for prop.table)
for (i in 1:length(tab)) {
ctab[i]<-paste(tab[i]," (", ctab[i], "%", ")", sep="")
}
require(xtable);
k<-xtable(ctab,digits=1) # make latex table