Reading Excel file: How to find the start cell in messy spreadsheets? - r

I'm trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first column is a date and the second column has "Monthly return" as the header. In this example, the data starts in cell B5:
How do I automate the search of Excel cells for my "Monthly return" string using R?
At the moment, the best idea I can come up with is to upload everything in R starting at cell A1 and sort out the mess in the resulting (huge) matrices. I'm hoping for a more elegant solution

I haven't found a way to do this elegantly, but I'm very familiar with this problem (getting data from FactSet PA reports -> Excel -> R, right?). I understand different reports have different formats, and this can be a pain.
For a slightly different version of annoyingly formatted spreadsheets, I do the following. It's not the most elegant (it requires two reads of the file) but it works. I like reading the file twice, to make sure the columns are of the correct type, and with good headers. It's easy to mess up column imports, so I'd rather have my code read the file twice than go through and clean up columns myself, and the read_excel defaults, if you start at the right row, are pretty good.
Also, it's worth noting that as of today (2017-04-20), readxl had an update. I installed the new version to see if that would make this very easy, but I don't believe that's the case, although I could be mistaken.
library(readxl)
library(stringr)
library(dplyr)
f_path <- file.path("whatever.xlsx")
if (!file.exists(f_path)) {
f_path <- file.choose()
}
# I read this twice, temp_read to figure out where the data actually starts...
# Maybe you need something like this -
# excel_sheets <- readxl::excel_sheets(f_path)
# desired_sheet <- which(stringr::str_detect(excel_sheets,"2 Factor Brinson Attribution"))
desired_sheet <- 1
temp_read <- readxl::read_excel(f_path,sheet = desired_sheet)
skip_rows <- NULL
col_skip <- 0
search_string <- "Monthly Returns"
max_cols_to_search <- 10
max_rows_to_search <- 10
# Note, for the - 0, you may need to add/subtract a row if you end up skipping too far later.
while (length(skip_rows) == 0) {
col_skip <- col_skip + 1
if (col_skip == max_cols_to_search) break
skip_rows <- which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0
}
# ... now we re-read from the known good starting point.
real_data <- readxl::read_excel(
f_path,
sheet = desired_sheet,
skip = skip_rows
)
# You likely don't need this if you start at the right row
# But given that all weird spreadsheets are weird in their own way
# You may want to operate on the col_skip, maybe like so:
# real_data <- real_data %>%
# select(-(1:col_skip))

Okay, at the format was specified for xls, update from csv to the correctly suggested xls loading.
library(readxl)
data <- readxl::read_excel(".../sampleData.xls", col_types = FALSE)
You would get something similar to:
data <- structure(list(V1 = structure(c(6L, 5L, 3L, 7L, 1L, 4L, 2L), .Label = c("",
"Apr 14", "GROSS PERFROANCE DETAILS", "Mar-14", "MC Pension Fund",
"MY COMPANY PTY LTD", "updated by JS on 6/4/2017"), class = "factor"),
V2 = structure(c(1L, 1L, 1L, 1L, 4L, 3L, 2L), .Label = c("",
"0.069%", "0.907%", "Monthly return"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -7L))
then you can dynamincally filter on the "Monthly return" cell and identify your matrix.
targetCell <- which(data == "Monthly return", arr.ind = T)
returns <- data[(targetCell[1] + 1):nrow(data), (targetCell[2] - 1):targetCell[2]]

With a general purpose package like readxl, you'll have to read twice, if you want to enjoy automatic type conversion. I assume you have some sort of upper bound on the number of junk rows at the front? Here I assumed that was 10. I'm iterating over worksheets in one workbook, but the code would look pretty similar if iterating over workbooks. I'd write one function to handle a single worksheet or workbook then use lapply() or purrr::map(). This function will encapsulate the skip-learning read and the "real" read.
library(readxl)
two_passes <- function(path, sheet = NULL, n_max = 10) {
first_pass <- read_excel(path = path, sheet = sheet, n_max = n_max)
skip <- which(first_pass[[2]] == "Monthly return")
message("For sheet '", if (is.null(sheet)) 1 else sheet,
"' we'll skip ", skip, " rows.")
read_excel(path, sheet = sheet, skip = skip)
}
(sheets <- excel_sheets("so.xlsx"))
#> [1] "sheet_one" "sheet_two"
sheets <- setNames(sheets, sheets)
lapply(sheets, two_passes, path = "so.xlsx")
#> For sheet 'sheet_one' we'll skip 4 rows.
#> For sheet 'sheet_two' we'll skip 6 rows.
#> $sheet_one
#> # A tibble: 6 × 2
#> X__1 `Monthly return`
#> <dttm> <dbl>
#> 1 2017-03-14 0.00907
#> 2 2017-04-14 0.00069
#> 3 2017-05-14 0.01890
#> 4 2017-06-14 0.00803
#> 5 2017-07-14 -0.01998
#> 6 2017-08-14 0.00697
#>
#> $sheet_two
#> # A tibble: 6 × 2
#> X__1 `Monthly return`
#> <dttm> <dbl>
#> 1 2017-03-14 0.00907
#> 2 2017-04-14 0.00069
#> 3 2017-05-14 0.01890
#> 4 2017-06-14 0.00803
#> 5 2017-07-14 -0.01998
#> 6 2017-08-14 0.00697

In those cases it's important to know the possible conditions of your data. I'm gonna assume that you want only remove columns and rows that doesn't confrom your table.
I have this Excel book:
I added 3 blank columns at left becouse when I loaded in R with one column the program omits them. Thats for confirm that R omits empty cols at the left.
First: load data
library(xlsx)
dat <- read.xlsx('book.xlsx', sheetIndex = 1)
head(dat)
MY.COMPANY.PTY.LTD NA.
1 MC Pension Fund <NA>
2 GROSS PERFORMANCE DETAILS <NA>
3 updated by IG on 20/04/2017 <NA>
4 <NA> Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
Second: I added some cols with NA and '' values in the case that your data contain some
dat$x2 <- NA
dat$x4 <- NA
head(dat)
MY.COMPANY.PTY.LTD NA. x2 x4
1 MC Pension Fund <NA> NA NA
2 GROSS PERFORMANCE DETAILS <NA> NA NA
3 updated by IG on 20/04/2017 <NA> NA NA
4 <NA> Monthly return NA NA
5 Mar-14 0.0097 NA NA
6 Apr-14 6e-04 NA NA
Third: Remove columns when all values are NA and ''. I have to deal with that kind of problems in past
colSelect <- apply(dat, 2, function(x) !(length(x) == length(which(x == '' | is.na(x)))))
dat2 <- dat[, colSelect]
head(dat2)
MY.COMPANY.PTY.LTD NA.
1 MC Pension Fund <NA>
2 GROSS PERFORMANCE DETAILS <NA>
3 updated by IG on 20/04/2017 <NA>
4 <NA> Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
Fourth: Keep only rows with complete observations (it's what I supose from your example)
rowSelect <- apply(dat2, 1, function(x) !any(is.na(x)))
dat3 <- dat2[rowSelect, ]
head(dat3)
MY.COMPANY.PTY.LTD NA.
5 Mar-14 0.0097
6 Apr-14 6e-04
7 May-14 0.0189
8 Jun-14 0.008
9 Jul-14 -0.0199
10 Ago-14 0.00697
Finally if you want to keep the header you can make something like this:
colnames(dat3) <- as.matrix(dat2[which(rowSelect)[1] - 1, ])
or
colnames(dat3) <- c('Month', as.character(dat2[which(rowSelect)[1] - 1, 2]))
dat3
Month Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
7 May-14 0.0189
8 Jun-14 0.008
9 Jul-14 -0.0199
10 Ago-14 0.00697

Here is how I would tackle it.
STEP 1
Read the excel spreadsheet in without the headers.
STEP 2
Find the row index for your string Monthly return in this case
STEP 3
Filter from the identified row (or column or both), prettify a little and done.
Here is what a sample function looks like. It works for your example no matter where it is in the spreadsheet. You can play around with regex to make it more robust.
Function Definition:
library(xlsx)
extract_return <- function(path = getwd(), filename = "Mysheet.xlsx", sheetnum = 1){
filepath = paste(path, "/", filename, sep = "")
input = read.xlsx(filepath, sheetnum, header = FALSE)
start_idx = which(input == "Monthly return", arr.ind = TRUE)[1]
output = input[start_idx:dim(input)[1],]
rownames(output) <- NULL
colnames(output) <- c("Date","Monthly Return")
output = output[-1, ]
return(output)
}
Example:
final_df <- extract_return(
path = "~/Desktop",
filename = "Apr2017.xlsx",
sheetnum = 2)
No matter ho many rows or columns you may have, the idea remains the same.. Give it a try and let me know.

This is a tidy alternative that avoids the multiple reads issue discussed above. However, when doing benchmarks, Rafael Zayas's answer still wins out.
library("tidyxl")
library("unpivotr")
library("tidyr")
library("dplyr")
tidy_solution <- function() {
raw <- xlsx_cells("messyExcel.xlsx")
start <- raw %>%
filter_all(any_vars(. %in% c("Monthly return"))) %>%
select(row, col)
month.col <- raw %>%
filter(row >= start$row + 1, col == start$col - 1) %>%
pivot_wider(date, col)
return.col <- raw %>%
filter(row >= start$row + 1, col == start$col) %>%
pivot_wider(numeric, col)
output <- cbind(month.col, return.col)
}
# My Solution
expr min lq mean median uq max neval
tidy_solution() 29.0372 30.40305 32.13793 31.36925 32.9812 56.6455 100
# Rafael's
expr min lq mean median uq max neval
original_solution() 21.4405 23.8009 25.86874 25.10865 26.99945 59.4128 100

grep("2014",dat)[1]
This gives you first column with year. Or use "-14" or whatever you have for years.
Similar way grep("Monthly",dat)[1] gives you second column

Related

How to use mice for multiple imputation of missing values in longitudinal data?

I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL

R calculate a new column for multiple dataframes with a map-function or a loop

I have a problem with creating a new column for multiple dataframes with a map function or a for-loop. I have 25 dataframes with cryptocurrency time series data:
ls(pattern="USD")
[1] "ADA.USD" "BCH.USD" "BNB.USD" "BTC.USD" "BTG.USD" "DASH.USD" "DOGE.USD" "EOS.USD" "ETC.USD" "ETH.USD" "IOT.USD"
[12] "LINK.USD" "LTC.USD" "NEO.USD" "OMG.USD" "QTUM.USD" "TRX.USD" "USDT.USD" "WAVES.USD" "XEM.USD" "XLM.USD" "XMR.USD"
[23] "XRP.USD" "ZEC.USD" "ZRX.USD"
Every object is a dataframe which stands for a cryptocurrency expressed in USD. And every dataframe has 2 columns: Date and Close (Closing price). For example: the dataframe "BTC.USD" stands for Bitcoin in USD:
head(BTC.USD)
# A tibble: 6 x 2
Date Close
1 2015-12-31 430.
2 2016-01-01 434.
3 2016-01-02 434.
4 2016-01-03 431.
5 2016-01-04 433.
Now I want to add a third column, which represents the daily return.
require(quantmod)
BTC.USD <- BTC.USD%>%mutate(Return= Delt(Close)*100)
For a single object (in this case Bitcoin [BTC.USD]) this code works as imagined:
> head(BTC.USD)
# A tibble: 6 x 3
Date Close Return[,"Delt.1.arithmetic"]
<date> <dbl> <dbl>
1 2015-12-31 430. NA
2 2016-01-01 434. 0.940
3 2016-01-02 434. -0.0622
4 2016-01-03 431. -0.696
5 2016-01-04 433. 0.608
6 2016-01-05 431. -0.489
Now I want to calculate the return for all 25 dataframes (or cryptocurrencies) with a map-function or a for-loop, but my code doesn't work:
temp = ls(pattern=".USD")
map(.x= temp,.f = mutate(Return= Delt(Close)*100))
Error in is.data.frame(.data) || is.list(.data) || is.environment(.data) :
argument ".data" is missing, with no default
for (i in seq_along(temp)) {mutate(Return= Delt(Close)*100)}
Error in is.data.frame(.data) || is.list(.data) || is.environment(.data) :
argument ".data" is missing, with no default
Can someone help me?
First, we need to actually get the data as a list (each data.frame will get its own entry in the list). Then, we can use any of our favorite list iterating functions to get the desired result.
temp_data <- lapply(ls(pattern = "USD"), get) # get data into a list
temp_data2 <- lapply(temp_data, function(x) mutate(x, Return = Delt(Close)*100))
As #akrun noted, there is a more compact way to do this:
lapply(mget(ls(pattern = "USD")), transform, Return = Delt(Close) * 100)
If you want to stick with tidyverse verbs, that would be:
lapply(mget(ls(pattern = "USD")), function(x) x %>% mutate(Return = Delt(Close) * 100))
I could get the code to work using your sample data:
sdat1 <-structure(list(
Date = c("2015-12-31","2016-01-01",
"2016-01-02","2016-01-03","2016-01-04"),
Close = c(430, 434, 434, 431, 433)),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5"))
sdat4 <- sdat3 <- sdat2 <- sdat1
lapply(mget(ls(pattern = 'sdat')),
FUN = function(x) x %>% mutate(Return = Delt(Close)))

Equivalent of SAS format (in R)

Suppose I have a dataframe:
sick <- c("daa12", "daa13", "daa14", "daa15", "daa16", "daa17")
code <- c("heart", "heart", "lung", "lung", "cancer", "cancer")
sick_code <- data.frame(sick, code)
And another:
pid <- abs(round(rnorm(6)*1000,0))
sick <- c("-" , "-", "-", "-", "daa16", "SO")
p_sick <- data.frame(pid, sick)
Now i would like to add a new varialbe to p_sick, that "translates" p_sick$sick to sick_code$code. The variable in p_sick$sick is a string which may or may not be p_sick$sick in this case NA should be returned.
Now I could write for loop with a simple ifelse statement. But the data I have is 150million rows long, and the translate table is 15.000 long.
I have googled that this is the equalivalent of a "proc format" in SaS (but I do not have acces to SaS, nor do I have any idea how it works).
Perhaps some variant of merge in plyr, or an apply function?
EDIT: I have accepted both answer, since they work.
I will try and look into the difference (in speed) between the two. Since merge is a built in function I am guessing it does lots of checking.
EDIT2: To people getting here by Google; merge has and sort = FALSE which will speed things up. Note that the order is not preserved in any way.
data.table will be suitable in your example:
library(data.table)
setkey(setDT(p_sick),sick)
p_sick[setDT(sick_code),code := i.code][]
pid sick code
1: 3137 - NA
2: 755 - NA
3: 1327 - NA
4: 929 - NA
5: 939 daa16 cancer
6: 906 SO NA
Please see here for detail explanation.
You could use merge with all.x = TRUE (to keep values from p_sick with no match in sick_code:
merge(p_sick, sick_code, all.x = TRUE)
An equivalent is using left_join from dplyr:
library(dplyr)
left_join(p_sick, sick_code)
# pid sick code
# 1 212 - <NA>
# 2 2366 - <NA>
# 3 325 - <NA>
# 4 269 - <NA>
# 5 501 daa16 cancer
# 6 1352 SO <NA>
Note that each of these solutions works only because the name sick is shared between the two data frames. Suppose they had different names- say the column was called sickness in sick_code. You could accommodate this with, respectively:
merge(p_sick, sick_code, by.x = "sick", by.y = "sickness", all.x = TRUE)
# or
left_join(p_sick, sick_code, c(sick = "sickness"))
A simple named vector will also work. The named vector can act as a lookup. So instead of defining sick and code as a data frame, define it as a named vector and use it as a decode. Like this:
# Set up named vector
sick_decode <- c("heart", "heart", "lung", "lung", "cancer", "cancer")
names(sick_decode) <- c("daa12", "daa13", "daa14", "daa15", "daa16", "daa17")
# Prepare data
pid <- abs(round(rnorm(6)*1000,0))
sick <- c("-" , "-", "-", "-", "daa16", "SO")
p_sick <- data.frame(pid, sick)
# Create new variable using decode
p_sick$sick_decode <- sick_decode[p_sick$sick]
# Results
#> pid sick sick_decode
#> 1 511 - <NA>
#> 2 1619 - <NA>
#> 3 394 - <NA>
#> 4 641 - <NA>
#> 5 53 daa16 cancer
#> 6 244 SO <NA>
I suspect this method will also be fast, but have not benchmarked it.
Also, there is now an R package specifically for replicating SAS format functionality in R. It is called fmtr.

Reformatting an excel sheet in R

I have an excel file that has multiple sheets. each sheet looks like this with some excess data at the bottom
A B C D....
1 time USA USA USA
2 MD CA PX
3 pork peas nuts
4 jan-11 4 2 2
5 feb-11 4 9 3
6 mar-11 8 8 3
.
.
workbook1|workbook2.....
The file is 11 mb, but when I try to use
sheet<-readWorksheetFromFile("excelfile.xlsx", sheet = 1)
I get
Error: OutOfMemoryError (Java): Java heap space
For each work sheet the data takes up different number for rows and columns, I want to write something that produces this for each sheet.
I am trying to convert each column into
country state product unit time
USA MD pork 3 jan-11
USA MD pork 3 feb-11
USA MD pork 3 mar-11
...
..
.
Is there any way to do this in R?
If your spreadsheet is full of formulas, you might need to convert those to values to get them to be read in easily. Otherwise, I would suggest using a tool like this one (among others out there) to convert all the sheets in a workbook to CSV files and work from there.
If you've gotten that far, here's something that can be tried for the "reshaping" part of your question. Here, we'll assume that "A" actually represents a CSV file, the contents of which are the six lines shown as sample data in your question:
## Create some sample data
A <- tempfile()
writeLines(sep="\n", con = A,
text = c("time, USA, USA, USA",
", MD, CA, PX",
", pork, peas, nuts",
"jan-11, 4, 2, 2",
"feb-11, 4, 9, 3",
"mar-11, 8, 8, 3"))
The first thing I would do is read in the headers and the data separately. To read the headers separately, use nrows to specify the number of rows that contain the header information. To read the data separately, specify skip to skip the header rows.
B <- read.csv(A, header = FALSE, skip = 3, strip.white = TRUE)
Bnames <- read.csv(A, header = FALSE, nrows = 3, strip.white = TRUE)
Use apply to paste the header rows together to form the names for the resulting data.frame:
names(B) <- apply(Bnames, 2, function(x) paste(x[x != ""], collapse = "_"))
B
# time USA_MD_pork USA_CA_peas USA_PX_nuts
# 1 jan-11 4 2 2
# 2 feb-11 4 9 3
# 3 mar-11 8 8 3
Now comes the part of converting the data from a "wide" to a "long" format. There are many ways to do this, some using base R too, but the most direct is to use melt and colsplit from the "reshape2" package:
library(reshape2)
BL <- melt(B, id.vars="time")
cbind(BL[c("time", "value")],
colsplit(BL$variable, "_",
c("country", "state", "product")))
# time value country state product
# 1 jan-11 4 USA MD pork
# 2 feb-11 4 USA MD pork
# 3 mar-11 8 USA MD pork
# 4 jan-11 2 USA CA peas
# 5 feb-11 9 USA CA peas
# 6 mar-11 8 USA CA peas
# 7 jan-11 2 USA PX nuts
# 8 feb-11 3 USA PX nuts
# 9 mar-11 3 USA PX nuts
Unfortunately, XLConnect is unlikely to work in your application. I can confirm that on a system with 8GB RAM, running Win 7 64bit and 64bit R 3.0.2, XLConnect fails with a 22MB .xlsx file, with the same error that you are getting. As #Ista pointed out, and as explained here, after restarting R and before doing anything else:
options(java.parameters = "-Xmx4096m")
library(XLConnect)
wb <- loadWorkbook("myWorkBook.xlsx")
sheet <- readWorksheet(wb,"Data")
avoids the error. However, the import still takes more than an hour(!!).
In contrast, as #Gaffi pointed out, once the sheet "Data" is saved to a csv file (~7MB), it can be imported as follows:
library(data.table)
system.time(sheet <- fread("Data.csv"))
user system elapsed
0.84 0.00 0.86
in less than 1 second. In my test case sheet has 6 columns and ~376,000 rows.
Sorry about this "second answer", but you really had two questions... #Ananda's solution for reshaping your data is extremely elegant. This is just another way to think about it.
If you transpose the input matrix you get a new matrix, where the first column is country, the second column is city, the third column is "type" (for lack of a better term), and the actual data is in the other columns (so, there is one additional column for every "time").
So a different approach is to transpose first and then melt the new matrix. This avoids creating all the concatenated column names and splitting them back later. The problem is that melt.data.frame is exceptionally inefficient with a very large number of columns (which you would have here). So doing it this way would bbe 10X slower than #Ananda's approach.
A solution is to use melt.array (just call melt(...) with an array rather than a data frame). As shown below, this approach is ~20X faster, with larger datasets (yours was 11MB).
library(reshape) # for melt(...)
library(microbenchmark) # for microbenchmark(...)
# this is just to model your situation with more realistic size
# create a large data frame (250 columns of country, city, type; 1000 rows of time)
df <- rep(c("USA","UK","FR","CHN","GER"),each=50) # time + 250 columns
df <- rbind(df,rep(c(c("NY","SF","CHI","BOS","LA")),each=10))
df <- rbind(df,rep(c("pork","peas","nuts","fruit","other")))
df <- rbind(df,matrix(sample(1:1000,250*1000,replace=T),ncol=250))
df <- cbind(c("time","","",
as.character(as.Date(1:1000,origin="2010-01-01"))),df)
df <- data.frame(df) # big warning here about duplicated row names; not important
# #Ananda'a approach:
transform.orig <- function(df){
B <- df[-(1:3),]
Bnames <- df[1:3,]
names(B) <- apply(Bnames, 2, function(x) paste(x[x != ""], collapse = "_"))
BL <- melt(B, id.vars="time")
final <- cbind(BL[c("time", "value")],
colsplit(BL$variable, "_",
c("country", "state", "product")))
return(final)
}
# transpose approach:
transform.new <- function(df) {
zz <- t(df)
times <- t(zz[1,4:ncol(zz)])
colnames(zz) <- c("country","city","type", times)
data <- melt(zz[-1,-(1:3)],varnames=c("id","time"))
final <- cbind(country=rep(zz[-1,1],each=ncol(zz)-3),
city =rep(zz[-1,2],each=ncol(zz)-3),
type =rep(zz[-1,3],each=ncol(zz)-3),
data[,-1])
return(final)
}
# benchmark
microbenchmark(transform.orig(df),transform.new(df), times=5, unit="s")
Unit: seconds
expr min lq median uq max neval
transform.orig(df) 9.2511679 9.6986330 9.889457 10.1518191 10.3354328 5
transform.new(df) 0.4383197 0.4724145 0.474212 0.5815531 0.6886383 5
For reading the data from excel, try the openxlsx package. It uses c++ instead of java, and better handles larger excel files.
To reshape your data look at the tidyr package. The gather function could help you out.

Read all files in a folder and apply a function to each data frame

I am doing a relatively simple piece of analysis that I have put into a function on all the files in a particular folder. I was wondering whether anyone had any tips to help me automate the process on a number of different folders.
Firstly, I was wondering whether there was a way of reading all the files in a particular folder straight into R. I believe the following command will list all the files:
files <- (Sys.glob("*.csv"))
...which I found from Using R to list all files with a specified extension
And then the following code reads all those files into R.
listOfFiles <- lapply(files, function(x) read.table(x, header = FALSE))
…from Manipulating multiple files in R
But the files seem to be read in as one continuous list and not individual files… how can I change the script to open all the csv files in a particular folder as individual dataframes?
Secondly, assuming that I can read all the files in separately, how do I complete a function on all these dataframes in one go. For example, I have created four small dataframes so I can illustrate what I want:
Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)))
Df.2 <- data.frame(A = c(1:6),B = (c(2,3,4,5,1,1)))
Df.3 <- data.frame(A = c(4,6,8,0,1,11),B = (c(7,6,5,9,1,15)))
Df.4 <- data.frame(A = c(4,2,6,8,1,0),B = (c(3,1,9,11,2,16)))
I have also made up an example function:
Summary<-function(dfile){
SumA<-sum(dfile$A)
MinA<-min(dfile$A)
MeanA<-mean(dfile$A)
MedianA<-median(dfile$A)
MaxA<-max(dfile$A)
sumB<-sum(dfile$B)
MinB<-min(dfile$B)
MeanB<-mean(dfile$B)
MedianB<-median(dfile$B)
MaxB<-max(dfile$B)
Sum<-c(sumA,sumB)
Min<-c(MinA,MinB)
Mean<-c(MeanA,MeanB)
Median<-c(MedianA,MedianB)
Max<-c(MaxA,MaxB)
rm(sumA,sumB,MinA,MinB,MeanA,MeanB,MedianA,MedianB,MaxA,MaxB)
Label<-c("A","B")
dfile_summary<-data.frame(Label,Sum,Min,Mean,Median,Max)
return(dfile_summary)}
I would ordinarily use the following command to apply the function to each individual dataframe.
Df1.summary<-Summary(dfile)
Is there a way instead of applying the function to all the dataframes, and use the titles of the dataframes in the summary tables (i.e. Df1.summary).
Many thanks,
Katie
On the contrary, I do think working with list makes it easy to automate such things.
Here is one solution (I stored your four dataframes in folder temp/).
filenames <- list.files("temp", pattern="*.csv", full.names=TRUE)
ldf <- lapply(filenames, read.csv)
res <- lapply(ldf, summary)
names(res) <- substr(filenames, 6, 30)
It is important to store the full path for your files (as I did with full.names), otherwise you have to paste the working directory, e.g.
filenames <- list.files("temp", pattern="*.csv")
paste("temp", filenames, sep="/")
will work too. Note that I used substr to extract file names while discarding full path.
You can access your summary tables as follows:
> res$`df4.csv`
A B
Min. :0.00 Min. : 1.00
1st Qu.:1.25 1st Qu.: 2.25
Median :3.00 Median : 6.00
Mean :3.50 Mean : 7.00
3rd Qu.:5.50 3rd Qu.:10.50
Max. :8.00 Max. :16.00
If you really want to get individual summary tables, you can extract them afterwards. E.g.,
for (i in 1:length(res))
assign(paste(paste("df", i, sep=""), "summary", sep="."), res[[i]])
usually i don't use for loop in R, but here is my solution using for loops and two packages : plyr and dostats
plyr is on cran and you can download dostats on https://github.com/halpo/dostats (may be using install_github from Hadley devtools package)
Assuming that i have your first two data.frame (Df.1 and Df.2) in csv files, you can do something like this.
require(plyr)
require(dostats)
files <- list.files(pattern = ".csv")
for (i in seq_along(files)) {
assign(paste("Df", i, sep = "."), read.csv(files[i]))
assign(paste(paste("Df", i, sep = ""), "summary", sep = "."),
ldply(get(paste("Df", i, sep = ".")), dostats, sum, min, mean, median, max))
}
Here is the output
R> Df1.summary
.id sum min mean median max
1 A 34 4 5.6667 5.5 8
2 B 22 1 3.6667 3.0 9
R> Df2.summary
.id sum min mean median max
1 A 21 1 3.5000 3.5 6
2 B 16 1 2.6667 2.5 5
Here is a tidyverse option that might not the most elegant, but offers some flexibility in terms of what is included in the summary:
library(tidyverse)
dir_path <- '~/path/to/data/directory/'
file_pattern <- 'Df\\.[0-9]\\.csv' # regex pattern to match the file name format
read_dir <- function(dir_path, file_name){
read_csv(paste0(dir_path, file_name)) %>%
mutate(file_name = file_name) %>% # add the file name as a column
gather(variable, value, A:B) %>% # convert the data from wide to long
group_by(file_name, variable) %>%
summarize(sum = sum(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
mean = mean(value, na.rm = TRUE),
median = median(value, na.rm = TRUE),
max = max(value, na.rm = TRUE))
}
df_summary <-
list.files(dir_path, pattern = file_pattern) %>%
map_df(~ read_dir(dir_path, .))
df_summary
# A tibble: 8 x 7
# Groups: file_name [?]
file_name variable sum min mean median max
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Df.1.csv A 34 4 5.67 5.5 8
2 Df.1.csv B 22 1 3.67 3 9
3 Df.2.csv A 21 1 3.5 3.5 6
4 Df.2.csv B 16 1 2.67 2.5 5
5 Df.3.csv A 30 0 5 5 11
6 Df.3.csv B 43 1 7.17 6.5 15
7 Df.4.csv A 21 0 3.5 3 8
8 Df.4.csv B 42 1 7 6 16

Resources