Combine columns on factor - r

I have data with factors lang and alg and I like to compare difference for selected lang pair between all alg:
> perf[perf$lang == "java", c("alg", "cpu")]
alg cpu
173 binarytrees 0.196
174 chameneosredux 0.404
175 fannkuchredux 0.648
> perf[perf$lang == "python3", c("alg", "cpu")]
alg cpu
246 binarytrees 0.972
248 fannkuchredux 13.752
249 fasta 1.152
For binarytrees I expect to get 0.196/0.972, but for chameneosredux is NA, for fannkuchredux is 0.648/13.752, for fasta is NA, ...
One way is to sort rows on alg but I don't understand how to inject rows with NA on missing factors (all factors available in unique(perf$alg)).
UPDATE Despite original question I think that I like to combine columns of two data frames into single data frame on same factor:
binarytrees 0.196 0.972
chameneosredux 0.404 NA
fasta NA 1.152
fannkuchredux 0.648 13.752

What you are looking for is essentially a FULL OUTER JOIN in SQL, which can be done with base::merge using all = TRUE.
Here is a comparable data set to demonstrate with:
Df <- data.frame(
Lang = rep(LETTERS[1:5], rep(3, 5)),
Alg = c(replicate(5, sample(letters[1:4], 3))),
Cpu = rnorm(15),
stringsAsFactors = FALSE
)
Note that I'm using stringsAsFactors = FALSE. I would suggest you convert your columns to character vectors as well; I don't see any need for using factors here.
This is the merge operation in a light wrapper function, just to make the presentation a little cleaner:
compare <- function(x, y, data) {
merge(x = data[data$Lang == x[1], 2:3],
y = data[data$Lang == y[1], 2:3],
by = "Alg", all = TRUE,
suffixes = c(paste0(".", x[1]),
paste0(".", y[1]))
)
}
And here it is in use:
compare("A", "D", Df)
# Alg Cpu.A Cpu.D
#1 a NA -0.06520117
#2 b 1.0587151 0.08379303
#3 c -2.0390119 NA
#4 d -0.8574474 1.27865596
compare("A", "C", Df)
# Alg Cpu.A Cpu.C
#1 b 1.0587151 -1.0230431
#2 c -2.0390119 -0.7691048
#3 d -0.8574474 -1.2421078
Regarding my comment, this can also be achieved using sqldf. SQLite does not support FULL OUTER JOIN, but this shouldn't be too much of an issue if you are comfortable with SQL, as there are probably a dozen or so ways to work around that:
library(sqldf)
sqldf(
"select x.Alg
,lhs.Cpu as 'Cpu.A'
,rhs.Cpu as 'Cpu.D'
from (
select distinct d.Alg
from Df d
) x
left join Df lhs on lhs.Alg = x.Alg and lhs.Lang = 'A'
left join Df rhs on rhs.Alg = x.Alg and rhs.Lang = 'D'
order by x.Alg"
)
# Alg Cpu.A Cpu.D
#1 a NA -0.06520117
#2 b 1.0587151 0.08379303
#3 c -2.0390119 NA
#4 d -0.8574474 1.27865596

Related

How to use mice for multiple imputation of missing values in longitudinal data?

I have a dataset with a repeatedly measured continuous outcome and some covariates of different classes, like in the example below.
Id y Date Soda Team
1 -0.4521 1999-02-07 Coke Eagles
1 0.2863 1999-04-15 Pepsi Raiders
2 0.7956 1999-07-07 Coke Raiders
2 -0.8248 1999-07-26 NA Raiders
3 0.8830 1999-05-29 Pepsi Eagles
4 0.1303 2005-03-04 NA Cowboys
5 0.1375 2013-11-02 Coke Cowboys
5 0.2851 2015-06-23 Coke Eagles
5 -0.3538 2015-07-29 Pepsi NA
6 0.3349 2002-10-11 NA NA
7 -0.1756 2005-01-11 Pepsi Eagles
7 0.5507 2007-10-16 Pepsi Cowboys
7 0.5132 2012-07-13 NA Cowboys
7 -0.5776 2017-11-25 Coke Cowboys
8 0.5486 2009-02-08 Coke Cowboys
I am trying to multiply impute missing values in Soda and Team using the mice package. As I understand it, because MI is not a causal model, there is no concept of dependent and independent variable. I am not sure how to setup this MI process using mice. I like some suggestions or advise from others who have encountered missing data in a repeated measure setting like this and how they used mice to tackle this problem. Thanks in advance.
Edit
This is what I have tried so far, but this does not capture the repeated measure part of the dataset.
library(mice)
init = mice(dat, maxit=0)
methd = init$method
predM = init$predictorMatrix
methd [c("Soda")]="logreg";
methd [c("Team")]="logreg";
imputed = mice(data, method=methd , predictorMatrix=predM, m=5)
There are several options to accomplish what you are asking for. I have decided to impute missing values in covariates in the so-called 'wide' format. I will illustrate this with the following worked example, which you can easily apply to your own data.
Let's first make a reprex. Here, I use the longitudinal Mayo Clinic Primary Biliary Cirrhosis Data (pbc2), which comes with the JM package. This data is organized in the so-called 'long' format, meaning that each patient i has multiple rows and each row contains a measurement of variable x measured on time j. Your dataset is also in the long format. In this example, I assume that pbc2$serBilir is our outcome variable.
# install.packages('JM')
library(JM)
# note: use function(x) instead of \(x) if you use a version of R <4.1.0
# missing values per column
miss_abs <- \(x) sum(is.na(x))
miss_perc <- \(x) round(sum(is.na(x)) / length(x) * 100, 1L)
miss <- cbind('Number' = apply(pbc2, 2, miss_abs), '%' = apply(pbc2, 2, miss_perc))
# --------------------------------
> miss[which(miss[, 'Number'] > 0),]
Number %
ascites 60 3.1
hepatomegaly 61 3.1
spiders 58 3.0
serChol 821 42.2
alkaline 60 3.1
platelets 73 3.8
According to this output, 6 variables in pbc2 contain at least one missing value. Let's pick alkaline from these. We also need patient id and the time variable years.
# subset
pbc_long <- subset(pbc2, select = c('id', 'years', 'alkaline', 'serBilir'))
# sort ascending based on id and, within each id, years
pbc_long <- with(pbc_long, pbc_long[order(id, years), ])
# ------------------------------------------------------
> head(pbc_long, 5)
id years alkaline serBilir
1 1 1.09517 1718 14.5
2 1 1.09517 1612 21.3
3 2 14.15234 7395 1.1
4 2 14.15234 2107 0.8
5 2 14.15234 1711 1.0
Just by quickly eyeballing, we observe that years do not seem to differ within subjects, even though variables were repeatedly measured. For the sake of this example, let's add a little bit of time to all rows of years but the first measurement.
set.seed(1)
# add little bit of time to each row of 'years' but the first row
new_years <- lapply(split(pbc_long, pbc_long$id), \(x) {
add_time <- 1:(length(x$years) - 1L) + rnorm(length(x$years) - 1L, sd = 0.25)
c(x$years[1L], x$years[-1L] + add_time)
})
# replace the original 'years' variable
pbc_long$years <- unlist(new_years)
# integer time variable needed to store repeated measurements as separate columns
pbc_long$measurement_number <- unlist(sapply(split(pbc_long, pbc_long$id), \(x) 1:nrow(x)))
# only keep the first 4 repeated measurements per patient
pbc_long <- subset(pbc_long, measurement_number %in% 1:4)
Since we will perform our multiple imputation in wide format (meaning that each participant i has one row and repeated measurements on x are stored in j different columns, so xj columns in total), we have to convert the data from long to wide. Now that we have prepared our data, we can use reshape to do this for us.
# convert long format into wide format
v_names <- c('years', 'alkaline', 'serBilir')
pbc_wide <- reshape(pbc_long,
idvar = 'id',
timevar = "measurement_number",
v.names = v_names, direction = "wide")
# -----------------------------------------------------------------
> head(pbc_wide, 4)[, 1:9]
id years.1 alkaline.1 serBilir.1 years.2 alkaline.2 serBilir.2 years.3 alkaline.3
1 1 1.095170 1718 14.5 1.938557 1612 21.3 NA NA
3 2 14.152338 7395 1.1 15.198249 2107 0.8 15.943431 1711
12 3 2.770781 516 1.4 3.694434 353 1.1 5.148726 218
16 4 5.270507 6122 1.8 6.115197 1175 1.6 6.716832 1157
Now let's multiply the missing values in our covariates.
library(mice)
# Setup-run
ini <- mice(pbc_wide, maxit = 0)
meth <- ini$method
pred <- ini$predictorMatrix
visSeq <- ini$visitSequence
# avoid collinearity issues by letting only variables measured
# at the same point in time predict each other
pred[grep("1", rownames(pred), value = TRUE),
grep("2|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("2", rownames(pred), value = TRUE),
grep("1|3|4", colnames(pred), value = TRUE)] <- 0
pred[grep("3", rownames(pred), value = TRUE),
grep("1|2|4", colnames(pred), value = TRUE)] <- 0
pred[grep("4", rownames(pred), value = TRUE),
grep("1|2|3", colnames(pred), value = TRUE)] <- 0
# variables that should not be imputed
pred[c("id", grep('^year', names(pbc_wide), value = TRUE)), ] <- 0
# variables should not serve as predictors
pred[, c("id", grep('^year', names(pbc_wide), value = TRUE))] <- 0
# multiply imputed missing values ------------------------------
imp <- mice(pbc_wide, pred = pred, m = 10, maxit = 20, seed = 1)
# Time difference of 2.899244 secs
As can be seen in the below three example traceplots (which can be obtained with plot(imp), the algorithm has converged nicely. Refer to this section of Stef van Buuren's book for more info on convergence.
Now we need to convert back the multiply imputed data (which is in wide format) to long format, so that we can use it for analyses. We also need to make sure that we exclude all rows that had missing values for our outcome variable serBilir, because we do not want to use imputed values of the outcome.
# need unlisted data
implong <- complete(imp, 'long', include = FALSE)
# 'smart' way of getting all the names of the repeated variables in a usable format
v_names <- as.data.frame(matrix(apply(
expand.grid(grep('ye|alk|ser', names(implong), value = TRUE)),
1, paste0, collapse = ''), nrow = 4, byrow = TRUE), stringsAsFactors = FALSE)
names(v_names) <- names(pbc_long)[2:4]
# convert back to long format
longlist <- lapply(split(implong, implong$.imp),
reshape, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
# logical that is TRUE if our outcome was not observed
# which should be based on the original, unimputed data
orig_data <- reshape(imp$data, direction = 'long',
varying = as.list(v_names),
v.names = names(v_names),
idvar = 'id', times = 1:4)
orig_data$logical <- is.na(orig_data$serBilir)
# merge into the list of imputed long-format datasets:
longlist <- lapply(longlist, merge, y = subset(orig_data, select = c(id, time, logical)))
# exclude rows for which logical == TRUE
longlist <- lapply(longlist, \(x) subset(x, !logical))
Finally, convert longlist back into a mids using datalist2mids from the miceadds package.
imp <- miceadds::datalist2mids(longlist)
# ----------------
> imp$loggedEvents
NULL

Reading Excel file: How to find the start cell in messy spreadsheets?

I'm trying to write R code to read data from a mess of old spreadsheets. The exact location of the data varies from sheet to sheet: the only constant is that the first column is a date and the second column has "Monthly return" as the header. In this example, the data starts in cell B5:
How do I automate the search of Excel cells for my "Monthly return" string using R?
At the moment, the best idea I can come up with is to upload everything in R starting at cell A1 and sort out the mess in the resulting (huge) matrices. I'm hoping for a more elegant solution
I haven't found a way to do this elegantly, but I'm very familiar with this problem (getting data from FactSet PA reports -> Excel -> R, right?). I understand different reports have different formats, and this can be a pain.
For a slightly different version of annoyingly formatted spreadsheets, I do the following. It's not the most elegant (it requires two reads of the file) but it works. I like reading the file twice, to make sure the columns are of the correct type, and with good headers. It's easy to mess up column imports, so I'd rather have my code read the file twice than go through and clean up columns myself, and the read_excel defaults, if you start at the right row, are pretty good.
Also, it's worth noting that as of today (2017-04-20), readxl had an update. I installed the new version to see if that would make this very easy, but I don't believe that's the case, although I could be mistaken.
library(readxl)
library(stringr)
library(dplyr)
f_path <- file.path("whatever.xlsx")
if (!file.exists(f_path)) {
f_path <- file.choose()
}
# I read this twice, temp_read to figure out where the data actually starts...
# Maybe you need something like this -
# excel_sheets <- readxl::excel_sheets(f_path)
# desired_sheet <- which(stringr::str_detect(excel_sheets,"2 Factor Brinson Attribution"))
desired_sheet <- 1
temp_read <- readxl::read_excel(f_path,sheet = desired_sheet)
skip_rows <- NULL
col_skip <- 0
search_string <- "Monthly Returns"
max_cols_to_search <- 10
max_rows_to_search <- 10
# Note, for the - 0, you may need to add/subtract a row if you end up skipping too far later.
while (length(skip_rows) == 0) {
col_skip <- col_skip + 1
if (col_skip == max_cols_to_search) break
skip_rows <- which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0
}
# ... now we re-read from the known good starting point.
real_data <- readxl::read_excel(
f_path,
sheet = desired_sheet,
skip = skip_rows
)
# You likely don't need this if you start at the right row
# But given that all weird spreadsheets are weird in their own way
# You may want to operate on the col_skip, maybe like so:
# real_data <- real_data %>%
# select(-(1:col_skip))
Okay, at the format was specified for xls, update from csv to the correctly suggested xls loading.
library(readxl)
data <- readxl::read_excel(".../sampleData.xls", col_types = FALSE)
You would get something similar to:
data <- structure(list(V1 = structure(c(6L, 5L, 3L, 7L, 1L, 4L, 2L), .Label = c("",
"Apr 14", "GROSS PERFROANCE DETAILS", "Mar-14", "MC Pension Fund",
"MY COMPANY PTY LTD", "updated by JS on 6/4/2017"), class = "factor"),
V2 = structure(c(1L, 1L, 1L, 1L, 4L, 3L, 2L), .Label = c("",
"0.069%", "0.907%", "Monthly return"), class = "factor")), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -7L))
then you can dynamincally filter on the "Monthly return" cell and identify your matrix.
targetCell <- which(data == "Monthly return", arr.ind = T)
returns <- data[(targetCell[1] + 1):nrow(data), (targetCell[2] - 1):targetCell[2]]
With a general purpose package like readxl, you'll have to read twice, if you want to enjoy automatic type conversion. I assume you have some sort of upper bound on the number of junk rows at the front? Here I assumed that was 10. I'm iterating over worksheets in one workbook, but the code would look pretty similar if iterating over workbooks. I'd write one function to handle a single worksheet or workbook then use lapply() or purrr::map(). This function will encapsulate the skip-learning read and the "real" read.
library(readxl)
two_passes <- function(path, sheet = NULL, n_max = 10) {
first_pass <- read_excel(path = path, sheet = sheet, n_max = n_max)
skip <- which(first_pass[[2]] == "Monthly return")
message("For sheet '", if (is.null(sheet)) 1 else sheet,
"' we'll skip ", skip, " rows.")
read_excel(path, sheet = sheet, skip = skip)
}
(sheets <- excel_sheets("so.xlsx"))
#> [1] "sheet_one" "sheet_two"
sheets <- setNames(sheets, sheets)
lapply(sheets, two_passes, path = "so.xlsx")
#> For sheet 'sheet_one' we'll skip 4 rows.
#> For sheet 'sheet_two' we'll skip 6 rows.
#> $sheet_one
#> # A tibble: 6 × 2
#> X__1 `Monthly return`
#> <dttm> <dbl>
#> 1 2017-03-14 0.00907
#> 2 2017-04-14 0.00069
#> 3 2017-05-14 0.01890
#> 4 2017-06-14 0.00803
#> 5 2017-07-14 -0.01998
#> 6 2017-08-14 0.00697
#>
#> $sheet_two
#> # A tibble: 6 × 2
#> X__1 `Monthly return`
#> <dttm> <dbl>
#> 1 2017-03-14 0.00907
#> 2 2017-04-14 0.00069
#> 3 2017-05-14 0.01890
#> 4 2017-06-14 0.00803
#> 5 2017-07-14 -0.01998
#> 6 2017-08-14 0.00697
In those cases it's important to know the possible conditions of your data. I'm gonna assume that you want only remove columns and rows that doesn't confrom your table.
I have this Excel book:
I added 3 blank columns at left becouse when I loaded in R with one column the program omits them. Thats for confirm that R omits empty cols at the left.
First: load data
library(xlsx)
dat <- read.xlsx('book.xlsx', sheetIndex = 1)
head(dat)
MY.COMPANY.PTY.LTD NA.
1 MC Pension Fund <NA>
2 GROSS PERFORMANCE DETAILS <NA>
3 updated by IG on 20/04/2017 <NA>
4 <NA> Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
Second: I added some cols with NA and '' values in the case that your data contain some
dat$x2 <- NA
dat$x4 <- NA
head(dat)
MY.COMPANY.PTY.LTD NA. x2 x4
1 MC Pension Fund <NA> NA NA
2 GROSS PERFORMANCE DETAILS <NA> NA NA
3 updated by IG on 20/04/2017 <NA> NA NA
4 <NA> Monthly return NA NA
5 Mar-14 0.0097 NA NA
6 Apr-14 6e-04 NA NA
Third: Remove columns when all values are NA and ''. I have to deal with that kind of problems in past
colSelect <- apply(dat, 2, function(x) !(length(x) == length(which(x == '' | is.na(x)))))
dat2 <- dat[, colSelect]
head(dat2)
MY.COMPANY.PTY.LTD NA.
1 MC Pension Fund <NA>
2 GROSS PERFORMANCE DETAILS <NA>
3 updated by IG on 20/04/2017 <NA>
4 <NA> Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
Fourth: Keep only rows with complete observations (it's what I supose from your example)
rowSelect <- apply(dat2, 1, function(x) !any(is.na(x)))
dat3 <- dat2[rowSelect, ]
head(dat3)
MY.COMPANY.PTY.LTD NA.
5 Mar-14 0.0097
6 Apr-14 6e-04
7 May-14 0.0189
8 Jun-14 0.008
9 Jul-14 -0.0199
10 Ago-14 0.00697
Finally if you want to keep the header you can make something like this:
colnames(dat3) <- as.matrix(dat2[which(rowSelect)[1] - 1, ])
or
colnames(dat3) <- c('Month', as.character(dat2[which(rowSelect)[1] - 1, 2]))
dat3
Month Monthly return
5 Mar-14 0.0097
6 Apr-14 6e-04
7 May-14 0.0189
8 Jun-14 0.008
9 Jul-14 -0.0199
10 Ago-14 0.00697
Here is how I would tackle it.
STEP 1
Read the excel spreadsheet in without the headers.
STEP 2
Find the row index for your string Monthly return in this case
STEP 3
Filter from the identified row (or column or both), prettify a little and done.
Here is what a sample function looks like. It works for your example no matter where it is in the spreadsheet. You can play around with regex to make it more robust.
Function Definition:
library(xlsx)
extract_return <- function(path = getwd(), filename = "Mysheet.xlsx", sheetnum = 1){
filepath = paste(path, "/", filename, sep = "")
input = read.xlsx(filepath, sheetnum, header = FALSE)
start_idx = which(input == "Monthly return", arr.ind = TRUE)[1]
output = input[start_idx:dim(input)[1],]
rownames(output) <- NULL
colnames(output) <- c("Date","Monthly Return")
output = output[-1, ]
return(output)
}
Example:
final_df <- extract_return(
path = "~/Desktop",
filename = "Apr2017.xlsx",
sheetnum = 2)
No matter ho many rows or columns you may have, the idea remains the same.. Give it a try and let me know.
This is a tidy alternative that avoids the multiple reads issue discussed above. However, when doing benchmarks, Rafael Zayas's answer still wins out.
library("tidyxl")
library("unpivotr")
library("tidyr")
library("dplyr")
tidy_solution <- function() {
raw <- xlsx_cells("messyExcel.xlsx")
start <- raw %>%
filter_all(any_vars(. %in% c("Monthly return"))) %>%
select(row, col)
month.col <- raw %>%
filter(row >= start$row + 1, col == start$col - 1) %>%
pivot_wider(date, col)
return.col <- raw %>%
filter(row >= start$row + 1, col == start$col) %>%
pivot_wider(numeric, col)
output <- cbind(month.col, return.col)
}
# My Solution
expr min lq mean median uq max neval
tidy_solution() 29.0372 30.40305 32.13793 31.36925 32.9812 56.6455 100
# Rafael's
expr min lq mean median uq max neval
original_solution() 21.4405 23.8009 25.86874 25.10865 26.99945 59.4128 100
grep("2014",dat)[1]
This gives you first column with year. Or use "-14" or whatever you have for years.
Similar way grep("Monthly",dat)[1] gives you second column

R function to calculate nearest neighbor distance given [inconsistent] constraint?

I have data consisting of tree growth measurements (diameter and height) for trees at known X & Y coordinates. I'd like to determine the distance to each tree's nearest neighbor of equal or greater size.
I've seen other SE questions asking about nearest neighbor calculations (e.g., see here, here, here, here, etc.), but none specify constraints on the nearest neighbor to be searched.
Is there a function (or other work around) that would allow me to determine the distance of a point's nearest neighbor given that nearest point meets some criteria (e.g., must be equal to or greater in size than the point of interest)?
[An even more complex set of constraints would be even more helpful...]
For my example: specifying that a tree must also be in the same plot as the tree of interest or is the same species as the tree of interest
I'd do it with non-equijoins and data.table
EDIT: (fyi, this requires data.table 1.9.7, which you can get from github)
EDIT2: did it with a copy of the data.table, since it seems like it was joining on its own threshholds. I'll fix that in future, but this works for now.
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1),
allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
The final dataset contains each pair, according to the given threshholds
EDIT:
With Additional variables:
If you want to join on additional parameters, this allows you to do it, (It's probably even faster if you additionally join on things like plot or species, since the cartesian join will be smaller)
Here's an example joining on two additional categorical variables, species and plot:
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# Join on a range, must be a cartesian join, since there are many candidates
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1,
species == species,
plot == plot),
nomatch = NA,
allow.cartesian = TRUE]
# Calculate the distance
test[, dist := (x - i.x)**2 + (y - i.y)**2]
# Exclude identical matches and
# Take the minimum distance grouped by id
final <- test[id != i.id, .SD[which.min(dist)],by = id]
final
> final
id x y height species plot height.1 i.id i.x i.y i.height dist
1: 3 0.4837348 0.4325731 91.53387 C 2 111.53387 486 0.5549221 0.4395687 101.53387 0.005116568
2: 13 0.8267298 0.3137061 94.58949 C 2 114.58949 754 0.8408547 0.2305702 104.58949 0.007111079
3: 29 0.2905729 0.4952757 89.52128 C 2 109.52128 333 0.2536760 0.5707272 99.52128 0.007054301
4: 37 0.4534841 0.5249862 89.95493 C 2 109.95493 72 0.4807242 0.6056771 99.95493 0.007253044
5: 63 0.1678515 0.8814829 84.77450 C 2 104.77450 289 0.1151764 0.9728488 94.77450 0.011122404
---
994: 137 0.8696393 0.2226888 66.57792 C 2 86.57792 473 0.4467795 0.6881008 76.57792 0.395418724
995: 348 0.3606249 0.1245749 110.14466 A 2 130.14466 338 0.1394011 0.1200064 120.14466 0.048960849
996: 572 0.6562758 0.1387882 113.61821 A 2 133.61821 348 0.3606249 0.1245749 123.61821 0.087611511
997: 143 0.9170504 0.1171652 71.39953 C 3 91.39953 904 0.6954973 0.3690599 81.39953 0.112536771
998: 172 0.6834473 0.6221259 65.52187 A 2 85.52187 783 0.4400028 0.9526355 75.52187 0.168501816
>
NOTE: in the final answer, there are columns height and height.1, the latter appears to result from data.table's equi join and represent the upper and lower boundary respectively.
A Mem-efficient solution
One of the issues here for #theforestecologist was that this requires a lot of memory to do,
(in that case, there were an additional 42 columns being multiplied by the cartesian join, which caused mem issues),
However, we can do this in a more memory efficient way by using .EACHI (I believe). Since we will not load the full table into memory. That solution follows:
library(data.table)
dtree <- data.table(id = 1:1000,
x = runif(1000),
y = runif(1000),
height = rnorm(1000,mean = 100,sd = 10),
species = sample(LETTERS[1:3],1000,replace = TRUE),
plot = sample(1:3,1000, replace = TRUE))
dtree_self <- copy(dtree)
dtree_self[,thresh1 := height + 10]
dtree_self[,thresh2 := height - 10]
# In order to navigate the sometimes unusual nature of scoping inside a
# data.table join, I set the second table to have its own uniquely named id
dtree_self[,id2 := id]
dtree_self[,id := NULL]
# for clarity inside the brackets,
# I define the squared euclid distance
eucdist <- function(x,xx,y,yy) (x - xx)**2 + (y - yy)**2
# Join on a range, must be a cartesian join, since there are many candidates
# Return a table of matches, using .EACHI to keep from loading too much into mem
test <- dtree[dtree_self, on = .(height >= thresh2,
height <= thresh1,
species,
plot),
.(id2, id[{z = eucdist(x,i.x,y,i.y); mz <- min(z[id2 != id]); mz == z}]),
by = .EACHI,
nomatch = NA,
allow.cartesian = TRUE]
# join the metadata back onto each id
test <- dtree[test, on = .(id = V2), nomatch = NA]
test <- dtree[test, on = .(id = id2), nomatch = NA]
> test
id x y height species plot i.id i.x i.y i.height i.species i.plot i.height.2 i.height.1 i.species.1 i.plot.1
1: 1 0.17622235 0.66547312 84.68450 B 2 965 0.17410840 0.63219350 93.60226 B 2 74.68450 94.68450 B 2
2: 2 0.04523011 0.33813054 89.46288 B 2 457 0.07267547 0.35725229 88.42827 B 2 79.46288 99.46288 B 2
3: 3 0.24096368 0.32649256 103.85870 C 3 202 0.20782303 0.38422814 94.35898 C 3 93.85870 113.85870 C 3
4: 4 0.53160655 0.06636979 101.50614 B 1 248 0.47382417 0.01535036 103.74101 B 1 91.50614 111.50614 B 1
5: 5 0.83426727 0.55380451 101.93408 C 3 861 0.78210747 0.52812487 96.71422 C 3 91.93408 111.93408 C 3
This way we should keep total memory usage low.

Aggregate across a list of dataframes

I have a fitting process that runs 100 times. Each time the output is a dataframe -- which I capture by using lapply to create a list of 100 dataframes.
The first two dataframes might look something like this (I have more than 1 column):
n1 = c(4, 5, 6)
df1 = data.frame(n1, row.names = c("height", "weight", "favcolor"))
n2 = c(2, 3, 5, 7)
df2 = data.frame(n2, row.names = c("height", "weight", "inseam", "favcolor"))
I would like to combine these dataframes (take the average height value, or the standard deviation of the weights, for example).
My first thought was to turn this list into a dataframe -- not going to work because arguments imply differing number of rows.
My second thought was to insert NAs for each rowname that didn't appear (so, I would be adding:
new_row <- c(NA, NA)
row.names(new_row) <- "inseam"
But I can't get that to work for a myrid of reasons, and I'm unable to even test if I add an "inseam" row that contains NAs to the end of df1, that when I take the average of the "inseam" I get the correct answer.
The correct average in this case being:
height 3
weight 4
inseam 5
favcolor 6.5
So, my question:
If you had a list of dataframes, where the row names are meaningful and need to be aggregated across like the above example, what is the best way to do that?
################
EDIT
Here is the full data from 3 of my datasets:
> mega_df[1]
[[1]]
coeff error pval
rf2 -1.15099200 0.5752430 4.540538e-02
rf3 -0.53430218 0.4928974 2.783635e-01
rf4 0.08784138 0.4933079 8.586711e-01
rf5 0.96002601 0.5070957 5.833327e-02
rm2 -0.36188368 0.4626464 4.340949e-01
rm3 0.01805873 0.4355164 9.669251e-01
rm4 0.45008373 0.4319557 2.974268e-01
rm5 1.04056503 0.4441024 1.912556e-02
rc_cat1 0.86231928 0.2827566 2.290799e-03
rc_cat2 1.21335473 0.2448206 7.192321e-07
rc_cat3 0.96196637 0.2044198 2.528247e-06
rc_cat4 1.04477290 0.3302644 1.559142e-03
rc_cat5 -0.58902623 1.5893867 7.109357e-01
rc_cat6 0.62569607 0.1720676 2.765407e-04
rc_cat7 0.29381724 0.4115594 4.752815e-01
rs2 0.12333678 0.7186019 8.637250e-01
rs3 1.22018613 0.6423970 5.750837e-02
rs4 1.96075220 0.6454184 2.381892e-03
rs5 2.58404946 0.6543862 7.853927e-05
1|3 0.01561497 0.4851330 9.743229e-01
3|4 1.82853786 0.4937675 2.128663e-04
4|5 3.73480100 0.5023435 1.047616e-13
> mega_df[2]
[[1]]
coeff error pval
rf2 -0.23364248 5.849338e-01 6.895734e-01
rf3 0.24054894 5.219730e-01 6.449094e-01
rf4 0.84072979 5.208259e-01 1.064788e-01
rf5 1.47867154 5.346970e-01 5.684640e-03
rm2 -0.29555400 4.465509e-01 5.080612e-01
rm3 0.31147504 4.131024e-01 4.508553e-01
rm4 0.73696523 4.141224e-01 7.514424e-02
rm5 1.14273148 4.271863e-01 7.472508e-03
rc_cat1 1.27479299 3.094432e-01 3.794740e-05
rc_cat2 1.10917318 2.619011e-01 2.284654e-05
rc_cat3 0.65782540 2.161602e-01 2.340525e-03
rc_cat4 0.40512225 3.301662e-01 2.198131e-01
rc_cat5 12.78797722 5.612311e-08 0.000000e+00
rc_cat6 0.41622889 1.677804e-01 1.310894e-02
rc_cat7 0.16833629 3.806498e-01 6.583198e-01
rs2 -0.02279305 7.225878e-01 9.748360e-01
rs3 0.68299485 6.759050e-01 3.122608e-01
rs4 1.36149302 6.780722e-01 4.465519e-02
rs5 2.18484594 6.863696e-01 1.456612e-03
1|3 0.35419237 5.844931e-01 5.445266e-01
3|4 2.12603072 5.928308e-01 3.354863e-04
4|5 3.97564508 5.999369e-01 3.431064e-11
> mega_df[3]
[[1]]
coeff error pval
rf2 -0.2733408 0.5884741 6.422961e-01
rf3 0.1764257 0.5257697 7.372050e-01
rf4 0.6504428 0.5248386 2.152271e-01
rf5 1.3967757 0.5356706 9.119879e-03
rm2 -0.2361284 0.4870015 6.277745e-01
rm3 0.2078729 0.4609270 6.519977e-01
rm4 0.6390950 0.4622065 1.667555e-01
rm5 1.1738653 0.4730686 1.308730e-02
rc_cat1 0.9337627 0.2958630 1.599133e-03
rc_cat2 1.0292916 0.2493133 3.651281e-05
rc_cat3 0.7088285 0.2012026 4.267587e-04
rc_cat4 0.6296966 0.3664883 8.576193e-02
rc_cat6 0.5475018 0.1720841 1.464662e-03
rc_cat7 0.4521113 0.3588440 2.077017e-01
rs2 -0.4663666 0.7031265 5.071541e-01
rs3 0.7810059 0.6489673 2.287985e-01
rs4 1.5178641 0.6522175 1.995271e-02
rs5 2.1916080 0.6578769 8.643075e-04
1|3 0.2569225 0.4659460 5.813597e-01
3|4 2.0648302 0.4769118 1.493906e-05
4|5 3.9312070 0.4855339 5.648509e-16
And I'm hoping to do some basic aggregations that end up returning:
avcoef averror avpval std(coef)
rf2 W X Y Z
rf3 ...
.
.
.
I guess you could just create a new column in each data set which will contain the row names and then merge accordingly, something like:
l <- lapply(list(df1, df2), function(x) {x$New <- row.names(x) ; x})
Res <- Reduce(function(...) merge(..., by = "New", all = TRUE), l)
cbind(Res[1], Means = rowMeans(Res[-1], na.rm = TRUE))
# Row.names Means
# 1 favcolor 6.5
# 2 height 3.0
# 3 inseam 5.0
# 4 weight 4.0
This is probably highly related to this
Edit: For the new data set
l <- lapply(list(mega_df1, mega_df2, mega_df3), function(x) {x$RowName <- row.names(x) ; x})
Res <- Reduce(function(...) merge(..., by = "RowName", all = TRUE), l)
library(data.table) ## v1.9.6+
dcast(melt(setDT(Res), "RowName"),
RowName ~ sub("\\..*", "", variable),
mean,
na.rm = TRUE,
value.var = "value")
# RowName coeff error pval
# 1: cat1 1.0236250 0.2960209 1.309293e-03
# 2: cat2 1.1172732 0.2520117 2.002619e-05
# 3: cat3 0.7762068 0.2072609 9.232706e-04
# 4: cat4 0.6931972 0.3423063 1.023781e-01
# 5: cat5 6.0994755 0.7946934 3.554678e-01
# 6: cat6 0.5298089 0.1706440 4.950048e-03
# 7: cat7 0.3047549 0.3836844 4.471010e-01

Group/bin/bucket data in R and get count per bucket and sum of values per bucket

I wish to bucket/group/bin data :
C1 C2 C3
49488.01172 0.0512 54000
268221.1563 0.0128 34399
34775.96094 0.0128 54444
13046.98047 0.07241 61000
2121699.75 0.00453 78921
71155.09375 0.0181 13794
1369809.875 0.00453 12312
750 0.2048 43451
44943.82813 0.0362 49871
85585.04688 0.0362 18947
31090.10938 0.0362 13401
68550.40625 0.0181 14345
I want to bucket it by C2 values but I wish to define the buckets e.g. <=0.005, <=.010, <=.014 etc. As you can see, the bucketing will be uneven intervals. I want the count of C1 per bucket as well as the total sum of C1 for every bucket.
I don't know where to begin as I am fairly new a user of R. Is there anyone willing to help me figure out the code or direct to me to an example that will work for my needs?
EDIT: added another column C3. I need sum of C3 per bucket as well at the same time as sum and count of C1 per bucket
From the comments, "C2" seems to be "character" column with % as suffix. Before, creating a group, remove the % using sub, convert to "numeric" (as.numeric). The variable "group" is created (transform(df,...)) by using the function cut with breaks (group buckets/intervals) and labels (for the desired group labels) arguments. Once the group variable is created, the sum of the "C1" by "group" and the "count" of elements within "group" can be done using aggregate from "base R"
df1 <- transform(df, group=cut(as.numeric(sub('[%]', '', C2)),
breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))
res <- do.call(data.frame,aggregate(C1~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
dNew <- data.frame(group=levels(df1$group))
merge(res, dNew, all=TRUE)
# group C1.Count C1.Sum
#1 <0.005 2 3491509.6
#2 0.005 NA NA
#3 0.01 2 302997.1
#4 0.014 8 364609.5
or you can use data.table. setDT converts the data.frame to data.table. Specify the "grouping" variable with by= and summarize/create the two variables "Count" and "Sum" within the list(. .N gives the count of elements within each "group".
library(data.table)
setDT(df1)[, list(Count=.N, Sum=sum(C1)), by=group][]
Or using dplyr. The %>% connect the LHS with RHS arguments and chains them together. Use group_by to specify the "group" variable, and then use summarise_each or summarise to get summary count and sum of the concerned column. summarise_each would be useful if there are more than one column.
library(dplyr)
df1 %>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1)
Update
Using the new dataset df
df1 <- transform(df, group=cut(C2, breaks=c(-Inf,0.005, 0.010, 0.014, Inf),
labels=c('<0.005', 0.005, 0.01, 0.014)))
res <- do.call(data.frame,aggregate(cbind(C1,C3)~group, df1,
FUN=function(x) c(Count=length(x), Sum=sum(x))))
res
# group C1.Count C1.Sum C3.Count C3.Sum
#1 <0.005 2 3491509.6 2 91233
#2 0.01 2 302997.1 2 88843
#3 0.014 8 364609.5 8 268809
and you can do the merge as detailed above.
The dplyr approach would be the same except specifying the additional variable
df1%>%
group_by(group) %>%
summarise_each(funs(n(), Sum=sum(.)), C1, C3)
#Source: local data frame [3 x 5]
# group C1_n C3_n C1_Sum C3_Sum
#1 <0.005 2 2 3491509.6 91233
#2 0.01 2 2 302997.1 88843
#3 0.014 8 8 364609.5 268809
data
df <-structure(list(C1 = c(49488.01172, 268221.1563, 34775.96094,
13046.98047, 2121699.75, 71155.09375, 1369809.875, 750, 44943.82813,
85585.04688, 31090.10938, 68550.40625), C2 = c("0.0512%", "0.0128%",
"0.0128%", "0.07241%", "0.00453%", "0.0181%", "0.00453%", "0.2048%",
"0.0362%", "0.0362%", "0.0362%", "0.0181%")), .Names = c("C1",
"C2"), row.names = c(NA, -12L), class = "data.frame")

Resources