recursive error in dplyr mutate - r

Just learning dplyr (and R) and I do not understand why this fails or what the correct approach to this is. I am looking for a general explanation rather than something specific to this contrived dataset.
Assume I have 3 files sizes with multipliers and I'd like to combine them into a single numeric column.
require(dplyr)
m <- data.frame(
K = 1E3,
M = 1E6,
G = 1E9
)
s <- data.frame(
size = 1:3,
mult = c('K', 'M', 'G')
)
Now I want to multiply the size by it's multiplier so I tried:
mutate(s, total = size * m[[mult]])
#Error in .subset2(x, i, exact = exact) :
# recursive indexing failed at level 2
which throws an error. I also tried:
mutate(s, total = size * as.numeric(m[mult]))
#1 1 K 1e+06
#2 2 M 2e+09
#3 3 G 3e+03
which is worse than an error (wrong answer)!
I tried a lot of other permutations but could not find the answer.
Thanks in advance!
Edit:
(or should this be another question)
akrun's answer worked great and I thought I understood but if I
rbind(s, c(4, NA))
then update the mutate to
mutate(s, total = size *
ifelse(is.na(mult), 1,
unlist(m[as.character(mult)])
it falls apart again with an "undefined columns selected"

The 'mult' column is 'factor' class. Convert it to 'character' for subsetting the 'm', `unlist' and then multiply with 'size'
mutate(s, new= size*unlist(m[as.character(mult)]))
# size mult new
#1 1 K 1e+03
#2 2 M 2e+06
#3 3 G 3e+09
If we look at how the 'factor' columns act based on the 'levels'
m[s$mult]
# M G K
#1 1e+06 1e+09 1000
We get the same order of output by using match between the names(m) and levels(s$mult)
m[match(names(m), levels(s$mult))]
# M G K
#1 1e+06 1e+09 1000
So, this might be the reason why you got a different result

If you don't mind changing the data structure of m, you could use
# change m to a table
m = as.data.frame(t(m))
m$mult = rownames(m)
colnames(m)[which(colnames(m) == "V1")] = "value"
# to avoid indexing
s %>%
inner_join(m) %>%
mutate(total = size*value) %>%
select(size, mult, total)
to keep things more dplyr based.
EDIT: Though it works, you may need to be a little bit careful about the data types in the columns though

Related

Calculate all possible product combinations between variables

I have a df containing 3 variables, and I want to create an extra variable for each possible product combination.
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
I want to create a new df (output) containing the result of a*b, a*c, b*c.
output <- data.frame(d = test$a * test$b
, e = test$a * test$c
, f = test$b * test$c)
This is easily doable (manually) with a small number of columns, but even above 5 columns, this activity can get very lengthy - and error-prone, when column names contain prefix, suffix or codes inside.
It would be extra if I could also control the maximum number of columns to consider at the same time (in the example above, I only considered 2 columns, but it would be great to select that parameter too, so to add an extra variable a*b*c - if needed)
My initial idea was to use expand.grid() with column names and then somehow do a lookup to select the whole columns values for the product - but I hope there's an easier way to do it that I am not aware of.
You can use combn to create combination of column names taken 2 at a time and multiply them to create new columns.
cbind(test, do.call(cbind, combn(names(test), 2, function(x) {
setNames(data.frame(do.call(`*`, test[x])), paste0(x, collapse = '-'))
}, simplify = FALSE)))
#. a b c a-b a-c b-c
#1 0.4098568 -0.3514020 2.5508854 -0.1440245 1.045498 -0.8963863
#2 1.4066395 0.6693990 0.1858557 0.9416031 0.261432 0.1244116
#3 0.7150305 -1.1247699 2.8347166 -0.8042448 2.026909 -3.1884040
#4 0.8932950 1.6330398 0.3731903 1.4587864 0.333369 0.6094346
#5 -1.4895243 1.4124826 1.0092224 -2.1039271 -1.503261 1.4255091
#6 0.8239685 0.1347528 1.4274288 0.1110321 1.176156 0.1923501
#7 0.7803712 0.8685688 -0.5676055 0.6778060 -0.442943 -0.4930044
#8 -1.5760181 2.0014636 1.1844449 -3.1543428 -1.866707 2.3706233
#9 1.4414434 1.1134435 -1.4500410 1.6049658 -2.090152 -1.6145388
#10 0.3526583 -0.1238261 0.8949428 -0.0436683 0.315609 -0.1108172
Could this one also be a solution. Ronak's solution is more elegant!
library(dplyr)
# your data
test <- data.frame(a = rnorm(10,0,1)
, b = rnorm(10,0,1)
, c = rnorm(10,0,1))
# new dataframe output
output <- test %>%
mutate(a_b= prod(a,b),
a_c= prod(a,c),
b_c= prod(b,c)
) %>%
select(-a,-b,-c)

How to select rows that match the values in a vector

This is likely a very simple question, maybe even already answered, but I couldn't find it.
I have two data frames. For simplicity, I'll call them all and index. all will have many columns, one of which I'll call match. index also has many (other) columns, including match. I want to have the rows from all that have a match value that matches match in index. Therefore, as a dummy data:
all <- data.frame(match = sample(LETTERS, 20),
otherStuff = rnorm(20))
index <- data.frame(match = sample(LETTERS, 20),
moreStuff = rnorm(20))
I've tried the following:
all[all$match == index$match,]
But this works very badly:
> all[all$match == index$match,]
match otherStuff
3 C -0.6030772
Warning message:
In all$match == index$match :
longer object length is not a multiple of shorter object length
As you can see, it gives this warning and, it seems, only considers matches in the same position (in this case, C was the third element in both cases).
After that, I tried something similar with filter(), from dplyr, just to give it a chance. As I expected, it didn't work. I tried some other things (less logical, actually a little to desperate and nonsense to even comment here), all of which seemed crazy from the getgo. I really don't know what to do here...
Another solution is by using match:
Data:
set.seed(123)
all <- data.frame(match = sample(LETTERS, 10),
otherStuff = rnorm(10))
index <- data.frame(match = sample(LETTERS, 10),
moreStuff = rnorm(10))
Solution:
all[match(index$match, all$match, nomatch = 0),]
match otherStuff
10 Z -0.5558411
5 W -0.4456620
8 Q 0.4007715
6 A 1.2240818
7 K 0.3598138

Tables and bins from two vectors in R

As an exercise I was given two samples from a seed called u and v and asked to show how many values are in v but not in u fell into the bins [1,50] and [51,100]. Then I am asked to add a line of code in to confirm my answer using a relational operator (like >) and sum().
I solved the first part:
table(findInterval(setdiff(v,u),c(50))
But for the second part, i don't really get what I need to do; any help is appreciated!
Example:
set.seed(1201)
u = sample(100,100,replace=TRUE)
v = sample(100,100,replace=TRUE)
table(findInterval(setdiff(v,u),c(50)))
Output:
0 1
12 12
If we want to use comparative operators and sum, create a logical vector and get the sum of logical vector
i1 <- v[!v %in% u] > 50
sum(i1)
sum(!i1)
Note: If the OP intended to use only unique values (as in setdiff), then get the unique
i1 <- unique(v[!v %in% u]) > 50
out1 <- sum(i1)
out2 <- sum(!i1)
-checking with the output of table
tbl1 <- table(findInterval(setdiff(v,u),c(50)))
all.equal(as.numeric(tbl1), c(out1, out2), check.attributes = FALSE)
#[1] TRUE
Since there is only one number that you are cutting the intervals in, you can verify your answer using > directly.
This is your code
set.seed(1201)
u = sample(100,100,replace=TRUE)
v = sample(100,100,replace=TRUE)
table(findInterval(setdiff(v,u),50))
#0 1
#9 9
Without findInterval
table(setdiff(v,u) > 50)
#FALSE TRUE
# 9 9

I want to split data in R by varying block sizes but each observation is unique

I have managed to read in a data file, and subset out the 2 columns of info that I want to work with. I am now stuck because I need to split the data into chunks of varying sizes and apply a function (mean, sd) to them, save the chunks and plot the sd from each. Otherwise known generally as block averaging. Right now I have a data frame with 2 columns and 10005 rows. The head of it looks like this:
Frame CA
1 0.773
Is there an efficient way that I could subset pieces of the data from a:b so that I can dictate how the data is broken up by the "Frame" column? I have found really good answers on here but I am not sure what they mean fully or if they would work.
chunk <- function(x, n)
(mapply(function(a, b) (x[a:b]), seq.int(from=1, to=length(x), by=n),
pmin(seq.int(from=1, to=length(x), by=n)+(n-1),
length(x)), SIMPLIFY=FALSE))
I'm not sure if it is what you're looking for but with closure, a data frame can be subsetted by arbitrary indices.
(If Frame can be subsetted by a:b, it is likely to be a sequence and thus a subset may be made by row index?)
df <- data.frame(group = sample(c("a", "b"), 20, replace = T),
val = rnorm(20))
# closure - returns a function that accepts from and to
subsetter <- function(from, to) {
function(x) {
x[from:to, ]
}
}
# from and to are specified
sub1 <- subsetter(2, 4)
sub2 <- subsetter(1, 5)
# data is split from to to
sub1(df)
#group val
#2 a 0.5518802
#3 b 1.5955093
#4 a -0.8132578
sub2(df)
# group val
#1 b 0.4780080
#2 a 0.5518802
#3 b 1.5955093
#4 a -0.8132578
#5 b 0.4449554

apply a function over groups of columns

How can I use apply or a related function to create a new data frame that contains the results of the row averages of each pair of columns in a very large data frame?
I have an instrument that outputs n replicate measurements on a large number of samples, where each single measurement is a vector (all measurements are the same length vectors). I'd like to calculate the average (and other stats) on all replicate measurements of each sample. This means I need to group n consecutive columns together and do row-wise calculations.
For a simple example, with three replicate measurements on two samples, how can I end up with a data frame that has two columns (one per sample), one that is the average each row of the replicates in dat$a, dat$b and dat$c and one that is the average of each row for dat$d, dat$e and dat$f.
Here's some example data
dat <- data.frame( a = rnorm(16), b = rnorm(16), c = rnorm(16), d = rnorm(16), e = rnorm(16), f = rnorm(16))
a b c d e f
1 -0.9089594 -0.8144765 0.872691548 0.4051094 -0.09705234 -1.5100709
2 0.7993102 0.3243804 0.394560355 0.6646588 0.91033497 2.2504104
3 0.2963102 -0.2911078 -0.243723116 1.0661698 -0.89747522 -0.8455833
4 -0.4311512 -0.5997466 -0.545381175 0.3495578 0.38359390 0.4999425
5 -0.4955802 1.8949285 -0.266580411 1.2773987 -0.79373386 -1.8664651
6 1.0957793 -0.3326867 -1.116623982 -0.8584253 0.83704172 1.8368212
7 -0.2529444 0.5792413 -0.001950741 0.2661068 1.17515099 0.4875377
8 1.2560402 0.1354533 1.440160168 -2.1295397 2.05025701 1.0377283
9 0.8123061 0.4453768 1.598246016 0.7146553 -1.09476532 0.0600665
10 0.1084029 -0.4934862 -0.584671816 -0.8096653 1.54466019 -1.8117459
11 -0.8152812 0.9494620 0.100909570 1.5944528 1.56724269 0.6839954
12 0.3130357 2.6245864 1.750448404 -0.7494403 1.06055267 1.0358267
13 1.1976817 -1.2110708 0.719397607 -0.2690107 0.83364274 -0.6895936
14 -2.1860098 -0.8488031 -0.302743475 -0.7348443 0.34302096 -0.8024803
15 0.2361756 0.6773727 1.279737692 0.8742478 -0.03064782 -0.4874172
16 -1.5634527 -0.8276335 0.753090683 2.0394865 0.79006103 0.5704210
I'm after something like this
X1 X2
1 -0.28358147 -0.40067128
2 0.50608365 1.27513471
3 -0.07950691 -0.22562957
4 -0.52542633 0.41103139
5 0.37758930 -0.46093340
6 -0.11784382 0.60514586
7 0.10811540 0.64293184
8 0.94388455 0.31948189
9 0.95197629 -0.10668118
10 -0.32325169 -0.35891702
11 0.07836345 1.28189698
12 1.56269017 0.44897971
13 0.23533617 -0.04165384
14 -1.11251880 -0.39810121
15 0.73109533 0.11872758
16 -0.54599850 1.13332286
which I did with this, but is obviously no good for my much larger data frame...
data.frame(cbind(
apply(cbind(dat$a, dat$b, dat$c), 1, mean),
apply(cbind(dat$d, dat$e, dat$f), 1, mean)
))
I've tried apply and loops and can't quite get it together. My actual data has some hundreds of columns.
This may be more generalizable to your situation in that you pass a list of indices. If speed is an issue (large data frame) I'd opt for lapply with do.call rather than sapply:
x <- list(1:3, 4:6)
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
Works if you just have col names too:
x <- list(c('a','b','c'), c('d', 'e', 'f'))
do.call(cbind, lapply(x, function(i) rowMeans(dat[, i])))
EDIT
Just happened to think maybe you want to automate this to do every three columns. I know there's a better way but here it is on a 100 column data set:
dat <- data.frame(matrix(rnorm(16*100), ncol=100))
n <- 1:ncol(dat)
ind <- matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=TRUE, ncol=3)
ind <- data.frame(t(na.omit(ind)))
do.call(cbind, lapply(ind, function(i) rowMeans(dat[, i])))
EDIT 2
Still not happy with the indexing. I think there's a better/faster way to pass the indexes. here's a second though not satisfying method:
n <- 1:ncol(dat)
ind <- data.frame(matrix(c(n, rep(NA, 3 - ncol(dat)%%3)), byrow=F, nrow=3))
nonna <- sapply(ind, function(x) all(!is.na(x)))
ind <- ind[, nonna]
do.call(cbind, lapply(ind, function(i)rowMeans(dat[, i])))
A similar question was asked here by #david: averaging every 16 columns in r (now closed), which I answered by adapting #TylerRinker's answer above, following a suggestion by #joran and #Ben. Because the resulting function might be of help to OP or future readers, I am copying that function here, along with an example for OP's data.
# Function to apply 'fun' to object 'x' over every 'by' columns
# Alternatively, 'by' may be a vector of groups
byapply <- function(x, by, fun, ...)
{
# Create index list
if (length(by) == 1)
{
nc <- ncol(x)
split.index <- rep(1:ceiling(nc / by), each = by, length.out = nc)
} else # 'by' is a vector of groups
{
nc <- length(by)
split.index <- by
}
index.list <- split(seq(from = 1, to = nc), split.index)
# Pass index list to fun using sapply() and return object
sapply(index.list, function(i)
{
do.call(fun, list(x[, i], ...))
})
}
Then, to find the mean of the replicates:
byapply(dat, 3, rowMeans)
Or, perhaps the standard deviation of the replicates:
byapply(dat, 3, apply, 1, sd)
Update
by can also be specified as a vector of groups:
byapply(dat, c(1,1,1,2,2,2), rowMeans)
mean for rows from vectors a,b,c
rowMeans(dat[1:3])
means for rows from vectors d,e,f
rowMeans(dat[4:6])
all in one call you get
results<-cbind(rowMeans(dat[1:3]),rowMeans(dat[4:6]))
if you only know the names of the columns and not the order then you can use:
rowMeans(cbind(dat["a"],dat["b"],dat["c"]))
rowMeans(cbind(dat["d"],dat["e"],dat["f"]))
#I dont know how much damage this does to speed but should still be quick
The rowMeans solution will be faster, but for completeness here's how you might do this with apply:
t(apply(dat,1,function(x){ c(mean(x[1:3]),mean(x[4:6])) }))
Inspired by #joran's suggestion I came up with this (actually a bit different from what he suggested, though the transposing suggestion was especially useful):
Make a data frame of example data with p cols to simulate a realistic data set (following #TylerRinker's answer above and unlike my poor example in the question)
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
Rename the columns in this data frame to create groups of n consecutive columns, so that if I'm interested in the groups of three columns I get column names like 1,1,1,2,2,2,3,3,3, etc or if I wanted groups of four columns it would be 1,1,1,1,2,2,2,2,3,3,3,3, etc. I'm going with three for now (I guess this is a kind of indexing for people like me who don't know much about indexing)
n <- 3 # how many consecutive columns in the groups of interest?
names(dat) <- rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat)))
Now use apply and tapply to get row means for each of the groups
dat.avs <- data.frame(t(apply(dat, 1, tapply, names(dat), mean)))
The main downsides are that the column names in the original data are replaced (though this could be overcome by putting the grouping numbers in a new row rather than the colnames) and that the column names are returned by the apply-tapply function in an unhelpful order.
Further to #joran's suggestion, here's a data.table solution:
p <- 99 # how many columns?
dat <- data.frame(matrix(rnorm(4*p), ncol = p))
dat.t <- data.frame(t(dat))
n <- 3 # how many consecutive columns in the groups of interest?
dat.t$groups <- as.character(rep(seq(1:(ncol(dat)/n)), each = n, len = (ncol(dat))))
library(data.table)
DT <- data.table(dat.t)
setkey(DT, groups)
dat.av <- DT[, lapply(.SD,mean), by=groups]
Thanks everyone for your quick and patient efforts!
There is a beautifully simple solution if you are interested in applying a function to each unique combination of columns, in what known as combinatorics.
combinations <- combn(colnames(df),2,function(x) rowMeans(df[x]))
To calculate statistics for every unique combination of three columns, etc., just change the 2 to a 3. The operation is vectorized and thus faster than loops, such as the apply family functions used above. If the order of the columns matters, then you instead need a permutation algorithm designed to reproduce ordered sets: combinat::permn

Resources