Take for example:
s <- c("1y1","2y1","1.5y2","1.8y2")
y1 and y2 means to multiply 2 and 1.8.
1y1 = 1*2=2
1.5y2 = 1.5*1.8 = 2.7
So, the result would be c(2,4,2.7,3.24).
How can I simply get it?
I can do it by c(1,2.5,3,2.7)*c(2,2,1.8,1.8), but it is not elegant, if the string is very long.
Using gsub to replace y1 and y2 by their respective numeric values then one can use the classical eval(parse(..)) to evaluate the resulted numeric expressions.
s <- c("1y1","2y1","1.5y2","1.8y2")
s <- gsub('y1','*2',s)
s <-gsub('y2','*1.8',s)
sapply(s,function(x)eval(parse(text=x)))
1*2 2*2 1.5*1.8 1.8*1.8
2.00 4.00 2.70 3.24
EDIT more elegant option using gsubfn, and its flexible function substitution version.
> library(gsubfn)
> sub.f <- function(x,y) as.numeric(x) * ifelse(y == 1, 2, 1. 8)
> as.numeric(gsubfn('(.*)y(.*)', sub.f, s))
[1] 2.00 4.00 2.70 3.24
we split, each expression into 2 components x and y and use them (once coerced to numeric) to compute manually the expression.
Alternately, this one liner:
> strapply(s, '(.*)y(.*)', ~ as.numeric(x) * ifelse(y == 1, 2, 1.8), simplify = TRUE)
[1] 2.00 4.00 2.70 3.24
You can try this:
s = sub(pattern="y", replacement="*y", x=s)
sub will replace all the ocurrences of y with *y, so the expression is syntactically correct to R. However, it's not really an expression but a character vector. Take notice sub will replace only the first ocurrence of y. If there are more than one "y" per string, you have to use gsub instead. For more information see ?sub.
y1 = 2
y2 = 1.8
s = parse(text=s) # parse will convert s to a R expression
s.num = sapply(s, eval) # and this will evaluate the expression to give the results
[1] 2.00 4.00 2.70 3.24
Here is another idea:
s <- c("1y1", "2y1", "1.5y2", "1.8y2")
s2 <- as.numeric(sub("y[0-9]", "", s))
mult <- rep(NA, len = length(s))
mult[grep("y1", s)] <- 2
mult[grep("y2", s)] <- 1.8
s2 * mult
#[1] 2.00 4.00 2.70 3.24
On a larger scale:
S = sample(s, 1e4, T)
f1 = function() {
S <- gsub('y1','*2',S)
S <-gsub('y2','*1.8',S)
sapply(S,function(x)eval(parse(text=x)))
}
f2 = function() {
S = sub(pattern="y", replacement="*y", x=S)
S = parse(text=S)
S.num = sapply(S, eval)
S.num
} ; y1 = 2 ; y2 = 1.8
f3 = function() {
s2 <- as.numeric(sub("y[0-9]", "", S))
mult <- rep(NA, len = length(S))
mult[grep("y1", S)] <- 2
mult[grep("y2", S)] <- 1.8
s2 * mult
}
library(microbenchmark)
microbenchmark(f1(), f2(), f3(), times = 10)
#Unit: milliseconds
# expr min lq median uq max neval
# f1() 1940.63876 2037.03500 2064.4605 2072.98875 2101.73358 10
# f2() 93.43298 98.69724 115.8716 125.07774 153.43385 10
# f3() 12.91724 13.01781 13.1805 13.30283 17.94973 10
sum(f1() != f2())
#[1] 0
sum(f2() != f3())
#[1] 0
Related
I have a data.frame which I would like to convert to a list by rows, meaning each row would correspond to its own list elements. In other words, I would like a list that is as long as the data.frame has rows.
So far, I've tackled this problem in the following manner, but I was wondering if there's a better way to approach this.
xy.df <- data.frame(x = runif(10), y = runif(10))
# pre-allocate a list and fill it with a loop
xy.list <- vector("list", nrow(xy.df))
for (i in 1:nrow(xy.df)) {
xy.list[[i]] <- xy.df[i,]
}
Like this:
xy.list <- split(xy.df, seq(nrow(xy.df)))
And if you want the rownames of xy.df to be the names of the output list, you can do:
xy.list <- setNames(split(xy.df, seq(nrow(xy.df))), rownames(xy.df))
Eureka!
xy.list <- as.list(as.data.frame(t(xy.df)))
A more modern solution uses only purrr::transpose:
library(purrr)
iris[1:2,] %>% purrr::transpose()
#> [[1]]
#> [[1]]$Sepal.Length
#> [1] 5.1
#>
#> [[1]]$Sepal.Width
#> [1] 3.5
#>
#> [[1]]$Petal.Length
#> [1] 1.4
#>
#> [[1]]$Petal.Width
#> [1] 0.2
#>
#> [[1]]$Species
#> [1] 1
#>
#>
#> [[2]]
#> [[2]]$Sepal.Length
#> [1] 4.9
#>
#> [[2]]$Sepal.Width
#> [1] 3
#>
#> [[2]]$Petal.Length
#> [1] 1.4
#>
#> [[2]]$Petal.Width
#> [1] 0.2
#>
#> [[2]]$Species
#> [1] 1
If you want to completely abuse the data.frame (as I do) and like to keep the $ functionality, one way is to split you data.frame into one-line data.frames gathered in a list :
> df = data.frame(x=c('a','b','c'), y=3:1)
> df
x y
1 a 3
2 b 2
3 c 1
# 'convert' into a list of data.frames
ldf = lapply(as.list(1:dim(df)[1]), function(x) df[x[1],])
> ldf
[[1]]
x y
1 a 3
[[2]]
x y
2 b 2
[[3]]
x y
3 c 1
# and the 'coolest'
> ldf[[2]]$y
[1] 2
It is not only intellectual masturbation, but allows to 'transform' the data.frame into a list of its lines, keeping the $ indexation which can be useful for further use with lapply (assuming the function you pass to lapply uses this $ indexation)
A couple of more options :
With asplit
asplit(xy.df, 1)
#[[1]]
# x y
#0.1137 0.6936
#[[2]]
# x y
#0.6223 0.5450
#[[3]]
# x y
#0.6093 0.2827
#....
With split and row
split(xy.df, row(xy.df)[, 1])
#$`1`
# x y
#1 0.1137 0.6936
#$`2`
# x y
#2 0.6223 0.545
#$`3`
# x y
#3 0.6093 0.2827
#....
data
set.seed(1234)
xy.df <- data.frame(x = runif(10), y = runif(10))
I was working on this today for a data.frame (really a data.table) with millions of observations and 35 columns. My goal was to return a list of data.frames (data.tables) each with a single row. That is, I wanted to split each row into a separate data.frame and store these in a list.
Here are two methods I came up with that were roughly 3 times faster than split(dat, seq_len(nrow(dat))) for that data set. Below, I benchmark the three methods on a 7500 row, 5 column data set (iris repeated 50 times).
library(data.table)
library(microbenchmark)
microbenchmark(
split={dat1 <- split(dat, seq_len(nrow(dat)))},
setDF={dat2 <- lapply(seq_len(nrow(dat)),
function(i) setDF(lapply(dat, "[", i)))},
attrDT={dat3 <- lapply(seq_len(nrow(dat)),
function(i) {
tmp <- lapply(dat, "[", i)
attr(tmp, "class") <- c("data.table", "data.frame")
setDF(tmp)
})},
datList = {datL <- lapply(seq_len(nrow(dat)),
function(i) lapply(dat, "[", i))},
times=20
)
This returns
Unit: milliseconds
expr min lq mean median uq max neval
split 861.8126 889.1849 973.5294 943.2288 1041.7206 1250.6150 20
setDF 459.0577 466.3432 511.2656 482.1943 500.6958 750.6635 20
attrDT 399.1999 409.6316 461.6454 422.5436 490.5620 717.6355 20
datList 192.1175 201.9896 241.4726 208.4535 246.4299 411.2097 20
While the differences are not as large as in my previous test, the straight setDF method is significantly faster at all levels of the distribution of runs with max(setDF) < min(split) and the attr method is typically more than twice as fast.
A fourth method is the extreme champion, which is a simple nested lapply, returning a nested list. This method exemplifies the cost of constructing a data.frame from a list. Moreover, all methods I tried with the data.frame function were roughly an order of magnitude slower than the data.table techniques.
data
dat <- vector("list", 50)
for(i in 1:50) dat[[i]] <- iris
dat <- setDF(rbindlist(dat))
Seems a current version of the purrr (0.2.2) package is the fastest solution:
by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out
Let's compare the most interesting solutions:
data("Batting", package = "Lahman")
x <- Batting[1:10000, 1:10]
library(benchr)
library(purrr)
benchmark(
split = split(x, seq_len(.row_names_info(x, 2L))),
mapply = .mapply(function(...) structure(list(...), class = "data.frame", row.names = 1L), x, NULL),
purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out
)
Rsults:
Benchmark summary:
Time units : milliseconds
expr n.eval min lw.qu median mean up.qu max total relative
split 100 983.0 1060.0 1130.0 1130.0 1180.0 1450 113000 34.3
mapply 100 826.0 894.0 963.0 972.0 1030.0 1320 97200 29.3
purrr 100 24.1 28.6 32.9 44.9 40.5 183 4490 1.0
Also we can get the same result with Rcpp:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List df2list(const DataFrame& x) {
std::size_t nrows = x.rows();
std::size_t ncols = x.cols();
CharacterVector nms = x.names();
List res(no_init(nrows));
for (std::size_t i = 0; i < nrows; ++i) {
List tmp(no_init(ncols));
for (std::size_t j = 0; j < ncols; ++j) {
switch(TYPEOF(x[j])) {
case INTSXP: {
if (Rf_isFactor(x[j])) {
IntegerVector t = as<IntegerVector>(x[j]);
RObject t2 = wrap(t[i]);
t2.attr("class") = "factor";
t2.attr("levels") = t.attr("levels");
tmp[j] = t2;
} else {
tmp[j] = as<IntegerVector>(x[j])[i];
}
break;
}
case LGLSXP: {
tmp[j] = as<LogicalVector>(x[j])[i];
break;
}
case CPLXSXP: {
tmp[j] = as<ComplexVector>(x[j])[i];
break;
}
case REALSXP: {
tmp[j] = as<NumericVector>(x[j])[i];
break;
}
case STRSXP: {
tmp[j] = as<std::string>(as<CharacterVector>(x[j])[i]);
break;
}
default: stop("Unsupported type '%s'.", type2name(x));
}
}
tmp.attr("class") = "data.frame";
tmp.attr("row.names") = 1;
tmp.attr("names") = nms;
res[i] = tmp;
}
res.attr("names") = x.attr("row.names");
return res;
}
Now caompare with purrr:
benchmark(
purrr = by_row(x, function(v) list(v)[[1L]], .collate = "list")$.out,
rcpp = df2list(x)
)
Results:
Benchmark summary:
Time units : milliseconds
expr n.eval min lw.qu median mean up.qu max total relative
purrr 100 25.2 29.8 37.5 43.4 44.2 159.0 4340 1.1
rcpp 100 19.0 27.9 34.3 35.8 37.2 93.8 3580 1.0
An alternative way is to convert the df to a matrix then applying the list apply lappy function over it: ldf <- lapply(as.matrix(myDF), function(x)x)
The best way for me was:
Example data:
Var1<-c("X1",X2","X3")
Var2<-c("X1",X2","X3")
Var3<-c("X1",X2","X3")
Data<-cbind(Var1,Var2,Var3)
ID Var1 Var2 Var3
1 X1 X2 X3
2 X4 X5 X6
3 X7 X8 X9
We call the BBmisc library
library(BBmisc)
data$lists<-convertRowsToList(data[,2:4])
And the result will be:
ID Var1 Var2 Var3 lists
1 X1 X2 X3 list("X1", "X2", X3")
2 X4 X5 X6 list("X4","X5", "X6")
3 X7 X8 X9 list("X7,"X8,"X9)
Like #flodel wrote:
This converts your dataframe into a list that has the same number of elements as number of rows in dataframe:
NewList <- split(df, f = seq(nrow(df)))
You can additionaly add a function to select only those columns that are not NA in each element of the list:
NewList2 <- lapply(NewList, function(x) x[,!is.na(x)])
Another alternative using library(purrr) (that seems to be a bit quicker on large data.frames)
flatten(by_row(xy.df, ..f = function(x) flatten_chr(x), .labels = FALSE))
The by_row function from the purrrlyr package will do this for you.
This example demonstrates
myfn <- function(row) {
#row is a tibble with one row, and the same number of columns as the original df
l <- as.list(row)
return(l)
}
list_of_lists <- purrrlyr::by_row(df, myfn, .labels=FALSE)$.out
By default, the returned value from myfn is put into a new list column in the df called .out. The $.out at the end of the above statement immediately selects this column, returning a list of lists.
I have a vector of values (x).
I would like to determine the length of its overlap with each of the sets sitting in a list (y) - but without running a loop or lapply. Is it possible?
I am really interested in accelerating the execution.
Thank you very much!
Below is an example with an implementation using a loop:
x <- c(1:5)
y <- list(1:5, 2:6, 3:7, 4:8, 5:9, 6:10)
overlaps <- rep(0, length(y))
for (i in seq(length(y))) { #i=1
# overlaps[i] <- length(intersect(x, y[[i]])) # it is slower than %in%
overlaps[i] <- sum(x %in% y[[i]])
}
overlaps
And below is the comparison of some of the methods that were suggested in the responses below. As you can see, the loop is still the fastest - but I'd love to find something faster:
# Function with the loop:
myloop <- function(x, y) {
overlaps <- rep(0, length(y))
for (i in seq(length(y))) overlaps[i] <- sum(x %in% y[[i]])
overlaps
}
# Function with sapply:
mysapply <- function(x, y) sapply(y, function(e) sum(e %in% x))
# Function with map_dbl:
library(purrr)
mymap <- function(x, y) {
map_dbl(y, ~sum(. %in% x))
}
library(microbenchmark)
microbenchmark(myloop(x, y), mysapply(x, y), mymap(x, y), times = 30000)
# Unit: microseconds
# expr min lq mean median uq max neval
# myloop(x, y) 17.2 19.4 26.64801 21.2 22.6 9348.6 30000
# mysapply(x, y) 27.1 29.5 39.19692 31.0 32.9 20176.2 30000
# mymap(x, y) 59.8 64.1 88.40618 66.0 70.5 114776.7 30000
Use sapply for code compactness.
Even if sapply doesn't bring much performance benefits, compared to a for loop, at least the code is far more compact. This is the sapply equivalent of your code:
x <- c(1:5)
y <- list(1:5, 2:6, 3:7, 4:8, 5:9, 6:10)
res <- sapply(y, function(e) length(intersect(e, x)))
> res
[1] 5 4 3 2 1 0
Performance gains
As correctly stated by #StupidWolf, it's not sapply that is slowing down the execution, but rather length and intersect. That's my test with 100.000 executions:
B <- 100000
system.time(replicate(B, sapply(y, function(e) length(intersect(e, x)))))
user system elapsed
9.79 0.01 9.79
system.time(replicate(B, sapply(y, function(e) sum(e %in% x))))
user system elapsed
2 0 2
#Using microbenchmark for preciser results:
library(microbenchmark)
microbenchmark(expr1 = sapply(y, function(e) length(intersect(e, x))), times = B)
expr min lq mean median uq max neval
expr1 81.4 84.9 91.87689 86.5 88.2 7368.7 1e+05
microbenchmark(expr2 = sapply(y, function(e) sum(e %in% x)), times = B)
expr min lq mean median uq max neval
expr2 15.4 16.1 17.68144 16.4 17 7567.9 1e+05
As we can see, the second approach is by far the performance winner.
Hope this helps.
You can use map from purrr, it goes through every element of the list y, and performs a function. Below i use map_dbl which returns a vector
library(purrr)
map_dbl(y,~+(. %in% x))
[1] 5 4 3 2 1 0
To see the time:
f1 = function(){
x <- c(1:5)
y <- lapply(1:5,function(i)sample(1:10,5,replace=TRUE))
map_dbl(y,~sum(. %in% x))
}
f2 = function(){
x <- c(1:5)
y <- lapply(1:5,function(i)sample(1:10,5,replace=TRUE))
overlaps <- rep(0, length(y))
for (i in seq(length(y))) { #i=1
overlaps[i] <- length(intersect(x, y[[i]]))
}
overlaps
}
f3 = function(){
x <- c(1:5)
y <- lapply(1:5,function(i)sample(1:10,5,replace=TRUE))
sapply(y,function(i)sum(i%in%x))
}
Let's put it to test:
system.time(replicate(10000,f1()))
user system elapsed
1.27 0.02 1.35
system.time(replicate(10000,f2()))
user system elapsed
1.72 0.00 1.72
system.time(replicate(10000,f3()))
user system elapsed
0.97 0.00 0.97
So if you want speed, do something like sapply + %in% , if something easily readable, do purrr
Here is an option using data.table which should be fast if you have a long list of vectors in y.
library(data.table)
DT <- data.table(ID=rep(seq_along(y), lengths(y)), Y=unlist(y))
DT[.(Y=x), on=.(Y)][, .N, ID]
In addition if you need to run this for multiple x, I would suggest creating a data.table that combines all of the x before running the code
output:
ID N
1: 1 5
2: 2 4
3: 3 3
4: 4 2
5: 5 1
so R has a great function, match, to find values in vectors (also %in% to test the existence). but what if I want to find a short vector in a big vector? that is, to test if a given vector is contained (in order!) in another vector? what if I want to find whether a given vector is a prefix/suffix of another vector? are there such functions in R?
example of what I would like:
x=c(1,3,4)
y=c(4,1,3,4,5)
z=c(3,1)
v_contains(x,y) # return TRUE x is contained in y
v_contains(z,y) # FALSE the values of z are in y, but not in the right order
v_match(x,y) # returns 2 because x appears in y starting at position 2
is there anything like it? how would you approach it efficiently?
A recent post uncovered this solution by Jonathan Carroll. I doubt a faster solution exists in R.
v_match <- function(needle, haystack, nomatch = 0L) {
sieved <- which(haystack == needle[1L])
for (i in seq.int(1L, length(needle) - 1L)) {
sieved <- sieved[haystack[sieved + i] == needle[i + 1L]]
}
sieved
}
v_contains <- function(needle, haystack) {
sieved <- which(haystack == needle[1L])
for (i in seq.int(1L, length(needle) - 1L)) {
sieved <- sieved[haystack[sieved + i] == needle[i + 1L]]
}
length(sieved) && !anyNA(sieved)
}
Tests and benchmarks:
library(testthat)
x=c(1,3,4)
y=c(4,1,3,4,5)
z=c(3,1)
expect_true(v_contains(x,y)) # return TRUE x is contained in y
expect_false(v_contains(z,y)) # FALSE the values of z are in y, but not in order
expect_equal(v_match(x,y), 2) # returns 2 because x appears in y starting at position 2
x <- c(5, 1, 3)
yes <- c(sample(5:1e6), c(5, 1, 3))
no <- c(sample(5:1e6), c(4, 1, 3))
expect_true(v_contains(x, yes))
expect_false(v_contains(x, no))
expect_equal(v_match(x, yes), 1e6 - 3)
v_contains_roll <- function(x, y) {
any(zoo::rollapply(y, length(x), identical, x))
}
v_contains_stri <- function(x, y) {
stringr::str_detect(paste(y, collapse = "_"),
paste(x, collapse = "_"))
}
options(digits = 2)
options(scipen = 99)
library(microbenchmark)
gc(0, 1, 1)
#> used (Mb) gc trigger (Mb) max used (Mb)
#> Ncells 527502 28 1180915 63 527502 28
#> Vcells 3010073 23 8388608 64 3010073 23
microbenchmark(v_contains(x, yes),
v_contains(x, no),
v_contains_stri(x, yes),
v_contains_stri(x, no),
v_contains_roll(x, yes),
v_contains_roll(x, no),
times = 2L,
control = list(order = "block"))
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> v_contains(x, yes) 3.8 3.8 3.8 3.8 3.9 3.9 2
#> v_contains(x, no) 3.7 3.7 3.7 3.7 3.8 3.8 2
#> v_contains_stri(x, yes) 1658.4 1658.4 1676.7 1676.7 1695.0 1695.0 2
#> v_contains_stri(x, no) 1632.3 1632.3 1770.0 1770.0 1907.8 1907.8 2
#> v_contains_roll(x, yes) 5447.4 5447.4 5666.1 5666.1 5884.7 5884.7 2
#> v_contains_roll(x, no) 5458.8 5458.8 5521.7 5521.7 5584.6 5584.6 2
#> cld
#> a
#> a
#> b
#> b
#> c
#> c
Created on 2018-08-18 by the reprex package (v0.2.0).
x=c(1,3,4)
y=c(4,1,3,4,5)
z=c(3,1)
# 1. return TRUE x is contained in y
stringr::str_detect(paste(y, collapse = "_"), paste(x, collapse = "_"))
# 2. FALSE the values of z are in y, but not in the right order
all(z %in% y) & stringr::str_detect(paste(y, collapse = "_"), paste(z, collapse = "_"))
# 3. returns 2 because x appears in y starting at position 2
stringr::str_locate(paste(y, collapse = "_"), paste(x, collapse = "_"))[1] - 1
If x and y are as in the question then here are some alternatives.
1) rollapply This checks if x is contained in y.
library(zoo)
any(rollapply(y, length(x), identical, x))
## [1] TRUE
2) embed Slightly more complex but still one line and no package dependencies.
any(apply(t(embed(y, length(x))) == rev(x), 2, all))
## [1] TRUE
2a) or this variation:
any(apply(embed(y, length(x)), 1, identical, rev(x)))
## [1] TRUE
3) strings Convert both x and y into character strings and use grepl. The comments to the question already point to code for such an approach.
4) Rcpp If speed is important then we can code it in C++. The standard library already has this as a builtin. Place this in a file Search.cpp in the current directory and from within R run library(Rcpp); sourceCpp("Search.cpp"). Then the R code Search(x, y) will invoke it.
#include <Rcpp.h>
using namespace Rcpp;
using namespace std;
// [[Rcpp::export]]
bool Search(NumericVector x, NumericVector y) {
return search(begin(y), end(y), begin(x), end(x)) != end(y);
}
foo <- data.table(x = 1:5/sum(1:5),
y = (-4):0/sum((-4):0),
z1 = 2:6/sum(2:6),
z2 = 2:6/sum(2:6))
Suppose I have the foo data table (as specified above):
x y z1 z2
1: 0.06666667 0.4 0.10 0.10
2: 0.13333333 0.3 0.15 0.15
3: 0.20000000 0.2 0.20 0.20
4: 0.26666667 0.1 0.25 0.25
5: 0.33333333 0.0 0.30 0.30
How to efficiently count unique columns? In this case only 3.
Please assume that in general:
foo is always a data table and not a matrix; though the columns are always numeric.
foo in reality is big, nrow > 20k and ncol > 100
Is it possible to do this without making extra copies of the data?
My current approach is to apply over columns with paste to get a single value for each column and then do length(unique(.)) on the result...
Is there any magic with data.table::transpose(), data.table::uniqueN, and maybe some other friends?
Another possibility:
length(unique(as.list(foo)))
Which gives the expected result:
> length(unique(as.list(foo)))
[1] 3
NOTE: the use of length(unique()) is necessary as uniqueN() will return an error.
Per the comment of #Ryan, you can also do:
length(unique.default(foo))
With regard to speed, both methods are comparable (when measured on a larger dataset of 5M rows):
> fooLarge <- foo[rep(1:nrow(foo),1e6)]
> microbenchmark(length(unique.default(fooLarge)), length(unique(as.list(fooLarge))))
Unit: milliseconds
expr min lq mean median uq max neval cld
length(unique.default(fooLarge)) 94.0433 94.56920 95.24076 95.01492 95.67131 103.15433 100 a
length(unique(as.list(fooLarge))) 94.0254 94.68187 95.17648 95.02672 95.49857 99.19411 100 a
If you want to retain only the unique columns, you could use:
# option 1
cols <- !duplicated(as.list(foo))
foo[, ..cols]
# option 2 (doesn't retain the column names)
as.data.table(unique.default(foo))
which gives (output option 1 shown):
x y z1
1: 0.06666667 0.4 0.10
2: 0.13333333 0.3 0.15
3: 0.20000000 0.2 0.20
4: 0.26666667 0.1 0.25
5: 0.33333333 0.0 0.30
transpose and check for non-duplicates
ncol( foo[ , which( !duplicated( t( foo ) ) ), with = FALSE ])
3
Another method which may be faster if you expect a large number of duplicates:
n_unique_cols <- function(foo) {
K <- seq_along(foo)
for (j in seq_along(foo)) {
if (j %in% K) {
foo_j <- .subset2(foo, j)
for (k in K) {
if (j < k) {
foo_k <- .subset2(foo, k)
if (foo_j[1] == foo_k[1] && identical(foo_j, foo_k)) {
K <- K[K != k]
}
rm(foo_k)
}
}
}
}
length(K)
}
Timings:
library(data.table)
create_foo <- function(row, col) {
foo <- data.table(x = rnorm(row),
y = seq_len(row) - 2L)
set.seed(1)
for (k in seq_len(col %/% 2L)) {
foo[, (paste0('x', k)) := x + sample(-4:4, size = 1)]
foo[, (paste0('y', k)) := y + sample(-2:2, size = 1)]
}
foo
}
library(bench)
res <-
press(rows = c(1e5, 1e6, 1e7),
cols = c(10, 50, 100),
{
foorc <- create_foo(rows, cols)
bench::mark(n_unique_cols(foorc),
length(unique(as.list(foorc))))
})
plot(res)
For this family of data, this function is twice as fast, but its memory consumption grows faster than unique(as.list(.)).
I've found polynomial coefficients from my data:
R <- c(0.256,0.512,0.768,1.024,1.28,1.437,1.594,1.72,1.846,1.972,2.098,2.4029)
Ic <- c(1.78,1.71,1.57,1.44,1.25,1.02,0.87,0.68,0.54,0.38,0.26,0.17)
NN <- 3
ft <- lm(Ic ~ poly(R, NN, raw = TRUE))
pc <- coef(ft)
So I can create a polynomial function:
f1 <- function(x) pc[1] + pc[2] * x + pc[3] * x ^ 2 + pc[4] * x ^ 3
And for example, take a derivative:
g1 <- Deriv(f1)
How to create a universal function so that it doesn't have to be rewritten for every new polynomial degree NN?
My original answer may not be what you really want, as it was numerical rather symbolic. Here is the symbolic solution.
## use `"x"` as variable name
## taking polynomial coefficient vector `pc`
## can return a string, or an expression by further parsing (mandatory for `D`)
f <- function (pc, expr = TRUE) {
stringexpr <- paste("x", seq_along(pc) - 1, sep = " ^ ")
stringexpr <- paste(stringexpr, pc, sep = " * ")
stringexpr <- paste(stringexpr, collapse = " + ")
if (expr) return(parse(text = stringexpr))
else return(stringexpr)
}
## an example cubic polynomial with coefficients 0.1, 0.2, 0.3, 0.4
cubic <- f(pc = 1:4 / 10, TRUE)
## using R base's `D` (requiring expression)
dcubic <- D(cubic, name = "x")
# 0.2 + 2 * x * 0.3 + 3 * x^2 * 0.4
## using `Deriv::Deriv`
library(Deriv)
dcubic <- Deriv(cubic, x = "x", nderiv = 1L)
# expression(0.2 + x * (0.6 + 1.2 * x))
Deriv(f(1:4 / 10, FALSE), x = "x", nderiv = 1L) ## use string, get string
# [1] "0.2 + x * (0.6 + 1.2 * x)"
Of course, Deriv makes higher order derivatives easier to get. We can simply set nderiv. For D however, we have to use recursion (see examples of ?D).
Deriv(cubic, x = "x", nderiv = 2L)
# expression(0.6 + 2.4 * x)
Deriv(cubic, x = "x", nderiv = 3L)
# expression(2.4)
Deriv(cubic, x = "x", nderiv = 4L)
# expression(0)
If we use expression, we will be able to evaluate the result later. For example,
eval(cubic, envir = list(x = 1:4)) ## cubic polynomial
# [1] 1.0 4.9 14.2 31.3
eval(dcubic, envir = list(x = 1:4)) ## its first derivative
# [1] 2.0 6.2 12.8 21.8
The above implies that we can wrap up an expression for a function. Using a function has several advantages, one being that we are able to plot it using curve or plot.function.
fun <- function(x, expr) eval.parent(expr, n = 0L)
Note, the success of fun requires expr to be an expression in terms of symbol x. If expr was defined in terms of y for example, we need to define fun with function (y, expr). Now let's use curve to plot cubic and dcubic, on a range 0 < x < 5:
curve(fun(x, cubic), from = 0, to = 5) ## colour "black"
curve(fun(x, dcubic), add = TRUE, col = 2) ## colour "red"
The most convenient way, is of course to define a single function FUN rather than doing f + fun combination. In this way, we also don't need to worry about the consistency on the variable name used by f and fun.
FUN <- function (x, pc, nderiv = 0L) {
## check missing arguments
if (missing(x) || missing(pc)) stop ("arguments missing with no default!")
## expression of polynomial
stringexpr <- paste("x", seq_along(pc) - 1, sep = " ^ ")
stringexpr <- paste(stringexpr, pc, sep = " * ")
stringexpr <- paste(stringexpr, collapse = " + ")
expr <- parse(text = stringexpr)
## taking derivatives
dexpr <- Deriv::Deriv(expr, x = "x", nderiv = nderiv)
## evaluation
val <- eval.parent(dexpr, n = 0L)
## note, if we take to many derivatives so that `dexpr` becomes constant
## `val` is free of `x` so it will only be of length 1
## we need to repeat this constant to match `length(x)`
if (length(val) == 1L) val <- rep.int(val, length(x))
## now we return
val
}
Suppose we want to evaluate a cubic polynomial with coefficients pc <- c(0.1, 0.2, 0.3, 0.4) and its derivatives on x <- seq(0, 1, 0.2), we can simply do:
FUN(x, pc)
# [1] 0.1000 0.1552 0.2536 0.4144 0.6568 1.0000
FUN(x, pc, nderiv = 1L)
# [1] 0.200 0.368 0.632 0.992 1.448 2.000
FUN(x, pc, nderiv = 2L)
# [1] 0.60 1.08 1.56 2.04 2.52 3.00
FUN(x, pc, nderiv = 3L)
# [1] 2.4 2.4 2.4 2.4 2.4 2.4
FUN(x, pc, nderiv = 4L)
# [1] 0 0 0 0 0 0
Now plotting is also easy:
curve(FUN(x, pc), from = 0, to = 5)
curve(FUN(x, pc, 1), from = 0, to = 5, add = TRUE, col = 2)
curve(FUN(x, pc, 2), from = 0, to = 5, add = TRUE, col = 3)
curve(FUN(x, pc, 3), from = 0, to = 5, add = TRUE, col = 4)
Since my final solution with symbolic derivatives eventually goes too long, I use a separate session for numerical calculations. We can do this as for polynomials, derivatives are explicitly known so we can code them. Note, there will be no use of R expression here; everything is done directly by using functions.
So we first generate polynomial basis from degree 0 to degree p - n, then multiply coefficient and factorial multiplier. It is more convenient to use outer than poly here.
## use `outer`
g <- function (x, pc, nderiv = 0L) {
## check missing aruments
if (missing(x) || missing(pc)) stop ("arguments missing with no default!")
## polynomial order p
p <- length(pc) - 1L
## number of derivatives
n <- nderiv
## earlier return?
if (n > p) return(rep.int(0, length(x)))
## polynomial basis from degree 0 to degree `(p - n)`
X <- outer(x, 0:(p - n), FUN = "^")
## initial coefficients
## the additional `+ 1L` is because R vector starts from index 1 not 0
beta <- pc[n:p + 1L]
## factorial multiplier
beta <- beta * factorial(n:p) / factorial(0:(p - n))
## matrix vector multiplication
drop(X %*% beta)
}
We still use the example x and pc defined in the symbolic solution:
x <- seq(0, 1, by = 0.2)
pc <- 1:4 / 10
g(x, pc, 0)
# [1] 0.1000 0.1552 0.2536 0.4144 0.6568 1.0000
g(x, pc, 1)
# [1] 0.200 0.368 0.632 0.992 1.448 2.000
g(x, pc, 2)
# [1] 0.60 1.08 1.56 2.04 2.52 3.00
g(x, pc, 3)
# [1] 2.4 2.4 2.4 2.4 2.4 2.4
g(x, pc, 4)
# [1] 0 0 0 0 0 0
The result is consistent with what we have with FUN in the the symbolic solution.
Similarly, we can plot g using curve:
curve(g(x, pc), from = 0, to = 5)
curve(g(x, pc, 1), from = 0, to = 5, col = 2, add = TRUE)
curve(g(x, pc, 2), from = 0, to = 5, col = 3, add = TRUE)
curve(g(x, pc, 3), from = 0, to = 5, col = 4, add = TRUE)
Now after quite much effort in demonstrating how we can work out this question ourselves, consider using R package polynom. As a small package, it aims at implementing construction, derivatives, integration, arithmetic and roots-finding of univariate polynomials. This package is written completely with R language, without any compiled code.
## install.packages("polynom")
library(polynom)
We still consider the cubic polynomial example used before.
pc <- 1:4 / 10
## step 1: making a "polynomial" object as preparation
pcpoly <- polynomial(pc)
#0.1 + 0.2*x + 0.3*x^2 + 0.4*x^3
## step 2: compute derivative
expr <- deriv(pcpoly)
## step 3: convert to function
g1 <- as.function(expr)
#function (x)
#{
# w <- 0
# w <- 1.2 + x * w
# w <- 0.6 + x * w
# w <- 0.2 + x * w
# w
#}
#<environment: 0x9f4867c>
Note, by step-by-step construction, the resulting function has all parameters inside. It only requires a single argument for x value. In contrast, functions in the other two answers will take coefficients and derivative order as mandatory arguments, too. We can call this function
g1(seq(0, 1, 0.2))
# [1] 0.200 0.368 0.632 0.992 1.448 2.000
To produce the same graph we see in other two answers, we get other derivatives as well:
g0 <- as.function(pcpoly) ## original polynomial
## second derivative
expr <- deriv(expr)
g2 <- as.function(expr)
#function (x)
#{
# w <- 0
# w <- 2.4 + x * w
# w <- 0.6 + x * w
# w
#}
#<environment: 0x9f07c68>
## third derivative
expr <- deriv(expr)
g3 <- as.function(expr)
#function (x)
#{
# w <- 0
# w <- 2.4 + x * w
# w
#}
#<environment: 0x9efd740>
Perhaps you have already noticed that I did not specify nderiv, but recursively take 1 derivative at a time. This may be a disadvantage of this package. It does not facilitate higher order derivatives.
Now we can make a plot
## As mentioned, `g0` to `g3` are parameter-free
curve(g0(x), from = 0, to = 5)
curve(g1(x), add = TRUE, col = 2)
curve(g2(x), add = TRUE, col = 3)
curve(g3(x), add = TRUE, col = 4)