I have a dataset that needs to be capped/trimmed etc. based on values from another dataset. Both datasets have same structure (column names etc.).
What is a quick way to apply the transformations stored in another dataset to the current dataset?
Sample data:
#generate sample data & set some values to NA
#this is the dataset that has variables that need to be trimmed
x1 <- data.frame(a=rep(11:20), b=rep(41:50))
x1[2,1] <- NA
x1
#a vector containing values to trim to (in this case, say 75th percentile)
y1 <- apply(x1, 2, function(x) quantile(x, 0.75, na.rm=T))
y1
#I am doing this inside a loop
for (i in 1:ncol(x1)){
x1[is.na(x1[[i]]),] <- y1[i] #if missing, set to some value
x1[x1[[i]] > y1[i], i] <- y1[i] #if larger than 75th pctl, set to some value
}
x1
I am pretty sure there is a faster vectorized way to do this. I'd greatly appreciate any inputs.
One option: write your logic as a function that takes a vector and a value:
myfun <- function(x, y) {
x[is.na(x)] <- y
x[x > y] <- y
return (x)
}
Then use mapply which will treat x1 as a list of columns (which it sort of is):
mapply(myfun, x1, y1)
And you can coerce it back to a data.frame by wrapping it:
data.frame(mapply(myfun, x1, y1))
You could also add SIMPLIFY=FALSE if you want...
As per the comments, Map is a better choice here since it avoids some typing and probably some overhead:
as.data.frame(Map(myfun, x1, y1))
Here is another option using the data.table package. data.table is very fast and has powerful syntax, but the disadvantage is the need to learn new syntax.
library(data.table)
x1 <- data.frame(a=rep(11:20), b=rep(41:50))
x1[2,1] <- NA
# Convert data.frame to data.table.
DT <- data.table(x1)
# Put your desired operations into a function, for clarity/tidiness.
update_vals <- function(x, prob=0.75) {
xcut <- quantile(x, probs=prob, na.rm=TRUE)
x[is.na(x) | x > xcut] <- xcut
return(x)
}
# Use lapply and data.table syntax to 'loop' over columns.
DT2 = DT[, lapply(.SD, update_vals)]
DT2
# a b
# 1: 11 41.00
# 2: 18 42.00
# 3: 13 43.00
# 4: 14 44.00
# 5: 15 45.00
# 6: 16 46.00
# 7: 17 47.00
# 8: 18 47.75
# 9: 18 47.75
# 10: 18 47.75
Related
I would like to write a function which takes a list of variables out of a dataframe, say:
df <- data.frame(a = c(1,2,3,4,5), b = c(6,7,8,9,10))
And to compute always the same calculation, say calculate the standard deviation like:
test.function <- function(var){
for (i in var) {
paste0(i, "_per_sd") <- i / sd(i)
}
}
In order to create a new variable a_per_sd which is divided by it's standard deviation. Unfortunately, I am stuck and get a Error in paste0(i, "_per_sd") <- i/sd(i) : could not find function "paste0<-" error.
The expected usage should be:
test.function(df$a, df$b)
The expected result should be:
> df$a_per_sd
[1] 0.6324555 1.2649111 1.8973666 2.5298221 3.1622777
And for every other variable which was given.
Somehow I think I should use as.formula and/or eval, but I might be doing a thinking error.
Thank you very much for your attention and help.
Is this what you are after?
df <- data.frame(a = c(1,2,3,4,5), b = c(6,7,8,9,10))
test.function <- function(...){
x <- list(...)
xn <- paste0(unlist(eval(substitute(alist(...)))),
"_per_sd")
setNames(lapply(x, function(y) y/sd(y)), xn)
}
cbind(df, test.function(df$a, df$b))
#> a b df$a_per_sd df$b_per_sd
#> 1 1 6 0.6324555 3.794733
#> 2 2 7 1.2649111 4.427189
#> 3 3 8 1.8973666 5.059644
#> 4 4 9 2.5298221 5.692100
#> 5 5 10 3.1622777 6.324555
Created on 2020-07-23 by the reprex package (v0.3.0)
The question is not completely clear to me, but you might get sd of rows/columns or vectors by these approaches:
apply(as.matrix(df), MARGIN = 1, FUN = sd) #across rows
#[1] 3.535534 3.535534 3.535534 3.535534 3.535534
apply(as.matrix(df), MARGIN = 2, FUN = sd) #across columns
# a b
#1.581139 1.581139
lapply(df, sd) #if you provide list of vectors (columns of `df` in this case)
#$a
#[1] 1.581139
#
#$b
#[1] 1.581139
I got this far. Is this what you are looking for?
test.function <- function(var)
{
newvar = paste(var, "_per_sd")
assign(newvar, var/sd(var))
get(newvar)
}
Input:
test.function(df$a)
Result:
[1] 0.6324555 1.2649111 1.8973666 2.5298221 3.1622777
I got the idea from here: Assignment using get() and paste()
At the end this is what my code looks like:
test.function <- function(...){
x <- list(...)
xn <- paste0(unlist(eval(substitute(alist(...)))),
"_per_sd")
setNames(lapply(x, function(y) y/sd(y, na.rm = TRUE)), xn)
}
test.function.wrap <- function(..., dataframe) {
assign(deparse(substitute(dataframe)), cbind(dataframe, test.function(...)) , envir=.GlobalEnv)
}
test.function.wrap(df$a, df$b , dataframe = df)
To be able to assign the new variables to the existing dataframe, I put the (absolutely genius) tips together and wrapped the function in another function to do the trick. I am aware it might not be as elegant, but it does the work!
Hopefully I get this right this time around, previously I posted (although that was years ago) and I remember that I didn't have much of a good example/ detail in my question.
So, I'm using the quakes dataset in R, for this example to hopefully make it easier to follow.
Hopefully this example is clear.
I have a function myfunc:
myfunc <- function(x,y){
z <- (x - y)^2
return(z)
}
So what I'm trying to do, is use this function for each and every row in the Quakes dataset. So for example using the head of the dataset:
> library(datasets)
> data(quakes)
> head(quakes)
lat long depth mag stations
1 -20.42 181.62 562 4.8 41
2 -20.62 181.03 650 4.2 15
3 -26.00 184.10 42 5.4 43
4 -17.97 181.66 626 4.1 19
5 -20.42 181.96 649 4.0 11
6 -19.68 184.31 195 4.0 12
>
this first row would use the myfunc function with every other row in the dataset and then the same would happen with the second row for every other row in the dataset etc.
I'm currently using the following nested for loop and appending to a vector. I then cbind them all together.
lat <- vector()
long <- vector()
depth <- vector()
mag <- vector()
stations <- vector()
for (i in 1:6){
for (j in 1:6){
lat <- append(lat,(myfunc(quakes$lat[i], quakes$lat[j])))
long <- append(long,(myfunc(quakes$long[i], quakes$long[j])))
depth <- append(depth,(myfunc(quakes$depth[i], quakes$depth[j])))
mag <- append(mag,(myfunc(quakes$mag[i], quakes$mag[j])))
stations <- append(stations,(myfunc(quakes$stations[i], quakes$stations[j])))
}
}
final <- as.data.frame(cbind(lat, long, depth, mag, stations))
The actual data I'm doing this on, has 1244 rows and 13 columns, and doesn't seem to run the full code (or takes too long, as I usually just stop when it's nearing 1 hour). I have tried my normal code on 191 rows and that seems to run fine, within 1 minute usually.
I've read up online about this and it's clear that the append is not good to do in for loops. I've come across sapply, vectorisation and some examples. However I'm really struggling to get this to work and output the exact same that it does currently.
I was wondering whether anyone has anyone can help me out with this/ has useful advice?
Thank you.
Update: Just to add that I'm going to be using the cbind function to bind two columns onto the results. For example if the quakes data had a letter assigned to each row i.e. A, B, C I would want the final output after the cbind to show from this
ID lat long depth mag stations
1 A -20.42 181.62 562 4.8 41
2 B -20.62 181.03 650 4.2 15
3 C -26.00 184.10 42 5.4 43
4 D -17.97 181.66 626 4.1 19
5 E -20.42 181.96 649 4.0 11
6 F -19.68 184.31 195 4.0 12
to
ID1 ID2 long depth mag stations
1 A A (row from final)
2 A B (row from final)
3 A C (row from final)
4 B A (row from final)
5 B B (row from final)
6 B C (row from final)
etc.
Currently I'm using something similar to this:
ID1 <- vector()
ID2 <- vector()
for (i in 1:1244){
for (j in 1:1244){
ID1 <- append(ID1,quakes$ID[i])
ID2 <- append(ID2,quakes$ID[j])
}
}
It currently returns large character lists. Do you have suggestion on how this could be improved?
Apologies for not mentioning this in my original post.
Here are two functions.
The first is my original answer made a function.According to a comment it's already faster than the original in the question but the second function is around twice as fast. It is also more memory efficient.
myfunc <- function(x, y){
z <- (x - y)^2
return(z)
}
slower <- function(X, fun = myfunc){
fun <- match.fun(fun)
res <- sapply(X, function(x) {
o <- outer(x, x, fun)
o[row(o) != col(o)]
})
as.data.frame(res)
}
faster <- function(X, fun){
f <- function(x, fun = myfunc){
y <- lapply(seq_along(x), function(i){
fun(x[i], x[-i])
})
unlist(y)
}
fun <- match.fun(fun)
res <- sapply(X, f, fun = fun)
as.data.frame(res)
}
Test both, the results are identical.
res1 <- slower(quakes, myfunc)
res2 <- faster(quakes, myfunc)
identical(res1, res2)
#[1] TRUE
Now for the timings with package microbenchmark.
library(microbenchmark)
mb <- microbenchmark(
outer = slower(quakes, myfunc),
fastr = faster(quakes, myfunc),
times = 10
)
print(mb, unit = "relative", order = "median")
#Unit: relative
# expr min lq mean median uq max neval cld
# fastr 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
# outer 1.545283 1.650968 1.970562 2.159856 2.762724 1.332896 10 b
ggplot2::autoplot(mb)
Good Afternoon R wizards,
I searched through a few posts on replacing outliers in data set - two that came closest to answering my questions were Changing outliers for NA in all columns in a dataset in R and Replace outliers by quantiles in R
The code in the 2nd reference works great if you want to update a column or two, but I have 40+ and would like to be able to use apply function to hit all the columns at once.
I want to set a threshold "max" of quantile(probs = .75) for each column, and replace any x>"max" with "max"
set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")
#apply(x,2,quantile,probs = .75)
Winsor75 <- function(x) {
Max <- quantile(x, probs = .75)
return(Max)
}
y <- as.data.frame(x)
y$a[y$a > Winsor75(x)] <- Winsor75(x)
The last line of code effectively replaces any defined outliers (in my case values above 75%) but uses the 75% for the entire matrix "x" where as I would like (a) the quantile to be attributable to each column and for (b) the ability to use the function in apply/tapply etc so I can perform the operation on all columns efficiently.
Any suggestions?
Thanks!
as.data.frame(lapply(y, function(x) pmin(x, quantile(x, 0.75, na.rm = TRUE))))
As a function:
df_winsor <- function(df, p) {
as.data.frame(lapply(df,
function(x) pmin(x, quantile(x, probs = p, na.rm = TRUE))))
}
Statistician's Disclaimer: I've solved the programming problem you asked. This should not be taken as an endorsement of the idea of automatically checking for, or doing anything with, so-called "outliers".
One option is to use mutate_all with custom function and apply rules to all columns.
Approach:
I have crated an replaceOutlier function (based on OPs function) which calculatesMaxand then replaces any item which is more thanMaxbefore returning vector.replaceOutlieris applied over all columns usingdplyr::mutate_all`.
library(tidyverse)
replaceOutlier <- function(x) {
Max <- quantile(x, probs = .75)
x[x>Max] <- Max
return(x)
}
x %>% as_tibble() %>% mutate_all(funs(replaceOutlier))
#Results
# # A tibble: 10 x 2
# a b
# <dbl> <dbl>
# 1 -0.626 1.08
# 2 0.698 0.390
# 3 -0.836 -0.621
# 4 0.698 1.08
# 5 0.330 1.08
# 6 -0.820 -0.0449
# 7 0.487 -0.0162
# 8 0.698 0.944
# 9 0.576 0.821
# 10 -0.305 0.594
#
Data
set.seed(1)
x = matrix(rnorm(20), ncol = 2)
x[2, 1] = 100
x[4, 2] = 200
colnames(x) <- c("a","b")
Sometimes I want to transform several data columns (usually character or factor) into one new column (usually a number). I try to do this using a lookup matrix. For example, my dataset is
dset <- data.frame(
x=c("a", "a", "b"),
y=c("v", "w", "w"),
stringsAsFactors=FALSE
)
lookup <- matrix(c(1:4), ncol=2)
rownames(lookup) <- c("a", "b")
colnames(lookup) <- c("v", "w")
Ideally (for my purpose here), I would now do
transform(dset, z=lookup[x,y])
and get my new data column. While this works in the one-dimensional case, this fails here, as lookup[x,y] returns a matrix. I came up with this function, which looks rather slow:
fill_from_matrix <- function(m, ...) {
arg <- list(...)
len <- sapply(arg, length)
if(sum(diff(len))!=0) stop("differing lengths in fill_from_matrix")
if(length(arg)!=length(dim(m))) stop("differing dimensions in fill_from_matrix")
n <- len[[1]]
dims <- length(dim(m))
res <- rep(NA, n)
for (i in seq(1,n)) {
one_arg <- list(m)
for (j in seq(1,dims)) one_arg[[j+1]] <- arg[[j]][[i]]
res[i] <- do.call("[", one_arg)
}
return(res)
}
With this function, I can call transform and get the result I wanted:
transform(dset, z=fill_from_matrix(lookup,x,y))
# x y z
# 1 a v 1
# 2 a w 3
# 3 b w 4
However, I am not satisfied with the code and wonder if there is a more elegant (and faster) way to perform this kind of transformation. How do I get rid of the for loops?
This is really quite easy and I suspect fast with base R indexing because the "[" function accepts a two-column matrix for this precise purpose:
> dset$z <- lookup[ with(dset, cbind(x,y)) ]
> dset
x y z
1 a v 1
2 a w 3
3 b w 4
If you needed it as a specific function then:
lkup <- function(tbl, rowidx, colidx){ tbl[ cbind(rowidx, colidx)]}
zvals <- lkup(lookup, dset$x, dset$y)
zvals
#[1] 1 3 4
(I'm pretty sure you can also use three and four column matrices if you have arrays of those dimensions.)
You can use library dplyr for inner_join and use a data.frame instead of matrix as lookup table:
library(dplyr)
lookup = transform(expand.grid(c('a','b'),c('v','w')), v=1:4) %>%
setNames(c('x','y','val'))
inner_join(dset, lookup, by=c('x','y'))
# x y val
#1 a v 1
#2 a w 3
#3 b w 4
A fast way is also to use data.table package, with my definition of lookup:
library(data.table)
setDT(lookup)
setDT(dset)
setkey(lookup, x ,y)[dset]
# x y val
#1: a v 1
#2: a w 3
#3: b w 4
If for any reason you have your matrix lookup as input, transform it in a dataframe:
lookup = transform(expand.grid(rownames(lookup), colnames(lookup)), v=c(lookup))
names(lookup) = c('x','y','val')
Does the by function make a list that grows one element at a time?
I need to process a data frame with about 4M observations grouped by a factor column. The situation is similar to the example below:
> # Make 4M rows of data
> x = data.frame(col1=1:4000000, col2=10000001:14000000)
> # Make a factor
> x[,"f"] = x[,"col1"] - x[,"col1"] %% 5
>
> head(x)
col1 col2 f
1 1 10000001 0
2 2 10000002 0
3 3 10000003 0
4 4 10000004 0
5 5 10000005 5
6 6 10000006 5
Now, a tapply on one of the columns takes a reasonable amount of time:
> t1 = Sys.time()
> z = tapply(x[, 1], x[, "f"], mean)
> Sys.time() - t1
Time difference of 22.14491 secs
But if I do this:
z = by(x[, 1], x[, "f"], mean)
That doesn't finish anywhere near the same time (I gave up after a minute).
Of course, in the above example, tapply could be used, but I actually need to process multiple columns together. What is the better way to do this?
by is slower than tapply because it is wrapping by.
Let's take a look at some benchmarks: tapply in this situation is more than 3x faster than using by
UPDATED to include #Roland's great recomendation:
library(rbenchmark)
library(data.table)
dt <- data.table(x,key="f")
using.tapply <- quote(tapply(x[, 1], x[, "f"], mean))
using.by <- quote(by(x[, 1], x[, "f"], mean))
using.dtable <- quote(dt[,mean(col1),by=key(dt)])
times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative")
times[,c("test", "elapsed", "relative")]
#------------------------#
# RESULTS #
#------------------------#
# COMPARING tapply VS by #
#-----------------------------------
# test elapsed relative
# 1 using.tapply 2.453 1.000
# 2 using.by 8.889 3.624
# COMPARING data.table VS tapply VS by #
#------------------------------------------#
# test elapsed relative
# 2 using.dtable 0.168 1.000
# 1 using.tapply 2.396 14.262
# 3 using.by 8.566 50.988
If x$f is a factor, the loss in efficiency between tapply and by is even greater!
Although, notice that they both improve relative to non-factor inputs, while data.table remains approx the same or worse
x[, "f"] <- as.factor(x[, "f"])
dt <- data.table(x,key="f")
times <- benchmark(using.tapply, using.dtable, using.by, replications=10, order="relative")
times[,c("test", "elapsed", "relative")]
# test elapsed relative
# 2 using.dtable 0.175 1.000
# 1 using.tapply 1.803 10.303
# 3 using.by 7.854 44.880
As for the why, the short answer is in the documentation itself.
?by :
Description
Function by is an object-oriented wrapper for tapply applied to data frames.
let's take a look at the source for by (or more specificaly, by.data.frame):
by.data.frame
function (data, INDICES, FUN, ..., simplify = TRUE)
{
if (!is.list(INDICES)) {
IND <- vector("list", 1L)
IND[[1L]] <- INDICES
names(IND) <- deparse(substitute(INDICES))[1L]
}
else IND <- INDICES
FUNx <- function(x) FUN(data[x, , drop = FALSE], ...)
nd <- nrow(data)
ans <- eval(substitute(tapply(seq_len(nd), IND, FUNx, simplify = simplify)),
data)
attr(ans, "call") <- match.call()
class(ans) <- "by"
ans
}
We see immediately that there is still a call to tapply plus a lot of extras (including calls to deparse(substitute(.)) and an eval(substitute(.)) both of which are relatively slow). Therefore it makes sense that your tapply will be relatively faster than a similar call to by.
Regarding a better way to do this: With 4M rows you should use data.table.
library(data.table)
dt <- data.table(x,key="f")
dt[,mean(col1),by=key(dt)]
dt[,list(mean1=mean(col1),mean2=mean(col2)),by=key(dt)]
dt[,lapply(.SD,mean),by=key(dt)]