R code to 'tidy' the values of a discrete variable - r

Consider this data:
set.seed(200914)
y <- round(runif(20, 5, 15))
y
table(y)
In the real application y is a categorical variable such as "outcome code". I want to recode R so that its values are 1:n, while preserving order (Sometimes the variable may be ordinal.)
One answer is:
(ya <- y - min(y) + 1)
table(ya)
But this solution does not have minimal range which may make subsequent code inefficient. Trying again...
(suy <- sort(unique(y)))
(n <- length(suy))
yb <- y
for (i in 1:n) yb[which(y == suy[i])] <- i
table(yb)
yb is exactly what I want, but I wonder if I am computing it in the most efficient way?

Try
yc <- as.numeric(factor(y))
table(yc)
#yc
#1 2 3 4 5 6 7 8
#1 4 1 1 6 3 3 1
since essentially what you're looking for are the factor codes (I think).

Try:
yc <- match(y, sort(unique(y)))
table(yc)
#1 2 3 4 5 6 7 8
#1 4 1 1 6 3 3 1
all.equal(yb,yc)
#[1] TRUE
Another option might be findInterval
table(findInterval(y, sort( unique(y))))
#1 2 3 4 5 6 7 8
#1 4 1 1 6 3 3 1
Benchmarks
set.seed(25)
y <- sample(1:20, 1e6,replace=TRUE)
f1 <- function() {suy <- sort(unique(y))
n <- length(suy)
yb <- y
for (i in 1:n) yb[which(y == suy[i])] <- i
table(yb)}
f2 <- function() {yc <- as.numeric(factor(y))
table(yc)}
f3 <- function() {yd <- match(y, sort(unique(y)))
table(yd)}
f4 <- function() {ye <- findInterval(y, sort(unique(y)))
table(ye)}
library(microbenchmark)
microbenchmark(f1(), f2(), f3(), f4(), unit="relative", times=25L)
# Unit: relative
# expr min lq median uq max neval
# f1() 1.198901 1.208551 1.235237 1.242697 1.600400 25
# f2() 3.745317 3.593736 3.593330 3.596990 3.488292 25
# f3() 1.000000 1.000000 1.000000 1.000000 1.000000 25
# f4() 1.017857 1.038056 1.047112 1.038731 1.014825 25

Related

Return vector position in list r

I am trying to determine the vector where an element is coming from in a list I have created. I'll give a repeatable example here:
set.seed(101)
a <- runif(10, min=0, max=100)
b <- runif(10, min=0, max=100)
c <- runif(10, min=0, max=100)
d <- runif(10, min=0, max=100)
information <- list(a, b, c, d)
information.wanted <- mean(do.call(pmax, information))
The code to get the information.wanted works just fine. What I am now trying to find is the individual vector in the list where each of the maximum values comes from. For example, value 1 in information.wanted (87.97...) comes from vector b in the information list. I would like to create another piece of code that gives the vector where the information.wanted comes from.
> information.wanted
[1] 87.97957 95.68375 73.19726 93.16344 92.33189 91.34787 82.04361 81.42830 62.20120
[10] 92.48044
I have no idea how to do this though. None of the code that I've tried has gotten me anywhere close.
postition.of.information.wanted <- ??
I'm looking to get something like this. A numeric vector is fine. I can supplement the values in later.
> position.of.informaiton.wanted
[1] 2 3 ...
Any help would be greatly appreciated. Thanks.
You need to apply which.max to each "i" index of each element in "information":
f1 = function(x)
sapply(seq_along(x[[1]]), function(i) which.max(sapply(x, "[[", i)))
f1(information)
# [1] 2 3 2 2 3 4 2 4 1 4
mapply already provides that kind of "parallel" functionality:
f2 = function(x)
unlist(.mapply(function(...) which.max(c(...)), x, NULL))
f2(information)
# [1] 2 3 2 2 3 4 2 4 1 4
Or, instead of concatenating "information" in chunks, convert to a "matrix" -as David Arenburg notes in the comments- at start and apply which.max to its rows:
f3a = function(x)
apply(do.call(cbind, x), 1, which.max)
f3a(information)
# [1] 2 3 2 2 3 4 2 4 1 4
or its columns:
f3b = function(x)
apply(do.call(rbind, x), 2, which.max)
f3b(information)
# [1] 2 3 2 2 3 4 2 4 1 4
also, max.col is convenient for a "matrix":
f4 = function(x)
max.col(do.call(cbind, x), "first")
f4(information)
# [1] 2 3 2 2 3 4 2 4 1 4
If it wasn't R, then a simple loop over the elements would provide both which.max and max ...but R, also, handles vectors:
f5 = function(x)
{
ans = rep_len(1L, length(x[[1]]))
maxs = x[[1]]
for(i in 2:length(x)) {
wh = x[[i]] > maxs
maxs[wh] = x[[i]][wh]
ans[wh] = i
}
ans #or '(data.frame(i = ans, val = maxs)' for both
}
f5(information)
# [1] 2 3 2 2 3 4 2 4 1 4
It had to end with a benchmark:
set.seed(007)
dat = replicate(13, runif(1e4), FALSE)
identical(f1(dat), f2(dat))
#[1] TRUE
identical(f2(dat), f3a(dat))
#[1] TRUE
identical(f3a(dat), f3b(dat))
#[1] TRUE
identical(f3b(dat), f4(dat))
#[1] TRUE
identical(f4(dat), f5(dat))
#[1] TRUE
microbenchmark::microbenchmark(f1(dat), f2(dat), f3a(dat), f3b(dat), f4(dat), f5(dat), do.call(pmax, dat), times = 50)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# f1(dat) 274995.963 298662.210 339279.948 318937.172 350822.539 723673.972 50 d
# f2(dat) 94619.397 100079.205 114664.776 107479.127 114619.439 226733.260 50 c
# f3a(dat) 19767.925 23423.688 26382.919 25795.499 29215.839 40100.656 50 b
# f3b(dat) 20351.872 22829.997 28889.845 25090.446 30503.100 140311.058 50 b
# f4(dat) 975.102 1109.431 1546.571 1169.462 1361.733 8954.100 50 a
# f5(dat) 2427.665 2470.816 5299.386 2520.755 3197.793 112986.612 50 a
# do.call(pmax, dat) 1477.618 1530.166 1627.934 1551.046 1602.898 2814.295 50 a

Resampling from subject id's in R

Assume we have the following data
set.seed(123)
dat <- data.frame(var1=c(10,35,13,19,15,20,19), id=c(1,1,2,2,2,3,4))
(sampledIDs <- sample(min(dat$id):max(dat$id), size=3, replace=TRUE))
> [1] 2 4 2
The sampledIDs is a vector of id's that is sampled (with replacement) from dat$id.
I need the code that results in (and works also for a large dataset with more variables):
var1 id
13 2
19 2
15 2
19 4
13 2
19 2
15 2
The code dat[which(dat$id%in%sampledIDs),] does not give me what I want, since the the result of this code is
var1 id
13 2
19 2
15 2
19 4
where the subject with dat$id==2 appears only once in this data (I understand why this is the result, but don't know how to get what I want). Can someone please help?
EDIT: Thank you for the answers, here the runtime of all answers (for those who are interested):
test replications elapsed relative user.self
3 dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ] 1000 0.67 1.000 0.64
1 dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[, 1], ] 1000 0.67 1.000 0.67
2 do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)]) 1000 1.83 2.731 1.83
4 setkey(setDT(dat), id)[J(sampledIDs)] 1000 1.33 1.985 1.33
This would be probably the fastest approach for a big data set using data.table binary search
library(data.table)
setkey(setDT(dat), id)[J(sampledIDs)]
# var1 id
# 1: 13 2
# 2: 19 2
# 3: 15 2
# 4: 19 4
# 5: 13 2
# 6: 19 2
# 7: 15 2
Edit:
Here's a benchmark for a not so big data set (1e+05 rows) which illustrates which is the clear winner
library(data.table)
library(microbenchmark)
set.seed(123)
n <- 1e5
dat <- data.frame(var1 = sample(seq_len(100), n, replace = TRUE), id = sample(seq_len(10), n, replace = TRUE))
(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 3, replace = TRUE))
dat2 <- copy(dat)
Sven1 <- function(dat) dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]
Sven2 <- function(dat) dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]
flodel <- function(dat) do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])
David <- function(dat2) setkey(setDT(dat2), id)[J(sampledIDs)]
Res <- microbenchmark(Sven1(dat),
Sven2(dat),
flodel(dat),
David(dat2))
Res
# Unit: milliseconds
# expr min lq median uq max neval
# Sven1(dat) 4.356151 4.817557 6.715533 7.313877 45.407768 100
# Sven2(dat) 9.750984 12.385677 14.324671 16.655005 54.797096 100
# flodel(dat) 36.097602 39.680006 42.236017 44.314981 82.261879 100
# David(dat2) 1.813387 2.068749 2.154774 2.335442 8.665379 100
boxplot(Res)
If, for example, we would like to sample more then just 3 Ids, but lets say, 10, the benchmark becomes ridiculous
(sampledIDs <- sample(min(dat$id) : max(dat$id), size = 10, replace = TRUE))
[1] 7 6 10 9 5 9 5 3 7 3
# Unit: milliseconds
# expr min lq median uq max neval
# Sven1(dat) 80.124502 89.141162 97.908365 104.111738 175.40919 100
# Sven2(dat) 99.010410 127.797966 159.404395 170.751069 209.96887 100
# flodel(dat) 129.722435 144.847505 157.737362 178.242103 232.41293 100
# David(dat2) 2.431682 2.721038 2.855103 3.057796 19.60826 100
You can do:
do.call(rbind, split(dat, dat$id)[as.character(sampledIDs)])
One approach:
dat[unlist(lapply(sampledIDs, function(x) which(x == dat$id))), ]
# var1 id
# 3 13 2
# 4 19 2
# 5 15 2
# 7 19 4
# 3.1 13 2
# 4.1 19 2
# 5.1 15 2
An alternative approach:
dat[which(sapply(sampledIDs, "==", dat$id), arr.ind = TRUE)[ , 1], ]

Numeric comparison during merge in R

Dataframe d1:
x y
4 10
6 20
7 30
Dataframe d2:
x z
3 100
6 200
9 300
How do I merge d1 and d2 by "x" where d1$x should be matched against exact match or the next higher number in d2$x. Output should look like:
x y z
4 10 200 # (4 is matched against next higher value that is 6)
6 20 200 # (6 is matched against 6)
7 30 300 # (7 is matched against next higher value that is 9)
If merge() cannot do this, then is there any other way to do this? For loops are painfully slow.
This is pretty straightforward using rolling joins with data.table:
require(data.table) ## >= 1.9.2
setkey(setDT(d1), x) ## convert to data.table, set key for the column to join on
setkey(setDT(d2), x) ## same as above
d2[d1, roll=-Inf]
# x z y
# 1: 4 200 10
# 2: 6 200 20
# 3: 7 300 30
Input data:
d1 <- data.frame(x=c(4,6,7), y=c(10,20,30))
d2 <- data.frame(x=c(3,6,9), z=c(100,200,300))
You basically wish to extend d1 by a new column. So let's copy it.
d3 <- d1
Next I assume that d2$x is sorted nondecreasingly and thatmax(d1$x) <= max(d2$x).
d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]])
Which reads: for each x in d1$x, get the smallest value from d2$x which is not smaller than x.
Under these assumptions, the above may also be written as (& should be a bit faster):
d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)])
In result we get:
d3
## x y z
## 1 4 10 200
## 2 6 20 200
## 3 7 30 300
EDIT1: Inspired by #MatthewLundberg's cut-based solution, here's another one using findInterval:
d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1]
EDIT2: (Benchmark)
Exemplary data:
set.seed(123)
d1 <- data.frame(x=sort(sample(1:10000, 1000)), y=sort(sample(1:10000, 1000)))
d2 <- data.frame(x=sort(c(sample(1:10000, 999), 10000)), z=sort(sample(1:10000, 1000)))
Results:
microbenchmark::microbenchmark(
{d3 <- d1; d3$z <- d2$z[findInterval(d1$x, d2$x+1)+1] },
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which(x <= d2$x)[1]]) },
{d3 <- d1; d3$z <- sapply(d1$x, function(x) d2$z[which.max(x <= d2$x)]) },
{d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]; merge(d1, d2, by.x='x2', by.y='x')},
{d1a <- d1; setkey(setDT(d1a), x); d2a <- d2; setkey(setDT(d2a), x); d2a[d1a, roll=-Inf] }
)
## Unit: microseconds
## expr min lq median uq max neval
## findInterval 221.102 1357.558 1394.246 1429.767 17810.55 100
## which 66311.738 70619.518 85170.175 87674.762 220613.09 100
## which.max 69832.069 73225.755 83347.842 89549.326 118266.20 100
## cut 8095.411 8347.841 8498.486 8798.226 25531.58 100
## data.table 1668.998 1774.442 1878.028 1954.583 17974.10 100
cut can be used to find the appropriate matches in d2$x for the values in d1$x.
The computation to find the matches with cut is as follows:
as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))
## [1] 2 2 3
These are the values:
d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]
[1] 6 6 9
These can be added to d1 and the merge performed:
d1$x2 <- d2$x[as.numeric(cut(d1$x, c(-Inf, d2$x, Inf)))]
merge(d1, d2, by.x='x2', by.y='x')
## x2 x y z
## 1 6 4 10 200
## 2 6 6 20 200
## 3 9 7 30 300
The added column may then be removed, if desired.
Try: sapply(d1$x,function(y) d2$z[d2$x > y][which.min(abs(y - d2$x[d2$x > y]))])

Piecewise linear transformation without for loop or nested ifelse

I'm trying to perform a piecewise linear transformation of my data. Here's an example table describing a transformation:
dat <- data.frame(x.low = 0:2, x.high = 1:3, y.low=c(0, 2, 3), y.high=c(2, 3, 10))
dat
# x.low x.high y.low y.high
# 1 0 1 0 2
# 2 1 2 2 3
# 3 2 3 3 10
If I defined x <- c(1.75, 2.5), I would expect transformed values 2.75 and 6.5 (my elements would be matched by rows 2 and 3 of dat, respectively).
I know how to solve this problem with a for loop, iterating through the rows of dat and transforming the corresponding values:
pw.lin.trans <- function(x, m) {
out <- rep(NA, length(x))
for (i in seq(nrow(m))) {
matching <- x >= m$x.low[i] & x <= m$x.high[i]
out[matching] <- m$y.low[i] + (x[matching] - m$x.low[i]) /
(m$x.high[i] - m$x.low[i]) * (m$y.high[i] - m$y.low[i])
}
out
}
pw.lin.trans(x, dat)
# [1] 2.75 6.50
While this works, it strikes me there should be a better approach that matches x values to rows of dat and then performs all the interpolations in a single computation. Could somebody point me to a non-for-loop solution for this problem?
Try approx:
(xp <- unique(c(dat$x.low, dat$x.high)))
## [1] 0 1 2 3
(yp <- unique(c(dat$y.low, dat$y.high)))
## [1] 0 2 3 10
x <- c(1.75, 2.5)
approx(xp, yp, x)
## $x
## [1] 1.75 2.50
##
## $y
## [1] 2.75 6.50
or approxfun (which returns a new function):
f <- approxfun(xp, yp)
f(x)
## [1] 2.75 6.50
Some benchmarks:
set.seed(123L)
x <- runif(10000, min(xp), max(yp))
library(microbenchmark)
microbenchmark(
pw.lin.trans(x, dat),
approx(xp, yp, x)$y,
f(x)
)
## Unit: microseconds
## expr min lq median uq max neval
## pw.lin.trans(x, dat) 3364.241 3395.244 3614.0375 3641.7365 6170.268 100
## approx(xp, yp, x)$y 359.080 379.669 424.0895 453.6800 522.756 100
## f(x) 202.899 209.168 217.8715 232.3555 293.499 100

Apply a correction factor to one column based on the value of a second column

Example Data
A<-c(1,4,5,6,2,3,4,5,6,7,8,7)
B<-c(4,6,7,8,2,2,2,3,8,8,7,8)
DF<-data.frame(A,B)
What I would like to do is apply a correction factor to column A, based on the values of column B. The rules would be something like this
If B less than 4 <- Multiply A by 1
If B equal to 4 and less than 6 <- Multiply A by 2
If B equal or greater than 6 <- Multiply by 4
I suppose I could write an "if" statement (and I'd be glad to see a good example), but I'd also be interested in using square bracket indexing to speed things up.
The end result would look like this
A B
2 4
16 6
20 7
24 8
ect
Use this:
within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
Or this (corrected by #agstudy):
within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
Benchmarking:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
identical(a(DF), b(DF))
#[1] TRUE
microbenchmark(a(DF), b(DF), times=1000)
#Unit: milliseconds
# expr min lq median uq max neval
# a(DF) 8.603778 10.253799 11.07999 11.923116 53.91140 1000
# b(DF) 3.763470 3.889065 5.34851 5.480294 39.72503 1000
Similar to #Ferdinand solution but using transform
transform(DF, newcol = ifelse(B<4, A,
ifelse(B>=6,4*A,2*A)))
A B newcol
1 1 4 2
2 4 6 16
3 5 7 20
4 6 8 24
5 2 2 2
6 3 2 3
7 4 2 4
8 5 3 5
9 6 8 24
10 7 8 28
11 8 7 32
12 7 8 28
I prefer to use findInterval as an index into a set of factors for such operations. The proliferation of nested test-conditional and consequent vectors with multiple ifelse calls offends my efficiency sensibilities:
DF$A <- DF$A * c(1,2,4)[findInterval(DF$B, c(-Inf,4,6,Inf) ) ]
DF
A B
1 2 4
2 16 6
3 20 7
4 24 8
snipped ....
Benchmark:
DF <- data.frame(A=rpois(1e4, 5), B=rpois(1e4, 5))
a <- function(DF) within(DF, A <- ifelse(B>=6, 4, ifelse(B<4, 1, 2)) * A)
b <- function(DF) within(DF, {A[B>=6] <- A[B>=6]*4; A[B>=4 & B<6] <- A[B>=4 & B<6]*2})
ccc <- function(DF) within(DF, {A * c(1,2,4)[findInterval(B, c(-Inf,4,6,Inf) ) ]})
microbenchmark(a(DF), b(DF), ccc(DF), times=1000)
#-----------
Unit: microseconds
expr min lq median uq max neval
a(DF) 7616.107 7843.6320 8105.0340 8322.5620 93549.85 1000
b(DF) 2638.507 2789.7330 2813.8540 3072.0785 92389.57 1000
ccc(DF) 604.555 662.5335 676.0645 698.8665 85375.14 1000
Note: I would not have done this using within if I were coding my own function, but thought for fairness to the earlier effort, I would make it apples <-> apples.

Resources