Function with random experiment related to value pairs - r

I have two vectors x1 and p:
x1 <- c(1,2,3,1,2,3)
p <- c(0.1,0.9,0.9,0.1,0.5,0.7)
Both vectors form pairs of values, see df1:
df1 <- data.frame(x1,p)
> df1
x1 p
1 1 0.1
2 2 0.9
3 3 0.9
4 1 0.1
5 2 0.5
6 3 0.7
Following function is used to update vector df1$x1 to a vector df1$x2, depending on a random experiment and a probability p:
rexp <- function(x,p){
if(runif(1) <= p) return(x + 1)
return(x)
}
Using lapply, the function "rexp" is applied to every df1$x1 value. Depending on the random experiment, the value for x2 remains equal x1 or increases by + 1.
In the follwing example, p equals 0.5:
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,rexp,0.5))
> df1
x1 p x2
1 1 0.1 2
2 2 0.9 2
3 3 0.9 4
4 1 0.1 1
5 2 0.5 2
6 3 0.7 4
Now to my problem: I want the argument "p" in "rexp" to refer to the vector df1$p.
For example, p for df1$x1[1] should be 0.1 (as can be seen in df1$p[1]): unlist(lapply(df1$x1[1],rexp,df1$p[1])).
p for df1$x1[5] should be df1$p[5], which is 0.5: unlist(lapply(df1$x1[5],rexp,df1$p[5]))
Desired output should be something like:
> unlist(lapply(df1$x1,rexp,df1$p))
[1] 1 3 4 1 2 4
#where 1 refers to rexp(df1$x1[1],df1$p[1]),
#3 refers to rexp(df1$x1[2],df1$p[2]),
#4 refers to rexp(df1$x1[3],df1$p[3]) and so on...
Doing that "manually" leads to:
set.seed(123)
> unlist(lapply(df1$x1[1],rexp,df1$p[1]))
[1] 1
> unlist(lapply(df1$x1[2],rexp,df1$p[2]))
[1] 3
> unlist(lapply(df1$x1[3],rexp,df1$p[3]))
[1] 4
> unlist(lapply(df1$x1[4],rexp,df1$p[4]))
[1] 1
> unlist(lapply(df1$x1[5],rexp,df1$p[5]))
[1] 2
> unlist(lapply(df1$x1[6],rexp,df1$p[6]))
[1] 4
How can "rexp" be adjusted so that the function uses the specific df1$p-value for each df1$x1-value?
Note: At this point, using "lapply" is important, because for every df1$x1-value in the function "rexp" a new random number should be drawn.
I am happy about any help!

Using your defined function, you may do
df1$x2 <- mapply(rexp, df1$x1, df1$p)
However, you may also exploit vectorization and use simply
df1$x2 <- df1$x1 + (runif(nrow(df1)) <= df1$p)
In this manner we element-wise sum the vector df1$x1 with a logical vector runif(nrow(df1)) <= df1$p that is being coerced to a binary vector (TRUE becomes 1 and FALSE becomes 0). The comparison <= is done element-wise as well, and we draw just as many different values from the uniform distribution as there are rows.
Regarding your approach, notice that when p is fixed, then there is no need for lapply, as it returns a list, and you may instead use
df1$x2 <- sapply(df1$x1, rexp, 0.5)

Related

R data.table sum number of columns exceeding threshold

I would like to sum the number of columns whose values exceed a threshold in an observation. Additionally, I would like to specify those column names and thresholds as vectors (cols, th)
Take the example data set:
x <- data.table(x1=c(1,2,3),x2=c(3,2,1))
The goal is to create a new column exceed.count with number of columns in which x1 and x2 exceed a respective threshold. Assuming the case in which the thresholds for both x1 and x2 are 2:
th <- c(2,2)
The function could be defined as:
fn <- function(z,th) (sum(z[,x1]>th[1],z[,x2]>th[2]))
And the number of columns exceeding the thresholds calculated by:
x[,exceed.count:=fn(.SD,th),by=seq_len(nrow(x))]
The results are:
x1 x2 exceed.count
1: 1 3 1
2: 2 2 0
3: 3 1 1
What I would like to do is be able to specify the column names as vector, e.g.
cols <- c("x1","x2")
I was playing around with a function of the form:
fn.i <- function(z,i) (sum(z[,cols[i],with=FALSE] > th[i]))
which works for a single i, but how do I vectorize this across elements of cols? (cols and th will always be the same length)
I think there is an easier way to solve your problem:
x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
th<-c(2,2)
x[,exceed.count:=sum(.SD>th),by=seq_len(nrow(x))]
Or, taking into account your input (only a subset of columns):
x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
sd.cols = c("x1")
th<-c(2)
x[,exceed.count:=sum(.SD>th),by=seq_len(nrow(x)), .SDcols=sd.cols]
Or
x<-data.table(x1=c(1,2,3),x2=c(3,2,1))
sd.cols = c("x1")
th<-c(2,2)
x[,exceed.count:=sum(.SD>th[1]),by=seq_len(nrow(x)), .SDcols=sd.cols]
#JonnyCrunch's approach, specifying a subset of columns with .SDcols=sd.cols works fine (as long as you ensure ncol(x) == length(th), otherwise vector recycling will mess things up).
Here's an alternative that is shorter syntax (but will be less performant for very wide columns):
x[,exceed.count:=sum(.SD>th), by=seq_len(nrow(x)) ]
no need to explicitly specify .SDcols, let it default to all columns
define the threshold vector th for all columns, using the don't-care value +Inf in those columns you don't want counted.
.
> x <- data.table(x0=4:6, x1=1:3, x2=3:1, x3=7:5)
x0 x1 x2 x3
1: 4 1 3 7
2: 5 2 2 6
3: 6 3 1 5
> th <- c(+Inf, 2, +Inf, 2)
> fn <- function(z,th) (z>th)
> x[,exceed.count:=sum(.SD>th), by=seq_len(nrow(x)) ]
x0 x1 x2 x3 exceed.count
1: 4 1 3 7 1
2: 5 2 2 6 1
3: 6 3 1 5 2
Here's one way to get around iteration over rows:
x <- data.table(x1=c(1,2,3), x2=c(3,2,1))
thL <- list(x1 = 2, x2 = 2)
nm = names(thL)
x[, n := 0L]
for (i in seq_along(thL)) x[thL[i], on=sprintf("%s>%s", nm[i], nm[i]), n := n + 1L][]
x1 x2 n
1: 1 3 1
2: 2 2 0
3: 3 1 1

for loop with decimals and store results in a vector

I'm working on a for loop in R, and I had to store the results in a vector. I know that this is a quite common answer, and my problem is not there, but let's proceed with order.
I got those data:
# here the data
alpha <- c(1,2,3,4,5,6)
beta <- c(0.1,0.5,0.3,0.4,0.5,0.6)
data <- data.frame(alpha, beta)
And I make a simple function that select the data above a certain threshold:
# here the function
funny <- function(x,k)
{x[x[,2]>=k,]}
# here an example of the function
funny(data,0.5)
alpha beta
2 2 0.5
5 5 0.5
6 6 0.6
But what I want is the number of the rows that go over the threshold, so:
# here the result wanted
nrow(funny(data,0.5))
[1] 3
So I got a question: how many rows go over the threshold at the variation of k, the parameter of the function? And I would like to have the result in a vector. I created a for loop, looking at
For loop in R with increments
Saving results from for loop as a vector in r
And I created this: first of all let's see if everything is all right:
# here the sequence
s <-seq(0.1,0.6, by = 0.1)
# here the I loop
for(i in s) {print(nrow(funny(data,i)))}
[1] 6
[1] 5
[1] 4
[1] 4
[1] 3
[1] 1
But clearly this is not stored in a vector. The problem is here. I tried:
# already written sequence
s <-seq(0.1,0.6, by = 0.1)
# here the empty vector
vec <- vector("numeric")
# here the II problematic loop
for(i in s) {vec[i]<-(nrow(funny(data,i)))}
vec
And here the result I do not want, I expected something like [1] 6 5 4 4 3 1
[1] 0 0 0 0 0 0
Furthermore infos:
I tried something like this:
# sequence * 10
s <-seq(1,6, by = 1)
# here the vector
vec <- vector("numeric")
# and the III loop, that it works now.
for(i in s) {vec[i]<-(nrow(funny(data,i/10)))}
vec
[1] 6 5 5 4 3 1
But I do not like this, because I do not understand why the III works and why the II loop no.
What I am missing?
We can try with sapply which will return a vector
sapply(s, function(x) nrow(funny(data, x)))
#[1] 6 5 4 4 3 1
As far as why your loop II is not working. If you do,
for(i in s) {
print(i)
}
You'll get
[1] 0.1
[1] 0.2
[1] 0.3
[1] 0.4
[1] 0.5
[1] 0.6
So when you are trying to store in your loop II vec[i] <-, you are actually doing vec[0.1] in first case which is not correct.
To correct your loop, try
for(i in seq_along(s)) {vec[i]<-(nrow(funny(data,s[i])))}
vec
#[1] 6 5 4 4 3 1
Where seq_along(s) would return #[1] 1 2 3 4 5 6.

Compute similarity percentage OR Compute correlation between more than 2 objects

Consider I have four objects (a,b,c,d), and I ask five persons to label them (category 1 or 2) according to their physical appearance or something else. The labels provided by five persons for these objects are shown as
df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))
In tabular format,
---------
a b c d
---------
1 1 2 1
2 2 1 2
1 2 2 1
2 1 2 2
1 1 2 1
----------
Now I want to calculate the percentage of times a group of objects were given the same label (either 1 or 2). For example, objects a, b and d were given the same label by 3 persons out of 5 persons. So its percentage is 3/5 (=60%). While as objects a and d were given same labels by all the people, so its percentage is 5/5 (=100%)
I can calculate this statistic manually, but in my original dataset, I have 50 such objects and the people are 30 and the labels are 4 (1,2,3, and 4). How can I compute such statistics for this bigger dataset automatically? Are there any existing packages/tools in R which can calculate such statistics?
Note: A group can be of any size. In the first example, a group consists of a,b and d while as second example group consists of a and d.
There are two tasks here: firstly, making a list of all the relevant combinations, and secondly, evaluating and aggregating rowwise similarity. combn can start the first task, but it takes a little massaging to arrange the results into a neat list. The second task could be handled with prop.table, but here it's simpler to calculate directly.
Here I've used tidyverse grammar (primarily purrr, which is helpful for handling lists), but convert into base if you like.
library(tidyverse)
map(2:length(df), ~combn(names(df), .x, simplify = FALSE)) %>% # get combinations
flatten() %>% # eliminate nesting
set_names(map_chr(., paste0, collapse = '')) %>% # add useful names
# subset df with combination, see if each row has only one unique value
map(~apply(df[.x], 1, function(x){n_distinct(x) == 1})) %>%
map_dbl(~sum(.x) / length(.x)) # calculate TRUE proportion
## ab ac ad bc bd cd abc abd acd bcd abcd
## 0.6 0.2 1.0 0.2 0.6 0.2 0.0 0.6 0.2 0.0 0.0
If you have numeric ratings, you could use diff to check if you consistently have 0 difference between each rater:
f <- function(cols, data) {
sum(colSums(diff(t(data[cols]))==0)==(length(cols)-1)) / nrow(data)
}
Results are as expected when applying the function to example groups:
f(c("a","b","d"), df)
#[1] 0.6
f(c("a","d"), df)
#[1] 1
With base R functions you could do:
groupVec = c("a","b","d")
transDF = t(as.matrix(DF))
subDF = transDF[rownames(transDF) %in% groupVec,]
subDF
# [,1] [,2] [,3] [,4] [,5]
# a 1 2 1 2 1
# b 1 2 2 1 1
# d 1 2 1 2 1
#if length of unique values is 1, it implies match across all objects, count unique values/total columns = match pct
match_pct = sum(sapply(as.data.frame(subDF), function(x) sum(length(unique(x))==1) ))/ncol(subDF)
match_pct
# [1] 0.6
Wrapping it in a custom funtion:
fn_matchPercent = function(groupVec = c("a","d") ) {
transDF = t(as.matrix(DF))
subDF = transDF[rownames(transDF) %in% groupVec,]
match_pct = sum(sapply(as.data.frame(subDF), function(x) sum(length(unique(x))==1) ))/ncol(subDF)
outputDF = data.frame(groups = paste0(groupVec,collapse=",") ,match_pct = match_pct)
return(outputDF)
}
fn_matchPercent(c("a","d"))
# groups match_pct
# 1 a,d 1
fn_matchPercent(c("a","b","d"))
# groups match_pct
# 1 a,b,d 0.6
Try this:
find.unanimous.percentage <- function(df, at.a.time) {
cols <- as.data.frame(t(combn(names(df), at.a.time)))
names(cols) <- paste('O', 1:at.a.time, sep='')
cols$percent.unanimous <- 100*colMeans(apply(cols, 1, function(x) apply(df[x], 1, function(y) length(unique(y)) == 1)))
return(cols)
}
find.unanimous.percentage(df, 2) # take 2 at a time
O1 O2 percent.unanimous
1 a b 60
2 a c 20
3 a d 100
4 b c 20
5 b d 60
6 c d 20
find.unanimous.percentage(df, 3) # take 3 at a time
O1 O2 O3 percent.unanimous
1 a b c 0
2 a b d 60
3 a c d 20
4 b c d 0
find.unanimous.percentage(df, 4)
O1 O2 O3 O4 percent.unanimous
1 a b c d 0
Clustering similarity metrics
It seems that you might want to calculate a substantially different (better?) metric than what you propose now, if your actual problem requires to evaluate various options of clustering the same data.
This http://cs.utsa.edu/~qitian/seminar/Spring11/03_11_11/IR2009.pdf is a good overview of the problem, but the BCubed precision/recall metrics are commonly used for similar problems in NLP (e.g http://alias-i.com/lingpipe/docs/api/com/aliasi/cluster/ClusterScore.html).
Try this code. It works for your example and should hold for the extended case.
df <- data.frame(a = c(1,2,1,2,1), b=c(1,2,2,1,1), c= c(2,1,2,2,2), d=c(1,2,1,2,1))
# Find all unique combinations of the column names
group_pairs <- data.frame(t(combn(colnames(df), 2)))
# For each combination calculate the similarity
group_pairs$similarities <- apply(group_pairs, 1, function(x) {
sum(df[x["X1"]] == df[x["X2"]])/nrow(df)
})

Multiply various subsets of a data frame by different vectors

I would like to multiply several columns in my data frame by a vector of values. The specific vector of values changes depending on the value in another column.
--EDIT--
What if I make the data set more complicated, i.e., more than 2 conditions and the conditions are randomly shuffled around the data set?
Here is an example of my data set:
df=data.frame(
Treatment=(rep(LETTERS[1:4],each=2)),
Species=rep(1:4,each=2),
Value1=c(0,0,1,3,4,2,0,0),
Value2=c(0,0,3,4,2,1,4,5),
Value3=c(0,2,4,5,2,1,4,5),
Condition=c("A","B","A","C","B","A","B","C")
)
Which looks like:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 2 B
B 2 1 3 4 A
B 2 3 4 5 C
C 3 4 2 2 B
C 3 2 1 1 A
D 4 0 4 4 B
D 4 0 5 5 C
If Condition=="A", I would like to multiply columns 3-5 by the vector c(1,2,3). If Condition=="B", I would like to multiply columns 3-5 by the vector c(4,5,6). If Condition=="C", I would like to multiply columns 3-5 by the vector c(0,1,0). The resulting data frame would therefore look like this:
Treatment Species Value1 Value2 Value3 Condition
A 1 0 0 0 A
A 1 0 0 12 B
B 2 1 6 12 A
B 2 0 4 0 C
C 3 16 10 12 B
C 3 2 2 3 A
D 4 0 20 24 B
D 4 0 5 0 C
I have tried subsetting the data frame and multiplying by the vector:
t(t(subset(df[,3:5],df[,6]=="A")) * c(1,2,3))
But I can't return the subsetted data frame to the original. Is there any way to perform this operation without subsetting the data frame, so that other columns (e.g., Treatment, Species) are preserved?
Here's a fairly general solution that you should be able to adapt to fit your needs.
Note the first argument in the outer call is a logical vector and the second is numeric, so before multiplication TRUE and FALSE are converted to 1 and 0, respectively. We can add the outer results because the conditions are non-overlapping and the FALSE elements will be zero.
multiples <-
outer(df$Condition=="A",c(1,2,3)) +
outer(df$Condition=="B",c(4,5,6)) +
outer(df$Condition=="C",c(0,1,0))
df[,3:5] <- df[,3:5] * multiples
Here's a non-vectorized, but easy to understand solution:
replaceFunction <- function(v){
m <- as.numeric(v[3:5])
if (v[6]=="A")
out <- m * c(1,2,3)
else if (v[6]=="B")
out <- m * c(4,5,6)
else
out <- m
return(out)
}
g <- apply(df, 1, replaceFunction)
df[3:5] <- t(g)
df
Edited to reflect some notes from the comments
Assuming that Condition is a factor, you could do this:
#Modified to reflect OP's edit - the same solution works just fine
m <- matrix(c(1:6,0,1,0),3,3,byrow = TRUE)
df[,3:5] <- with(df,df[,3:5] * m[Condition,])
which makes use of fairly quick vectorized multiplication. And obviously, wrapping this in with isn't strictly necessary, it's just what popped out of my brain. Also note the subsetting comment below by Backlin.
More globally, remember that every subsetting you can do with subset you can also do with [, and crucially, [ support assignment via [<-. So if you want to alter a portion of a data frame or matrix, you can always use this type of idiom:
df[rowCondition,colCondition] <- <replacement values>
assuming of course that <replacement values> is the same dimension as your subset of df. It may work otherwise, but you will run afoul of R's recycling rules and R may kick back a warning.
df[3:5] <- df[3:5] * t(sapply(df$Condition, function(x) if(x=="B") 4:6 else 1:3))
Or by vector multiplication
df[3:5] <- df[3:5] * (3*(df$Condition == "B") %*% matrix(1, 1, 3)
+ matrix(1:3, nrow(df), 3, byrow=T))

What is the most efficient way to return ranks of a vector within levels of a factor, as a vector having the same order/length as the original vector?

With one more requirement - that the resulting vector is in the same order as the original.
I have a very basic function that percentiles a vector, and works just the way I want it to do:
ptile <- function(x) {
p <- (rank(x) - 1)/(length(which(!is.na(x))) - 1)
p[p > 1] <- NA
p
}
data <- c(1, 2, 3, 100, 200, 300)
For example, ptile(data) generates:
[1] 0.0 0.2 0.4 0.6 0.8 1.0
What I'd really like to be able to do is use this same function (ptile) and have it work within levels of a factor. So suppose I have a "factor" f as follows:
f <- as.factor(c("a", "a", "b", "a", "b", "b"))
I'd like to be able to transform "data" into a vector that tells me, for each observation, what its corresponding percentile is relative to other observations within its same level, like this:
0.0 0.5 0.0 1.0 0.5 1.0
As a shot in the dark, I tried:
tapply(data,f,ptile)
and see that it does, in fact, succeed at doing the ranking/percentiling, but does so in a way that I have no idea which observations match up to their indices in the original vector:
[1] a a b a b b
Levels: a b
> tapply(data,f,ptile)
$a
[1] 0.0 0.5 1.0
$b
[1] 0.0 0.5 1.0
This matters because the actual data I'm working with can have 1000-3000 observations (stocks) and 10-55 levels (things like sectors, groupings by other stock characteristics, etc), and I need the resulting vector to be in the same order as the way it went in, in order for everything to line up, row by row in my matrix.
Is there some "apply" variant that would do what I am seeking? Or a few quick lines that would do the trick? I've written this functionality in C# and F# with a lot more lines of code, but had figured that in R there must be some really direct, elegant solution. Is there?
Thanks in advance!
The ave function is very useful. The main gotcha is to remember that you always need to name the function with FUN=:
dt <- data.frame(data, f)
dt$rank <- with(dt, ave(data, list(f), FUN=rank))
dt
#---
data f rank
1 1 a 1
2 2 a 2
3 3 b 1
4 100 a 3
5 200 b 2
6 300 b 3
Edit: I thought I was answering the question in the title but have been asked to include the code that uses the "ptile" function:
> dt$ptile <- with(dt, ave(data, list(f), FUN=ptile))
> dt
data f rank ptile
1 1 a 1 0.0
2 2 a 2 0.5
3 3 b 1 0.0
4 100 a 3 1.0
5 200 b 2 0.5
6 300 b 3 1.0
For what you are trying to do, I would first put the stock, sector, value as columns in a data-frame. E.g with some made-up data:
> set.seed(1)
> df <- data.frame(stock = 1:10,
+ sector = sample(letters[1:2], 10, repl = TRUE),
+ val = sample(1:10))
> df
stock sector val
1 1 a 3
2 2 a 2
3 3 b 6
4 4 b 10
5 5 a 5
6 6 b 7
7 7 b 8
8 8 b 4
9 9 b 1
10 10 a 9
Then you can use the ddply function from the plyr package to do the "sectorwise" percentile (there are other ways, but I find the plyr to be very useful, and would recommend you take a look at it):
require(plyr)
df.p <- ddply(df, .(sector), transform, pct = ptile(val))
Now of course in df.p the rows will be arranged by the factor (i.e. sector), and it's a simple matter to restore it to the original order, e.g.:
> df.p[ order(df.p$stock),]
stock sector val pct
1 1 a 3 0.3333333
2 2 a 2 0.0000000
5 3 b 6 0.4000000
6 4 b 10 1.0000000
3 5 a 5 0.6666667
7 6 b 7 0.6000000
8 7 b 8 0.8000000
9 8 b 4 0.2000000
10 9 b 1 0.0000000
4 10 a 9 1.0000000
In particular the pct column is the final vector you are seeking in your original question.
When you call tapply() with INDEX=f you get a result that is subsetted by f and broken into a list in order of the levels of f. To reverse that process, simply:
unlist(tapply(data, f, ptile))[order(order(f))]
Your example data vector happened to be in numeric order already, but this works even if the data is in random order...
ptile <- function(x) {
p <- (rank(x) - 1)/(length(which(!is.na(x))) - 1)
p[p > 1] <- NA
# concatenated with the original data to make the match clear
paste(round(p * 100, 2), x, sep="% ")
}
data <- sample(c(1:5, (1:5)*100), 10)
f <- sample(letters[1:2], 10, replace=TRUE)
result <- unlist(tapply(data, f, ptile))[order(order(f))]
data.frame(result, data, f)

Resources