Updating a vector within a dataframe using a random experiment - r

I have the following dataframes "df1" and "df2":
x1 <- c(1,1,1,2,2,3)
y1 <- c(0,0,1,1,2,2)
df1 <- data.frame(x1,y1)
y <- c(0,1,2)
p <- c(0.1,0.6,0.9)
df2 <- data.frame(y,p)
What I want to do is to update df1$x1 to a new vector df1$x2, based on a random experiment. This can be manually done using the following function and "lapply" on vector df1$x1:
example_function <- function(x,p){
if(runif(1) <= p) return(x + 1)
return(x)
}
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,example_function,0.5))
The function performs a random experiment and compares it with a given probability p. Depending on the result either x remains the same for df$x2 or increases by the value of 1.
In the procedure described above, "p" was selected manually within the function (here 0.5 for all x-values in df1). However, I want p to be chosen automatically depending on the combination of df1$x1 and df1$y1. Here comes df2 into play. df2 shows which p-values are related to which y-values. For example df1$x1[3] equals 1, the corresponding y value df1$y1[3] is also equal 1. df2 shows that the associated p-value has to be 0.6 (that is the p-value for y equal 1). In order to determine the corresponding value df1$x2, p = 0.6 should be used in "example_function". Depending on df1$y1, p should be 0.1 for df1$x1[1] and df1$x1[2], 0.6 for df1$x1[3] and df1$x1[4] and 0.9 for df1$x1[5] and df1$x1[6].
Following example is an approach, but only if vector df$x1 contains only different values:
x1 <- c(1,2,3,4,5,6)
y1 <- c(0,0,1,1,2,2)
df1 <- data.frame(x1,y1)
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,
function(z) {
example_function(z, df2$p[df2$y == df1$y1[df1$x1 == z]])
}))
df1
x1 y1 x2
#1 1 0 1
#2 2 0 2
#3 3 1 4
#4 4 1 4
#5 5 2 5
#6 6 2 7
Using x1 <- c(1,1,1,2,2,3), as mentioned above, leads to warnings and errors:
x1 <- c(1,1,1,2,2,3)
y1 <- c(0,0,1,1,2,2)
df1 <- data.frame(x1,y1)
set.seed(123)
df1$x2 <- unlist(lapply(df1$x1,
function(z) {
example_function(z, df2$p[df2$y == df1$y1[df1$x1 == z]])
}))
Error in if (runif(1) <= p) return(x + 1) : argument is of length zero
In addition: Warning message:
In df2$y == df1$y1[df1$x1 == z] :
Error in if (runif(1) <= p) return(x + 1) : argument is of length zero
Is there anyone who has an idea how to fix that problem? I am very grateful for any help.

Working with "merge" seems to be one solution:
df_new <- merge(df1, df2, by.x = 'y1', by.y = 'y')
set.seed(123)
df1$x2 <- mapply(example_function,df1$x1,df_new$p)
> df1
x1 y1 x2
1 1 0 1
2 1 0 1
3 1 1 2
4 2 1 2
5 2 2 2
6 3 2 4

Related

Aggregate a table by applying a function of multiple columns

Considering the following table df, with categorical variables noted x1 and x2 and numerical measurements noted y1, y2 and y3:
df <- data.frame(x1=sample(letters[1:3], 20, replace=TRUE),
x2=sample(letters[4:6], 20, replace=TRUE),
y1=rnorm(20), y2=rnorm(20), y3=rnorm(20))
I'd like to apply on it a function of the 3 numerical measurements y with respect to the categorical variables x. For example the following function, where the input y is a table of 3 columns, which should output one new column:
f <- function(y){ sum((y[,1] - y[,2]) / y[,3]) }
I tried it with aggregate, dplyr, summarizeBy.. without success as it seems that for every method, mixing the inputs columns is not an option. Any idea on how to do that with such kind of functions (i.e. taking advantage of aggregation)?
aggregate(data = df, y1 + y2 + y3 ~ x1 + x2, FUN = f)
To clarify, the expected result can be obtained with something like:
groups <- unique(df[,c("x1", "x2")]) # coocurences of explanatory variables
res <- c()
for (i in 1:nrow(groups)){ # get the subtables
temp <- df[df$x1 == groups[i,1] & df$x2 == groups[i,2], c("y1", "y2", "y3")]
res <- c(res, f(temp)) # apply function on subtables
}
groups$res <- res # aggregate results
Which is not that fat for this simple toy example but very impractical with more complex data.
The problem is on th input side of your function. The way you specified it, it expects a dataframe.
A possible slution is to feed the function a list of columns. With a small change to your function:
f <- function(y) sum((y[[1]] - y[[2]]) / y[[3]])
You can now use it in a dplyr-chain:
df %>%
group_by(x1, x2) %>%
summarise(sum_y = f(list(y1, y2, y3)))
which gives:
# A tibble: 9 x 3
# Groups: x1 [?]
x1 x2 sum_y
<fct> <fct> <dbl>
1 a d 1.20
2 a e 0.457
3 a f -9.46
4 b d -1.11
5 b e -0.176
6 b f -1.34
7 c d -0.994
8 c e 3.38
9 c f -2.63

How do I pass mulitple columns to a function within dplyr::summarize

I am trying to pass all columns from a data.frame matching a criteria to a function within the summarize function of dplyr as follows:
df %>% group_by(Version, Type) %>%
summarize(mcll(TrueClass, starts_with("pred")))
Error: argument is of length zero
Is there a way to do this? A working example follows:
Build a simulated data.frame of sample predictions. These are interpreted as the output of a classification algorithm.
library(dplyr)
nrow <- 40
ncol <- 4
set.seed(567879)
getProbs <- function(i) {
p <- runif(i)
return(p / sum(p))
}
df <- data.frame(matrix(NA, nrow, ncol))
for (i in seq(nrow)) df[i, ] <- getProbs(ncol)
names(df) <- paste0("pred.", seq(ncol))
add a column indicating the true class
df$TrueClass <- factor(ceiling(runif(nrow, min = 0, max = ncol)))
add categorical columns for sub-setting
df$Type <- c(rep("a", nrow / 2), rep("b", nrow / 2))
df$Version <- rep(1:4, times = nrow / 4)
now I want to calculate the Multiclass LogLoss for these predictions using the function below:
mcll <- function (act, pred)
{
if (class(act) != "factor") {
stop("act must be a factor")
}
pred[pred == 0] <- 1e-15
pred[pred == 1] <- 1 - 1e-15
dummies <- model.matrix(~act - 1)
if (nrow(dummies) != nrow(pred)) {
return(0)
}
return(-1 * (sum(dummies * log(pred)))/length(act))
}
this is easily done with the entire data set
act <- df$TrueClass
pred <- df %>% select(starts_with("pred"))
mcll(act, pred)
but I want to use dplyr group_by to calculate mcll for each subset of the data
df %>% group_by(Version, Type) %>%
summarize(mcll(TrueClass, starts_with("pred")))
Ideally I could do this without changing the mcll() function, but I am open to doing that if it simplifies the other code.
Thanks!
EDIT: Note that the input to mcll is a vector of true values and a matrix of probabilities with one column for each "pred" column. For each subset of data, mcll should return a scalar. I can get exactly what I want with the code below, but I was hoping for something within the context of dplyr.
mcll_df <- data.frame(matrix(ncol = 3, nrow = 8))
names(mcll_df) <- c("Type", "Version", "mcll")
count = 1
for (ver in unique(df$Version)) {
for (type in unique(df$Type)) {
subdat <- df %>% filter(Type == type & Version == ver)
val <- mcll(subdat$TrueClass, subdat %>% select(starts_with("pred")))
mcll_df[count, ] <- c(Type = type, Version = ver, mcll = val)
count = count + 1
}
}
head(mcll_df)
Type Version mcll
1 a 1 1.42972507510096
2 b 1 1.97189000832723
3 a 2 1.97988830406062
4 b 2 1.21387875938737
5 a 3 1.30629638026735
6 b 3 1.48799237895462
This is easy to do using data.table:
library(data.table)
setDT(df)[, mcll(TrueClass, .SD), by = .(Version, Type), .SDcols = grep("^pred", names(df))]
# Version Type V1
#1: 1 a 1.429725
#2: 2 a 1.979888
#3: 3 a 1.306296
#4: 4 a 1.668330
#5: 1 b 1.971890
#6: 2 b 1.213879
#7: 3 b 1.487992
#8: 4 b 1.171286
I had to change the mcll function a little bit but then it worked. The problem is occurring with the second if statement. You are telling the function to get nrow(pred), but if you are summarizing over multiple columns you are actually only supplying a vector each time (because each column gets analyzed separately). Additionally, I switched the order of the arguments being entered into the function.
mcll <- function (pred, act)
{
if (class(act) != "factor") {
stop("act must be a factor")
}
pred[pred == 0] <- 1e-15
pred[pred == 1] <- 1 - 1e-15
dummies <- model.matrix(~act - 1)
if (nrow(dummies) != length(pred)) { # the main change is here
return(0)
}
return(-1 * (sum(dummies * log(pred)))/length(act))
}
From there we can use the summarise_each function.
df %>% group_by(Version,Type) %>% summarise_each(funs(mcll(., TrueClass)), matches("pred"))
Version Type pred.1 pred.2 pred.3 pred.4
(int) (chr) (dbl) (dbl) (dbl) (dbl)
1 1 a 1.475232 1.972779 1.743491 1.161984
2 1 b 2.030829 1.331629 1.397577 1.484865
3 2 a 1.589256 1.740858 1.898906 2.005511
I checked this against a subset of the data and it looks like it works.
mcll(df$pred.1[which(df$Type=="a" & df$Version==1)],
df$TrueClass[which(df$Type=="a" & df$Version==1)])
[1] 1.475232 #pred.1 mcll when Version equals 1 and Type equals a.

Standard deviation for each row value and constant value

N <- c(1,3,4,6)
a <- c(3,4,5,6)
b <- c(4,5,6,7)
w <- c(5,6,7,6)
dat1 <- data.frame(N,May = a, April = b,June = w)
N May April June
1 1 3 4 5
2 3 4 5 6
3 4 5 6 7
4 6 6 7 6
I need a data frame, where each value is sd of N value and row value
sd(c(1,3) sd(c(1,4) sd(c(1,5) # for 1st row
sd(c(3,4) sd(c(3,5) sd(c(3,6) # for second and so on.
Try this:
The data:
Norm <- c(1,3,4,6)
a <- c(3,4,5,6)
b <- c(4,5,6,7)
w <- c(5,6,7,6)
mydata <- data.frame(Norm=Norm,May = a, April = b,June = w)
Solution:
finaldata <- do.call('cbind',lapply(names(mydata)[2:4], function(x) apply(mydata[c("Norm",x)],1,sd)))
I hope it helps.
Piece of advice:
Please refrain from using names like data and norm for your variable names. They can easily conflict with things that are native to R. For example norm is a function in R, and so is data.
I think I got it
x=matrix(data=NA, nrow=4, ncol=3)
for(j in 1:3){
for(i in 1:4){
x[i, j] <- sd(data[i, c(i,(j+1))])
x
}
}

Assign an element value based on element adjacencies in R

I have a data frame with {0,1} indicating whether a product was Small, Medium or Large.
dat <- data.frame(Sm = c(1,0,0), Med = c(0,1,0), Lg = c(0,0,1))
Sm Med Lg
1 1 0 0
2 0 1 0
3 0 0 1
I'm looking to assign 1's to the 0's leading up to a 1 in a given row. For example in row 2 the product is a "Med", so I'm looking to assign a 1 to the 0 in the "Sm" column.
Allocation size is a consideration so I'm looking for a vectorized approach without using a for loop please. The final solution should output the following:
Sm Med Lg
1 1 0 0
2 1 1 0
3 1 1 1
I've tried several variations of the code below, but the closest I can get is a ragged array which assigns all of the 1's correctly while dropping the elements that have legitimate 0's.
apply(dat, 1, function(x) {
x[1:which.max(x)] <- 1
})
[1] 1 1 1
And below, which gets close but without the needed trailing 0's
apply(dat, 1, function(x) {
temp <- x[1:which.max(x)]
unlist(lapply(temp, function(y) {
y <- 1
}))
})
[[1]]
Sm
1
[[2]]
Sm Med
1 1
[[3]]
Sm Med Lg
1 1 1
First, convert to matrix and use max.col to get the index of the 1 in each row:
mat <- as.matrix(dat)
mc <- max.col(mat)
logical construction Overwrite the matrix:
mat = +(col(mat) <= mc)
or construct an index of matrix positions to change and change 'em:
logical indexing
mat[col(mat) < mc] <- 1L
# or
mat[which(col(mat) < mc)] <- 1L
matrix indexing
idx <- do.call( rbind, lapply( seq_along(mc), function(i)
if (i==1L) NULL
else cbind(i,seq_len(mc[i]-1))
))
mat[idx] <- 1L
vector indexing
nr <- nrow(mat)
idx <- unlist( lapply( seq_along(mc), function(i)
if (mc[i]==1L) NULL
else seq(from = i, by = nr, length.out = mc[i]-1L)
))
mat[idx] <- 1L
The help for all three indexing methods can be found at help("[<-").
This will do what you want.
dat[which(dat$Med==1),]$Sm = 1
dat[which(dat$Lg==1),]$Med = 1
dat[which(dat$Lg==1),]$Sm = 1

number elements in a vector with constraints

Given x and y I wish to create the desired.result below:
x <- 1:10
y <- c(2:4,6:7,8:9)
desired.result <- c(1,2,2,2,3,4,4,5,5,6)
where, in effect, each sequence in y is replaced in x by the the first element in the sequence in y and then the elements of the new x are numbered.
The intermediate step for x would be:
x.intermediate <- c(1,2,2,2,5,6,6,8,8,10)
Below is code that does this. However, the code is not general and is overly complex:
x <- 1:10
y <- list(c(2:4),(6:7),(8:9))
unique.x <- 1:(length(x[-unlist(y)]) + length(y))
y1 <- rep(min(unlist(y[1])), length(unlist(y[1])))
y2 <- rep(min(unlist(y[2])), length(unlist(y[2])))
y3 <- rep(min(unlist(y[3])), length(unlist(y[3])))
new.x <- x
new.x[unlist(y[1])] <- y1
new.x[unlist(y[2])] <- y2
new.x[unlist(y[3])] <- y3
rep(unique.x, rle(new.x)$lengths)
[1] 1 2 2 2 3 4 4 5 5 6
Below is my attempt to generalize the code. However, I am stuck on the second lapply.
x <- 1:10
y <- list(c(2:4),(6:7),(8:9))
unique.x <- 1:(length(x[-unlist(y)]) + length(y))
y2 <- lapply(y, function(i) rep(min(i), length(i)))
new.x <- x
lapply(y2, function(i) new.x[i[1]:(i[1]-1+length(i))] = i)
rep(unique.x, rle(new.x)$lengths)
Thank you for any advice. I suspect there is a much simpler solution I am overlooking. I prefer a solution in base R.
A solution like this should work:
x <- 1:10
y <- list(c(2:4),(6:7),(8:9))
x[unlist(y)]<-rep(sapply(y,'[',1),lapply(y,length))
rep(1:length(rle(x)$lengths), rle(x)$lengths)
## [1] 1 2 2 2 3 4 4 5 5 6

Resources