R Version 2.11.1 32-bit on Windows 7
I have two data sets as shown below:
data_set_A:
USER_B ACTION
10 0.1
11 0.3
12 0.1
data_set_B:
USER_A USER_B ACTION
1 10 0.2
1 11 0.1
1 15 0.1
2 12 0.2
How to add the ACTION of USER_B from data_set_A to data_set_B? The USER_B in data_set_A is a subset of USER_B in data_set_B.
for the example above, it may be:
USER_A USER_B ACTION
1 10 0.2+0.1
1 11 0.1+0.3
1 15 0.1
2 12 0.2+0.1
In data_set_B I don't need to consider the USER_A, just consider the USER_B appear in data_set_A.
I wonder if it could be achieved without doing one by one?
dfa <- data.frame(
user_b = 10:12,
action = c(0.1, 0.3, 0.1)
)
dfb <- data.frame(
user_a = c(1, 1, 1, 2),
user_b = c(10, 11, 15, 12),
action = c(0.2, 0.1, 0.1, 0.2)
)
action <- dfa$action[match(dfb$user_b, dfa$user_b)]
action[is.na(action)] <- 0
dfb$action <- dfb$action + action
dfb
user_a user_b action
1 1 10 0.3
2 1 11 0.4
3 1 15 0.1
4 2 12 0.3
One way is to do the equivalent of a database merge on the two data sets to form the action pairs you want and then sum those. Using #Andrie's example code:
dfa <- data.frame(
user_b = 10:12,
action = c(0.1, 0.3, 0.1)
)
dfb <- data.frame(
user_a = c(1, 1, 1, 2),
user_b = c(10, 11, 15, 12),
action = c(0.2, 0.1, 0.1, 0.2)
)
Solution Code
I'll first present the full solution and then explain the steps:
mdat <- merge(dfb, dfa, by = "user_b", all.x = TRUE)
res <- data.frame(mdat[,c(2,1)],
action = rowSums(mdat[, c("action.x", "action.y")],
na.rm = TRUE))
res <- res[order(res$user_a, res$user_b),]
res now contains the results.
Explanation
We first merge the two data frames, matching on user_b:
## merge the data
mdat <- merge(dfb, dfa, by = "user_b", all.x = TRUE)
mdat
giving:
> mdat
user_b user_a action.x action.y
1 10 1 0.2 0.1
2 11 1 0.1 0.3
3 12 2 0.2 0.1
4 15 1 0.1 NA
Then we just use this object to create the result data frame, and sum the two action. columns row-wise:
## format the merged data with summed `action`
res <- data.frame(mdat[,c(2,1)],
action = rowSums(mdat[, c("action.x", "action.y")],
na.rm = TRUE))
## reorder
res <- res[order(res$user_a, res$user_b),]
res
resulting in
> res
user_a user_b action
1 1 10 0.3
2 1 11 0.4
4 1 15 0.1
3 2 12 0.3
Related
For MICE imputations I need to constrict the predictions so that the predicted values will have the same mean (which is a measured value). The situation is we are dealing with mean blood serum samples (individual blood samples are pooled together) where we have measured values, which are representative of the mean of those individuals. I am trying to predict what the concentration of x was in those individuals based on the measured mean and covariate data. You'll notice in my dummy dataset that there are 3 individuals (Individual_id) for each pool (Pool_id). So when imputing these values to the individuals we need the average of those 3 individuals to equal the Pool_mean.
How can we constrict the Mice algorithm to still predict based on covariate data, but have the means match exactly (can be any method chose, "cart", in this circumstance)? Could this conceptually be done through a MICE squeeze constraint with inputs from the mean?
The code is below:
library(mice)
library(dplyr)
#create demo data table as an example
Pool_id <- c(1, 1, 1, 2, 2, 2, 3, 3, 3)
Pool_mean <- c(15, 15, 15, 35, 35, 35, 42, 42, 42)
Individual_id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
concentration <- c(10, 20, NA, 30, NA,NA, NA, NA, 70)
co_variate <- c(0.1, 0.2, 0.1, 0.2, 0.3, 0.1, 0.1, 0.2, 0.3)
df <- data.frame(Pool_id, Pool_mean, Individual_id, concentration, co_variate)
#run mice to impute missing data
initial_imputed <- mice(df, m = 5, maxit = 10, meth = "cart", seed = 3985))
completed <- complete(intial_imputed)
I know that we can constraint mice using a post process and maybe a custom function like the vec_squeeze below. However, I need to constraint the values based on a mean. How could I update this function to create this?
vec_squeeze <- function(x, bounds) {
stopifnot(length(x) == nrow(bounds))
pmin(pmax(x, bounds[,1]), bounds[,2])
}
Here's an example of how to use passive imputation on the 3rd variable to force the mean of the imputations to be equal to pool_mean from the data. First generate some data in 'wide' format.
set.seed(123)
# Using larger example data to avoid issues with imputation models
n <- 20
pool_id <- rep(1:n, each = 3)
ind_id <- rep(1:3, times = n)
cov_1 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
cov_2 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
cov_3 <- sample(c(0.1, 0.2, 0.3), n*3, replace = TRUE)
conc_1 <- round(rnorm(n*3, mean = 20 + 5*cov_1, sd = 5))
conc_2 <- round(rnorm(n*3, mean = 20 + 5*cov_2, sd = 5))
conc_3 <- round(rnorm(n*3, mean = 20 + 5*cov_3, sd = 5))
pool_mean <- apply(cbind(conc_1, conc_2, conc_3), FUN = mean, MARGIN = 1)
df <- data.frame(pool_id, ind_id, pool_mean, conc_1, conc_2,
conc_3, cov_1, cov_2, cov_3)
df[which(rbinom(n*3, 1, prob = 0.5) == 1), "conc_3"] <- NA
df[which(rbinom(n*3, 1, prob = 0.2) == 1), "conc_2"] <- NA
df[which(is.na(df$conc_2)),"conc_3"] <- NA
head(df)
#> pool_id ind_id pool_mean conc_1 conc_2 conc_3 cov_1 cov_2 cov_3
#> 1 1 1 18.00000 14 16 24 0.3 0.1 0.2
#> 2 1 2 24.33333 20 32 21 0.3 0.3 0.3
#> 3 1 3 16.33333 26 NA NA 0.3 0.1 0.2
#> 4 2 1 25.00000 25 NA NA 0.2 0.3 0.3
#> 5 2 2 22.00000 24 17 25 0.3 0.2 0.1
#> 6 2 3 22.00000 23 19 NA 0.2 0.3 0.3
I forced missing values into the third position to avoid re-arranging. I also have ind_id repeated within each pool_id instead of unique, but that's not important for what follows.
The key part of the passive imputation is meth["conc_3"] <- "~ I((3*pool_mean) - conc_1 - conc_2)". If (A+B+C)/3 = D, then 3D - A - B = C.
library(mice)
ini <- mice(df, maxit = 0, printFlag = FALSE)
# Limit the variables used in prediction, to avoid co-linearity
pred <- ini$predictorMatrix
pred[,] <- 0
pred["conc_1", c("pool_mean","cov_1")] <- 1
pred["conc_2", c("pool_mean","conc_1","cov_2")] <- 1
# Set the imputation methods. Use passive imputation for conc_3
meth <- ini$method
meth["conc_2"] <- "pmm"
meth["conc_3"] <- "~ I((3*pool_mean) - conc_1 - conc_2)"
# Control the visit sequence to ensure that conc_3 is updated
# after conc_2. Add other missing variables if needed.
visit_seq <- c("conc_2", "conc_3")
imps <- mice(df, method = meth,
predictorMatrix = pred,
visitSequence = visit_seq,
printFlag = FALSE)
head(complete(imps, action = 1))
#> pool_id ind_id pool_mean conc_1 conc_2 conc_3 cov_1 cov_2 cov_3
#> 1 1 1 18.00000 14 16 24 0.3 0.1 0.2
#> 2 1 2 24.33333 20 32 21 0.3 0.3 0.3
#> 3 1 3 16.33333 26 18 5 0.3 0.1 0.2
#> 4 2 1 25.00000 25 23 27 0.2 0.3 0.3
#> 5 2 2 22.00000 24 17 25 0.3 0.2 0.1
#> 6 2 3 22.00000 23 19 24 0.2 0.3 0.3
Created on 2022-11-20 with reprex v2.0.2
The imputation procedure has correctly replaced row 6's conc_3 value with 24. The other rows have received a random imputation for conc_2 and then had conc_3 passively imputed. In general conc_3 has no other restrictions on it. In row 3 conc_3 = 5, which may be questionably low. It could even become negative in some situations. Better modelling of conc_2 would help.
I have a dataframe (example data):
id <- c(1, 2, 3)
ex1 <- c(0.8, 0.2, 0.3)
ex2 <- c(0.1, 0.4, 0.04)
ex3 <- c(0.04, 0.3, 0.5)
ex <- c(1, 1, 1)
ran <- c(0.5, 0.7, 0.6)
dat <- data.frame(id, ex1, ex2, ex3, ex, ran)
dat
id ex1 ex2 ex3 ex ran
1 1 0.8 0.10 0.04 1 0.5
2 2 0.2 0.40 0.30 1 0.7
3 3 0.3 0.04 0.50 1 0.6
I want to change the values of "ex" with an if-else-condition. "ex" should change to 5 (arbitrary) when "ran" is smaller or equal then the highest value for the ex$-variables. It should be greater then the other ex$-variables aswell, but they should be sorted - the second largest value added to the smallest value. Here are examples for all id's, beginning with id 1:
dat$ex <- ifelse(dat$ran <= dat$ex1 & dat$ran > dat$ex1 + dat$ex2, 5, dat$ex)
Here, ex1 is the largest value, followed by ex1 and ex2.
For id 2, it should be:
dat$ex <- ifelse(dat$ran <= dat$ex2 & dat$ran > dat$ex3 + dat$ex1, 5, dat$ex)
Here, ex2 is the largest value, followed by ex3 and then ex1.
For id 3:
dat$ex <- ifelse(dat$ran <= dat$ex3 & dat$ran > dat$ex1 + dat$ex2, 5, dat$ex)
Here, ex3 is the largest value, followed by ex1 and then ex2.
Now to the problem: How to generalize the ifelse-statement? Note: It is important that the summation of the two smaller values is performed as implemented in the examples. I need to identify the sorted values for ex1, ex2 and ex3 within ifelse by id.
Here is a way how we could achieve the task using dplyr and tidyr:
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(
cols = ex1:ex3
) %>%
arrange(id, desc(value)) %>%
group_by(id) %>%
mutate(ex = ifelse(ran <= value[1] & ran > sum(value[2], value[3]), 5, ex)) %>%
pivot_wider(
names_from=name
)
output:
id ex ran ex1 ex2 ex3
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 5 0.5 0.8 0.1 0.04
2 2 1 0.7 0.2 0.4 0.3
3 3 1 0.6 0.3 0.04 0.5
We may use pmax
library(dplyr)
library(purrr)
dat %>%
mutate(ex= case_when(ran <=invoke(pmax, across(matches('^ex\\d+'))) ~ 5,
TRUE ~ ex))
id ex1 ex2 ex3 ex ran
1 1 0.8 0.10 0.04 5 0.5
2 2 0.2 0.40 0.30 1 0.7
3 3 0.3 0.04 0.50 1 0.6
exes <- t(apply(subset(dat, select = grep("^ex.+", names(dat))), 1, function(z) c(max(z), sum(z[-which.max(z)]))))
exes
# [,1] [,2]
# 1 0.8 0.14
# 2 0.4 0.50
# 3 0.5 0.34
ifelse(dat$ran <= exes[,1] & dat$ran > exes[,2], 5, dat$ran)
# 1 2 3
# 5.0 0.7 0.6
Walk-through:
subset(dat, ...) is a way to dynamically extract columns from a frame regardless of its type (e.g., data.frame, tbl_df, or data.table), and without risk of dropping the frame to a column (i.e., see that mtcars[,2] is no longer a frame); there are other ways to do this, some in base R, some in other packages like dplyr or data.table
apply(dat, 1, ..) operates on the rows of the respective columns; because when MARGIN=1 (second arg), it transposes the results, so we need to t(.) it back into the right shape;
exes is now a matrix whose first column contains the max of the ex# variables, and the second column contains the sum of the non-max ex# variables
From here, I think the use of exes is the "general" solution you were looking for.
I have two data frames both with 220 obs and 80 variables. The first data frame, df1, has only the data points 1, 2, and 3. The second data frame, df2, has different numeric values consisting of decimals, such as 0.12, -0.03, 0.01 etc. (supposed to portray market cap weighted stock returns for a given month). PS: The length of the original data set is 80.
For example
df1 = data.frame(a = c(2, 2, 1), b = c(3, 2, 3), c = c(1, 1, 2), d = c(3, 3, 1))
a b c d
1 2 3 1 3
2 2 2 1 3
3 1 3 2 1
df2 = data.frame(a = c(0.1, 0.1, 0.2), b = c(0.3, 0.4, 0.6), c = c(0.2, 0.3, 0.5), d = c(0.1, 0.5, 0.6))
a b c d
1 0.1 0.3 0.2 0.1
2 0.1 0.4 0.3 0.5
3 0.2 0.6 0.5 0.6
How can I sum the rows of df2and turn into a matrix with 220 obs and 3 variables based on the values in df1. Note that df1 and df2 have the same column names in the same order. How can I create a third data frame df3 based on the indicator variables from df1 by summing the rows of df2? I want to sum the rows of df2 based on the values in df1 to create df3:
df3 =
X1 X2 X3
1 0.2 0.1 0.4
2 0.3 0.5 0.5
3 0.8 0.5 0.6
Let's first look at (X1,1). Row 1 in df1 only contain one data point with value 1, which is (c,1). Thus, we sum row 1 of df2 to get 0.2. Now look at (X1,3) (last value of column X1). Observe row 3 in df1 to find two data points with value 1. In df2 those two values are 0.2 (a,3) and 0.6 (d,3), and sum the values to get 0.8.
Here is the explanation of how df3 looks like:
calculation = data.frame("1" = c("0+0+0.2+0", "0+0+0.3+0", "0.2+0+0+0.6"), "2" = c("0.1+0+0+0", "0.1+0.4+0+0", "0+0+0.5+0"), "3" = c("0+0.3+0+0.1", "0+0+0+0.5", "0+0.6+0+0"))
X1 X2 X3
1 0 + 0 + 0.2 + 0 0.1 + 0 + 0 + 0 0 + 0.3 + 0 + 0.1
2 0 + 0 + 0.3 + 0 0.1 + 0.4 + 0 + 0 0 + 0 + 0 + 0.5
3 0.2 + 0 + 0 + 0.6 0 + 0 + 0.5 + 0 0 + 0.6 + 0 + 0
More practical explanation based on stocks. Assume df1 is a matrix that describes buy, hold, and sell recommendations. df2 describes the market weighted stock returns. All variables/columns are different stocks. df3 creates a matrix with three different portfolios. If the stock is "buy", I want to put it in a "buy" portfolio. If the stock is "hold", I want to put it in a "hold" portfolio, etc. This is easily done in Excel with nested IF,AND,OR functions, but I do not know how to do it in R.
We could use tapply by converting the datasets to matrix, use grouping variables as the row index of the data and the index of 'df1'
tapply(as.matrix(df2), list(row(df2), as.matrix(df1)), FUN = sum)
# 1 2 3
#[1,] 0.2 0.1 0.4
#[2,] 0.3 0.5 0.5
#[3,] 0.8 0.5 0.6
Or with tidyverse, bind the datasets after gathering the two in to 'long' data, and then do a group by sum
library(tidyverse)
gather(df1) %>%
bind_cols(gather(df2)) %>%
group_by(key) %>%
group_by(rn = row_number(), value) %>%
summarise(value1 = sum(value1)) %>%
spread(value, value1) %>%
ungroup %>%
select(-rn)
# A tibble: 3 x 3
# `1` `2` `3`
# <dbl> <dbl> <dbl>
#1 0.2 0.1 0.4
#2 0.3 0.5 0.5
#3 0.8 0.5 0.6
Here is another base R method that uses rowsum to perform group sums and loops through the rows with mapply.
t(mapply(rowsum, as.data.frame(t(df2)), as.data.frame(t(df1))))
[,1] [,2] [,3]
V1 0.2 0.1 0.4
V2 0.3 0.5 0.5
V3 0.8 0.5 0.6
Note that I am using R 3.4.4. I believe that as.data.frame is not necessary with R 3.5.0+, since t should return a data.frame when it is fed a data.frame.
For a specific task, I have written the following R script:
pred <- c(0.1, 0.1, 0.1, 0.2, 0.2, 0.3, 0.3)
grp <- as.factor(c(1, 1, 2, 2, 1, 1, 1))
cut <- unique(pred)
cut_n <- length(cut)
n <- length(pred)
class_1 <- numeric(cut_n)
class_2 <- numeric(cut_n)
curr_cut <- cut[1]
class_1_c <- 0
class_2_c <- 0
j <- 1
for (i in 1:n){
if (curr_cut != pred[i]) {
j <- j + 1
curr_cut <- pred[i]
}
if (grp[i] == levels(grp)[1])
class_1_c <- class_1_c + 1
else
class_2_c <- class_2_c + 1
class_1[j] <- class_1_c
class_2[j] <- class_2_c
}
cat("index:", cut, "\n")
cat("class1:", class_1, "\n")
cat("class2:", class_2, "\n")
My goal above was to compute the cumulative number of times the factors in grp appear for each unique value in pred. For example, I get the following output for above:
index: 0.1 0.2 0.3
class1: 2 3 5
class2: 1 2 2
I am a beginner in R and I have few questions about this:
How can I make this code faster and simpler?
Is is it possible to vectorize this and avoid the for loop?
Is there a different "R-esque" way of doing this?
Any help would be greatly appreciated. Thanks!
You can start by getting a the unique group/pred counts using a table
table(grp, pred)
# pred
# grp 0.1 0.2 0.3
# 1 2 1 2
# 2 1 1 0
Of course this isn't exactly what you wanted. You want cumulative totals, so we can adjust this result by applying a cumulative sum across each row (transposed to better match your data layout)
t(apply(table(grp, pred), 1, cumsum))
# grp 0.1 0.2 0.3
# 1 2 3 5
# 2 1 2 2
I got two large matrix with this format:
row.names 1 2 3 ... row.names 1 2 3 ....
A 0.1 0.2 0.3 A 1 1 1
B 0.4 0.9 0.3 B 2 3 1
C 0.9 0.9 0.4 C 1 3 1
.
And I want to obtain something like this:
X S CONF P
1 A 0.1 1
1 B 0.4 2
1 C 0.9 1
2 A 0.2 1
2 B ......
Getting the colnames in one column and repeat the rownames and the information per each of the column names.
Thank you so much
You can do this pretty easily with some rep and c work:
out <- data.frame(X = rep(colnames(conf), each = nrow(conf)),
S = rep(rownames(conf), ncol(conf)),
CONF = c(conf), P = c(P))
out
# X S CONF P
# 1 1 A 0.1 1
# 2 1 B 0.2 1
# 3 1 C 0.3 1
# 4 2 A 0.4 2
# 5 2 B 0.9 3
# 6 2 C 0.3 1
# 7 3 A 0.9 1
# 8 3 B 0.9 3
# 9 3 C 0.4 1
#Thomas had a similar approach (but one which matches the answer you show in your question). His answer looked like this:
cbind.data.frame(X = rep(colnames(conf), each=nrow(conf)),
S = rep(rownames(conf), times=nrow(conf)),
CONF = matrix(t(conf), ncol=1),
P = matrix(t(P), ncol=1))
Assuming we're talking about matrices, I would convert to a data frame, add the rownames as a column and then "melt" each data.frame...
conf <- matrix(
c(0.1, 0.4, 0.9,
0.2, 0.9, 0.9,
0.3, 0.3, 0.4),
ncol=3, byrow=T
)
rownames(conf) <- c("A", "B", "C")
colnames(conf) <- 1:3
P <- matrix(
c(1, 2, 1,
1, 3, 3,
1, 1, 1),
ncol=3, byrow=T
)
rownames(P) <- c("A", "B", "C")
colnames(P) <- 1:3
library(reshape)
conf <- cbind(as.data.frame(conf), "S"=rownames(conf))
P <- cbind(as.data.frame(P), "S"=rownames(P))
out <- merge(melt(conf, id="S"), melt(P, id="S"), by=c("variable", "S"))
colnames(out) <- c("X", "S", "CONF", "P")