Select rows in a data.frame when some rows repeat

Select rows in a data.frame when some rows repeat - r

I have the following toy dataset
set.seed(100)
df <- data.frame(ID = rep(1:5, each = 3),
value = sample(LETTERS, 15, replace = TRUE),
weight = rep(c(0.1, 0.1, 0.5, 0.2, 0.1), each = 3))
df
ID value weight
1 1 I 0.1
2 1 G 0.1
3 1 O 0.1
4 2 B 0.1
5 2 M 0.1
6 2 M 0.1
7 3 V 0.5
8 3 J 0.5
9 3 O 0.5
10 4 E 0.2
11 4 Q 0.2
12 4 W 0.2
13 5 H 0.1
14 5 K 0.1
15 5 T 0.1
where each ID is an individual respondent, answering 3 questions (in the actual dataset, the number of questions answered is variable, so I can't rely on a certain number of rows per ID).
I want to create a new (larger) dataset which samples from the individual IDs based on the weights in weight.
probs <- data.frame(ID = unique(df$ID))
probs$prob <- NA
for(i in 1:nrow(probs)){
probs$prob[i] <- df[df$ID %in% probs$ID[i],]$weight[1]
}
probs$prob <- probs$prob / sum(probs$prob)
sampledIDs <- sample(probs$ID, size = 10000, replace = TRUE, prob = probs$prob)
head(sampledIDs,10)
[1] 4 3 3 3 4 4 2 4 2 3
Moving from the probabilistic sampling of IDs to the actual creation of the new data.frame is stumping me. I've tried
dfW <- df[df$ID %in% sampledIDs,]
but that obviously doesn't take into account the fact that IDs repeat. I've also tried a loop:
dfW <- df[df$ID == sampledIDs[1],]
for(i in 2:length(sampledIDs)){
dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
}
but that's painfully slow with a large dataset.
Any help would be very appreciated.
(Also, if there are simpler ways of doing the probabilistic selection of IDs, that would be great to hear too!)

The code speed is low because you resize the data frame in every cycle of the for loop. Here is my suggestion. Create a dataframe with the final size that the data framedfW will have before the for loop. Then assign the values from data frame df to dfW in the for loop. You may change the last part of your code with this:
dfW <- as.data.frame(matrix(nrow = 3 * length(sampledIDs), ncol = 3))
colnames(dfW) <- colnames(df) # make the column names the same
for(i in 1:length(sampledIDs)){ # notice the start index is changed from 2 to 1
#dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}
Your code should run much faster with this change. Let me know how it goes!

If you don't know the final size you can resize it whenever needed, but a new if condition should be added in the for loop. First define the function to resize the dataframe as follow:
double_rowsize <- function(df) {
mdf <- as.data.frame(matrix(, nrow = nrow(df), ncol = ncol(df)))
colnames(mdf) <- colnames(df)
df <- rbind(df, mdf)
return(df)
}
Then start the dfW with an initial size like 12 (3 times 4):
dfW <- as.data.frame(matrix(nrow = 12, ncol = 3))
colnames(dfW) <- colnames(df)
And finally add an if condition in the for loop to resize the dataframe whenever needed:
for(i in 1:length(sampledIDs)){
if (3*i > nrow(dfW))
dfW <- double_rowsize(dfW)
dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}
You can change the details of function double_rowsize to change the dataframe size with a different number rather than 2 if anything else works better. 2 is common because it works best in array resizing.
Good luck!

Related

Is there a R function for touch point attribution in a customer journey?

My dataframe contains a column with various touch points, numbers 1 till 18. I want to know which touch point results in touch point 10. Therefore I want to create a new column which shows the touch point which occurred before touch point 10 per customer journey (PurchaseID). If touch point 10 doesn't occur in a customer journey the value can be NULL or 0.
So for example:
dd <- read.table(text="
PurchaseId TouchPoint DesiredOutcome
1 8 6
1 6 6
1 10 6
2 12 0
2 8 0
3 17 4
3 3 4
3 4 4
3 10 4", header=TRUE)
The complete dataset contains 2.500.000 observations. Does anyone know how to solve my problem? Thanks in advance.

Firstly, it is better to give a complete reproducible sample code. I suggest you look at the data.table library which is nice for handling large datasets.
library(data.table)
mdata <- matrix(sample(x = c(1:20, 21), size = 15*10, replace = TRUE), ncol = 10)
mdata[mdata==21] <- NA
mdata <- data.frame(mdata)
names(mdata) <- paste0("cj", 1:10)
df_touch <- data.table(mdata)
# -- using for
res <- rep(0, nrow(df_touch))
for( i in 1:10){
cat(i, "\n")
res[i] <- i*df_touch[, (10 %in% get(paste0("cj", i)))]
cat(res[i], "\n")
}
# -- using lapply
dfun <- function(x, k = 10){ return( k %in% x ) }
df_touch[, lapply(.SD, dfun)]

sample of a subsample

I would like to select a sample from a dataset twice. Actually, I don't want to select it, but to create a new variable sampleNo that indicates to which sample (one or two) a case belongs to.
Lets suppose I have a dataset containing 40 cases:
data <- data.frame(var1=seq(1:40), var2=seq(40,1))
The first sample (n=10) I drew like this:
data$sampleNo <- 0
idx <- sample(seq(1,nrow(data)), size=10, replace=F)
data[idx,]$sampleNo <- 1
Now, (and here my problems start) I'd like to draw a second sample (n=10). But this sample should be drawn from the cases that don't belong to the first sample only. Additionally, "var1" should be an even number.
So sampleNo should be 0 for cases that were not drawn at all, 1 for cases that belong to the first sample and 2 for cases belonging to the second sample (= sampleNo equals 0 and var1 is even).
I was trying to solve it like this:
idx2<-data$var1%%2 & data$sampleNo==0
sample(data[idx2,], size=10, replace=F)
But how can I set sampleNo to 2?

We can use the setdiff function as follows:
sample(setdiff(1:nrow(data), idx), 3, replace = F)
setdiff(x, y) will select the elements of x that are not in y:
setdiff(x = 1:20, y = seq(2,20,2))
[1] 1 3 5 7 9 11 13 15 17 19
So to include in the above example:
data$sampleNo2 <- 0
idx2 <- sample(setdiff(1:nrow(data), idx), 3, replace = F)
data[idx2,]$sampleNo2 <- 1

Here is a complete solution to your problem more along the line of your original idea. The code can be shortened but for now I tried to make it as transparent as I could.
# Data
data <- data.frame(var1 = 1:40, var2 = 40:1)
# Add SampleNo column
data$sampleNo <- 0L
# Randomly select 10 rows as sample 1
pool_idx1 <- 1:nrow(data)
idx1 <- sample(pool_idx1, size = 10)
data[idx1, ]$sampleNo <- 1L
# Draw a second sample from cases where sampleNo != 1 & var1 is even
pool_idx2 <- pool_idx1[data$var1 %% 2 == 0 & data$sampleNo != 1]
idx2 <- sample(pool_idx2, size = 10)
data[idx2, ]$sampleNo <- 2L

Is there a way to get dplyr's bind_cols to expand number of rows like in cbind?

From ?dplyr::bind_cols:
This is an efficient implementation of the common pattern of do.call(rbind, dfs) or do.call(cbind, dfs) for binding many data frames into one
However, with example data:
tmp_df1 <- data.frame(a = 1)
tmp_df2 <- data.frame(b = c(-2, 2))
tmp_df3 <- data.frame(c = runif(10))
The command do.call(cbind, list(tmp_df1, tmp_df2, tmp_df3)) produces:
a b c
1 1 -2 0.8473307
2 1 2 0.8031552
3 1 -2 0.3057430
4 1 2 0.6344999
5 1 -2 0.7870753
6 1 2 0.9453199
7 1 -2 0.6642231
8 1 2 0.9708049
9 1 -2 0.7189576
10 1 2 0.9217087
That is, rows of tmp_df1 and tmp_df2 are recycled to match the number of rows in tmp_df3.
In dplyr:
> bind_cols(tmp_df1, tmp_df2, tmp_df3)
Error in eval(substitute(expr), envir, enclos) :
incompatible number of rows (2, expecting 1)
The reason why I want to do something like this is because I am in a situation similar to below:
df_normal_param <- df(mu = rnorm(10), sigma = runif(10))
df_normal_sample_list <- lapply(1:10, function(i)
with(df_normal_param,
data.frame(sam = rnorm(100, mu[i], sigma[i]))
and I wish to attach the arguments used to create each entry of df_normal_sample_list to the outputs, e.g.
df_normal_sample_list <- lapply(1:10, function(i)
cbind(df_normal_param[i,], df_normal_sample_list[[i]]))

You argue in a comment that this behavior is safe, I strongly disagree. It seems safe, for this very particular case, but it is likely to cause you problems somewhere down the road. Which is why I believe that the answer to your stated question ("Is there a way to get dplyr's bind_cols to expand number of rows like in cbind?") is a simple: no, and they probably built it that way intentionally.
Instead, I would suggest that you be more explicit in your approach, and just add the columns you want right as you build the data you are creating. For example, you could include that step right in your call (here using apply to clarify what is going where)
df <- data.frame(mu = rnorm(3), sigma = runif(3))
df_normal_sample_list <- apply(df, 1, function(x){
data.frame(
mu = x["mu"]
, sigma = x["sigma"]
, sam = rnorm(3, x["mu"], x["sigma"])
)
})
Returns
[[1]]
mu sigma sam
1 -0.6982395 0.1690402 -0.592286
2 -0.6982395 0.1690402 -0.516948
3 -0.6982395 0.1690402 -0.804366
[[2]]
mu sigma sam
1 -1.698747 0.2597186 -1.830950
2 -1.698747 0.2597186 -2.087393
3 -1.698747 0.2597186 -1.961376
[[3]]
mu sigma sam
1 0.9913492 0.3069877 0.9629801
2 0.9913492 0.3069877 1.2279697
3 0.9913492 0.3069877 1.1222780
Then, instead of binding the columns, then the rows, you can just bind the rows at the end (also from dplyr)
bind_rows(df_normal_sample_list)

Loop a sequence in R (standardize and winsorize dataframe)

I'm trying to loop this sequence of steps in r for a data frame.
Here is my data:
ID Height Weight
a 100 80
b 80 90
c na 70
d 120 na
....
Here is my code so far
winsorize2 <- function(x) {
Min <- which(x == min(x))
Max <- which(x == max(x))
ord <- order(x)
x[Min] <- x[ord][length(Min)+1]
x[Max] <- x[ord][length(x)-length(Max)]
x}
df<-read.csv("data.csv")
df2 <- scale(df[,-1], center = TRUE, scale = TRUE)
id<-df$Type
full<-data.frame(id,df2)
full[is.na(full)] <- 0
full[, -1] <- sapply(full[,-1], winsorize2)
what i'm trying to do is this:-> Standardize a dataframe, then winsorize the standardized dataframe using the function winsorize2, ie replace the most extreme values with the second least extreme value. This is then repeated 10 times. How do i do a loop for this? Im confused as in the sequence ive already replaced the nas with 0s and so i should remove this step from the loop too?
edit:After discussion with #ekstroem, we decided to change to code to introduce the boundaries
df<-read.csv("data.csv")
id<-df$Type
df2<- scale(df[,-1], center = TRUE, scale = TRUE)
df2[is.na(df2)] <- 0
df2[df2<=-3] = -3
df2[df2>=3] = 3
df3<-df2 #trying to loop again
df3<- scale(df3, center = TRUE, scale = TRUE)
df3[is.na(df3)] <- 0
df3[df3<=-3] = -3
df3[df3>=3] = 3

There are some boundary issues that are not fully specified in your code, but maybe the following can be used (using base R and not super efficient)
wins2 <- function(x, n=1) {
xx <- sort(unique(x))
x[x<=xx[n]] <- xx[n+1]
x[x>=xx[length(xx)-n]] <- xx[length(xx)-n]
x
}
This yields:
x <- 1:11
wins(x,1)
[1] 2 2 3 4 5 6 7 8 9 10 10
wins(x,3)
[1] 4 4 4 4 5 6 7 8 8 8 8

Find top deciles from dataframe by group

I am attempting to create new variables using a function and lapply rather than working right in the data with loops. I used to use Stata and would have solved this problem with a method similar to that discussed here.
Since naming variables programmatically is so difficult or at least awkward in R (and it seems you can't use indexing with assign), I have left the naming process until after the lapply. I am then using a for loop to do the renaming prior to merging and again for the merging. Are there more efficient ways of doing this? How would I replace the loops? Should I be doing some sort of reshaping?
#Reproducible data
data <- data.frame("custID" = c(1:10, 1:20),
"v1" = rep(c("A", "B"), c(10,20)),
"v2" = c(30:21, 20:19, 1:3, 20:6), stringsAsFactors = TRUE)
#Function to analyze customer distribution for each category (v1)
pf <- function(cat, df) {
df <- df[df$v1 == cat,]
df <- df[order(-df$v2),]
#Divide the customers into top percents
nr <- nrow(df)
p10 <- round(nr * .10, 0)
cat("Number of people in the Top 10% :", p10, "\n")
p20 <- round(nr * .20, 0)
p11_20 <- p20-p10
cat("Number of people in the 11-20% :", p11_20, "\n")
#Keep only those customers in the top groups
df <- df[1:p20,]
#Create a variable to identify the percent group the customer is in
top_pct <- integer(length = p10 + p11_20)
#Identify those in each group
top_pct[1:p10] <- 10
top_pct[(p10+1):p20] <- 20
#Add this variable to the data frame
df$top_pct <- top_pct
#Keep only custID and the new variable
df <- subset(df, select = c(custID, top_pct))
return(df)
}
##Run the customer distribution function
v1Levels <- levels(data$v1)
res <- lapply(v1Levels, pf, df = data)
#Explore the results
summary(res)
# Length Class Mode
# [1,] 2 data.frame list
# [2,] 2 data.frame list
print(res)
# [[1]]
# custID top_pct
# 1 1 10
# 2 2 20
#
# [[2]]
# custID top_pct
# 11 1 10
# 16 6 10
# 12 2 20
# 17 7 20
##Merge the two data frames but with top_pct as a different variable for each category
#Change the new variable name
for(i in 1:length(res)) {
names(res[[i]])[2] <- paste0(v1Levels[i], "_top_pct")
}
#Merge the results
res_m <- res[[1]]
for(i in 2:length(res)) {
res_m <- merge(res_m, res[[i]], by = "custID", all = TRUE)
}
print(res_m)
# custID A_top_pct B_top_pct
# 1 1 10 10
# 2 2 20 20
# 3 6 NA 10
# 4 7 NA 20

Stick to your Stata instincts and use a single data set:
require(data.table)
DT <- data.table(data)
DT[,r:=rank(v2)/.N,by=v1]
You can see the result by typing DT.
From here, you can group the within-v1 rank, r, if you want to. Following Stata idioms...
DT[,g:={
x = rep(0,.N)
x[r>.8] = 20
x[r>.9] = 10
x
}]
This is like gen and then two replace ... if statements. Again, you can see the result with DT.
Finally, you can subset with
DT[g>0]
which gives
custID v1 v2 r g
1: 1 A 30 1.000 10
2: 2 A 29 0.900 20
3: 1 B 20 0.975 10
4: 2 B 19 0.875 20
5: 6 B 20 0.975 10
6: 7 B 19 0.875 20
These steps can also be chained together:
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0]
(Thanks to #ExperimenteR:)
To rearrange for the desired output in the OP, with values of v1 in columns, use dcast:
dcast(
DT[,r:=rank(v2)/.N,by=v1][,g:={x = rep(0,.N);x[r>.8] = 20;x[r>.9] = 10;x}][g>0],
custID~v1)
Currently, dcast requires the latest version of data.table, available (I think) from Github.

You don't need the function pf to achieve what you want. Try dplyr/tidyr combo
library(dplyr)
library(tidyr)
data %>%
group_by(v1) %>%
arrange(desc(v2))%>%
mutate(n=n()) %>%
filter(row_number() <= round(n * .2)) %>%
mutate(top_pct= ifelse(row_number()<=round(n* .1), 10, 20)) %>%
select(custID, top_pct) %>%
spread(v1, top_pct)
# custID A B
#1 1 10 10
#2 2 20 20
#3 6 NA 10
#4 7 NA 20

The idiomatic way to do this kind of thing in R would be to use a combination of split and lapply. You're halfway there with your use of lapply; you just need to use split as well.
lapply(split(data, data$v1), function(df) {
cutoff <- quantile(df$v2, c(0.8, 0.9))
top_pct <- ifelse(df$v2 > cutoff[2], 10, ifelse(df$v2 > cutoff[1], 20, NA))
na.omit(data.frame(id=df$custID, top_pct))
})
Finding quantiles is done with quantile.