sample of a subsample - r

I would like to select a sample from a dataset twice. Actually, I don't want to select it, but to create a new variable sampleNo that indicates to which sample (one or two) a case belongs to.
Lets suppose I have a dataset containing 40 cases:
data <- data.frame(var1=seq(1:40), var2=seq(40,1))
The first sample (n=10) I drew like this:
data$sampleNo <- 0
idx <- sample(seq(1,nrow(data)), size=10, replace=F)
data[idx,]$sampleNo <- 1
Now, (and here my problems start) I'd like to draw a second sample (n=10). But this sample should be drawn from the cases that don't belong to the first sample only. Additionally, "var1" should be an even number.
So sampleNo should be 0 for cases that were not drawn at all, 1 for cases that belong to the first sample and 2 for cases belonging to the second sample (= sampleNo equals 0 and var1 is even).
I was trying to solve it like this:
idx2<-data$var1%%2 & data$sampleNo==0
sample(data[idx2,], size=10, replace=F)
But how can I set sampleNo to 2?

We can use the setdiff function as follows:
sample(setdiff(1:nrow(data), idx), 3, replace = F)
setdiff(x, y) will select the elements of x that are not in y:
setdiff(x = 1:20, y = seq(2,20,2))
[1] 1 3 5 7 9 11 13 15 17 19
So to include in the above example:
data$sampleNo2 <- 0
idx2 <- sample(setdiff(1:nrow(data), idx), 3, replace = F)
data[idx2,]$sampleNo2 <- 1

Here is a complete solution to your problem more along the line of your original idea. The code can be shortened but for now I tried to make it as transparent as I could.
# Data
data <- data.frame(var1 = 1:40, var2 = 40:1)
# Add SampleNo column
data$sampleNo <- 0L
# Randomly select 10 rows as sample 1
pool_idx1 <- 1:nrow(data)
idx1 <- sample(pool_idx1, size = 10)
data[idx1, ]$sampleNo <- 1L
# Draw a second sample from cases where sampleNo != 1 & var1 is even
pool_idx2 <- pool_idx1[data$var1 %% 2 == 0 & data$sampleNo != 1]
idx2 <- sample(pool_idx2, size = 10)
data[idx2, ]$sampleNo <- 2L

Related

Is there a R function for touch point attribution in a customer journey?

My dataframe contains a column with various touch points, numbers 1 till 18. I want to know which touch point results in touch point 10. Therefore I want to create a new column which shows the touch point which occurred before touch point 10 per customer journey (PurchaseID). If touch point 10 doesn't occur in a customer journey the value can be NULL or 0.
So for example:
dd <- read.table(text="
PurchaseId TouchPoint DesiredOutcome
1 8 6
1 6 6
1 10 6
2 12 0
2 8 0
3 17 4
3 3 4
3 4 4
3 10 4", header=TRUE)
The complete dataset contains 2.500.000 observations. Does anyone know how to solve my problem? Thanks in advance.
Firstly, it is better to give a complete reproducible sample code. I suggest you look at the data.table library which is nice for handling large datasets.
library(data.table)
mdata <- matrix(sample(x = c(1:20, 21), size = 15*10, replace = TRUE), ncol = 10)
mdata[mdata==21] <- NA
mdata <- data.frame(mdata)
names(mdata) <- paste0("cj", 1:10)
df_touch <- data.table(mdata)
# -- using for
res <- rep(0, nrow(df_touch))
for( i in 1:10){
cat(i, "\n")
res[i] <- i*df_touch[, (10 %in% get(paste0("cj", i)))]
cat(res[i], "\n")
}
# -- using lapply
dfun <- function(x, k = 10){ return( k %in% x ) }
df_touch[, lapply(.SD, dfun)]

What is an alternative for this slow for-loop to fill in single days between dates?

For a project I'm working on, I need to have a dataframe to indicate whether a person was absent (0) or not (1) on a particular day.
The problem is: my data is in a format where it gives the starting date of absenteïsm and then the number of days the person was absent.
Example of my dataframe:
df1 <- data.frame(Person = c(1,1,1,1,1),
StartDate = c("01-01","02-01","03-01","04-01","05-01"),
DAYS = c(3,NA,NA,NA,1))
Instead of the "Start date" and "number of days absent" per person, it should look like this instead:
df2 <- data.frame(Person = c(1,1,1,1,1),
Date = c("01-01","02-01","03-01","04-01","05-01"),
Absent = c(1,1,1,0,1))
For now I solved it with this for loop with two if-conditions:
for(i in 1:nrow(df1)){
if(!is.na(df1$DAYS[i])){
var <- df1$DAYS[i]
}
if(var > 0){
var <- var-1
df1$DAYS[i] <- 1
}
}
This works, however I have thousands of persons with a full year of dates each, meaning that I have more than 5 million rows in my dataframe. You can imagine how slow the loop is.
Does anyone know a quicker way to solve my problem?
I tried looking at the lubridate package to work with periods and dates, but I don't see a solution there.
Here is an approach based upon generating all the indices of observations that should be set to 1, and then filling in the values.
# The data
df1 <- data.frame(Person = c(1,1,1,1,1),
StartDate = c("01-01","02-01","03-01","04-01","05-01"),
DAYS = c(3,NA,NA,NA,1))
# Initialize the vector we want with zeros
df1$Absent <- 0
# we get the indices of all the non-zero day entries
inds <- which(!is.na(df1$DAYS))
# Now we are going to build a list of all the indices that should be
# set to one. These are the intervals from when absence starts to
# the number of days absent - 1
inds_to_change <- mapply(function(i,d){i:(i+d-1)}, inds, na.omit(df1$DAYS))
df1$Absent[unlist(inds_to_change)] <- 1
df1
#> Person StartDate DAYS Absent
#> 1 1 01-01 3 1
#> 2 1 02-01 NA 1
#> 3 1 03-01 NA 1
#> 4 1 04-01 NA 0
#> 5 1 05-01 1 1
Created on 2019-02-20 by the reprex package (v0.2.1)
A faster solution can be found by using integrated R functions.
The general idea:
For each person, find the position for absent days greater than 1. Let the number of absent days be a and the position be p.
In every position defined by the sequence p:(p + a - 1) insert the value 1.
Return the redefined vector, in place of the old vector.
This can all be implemented into a function, and then applied across all the subgroups. For this to be faster
the function
For the specific case using mapply (as the previous answer suggest) works, but using data.table will in general be faster for larger data sets. This is utilized below.
RelocateAbsentees <- function(x){
#Find the position in x for which the value is greater than 1
pos <- which(x > 1)
#Fill in the vector with the absent days
for(i in pos){
val <- x[i]
x[i:(i + val - 1)] <- 1
}
#return the vector
pos
}
df1 <- data.frame(Person = c(1,1,1,1,1),
StartDate = c("01-01","02-01","03-01","04-01","05-01"),
DAYS = c(3,NA,NA,NA,1))
library(data.table)
setDT(df1)
df2 <- copy(df1)[,Person := 2]
df3 <- rbind(df1,df2)
#Using data.table package (faster)
df1[, newDays := RelocateAbsentees(DAYS), by = Person]
df3[, newDays := RelocateAbsentees(DAYS), by = Person]
I found pretty neat solution using tidyverse:
library(tidyverse)
df1 %>%
group_by(Person) %>%
mutate(Abs = map_dbl(DAYS, ~ {
if (!is.na(.x)) {
d <<- .x
+(d > 0)
} else {
d <<- d - 1
+(d > 0)
}
}))
Firstly, your original approach was not so bad. Some minor improvements can make it faster than gfgm`s (as of my testing, I do not know your exact data structure):
improvedOP <- function(d) {
days <- d$DAYS # so we do not repeatedly change data.frames column
ii <- !is.na(days) # this can be calculated outside the loop
for (i in 1:nrow(d)) {
if (ii[i]) var <- days[i]
if (var > 0) {
var <- var - 1
days[i] <- 1
}
}
return(days)
}
I came up with this approach:
minem <- function(d) {
require(zoo)
rn <- 1:nrow(d) # row numbers
ii <- rn + d$DAYS - 1L # get row numbers which set to 1
ii <- na.locf(ii, na.rm = F) # fill NA forward
ii <- rn <= ii # if row number less or equal than interested row is 1
ii[ii == 0] <- NA # set 0 to NA to match original results
as.integer(ii)
}
all.equal(minem(d), improvedOP(d))
# TRUE
The idea is that we calculate row numbers which need to be 1 (current row + DAYS - 1). Then fill the NAs with this value and if row matches our condition set to 1. This should be faster than any other approach, that involves creating sequences.
Benchmark on larger (7.3 mil rows) simulated data:
gfgm <- function(d) {
days <- rep(0, nrow(d))
inds <- which(!is.na(d$DAYS))
inds_to_change <- mapply(function(i, d) {i:(i + d - 1)}, inds, na.omit(d$DAYS))
days[unlist(inds_to_change)] <- 1
days
}
nrow(d)/1e6 # 7.3 mil
require(bench)
require(data.table)
bm <- bench::mark(minem(d), improvedOP(d), gfgm(d), iterations = 2, check = F)
as.data.table(bm[, 1:7])
# expression min mean median max itr/sec mem_alloc
# 1: minem(d) 281.34ms 302.85ms 302.85ms 324.35ms 3.3019990 408MB
# 2: improvedOP(d) 747.45ms 754.55ms 754.55ms 761.65ms 1.3252907 139MB
# 3: gfgm(d) 3.23s 3.27s 3.27s 3.31s 0.3056558 410MB
P.S. but the real results probably depends on the distribution of DAYS values.

Reading and counting of consecutive points

I have problems reading coordinates from a 2D space from a data.table as the following and reading out different qualities from it:
DT <- data.table(
A = c(rep("aa",2),rep("bb",2)),
B = c(rep("H",2),rep("Na",2)),
Low = c(0,3,1,1),
High = c(8,10,9,8),
Time =c("0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10"),
Intensity = c("0,0,0,0,561464,0,0,0,0,0,0","0,0,0,6548,5464,5616,0,0,0,68716,0","5658,12,6548,6541,8,5646854,54565,56465,546,65,0","0,561464,0,0,0,0,0,0,0,0,0")
)
The "Time" and "Intensity" columns are referring to x and y values of a 2D space. The "Low" and "High" columns are referring to to boundaries on the x-axis ("Time").
Now I would like to check different qualities of the y ("Intensity") dimension within (< >) those boarders:
The highest number of consecutive points > 0: (row1: 1, row 2: 2,..)
The total number of points > 0: (row1: 1, row2: 3,..)
The highest number of consecutive points > baseline (the baseline value should be taken from the Intensity value of the Low or High boundary, which ever is lower (so for row3 it would be 12, for the others 0)): (row3: 4, for all other rows it is the same as in 1.)
So the output should be a table like that:
DT <- data.table(
A =c(rep("aa",2),rep("bb",2)),
B =c(rep("H",2),rep("Na",2)),
Low = c(0,3,1,1),
High = c(8,10,9,8),
Time = c("0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10"),
Intensity = c("0,0,0,0,561464,0,0,0,0,0,0","0,0,0,6548,5464,5616,0,0,0,68716,0","5658,12,6548,6541,8,5646854,54565,56465,546,65,0","0,561464,0,0,0,0,0,0,0,0,0"),
First = c(1,2,7,0),
Second= c(1,3,7,0),
Third = c(1,2,4,0)
)
Has anyone an idea how that task could be handled? I was trying with data.table until now but if someone knows a better package for such tasks I would also be happy.
Thank you a lot in advance!
Yasel
Here is one method with base R. We split the 'Intensity', 'Time' columns by , into a list, then loop through the corresponding elements of the list along with the elements of 'High', 'Low' column, extract the values in the 'Intensity' based on the index from 'Low' to 'High', check whether it is greater than 0 (and also based on the conditional checking of values in 'Low'). Use rle to find the length of consecutive elements that are greater than 0 (or the 'Low' index). Create a data.frame, rbind the contents of list and cbind with the original dataset
newCols <- do.call(rbind, Map(function(u, v, x, y) {
u1 <- as.numeric(u)
v1 <- as.numeric(v)
v2 <- as.numeric(v1[u1 >x & u1 < y])
i1 <- with(rle(v2 > 0), pmax(max(lengths[values]), 0))
i2 <- sum(v2 > 0)
lb <- match(x, u1)
ub <- match(y, u1)
v3 <- as.numeric(v[(lb+1):(ub-1)])
i3 = with(rle(v3 > min(as.numeric(v[c(lb, ub)]))),
pmax(max(lengths[values]), 0))
data.frame(First = i1, Second = i2, Third = i3)
},
strsplit(DT$Time, ","), strsplit(DT$Intensity, ","), DT$Low, DT$High))
cbind(DT, newCols)
# A B Low High Time Intensity First Second Third
#1: aa H 0 8 0,1,2,3,4,5,6,7,8,9,10 0,0,0,0,561464,0,0,0,0,0,0 1 1 1
#2: aa H 3 10 0,1,2,3,4,5,6,7,8,9,10 0,0,0,6548,5464,5616,0,0,0,68716,0 2 3 2
#3: bb Na 1 9 0,1,2,3,4,5,6,7,8,9,10 5658,12,6548,6541,8,5646854,54565,56465,546,65,0 7 7 4
#4: bb Na 1 8 0,1,2,3,4,5,6,7,8,9,10 0,561464,0,0,0,0,0,0,0,0,0 0 0 0

Select rows in a data.frame when some rows repeat

I have the following toy dataset
set.seed(100)
df <- data.frame(ID = rep(1:5, each = 3),
value = sample(LETTERS, 15, replace = TRUE),
weight = rep(c(0.1, 0.1, 0.5, 0.2, 0.1), each = 3))
df
ID value weight
1 1 I 0.1
2 1 G 0.1
3 1 O 0.1
4 2 B 0.1
5 2 M 0.1
6 2 M 0.1
7 3 V 0.5
8 3 J 0.5
9 3 O 0.5
10 4 E 0.2
11 4 Q 0.2
12 4 W 0.2
13 5 H 0.1
14 5 K 0.1
15 5 T 0.1
where each ID is an individual respondent, answering 3 questions (in the actual dataset, the number of questions answered is variable, so I can't rely on a certain number of rows per ID).
I want to create a new (larger) dataset which samples from the individual IDs based on the weights in weight.
probs <- data.frame(ID = unique(df$ID))
probs$prob <- NA
for(i in 1:nrow(probs)){
probs$prob[i] <- df[df$ID %in% probs$ID[i],]$weight[1]
}
probs$prob <- probs$prob / sum(probs$prob)
sampledIDs <- sample(probs$ID, size = 10000, replace = TRUE, prob = probs$prob)
head(sampledIDs,10)
[1] 4 3 3 3 4 4 2 4 2 3
Moving from the probabilistic sampling of IDs to the actual creation of the new data.frame is stumping me. I've tried
dfW <- df[df$ID %in% sampledIDs,]
but that obviously doesn't take into account the fact that IDs repeat. I've also tried a loop:
dfW <- df[df$ID == sampledIDs[1],]
for(i in 2:length(sampledIDs)){
dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
}
but that's painfully slow with a large dataset.
Any help would be very appreciated.
(Also, if there are simpler ways of doing the probabilistic selection of IDs, that would be great to hear too!)
The code speed is low because you resize the data frame in every cycle of the for loop. Here is my suggestion. Create a dataframe with the final size that the data framedfW will have before the for loop. Then assign the values from data frame df to dfW in the for loop. You may change the last part of your code with this:
dfW <- as.data.frame(matrix(nrow = 3 * length(sampledIDs), ncol = 3))
colnames(dfW) <- colnames(df) # make the column names the same
for(i in 1:length(sampledIDs)){ # notice the start index is changed from 2 to 1
#dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}
Your code should run much faster with this change. Let me know how it goes!
If you don't know the final size you can resize it whenever needed, but a new if condition should be added in the for loop. First define the function to resize the dataframe as follow:
double_rowsize <- function(df) {
mdf <- as.data.frame(matrix(, nrow = nrow(df), ncol = ncol(df)))
colnames(mdf) <- colnames(df)
df <- rbind(df, mdf)
return(df)
}
Then start the dfW with an initial size like 12 (3 times 4):
dfW <- as.data.frame(matrix(nrow = 12, ncol = 3))
colnames(dfW) <- colnames(df)
And finally add an if condition in the for loop to resize the dataframe whenever needed:
for(i in 1:length(sampledIDs)){
if (3*i > nrow(dfW))
dfW <- double_rowsize(dfW)
dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}
You can change the details of function double_rowsize to change the dataframe size with a different number rather than 2 if anything else works better. 2 is common because it works best in array resizing.
Good luck!

R Matrix process with conditional additions

I have to pre-process a big matrix. To make my example easier to understand I will use the following matrix:
Raw data
Where col = people and row = skills
In R my matrix is:
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
Aim
In my case I need to process row by row. So there is 3 steps. For each row I have to :
Put 0 if ij=ij (So all diagonals equals zero)
Put 0 if one of the ij=0
Otherwise I have to add ij+ij
I will show the 3 steps to be more clear.
Step 1 (row1)
The data are the row 1
The result is:
Step 2 (row2)
The data are the row 2
The result is:
Step 3 (row3)
The data are the row 3
The result is:
Create a maximum matrix
Then the maximum matching are :
So my final matrix should be:
Question
Can someone tell me how to succeed to achieve this in R?
And of course the same process should work if my matrix has more row and columns...
Thanks a lot :)
Here is my implementation in R. The code doesn't execute the steps exactly in the way you specified them. I focused on your final matrix and assumed that this is the main result you're interested in.
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
rownames(test) <- paste("Skill", 1:dim(test)[1], sep="")
colnames(test) <- paste("People", 1:dim(test)[2], sep="")
test
# Pairwise combinations
comb.mat <- combn(1:dim(test)[2], 2)
pairwise.mat <- data.frame(matrix(t(comb.mat), ncol=2))
pairwise.mat$max.score <- 0
names(pairwise.mat) <- c("Person1", "Person2", "Max.Score")
for ( i in 1:dim(comb.mat)[2] ) { # Loop over the rows
first.person <- comb.mat[1,i]
second.person <- comb.mat[2,i]
temp.mat <- test[, c(first.person, second.person)]
temp.mat[temp.mat == 0] <- NA
temp.rowSums <- rowSums(temp.mat, na.rm=FALSE)
temp.rowSums[is.na(temp.rowSums)] <- 0
max.sum <- max(temp.rowSums)
previous.val <- pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person]
pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person] <- max.sum*(max.sum > previous.val)
}
pairwise.mat
Person1 Person2 Max.Score
1 1 2 25
2 1 3 32
3 2 3 0
person.mat <- matrix(NA, nrow=dim(test)[2], ncol=dim(test)[2])
rownames(person.mat) <- colnames(person.mat) <- paste("People", 1:dim(test)[2], sep="")
diag(person.mat) <- 0
person.mat[cbind(pairwise.mat[,1], pairwise.mat[,2])] <- pairwise.mat$Max.Score
person.mat[lower.tri(person.mat, diag=F)] <- t(person.mat)[lower.tri(person.mat, diag=F)]
person.mat
People1 People2 People3
People1 0 25 32
People2 25 0 0
People3 32 0 0

Resources