For a project I'm working on, I need to have a dataframe to indicate whether a person was absent (0) or not (1) on a particular day.
The problem is: my data is in a format where it gives the starting date of absenteïsm and then the number of days the person was absent.
Example of my dataframe:
df1 <- data.frame(Person = c(1,1,1,1,1),
StartDate = c("01-01","02-01","03-01","04-01","05-01"),
DAYS = c(3,NA,NA,NA,1))
Instead of the "Start date" and "number of days absent" per person, it should look like this instead:
df2 <- data.frame(Person = c(1,1,1,1,1),
Date = c("01-01","02-01","03-01","04-01","05-01"),
Absent = c(1,1,1,0,1))
For now I solved it with this for loop with two if-conditions:
for(i in 1:nrow(df1)){
if(!is.na(df1$DAYS[i])){
var <- df1$DAYS[i]
}
if(var > 0){
var <- var-1
df1$DAYS[i] <- 1
}
}
This works, however I have thousands of persons with a full year of dates each, meaning that I have more than 5 million rows in my dataframe. You can imagine how slow the loop is.
Does anyone know a quicker way to solve my problem?
I tried looking at the lubridate package to work with periods and dates, but I don't see a solution there.
Here is an approach based upon generating all the indices of observations that should be set to 1, and then filling in the values.
# The data
df1 <- data.frame(Person = c(1,1,1,1,1),
StartDate = c("01-01","02-01","03-01","04-01","05-01"),
DAYS = c(3,NA,NA,NA,1))
# Initialize the vector we want with zeros
df1$Absent <- 0
# we get the indices of all the non-zero day entries
inds <- which(!is.na(df1$DAYS))
# Now we are going to build a list of all the indices that should be
# set to one. These are the intervals from when absence starts to
# the number of days absent - 1
inds_to_change <- mapply(function(i,d){i:(i+d-1)}, inds, na.omit(df1$DAYS))
df1$Absent[unlist(inds_to_change)] <- 1
df1
#> Person StartDate DAYS Absent
#> 1 1 01-01 3 1
#> 2 1 02-01 NA 1
#> 3 1 03-01 NA 1
#> 4 1 04-01 NA 0
#> 5 1 05-01 1 1
Created on 2019-02-20 by the reprex package (v0.2.1)
A faster solution can be found by using integrated R functions.
The general idea:
For each person, find the position for absent days greater than 1. Let the number of absent days be a and the position be p.
In every position defined by the sequence p:(p + a - 1) insert the value 1.
Return the redefined vector, in place of the old vector.
This can all be implemented into a function, and then applied across all the subgroups. For this to be faster
the function
For the specific case using mapply (as the previous answer suggest) works, but using data.table will in general be faster for larger data sets. This is utilized below.
RelocateAbsentees <- function(x){
#Find the position in x for which the value is greater than 1
pos <- which(x > 1)
#Fill in the vector with the absent days
for(i in pos){
val <- x[i]
x[i:(i + val - 1)] <- 1
}
#return the vector
pos
}
df1 <- data.frame(Person = c(1,1,1,1,1),
StartDate = c("01-01","02-01","03-01","04-01","05-01"),
DAYS = c(3,NA,NA,NA,1))
library(data.table)
setDT(df1)
df2 <- copy(df1)[,Person := 2]
df3 <- rbind(df1,df2)
#Using data.table package (faster)
df1[, newDays := RelocateAbsentees(DAYS), by = Person]
df3[, newDays := RelocateAbsentees(DAYS), by = Person]
I found pretty neat solution using tidyverse:
library(tidyverse)
df1 %>%
group_by(Person) %>%
mutate(Abs = map_dbl(DAYS, ~ {
if (!is.na(.x)) {
d <<- .x
+(d > 0)
} else {
d <<- d - 1
+(d > 0)
}
}))
Firstly, your original approach was not so bad. Some minor improvements can make it faster than gfgm`s (as of my testing, I do not know your exact data structure):
improvedOP <- function(d) {
days <- d$DAYS # so we do not repeatedly change data.frames column
ii <- !is.na(days) # this can be calculated outside the loop
for (i in 1:nrow(d)) {
if (ii[i]) var <- days[i]
if (var > 0) {
var <- var - 1
days[i] <- 1
}
}
return(days)
}
I came up with this approach:
minem <- function(d) {
require(zoo)
rn <- 1:nrow(d) # row numbers
ii <- rn + d$DAYS - 1L # get row numbers which set to 1
ii <- na.locf(ii, na.rm = F) # fill NA forward
ii <- rn <= ii # if row number less or equal than interested row is 1
ii[ii == 0] <- NA # set 0 to NA to match original results
as.integer(ii)
}
all.equal(minem(d), improvedOP(d))
# TRUE
The idea is that we calculate row numbers which need to be 1 (current row + DAYS - 1). Then fill the NAs with this value and if row matches our condition set to 1. This should be faster than any other approach, that involves creating sequences.
Benchmark on larger (7.3 mil rows) simulated data:
gfgm <- function(d) {
days <- rep(0, nrow(d))
inds <- which(!is.na(d$DAYS))
inds_to_change <- mapply(function(i, d) {i:(i + d - 1)}, inds, na.omit(d$DAYS))
days[unlist(inds_to_change)] <- 1
days
}
nrow(d)/1e6 # 7.3 mil
require(bench)
require(data.table)
bm <- bench::mark(minem(d), improvedOP(d), gfgm(d), iterations = 2, check = F)
as.data.table(bm[, 1:7])
# expression min mean median max itr/sec mem_alloc
# 1: minem(d) 281.34ms 302.85ms 302.85ms 324.35ms 3.3019990 408MB
# 2: improvedOP(d) 747.45ms 754.55ms 754.55ms 761.65ms 1.3252907 139MB
# 3: gfgm(d) 3.23s 3.27s 3.27s 3.31s 0.3056558 410MB
P.S. but the real results probably depends on the distribution of DAYS values.
I have problems reading coordinates from a 2D space from a data.table as the following and reading out different qualities from it:
DT <- data.table(
A = c(rep("aa",2),rep("bb",2)),
B = c(rep("H",2),rep("Na",2)),
Low = c(0,3,1,1),
High = c(8,10,9,8),
Time =c("0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10"),
Intensity = c("0,0,0,0,561464,0,0,0,0,0,0","0,0,0,6548,5464,5616,0,0,0,68716,0","5658,12,6548,6541,8,5646854,54565,56465,546,65,0","0,561464,0,0,0,0,0,0,0,0,0")
)
The "Time" and "Intensity" columns are referring to x and y values of a 2D space. The "Low" and "High" columns are referring to to boundaries on the x-axis ("Time").
Now I would like to check different qualities of the y ("Intensity") dimension within (< >) those boarders:
The highest number of consecutive points > 0: (row1: 1, row 2: 2,..)
The total number of points > 0: (row1: 1, row2: 3,..)
The highest number of consecutive points > baseline (the baseline value should be taken from the Intensity value of the Low or High boundary, which ever is lower (so for row3 it would be 12, for the others 0)): (row3: 4, for all other rows it is the same as in 1.)
So the output should be a table like that:
DT <- data.table(
A =c(rep("aa",2),rep("bb",2)),
B =c(rep("H",2),rep("Na",2)),
Low = c(0,3,1,1),
High = c(8,10,9,8),
Time = c("0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10","0,1,2,3,4,5,6,7,8,9,10"),
Intensity = c("0,0,0,0,561464,0,0,0,0,0,0","0,0,0,6548,5464,5616,0,0,0,68716,0","5658,12,6548,6541,8,5646854,54565,56465,546,65,0","0,561464,0,0,0,0,0,0,0,0,0"),
First = c(1,2,7,0),
Second= c(1,3,7,0),
Third = c(1,2,4,0)
)
Has anyone an idea how that task could be handled? I was trying with data.table until now but if someone knows a better package for such tasks I would also be happy.
Thank you a lot in advance!
Yasel
Here is one method with base R. We split the 'Intensity', 'Time' columns by , into a list, then loop through the corresponding elements of the list along with the elements of 'High', 'Low' column, extract the values in the 'Intensity' based on the index from 'Low' to 'High', check whether it is greater than 0 (and also based on the conditional checking of values in 'Low'). Use rle to find the length of consecutive elements that are greater than 0 (or the 'Low' index). Create a data.frame, rbind the contents of list and cbind with the original dataset
newCols <- do.call(rbind, Map(function(u, v, x, y) {
u1 <- as.numeric(u)
v1 <- as.numeric(v)
v2 <- as.numeric(v1[u1 >x & u1 < y])
i1 <- with(rle(v2 > 0), pmax(max(lengths[values]), 0))
i2 <- sum(v2 > 0)
lb <- match(x, u1)
ub <- match(y, u1)
v3 <- as.numeric(v[(lb+1):(ub-1)])
i3 = with(rle(v3 > min(as.numeric(v[c(lb, ub)]))),
pmax(max(lengths[values]), 0))
data.frame(First = i1, Second = i2, Third = i3)
},
strsplit(DT$Time, ","), strsplit(DT$Intensity, ","), DT$Low, DT$High))
cbind(DT, newCols)
# A B Low High Time Intensity First Second Third
#1: aa H 0 8 0,1,2,3,4,5,6,7,8,9,10 0,0,0,0,561464,0,0,0,0,0,0 1 1 1
#2: aa H 3 10 0,1,2,3,4,5,6,7,8,9,10 0,0,0,6548,5464,5616,0,0,0,68716,0 2 3 2
#3: bb Na 1 9 0,1,2,3,4,5,6,7,8,9,10 5658,12,6548,6541,8,5646854,54565,56465,546,65,0 7 7 4
#4: bb Na 1 8 0,1,2,3,4,5,6,7,8,9,10 0,561464,0,0,0,0,0,0,0,0,0 0 0 0
I have the following toy dataset
set.seed(100)
df <- data.frame(ID = rep(1:5, each = 3),
value = sample(LETTERS, 15, replace = TRUE),
weight = rep(c(0.1, 0.1, 0.5, 0.2, 0.1), each = 3))
df
ID value weight
1 1 I 0.1
2 1 G 0.1
3 1 O 0.1
4 2 B 0.1
5 2 M 0.1
6 2 M 0.1
7 3 V 0.5
8 3 J 0.5
9 3 O 0.5
10 4 E 0.2
11 4 Q 0.2
12 4 W 0.2
13 5 H 0.1
14 5 K 0.1
15 5 T 0.1
where each ID is an individual respondent, answering 3 questions (in the actual dataset, the number of questions answered is variable, so I can't rely on a certain number of rows per ID).
I want to create a new (larger) dataset which samples from the individual IDs based on the weights in weight.
probs <- data.frame(ID = unique(df$ID))
probs$prob <- NA
for(i in 1:nrow(probs)){
probs$prob[i] <- df[df$ID %in% probs$ID[i],]$weight[1]
}
probs$prob <- probs$prob / sum(probs$prob)
sampledIDs <- sample(probs$ID, size = 10000, replace = TRUE, prob = probs$prob)
head(sampledIDs,10)
[1] 4 3 3 3 4 4 2 4 2 3
Moving from the probabilistic sampling of IDs to the actual creation of the new data.frame is stumping me. I've tried
dfW <- df[df$ID %in% sampledIDs,]
but that obviously doesn't take into account the fact that IDs repeat. I've also tried a loop:
dfW <- df[df$ID == sampledIDs[1],]
for(i in 2:length(sampledIDs)){
dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
}
but that's painfully slow with a large dataset.
Any help would be very appreciated.
(Also, if there are simpler ways of doing the probabilistic selection of IDs, that would be great to hear too!)
The code speed is low because you resize the data frame in every cycle of the for loop. Here is my suggestion. Create a dataframe with the final size that the data framedfW will have before the for loop. Then assign the values from data frame df to dfW in the for loop. You may change the last part of your code with this:
dfW <- as.data.frame(matrix(nrow = 3 * length(sampledIDs), ncol = 3))
colnames(dfW) <- colnames(df) # make the column names the same
for(i in 1:length(sampledIDs)){ # notice the start index is changed from 2 to 1
#dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}
Your code should run much faster with this change. Let me know how it goes!
If you don't know the final size you can resize it whenever needed, but a new if condition should be added in the for loop. First define the function to resize the dataframe as follow:
double_rowsize <- function(df) {
mdf <- as.data.frame(matrix(, nrow = nrow(df), ncol = ncol(df)))
colnames(mdf) <- colnames(df)
df <- rbind(df, mdf)
return(df)
}
Then start the dfW with an initial size like 12 (3 times 4):
dfW <- as.data.frame(matrix(nrow = 12, ncol = 3))
colnames(dfW) <- colnames(df)
And finally add an if condition in the for loop to resize the dataframe whenever needed:
for(i in 1:length(sampledIDs)){
if (3*i > nrow(dfW))
dfW <- double_rowsize(dfW)
dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}
You can change the details of function double_rowsize to change the dataframe size with a different number rather than 2 if anything else works better. 2 is common because it works best in array resizing.
Good luck!
I have to pre-process a big matrix. To make my example easier to understand I will use the following matrix:
Raw data
Where col = people and row = skills
In R my matrix is:
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
Aim
In my case I need to process row by row. So there is 3 steps. For each row I have to :
Put 0 if ij=ij (So all diagonals equals zero)
Put 0 if one of the ij=0
Otherwise I have to add ij+ij
I will show the 3 steps to be more clear.
Step 1 (row1)
The data are the row 1
The result is:
Step 2 (row2)
The data are the row 2
The result is:
Step 3 (row3)
The data are the row 3
The result is:
Create a maximum matrix
Then the maximum matching are :
So my final matrix should be:
Question
Can someone tell me how to succeed to achieve this in R?
And of course the same process should work if my matrix has more row and columns...
Thanks a lot :)
Here is my implementation in R. The code doesn't execute the steps exactly in the way you specified them. I focused on your final matrix and assumed that this is the main result you're interested in.
test <- matrix(c(18,12,15,0,13,0,14,0,12),ncol=3, nrow=3)
rownames(test) <- paste("Skill", 1:dim(test)[1], sep="")
colnames(test) <- paste("People", 1:dim(test)[2], sep="")
test
# Pairwise combinations
comb.mat <- combn(1:dim(test)[2], 2)
pairwise.mat <- data.frame(matrix(t(comb.mat), ncol=2))
pairwise.mat$max.score <- 0
names(pairwise.mat) <- c("Person1", "Person2", "Max.Score")
for ( i in 1:dim(comb.mat)[2] ) { # Loop over the rows
first.person <- comb.mat[1,i]
second.person <- comb.mat[2,i]
temp.mat <- test[, c(first.person, second.person)]
temp.mat[temp.mat == 0] <- NA
temp.rowSums <- rowSums(temp.mat, na.rm=FALSE)
temp.rowSums[is.na(temp.rowSums)] <- 0
max.sum <- max(temp.rowSums)
previous.val <- pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person]
pairwise.mat$Max.Score[pairwise.mat$Person1 == first.person & pairwise.mat$Person2 == second.person] <- max.sum*(max.sum > previous.val)
}
pairwise.mat
Person1 Person2 Max.Score
1 1 2 25
2 1 3 32
3 2 3 0
person.mat <- matrix(NA, nrow=dim(test)[2], ncol=dim(test)[2])
rownames(person.mat) <- colnames(person.mat) <- paste("People", 1:dim(test)[2], sep="")
diag(person.mat) <- 0
person.mat[cbind(pairwise.mat[,1], pairwise.mat[,2])] <- pairwise.mat$Max.Score
person.mat[lower.tri(person.mat, diag=F)] <- t(person.mat)[lower.tri(person.mat, diag=F)]
person.mat
People1 People2 People3
People1 0 25 32
People2 25 0 0
People3 32 0 0