I'm trying to convert some code to use data.table. In this situation, I need to create a graph structure from columns in a data.frame/data.table where rows have information containing the id and depth in the tree. My normal approach is a split/apply/combine, so I feel like it should be possible using by and some expression in data.table but I can't get it.
Here is an example,
## A data.table like this with ids and levels
dat <- data.table(level = rep(1:4, times=2^(0:3)), id = 1:15)
## my normal way, not using data table would involve a split and rep
levs <- split(dat$id, dat$level)
nodes <- unlist(mapply(function(a,b) rep(a, length.out=b), head(levs, -1L),
tail(lengths(levs), -1L)), use.names = FALSE)
## Desired result
res <- cbind(nodes, dat$id[-1L])
## To visualize
library(igraph)
plot(graph_from_edgelist(cbind(nodes, dat$id[-1L])), layout=layout.reingold.tilford,
asp=0.6)
Edit
I think the problem I'm having is when I do a by=level I need information from two levels to get the proper repeat lenght.
Here's another way of getting your nodes column:
dat[, .N, by = .(level = level - 1)][
dat, on = 'level', nomatch = 0][
, .(nodes = rep(id, length.out = N[1])), by = level]
# level nodes
# 1: 1 1
# 2: 1 1
# 3: 2 2
# 4: 2 3
# 5: 2 2
# 6: 2 3
# 7: 3 4
# 8: 3 5
# 9: 3 6
#10: 3 7
#11: 3 4
#12: 3 5
#13: 3 6
#14: 3 7
Related
I have two dataframes: one (called df_persons) with records that are have unique person_id's, but have stratum_id's that are not unique, and one (called df_population) with those same stratum_id's, and multiple duplicate rows of them. Code to recreate them below:
df_persons = data.frame(person_id=c(101, 102, 103), stratum_id=c(1,2,1))
df_population = data.frame(stratum_id=c(1,1,1,1,2,2,2,2,3,3))
Now I would like a way to merge the data from df_persons with df_population, so that every row from df_persons gets merged with the first matching (key = stratum_id) row of df_population that has not been previously matched. Find the desired solution below:
# manual way to merge first available match
df_population$person = c(101, 103, NA, NA, 102, NA, NA, NA, NA, NA)
I wrote a loop for this that works (see below). The problem is that df_persons is 83.000 records long, and df_population is 13 million records long. And the loop therefore takes too long + my pc cannot handle it.
# create empty person column in df_population
df_population$person = NA
# order both df's to speed up
df_population = df_population[order(df_population$stratum_id),]
df_persons = df_persons[order(df_persons$stratum_id),]
# loop through all persons in df_person, and for each find the first available match
for(i_person in 1:nrow(df_persons))
{
match = F
i_pop = 0
while(!match)
{
i_pop = i_pop+1
if(df_population$stratum_id[i_pop] == df_persons$stratum_id[i_person] & is.na(df_population$person[i_pop]))
{
match = T
df_population$person[i_pop] = df_persons$person[i_person]
}
}
}
Any help to make this a lot faster would be much appreciated. I have looked into the data.frame package, to no avail so far, but I do think I will need to move away from looping in order to execute the code.
1) dplyr Using dplyr add a sequence number to each data frame and then merge them:
library(dplyr)
df_population %>%
group_by(stratum_id) %>%
mutate(seq = 1:n()) %>%
ungroup %>%
left_join(df_persons %>% group_by(stratum_id) %>% mutate(seq = 1:n()))
giving:
Joining, by = c("stratum_id", "seq")
# A tibble: 10 x 3
stratum_id seq person_id
<dbl> <int> <dbl>
1 1 1 101
2 1 2 103
3 1 3 NA
4 1 4 NA
5 2 1 102
6 2 2 NA
7 2 3 NA
8 2 4 NA
9 3 1 NA
10 3 2 NA
2) Base R or in base R:
p1 <- transform(df_population, seq = ave(stratum_id, stratum_id, FUN = seq_along))
p2 <- transform(df_persons, seq = ave(stratum_id, stratum_id, FUN = seq_along))
merge(p1, p2, all.x = TRUE, all.y = FALSE)
3) sqldf In SQL we have the following. The dbname= argument causes it to perform the processing outside of R but if you have sufficient memory then it could be omitted and it will use memory within R.
library(sqldf)
seqno <- "sum(1) over (partition by stratum_id rows unbounded preceding)"
fn$sqldf("
with
p1 as (select *, $seqno seq from df_population),
p2 as (select *, $seqno seq from df_persons)
select * from p1 left join p2 using (stratum_id, seq)
", dbname = tempfile())
Here is a data.table approach. More explanation in the code's comments.
library(data.table)
# make them data.table
setDT(df_persons)
setDT(df_population)
# create dummy values to join on
df_persons[, id := rowid(stratum_id)]
df_population[, id := rowid(stratum_id)]
# join by refence
df_population[df_persons, person_id := i.person_id, on = .(stratum_id, id)][]
# drop the dummy id column
df_population[, id := NULL][]
# stratum_id person_id
# 1: 1 101
# 2: 1 103
# 3: 1 NA
# 4: 1 NA
# 5: 2 102
# 6: 2 NA
# 7: 2 NA
# 8: 2 NA
# 9: 3 NA
#10: 3 NA
Simply use pmatch as shown below:
df_population$person_id <- df_persons$person_id[pmatch(df_population$stratum_id, df_persons$stratum_id)]
df_population
stratum_id person_id
1 1 101
2 1 103
3 1 NA
4 1 NA
5 2 102
6 2 NA
7 2 NA
8 2 NA
9 3 NA
10 3 NA
Here is my simplified data.table:
Individual
time_alive (day)
ID1
1
ID2
5
ID3
7
ID4
5
I need to calculate the number of individual alive at every day.
I achieved this by doing a loop
for (i in c(-1:600)) {
y<-summarise(DT , time_alive > i )
Alive[i+2,]<-length(y[y==TRUE])
}
However this is really long with a data.frame of more than 2B observations.
I wanted to try an alternative with data.table but I am stuck at only 1 day number of alive calculation:
DT[,.N,time_alive> i][time_alive==TRUE,2]
Here, i cannot be replaced by a vector but only by 1 number. I want to calculate the number of individual with more than i days of life, without doing a loop.
My result expected for the simplified data would be:
Day
Number of individual alive
1
4
2
3
3
3
4
3
5
3
6
1
7
1
8
0
Best solution in one line, with data.table which is much faster than looping:
DT[, .(Day = seq_len(1 + max(time_alive)))][DT[,.(time_alive)], .(.N), on = .(Day <= time_alive), by = Day]
# #r2evans suggestion about making it a one-liner
# replaced res = data.table('day' = 1:max(DT$time_alive))
DT[, .(day = seq_len(1 + max(time_alive)))][
# my original solution
DT, .(.N) ,on = .(day <= time_alive),by = day, allow.cartesian = T]
# or
DT[,time_alive > TARGET_NUMBER, by = individual]
I have two solutions based on what you have provided. One or both of them should be what you're looking for. See below for details/explanation
# load in data
DT = data.table('individual' = 1:4, 'time_alive' = c(1,5,7,5))
# set your target number
TARGET_NUMBER = 5
# group by individual,
# then check if the number of days they were alive is greater than your target
# this answers "i want to calculate the number of
# individual with more than "i" days of life
DT[,time_alive > TARGET_NUMBER, by = individual]
individual V1
1: 1 FALSE
2: 2 FALSE
3: 3 TRUE
4: 4 FALSE
# if the result you want is that table you created. that is a little different:
# create a table with days ranging from 1 to the maximum survivor
res = data.table('day' = 1:max(DT$time_alive))
day
1: 1
2: 2
3: 3
4: 4
5: 5
6: 6
7: 7
# use joins
# join by time alive being greater than or equal to the day
# group by the specific day, and count how many observations we have
# allow.cartesian because the mapping isn't one-to-one
res[DT, .(.N) ,on = .(day <= time_alive),by = day, allow.cartesian = T]
day N
1: 1 4
2: 2 3
3: 3 3
4: 4 3
5: 5 3
6: 6 1
7: 7 1
I would approach the problem in a different way.
If you run data.frame(Alive = cumsum(rev(table(c(1,5,7,5))))) (Or in your general case data.frame(Alive = cumsum(rev(table(DT$time_alive))))) you will have the information you need, with the only caveat that if there is any day that had no deaths, you will end up with gaps in the data.
data.table
library(data.table)
DT[, .(Day = seq_len(max(time_alive) + 1))
][, Number := rowSums(outer(Day, DT$time_alive, `<=`))]
# Day Number
# <int> <num>
# 1: 1 4
# 2: 2 3
# 3: 3 3
# 4: 4 3
# 5: 5 3
# 6: 6 1
# 7: 7 1
# 8: 8 0
(I'm assuming that DT will have no more than 1 row per Individual.)
Data
DT <- setDT(structure(list(Individual = c("ID1", "ID2", "ID3", "ID4"), time_alive = c(1L, 5L, 7L, 5L)), class = c("data.table", "data.frame"), row.names = c(NA, -4L)))
everyone!
Being a beginner with the R software (I think my request is feasible on this software), I would like to ask you a question.
In a large Excel type file, I have a column where the values I am interested in are only every 193 lines. So I would like the previous 192 rows to be equal to the value of the 193rd position ... and so on for all 193 rows, until the end of the column.
Concretely, here is what I would like to get for this little example:
Month Fund_number Cluster_ref_INPUT Expected_output
1 1 1 1
2 1 1 1
3 1 3 1
4 1 1 1
1 3 2 NA
2 3 NA NA
3 3 NA NA
4 3 NA NA
1 8 4 5
2 8 5 5
3 8 5 5
4 8 5 5
The column "Cluster_ref_INPUT" is partitioned according to the column "Fund_number" (one observation for each fund every month for 4 months). The values that interest me in the INPUT column appear every 4 observations (the value in the 4th month).
Thus, we can see that for each fund number, we find in the column "Expected_output" the values corresponding to the value found in the last line of the column "Cluster_ref_INPUT". (every 4 lines). I think we should partition by "Fund_number" and put that all the lines are equal to the last one... something like that?
Do you have any idea what code I should use to make this work?
I hope that's clear enough. Do not hesitate if I need to clarify.
Thank you very much in advance,
Vanie
Here's a one line solution using data.table:
library(data.table)
exdata <- fread(text = "
Month Fund_number Cluster_ref_INPUT Expected_output
1 1 1 1
2 1 1 1
3 1 3 1
4 1 1 1
1 2 2 NA
2 2 NA NA
3 2 NA NA
4 2 NA NA
1 3 4 5
2 3 5 5
3 3 5 5
4 3 5 5")
# You can read you data directly as data.table using fread or convert using setDT(exdata)
exdata[, newvar := Cluster_ref_INPUT[.N], by = Fund_number]
> exdata
Month Fund_number Cluster_ref_INPUT Expected_output newvar
1: 1 1 1 1 1
2: 2 1 1 1 1
3: 3 1 3 1 1
4: 4 1 1 1 1
5: 1 2 2 NA NA
6: 2 2 NA NA NA
7: 3 2 NA NA NA
8: 4 2 NA NA NA
9: 1 3 4 5 5
10: 2 3 5 5 5
11: 3 3 5 5 5
12: 4 3 5 5 5
There are probably solutions using tidyverse that'll be a lot faster, but here's a solution in base R.
#Your data
df <- data.frame(Month = rep_len(c(1:4), 12),
Fund_number = rep(c(1:3), each = 4),
Cluster_ref_INPUT = c(1, 1, 3, 1, 2, NA, NA, NA, 4, 5, 5, 5),
stringsAsFactors = FALSE)
#Create an empty data frame in which the results will be stored
outdat <- data.frame(Month = c(), Fund_number = c(), Cluster_ref_INPUT = c(), expected_input = c(), stringsAsFactors = FALSE)
#Using a for loop
#Iterate through the list of unique Fund_number values
for(i in 1:length(unique(df$Fund_number))){
#Subset data pertaining to each unique Fund_number
curdat <- subset(df, df$Fund_number == unique(df$Fund_number)[i])
#Take the value of Cluster_ref_Input from the last row
#And set it as the value for expected_input column for all rows
curdat$expected_input <- curdat$Cluster_ref_INPUT[nrow(curdat)]
#Append this modified subset to the output container data frame
outdat <- rbind(outdat, curdat)
#Go to next iteration
}
#Remove non-essential looping variables
rm(curdat, i)
outdat
# Month Fund_number Cluster_ref_INPUT expected_input
# 1 1 1 1 1
# 2 2 1 1 1
# 3 3 1 3 1
# 4 4 1 1 1
# 5 1 2 2 NA
# 6 2 2 NA NA
# 7 3 2 NA NA
# 8 4 2 NA NA
# 9 1 3 4 5
# 10 2 3 5 5
# 11 3 3 5 5
# 12 4 3 5 5
EDIT: additional solutions + benchmarking
Per OP's comment on this answer, I've presented some faster solutions (dplyr and the data.table solution from the other answer) and also benchmarked them on a 950,004 row simulated dataset similar to the one in OP's example. Code and results below; the entire code-block can be copy-pasted and run directly as long as the necessary libraries (microbenchmark, dplyr, data.table) and their dependencies are installed. (If someone knows a solution based on apply() they're welcome to add it here.)
rm(list = ls())
#Library for benchmarking
library(microbenchmark)
#Dplyr
library(dplyr)
#Data.table
library(data.table)
#Your data
df <- data.frame(Month = rep_len(c(1:12), 79167),
Fund_number = rep(c(1, 2, 5, 6, 8, 22), each = 158334),
Cluster_ref_INPUT = sample(letters, size = 950004, replace = TRUE),
stringsAsFactors = FALSE)
#Data in format for data.table
df_t <- data.table(Month = rep_len(c(1:12), 79167),
Fund_number = rep(c(1, 2, 5, 6, 8, 22), each = 158334),
Cluster_ref_INPUT = sample(letters, size = 950004, replace = TRUE),
stringsAsFactors = FALSE)
#----------------
#Base R solution
#Using a for loop
#Iterate through the list of unique Fund_number values
base_r_func <- function(df) {
#Create an empty data frame in which the results will be stored
outdat <- data.frame(Month = c(),
Fund_number = c(),
Cluster_ref_INPUT = c(),
expected_input = c(),
stringsAsFactors = FALSE)
for(i in 1:length(unique(df$Fund_number))){
#Subset data pertaining to each unique Fund_number
curdat <- subset(df, df$Fund_number == unique(df$Fund_number)[i])
#Take the value of Cluster_ref_Input from the last row
#And set it as the value for expected_input column for all rows
curdat$expected_input <- curdat$Cluster_ref_INPUT[nrow(curdat)]
#Append this modified subset to the output container data frame
outdat <- rbind(outdat, curdat)
#Go to next iteration
}
#Remove non-essential looping variables
rm(curdat, i)
#This return is needed for the base_r_func function wrapper
#this code is enclosed in (not necessary otherwise)
return(outdat)
}
#----------------
#Tidyverse solution
dplyr_func <- function(df){
df %>% #For actual use, replace this %>% with %<>%
#and it will write the output back to the input object
#Group the data by Fund_number
group_by(Fund_number) %>%
#Create a new column populated w/ last value from Cluster_ref_INPUT
mutate(expected_input = last(Cluster_ref_INPUT))
}
#----------------
#Data table solution
dt_func <- function(df_t){
#For this function, we are using
#dt_t (created above)
#Logic similar to dplyr solution
df_t <- df_t[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
dt_func_conv <- function(df){
#Converting data.frame to data.table format
df_t <- data.table(df)
#Logic similar to dplyr solution
df_t <- df_t[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
#----------------
#Benchmarks
bm_vals <- microbenchmark(base_r_func(df),
dplyr_func(df),
dt_func(df_t),
dt_func_conv(df), times = 8)
bm_vals
# Unit: milliseconds
# expr min lq mean median uq max neval
# base_r_func(df) 618.58202 702.30019 721.90643 743.02018 754.87397 756.28077 8
# dplyr_func(df) 119.18264 123.26038 128.04438 125.64418 133.37712 140.60905 8
# dt_func(df_t) 38.06384 38.27545 40.94850 38.88269 43.58225 48.04335 8
# dt_func_conv(df) 48.87009 51.13212 69.62772 54.36058 57.68829 181.78970 8
#----------------
As can be seen, using data.table would be the way to go if speed is a necessity. data.table is faster than dplyr and base R even when the overhead of converting a regular data.frame to a data.table is considered (see results of dt_func_conv()).
Edit: following up on Carlos Eduardo Lagosta's comments, using setDT() to coerce the df from a data.frame to a data.table, makes the overhead of said coercion close to nil. Code snippet and benchmark values below.
#This version includes the time taken
#to coerce a data.frame to a data.table
dt_func_conv <- function(df){
#Logic similar to dplyr solution
#setDT() coerces data.frames to the data.table format
setDT(df)[ , expected_output := Cluster_ref_INPUT[.N], by = Fund_number]
}
bm_vals
# Unit: milliseconds
# expr min lq mean median uq max neval
# base_r_func(df) 271.60196 344.47280 353.76204 348.53663 368.65696 435.16163 8
# dplyr_func(df) 121.31239 122.67096 138.54481 128.78134 138.72509 206.69133 8
# dt_func(df_t) 38.21601 38.57787 40.79427 39.53428 43.14732 45.61921 8
# dt_func_conv(df) 41.11210 43.28519 46.72589 46.74063 50.16052 52.32235 8
For the OP specifically: whatever solution you wish to use, the code you're looking for is within the body of the corresponding function. So, for instance, if you want to use the dplyr solution, you would need to take this code and tailor it to your data objects:
df %>% #For actual use, replace this %>% with %<>%
#and it will write the output back to the input object
#Group the data by Fund_number
group_by(Fund_number) %>%
#Create a new column populated w/ last value from Cluster_ref_INPUT
mutate(expected_input = last(Cluster_ref_INPUT))
I have a dataset of prescription records. Each row is a prescription for a single drug on a particular day. I have divided the drugs into two groups with partial overlap. I would like to identify where prescriptions have been issued from both drug groups within 3 days of each other but not include where the same drug has been issued from group 1 and 2 identifying the date of the latter drug of the pair.
An example of my data:
library(data.table)
set.seed(10)
DT <- data.table(day = sample(c(1:31), 30, replace = TRUE),
drug_group = sample(c(1, 2), 30, replace = TRUE),
drug_1 = sample(c("A", "B", "C"), 30, replace = TRUE),
drug_2 = sample(c("A", "D", "E"), 30, replace = TRUE))
DT[drug_group == 1, drug := drug_1]
DT[drug_group == 2, drug := drug_2]
DT[, c("drug_1", "drug_2") := NULL]
setkey(DT, day)
so the following:
day drug_group drug
1: 2 1 B
2: 3 1 C
3: 4 1 B
4: 7 2 E
5: 8 1 A
6: 9 2 A
7: 9 2 D
8: 9 1 C
9: 10 1 A
10: 12 1 A
...
24: 22 2 D
25: 22 2 E
26: 24 1 A
27: 25 1 A
28: 26 2 D
29: 26 1 C
30: 27 1 C
I would like to obtain a result like this:
day interaction_present
1: 1 FALSE
2: 2 FALSE
3: 3 FALSE
4: 4 FALSE
5: 5 FALSE
6: 6 FALSE
...
26: 26 TRUE
29: 29 FALSE
30: 30 FALSE
I am pretty sure I could do this by looping over each row in turn but I have been admonished repeatedly for using loops instead of vectorising and I wondered if this type of task was feasible without a loop? I have looked at using the data.table shift() function to set up lags but I am wary of creating too many new columns since my actual data.table is over a million rows.
Sorry if this is a trivial issue or if it has been asked before but I have been stuck on it all afternoon and I am giving up for the day!
if I am understanding your question correctly the following should work. Any changes can easily be made to the logic decisions within the function as well as the time lag start and end variables.
timelagadj <- function(i){
## this should be changed depending on what you mean by "within"
## 3 days. This currently goes i-3,i+3
## but if i-3 or i+3 doesn't exist in dayDT$day
## then we pull the value one above/below i-3/i+3 respectively
start <- max(dayDT$day[i]-3,dayDT$day[1])
start <- ifelse(dayDT$day[findInterval(start,dayDT$day)] < start,
findInterval(start,dayDT$day)+1,
findInterval(start,dayDT$day))
end <- min(dayDT$day[i]+3,dayDT$day[nrow(dayDT)])
end <- findInterval(end,dayDT$day)
## now we pull the relevant group ID and drug ID
gIDs <- dayDT$groupID[start:end]
dIDs <- dayDT$drugID[start:end]
## here we unlist the paste made before
## to group by days
gIDs <- unlist(strsplit(gIDs,"_"))
dIDs <- unlist(strsplit(dIDs,"_"))
## now we can apply our logic rule based
## on the criteria you mentioned
if(length(unique(gIDs))>1){
tmp <- unique(data.frame(gIDs,dIDs))
if(length(unique(tmp$gIDs))!=length(unique(tmp$dIDs))) T else F
}else F
}
dayDT <- DT[,list("drugID"=paste(drug,collapse="_"),"groupID"=paste(drug_group,collapse="_")),by=day]
res <- sapply(1:nrow(dayDT),function(m) timelagadj(m))
res <- dayDT[,list(day,"interaction_present"=res)]
I have a data frame with coordinates ("start","end") and labels ("group"):
a <- data.frame(start=1:4, end=3:6, group=c("A","B","C","D"))
a
start end group
1 1 3 A
2 2 4 B
3 3 5 C
4 4 6 D
I want to create a new data frame in which labels are assigned to every element of the sequence on the range of coordinates:
V1 V2
1 1 A
2 2 A
3 3 A
4 2 B
5 3 B
6 4 B
7 3 C
8 4 C
9 5 C
10 4 D
11 5 D
12 6 D
The following code works but it is extremely slow with wide ranges:
df<-data.frame()
for(i in 1:dim(a)[1]){
s<-seq(a[i,1],a[i,2])
df<-rbind(df,data.frame(s,rep(a[i,3],length(s))))
}
colnames(df)<-c("V1","V2")
How can I speed this up?
You can try data.table
library(data.table)
setDT(a)[, start:end, by = group]
which gives
group V1
1: A 1
2: A 2
3: A 3
4: B 2
5: B 3
6: B 4
7: C 3
8: C 4
9: C 5
10: D 4
11: D 5
12: D 6
Obviously this would only work if you have one row per group, which it seems you have here.
If you want a very fast solution in base R, you can manually create the data.frame in two steps:
Use mapply to create a list of your ranges from "start" to "end".
Use rep + lengths to repeat the "groups" column to the expected number of rows.
The base R approach shared here won't depend on having only one row per group.
Try:
temp <- mapply(":", a[["start"]], a[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(a[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
If you're doing this a lot, just put it in a function:
myFun <- function(indf) {
temp <- mapply(":", indf[["start"]], indf[["end"]], SIMPLIFY = FALSE)
data.frame(group = rep(indf[["group"]], lengths(temp)),
values = unlist(temp, use.names = FALSE))
}
Then, if you want some sample data to try it with, you can use the following as sample data:
set.seed(1)
a <- data.frame(start=1:4, end=sample(5:10, 4, TRUE), group=c("A","B","C","D"))
x <- do.call(rbind, replicate(1000, a, FALSE))
y <- do.call(rbind, replicate(100, x, FALSE))
Note that this does seem to slow down as the number of different unique values in "group" increases.
(In other words, the "data.table" approach will make the most sense in general. I'm just sharing a possible base R alternative that should be considerably faster than your existing approach.)