My question is essentially a generalisation of this SO post but with a rolling component.
I have a dataset of people, jobs and the dates on which they held said jobs (whilst the specifics aren't important, they make the prose easier). Each person can hold multiple jobs on a given date, or they can have no job, which shows up as a missing person-date-job row.
I want to create a summarised table where there is only one row for each person-date combination, thus necessitating creating a list-column that holds the IDs for jobs held by that person-date. I have managed to do this for contemporaneous person-job-dates following the above linked SO post. The complicating factor is that I want to look backwards by 3 periods, i.e. I need the job_id list-col for date t to contain all the jobs held by a person in the date t, t-1 and t-2.
Below is some code to produce a toy input table and the desired output.
library(data.table)
# Input data
data <- data.table(
ind_id = c(rep(1, 3), rep(2, 4), rep(3, 2), rep(4, 5)),
date = c(1, 2, 3, 1, 2, 2, 3, 1, 3, 1, 1, 2, 2, 3),
job_id = c("A", "A", "A", "B", "B", "C", "B", "D", "E", "F", "G", "F", "G", "G")
)
# Desired output
output <- data.table(
ind_id = c(rep(1, 3), rep(2, 3), rep(3, 3), rep(4, 3)),
date = rep(1:3, 4),
job_id = list("A", "A", "A", "B", c("B", "C"), c("B", "C"), "D", c("D"), c("D", "E"), c("F", "G"), c("F", "G"), c("F", "G"))
)
And here is the code that works to make a table of contemporaneous person-job-date rows.
data_contemp <- data[, .(job_id = list(job_id)), by = .(date, ind_id)]
Something that I tried was to use frollapply but it doesn't work if the output is not numeric unfortunately: data[, all_jobs := frollapply(job_id, 3, list), by = ind_id]
Appreciate everyone's help on this!
EDIT: I should add that a data.table solution is highly preferred because the actual dataset is 607 million rows, data.table is faster and more memory efficient, and the syntax is better.
EDIT 2: Added some code to generate an arbitrarily large input table.
n <- 600e6
n <- round(n / 15)
t1 <- data.table(ind_id = rep(1, 3), date = 1:3, job_id = rep("A", 3))
t2 <- data.table(ind_id = rep(2, 3), date = 1:3, job_id = c("A", "B", "B"))
t3 <- data.table(ind_id = rep(3, 5), date = c(1, 2, 2, 3, 3), job_id = c("A", "A", "B", "A", "B"))
t4 <- data.table(ind_id = rep(4, 2), date = c(1, 3), job_id = c("A", "B"))
t5 <- data.table(ind_id = rep(5, 4), date = c(1, 1, 2, 3), job_id = c("A", "B", "A", "A"))
data <- rbind(t1, t2, t3, t4, t5)
data <- data[rep(seq_len(nrow(data)), n)]
data[, ind_id := rleid(ind_id)]
You could use self non-equijoins:
data[,start:=date-2]
data[data,.(ind_id,date = x.date,job_id=i.job_id),on=.(ind_id, start<= date, date>=date)][
,.(job_id=list(unique(job_id))),.(ind_id,date)]
ind_id date job_id
<num> <num> <list>
1: 1 1 A
2: 1 2 A
3: 1 3 A
4: 2 1 B
5: 2 2 B,C
6: 2 3 B,C
7: 3 1 D
8: 3 3 D,E
9: 4 1 F,G
10: 4 2 F,G
11: 4 3 F,G
Slight difference compared to your expected output: date=2 isn't present for ind_id=3 because it isn't present in initial data.
ind_id date job_id
<num> <int> <list>
8: 3 2 D
Thanks Waldi for your solution. I actually managed to figure out my own solution to my question with a combination of helper columns and mapply. So I've included my method as an answer and also benchmarked both methods. Waldi's solution with non-equi joins is about 20 per cent faster than my method with mapply, but uses about 40 per cent more memory. This differential looks like it remains constant as the number of rows scales.
Given memory is cheap and time is not, Waldi's solution works best here.
Thanks everyone for contributing!
library(data.table)
library(collapse)
## Input data
# Create three types of people with different employment histories:
# Type 1: same job over time
# Type 2: changes to a new job in t2
# Type 3: picks up a new job in t2
# Type 4: employed in t1, unemployed in t2, employed in t3
# Type 5: loses a second job in t2
make_data <- function(n) {
n <- round(n / 15)
t1 <- data.table(ind_id = rep(1, 3), date = 1:3, job_id = rep("A", 3))
t2 <- data.table(ind_id = rep(2, 3), date = 1:3, job_id = c("A", "B", "B"))
t3 <- data.table(ind_id = rep(3, 5), date = c(1, 2, 2, 3, 3), job_id = c("A", "A", "B", "A", "B"))
t4 <- data.table(ind_id = rep(4, 2), date = c(1, 3), job_id = c("A", "B"))
t5 <- data.table(ind_id = rep(5, 4), date = c(1, 1, 2, 3), job_id = c("A", "B", "A", "A"))
data <- rbind(t1, t2, t3, t4, t5)
data <- data[rep(seq_len(nrow(data)), n)]
data[, ind_id := rleid(ind_id)]
data <- data[, .(job_id = list(job_id)), by = .(date, ind_id)]
# Add back missing person-date rows to create balanced panel
date_person_rows <- CJ(ind_id = unique(data$ind_id), date = unique(data$date))
data <- date_person_rows[data, job_id := i.job_id, on = .(date, ind_id)]
return(data)
}
method_1 <- function(data) {
data[, paste0("jobs_", 0:2) := shift(.(job_id), 0:2), by = ind_id]
data[, job_id := mapply(jobs_0, jobs_1, jobs_2, FUN = function(a, b, c) sort(na_rm(unique(c(a, b, c)))))]
data[, c("jobs_0", "jobs_1", "jobs_2") := NULL]
setkey(data, NULL) # For some reason this dt has a key set but the method 2 one doesn't
return(data)
}
method_2 <- function(data) {
data[, start := date - 2]
data <-
data[data, .(ind_id, date = x.date, job_id = i.job_id), on = .(ind_id, start <= date, date >= date)][, .(job_id = list(unique(job_id))), .(ind_id, date)]
data[, job_id := lapply(job_id, function(x) unique(unlist(x)))]
}
# Benchmark
bench::mark(
method_1(make_data(10e4)),
method_2(make_data(10e4)),
iterations = 1L
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 method_1(make_data(1e+05)) 2.7s 2.7s 0.370 27.4MB 17.8
#> 2 method_2(make_data(1e+05)) 2.08s 2.08s 0.481 43MB 14.9
Created on 2022-10-11 with reprex v2.0.2
I'd like to be able to merge two categories in a categorical raster. The only solution I've figured out so far uses the level index number, not the name of the category. How could I do this using the name of the category?
library(terra)
m <- matrix(rep(c("a", "b", "c"), each = 3), nrow = 3, ncol = 3)
x <- rast(m)
x[x$lyr.1 == "c"]
m2 <- matrix(c("a", "a", "b", "b", "c", "b"), nrow = 3, ncol = 2, byrow = TRUE)
test <- classify(x, m2)
#doesn't work with category names
test <- subst(x, "c", "b")
#doesn't work with category names
test <- subst(x, 2, 1)
#works with category index
Example data
library(terra)
m <- matrix(rep(c("a", "b", "c"), each = 3), nrow = 3, ncol = 3)
x <- rast(m)
m2 <- matrix(c("a", "a", "b", "b", "c", "b"), nrow = 3, ncol = 2, byrow = TRUE)
With the current version of terra you can do either
test1 <- subst(x, "c", "b")
or
test2 <- subst(x, 2, 1, raw=TRUE)
library(terra)
library(tidyverse)
m <- matrix(rep(c("a", "b", "c"), each = 3), nrow = 3, ncol = 3)
x <- rast(m)
plot(x)
reclassified <- cats(x)[[1]] %>%
mutate(label_reclass = forcats::fct_collapse(cats(x)[[1]]$label,c="b"))
x <- categories(x, layer=1, value=reclassified, active=2)
plot(x)
levels(x)
[[1]]
value label_reclass
1 1 a
2 2 c
3 3 c
Let's assume that we have following toy data:
library(tidyverse)
data <- tibble(
subject = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
id1 = c("a", "a", "b", "a", "a", "a", "b", "a", "a", "b"),
id2 = c("b", "c", "c", "b", "c", "d", "c", "b", "c", "c")
)
which represent network relationships for each subject. For example, there are three unique subjects in the data and the network for the first subject could be represented as sequence of relations:
a -- b, a --c, b -- c
The task is to compute centralities for each network. Using for loop this is straightforward:
library(igraph)
# Get unique subjects
subjects_uniq <- unique(data$subject)
# Compute centrality of nodes for each graph
for (i in 1:length(subjects_uniq)) {
current_data <- data %>% filter(subject == i) %>% select(-subject)
current_graph <- current_data %>% graph_from_data_frame(directed = FALSE)
centrality <- eigen_centrality(current_graph)$vector
}
Question: My dataset is huge so I wonder how to avoid explicit for loop. Should I use apply() and its modern cousins (maybe map() in the purrr package)? Any suggestions are greatly welcome.
Here is an option using map
library(tidyverse)
library(igraph)
map(subjects_uniq, ~data %>%
filter(subject == .x) %>%
select(-subject) %>%
graph_from_data_frame(directed = FALSE) %>%
{eigen_centrality(.)$vector})
#[[1]]
#a b c
#1 1 1
#[[2]]
# a b c d
#1.0000000 0.8546377 0.8546377 0.4608111
#[[3]]
#a b c
#1 1 1
suppose I have something like this:
dat <- data.frame(ID = c("A", "B", "C"),
value = c(1, 2, 3))
I would like to add an extra column in which I have a value like this:
[A,1]
It would be a column in which each single values are the concatenation of "[" + A (value in the first column) + "," + B (value in the second column) + "]".
How can I do it? I tried with paste but I am doing something wrong.
Here's an approach that will work consistently with endless numbers of columns:
dat$conc <- paste0("[",apply(dat,1,paste,collapse=","),"]")
Using your example:
dat <- data.frame(ID = c("A", "B", "C"), value = c(1, 2, 3))
dat$conc <- paste0("[",apply(dat,1,paste,collapse=","),"]")
Gives:
ID value conc
1 A 1 [A,1]
2 B 2 [B,2]
3 C 3 [C,3]
Or if we have a dataframe with more columns:
dat <- data.frame(ID = c("A", "B", "C"), value = c(1, 2, 3), value2 = c(4, 5, 6))
dat$conc <- paste0("[",apply(dat,1,paste,collapse=","),"]")
Gives:
ID value value2 conc
1 A 1 4 [A,1,4]
2 B 2 5 [B,2,5]
3 C 3 6 [C,3,6]
Assuming this is your data
dat <- data.frame(ID = c("A", "B", "C"), value = c(1, 2, 3))
This would work
dat$concat <- paste0("[", dat$ID, ", ", dat$value, "]")
ID value concat
1 A 1 [A, 1]
2 B 2 [B, 2]
3 C 3 [C, 3]
Or if you did not want the space after the comma:
dat$concat <- paste0("[", dat$ID, ",", dat$value, "]")
bar is the requested output with the colnames as would be desired. Any way to do this in the one liner construction of foo, without using a second statement, i.e. rename rn and V1 to c and d respectively in the construction?
Please note that the list("s1" = 1, "s2" = 2) has to stay as it is.
bar is the desired output:
a b c d
1: abc bcd s1 1
2: abc bcd s2 2
foo to mimick bar
a b rn V1
1: abc bcd s1 1
2: abc bcd s2 2
MWE script:
library(data.table)
bar <- data.table(a = "abc", b = "bcd", c = c("s1", "s2"), d = 1:2)
print("bar:")
print(bar)
foo <- data.table(a = "abc", b = "bcd",
data.matrix(list("s1" = 1,
"s2" = 2)), keep.rownames = T)
# colnames(foo) <- c("a", "b", "c", "d") # without using a second statement like this
print("foo:")
print(foo)
PS: A workaround I did was to define a reformat function as follows for example
reformat <- function(dt) {
colnames(dt) <- c("a", "b", "c", "d")
return(dt)
}
foo <- reformat(data.table(a = "abc", b = "bcd",
data.matrix(list("s1" = 1,
"s2" = 2)), keep.rownames = T))
print(foo)
but wandering if there is any way to do it without the need of the function.
Am I missing something?
setnames(foo, old = c("c", "d"), new = c("rn", "V1"))
data.table already has a function to rename columns without copying the data.table.
Are you loking for something like this?
setnames(foo <- data.table(a = "abc", b = "bcd",
data.matrix(list("s1" = 1, "s2" = 2)),
keep.rownames = TRUE),
c("a", "b", "c", "d"))
print(foo)
a b c d
1: abc bcd s1 1
2: abc bcd s2 2