R purrr row-wise lookups from two lists - r

Here’s a simplified version of a problem that involves larger, more complex inputs. First, I create data:
input <- tibble(
person = rep(101:103, each = 12),
item = rep(1:12, 3),
response = sample(1:4, 36, replace = T)
)
These data are responses from three persons on a 12-item test. input is a multilevel table in which the test items are nested within each person. The columns of input are:
person: ID numbers for persons 101, 102, and 103 (12 rows for each person)
item: test items 1-12 for each person. Note how the items are nested within each person
response: score for each item
The test is divided into four subscales consisting of three items each.
scale_assign <- list(1:3, 4:6, 7:9, 10:12)
scale_num <- 1:4
scale_assign is a four-element list containing four item sets (expressed as four numerical ranges): items 1-3 (subscale 1), items 4-6 (subscale 2), items 7-9 (subscale 3), and items 10-12 (subscale 4). scale_num is a four element numerical vector containing the numbers (1-4) that label the four subscales.
What I want R to do is process input row-wise, creating a new column scale, and filling it with the correct value of scale_num for each item (that is, each item's subscale assignment). In each row, R needs to check the value of item against the ranges in scale_assign and fill in scale with the value of scale_num that corresponds to the scale_assign range for that item.
The desired output looks like this:
# A tibble: 36 x 4
# person item response scale
# 1 101 1 4 1
# 2 101 2 2 1
# 3 101 3 4 1
# 4 101 4 4 2
# 5 101 5 4 2
# 6 101 6 4 2
# 7 101 7 3 3
# 8 101 8 2 3
# 9 101 9 4 3
# 10 101 10 1 4
# 11 101 11 1 4
# 12 101 12 4 4
# 13 102 1 1 1
# 14 102 2 3 1
# 15 102 3 1 1
# 16 102 4 1 2
# 17 102 5 3 2
# 18 102 6 3 2
# 19 102 7 4 3
# 20 102 8 1 3
# 21 102 9 3 3
# 22 102 10 4 4
# 23 102 11 3 4
# 24 102 12 3 4
# 25 103 1 4 1
# 26 103 2 1 1
# 27 103 3 2 1
# 28 103 4 2 2
# 29 103 5 4 2
# 30 103 6 1 2
# 31 103 7 4 3
# 32 103 8 4 3
# 33 103 9 1 3
# 34 103 10 4 4
# 35 103 11 1 4
# 36 103 12 2 4
Preferring a tidyverse solution, I thought this might be a job for purrr::map2(), because it seems to involve simultaneous iteration over a four-element list scale_assign and a four-element vector scale_num. I tried to implement the coding of scale within a map2() call, using mutate() and case_when() to do the coding, but could not get it to work.
Thanks in advance for any help!

Instead of performing this operation row-wise and checking for each value it would be easy to perform a join operation if you change scale_assign to named list convert it into a dataframe and do a right_join with input dataframe.
scale_assign <- list(1:3, 4:6, 7:9, 10:12)
names(scale_assign) <- 1:4
library(tidyverse)
enframe(scale_assign) %>%
unnest(cols = value) %>%
mutate_all(as.integer) %>%
right_join(input, by = c("value" = "item"))
# A tibble: 36 x 4
# name value person response
# <int> <int> <int> <int>
# 1 1 1 101 4
# 2 1 2 101 4
# 3 1 3 101 2
# 4 2 4 101 2
# 5 2 5 101 1
# 6 2 6 101 4
# 7 3 7 101 3
# 8 3 8 101 1
# 9 3 9 101 1
#10 4 10 101 2
# … with 26 more rows
In base R, that can be done using stack and merge
merge(input, stack(scale_assign), all.x = TRUE, by.x = "item", by.y = "values")
data
set.seed(1234)
input <- tibble(
person = rep(101:103, each = 12),
item = rep(1:12, 3),
response = sample(1:4, 36, replace = TRUE))

Here is a data.table solution, using an update-join.
Basically this is #Ronak Shah's Base-R answer, but using the data.table-package (i.e. fast performance on large data-sets).
library(data.table)
#1. set inpus as data.table
#2. create a lookup-table using `stack( scale_assign )`,
# and make that also a data.table (using setDT() )
#3. left update join on item
setDT(input)[ setDT( stack( scale_assign ) ),
scale := i.ind,
on = .( item = values ) ][]
output
# person item response scale
# 1: 101 1 3 1
# 2: 101 2 4 1
# 3: 101 3 3 1
# 4: 101 4 2 2
# 5: 101 5 3 2
# 6: 101 6 4 2
# 7: 101 7 1 3
# 8: 101 8 3 3
# 9: 101 9 4 3
# 10: 101 10 2 4
# 11: 101 11 3 4
# 12: 101 12 4 4
# 13: 102 1 4 1
# 14: 102 2 2 1
# 15: 102 3 3 1
# 16: 102 4 2 2
# 17: 102 5 1 2
# 18: 102 6 4 2
# 19: 102 7 1 3
# 20: 102 8 3 3
# 21: 102 9 2 3
# 22: 102 10 1 4
# 23: 102 11 4 4
# 24: 102 12 3 4
# 25: 103 1 1 1
# 26: 103 2 1 1
# 27: 103 3 2 1
# 28: 103 4 1 2
# 29: 103 5 2 2
# 30: 103 6 4 2
# 31: 103 7 4 3
# 32: 103 8 2 3
# 33: 103 9 3 3
# 34: 103 10 2 4
# 35: 103 11 2 4
# 36: 103 12 2 4
# person item response scale

Related

How to replace single occurances with the previous status

I have a data table like below :
table=data.table(x=c(1:15),y=c(1,1,1,3,1,1,2,1,2,2,3,3,3,3,3),z=c(1:15)*3)
I have to clean this data table where there are single occurrences like a 3 in between the 1s and a 1 in between the 2s. It doesn't have to be a 3 but any number which occurs only once should be replaced by the previous number.
table=data.table(x=c(1:15),y=c(1,1,1,1,1,1,2,2,2,2,3,3,3,3,3),z=c(1:15)*3)
This is the expected data table.
Any help is appreciated.
Here's one way :
library(data.table)
#Count number of rows for each group
table[, N := .N, rleid(y)]
#Change `y` value which have only one row
table[, y := replace(y, N ==1, NA)]
#Replace NA with last non-NA value
table[, y := zoo::na.locf(y)][, N := NULL]
table
# x y z
# 1: 1 1 3
# 2: 2 1 6
# 3: 3 1 9
# 4: 4 1 12
# 5: 5 1 15
# 6: 6 1 18
# 7: 7 2 21
# 8: 8 2 24
# 9: 9 2 27
#10: 10 2 30
#11: 11 3 33
#12: 12 3 36
#13: 13 3 39
#14: 14 3 42
#15: 15 3 45
Here is a base R option
inds <- which(diff(c(head(table$y,1),table$y))*diff(c(table$y,tail(table$y,1)))<0)
table$y <- replace(table$y,inds,table$y[inds-1])
such that
> table
x y z
1: 1 1 3
2: 2 1 6
3: 3 1 9
4: 4 1 12
5: 5 1 15
6: 6 1 18
7: 7 2 21
8: 8 2 24
9: 9 2 27
10: 10 2 30
11: 11 3 33
12: 12 3 36
13: 13 3 39
14: 14 3 42
15: 15 3 45

How to create a conditionally increasing sequence within a group?

I have a dataframe like the following:
df <- data.frame("id" = c(111,111,111,111,222,222,222,222,222,333,333,333),
"Encounter" = c(1,2,3,4,1,2,3,4,5,1,2,3),
"Level" = c(1,1,2,3,3,4,1,2,3,3,4,4),
"Gap_Days" = c(NA,3,2,15,NA,1,18,3,2,NA,77,1))
df
id Encounter Level Gap_Days
1 111 1 1 NA
2 111 2 1 3
3 111 3 2 2
4 111 4 3 15
5 222 1 3 NA
6 222 2 4 1
7 222 3 1 18
8 222 4 2 3
9 222 5 3 2
10 333 1 3 NA
11 333 2 4 77
12 333 3 4 1
Where Level is a numeric signaling a numeric signaling the type of encounter and Gap_Days is the number of days since the previous encounter, and is thus NA for the first encounter in each id group.
I'm looking to create a variable, "Session", that will start at 1 for the first Encounter within an id group, and increase sequentially when a Level fails to increase from the previous encounter, or when it takes more than 3 days between encounters. Basically it is considered a new "Session" each time these conditions aren't met for an Encounter. I'd like to do this within each group, ideally resulting in something like:
df2 <- data.frame("id" = c(111,111,111,111,222,222,222,222,222,333,333,333),
"Encounter" = c(1,2,3,4,1,2,3,4,5,1,2,3),
"Level" = c(1,1,2,3,3,4,1,2,3,3,4,4),
"Gap_Days" = c(NA,3,2,15,NA,1,18,3,2,NA,77,1),
"Session" = c(1,2,2,3,1,1,2,2,2,1,2,3))
df2
id Encounter Level Gap_Days Session
1 111 1 1 NA 1
2 111 2 1 3 2
3 111 3 2 2 2
4 111 4 3 15 3
5 222 1 3 NA 1
6 222 2 4 1 1
7 222 3 1 18 2
8 222 4 2 3 2
9 222 5 3 2 2
10 333 1 3 NA 1
11 333 2 4 77 2
12 333 3 4 1 3
In the actual data there are no strict limits to the number of Encounters or Sessions within each group. The first encounter can begin at any level, and it is not necessary that the level only increase by 1 i.e. if the level increased from 1 to 4 between encounters that could still be considered the same Session.
I'd prefer a dplyr solution, but am open to any ideas to help accomplish this!
You can do the following
library(dplyr)
df %>% group_by(id) %>% mutate(Session = cumsum(c(T, diff(Level) == 0) | Gap_Days > 3))
## A tibble: 12 x 5
## Groups: id [3]
# id Encounter Level Gap_Days Session
# <dbl> <dbl> <dbl> <dbl> <int>
# 1 111 1 1 NA 1
# 2 111 2 1 3 2
# 3 111 3 2 2 2
# 4 111 4 3 15 3
# 5 222 1 3 NA 1
# 6 222 2 4 1 1
# 7 222 3 1 18 2
# 8 222 4 2 3 2
# 9 222 5 3 2 2
#10 333 1 3 NA 1
#11 333 2 4 77 2
#12 333 3 4 1 3
You probably want to ungroup afterwards.

applying cumsum() from different starting points

I have data
library(data.table)
set.seed(42)
t <- data.table(time=1:1000, value=runif(100,0,1))
p <- data.table(id=1:10, cut=sample(1:100,5))
vals <- 1:5
> head(t)
time value
1: 1 0.9148060
2: 2 0.9370754
3: 3 0.2861395
4: 4 0.8304476
5: 5 0.6417455
6: 6 0.5190959
> head(p)
id cut
1: 1 63
2: 2 22
3: 3 99
4: 4 38
5: 5 91
6: 6 63
> vals
[1] 1 2 3 4 5
where t gives some vector of values associated with time points, and p gives for each person a cutoff in time.
I would like to get for each person the time units it takes to accumulate each of the values in vals.
My approach now is to use a for-loop that computes for each person a temporary vector of cumulative sums, starting at its specific cutoff in time. Next, I use findInterval() to obtain the positions at which cumsum reaches each of the levels in vals.
out <- matrix(NA, nrow=nrow(p), ncol=length(vals)); colnames(out) <- vals
for(i in 1:nrow(p)){
temp <- cumsum(t$value[t$time > p$cut[i]]); temp <- temp[!is.na(temp)]
out[i,] <- findInterval(vals,temp)
}
which should yield
1 2 3 4 5
[1,] 1 4 5 9 12
[2,] 1 2 5 6 7
[3,] 1 2 4 5 7
[4,] 1 3 5 7 8
[5,] 2 3 5 7 8
[6,] 1 4 5 9 12
[7,] 1 2 5 6 7
[8,] 1 2 4 5 7
[9,] 1 3 5 7 8
[10,] 2 3 5 7 8
This is of course heavily inefficient and doesn't do justice to the powers of R. Is there a way of speeding this up?
I'd do
# precompute cumsum on full table
t[, cs := cumsum(value)]
# compute one time per unique cut value, not per id
cuts = unique(p[, .(t_cut = cut)])
# look up value at cut time
cuts[t, on=.(t_cut = time), v_cut := i.cs]
# look up time at every cut value combo
cutres = cuts[, .(pt = vals + v_cut), by=t_cut][, .(
t_cut,
v = vals,
t_plus = t[.SD, on=.(cs = pt), roll=TRUE, x.time] - t_cut
)]
which gives
t_cut v t_plus
1: 63 1 1
2: 63 2 4
3: 63 3 5
4: 63 4 9
5: 63 5 12
6: 22 1 1
7: 22 2 2
8: 22 3 5
9: 22 4 6
10: 22 5 7
11: 99 1 1
12: 99 2 2
13: 99 3 4
14: 99 4 5
15: 99 5 7
16: 38 1 1
17: 38 2 3
18: 38 3 5
19: 38 4 7
20: 38 5 8
21: 91 1 2
22: 91 2 3
23: 91 3 5
24: 91 4 7
25: 91 5 8
t_cut v t_plus
If you want to map this back to id and get it in a id x vals table...
cutres[p, on=.(t_cut = cut), allow.cartesian=TRUE,
dcast(.SD, id ~ v, value.var = "t_plus")]
id 1 2 3 4 5
1: 1 1 4 5 9 12
2: 2 1 2 5 6 7
3: 3 1 2 4 5 7
4: 4 1 3 5 7 8
5: 5 2 3 5 7 8
6: 6 1 4 5 9 12
7: 7 1 2 5 6 7
8: 8 1 2 4 5 7
9: 9 1 3 5 7 8
10: 10 2 3 5 7 8
(Alternately, the key part can be done like t_plus = t[.SD, on=.(cs = pt), roll=TRUE, which=TRUE] - t_cut since t$time is the row number.)

Merge 2 dataframes based on condition in R

I have the following 2 data frames that I want to merge:
x <- data.frame(a= 1:11, b =3:13, c=2:12, d=7:17, invoice = 1:11)
x =
a b c d invoice
1 3 2 7 1
2 4 3 8 2
3 5 4 9 3
4 6 5 10 4
5 7 6 11 5
6 8 7 12 6
7 9 8 13 7
8 10 9 14 8
9 11 10 15 9
10 12 11 16 10
11 13 12 17 11
y <- data.frame(nr = 100:125, invoice = 1)
y$invoice[12:26] <- 2
> y
nr invoice
100 1
101 1
102 1
103 1
104 1
105 1
106 1
107 1
108 1
109 1
110 1
111 2
112 2
113 2
114 2
115 2
116 2
117 2
I want to merge the letters from dataframe X with dataframe Y when the invoice number is the same. It should start with merging the value from letter A, then B ect. This should be happening until the invoice number is not the same anymore and then choose the numbers from invoice nr 2.
the output should be like this:
> output
nr invoice letter_count
100 1 1
101 1 3
102 1 2
103 1 7
104 1 1
105 1 3
106 1 2
107 1 7
108 1 1
109 1 2
110 1 7
111 2 2
112 2 4
113 2 3
114 2 8
115 2 2
116 2 4
I tried to use the merge function with the by argument but this created an error that the number of rows is not the same. Any help I will appreciate.
Here is a solution using the purrr package.
# Prepare the data frames
x <- data.frame(a = 1:11, b = 3:13, c = 2:12, d = 7:17, invoice = 1:11)
y <- data.frame(nr = 100:125, invoice = 1)
y$invoice[12:26] <- 2
# Load package
library(purrr)
# Split the data based on invoice
y_list <- split(y, f = y$invoice)
# Design a function to transfer data
trans_fun <- function(main_df, letter_df = x){
# Get the invoice number
temp_num<- unique(main_df$invoice)
# Extract letter_count information from x
add_vec <- unlist(letter_df[letter_df$invoice == temp_num, 1:4])
# Get the remainder of nrow(main_df) and length(add_vec)
reamin_num <- nrow(main_df) %% length(add_vec)
# Get the multiple of nrow(main_df) and length(add_vec)
multiple_num <- nrow(main_df) %/% length(add_vec)
# Create the entire sequence to add
add_seq <- rep(add_vec, multiple_num + 1)
add_seq2 <- add_seq[1:(length(add_seq) - (length(add_vec) - reamin_num))]
# Add new column, add_seq2, to main_df
main_df$letter_count <- add_seq2
return(main_df)
}
# Apply the trans_fun function using map_df
output <- map_df(y_list, .f = trans_fun)
# See the result
output
nr invoice letter_count
1 100 1 1
2 101 1 3
3 102 1 2
4 103 1 7
5 104 1 1
6 105 1 3
7 106 1 2
8 107 1 7
9 108 1 1
10 109 1 3
11 110 1 2
12 111 2 2
13 112 2 4
14 113 2 3
15 114 2 8
16 115 2 2
17 116 2 4
18 117 2 3
19 118 2 8
20 119 2 2
21 120 2 4
22 121 2 3
23 122 2 8
24 123 2 2
25 124 2 4
26 125 2 3

How to best join one column of a data.table with another column of the same data.table?

My data
I have a data.table DT with the current (F0YR) and the next (F1YR) fiscal year-end (FYE) encoded as integers. Since every next FYE will eventually become
a current FYE, the integer will be both in the column F1YR and F0YR. Also, my data contains monthly observations so the same FYE will be in the data set
multiple times:
library(data.table)
DT <- data.table(ID = rep(c("A", "B"), each=9),
MONTH = rep(100L:108L, times=2),
F0YR = rep(c(1L, 4L, 7L), each=3, times=2),
F1YR = rep(c(4L, 7L, 9L), each=3, times=2),
value = c(rep(1:5, each=3), 6, 6, 7),
key = "ID,F0YR")
DT
ID MONTH F0YR F1YR value
[1,] A 100 1 4 1
[2,] A 101 1 4 1
[3,] A 102 1 4 1
[4,] A 103 4 7 2
[5,] A 104 4 7 2
[6,] A 105 4 7 2
[7,] A 106 7 9 3
[8,] A 107 7 9 3
[9,] A 108 7 9 3
[10,] B 100 1 4 4
[11,] B 101 1 4 4
...
What I want to do
For every ID and F1YR combination, I want to get the value for the ID and F0YR combination. As an example: Company A had a value of 2 for FOYR==4. Now,
I want an additional column for all combinations with ID=="A" and F1YR==4 which is set to 2, next to the already existent value of 1.
What I tried
intDT <- DT[CJ(unique(ID), unique(F0YR)), list(ID, F0YR, valueNew = value), mult="last"]
setkey(intDT, ID, F0YR)
setkey(DT, ID, F1YR)
DT <- intDT[DT]
setnames(DT, c("F0YR.1", "F0YR"), c("F0YR", "F1YR"))
DT
ID F1YR valueNew MONTH F0YR value
[1,] A 4 2 100 1 1
[2,] A 4 2 101 1 1
[3,] A 4 2 102 1 1
[4,] A 7 3 103 4 2
[5,] A 7 3 104 4 2
[6,] A 7 3 105 4 2
[7,] A 9 NA 106 7 3
[8,] A 9 NA 107 7 3
[9,] A 9 NA 108 7 3
[10,] B 4 5 100 1 4
[11,] B 4 5 101 1 4
...
(Note that I use mult="last" here because, although the values should only change with F0YR or F1YR changes, sometimes they don't and this is just my
tie breaker).
What I want
This looks improvable. First of all, I have to make a copy of my DT. Second, since I join basically the same data.table, all the column names have the same name
and I have to rename them. I thought that a self join would be the way forward, but I tried and tried and couldn't get a nice solution. I have the hope
that there is something easy out there which I just don't see...Does anyone have a clue? Or is my data set up in such a way that it is actually hard
(maybe because I have monthly observations, but want to join only quarterly or yearly changing values).
In use cases like this, the mantra "aggregate first, then join with that" often helps. So, starting with your DT, and using v1.8.1 :
> agg = DT[,last(value),by=list(ID,F0YR)]
> agg
ID F0YR V1
1: A 1 1
2: A 4 2
3: A 7 3
4: B 1 4
5: B 4 5
6: B 7 7
I called it agg because I couldn't think of a better name. In this case you wanted last which isn't really an aggregate as such, but you know what I mean.
Then update DT by reference by group. Here we're grouping by i.
setkey(DT,ID,F1YR)
DT[agg,newcol:=V1]
ID MONTH F0YR F1YR value newcol
1: A 100 1 4 1 2
2: A 101 1 4 1 2
3: A 102 1 4 1 2
4: A 103 4 7 2 3
5: A 104 4 7 2 3
6: A 105 4 7 2 3
7: A 106 7 9 3 NA
8: A 107 7 9 3 NA
9: A 108 7 9 3 NA
10: B 100 1 4 4 5
11: B 101 1 4 4 5
12: B 102 1 4 4 5
13: B 103 4 7 5 7
14: B 104 4 7 5 7
15: B 105 4 7 5 7
16: B 106 7 9 6 NA
17: B 107 7 9 6 NA
18: B 108 7 9 7 NA
Is that right? Not sure I fully followed. Those ops should be very fast, without any copies, and should scale to large data. At least, that's the intention.

Resources