Collapsing levels of a variable by another variable with certain conditions - r

I have the following example data table:
library(data.table)
exdt <- structure(list(domain = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
L1 = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 7L, 8L, 8L, 9L, 9L, 10L,
10L, 11L, 12L, 12L, 13L, 13L, 14L, 15L, 15L, 16L, 16L, 17L, 17L,
18L, 18L, 19L, 19L, 20L, 21L, 22L, 22L, 23L, 23L, 23L, 24L, 25L,
25L, 25L, 25L, 26L, 26L, 26L),
L2 = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 7L, 7L,
7L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 11L, 11L,
11L, 12L, 12L, 12L, 13L, 13L, 13L, 13L, 14L, 14L, 14L, 14L, 14L,
14L, 14L)),
row.names = c(NA, -51L), class = c("data.table", "data.frame"))
I'd like to create a new variable L2, which is a grouping of two consecutive, unique levels of L1 within levels of domain. However, when I get to the end of a domain, I sometimes have a level of L1 that is stand-alone. In that case, I'd like to merge it with the two unique levels before it. This means that at the end of a domain, I may have merged together 3 consecutive, unique levels of L1 instead of 2 unique levels. The desired output is shown below.
exdt_L2_desired <- structure(list(domain = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
L1 = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 5L, 5L, 6L, 7L, 8L, 8L, 9L, 9L, 10L,
10L, 11L, 12L, 12L, 13L, 13L, 14L, 15L, 15L, 16L, 16L, 17L, 17L,
18L, 18L, 19L, 19L, 20L, 21L, 22L, 22L, 23L, 23L, 23L, 24L, 25L,
25L, 25L, 25L, 26L, 26L, 26L),
L2 = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L,3L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 6L,
6L, 6L, 6L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 10L, 10L, 11L, 11L,
11L, 11L, 11L, 12L, 12L, 12L, 12L, 12L, 12L, 12L, 12L)), row.names = c(NA, -51L),
class = c("data.table","data.frame"))
domain L1 L2
1 1 1
1 1 1
1 2 1
1 2 1
1 3 1
1 3 1
2 4 2
2 4 2
2 5 2
2 5 2
2 5 2
2 5 2
2 6 3
2 7 3
2 8 4
2 8 4
2 9 4
2 9 4
2 10 5
2 10 5
2 11 5
2 12 6
2 12 6
2 13 6
2 13 6
2 14 7
2 15 7
2 15 7
2 16 8
2 16 8
2 17 8
2 17 8
2 18 9
2 18 9
2 19 9
2 19 9
2 20 10
2 21 10
2 22 11
2 22 11
2 23 11
2 23 11
2 23 11
2 24 12
2 25 12
2 25 12
2 25 12
2 25 12
2 26 12
2 26 12
2 26 12
You can check that this has the right grouping L2 by:
#Check
exdt_L2_desired[, .(numL1_lev = uniqueN(L1)), by = list(domain,L2)]
domain L2 numL1_lev
1: 1 1 3
2: 2 2 2
3: 2 3 2
4: 2 4 2
5: 2 5 2
6: 2 6 2
7: 2 7 2
8: 2 8 2
9: 2 9 2
10: 2 10 2
11: 2 11 2
12: 2 12 3
As you can see each level of L2 has 2 or 3 levels of L1. For domain=1, numL1_lev=3 because there were only 3 unique L1 values, which were lumped into a single group. For domain=2, only the last level of L2 had the numL1_lev=3.
Attempt
I tried the following, but I seem to still have trouble getting the stand-alone levels of L1 within a given domain:
exdt_L2 <- exdt[, L2 :=
exdt[, {x <- ceiling(L1/2) #Group 2 consecutive, unique L1 levels by domain
#If the number of unique L1 levels at the end is stand-alone, then replace with previous group
if (length(unique(L1[x==x[.N]])) == 1) x[x==x[.N]] <- x[.N]-1
x
}, domain][, rleid(domain, V1)]
]
domain L1 L2
1 1 1
1 1 1
1 2 1
1 2 1
1 3 1
1 3 1
2 4 2
2 4 2
2 5 3
2 5 3
2 5 3
2 5 3
2 6 3
2 7 4
2 8 4
2 8 4
2 9 5
2 9 5
2 10 5
2 10 5
2 11 6
2 12 6
2 12 6
2 13 7
2 13 7
2 14 7
2 15 8
2 15 8
2 16 8
2 16 8
2 17 9
2 17 9
2 18 9
2 18 9
2 19 10
2 19 10
2 20 10
2 21 11
2 22 11
2 22 11
2 23 12
2 23 12
2 23 12
2 24 12
2 25 13
2 25 13
2 25 13
2 25 13
2 26 13
2 26 13
2 26 13

Using just ceiling(L1 / 2) will not work, as this assigns e.g. L1 = 4 and L1 = 5 to different bins, which should be added to the same L2 bin. Below is an updated version in the same spirit as OP's attempt instead using ceiling(rleid(L1) / 2):
library(data.table)
exdt[, L2 := {
## modify rle values
x <- ceiling(rleid(L1) / 2)
n <- length(unique(L1))
## if n is odd update last bin values
if(n > 1 && n %% 2 == 1) {
x[x == x[.N]] <- x[.N] - 1
}
x
}, by = "domain"][, L2 := rleid(domain, L2)]
all.equal(exdt, exdt_L2_desired)
#> [1] TRUE

Related

Using sample() to sample from nested lists in R

I'm looking for a way to use sample() to sample values from different lists based on a value in another column of a data.table - at the moment I'm getting a recursive indexing failed error - code and more explanation below:
First set up example data:
library(stats)
library(data.table)
# list of three different nest survival rates
survival<-list(0.91,0.95,0.99)
# incubation period
inc.period<-28
# then set up function to use the geometric distribution to generate 3 lists of incubation outcomes based on the nest survivals and incubation period above.
# e.g. less than 28 is a nest failure, 28 is a successful nest.
create.sample <- function(survival){
outcome<-rgeom(100,1-survival)
fifelse(outcome > inc.period, inc.period, outcome)
}
# then create list of 100 nest outcomes with 3 different survival values using lapply
inc.outcomes <- lapply(survival,create.sample)
# set up a data.table - each row of data will be a nest.
index<-c(1:3)
iteration<-1:20
dt = CJ(index,iteration)
Then I want to make a new column 'inc.period' which samples from the 'inc.outcomes' list using the index column of the dt to select which of the three 'inc.outcomes' lists to sample from (with a different sample for each row of data).
So e.g. when index = 1, the sampled value comes from inc.outcomes[[1]] - which is the low nest survival list, when index = 2 I sample from inc.outcomes[[2]] etc.
The code would look something like this but this doesn't work (I get a recursive indexing failed error):
dt[,inc.period:= sample(inc.outcomes[[index]],nrow(dt),replace = TRUE)]
Any help or advice gratefully received, also suggestions for different approaches to this problem - this is for an update to code that runs in a shiny simulation so quicker options preferred!
Two problems:
inc.outcomes[[index]] is a problem since index is 60-long here, meaning you are ultimately trying inc.outcomes[[ c(1,1,...,2,2,...,3,3) ]], which is incorrect. [[-indexing is either length-1 (for most uses) or a vector as long as its list is nested. For example, in list(list(1,2),list(3,4))[[ c(1,2) ]] the [[c(1,2)]] with length-2 works because the have 2-deep nested lists. Since inc.outcomes is only 1-deep, we can only have length-1 in the [[ indexing.
This means we need to do this by-index. (An from this, we need to change from nrow(dt) to .N, but frankly we should be using that anyway even without by=.)
dt[, inc.period := sample(inc.outcomes[[ index[1] ]], .N, replace = TRUE), by = index]
# index iteration inc.period
# <int> <int> <num>
# 1: 1 1 17
# 2: 1 2 17
# 3: 1 3 21
# 4: 1 4 24
# 5: 1 5 3
# 6: 1 6 1
# 7: 1 7 17
# 8: 1 8 0
# 9: 1 9 1
# 10: 1 10 0
# ---
# 51: 3 11 0
# 52: 3 12 0
# 53: 3 13 28
# 54: 3 14 28
# 55: 3 15 9
# 56: 3 16 28
# 57: 3 17 7
# 58: 3 18 28
# 59: 3 19 28
# 60: 3 20 28
My data:
dt <- setDT(structure(list(index = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), iteration = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L)), row.names = c(NA, -60L), class = c("data.table", "data.frame"), sorted = c("index", "iteration")))

Create unique ID based on repeated IDs [duplicate]

This question already has answers here:
Create group number for contiguous runs of equal values
(4 answers)
Closed 4 days ago.
I received some data from a colleague who is working with animal observations recorded in several transects. However my colleague used the same three ID codes for identifying each transect: 1, 7, 13 and 19. I would like to replace the repeated IDs with unique IDs. This image shows what I want to do:
Here's the corresponding code:
example_data<-structure(list(ID_Transect = c(1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L,
7L, 7L, 7L, 7L, 13L, 13L, 13L, 13L, 13L, 13L, 19L, 19L, 19L,
19L, 19L, 19L, 1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L, 7L, 7L, 7L, 7L),
transect_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L,
5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L)), class = "data.frame", row.names = c(NA,
-36L))
We can also do
library(data.table)
setDT(example_data)[, transect_id := rleid(ID_Transect)]
You can use data.table rleid -
example_data$transect_id <- data.table::rleid(example_data$ID_Transect)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6
In base R we can use rle -
with(rle(example_data$ID_Transect), rep(seq_along(values), lengths))
Or diff + cumsum -
cumsum(c(TRUE, diff(example_data$ID_Transect) != 0))

Function for creating contingency table with two categorical variables in R

I am creating a crosstab with two categorical variable with below code:
library(dplyr)
library(reshape2)
T1.1<-table(data$Q7_1,data$Q9,exclude = NULL)
T1.1<-data.frame(T1.1)
T1.2<-dcast(T1.1, Var1~Var2)
T1.2<-T1.2%>%mutate(Industry=as.character(Var1),Total_responses=A+B+C)%>%select(Industry,A,B,C,Total_responses)
C<-c("Industry"="ALL", colSums(T1.2[,2:5]))
T1.2<-rbind(C,T1.2)
This gives output:
Industry A B C Total_responses
1 ALL 20 18 18 56
2 Banking/Financial Services 2 2 2 6
3 Chemicals 0 1 2 3
4 Consumer Goods 1 1 1 3
5 Energy 2 1 0 3
6 High Tech 6 0 2 8
7 Insurance/Reinsurance 0 2 0 2
8 Life Sciences 0 0 0 0
9 Logistics 0 0 2 2
10 Mining & Metals 1 1 1 3
11 Other Manufacturing 1 2 0 3
12 Other Non-Manufacturing 3 2 2 7
13 Retail & Wholesale 1 1 0 2
14 Services (Non-Financial) 2 4 5 11
15 Transportation Equipment 1 1 1 3
16 <NA> 0 0 0 0
This output is ok, but the issue is after I use table() function , I convert it to data frame and then use dcast to get desired look of the table. After dcast it creates another column NA, that I don't want.
Also I want to use this entire calculation to make a function which I can use for other factors with more levels.
Q9 has 3 levels A,B and C, I don't want to calculate the Total Response like this, I want to create function that can be used with any other factor with different number of levels. Please suggest any other efficient ways.
> dput(data)
structure(list(Q7_1 = structure(c(5L, 5L, 14L, 1L, 9L, 13L, 1L,
3L, 13L, 13L, 13L, 12L, 2L, 11L, 13L, 10L, 11L, 1L, 4L, 5L, 5L,
4L, 5L, 9L, 2L, 4L, 13L, 10L, 13L, 13L, 11L, 1L, 11L, 5L, NA,
1L, 9L, 3L, 1L, 5L, NA, 2L, NA, 6L, 14L, NA, NA, 14L, 8L, 11L,
8L, 12L, 13L, NA, 3L, 11L, 11L, NA, 10L, 6L, 5L, 13L, 13L), .Label = c("Banking/Financial Services",
"Chemicals", "Consumer Goods", "Energy", "High Tech", "Insurance/Reinsurance",
"Life Sciences", "Logistics", "Mining & Metals", "Other Manufacturing",
"Other Non-Manufacturing", "Retail & Wholesale", "Services (Non-Financial)",
"Transportation Equipment"), class = "factor"), Q9 = structure(c(1L,
3L, 3L, 3L, 3L, 1L, 1L, 3L, 2L, 2L, 3L, 2L, 3L, 2L, 2L, 2L, 1L,
3L, 1L, 1L, 1L, 2L, 1L, 2L, 3L, 1L, 1L, 1L, 3L, 3L, 3L, 2L, 2L,
1L, NA, 1L, 1L, 2L, 2L, 1L, NA, 2L, NA, 2L, 2L, NA, NA, 1L, 3L,
1L, 3L, 1L, 3L, NA, 1L, 3L, 1L, NA, 2L, 2L, 3L, 3L, 2L), .Label = c("A",
"B", "C"), class = "factor")), class = c("data.table", "data.frame"
), row.names = c(NA, -63L), .Names = c("Q7_1",
"Q9"))
>
To convert a table into a data frame we can use as.data.frame.matrix().
crossCalc <- function(data){
t <- table(data$Q7_1, data$Q9)
t <- as.data.frame.matrix(t)
Total_responses <- with(t, A + B + C)
t <- cbind(t, Total_responses)
t <- rbind(ALL=colSums(T1.1), T1.1)
return(t)
}
crossCalc(data)
# A B C Total_responses
# ALL 20 18 18 56
# Banking/Financial Services 2 2 2 6
# Chemicals 0 1 2 3
# Consumer Goods 1 1 1 3
# Energy 2 1 0 3
# High Tech 6 0 2 8
# Insurance/Reinsurance 0 2 0 2
# Life Sciences 0 0 0 0
# Logistics 0 0 2 2
# Mining & Metals 1 1 1 3
# Other Manufacturing 1 2 0 3
# Other Non-Manufacturing 3 2 2 7
# Retail & Wholesale 1 1 0 2
# Services (Non-Financial) 2 4 5 11
# Transportation Equipment 1 1 1 3
Maybe this is what you want?

R ggplot2 - How to plot 2 boxplots on the same x value

suppose I have two boxplots.
trial1 <- ggplot(completionTime, aes(fill=Condition, x=Scenario, y=Trial1))
trial1 + geom_boxplot()+geom_point(position=position_dodge(width=0.75)) + ylim(0, 160)
trial2 <- ggplot(completionTime, aes(fill=Condition, x=Scenario, y=Trial2))
trial2 + geom_boxplot()+geom_point(position=position_dodge(width=0.75)) + ylim(0, 160)
How can I plot trial 1 and trial 2 on the same plot and same respective X? they have the same range of y.
I looked at geom_boxplot(position="identity"), but that plots the two conditions(fill) on the same X.
I want to plot two y column on the same X.
Edit: the dataset
User Condition Scenario Trial1 Trial2
1 1 ME a 67 41
2 1 ME b 70 42
3 1 ME c 40 15
4 1 ME d 65 23
5 1 ME e 45 45
6 1 SE a 100 34
7 1 SE b 54 23
8 1 SE c 70 23
9 1 SE d 56 15
10 1 SE e 30 20
11 2 ME a 42 23
12 2 ME b 22 12
13 2 ME c 28 8
14 2 ME d 22 8
15 2 ME e 38 37
16 2 SE a 59 18
17 2 SE b 65 14
18 2 SE c 75 7
19 2 SE d 37 9
20 2 SE e 31 7
dput()
structure(list(User = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), Condition = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L), .Label = c("ME", "SE"), class = "factor"), Scenario =
structure(c(1L,
2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L, 3L, 4L, 5L, 1L, 2L,
3L, 4L, 5L), .Label = c("a", "b", "c", "d", "e"), class = "factor"),
Trial1 = c(67L, 70L, 40L, 65L, 45L, 100L, 54L, 70L, 56L,
30L, 42L, 22L, 28L, 22L, 38L, 59L, 65L, 75L, 37L, 31L), Trial2 = c(41L,
42L, 15L, 23L, 45L, 34L, 23L, 23L, 15L, 20L, 23L, 12L, 8L,
8L, 37L, 18L, 14L, 7L, 9L, 7L)), .Names = c("User", "Condition",
"Scenario", "Trial1", "Trial2"), class = "data.frame", row.names = c(NA,
-20L))
You could try using interaction to combine two of your factors and plot against a third. For example, assuming you want to fill by condition as in your original code:
library(tidyr)
completionTime %>%
gather(trial, value, -Scenario, -Condition, -User) %>%
ggplot(aes(interaction(Scenario, trial), value)) + geom_boxplot(aes(fill = Condition))
Result:

array manipulation: calculate odds ratios for a layer in a 3-way table

This is a question about array and data frame manipulation and calculation, in the
context of models for log odds in contingency tables. The closest question I've found to this is How can i calculate odds ratio in many table, but mine is more general.
I have a data frame representing a 3-way frequency table, of size 5 (litter) x 2 (treatment) x 3 (deaths).
"Freq" is the frequency in each cell, and deaths is the response variable.
Mice <-
structure(list(litter = c(7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L,
11L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 7L, 7L, 8L,
8L, 9L, 9L, 10L, 10L, 11L, 11L), treatment = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), deaths = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("0", "1",
"2+"), class = "factor"), Freq = c(58L, 75L, 49L, 58L, 33L, 45L,
15L, 39L, 4L, 5L, 11L, 19L, 14L, 17L, 18L, 22L, 13L, 22L, 12L,
15L, 5L, 7L, 10L, 8L, 15L, 10L, 15L, 18L, 17L, 8L)), .Names = c("litter",
"treatment", "deaths", "Freq"), row.names = c(NA, 30L), class = "data.frame")
From this, I want to calculate the log odds for adjacent categories of the last variable (deaths)
and have this value in a data frame with factors litter (5), treatment (2), and contrast (2), as detailed below.
The data can be seen in xtabs() form:
mice.tab <- xtabs(Freq ~ litter + treatment + deaths, data=Mice)
ftable(mice.tab)
deaths 0 1 2+
litter treatment
7 A 58 11 5
B 75 19 7
8 A 49 14 10
B 58 17 8
9 A 33 18 15
B 45 22 10
10 A 15 13 15
B 39 22 18
11 A 4 12 17
B 5 15 8
>
From this, I want to calculate the (adjacent) log odds of 0 vs. 1 and 1 vs.2+ deaths, which is easy in
array format,
odds1 <- log(mice.tab[,,1]/mice.tab[,,2]) # contrast 0:1
odds2 <- log(mice.tab[,,2]/mice.tab[,,3]) # contrast 1:2+
odds1
treatment
litter A B
7 1.6625477 1.3730491
8 1.2527630 1.2272297
9 0.6061358 0.7156200
10 0.1431008 0.5725192
11 -1.0986123 -1.0986123
>
But, for analysis, I want to have these in a data frame, with factors litter, treatment and contrast
and a column, 'logodds' containing the entries in the odds1 and odds2 tables, suitably strung out.
More generally, for an I x J x K table, where the last factor is the response, my desired result
is a data frame of IJ(K-1) rows, with adjacent log odds in a 'logodds' column, and ideally, I'd like
to have a general function to do this.
Note that if T is the 10 x 3 matrix of frequencies shown by ftable(), the calculation is essentially
log(T) %*% matrix(c(1, -1, 0,
0, 1, -1))
followed by reshaping and labeling.
Can anyone help with this?

Resources