Create unique ID based on repeated IDs [duplicate]

Create unique ID based on repeated IDs [duplicate] - r

This question already has answers here:
Create group number for contiguous runs of equal values
(4 answers)
Closed 4 days ago.
I received some data from a colleague who is working with animal observations recorded in several transects. However my colleague used the same three ID codes for identifying each transect: 1, 7, 13 and 19. I would like to replace the repeated IDs with unique IDs. This image shows what I want to do:
Here's the corresponding code:
example_data<-structure(list(ID_Transect = c(1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L,
7L, 7L, 7L, 7L, 13L, 13L, 13L, 13L, 13L, 13L, 19L, 19L, 19L,
19L, 19L, 19L, 1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L, 7L, 7L, 7L, 7L),
transect_id = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L,
5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L)), class = "data.frame", row.names = c(NA,
-36L))

We can also do
library(data.table)
setDT(example_data)[, transect_id := rleid(ID_Transect)]

You can use data.table rleid -
example_data$transect_id <- data.table::rleid(example_data$ID_Transect)
#[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 5 5 5 5 5 5 6 6 6 6 6 6
In base R we can use rle -
with(rle(example_data$ID_Transect), rep(seq_along(values), lengths))
Or diff + cumsum -
cumsum(c(TRUE, diff(example_data$ID_Transect) != 0))

Related

Using sample() to sample from nested lists in R

I'm looking for a way to use sample() to sample values from different lists based on a value in another column of a data.table - at the moment I'm getting a recursive indexing failed error - code and more explanation below:
First set up example data:
library(stats)
library(data.table)
# list of three different nest survival rates
survival<-list(0.91,0.95,0.99)
# incubation period
inc.period<-28
# then set up function to use the geometric distribution to generate 3 lists of incubation outcomes based on the nest survivals and incubation period above.
# e.g. less than 28 is a nest failure, 28 is a successful nest.
create.sample <- function(survival){
outcome<-rgeom(100,1-survival)
fifelse(outcome > inc.period, inc.period, outcome)
}
# then create list of 100 nest outcomes with 3 different survival values using lapply
inc.outcomes <- lapply(survival,create.sample)
# set up a data.table - each row of data will be a nest.
index<-c(1:3)
iteration<-1:20
dt = CJ(index,iteration)
Then I want to make a new column 'inc.period' which samples from the 'inc.outcomes' list using the index column of the dt to select which of the three 'inc.outcomes' lists to sample from (with a different sample for each row of data).
So e.g. when index = 1, the sampled value comes from inc.outcomes[[1]] - which is the low nest survival list, when index = 2 I sample from inc.outcomes[[2]] etc.
The code would look something like this but this doesn't work (I get a recursive indexing failed error):
dt[,inc.period:= sample(inc.outcomes[[index]],nrow(dt),replace = TRUE)]
Any help or advice gratefully received, also suggestions for different approaches to this problem - this is for an update to code that runs in a shiny simulation so quicker options preferred!

Two problems:
inc.outcomes[[index]] is a problem since index is 60-long here, meaning you are ultimately trying inc.outcomes[[ c(1,1,...,2,2,...,3,3) ]], which is incorrect. [[-indexing is either length-1 (for most uses) or a vector as long as its list is nested. For example, in list(list(1,2),list(3,4))[[ c(1,2) ]] the [[c(1,2)]] with length-2 works because the have 2-deep nested lists. Since inc.outcomes is only 1-deep, we can only have length-1 in the [[ indexing.
This means we need to do this by-index. (An from this, we need to change from nrow(dt) to .N, but frankly we should be using that anyway even without by=.)
dt[, inc.period := sample(inc.outcomes[[ index[1] ]], .N, replace = TRUE), by = index]
# index iteration inc.period
# <int> <int> <num>
# 1: 1 1 17
# 2: 1 2 17
# 3: 1 3 21
# 4: 1 4 24
# 5: 1 5 3
# 6: 1 6 1
# 7: 1 7 17
# 8: 1 8 0
# 9: 1 9 1
# 10: 1 10 0
# ---
# 51: 3 11 0
# 52: 3 12 0
# 53: 3 13 28
# 54: 3 14 28
# 55: 3 15 9
# 56: 3 16 28
# 57: 3 17 7
# 58: 3 18 28
# 59: 3 19 28
# 60: 3 20 28
My data:
dt <- setDT(structure(list(index = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), iteration = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 19L, 20L)), row.names = c(NA, -60L), class = c("data.table", "data.frame"), sorted = c("index", "iteration")))

How to reorder x-axis based on y-axis values in R ggplot2

I am trying to reorder (I don't mind whether it is ascending or descending order) the x-axis on my errorplot based on the mean values of the y-axis. I have applied a solution based on this post, however for some reason it seems to be ignoring the reorder command. Any ideas what is happening here?
#Import data.
df <- structure(list(X_Variable = c(4L, 4L, 13L, 18L, 12L, 3L, 15L,
NA, 18L, 4L, 17L, NA, 3L, 15L, 4L, 6L, 12L, NA, 2L, 1L, NA, 15L,
1L, 6L, 1L, 12L, NA, 6L, NA, 15L, NA, 1L, 7L, 15L, 11L, NA, NA,
1L, 1L, 7L, 2L, 2L, 12L, 11L, 15L, 17L, 1L, 4L, 11L, 15L, 2L,
3L, 13L, 17L, 15L, 6L, 3L, 14L, 12L, 8L, 12L, 11L, NA, 2L, 11L,
NA, 4L, 8L, 15L, 4L, 7L, 8L, 15L, 15L, 15L, 6L, 3L, 6L, 8L, 15L,
4L, 2L, 1L, 1L, 7L, 17L, 15L, 1L, NA, 5L, 13L, 1L, 15L, 4L, 15L,
13L, 18L, 1L, 15L, 6L, NA, 6L, NA, 6L, 1L, 16L, 4L, 1L, NA, 2L,
12L, NA, 7L, 2L, 15L, 13L, 13L, 16L, NA, 7L, 2L, 4L, 15L, 11L,
15L, 2L, 5L, 13L, 2L, 9L, 7L, 6L, 15L, 15L, 11L, 3L, 15L, 13L,
NA, 4L, 8L, NA, 4L, 8L, 18L, 4L, 1L, 8L, 5L, 18L), Y_Variable = c(6L,
4L, 5L, 4L, 4L, 3L, 7L, 1L, 1L, 7L, 4L, NA, 5L, 1L, 6L, 1L, 6L,
3L, 6L, 4L, NA, 4L, 6L, 5L, 1L, 4L, 1L, 1L, 6L, 3L, 4L, 1L, 1L,
2L, 3L, 4L, 4L, 2L, 2L, 2L, 4L, 1L, 1L, 5L, 4L, 1L, 4L, 4L, 3L,
3L, 2L, 2L, 1L, 3L, NA, 2L, 4L, 1L, 2L, 2L, 6L, 3L, NA, 2L, 2L,
NA, 4L, 2L, 3L, 6L, 5L, 4L, 1L, 5L, 3L, 1L, 4L, 6L, 1L, 5L, 4L,
2L, 1L, 5L, 4L, 3L, 2L, NA, 4L, 2L, NA, 4L, 5L, 5L, 4L, 2L, 1L,
5L, 2L, 2L, 4L, 1L, 4L, 1L, 5L, 2L, 1L, 3L, NA, 2L, 2L, 2L, 5L,
1L, 1L, 4L, 2L, 2L, NA, 3L, 5L, 7L, 1L, 1L, 1L, 1L, 4L, 1L, 2L,
2L, 3L, 3L, 3L, 4L, 1L, 4L, 3L, 4L, 3L, 6L, 1L, 5L, 4L, 2L, 5L,
2L, 3L, 1L, 1L, 2L)), row.names = c(NA, -150L), class = "data.frame")
#Error plot ordered by Y-Variable.
ggplot(df, aes(x=reorder(X_Variable, Y_Variable, FUN=mean), y=Y_Variable))+
geom_point(stat="summary", fun.y="mean")+
geom_errorbar(stat="summary", fun.data="mean_se", fun.args=list(mult=1.96), width=0.1)

I only removed missing values first. The minus sign works fine on my machine.
df1<-df %>%
filter(!is.na(X_Variable), !is.na(Y_Variable))
ggplot(df1, aes(x=reorder(X_Variable, -Y_Variable, FUN=mean), y=Y_Variable))+
geom_point(stat="summary", fun.y="mean")+
geom_errorbar(stat="summary", fun.data="mean_se", fun.args=list(mult=1.96), width=0.1)
Edit: Because of missing values, X_Variable 1, 13, and 15 ranked last. Hope this helps.
df %>% group_by(X_Variable) %>%
summarise(
Y_Variable = mean(Y_Variable)) %>%
arrange(Y_Variable)
# A tibble: 18 x 2
X_Variable Y_Variable
<int> <dbl>
1 4 4.71
2 3 3.67
3 12 3.57
4 7 3.29
5 17 2.75
6 18 2.6
7 11 2.57
8 2 2.55
9 5 2.33
10 6 2.3
11 9 2
12 16 2
13 8 1.86
14 14 1
15 1 NA
16 13 NA
17 15 NA
18 NA NA
>

Sampling distribution and sum of tables

I've made a few experiments and each experiment led to the apparition of color.
As I can't do more experiments, I want to sample by size=30 and see what frequency table (of colors) I could obtain for 1000 sampling. The resulting frequency table should be the sum of the 1000 frequency table.
I think about concatenating table as follows and try to agregate, but it did not work:
mydata=structure(list(Date = structure(c(11L, 1L, 9L, 9L, 10L, 1L, 2L,
3L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 6L, 7L, 4L, 4L, 4L, 6L, 6L, 11L,
5L, 4L, 7L, 10L, 6L, 6L, 2L, 5L, 7L, 11L, 1L, 9L, 11L, 11L, 11L,
1L, 1L), .Label = c("01/02/2016", "02/02/2016", "03/02/2016",
"08/02/2016", "10/02/2016", "11/02/2016", "16/02/2016", "22/02/2016",
"26/01/2016", "27/01/2016", "28/01/2016"), class = "factor"),
Color = structure(c(30L, 33L, 11L, 1L, 18L, 18L, 11L,
16L, 19L, 19L, 22L, 1L, 18L, 18L, 13L, 14L, 13L, 18L, 24L,
24L, 11L, 24L, 2L, 33L, 25L, 1L, 30L, 5L, 24L, 18L, 13L,
35L, 19L, 19L, 18L, 23L, 19L, 8L, 19L, 14L), .Label = c("ARD",
"ARP", "BBB", "BIE", "CFX", "CHR", "DDD", "DOO", "EAU", "ELY",
"EPI", "ETR", "GEN", "GER", "GGG", "GIS", "ISE", "JUV", "LER",
"LES", "LON", "LYR", "MON", "NER", "NGY", "NOJ", "NYO", "ORI",
"PEO", "RAY", "RRR", "RSI", "SEI", "SEP", "VIL", "XQU", "YYY",
"ZYZ"), class = "factor"), Categorie = structure(c(1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("1", "1,2", "1,2,3",
"1,3", "2", "2,3", "3", "4", "5"), class = "factor"), Portion_Longueur = c(3L,
4L, 1L, 1L, 2L, 4L, 5L, 6L, 7L, 7L, 8L, 8L, 9L, 8L, 8L, 9L,
11L, 7L, 7L, 7L, 9L, 8L, 3L, 8L, 7L, 11L, 2L, 9L, 8L, 5L,
8L, 12L, 3L, 4L, 1L, 3L, 3L, 3L, 4L, 5L)), .Names = c("Date",
"Color", "Categorie", "Portion_Longueur"), row.names = c(NA,
40L), class = "data.frame")
for (i in 1:1000) {
mysamp= sample(mydata$Color,size=30)
x=data.frame(table(mysamp))
if (i==1) w=x
else w <- c(w, x)
}
aggregate(w$Freq, by=list(Color=w$mysamp), FUN=sum)
Example, for 3 sampling, for (i in 1:3) I expect have sum as follow :
But I do not have Sum, instead I have:
Color x
1 ARD 2
2 ARP 1
3 BBB 0
4 BIE 0
5 CFX 0
6 CHR 0
7 DDD 0
8 DOO 1
9 EAU 0
10 ELY 0
11 EPI 3
12 ETR 0
13 GEN 2
14 GER 2
15 GGG 0
16 GIS 1
17 ISE 0
18 JUV 4
19 LER 5
20 LES 0
21 LON 0
22 LYR 1
23 MON 1
24 NER 2
25 NGY 1
26 NOJ 0
27 NYO 0
28 ORI 0
29 PEO 0
30 RAY 1
31 RRR 0
32 RSI 0
33 SEI 2
34 SEP 0
35 VIL 1
36 XQU 0
37 YYY 0
38 ZYZ 0
How to do this ?
Thanks a lot

Your for loop is what's causing your issues. You end up creating a big list that is somewhat difficult to perform calculations on (check out names(w) to see what I mean). A better data structure would allow for easier calculations:
x = NULL #initialize
for (i in 1:1000) {
mysamp = sample(mydata$Color,size=30) #sample
mysamp = data.frame(table(mysamp)) #frequency
x = rbind(x, mysamp) #bind to x
}
aggregate(Freq~mysamp, data = x, FUN = sum) #perform calculation
Note that this loop runs a bit slower than your loop. This is because of the rbind() function. See this post. Maybe someone will come along with a more efficient solution.

array manipulation: calculate odds ratios for a layer in a 3-way table

This is a question about array and data frame manipulation and calculation, in the
context of models for log odds in contingency tables. The closest question I've found to this is How can i calculate odds ratio in many table, but mine is more general.
I have a data frame representing a 3-way frequency table, of size 5 (litter) x 2 (treatment) x 3 (deaths).
"Freq" is the frequency in each cell, and deaths is the response variable.
Mice <-
structure(list(litter = c(7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L,
11L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 7L, 7L, 8L,
8L, 9L, 9L, 10L, 10L, 11L, 11L), treatment = structure(c(1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L), .Label = c("A",
"B"), class = "factor"), deaths = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("0", "1",
"2+"), class = "factor"), Freq = c(58L, 75L, 49L, 58L, 33L, 45L,
15L, 39L, 4L, 5L, 11L, 19L, 14L, 17L, 18L, 22L, 13L, 22L, 12L,
15L, 5L, 7L, 10L, 8L, 15L, 10L, 15L, 18L, 17L, 8L)), .Names = c("litter",
"treatment", "deaths", "Freq"), row.names = c(NA, 30L), class = "data.frame")
From this, I want to calculate the log odds for adjacent categories of the last variable (deaths)
and have this value in a data frame with factors litter (5), treatment (2), and contrast (2), as detailed below.
The data can be seen in xtabs() form:
mice.tab <- xtabs(Freq ~ litter + treatment + deaths, data=Mice)
ftable(mice.tab)
deaths 0 1 2+
litter treatment
7 A 58 11 5
B 75 19 7
8 A 49 14 10
B 58 17 8
9 A 33 18 15
B 45 22 10
10 A 15 13 15
B 39 22 18
11 A 4 12 17
B 5 15 8
>
From this, I want to calculate the (adjacent) log odds of 0 vs. 1 and 1 vs.2+ deaths, which is easy in
array format,
odds1 <- log(mice.tab[,,1]/mice.tab[,,2]) # contrast 0:1
odds2 <- log(mice.tab[,,2]/mice.tab[,,3]) # contrast 1:2+
odds1
treatment
litter A B
7 1.6625477 1.3730491
8 1.2527630 1.2272297
9 0.6061358 0.7156200
10 0.1431008 0.5725192
11 -1.0986123 -1.0986123
>
But, for analysis, I want to have these in a data frame, with factors litter, treatment and contrast
and a column, 'logodds' containing the entries in the odds1 and odds2 tables, suitably strung out.
More generally, for an I x J x K table, where the last factor is the response, my desired result
is a data frame of IJ(K-1) rows, with adjacent log odds in a 'logodds' column, and ideally, I'd like
to have a general function to do this.
Note that if T is the 10 x 3 matrix of frequencies shown by ftable(), the calculation is essentially
log(T) %*% matrix(c(1, -1, 0,
0, 1, -1))
followed by reshaping and labeling.
Can anyone help with this?

Stacked bar graph with fill ggplot2

I've read through the ggplot2 docs website and other question but I couldn't find a solution. I'm trying to visualize some data for varying age groups. I have sort of managed to do it but it does not look like I would intend it to.
Here is the code for my plot
p <- ggplot(suggestion, aes(interaction(Age,variable), value, color = Age, fill = factor(variable), group = Age))
p + geom_bar(stat = "identity")+
facet_grid(.~Age)![The facetting separates the age variables][1]
My ultimate goal is to created a stack bar graph, which is why I used the fill, but it does not put the TDX values in its corresponding Age group and Year. (Sometimes TDX values == DX values, but I want to visualize when they don't)
Here's the dput(suggestion)
structure(list(Age = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 7L,
1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L,
3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 1L, 2L, 3L, 4L, 5L, 6L,
7L), .Label = c("0-2", "3-9", "10-19", "20-39", "40-59", "60-64",
"65+", "UNSP", "(all)"), class = "factor"), variable = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L,
5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 7L, 7L, 7L, 7L, 7L,
8L, 8L, 8L, 8L, 8L, 8L, 8L), .Label = c("Year.10.DX", "Year.11.DX",
"Year.12.DX", "Year.13.DX", "Year.10.TDX", "Year.11.TDX", "Year.12.TDX",
"Year.13.TDX"), class = "factor"), value = c(26.8648932910636,
30.487741796656, 31.9938838749782, 62.8189679326958, 72.8480838120064,
69.3044125928752, 36.9789457527416, 21.808001825378, 24.1073451428435,
40.3305134762935, 70.4486116545885, 68.8342676191755, 63.9227718107745,
34.6086468618636, 8.84033719571875, 13.2807072303835, 28.4781516422802,
55.139497471546, 59.7230544500003, 67.9448927372699, 37.7293286937066,
6.9507024051526, 17.4393054963572, 33.1485743479821, 61.198647580693,
58.6845873573852, 48.0073013177248, 28.4455801248562, 26.8648932910636,
19.8044453272475, 23.0189084635948, 53.7037832071889, 60.6516550126422,
58.1573725886767, 27.0791868812255, 21.808001825378, 19.8146296425633,
35.0587750051557, 62.3308555053346, 59.3299998610862, 56.5341245769817,
27.7229319271878, 8.84033719571875, 13.2807072303835, 22.4081606349585,
48.0252683906252, 52.7560684009579, 65.2890977685045, 32.4142337849399,
6.9507024051526, 15.2833655677215, 24.5268503180754, 52.536784326675,
51.4100599515986, 40.9609231655724, 18.1306673637441)), row.names = c(NA,
-56L), .Names = c("Age", "variable", "value"), class = "data.frame")

It's unclear what you need but perhaps this.
ggplot(a,aes(x=variable,y=value,fill=Age)) + geom_bar(stat='identity')
+facet_wrap(~Age)
If you want to visualize separately the TDX and the DX entries, we'll need to change the dataframe a bit.
> head(a)
Age variable value
1 0-2 Year.10.DX 26.86489
2 3-9 Year.10.DX 30.48774
3 10-19 Year.10.DX 31.99388
4 20-39 Year.10.DX 62.81897
5 40-59 Year.10.DX 72.84808
6 60-64 Year.10.DX 69.30441
The column of interest variable is a combination of year and of TDX/DX value. We'll use the tidyr package to separate this into two columns.
library(tidyr)
library(dplyr)
tidy_a<- a %>% separate(variable, into = c( 'nothing',"year",'label'), sep = "\\.")
This actually splits the levels of column variable into three components, since we split on . and the character . appears twice in each entry.
> head(tidy_a)
Age nothing year label value
1 0-2 Year 10 DX 26.86489
2 3-9 Year 10 DX 30.48774
3 10-19 Year 10 DX 31.99388
4 20-39 Year 10 DX 62.81897
5 40-59 Year 10 DX 72.84808
6 60-64 Year 10 DX 69.30441
So the column nothing is rather useless, just a necessary result of using separate and separating on .. Now this will allow us to visualize TDX/DX separately.
ggplot(tidy_a,aes(x=year,y=value,fill=label)) + geom_bar(stat='identity') + facet_wrap(~Age)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create unique ID based on repeated IDs [duplicate] - r

We can also do library(data.table) setDT(example_data)[, transect_id := rleid(ID_Transect)]

Related

Using sample() to sample from nested lists in R

How to reorder x-axis based on y-axis values in R ggplot2

Sampling distribution and sum of tables

array manipulation: calculate odds ratios for a layer in a 3-way table

Stacked bar graph with fill ggplot2

Categories

Resources