dplyr append group id sequence? - r

I have a dataset like below, it's created by dplyr and currently grouped by ‘Stage', how do I generate a sequence based on unique, incremental value of Stage, starting from 1 (for eg row$4 should be 1 row#1 and #8 should be 4)
X Y Stage Count
1 61 74 1 2
2 58 56 2 1
3 78 76 0 1
4 100 100 -2 1
5 89 88 -1 1
6 47 44 3 1
7 36 32 4 1
8 75 58 1 2
9 24 21 5 1
10 12 11 6 1
11 0 0 10 1
I tried the approach in below post but didn't work.
how to mutate a column with ID in group
Thanks.

Here is another dplyr solution:
> df
# A tibble: 11 × 4
X Y Stage Count
<dbl> <dbl> <dbl> <dbl>
1 61 74 1 2
2 58 56 2 1
3 78 76 0 1
4 100 100 -2 1
5 89 88 -1 1
6 47 44 3 1
7 36 32 4 1
8 75 58 1 2
9 24 21 5 1
10 12 11 6 1
11 0 0 10 1
To create the group id's use dpylr's group_indicies:
i <- df %>% group_indices(Stage)
df %>% mutate(group = i)
# A tibble: 11 × 5
X Y Stage Count group
<dbl> <dbl> <dbl> <dbl> <int>
1 61 74 1 2 4
2 58 56 2 1 5
3 78 76 0 1 3
4 100 100 -2 1 1
5 89 88 -1 1 2
6 47 44 3 1 6
7 36 32 4 1 7
8 75 58 1 2 4
9 24 21 5 1 8
10 12 11 6 1 9
11 0 0 10 1 10
It would be great if you could pipe both commands together. But, as of this writing, it doesn't appear to be possible.

After some experiment, I did %>% ungroup() %>% mutate(test = rank(Stage)), which will yield the following result.
X Y Stage Count test
1 100 100 -2 1 1.0
2 89 88 -1 1 2.0
3 78 76 0 1 3.0
4 61 74 1 2 4.5
5 75 58 1 2 4.5
6 58 56 2 1 6.0
7 47 44 3 1 7.0
8 36 32 4 1 8.0
9 24 21 5 1 9.0
10 12 11 6 1 10.0
11 0 0 10 1 11.0
I don't know whether this is the best approach, feel free to comment....
update
Another approach, assuming the data called Node
lvs <- levels(as.factor(Node$Stage))
Node %>% mutate(Rank = match(Stage,lvs))

Related

How to get p values for odds ratios from an ordinal regression in r

I am trying to get the p values for my odds ratio from an ordinal regression using r.
I previously constructed my p values on the log odds like this
scm <- polr(finaloutcome ~ Size_no + Hegemony + Committee, data = data3, Hess = TRUE)
(ctable <- coef(summary(scm)))
Calculate and store p value
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
(ctable <- cbind(ctable, "p value" = p))
I created by odds ratios like this:
ci <- confint.default(scm)
exp(coef(scm))
## OR and CI
exp(cbind(OR = coef(scm), ci))
However, I am now unsure how to create the p values for the odds ratio. Using the previous method I got:
(ctable1 <- exp(coef(scm)))
p1 <- pnorm(abs(ctable1[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p1))
However i get the error: Error in ctable1[, "t value"] : incorrect number of dimensions
Odds ratio output sample:
Size
Hegem
Committee
9.992240e-01
6.957805e-02
1.204437e-01
Data sample:
finaloutcome
Size_no
Committee
Hegemony
1
3
54
2
0
2
2
127
3
0
3
2
127
3
0
4
2
22
1
1
5
2
193
4
1
6
2
54
2
0
7
NA
11
1
1
8
3
54
2
0
9
3
22
1
1
10
2
53
3
1
11
2
53
3
1
12
2
53
3
1
13
2
53
3
1
14
2
53
3
1
15
2
53
3
1
16
2
120
3
0
17
2
120
3
0
18
1
22
1
1
19
1
22
1
1
20
2
193
4
1
21
2
193
4
1
22
2
193
4
1
23
2
12
4
1
24
2
35
1
1
25
1
193
4
1
26
1
164
4
1
27
1
12
4
1
28
2
12
4
1
29
2
193
4
1
30
2
54
2
0
31
2
193
4
1
32
2
193
4
1
33
2
54
2
0
34
2
12
4
1
35
2
22
1
1
36
4
53
3
1
37
2
35
1
1
38
1
193
4
1
39
5
54
2
0
40
7
164
4
1
41
5
54
2
0
42
1
12
4
1
43
7
193
4
1
44
2
193
4
1
45
2
193
4
1
46
2
193
4
1
47
2
193
4
1
48
2
193
4
1
49
2
12
4
1
50
2
22
1
1
51
2
12
4
1
52
2
12
4
1
53
6
13
1
1
54
6
13
1
1
55
6
13
1
1
56
6
12
4
1
57
2
193
4
1
58
3
12
4
1
59
1
12
4
1
60
1
12
4
1
61
8
35
1
1
62
2
193
4
1
63
8
35
1
1
64
6
30
2
1
65
8
12
4
1
66
4
12
4
1
67
5
30
2
1
68
5
54
2
0
69
7
12
4
1
70
5
12
4
1
71
5
54
2
0
72
5
193
4
1
73
5
193
4
1
74
5
54
2
0
75
5
54
2
0
76
1
11
1
1
77
3
22
1
1
78
3
12
4
1
79
6
12
4
1
80
2
22
1
1
81
8
193
4
1
82
8
193
4
1
83
4
193
4
1
84
2
193
4
1
85
2
193
4
1
86
2
193
4
1
87
2
193
4
1
88
2
193
4
1
89
2
193
4
1
90
2
193
4
1
91
2
193
4
1
92
2
193
4
1
93
8
193
4
1
94
6
12
4
1
95
5
12
4
1
96
5
12
4
1
97
5
12
4
1
98
5
12
4
1
99
5
12
4
1
100
5
12
4
1
I usually use lm or glm to create my model (mdl <- lm(…) or mdl <- glm(…)). Then I use summary on the object to see these values. More than this, you can use the Yardstick and Broom. I recommend the book R for Data Science. There is a great explanation about modeling and using the Tidymodels packages.
I went through the same difficulty.
I finally used the fonction tidy from the broom package: https://broom.tidymodels.org/reference/tidy.polr.html
library(broom)
tidy(scm, p.values = TRUE)
This does not yet work if you have categorical variables with more than two levels, or missing values.

Add rows to dataframe in R based on values in column

I have a dataframe with 2 columns: time and day. there are 3 days and for each day, time runs from 1 to 12. I want to add new rows for each day with times: -2, 1 and 0. How do I do this?
I have tried using add_row and specifying the row number to add to, but this changes each time a new row is added making the process tedious. Thanks in advance
picture of the dataframe
We could use add_row
then slice the desired sequence
and bind all to a dataframe:
library(tibble)
library(dplyr)
df1 <- df %>%
add_row(time = -2:0, Day = c(1,1,1), .before = 1) %>%
slice(1:15)
df2 <- bind_rows(df1, df1, df1) %>%
mutate(Day = rep(row_number(), each=15, length.out = n()))
Output:
# A tibble: 45 x 2
time Day
<dbl> <int>
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
Here's a fast way to create the desired dataframe from scratch using expand.grid(), rather than adding individual rows:
df <- expand.grid(-2:12,1:3)
colnames(df) <- c("time","day")
Results:
df
time day
1 -2 1
2 -1 1
3 0 1
4 1 1
5 2 1
6 3 1
7 4 1
8 5 1
9 6 1
10 7 1
11 8 1
12 9 1
13 10 1
14 11 1
15 12 1
16 -2 2
17 -1 2
18 0 2
19 1 2
20 2 2
21 3 2
22 4 2
23 5 2
24 6 2
25 7 2
26 8 2
27 9 2
28 10 2
29 11 2
30 12 2
31 -2 3
32 -1 3
33 0 3
34 1 3
35 2 3
36 3 3
37 4 3
38 5 3
39 6 3
40 7 3
41 8 3
42 9 3
43 10 3
44 11 3
45 12 3
You can use tidyr::crossing
library(dplyr)
library(tidyr)
add_values <- c(-2, 1, 0)
crossing(time = add_values, Day = unique(day$Day)) %>%
bind_rows(day) %>%
arrange(Day, time)
# A tibble: 45 x 2
# time Day
# <dbl> <int>
# 1 -2 1
# 2 0 1
# 3 1 1
# 4 1 1
# 5 2 1
# 6 3 1
# 7 4 1
# 8 5 1
# 9 6 1
#10 7 1
# … with 35 more rows
If you meant -2, -1 and 0 you can also use complete.
tidyr::complete(day, Day, time = -2:0)

FIll zeros with previous value +1

I have a record grouped by users. At the variable "day" there are some 0s, which I would like to have replaced in order of sequence (= previous value +1).
data <- data.frame(user = c(1,1,1,2,2,2,2,2), day = c(170,0,172,34,35,0,0,38))
data
user day
1 1 170
2 1 0
3 1 172
4 2 34
5 2 35
6 2 0
7 2 0
8 2 38
I want to have the following:
data_new
user day
1 1 170
2 1 171
3 1 172
4 2 34
5 2 35
6 2 36
7 2 37
8 2 38
I've tried the following (really inefficient and doesn't work for all cases...):
data = group_by(data, user) %>%
+ mutate(lead_day = lead(day),
+ day_new = case_when(day == 0 ~ lead_day - 1,
+ day > 0 ~ day))
> data
# A tibble: 8 x 4
# Groups: user [2]
user day lead_day day_new
<dbl> <dbl> <dbl> <dbl>
1 1 170 0 170
2 1 0 172 171
3 1 172 NA 172
4 2 34 35 34
5 2 35 0 35
6 2 0 0 -1
7 2 0 38 37
8 2 38 NA 38
You could use Reduce
data$day <-Reduce(function(x,y) if(y==0) x+1 else y, data$day,accumulate = TRUE)
data
# user day
# 1 1 170
# 2 1 171
# 3 1 172
# 4 2 34
# 5 2 35
# 6 2 36
# 7 2 37
# 8 2 38
Or as you use tidyverse already :
data %>% mutate(day = accumulate(day,~if(.y==0) .x+1 else .y))
# user day
# 1 1 170
# 2 1 171
# 3 1 172
# 4 2 34
# 5 2 35
# 6 2 36
# 7 2 37
# 8 2 38

add values of one group into another group in R

I have a question on how to add the value from a group to rest of the elements in the group then delete that row. for ex:
df <- data.frame(Year=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
Cluster=c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","c","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","d"),
Seed=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,99,99,99,99,99,99),
Day=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1),
value=c(5,2,1,2,8,6,7,9,3,5,2,1,2,8,6,55,66,77,88,99,10))
in the above example, my data is grouped by Year, Cluster, Seed and Day where seed=99 values need to be added to above rows based on (Year, Cluster and Day) group then delete this row. for ex: Row # 16, is part of (Year=1, Cluster=a,Day=1 and Seed=99) group and the value of Row #16 which is 55 should be added to Row #1 (5+55), Row # 6 (6+55) and Row # 11 (2+55) and row # 16 should be deleted. But when it comes to Row #21, which is in cluster=C with seed=99, should remain in the database as is as it cannot find any matching in year+cluster+day combination.
My actual data is of 1 million records with 10 years, 80 clusters, 500 days and 10+1 (1 to 10 and 99) seeds, so looking for so looking for an efficient solution.
Year Cluster Seed Day value
1 1 a 1 1 60
2 1 a 1 2 68
3 1 a 1 3 78
4 1 a 1 4 90
5 1 a 1 5 107
6 1 a 2 1 61
7 1 a 2 2 73
8 1 a 2 3 86
9 1 a 2 4 91
10 1 a 2 5 104
11 1 a 3 1 57
12 1 a 3 2 67
13 1 a 3 3 79
14 1 a 3 4 96
15 1 a 3 5 105
16 1 c 99 1 10
17 2 b 1 1 60
18 2 b 1 2 68
19 2 b 1 3 78
20 2 b 1 4 90
21 2 b 1 5 107
22 2 b 2 1 61
23 2 b 2 2 73
24 2 b 2 3 86
25 2 b 2 4 91
26 2 b 2 5 104
27 2 b 3 1 57
28 2 b 3 2 67
29 2 b 3 3 79
30 2 b 3 4 96
31 2 b 3 5 105
32 2 d 99 1 10
A data.table approach:
library(data.table)
df <- setDT(df)[, `:=` (value = ifelse(Seed != 99, value + value[Seed == 99], value),
flag = Seed == 99 & .N == 1), by = .(Year, Cluster, Day)][!(Seed == 99 & flag == FALSE),][, "flag" := NULL]
Output:
df[]
Year Cluster Seed Day value
1: 1 a 1 1 60
2: 1 a 1 2 68
3: 1 a 1 3 78
4: 1 a 1 4 90
5: 1 a 1 5 107
6: 1 a 2 1 61
7: 1 a 2 2 73
8: 1 a 2 3 86
9: 1 a 2 4 91
10: 1 a 2 5 104
11: 1 a 3 1 57
12: 1 a 3 2 67
13: 1 a 3 3 79
14: 1 a 3 4 96
15: 1 a 3 5 105
16: 1 c 99 1 10
17: 2 b 1 1 60
18: 2 b 1 2 68
19: 2 b 1 3 78
20: 2 b 1 4 90
21: 2 b 1 5 107
22: 2 b 2 1 61
23: 2 b 2 2 73
24: 2 b 2 3 86
25: 2 b 2 4 91
26: 2 b 2 5 104
27: 2 b 3 1 57
28: 2 b 3 2 67
29: 2 b 3 3 79
30: 2 b 3 4 96
31: 2 b 3 5 105
32: 2 d 99 1 10
Here's an approach using the tidyverse. If you're looking for speed with a million rows, a data.table solution will probably perform better.
library(tidyverse)
df <- data.frame(Year=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
Cluster=c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","c","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","d"),
Seed=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,99,99,99,99,99,99),
Day=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1),
value=c(5,2,1,2,8,6,7,9,3,5,2,1,2,8,6,55,66,77,88,99,10))
seeds <- df %>%
filter(Seed == 99)
matches <- df %>%
filter(Seed != 99) %>%
inner_join(select(seeds, -Seed), by = c("Year", "Cluster", "Day")) %>%
mutate(value = value.x + value.y) %>%
select(Year, Cluster, Seed, Day, value)
no_matches <- anti_join(seeds, matches, by = c("Year", "Cluster", "Day"))
bind_rows(matches, no_matches) %>%
arrange(Year, Cluster, Seed, Day)
#> Year Cluster Seed Day value
#> 1 1 a 1 1 60
#> 2 1 a 1 2 68
#> 3 1 a 1 3 78
#> 4 1 a 1 4 90
#> 5 1 a 1 5 107
#> 6 1 a 2 1 61
#> 7 1 a 2 2 73
#> 8 1 a 2 3 86
#> 9 1 a 2 4 91
#> 10 1 a 2 5 104
#> 11 1 a 3 1 57
#> 12 1 a 3 2 67
#> 13 1 a 3 3 79
#> 14 1 a 3 4 96
#> 15 1 a 3 5 105
#> 16 1 c 99 1 10
#> 17 2 b 1 1 60
#> 18 2 b 1 2 68
#> 19 2 b 1 3 78
#> 20 2 b 1 4 90
#> 21 2 b 1 5 107
#> 22 2 b 2 1 61
#> 23 2 b 2 2 73
#> 24 2 b 2 3 86
#> 25 2 b 2 4 91
#> 26 2 b 2 5 104
#> 27 2 b 3 1 57
#> 28 2 b 3 2 67
#> 29 2 b 3 3 79
#> 30 2 b 3 4 96
#> 31 2 b 3 5 105
#> 32 2 d 99 1 10
Created on 2018-11-23 by the reprex package (v0.2.1)

Create Customized weighted variable in R

My data set looks like this
set.seed(1)
data <- data.frame(ITEMID = 101:120,DEPT = c(rep(1,10),rep(2,10)),
CLASS = c(1,1,1,1,1,2,2,2,2,2,1,1,1,1,1,2,2,2,2,2),
SUBCLASS = c(3,3,3,3,4,4,4,4,4,3,3,3,3,3,3,4,4,4,4,4),
PRICE = sample(1:20,20),UNITS = sample(1:100,20)
)
> data
ITEMID DEPT CLASS SUBCLASS PRICE UNITS
1 101 1 1 3 6 94
2 102 1 1 3 8 22
3 103 1 1 3 11 64
4 104 1 1 3 16 13
5 105 1 1 4 4 26
6 106 1 2 4 14 37
7 107 1 2 4 15 2
8 108 1 2 4 9 36
9 109 1 2 4 19 81
10 110 1 2 3 1 31
11 111 2 1 3 3 44
12 112 2 1 3 2 54
13 113 2 1 3 20 90
14 114 2 1 3 10 17
15 115 2 1 3 5 72
16 116 2 2 4 7 57
17 117 2 2 4 12 67
18 118 2 2 4 17 9
19 119 2 2 4 18 60
20 120 2 2 4 13 34
Now I want to add another column called PRICE_RATIO using the following logic
Taking ItemID 101 and group_by with DEPT,CLASS and SUBCLASS yields prices c(6,8,11,16) and UNITS c(94,22,64,13) for ITEMIDs c(101,102,103,104) respectively
Now for each item id the variable PRICE_RATIO will be the ratio of the price of that item id to weighted price of all other itemIDs in the group. For example
For item ID 101 other items are c(102,103,104) whose total UNITS is (22+ 64+13) =99 and weights are (22/99,64/99,13/99). So weighted price for all other items is (22/99)*8 + (64/99)*11 + (13/99)*16 = 10.9899. Hence value for PRICE_RATIO will be 6/10.9899= .54
Similarly for all other items.
Any help in creating the code for this will be greatly appreciated
One solution to your problem, and generally such problems can be with the use of dplyr package and its data munging capabilities. The logic here is as you say, you group by the desired columns, then mutate the desired value (sum product of price and units (excluding the product for that specific row) and ratio of price to that weight. You can execute every step in this computation separately (I encourage that so you can learn) and see exactly what it does.
library(dplyr)
data %>%
group_by(DEPT, CLASS, SUBCLASS) %>%
mutate(price_ratio = round(PRICE /
((sum(UNITS * PRICE) - UNITS * PRICE) /
(sum(UNITS) - UNITS)),
2))
Output is as follows:
Source: local data frame [20 x 7]
Groups: DEPT, CLASS, SUBCLASS [6]
ITEMID DEPT CLASS SUBCLASS PRICE UNITS price_ratio
<int> <dbl> <dbl> <dbl> <int> <int> <dbl>
1 101 1 1 3 6 94 0.55
2 102 1 1 3 8 22 0.93
3 103 1 1 3 11 64 1.50
4 104 1 1 3 16 13 1.99
5 105 1 1 4 4 26 NaN
6 106 1 2 4 14 37 0.88
7 107 1 2 4 15 2 0.97
8 108 1 2 4 9 36 0.52
9 109 1 2 4 19 81 1.63
10 110 1 2 3 1 31 NaN
11 111 2 1 3 3 44 0.29
12 112 2 1 3 2 54 0.18
13 113 2 1 3 20 90 4.86
14 114 2 1 3 10 17 1.08
15 115 2 1 3 5 72 0.46
16 116 2 2 4 7 57 0.48
17 117 2 2 4 12 67 0.93
18 118 2 2 4 17 9 1.36
19 119 2 2 4 18 60 1.67
20 120 2 2 4 13 34 1.03

Resources