Create Customized weighted variable in R - r

My data set looks like this
set.seed(1)
data <- data.frame(ITEMID = 101:120,DEPT = c(rep(1,10),rep(2,10)),
CLASS = c(1,1,1,1,1,2,2,2,2,2,1,1,1,1,1,2,2,2,2,2),
SUBCLASS = c(3,3,3,3,4,4,4,4,4,3,3,3,3,3,3,4,4,4,4,4),
PRICE = sample(1:20,20),UNITS = sample(1:100,20)
)
> data
ITEMID DEPT CLASS SUBCLASS PRICE UNITS
1 101 1 1 3 6 94
2 102 1 1 3 8 22
3 103 1 1 3 11 64
4 104 1 1 3 16 13
5 105 1 1 4 4 26
6 106 1 2 4 14 37
7 107 1 2 4 15 2
8 108 1 2 4 9 36
9 109 1 2 4 19 81
10 110 1 2 3 1 31
11 111 2 1 3 3 44
12 112 2 1 3 2 54
13 113 2 1 3 20 90
14 114 2 1 3 10 17
15 115 2 1 3 5 72
16 116 2 2 4 7 57
17 117 2 2 4 12 67
18 118 2 2 4 17 9
19 119 2 2 4 18 60
20 120 2 2 4 13 34
Now I want to add another column called PRICE_RATIO using the following logic
Taking ItemID 101 and group_by with DEPT,CLASS and SUBCLASS yields prices c(6,8,11,16) and UNITS c(94,22,64,13) for ITEMIDs c(101,102,103,104) respectively
Now for each item id the variable PRICE_RATIO will be the ratio of the price of that item id to weighted price of all other itemIDs in the group. For example
For item ID 101 other items are c(102,103,104) whose total UNITS is (22+ 64+13) =99 and weights are (22/99,64/99,13/99). So weighted price for all other items is (22/99)*8 + (64/99)*11 + (13/99)*16 = 10.9899. Hence value for PRICE_RATIO will be 6/10.9899= .54
Similarly for all other items.
Any help in creating the code for this will be greatly appreciated

One solution to your problem, and generally such problems can be with the use of dplyr package and its data munging capabilities. The logic here is as you say, you group by the desired columns, then mutate the desired value (sum product of price and units (excluding the product for that specific row) and ratio of price to that weight. You can execute every step in this computation separately (I encourage that so you can learn) and see exactly what it does.
library(dplyr)
data %>%
group_by(DEPT, CLASS, SUBCLASS) %>%
mutate(price_ratio = round(PRICE /
((sum(UNITS * PRICE) - UNITS * PRICE) /
(sum(UNITS) - UNITS)),
2))
Output is as follows:
Source: local data frame [20 x 7]
Groups: DEPT, CLASS, SUBCLASS [6]
ITEMID DEPT CLASS SUBCLASS PRICE UNITS price_ratio
<int> <dbl> <dbl> <dbl> <int> <int> <dbl>
1 101 1 1 3 6 94 0.55
2 102 1 1 3 8 22 0.93
3 103 1 1 3 11 64 1.50
4 104 1 1 3 16 13 1.99
5 105 1 1 4 4 26 NaN
6 106 1 2 4 14 37 0.88
7 107 1 2 4 15 2 0.97
8 108 1 2 4 9 36 0.52
9 109 1 2 4 19 81 1.63
10 110 1 2 3 1 31 NaN
11 111 2 1 3 3 44 0.29
12 112 2 1 3 2 54 0.18
13 113 2 1 3 20 90 4.86
14 114 2 1 3 10 17 1.08
15 115 2 1 3 5 72 0.46
16 116 2 2 4 7 57 0.48
17 117 2 2 4 12 67 0.93
18 118 2 2 4 17 9 1.36
19 119 2 2 4 18 60 1.67
20 120 2 2 4 13 34 1.03

Related

How to get p values for odds ratios from an ordinal regression in r

I am trying to get the p values for my odds ratio from an ordinal regression using r.
I previously constructed my p values on the log odds like this
scm <- polr(finaloutcome ~ Size_no + Hegemony + Committee, data = data3, Hess = TRUE)
(ctable <- coef(summary(scm)))
Calculate and store p value
p <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
## combined table
(ctable <- cbind(ctable, "p value" = p))
I created by odds ratios like this:
ci <- confint.default(scm)
exp(coef(scm))
## OR and CI
exp(cbind(OR = coef(scm), ci))
However, I am now unsure how to create the p values for the odds ratio. Using the previous method I got:
(ctable1 <- exp(coef(scm)))
p1 <- pnorm(abs(ctable1[, "t value"]), lower.tail = FALSE) * 2
(ctable <- cbind(ctable, "p value" = p1))
However i get the error: Error in ctable1[, "t value"] : incorrect number of dimensions
Odds ratio output sample:
Size
Hegem
Committee
9.992240e-01
6.957805e-02
1.204437e-01
Data sample:
finaloutcome
Size_no
Committee
Hegemony
1
3
54
2
0
2
2
127
3
0
3
2
127
3
0
4
2
22
1
1
5
2
193
4
1
6
2
54
2
0
7
NA
11
1
1
8
3
54
2
0
9
3
22
1
1
10
2
53
3
1
11
2
53
3
1
12
2
53
3
1
13
2
53
3
1
14
2
53
3
1
15
2
53
3
1
16
2
120
3
0
17
2
120
3
0
18
1
22
1
1
19
1
22
1
1
20
2
193
4
1
21
2
193
4
1
22
2
193
4
1
23
2
12
4
1
24
2
35
1
1
25
1
193
4
1
26
1
164
4
1
27
1
12
4
1
28
2
12
4
1
29
2
193
4
1
30
2
54
2
0
31
2
193
4
1
32
2
193
4
1
33
2
54
2
0
34
2
12
4
1
35
2
22
1
1
36
4
53
3
1
37
2
35
1
1
38
1
193
4
1
39
5
54
2
0
40
7
164
4
1
41
5
54
2
0
42
1
12
4
1
43
7
193
4
1
44
2
193
4
1
45
2
193
4
1
46
2
193
4
1
47
2
193
4
1
48
2
193
4
1
49
2
12
4
1
50
2
22
1
1
51
2
12
4
1
52
2
12
4
1
53
6
13
1
1
54
6
13
1
1
55
6
13
1
1
56
6
12
4
1
57
2
193
4
1
58
3
12
4
1
59
1
12
4
1
60
1
12
4
1
61
8
35
1
1
62
2
193
4
1
63
8
35
1
1
64
6
30
2
1
65
8
12
4
1
66
4
12
4
1
67
5
30
2
1
68
5
54
2
0
69
7
12
4
1
70
5
12
4
1
71
5
54
2
0
72
5
193
4
1
73
5
193
4
1
74
5
54
2
0
75
5
54
2
0
76
1
11
1
1
77
3
22
1
1
78
3
12
4
1
79
6
12
4
1
80
2
22
1
1
81
8
193
4
1
82
8
193
4
1
83
4
193
4
1
84
2
193
4
1
85
2
193
4
1
86
2
193
4
1
87
2
193
4
1
88
2
193
4
1
89
2
193
4
1
90
2
193
4
1
91
2
193
4
1
92
2
193
4
1
93
8
193
4
1
94
6
12
4
1
95
5
12
4
1
96
5
12
4
1
97
5
12
4
1
98
5
12
4
1
99
5
12
4
1
100
5
12
4
1
I usually use lm or glm to create my model (mdl <- lm(…) or mdl <- glm(…)). Then I use summary on the object to see these values. More than this, you can use the Yardstick and Broom. I recommend the book R for Data Science. There is a great explanation about modeling and using the Tidymodels packages.
I went through the same difficulty.
I finally used the fonction tidy from the broom package: https://broom.tidymodels.org/reference/tidy.polr.html
library(broom)
tidy(scm, p.values = TRUE)
This does not yet work if you have categorical variables with more than two levels, or missing values.

calculate count of number observation for all variables at once in R

numbers1 <- c(4,23,4,23,5,43,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435)
and
numbers2 <- c(4,23,4,23,5,44,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435)
to peform counting i do so manually
as.data.frame(table(numbers1))
as.data.frame(table(numbers2))
but i can have 100 variables from mydat$x1 to mydat$100.
I don't want manually enter 100 times.
How to do that all counting would for all variables?
as.data.frame(table(mydat$x1-mydat$x100))
is not working.
We can make a list of all variables in the environment that have a pattern like numbers. Then we can loop through all of the elements of the list:
number_lst <- mget(ls(pattern = 'numbers\\d'), envir = .GlobalEnv) #thanks NelsonGon
lapply(number_lst, function(x) as.data.frame(table(x)))
$numbers1
x Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
$numbers2
x Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 44 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
As I read your question, you want to count the number of times each unique element in a set occurs using minimal re-typing over many sets.
To do this, you'll first need to put the sets into a single object, e.g. into a list:
list_of_sets <- list(numbers1 = c(4,23,4,23,5,43,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435),
numbers2 = c(4,23,4,23,5,44,54,56,657,67,67,435,
453,435,324,34,456,56,567,65,34,435))
Then you loop over each list element, e.g. using a for loop:
list_of_counts <- list()
for(i in seq_along(list_of_sets)){
list_of_counts[[i]] <- as.data.frame(table(list_of_sets[[i]]))
}
list_of_counts then contains the results:
[[1]]
Var1 Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 43 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1
[[2]]
Var1 Freq
1 4 2
2 5 1
3 23 2
4 34 2
5 44 1
6 54 1
7 56 2
8 65 1
9 67 2
10 324 1
11 435 3
12 453 1
13 456 1
14 567 1
15 657 1

add values of one group into another group in R

I have a question on how to add the value from a group to rest of the elements in the group then delete that row. for ex:
df <- data.frame(Year=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
Cluster=c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","c","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","d"),
Seed=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,99,99,99,99,99,99),
Day=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1),
value=c(5,2,1,2,8,6,7,9,3,5,2,1,2,8,6,55,66,77,88,99,10))
in the above example, my data is grouped by Year, Cluster, Seed and Day where seed=99 values need to be added to above rows based on (Year, Cluster and Day) group then delete this row. for ex: Row # 16, is part of (Year=1, Cluster=a,Day=1 and Seed=99) group and the value of Row #16 which is 55 should be added to Row #1 (5+55), Row # 6 (6+55) and Row # 11 (2+55) and row # 16 should be deleted. But when it comes to Row #21, which is in cluster=C with seed=99, should remain in the database as is as it cannot find any matching in year+cluster+day combination.
My actual data is of 1 million records with 10 years, 80 clusters, 500 days and 10+1 (1 to 10 and 99) seeds, so looking for so looking for an efficient solution.
Year Cluster Seed Day value
1 1 a 1 1 60
2 1 a 1 2 68
3 1 a 1 3 78
4 1 a 1 4 90
5 1 a 1 5 107
6 1 a 2 1 61
7 1 a 2 2 73
8 1 a 2 3 86
9 1 a 2 4 91
10 1 a 2 5 104
11 1 a 3 1 57
12 1 a 3 2 67
13 1 a 3 3 79
14 1 a 3 4 96
15 1 a 3 5 105
16 1 c 99 1 10
17 2 b 1 1 60
18 2 b 1 2 68
19 2 b 1 3 78
20 2 b 1 4 90
21 2 b 1 5 107
22 2 b 2 1 61
23 2 b 2 2 73
24 2 b 2 3 86
25 2 b 2 4 91
26 2 b 2 5 104
27 2 b 3 1 57
28 2 b 3 2 67
29 2 b 3 3 79
30 2 b 3 4 96
31 2 b 3 5 105
32 2 d 99 1 10
A data.table approach:
library(data.table)
df <- setDT(df)[, `:=` (value = ifelse(Seed != 99, value + value[Seed == 99], value),
flag = Seed == 99 & .N == 1), by = .(Year, Cluster, Day)][!(Seed == 99 & flag == FALSE),][, "flag" := NULL]
Output:
df[]
Year Cluster Seed Day value
1: 1 a 1 1 60
2: 1 a 1 2 68
3: 1 a 1 3 78
4: 1 a 1 4 90
5: 1 a 1 5 107
6: 1 a 2 1 61
7: 1 a 2 2 73
8: 1 a 2 3 86
9: 1 a 2 4 91
10: 1 a 2 5 104
11: 1 a 3 1 57
12: 1 a 3 2 67
13: 1 a 3 3 79
14: 1 a 3 4 96
15: 1 a 3 5 105
16: 1 c 99 1 10
17: 2 b 1 1 60
18: 2 b 1 2 68
19: 2 b 1 3 78
20: 2 b 1 4 90
21: 2 b 1 5 107
22: 2 b 2 1 61
23: 2 b 2 2 73
24: 2 b 2 3 86
25: 2 b 2 4 91
26: 2 b 2 5 104
27: 2 b 3 1 57
28: 2 b 3 2 67
29: 2 b 3 3 79
30: 2 b 3 4 96
31: 2 b 3 5 105
32: 2 d 99 1 10
Here's an approach using the tidyverse. If you're looking for speed with a million rows, a data.table solution will probably perform better.
library(tidyverse)
df <- data.frame(Year=c(1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2),
Cluster=c("a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","a","c","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","b","d"),
Seed=c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,99,99,99,99,99,99),
Day=c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1),
value=c(5,2,1,2,8,6,7,9,3,5,2,1,2,8,6,55,66,77,88,99,10))
seeds <- df %>%
filter(Seed == 99)
matches <- df %>%
filter(Seed != 99) %>%
inner_join(select(seeds, -Seed), by = c("Year", "Cluster", "Day")) %>%
mutate(value = value.x + value.y) %>%
select(Year, Cluster, Seed, Day, value)
no_matches <- anti_join(seeds, matches, by = c("Year", "Cluster", "Day"))
bind_rows(matches, no_matches) %>%
arrange(Year, Cluster, Seed, Day)
#> Year Cluster Seed Day value
#> 1 1 a 1 1 60
#> 2 1 a 1 2 68
#> 3 1 a 1 3 78
#> 4 1 a 1 4 90
#> 5 1 a 1 5 107
#> 6 1 a 2 1 61
#> 7 1 a 2 2 73
#> 8 1 a 2 3 86
#> 9 1 a 2 4 91
#> 10 1 a 2 5 104
#> 11 1 a 3 1 57
#> 12 1 a 3 2 67
#> 13 1 a 3 3 79
#> 14 1 a 3 4 96
#> 15 1 a 3 5 105
#> 16 1 c 99 1 10
#> 17 2 b 1 1 60
#> 18 2 b 1 2 68
#> 19 2 b 1 3 78
#> 20 2 b 1 4 90
#> 21 2 b 1 5 107
#> 22 2 b 2 1 61
#> 23 2 b 2 2 73
#> 24 2 b 2 3 86
#> 25 2 b 2 4 91
#> 26 2 b 2 5 104
#> 27 2 b 3 1 57
#> 28 2 b 3 2 67
#> 29 2 b 3 3 79
#> 30 2 b 3 4 96
#> 31 2 b 3 5 105
#> 32 2 d 99 1 10
Created on 2018-11-23 by the reprex package (v0.2.1)

Merge 2 dataframes based on condition in R

I have the following 2 data frames that I want to merge:
x <- data.frame(a= 1:11, b =3:13, c=2:12, d=7:17, invoice = 1:11)
x =
a b c d invoice
1 3 2 7 1
2 4 3 8 2
3 5 4 9 3
4 6 5 10 4
5 7 6 11 5
6 8 7 12 6
7 9 8 13 7
8 10 9 14 8
9 11 10 15 9
10 12 11 16 10
11 13 12 17 11
y <- data.frame(nr = 100:125, invoice = 1)
y$invoice[12:26] <- 2
> y
nr invoice
100 1
101 1
102 1
103 1
104 1
105 1
106 1
107 1
108 1
109 1
110 1
111 2
112 2
113 2
114 2
115 2
116 2
117 2
I want to merge the letters from dataframe X with dataframe Y when the invoice number is the same. It should start with merging the value from letter A, then B ect. This should be happening until the invoice number is not the same anymore and then choose the numbers from invoice nr 2.
the output should be like this:
> output
nr invoice letter_count
100 1 1
101 1 3
102 1 2
103 1 7
104 1 1
105 1 3
106 1 2
107 1 7
108 1 1
109 1 2
110 1 7
111 2 2
112 2 4
113 2 3
114 2 8
115 2 2
116 2 4
I tried to use the merge function with the by argument but this created an error that the number of rows is not the same. Any help I will appreciate.
Here is a solution using the purrr package.
# Prepare the data frames
x <- data.frame(a = 1:11, b = 3:13, c = 2:12, d = 7:17, invoice = 1:11)
y <- data.frame(nr = 100:125, invoice = 1)
y$invoice[12:26] <- 2
# Load package
library(purrr)
# Split the data based on invoice
y_list <- split(y, f = y$invoice)
# Design a function to transfer data
trans_fun <- function(main_df, letter_df = x){
# Get the invoice number
temp_num<- unique(main_df$invoice)
# Extract letter_count information from x
add_vec <- unlist(letter_df[letter_df$invoice == temp_num, 1:4])
# Get the remainder of nrow(main_df) and length(add_vec)
reamin_num <- nrow(main_df) %% length(add_vec)
# Get the multiple of nrow(main_df) and length(add_vec)
multiple_num <- nrow(main_df) %/% length(add_vec)
# Create the entire sequence to add
add_seq <- rep(add_vec, multiple_num + 1)
add_seq2 <- add_seq[1:(length(add_seq) - (length(add_vec) - reamin_num))]
# Add new column, add_seq2, to main_df
main_df$letter_count <- add_seq2
return(main_df)
}
# Apply the trans_fun function using map_df
output <- map_df(y_list, .f = trans_fun)
# See the result
output
nr invoice letter_count
1 100 1 1
2 101 1 3
3 102 1 2
4 103 1 7
5 104 1 1
6 105 1 3
7 106 1 2
8 107 1 7
9 108 1 1
10 109 1 3
11 110 1 2
12 111 2 2
13 112 2 4
14 113 2 3
15 114 2 8
16 115 2 2
17 116 2 4
18 117 2 3
19 118 2 8
20 119 2 2
21 120 2 4
22 121 2 3
23 122 2 8
24 123 2 2
25 124 2 4
26 125 2 3

dplyr append group id sequence?

I have a dataset like below, it's created by dplyr and currently grouped by ‘Stage', how do I generate a sequence based on unique, incremental value of Stage, starting from 1 (for eg row$4 should be 1 row#1 and #8 should be 4)
X Y Stage Count
1 61 74 1 2
2 58 56 2 1
3 78 76 0 1
4 100 100 -2 1
5 89 88 -1 1
6 47 44 3 1
7 36 32 4 1
8 75 58 1 2
9 24 21 5 1
10 12 11 6 1
11 0 0 10 1
I tried the approach in below post but didn't work.
how to mutate a column with ID in group
Thanks.
Here is another dplyr solution:
> df
# A tibble: 11 × 4
X Y Stage Count
<dbl> <dbl> <dbl> <dbl>
1 61 74 1 2
2 58 56 2 1
3 78 76 0 1
4 100 100 -2 1
5 89 88 -1 1
6 47 44 3 1
7 36 32 4 1
8 75 58 1 2
9 24 21 5 1
10 12 11 6 1
11 0 0 10 1
To create the group id's use dpylr's group_indicies:
i <- df %>% group_indices(Stage)
df %>% mutate(group = i)
# A tibble: 11 × 5
X Y Stage Count group
<dbl> <dbl> <dbl> <dbl> <int>
1 61 74 1 2 4
2 58 56 2 1 5
3 78 76 0 1 3
4 100 100 -2 1 1
5 89 88 -1 1 2
6 47 44 3 1 6
7 36 32 4 1 7
8 75 58 1 2 4
9 24 21 5 1 8
10 12 11 6 1 9
11 0 0 10 1 10
It would be great if you could pipe both commands together. But, as of this writing, it doesn't appear to be possible.
After some experiment, I did %>% ungroup() %>% mutate(test = rank(Stage)), which will yield the following result.
X Y Stage Count test
1 100 100 -2 1 1.0
2 89 88 -1 1 2.0
3 78 76 0 1 3.0
4 61 74 1 2 4.5
5 75 58 1 2 4.5
6 58 56 2 1 6.0
7 47 44 3 1 7.0
8 36 32 4 1 8.0
9 24 21 5 1 9.0
10 12 11 6 1 10.0
11 0 0 10 1 11.0
I don't know whether this is the best approach, feel free to comment....
update
Another approach, assuming the data called Node
lvs <- levels(as.factor(Node$Stage))
Node %>% mutate(Rank = match(Stage,lvs))

Resources