Why Decile values are incorrect using the cut function - r

I tried to attach a decile value for each observation using the code below.However, it seems that the values are not correct. What can be the reason for that?
df<-read.table(text="pregnant glucose blood skin INSULIN MASS DIAB AGE CLASS predict_probability
1 106 70 28 135 34.2 0.142 22 0 0.15316285
1 91 54 25 100 25.2 0.234 23 0 0.05613959
4 136 70 0 0 31.2 1.182 22 1 0.54034794
9 164 78 0 0 32.8 0.148 45 1 0.64361578
3 173 78 39 185 33.8 0.970 31 1 0.79185196
11 136 84 35 130 28.3 0.260 42 1 0.31927737
0 141 84 26 0 32.4 0.433 22 0 0.41609308
3 106 72 0 0 25.8 0.207 27 0 0.10460090
9 145 80 46 130 37.9 0.637 40 1 0.67061324
10 111 70 27 0 27.5 0.141 40 1 0.16152296
",header=T)
deciles <- cut(df$predict_probability, breaks=c(quantile(df$predict_probability, probs = seq(0, 1, by = 0.10))),labels = 1:10, include.lowest=TRUE)
df1 <- cbind(df,deciles)
head(df1,10)
pregnant glucose blood skin INSULIN MASS DIAB AGE CLASS predict_probability deciles
1 1 106 70 28 135 34.2 0.142 22 0 0.15316285 3
2 1 91 54 25 100 25.2 0.234 23 0 0.05613959 1
3 4 136 70 0 0 31.2 1.182 22 1 0.54034794 7
4 9 164 78 0 0 32.8 0.148 45 1 0.64361578 8
5 3 173 78 39 185 33.8 0.970 31 1 0.79185196 10
6 11 136 84 35 130 28.3 0.260 42 1 0.31927737 5
7 0 141 84 26 0 32.4 0.433 22 0 0.41609308 6
8 3 106 72 0 0 25.8 0.207 27 0 0.10460090 2
9 9 145 80 46 130 37.9 0.637 40 1 0.67061324 9
10 10 111 70 27 0 27.5 0.141 40 1 0.16152296 4

Per Dason's proposal, here is the full answer to the question.
The quantile function should be taken out from the code so seq(0,1,by=0.1) should be passed directly to the cut function.
deciles <- cut(df$predict_probability, seq(0,1,by=0.1) ,labels = 1:10, include.lowest=TRUE)
df1 <- cbind(df,deciles)
head(df1,10)
pregnant glucose blood skin INSULIN MASS DIAB AGE CLASS predict_probability deciles
1 1 106 70 28 135 34.2 0.142 22 0 0.15316285 2
2 1 91 54 25 100 25.2 0.234 23 0 0.05613959 1
3 4 136 70 0 0 31.2 1.182 22 1 0.54034794 6
4 9 164 78 0 0 32.8 0.148 45 1 0.64361578 7
5 3 173 78 39 185 33.8 0.970 31 1 0.79185196 8
6 11 136 84 35 130 28.3 0.260 42 1 0.31927737 4
7 0 141 84 26 0 32.4 0.433 22 0 0.41609308 5
8 3 106 72 0 0 25.8 0.207 27 0 0.10460090 2
9 9 145 80 46 130 37.9 0.637 40 1 0.67061324 7
10 10 111 70 27 0 27.5 0.141 40 1 0.16152296 2

Related

Mean imputation method: graph representation?

Good morning,
I'm dealing with graphic representation of missing data imputeted via mean imputation method. This is the dataset I'm working on:
> data2
age fev ht sex smoke
1 9 1.708 145 1 1
2 8 1.724 171 1 1
3 7 1.720 138 1 1
4 9 1.558 135 2 1
5 9 1.895 145 2 1
6 8 2.336 155 1 1
7 6 1.919 147 1 1
8 6 1.415 142 1 1
9 8 1.987 149 1 1
10 9 1.942 152 1 1
11 6 1.602 135 1 1
12 8 1.735 137 2 1
13 8 2.193 149 1 1
14 8 2.118 154 2 1
15 8 2.258 147 2 1
16 7 1.932 135 2 1
17 5 1.472 127 2 1
18 6 1.878 NA 1 1
19 9 2.352 150 2 1
20 9 2.604 156 2 1
21 5 1.400 124 1 1
22 5 1.256 133 1 1
23 4 0.839 122 1 1
24 7 2.578 159 2 1
25 9 2.988 165 1 1
26 3 1.404 131 2 1
27 9 2.348 152 2 1
28 5 1.755 132 2 1
29 8 2.980 152 1 1
30 9 2.100 152 1 1
31 5 1.282 124 1 1
32 9 3.000 166 2 1
33 8 2.673 152 1 1
34 7 2.093 146 1 1
35 5 1.612 132 1 1
36 8 2.175 150 1 1
37 9 2.725 150 2 1
38 8 2.071 140 2 1
39 8 1.547 145 2 1
40 8 2.004 145 2 1
41 9 3.135 152 1 1
42 8 2.420 150 2 1
43 5 1.776 130 2 1
44 8 1.931 145 1 1
45 5 1.343 127 1 1
46 9 2.076 145 1 1
47 7 1.624 137 2 1
48 8 1.344 133 1 1
49 6 1.650 140 2 1
50 8 2.732 154 2 1
51 5 2.017 138 2 1
52 9 2.797 156 1 1
53 9 NA 157 2 1
54 8 1.703 138 2 1
55 6 1.634 137 2 1
56 9 2.570 145 2 1
57 9 3.016 159 1 1
58 7 2.419 152 1 1
59 4 1.569 127 1 1
60 8 1.698 146 1 1
61 8 2.123 152 2 1
62 8 2.481 152 1 1
63 6 1.481 130 1 1
64 4 1.577 124 1 1
65 8 1.940 150 2 1
66 6 1.747 146 2 1
67 9 2.069 147 2 1
68 7 1.631 141 1 1
69 5 1.536 132 1 1
70 9 2.560 154 1 1
71 8 1.962 145 2 1
72 8 2.531 147 1 1
73 9 2.715 152 2 1
74 9 2.457 150 2 1
75 9 2.090 151 2 1
76 7 1.789 142 2 1
77 5 1.858 135 2 1
78 5 1.452 130 2 1
79 9 NA 175 2 1
80 6 1.719 135 1 1
81 7 2.111 145 1 1
82 6 1.695 135 1 1
83 8 2.211 160 2 1
84 8 1.794 138 2 1
85 7 1.917 147 1 1
86 8 2.144 NA 1 1
87 7 1.253 132 2 1
88 9 2.659 156 2 1
89 5 1.580 133 2 1
90 9 2.126 157 2 1
91 9 3.029 156 1 1
92 9 2.964 164 2 1
93 7 1.611 NA 2 1
94 8 2.215 152 1 1
95 8 2.388 152 1 1
96 9 2.196 155 2 1
97 9 1.751 147 2 1
98 9 2.165 156 2 1
99 7 1.682 140 2 1
100 8 1.523 140 2 1
101 8 1.292 132 1 1
102 7 1.649 137 2 1
103 9 2.588 160 2 1
104 4 0.796 119 2 1
105 9 2.574 154 1 1
106 6 1.979 142 2 1
107 8 2.354 149 2 1
108 6 1.718 140 2 1
109 7 1.742 149 1 1
110 7 1.603 130 1 1
111 8 2.639 151 1 1
112 7 1.829 137 1 1
113 7 2.084 147 2 1
114 7 2.220 147 2 1
115 7 1.473 133 1 1
116 8 2.341 154 1 1
117 7 1.698 138 1 1
118 5 1.196 118 1 1
119 8 1.872 144 1 1
120 7 2.219 140 2 1
121 9 2.420 145 2 1
122 7 1.827 138 1 1
123 7 1.461 137 1 1
124 6 1.338 NA 2 1
125 8 2.090 145 2 1
126 8 1.697 150 1 1
127 8 1.562 140 2 1
128 9 2.040 141 1 1
129 7 1.609 131 1 1
130 8 2.458 155 1 1
131 9 2.650 161 2 1
132 8 1.429 146 2 1
133 8 1.675 135 2 1
134 9 1.947 144 1 1
135 8 2.069 137 2 1
136 6 1.572 132 2 1
137 6 1.348 135 2 1
138 8 2.288 156 1 1
139 9 1.773 149 2 1
140 5 0.791 132 1 1
141 7 1.905 147 2 1
142 9 2.463 155 1 1
143 6 1.431 130 2 1
144 9 2.631 157 1 1
145 9 3.114 164 2 1
146 9 2.135 149 2 1
147 6 1.527 133 2 1
148 8 2.293 147 1 1
149 9 3.042 168 1 1
150 8 2.927 161 2 1
151 8 2.665 163 1 1
152 9 2.301 149 2 1
153 9 2.460 163 2 1
154 9 2.592 154 1 1
155 7 1.750 140 1 1
156 8 1.759 135 2 1
157 6 1.536 122 2 1
158 9 2.259 149 1 1
159 9 2.048 164 1 1
160 9 2.571 154 2 1
161 7 2.046 142 2 1
162 8 1.780 149 1 1
163 5 1.552 137 1 1
164 8 1.953 147 1 1
165 9 2.893 164 2 1
166 6 1.713 128 2 1
167 9 2.851 152 1 1
168 6 1.624 131 2 1
169 8 2.631 150 2 1
170 5 1.819 135 2 1
171 7 1.658 135 2 1
172 7 2.158 136 2 1
173 4 1.789 132 2 1
174 9 3.004 163 1 1
175 8 2.503 160 2 1
176 9 1.933 147 1 1
177 9 2.091 149 1 1
178 9 2.316 NA 1 1
179 5 1.704 NA 1 1
180 9 1.606 146 1 1
181 7 1.165 119 2 1
182 6 2.102 141 1 1
183 9 2.320 145 1 1
184 9 2.230 155 2 1
185 9 1.716 141 2 1
186 7 1.790 136 2 1
187 5 1.146 127 1 1
188 8 2.187 156 1 1
189 9 2.717 156 2 1
190 7 1.796 140 2 1
191 9 1.953 147 2 2
192 8 1.335 144 1 1
193 9 2.119 145 2 1
194 6 1.666 132 2 1
195 6 1.826 133 2 1
196 8 2.709 159 1 1
197 9 2.871 165 2 1
198 5 1.092 127 1 1
199 6 2.262 146 2 1
200 6 2.104 144 2 1
I've used the following code to return back the observed vs the imputed data scenario and the sided scatterplot of Y="fev" versus X="age".
1. FIRST GRAPH
library(lattice)
par(mfrow=c(1,2))
breaks <- seq(-20, 200, 10)
nudge <- 1
lwd <- 1.5
x <- matrix(c(breaks-nudge, breaks+nudge), ncol=2, nrow = 46)
obs <- data2[,"fev"]
mis <- imp$imp$fev[,1]
fobs <- c(hist(obs, breaks, plot=FALSE)$fev, 0)
fmis <- c(hist(mis, breaks, plot=FALSE)$fev, 0)
y <- matrix(c(fobs, fmis), ncol=2, nrow = 46)
matplot(x, y, type="s",
col=c(mdc(4),mdc(5)), lwd=2, lty=1,
xlim = c(0, 150), ylim = c(0,40), yaxs = "i",
xlab="fev",
ylab="Frequency")
box()
2. SECOND GRAPH
tp <- xyplot(imp, fev ~ age, na.groups=ici(imp),
ylab="fev", xlab="age",
cex = 0.75, lex=lwd, pch=19,
ylim = c(-20, 180), xlim = c(0,350))
print(tp, newpage = FALSE, position = c(0.48,0.08,1,0.92))
Although the code works well, I'm not so as to its validity, as I am supposed to have back a graphic results like those I let attached here enter image description here, whereas I'm keeping on getting a sort of graphs like these enter image description here
What do you think about? Any clue as to making the right cade out?
Thanks for helping
You didn't post the complete code. It isn't exactly clear what your imp data, which you are trying to plot looks like. The data you posted is named data2, but I don't really know at which point in your code this is used.
As for reasons, why your code might not not show anything, it seems the range for fev is about from 0 to 3. age is about from 1 to 10.
But axis limits in the first plot are:
xlim = c(0, 150), ylim = c(0,40)
In the second plot
ylim = c(-20, 180), xlim = c(0,350)
Which means, the actual data you want to plot is in quite a small area of the plot (as you can see).
You have to adjust your axis limits to the range of your data.

How to test for p-value with groups/filters in dplyr

My data looks like the example below. (sorry if it's too long, not sure what's acceptable/needed).
I have used the following code to calculate the median and IQR of each time difference (tdif) between tests (testno):
data %>% group_by(testno) %>% filter(type ==1) %>%
summarise(Median = median(tdif), IQR= IQR(tdif), n= n(), .groups = 'keep') -> result
I have done this for each category of 'type' (coded as 1 - 10), which brought me to the added table (bottom).
My question is, if it is possible to:
Do this an easier way (without the filters? So I can do this all in 1 run), and
Is it possible run a test for p-value with all the groups/filters?
data <- read.table(header=T, text= '
PID time tdif testno type
3 205 0 1 1
4 77 0 1 1
4 85 8 2 1
4 126 41 3 1
4 165 39 4 1
4 202 37 5 1
4 238 36 6 1
4 272 34 7 1
4 277 5 8 1
4 370 93 9 1
4 397 27 10 1
4 452 55 11 1
4 522 70 12 1
4 529 7 13 1
4 608 79 14 1
4 651 43 15 1
4 655 4 16 1
4 713 58 17 1
4 804 91 18 1
4 900 96 19 1
4 944 44 20 1
4 979 35 21 1
4 1015 36 22 1
4 1051 36 23 1
4 1077 26 24 1
4 1124 47 25 1
4 1162 38 26 1
4 1222 60 27 1
4 1334 112 28 1
4 1383 49 29 1
4 1457 74 30 1
4 1506 49 31 1
4 1590 84 32 1
4 1768 178 33 1
4 1838 70 34 1
4 1880 42 35 1
4 1915 35 36 1
4 1973 58 37 1
4 2017 44 38 1
4 2090 73 39 1
4 2314 224 40 1
4 2381 67 41 1
4 2433 52 42 1
4 2484 51 43 1
4 2694 210 44 1
4 2731 37 45 1
4 2792 61 46 1
4 2958 166 47 1
5 48 0 1 3
5 111 63 2 3
5 699 588 3 3
5 1077 378 4 3
6 -43 0 1 3
8 67 0 1 1
8 168 101 2 1
8 314 146 3 1
8 368 54 4 1
8 586 218 5 1
10 639 0 1 6
13 -454 0 1 3
13 -384 70 2 3
13 -185 199 3 3
13 193 378 4 3
13 375 182 5 3
13 564 189 6 3
13 652 88 7 3
13 669 17 8 3
13 718 49 9 3
14 704 0 1 8
15 -165 0 1 3
15 -138 27 2 3
15 1335 1473 3 3
16 168 0 1 6
18 -1329 0 1 3
18 -1177 152 2 3
18 -1071 106 3 3
18 -945 126 4 3
18 -834 111 5 3
18 -719 115 6 3
18 -631 88 7 3
18 -497 134 8 3
18 -376 121 9 3
18 -193 183 10 3
18 -78 115 11 3
18 -13 65 12 3
18 100 113 13 3
18 196 96 14 3
18 552 356 15 3
18 650 98 16 3
18 737 87 17 3
18 804 67 18 3
18 902 98 19 3
18 983 81 20 3
18 1119 136 21 3
19 802 0 1 1
19 1593 791 2 1
26 314 0 1 8
26 389 75 2 8
26 597 208 3 8
33 639 0 1 6
Added table (values differ from example data, because this isn't the complete set).

Selecting closest subset of variables in r based on the highest and Lowest Values from another Variable

I have this file:
id egs aol pH marm fat PCQ
1 12 59 5.52 420 3 216
2 9 70 5.38 330 2 225
3 8 50 5.54 360 2 225
4 10 58 5.48 340 3 227
5 12 57 5.41 400 2 227
6 10 62 5.32 350 3 231
7 18 66 5.49 350 2 234
8 13 57 5.49 320 3 241
9 11 61 5.65 400 3 242
10 9 69 5.43 320 2 246
11 11 60 5.52 390 2 248
12 19 66 5.48 330 2 281
13 11 73 5.53 380 2 283
14 20 72 5.46 370 3 284
15 9 71 5.36 500 3 284
16 11 63 5.50 380 3 285
17 15 68 5.33 NA NA 286
18 17 58 5.43 340 3 288
19 17 61 5.44 350 2 291
20 11 73 6.15 340 3 292
21 19 67 5.67 390 2 296
22 13 76 5.53 360 3 297
23 15 64 5.75 320 3 300
24 11 69 5.75 NA NA 300
25 15 68 5.47 390 3 317
And I need select the three highest values and three lowest values for PCQ, and the closest values within egs, aol,pH,marm,fat. So, I would select this:
id egs aol pH marm fat PCQ
1 12 59 5.52 420 3 216
5 12 57 5.41 400 2 227
7 18 66 5.49 350 2 234
21 19 67 5.67 390 2 296
22 13 76 5.53 360 3 297
25 15 68 5.47 390 3 317
OBS: this is just a example, I donĀ“t need select exactly these values, as long as the logic remains.
I tried commands based on dplyr like this:
library(dplyr)
date %>%
arrange((PCQ)) %>% slice(1:3) %>%
filter(PCQ - lag(PCQ, default = 0))
but unsuccessful to move forward. Some suggestion?
OBS: Can be unix language too. Thanks

dividing all values in one column by values in a separate data frame (colnames match)

I need to divide every value in a given column of the first data frame by the value in corresponding column name of the second data frame. For example, I need to divide every value in the 0 column of demand_copy by 25.5, every value in the 1 column by 13.0, etc. and get an output that is the same structure as the first data frame
How would one do this in R?
> head(demand_copy)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
1 25 9 14 3 10 10 28 175 406 230 155 151 202 167 179 185 275 298 280 185 110 84 93 51
2 36 17 9 3 2 7 32 88 110 131 89 125 149 165 161 147 178 309 339 201 115 78 67 39
3 10 3 5 10 0 11 15 58 129 110 49 62 62 100 70 73 72 86 116 61 49 37 26 22
4 24 15 10 5 3 4 39 53 108 98 80 118 116 110 135 158 157 196 176 132 118 94 91 102
5 40 45 15 9 16 37 75 205 497 527 362 287 316 353 359 309 365 653 598 468 328 242 168 102
6 0 0 1 2 0 0 11 56 26 12 21 6 27 15 18 5 14 19 25 6 4 0 1 0
> medians
medians
0 25.5
1 13.0
2 8.0
3 4.0
4 4.0
5 10.0
6 38.5
7 106.5
8 205.5
9 164.0
10 111.5
11 130.5
12 160.0
13 164.5
14 170.0
15 183.0
16 202.0
17 282.0
18 256.5
19 178.0
20 109.0
21 80.0
22 60.5
23 41.0
You could use
t(t(demand_copy) / medians[, 1])
or
sweep(demand_copy, 2, medians[, 1], "/")
Notice that the first approach returns a matrix, whereas the second one returns a data frame.
I also suggest to have medians as a vector, not as a data frame with a single column. Then you could use medians instead of medians[, 1] in the two lines above.

R, correlation in p-values

quite new with R and spending lot of time to solve issues...
I have a big table(named mydata) containing more that 14k columns. this is a short view...
Latitude comp48109 comp48326 comp48827 comp49708 comp48407 comp48912
59.8 21 29 129 440 23 13
59.8 18 23 32 129 19 34
59.8 19 27 63 178 23 27
53.1 21 28 0 0 26 10
53.1 15 21 129 423 25 36
53.1 18 44 44 192 26 42
48.7 14 32 0 0 17 42
48.7 11 26 0 0 20 33
48.7 24 37 0 0 26 20
43.6 34 40 1 3 23 4
43.6 19 28 0 1 26 33
43.6 19 35 0 0 14 3
41.4 22 67 253 1322 15 4
41.4 44 39 0 0 11 14
41.4 24 41 63 174 12 4
39.5 21 45 102 291 12 17
39.5 17 26 69 300 16 79
39.5 13 46 151 526 14 14
Despite I manage to get the correlation scores for the first column ("Latitude") against the others with
corrScores <- cor(Latitude, mydata[2:14429])
I need to get a list of the p-values by applying the function cor.test(x, y,...)$p.value
How can I do that without getting the error 'x' and 'y' must have the same length?
You can use sapply:
sapply(mydata[-1], function(y) cor.test(mydata$Latitude, y)$p.value)
# comp48109 comp48326 comp48827 comp49708 comp48407 comp48912
# 0.331584624 0.020971913 0.663194866 0.544407919 0.005375973 0.656831836
Here, mydata[-1] means: All columns of mydata except the first one.

Resources