Filter rows having duplicate IDs [duplicate]

Filter rows having duplicate IDs [duplicate] - r

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 5 years ago.
My data is like this:
dat <- read.table(header=TRUE, text="
ID Veh oct nov dec jan feb
1120 1 7 47 152 259 140
2000 1 5 88 236 251 145
2000 2 14 72 263 331 147
1133 1 6 71 207 290 242
2000 3 7 47 152 259 140
2002 1 5 88 236 251 145
2006 1 14 72 263 331 147
2002 2 6 71 207 290 242
")
dat
ID Veh oct nov dec jan feb
1 1120 1 7 47 152 259 140
2 2000 1 5 88 236 251 145
3 2000 2 14 72 263 331 147
4 1133 1 6 71 207 290 242
5 2000 3 7 47 152 259 140
6 2002 1 5 88 236 251 145
7 2006 1 14 72 263 331 147
8 2002 2 6 71 207 290 242
By using duplicated function:
Unique Cells in Column 1
dat[!duplicated(dat[,1]),]
ID Veh oct nov dec jan feb
1 1120 1 7 47 152 259 140
2 2000 1 5 88 236 251 145
4 1133 1 6 71 207 290 242
6 2002 1 5 88 236 251 145
7 2006 1 14 72 263 331 147
Duplicate cells in Column 1
dat[duplicated(dat[,1]),]
ID Veh oct nov dec jan feb
3 2000 2 14 72 263 331 147
5 2000 3 7 47 152 259 140
8 2002 2 6 71 207 290 242
But I want to keep the row with first row like the following (which I am struggling to code):
ID Veh oct nov dec jan feb
2000 1 5 88 236 251 145
2000 2 14 72 263 331 147
2000 3 7 47 152 259 140
2002 1 5 88 236 251 145
2002 2 6 71 207 290 242

Try
dat[duplicated(dat[,1])|duplicated(dat[,1],fromLast=TRUE),]
# ID Veh oct nov dec jan feb
#2 2000 1 5 88 236 251 145
#3 2000 2 14 72 263 331 147
#5 2000 3 7 47 152 259 140
#6 2002 1 5 88 236 251 145
#8 2002 2 6 71 207 290 242
Or
library(data.table)
setDT(dat)[, .SD[.N>1], ID]

Related

Is there an R function that turns a frequency table into a prop table?

What is the simplest way of turning a frequency data table into a prop table in R?
This is the data:
Time Total Blog News Social.Network Microblog Other Forums Pictures Video
1 15.KW 2022 1816 23 326 39 678 99 27 523 0
2 16.KW 2022 2535 32 690 42 815 135 26 644 1
3 17.KW 2022 2181 20 362 79 805 110 14 634 1
4 18.KW 2022 2583 19 895 25 692 127 6 658 0
5 19.KW 2022 2337 21 555 22 908 148 8 599 0
6 20.KW 2022 2091 23 392 18 851 119 5 554 0
7 21.KW 2022 1658 17 344 16 650 129 1 417 0
8 22.KW 2022 2476 24 798 24 937 150 7 443 0
9 23.KW 2022 1687 14 341 17 691 102 9 400 0
10 24.KW 2022 2476 21 521 29 984 110 19 509 0
11 25.KW 2022 2412 22 696 31 845 115 29 561 0
12 26.KW 2022 2197 22 715 13 709 128 59 445 0
13 27.KW 2022 2111 20 429 10 937 86 28 474 1
14 28.KW 2022 752 5 121 4 373 42 3 172 0

Your data frame df has a 2nd column called Total. It seems that you want to divide subsequent columns by this one.
df[-1] <- df[-1] / df$Total
After this, the 1st column Time does not change. 2nd column Total becomes 1. Other columns become proportions.

Pivot / Reshape data [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
My sample data looks like this:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time (days) Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick.
Kind regards, and thank you in advance.

Here is a base R option
u <- cbind(
data[1],
do.call(
rbind,
lapply(
split.default(data[-1], ceiling(seq_along(data[-1]) / 2)),
setNames,
c("Value", "Time")
)
)
)
out <- `row.names<-`(
subset(
x <- u[order(u$pid), ],
complete.cases(x)
), NULL
)
such that
> out
pid Value Time
1 1 1356 1435
2 1 1483 1405
3 1 1563 1374
4 2 943 1848
5 2 1173 1818
6 2 1300 1785
7 3 1590 185
8 3 1585 294
9 4 130 72
10 4 140 82
11 4 220 126
12 4 166 159
13 4 380 189
14 4 353 231
15 4 180 268
16 4 571 334
17 4 443 70
18 4 266 124
19 4 213 156
20 4 583 173
21 4 510 222
22 4 596 303
23 4 476 145
24 4 656 217
25 4 816 289
26 4 136 79
27 4 756 89
28 4 703 128
29 4 776 166
30 4 586 203
31 4 526 240
32 4 580 278
33 4 483 371

An option with pivot_longer
library(dplyr)
library(tidyr)
names(data)[8] <- "measurement4"
data %>%
pivot_longer(cols = -pid, names_to = c('.value', 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])", values_drop_na = TRUE) %>% select(-grp)
# A tibble: 33 x 3
# pid measurement Tdays
# <int> <int> <int>
# 1 1 1356 1435
# 2 1 1483 1405
# 3 1 1563 1374
# 4 2 943 1848
# 5 2 1173 1818
# 6 2 1300 1785
# 7 3 1590 185
# 8 3 1585 294
# 9 4 130 72
#10 4 443 70
# … with 23 more rows

Reindexing a column in R

I'm dealing with the following dataset
animal protein herd sire dam
6 416 189.29 2 15 236
7 417 183.27 2 6 295
9 419 193.24 3 11 268
10 420 198.84 2 12 295
11 421 205.25 3 3 251
12 422 204.15 2 2 281
13 423 200.20 2 3 248
14 424 197.22 2 11 222
15 425 201.14 1 10 262
17 427 196.20 1 11 290
18 428 208.13 3 9 294
19 429 213.01 3 14 254
21 431 203.38 2 4 273
22 432 190.56 2 8 248
25 435 196.59 3 9 226
26 436 193.31 3 10 249
27 437 207.89 3 7 272
29 439 202.98 2 10 260
30 440 177.28 2 4 291
31 441 182.04 1 6 282
32 442 217.50 2 3 265
33 443 190.43 2 11 248
35 445 197.24 2 4 256
37 447 197.16 3 5 240
42 452 183.07 3 5 293
43 453 197.99 2 6 293
44 454 208.27 2 6 254
45 455 187.61 3 12 271
46 456 173.18 2 6 280
47 457 187.89 2 6 235
48 458 191.96 1 7 286
49 459 196.39 1 4 275
50 460 178.51 2 13 262
52 462 204.17 1 6 253
53 463 203.77 2 11 273
54 464 206.25 1 13 249
55 465 211.63 2 13 222
56 466 211.34 1 6 228
57 467 194.34 2 1 217
58 468 201.53 2 12 247
59 469 198.01 2 3 251
60 470 188.94 2 7 290
61 471 190.49 3 2 220
62 472 197.34 2 3 224
63 473 194.04 1 15 229
64 474 202.74 2 1 287
67 477 189.98 1 6 300
69 479 206.37 3 2 293
70 480 183.81 2 10 274
72 482 190.70 2 12 265
74 484 194.25 3 2 262
75 485 191.15 3 10 297
76 486 193.23 3 15 255
77 487 193.29 2 4 266
78 488 182.20 1 15 260
81 491 195.89 2 12 294
82 492 200.77 1 8 278
83 493 179.12 2 7 281
85 495 172.14 3 13 252
86 496 183.82 1 4 264
88 498 195.32 1 6 249
89 499 197.19 1 13 274
90 500 178.07 1 8 293
92 502 209.65 2 7 241
95 505 199.66 3 5 220
96 506 190.96 2 11 259
98 508 206.58 3 3 230
100 510 196.60 2 5 231
103 513 193.25 2 15 280
104 514 181.34 2 3 227
I'm interested with the animals indexes and corresponding to them the dams' indexes. Using table function I was able to check that some dams are matched to different animals. In fact I got the following output
217 220 222 224 226 227 228 229 230 231 235 236 240 241 247 248 249 251 252 253 254 255 256 259 260 262
1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 1 1 2 1 1 1 2 3
264 265 266 268 271 272 273 274 275 278 280 281 282 286 287 290 291 293 294 295 297 300
1 2 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 4 2 2 1 1
Using length function I checked that there are only 48 dams in this dataset.
I would like to 'reindex' them with the integers 1, ..., 48 instead of these given in my set. Is there any method of doing such things?

You can use match and unique.
df$index <- match(df$dam, unique(df$dam))
Or convert to factor and then integer
df$index <- as.integer(factor(df$dam))
Another option is group_indices from dplyr.
df$index <- dplyr::group_indices(df, dam)

We can use .GRP in data.table
library(data.table)
setDT(df)[, index := .GRP, dam]

Running a function for each group

I have following parameters of gompertz
A <- 100 # A is always 100
mu <- 35
lambda <- 265 # day of the year. Also the start day
I can use the above parameters to run a gompertz using following equation
grofit::gompertz(time,A,mu,lambda)
time is a basically a vector of lambda:end.day.
Now the issue is that I know the lambda (start day) but not the end day. I want find the end day when it reaches 100.
For e.g in the above example if I supply lambda:end.day as 265:270, I do not reach 100.
time <- 265:270
x <- round(grofit::gompertz(time,A,mu,lambda),2)
x
6.60 35.00 66.67 85.51 94.13 97.69
By multiple trials, I know if I give a vector of 265:277, I will reach 100.
time <- 265:277
x <- round(grofit::gompertz(time,A,mu,lambda),2)
x
[1] 6.60 35.00 66.67 85.51 94.13 97.69
[7] 99.10 99.65 99.87 99.95 99.98 99.99
[13] 100.00
I have dataframe that has the lambda (same as start day) and mu.
df <- data.frame(id = c(1,1,2,2), year = c(1981,1982,1981,1982), mu= c(35,32,33,28), lambda = c(275,278,284,296))
For each id and year, I want two columns: one column called day first value of which is equal to lamba and a second column which tells me the value of x for each day till it reaches 100 (end day).
How do I implement the above equation for each id and year such that I have a dataframe something like this:
id year day x
1 1981 275 6.6
1 1981 276 35
1 1981 277 66.67
1 1981 278 85.51
1 1981 279 94.13
1 1981 280 97.69
1 1981 281 99.1
1 1981 282 99.65
1 1981 283 99.87
1 1981 284 99.95
1 1981 285 99.98
1 1981 286 99.99
1 1981 287 100
. . . .
. . . .
2 1982 296 8
2 1982 297 33
2 1982 298 45
2 1982 299 63
2 1982 300 61
2 1982 301 73
2 1982 302 81
2 1982 303 91
2 1982 304 94
2 1982 305 98
2 1982 306 99
2 1982 307 100

Using dplyr and tidyr:
library(dplyr)
library(tidyr)
A <- 100 # A is always 100
df <-
data.frame(
id = c(1, 1, 2, 2),
year = c(1981, 1982, 1981, 1982),
mu = c(35, 32, 33, 28),
lambda = c(275, 278, 284, 296)
)
df2 <- df %>%
crossing(day = 1:365) %>%
group_by(id, year) %>%
filter(day >= lambda) %>%
mutate(x = round(grofit::gompertz(day, A, mu, lambda), 2)) %>%
group_by(id, year, x) %>%
filter(x != 100 | row_number() == 1)
df2 %>%
as.data.frame()
Result:
id year mu lambda day x
1 1 1981 35 275 275 6.60
2 1 1981 35 275 276 35.00
3 1 1981 35 275 277 66.67
4 1 1981 35 275 278 85.51
5 1 1981 35 275 279 94.13
6 1 1981 35 275 280 97.69
7 1 1981 35 275 281 99.10
8 1 1981 35 275 282 99.65
9 1 1981 35 275 283 99.87
10 1 1981 35 275 284 99.95
11 1 1981 35 275 285 99.98
12 1 1981 35 275 286 99.99
13 1 1981 35 275 287 100.00
14 1 1982 32 278 278 6.60
15 1 1982 32 278 279 32.01
16 1 1982 32 278 280 62.05
17 1 1982 32 278 281 81.87
18 1 1982 32 278 282 91.96
19 1 1982 32 278 283 96.55
20 1 1982 32 278 284 98.54
21 1 1982 32 278 285 99.39
22 1 1982 32 278 286 99.74
23 1 1982 32 278 287 99.89
24 1 1982 32 278 288 99.95
25 1 1982 32 278 289 99.98
26 1 1982 32 278 290 99.99
27 1 1982 32 278 291 100.00
28 2 1981 33 284 284 6.60
29 2 1981 33 284 285 33.01
30 2 1981 33 284 286 63.64
31 2 1981 33 284 287 83.17
32 2 1981 33 284 288 92.76
33 2 1981 33 284 289 96.98
34 2 1981 33 284 290 98.76
35 2 1981 33 284 291 99.49
36 2 1981 33 284 292 99.79
37 2 1981 33 284 293 99.92
38 2 1981 33 284 294 99.97
39 2 1981 33 284 295 99.99
40 2 1981 33 284 296 99.99
41 2 1981 33 284 297 100.00
42 2 1982 28 296 296 6.60
43 2 1982 28 296 297 28.09
44 2 1982 28 296 298 55.26
45 2 1982 28 296 299 75.80
46 2 1982 28 296 300 87.86
47 2 1982 28 296 301 94.13
48 2 1982 28 296 302 97.21
49 2 1982 28 296 303 98.69
50 2 1982 28 296 304 99.39
51 2 1982 28 296 305 99.71
52 2 1982 28 296 306 99.87
53 2 1982 28 296 307 99.94
54 2 1982 28 296 308 99.97
55 2 1982 28 296 309 99.99
56 2 1982 28 296 310 99.99
57 2 1982 28 296 311 100.00

how to make stacked ggplot

want to make stacked ggplot for timeseries
>air2
dayofmonth total dept
1 1 1107 380
2 2 918 92
3 3 1089 113
4 4 1086 235
5 5 1063 218
6 6 1084 325
7 7 1129 180
8 8 1133 166
9 9 918 72
10 10 1088 214
11 11 1114 180
12 12 1047 195
13 13 1110 421
14 14 1165 216
15 15 1174 228
16 16 1010 115
I tried this but didnt get the expected graph:
mdata <- melt(air2,id=c("dayofmonth"))
ggplot(mdata, aes(x=Time,y=value,group=variable,fill=variable)) +
geom_area(position="fill")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Filter rows having duplicate IDs [duplicate] - r

Try dat[duplicated(dat[,1])|duplicated(dat[,1],fromLast=TRUE),] # ID Veh oct nov dec jan feb #2 2000 1 5 88 236 251 145 #3 2000 2 14 72 263 331 147 #5 2000 3 7 47 152 259 140 #6 2002 1 5 88 236 251 145 #8 2002 2 6 71 207 290 242 Or library(data.table) setDT(dat)[, .SD[.N>1], ID]

Related

Is there an R function that turns a frequency table into a prop table?

Pivot / Reshape data [closed]

Reindexing a column in R

Running a function for each group

how to make stacked ggplot

Categories

Resources