R_splitting a data frame based on column-wise NA value occurance - r

Sample data
set.seed(16)
aaa <- 1:1000
aaa[round(runif(100,1,1000))] <- NA
aaa.df <- as.data.frame(matrix(aaa, ncol=5))
I want the aaa.df to be split into multiple groups based on which column(s) contains NA value(s), so for example, if 10th, 16th, 200th rows has NA value in the same column, I want these rows to be in one group and so on. It should also work when a. there is no NA values in a row and b. there is multiple NA values in a row.
I also want to keep the original row number when grouping.
Edit: To make it clearer this is the expected output (Obtained using Taufi's answer, but I am still looking for a more elegant way)
[[1]]
# A tibble: 119 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 1 201 401 601 801 1
2 2 202 402 602 802 2
3 3 203 403 603 803 3
4 4 204 404 604 804 4
5 5 205 405 605 805 5
6 6 206 406 606 806 6
7 7 207 407 607 807 7
8 8 208 408 608 808 8
9 9 209 409 609 809 9
10 10 210 410 610 810 10
# ... with 109 more rows
[[2]]
# A tibble: 14 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 20 220 420 620 NA 20
2 32 232 432 632 NA 32
3 47 247 447 647 NA 47
4 70 270 470 670 NA 70
5 85 285 485 685 NA 85
6 92 292 492 692 NA 92
7 129 329 529 729 NA 129
8 132 332 532 732 NA 132
9 137 337 537 737 NA 137
10 151 351 551 751 NA 151
11 152 352 552 752 NA 152
12 168 368 568 768 NA 168
13 178 378 578 778 NA 178
14 181 381 581 781 NA 181
[[3]]
# A tibble: 15 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 11 211 411 NA 811 11
2 37 237 437 NA 837 37
3 62 262 462 NA 862 62
4 82 282 482 NA 882 82
5 83 283 483 NA 883 83
6 89 289 489 NA 889 89
7 107 307 507 NA 907 107
8 115 315 515 NA 915 115
9 116 316 516 NA 916 116
10 117 317 517 NA 917 117
11 118 318 518 NA 918 118
12 165 365 565 NA 965 165
13 176 376 576 NA 976 176
14 189 389 589 NA 989 189
15 200 400 600 NA 1000 200
[[4]]
# A tibble: 1 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 12 212 412 NA NA 12
[[5]]
# A tibble: 16 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 17 217 NA 617 817 17
2 28 228 NA 628 828 28
3 31 231 NA 631 831 31
4 48 248 NA 648 848 48
5 58 258 NA 658 858 58
6 72 272 NA 672 872 72
7 80 280 NA 680 880 80
8 126 326 NA 726 926 126
9 144 344 NA 744 944 144
10 145 345 NA 745 945 145
11 149 349 NA 749 949 149
12 153 353 NA 753 953 153
13 186 386 NA 786 986 186
14 190 390 NA 790 990 190
15 192 392 NA 792 992 192
16 196 396 NA 796 996 196
and so on..

In addition to my previous more brute-force kind of answer, I came up with the following way more elegant one-liner that avoids any unnecessary joins or intermediate assignment steps. Since you already accepted my previous answer, I let that be as it stands and add the conceptually different one-liner below. The idea is to split() the data.frame based on pasted column numbers from which() that indicate the presence of NA.
split(aaa.df,
apply(aaa.df, 1,
function(x) paste(which(is.na(x)), collapse = ",")))
Output
$`1`
V1 V2 V3 V4 V5
77 NA 277 477 677 877
93 NA 293 493 693 893
97 NA 297 497 697 897
109 NA 309 509 709 909
119 NA 319 519 719 919
140 NA 340 540 740 940
154 NA 354 554 754 954
158 NA 358 558 758 958
171 NA 371 571 771 971
172 NA 372 572 772 972
$`1,2,3`
V1 V2 V3 V4 V5
51 NA NA NA 651 851
$`1,3,5`
V1 V2 V3 V4 V5
75 NA 275 NA 675 NA
$`1,4`
V1 V2 V3 V4 V5
194 NA 394 594 NA 994
$`1,4,5`
V1 V2 V3 V4 V5
49 NA 249 449 NA NA
...
and so on ...

A quick, but not very elegant solution would be as follows. Note that the original row number later is in V6.
aaa.df %<>% mutate(Rownum = 1:nrow(aaa.df))
Aux.df <- cbind(is.na(aaa.df[, 1:(ncol(aaa.df) - 1)]), 1:nrow(aaa.df)) %>%
as.data.frame %>%
group_by(V1, V2, V3, V4, V5) %>%
group_split
Sol <- lapply(Aux.df, function(x) inner_join(x, aaa.df, by = c("V6"="Rownum")) %>%
select(V1.y, V2.y, V3.y, V4.y, V5.y, V6))
Output
> Sol
[[1]]
# A tibble: 119 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 1 201 401 601 801 1
2 2 202 402 602 802 2
3 3 203 403 603 803 3
4 4 204 404 604 804 4
5 5 205 405 605 805 5
6 6 206 406 606 806 6
7 7 207 407 607 807 7
8 8 208 408 608 808 8
9 9 209 409 609 809 9
10 10 210 410 610 810 10
# ... with 109 more rows
[[2]]
# A tibble: 14 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 20 220 420 620 NA 20
2 32 232 432 632 NA 32
3 47 247 447 647 NA 47
4 70 270 470 670 NA 70
5 85 285 485 685 NA 85
6 92 292 492 692 NA 92
7 129 329 529 729 NA 129
8 132 332 532 732 NA 132
9 137 337 537 737 NA 137
10 151 351 551 751 NA 151
11 152 352 552 752 NA 152
12 168 368 568 768 NA 168
13 178 378 578 778 NA 178
14 181 381 581 781 NA 181
....
and so on ...

Related

Pivot / Reshape data [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
My sample data looks like this:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time (days) Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick.
Kind regards, and thank you in advance.
Here is a base R option
u <- cbind(
data[1],
do.call(
rbind,
lapply(
split.default(data[-1], ceiling(seq_along(data[-1]) / 2)),
setNames,
c("Value", "Time")
)
)
)
out <- `row.names<-`(
subset(
x <- u[order(u$pid), ],
complete.cases(x)
), NULL
)
such that
> out
pid Value Time
1 1 1356 1435
2 1 1483 1405
3 1 1563 1374
4 2 943 1848
5 2 1173 1818
6 2 1300 1785
7 3 1590 185
8 3 1585 294
9 4 130 72
10 4 140 82
11 4 220 126
12 4 166 159
13 4 380 189
14 4 353 231
15 4 180 268
16 4 571 334
17 4 443 70
18 4 266 124
19 4 213 156
20 4 583 173
21 4 510 222
22 4 596 303
23 4 476 145
24 4 656 217
25 4 816 289
26 4 136 79
27 4 756 89
28 4 703 128
29 4 776 166
30 4 586 203
31 4 526 240
32 4 580 278
33 4 483 371
An option with pivot_longer
library(dplyr)
library(tidyr)
names(data)[8] <- "measurement4"
data %>%
pivot_longer(cols = -pid, names_to = c('.value', 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])", values_drop_na = TRUE) %>% select(-grp)
# A tibble: 33 x 3
# pid measurement Tdays
# <int> <int> <int>
# 1 1 1356 1435
# 2 1 1483 1405
# 3 1 1563 1374
# 4 2 943 1848
# 5 2 1173 1818
# 6 2 1300 1785
# 7 3 1590 185
# 8 3 1585 294
# 9 4 130 72
#10 4 443 70
# … with 23 more rows

Extract rows from matrix by rownames (dates) in R?

Assume I have following Inputs:
Date <- seq.Date(as.Date("2000-01-01"),as.Date("2006-01-01"), by = "quarter")
mat <- matrix(1:730,73,10)
mat <- data.frame(mat)
mat$Time <- c(seq.Date(as.Date("2000-01-01"),as.Date("2002-12-01"), by= "month"),as.Date("2003-01-03") ,seq.Date(as.Date("2003-02-01"),as.Date("2004-12-01"),by ="month"),as.Date("2005-01-02"),seq(as.Date("2005-02-01"),as.Date("2006-01-01"), by ="month"))
mat
And now I would like to get the rows in the matrix which are the same as the date vector. However, some of the dates in the Date vector dont exist. So iwould like to get the closest date. Therefore I tried this:
for(i in 1:length(Date)){
if(Date[i] == mat$Time){
Date[i] <- Date[i]
}else{
Date_Row <- which(abs(mat$Time - Date[i]) == min(abs(mat$Time -Date[i])))
Date[i] <- mat[Date_Row,]
}
}
Date
But it doesn't work. How can I fix this? Thanks!
We can extract the row names and subset the data frame by assigning year and quarter values to the input data, then merging with the reference data that has one observation per quarter.
aFile <- " rowName X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
2000-01-01 1 40 79 118 157 196 235 274 313 352
2000-02-01 2 41 80 119 158 197 236 275 314 353
2000-03-01 3 42 81 120 159 198 237 276 315 354
2000-04-01 4 43 82 121 160 199 238 277 316 355
2000-05-01 5 44 83 122 161 200 239 278 317 356
2000-06-01 6 45 84 123 162 201 240 279 318 357
2000-07-01 7 46 85 124 163 202 241 280 319 358
2000-08-01 8 47 86 125 164 203 242 281 320 359
2000-09-01 9 48 87 126 165 204 243 282 321 360
2000-10-01 10 49 88 127 166 205 244 283 322 361
2000-11-01 11 50 89 128 167 206 245 284 323 362
2000-12-01 12 51 90 129 168 207 246 285 324 363
2001-01-01 13 52 91 130 169 208 247 286 325 364
2002-11-01 35 74 113 152 191 230 269 308 347 386
2002-12-01 36 75 114 153 192 231 270 309 348 387
2003-01-03 37 76 115 154 193 232 271 310 349 388"
df <- read.table(text = aFile,header = TRUE, row.names = "rowName")
referenceDate <- seq.Date(as.Date("2000-01-01"),as.Date("2006-01-01"),
by = "quarter")
library(lubridate)
quarterData <- data.frame(referenceDate,year = year(referenceDate),
qtr = quarter(referenceDate) )
library(dplyr)
df %>% mutate(date = ymd(rownames(df)),
year = year(date),
qtr = quarter(date)) %>%
left_join(.,quarterData)
...and the output:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 date year qtr referenceDate
1 1 40 79 118 157 196 235 274 313 352 2000-01-01 2000 1 2000-01-01
2 2 41 80 119 158 197 236 275 314 353 2000-02-01 2000 1 2000-01-01
3 3 42 81 120 159 198 237 276 315 354 2000-03-01 2000 1 2000-01-01
4 4 43 82 121 160 199 238 277 316 355 2000-04-01 2000 2 2000-04-01
5 5 44 83 122 161 200 239 278 317 356 2000-05-01 2000 2 2000-04-01
6 6 45 84 123 162 201 240 279 318 357 2000-06-01 2000 2 2000-04-01
7 7 46 85 124 163 202 241 280 319 358 2000-07-01 2000 3 2000-07-01
8 8 47 86 125 164 203 242 281 320 359 2000-08-01 2000 3 2000-07-01
9 9 48 87 126 165 204 243 282 321 360 2000-09-01 2000 3 2000-07-01
10 10 49 88 127 166 205 244 283 322 361 2000-10-01 2000 4 2000-10-01
11 11 50 89 128 167 206 245 284 323 362 2000-11-01 2000 4 2000-10-01
12 12 51 90 129 168 207 246 285 324 363 2000-12-01 2000 4 2000-10-01
13 13 52 91 130 169 208 247 286 325 364 2001-01-01 2001 1 2001-01-01
14 35 74 113 152 191 230 269 308 347 386 2002-11-01 2002 4 2002-10-01
15 36 75 114 153 192 231 270 309 348 387 2002-12-01 2002 4 2002-10-01
16 37 76 115 154 193 232 271 310 349 388 2003-01-03 2003 1 2003-01-01
>
Filter to dates near start of quarter
The reference dates in the OP are at the start of each quarter. Solutions for subsetting the joined data rely on this assumption.
Now that we've joined the data, if we want to subset to only the items early in the quarter, we can filter() based on the difference between date and referenceDate to keep those rows that are within the first 5 days of the quarter.
df %>% mutate(date = ymd(rownames(df)),
year = year(date),
qtr = quarter(date)) %>%
left_join(.,quarterData) %>%
filter(.,(date - referenceDate) < 5)
...and the output:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 date year qtr referenceDate
1 1 40 79 118 157 196 235 274 313 352 2000-01-01 2000 1 2000-01-01
2 4 43 82 121 160 199 238 277 316 355 2000-04-01 2000 2 2000-04-01
3 7 46 85 124 163 202 241 280 319 358 2000-07-01 2000 3 2000-07-01
4 10 49 88 127 166 205 244 283 322 361 2000-10-01 2000 4 2000-10-01
5 13 52 91 130 169 208 247 286 325 364 2001-01-01 2001 1 2001-01-01
6 37 76 115 154 193 232 271 310 349 388 2003-01-03 2003 1 2003-01-01
>
Filtering to a date beyond the first few days of quarter
If the first day in a quarter falls outside the criteria above, or if the input data includes multiple days that meet the filter criteria, another approach is to create a unique sequential number representing sorted dates within a year and quarter, and selecting the first item in the sequence.
# filter first obs in quarter
df %>% mutate(date = ymd(rownames(df)),
year = year(date),
qtr = quarter(date)) %>%
left_join(.,quarterData) %>%
arrange(.,year,qtr,date) %>%
group_by(year,qtr) %>%
mutate(quarterSequence = seq_along(qtr)) %>%
filter(quarterSequence == 1)
...and the output:
# A tibble: 7 x 15
# Groups: year, qtr [7]
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 date year
<int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <date> <dbl>
1 1 40 79 118 157 196 235 274 313 352 2000-01-01 2000
2 4 43 82 121 160 199 238 277 316 355 2000-04-01 2000
3 7 46 85 124 163 202 241 280 319 358 2000-07-01 2000
4 10 49 88 127 166 205 244 283 322 361 2000-10-01 2000
5 13 52 91 130 169 208 247 286 325 364 2001-01-01 2001
6 35 74 113 152 191 230 269 308 347 386 2002-11-01 2002
7 37 76 115 154 193 232 271 310 349 388 2003-01-03 2003
# … with 3 more variables: qtr <int>, referenceDate <date>, quarterSequence <int>
>
A simpler approach: use the original data to create reference dates
We can solve the problem posed in the original post without joining one set of dates to another. How? We use lubridate functions to create the first day of the quarter for each row by parsing the year and quarter values from the dates provided in the row names of the original data frame.
# read same data file as top of this answer
df <- read.table(text = aFile,header = TRUE, row.names = "rowName")
library(lubridate)
library(dplyr)
df %>%
mutate(date = ymd(rownames(.)),
referenceDate = ymd(sprintf("%4d-%02d-%02d",year(date),
(quarter(date)-1)*3+1,1))) %>%
filter(.,(date - referenceDate) < 5)
...and the output:
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 date referenceDate
1 1 40 79 118 157 196 235 274 313 352 2000-01-01 2000-01-01
2 4 43 82 121 160 199 238 277 316 355 2000-04-01 2000-04-01
3 7 46 85 124 163 202 241 280 319 358 2000-07-01 2000-07-01
4 10 49 88 127 166 205 244 283 322 361 2000-10-01 2000-10-01
5 13 52 91 130 169 208 247 286 325 364 2001-01-01 2001-01-01
6 37 76 115 154 193 232 271 310 349 388 2003-01-03 2003-01-01
I copy and pasted the top few rows of your data into an excel spreadsheet, then exported it to a csv to read into R as the variable Book1
I used your same code but changed the variable for clarity
Datetofind <- seq.Date(as.Date("2000-01-01"),as.Date("2006-01-01"), by = "quarter")
I got the dataset into a tibble to use lubridate and tidyverse the code below got the column into a Date format
Book1$Date <- ymd(Book1$Date)
Now I just used dplyr to filter the dates in your original datasets and return only the rows that match the quarters.
Book1 %>%
filter(Date %in% Datetofind)
That got me the data below
Date X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
<date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
2000-01-01 1 40 79 118 157 196 235 274 313 352
2000-04-01 4 43 82 121 160 199 238 277 316 355
2000-07-01 7 46 85 124 163 202 241 280 319 358
2000-10-01 10 49 88 127 166 205 244 283 322 361
2001-01-01 13 52 91 130 169 208 247 286 325 364

Reindexing a column in R

I'm dealing with the following dataset
animal protein herd sire dam
6 416 189.29 2 15 236
7 417 183.27 2 6 295
9 419 193.24 3 11 268
10 420 198.84 2 12 295
11 421 205.25 3 3 251
12 422 204.15 2 2 281
13 423 200.20 2 3 248
14 424 197.22 2 11 222
15 425 201.14 1 10 262
17 427 196.20 1 11 290
18 428 208.13 3 9 294
19 429 213.01 3 14 254
21 431 203.38 2 4 273
22 432 190.56 2 8 248
25 435 196.59 3 9 226
26 436 193.31 3 10 249
27 437 207.89 3 7 272
29 439 202.98 2 10 260
30 440 177.28 2 4 291
31 441 182.04 1 6 282
32 442 217.50 2 3 265
33 443 190.43 2 11 248
35 445 197.24 2 4 256
37 447 197.16 3 5 240
42 452 183.07 3 5 293
43 453 197.99 2 6 293
44 454 208.27 2 6 254
45 455 187.61 3 12 271
46 456 173.18 2 6 280
47 457 187.89 2 6 235
48 458 191.96 1 7 286
49 459 196.39 1 4 275
50 460 178.51 2 13 262
52 462 204.17 1 6 253
53 463 203.77 2 11 273
54 464 206.25 1 13 249
55 465 211.63 2 13 222
56 466 211.34 1 6 228
57 467 194.34 2 1 217
58 468 201.53 2 12 247
59 469 198.01 2 3 251
60 470 188.94 2 7 290
61 471 190.49 3 2 220
62 472 197.34 2 3 224
63 473 194.04 1 15 229
64 474 202.74 2 1 287
67 477 189.98 1 6 300
69 479 206.37 3 2 293
70 480 183.81 2 10 274
72 482 190.70 2 12 265
74 484 194.25 3 2 262
75 485 191.15 3 10 297
76 486 193.23 3 15 255
77 487 193.29 2 4 266
78 488 182.20 1 15 260
81 491 195.89 2 12 294
82 492 200.77 1 8 278
83 493 179.12 2 7 281
85 495 172.14 3 13 252
86 496 183.82 1 4 264
88 498 195.32 1 6 249
89 499 197.19 1 13 274
90 500 178.07 1 8 293
92 502 209.65 2 7 241
95 505 199.66 3 5 220
96 506 190.96 2 11 259
98 508 206.58 3 3 230
100 510 196.60 2 5 231
103 513 193.25 2 15 280
104 514 181.34 2 3 227
I'm interested with the animals indexes and corresponding to them the dams' indexes. Using table function I was able to check that some dams are matched to different animals. In fact I got the following output
217 220 222 224 226 227 228 229 230 231 235 236 240 241 247 248 249 251 252 253 254 255 256 259 260 262
1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 1 1 2 1 1 1 2 3
264 265 266 268 271 272 273 274 275 278 280 281 282 286 287 290 291 293 294 295 297 300
1 2 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 4 2 2 1 1
Using length function I checked that there are only 48 dams in this dataset.
I would like to 'reindex' them with the integers 1, ..., 48 instead of these given in my set. Is there any method of doing such things?
You can use match and unique.
df$index <- match(df$dam, unique(df$dam))
Or convert to factor and then integer
df$index <- as.integer(factor(df$dam))
Another option is group_indices from dplyr.
df$index <- dplyr::group_indices(df, dam)
We can use .GRP in data.table
library(data.table)
setDT(df)[, index := .GRP, dam]

R, remove rows based on the values from multiple column

Suppose i have a dataframe with 100 rows and 100 columns.
For each row, if any 2 columns have the same value, then this row should be removed.
For example, if column 1 and 2 are equal, then this row should be removed.
Another example, if column 10 and column 47 are equal, then this row should be removed as well.
Example:
test <- data.frame(x1 = c('a', 'a', 'c', 'd'),
x2 = c('a', 'x', 'f', 'h'),
x3 = c('s', 'a', 'f', 'g'),
x4 = c('a', 'x', 'u', 'a'))
test
x1 x2 x3 x4
1 a a s a
2 a x a x
3 c f f u
4 d h g a
Only the 4th row should be kept.
How to do this in a quick and concise way? Not using for loops....
Use apply to look for duplicates in each row. (Note that this internally converts your data to a matrix for the comparison. If you are doing a lot of row-wise operations I would recommend either keeping it as a matrix or converting it to a long format as in Jack Brookes's answer.)
# sample data
set.seed(47)
dd = data.frame(matrix(sample(1:5000, size = 100^2, replace = TRUE), nrow = 100))
# remove rows with duplicate entries
result = dd[apply(dd, MARGIN = 1, FUN = function(x) !any(duplicated(x))), ]
Tested on this 20x20 dataframe
library(tidyverse)
N <- 20
df <- matrix(as.integer(runif(N^2, 1, 500)), nrow = N, ncol = N) %>%
as.tibble()
df
# # A tibble: 20 x 20
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 350 278 256 484 486 249 35 308 248 66 493 130 149 2 374 51 370 423 165 388
# 2 368 448 441 62 304 373 38 375 406 463 412 95 174 365 170 113 459 369 62 21
# 3 250 459 416 128 372 67 281 450 48 122 308 56 121 497 498 220 34 4 126 411
# 4 171 306 390 13 395 160 256 258 76 131 471 487 190 492 21 237 380 129 5 30
# 5 402 421 6 401 50 292 470 319 283 178 234 46 176 178 288 499 7 221 123 268
# 6 415 342 132 379 150 35 323 225 246 496 460 478 205 255 460 62 78 207 82 118
# 7 207 52 420 216 9 366 390 382 304 63 427 425 350 112 488 400 328 239 148 40
# 8 392 455 156 386 478 3 359 184 420 138 29 434 31 279 87 233 455 21 181 437
# 9 349 460 498 278 104 93 253 287 124 351 60 333 321 116 19 156 372 168 95 169
# 10 386 73 362 127 313 93 427 81 188 366 418 115 353 412 483 147 295 53 82 188
# 11 272 480 168 306 359 75 436 228 187 279 410 388 62 227 415 374 366 313 187 49
# 12 177 382 233 146 338 76 390 232 336 448 175 79 202 230 317 296 410 90 102 465
# 13 108 433 59 151 8 138 464 458 183 316 481 153 403 193 71 136 27 454 62 439
# 14 421 72 106 442 338 440 476 357 74 108 94 407 453 262 355 356 27 217 243 455
# 15 325 449 151 473 241 11 154 52 77 489 137 279 420 120 165 289 70 128 384 53
# 16 126 189 43 354 233 168 48 285 175 348 404 254 168 126 95 65 493 493 187 228
# 17 26 143 112 107 350 198 353 439 192 158 151 23 326 4 304 162 84 412 499 170
# 18 88 156 222 227 452 233 397 203 478 73 483 241 151 38 176 77 244 396 9 393
# 19 361 486 423 310 153 235 274 204 399 493 422 374 399 10 215 468 322 38 395 390
# 20 417 124 21 220 123 399 354 182 233 24 397 263 182 211 360 419 202 240 363 187
Removing rows with any duplicates
df %>%
group_by(id = row_number()) %>%
gather(col, value, -id) %>%
filter(!any(duplicated(value))) %>%
spread(col, value)
# # A tibble: 11 x 21
# # Groups: id [11]
# id V1 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V2 V20 V3 V4 V5 V6 V7 V8 V9
# <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
# 1 1 350 66 493 130 149 2 374 51 370 423 165 278 388 256 484 486 249 35 308 248
# 2 3 250 122 308 56 121 497 498 220 34 4 126 459 411 416 128 372 67 281 450 48
# 3 4 171 131 471 487 190 492 21 237 380 129 5 306 30 390 13 395 160 256 258 76
# 4 7 207 63 427 425 350 112 488 400 328 239 148 52 40 420 216 9 366 390 382 304
# 5 9 349 351 60 333 321 116 19 156 372 168 95 460 169 498 278 104 93 253 287 124
# 6 12 177 448 175 79 202 230 317 296 410 90 102 382 465 233 146 338 76 390 232 336
# 7 13 108 316 481 153 403 193 71 136 27 454 62 433 439 59 151 8 138 464 458 183
# 8 14 421 108 94 407 453 262 355 356 27 217 243 72 455 106 442 338 440 476 357 74
# 9 15 325 489 137 279 420 120 165 289 70 128 384 449 53 151 473 241 11 154 52 77
# 10 17 26 158 151 23 326 4 304 162 84 412 499 143 170 112 107 350 198 353 439 192
# 11 18 88 73 483 241 151 38 176 77 244 396 9 156 393 222 227 452 233 397 203 478
You can try a series of filters from dplyr. I cooked up some sample data here. If your variables are named then you can use something like the first example. Otherwise the second should work
library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.5.1
data <- data_frame(
A = c(1,2,3,4,5,6),
B= c(1,3,5,7,9,11),
C = c(2,2,6,8,10,12)
)
data %>%
filter(A != B) %>% # This removed the first row
filter(A != C) # This removed the second row
#> # A tibble: 4 x 3
#> A B C
#> <dbl> <dbl> <dbl>
#> 1 3 5 6
#> 2 4 7 8
#> 3 5 9 10
#> 4 6 11 12
data %>%
filter(.[1] != .[2]) %>%
filter(.[1] != .[3])
#> # A tibble: 4 x 3
#> A B C
#> <dbl> <dbl> <dbl>
#> 1 3 5 6
#> 2 4 7 8
#> 3 5 9 10
#> 4 6 11 12

R map() 2 levels into list

I am stuck on doing nested map() or maybe map() pipe.
I have a list of 4 outputs in object "output". In each of the four output there is an element "parameters" that is a list of 3 elements. THe 1st element is "unstandardized"
From the View tool I can see the code to get the unstandardized parameters from any one output
output[["ar.4g_gm.pr.dual..semi.inv..phantom.out"]][["parameters"]][["unstandardized"]])
I have tried to use map over outputs extracting parameters piped into map_dfr to extract and rbind the unstandardized parameters, which does the job ...
x<- map(output,"parameters") %>% map_dfr("unstandardized")
but I want to have the top-level list element name (i.e., the output file) in a column of my result.
Is there a way to nest the map functions or some other syntax to get the 4 top-level list element names into a column?
Here is statements with dummy data. I tworks but I need to cbind rep(c"out1","out2","out3", each=5) to the result and I want it to happen w/o cbind.
output <- list(out1=list(e1=c(1,2,3),
e2=c(T,F,T),
parm=list(a = as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),
b = as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),
stand = cbind(as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),grp=rep(1,times=5)))),
out2=list(e1=c(3,4,5),
e2=c(T,F,T),
parm=list(a = as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),
b = as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),
stand = cbind(as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),grp=rep(2,times=5)))),
out3=list(e1=c(1,2,3),
e2=c(T,F,T),
parm=list(a = as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),
b = as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),
stand = cbind(as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),grp=rep(3,times=5)))) )
output[["out1"]][["parm"]][["stand"]]
map(output,"parm") %>% map_dfr("stand")
library(purrr)
library(dplyr)
map(output, pluck, "parm", "stand") %>%
bind_rows(.id = "foo")
# foo V1 V2 V3 V4 V5 V6 V7 V8 grp
# 1 out1 845 527 296 902 358 447 317 347 1
# 2 out1 679 473 290 482 349 691 144 731 1
# 3 out1 842 574 135 894 628 542 757 174 1
# 4 out1 379 548 836 176 796 744 889 922 1
# 5 out1 498 837 492 965 255 508 138 689 1
# 6 out2 203 599 158 355 793 884 722 210 2
# 7 out2 543 693 484 195 511 174 793 654 2
# 8 out2 593 839 296 926 387 788 260 143 2
# 9 out2 373 363 323 939 416 348 792 211 2
# 10 out2 773 218 616 806 119 304 775 775 2
# 11 out3 171 217 859 899 664 737 114 837 3
# 12 out3 953 225 600 581 528 388 714 899 3
# 13 out3 615 550 860 134 667 136 987 993 3
# 14 out3 494 407 726 128 559 418 782 832 3
# 15 out3 729 734 432 354 716 288 734 264 3
output <- list(out1=list(e1=c(1,2,3),
e2=c(T,F,T),
parm=list(a = as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),
b = as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),
stand = cbind(as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),grp=rep(1,times=5)))),
out2=list(e1=c(3,4,5),
e2=c(T,F,T),
parm=list(a = as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),
b = as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),
stand = cbind(as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),grp=rep(2,times=5)))),
out3=list(e1=c(1,2,3),
e2=c(T,F,T),
parm=list(a = as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),
b = as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),
stand = cbind(as.data.frame(matrix(sample(101:999,size=40,replace=TRUE),nrow=5)),grp=rep(3,times=5)))) )
library(tidyverse)
map(output,"parm") %>%
map("stand") %>%
map2(names(output), ~ cbind(.x, df_name=.y))
# $out1
# V1 V2 V3 V4 V5 V6 V7 V8 grp df_name
# 1 695 356 109 463 688 496 842 310 1 out1
# 2 922 450 680 170 567 921 530 419 1 out1
# 3 568 604 626 446 364 206 541 644 1 out1
# 4 210 237 300 432 366 945 413 368 1 out1
# 5 529 224 392 181 156 126 255 283 1 out1
#
# $out2
# V1 V2 V3 V4 V5 V6 V7 V8 grp df_name
# 1 320 429 109 749 394 657 690 764 2 out2
# 2 580 296 755 101 385 582 956 547 2 out2
# 3 939 122 697 146 747 108 672 836 2 out2
# 4 550 972 128 396 874 224 158 133 2 out2
# 5 923 650 888 895 742 166 533 225 2 out2
#
# $out3
# V1 V2 V3 V4 V5 V6 V7 V8 grp df_name
# 1 347 928 777 656 503 783 847 620 3 out3
# 2 496 586 919 991 810 797 779 202 3 out3
# 3 644 731 441 896 284 514 954 981 3 out3
# 4 303 803 945 806 938 692 587 775 3 out3
# 5 243 666 719 823 133 773 585 461 3 out3

Resources