Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
My sample data looks like this:
data <- read.table(header=T, text='
pid measurement1 Tdays1 measurement2 Tdays2 measurement3 Tdays3 measurment4 Tdays4
1 1356 1435 1483 1405 1563 1374 NA NA
2 943 1848 1173 1818 1300 1785 NA NA
3 1590 185 NA NA NA NA 1585 294
4 130 72 443 70 NA NA 136 79
4 140 82 NA NA NA NA 756 89
4 220 126 266 124 NA NA 703 128
4 166 159 213 156 476 145 776 166
4 380 189 583 173 NA NA 586 203
4 353 231 510 222 656 217 526 240
4 180 268 NA NA NA NA NA NA
4 NA NA NA NA NA NA 580 278
4 571 334 596 303 816 289 483 371
')
Now i would like it to look something like this:
PID Time (days) Value
1 1435 1356
1 1405 1483
1 1374 1563
2 1848 943
2 1818 1173
2 1785 1300
3 185 1590
... ... ...
How would i tend to get there? I have looked up some things about wide to longformat, but it doesn't seem to do the trick.
Kind regards, and thank you in advance.
Here is a base R option
u <- cbind(
data[1],
do.call(
rbind,
lapply(
split.default(data[-1], ceiling(seq_along(data[-1]) / 2)),
setNames,
c("Value", "Time")
)
)
)
out <- `row.names<-`(
subset(
x <- u[order(u$pid), ],
complete.cases(x)
), NULL
)
such that
> out
pid Value Time
1 1 1356 1435
2 1 1483 1405
3 1 1563 1374
4 2 943 1848
5 2 1173 1818
6 2 1300 1785
7 3 1590 185
8 3 1585 294
9 4 130 72
10 4 140 82
11 4 220 126
12 4 166 159
13 4 380 189
14 4 353 231
15 4 180 268
16 4 571 334
17 4 443 70
18 4 266 124
19 4 213 156
20 4 583 173
21 4 510 222
22 4 596 303
23 4 476 145
24 4 656 217
25 4 816 289
26 4 136 79
27 4 756 89
28 4 703 128
29 4 776 166
30 4 586 203
31 4 526 240
32 4 580 278
33 4 483 371
An option with pivot_longer
library(dplyr)
library(tidyr)
names(data)[8] <- "measurement4"
data %>%
pivot_longer(cols = -pid, names_to = c('.value', 'grp'),
names_sep = "(?<=[a-z])(?=[0-9])", values_drop_na = TRUE) %>% select(-grp)
# A tibble: 33 x 3
# pid measurement Tdays
# <int> <int> <int>
# 1 1 1356 1435
# 2 1 1483 1405
# 3 1 1563 1374
# 4 2 943 1848
# 5 2 1173 1818
# 6 2 1300 1785
# 7 3 1590 185
# 8 3 1585 294
# 9 4 130 72
#10 4 443 70
# … with 23 more rows
Related
Sample data
set.seed(16)
aaa <- 1:1000
aaa[round(runif(100,1,1000))] <- NA
aaa.df <- as.data.frame(matrix(aaa, ncol=5))
I want the aaa.df to be split into multiple groups based on which column(s) contains NA value(s), so for example, if 10th, 16th, 200th rows has NA value in the same column, I want these rows to be in one group and so on. It should also work when a. there is no NA values in a row and b. there is multiple NA values in a row.
I also want to keep the original row number when grouping.
Edit: To make it clearer this is the expected output (Obtained using Taufi's answer, but I am still looking for a more elegant way)
[[1]]
# A tibble: 119 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 1 201 401 601 801 1
2 2 202 402 602 802 2
3 3 203 403 603 803 3
4 4 204 404 604 804 4
5 5 205 405 605 805 5
6 6 206 406 606 806 6
7 7 207 407 607 807 7
8 8 208 408 608 808 8
9 9 209 409 609 809 9
10 10 210 410 610 810 10
# ... with 109 more rows
[[2]]
# A tibble: 14 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 20 220 420 620 NA 20
2 32 232 432 632 NA 32
3 47 247 447 647 NA 47
4 70 270 470 670 NA 70
5 85 285 485 685 NA 85
6 92 292 492 692 NA 92
7 129 329 529 729 NA 129
8 132 332 532 732 NA 132
9 137 337 537 737 NA 137
10 151 351 551 751 NA 151
11 152 352 552 752 NA 152
12 168 368 568 768 NA 168
13 178 378 578 778 NA 178
14 181 381 581 781 NA 181
[[3]]
# A tibble: 15 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 11 211 411 NA 811 11
2 37 237 437 NA 837 37
3 62 262 462 NA 862 62
4 82 282 482 NA 882 82
5 83 283 483 NA 883 83
6 89 289 489 NA 889 89
7 107 307 507 NA 907 107
8 115 315 515 NA 915 115
9 116 316 516 NA 916 116
10 117 317 517 NA 917 117
11 118 318 518 NA 918 118
12 165 365 565 NA 965 165
13 176 376 576 NA 976 176
14 189 389 589 NA 989 189
15 200 400 600 NA 1000 200
[[4]]
# A tibble: 1 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 12 212 412 NA NA 12
[[5]]
# A tibble: 16 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 17 217 NA 617 817 17
2 28 228 NA 628 828 28
3 31 231 NA 631 831 31
4 48 248 NA 648 848 48
5 58 258 NA 658 858 58
6 72 272 NA 672 872 72
7 80 280 NA 680 880 80
8 126 326 NA 726 926 126
9 144 344 NA 744 944 144
10 145 345 NA 745 945 145
11 149 349 NA 749 949 149
12 153 353 NA 753 953 153
13 186 386 NA 786 986 186
14 190 390 NA 790 990 190
15 192 392 NA 792 992 192
16 196 396 NA 796 996 196
and so on..
In addition to my previous more brute-force kind of answer, I came up with the following way more elegant one-liner that avoids any unnecessary joins or intermediate assignment steps. Since you already accepted my previous answer, I let that be as it stands and add the conceptually different one-liner below. The idea is to split() the data.frame based on pasted column numbers from which() that indicate the presence of NA.
split(aaa.df,
apply(aaa.df, 1,
function(x) paste(which(is.na(x)), collapse = ",")))
Output
$`1`
V1 V2 V3 V4 V5
77 NA 277 477 677 877
93 NA 293 493 693 893
97 NA 297 497 697 897
109 NA 309 509 709 909
119 NA 319 519 719 919
140 NA 340 540 740 940
154 NA 354 554 754 954
158 NA 358 558 758 958
171 NA 371 571 771 971
172 NA 372 572 772 972
$`1,2,3`
V1 V2 V3 V4 V5
51 NA NA NA 651 851
$`1,3,5`
V1 V2 V3 V4 V5
75 NA 275 NA 675 NA
$`1,4`
V1 V2 V3 V4 V5
194 NA 394 594 NA 994
$`1,4,5`
V1 V2 V3 V4 V5
49 NA 249 449 NA NA
...
and so on ...
A quick, but not very elegant solution would be as follows. Note that the original row number later is in V6.
aaa.df %<>% mutate(Rownum = 1:nrow(aaa.df))
Aux.df <- cbind(is.na(aaa.df[, 1:(ncol(aaa.df) - 1)]), 1:nrow(aaa.df)) %>%
as.data.frame %>%
group_by(V1, V2, V3, V4, V5) %>%
group_split
Sol <- lapply(Aux.df, function(x) inner_join(x, aaa.df, by = c("V6"="Rownum")) %>%
select(V1.y, V2.y, V3.y, V4.y, V5.y, V6))
Output
> Sol
[[1]]
# A tibble: 119 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 1 201 401 601 801 1
2 2 202 402 602 802 2
3 3 203 403 603 803 3
4 4 204 404 604 804 4
5 5 205 405 605 805 5
6 6 206 406 606 806 6
7 7 207 407 607 807 7
8 8 208 408 608 808 8
9 9 209 409 609 809 9
10 10 210 410 610 810 10
# ... with 109 more rows
[[2]]
# A tibble: 14 x 6
V1.y V2.y V3.y V4.y V5.y V6
<int> <int> <int> <int> <int> <int>
1 20 220 420 620 NA 20
2 32 232 432 632 NA 32
3 47 247 447 647 NA 47
4 70 270 470 670 NA 70
5 85 285 485 685 NA 85
6 92 292 492 692 NA 92
7 129 329 529 729 NA 129
8 132 332 532 732 NA 132
9 137 337 537 737 NA 137
10 151 351 551 751 NA 151
11 152 352 552 752 NA 152
12 168 368 568 768 NA 168
13 178 378 578 778 NA 178
14 181 381 581 781 NA 181
....
and so on ...
I am trying to run a time series analysis on the following data set:
Year 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780
Number 101 82 66 35 31 7 20 92 154 125
Year 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790
Number 85 68 38 23 10 24 83 132 131 118
Year 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800
Number 90 67 60 47 41 21 16 6 4 7
Year 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810
Number 14 34 45 43 48 42 28 10 8 2
Year 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820
Number 0 1 5 12 14 35 46 41 30 24
Year 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830
Number 16 7 4 2 8 17 36 50 62 67
Year 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840
Number 71 48 28 8 13 57 122 138 103 86
Year 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850
Number 63 37 24 11 15 40 62 98 124 96
Year 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860
Number 66 64 54 39 21 7 4 23 55 94
Year 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870
Number 96 77 59 44 47 30 16 7 37 74
My problem is that the data is placed in multiple rows. I am trying to make two columns from the data. One for Year and one for Number, so that it is easily readable in R. I have tried
> library(tidyverse)
> sun.df = data.frame(sunspots)
> Year = filter(sun.df, sunspots == "Year")
to isolate the Year data, and it works, but I am unsure of how to then place it in a column.
Any suggestions?
Try this:
library(tidyverse)
df <- read_csv("test.csv",col_names = FALSE)
df
# A tibble: 6 x 4
# X1 X2 X3 X4
# <chr> <dbl> <dbl> <dbl>
# 1 Year 123 124 125
# 2 Number 1 2 3
# 3 Year 126 127 128
# 4 Number 4 5 6
# 5 Year 129 130 131
# 6 Number 7 8 9
# Removing first column and transpose it to get a dataframe of numbers
df_number <- as.data.frame(as.matrix(t(df[,-1])),row.names = FALSE)
df_number
# V1 V2 V3 V4 V5 V6
# 1 123 1 126 4 129 7
# 2 124 2 127 5 130 8
# 3 125 3 128 6 131 9
# Keep the first two column (V1,V2) and assign column names
df_new <- df_number[1:2]
colnames(df_new) <- c("Year","Number")
# Iterate and rbind with subsequent columns (2 by 2) to df_new
for(i in 1:((ncol(df_number) - 2 )/2)) {
df_mini <- df_number[(i*2+1):(i*2+2)]
colnames(df_mini) <- c("Year","Number")
df_new <- rbind(df_new,df_mini)
}
df_new
# Year Number
# 1 123 1
# 2 124 2
# 3 125 3
# 4 126 4
# 5 127 5
# 6 128 6
# 7 129 7
# 8 130 8
# 9 131 9
I'm dealing with the following dataset
animal protein herd sire dam
6 416 189.29 2 15 236
7 417 183.27 2 6 295
9 419 193.24 3 11 268
10 420 198.84 2 12 295
11 421 205.25 3 3 251
12 422 204.15 2 2 281
13 423 200.20 2 3 248
14 424 197.22 2 11 222
15 425 201.14 1 10 262
17 427 196.20 1 11 290
18 428 208.13 3 9 294
19 429 213.01 3 14 254
21 431 203.38 2 4 273
22 432 190.56 2 8 248
25 435 196.59 3 9 226
26 436 193.31 3 10 249
27 437 207.89 3 7 272
29 439 202.98 2 10 260
30 440 177.28 2 4 291
31 441 182.04 1 6 282
32 442 217.50 2 3 265
33 443 190.43 2 11 248
35 445 197.24 2 4 256
37 447 197.16 3 5 240
42 452 183.07 3 5 293
43 453 197.99 2 6 293
44 454 208.27 2 6 254
45 455 187.61 3 12 271
46 456 173.18 2 6 280
47 457 187.89 2 6 235
48 458 191.96 1 7 286
49 459 196.39 1 4 275
50 460 178.51 2 13 262
52 462 204.17 1 6 253
53 463 203.77 2 11 273
54 464 206.25 1 13 249
55 465 211.63 2 13 222
56 466 211.34 1 6 228
57 467 194.34 2 1 217
58 468 201.53 2 12 247
59 469 198.01 2 3 251
60 470 188.94 2 7 290
61 471 190.49 3 2 220
62 472 197.34 2 3 224
63 473 194.04 1 15 229
64 474 202.74 2 1 287
67 477 189.98 1 6 300
69 479 206.37 3 2 293
70 480 183.81 2 10 274
72 482 190.70 2 12 265
74 484 194.25 3 2 262
75 485 191.15 3 10 297
76 486 193.23 3 15 255
77 487 193.29 2 4 266
78 488 182.20 1 15 260
81 491 195.89 2 12 294
82 492 200.77 1 8 278
83 493 179.12 2 7 281
85 495 172.14 3 13 252
86 496 183.82 1 4 264
88 498 195.32 1 6 249
89 499 197.19 1 13 274
90 500 178.07 1 8 293
92 502 209.65 2 7 241
95 505 199.66 3 5 220
96 506 190.96 2 11 259
98 508 206.58 3 3 230
100 510 196.60 2 5 231
103 513 193.25 2 15 280
104 514 181.34 2 3 227
I'm interested with the animals indexes and corresponding to them the dams' indexes. Using table function I was able to check that some dams are matched to different animals. In fact I got the following output
217 220 222 224 226 227 228 229 230 231 235 236 240 241 247 248 249 251 252 253 254 255 256 259 260 262
1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 1 1 2 1 1 1 2 3
264 265 266 268 271 272 273 274 275 278 280 281 282 286 287 290 291 293 294 295 297 300
1 2 1 1 1 1 2 2 1 1 2 2 1 1 1 2 1 4 2 2 1 1
Using length function I checked that there are only 48 dams in this dataset.
I would like to 'reindex' them with the integers 1, ..., 48 instead of these given in my set. Is there any method of doing such things?
You can use match and unique.
df$index <- match(df$dam, unique(df$dam))
Or convert to factor and then integer
df$index <- as.integer(factor(df$dam))
Another option is group_indices from dplyr.
df$index <- dplyr::group_indices(df, dam)
We can use .GRP in data.table
library(data.table)
setDT(df)[, index := .GRP, dam]
I have the question: Construct a list of all twin primes less than 1000
So far my code is:
isPrime <- function (n ) n==2L || all (n %% 2L:max (2, floor(sqrt(n)))!=0)
Im having trouble constructing the actual list itself, any suggestions?
You could use the sapply command for getting your primes and then with the diff function the pairs
(Thanks Rui for pointing out that sapply is more suited than lapply here!)
testThese <- 1:1000
primes <- testThese[sapply(testThese,isPrime)]
pairs.temp <- which(diff(primes)==2)
pairs <- sort(c(pairs.temp, pairs.temp+1))
matrix(primes[pairs], ncol=2, byrow=TRUE)
[,1] [,2]
[1,] 3 5
[2,] 5 7
[3,] 11 13
[4,] 17 19
[5,] 29 31
... ... ...
Here is a solution using the Sieve of Eratosthenes:
E <- rep(TRUE, 1000)
E[1] <- FALSE
for (i in 2:33) {
if (!E[i]) next
E[seq(i+i, 1000, i)] <- FALSE
}
P <- which(E) ## primes
pp <- which(diff(P)==2) ## index of the first twin
cbind(P[pp], P[pp+1]) ## the twins
If you need a function isPrime() you can do:
isPrime <- function(i) E[i]
isPrime(c(1,2,4,5)) ## Test
Here is how you can construct (not very efficiently though) a list of primes using your function:
primes_list <- vector(length = 0, mode = "integer")
for (i in 1:1000) {
if (isPrime(i)) primes_list <- c(primes_list, i)
}
You should be able to extend that to sorting out the twin primes.
How about the following?
library(gmp)
library(dplyr)
df <- expand.grid(x = 1:1000)
df$y <- isprime(df$x)
df <- df[df$y == 2,]
df[c(0,diff(df$x)) == 2 | lead(c(0,diff(df$x)) == 2, 1, F),]
x y
3 3 2
5 5 2
7 7 2
11 11 2
13 13 2
17 17 2
19 19 2
29 29 2
31 31 2
41 41 2
43 43 2
59 59 2
61 61 2
71 71 2
73 73 2
101 101 2
103 103 2
107 107 2
109 109 2
137 137 2
139 139 2
149 149 2
151 151 2
179 179 2
181 181 2
191 191 2
193 193 2
197 197 2
199 199 2
227 227 2
229 229 2
239 239 2
241 241 2
269 269 2
271 271 2
281 281 2
283 283 2
311 311 2
313 313 2
347 347 2
349 349 2
419 419 2
421 421 2
431 431 2
433 433 2
461 461 2
463 463 2
521 521 2
523 523 2
569 569 2
571 571 2
599 599 2
601 601 2
617 617 2
619 619 2
641 641 2
643 643 2
659 659 2
661 661 2
809 809 2
811 811 2
821 821 2
823 823 2
827 827 2
829 829 2
857 857 2
859 859 2
881 881 2
883 883 2
I'm getting this error but the fixes in related posts don't seem to apply I'm using ungroup, though it's no longer needed (can I switch the grouping variable in a single dplyr statement? but see Format column within dplyr chain). Also I have no quotes in my group_by call and I'm not applying any functions that act on the grouped-by columns (R dplyr summarize_each --> "Error: cannot modify grouping variable") but I'm still getting this error:
> games2 = baseball %>%
+ ungroup %>%
+ group_by(id, year) %>%
+ summarize(total=g+ab, a = ab+1, id = id)%>%
+ arrange(desc(total)) %>%
+ head(10)
Error: cannot modify grouping variable
This is the baseball set that comes with plyr:
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA
68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA
99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA
102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA
106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA
I loaded plyr before dplyr. Other bugs to check for? Thanks for any corrections/suggestions.
Not clear what you are doing. I think following is what you are looking for:
games2 = baseball %>%
group_by(id, year) %>%
mutate(total=g+ab, a = ab+1)%>%
arrange(desc(total)) %>%
head(10)
> games2
Source: local data frame [10 x 24]
Groups: id, year
id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp total a
1 aaronha01 1954 1 ML1 NL 122 468 58 131 27 6 13 69 2 2 28 39 NA 3 6 4 13 590 469
2 aaronha01 1955 1 ML1 NL 153 602 105 189 37 9 27 106 3 1 49 61 5 3 7 4 20 755 603
3 aaronha01 1956 1 ML1 NL 153 609 106 200 34 14 26 92 2 4 37 54 6 2 5 7 21 762 610
4 aaronha01 1957 1 ML1 NL 151 615 118 198 27 6 44 132 1 1 57 58 15 0 0 3 13 766 616
5 aaronha01 1958 1 ML1 NL 153 601 109 196 34 4 30 95 4 1 59 49 16 1 0 3 21 754 602
6 aaronha01 1959 1 ML1 NL 154 629 116 223 46 7 39 123 8 0 51 54 17 4 0 9 19 783 630
7 aaronha01 1960 1 ML1 NL 153 590 102 172 20 11 40 126 16 7 60 63 13 2 0 12 8 743 591
8 aaronha01 1961 1 ML1 NL 155 603 115 197 39 10 34 120 21 9 56 64 20 2 1 9 16 758 604
9 aaronha01 1962 1 ML1 NL 156 592 127 191 28 6 45 128 15 7 66 73 14 3 0 6 14 748 593
10 aaronha01 1963 1 ML1 NL 161 631 121 201 29 4 44 130 31 5 78 94 18 0 0 5 11 792 632
The problem is that you are trying to edit id in the summarize call, but you have grouped on id.
From your example, it looks like you want mutate anyway. You would use summarize if you were looking to apply a function that would return a single value like sum or mean.
games2 = baseball %>%
dplyr::group_by(id, year) %>%
dplyr::mutate(
total = g + ab,
a = ab + 1
) %>%
dplyr::select(id, year, total, a) %>%
dplyr::arrange(desc(total)) %>%
head(10)
Source: local data frame [10 x 4]
Groups: id, year
id year total a
1 aaronha01 1954 590 469
2 aaronha01 1955 755 603
3 aaronha01 1956 762 610
4 aaronha01 1957 766 616
5 aaronha01 1958 754 602
6 aaronha01 1959 783 630
7 aaronha01 1960 743 591
8 aaronha01 1961 758 604
9 aaronha01 1962 748 593
10 aaronha01 1963 792 632