Related
I have a dataset comparing 15 hybrids, each with 5 separate measurements. I am trying to spread the data into a wider dataset using pivot_wider for a regression analysis, since spread() would not work (probably because of the repeated observations).
The dataset I am working with is below:
data <- structure(list(hybrid = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 4,
4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 7, 7, 7,
7, 7, 7, 7, 7, 7, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,
8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10,
10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11,
11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12,
12, 12, 12, 12, 12, 12, 12, 13, 13, 13, 13, 13, 13, 13, 13, 13,
13, 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14,
14, 14, 14, 14, 14, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15,
15, 15, 15), measurement = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4,
4, 5, 5, 5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 1, 1,
1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 1, 1, 1, 2, 2, 2, 3, 3,
3, 4, 4, 4, 5, 5, 5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5,
5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 1, 1, 1, 2, 2,
2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4,
4, 5, 5, 5, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 1, 1, 1, 2, 2, 2, 3,
3, 3, 4, 4, 4, 5, 5, 5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5,
5, 5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 1, 1, 1, 2,
2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4,
4, 4, 5, 5, 5, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5), value = c(245,
889, 450, 45, 515, 318, 956, 434, 29, 740, 156, 516, 767, 292,
753, 573, 636, 611, 777, 557, 408, 95, 482, 227, 495, 360, 55,
76, 393, 37, 667, 802, 724, 900, 885, 191, 79, 143, 531, 398,
324, 129, 172, 467, 25, 101, 476, 629, 915, 122, 498, 649, 354,
527, 920, 788, 565, 552, 586, 127, 461, 307, 77, 552, 198, 240,
816, 144, 136, 781, 593, 421, 233, 264, 812, 407, 492, 932, 940,
139, 764, 200, 352, 754, 271, 506, 381, 973, 678, 848, 432, 358,
218, 736, 287, 411, 220, 264, 531, 669, 666, 727, 841, 792, 79,
460, 159, 426, 90, 395, 793, 507, 262, 814, 157, 641, 230, 870,
304, 591, 636, 277, 534, 783, 562, 938, 889, 68, 557, 892, 809,
157, 71, 54, 256, 246, 301, 823, 622, 953, 6, 66, 556, 902, 207,
832, 248, 540, 192, 65, 381, 712, 15, 323, 1, 193, 146, 637,
488, 158, 289, 839, 229, 237, 273, 978, 560, 969, 898, 204, 335,
930, 444, 968, 920, 398, 303, 318, 975, 182, 630, 4, 624, 271,
272, 438, 661, 728, 32, 106, 473, 465, 498, 33, 189, 918, 704,
605, 867, 240, 833, 497, 514, 241, 860, 228, 643, 791, 4, 898,
574, 225, 339, 365, 387, 548, 88, 604, 283)), class = "data.frame", row.names = c(NA,
-219L))
I'm new to the pivot_wider function, so when I run my code, I get an error:
data%>%
pivot_wider(cols = -hybrid, names_to = c("1","2","3","4","5"))
Error in pivot_wider(., cols = -hybrid, names_to = c("1", "2", "3", "4", :
unused arguments (cols = -hybrid, names_to = c("1", "2", "3", "4", "5"))
How can I spread this data so that I have 5 columns? Hybrid, 1, 2, 3, 4, 5 (with the values under the columns entitled 1:5).
My guess is that you are you looking for this:
library(tidyr)
pivot_wider(data, id_cols = hybrid, names_from = measurement, values_from = "value", values_fn = sum)
# # A tibble: 15 x 6
# hybrid `1` `2` `3` `4` `5`
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1584 878 1419 1412 1812
# 2 2 1820 1742 804 910 506
# 3 3 2193 1976 753 851 664
# 4 4 1206 1535 1530 2273 1265
# 5 5 845 990 1096 1795 1309
# 6 6 1831 1843 1306 1158 2499
# 7 7 1008 1434 1015 2062 1712
# 8 8 1045 1278 1583 1028 1765
# 9 9 913 1317 1500 957 1449
# 10 10 1037 556 1746 1025 1665
# 11 11 1620 638 1050 340 1283
# 12 12 1357 1488 2427 1469 2332
# 13 13 1019 1787 899 1371 866
# 14 14 1436 1140 2176 1570 1615
# 15 15 1662 1476 929 1023 887
Using dcast from data.table
library(data.table)
dcast(setDT(data), hybrid ~ measurement, sum)
# hybrid 1 2 3 4 5
# 1: 1 1584 878 1419 1412 1812
# 2: 2 1820 1742 804 910 506
# 3: 3 2193 1976 753 851 664
# 4: 4 1206 1535 1530 2273 1265
# 5: 5 845 990 1096 1795 1309
# 6: 6 1831 1843 1306 1158 2499
# 7: 7 1008 1434 1015 2062 1712
# 8: 8 1045 1278 1583 1028 1765
# 9: 9 913 1317 1500 957 1449
#10: 10 1037 556 1746 1025 1665
#11: 11 1620 638 1050 340 1283
#12: 12 1357 1488 2427 1469 2332
#13: 13 1019 1787 899 1371 866
#14: 14 1436 1140 2176 1570 1615
#15: 15 1662 1476 929 1023 887
I have a dataset consisting of 20 plant genotypes with measurements of LAI, V1, V2, V3, V4, V5 being taken at three growth stages (1, 2, 3).
I need to separate the data in R (using the tidyverse package) into columns of genotype, stage, and mesurement (consisting of LAI:V5). The code that I have tried does not work; how could I go about doing this? Here is what I have tried:
#Open packages
library(readr)
library(tidyr)
library(dplyr)
#Dataset:
dataset <- structure(list(plot = c(101, 102, 103, 104, 105, 106, 107, 108,
109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 101,
102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114,
115, 116, 117, 118, 119, 120, 101, 102, 103, 104, 105, 106, 107,
108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120
), genotype = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20), stage = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), LAI = c(822, 763,
551, 251, 800, 761, 343, 593, 997, 261, 19, 429, 566, 574, 174,
356, 891, 918, 948, 782, 902, 383, 704, 157, 358, 453, 723, 644,
308, 149, 504, 437, 348, 165, 128, 305, 778, 516, 347, 212, 792,
423, 565, 828, 106, 605, 603, 551, 145, 393, 914, 919, 672, 628,
143, 103, 906, 717, 18, 324), V1 = c(52, 556, 222, 534, 953,
346, 635, 84, 592, 444, 34, 340, 343, 188, 554, 397, 315, 643,
376, 101, 663, 42, 360, 645, 718, 883, 266, 225, 674, 797, 726,
259, 829, 701, 601, 206, 325, 963, 292, 985, 954, 828, 839, 541,
301, 312, 187, 59, 563, 577, 961, 239, 147, 203, 421, 690, 542,
412, 812, 19), V2 = c(354, 719, 45, 376, 921, 243, 256, 316,
384, 450, 166, 850, 784, 291, 889, 389, 925, 157, 37, 528, 847,
942, 624, 387, 680, 380, 848, 745, 49, 69, 864, 649, 125, 117,
911, 947, 212, 628, 162, 165, 395, 437, 102, 136, 446, 51, 106,
141, 886, 373, 113, 186, 233, 937, 698, 202, 89, 623, 731, 474
), V3 = c(18, 87, 692, 888, 681, 134, 774, 619, 544, 32, 804,
993, 147, 352, 825, 490, 196, 794, 900, 796, 617, 160, 688, 947,
665, 122, 386, 968, 772, 836, 696, 806, 925, 410, 949, 546, 303,
550, 359, 285, 167, 605, 780, 419, 925, 822, 142, 4, 648, 18,
867, 204, 617, 5, 251, 198, 316, 205, 660, 680), V4 = c(728,
266, 678, 958, 946, 248, 425, 777, 86, 340, 527, 766, 161, 187,
129, 881, 149, 888, 811, 118, 379, 22, 953, 940, 520, 200, 557,
438, 401, 25, 55, 155, 73, 834, 614, 933, 235, 759, 852, 29,
475, 356, 992, 765, 593, 703, 929, 823, 466, 717, 86, 607, 730,
7, 416, 727, 400, 904, 503, 881), V5 = c(550, 785, 954, 852,
718, 295, 208, 2, 36, 185, 726, 540, 476, 994, 720, 532, 401,
525, 504, 868, 414, 878, 808, 550, 740, 9, 936, 570, 477, 516,
561, 648, 686, 906, 387, 621, 461, 323, 829, 948, 964, 853, 943,
805, 349, 254, 979, 784, 246, 444, 71, 883, 345, 973, 546, 120,
310, 347, 732, 308)), class = "data.frame", row.names = c(NA,
-60L))
Code I have tried....
data <- gather(dataset, LAI, V1, V2, V3, V4, V5, -plot)
....provides these results (a sample of the resulting dataset):
plot genotype stage LAI V1
1 101 1 1 V2 354
2 102 2 1 V2 719
3 103 3 1 V2 45
4 104 4 1 V2 376
5 105 5 1 V2 921
6 106 6 1 V2 243
7 107 7 1 V2 256
8 108 8 1 V2 316
9 109 9 1 V2 384
10 110 10 1 V2 450
11 111 11 1 V2 166
12 112 12 1 V2 850
13 113 13 1 V2 784
14 114 14 1 V2 291
The outcome needs to be like this:
correct_format <- data.frame(genotype = c(1,
2,
3,
4,
5,
6),
stage = c(1,
1,
1,
1,
1,
1),
measurement = c("LAI",
"LAI",
"LAI",
"LAI",
"LAI",
"LAI"),
value = c(822,
763,
551,
251,
800,
761)
Perhaps, we need
library(dplyr)
library(tidyr)
dataset %>%
select(-plot) %>%
pivot_longer(cols = LAI:V5, names_to = 'measurement') %>%
arrange(measurement)
dat <- structure(list(doy = c(274, 275, 276, 277, 278, 279, 280, 281, 282, 283, 284, 285, 286, 287, 288, 289, 290, 291, 292, 293, 294,
295, 296, 297, 298, 299, 300, 301, 302, 303, 304, 305, 306, 307, 308, 309, 310, 311, 312, 313, 314, 315,
316, 317, 318, 319, 320, 321, 322, 323, 324, 325, 326, 327, 328, 329, 330, 331, 332, 333, 334, 335, 336,
337, 338, 339, 340, 341, 342, 343, 344, 345, 346, 347, 348, 349, 350, 351, 352, 353, 354, 355, 356, 357,
358, 359, 360, 361, 362, 363, 364, 365),
no.plant = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1),
cum.value = c(0, 1.34973713866726e-05, 0.000107973870218436, 0.000364365089792096, 0.000863464598244823, 0.00168576031682954,
0.00291120609231443, 0.00291120609231443, 0.0046189294134239, 0.00688687680728461, 0.00688687680728461,
0.00979139917551386, 0.0134067801825104, 0.0178047117788614, 0.0230537220148601, 0.0292185614529241,
0.0292185614529241, 0.0363595556987137, 0.0363595556987137, 0.0445319328097977, 0.0537851355741434,
0.0641621298405947, 0.0756987211882645, 0.0884228931969177, 0.102354181379628, 0.102354181379628, 0.117503097415405,
0.133870618627253, 0.151447757647197, 0.151447757647197, 0.170215226855778, 0.170215226855778,
0.190143211447851, 0.211191263836225, 0.233308330547831, 0.256432920794094, 0.280493423522773, 0.305408577012532,
0.331088091999851, 0.357433425992349, 0.384338702900249, 0.411691768499651, 0.439375368630229, 0.467268433537531,
0.495247448513112, 0.523187888081939, 0.550965688550059, 0.578458731861707, 0.605548312515632, 0.632120558828558,
0.658067780159839, 0.683289712849355, 0.707694639565394, 0.731200359474982, 0.753734990069534, 0.753734990069534,
0.753734990069534, 0.753734990069534, 0.775237585508182, 0.795658560857758, 0.814959916467899, 0.833115261761304,
0.850109642771837, 0.865939182653005, 0.865939182653005, 0.880610548937487, 0.894140265397845, 0.906553889802375,
0.917885081566473, 0.928174585188328, 0.93746915638157, 0.945820457966355, 0.95328395187962, 0.959917812174526,
0.965781881688334, 0.970936692282333, 0.975442565331355, 0.97935880560985, 0.97935880560985, 0.982742998037354,
0.985650413056059, 0.988133522855331, 0.990241627354782, 0.992020585910824, 0.993512648199701, 0.994756375705273,
0.995786643728671, 0.996634712840931, 0.997328358197721, 0.997892045086969, 0.998347139430071, 0.998347139430071)),
class = "data.frame", row.names = c(NA, -92L))
delta <- 0.04991736
I need to select those doy where the cum.value reaches 1*delta, 2*delta, 3*delta, 4*delta ....n*delta and also
include last doy which is 365 if n*delta does not reach the doy 365.
At the moment I am selecting n by trial and error which is by first creating a sequencnce of 1:n. For e.g 1:19:
qt.vec.19 <- 1:19 * delta
max(qt.vec.19) >= max(dat$cum.value)
FALSE
If I change qt.vec to 1:20
qt.vec.20 <- 1:20 * delta
max(qt.vec.20) >= max(dat$cum.value)
TRUE
This means that I can do 1*delta, 2*delta....19*delta and then also select the last doy.
sample.dat <- dat %>% dplyr::slice(unique(c(which.max(cum.value > qt.vec.19[1]),
which.max(cum.value > qt.vec.19[2]),
which.max(cum.value > qt.vec.19[3]),
which.max(cum.value > qt.vec.19[4]),
which.max(cum.value > qt.vec.19[5]),
which.max(cum.value > qt.vec.19[6]),
which.max(cum.value > qt.vec.19[7]),
which.max(cum.value > qt.vec.19[8]),
which.max(cum.value > qt.vec.19[9]),
which.max(cum.value > qt.vec.19[10]),
which.max(cum.value > qt.vec.19[11]),
which.max(cum.value > qt.vec.19[12]),
which.max(cum.value > qt.vec.19[13]),
which.max(cum.value > qt.vec.19[14]),
which.max(cum.value > qt.vec.19[15]),
which.max(cum.value > qt.vec.19[16]),
which.max(cum.value > qt.vec.19[17]),
which.max(cum.value > qt.vec.19[18]),
which.max(cum.value > qt.vec.19[19]))))
last.doy <- dat %>% dplyr::filter(doy == 365)
all.doy <- as.data.frame(rbind(sample.dat, last.doy))
doy no.plant cum.value
294 0 0.05378514
298 0 0.10235418
302 0 0.15144776
307 0 0.21119126
309 0 0.25643292
311 0 0.30540858
313 0 0.35743343
315 0 0.41169177
317 0 0.46726843
319 0 0.52318789
320 0 0.55096569
322 0 0.60554831
324 0 0.65806778
326 0 0.70769464
328 0 0.75373499
334 0 0.81495992
336 0 0.85010964
341 0 0.90655389
346 0 0.95328395
365 1 0.99834714
I was wondering if there's any better way to do this like selecting what my n value should be or avoid the long slice(unique(... part?
A matter of taste and context and you read a lot about "loops are frowned upon in R" - but they deliver results and are easy to read, and they are Base R - no extra packages needed or new syntax to learn:
options( scipen = 10, digits = 15 ) # display all digits
dat <- read.csv( "crop89.csv" ) # load your data from a file
delta <- 0.04991736 # selected threshold
n <- 1 # initiate multiplier variable
all.doy <- dat[ 1, ] # initiate receiving data.frame
for( i in 1:length( dat$doy ) ){ # loop through dat rows
if( dat[ i, "cum.value"] >= n * delta ){ # as soon as threshold is passed
all.doy[ n, ] <- dat[ i, ] # write the line to the target data.frame
n <- n + 1 # increment multiplier
}
}
all.doy[ n, ] <- dat[ i, ] # add the last row anyway
all.doy
> all.doy
doy no.plant cum.value
1 294 0 0.0537851355741434
25 298 0 0.1023541813796280
29 302 0 0.1514477576471970
34 307 0 0.2111912638362250
36 309 0 0.2564329207940940
38 311 0 0.3054085770125320
40 313 0 0.3574334259923490
42 315 0 0.4116917684996510
44 317 0 0.4672684335375310
46 319 0 0.5231878880819389
47 320 0 0.5509656885500590
49 322 0 0.6055483125156320
51 324 0 0.6580677801598390
53 326 0 0.7076946395653940
55 328 0 0.7537349900695340
61 334 0 0.8149599164678990
63 336 0 0.8501096427718370
68 341 0 0.9065538898023749
73 346 0 0.9532839518796200
92 365 1 0.9983471394300710
The main point is the cut function here:
library(data.table)
DT<-as.data.table(dat)
DT[,group:=as.numeric(cut(cum.value,c(-Inf,qt.vec.19,Inf),ordered_result = T))-1]
DT[,position:=frank(cum.value,ties.method = "first" ),by=group]
DT<-DT[position==1 & group>0]
DT[,position:=NULL]
DT[,group:=NULL]
if (max(DT$cum.value)!=max(dat$cum.value)) DT<-rbind(DT,dat[dat$doy==max(dat$doy),])
I currently have data spread out across multiple columns in R. I am looking for a way to put this information into the one column as a vector for each of the individual rows.
Is there a function to do this?
For example, the data looks like this:
DF <- data.frame(id=rep(LETTERS, each=1)[1:26], replicate(26, sample(1001, 26)), Class=sample(c("Yes", "No"), 26, TRUE))
select(DF, cols=c("id", "X1","X2", "X23", "Class"))
How can I merge the columns "X1","X2", "X23" into a vector containing numeric type variables for each of the IDs?
Like this?
library(reshape2)
melt(df) %>% dcast(id ~ ., fun.aggregate = list)
Using id, Class as id variables
id .
1 A 422, 74, 439
2 B 879, 443, 923
3 C 575, 901, 749
4 D 813, 747, 21
5 E 438, 526, 675
6 F 863, 562, 474
7 G 103, 713, 918
8 H 585, 294, 525
9 I 115, 76, 175
10 J 953, 379, 926
11 K 679, 439, 377
12 L 816, 624, 538
13 M 678, 226, 142
14 N 667, 369, 586
15 O 795, 422, 248
16 P 165, 22, 612
17 Q 294, 476, 746
18 R 968, 368, 290
19 S 238, 481, 980
20 T 921, 482, 741
21 U 550, 15, 296
22 V 121, 358, 625
23 W 213, 313, 242
24 X 92, 77, 58
25 Y 607, 936, 350
26 Z 660, 42, 275
A note though: I do not know your final use case, but this strikes me as something you probably do not want to have. It is often more advisable to stick to tidy data, see e.g. https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
for example, I have a data frame with one column containing numbers. these is how it looks.
head(c1)
c
1 300
2 302
3 304
4 306
5 308
6 310
Here is the sample data frame.
c1 <- structure(list(c = c(300, 302, 304, 306, 308, 310, 312, 314,
316, 318, 320, 322, 324, 326, 328, 330, 332, 334, 336, 338, 340,
342, 344, 346, 348, 350, 352, 354, 356, 358, 360, 362, 364, 366,
368, 370, 372, 374, 376, 378, 380, 382, 384, 386, 388, 390, 392,
394, 396, 398, 400)), .Names = "c", row.names = c(NA, -51L), class = "data.frame")
I want to delete the rows between 300 to 310 and 310 to 320 and so on..
I want to have a dataframe like these
300
310
320
330
340
350
.
.
.
400
Any ideas how to do these, I found how to remove every nth row, but not every four rows between two numbers
You can make use of the modulo operator %%. If you want the result as an atomic vector, you can run
c1$c[c1$c %% 10 == 0]
or if you want it as a data.frame with 1 column, you can use
c1[c1$c %% 10 == 0, , drop=FALSE]