remove every 4 rows between two numbers - r

for example, I have a data frame with one column containing numbers. these is how it looks.
head(c1)
c
1 300
2 302
3 304
4 306
5 308
6 310
Here is the sample data frame.
c1 <- structure(list(c = c(300, 302, 304, 306, 308, 310, 312, 314,
316, 318, 320, 322, 324, 326, 328, 330, 332, 334, 336, 338, 340,
342, 344, 346, 348, 350, 352, 354, 356, 358, 360, 362, 364, 366,
368, 370, 372, 374, 376, 378, 380, 382, 384, 386, 388, 390, 392,
394, 396, 398, 400)), .Names = "c", row.names = c(NA, -51L), class = "data.frame")
I want to delete the rows between 300 to 310 and 310 to 320 and so on..
I want to have a dataframe like these
300
310
320
330
340
350
.
.
.
400
Any ideas how to do these, I found how to remove every nth row, but not every four rows between two numbers

You can make use of the modulo operator %%. If you want the result as an atomic vector, you can run
c1$c[c1$c %% 10 == 0]
or if you want it as a data.frame with 1 column, you can use
c1[c1$c %% 10 == 0, , drop=FALSE]

Related

How to manipulate data frame to better format which is usable in ggplot?

I am working through some messy data where, after reading it in, it appears as the following:
> glimpse(il_births)
Rows: 106
Columns: 22
$ x1989 <dbl> 190247, 928, 175, 187, 445, 57, 425, 41, 207, 166, 2662, 48…
$ x1990 <dbl> 195499, 960, 192, 195, 462, 68, 449, 53, 222, 187, 2574, 47…
$ x1991 <dbl> 194066, 971, 164, 195, 464, 72, 448, 54, 179, 211, 2562, 49…
$ x1992 <dbl> 190923, 881, 189, 185, 462, 72, 414, 55, 201, 161, 2426, 46…
$ x1993 <dbl> 190709, 893, 152, 206, 497, 50, 389, 75, 202, 183, 2337, 43…
$ x1994 <dbl> 189182, 865, 158, 200, 538, 58, 429, 48, 189, 171, 2240, 41…
$ x1995 <dbl> 185801, 828, 140, 202, 566, 58, 417, 48, 173, 166, 2117, 43…
$ x1996 <dbl> 183079, 830, 147, 194, 529, 58, 417, 49, 175, 150, 2270, 41…
$ x1997 <dbl> 180649, 812, 132, 193, 531, 64, 389, 37, 163, 185, 2175, 43…
$ x1998 <dbl> 182503, 862, 140, 201, 545, 41, 417, 57, 185, 188, 2128, 41…
$ x1999 <dbl> 182027, 843, 117, 188, 595, 51, 396, 47, 193, 191, 2194, 39…
$ x2000 <dbl> 185003, 825, 132, 184, 587, 63, 434, 51, 170, 181, 2260, 40…
$ x2001 <dbl> 184022, 866, 138, 196, 629, 57, 420, 49, 147, 215, 2312, 39…
$ x2002 <dbl> 180555, 760, 129, 172, 629, 54, 434, 48, 191, 185, 2226, 39…
$ x2003 <dbl> 182393, 794, 141, 239, 668, 76, 458, 58, 154, 208, 2288, 39…
$ x2004 <dbl> 180665, 802, 126, 209, 646, 56, 396, 51, 151, 181, 2291, 42…
$ x2005 <dbl> 178872, 883, 122, 189, 744, 54, 409, 58, 160, 199, 2490, 40…
$ x2006 <dbl> 180503, 805, 112, 215, 737, 57, 392, 55, 140, 177, 2455, 41…
$ x2007 <dbl> 180530, 890, 136, 185, 736, 60, 413, 49, 163, 195, 2508, 44…
$ x2008 <dbl> 176634, 817, 120, 173, 676, 64, 409, 59, 142, 200, 2482, 40…
$ x2009 <dbl> 171077, 804, 114, 198, 622, 65, 381, 53, 123, 164, 2407, 40…
$ county_name <chr> "ILLINOIS TOTAL", "ADAMS", "ALEXANDER", "BOND", "BOONE", "B…
The data comes from All Live Births In Illinois, 1989-2009. The data frame is difficult to work with, as the years are the column headers in addition to a column with all of the counties. I would prefer if the table were formatted such that there is a year column and a county column, and each row contains an observation for one year and one county. This would make it easier to work with in ggplot such that I can make some quick visualizations of the data.
I first tried transposing the data frame, but that leaves counties as rows so that does not help much.
I also tried using the pivot_longer() function but was not sure how to set my parameters based on my issue.
Any help or suggestions are appreciated!
I suspect a reading of pivot_longer's help page would have done the trick:
data - A data frame to pivot.
cols - Columns to pivot into longer format.
names_to - A character vector specifying the new column or columns to
create from the information stored in the column names of data
specified by cols.
values_to - A string specifying the name of the column to create from
the data stored in cell values.
The other arguments are for more complex operations. To solve your case:
data should be il_births
cols should be all the year column names, you
can use any tidy-select method to get them, the easier in this case
is to say "everyone less county_name", so -county_name
names_to is the name of the column that will have the years, by default "name", but you can change it to "year" or anything else.
values_to is the name of the column that will have the values, by default "value", but you can change it here.
pivot_longer(il_births, -county_name, names_to = "year")
Additionally, you can remove the "x"'s from the column names, and format the year column as numeric:
pivot_longer(il_births, -county_name, names_to = "year",
names_prefix = "x", names_transform = list(year = as.numeric))
Here's a full reprex of how you might read in and tidy the data. A plot is going to look very messy if you include all counties, so I have used slice_max to include only the five most populous counties. This line could be removed if you want to retain all the data:
library(tidyverse)
data <- "https://data.illinois.gov/dataset/" %>%
paste0("ac7f40df-b256-4867-9953-78c8c4a52590/",
"resource/d7ec861b-6b7c-4260-82d8-3f05f49053f9/",
"download/data.csv") %>%
read.csv(check.names = FALSE) %>%
filter(row_number() != 1) %>%
slice_max(`_2009`, n = 5) %>% # Remove this line to keep all data
mutate(county_name = str_to_title(county_name)) %>%
mutate(county_name = reorder(county_name, -`_2009`)) %>%
pivot_longer(-county_name, names_to = "Year", values_to = "Births") %>%
mutate(Year = as.numeric(substr(Year, 2, 5)))
This results in:
data
#> # A tibble: 105 x 3
#> county_name Year Births
#> <fct> <dbl> <dbl>
#> 1 "Cook " 1989 94096
#> 2 "Cook " 1990 97005
#> 3 "Cook " 1991 96387
#> 4 "Cook " 1992 95140
#> 5 "Cook " 1993 94614
#> 6 "Cook " 1994 92881
#> 7 "Cook " 1995 90029
#> 8 "Cook " 1996 87747
#> 9 "Cook " 1997 85589
#> 10 "Cook " 1998 85970
#> # ... with 95 more rows
Which we could plot like this:
ggplot(data, aes(Year, Births, color = county_name)) +
geom_line(alpha = 0.5) +
scale_y_continuous(labels = scales::comma) +
geom_point() +
theme_minimal(base_size = 16) +
scale_color_brewer(palette = "Set1", name = "County") +
ggtitle("Live births in five most populous Illinois counties, 1989-2009") +
labs(caption = "Source: Illinois Department of Public Health")
Created on 2022-11-20 with reprex v2.0.2

How to generate a frequency table from an array

I have an array like the following:
structure(c(0, 191, 235, 196, 311, 240, 246, 236, 222, 345, 369,
447, 289, 274, 331, 368, 371, 344, 335, 403, 378, 367, 384, 364,
191, 0, 230, 207, 336, 151, 291, 324, 306, 340, 389, 461, 345,
322, 341, 367, 369, 356, 334, 395, 396, 392, 414, 387, 235, 230,
0, 254, 309, 253, 300, 346, 305, 375, 400, 466, 379, 372, 367,
387, 382, 370, 363, 445, 384, 361, 386, 356, 196, 207, 254, 0,
298, 195, 263, 244, 223, 352, 377, 444, 348, 316, 333, 356, 367,
347, 322, 400, 378, 357, 370, 367, 311, 336, 309, 298, 0, 326,
257, 240, 259, 205, 320, 357, 331, 339, 191, 298, 262, 223, 220,
311, 273, 216, 256, 317, 240, 151, 253, 195, 326, 0, 263, 308,
277, 303, 382, 457, 347, 294, 321, 358, 374, 340, 302, 376, 386,
373, 399, 379, 246, 291, 300, 263, 257, 263, 0, 264, 240, 265,
283, 368, 263, 265, 268, 336, 324, 292, 262, 372, 354, 345, 359,
355, 236, 324, 346, 244, 240, 308, 264, 0, 116, 307, 343, 412,
296, 266, 299, 313, 320, 312, 308, 356, 320, 341, 353, 324, 222,
306, 305, 223, 259, 277, 240, 116, 0, 280, 351, 428, 305, 263,
308, 335, 350, 332, 314, 392, 362, 352, 372, 362, 345, 340, 375,
352, 205, 303, 265, 307, 280, 0, 226, 303, 278, 287, 229, 349,
312, 289, 238, 280, 361, 290, 326, 353, 369, 389, 400, 377, 320,
382, 283, 343, 351, 226, 0, 290, 277, 265, 280, 344, 332, 360,
332, 379, 391, 365, 370, 412, 447, 461, 466, 444, 357, 457, 368,
412, 428, 303, 290, 0, 309, 346, 281, 345, 313, 336, 354, 334,
349, 348, 337, 382, 289, 345, 379, 348, 331, 347, 263, 296, 305,
278, 277, 309, 0, 147, 305, 333, 329, 355, 316, 339, 365, 358,
385, 371, 274, 322, 372, 316, 339, 294, 265, 266, 263, 287, 265,
346, 147, 0, 279, 303, 320, 346, 299, 318, 366, 358, 373, 378,
331, 341, 367, 333, 191, 321, 268, 299, 308, 229, 280, 281, 305,
279, 0, 201, 153, 172, 185, 254, 261, 228, 252, 316, 368, 367,
387, 356, 298, 358, 336, 313, 335, 349, 344, 345, 333, 303, 201,
0, 146, 235, 278, 287, 228, 299, 279, 235, 371, 369, 382, 367,
262, 374, 324, 320, 350, 312, 332, 313, 329, 320, 153, 146, 0,
184, 251, 233, 229, 264, 241, 273, 344, 356, 370, 347, 223, 340,
292, 312, 332, 289, 360, 336, 355, 346, 172, 235, 184, 0, 157,
202, 183, 193, 181, 249, 335, 334, 363, 322, 220, 302, 262, 308,
314, 238, 332, 354, 316, 299, 185, 278, 251, 157, 0, 171, 221,
178, 220, 262, 403, 395, 445, 400, 311, 376, 372, 356, 392, 280,
379, 334, 339, 318, 254, 287, 233, 202, 171, 0, 149, 195, 168,
218, 378, 396, 384, 378, 273, 386, 354, 320, 362, 361, 391, 349,
365, 366, 261, 228, 229, 183, 221, 149, 0, 130, 127, 136, 367,
392, 361, 357, 216, 373, 345, 341, 352, 290, 365, 348, 358, 358,
228, 299, 264, 193, 178, 195, 130, 0, 98, 175, 384, 414, 386,
370, 256, 399, 359, 353, 372, 326, 370, 337, 385, 373, 252, 279,
241, 181, 220, 168, 127, 98, 0, 146, 364, 387, 356, 367, 317,
379, 355, 324, 362, 353, 412, 382, 371, 378, 316, 235, 273, 249,
262, 218, 136, 175, 146, 0), .Dim = c(24L, 24L))
and want to create a frequency table.
The code I have used:
ages <- c(-0.01,100,200,300,400,500,600)
factorx <- factor(cut(tdc,breaks=ages,include.lowest=TRUE))
xout <- as.data.frame(table(factorx))
gives
factorx Freq
[-0.01,100] 26
(100,200] 52
(200,300] 168
(300,400] 308
(400,500] 22
which is right, except it excludes the interval 500 to 600. But the real problem is that the absolute frequencies are wrong as the sum should be equal to 25*25=625 but in xout is 576.
I guess there is a problem in the code.

Count the number of consecutive occurrences in a vector

I have the following vector:
vec <- c(28, 44, 45, 46, 47, 48, 61, 62, 70, 71, 82, 83, 104, 105, 111, 115, 125, 136, 137, 138, 146, 147, 158, 159, 160, 185, 186, 187, 188, 189, 190, 191, 192, 193, 209, 263, 264, 265, 266, 267, 268, 280, 283, 284, 308, 309, 318, 319, 324, 333, 334, 335, 347, 354)
Now I would like to get the number of consecutive occurrences in the vector of the minimum length two.
So here this would be valid for the following cases:
44, 45, 46, 47, 48
61, 62
70, 71
82, 83
104, 105
136, 137, 138
146, 147
158, 159, 160
185, 186, 187, 188, 189, 190, 191, 192, 193
263, 264, 265, 266, 267, 268
283, 284
308, 309
318, 319
333, 334, 335
So there are 14 times cases of consecutive numbers, and I just need the integer 14 as output.
Anybody with an idea how to do that?
We can use rle and diff functions :
a=rle(diff(vec))
sum(a$values==1)
diff and split will help
vec2 <- split(vec, cumsum(c(1, diff(vec) != 1)))
vec2[(sapply(vec2, function(x) length(x))>1)]
$`2`
[1] 44 45 46 47 48
$`3`
[1] 61 62
$`4`
[1] 70 71
$`5`
[1] 82 83
$`6`
[1] 104 105
$`10`
[1] 136 137 138
$`11`
[1] 146 147
$`12`
[1] 158 159 160
$`13`
[1] 185 186 187 188 189 190 191 192 193
$`15`
[1] 263 264 265 266 267 268
$`17`
[1] 283 284
$`18`
[1] 308 309
$`19`
[1] 318 319
$`21`
[1] 333 334 335
Brut force :
var <- sort(var)
nconsecutive <- 0
p <- length(var)-1
for (i in 1:p){
if((var[i + 1] - var[i]) == 1){
consecutive <- consecutive + 1
}else{
# If at least one consecutive number
if(consecutive > 0){
# when no more consecutive numbers add one to your increment
nconsecutive = nconsecutive + 1
}
# Re set to 0 your increment
consecutive <- 0
}
}
Here's another base R one-liner using tapply -
sum(tapply(vec, cumsum(c(TRUE, diff(vec) != 1)), length) > 1)
#[1] 14

Having trouble separating data using Tidyverse in R

I have a dataset consisting of 20 plant genotypes with measurements of LAI, V1, V2, V3, V4, V5 being taken at three growth stages (1, 2, 3).
I need to separate the data in R (using the tidyverse package) into columns of genotype, stage, and mesurement (consisting of LAI:V5). The code that I have tried does not work; how could I go about doing this? Here is what I have tried:
#Open packages
library(readr)
library(tidyr)
library(dplyr)
#Dataset:
dataset <- structure(list(plot = c(101, 102, 103, 104, 105, 106, 107, 108,
109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 101,
102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114,
115, 116, 117, 118, 119, 120, 101, 102, 103, 104, 105, 106, 107,
108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120
), genotype = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,
15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12, 13, 14, 15, 16, 17, 18, 19, 20), stage = c(1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3), LAI = c(822, 763,
551, 251, 800, 761, 343, 593, 997, 261, 19, 429, 566, 574, 174,
356, 891, 918, 948, 782, 902, 383, 704, 157, 358, 453, 723, 644,
308, 149, 504, 437, 348, 165, 128, 305, 778, 516, 347, 212, 792,
423, 565, 828, 106, 605, 603, 551, 145, 393, 914, 919, 672, 628,
143, 103, 906, 717, 18, 324), V1 = c(52, 556, 222, 534, 953,
346, 635, 84, 592, 444, 34, 340, 343, 188, 554, 397, 315, 643,
376, 101, 663, 42, 360, 645, 718, 883, 266, 225, 674, 797, 726,
259, 829, 701, 601, 206, 325, 963, 292, 985, 954, 828, 839, 541,
301, 312, 187, 59, 563, 577, 961, 239, 147, 203, 421, 690, 542,
412, 812, 19), V2 = c(354, 719, 45, 376, 921, 243, 256, 316,
384, 450, 166, 850, 784, 291, 889, 389, 925, 157, 37, 528, 847,
942, 624, 387, 680, 380, 848, 745, 49, 69, 864, 649, 125, 117,
911, 947, 212, 628, 162, 165, 395, 437, 102, 136, 446, 51, 106,
141, 886, 373, 113, 186, 233, 937, 698, 202, 89, 623, 731, 474
), V3 = c(18, 87, 692, 888, 681, 134, 774, 619, 544, 32, 804,
993, 147, 352, 825, 490, 196, 794, 900, 796, 617, 160, 688, 947,
665, 122, 386, 968, 772, 836, 696, 806, 925, 410, 949, 546, 303,
550, 359, 285, 167, 605, 780, 419, 925, 822, 142, 4, 648, 18,
867, 204, 617, 5, 251, 198, 316, 205, 660, 680), V4 = c(728,
266, 678, 958, 946, 248, 425, 777, 86, 340, 527, 766, 161, 187,
129, 881, 149, 888, 811, 118, 379, 22, 953, 940, 520, 200, 557,
438, 401, 25, 55, 155, 73, 834, 614, 933, 235, 759, 852, 29,
475, 356, 992, 765, 593, 703, 929, 823, 466, 717, 86, 607, 730,
7, 416, 727, 400, 904, 503, 881), V5 = c(550, 785, 954, 852,
718, 295, 208, 2, 36, 185, 726, 540, 476, 994, 720, 532, 401,
525, 504, 868, 414, 878, 808, 550, 740, 9, 936, 570, 477, 516,
561, 648, 686, 906, 387, 621, 461, 323, 829, 948, 964, 853, 943,
805, 349, 254, 979, 784, 246, 444, 71, 883, 345, 973, 546, 120,
310, 347, 732, 308)), class = "data.frame", row.names = c(NA,
-60L))
Code I have tried....
data <- gather(dataset, LAI, V1, V2, V3, V4, V5, -plot)
....provides these results (a sample of the resulting dataset):
plot genotype stage LAI V1
1 101 1 1 V2 354
2 102 2 1 V2 719
3 103 3 1 V2 45
4 104 4 1 V2 376
5 105 5 1 V2 921
6 106 6 1 V2 243
7 107 7 1 V2 256
8 108 8 1 V2 316
9 109 9 1 V2 384
10 110 10 1 V2 450
11 111 11 1 V2 166
12 112 12 1 V2 850
13 113 13 1 V2 784
14 114 14 1 V2 291
The outcome needs to be like this:
correct_format <- data.frame(genotype = c(1,
2,
3,
4,
5,
6),
stage = c(1,
1,
1,
1,
1,
1),
measurement = c("LAI",
"LAI",
"LAI",
"LAI",
"LAI",
"LAI"),
value = c(822,
763,
551,
251,
800,
761)
Perhaps, we need
library(dplyr)
library(tidyr)
dataset %>%
select(-plot) %>%
pivot_longer(cols = LAI:V5, names_to = 'measurement') %>%
arrange(measurement)

How to merge multiple columns into one column?

I currently have data spread out across multiple columns in R. I am looking for a way to put this information into the one column as a vector for each of the individual rows.
Is there a function to do this?
For example, the data looks like this:
DF <- data.frame(id=rep(LETTERS, each=1)[1:26], replicate(26, sample(1001, 26)), Class=sample(c("Yes", "No"), 26, TRUE))
select(DF, cols=c("id", "X1","X2", "X23", "Class"))
How can I merge the columns "X1","X2", "X23" into a vector containing numeric type variables for each of the IDs?
Like this?
library(reshape2)
melt(df) %>% dcast(id ~ ., fun.aggregate = list)
Using id, Class as id variables
id .
1 A 422, 74, 439
2 B 879, 443, 923
3 C 575, 901, 749
4 D 813, 747, 21
5 E 438, 526, 675
6 F 863, 562, 474
7 G 103, 713, 918
8 H 585, 294, 525
9 I 115, 76, 175
10 J 953, 379, 926
11 K 679, 439, 377
12 L 816, 624, 538
13 M 678, 226, 142
14 N 667, 369, 586
15 O 795, 422, 248
16 P 165, 22, 612
17 Q 294, 476, 746
18 R 968, 368, 290
19 S 238, 481, 980
20 T 921, 482, 741
21 U 550, 15, 296
22 V 121, 358, 625
23 W 213, 313, 242
24 X 92, 77, 58
25 Y 607, 936, 350
26 Z 660, 42, 275
A note though: I do not know your final use case, but this strikes me as something you probably do not want to have. It is often more advisable to stick to tidy data, see e.g. https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html

Resources