Related
I am working through some messy data where, after reading it in, it appears as the following:
> glimpse(il_births)
Rows: 106
Columns: 22
$ x1989 <dbl> 190247, 928, 175, 187, 445, 57, 425, 41, 207, 166, 2662, 48…
$ x1990 <dbl> 195499, 960, 192, 195, 462, 68, 449, 53, 222, 187, 2574, 47…
$ x1991 <dbl> 194066, 971, 164, 195, 464, 72, 448, 54, 179, 211, 2562, 49…
$ x1992 <dbl> 190923, 881, 189, 185, 462, 72, 414, 55, 201, 161, 2426, 46…
$ x1993 <dbl> 190709, 893, 152, 206, 497, 50, 389, 75, 202, 183, 2337, 43…
$ x1994 <dbl> 189182, 865, 158, 200, 538, 58, 429, 48, 189, 171, 2240, 41…
$ x1995 <dbl> 185801, 828, 140, 202, 566, 58, 417, 48, 173, 166, 2117, 43…
$ x1996 <dbl> 183079, 830, 147, 194, 529, 58, 417, 49, 175, 150, 2270, 41…
$ x1997 <dbl> 180649, 812, 132, 193, 531, 64, 389, 37, 163, 185, 2175, 43…
$ x1998 <dbl> 182503, 862, 140, 201, 545, 41, 417, 57, 185, 188, 2128, 41…
$ x1999 <dbl> 182027, 843, 117, 188, 595, 51, 396, 47, 193, 191, 2194, 39…
$ x2000 <dbl> 185003, 825, 132, 184, 587, 63, 434, 51, 170, 181, 2260, 40…
$ x2001 <dbl> 184022, 866, 138, 196, 629, 57, 420, 49, 147, 215, 2312, 39…
$ x2002 <dbl> 180555, 760, 129, 172, 629, 54, 434, 48, 191, 185, 2226, 39…
$ x2003 <dbl> 182393, 794, 141, 239, 668, 76, 458, 58, 154, 208, 2288, 39…
$ x2004 <dbl> 180665, 802, 126, 209, 646, 56, 396, 51, 151, 181, 2291, 42…
$ x2005 <dbl> 178872, 883, 122, 189, 744, 54, 409, 58, 160, 199, 2490, 40…
$ x2006 <dbl> 180503, 805, 112, 215, 737, 57, 392, 55, 140, 177, 2455, 41…
$ x2007 <dbl> 180530, 890, 136, 185, 736, 60, 413, 49, 163, 195, 2508, 44…
$ x2008 <dbl> 176634, 817, 120, 173, 676, 64, 409, 59, 142, 200, 2482, 40…
$ x2009 <dbl> 171077, 804, 114, 198, 622, 65, 381, 53, 123, 164, 2407, 40…
$ county_name <chr> "ILLINOIS TOTAL", "ADAMS", "ALEXANDER", "BOND", "BOONE", "B…
The data comes from All Live Births In Illinois, 1989-2009. The data frame is difficult to work with, as the years are the column headers in addition to a column with all of the counties. I would prefer if the table were formatted such that there is a year column and a county column, and each row contains an observation for one year and one county. This would make it easier to work with in ggplot such that I can make some quick visualizations of the data.
I first tried transposing the data frame, but that leaves counties as rows so that does not help much.
I also tried using the pivot_longer() function but was not sure how to set my parameters based on my issue.
Any help or suggestions are appreciated!
I suspect a reading of pivot_longer's help page would have done the trick:
data - A data frame to pivot.
cols - Columns to pivot into longer format.
names_to - A character vector specifying the new column or columns to
create from the information stored in the column names of data
specified by cols.
values_to - A string specifying the name of the column to create from
the data stored in cell values.
The other arguments are for more complex operations. To solve your case:
data should be il_births
cols should be all the year column names, you
can use any tidy-select method to get them, the easier in this case
is to say "everyone less county_name", so -county_name
names_to is the name of the column that will have the years, by default "name", but you can change it to "year" or anything else.
values_to is the name of the column that will have the values, by default "value", but you can change it here.
pivot_longer(il_births, -county_name, names_to = "year")
Additionally, you can remove the "x"'s from the column names, and format the year column as numeric:
pivot_longer(il_births, -county_name, names_to = "year",
names_prefix = "x", names_transform = list(year = as.numeric))
Here's a full reprex of how you might read in and tidy the data. A plot is going to look very messy if you include all counties, so I have used slice_max to include only the five most populous counties. This line could be removed if you want to retain all the data:
library(tidyverse)
data <- "https://data.illinois.gov/dataset/" %>%
paste0("ac7f40df-b256-4867-9953-78c8c4a52590/",
"resource/d7ec861b-6b7c-4260-82d8-3f05f49053f9/",
"download/data.csv") %>%
read.csv(check.names = FALSE) %>%
filter(row_number() != 1) %>%
slice_max(`_2009`, n = 5) %>% # Remove this line to keep all data
mutate(county_name = str_to_title(county_name)) %>%
mutate(county_name = reorder(county_name, -`_2009`)) %>%
pivot_longer(-county_name, names_to = "Year", values_to = "Births") %>%
mutate(Year = as.numeric(substr(Year, 2, 5)))
This results in:
data
#> # A tibble: 105 x 3
#> county_name Year Births
#> <fct> <dbl> <dbl>
#> 1 "Cook " 1989 94096
#> 2 "Cook " 1990 97005
#> 3 "Cook " 1991 96387
#> 4 "Cook " 1992 95140
#> 5 "Cook " 1993 94614
#> 6 "Cook " 1994 92881
#> 7 "Cook " 1995 90029
#> 8 "Cook " 1996 87747
#> 9 "Cook " 1997 85589
#> 10 "Cook " 1998 85970
#> # ... with 95 more rows
Which we could plot like this:
ggplot(data, aes(Year, Births, color = county_name)) +
geom_line(alpha = 0.5) +
scale_y_continuous(labels = scales::comma) +
geom_point() +
theme_minimal(base_size = 16) +
scale_color_brewer(palette = "Set1", name = "County") +
ggtitle("Live births in five most populous Illinois counties, 1989-2009") +
labs(caption = "Source: Illinois Department of Public Health")
Created on 2022-11-20 with reprex v2.0.2
Here i wish to divide the value in each row by the corresponding daily (each row is a day) maximum value, how would i do this? Finding it tricky as i dont want to divide the first 2 columns of the data by this. Data snipet below;
so i want to divide each of the values in each row by the max in each row but avoiding the date and substation
Date Substation `00:00` `00:10` `00:20` `00:30` `00:40` `00:50` `01:00` `01:10` `01:20` max
<date> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2013-01-03 511016 257. 259. 264. 263. 248. 239. 243. 230. 228. 264.
2 2013-01-03 511029 96 111. 116. 99.0 123. 128. 130. 126. 116. 130.
3 2013-01-03 511030 138. 129. 127. 124. 119. 126. 125. 121. 112. 138.
4 2013-01-03 511033 172. 165. 167. 170. 171. 173. 173. 166. 157. 173.
5 2013-01-03 511034 302. 298. 302. 290. 291. 287. 280. 291. 277. 302.
6 2013-01-03 511035 116. 131. 130. 121. 116. 108. 106. 112. 109. 131.
With pmax and do.call:
df[-c(1, 2)] / do.call(pmax, df[-c(1, 2)])
Or, you can apply a function, x / max(x), for each rows (MARGIN = 1) of your selected columns (df[-c(1, 2)]) using apply.
t(apply(df[-c(1, 2)], 1, function(x) x / max(x)))
Actually, since you already have the max in your last column you just need to do
cc <- grep('^Date|^Subst|^max', names(dat)) ## storing columns NOT to be divided
dat[-cc] <- dat[-cc]/dat$max ## divide matrix by vector
Before having the max column you could use matrixStats::rowMaxs to divide by. And rowMeans for averages. There's also a matrixStats::rowMeans2 which might be a little faster.
dat[-cc]/matrixStats::rowMaxs(as.matrix(dat[-cc]))
dat[-cc]/rowMeans(as.matrix(dat[-cc]))
Data:
dat <- structure(list(Date = c("2013-01-03", "2013-01-03", "2013-01-03",
"2013-01-03", "2013-01-03", "2013-01-03"), Substation = c(511016L,
511029L, 511030L, 511033L, 511034L, 511035L), X.00.00. = c(257,
96, 138, 172, 302, 116), X.00.10. = c(259, 111, 129, 165, 298,
131), X.00.20. = c(264, 116, 127, 167, 302, 130), X.00.30. = c(263,
99, 124, 170, 290, 121), X.00.40. = c(248, 123, 119, 171, 291,
116), X.00.50. = c(239, 128, 126, 173, 287, 108), X.01.00. = c(243,
130, 125, 173, 280, 106), X.01.10. = c(230, 126, 121, 166, 291,
112), X.01.20. = c(228, 116, 112, 157, 277, 109), max = c(264,
130, 138, 173, 302, 131)), row.names = c("1", "2", "3", "4",
"5", "6"), class = "data.frame")
I have the following vector:
vec <- c(28, 44, 45, 46, 47, 48, 61, 62, 70, 71, 82, 83, 104, 105, 111, 115, 125, 136, 137, 138, 146, 147, 158, 159, 160, 185, 186, 187, 188, 189, 190, 191, 192, 193, 209, 263, 264, 265, 266, 267, 268, 280, 283, 284, 308, 309, 318, 319, 324, 333, 334, 335, 347, 354)
Now I would like to get the number of consecutive occurrences in the vector of the minimum length two.
So here this would be valid for the following cases:
44, 45, 46, 47, 48
61, 62
70, 71
82, 83
104, 105
136, 137, 138
146, 147
158, 159, 160
185, 186, 187, 188, 189, 190, 191, 192, 193
263, 264, 265, 266, 267, 268
283, 284
308, 309
318, 319
333, 334, 335
So there are 14 times cases of consecutive numbers, and I just need the integer 14 as output.
Anybody with an idea how to do that?
We can use rle and diff functions :
a=rle(diff(vec))
sum(a$values==1)
diff and split will help
vec2 <- split(vec, cumsum(c(1, diff(vec) != 1)))
vec2[(sapply(vec2, function(x) length(x))>1)]
$`2`
[1] 44 45 46 47 48
$`3`
[1] 61 62
$`4`
[1] 70 71
$`5`
[1] 82 83
$`6`
[1] 104 105
$`10`
[1] 136 137 138
$`11`
[1] 146 147
$`12`
[1] 158 159 160
$`13`
[1] 185 186 187 188 189 190 191 192 193
$`15`
[1] 263 264 265 266 267 268
$`17`
[1] 283 284
$`18`
[1] 308 309
$`19`
[1] 318 319
$`21`
[1] 333 334 335
Brut force :
var <- sort(var)
nconsecutive <- 0
p <- length(var)-1
for (i in 1:p){
if((var[i + 1] - var[i]) == 1){
consecutive <- consecutive + 1
}else{
# If at least one consecutive number
if(consecutive > 0){
# when no more consecutive numbers add one to your increment
nconsecutive = nconsecutive + 1
}
# Re set to 0 your increment
consecutive <- 0
}
}
Here's another base R one-liner using tapply -
sum(tapply(vec, cumsum(c(TRUE, diff(vec) != 1)), length) > 1)
#[1] 14
I have 2 vectors of value that I extracted from a .pdf file which represents the location of two different keywords. vector.1 is the first keyword while vector.2 is the second keyword. So I need to extract all the rows in between using something like below.
df.1 <- df[vector.1[1]:vector.2[1], ]
I managed to use a loop to go through all the vectors for other documents but for this particular file, it has uneven location due to the structure of it.
vector.1 <- c(12, 85, 144, 188, 233, 285, 338, 384, 426, 469, 512, 558, 613, 669, 713, 758, 808, 859, 908, 964, 1046, 1090, 1126, 1149, 1216, 1267, 1346, 1423, 1464, 1513, 1560, 1607, 1665, 1718, 1763, 1810, 1856, 1908, 1938)
vector.2 <- c(48, 53, 111, 116, 155, 160, 198, 203, 250, 255, 303, 308, 350, 355, 392, 397, 435, 440, 478, 483, 523, 528, 578, 583, 635, 640, 679, 684, 723, 728, 773, 778, 824, 829, 871, 876, 929, 934, 1008, 1017, 1091, 1096, 1182, 1187, 1232, 1237, 1308, 1313, 1385, 1390, 1430, 1435, 1478, 1483, 1525, 1530, 1572, 1577, 1629, 1634, 1683, 1688, 1729, 1734, 1776, 1781, 1821, 1826, 1874, 1879, 1967, 1972)
As you can see vector.1[2] is bigger than vector.2[2] and the actual location is supposed to be vector.2[3]. Is there anyway to write the code to match each vector[i] so that the desired result is something like below:
vector.3 <- c(48, 111, 155, 198, 250, 303, 392, 435, ....)
Thank you!
vector.2[rowSums(outer(vector.1,vector.2,">"))+1]
[1] 48 111 155 198 250 303 350 392 435 478 523 578 635 679 723 773 824 871 929
[20] 1008 1091 1091 1182 1182 1232 1308 1385 1430 1478 1525 1572 1629 1683 1729 1776 1821 1874 1967
[39] 1967
You can also do vector.2[colSums(sapply(vector.1,">",vector.2))+1]
If you are trying to extract particular values from vector.2 into another vector.3 this should do, provided the indexes follow a sequence separated by 2. You can generate a sequence of indices using seq by providing the start and end index and the increment by arguments accordingly.
vector.3 <- vector.2[seq(1,length(vector.2),by=2)]
[1] 48 111 155 198 250 303 350 392 435 478
[11] 523 578 635 679 723 773 824 871 929 1008
[21] 1091 1182 1232 1308 1385 1430 1478 1525 1572 1629
[31] 1683 1729 1776 1821 1874 1967
Or do you want the smallest value of vector.2 that is larger than each value of vector.1?...
sapply(vector.1, function(x) min(vector.2[vector.2>x]))
[1] 48 111 155 198 250 303 350 392 435 478 523 578 635 679 723 773 824 871 929 1008 1091 1091 1182
[24] 1182 1232 1308 1385 1430 1478 1525 1572 1629 1683 1729 1776 1821 1874 1967 1967
for example, I have a data frame with one column containing numbers. these is how it looks.
head(c1)
c
1 300
2 302
3 304
4 306
5 308
6 310
Here is the sample data frame.
c1 <- structure(list(c = c(300, 302, 304, 306, 308, 310, 312, 314,
316, 318, 320, 322, 324, 326, 328, 330, 332, 334, 336, 338, 340,
342, 344, 346, 348, 350, 352, 354, 356, 358, 360, 362, 364, 366,
368, 370, 372, 374, 376, 378, 380, 382, 384, 386, 388, 390, 392,
394, 396, 398, 400)), .Names = "c", row.names = c(NA, -51L), class = "data.frame")
I want to delete the rows between 300 to 310 and 310 to 320 and so on..
I want to have a dataframe like these
300
310
320
330
340
350
.
.
.
400
Any ideas how to do these, I found how to remove every nth row, but not every four rows between two numbers
You can make use of the modulo operator %%. If you want the result as an atomic vector, you can run
c1$c[c1$c %% 10 == 0]
or if you want it as a data.frame with 1 column, you can use
c1[c1$c %% 10 == 0, , drop=FALSE]