I try to create a weekly cumulative result from this daily data with detail below
date A B C D E F G H
16-Jan-22 227 441 3593 2467 9 6 31 2196
17-Jan-22 224 353 3555 2162 31 5 39 2388
18-Jan-22 181 144 2734 2916 0 0 14 1753
19-Jan-22 95 433 3610 3084 42 19 10 2862
20-Jan-22 141 222 3693 3149 183 19 23 2176
21-Jan-22 247 426 3455 4016 68 0 1 2759
22-Jan-22 413 931 4435 4922 184 2 39 3993
23-Jan-22 389 1340 5433 5071 200 48 27 4495
24-Jan-22 281 940 6875 5009 343 47 71 3713
25-Jan-22 314 454 5167 4555 127 1 68 3554
26-Jan-22 315 973 5789 3809 203 1 105 4456
27-Jan-22 269 1217 6776 4578 227 91 17 5373
28-Jan-22 248 1320 5942 3569 271 91 156 4260
29-Jan-22 155 1406 6771 4328 426 44 109 4566
Solution using data.table and lubridate
library(lubridate)
library(data.table)
setDT(df)
df[, lapply(.SD, sum), by = isoweek(dmy(date))]
# isoweek A B C D E F G H
# 1: 2 227 441 3593 2467 9 6 31 2196
# 2: 3 1690 3849 26915 25320 708 93 153 20426
# 3: 4 1582 6310 37320 25848 1597 275 526 25922
I wanted to provide a solution using the tidyverse principles.
You will need to use the group_by and summarize formulas and this very useful across() function.
#recreate data in tribble
df <- tribble(
~"date", ~"A", ~"B", ~"C", ~"D", ~"E", ~"F", ~"G", ~H,
"16-Jan-22",227, 441, 3593, 2467, 9, 6, 31, 2196,
"17-Jan-22",224, 353, 3555, 2162, 31, 5, 39, 2388,
"18-Jan-22",181, 144, 2734, 2916, 0, 0, 14, 1753,
"19-Jan-22",95, 433, 3610, 3084, 42, 19, 10, 2862,
"20-Jan-22",141, 222, 3693, 3149, 183, 19, 23, 2176,
"21-Jan-22",247, 426, 3455, 4016, 68, 0, 1, 2759,
"22-Jan-22",413, 931, 4435, 4922, 184, 2, 39, 3993,
"23-Jan-22",389, 1340, 5433, 5071, 200, 48, 27, 4495,
"24-Jan-22",281, 940, 6875, 5009, 343, 47, 71, 3713,
"25-Jan-22",314, 454, 5167, 4555, 127, 1, 68, 3554,
"26-Jan-22",315, 973, 5789, 3809, 203, 1, 105, 4456,
"27-Jan-22",269, 1217, 6776, 4578, 227, 91, 17, 5373,
"28-Jan-22",248, 1320, 5942, 3569, 271, 91, 156, 4260,
"29-Jan-22",155, 1406, 6771, 4328, 426, 44, 109, 4566)
#change date to format date
df$date <- ymd(df$date)
# I both create a new column "week_num" and group it by this variables
## Then I summarize that for each column except for "date", take each sum
df %>%
group_by(week_num=lubridate::isoweek(date)) %>%
summarize(across(-c("date"),sum))
# I get this results
week_num A B C D E F G H
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3 2017 6028 33189 26785 990 243 310 25464
2 4 1482 4572 34639 26850 1324 131 400 23080
Group_by() and summarize() are relatively straight forward. Across() is a fairly new verb that is super powerful. It allows you to reference columns using tidy selection principles (eg. starts_with(), c(1:9), etc) and if you apply it a formula it will allow that formula to each of the selected columns. Less typing!
Alternatively you would have to individually sum each column A=sum(A) which is more typing.
Related
I have a data frame that corresponds to the path taken by a river, describing elevation and distance. I need to evaluate each different ground path traveled by the river and extract this information.
Example:
df = data.frame(Soil = c("Forest", "Forest",
"Grass", "Grass","Grass",
"Scrub", "Scrub","Scrub","Scrub",
"Grass", "Grass","Grass","Grass",
"Forest","Forest","Forest","Forest","Forest","Forest"),
Distance = c(1, 5,
10, 15, 56,
59, 67, 89, 99,
102, 105, 130, 139,
143, 145, 167, 189, 190, 230),
Elevation = c(1500, 1499,
1470, 1467, 1456,
1450, 1445, 1440, 1435,
1430, 1420, 1412, 1400,
1390, 1387, 1384, 1380, 1376, 1370))
Soil Distance Elevation
1 Forest 1 1500
2 Forest 5 1499
3 Grass 10 1470
4 Grass 15 1467
5 Grass 56 1456
6 Scrub 59 1450
7 Scrub 67 1445
8 Scrub 89 1440
9 Scrub 99 1435
10 Grass 102 1430
11 Grass 105 1420
12 Grass 130 1412
13 Grass 139 1400
14 Forest 143 1390
15 Forest 145 1387
16 Forest 167 1384
17 Forest 189 1380
18 Forest 190 1376
19 Forest 230 1370
But i need to something like this:
Soil Distance.Min Distance.Max Elevation.Min Elevation.Max
1 Forest 1 5 1499 1500
2 Grass 10 56 1456 1470
3 Scrub 59 99 1435 1450
4 Grass 102 139 1400 1430
5 Forest 143 230 1370 1390
I tried to use group_by() and which.min(Soil), but that takes into account the whole df, not each path.
We need a run-length encoding to track consecutive Soil.
Using this function (fashioned to mimic data.table::rleid):
myrleid <- function (x) {
r <- rle(x)
rep(seq_along(r$lengths), times = r$lengths)
}
We can do
df %>%
group_by(grp = myrleid(Soil)) %>%
summarize(Soil = Soil[1], across(c(Distance, Elevation), list(min = min, max = max))) %>%
select(-grp)
# # A tibble: 5 x 5
# Soil Distance_min Distance_max Elevation_min Elevation_max
# <chr> <dbl> <dbl> <dbl> <dbl>
# 1 Forest 1 5 1499 1500
# 2 Grass 10 56 1456 1470
# 3 Scrub 59 99 1435 1450
# 4 Grass 102 139 1400 1430
# 5 Forest 143 230 1370 1390
You can try this:
df = df %>% mutate(id=data.table::rleid(Soil))
inner_join(
distinct(df %>% select(Soil,id)),
df %>%
group_by(id) %>%
summarize(across(Distance:Elevation, .fns = list("min" = min,"max"=max))),
by="id"
) %>% select(!id)
Output:
Soil Distance_min Distance_max Elevation_min Elevation_max
1 Forest 1 5 1499 1500
2 Grass 10 56 1456 1470
3 Scrub 59 99 1435 1450
4 Grass 102 139 1400 1430
5 Forest 143 230 1370 1390
Or, even more concise, thanks to r2evans.
df %>%
group_by(id = data.table::rleid(Soil)) %>%
summarize(Soil=first(Soil),across(Distance:Elevation, .fns = list("min" = min,"max"=max))) %>%
select(!id)
My goal is to perform multiple column operations in one line of code without hard coding the variable names.
structure(list(Subject = 1:6, Congruent_1 = c(359, 391, 384,
316, 287, 403), Congruent_2 = c(361, 378, 322, 286, 276, 363),
Congruent_3 = c(342, 355, 334, 274, 297, 335), Congruent_4 = c(365,
503, 324, 256, 266, 388), Congruent_5 = c(335, 354, 320,
272, 260, 337), Incongruent_1 = c(336, 390, 402, 305, 310,
400), Incongruent_2 = c(366, 407, 386, 280, 243, 393), Incongruent_3 = c(323,
455, 317, 308, 259, 325), Incongruent_4 = c(361, 392, 357,
274, 342, 350), Incongruent_5 = c(300, 366, 378, 263, 258,
349)), row.names = c(NA, 6L), class = "data.frame")
My data looks like this.
I want need to do column subtraction and save those new values into new columns. For example, a new column by the name of selhist_1 should be computed as Incongruent_1 - Congruent_1. I tried to write a for loop that indexes the existing columns by their names and creates new columns using the same indexing variable:
for(i in 1:5)(
DP4 = mutate(DP4, as.name(paste("selhistB_",i,sep="")) = as.name(paste("Incongruent_",i,sep="")) - as.name(paste("Congruent_",i,sep="")))
)
but I received this error:
Error: unexpected '=' in: "for(i in 1:5)( DP4 = mutate(DP4, as.name(paste("selhistB_",i,sep="")) ="
I rather use this modular approach, as opposed to hard coding and writing out "selhistB = incongruent_1 - congruent_1" five times, using the mutate() function.
I also wonder if i could achieve the same goal on the long version of this data, and maybe it would make more sense.
library(dplyr)
d %>%
pivot_longer(-Subject,
names_to = c(".value", "id"),
names_sep = "_") %>%
mutate(selhistB = Incongruent - Congruent) %>%
pivot_wider(names_from = id, values_from = c(Congruent, Incongruent, selhistB))
Or just skip the last pivot, and keep everything long.
As long as you are already using tidyverse packages, the following code will do exactly what you need:
library(dplyr)
for(i in 1:5){
DP4 <- DP4 %>% mutate(UQ(sym(paste0("selhistB_",i))) :=
UQ(sym(paste0("Incongruent_",i))) - UQ(sym(paste0("Congruent_",i))))
}
DP4
Subject Congruent_1 Congruent_2 Congruent_3 Congruent_4 Congruent_5
1 1 359 361 342 365 335
2 2 391 378 355 503 354
3 3 384 322 334 324 320
4 4 316 286 274 256 272
5 5 287 276 297 266 260
6 6 403 363 335 388 337
Incongruent_1 Incongruent_2 Incongruent_3 Incongruent_4 Incongruent_5
1 336 366 323 361 300
2 390 407 455 392 366
3 402 386 317 357 378
4 305 280 308 274 263
5 310 243 259 342 258
6 400 393 325 350 349
selhistB_1 selhistB_2 selhistB_3 selhistB_4 selhistB_5
1 23 -5 19 4 35
2 1 -29 -100 111 -12
3 -18 -64 17 -33 -58
4 11 6 -34 -18 9
5 -23 33 38 -76 2
6 3 -30 10 38 -12
You can use split.default and split on column names suffix, then loop over the list and subtract column 2 from column 1, i.e.
lapply(split.default(df[-1], sub('.*_', '', names(df[-1]))), function(i) i[1] - i[2])
Using subtract over all matching columns, then cbind, try:
x <- df1[, grepl("^C", colnames(df1)) ] - df1[, grepl("^I", colnames(df1)) ]
names(x) <- paste0("selhistB_", seq_along(names(x)))
res <- cbind(df1, x)
res
Subject Congruent_1 Congruent_2 Congruent_3 Congruent_4 Congruent_5
1 1 359 361 342 365 335
2 2 391 378 355 503 354
3 3 384 322 334 324 320
4 4 316 286 274 256 272
5 5 287 276 297 266 260
6 6 403 363 335 388 337
Incongruent_1 Incongruent_2 Incongruent_3 Incongruent_4 Incongruent_5
1 336 366 323 361 300
2 390 407 455 392 366
3 402 386 317 357 378
4 305 280 308 274 263
5 310 243 259 342 258
6 400 393 325 350 349
selhistB_1 selhistB_2 selhistB_3 selhistB_4 selhistB_5
1 23 -5 19 4 35
2 1 -29 -100 111 -12
3 -18 -64 17 -33 -58
4 11 6 -34 -18 9
5 -23 33 38 -76 2
6 3 -30 10 38 -12
How do I create a loop and/or a function to divide 200 columns (and create 200 new columns/variables) by another column to get a percentage?
How do I do this in a loop so I can do 200 columns? and how do I name the
name the columns so that it is the old column name with a "p_" in front of it?
Is this possible?
For example I'm trying to do something like this but with 200 columns.
fans <- data.frame(
population = c(1234, 5678, 2345, 6789, 3456, 7890,
4567, 8901, 5678, 9012, 6789),
bearsfans = c(123, 234, 345, 456, 567,678, 789, 890, 901, 135, 246),
packersfans = c(11,22,33,44,55,66,77,88,99,100,122),
vikingsfans = c(39, 49, 59, 61, 32, 22, 31, 92, 52, 10, 122))
print(fans)
attach(fans)
## create new columns which are the ratio of fans to population
fans$p_bearsfan = bearsfans/population
print(fans)
Output:
## population bearsfans packersfans vikingsfans p_bearsfan
## 1 1234 123 11 39 0.09967585
## 2 5678 234 22 49 0.04121169
temp <- sapply(fans[-1], function(x) x / fans$population)
colnames(temp) <- paste0("p_", colnames(temp))
cbind(fans, temp)
population bearsfans packersfans vikingsfans p_bearsfans p_packersfans p_vikingsfans
1 1234 123 11 39 0.09967585 0.008914100 0.031604538
2 5678 234 22 49 0.04121169 0.003874604 0.008629799
3 2345 345 33 59 0.14712154 0.014072495 0.025159915
4 6789 456 44 61 0.06716748 0.006481072 0.008985123
5 3456 567 55 32 0.16406250 0.015914352 0.009259259
6 7890 678 66 22 0.08593156 0.008365019 0.002788340
7 4567 789 77 31 0.17276111 0.016860083 0.006787826
8 8901 890 88 92 0.09998877 0.009886530 0.010335917
9 5678 901 99 52 0.15868263 0.017435717 0.009158154
10 9012 135 100 10 0.01498003 0.011096316 0.001109632
11 6789 246 122 122 0.03623509 0.017970246 0.017970246
If you're happy with a suffixed new column name (instead of prefixing), this is a one-liner using dplyr::mutate_at. I assume here that all relevant columns end with the word "fans".
With suffixes
fans %>% mutate_at(vars(ends_with("fans")), list(percent = ~.x / population))
# population bearsfans packersfans vikingsfans bearsfans_percent
#1 1234 123 11 39 0.09967585
#2 5678 234 22 49 0.04121169
#3 2345 345 33 59 0.14712154
#4 6789 456 44 61 0.06716748
#5 3456 567 55 32 0.16406250
#6 7890 678 66 22 0.08593156
#7 4567 789 77 31 0.17276111
#8 8901 890 88 92 0.09998877
#9 5678 901 99 52 0.15868263
#10 9012 135 100 10 0.01498003
#11 6789 246 122 122 0.03623509
# packersfans_percent vikingsfans_percent
#1 0.008914100 0.031604538
#2 0.003874604 0.008629799
#3 0.014072495 0.025159915
#4 0.006481072 0.008985123
#5 0.015914352 0.009259259
#6 0.008365019 0.002788340
#7 0.016860083 0.006787826
#8 0.009886530 0.010335917
#9 0.017435717 0.009158154
#10 0.011096316 0.001109632
#11 0.017970246 0.017970246
With prefixes
To turn the suffixes into prefixes requires on more step
fans %>%
mutate_at(vars(ends_with("fans")), list(percent = ~.x / population)) %>%
rename_at(vars(ends_with("percent")), ~sub("(.+)_percent", "p_\\1", .x))
# population bearsfans packersfans vikingsfans p_bearsfans p_packersfans
#1 1234 123 11 39 0.09967585 0.008914100
#2 5678 234 22 49 0.04121169 0.003874604
#3 2345 345 33 59 0.14712154 0.014072495
#4 6789 456 44 61 0.06716748 0.006481072
#5 3456 567 55 32 0.16406250 0.015914352
#6 7890 678 66 22 0.08593156 0.008365019
#7 4567 789 77 31 0.17276111 0.016860083
#8 8901 890 88 92 0.09998877 0.009886530
#9 5678 901 99 52 0.15868263 0.017435717
#10 9012 135 100 10 0.01498003 0.011096316
#11 6789 246 122 122 0.03623509 0.017970246
# p_vikingsfans
#1 0.031604538
#2 0.008629799
#3 0.025159915
#4 0.008985123
#5 0.009259259
#6 0.002788340
#7 0.006787826
#8 0.010335917
#9 0.009158154
#10 0.001109632
#11 0.017970246
We can directly divide multiple columns with one column. We use grep to select columns which end with "fans" and use those names to assign new columns.
cols <- grep("fans$", names(fans), value = TRUE)
fans[paste0("p_", cols)] <- fans[cols]/fans$population
fans
# population bearsfans packersfans vikingsfans p_bearsfans p_packersfans p_vikingsfans
#1 1234 123 11 39 0.09968 0.008914 0.031605
#2 5678 234 22 49 0.04121 0.003875 0.008630
#3 2345 345 33 59 0.14712 0.014072 0.025160
#4 6789 456 44 61 0.06717 0.006481 0.008985
#5 3456 567 55 32 0.16406 0.015914 0.009259
#6 7890 678 66 22 0.08593 0.008365 0.002788
#7 4567 789 77 31 0.17276 0.016860 0.006788
#8 8901 890 88 92 0.09999 0.009887 0.010336
#9 5678 901 99 52 0.15868 0.017436 0.009158
#10 9012 135 100 10 0.01498 0.011096 0.001110
#11 6789 246 122 122 0.03624 0.017970 0.017970
Also as a side note : Why is it not advisable to use attach() in R, and what should I use instead?
Data
fans <- data.frame(
population = c(1234, 5678, 2345, 6789, 3456, 7890,
4567, 8901, 5678, 9012, 6789),
bearsfans = c(123, 234, 345, 456, 567,678, 789, 890, 901, 135, 246),
packersfans = c(11,22,33,44,55,66,77,88,99,100,122),
vikingsfans = c(39, 49, 59, 61, 32, 22, 31, 92, 52, 10, 122))
Naive Solution
for (c in names(fans)[-1]){
fans[[paste0("p_",c)]] <- fans[[c]] / fans[["population"]]
}
I have a simple dataset as attached. I see a clear outlier, (Qty=6), which should get corrected after processing it through tsclean.
c(6, 187, 323, 256, 289, 387, 335, 320, 362, 359, 426, 481,
356, 408, 497, 263, 330, 521, 406, 350, 478, 320, 339)
What I have:
library(forecast)
data1 <- read_csv("sample.csv", col_names = FALSE)
count_qty <-ts(data1, frequency = 12)
data1$clean_qty = tsclean(count_qty)
and the data returns
X1 clean_qty[,"X1"]
<dbl> <dbl>
1 6 6
2 187 187
3 323 323
4 256 256
5 289 289
6 387 387
7 335 335
8 320 320
9 362 362
10 359 359
# ... with 13 more rows
The first item should be removed.
You can remove outliers using boxplot:
vec1[! vec1 %in% boxplot(vec1, plot = F)$out]
# [1] 323 256 289 387 335 320 362 359 426 481 356 408 497 263 330 521 406 350 478 320 339
Note that 187 is also an outlier. As you said, 6 is the obvious one;
DF:
Year 1901 1901 1903 1968 1978 2002 2006 2010
species 1 1 2 65 1 82 3 1
lat: 49 46 47 47 48 43.1 44.23 47.11
long: -79.22 -79.5 -78.22 -79.84 -78.11 -77.114 -76.33 -76.2
Julian_Day: 79 125 165 178 193 68 90 230
Land: 16 24 25 30 34 34 39 41
There are more variables but that's an example of the matrix. I only want to keep the rows for each year AND for each species that has the lowest value for the Julian_day. Ie: the second row would be omitted here, because 79 is less than 125 for species 1 in 1901.
First of all. I would suggest you providing a data.frame in a format that is easy for people to use. We'll be able to help you better and faster
df <- structure(list(Year = c(1901, 1901, 1903, 1968, 1978,
2002, 2006, 2010), species = c(1, 1, 2, 65, 1, 82, 3, 1), lat =
c(49, 46, 47, 47, 48, 43.1, 44.23, 47.11), long = c(79.22,
-79.5, -78.22, -79.84, -78.11, -77.114, -76.33, -76.2),
Julian_Day = c(79, 125, 165, 178, 193, 68, 90, 230), Land =
c(16, 24, 25, 30, 34, 34, 39, 41)), .Names =
c("Year", "species", "lat", "long", "Julian_Day", "Land"),
row.names = c(NA, -8L), class = "data.frame")
Here is your data.frame
df
# Year species lat long Julian_Day Land
#1: 1901 1 49.00 79.220 79 16
#2: 1901 1 46.00 -79.500 125 24
#3: 1903 2 47.00 -78.220 165 25
#4: 1968 65 47.00 -79.840 178 30
#5: 1978 1 48.00 -78.110 193 34
#6: 2002 82 43.10 -77.114 68 34
#7: 2006 3 44.23 -76.330 90 39
#8: 2010 1 47.11 -76.200 230 41
Generally, you just have to do dput(head(your dataframe)) But you can build a small fake data frame to illustrate your point if cannot reveal your data.
Her's a possible solution using the data.table package
library(data.table)
setDT(df)[ ,.SD[which.min(Julian_Day)], .(species, Year)]