how to properly sum rows based in an specific date column rank? - r

The idea is to get the sum based on the column names that are
between 01/01/2021 and 01/08/2021:
# define rank parameters {start-end}
first_date <- format(Sys.Date(), "01/01/%Y")
actual_date <- format(Sys.Date() %m-% months(1), "01/%m/%Y")
# get the sum of the rows between first_date and actual_date
df$ytd<- rowSums(df[as.character(seq(first_date,
actual_date))])
However, when applied the next error arises:
Error in seq.default(first_date, to_date) :
'from' must be a finite number
Expected output is a new column that takes the sum of the rows between the specified rank.
data
df <- structure(list(country = c("Mexico", "Mexico", "Mexico", "Mexico"
), `01/01/2021` = c(12, 23, 13, 12), `01/02/2021` = c(12, 23,
13, 12), `01/03/2021` = c(12, 23, 13, 12), `01/04/2021` = c(12,
23, 13, 12), `01/05/2021` = c(12, 23, 13, 12), `01/06/2021` = c(12,
23, 13, 12), `01/07/2021` = c(12, 23, 13, 12), `01/08/2021` = c(12,
23, 13, 12), `01/09/2021` = c(12, 23, 13, 12), `01/10/2021` = c(12,
23, 13, 12), `01/11/2021` = c(12, 23, 13, 12), `01/12/2021` = c(12,
23, 13, 12)), row.names = c(NA, -4L), class = c("tbl_df", "tbl",
"data.frame"))
How could I properly apply a function to get this output?

The format and seq don't work i.e. seq expects a Date class whereas the format is a character class. Instead, make use of the range operator in across or select
library(dplyr)
out <- df %>%
mutate(ytd = rowSums(across(all_of(first_date):all_of(actual_date))))
-output
> out$ytd
[1] 96 184 104 96

A base R approach using match -
df$ytd <- rowSums(df[match(first_date, names(df)):match(actual_date, names(df))])
df$ytd
#[1] 96 184 104 96

Related

Subsetting within a function using data.table and is.na(x) in R

I am trying to subset a data.table within a function, but subsetting by using !is.na(x) is not working. I know it could work, because as I was building my example on a still simpler problem, the subset call worked fine.
library(data.table)
library(ggpubr)
tj = as.data.table(cbind(Name = c("Tom", "Tom", "Tim", "Jerry", NA, "Jerry", "Tim", NA),
var1 = c(12, 12, 20, 30, 31, 21, 21, 31),
var2 = c(12, 11, 27, 32, 31, 11, 21, 41),
var3 = c(10, 10,11, 13, 12, 12, 11, 10),
time = as.numeric(c(1, 2, 1,1, 1,2,2,2))))
plot.tj<- function(dat = tj, color = NULL) {
name <- names(dat)[2:4] # a factor of names to loop over
for (i in seq_along(name)) {
plotms <- ggline(dat[!is.na(color),], x = "time", y = name[i], color = color)
print(plotms)
}
}
plot.tj(color = "Name")
The expected output are the 3 var graphs, but without the NA group.
The thing is that your variable color is a character, so you must call it with get to subset in your data.table. This works:
plot.tj<- function(dat = tj, color = NULL) {
name <- names(dat)[2:4] # a factor of names to loop over
for (i in seq_along(name)) {
plotms <- ggline(dat[!is.na(get(color)),], x = "time", y = name[i], color = color)
print(plotms)
}
}
plot.tj(color = "Name")

ggplot formula for a bar graph

I am looking to get a bar graph of medals in R. I have 3 distinct columns (gold, silver, bronze). The columns for gold medals has a total of 8, the silver has 10, and the bronze has 13.
For the code, I started writing: ggplot(data, aes(x=?)) + geom_bar()
I am not sure how to write all 3 gold medals on the function where it shows x=?
Thanks
For plotting purposes, it is "easier" to work with long data instead of wide. Below I converted the data you mentioned in your comment to long and plotted the data as a grouped bar.
library(tidyverse)
# load data
raw_data <- structure(list(Rank = c(1, 2, 3, 4, 5, 6),
`Team/Noc` = c("United States of America", "People's Republic of China", "Japan", "Great Britain", "ROC", "Australia"),
Gold = c(39, 38, 27, 22, 20, 17),
Silver = c(41,32, 14, 21, 28, 7),
Bronze = c(33, 18, 17, 22, 23, 22),
Total = c(113, 88, 58, 65, 71, 46),
`Rank by Total` = c(1, 2, 5, 4, 3, 6)),
row.names = c(NA,-6L),
class = c("tbl_df", "tbl", "data.frame"))
# convert wide data to long
long_data <- raw_data %>%
pivot_longer(cols = -`Team/Noc`, names_to = 'Medal') %>% # convert wide data to long format
filter(Medal %in% c("Gold", "Silver", "Bronze")) # only select medal columns
# plot
ggplot(long_data) +
geom_col(aes(x = `Team/Noc`,
y = value,
fill = Medal),
position = "dodge" # grouped bars
)
Hope this gets you started!

Split Column in 3 columns with R

I'm trying to separate a column into 3 columns.
My code:
library(dplyr)
library(tidyr)
table1 <- read.csv("tablepartipants.csv")
table2 <- tidyr::separate(table1, col = unique_participant, into = c("uID", "gender", "employment"), sep='.')
I always get this error: Expected 3 pieces. Additional pieces discarded in 80 rows [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
This is how the column dataset looks like
All 3 "new" columns are empty...
Remove the sep part of your command. A period is the default and . is a special character.
# Example data.frame
table1 <- data.frame(unique_participant = paste0(30:33, c('.male.Student', '.female.Student')))
One option
separate(table1, unique_participant, into = c("uID", "gender", "employment"))
Or using \\. to specify a period.
separate(table1, unique_participant, into = c("uID", "gender", "employment"), sep = '\\.')

mapping over lists with `ends_with` to apply a custom error function

I have a list which looks like:
I am trying to map over it and use the mutate function to apply a custom function. The list is called results and I want to compute an error between the preds and another column in the data frame. The common theme of that column in all the lists is the 1 at the very end of one of the columns.
How can I compute my custom function using contain, ends_with or something similar? The column preds is the same in all data frames.
rse <- function(x, y){
sqrt((x - y)**2)
}
x <- map(results, ~mutate(
error = rse(ends_with("1"), preds)
))
Data:
list(`c(5, 19)` = structure(list(date = structure(c(16801, 16802,
16803, 16804, 16805, 16806), class = "Date"), year = c(2016,
2016, 2016, 2016, 2016, 2016), c_farolillo = c(17, 9, 8, 3, 4,
4), plaza_eliptica = c(25, 29, 18, 11, 13, 9), c_farolillo1 = c(17,
9, 8, 3, 4, 4), preds = c(7.08282661437988, 9.66606140136719,
5.95918273925781, 3.81649804115295, 4.26900291442871, 3.38829565048218
)), row.names = c(NA, 6L), class = "data.frame"), `c(7, 1, 2, 18)` = structure(list(
date = structure(c(16801, 16802, 16803, 16804, 16805, 16806
), class = "Date"), year = c(2016, 2016, 2016, 2016, 2016,
2016), pza_del_carmen = c(12, 10, 10, 6, 8, 4), pza_de_espana = c(28,
21, 14, 8, 10, 6), escuelas_aguirre = c(17, 24, 19, 20, 22,
16), retiro = c(6, 5, 7, 3, 2, 2), pza_del_carmen1 = c(12,
10, 10, 6, 8, 4), preds = c(15.3020477294922, 16.007848739624,
15.3953952789307, 9.59985256195068, 9.85349082946777, 8.42792892456055
)), row.names = c(NA, 6L), class = "data.frame"))
We loop over the list of data.frames ('results') with map, then use mutate_at to modify the columns with names that ends_with "1" by applying rse function while speciying the 'y' as 'preds' column
library(dplyr)
library(purrr)
results <- map(results, ~ .x %>%
mutate_at(vars(ends_with("1")), list(new = ~ rse(., y = preds))))

15 Minute Period for Time Series

I have the chunk of code below where I am trying to fill the missing minutes in my data df_stuff by joining it to a time series which has all minutes for an entire year. I would actually like to aggregate this data at 15 minute intervals instead of minute. Does anyone know a simple way of doing this? I was looking at to.minutes15 from the xts package but it seems to have problems with my POSIXct format time series.
Code:
library("sqldf")
##Filling Gaps in time by minute
myTZ <- "America/Los_Angeles"
tseries <- seq(as.POSIXct("2015-01-01 00:00:00", tz=myTZ),
as.POSIXct("2015-12-31 23:59:00", tz=myTZ), by="min")
df2 <- data.frame(SeqDateTime=tseries)
finaldf <- sqldf("select df2.SeqDateTime,
median(df_stuff.brooms) as broomsTot
from df2
left outer join df_stuff on df2.SeqDateTime = df_stuff.broomTime
group by df2.SeqDateTime
order by df2.SeqDateTime asc")
Data:
df_stuff <- structure(list(brooms = c(27, 53, 10, 55, 14, 49, 26,
13, 12, NA, NA, 23, 28, 31, NA, 46, NA, 13, NA, 33, 12, 4, 28,
34, 0, 24, 7, 31, 33, 37, 56, 41, 50, 55, 41, 15, 23, 26, 14,
27, 22, 41, 48, 19, 28, 11, 11, NA, 49, NA), broomTime = structure(c(1423970100,
1424122200, 1424136180, 1424035260, 1424141580, 1424122440, 1423274580,
1424129580, 1424146320, 1429129320, 1429032060, 1429142940, 1428705000,
1429142460, 1429128720, 1429204560, 1422909480, 1424137200, 1424042100,
1424149620, 1424131920, 1424108940, 1424144820, 1424040600, 1424119620,
1424148660, 1443593040, 1443657120, 1424125860, 1424223120, 1424235240,
1424232720, 1424234940, 1424234640, 1424230440, 1424115300, 1429208280,
1429131720, 1429148460, 1429151040, 1424129760, 1424125380, 1424123220,
1424137380, 1424115780, 1424219340, 1424131560, 1424233560, 1424224920,
1443640800), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("brooms",
"broomTime"), row.names = c(NA, 50L), class = "data.frame")
You can summarize by any amount of time interval by using cut within the group_by function in dplyr.
library(dplyr)
ans <- finaldf %>%
group_by(SeqDateTime = cut(SeqDateTime, breaks = "15 min")) %>%
summarize(broomsTot = sum(as.numeric(broomsTot), na.rm = TRUE))
head(ans)
Source: local data frame [6 x 2]
SeqDateTime broomsTot
(fctr) (dbl)
1 2015-01-01 02:00:00 0
2 2015-01-01 02:15:00 0
3 2015-01-01 02:30:00 0
4 2015-01-01 02:45:00 0
5 2015-01-01 03:00:00 0
6 2015-01-01 03:15:00 0
I can assure you that xts does not have problem with your POSIXct time series. xts uses POSIXct for its internal time index.
Here's how to join df_stuff with a 1-minute series and then aggregate that result to a 15-minute series.
library(xts)
# create xts object
xts_stuff <- with(df_stuff, xts(brooms, broomTime))
# merge with empty xts object that contains a regular 1-minute index
xts_stuff_1min <- merge(xts_stuff, xts(,tseries))
# aggregate to 15-minutes
ep15 <- endpoints(xts_stuff_1min, "minutes", 15)
final_df <- period.apply(xts_stuff_1min, ep15, median, na.rm=TRUE)

Resources