Convert multiple character columns to as.Date and time in R - r

We have an arbitrary dataset, called df:
enter <- c("2017-01-01", "2018-02-02", "2018-03-03")
guest <- c("Foxtrot","Uniform","Charlie","Kilo")
disposal <- c("2017-01-05", "2018-02-05", "2018-03-09")
rating <- c("60","50","50")
date <- c("2017-04-10", "2018-04-15", "2018-04-20")
clock <- c("16:02:00", "17:02:00", "18:02:00")
rolex <- c("20:10:00", "20:49:00", "17:44:00")
df <- data.frame(enter,guest,disposal,rating,date,clock,rolex, stringsAsFactors = F)
What I try to accomplish is to change the columns enter, disposal, and date from character to date, using dplyr package.
So, I came up with the following, simply by chaining it together:
library(dplyr)
library(chron)
df2 <- df %>% mutate(enter = as.Date(enter, format = "%Y-%m-%d"))
%>% mutate(disposal = as.Date(disposal, format = "%Y-%m-%d"))
%>% mutate(date = as.Date(date, format = "%Y-%m-%d"))
What I am after is this: which mutate function is needed from dplyr to get rid of the multiple chaining, i.e. when we have lots of columns with arbitrary namings that imply dates? I want to specify the columns by name, and then apply the as.Date function to change them from character to date.
Some solutions to different operations that are not applicable to this case:
1: convert column in data.frame to date
2: convert multiple columns to dates with lubridate and dplyr
3: change multiple character columns to date
For example, I've tried, but with no luck:
df2 <- df %>% mutate_at(data = df, each_of(c(enter, disposal, date)) = as.Date(format = "%Y-%m-%d"))
as given here: dplyr change many data types
As a bonus
Notice the clock and rolex columns. Using the chron package simply converts them to the right format, i.e. time instead of character
df2 <- df %>% mutate(clock = chron(times = clock)) %>% mutate(rolex = chron(times = rolex))
As suggested here:
convert character to time in r
Now, is the same solution available without all the chaining, especially when we have an arbitrary amount of columns with different namings etc.?

You just need to tweak the arguments of mutate_at. Any additional arguments to as.Date are specified as arguments to mutate_at.
df2 <- df %>% mutate_at(vars(enter,disposal,date), as.Date, format="%Y-%m-%d")
The second part of your question has a similar solution.
df2 <- df2 %>% mutate_at(vars(clock, rolex), function(x) chron(times. = x))

Related

How to add a column that identifies groups of consecutive days

In a data.frame, I would like to add a column that identifies groups of consecutive days.
I think I need to start by converting my strings to date format...
Here's my example :
mydf <- data.frame(
var_name = c(rep("toto",6),rep("titi",5)),
date_collection = c("09/12/2022","10/12/2022","13/12/2022","16/12/2022","16/12/2022","17/12/2022",
"01/12/2022","03/11/2022","04/11/2022","05/11/2022","08/11/2022")
)
Expected output :
Convert to Date class and do the adjacent diff to create a a logical vector and take the cumulative sum
library(dplyr)
library(lubridate)
mydf %>%
mutate(id = cumsum(c(0, abs(diff(dmy(date_collection)))) > 1)+1)

How to construct a data.frame with name for rows and columns?

I can construct a data.frame using the following code.
library(tidyverse)
library(lubridate)
DF <- data.frame(Date = seq(as.Date("2001-03-01"), to= as.Date("2001-05-31"), by="day"),
A = runif(92, 0,10),
D = runif(92,5,15),
Z = runif(92,3,15))
I, however, would like to construct a data.frame like in the figure below where the name of the columns (I.e., 1:2 or 1:5) and rows (I.e., A, Z etc) should be like what I have but the values can be random in there. I am trying to put a reproducible questions but wanted to first get my data.frame right.
If we want to transpose the dataset, in tidyverse, we reshape into 'long' format and reshape back to 'wide' with a different name column
library(dplyr)
library(tidyr)
DF %>%
pivot_longer(cols = -Date) %>%
pivot_wider(names_from = Date, values_from = value)
We can try reshaping using reshape2::recast.
library(reshape2)
recast(DF, id.var = 1, variable ~ Date)
However, this will give us each date as a separate column; that is as far as we can help without a reproducible example.

How to loop through date variable names and sum by group?

I have some time series data where there are a few region variables and the rest of the variable names are all dates. I am trying to trying to loop through the entire list of date variables and sum each of them but am unsure how to do it using dplyr syntax. This is what I have so far
library(dplyr)
library(lubridate)
library(data.table)
library(curl)
# county level
covid_jhu <- as.data.frame(fread(paste0("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")))
# remove territories and assign the correct FIPS code
covid_jhu <- covid_jhu %>%
filter(Admin2 != "") %>%
mutate(FIPS = substr(as.character(UID), 4, 8))
jhu_state <- covid_jhu %>%
group_by(Province_State) %>%
mutate(`1/22/20` = sum(`1/22/20`))
I can't seem to figure out the loop here even though I seem to be able to get it right for 1 variable.
Here is potential method to perform the desired grouping. The key is convert the wide data frame from the source and transform it into a long format.
library(dplyr)
library(tidyr)
# county level
covid_jhu <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
# remove territories and assign the correct FIPS code
covid_jhu <- covid_jhu %>%
filter(Admin2 != "") %>%
mutate(FIPS = substr(as.character(UID), 4, 8))
#convert from wide to long
long_covid_jhu<-pivot_longer(covid_jhu, cols=starts_with("X"), names_to = "Date")
long_covid_jhu$Date <- as.Date(long_covid_jhu$Date, format="X%m.%d.%y")
#grouping by state
long_covid_jhu %>%
group_by(Province_State) %>% summarize(TotalCases=sum(value))
#grouping by date
long_covid_jhu %>%
group_by(Date) %>% summarize(TotalCases=sum(value))
#grouping by state & date
long_covid_jhu %>%
group_by(Province_State, Date) %>% summarize(TotalCases=sum(value))
Suggest if you want to try functions like
group_by_all,
group_by_ (this take a variable name as input rather than hard-coding a column name, essentially you can keep passing column names as input in a loop)
Similarly, you will have mutate_ , summarise_ functions as well
With my understanding of the question, i think reading slightly about this solves your purpose

R: reshape dataframe from wide to long format based on compound column names

I have a dataframe containing observations for two sets of data (A,B), with dataset and observation type given by the column names :
mydf <- data.frame(meta1=paste0("a",1:2), meta2=paste0("b",1:2),
A_var1 = c(11:12), A_var2 = c("p","r"),
B_var1 = c(21:22), B_var2 = c("x","z"))
I would like to reshape this dataframe so that each row contains observations on one set only. In this long format, set and column names should by given by splitting the original column names at the '_':
mydf2 <- data.frame(meta1=rep(paste0("a",1:2),2),
meta2=rep(paste0("b",1:2),2),
set=c("A","B","A","B"),
var1 = c(11:12),
var2 = c("a","b","c","d"))
I have tried using 'gather' in combination with 'str_split','sub', but unfortunately without success. Could this be done using tideverse functions?
Yes you can do this with tidyverse !
You were close, you need to gather, then separate, then spread.
new_df <- mydf %>%
gather(set, vars, 3:6) %>%
separate(set, into = c('set', 'var'), sep = "_") %>%
spread(var, vars)
hope this helps!

Problems with dplyr and POSIXlt data

I have a problem. I downloaded data and tranformed dates into POSIXlt format
df<-read.csv("007.csv", header=T, sep=";")
df$transaction_date<-strptime(df$transaction_date, "%d.%m.%Y")
df$install_date<-strptime(df$install_date, "%d.%m.%Y")
df$days<- as.numeric(difftime(df$transaction_date,df$install_date, units = "days"))
Data frame is about transaction in one online game. It contains value (its payment), transaction_date, intall_date and ID. I added new column, which showndays after installation. I tried to summarise data using dlyr
df2<-df %>%
group_by(days) %>%
summarise(sum=sum(value))
And I've got an error:
Error: column 'transaction_date' has unsupported type : POSIXlt, POSIXt
How can i Fix it?
UPD. I changed classes of Date columns into Character. It solved problem. But can i use dlyr withouts changing classes in my dataset?
You could use as.POSIXct as recommended in the comments but if the hours, minutes, and seconds don't matter then you should just use as.Date
df <- read.csv("007.csv", header=T, sep=";")
df2 <- df %>%
mutate(
transaction_date = as.Date(transaction_date, "%d.%m.%Y")
,install_date = as.Date(install_date, "%d.%m.%Y")
) %>%
group_by(days = transaction_date - install_date) %>%
summarise(sum=sum(value))
As noted here, this is a "feature" of the tidyverse. They don't want to handle POSIXlt object because it is some kind of list within a vector. However, using as.POSIXct isn't always an option. In my case I really needed the POSIXlt class to handle some uncleaned data. In that case, just go back to good old stable base R. In your case:
df2 <- aggregate(df1$value, by=list(df$days), sum)
One trick I use often is the following:
Convert POSIXt columns (in example below eventDate) to character
Perform dplyr operations you need (in example below we bind rows of two data frames)
Convert back from character to POSIXt not forgetting to set the right format (format) and timezone (tz) as it was before performing step 1.
Example:
# step 1
df1$eventDate <- as.character.POSIXt(df1$eventDate)
df2$eventDate <- as.character.POSIXt(df2$eventDate)
#step 2
merged_df <- bind_rows(df1, df2)
#step 3
merged_df$eventDate <- strptime(merged_df$eventDate, format = "%Y-%m-%d", tz = "UTC")

Resources