I am currently struggling with reshaping my dataset to a preferred result. Lets say I have the following dataset to start with:
library(tsbox)
library(dplyr)
library(tidyr)
# create df that matches my format
df1 <- ts_wide(ts_df(ts_c(mdeaths)))
df1$id <- 1
df2 <- ts_wide(ts_df(ts_c(mdeaths)))
df2$id <- 2
df <- rbind(df1, df2)
Now this dataset has a date column, a value column and an "id" column, which should specifiy which date/value points belong to the same observation object. I would now like to reshape my dataset to a 2x2 dataframe, where the first column is the id, while the second column is a timeseries object (of the date/value corresponding to that id). To do so, I tried the following:
# create a new df, with two cols (id and ts)
df_ts <- df %>%
group_by(id) %>%
nest()
The nest command creates a "a list-column of data frames", which is not exactly what I wanted. I know that a ts can be defined via ts(data$value, data$date), but I do not know how to integrate it after the group_by(id) function. Can anyone help me how to turn this column into a ts object instead of a data frame? I am new to R and grateful for any form of help.
Thanks in advance
If you have a non-atomic data type it will have to be a list column of something.
If you want a list-column of ts object you can:
df %>%
group_by(id) %>%
summarize(ts = list(ts(value, time)))
Continuing your pipe you could:
df %>%
group_by(id) %>%
nest() %>%
mutate(data = purrr::map(data, with, ts(value, time)))
Related
For example, my df now is:
person <- c("a","a","a","b","b","b","c","c","c")
score <- c(31,2,13,5,6,7,8,9,4)
df <- data.frame(person,score)
what I want to get is a two-column table with three rows.
[1,1]="a", [1,2]= a vector of c(31,2,13)
[2,1]="b", [2,2]= a vector of c(5,6,7)
[3,1]="c", [3,2]= a vector of c(8,9,4)
Actually, I just want the three vectors to perform another function but I tried something like the following code, it didn't work(the actual function is much more complex but it takes in two vectors of the same length where one is provided).
f <- function(x,y){x-y}
df <- df %>%
group_by(person) %>%
summarise(diff = f(c(1,2,3), score))
Thanks so much in advance!
Base R solution:
aggregate(
score ~ person,
df,
list
)
Tidyverse solution:
library(dplyr)
df %>%
group_by(person) %>%
summarise(score = list(score))
I would like to create a new column that extracts the hour from a timestamp as a numeric data type. If I had one data frame or tibble, I would do it as follows:
calories_hourly$activity_hour_num <- calories_hourly$activity_hour %>% mdy_hms() %>% format(format = ('%H')) %>% as.numeric()
However, I have one list of 18 tibbles called "fitbit_data" where I would like to perform the operation above for tibbles 6-16. The type casting is calculated from the second column in all of my tibbles. I have an example of the beginning of a failed attempt below:
fitbit_data[6:16] <- fitbit_data[6:16] %>% mutate(activity_hour_num=map(.x=fitbit_data[6:16], .f=~mdy(.x[2])))
Can you please help me code a tidy solution for this R task?
Thank you so much!
You can use map as -
library(purrr)
library(lubridate)
library(dplyr)
k <- 6:16
fitbit_data[k] <- map(fitbit_data[k], ~{.x[[2]] <- lubridate::mdy(.x[[2]]);.x})
Based on the first attempt you can do -
fitbit_data[k] <- map(fitbit_data[k], ~.x %>%
mutate(activity_hour = mdy_hms(activity_hour) %>%
format('%H') %>% as.numeric()))
I am trying to subset/filter a data frame according to the corresponding column elements from another data frame.
Here is what I used to do this
df <- df1[df1$col1 %in% df2$col2,]
And then I am going to set the column as row names
df <- df %>% remove_rownames %>% column_to_rownames('col1')
However I have no idea how to combine these two codes into one using %>%
df1 %>% filter(col1 %in% df2$col2) %>% remove_rownames %>% column_to_rownames('col1')
I have some time series data where there are a few region variables and the rest of the variable names are all dates. I am trying to trying to loop through the entire list of date variables and sum each of them but am unsure how to do it using dplyr syntax. This is what I have so far
library(dplyr)
library(lubridate)
library(data.table)
library(curl)
# county level
covid_jhu <- as.data.frame(fread(paste0("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")))
# remove territories and assign the correct FIPS code
covid_jhu <- covid_jhu %>%
filter(Admin2 != "") %>%
mutate(FIPS = substr(as.character(UID), 4, 8))
jhu_state <- covid_jhu %>%
group_by(Province_State) %>%
mutate(`1/22/20` = sum(`1/22/20`))
I can't seem to figure out the loop here even though I seem to be able to get it right for 1 variable.
Here is potential method to perform the desired grouping. The key is convert the wide data frame from the source and transform it into a long format.
library(dplyr)
library(tidyr)
# county level
covid_jhu <- read.csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
# remove territories and assign the correct FIPS code
covid_jhu <- covid_jhu %>%
filter(Admin2 != "") %>%
mutate(FIPS = substr(as.character(UID), 4, 8))
#convert from wide to long
long_covid_jhu<-pivot_longer(covid_jhu, cols=starts_with("X"), names_to = "Date")
long_covid_jhu$Date <- as.Date(long_covid_jhu$Date, format="X%m.%d.%y")
#grouping by state
long_covid_jhu %>%
group_by(Province_State) %>% summarize(TotalCases=sum(value))
#grouping by date
long_covid_jhu %>%
group_by(Date) %>% summarize(TotalCases=sum(value))
#grouping by state & date
long_covid_jhu %>%
group_by(Province_State, Date) %>% summarize(TotalCases=sum(value))
Suggest if you want to try functions like
group_by_all,
group_by_ (this take a variable name as input rather than hard-coding a column name, essentially you can keep passing column names as input in a loop)
Similarly, you will have mutate_ , summarise_ functions as well
With my understanding of the question, i think reading slightly about this solves your purpose
Background
I am working with a large dataset from a repeated measures clinical trial in R, where I want to do some data manipulations for each subject. This could be extraction of the max value in column x for each subject or the mean of column y for each subject.
Problem
I am fond of using the dplyr package and pipes, which led me to the group_by function. But when I try to apply it, the data that I want to extract does not seem to group by subject as it is supposed to, but rather extracts data based on the entire dataset.
Code
This is what I have done so far:
data <- read.csv(file="group_by_question.csv", header=TRUE, sep=",")
library(dplyr)
library(plyr)
data <- tbl_df(data)
test <- data %>%
filter(!is.na(wght)) %>%
dplyr::group_by(subject_id) %>%
mutate(maxwght=max(wght),meanwght=mean(wght)) %>%
ungroup()
Sample of the test dataframe:
Find a .csv sample of my dataset here:
https://drive.google.com/file/d/1wGkSQyJXqSswThiNsqC26qaP7d3catyX/view?usp=sharing
Is this what you want? In my example below, the output shows the max value for the maxwght column by subject id. You could replace max() with mean, for example, if you require the mean value for maxwght for each subject id.
library(dplyr)
data <- read.csv(file="group_by_question.csv", header=TRUE, sep=",")
test <- data %>%
filter(!is.na(wght)) %>%
mutate(maxwght=max(wght),meanwght=mean(wght)) %>%
group_by(subject_id) %>%
summarise(value = max(maxwght)) %>%
ungroup()