Transformation from long to wide with multiple observations in R - r

I want to transform a data set from long to wide.
The data contains multiple observations for each time point.
To illustrate, consider the following two examples.
In EXAMPLE 1 below, the data does not contain multiple observations and can be transformed from long to wide.
In EXAMPLE 2 below, the data does contain multiple observations (n=3 per time point) and cannot be transformed from long to wide, testing with dcast and pivot_wider.
Can anyone suggest a method to transform the test data from EXAMPLE 2 into a valid format?
Code to reproduce the problem:
library(ggplot2)
library(ggcorrplot)
library(reshape2)
library(tidyr)
library(data.table)
# EXAMPLE 1 (does work)
# Test data
set.seed(5)
time <- rep(c(0,10), 1, each = 2)
feature <- rep(c("feat1", "feat2"), 2)
values <- runif(4, min=0, max=1)
# Concatenate test data
# test has non-unique values in time column
test <- data.table(time, feature, values)
# Transform data into wide format
test_wide <- dcast(test, time ~ feature, value.var = 'values')
# EXAMPLE 2 (does not work)
# Test data
set.seed(5)
time <- rep(c(0,10), 2, each = 6)
feature <- c(rep("feat1", 12), rep("feat2", 12))
values <- runif(24, min=0, max=1)
# Concatenate test data
# test has non-unique values in time column
test <- data.table(time, feature, values)
# Transform data into wide format
test_wide <- dcast(test, time ~ feature, value.var = 'values')
Warning:
Aggregate function missing, defaulting to 'length'
Problem:
Non-unique values in first column (time) are not preserved/allowed.
# Testing with pivot_wider
test_wider <- pivot_wider(test, names_from = feature, values_from = values)
Warning:
Warning message:
Values are not uniquely identified; output will contain list-cols.
Problem:
Non-unique values in first column (time) are not preserved/allowed.
In lack of a better idea, a possible output could look like this:
time
feat1
feat2
0
0.1046501
0.5279600
0
0.7010575
0.8079352
0
0.2002145
0.9565001
etc.

Since there are multiple values, it is not obvious how these should be treated when converting to a wide format. That's why you get the warning messages. This is one way of handling them. If you want something else, then please give a specific example of what the output should look like.
pivot_wider(test, names_from = feature, values_from = values) %>%
unnest(c(feat1, feat2))

You may want something like this:
library(dplyr)
test %>%
pivot_wider(names_from = c(feature, time),
values_from = values)
where the c(feature, times) accounts for the multiple variable case. But as was already pointed out in the comments please indicate if you want something else.

Related

How can I re-write code that applies a function on subset of rows based on another vector in different R ecosystems?

in my problem I have to apply a function on a subset of individual time-series based on a set of dates extracted from the original data.
So, I have a data.frame with a time-series for each individual between 2005-01-01 and 2010-12-31 (test_final_ind_series) and a sample of pairs individual-date (sample_events) ideally extracted from the same data.
With these, in my example I attempt to calculate an average on a subset of the time-series values exp conditional on individual and date in the sample_events.
I did this in 2 different ways:
1: a simple but effective code that gets the job done very quickly
I simply ask the user to input the data for a specific individual and define a lag of time and a window width (like a rolling average). The function exp_summary then outputs the requested average.
To repeat the operation for each row in sample_events I decided to nest the individual series by ID of the individuals and then attach the sample of dates. Eventually, I just run a loop that applies the function to each individual nested dataframe.
#Sample data
set.seed(111)
exp_series <- data.frame(
id = as.character(rep(1:10000, each=2191)),
date = rep(seq(as.Date('2005-01-01'),
as.Date('2010-12-31'), by = 'day'),times=10000),
exp = rep(rnorm(n=10000, mean=10, sd=5),times=2191)
)
sample_dates <- data.frame(
Event_id = as.character(replicate(10000,sample(1:10000,size = 1,replace = TRUE))),
Event_date = sample(
seq(as.Date('2005-01-01'),
as.Date('2010-12-31'), by = 'day'),
size =10000,replace = TRUE)
)
#This function, given a dataframe with dates and exposure series (df)
#an event_date
#a lag value
#a width of the window
#Outputs the average for a user-defined time window
exp_summary<- function(df, event_date, lag=0,width=0){
df<-as.data.table(df)
end<-as.character(as.Date(event_date)-lag)
start<-as.character(max(as.Date(end)-width, min(df$date)))# I need this in case the time window goes beyond the time limits (earliest date)
return(mean(df[date %between% c(start,end)]$exp))
}
#Nest dataframes
exp_series_nest <- exp_series %>%
group_by(id) %>%
nest()
#Merge with sample events, including only the necessary dates
full_data<-merge(exp_series_nest,sample_dates, by.x="id", by.y="Event_id",all.x = FALSE, all.y=TRUE)
#Initialize dataframe in advance
summaries1<-setNames(data.frame(matrix(ncol = 2, nrow = nrow(full_data))), c("id", "mean"))
summaries1$id<-full_data$id
#Loop over each id, which is nasted data.frame
system.time(for (i in 1:nrow(full_data)){
summaries1$mean[i]<-exp_summary(full_data$data[[i]], full_data$Event_date[i], lag=1, width=365)
})
2: using the highly-flexible package runner
With the same data I need to properly specify the arguments properly. I have also opened an issue on the Github repository to speed-up this code with parallelization.
system.time(summaries2 <- sample_dates %>%
group_by(Event_id) %>%
mutate(
mean = runner(
x = exp_series[exp_series$id == Event_id[1],],
k = "365 days",
lag = "1 days",
idx =exp_series$date[exp_series$id == Event_id[1]],
at = Event_date,
f = function(x) {mean(x$exp)},
na_pad=FALSE
)
)
)
They give very same results up to the second decimal, but method 1 is much faster than 2, and you can see the difference when you use very datasets.
My question is, for method 1, how can I write the last loop in a more concise way within the data.table and/or tidyverse ecosystems? I really struggle in making work together nested lists and "normal" columns embedded in the same dataframe.
Also, if you have any other recommendation I am open to hear it! I am here more for curiosity than need, as my problem is solved by method 1 already acceptably.
With data.table, you could join exp_series with the range you wish in sample_dates and calculate mean by=.EACHI:
library(data.table)
setDT(exp_series)
setDT(sample_dates)
lag <- 1
width <- 365
# Define range
sample_dates[,':='(begin=Event_date-width-lag,end=Event_date-lag)]
# Calculate mean by .EACHI
summariesDT <- exp_series[sample_dates,.(id,mean=mean(exp))
,on=.(id=Event_id,date>=begin,date<=end),by=.EACHI][
,.(id,mean)]
Note that this returns the same results as summaries1 only for Event_id without duplicates in sample_dates.
The results are different in case of duplicates, for instance Event_id==1002:
sample_dates[Event_id==1002]
Event_id Event_date begin end
<char> <Date> <Date> <Date>
1: 1002 2010-08-17 2009-08-16 2010-08-16
2: 1002 2010-06-23 2009-06-22 2010-06-22
If you don't have duplicates in your real data, this shouldn't be a problem.

Adding a column to a data frame using mutate in R

I am working with OJdata set in ISLR package. I need to add to columns to the data frame. One column is a product of two numerical variable. The second column is a product of numerical and categorical variables .
I added the first column (numerical*numerical) using mutate function in dplyr package in R as follows,
require(ISLR)
OJ %>%
mutate(`StoreID:PriceCH` = StoreID*PriceCH)
And i was able to add this coulmn sucessfully. But when i tried to do the same when adding the categorical*numeric column i am getting an error.
OJ %>%
mutate(`Store7:PriceCH` = Store7*PriceCH)
Warning message:
In Ops.factor(Store7, PriceCH) : ‘*’ not meaningful for factors
Can anyone suggest what i can do if i need to add coulmn which is a product of categorical*numerical ?
My output should be something like this,
Thank you
Apply one-hot encoding to Store7 first:
OJ <- cbind(OJ, sapply("Yes", function(x) as.integer(x == OJ$Store7)))
names(OJ)[ncol(OJ)] <- "Store7_Yes"
Conceptually, I does not make a lot of sense (in most of the cases) multiply categorical variables.
Thought if you want to do so, you have to transform your data to a numeric scale. Be aware that this is not always so straightfoward.
This could be a starting point:
library(tidyverse)
Result <- OJ %>%
mutate(`StoreID:PriceCH` = StoreID*PriceCH) %>%
mutate(Store7Numeric = as.character(Store7)) #To avoid possible mistakes
Result <- Result %>%
mutate(Store7Numeric = ifelse(Store7Numeric == "No", 0, 1)) #Check this
Result <- Result %>% mutate(Store7Numeric = as.numeric(Store7Numeric)) %>% #To numeric
mutate(`Store7:PriceCH` = Store7Numeric*PriceCH) %>% #Your calculation
select(-Store7Numeric) #Remove, if you want. the created numeric variable
The error message is due to variable Store7 being a factor (See in str(OJ)), so you must make it numeric:
OJ$Store7 <- as.numeric(OJ$Store7)

How can I create subsets from these data frame?

I want to aggregate my data. The goal is to have for each time interval one point in a diagram. Therefore I have a data frame with 2 columns. The first columns is a timestamp. The second is a value. I want to evaluate each time period. That means: The values be added all together within the Time period for example 1 second.
I don't know how to work with the aggregate function, because these function supports no time.
0.000180 8
0.000185 8
0.000474 32
It is not easy to tell from your question what you're specifically trying to do. Your data has no column headings, we do not know the data types, you did not include the error message, and you contradicted yourself between your original question and your comment (Is the first column the time stamp? Or is the second column the time stamp?
I'm trying to understand. Are you trying to:
Split your original data.frame in to multiple data.frame's?
View a specific sub-set of your data? Effectively, you want to filter your data?
Group your data.frame in to specific increments of a set time-interval to then aggregate the results?
Assuming that you have named the variables on your dataframe as time and value, I've addressed these three examples below.
#Set Data
num <- 100
set.seed(4444)
tempdf <- data.frame(time = sample(seq(0.000180,0.000500,0.000005),num,TRUE),
value = sample(1:100,num,TRUE))
#Example 1: Split your data in to multiple dataframes (using base functions)
temp1 <- tempdf[ tempdf$time>0.0003 , ]
temp2 <- tempdf[ tempdf$time>0.0003 & tempdf$time<0.0004 , ]
#Example 2: Filter your data (using dplyr::filter() function)
dplyr::filter(tempdf, time>0.0003 & time<0.0004)
#Example 3: Chain the funcions together using dplyr to group and summarise your data
library(dplyr)
tempdf %>%
mutate(group = floor(time*10000)/10000) %>%
group_by(group) %>%
summarise(avg = mean(value),
num = n())
I hope that helps?

Variance of a complete group of a dataframe in R

Let's say I have a dataframe with 10+1 columns and 10 rows, and every value has the same units except for one column (the "grouping" column A).
I'm trying to accomplish the following: given a grouping of the data frames based on the last column, how do I compute the standard deviation of the whole block as a single, monolithic variable.
Let's say I do the grouping (in reality it's a cut in intervals):
df %>% group_by(A)
From what I have gathered trhoughout this site, you can use aggregate or other dplyr methods to calculate variance per column, i.e.:
this (SO won't let me embed if I have <10 rep).
In that picture we can see the grouping as colors, but by using aggregate I would get 1 standard deviation per specified column (I know you can use cbind to get more than 1 variable, for example aggregate(cbind(V1,V2)~A, df, sd)) and per group (and similar methods using dplyr and %>%, with summarise(..., FUN=sd) appended at the end).
However what I want is this: just like in Matlab when you do
group1 = df(row_group,:) % row_group would be df(:,end)==1 in this case
stdev(group1(:)) % operator (:) is key here
% iterate for every group
I have my reasons for wanting it that specific way, and of course the real dataframe is bigger than this mock example.
Minimum working example:
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")
df %>% group_by(A) %>% summarise_at(vars(V1), funs(sd(.))) # no good
aggregate(V1~A, data=df, sd) # no good
aggregate(cbind(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10)~A, data=df, sd) # nope
df %>% group_by(A) %>% summarise_at(vars(V1,V2,V3,V4,V5,V6,V7,V8,V9,V10), funs(sd(.))) # same as above...
Result should be 3 doubles, each with the sd of the group (which should be close to 1 if enough columns are added).
If you want a base R solution, try the following.
sp <- split(df[-1], cut(df$A, breaks=c(2.1)))
lapply(sp, function(x) var(unlist(x)))
#$`(0.998,2]`
#[1] 0.848707
#
#$`(2,3]`
#[1] 1.80633
I have coded it in two lines to make it clearer but you can avoid the creation of sp and write the one-liner
lapply(split(df[-1], cut(df$A, breaks=c(2.1))), function(x) var(unlist(x)))
Or, for a result in another form,
sapply(sp, function(x) var(unlist(x)))
#(0.998,2] (2,3]
# 0.848707 1.806330
DATA
set.seed(6322) # make the results reproducible
df <- data.frame(cbind(matrix(rnorm(100),10,10),c(1,2,1,1,2,2,3,3,3,1)))
colnames(df) <- c(paste0("V",seq(1,10)),"A")

Replacing values from a column using a condition in R

I have a very basic R question but I am having a hard time trying to get the right answer. I have a data frame that looks like this:
species <- "ABC"
ind <- rep(1:4, each = 24)
hour <- rep(seq(0, 23, by = 1), 4)
depth <- runif(length(ind), 1, 50)
df <- data.frame(cbind(species, ind, hour, depth))
df$depth <- as.numeric(df$depth)
What I would like it to select AND replace all the rows where depth < 10 (for example) with zero, but I want to keep all the information associated to those rows and the original dimensions of the data frame.
I have try the following but this does not work.
df[df$depth<10] <- 0
Any suggestions?
# reassign depth values under 10 to zero
df$depth[df$depth<10] <- 0
(For the columns that are factors, you can only assign values that are factor levels. If you wanted to assign a value that wasn't currently a factor level, you would need to create the additional level first:
levels(df$species) <- c(levels(df$species), "unknown")
df$species[df$depth<10] <- "unknown"
I arrived here from a google search, since my other code is 'tidy' so leaving the 'tidy' way for anyone who else who may find it useful
library(dplyr)
iris %>%
mutate(Species = ifelse(as.character(Species) == "virginica", "newValue", as.character(Species)))

Resources