Simple pivot table type transformation in R statistics - r

I've been trying learn R for a while but haven't got my knowledge up to even a decent level yet. I'll get there in the end, but I'm in a pinch at the moment and was wondering if you could help me do a quick "transformation" type piece.
I have a csv data file with 18 million rows with the following data fields: Person ID, Date and Value. It's basically from a simulation model and is simulating the contributions a person makes into their savings accounts, e.g.:
1,28/02/2013,19.49
2,13/03/2013,16.68
3,15/03/2013,20.34
2,10/01/2014,28.43
3,12/06/2014,38.13
1,29/08/2014,68.46
1,20/12/2013,20.51
So, as you can see, there can be multiple IDs in the data but each date and contribution amount for a person is unique.
I would like to transform this so I have a contribution history by year for each person. So for example the above would become:
ID,2013,2014
1,40.00,68.46
2,16.68,28.43
3,20.34,38.13
I have a rough idea how I could approach the problem: create another column of data with just the years and then summarise by ID and year to add up all contributions that fit into each ID/year bucket. I just have no clue how to even begin translating that into an R script.
Any pointers/guidance would be most appreciated.
Many Thanks and Kind Regards.

Here are a few possibilities:
zoo package read.zoo in the zoo package can produce a multivariate time series one column per series, i.e. one column per ID. We define yr to get the year from the index column and then split on the ID using the split= argument as we read it in. We use aggregate=sum to aggregate over the remaining columns -- here just one. We use text = Lines to keep the code below self contained but with a real file we would replace that with "myfile", say. The final line transposes the result. We could drop that line if it were OK to have persons in columns instead of rows.
Lines <- "1,28/02/2013,19.49
2,13/03/2013,16.68
3,15/03/2013,20.34
2,10/01/2014,28.43
3,12/06/2014,38.13
1,29/08/2014,68.46
1,20/12/2013,20.51
"
library(zoo)
# given a Date string, x, output the year
yr <- function(x) floor(as.numeric(as.yearmon(x, "%d/%m/%Y")))
# read in data, reshape & aggregate
z <- read.zoo(text = Lines, sep = ",", index = 2, FUN = yr,
aggregate = sum, split = 1)
# transpose (optional)
tz <- data.frame(ID = colnames(z), t(z), check.names = FALSE)
With the posted data we get the following result:
> tz
ID 2013 2014
1 1 40.00 68.46
2 2 16.68 28.43
3 3 20.34 38.13
See ?read.zoo and also the zoo-read vignette.
reshape2 package Here is a second solution using the reshape2 package:
library(reshape2)
# read in and fix up column names and Year
DF <- read.table(text = Lines, sep = ",") ##
colnames(DF) <- c("ID", "Year", "Value") ##
DF$Year <- sub(".*/", "", DF$Year) ##
dcast(DF, ID ~ Year, fun.aggregate = sum, value.var = "Value")
The result is:
ID 2013 2014
1 1 40.00 68.46
2 2 16.68 28.43
3 3 20.34 38.13
reshape function Here is a solution that does not use any addon packages. First read in the data using the three lines marked ## in the last solution. This will produce DF. Then aggregate the data, reshape it from long to wide form and finally fix up the column names:
Ag <- aggregate(Value ~., DF, sum)
res <- reshape(Ag, direction = "wide", idvar = "ID", timevar = "Year")
colnames(res) <- sub("Value.", "", colnames(res))
which produces this:
> res
ID 2013 2014
1 1 40.00 68.46
2 2 16.68 28.43
3 3 20.34 38.13
tapply function. This solution does not use addon packages either. Using Ag from the last solution try this:
tapply(Ag$Value, Ag[1:2], sum)
UPDATES: minor improvements and 3 additional solutions.

The approach you describe is a sound one. Translating the date string back and forth from string to date and back can be done using strptime and strftime (possible as.POSIXct. Once you have the year column, you can use a number of tools available in R, e.g. data.table, by, or ddply. I like the syntax of the last one:
library(plyr)
ddply(df, .(ID, year), summarise, total_per_year = sum(value))
This assumes that your base date is in df, and that the columns in your data are called year, ID and value. Do note that for large datasets ddply can become quite slow. If you really need raw performance, you definitely want to start working with data.table.

Related

R calculating time differences in a (layered) long dataset

I've been struggling with a bit of timestamp data (haven't had to work with dates much until now, and it shows). Hope you can help out.
I'm working with data from a website showing for each customer (ID) their respective visits and the timestamp for those visits. It's grouped in the sense that one customer might have multiple visits/timestamps.
The df is structured as follows, in a long format:
df <- data.frame("Customer" = c(1, 1, 1, 2, 3, 3),
"Visit" =c(1, 2, 3, 1, 1, 2), # e.g. customer ID #1 has visited the site three times.
"Timestamp" = c("2019-12-31 12:13:25", "2019-12-31 16:13:25", "2020-01-05 10:13:25", "2019-11-12 15:18:42", "2019-11-13 19:22:35", "2019-12-10 19:43:55"))
Note: In the real dataset the timestamp isn't a factor but some other haggard character-type abomination which I should probably first try to convert into a POSIXct format somehow.
What I would like to do here is to create a df that displays per customer their average time between visits (let's say in minutes, or hours). Visitors with only a single visit (e.g., second customer in my example) could be filtered out in advance or should display a 0. My final goal is to visualize that distribution, and possibly calculate a grand mean across all customers.
Because the number of visits can vary drastically (e.g. one or 256 visits) I can't just use a 'wide' version of the dataset where a fixed number of visits are the columns which I could then subtract and average.
I'm at a bit of a loss how to best approach this type of problem, thanks a bunch!
Using dplyr:
df %>%
arrange(Customer, Timestamp) %>%
group_by(Customer) %>%
mutate(Difference = Timestamp - lag(Timestamp)) %>%
summarise(mean(Difference, na.rm = TRUE))
Due to the the grouping, the first value of difference for any costumer should be NA (including those with only one visit), so they will be dropped with the mean.
Using base R (no extra packages):
sort the data, ordering by customer Id, then by timestamp.
calculate the time difference between consecutive rows (using the diff() function), grouping by customer id (tapply() does the grouping).
find the average
squish that into a data.frame.
# 1 sort the data
df$Timestamp <- as.POSIXct(df$Timestamp)
# not debugged
df <- df[order(df$Customer, df$Timestamp),]
# 2 apply a diff.
# if you want to force the time units to seconds, convert
# the timestamp to numeric first.
# without conversion
diffs <- tapply(df$Timestamp, df$Customer, diff)
# ======OR======
# convert to seconds
diffs <- tapply(as.numeric(df$Timestamp), df$Customer, diff)
# 3 find the averages
diffs.mean <- lapply(diffs, mean)
# 4 squish that into a data.frame
diffs.df <- data.frame(do.call(rbind, diffs.mean))
diffs.df$Customer <- names(diffs.mean)
# 4a tidy up the data.frame names
names(diffs.df)[1] <- "Avg_Interval"
diffs.df
You haven't shown your timestamp strings, but when you need to wrangle them, the lubridate package is your friend.

Aggregating data on monthly basis

I do have following data in R:
date category1 category2 category 3 category 4
1 2012-04-01 7496.00 77288.37 224099.15 700050.04
2 2012-04-02 24541.00 59103.94 138408.65 625006.84
3 2012-04-03 1249.00 15951.50 574170.30 249390.53
4 2012-04-04 5205.00 10866.00 0.00 358703.88
5 2012-04-05 10398.00 0.00 119745.17 270585.46
And use following script to aggregate data on monthly basis:
data <- as.xts(data$category1,order.by=as.Date(data$date))
monthly <- apply.monthly(data,sum)
monthly
Question: Instead of repeating the step each for every category and then joining each monthly dataframe, how can I apply as.xts(...) to all columns? I tried
as.xts(c("data$category1","data$category1"),order.by=as.Date(data$date))
which did not work.
Also: Is there a better way to aggregate on a monthly basis?
Use xts instead of as.xts.
apply.monthly(xts(df[ -1], order.by = as.Date(df$date)), mean)
However, this seems to only work for mean, not for sum. You can always use sapply to iterate through the columns
sapply(colnames(data[, -1]), function(x) apply.monthly(as.xts(data[,x],
order.by=as.Date(data$date)),sum))
You can use the daily2monthly function in the HydroTSM package. It can handle more than just xts for arguments, including multiple columns. Fun can be sum or mean.
monthly <- daily2monthly(data, FUN=sum, na.rm=TRUE)

How to calculate the difference between dates in R for each unique id

I am new to R and have the following data of user name and their usage date for a product (truncated output):
Name, Date
Jane, 01-24-2016 10:02:00
Mary, 01-01-2016 12:18:00
Mary, 01-01-2016 13:18:00
Mary, 01-02-2016 13:18:00
Jane, 01-23-2016 10:02:00
I would like to do some analysis on difference between Date, in particular the number of days between usage for each user. I'd like to plot a histogram to determine if there is a pattern among users.
how do I compute the difference between dates for each user in R ?
are there any other visualizations besides histograms I should explore ?
Thanks
Try this, assuming your data frame is df:
## in case you have different column names
colnames(df) <- c("Name", "Date")
## you might also have Date as factors when reading in data
## the following ensures it is character string
df$Date <- as.character(df$Date)
## convert to Date object
## see ?strptime for various available format
## see ?as.Date for Date object
df$Date <- as.Date(df$Date, format = "%m-%d-%Y %H:%M:%S")
## reorder, so that date are ascending (see Jane)
## this is necessary, otherwise negative number occur after differencing
## see ?order on ordering
df <- df[order(df$Name, df$Date), ]
## take day lags per person
## see ?diff for taking difference
## see ?tapply for applying FUN on grouped data
## as.integer() makes output clean
## if unsure, compare with: lags <- with(df, tapply(Date, Name, FUN = diff))
lags <- with(df, tapply(Date, Name, FUN = function (x) as.integer(diff(x))))
For you truncated data (with 5 rows), I get:
> lags
$Jane
[1] 1
$Mary
[1] 0 1
lags is a list. If you want to get Jane's information, do lags$Jane. To get a histogram, do hist(lags$Jane). Furthermore, if you want to simply produce a histogram for all clients, overlooking individual difference, use hist(unlist(lags)). The unlist() collapse a list into a single vector.
comments:
regarding your requirement for good reference to R, see CRAN: R intro and advanced R;
using tapply for multiple indices? Maybe you can try the trick I gave by using paste to first construct an auxiliary index;
Er, looks like I quickly made things complicated than necessary, by using density and central limit theorem, etc, for visualization. So I removed my other answer.
We can use data.table with lubridate
library(lubridate)
library(data.table)
setDT(df1)[order(mdy_hms(Date)), .(Diff=as.integer(diff(as.Date(mdy_hms(Date))))), Name]
# Name Diff
#1: Mary 0
#2: Mary 1
#3: Jane 1
If there are several grouping variables i.e. "ID" , we can place it in the by
setDT(df1)[order(mdy_hms(Date)), .(Diff=as.integer(diff(as.Date(mdy_hms(Date))))),
by = .(Name, ID)]

List of Dataframes filter row based on Date

I am currently working with a list of dataframes.
Actually, I have about a hundred csv files representing forecasts of some kind, where the date on which the forecast was made is in the first line, the lines thereafter contain the predicted values. The data might look like this:
2010/04/15 10:12:51 #Date of the forecast
2010/05/02 2372 #Date for which the forecast was made and the value assigned
2010/05/09 2298
2009/04/15 10:09:13 #another forecast
....
2010/05/02 2298 #also predicts for 2010/05/02
As you might guess, the forecasts do predict values quite some time ahead (e.g. 5 years), which means predictions for the date 2010/05/02 were not only made on 2010/04/15 but also 2009/04/15 and so on (actually, forecasts are done weekly).
I would like to compare how the predicted value for a specified date (for example 2010/05/02) has changed over time.
Right now, I read in all .csv datas I have as a dataframe, and save each of the resulting dataframes in a list.
(Sadly, the date on which the prediction was made got lost-I hoped to be able to name the list elements with the respective date but have not yet figured out how to do this-still, I am pretty sure I'll find something somewhere, not the main problem here)
That's where the question title comes in: I would like to know how to filter a list of dataframes by row value.
So, I'd like to be able to use a function: function(2010/05/02) and get as a result the rows of each Element of the list (each dataframe in the list) where Date is 2010/05/02.
In this case I'd like to get:
2010/05/02 2372
2010/05/02 2298
I know how to do this using a for loop, but it needs endlessly much time.
I am happy for any suggestions.
(By this example you might understand why it is important to know when the prediction was made- which I would not have right now. I was thinking about adding a new row containing the date on which the prediction was made in each dataframe)
Threads visited until now include:
get column from list of dataframes R
convert a row of a data frame to a simple vector in R
How to get the name of a data.frame within a list? (which more or less adresses the name problem)
As you can see, no thread was particularly helpful.
As requested, a small reproducible example:
dateList <- as.Date(seq(0,100,5),origin="2010-01-01")
forecasts <- seq(2000,3000,50)
df1 <- data.frame(dateList,forecasts)
df2 <- data.frame(dateList-50,forecasts)
l <- list(df1,df2)
we have dates from 2010-01-01 in 5 days steps. I would for example like to know the predicted values for 2010-01-01 in both dataframes.
The first dataframe looks like this:
dateList forecasts
1 2010-01-01 2000
2 2010-01-06 2050
3 2010-01-11 2100
while the second looks like this:
10 2009-12-27 2450
11 2010-01-01 2500
12 2010-01-06 2550
I was hoping to find out for example the predicted values for 2010-01-01.
So, for example:
function(2010-01-01):
2000
2500
Couldn't wait for your example so I made a small one. Let me know if this is in the general direction of what you're after.
xy <- list(df1 = data.frame(dates = as.Date(c("2016-01-01", "2016-01-02", "2016-01-03")), value = runif(3)),
df2 = data.frame(dates = as.Date(c("2016-01-01", "2016-01-02", "2016-01-03")), value = runif(3)),
df3 = data.frame(dates = as.Date(c("2016-01-01", "2016-01-02", "2016-01-03")), value = runif(3))
)
getValueOnDate <- function(x, list.all) {
lapply(list.all, FUN = function(m) m[m$dates %in% x, ])
}
out <- getValueOnDate(as.Date("2016-01-02"), list.all = xy)
do.call("rbind", out)
dates value
df1 2016-01-02 0.7665590
df2 2016-01-02 0.9907976
df3 2016-01-02 0.4909025
You can obviously modify the function to return just the values.
You could alternatively use the following approach, given your list is called ls and the date column date in all data.frame's:
my.ls <- lapply(ls, subset, date == "2010/05/02")
df <- do.call("rbind", my.ls)

grouping by date and treatment in R

I have a time series that looks at how caffeine impacts test scores. On each day, the first test is used to measure a baseline score for the day, and the second score is the effect of a treatment.
Post Caffeine Score Time/Date
yes 10 3/17/2014 17:58:28
no 9 3/17/2014 23:55:47
no 7 3/18/2014 18:50:50
no 10 3/18/2014 23:09:03
Some days have a caffeine treatment, others not. Here's a question: how do I group variables by the day of the week, and create a measure of impact, by subtracting the second days' score from the first.
I'm going to be using these groupings for later graphs and analysis, so I think it's most efficient if there's a way to create objects that look at the improvement in score each day and groups by whether caffeine (treatment) was used.
Thank you for your help!
First make a column for the day:
df$day = strftime(df$'Time/Date', format="%Y-%m-%d")
then I think what you're after is two aggregates:
1) To find if the day had caffeine
dayCaf = aggregate(df$Caffeine~df$day, FUN=function(x) ifelse(length(which(grepl("yes",x)))>0,1,0))
2) To calculate the difference in scores
dayDiff = aggregate(df$Score~df$day, FUN=function(x) x[2]-x[1])
Now put the two together
out = merge(dayCaf, dayDiff, by='df$day')
That gives:
df$day df$caff df$score
1 2014-03-17 1 -1
2 2014-03-18 0 3
The whole code is:
df$day = strftime(df$'Time/Date', format="%Y-%m-%d")
dayCaf = aggregate(df$Caffeine~df$day, FUN=function(x) ifelse(length(which(grepl("yes",x)))>0,1,0))
dayDiff = aggregate(df$Score~df$day, FUN=function(x) x[2]-x[1])
out = merge(dayCaf, dayDiff, by='df$day')
Just replace "df" with the name of your frame and it should work.
Alternatively:
DF <- data.frame(Post.Caffeine = c("Yes","No","No","No"),Score=c(10,9,7,10),Time.Date=c("3/17/2014 17:58:28","3/17/2014 23:55:47","3/18/2014 18:50:50", "3/18/2014 23:09:03"))
DF$Time.Date <- as.Date(DF$Time.Date,format="%m/%d/%Y")
DF2 <- setNames(aggregate(Score~Time.Date,DF,diff),c("Date","Diff"))
DF2$PC <- DF2$Date %in% DF$Time.Date[DF$Post.Caffeine=="Yes"]
DF2
EDIT: This assumes that your data is in the order that you demonstrate.
data.table solution. The order part sorts your data first (If it is already sorted, you can remove the order part, just leave the comma in place). The advantage of this approach is that you are doing the whole process in one line and that it will be fast too
library(data.table)
setDT(temp)[order(as.POSIXct(strptime(`Time/Date`, "%m/%d/%Y %H:%M:%S"))),
list(HadCafffeine = if(any(PostCaffeine == "yes")) "yes" else "no",
Score = diff(Score),
by = as.Date(strptime(`Time/Date`, "%m/%d/%Y"))]
## as.Date HadCafffeine Score
## 1: 2014-03-17 yes -1
## 2: 2014-03-18 no 3
This solution assumes temp as your data set and PostCaffeine instead Post Caffeine as the variable name (it is bad practice in R to put spaces or / into variable names as it limits your possibilities to work with them).

Resources