Creating averaged time-bins from an existing dataframe - r

I have the following dataframe called 'EasyScaled';
str(EasyScaled)
'data.frame': 675045 obs. of 3 variables:
$ Trial : chr "1_easy.wav" "1_easy.wav" "1_easy.wav" "1_easy.wav" ...
$ TrialTime : num 3000 3001 3002 3003 3004 ...
$ PupilBaseCorrect: num 0.784 0.781 0.78 0.778 0.777 ...
The 'TrialTime' numeric variable denotes the time of each data point (3000 = 3000ms, 3001 = 3001 ms, etc.), 'PupilBaseCorrect' is my dependent variable, and the 'Trial' variable refers to the experimental trial.
I would like to create a new object which firstly divides my data into 3 time-bins (TimeBin1 = 3000-8000ms, TimeBin2 = 8001-13000ms, TimeBin3 = 13001 - 18000ms) and then calculate an average value for each timebin (for each trial) so that I would end up with something that looks like this (with the value given reflecting 'PupilBaseCorrect');
Trial TimeBin1 TimeBin2 TimeBin3
1_easy 0.784 0.876 0.767
34_easy 0.781 0.872 0.765
35_easy 0.78 0.871 0.762
...etc ...etc ...etc ....etc
I have tried using cut(), ddply() and some of the suggestions on this blog http://lamages.blogspot.co.uk/2012/01/say-it-in-r-with-by-apply-and-friends.html but haven't been able to find the correct code. I also tried this;
EasyTimeBin <- aggregate(PupilBaseCorrect ~ Trial + TrialTime[3000:8000, 8001:1300,1301:1800], data=EasyScaled, mean)
But got the following error;
Error in TrialTime[3000:8000, 8001:1300, 1301:1800] :
incorrect number of dimensions
Any suggestions or advice would be much appreciated.

Good use of cut and ddply are correct, but here's some vanilla R chicken scratch that will do what you need.
# Generate example data
EasyScaled <- data.frame(
Trial = paste0(c(sapply(1:3, function(x) rep(x, 9))), "_easy.wav"),
TrialTime = c(sapply(seq_len(9)-1, function(x) (floor(x/3))*5000 + x%%3 + 3000)),
PupilBaseCorrect = rnorm(27, 0.78, 0.1)
)
# group means of PupilBaseCorrect by Trial + filename
tmp <- tapply(EasyScaled$PupilBaseCorrect,
paste0(EasyScaled$Trial, ',',
as.integer((EasyScaled$TrialTime - 3000)/5000)+1), mean)
# melt & recast the array manually into a dataframe
EasyTimeBin <- do.call(data.frame,
append(list(row.names = NULL,
Trial = gsub('.wav,.*','',names(tmp)[3*seq_len(length(tmp)/3)])),
structure(lapply(seq_len(3),
function(x) tmp[3*(seq_len(length(tmp)/3)-1) + x]
), .Names = paste0("TimeBin", seq_len(3))
)
)
)
print(EasyTimeBin)
# Trial TimeBin1 TimeBin2 TimeBin3
# 1 1_easy 0.7471973 0.7850524 0.8939581
# 2 2_easy 0.8096973 0.8390587 0.7757359
# 3 3_easy 0.8151430 0.7855042 0.8081268

Related

looping with iterations over two lists of variables for a multiple regression in R

I want to write a loop in R to run multiple regressions with one dependent variables and two lists of independent variables (all continuous variables). The model is additive and the loop should run by iterating through the two lists of variables so that it takes the first column from the first list + the first column from the second list, then the same for the second column in the two lists etc. The problem is I can't get it to iterate through the lists properly, instead my loop runs more models than it should.
The dataframe I am describing here is just a subset I will actually have to run this 3772 times (I am working on RNA-seq transcript expression).
My dataframe is called dry, and contains 22 variables (columns) and 87 observations (rows). Column 1 contains genotypeIDs, column 2:11 contains one set of independent variables to iterate through, column 12:21 contains a second set of independent variables to iterate through, and column 23 contains my dependent variable called FITNESS_DRY. This is what the structure looks like:
str(dry)
'data.frame': 87 obs. of 22 variables:
$ geneID : Factor w/ 87 levels "e10","e101","e102",..: 12 15 17 24 25 30 35 36 38 39 ...
$ RDPI_T1 : num 1.671 -0.983 -0.776 -0.345 0.313 ...
$ RDPI_T2 : num -0.976 -0.774 -0.532 -1.137 1.602 ...
$ RDPI_T3 : num -0.197 -0.324 0.805 -0.701 -0.566 ...
$ RDPI_T4 : num 0.289 -0.92 1.117 -1.214 -0.447 ...
$ RDPI_T5 : num -0.671 1.963 NA -1.024 -0.295 ...
$ RDPI_T6 : num 2.606 -1.116 -0.383 -0.893 0.119 ...
$ RDPI_T7 : num -0.843 -0.229 -0.297 0.504 -0.712 ...
$ RDPI_T8 : num -0.227 NA NA -0.816 -0.761 ...
$ RDPI_T9 : num 0.754 -1.304 1.867 -0.514 -1.377 ...
$ RDPI_T10 : num 1.1352 -0.1028 -0.69 2.0242 -0.0925 ...
$ DRY_T1 : num 0.6636 -0.64508 -0.24643 -1.43231 -0.00855 ...
$ DRY_T2 : num 1.008 0.823 -0.658 -0.148 0.272 ...
$ DRY_T3 : num -0.518 -0.357 1.294 0.408 0.771 ...
$ DRY_T4 : num 0.0723 0.2834 0.5198 1.6527 0.4259 ...
$ DRY_T5 : num 0.1831 1.9984 NA 0.0923 0.1232 ...
$ DRY_T6 : num -1.55 0.366 0.692 0.902 -0.993 ...
$ DRY_T7 : num -2.483 -0.334 -1.077 -1.537 0.393 ...
$ DRY_T8 : num 0.396 NA NA -0.146 -0.468 ...
$ DRY_T9 : num -0.694 0.353 2.384 0.665 0.937 ...
$ DRY_T10 : num -1.24 -1.57 -1.36 -3.88 -1.4 ...
$ FITNESS_DRY: num 1.301 3.365 0.458 0.346 1.983 ...
The goal is to run 10 multiple regressions looking like this:
lm1<-lm(FITNESS_DRY~DRY_T1+RDPI_T1)
lm2<-lm(FITNESS_DRY~DRY_T2+RDPI_T2)
and so forth iterating through all ten columns for both lists
This is equivalent to the following in terms of indexing
lm1<-lm(FITNESS_DRY~dry[,12]+dry[,2])
lm1<-lm(FITNESS_DRY~dry[,12]+dry[,2])
etc.
My loop should then calculate summaries for each model, and combine all the pvalues (4th column of the lm summary) in an output object.
I first defined my variable lists
var_list<-list(
var1=dry[,12:21],
var2=dry[,2:11]
)
This is the loop I tried which doesn't work properly:
lm.test1<-name<-vector()
for (i in 12:length(var_list$var1)){
for (j in 2:length(var_list$var2)){
lm.tmp<-lm(FITNESS_DRY~dry[,i]+dry[,j], na.action=na.omit, data=dry)
sum.tmp<-summary(lm.tmp)
lm.test1<-rbind(lm.test1,sum.tmp$coefficients[,4]) }
}
The loop returns this error message:
Warning message:
In rbind(lm.test6, sum.tmp$coefficients[, 4]) :
number of columns of result is not a multiple of vector length (arg 2)
I can call up the object "lm.test1", but that object has 27 lines instead of the 10 I want, so the iterations are not working properly here. Can anyone help with this please? Also, it would be great if I could add the names of my columns for each list of variables into the summary. I have tried using this for each variable list but without succes:
name<-append(name, as.character(colnames(var_list$var1))
Any ideas? Thanks in advance for any help!
UPDATE1: More information on the full data set:
My actual data will still contain a first colum "geneID", then I have 3772 columns names DRY_T1....DRY_T3772, then another 3772 columns names RDPI_T1...RDPI_T3772, and finally my dependent variable "FITNESS_DRY". I still want to run all additive models as such:
lm1<-lm(FITNESS_DRY~DRY_T1+RDPI_T1)
lm2<-lm(FITNESS_DRY~DRY_T2+RDPI_T2)
lm3772<-lm(FITNESS_DRY~DRY_T3772+RDPI_T3772)
I simulated a dataset as such:
set.seed(2)
dat3 = as.data.frame(replicate(7544, runif(20)))
names(dat3) = paste0(rep(c("DRY_T","RDPI_T"),each=3772), 1:3772)
dat3 = cbind(dat3, FITNESS_DRY=runif(20))
I then run the for loop:
models = list()
for(i in 1:3772) {
vars = names(dat3)[grepl(paste0(i,"$"), names(dat3))]
models2[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse="
+")),
data = dat3)
}
This works fine on the data simulation, but when I try it on my real dataset that is set up exactly in the same way it doesn't work. The loop is probably having issues handling numbers with two or more digits. I get this error message:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
UPDATE 2: Indeed the model had issues handling numbers with two or more digits. To see how things go wrong in the original version I used this:
(my dataset is called "dry2"):
names(dry2)[grepl("2$", names(dry2))]
This returned all DRY_T and RDPI_T variables with numbers containing "2" instead of just one pair of DRY_T and RDPI_T.
To fix the issue this new code works:
models = list()
for(i in 1:3772) {
vars = names(dry2)[names(dry2) %in% paste0(c("DRY_T", "RDPI_T"), i)]
models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
data = dry2)
}
There are a number of ways to set up the model formulas for iteration. Here's one approach, which we demonstrate using a for loop or map from the purrr package for the iteration. Then we use tidy from the broom package to get the coefficients and p-values.
library(tidyverse)
library(broom)
# Fake data
set.seed(2)
dat = as.data.frame(replicate(20, runif(20)))
names(dat) = paste0(rep(c("DRY_T","RDPI_T"),each=10), 0:9)
dat = cbind(dat, FITNESS_DRY=runif(20))
# Generate list of models
# Using for loop
models = list()
for(i in 0:9) {
# Get the two column names to use for this iteration of the model
vars = names(dat)[grepl(paste0(i,"$"), names(dat))]
# Fit the model and add results to the output list
models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
data = dat)
}
# Same idea using purrr::map to iterate
models = map(0:9 %>% set_names(),
~ {
vars = names(dat)[grepl(paste0(.x,"$"), names(dat))]
form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
lm(form, data = dat)
})
# Check first two models
models[1:2]
#> $`0`
#>
#> Call:
#> lm(formula = form, data = dat)
#>
#> Coefficients:
#> (Intercept) DRY_T0 RDPI_T0
#> 0.4543 0.3025 -0.1624
#>
#>
#> $`1`
#>
#> Call:
#> lm(formula = form, data = dat)
#>
#> Coefficients:
#> (Intercept) DRY_T1 RDPI_T1
#> 0.64511 -0.33293 0.06698
# Get coefficients and p-values for each model in a single data frame
results = map_df(models, tidy, .id="run_number")
results
#> # A tibble: 30 x 6
#> run_number term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 0 (Intercept) 0.454 0.153 2.96 0.00872
#> 2 0 DRY_T0 0.303 0.197 1.53 0.143
#> 3 0 RDPI_T0 -0.162 0.186 -0.873 0.395
#> 4 1 (Intercept) 0.645 0.185 3.49 0.00279
#> 5 1 DRY_T1 -0.333 0.204 -1.63 0.122
#> 6 1 RDPI_T1 0.0670 0.236 0.284 0.780
#> 7 2 (Intercept) 0.290 0.147 1.97 0.0650
#> 8 2 DRY_T2 0.270 0.176 1.53 0.144
#> 9 2 RDPI_T2 0.180 0.185 0.972 0.345
#> 10 3 (Intercept) 0.273 0.187 1.46 0.162
#> # … with 20 more rows
Created on 2019-06-28 by the reprex package (v0.2.1)
If you don't need to save the model objects, you can just return the data frame of coefficients and p-values:
results = map_df(0:9 %>% set_names(),
~ {
vars = names(dat)[grepl(paste0(.x,"$"), names(dat))]
form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
tidy(lm(form, data = dat))
}, .id="run_number")
UPDATE: In answer to your comment, if you replace all instances of 0:9 with 1:10 (sorry, didn't notice that your column suffixes went from 1:10 rather than 0:9), and all instances of dat (my fake data) with dry2 (or whatever name you're using for your data frame), the code will run with your data, so long as the column names are the same as the ones you used in your question. If you're using different column names, you'll need to adapt the code, either by hard-coding the new names or by creating a function that can accept whatever column names you're using for the model(s) you're generating.
To explain what the code is doing: First, we need to get the names of the columns we want to use in each iteration of the model. For example, in the for-loop version:
vars = names(dry2)[grepl(paste0(i,"$"), names(dry2))]
When, for example, i=2, this resolves to:
vars = names(dry2)[grepl("2$", names(dry2))]
vars
[1] "RDPI_T2" "DRY_T2"
So those are the two columns we want to use to generate a regression formula. "2$" is a regular expression (regular expressions is a string matching language) that means: match values in names(dry2) that end with the number '2'.
To create our formula we do:
paste(vars, collapse=" + ")
[1] "RDPI_T2 + DRY_T2"
form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
form
[1] "FITNESS_DRY ~ RDPI_T2 + DRY_T2"
And now we have our regression formula which we use inside lm.
Each iteration (either with for or map or, in #RomanLuštrik's suggestion, mapply), generates the successive models.
UPDATE 2: As I noted in the comment, I realized that the regular expression paste(i, "$") will fail (by matching more than one of each type of independent variable column) when the final number is more than one digit. So, try this instead (and similarly for the map version):
models = list()
for(i in 1:3772) {
# Get the two column names to use for this iteration of the model
vars = names(dry2)[names(dry2) %in% paste0(c("DRY_T", "RDPI_T"), i)]
# Fit the model and add results to the output list
models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
data = dry2)
}
To see how things go wrong in the original version, run, for example, names(dry2)[grepl("2$", names(dry2))]
Consider reshaping your very wide data frame to long format with reshape which is usually the preferred data format of practically any data science application.
For your needs, this requires two reshapes for each _T metric. After reshaping, create a T_NUM indicator (i.e., stripping the number of DRY_T## and RDPI_T##) and use that along with corresponding FITNESS_DRY to merge the two metrics.
Finally, use by to slice your large data frame by T_NUM groupings to build a list of models. Below uses the dat3 you simulated above. Altogether, all with base R: reshape -> TNUM <- ... -> merge -> by -> lm. The other methods, lapply, within, and Reduce are helpers for DRY-er code.
# TWO DATA FRAMES OF FOUR COLUMNS
df_list <- lapply(c("DRY_T", "RDPI_T"), function(i)
within(reshape(dat3[c(grep(i, names(dat3)), ncol(dat3))],
varying = list(names(dat3)[grep(i, names(dat3))]),
v.names = i,
times = names(dat3)[grep(i, names(dat3))],
timevar = "T_NUM",
direction = "long"), {
T_NUM <- as.integer(gsub(i, "", as.character(T_NUM)))
id <- NULL
})
)
# MERGE BOTH DFs
long_df <- Reduce(function(x, y) merge(x, y, by=c("T_NUM", "FITNESS_DRY")), df_list)
head(long_df, 10)
# T_NUM FITNESS_DRY DRY_T RDPI_T
# 1 1 0.1528837 0.9438393 0.87948274
# 2 1 0.1925344 0.7023740 0.65120186
# 3 1 0.2193480 0.2388948 0.29875871
# 4 1 0.2743660 0.1291590 0.60097630
# 5 1 0.2877732 0.9763985 0.66921847
# 6 1 0.3082835 0.7605133 0.22456361
# 7 1 0.5196165 0.1848823 0.79543965
# 8 1 0.5603618 0.1680519 0.08759412
# 9 1 0.5789254 0.8535485 0.37942053
# 10 1 0.6291315 0.5526741 0.43043940
# NAMED LIST OF 3,772 MODELS
model_list <- by(long_df, long_df$T_NUM, function(sub)
lm(FITNESS_DRY ~ DRY_T + RDPI_T, sub))
Output
summary(model_list$`1`)$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.7085512 0.1415849 5.0044269 0.0001085681
# DRY_T -0.1423601 0.1985256 -0.7170867 0.4830577281
# RDPI_T -0.1273237 0.2179249 -0.5842551 0.5667218157
summary(model_list$`2`)$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.3907525 0.1524423 2.5632809 0.02015115
# DRY_T 0.1952963 0.1990449 0.9811672 0.34026853
# RDPI_T 0.1979513 0.1884085 1.0506492 0.30812662
summary(model_list$`3`)$coefficients
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 0.38836708 0.2076638 1.870172 0.07878049
# DRY_T 0.06995811 0.1965336 0.355960 0.72624947
# RDPI_T 0.27144752 0.2115787 1.282962 0.21672143
...

Plotting when time series is in rows not columns- using R

Excel allows you to switch rows and columns in its Chart functionality.
I am trying to replicate this in R. My data (shown) below, is showing production for each company in rows. I am unable to figure out how to show the Month-1, Month-2 etc in x-axis, and the series for each company in the same graph. Any help appreciated.
Data:
tibble::tribble( ~Company.Name, ~Month-1, ~Month-2, ~Month-3, ~Month-4, "Comp-1", 945.5438986, 1081.417009, 976.7388701, 864.309703, "Comp-2", 16448.87, 13913.19, 12005.28, 10605.32, "Comp-3", 346.9689321, 398.2297592, 549.1282647, 550.4207169, "Comp-4", 748.8806367, 949.463941, 1018.877481, 932.3773791 )
I'm going to skip the part where you want to transpose, and infer that your purpose for that was solely to help with plotting. The part I'm focusing on here is "show the Month-1, Month-2 etc in x-axis, and the series for each company in the same graph".
This is doable in base graphics, but I highly recommend using ggplot2 (or plotly or similar), due to its ease of dealing with dimensional plots like this. The "grammar of graphics" (which both tend to implement) really prefers data like this be in a "long" format, so part of what I'll do is convert to this format.
First, some data:
set.seed(2)
months <- paste0("Month", 1:30)
companies <- paste0("Comp", 1:5)
m <- matrix(abs(rnorm(length(months)*length(companies), sd=1e3)),
nrow = length(companies))
d <- cbind.data.frame(
Company = companies,
m,
stringsAsFactors = FALSE
)
colnames(d)[-1] <- months
str(d)
# 'data.frame': 5 obs. of 31 variables:
# $ Company: chr "Comp1" "Comp2" "Comp3" "Comp4" ...
# $ Month1 : num 896.9 184.8 1587.8 1130.4 80.3
# $ Month2 : num 132 708 240 1984 139
# $ Month3 : num 418 982 393 1040 1782
# $ Month4 : num 2311.1 878.6 35.8 1012.8 432.3
# (truncated)
Reshaping can be done with multiple libraries, including base R, here are two techniques:
library(data.table)
d2 <- melt(as.data.table(d), id = 1, variable.name = "Month", value.name = "Cost")
d2[,Month := as.integer(gsub("[^0-9]", "", Month)),]
d2
# Company Month Cost
# 1: Comp1 1 896.91455
# 2: Comp2 1 184.84918
# 3: Comp3 1 1587.84533
# 4: Comp4 1 1130.37567
# 5: Comp5 1 80.25176
# ---
# 146: Comp1 30 653.67306
# 147: Comp2 30 657.10598
# 148: Comp3 30 549.90924
# 149: Comp4 30 806.72936
# 150: Comp5 30 997.37972
library(dplyr)
# library(tidyr)
d2 <- tbl_df(d) %>%
tidyr::gather(Month, Cost, -Company) %>%
mutate(Month = as.integer(gsub("[^0-9]", "", Month)))
I also integerized the Month, since it made sense with an ordinal variable. This isn't strictly necessary, the plot would just treat them as discretes.
The plot is anti-climactically simple:
library(ggplot2)
ggplot(d2, aes(Month, Cost, group=Company)) +
geom_line(aes(color = Company))
Bottom line: I don't think you need to worry about transposing your data: doing so has many complications that can just confuse things. Reshaping is a good thing (in my opinion), but with this kind of data is fast enough that if your data is stored in the wide format, you can re-transform it without too much difficulty. (If you are thinking about putting this in a database, however, I'd strongly recommend you re-think "wide", your db schema will be challenging if you keep it.)

Finding number of unique values (quickly) across a large data.table

I have a 1.5Mx7 data.table that I need to process through. The code I have written is running very slowly (.18s per row, estimated 75 hours to complete), and I'm hoping I can optimize it.
I'll put the pseudo-example code at the end, because it's long.
str(review)
Classes ‘data.table’ and 'data.frame': 1500000 obs. of 7 variables:
$ user_id : Factor w/ 375000 levels "aA1aJ9lJ1lB5yH5uR6jR7",..: 275929 313114 99332 277686 57473 31780 236964 44371 210127 217770 ...
$ stars : int 2 1 3 3 1 1 2 1 2 2 ...
$ business_id : Factor w/ 60000 levels "aA1kR2bK6nH8yQ9gU2uI9",..: 40806 29885 43018 58297 58444 31626 26018 2493 37883 34204 ...
$ votes.funny : int 3 0 0 7 2 9 6 8 2 7 ...
$ votes.useful: int 4 1 0 5 9 2 4 7 4 9 ...
$ votes.cool : int 5 3 6 8 3 2 0 8 10 9 ...
$ IDate : IDate, format: "2012-01-01" "2012-01-01" "2012-01-01" ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "IDate"
I need to subset the dataset by date, and then compute several columns by business_id.
setkey(review, IDate)
system.time(
review[
#(IDate >= window.start) & (IDate <= window.end),
1:10,
.SD,
keyby = business_id
][
,
list(
review.num = .N,
review.users = length(unique(user_id)),
review.stars = mean(stars),
review.votes.funny = sum(votes.funny),
review.votes.useful = sum(votes.useful),
review.votes.cool = sum(votes.cool)
),
by = business_id
]
)
user system elapsed
1.534 0.000 1.534
Timing for smaller versions of the example dataset is
# 1% of original size - 15000 rows
user system elapsed
0.02 0.00 0.02
# 10% of original size - 150000 rows
user system elapsed
0.25 0.00 0.25
So, even though I'm only processing 10 rows, the time increases with the size of the original dataset.
I tried commenting out the review.users variable above, and the computation time on the original dataset fell tremendously:
user system elapsed
0 0 0
So, my challenge is making unique() work more quickly.
I need to count the unique values in user_id for each grouping of business_id.
Not sure what else to specify, but I'm happy to answer questions.
Here is some code to create a pseudo-example dataset. I'm not sure exactly what is the cause of the slowdown, so I've tried to recreate the data as specifically as possible, but because the processing time for the random variables is so long I've reduced the size by ~90%.
z <- c()
x <- c()
for (i in 1:6000) {
z <<- c(z, paste0(
letters[floor(runif(7, min = 1, max = 26))],
LETTERS[floor(runif(7, min = 1, max = 26))],
floor(runif(7, min = 1, max = 10)),
collapse = ""
))
}
z <- rep(z, 25)
for (i in 1:37500) {
x <<- c(x, paste0(
letters[floor(runif(7, min = 1, max = 26))],
LETTERS[floor(runif(7, min = 1, max = 26))],
floor(runif(7, min = 1, max = 10)),
collapse = ""
))
}
x <- rep(x, 4)
review2 <- data.table(
user_id = factor(x),
stars = as.integer(round(runif(150000) * 5, digits = 0)),
business_id = factor(z),
votes.funny = as.integer(round(runif(150000) * 10, digits = 0)),
votes.useful = as.integer(round(runif(150000) * 10, digits = 0)),
votes.cool = as.integer(round(runif(150000) * 10, digits = 0)),
IDate = rep(as.IDate("2012-01-01"), 150000)
)
setkey(review2, IDate)
How about this - an alternative to unique using an extra data.table within an anonymous function:
review2[,{
uid <- data.table(user_id)
rev_user <- uid[, .N, by = user_id][, .N]
#browser()
list(
review.num = .N,
review.users = rev_user,
review.stars = mean(stars),
review.votes.funny = sum(votes.funny),
review.votes.useful = sum(votes.useful),
review.votes.cool = sum(votes.cool)
)}, by = business_id]
It seems that length(unique()) is inefficient in calculating the length of factor variables as levels become very large.
Using uniqueN() instead (thanks #Frank):
user system elapsed
0.12 0.00 0.12
Using set(review, NULL, "user_id", as.character(review$user_id)) and length(unique)):
user system elapsed
0.11 0.00 0.11

Merge plm fitted values to dataset

I'm working with a fixed effects regression model using plm.
The model looks like this:
FE.model <-plm(fml, data = data.reg2,
index=c('Site.ID','date.hour'), # cross section ID and time series ID
model='within', #coefficients are fixed
effect='individual')
summary(FE.model)
"fml" is a formula I defined previously. I have many independent variables, so this made it more efficient.
What I want to do is get my fitted values (my yhats) and join them to my base dataset; data.reg2
I was able to get the fitted values using this code:
Fe.model.fitted <- FE.model$model[[1]] - FE.model$residuals
However, this only gives me a one column vector of fitted values only - I have no way of joining it to my base dataset.
Alternatively, I've tried something like this:
Fe.model.fitted <- cbind(data.reg2, resid=resid(FE.model), fitted=fitted(FE.model))
However, I get this error with that:
Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class ""pseries"" to a data.frame
Are there any other ways to get my fitted values in my base dataset? Or can someone explain the error I'm getting and maybe a way to fix it?
I should note that I don't want to manually compute the yhats based on my betas. I have way too many independent variables for that option and my defined formula (fml) may change so that option would not be efficient.
Many thanks!!
Merging plm fitted values back into the original dataset requires some intermediate steps -- plm drops any rows with missing data, and as far as I can tell, a plm object does not contain the index info. The order of the data is not preserved -- see what Giovanni Millo, one of plm's authors, commented in this thread:
"...the input order is not always preserved: observations are always reordered by (individual, time) internally, so that the output you get is ordered accordingly..."
The steps in short:
Get fitted values from the estimated plm object. It is a single vector but the entries are named. The names correspond to the position in the index.
Get the index, using the index() function. It can return both individual and time indices. Note the index may contain more rows than the fitted values, in case rows were removed for missing data. (It is also possible to generate an index directly from the original data, but I did not see a promise that the original order of the data is preserved in what plm returns.)
Merge into the original data, looking up the id and time values from the index.
Sample code is provided below. Kind of long but I've tried to comment. The code is not optimized, my intention was to list the steps explicitly. Also, I am using data.tables rather than data.frames.
library(data.table); library(plm)
### Generate dummy data. This way we know the "true" coefficients
set.seed(100)
n <- 500 # Run with more data if you want to get closer to the "true" coefficients
DT <- data.table(CJ(id = c("a","b","c","d","e"), time = c(1:(n / 5))))
DT[, x1 := rnorm(n)]
DT[, x2 := rnorm(n)]
DT[, y := x1 + 2 * x2 + rnorm(n) / 10]
setkey(DT, id, time)
# # Make it an unbalanced panel & put in some NAs
DT <- DT[!(id == "a" & time == 4)]
DT[.("a", 3), x2 := as.numeric(NA)]
DT[.("d", 2), x2 := as.numeric(NA)]
str(DT)
### Run the model -- both individual and time effects; "within" model
summary(PLM <- plm(data = DT, id = c("id", "time"), formula = y ~ x1 + x2, model = "within", effect = "twoways", na.action = "na.omit"))
### Merge the fitted values back into the data.table DT
# Note that PLM$model$y is shorter than the data, i.e. the row(s) with NA have been dropped
cat("\nRows omitted (due to NA): ", nrow(DT) - length(PLM$model$y))
# Since the objects returned by plm() do not contain the index, need to generate it from the data
# The object returned by plm(), i.e. PLM$model$y, has names that point to the place in the index
# Note: The index can also be done as INDEX <- DT[, j = .(id, time)], but use the longer way with index() in case plm does not preserve the order
INDEX <- data.table(index(x = pdata.frame(x = DT, index = c("id", "time")), which = NULL)) # which = NULL extracts both the individual and time indexes
INDEX[, id := as.character(id)]
INDEX[, time := as.integer(time)] # it is returned as a factor, convert back to integer to match the variable type in DT
# Generate the fitted values as the difference between the y values and the residuals
if (all(names(PLM$residuals) == names(PLM$model$y))) { # this should not be needed, but just in case...
FIT <- data.table(
index = as.integer(names(PLM$model$y)), # this index corresponds to the position in the INDEX, from where we get the "id" and "time" below
fit.plm = as.numeric(PLM$model$y) - as.numeric(PLM$residuals)
)
}
FIT[, id := INDEX[index]$id]
FIT[, time := INDEX[index]$time]
# Now FIT has both the id and time variables, can match it back into the original dataset (i.e. we have the missing data accounted for)
DT <- merge(x = DT, y = FIT[, j = .(id, time, fit.plm)], by = c("id", "time"), all = TRUE) # Need all = TRUE, or some data from DT will be dropped!
I have a simplified method. The main problem here is twofold:
1) pdata.frames sort your input alphabetically by name, then year. This can be addressed by sorting your data frame first before running plm.
2) rows with NA in variables included in the formula are dropped. I handle this problem by creating a second formula including my id and time variable, and then use model.frame to extract the data used in the regression (excluding NAs but now also includes id and time)
library(plm)
set.seed(100)
n <- 10 # Run with more data if you want to get closer to the "true" coefficients
DT <- data.frame(id = c("a","c","b","d","e"), time = c(1:(n / 5)),x1 = rnorm(n),x2= rnorm(n),x3=rnorm(n))
DT$Y = DT$x2 + 2 * DT$x3 + rnorm(n) / 10 # make x1 a function of other variables
DT$x3[3]=NA # add an NA to show this works with missing data
DT
# now can add drop.index = F, but note that DT is now sorted by order(id,time)
pdata.frame(DT,index=c('id','time'),drop.index = F)
# order DT to match pdata.frame that will be used for plm
DT=DT[order(DT$id,DT$time),]
# formulas
formulas =Y~x1+x2+x3
formulas_dataframe = Y~x1+x2+x3 +id+time # add id and time for model.frame
# estimate
random <- plm(formulas, data=DT, index=c("id", "time"), model="random",na.action = 'na.omit')
summary(random)
# merge prediction and and model.frame
fitted = data.frame(fitted = random$model[[1]] - random$residuals)
model_data = cbind(as.data.frame(as.matrix(random$model)),fitted) # this isn't really needed but shows that input and model.frame are same
model_data = cbind(model_data,na.omit(model.frame(formulas_dataframe,DT)))
model_data
I wrote a function (predict.out.plm) to do out of sample predictions after estimating First Differences or Fixed-Effects models with plm.
The function further adds the predicted values to the indices of the original data. This is done by using the rownames saved within the plm - attributes(plmobject)$index and the rownames within the model.matrix
for more details see the function posted here:
https://stackoverflow.com/a/44185441/2409896
It's been a while for this post, but I believe the easiest way to do this now would be:
Fe.model.fitted <- cbind(FE.model$model,
resid=FE.model$residuals,
fitted=plm:::fitted_exp.plm(FE.model))
The function fitted_exp.plm is not exported by the plm package but we can use the ::: to extract it.
The residuals are deviation of the model from the value on the LHS of the formula .... which you have not shown to us. There is a fitted.panelmodel function in the 'plm' package, but it appears to expect that there will be a fitted value which the plm function does not return by default, nor is it documented to do so, nor is the a way that I see to make it cough one up.
library(plm)
data("Produc", package = "plm")
zz <- plm(log(gsp) ~ log(pcap) + log(pc) + log(emp) + unemp,
data = Produc, index = c("state","year"))
summary(zz) # the example on the plm page:
> str(fitted(zz))
NULL
> names(zz$model)
[1] "log(gsp)" "log(pcap)" "log(pc)" "log(emp)" "unemp"
> Produc[ , c("Yvar", "Fitted")] <- cbind( zz$model[ ,"log(gsp)", drop=FALSE], zz$residuals)
> str(Produc)
'data.frame': 816 obs. of 12 variables:
$ state : Factor w/ 48 levels "ALABAMA","ARIZONA",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 ...
$ pcap : num 15033 15502 15972 16406 16763 ...
$ hwy : num 7326 7526 7765 7908 8026 ...
$ water : num 1656 1721 1765 1742 1735 ...
$ util : num 6051 6255 6442 6756 7002 ...
$ pc : num 35794 37300 38670 40084 42057 ...
$ gsp : int 28418 29375 31303 33430 33749 33604 35764 37463 39964 40979 ...
$ emp : num 1010 1022 1072 1136 1170 ...
$ unemp : num 4.7 5.2 4.7 3.9 5.5 7.7 6.8 7.4 6.3 7.1 ...
$ Yvar :Classes 'pseries', 'pseries', 'integer' atomic [1:816] 10.3 10.3 10.4 10.4 10.4 ...
.. ..- attr(*, "index")='data.frame': 816 obs. of 2 variables:
.. .. ..$ state: Factor w/ 48 levels "ALABAMA","ARIZONA",..: 1 1 1 1 1 1 1 1 1 1 ...
.. .. ..$ year : Factor w/ 17 levels "1970","1971",..: 1 2 3 4 5 6 7 8 9 10 ...
$ Fitted: num -0.04656 -0.03064 -0.01645 -0.00873 -0.02708 ...

How to read csv file in R where some values contain the percent symbol (%)

Is there a clean/automatic way to convert CSV values formatted with as percents (with trailing % symbol) in R?
Here is some example data:
actual,simulated,percent error
2.1496,8.6066,-300%
0.9170,8.0266,-775%
7.9406,0.2152,97%
4.9637,3.5237,29%
Which can be read using:
junk = read.csv("Example.csv")
But all of the % columns are read as strings and converted to factors:
> str(junk)
'data.frame': 4 obs. of 3 variables:
$ actual : num 2.15 0.917 7.941 4.964
$ simulated : num 8.607 8.027 0.215 3.524
$ percent.error: Factor w/ 4 levels "-300%","-775%",..: 1 2 4 3
but I would like them to be numeric values.
Is there an additional parameter for read.csv? Is there a way to easily post process the needed columns to convert to numeric values? Other solutions?
Note: of course in this example I could simply recompute the values, but in my real application with a larger data file this is not practical.
There is no "percentage" type in R. So you need to do some post-processing:
DF <- read.table(text="actual,simulated,percent error
2.1496,8.6066,-300%
0.9170,8.0266,-775%
7.9406,0.2152,97%
4.9637,3.5237,29%", sep=",", header=TRUE)
DF[,3] <- as.numeric(gsub("%", "",DF[,3]))/100
# actual simulated percent.error
#1 2.1496 8.6066 -3.00
#2 0.9170 8.0266 -7.75
#3 7.9406 0.2152 0.97
#4 4.9637 3.5237 0.29
This is the same as Roland's solution except using the stringr package. When working with strings I'd recommend it though as the interface is more intuitive.
library(stringr)
d <- str_replace(junk$percent.error, pattern="%", "")
junk$percent.error <- as.numeric(d)/100
With data.table you can achieve it as
a <- fread("file.csv")[,`percent error` := as.numeric(sub('%', '', `percent error`))/100]
Tidyverse has multiple ways of solving such issues. You can use the parse_number() specification which will strip a number off any symbols, text etc.:
sample_data = "actual,simulated,percent error\n 2.1496,8.6066,-300%\n 0.9170,8.0266,-775%\n7.9406,0.2152,97%\n4.9637,3.5237,29%"
DF <- read_csv(sample_data,col_types = cols(`percent error`= col_number()))
# A tibble: 4 x 3
# actual simulated `percent error`
# <chr> <dbl> <dbl>
# 1 2.1496 8.61 -300
# 2 + 0.9170 8.03 -775
# 3 + 7.9406 0.215 97.0
# 4 + 4.9637 3.52 29.0

Resources