I have a large dataset (~800M rows) as a data.table. The dataset consists out of equidistant timeseries data for thousands of IDs. My problem is that missing values were originally not encoded but are really missing in the dataset. So, I would like to add the rows with missing data. I know that for each ID the same timestamps should be present.
Given the size of the dataset my initial idea was to create one data.table which includes every timestep the data should include and then use merge with all=TRUE, for each ID of the main data.table. However so far, I have only managed to do that if my data.table with all-time steps (complete_dt) includes also the ID column. However, this creates a lot of redundant information, as each ID should have the same timesteps.
I made a MWE - for simplicity as my data is equidistant, I have replaced the POSIXct column with a simple integer column
library(data.table)
# My main dataset
set.seed(123)
main_dt <- data.table(id = as.factor(rep(1:3, c(5,4,3))),
pseudo_time = c(1,3,4,6,7, 1,3,4,5, 3,5,6),
value = runif(12))
# Assuming that I should have the pseudo timesteps 1:7 for each ID
# Given the size of my real data I would like to create the pseudo time not for each ID but only once
complete_dt <- main_dt[, list(pseudo_time = 1:7), by = id]
#The dt I need to get in the end
result_dt <- merge.data.table(main_dt,complete_dt, all = TRUE )
I have seen this so what similar question Merge (full join) recursively one data.table with each group of another data.table, but I have not managed to apply this to my problem.
Any help for a more efficient solution then mine would be much appreciated.
Here is an alternative but probably not much more efficient:
setkey(main_dt, id, pseudo_time)
main_dt[CJ(id, pseudo_time = 1:7, unique = TRUE)]
Related
I have got a large dataframe containing medical data (my.medical.data).
A number of columns contain dates (e.g. hospital admission date), the names of each of these columns end in "_date".
I would like to apply the lubridate::dmy() function to the columns that contain dates and overwrite my original dataframe with the output of this function.
It would be great to have a general solution that can be applied using any function, not just my dmy() example.
Essentially, I want to apply the following to all of my date columns:
my.medical.data$admission_date <- lubridate::dmy(my.medical.data$admission_date)
my.medical.data$operation_date <- lubridate::dmy(my.medical.data$operation_date)
etc.
I've tried this:
date.columns <- select(ICB, ends_with("_date"))
date.names <- names(date.columns)
date.columns <- transmute_at(my.medical.data, date.names, lubridate::dmy)
Now date.columns contains my date columns, in the "Date" format, rather than the original factors. Now I want to replace the date columns in my.medical.data with the new columns in the correct format.
my.medical.data.new <- full_join(x = my.medical.data, y = date.columns)
Now I get:
Error: cannot join a Date object with an object that is not a Date object
I'm a bit of an R novice, but I suspect that there is an easier way to do this (e.g. process the original dataframe directly), or maybe a correct way to join / merge the two dataframes.
As usual it's difficult to answer without an example dataset, but this should do the work:
library(dplyr)
my.medical.data <- my.medical.data %>%
mutate_at(vars(ends_with('_date')), lubridate::dmy)
This will mutate in place each variable that end with '_date', applying the function. It can also apply multiple functions. See ?mutate_at (which is also the help for mutate_if)
Several ways to do that.
If you work with voluminous data, I think data.table is the best approach (will bring you flexibility, speed and memory efficiency)
data.table
You can use the := (update by reference operator) together with lapplỳ to apply lubridate::ymd to all columns defined in .SDcols dimension
library(data.table)
setDT(my.medical.data)
cols_to_change <- endsWith("_date", colnames(my.medical.date))
my.medical.data[, c(cols_to_change) := lapply(.SD, lubridate::ymd), .SDcols = cols_to_change]
base R
A standard lapply can also help. You could try something like that (I did not test it)
my.medical.data[, cols_to_change] <- lapply(cols_to_change, function(d) lubridate::ymd(my.medical.data[,d]))
I have a data.frame in in that consist of two columns, a Sample_ID variable and a value variable. Each sample (of which there are 1971) has 132 individual points. The entire object is only ~3000000 bytes, or about 0.003 gigabytes (according to object.size()). For some reason, when I try to dcast the object into wide format, it throws an error saying it can't allocate vectors of size 3.3 GB, which is more 3 orders of magnitude larger than the original object.
The output I'm hoping for is 1 column for each sample, with 132 rows of data for each column.
The dcast code I am using is the following:
df_dcast = dcast(df, value.var = "Vals", Vals~Sample_ID)
I would provide the dataset for reproducibility but because this problem has to do with object size, I don't think a subset of it would help and I'm not sure how to easily post the full dataset. If you know how to post the full dataset or think that a subset would be helpful, let me know.
Thanks
Ok I figured out what was going wrong. It was attempting to use each unique value in the Vals column as an individual row producing far far more rows than the 132 that I wanted, so I needed to add a new column that was basically a value index going from 1:132 so the dataframe has 3 columns: ID, Vals, ValsNumber
The dcast code then looks like the following:
df_wide = dcast(df, value.var = "Vals", ValsNumber ~ Sample_ID)
I did search for this and linking to entries HERE, HERE and HERE.
But they don't answer my question.
Code:
for (i in 1:nrow(files.df)) {
Final <- parser(as.character(files.df[i,]))
Final2 <- rbind(Final2,Final)
}
files.df contains over 30 filenames (read from a directory using list.files) which is then passed to a custom function parser which returns a dataframe holding over 100 lines (number varies from one file to the next). Both Final and Final2 are initialised with NA outside the for loop. The script runs fine with rbind but its a semantic issue - the resulting output is not what I expect. The resulting dataframe is a lot smaller than the files combined.
I am certain it has to do with the rbind bit.
Secondly, am looking to mimic the pivot functionality that's in excel, whereby I have four columns, the first column is repeated for each row, second column is distinct, third column is distinct, fourth column is distinct. The final dataframe should be pivoted around the first column. Any idea as to how I can achieve this? I had a go at cast and melt but no avail.
Any thoughts would be great! Would be good if I can stick to the data frame structure.
Attaching pictures for reference:
With pivot on and ideal output
For your pivot functionality which essentially requires transforming dataframe from long to wide format, aggregating on the Value column, you can use base R's reshape():
reshapedf <- reshape(df, v.names = c("Value"),
timevar=c("Identifier"),
idvar = c("Date"),
direction = "wide")
# RENAME COLUMNS
names(reshapedf) <- c('Date', 'A', 'B', 'C')
# CONVERT NAs TO ZEROS
reshapedf[,c(2:4)] <- data.frame(sapply(reshapedf[,c(2:4)],
function(x) ifelse(is.na(x),0,x)))
# RESET ROW.NAMES
row.names(reshapedf) <- 1:nrow(reshapedf)
Alternatively, the dedicated package, reshape2 which as seen below tends to be less wordy and less post-formatting. Hence, most prefer this transformation route. Plus, like Excel Pivot Tables other aggregate functions are available (sum, mean, length, etc.):
library(reshape2)
reshape2df <- dcast(df, Date~Identifier, sum, value.var="Value")
I would like to create a data.table in tidy form containing the columns articleID, period and demand (with articleID and period as key). The demand is subject to a random function with input data from another data.frame (params). It is created at runtime for differing numbers of periods.
It is easy to do this in "non-tidy" form:
#example data
params <- data.frame(shape=runif(10), rate=runif(10)*2)
rownames(params) <- letters[1:10]
periods <- 10
# create non-tidy data with one column for each period
df <- replicate(nrow(params),
rgamma(periods,shape=params[,"shape"], rate=params[,"rate"]))
rownames(df) <- rownames(params)
Is there a "tidy" way to do this creation? I would need to replicate the rgamma(), but I am not sure how to make it use the parameters of the corresponding article. I tried starting with a Cross Join from data.table:
dt <- CJ(articleID=rownames(params), per=1:periods, demand=0)
but I don't know how to pass the rgamma to the dt[,demand] directly and correctly at creation nor how to change the values now without using some ugly for loop. I also considered using gather() from the tidyr package, but as far as I can see, I would need to use a for loop either.
It does not really matter to me whether I use data.frame or data.table for my current use case. Solutions for any (or both!) would be highly appreciated.
This'll do (note that it assumes that params is sorted by row names, if not you can convert it to a data.table and merge the two):
CJ(articleID=rownames(params), per=1:periods)[,
demand := rgamma(.N, shape=params[,"shape"], rate=params[,"rate"]), by = per]
This question is about selecting a different number of columns on every row of a data frame. I have a data frame:
df = data.frame(
START=sample(1:2, 10, repace=T), END=sample(2:4, 10, replace=T),
X1=rnorm(10), X2=rnorm(10), X3=rnorm(10), X4=rnorm(10)
)
I would like to have a way without loops to select columns (START[i]:END[i])+2 on row i for all rows of my data frame.
Base R solution
lapply(split(df,1:nrow(df)),function(row) row[(row$START+2):(row$END+2)])
Or something similar as given in the comment above (I would store the output in a list)
library(plyr)
alply(df,1,function(row) row[(row$START+2):(row$END+2)])
Edit per request of OP:
To get a TRUE/FALSE index matrix, use the following R base solution
idx_matrix=col(df)>=df$START+2&col(df)<=df$END+2
df[idx_matrix]
Note, however, that you lose some information here (compared to the list based solution).