Change data.table column classes to integer64, double respectively - r

Given data.table dt:
dt <- structure(list(V1 = c("1544018118438041139", "1544018118466235879",
"1544018118586849680", "1544018118601169211", "1544018118612947335",
"1544018118614422179"), V2 = c("162", "162", "161.05167", "158.01309",
"157", "157"), V3 = c("38", "38", "36.051697", "33.01306", "32",
"32"), V4 = c("0.023529414", "0.025490198", "0.023529414", "0.027450982",
"0.03137255", "0.03137255"), V5 = c("1", "1", "1", "1", "1",
"1"), V6 = c("2131230815", "2131230815", "2131230815", "2131230815",
"2131230815", "2131230815"), V7 = c("1", "0", "0", "0", "0",
"-1")), class = c("data.table", "data.frame"), row.names = c(NA,
-6L), .internal.selfref = <pointer: 0x2715f60>)
I want the first column to be bit64::as.integer64() and the rest of the columns as.numeric()
I am trying to do this:
dt <- dt[ ,V1 := bit64::as.integer64(V1)]
dt[, lapply(.SD, as.numeric), .SDcols = -c("V1")]
But it doesn't seem to do what I want, please advise how to change specific columns to class A(integer64) and the rest to another class B (say as.numeric())?

From the comments above it seems like you want to be able to do this all in one step rather than convert the first to integer64 and then the rest to double. One way you can do this is with:
dt[, names(dt) := Map(function(fun, x) fun(x), rep(list(bit64::as.integer64, as.numeric), times = c(1,length(.SD) - 1)), .SD), .SDcols = names(dt)]
The Map function iterates through your inputs together. That is, it takes the first elements of your first and second vectors and pass them as arguments to our function. Then it takes the second elements of both vectors and passes those to the function.
In our Map call we have:
A main function to apply. This is an anonymous function which takes two arugments (1) fun, and (2) x. The result of our function is the result of applying fun to x or fun(x). For a concrete example try:
myfun <- function(fun, x){
fun(x)
}
res<-myfun(as.numeric, c("1","1")); class(res)
A list of functions to pass to our main function. These will be used as fun in our main function. In this case its list(as.integer64, as.numeric, as.numeric,...)
A list of vectors to pass to our main function. These will be used as x in our main function. In this case each column of our dt.
A quick and dirty visual aid of how this works is (assuming custom_function takes two arguments):

It looks to me that you have a data.table object with integer64 nanosecond timestamps since the epoch. I use the same at work to represent high-resolution timestamps.
The good news is that data.table supports this -- by relying on our package nanotime which itself uses bit64 for the integer64 type. However, I create my timestamps differently, typically from compiled code where I retrieve the data.
I described this in some detail at the Rcpp Gallery in this post . So some good news: this can be done. Some bad news: I don't think you can do it the way you want it because we can only go via double which has only 16 decimals precision, not 19. But maybe I am missing trick so if simpler solution exists I'd be all ears. (And I keep forgetting if there is a 'parse int64 from string approach'. I never went that route because you can't do that at scale -- I deal with pretty sizeable data sets too.)

Thanks guys, #dirk_eddelbuettel I managed to do this:
1) Load all the JSON files (in my case) and use
bigint_as_char=TRUE
within fromJSON command.
2) Now you have a big table with all columns as characters.
3) Convert timestamp column to bit64::as.integer64() - you get the numbers I want.
4) Convert the rest to desired types.
5) When I want to perform calculations, for example timestamp - lag(timestamp) I am adding the lag_timestamp = lag(timestamp) (with dplyr::mutate) as new column and add diff_column = storing it as.character()
6) You are almost done - the new diff column stores the value I want as string / character and now you can convert it to as.numeric() where needed or apply ifelse() to deal with non relevant values.
7) That's all, it works perfectly for me and don't crash R Studio.
Before applying my solution R Studio crashed.

Related

New columns in data.frame don't retain POSIXct class

I spent almost two days to find the reason of an error occuirring - probably trivial for many, but I cannot figure out the reason for that and I am thankful for help:
When I create a new data.frame and add columns with a specific class (POSIXct) using ...$... syntax, it works nicely ("p" columns in code below, they become class POSIXct as intended).
However, if I do the same using the ...[..., ...] syntax, POSIXct class is lost upon assignment ("n" columns in code below, since they become unintendedly class numeric).
Even after setting class explicitely, it remains numeric using the ...[..., ...] syntax, but not using the ...$.... syntax..
What is the reasoning behind this behaviour? Obviously I have found a workaround, but it is more convenient to use vectors of column names, and I am afraid that I miss sth. very basic but cannot figure out what, or where to look by which keywords.
Basically I need to access the columns by a variable and then assign class and data.
rm(dfDummy) # just make sure there is no residual old data/columns leftover
dfDummy <- data.frame(a = 1:10, dummy = dummy)
dfDummy$p <- as.POSIXct(NA)
dfDummy$p.rep <- as.POSIXct(rep(NA, 10))
dfDummy[ , c("n1", "n2")] <- as.POSIXct(NA)
dfDummy[ , c("n1.rep", "n2.rep")] <- as.POSIXct(rep(NA, 10))
sapply(X = c("p", "p.rep", "n1", "n2", "n1.rep", "n2.rep"), function(x) class(dfDummy[, x]))
# even after setting the class explicitely, it remains "numeric" - what is wrong?
class(dfDummy[ , c("n1", "n2", "n1.rep", "n2.rep")]) <- c("POSIXct", "POSIXt")
sapply(X = c("p", "p.rep", "n1", "n2", "n1.rep", "n2.rep"), function(x) class(dfDummy[, x]))
The issue has nothing really to do with using $ or [, except when using $ a single column is being assigned and when you're using [ multiple columns are.
Rather when you assign into multiple columns the POSIXct vector is being recycled and simplified into a matrix - and matrices can't hold class POSIXct.
If you instead pass a list, it will work:
dfDummy[ , c("n1.rep", "n2.rep")] <- list(as.POSIXct(NA))
lapply(dfDummy[ , c("n1.rep", "n2.rep")], class)
$n1.rep
[1] "POSIXct" "POSIXt"
$n2.rep
[1] "POSIXct" "POSIXt"

Creating a simple for loop in R

I have a tibble called 'Volume' in which I store some data (10 columns - the first 2 columns are characters, 30 rows).
Now I want to calculate the relative Volume of every column that corresponds to Column 3 of my tibble.
My current solution looks like this:
rel.Volume_unmod = tibble(
"Volume_OD" = Volume[[3]] / Volume[[3]],
"Volume_Imp" = Volume[[4]] / Volume[[3]],
"Volume_OD_1" = Volume[[5]] / Volume[[3]],
"Volume_WS_1" = Volume[[6]] / Volume[[3]],
"Volume_OD_2" = Volume[[7]] / Volume[[3]],
"Volume_WS_2" = Volume[[8]] / Volume[[3]],
"Volume_OD_3" = Volume[[9]] / Volume[[3]],
"Volume_WS_3" = Volume[[10]] / Volume[[3]])
rel.Volume_unmod
I would like to keep the tibble structure and the labels. I am sure there is a better solution for this, but I am relative new to R so I it's not obvious to me. What I tried is something like this, but I can't actually run this:
rel.Volume = NULL
for(i in Volume[,3:10]){
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
}
Mockup Data
Since you did not provide some data, I've followed the description you provided to create some mockup data. Here:
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE))
Volume[3:10] <- rnorm(30*8)
Solution with Dplyr
library(dplyr)
# rename columns [brute force]
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(Volume)[3:10] <- cols
# divide by Volumn_OD
rel.Volume_unmod <- Volume %>%
mutate(across(all_of(cols), ~ . / Volume_OD))
# result
rel.Volume_unmod
Explanation
I don't know the names of your columns. Probably, the names correspond to the names of the columns you intended to create in rel.Volume_unmod. Anyhow, to avoid any problem I renamed the columns (kinda brutally). You can do it with dplyr::rename if you wan to.
There are many ways to select the columns you want to mutate. mutate is a verb from dplyr that allows you to create new columns or perform operations or functions on columns.
across is an adverb from dplyr. Let's simplify by saying that it's a function that allows you to perform a function over multiple columns. In this case I want to perform a division by Volum_OD.
~ is a tidyverse way to create anonymous functions. ~ . / Volum_OD is equivalent to function(x) x / Volumn_OD
all_of is necessary because in this specific case I'm providing across with a vector of characters. Without it, it will work anyway, but you will receive a warning because it's ambiguous and it may work incorrectly in same cases.
More info
Check out this book to learn more about data manipulation with tidyverse (which dplyr is part of).
Solution with Base-R
rel.Volume_unmod <- Volume
# rename columns
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(rel.Volume_unmod)[3:10] <- cols
# divide by columns 3
rel.Volume_unmod[3:10] <- lapply(rel.Volume_unmod[3:10], `/`, rel.Volume_unmod[3])
rel.Volume_unmod
Explanation
lapply is a base R function that allows you to apply a function to every item of a list or a "listable" object.
in this case rel.Volume_unmod is a listable object: a dataframe is just a list of vectors with the same length. Therefore, lapply takes one column [= one item] a time and applies a function.
the function is /. You usually see / used like this: A / B, but actually / is a Primitive function. You could write the same thing in this way:
`/`(A, B) # same as A / B
lapply can be provided with additional parameters that are passed directly to the function that is being applied over the list (in this case /). Therefore, we are writing rel.Volume_unmod[3] as additional parameter.
lapply always returns a list. But, since we are assigning the result of lapply to a "fraction of a dataframe", we will just edit the columns of the dataframe and, as a result, we will have a dataframe instead of a list. Let me rephrase in a more technical way. When you are assigning rel.Volume_unmod[3:10] <- lapply(...), you are not simply assigning a list to rel.Volume_unmod[3:10]. You are technically using this assigning function: [<-. This is a function that allows to edit the items in a list/vector/dataframe. Specifically, [<- allows you to assign new items without modifying the attributes of the list/vector/dataframe. As I said before, a dataframe is just a list with specific attributes. Then when you use [<- you modify the columns, but you leave the attributes (the class data.frame in this case) untouched. That's why the magic works.
Whithout a minimal working example it's hard to guess what the Variable Volume actually refers to. Apart from that there seems to be a problem with your for-loop:
for(i in Volume[,3:10]){
Assuming Volume refers to a data.frame or tibble, this causes the actual column-vectors with indices between 3 and 10 to be assigned to i successively. You can verify this by putting print(i) inside the loop. But inside the loop it seems like you actually want to use i as a variable containing just the index of the current column as a number (not the column itself):
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
Also, two brackets are usually used with lists, not data.frames or tibbles. (You can, however, do so, because data.frames are special cases of lists.)
Last but not least, initialising the variable rel.Volume with NULL will result in an error, when trying to reassign to that variable, since you haven't told R, what rel.Volume should be.
Try this, if you like (thanks #Edo for example data):
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE),
Vol1 = rnorm(30),
Vol2 = rnorm(30),
Vol3 = rnorm(30))
rel.Volume <- Volume[1:2] # Assuming you want to keep the IDs.
# Your data.frame will need to have the correct number of rows here already.
for (i in 3:ncol(Volume)){ # ncol gives the total number of columns in data.frame
rel.Volume[i] = Volume[i]/Volume[3]
}
A more R-like approach would be to avoid using a for-loop altogether, since R's strength is implicit vectorization. These expressions will produce the same result without a loop:
# OK, this one messes up variable names...
rel.V.2 <- data.frame(sapply(X = Volume[3:5], FUN = function(x) x/Volume[3]))
rel.V.3 <- data.frame(Map(`/`, Volume[3:5], Volume[3]))
Since you said you were new to R, frankly I would recommend avoiding the Tidyverse-packages while you are still learing the basics. From my experience, in the long run you're better off learning base-R first and adding the "sugar" when you're more familiar with the core language. You can still learn to use Tidyverse-functions later (but then, why would anybody? ;-) ).

Identify sustained peaks using pracma::findpeaks

I am having problems with the syntax of the peakpat option within the findpeaks function within the pramca R package (v. 2.1.1). I am using R 3.4.3 x64 windows.
I would like the function to identify peaks that may have two repeated values, and I believe the option peakpat is how I can do this.
This question has been asked before, however I haven't been able to come across an example of how to implement the option Hans is referring to. This seems very basic and I am also quite a beginner when it comes to coding. In the help file online, it says the following about peakpat:
define a peak as a regular pattern, such as the default pattern ``[+]1,[-]1,''; if a pattern is provided, the parameters nups and ndowns are not taken into account."
I'm having problems interpreting what "[+]1,[-]1" means. Any ideas? I've tried variations of what I think this means, but each attempt results in NULL. Please see my example below, any help/insight is greatly appreciated.
# Example:
install.packages("pracma")
library(pracma)
subset = c(570,584,500,310,261,265,272,313,314,315,330,360,410,410,360,365,368,391,390,414)
# Plots
plot(subset)
lines(subset)
# findpeaks without defining repeated values;
# the result does not identify the peak at subset[13:14] (repeated 'peak' values)
result = findpeaks(subset)
pks1 = data.matrix(result[,1])
locs1 = data.matrix(result[,2])
# findpeaks with my futile attempt at defining peakpat
result = findpeaks(subset, nups=2, ndowns=nups, zero = "0", peakpat="[+]2,[-]2,")
result = findpeaks(subset, nups=1, ndowns=nups, zero = "0", peakpat="[+]1,[-]1,")
result = findpeaks(subset, nups=1, ndowns=nups, zero = "0", peakpat="[+]{,1},[-]{,1}")
result = findpeaks(subset, nups=1, ndowns=nups, zero = "0", peakpat="[+]{1,},[-]{1,}")
result = findpeaks(subset, nups=2, ndowns=nups, zero = "0", peakpat="[2],[2]")
result = findpeaks(subset, nups=2, ndowns=nups, zero = "0", peakpat="[1],[1]")
# all of the above results in NULL
Thank you!
The documentation isn't too helpful in this case, but you can get some clues by inspecting the function body.
Typing the function name into the console lets you inspect its source. Without going into complete detail, this line is helpful:
peakpat <- sprintf("[+]{%d,}[-]{%d,}", nups, ndowns)
This shows us that the default arguments correspond to a peakpat of "[+]{1,}[-]{1,}".
This should also reinforce why if you specify peakpat, you don't need to specify anything for nups and ndowns.
A pattern that does what you're after, for peaks of two repeated values:
result <- findpeaks(subset, peakpat = "[+]{1,}[0]{1,}[-]{1,}")
The commas specify an interval. So if you wanted to limit your search to peaks that have a repeated value of at most length 3:
result <- findpeaks(subset, peakpat = "[+]{1,}[0]{1,2}[-]{1,}")
The function works by turning your data into a string and applying a regular expression, so the usual rules for regex should apply.
This is my final solution for finding the peaks with the traditional peakpat definition and defining sustained peaks (the answer by Callum):
# EXAMPLE: findpeaks
library(pracma)
subset = c(570,584,500,310,261,265,272,313,314,315,330,360,410,410,360,365,368,391,390,414)
# Plots
plot(subset)
lines(subset)
# findpeaks without sustained peaks:
# the result does not identify the peak at subset[13:14] (repeated 'peak' values)
result = findpeaks(subset)
pks1 = result[,1]
locs1 = result[,2]
# findpeaks by defining sustained peaks (2 or more consecutive values):
result = findpeaks(subset, peakpat = "[+]{1,}[0]{1,}[-]{1,}")
pks2 = result[,1]
locs2 = result[,2]

Recoding data.table values in a loop using a vector of column names

I have a large data table that contains some categorical variables, where missing values have been coded as blank strings. I would like to recode them to NA.
I have a vector storing the names of the categorical variables:
categorical_variables = c("v3", etc.
The vector is definitely set up correctly - I have successfully used it to loop through plots of each column. However when I try to recode using this...
for (v in categorical_variables) myDataTable[get(v)=="",get(v):=NA]
...I get the following error:
Error in get(v) : object 'v3' not found
Yet this works OK:
myDataTable[v3=="",v3:=NA]
And this also works OK:
myDataTable[get("v3")=="",get("v3")]
So it's when I try to do the assignment using get() combined with := it throws up the error. What am I doing wrong?
The data.table is very large (hence my preference for using data.table), so ideally I don't want to convert to data.frame and use a base R approach. I feel like this should be a very straightforward procedure in data.table, but I've really struggled to find anything conclusive in the documentation, on Google, or on here! Is this a bug or am I missing something obvious?
We can use set. According to ?set, it is very fast as the overhead of [.data.table is avoided
library(data.table)
for (v in categorical_variables){
set(myDataTable, i=which(myDataTable[[v]]==""), j=v, value=NA)
}
However, this can be avoided while reading itself, as fread has the na.strings option (just like read.csv/read.table). We can specify the characters that needs to be read as NA i.e. if we have "" and $ to read as NA,
myDataTable <- fread("yourfile.csv", na.strings=c("", "$"))
data
myDataTable <- data.table(v3=c(letters[1:3], ''),
v5 = 1:4, v7 = c('', '', letters[1:2]))

What you can do with a data.frame that you can't with a data.table?

I just started using R, and came across data.table. I found it brilliant.
A very naive question: Can I ignore data.frame to use data.table to avoid syntax confusion between two packages?
From the data.table FAQ
FAQ 1.8 OK, I'm starting to see what data.table is about, but why didn't you enhance data.frame in R? Why does it have to be a new package?
As FAQ 1.1 highlights, j in [.data.table is fundamentally
different from j in [.data.frame. Even something as simple as
DF[,1] would break existing code in many packages and user code.
This is by design, and we want it to work this way for more
complicated syntax to work. There are other differences, too (see FAQ
2.17).
Furthermore, data.table inherits from data.frame. It is a
data.frame, too. A data.table can be passed to any package that
only accepts data.frame and that package can use [.data.frame
syntax on the data.table.
We have proposed enhancements to R wherever possible, too. One of
these was accepted as a new feature in R 2.12.0 :
unique() and match() are now faster on character vectors where all elements are in the global CHARSXP cache and have unmarked
encoding (ASCII). Thanks to Matthew Dowle for suggesting improvements
to the way the hash code is generated in unique.c.
A second proposal was to use memcpy in duplicate.c, which is much
faster than a for loop in C. This would improve the way that R copies
data internally (on some measures by 13 times). The thread on r-devel
is here : http://tolstoy.newcastle.edu.au/R/e10/devel/10/04/0148.html.
What are the smaller syntax differences between data.frame and data.table
DT[3] refers to the 3rd row, but DF[3] refers to the 3rd column
DT[3, ] == DT[3], but DF[ , 3] == DF[3] (somewhat confusingly in data.frame, whereas data.table is consistent)
For this reason we say the comma is optional in DT, but not optional in DF
DT[[3]] == DF[, 3] == DF[[3]]
DT[i, ], where i is a single integer, returns a single row, just like DF[i, ], but unlike a matrix single-row subset which returns a vector.
DT[ , j] where j is a single integer returns a one-column data.table, unlike DF[, j] which returns a vector by default
DT[ , "colA"][[1]] == DF[ , "colA"].
DT[ , colA] == DF[ , "colA"] (currently in data.table v1.9.8 but is about to change, see release notes)
DT[ , list(colA)] == DF[ , "colA", drop = FALSE]
DT[NA] returns 1 row of NA, but DF[NA] returns an entire copy of DF containing NA throughout. The symbol NA is type logical in R and is therefore recycled by [.data.frame. The user's intention was probably DF[NA_integer_]. [.data.table diverts to this probable intention automatically, for convenience.
DT[c(TRUE, NA, FALSE)] treats the NA as FALSE, but DF[c(TRUE, NA, FALSE)] returns
NA rows for each NA
DT[ColA == ColB] is simpler than DF[!is.na(ColA) & !is.na(ColB) & ColA == ColB, ]
data.frame(list(1:2, "k", 1:4)) creates 3 columns, data.table creates one list column.
check.names is by default TRUE in data.frame but FALSE in data.table, for convenience.
stringsAsFactors is by default TRUE in data.frame but FALSE in data.table, for efficiency. Since a global string cache was added to R, characters items are a pointer to the single cached string and there is no longer a performance benefit of converting to factor.
Atomic vectors in list columns are collapsed when printed using ", " in data.frame, but "," in data.table with a trailing comma after the 6th item to avoid accidental printing of large embedded objects.
In [.data.frame we very often set drop = FALSE. When we forget, bugs can arise in edge cases where single columns are selected and all of a sudden a vector is returned rather than a single column data.frame. In [.data.table we took the opportunity to make it consistent and dropped drop.
When a data.table is passed to a data.table-unaware package, that package is not concerned with any of these differences; it just works.
Small caveat
There will possibly be cases where some packages use code that falls down when given a data.frame, however, given that data.table is constantly being maintained to avoid such problems, any problems that may arise will be fixed promptly.
For example
see this question and prompt response
From the NEWS for v 1.8.2
base::unname(DT) now works again, as needed by plyr::melt(). Thanks to
Christoph Jaeckel for reporting. Test added.
An as.data.frame method has been added for ITime, so that ITime can be passed to ggplot2
without error, #1713. Thanks to Farrel Buchinsky for reporting. Tests added.
ITime axis labels are still displayed as integer seconds from midnight; we don't know why ggplot2
doesn't invoke ITime's as.character method. Convert ITime to POSIXct for ggplot2, is one approach.

Resources