How to use apply function over character vectors inside data.table - r

I'm trying to get an idea of the availability of my data which might look like:
DT <- data.table(id=rep(c("a","b"),each=20),time=rep(1991:2010,2),
x=rbeta(40,shape1=1,shape2=2),
y=rnorm(40))
#I have some NA's (no gaps):
DT[id=="a"&time<2000,x:=NA]
DT[id=="b"&time>2005,y:=NA]
but is much larger of course. Ideally, I'd like to see a table like this:
a b
x 2000-2010 1991-2010
y 1991-2010 1991-2005
so the non-missing minimum to the non-missing maximun time period. I can get that for one variable:
DT[,availability_x:=paste0(
as.character(min(ifelse(!is.na(x),time,NA),na.rm=T)),
"-",
as.character(max(ifelse(!is.na(x),time,NA),na.rm=T))),
by=id]
But in reality, I want to do that for many variables. All my attempts to do that fail, however, because I'm having a hard time communicating a vector of columns to the data table. My guess is that it goes in the direction of this or this but my attempts to adapt these solutions to a vector of columns failed.
An apply function for example doesn't seem to evaluate the elements of a character vector:
cols <- c("x","y")
availabilityfunction <- function(i){
DT[,paste0("avail_",i):=paste0(
as.character(min(ifelse(!is.na(i),time,NA),na.rm=T)),
"-",
as.character(max(ifelse(!is.na(i),time,NA),na.rm=T))),
by=id]}
lapply(cols,availabilityfunction)

We can loop (lapply) through the columns of interest specified in .SDcols after grouping by 'id', create a logical index of non-NA elements (!is.na), find the numeric index (which), get the range (i.e. min and max), use that to subset the 'time' column and paste the time elements together.
DT[, lapply(.SD, function(x) paste(time[range(which(!is.na(x)))],
collapse="-")), by = id, .SDcols = x:y]
# id x y
#1: a 2000-2010 1991-2010
#2: b 1991-2010 1991-2005

Related

How to subset a data.table in i (eg finding NAs) based on a character vector of column names

This should be easy but google and me are failing. Say I have this data:
library(data.table)
mydata <- data.table(a = c(1, NA),
b = c(NA, NA),
pointer = c(1,2))
and I want to get the rows where both a and b are NA. Of course i can do this manually like:
mydata[is.na(a) & is.na(b)]
but the issue arises deep in other code and I want to do this based on a character vector (or list, or whatever, this is flexible) of the column names such as:
myvector <- c("a","b")
again I can do this manually if I know how many elements the vector has:
mydata[is.na(get(myvector[1])) & is.na(get(myvector[2]))]
But I don't know how many elements myvector has in my application. How can I do this without specifying the number of entries in myvector? Essentially, I'm looking for something like with = F but for i in data.table. So I want to use myvector like this:
mydata[is.na(somefunction(myvector))]
I tried all kind of paste0(myvector, collapse = " & ") combinations with get() or as.formula() but it is getting me nowhere.
We can specify the .SDcols with the vector of column names, loop over the .SD (Subset of Data.table), create a list of logical vectors with is.na and Reduce the list to a single logical vector with & (which checks the corresponding elements of the list or column with & condition), use that to subset the rows of data
library(data.table)
mydata[mydata[, Reduce(`&`, lapply(.SD, is.na)), .SDcols = myvector]]
-output
# a b pointer
#1: NA NA 2
Or use mget
mydata[mydata[, Reduce(`&`, lapply(mget(myvector), is.na))]]
Here is another solution assuming that myvector is a character vector:
library(data.table)
mydata[rowSums(!is.na(mydata[, ..myvector])) == 0]

How to separate a column of data.table given conditions

After reading some XML files, I am to create a data.table with a specific column names, e.g. Name, Score, Medal, etc. However, I am confused of how i should separate the single column (see the code and results) into many with given criterias.
In my opinion, we either need a cycle just with a step, or a special function, but I do not know what function exactly :/
stage1 <- read_html("1973.html")
stage2 <- xml_find_all(stage1, ".//tr")
xml_text(stage2)
stage3 <- xml_text(xml_find_all(stage2, ".//td"))
stage3
DT <- data.table(stage3, keep.rownames=TRUE, check.names=TRUE, key=NULL,
stringsAsFactors=TRUE)
for (i in seq(from = 1, to = 1375, by = 11)){
if (is.numeric(DT[i,stage3] = FALSE)){
DT$Name <- DT[i,stage3]
}
}
https://pp.userapi.com/c845220/v845220632/1678a5/IRykEniYiiA.jpg
This is example of first 20 rows of 1375
Here how the data.table looks now. What I need, is to separate these results to columns "Name" (e.g. Sergei konyagin), Country (e.g. USSR), score for problems 1-8 (8 columns, respectively), and the medal. The cycle I have written, I think, is something that should extract with a step 11 (since every name, country, etc. repeats every 11 rows) the value from existing column and transfer it into new one. Unfortunately, it doesn't work :/
Thanks in advance for your help!
Give this a shot.
First, load the required packages:
library (data.table)
library (stringr) # this is just for the piping operator %>%
You would read in your own data table here, I am creating one as an example:
dat = c( "Sergey","USSR",1,2,3,4,5,6,7,8,"silver") %>% rep (125) %>% data.table
setnames (dat, "stage3")
As a quick note, I would not be reading in your strings as factors as you do in your own code, because then it can screw up the conversion to numeric.
This will repeat itself to fill out the table. this only works if your table doesn't skip values. also, not advisable to have column names as numbers, better to give them proper names like "test1","test2", etc:
dat [, metadata := c ("name","country",1:8,"medal") ] # whatever you want to name your future 11 columns
dat [, participant := 1: (.N / 11) %>% rep (each = 11) ] # same idea, can't have missing rows
Now, reshape and convert from strings to numeric where possible:
new.dat =
dcast (dat, participant ~ metadata, value.var = "stage3") [, lapply (.SD, type.convert) ]

Assign a vector to a specific existing row of data table in R

I've been looking through tutorials and documentation, but have not figured out how to assign a vector of values for all columns to one existing row in a data.table.
I start with an empty data.table that has already the correct number of columns and rows:
dt <- data.table(matrix(nrow=10, ncol=5))
Now I calculate some values for one row outside of the data.table and place them in a vector vec, e. g.:
vec <- rnorm(5)
How could I assign the values of vec to e. g. the first row of the data.table while achieving a good performance (since I also want to fill the other rows step by step)?
First you need to get the correct column types, as the NA matrix you've created is logical. The column types won't be magically changed by assigning numerics to them.
dt[, names(dt) := lapply(.SD, as.numeric)]
Then you can change the first row's values with
dt[1, names(dt) := as.list(vec)]
That said, if you begin with a numeric matrix you wouldn't have to change the column types.
dt <- data.table(matrix(numeric(), 10, 5))
dt[1, names(dt) := as.list(vec)]

data.table: transforming subset of columns with a function, row by row

How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.
Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.
And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?
With regular data.frame I would just do:
df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))
I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.
I would image something like this to work for data.tables:
dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]
But it doesn't.
EDIT:
Another example of doing that updating columns with their per-row-scaled version:
dt = data.table object
dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]
Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?
If what you need is really to scale by row, you can try doing it in 2 steps:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
PART 1: The one line solution you requested:
# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`
One-line Solution Version 1: Use magrittR and the pipe operator:
DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
.SDcols = grep("corrupt", colnames(DT))]
One-line Solution Version 2: Explicitly defines the function for the lapply:
DT[, (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)})),
.SDcols = grep("corrupt", colnames(DT))]
Modification - If you want to do it by group, just use the by =
DT[ , (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)}))
, .SDcols = grep("corrupt", colnames(DT))
, by = Grouping.Variable]
You can verify:
# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]
PART 2: A Step-by-Step Solution: (more general and easier to follow)
The above solution works clearly for the narrow example given.
As a public service, I am posting this for anyone that is still searching for a way that
feels a bit less condensed;
easier to understand;
more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, n.b. does work perfectly here)
Here's the step-by-step way of doing the same:
Get the data into Data.Table format:
# You get a data.table called DT
DT <- as.data.table(df)
Then, Handle the Column Names:
# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))
# FOR PEOPLE who want to store both transformed and untransformed values.
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")
Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X,
X.mean = mean(X, na.rm = TRUE),
X.sd = sd(X, na.rm = TRUE))
{
X <- (X - X.mean) / X.sd
return(X)
}
After that, it is trivial in Data.Table syntax:
# Voila, the newly created set of columns the contain the transformed value,
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]
Verify:
new values stored in columns with names stored in:
DT[, .SD, .SDcols = Reference.Cols.normalized]
Untransformed values left unharmed
DT[, .SD, .SDcols = Reference.Cols]
Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.

Find and replace values with data.table in R?

After reading about benchmarks and speed comparisons of R methods, I am in the process of converting to the speedy data.table package for data manipulation on my large data sets.
I am having trouble with a particular task:
For a certain observed variable, I want to check, for each station, if the absolute lagged difference (with lag 1) is greater than a certain threshold. If it is, I want to replace it with NA, else do nothing.
I can do this for the entire data.table using the set command, but I need to do this operation by station.
Example:
# Example data. Assume the columns are ordered by date.
set.seed(1)
DT <- data.table(station=sample.int(n=3, size=1e6, replace=TRUE),
wind=rgamma(n=1e6, shape=1.5, rate=1/10),
other=rnorm(n=1.6),
key="station")
# My attempt
max_rate <- 35
set(DT, i=which(c(NA, abs(diff(DT[['wind']]))) > max_rate),
j=which(names(DT)=='wind'), value=NA)
# The results
summary(DT)
The trouble with my implementation is that I need to do this by station, and I do not want to get the lagged difference between the last reading in station 1 and the first reading of station 2.
I tried to use the by=station operator within the [ ], but I am not sure how to do this.
One way is to get the row numbers you've to replace using the special variable .I and then assign NA to those rows by reference using the := operator (or set).
# get the row numbers
idx = DT[, .I[which(c(NA, diff(wind)) > 35)], by=station][, V1]
# then assign by reference
DT[idx, wind := NA_real_]
This FR #2793 filed by #eddi when/if implemented will have a much more natural way to accomplish this task by providing the expression resulting in the corresponding indices on LHS and the value to replace with on RHS. That is, in the future, we should be able to do:
# in the future - a more natural way of doing the same operation shown above.
DT[, wind[which(c(NA, diff(wind)) > 35)] := NA_real_, by=station]

Resources