datatables in R "," and keyby, and "." confusion - r

I'm rewriting someone else's R code into python, and I don't know R.
So I'm trying to decipher what things mean.
What does this line mean?
kable(DT[, .N, keyby=.(target=get(y))], format="html")
So DT is the datatable itself, and y is a column within DT. But I think it's trying to create a table wherever y exists?
There's also this follow up line:
id_bady1= DT[! get(y) %in% c(0,1), get(id)]
Documentation for R says that get returns the object the matches the input, but how does that work when there are multiple matches?

The content of y is the name of a column of the datatable, see:
library("data.table")
DT <- mtcars
setDT(DT)
y <- "cyl"
DT[, .N, keyby=.(target=get(y))]
IMHO it is here complete matching (not partial matching):
DT[, cylA:=7] # construct a second column that begins with "cyl"
DT[, .N, keyby=.(target=get(y))]
y <- "cy" ## no complete matching possible
DT[, .N, keyby=.(target=get(y))]
### Error in get(y) : object 'cy' not found

Related

Understand the meaning of[.... with=F][[1]]

I am doing practicing exercise based on the problems and solutions for data.table in R. The problem was: get the row and column positions of missing values in a data table. The solution code is used " [.....with=F][[1]]. I am not understanding this part of that code and expecting expert opinion to make my concept clear on that.
for(i in 1:NROW(DT)){
for(j in 1:NCOL(DT)){
curr_value <- DT[i, j,with=F][[1]]
I can understand first two lines, but not understanding ,with=F and then [[1]]
What the meaning of with=F and why has been used [[1]] after than that. Why there is double bracket with 1?
Generally in data.table, with = FALSE allows you to select columns named in a variable.
Consider the following minimal example,
library(data.table)
dt <- data.table(mtcars)
Let's select the following columns from dt
cols <- c(1, 7)
The following command will produce an error
dt[, cols]
Instead you can use with = F
dt[, cols, with = F]
From ?data.table
When with=TRUE (default), j is evaluated within the frame of the data.table;
i.e., it sees column names as if they are variables.
A shorter alternative is to use
dt[, ..cols]
See also Why does “..” work to pass column names in a character vector variable?

Mean imputation issue with data.table

Trying to impute missing values in all numeric rows using this loop:
for(i in 1:ncol(df)){
if (is.numeric(df[,i])){
df[is.na(df[,i]), i] <- mean(df[,i], na.rm = TRUE)
}
}
When data.table package is not attached then code above is working as it should. Once I attach data.table package, then the behaviour changes and it shows me the error:
Error in `[.data.table`(df, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i'
is not found. Perhaps you intended DT[,..i] or DT[,i,with=FALSE]. This
difference to data.frame is deliberate and explained in FAQ 1.1.
I tried '..i' and 'with=FALSE' everywhere but with no success. Actually it has not passed even first is.numeric condition.
The data.table syntax is a little different in such a case. You can do it as follows:
num_cols <- names(df)[sapply(df, is.numeric)]
for(col in num_cols) {
set(df, i = which(is.na(df[[col]])), j = col, value = mean(df[[col]], na.rm=TRUE))
}
Or, if you want to keep using your existing loop, you can just turn the data back to data.frame using
setDF(df)
An alternative answer to this question, i came up with while sitting with a similar problem on a large scale. One might be interested in avoiding for loops by using the [.data.table method.
DF[i, j, by, on, ...]
First we'll create a function that can perform the imputation
impute_na <- function(x, val = mean, ...){
if(!is.numeric(x))return(x)
na <- is.na(x)
if(is.function(val))
val <- val(x[!na])
if(!is.numeric(val)||length(val)>1)
stop("'val' needs to be either a function or a single numeric value!")
x[na] <- val
x
}
To perform the imputation on the data frame, one could create and evaluate an expression in the data.table environment, but for simplicity of example here we'll overwrite using <-
DF <- DF[, lapply(.SD, impute_na)]
This will impute the mean across all numeric columns, and keep any non-numeric columns as is. If we wished to impute another value (like... 42 or whatever), and maybe we have some grouping variable, for which we only want the mean to computed over this can be included as well by
DF <- DF[, lapply(.SD, impute_na, val = 42)]
DF <- DF[, lapply(.SD, impute_NA), by = group]
Which would impute 42, and the within-group mean respectively.

R: .N in data.table returning 0 columns

I have searched and can't find a similar question. I'm trying to count the rows in a data.frame in which the value of the VAL variable is equal to 24.
I downloaded the data from https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv
and read it to R using read.table:
download.file(fileUrl, destfile = "./Housing_Data/Housingdata.csv", method = "curl")
DT <- read.table("./Housing_Data/Housingdata.csv", sep = ",", header = TRUE)
I tried
DT[, .N, by=VAL]
which returned:
Error in [.data.frame(DT, , .N, by = VAL) : unused argument (by = VAL)
DT[, .N]
returns:
data frame with 0 columns and 6496 rows
However, when I run head(DT) it returns as if the columns are loaded correctly.
I'm really not sure where I'm going wrong here, can anyone point me in the right direction?
Looks like you're trying to use data.table operations on a data.frame. And your syntax looks a little off for the data.table as well.
This is how you would find the nrow where VAL == 24
nrow(DT[DT$VAL==24,])
If you want to do this with a data.table you'll first have to a data.table. Run this:
library(data.table)
setDT(DT)
DT[,.(Count = .N),by = .(VAL)]
I am using iris data set in R as an example.
Say you want to retain only those records which have Sepal.Length as 5.1
So you would have
nrow(iris[iris$Sepal.Length == 5.1, ])
or
dim(iris[iris$Sepal.Length == 5.1, ])[1]
I don't know what the full URL looks like, but here is one option for you.
df <- read.csv("http://www.football-data.co.uk/mmz4281/1516/E0.csv",
header = TRUE, stringsAsFactors = TRUE)[1:6]
Here is another way to do it.
library(dplyr)
MyData2 <- read.csv(file="http://www.grex.org/~ev/breweries_geocode.csv", header=TRUE, sep=",")
I just realised I never concluded this by posting the solution. Kristofersen correctly pointed out that I was attempting to use a data.table command on a data.frame. The simple solution he suggested was to convert it:
library(data.table)
SetDT(DT)
DT[, .N, by=VAL]
The other option also works - using fread to load the data as a data.table in the first place. This is probably preferable, as it's more scalable.
Drj also provided a good answer allowing me to perform the same operation with data.frame commands, however I neglected to specify that I'm using data.table as I need to be able to create new columns a lot in this project, and data.table makes that really easy using the := argument.
Thanks all for the answers.

data.table: transforming subset of columns with a function, row by row

How can one, having a data.table with mostly numeric values, transform just a subset of columns and put them back to the original data table? Generally, I don't want to add any summary statistic as a separate column, just exchange the transformed ones.
Assume we have a DT. It has 1 column with names and 10 columns with numeric values. I am interested in using "scale" function of base R for each row of that data table, but only applied to those 10 numeric columns.
And to expand on this. What if I have a data table with more columns and I need to use column names to tell the scale function on which datapoints to apply the function?
With regular data.frame I would just do:
df[,grep("keyword",colnames(df))] <- t(apply(df[,grep("keyword",colnames(df))],1,scale))
I know this looks cumbersome but always worked for me. However, I can't figure out a simple way to do it in data.tables.
I would image something like this to work for data.tables:
dt[,grep("keyword",colnames(dt)) := scale(grep("keyword",colnames(dt)),center=F)]
But it doesn't.
EDIT:
Another example of doing that updating columns with their per-row-scaled version:
dt = data.table object
dt[,grep("keyword",colnames(dt),value=T) := as.data.table(t(apply(dt[,grep("keyword",colnames(dt)),with=F],1,scale)))]
Too bad it needs the "as.data.table" part inside, as the transposed value from apply function is a matrix. Maybe data.table should automatically coerce matrices into data.tables upon updating of columns?
If what you need is really to scale by row, you can try doing it in 2 steps:
# compute mean/sd:
mean_sd <- DT[, .(mean(unlist(.SD)), sd(unlist(.SD))), by=1:nrow(DT), .SDcols=grep("keyword",colnames(DT))]
# scale
DT[, grep("keyword",colnames(DT), value=TRUE) := lapply(.SD, function(x) (x-mean_sd$V1)/mean_sd$V2), .SDcols=grep("keyword",colnames(DT))]
PART 1: The one line solution you requested:
# First lets take a look at the data in the columns:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]`
One-line Solution Version 1: Use magrittR and the pipe operator:
DT[, (grep("keyword", colnames(DT))) := (lapply(.SD, . %>% scale(., center = F))),
.SDcols = grep("corrupt", colnames(DT))]
One-line Solution Version 2: Explicitly defines the function for the lapply:
DT[, (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)})),
.SDcols = grep("corrupt", colnames(DT))]
Modification - If you want to do it by group, just use the by =
DT[ , (grep("keyword", colnames(DT))) :=
(lapply(.SD, function(x){scale(x, center = F)}))
, .SDcols = grep("corrupt", colnames(DT))
, by = Grouping.Variable]
You can verify:
# Verify that the columns have updated values:
DT[,.SD, .SDcols = grep("corrupt", colnames(DT))]
PART 2: A Step-by-Step Solution: (more general and easier to follow)
The above solution works clearly for the narrow example given.
As a public service, I am posting this for anyone that is still searching for a way that
feels a bit less condensed;
easier to understand;
more general, in the sense that you can apply any function you wish without having to compute the values into a separate data table first (which, n.b. does work perfectly here)
Here's the step-by-step way of doing the same:
Get the data into Data.Table format:
# You get a data.table called DT
DT <- as.data.table(df)
Then, Handle the Column Names:
# Get the list of names
Reference.Cols <- grep("keyword",colnames(df))
# FOR PEOPLE who want to store both transformed and untransformed values.
# Create new column names
Reference.Cols.normalized <- Reference.Cols %>% paste(., ".normalized", sep = "")
Define the function you want to apply
#Define the function you wish to apply
# Where, normalize is just a function as defined in the question:
normalize <- function(X,
X.mean = mean(X, na.rm = TRUE),
X.sd = sd(X, na.rm = TRUE))
{
X <- (X - X.mean) / X.sd
return(X)
}
After that, it is trivial in Data.Table syntax:
# Voila, the newly created set of columns the contain the transformed value,
DT[, (Reference.Cols.normalized) := lapply(.SD, normalize), .SDcols = Reference.Cols]
Verify:
new values stored in columns with names stored in:
DT[, .SD, .SDcols = Reference.Cols.normalized]
Untransformed values left unharmed
DT[, .SD, .SDcols = Reference.Cols]
Hopefully, for those of you who return to look at code after some interval, this more step-by-step / general approach can be helpful.

Remove multiple columns from data.table

What's the correct way to remove multiple columns from a data.table? I'm currently using the code below, but was getting unexpected behavior when I accidentally repeated one of the column names. I wasn't sure if this was a bug, or if I shouldn't be removing columns this way.
library(data.table)
DT <- data.table(x = letters, y = letters, z = letters)
DT[ ,c("x","y") := NULL]
names(DT)
[1] "z"
The above works fine, but
DT <- data.table(x = letters, y = letters, z = letters)
DT[ ,c("x","x") := NULL]
names(DT)
[1] "z"
This looks like a solid, reproducible bug. It's been filed as Bug #2791.
It appears that repeating the column attempts to delete the subsequent columns.
If no columns remain, then R crashes.
UPDATE : Now fixed in v1.8.11. From NEWS :
Assigning to the same column twice in the same query is now an error rather than a crash in some circumstances; e.g., DT[,c("B","B"):=NULL] (delete by reference the same column twice). Thanks to Ricardo (#2751) and matt_k (#2791) for reporting. Tests added.
This Q has been answered but regard this as a side note.
I prefer the following syntax to drop multiple columns
DT[ ,`:=`(x = NULL, y = NULL)]
because it matches the one to add multiple columns (variables)
DT[ ,`:=`(x = letters, y = "Male")]
This also check for duplicated column names. So trying to drop x twice will throw an error message.

Resources