How to transform data table columns, indexed by position, by reference? - r

I have a data.table that houses several columns of factors. I'd like to convert 2 columns originally read as factors to their original numeric values. Here's what I've tried:
data[, c(4,5):=c(as.numeric(as.character(4)), as.numeric(as.character(5))), with=FALSE]
This gives me the following warnings:
Warning messages:
1: In `[.data.table`(data, , `:=`(c(4, 5), c(as.numeric(as.character(4)), :
Supplied 2 items to be assigned to 7 items of column 'Bentley (R)' (recycled leaving remainder of 1 items).
2: In `[.data.table`(data, , `:=`(c(4, 5), c(as.numeric(as.character(4)), :
Supplied 2 items to be assigned to 7 items of column 'Sparks (D)' (recycled leaving remainder of 1 items).
3: In `[.data.table`(data, , `:=`(c(4, 5), c(as.numeric(as.character(4)), :
Coerced 'double' RHS to 'integer' to match the factor column's underlying type. Character columns are now recommended (can be in keys), or coerce RHS to integer or character first.
4: In `[.data.table`(data, , `:=`(c(4, 5), c(as.numeric(as.character(4)), :
Coerced 'double' RHS to 'integer' to match the factor column's underlying type. Character columns are now recommended (can be in keys), or coerce RHS to integer or character first.
Also I can tell the conversion has not succeeded because the 4th and 5th columns persist in being factors after this code has run.
As an alternate, I tried this code, which won't run at all:
data[, ':=' (4=c(as.numeric(as.character(4)), 5 = as.numeric(as.character(5)))), with=FALSE]
Finally, I tried referencing the column names via colnames:
data[ , (colnames(data)[4]) := as.numeric(as.character(colnames(data)[4]))]
This runs but results in a row of NAs as well as the following errors:
Warning messages:
1: In eval(expr, envir, enclos) : NAs introduced by coercion
2: In `[.data.table`(data, , `:=`((colnames(data)[4]), as.numeric(as.character(colnames(data)[4])))) :
Coerced 'double' RHS to 'integer' to match the factor column's underlying type. Character columns are now recommended (can be in keys), or coerce RHS to integer or character first.
3: In `[.data.table`(data, , `:=`((colnames(data)[4]), as.numeric(as.character(colnames(data)[4])))) :
RHS contains -2147483648 which is outside the levels range ([1,6]) of column 1, NAs generated
I need to do this by position and not by column name, since the column name will depend on the URL. What's the proper way to transform columns by position using data.table?
I also have a related query, which is how to transform numbered columns relative to other numbered columns. For example, if I want to set the 3rd column to be equal to 45 minus the value of the 3rd column plus the value of the 4th column, how would I do that? Is there some way to distinguish between a real # vs a column number? I know something like this is not the way to go:
dt[ , .(4) = 45 - .(3) + .(4), with = FALSE]
So then how can this be done?

If you want to assign by reference and position, you need to get the column names to assign to as a character vector or the column numbers as an integer vector and use .SDcols (at least in data.table 1.9.4).
First a reproducible example:
library(data.table)
DT <- data.table(iris)
DT[, c("Sepal.Length", "Petal.Length") := list(factor(Sepal.Length), factor(Petal.Length))]
str(DT)
Now let's convert the columns:
DT[, names(DT)[c(1, 3)] := lapply(.SD, function(x) as.numeric(as.character(x))),
.SDcols = c(1, 3)]
str(DT)
Alternatively:
DT[, c(1,3) := lapply(.SD, function(x) as.numeric(as.character(x))), .SDcols=c(1,3)]
str(DT)
Note that := expects a vector of column names or positions on the left side and a list on the right side.

Related

finding max value for multiple columns removing the duplicate ids in R

tab1: original tab2: expected table
enter image description here
library(data.table)
setDT(df00)
df00[, lapply(.SD, max), by=.("study_id")]
*Getting the following error: *
Error in [.data.table(df00, , lapply(.SD, max), by = .("study_id")) :
The items in the 'by' or 'keyby' list are length(s) (1). Each must be length 1638; the same length as there are rows in x (after subsetting if i is provided).
source: Taking only the maximum values of duplicate IDs for all columns of a data frame in R
Please help me finding the solution. Thank you

Select odd rows from a specific column in a dataframe

I have a large df with a specific numeric column named Amount.
df = data.frame(Amount = c(as.numeric(1:14)), stringsAsFactors = FALSE)
I want to select odd rows. So far, I have tried with the syntax below but I always get this error messages:
df$Amount[c(FALSE, TRUE),]
Error in df$Amount[c(FALSE, TRUE), ] : incorrect number of dimensions
seq_len(ncol(df$Amount)) %% 2
Error in seq_len(ncol(df$Amount)) :
argument must be coercible to non-negative integer
In addition: Warning message:
In seq_len(ncol(df$Amount)) :
first element used of 'length.out' argument
odd = seq(1,14,1)
df$Amount[odd,1]
Error in P20$Journal.Amount[even, 1] : incorrect number of dimensions
P20$Journal.Amount[seq(2,length(14), 2),]
Error in seq.default(2, length(14), 2) : wrong sign in 'by' argument
My question is: Is there a way I can do this directly? I tried with the solutions of questions previously posted but so far, I keep having these error messages.
BaseR preferably.
The row/column index is used when there are dim attributes. vector doesn't have it.
is.vector(df$Amount)
If we extract the vector, then just use the row index
df$Amount[c(FALSE, TRUE)]
If we want to subset the rows of the dataset,
df[c(FALSE, TRUE), 'Amount', drop = FALSE]
In the above code, we are specify the row index (i), 'j' as the column index or column name, and drop (?Extract - is by default drop = TRUE for data.frame. So, we need to specify drop = FALSE to not lose the dimensions and coerce to a vector)

How to properly apply RowMeans()? "X is not numeric" error

I have two columns within OtherIncludedClean, and I would like to add another column of OtherIncludedClean$Mean; however, my efforts are in vain.
I have tried:
OtherIncludedClean$mean <- rowMeans(OtherIncludedClean, na.rm = FALSE, dims = 1)
But, the above reports the error:
"Error in base::rowMeans(x, na.rm = na.rm, dims = dims, ...) :
'x' must be numeric"
I have also attempted:
OtherIncludedClean$mean <- apply(OtherIncludedClean, 1, function(x) { mean(x, na.rm=TRUE) })
Which reports this error:
"1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA"
For all 141 rows.
Any and all help appreciated. Thank you .
My columns are "X__1" and "X__2"
When we get error 'x' must be numeric", it is better to check the column types. An easier option is
str(OtherIncludedClean)
If we find that the types are not numeric/integer and it is character/factor, we need to convert it to numeric type (assuming that most of the values are numeric in a column and due to one or two elements which is not numeric, it changes the type).
The way to convert is as.numeric. For a single column, as.numeric(data$columnname) if it is character class and for factor class,
as.numeric(as.character(data$columnname))
Here, we need to change all the columns to numeric (assuming it is character class). For that, loop through the columns with lapply and assign the output back to the dataset
OtherIncludedClean[] <- lapplyOtherIncludedClean, as.numeric)
and then apply the rowMeans
If the class of only a subset of columns are character, then we need to only loop through those columns
i1 <- !sapply(OtherIncludedClean, is.numeric)
OtherIncludedClean[i1] <- lapplyOtherIncludedClean[i1], as.numeric)

How to use apply function over character vectors inside data.table

I'm trying to get an idea of the availability of my data which might look like:
DT <- data.table(id=rep(c("a","b"),each=20),time=rep(1991:2010,2),
x=rbeta(40,shape1=1,shape2=2),
y=rnorm(40))
#I have some NA's (no gaps):
DT[id=="a"&time<2000,x:=NA]
DT[id=="b"&time>2005,y:=NA]
but is much larger of course. Ideally, I'd like to see a table like this:
a b
x 2000-2010 1991-2010
y 1991-2010 1991-2005
so the non-missing minimum to the non-missing maximun time period. I can get that for one variable:
DT[,availability_x:=paste0(
as.character(min(ifelse(!is.na(x),time,NA),na.rm=T)),
"-",
as.character(max(ifelse(!is.na(x),time,NA),na.rm=T))),
by=id]
But in reality, I want to do that for many variables. All my attempts to do that fail, however, because I'm having a hard time communicating a vector of columns to the data table. My guess is that it goes in the direction of this or this but my attempts to adapt these solutions to a vector of columns failed.
An apply function for example doesn't seem to evaluate the elements of a character vector:
cols <- c("x","y")
availabilityfunction <- function(i){
DT[,paste0("avail_",i):=paste0(
as.character(min(ifelse(!is.na(i),time,NA),na.rm=T)),
"-",
as.character(max(ifelse(!is.na(i),time,NA),na.rm=T))),
by=id]}
lapply(cols,availabilityfunction)
We can loop (lapply) through the columns of interest specified in .SDcols after grouping by 'id', create a logical index of non-NA elements (!is.na), find the numeric index (which), get the range (i.e. min and max), use that to subset the 'time' column and paste the time elements together.
DT[, lapply(.SD, function(x) paste(time[range(which(!is.na(x)))],
collapse="-")), by = id, .SDcols = x:y]
# id x y
#1: a 2000-2010 1991-2010
#2: b 1991-2010 1991-2005

Error deleting factor column in empty data.table

If I have an empty data.table with a factor column, the factor column can't be removed with the := NULL operator. Integer and character columns have no problems.
library(data.table)
DT <- data.table(numbers = integer(0),
char.letters = character(0),
factor.letters = factor(character(0)))
DT[, factor.letters := NULL]
I get the following error:
Error in `[.data.table`(DT, , `:=`(factor.letters, NULL)) :
Can't assign to column 'factor.letters' (type 'factor') a value of type 'NULL' (not character, factor, integer or numeric)
Note that DT[, char.letters := NULL] and DT[, numbers := NULL] do not produce errors.
Since factor columns behave differently from character and integer columns, I suspect this is a problem with data.table, but am I doing anything incorrectly?
Edit: Previous example used join to create the empty data.table (which was then called join), but it can be reproduced just as easily by creating it directly.
Thanks for reporting. Now fixed in v1.8.9
Deleting a (0-length) factor column using :=NULL on an empty
data.table
now works, #4809. Thanks to Frank Pinter for reporting. Test added.

Resources