subsetting columns in a datatable [duplicate] - r

This question already has answers here:
Select subset of columns in data.table R [duplicate]
(7 answers)
Closed 6 years ago.
example R dataframe:
n = c(2, 3, 5)
s = c("aa", "bb", "cc")
b = c(TRUE, FALSE, TRUE)
df = data.frame(n, s, b)
I can easily subset columns in a dataframe like this:
df.smaller <- df[c(1,2)]
n s
1 2 aa
2 3 bb
3 5 cc
Very handy!
However, with datatable (and I thought it is easier with datatable) I have not found such a quick way to do the same. How can I quick and easy do the same to a datatable?
dt = data.table(df)
dt.smaller <- dt[c(1,2)]
n s b
1: 2 aa TRUE
2: 3 bb FALSE
will return me the first two rows. Probably it is just a comma or something I have to change, but I can't figure it out.

We need to use with = FALSE
dt[, 1:2, with = FALSE]
This is explained in the ?data.table
with: By default with=TRUE and j is evaluated within the frame of x;
column names can be used as variables.
When with=FALSE j is a character vector of column names, a numeric
vector of column positions to select or of the form startcol:endcol,
and the value returned is always a data.table. with=FALSE is often
useful in data.table to select columns dynamically

Related

R: How to use row-wise apply with individual columns as input variables?

I create a list with all unique combinations of two columns in a data.table.
Based on all unique combinations in this list I want to take samples from a data.table.
I already wrote a function for this and I know that I could use a for-loop or a foreach-loop.
How could the funcion in the following be used with "apply" or one of its variations?
Thank you very much :-)
MWE:
dt <- data.table(filename = c("a", "b", "c", "c", "a"), class = c(1,2,1,1,4), var = c(1,2,3,4,5))
unique_combinations <- unique(dt[, c("filename", "class")])
take_samples <- function(dt, filename, class, n) {
dt %>%
.[filename==filename & class==class] %>%
sample_n(size=n, replace = FALSE)
#TBD: append result to other data.table
}
# How to do the following call automatically for every unique combination using apply?
take_samples(dt, unique_combinations$filename[0], unique_combinations$class[0], 1)
I think you need groupby:
n <- 1
dt[,.SD[sample(.N, size = n, replace = T)], .(filename, class)]
Explanation
Grouping by .(filename, class) will take unique combination of the two columns.
.SD contains the grouped dataframe.
Here's the ouptut looks like:
filename class var
1: a 1 1
2: b 2 2
3: c 1 4
4: a 4 5

Creating several new columns in a data frame using the same function

I'm sorry for the basic question. I'm just struggling with something that should be simple. Say I have the the data frame "Test" that originally has three fields: Col1, Col2, Col3.
I want to create new columns based on each of the original columns. The values in each row of the new columns would specify whether the corresponding value in the matching row on the original column is above or below the initial column's median. So, for example, in the image attached, Col4 is based on Col1. Col5 is based on Col2. Col6 based on Col3.
test dataframe example:
It's quite easy to perform this function on a single column and output a single column:
Test <- Test %>% mutate(Col4 = derivedFactor(
"below"= Col1 > median(Test$Col1),
"at"= Col1 == median(Test$Col1),
"above"= Col1 < median(Test$Col1)
.default = NA)
)
But if I'm performing this same operation over 50 columns, writing out/copy-paste and editing the code can be tedious and inefficient. I should mention that I am hoping to add the new columns to the data frame, not create another data frame. Additionally, there are about 200 other fields in the data frame that will not have this function performed on them (so I can't just use a mutate_all). And the columns are not uniformly named (my examples above are just examples, not the actual dataset) so I'm not able to find a pattern for mutate_at. Maybe there is a way to manually pass a list of column names to the mutate command?
There must be an easy and elegant way to do this. If anyone could help, that would be amazing.
You can do the following using data.table.
Firstly, I define a function which is applied onto a numeric vector, whereby it outputs the elements' corresponding position in relation to the vector's median:
med_fn = function(x){
med = median(x)
unlist(sapply(x, function(x){
if(x > med) {'Above'}
else if(x < med) {'Below'}
else {'At'}
}))
}
> med_fn(c(1,2,3))
[1] "Below" "At" "Above"
Let us examine some sample data:
dt = data.table(
C1 = c(1, 2, 3),
C2 = c(2, 1, 3),
C3 = c(3, 2, 1)
)
old = c('C1', 'C2', 'C3') # Name of columns I want to perform operation on
new = paste0(old, '_medfn') # Name of new columns following operation
Using the .SD and .SDcols arguments from data.table, I apply med_fn across the columns old, in my case columns C1, C2 and C3. I call the new columns C#_medfn:
dt[, (new) := lapply(.SD, med_fn), .SDcols = old]
Result:
> dt
C1 C2 C3 C1_medfn C2_medfn C3_medfn
1: 1 2 3 Below At Above
2: 2 1 2 At Below At
3: 3 3 1 Above Above Below

setting multiple columns NA's to value --R [duplicate]

This question already has answers here:
How do I replace NA values with zeros in an R dataframe?
(29 answers)
Fastest way to replace NAs in a large data.table
(10 answers)
Closed 6 years ago.
Quite new to R, I am trying to subselect certain columns in order to set their NA's to 0.
so far I have:
col_names1 <- c('a','b','c')
col_names2 <- c('e','f','g')
col_names <- c(col_names1, col_names2)
data = fread('data.tsv', sep="\t", header= FALSE,na.strings="NA",
stringsAsFactors=TRUE,
colClasses=my_col_Classes
)
setnames(data, col_names)
data[col_names2][is.na(data[col_names2])] <- 0
But I keep getting the error
Error in `[.data.table`(`*tmp*`, column_names2): When i is a data.table (or character vector), x must be keyed (i.e. sorted, and, marked as sorted) so data.table knows which columns to join to and take advantage of x being sorted. Call setkey(x,...) first, see ?setkey.
I believer this error is saying I have the wrong order but I am not sure how I do?
You can do it with data.table assign :=
data <- data.table(a = c(2, NA, 3, 5), b = c(NA,2,3,4), c = c(2,5,NA, 6))
fix_columns <- c('a','b')
fix_fun <- function(x) ifelse(is.na(x), 0 , x)
data[,(fix_columns):=lapply(.SD, fix_fun), .SDcols=fix_columns]
P.S. You cant select columns from data.table like data[col_names2]. If you want select them by character vector, one approach is : data[, col_names2, with = F]

R: show ALL rows with duplicated elements in a column [duplicate]

This question already has answers here:
Fastest way to remove all duplicates in R
(3 answers)
Closed 6 years ago.
Does a function like this exist in any package?
isdup <- function (x) duplicated (x) | duplicated (x, fromLast = TRUE)
My intention is to use it with dplyr to display all rows with duplicated values in a given column. I need the first occurrence of the duplicated element to be shown as well.
In this data.frame for instance
dat <- as.data.frame (list (l = c ("A", "A", "B", "C"), n = 1:4))
dat
> dat
l n
1 A 1
2 A 2
3 B 3
4 C 4
I would like to display the rows where column l is duplicated ie. those with an A value doing:
library (dplyr)
dat %>% filter (isdup (l))
returns
l n
1 A 1
2 A 2
dat %>% group_by(l) %>% filter(n() > 1)
I don't know if it exists in any package, but since you can implement it easily, I'd say just go ahead and implement it yourself.

R reshaping melted data.table with list column

I have a large (millions of rows) melted data.table with the usual melt-style unrolling in the variable and value columns. I need to cast the table in wide form (rolling the variables up). The problem is that the data table also has a list column called data, which I need to preserve. This makes it impossible to use reshape2 because dcast cannot deal with non-atomic columns. Therefore, I need to do the rolling up myself.
The answer from a previous question about working with melted data tables does not apply here because of the list column.
I am not satisfied with the solution I've come up with. I'm looking for suggestions for a simpler/faster implementation.
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list(), list(), list(), list(), list(), list()),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# Column template set up
list_template <- Reduce(
function(l, col) { l[[col]] <- col; l },
unique(dt$variable),
list())
# Expression set up
q <- substitute({
l <- lapply(
list_template,
function(col) .SD[variable==as.character(col)]$value)
l$data = .SD[1,]$data
l
}, list(list_template=list_template))
# Roll up
dt[, eval(q), by=list(x, y)]
x y var.1 var.2 data
1: A d 1 2 <list>
2: B d 3 4 <list>
3: C d 5 6 <list>
This old question piqued my curiosity as data.table has been improved sigificantly since 2013.
However, even with data.table version 1.11.4
dcast(dt, x + y + data ~ variable)
still returns an error
Columns specified in formula can not be of type list
The workaround follows the general outline of jonsedar's answer :
Reshape the non-list columns from long to wide format
Aggregate the list column data grouped by x and y
Join the two partial results on x and y
but uses the features of the actual data.table syntax, e.g., the on parameter:
dcast(dt, x + y ~ variable)[
dt[, .(data = .(first(data))), by = .(x, y)], on = .(x, y)]
x y var.1 var.2 data
1: A d 1 2 <list>
2: B d 3 4 <list>
3: C d 5 6 <list>
The list column data is aggregated by taking the first element. This is in line with OP's code line
l$data = .SD[1,]$data
which also picks the first element.
I have somewhat cheating method that might do the trick - importantly, I assume that each x,y,list combination is unique! If not, please disregard.
I'm going to create two separate datatables, the first which is dcasted without the data list objects, and the second which has only the unique data list objects and a key. Then just merge them together to get the desired result.
require(data.table)
require(stringr)
require(reshape2)
x <- LETTERS[1:3]
dt <- data.table(
x=rep(x, each=2),
y='d',
data=list(list("a","b"), list("c","d")),
variable=rep(c('var.1', 'var.2'), 3),
value=seq(1,6)
)
# First create the dcasted datatable without the pesky list objects:
dt_nolist <- dt[,list(x,y,variable,value)]
dt_dcast <- data.table(dcast(dt_nolist,x+y~variable,value.var="value")
,key=c("x","y"))
# Second: create a datatable with only unique "groups" of x,y, list
dt_list <- dt[,list(x,y,data)]
# Rows are duplicated so I'd like to use unique() to get rid of them, but
# unique() doesn't work when there's list objects in the data.table.
# Instead so I cheat by applying a value to each row within an x,y "group"
# that is unique within EACH group, but present within EVERY group.
# Then just simply subselect based on that unique value.
# I've chosen rank(), but no doubt there's other options
dt_list <- dt_list[,rank:=rank(str_c(x,y),ties.method="first"),by=str_c(x,y)]
# now keep only one row per x,y "group"
dt_list <- dt_list[rank==1]
setkeyv(dt_list,c("x","y"))
# drop the rank since we no longer need it
dt_list[,rank:=NULL]
# Finally just merge back together
dt_final <- merge(dt_dcast,dt_list)

Resources