count of numeric values by column with data.table (R) - r

DC<-data.table(l=c(0,0,1,4,5),d=c(1,2,0,0,1),y=c(0,1,0,1,7))
Hello,
how can I get a count of a particular value in a column using data.table?
I tried the following:
DC[, lapply(.SD, function(x) length(which(DC==0)))]
But this returns the count of zeros in the entire dataset, not indexing by column. So, how do I index by column?
Thanks

The question is not very well formulated, but I think #Sathish answered it perfectly in the comments.
Let's write it here one more time: to me colSums(DC == 0) is one answer to the question.
All credit goes to #Sathish. Very helpful.

If I understand your question, you'd like to form a frequency count of the values that comprise a given data table column. If that is true, let's say that you want to do so on column d of the data table you supplied:
> DC <- data.table(l=c(0,0,1,4,5), d=c(1,2,0,0,1), y=c(0,1,0,1,7))
> DC[, .N, by = d]
d N
1: 1 2
2: 2 1
3: 0 2
Then, if you want the count of a particular value in d, you would do so by accessing the corresponding row of the above aggregation as follows:
> DC[, .N, by = d][d == 0, N]
[1] 2

Related

Generate group by condition on row value in column R data.table

I want to split a data.table in R into groups based on a condition in the value of a row. I have searched SO extensively and can't find an efficient data.table way to do this (I'm not looking to for loop across rows)
I have data like this:
library(data.table)
dt1 <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)))
I'd like to group at the large numbers (over a settable value) and come up with the example below:
dt.desired <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)), group=c(rep(1,50),rep(2,46),rep(3,43)))
dt1[ , group := cumsum(t > 200) + 1]
dt1[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
dt.desired[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
You can use a test like t>100 to find the large values. You can then use cumsum() to get a running integer for each set of rows up to (but not including) the large number.
# assuming you can define "large" as >100
dt1[ , islarge := t>100]
dt1[ , group := shift(cumsum(islarge))]
I understand that you want the large number to be part of the group above it. To do this, use shift() and then fill in the first value (which will be NA after shift() is run.
# a little cleanup
# (fix first value and start group at 1 instead of 0)
dt1[1, group := 0]
dt1[ , group := group+1]

R - Add columns with almost same name and save it using the correct column name

I have multiple large data tables in R. Some column names appear twice having a nearly duplicate name: they are the same except for the last character.
For example:
[1] "Genre_Romance" (correct)
[2] "Genre_Sciencefiction" (correct)
[3] "Genre_Sciencefictio" (wrong)
[4] "Genre_Fables" (correct)
[5] "Genre_Fable" (wrong)
Genre_Romance <- c(1, 0, 1, 0, 1)
Genre_Sciencefiction <- c(0, 1, 0, 0, 0)
Genre_Sciencefictio <- c(1, 0, 1, 1, 0)
Genre_Fables <- c(0, 0, 1, 0, 0)
Genre_Fable <- c(0, 0, 0, 0, 1)
dt <- data.table(Genre_Romance, Genre_Sciencefiction, Genre_Sciencefictio, Genre_Fables, Genre_Fable)
Now I want to add the column values with nearly the same column name. I want to save this sum under the correct column name while removing the incorrect column. The solution here would be:
dt[,"Genre_Sciencefiction"] <- dt[,2] + dt[, 3]
dt[,"Genre_Fables"] <- dt[,4] + dt[, 5]
dt[,"Genre_Sciencefictio"] <- NULL
dt[,"Genre_Fable"] <- NULL
dt
Genre_Romance Genre_Sciencefiction Genre_Fables
1 1 0
0 1 0
1 1 1
0 1 0
1 0 1
As you can see, not every column name has a nearly duplicate one (such as "Genre_Romance"). So we just keep the first column like that.
I tried to solve this problem with a for loop to compare column names one by one and use substr() function to compare the longest column name with the shorter column name and take sum if they are the same. But it does not work correctly and is not very R-friendly.
The post below also helped me a bit further, but I cannot use 'duplicated' since the column names are not exactly the same.
how do I search for columns with same name, add the column values and replace these columns with same name by their sum? Using R
Thanks in advance.
Here is a more-or-less base R solution that relies on agrep to find similar names. agrep allows for close string matches, based on the "generalized Levenshtein edit distance."
# find groups of similar names
groups <- unique(lapply(names(dt), function(i) agrep(i, names(dt), fixed=TRUE, value=TRUE)))
# choose the final names as those that are longest
finalNames <- sapply(groups, function(i) i[which.max(nchar(i))])
I chose to keep the longest variable names in each groups that matched the example, you could easily switch to the shortest with which.min or you could maybe do some hard-coding depending on what you want.
Next, Reduce is given "+" operator and is fed matching groups with lapply. To calculate the maximum instead, use max in place of "+". The variables are selected using .SDcols from data.table with a data.frame, you could directly feed it the group vectors.
# produce a new data frame
setNames(data.frame(lapply(groups, function(x) Reduce("+", dt[, .SD, .SDcols=x]))),
finalNames)
#Frank's comment notes that this can be simplified in newer (1.10+, I believe) versions of data.table to avoid .SD, .SDcols with
# produce a new data frame
setNames(data.frame(lapply(groups, function(x) Reduce("+", dt[, ..x]))), finalNames)
To make this a data.table, just replace data.frame with as.data.table or wrap the output in setDT.
To turn the final line into a data.table solution, you could use
dtFinal <- setnames(dt[, lapply(groups, function(x) Reduce("+", dt[, .SD, .SDcols=x]))],
finalNames)
or, following #Frank's comment
dtFinal <- setnames(dt[, lapply(groups, function(x) Reduce("+", dt[, ..x]))], finalNames)
which both return
dtFinal
Genre_Romance Genre_Sciencefiction Genre_Fables
1: 1 1 0
2: 0 1 0
3: 1 1 1
4: 0 1 0
5: 1 0 1

data.frame filtering function too slow

I am trying to filter a data frame of mine, with about 200 thousand rows, using R.
The dataframe is structured as follow:
testdf<- data.frame("CHROM"='CHR8', "POS"=c(500,510), "ID"='Some_value',
"REF"=c('A','C'), "ALT"=c('C','T,G'), "Some_more_stuff"='More_info')
I am trying to filter the rows based on how many letters are in the 'ALT' column, being equal to or lesser than a custom threshold. In the example above, if my threshold is 1, only the first row would be retained (the second row - ALT column- has 2 letters > 1).
I have written a couple of functions, which do the job. The only problem is that they take several seconds on a test dataframe with just 14 rows. On the real dataframe (200,000 rows) it takes forever. I am looking for advice on how writing better syntax and get faster results.
Here are my functions:
# Function no. 1:
allele_number_filtering<- function (snp_table, max_alleles=1, ALT_column=5) {
#here I calculate how many letters are in the ALT column
alt_allele_list_length <- function(ALT_field) {
alt_length<- length(strsplit(as.character
(ALT_field), split = ',')[[1]])
return(alt_length)}
# Create an empty dataframe with same columns as the input df
final_table<- snp_table[0,]
# Now only retain the rows that are <= max_alleles
for (i in 1:nrow(snp_table)) {
if (alt_allele_list_length(snp_table[i, ALT_column]) <= max_alleles) {
final_table<- rbind(final_table, snp_table[i,])}}
return(final_table)}
#Function no. 2:
allele_number_filtering<- function (snp_table, max_alleles=1, ALT_column=5) {
final_table<- snp_table[0,]
for (i in 1: nrow(snp_table)) {
if (length(strsplit(as.character(snp_table[i,ALT_column]),
split = ',')[[1]])<=max_alleles) {
final_table<- rbind(final_table, snp_table[i,])
}}
return(final_table)}
I would be thankful for any advice :)
Max
EDIT: I realized I also had values such as 'ALT' = 'at' (still to be counted as 1) or 'ALT' = 'aa,at' (to be counted as 2 ).
you can use lengths() for this:
testdf[lengths(strsplit(as.character(testdf$ALT), ',',fixed = TRUE))<=1,]
Thanks #docendodiscimus for strsplit( fixed=TRUE) option for speed up and to #joran for his perspicacity
I would use nchar for this (before I would remove the , via gsub):
nchar(gsub(",", "", as.character(testdf$ALT)))
# [1] 1 2
threshold <- 1
testdf[nchar(gsub(",", "", as.character(testdf$ALT))) > threshold, ]
# CHROM POS ID REF ALT Some_more_stuff
# 2 CHR8 510 Some_value C T,G More_info

How do I subset rows in data.table that meet some minimum criteria?

I'm a noob to data.table in R and I'd like to skip the last value of z=3 in this example:
> DT = data.table(y=c(1,1,2,2,2,3,3,3,4,4),x=1:10,z=c(1,1,1,1,2,2,2,2,3,3))
> DT[,list(list(predict(smooth.spline(x,y),c(4,5,6))$y)),by=z]
Error in smooth.spline(x, y) : need at least four unique 'x' values
If I simply delete z=3 I get the answer I want:
> DT = data.table(y=c(1,1,2,2,2,3,3,3),x=1:8,z=c(1,1,1,1,2,2,2,2))
> DT[,list(list(predict(smooth.spline(x,y),c(4,5,6))$y)),by=z]
z V1
1: 1 2.09999998977689,2.49999997903384,2.89999996829078
2: 2 0.999895853971133,2.04533519691888,2.90932467439562
What a great package!
Omitting the rows were z is 3 is as simple as
DT[z!=3, <whatever expression you'd like>]
If your data.table is keyed by z then you can use
DT[!.(3), .....]
If you simply want to omit the results when .N <4, then you can use if (not ifelse). If .N <4, then nothing is returned
DT[,if(.N>=4){ list(list(predict(smooth.spline(x,y),c(4,5,6))$y))},by=z]
# z V1
# 1: 1 2.1000000266026,2.50000003412706,2.90000004165153
# 2: 2 0.999895884129996,2.04533520266699,2.90932466433092

Efficiently select subset of rows by rank

How do you use data.table to select a subset of rows by rank? I have a large data set and I hope to do so efficiently.
> dt <- data.table(id=1:200, category=sample(LETTERS, 200, replace=T))
> dt[,count:=length(id), by=category]
> dt
id category count
1: 1 O 13
2: 2 O 13
---
199: 170 N 3
200: 171 H 3
What I want to do is to efficiently change the category to 'OTHER' for any category not in the k most common ones. Something along the lines of:
dt[rank > 5,category:="OTHER", by=category]
I'm new to data.table and I'm not quite sure how to get the rank in an efficient way. Here's a way that works, but seems clunky.
counts <- unique(dt$count)
decision <- max(counts[rank(-counts)>5])
dt[count<=decision, category:='OTHER']
I would appreciate any advice. To be honest, I don't even need to the 'count' column if it's not necessary.

Resources