Finding duplicate columns in a data.table - r

I have a pretty big data.table (500 x 2000), and I need to find out if any of the columns are duplicates, i.e., have the same values for all rows. Is there a way to efficiently do this within the data.table structure?
I have tried a naive two loop approach with all(col1 == col2) for each pair of columns, but it takes too long. I have also tried converting it to a data.frame and using the above approach, and it still takes quite a long time.
My current solution is to convert the data.table to a matrix and use the apply() function as:
similarity.matrix <- apply(m, 2, function(x) colSums(x == m)))/nrow(m)
However, the approach forces the modes of all elements to be the same, and I'd rather not have that happen. What other options do I have?
Here is a sample construction for the data.table:
m = matrix(sample(1:10, size=1000000, replace=TRUE), nrow=500, ncol=2000)
DF = as.data.frame(m)
DT = as.data.table(m)

Following the suggestion of #Haboryme*, you can do this using duplicated to find any duplicated vectors. duplicated usually works rowwise, but you can transpose it with t() just for finding the duplicates.
DF <- DF[ , which( !duplicated( t( DF ) ) ) ]
With a data.table, you may need to add with = FALSE (I think this depends on the version of data.table you're using).
DT <- DT[ , which( !duplicated( t( DT ) ) ), with = FALSE ]
*#Haboryme, if you were going to turn your comment into an answer, please do and I'll remove this one.

Here's a different approach, where you hash each column first and then call duplicated.
library(digest)
dups <- duplicated(sapply(DF, digest))
DF <- DF[,which(!dups)]
Depending on your data this might be a faster way.

I am using mtcars for a reproducible result:
library(data.table)
library(digest)
# Create data
data <- as.data.table(mtcars)
data[, car.name := rownames(mtcars)]
data[, car.name.dup := car.name] # create a duplicated row
data[, car.name.not.dup := car.name] # create a second duplicated row...
data[1, car.name.not.dup := "Moon walker"] # ... but change a value so that it is no longer a duplicated column
data contains now:
> head(data)
mpg cyl disp hp drat wt qsec vs am gear carb car.name car.name.dup car.name.not.dup
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Mazda RX4 Moon walker
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag Mazda RX4 Wag Mazda RX4 Wag
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Datsun 710 Datsun 710 Datsun 710
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet 4 Drive Hornet 4 Drive Hornet 4 Drive
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Hornet Sportabout Hornet Sportabout Hornet Sportabout
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 Valiant Valiant Valiant
Now find the duplicated colums:
# create a vector with the checksum for each column (and keep the column names as row names)
col.checksums <- sapply(data, function(x) digest(x, "md5"), USE.NAMES = T)
# make a data table with one row per column name and hash value
dup.cols <- data.table(col.name = names(col.checksums), hash.value = col.checksums)
# self join using the hash values and filter out all column name pairs that were joined to themselves
dup.cols[dup.cols,, on = "hash.value"][col.name != i.col.name,]
Results in:
col.name hash.value i.col.name
1: car.name.dup 58fed3da6bbae3976b5a0fd97840591d car.name
2: car.name 58fed3da6bbae3976b5a0fd97840591d car.name.dup
Note: The result still contains both directions (col1 == col2 and col2 == col1) and should be deduplicated ;-)

Related

Loop within Loop in R

I am trying to figure out how to run two different loops on the same code. I am trying to create a matrix where I am filling a column with the mean of a variable for each year.
Here's the code I am using to do it right now:
matplot2 = as.data.frame(matrix(NA, nrow=16, ncol=4))
matplot2[1,1] = mean(matplot[matplot$Year==2003, 'TotalTime'])
matplot2[2,1] = mean(matplot[matplot$Year==2004, 'TotalTime'])
matplot2[3,1] = mean(matplot[matplot$Year==2005, 'TotalTime'])
matplot2[4,1] = mean(matplot[matplot$Year==2006, 'TotalTime'])
matplot2[5,1] = mean(matplot[matplot$Year==2007, 'TotalTime'])
matplot2[6,1] = mean(matplot[matplot$Year==2008, 'TotalTime'])
matplot2[7,1] = mean(matplot[matplot$Year==2009, 'TotalTime'])
matplot2[8,1] = mean(matplot[matplot$Year==2010, 'TotalTime'])
matplot2[9,1] = mean(matplot[matplot$Year==2011, 'TotalTime'])
matplot2[10,1] = mean(matplot[matplot$Year==2012, 'TotalTime'])
matplot2[11,1] = mean(matplot[matplot$Year==2013, 'TotalTime'])
matplot2[12,1] = mean(matplot[matplot$Year==2014, 'TotalTime'])
matplot2[13,1] = mean(matplot[matplot$Year==2015, 'TotalTime'])
matplot2[14,1] = mean(matplot[matplot$Year==2016, 'TotalTime'])
matplot2[15,1] = mean(matplot[matplot$Year==2017, 'TotalTime'])
matplot2[16,1] = mean(matplot[matplot$Year==2018, 'TotalTime'])
If it were just the year changing, I would write the loop like this:
for(i in 2003:2018) {
matplot2[1,1] = mean(matplot[matplot$Year==i, 'TotalTime'])
}
But, I need the row number in the matrix I'm printing the results into to change as well. How can I write a loop where I am printing the results of all these means into one column of a matrix?
In other words, I need to be able to have it loop matplot2[j,1] in addition to the matplot$Year==i.
Any suggestions would be greatly appreciated!
Your literal calculations of the mean(TotalTime) can all be reduced to a single command (with no for loop required):
matplot2 <- aggregate(TotalTime ~ Year, data = matplot, FUN = mean)
That should return a two-column frame with the unique values of Year in the first column, and the respective means in the second column.
Demonstrated with data I have:
head(mtcars)
# mpg cyl disp hp drat wt qsec vs am gear carb
# Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
# Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
# Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
# Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
# Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
# Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
res <- aggregate(disp ~ cyl, data = mtcars, FUN = mean)
res
# cyl disp
# 1 4 105.1364
# 2 6 183.3143
# 3 8 353.1000
This and more can be seen in summarize by group (of which this question is essentially a dupe, even if you didn't know to ask it that way).
R is a vectorized language so passing a vector of values for the index and year should work.
i<-1:16
matplot2[i,1] = mean(matplot[matplot$Year==(2002 + i), 'TotalTime'])

Order columns from a list of pre-defined names and ignore column names which don't exist in the list

I want to order a data.table by using a set of predefined names available in a list.
For example:
library(data.table)
dt <- as.data.table(mtcars)
list_name <-c("mpg", "disp", "xyz")
#Order columns
setcolorder(dt, list_name) #requirement: if "xyz" column doesn't exist it should ignore and take the rest
The use case case is that there are multiple data.tables that are getting created and all of them have column names from a list of names. There can be missing column names in some data but the data needs to be ordered as per a list.
output:
dt
disp wt mpg cyl hp drat qsec vs am gear carb
1: 160.0 2.620 21.0 6 110 3.90 16.46 0 1 4 4
2: 160.0 2.875 21.0 6 110 3.90 17.02 0 1 4 4
3: 108.0 2.320 22.8 4 93 3.85 18.61 1 1 4 1
An option is to load all of them in a list and then use setcolorder by looping over the list with lapply and use intersect on the names of the dataset while ordering
lst1 <- list(dt, dt)
lst1 <- lapply(lst1, function(x) setcolorder(x, intersect(list_name, names(x)))
If we need to reuse, create a function
f1 <- function(dat, nm1) {
setcolorder(dat, intersect(nm1, names(dat)))
}
f1(dt, list_name)
f1(dt2, list_name)

Use a character vector in the `by` argument

Within the data.table package in R, is there a way in order to use a character vector to be assigned within the by argument of the calculation?
Here is an example of what would be the desired output from this using mtcars:
mtcars <- data.table(mtcars)
ColSelect <- 'cyl' # One Column Option
mtcars[,.( AveMpg = mean(mpg)), by = .(ColSelect)] # Doesn't work
# Desired Output
cyl AveMpg
1: 6 19.74286
2: 4 26.66364
3: 8 15.10000
I know that this is possible to use assigning column names in j by enclosing the vector around brackets.
ColSelect <- 'AveMpg' # Column to be assigned for average mpg value
mtcars[,(ColSelect):= mean(mpg), by = .(cyl)]
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb AveMpg
1: 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 19.74286
2: 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 19.74286
3: 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 26.66364
4: 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 19.74286
5: 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 15.10000
6: 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 19.74286
Is there a suggestion as to what to put in the by argument in order to achieve this?
From ?data.table in the by section it says that by accepts:
a single character string containing comma separated column names (where spaces are significant since column names may contain spaces
even at the start or end): e.g., DT[, sum(a), by="x,y,z"]
a character vector of column names: e.g., DT[, sum(a), by=c("x", "y")]
So yes, you can use the answer in #cccmir's response. You can also use c() as #akrun mentioned, but that seems slightly extraneous unless you want multiple columns.
The reason you cannot use .() syntax is that in data.table .() is an alias for list(). And according to the same help for by the list() syntax requires an expression of column names - not a character string.
Going off the examples in the by help if you wanted to use multiple variables and pass the names as characters you could do:
mtcars[,.( AveMpg = mean(mpg)), by = "cyl,am"]
mtcars[,.( AveMpg = mean(mpg)), by = c("cyl","am")]
try to use it like this
mtcars <- data.table(mtcars)
ColSelect <- 'cyl' # One Column Option
mtcars[, AveMpg := mean(mpg), by = ColSelect] # Should work

Remove columns with dplyr [duplicate]

This question already has answers here:
how to drop columns by passing variable name with dplyr?
(6 answers)
Closed 5 years ago.
I'm interested in simplifying the way that I can remove columns with dplyr (version >= 0.7). Let's say that I have a character vector of names.
drop <- c("disp", "drat", "gear", "am")
Selecting Columns
With the current version version of dplyr, you can perform a selection with:
dplyr::select(mtcars, !! rlang::quo(drop))
Or even easier with base R:
mtcars[, drop]
Removing Columns
Removing columns names is another matter. We could use each unquoted column name to remove them:
dplyr::select(mtcars, -disp, -drat, -gear, -am)
But, if you have a data.frame with several hundred columns, this isn't a great solution. The best solution I know of is to use:
dplyr::select(mtcars, -which(names(mtcars) %in% drop))
which is fairly simple and works for both dplyr and base R. However, I wonder if there's an approach which doesn't involve finding the integer positions for each column name in the data.frame.
Use modify_atand set columns to NULL which will remove them:
mtcars %>% modify_at(drop,~NULL)
# mpg cyl hp wt qsec vs carb
# Mazda RX4 21.0 6 110 2.620 16.46 0 4
# Mazda RX4 Wag 21.0 6 110 2.875 17.02 0 4
# Datsun 710 22.8 4 93 2.320 18.61 1 1
# Hornet 4 Drive 21.4 6 110 3.215 19.44 1 1
# Hornet Sportabout 18.7 8 175 3.440 17.02 0 2
# Valiant 18.1 6 105 3.460 20.22 1 1
# ...
Closer to what you were trying, you could have tried magrittr::extract instead of dplyr::select
extract(mtcars,!names(mtcars) %in% drop) # same output
You can use -one_of(drop) with select:
drop <- c("disp", "drat", "gear", "am")
select(mtcars, -one_of(drop)) %>% names()
# [1] "mpg" "cyl" "hp" "wt" "qsec" "vs" "carb"
one_of evaluates the column names in character vector to integers, similar to which(... %in% ...) does:
one_of(drop, vars = names(mtcars))
# [1] 3 5 10 9
which(names(mtcars) %in% drop)
# [1] 3 5 9 10

Is it possible to manipulate data frame by column name and number?

I'm working on large data set, about 900 columns. i have something like this:
B <- c(1)
A_1 <- c(2)
A_2 <- c(3)
A_3 <- c(7)
A_4 <- c(9)
df <- data.frame(B,A_1,A_2,A_3,A_4)
I would like to be able to do something like this :
df[,A_1:A_1+3]
Do you know if it's possible ?
I'm also working with data.table so if there is a way with data.table it could be good.
Base R's subset will let you do this.
subset(mtcars, , mpg:(mpg + 1))
# mpg cyl
#Mazda RX4 21.0 6
#Mazda RX4 Wag 21.0 6
#Datsun 710 22.8 4
#Hornet 4 Drive 21.4 6
#Hornet Sportabout 18.7 8
#...
dplyr's select works the same way.

Resources