This question already has answers here:
Filtering out duplicated/non-unique rows in data.table
(5 answers)
Closed 4 years ago.
I'm migrating from data frames and matrices to data tables, but haven't found a solution for extracting the unique rows from a data table. I presume there's something I'm missing about the [,J] notation, though I've not yet found an answer in the FAQ and intro vignettes. How can I extract the unique rows, without converting back to data frames?
Here is an example:
library(data.table)
set.seed(123)
a <- matrix(sample(2, 120, replace = TRUE), ncol = 3)
a <- as.data.frame(a)
b <- as.data.table(a)
# Confirm dimensionality
dim(a) # 40 3
dim(b) # 40 3
# Unique rows using all columns
dim(unique(a)) # 8 3
dim(unique(b)) # 34 3
# Unique rows using only a subset of columns
dim(unique(a[,c("V1","V2")])) # 4 2
dim(unique(b[,list(V1,V2)])) # 29 2
Related question: Is this behavior a result of the data being unsorted, as with the Unix uniq function?
Before data.table v1.9.8, the default behavior of unique.data.table method was to use the keys in order to determine the columns by which the unique combinations should be returned. If the key was NULL (the default), one would get the original data set back (as in OPs situation).
As of data.table 1.9.8+, unique.data.table method uses all columns by default which is consistent with the unique.data.frame in base R. To have it use the key columns, explicitly pass by = key(DT) into unique (replacing DT in the call to key with the name of the data.table).
Hence, old behavior would be something like
library(data.table) v1.9.7-
set.seed(123)
a <- as.data.frame(matrix(sample(2, 120, replace = TRUE), ncol = 3))
b <- data.table(a, key = names(a))
## key(b)
## [1] "V1" "V2" "V3"
dim(unique(b))
## [1] 8 3
While for data.table v1.9.8+, just
b <- data.table(a)
dim(unique(b))
## [1] 8 3
## or dim(unique(b, by = key(b)) # in case you have keys you want to use them
Or without a copy
setDT(a)
dim(unique(a))
## [1] 8 3
As mentioned by Seth the data.table package has evolved and now proposes optimized functions for this.
To all the ones who don't want to get into the documentation, here is the fastest and most memory efficient way to do what you want :
uniqueN(a)
And if you only want to choose a subset of columns you could use the 'by' argument :
uniqueN(a,by = c('V1','V2'))
EDIT : As mentioned in the comments this will only gives the count of unique rows. To get the unique values, use unique instead :
unique(a)
And for a subset :
unique(a[c('V1',"V2")], by=c('V1','V2'))
Related
I have been struggling with this question for a couple of days.
I need to scan every row from a data frame and then assign an univocal identifier for each rows based on values found in a second data frame. Here is a toy exemple.
df1<-data.frame(c(99443975,558,99009680,99044573,599,99172478))
names(df1)<-"Building"
V1<-c(558,134917,599,120384)
V2<-c(4400796,14400095,99044573,4500481)
V3<-c(NA,99009680,99340705,99132792)
V4<-c(NA,99156365,NA,99132794)
V5<-c(NA,99172478,NA, 99181273)
V6<-c(NA, NA, NA,99443975)
row_number<-1:4
df2<-data.frame(cbind(V1, V2,V3,V4,V5,V6, row_number))
The output I expect is what follows.
row_number_assigned<-c(4,1,2,3,3,2)
output<-data.frame(cbind(df1, row_number_assigned))
Any hints?
Here's an efficient method using the arr.ind feature of thewhich function:
sapply( df1$Building, # will send Building entries one-by-one
function(inp){ which(inp == df2, # find matching values
arr.in=TRUE)[1]}) # return only row; not column
[1] 4 1 2 3 3 2
Incidentally your use of the data.frame(cbind(.)) construction is very dangerous. A much less dangerous, and using fewer keystrokes as well, method for dataframe construction would be:
df2<-data.frame( V1=c(558,134917,599,120384),
V2=c(4400796,14400095,99044573,4500481),
V3=c(NA,99009680,99340705,99132792),
V4=c(NA,99156365,NA,99132794),
V5=c(NA,99172478,NA, 99181273),
V6=c(NA, NA, NA,99443975) )
(It didn't cause coding errors this time but if there were any character columns it would changed all the numbers to character values.) If you learned this from a teacher, can you somehow approach them gently and do their future students a favor and let them know that cbind() will coerce all of the arguments to the "lowest common denominator".
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
df1 %>%
left_join(df2 %>%
pivot_longer(-row_number) %>%
select(-name),
by = c("Building" = "value"))
This returns
Building row_number
1 99443975 4
2 558 1
3 99009680 2
4 99044573 3
5 599 3
6 99172478 2
So I have a dataframe column with userID where there are duplicates. I was asked to find the userID that appear least frequent. What are the possible methods to achieve this. Only using Base R or Dplyr packages.
Something like this
userID = c(1,1,1,1,2,2,1,1,4,4,4,4,3)
Expected Output would be 3 in this case.
If this is based on the lengths of same adjacent values
with(rle(userID), values[which.min(lengths)])
#[1] 3
Or if it is based on the full data values
names(which.min(table(userID)))
#[1] "3"
Another possibility is to get the min of mode:
# example dataframe
df <- data.frame(userID = c(1,1,1,1,2,2,1,1,4,4,4,4,3))
# define Mode function
Mode <- function(x){
a = table(x) # x is a column
return(a[which.min(a)])
}
Mode(df$userID)
# Output:
3 #value
1 #count
Gives the value 3 and the count 1
I am using R data table packages. I just want to access specific column of data table using variable.
But, When I put variable in bracket of data table, a variable just comes out as it is. not data table.
for example,
df <- matrix(1:12,nrow=4,ncol=3)
df <- as.data.table(df)
colnames(df) <- c("A","B","C")
list <- c("A","B","C")
df[,"C"]
The result of above code is just "C" not (9, 10, 11, 12).
And other result that I tried to figure out this problem are below.
df[,list[3]]
[1] "C"
df[,"C"]
C
1: 9
2: 10
3: 11
4: 12
list[3] == "C"
[1] TRUE
Why does this problem happen? How do I get specific column from data table using variable?
Thank you.
What version of data.table are you using? They changed that a while ago (apparently v1.9.8).
For your version, try with = FALSE. From help(data.table):
By default with=TRUE and j is evaluated within the frame of x; column
names can be used as variables. In case of overlapping variables names
inside dataset and in parent scope you can use double dot prefix
..cols to explicitly refer to 'cols variable parent scope and not from
your dataset.
When j is a character vector of column names, a numeric vector of
column positions to select or of the form startcol:endcol, and the
value returned is always a data.table. with=FALSE is not necessary
anymore to select columns dynamically.
library(data.table)
df[,"C",with=FALSE]
C
1: 9
2: 10
3: 11
4: 12
This also works.
df[,list[3],with=FALSE]
C
1: 9
2: 10
3: 11
4: 12
The more data.table way to do it would be C without quotes.
df[,C]
[1] 9 10 11 12
You might wonder why df[,"C"] returns a data.table, but df[,C] returns a vector. And that, I would argue is a good question. I guess because df[,"C"] is similar to how one would subset a data.frame which would also return a data.frame.
sessionInfo()
R version 3.6.3 (2020-02-29)
data.table_1.12.8
My data looks like
Name Pd1 Pd2 Pd3 Pd4
A 2 6 8 9
B 6 3 7 1
I want to collect the name of columns that has values from highest to lowest.
I wish to see my data like
Name pdts
A c(pd4,pd3,pd2,pd1)
B c(Pd3,Pd1,Pd2,Pd4)
Kindly help me to do this in R.
You can use data.table with apply function with sort.list to do this:
library(data.table)
setDT(df)
df <- df[, list(list((colnames(.SD)[c(t(apply(.SD, 1, function(x) sort.list(x, decreasing = T))))]))) ,Name]
print(df)
Name V1
1: A Pd4,Pd3,Pd2,Pd1
2: B Pd3,Pd1,Pd2,Pd4
Explanation:
1.apply(.SD, 1, function(x) sort.list(x, decreasing = T) - Gives the indexes of columns row wise.
2.t - We transpose the result to get a row wise vector.
3.[c(t(apply(.SD, 1, function(x) sort.list(x, decreasing = T))))] - this complete function return the sorted index of columns wise, that's the thing we need to solve this problem.
4.colnames(.SD) - .SD is a special symbol used in data.table. It basically refers to the grouped data and here we get the column names.
5.Finally, we sort the column names based on the indexes we got in step 3.
6.And, we group by Name column to get the solution for each Name.
7.You might find this overwhelming, so to understand, do it step by step and see how the solution evolves.
I have a data.table with two of its columns (V1,V2) set as a key. A third column contains multiple values.
I have another data.table with pairs I want to search for in the first table key, and return a united list of the lists in V3.
locations<-data.table(c(-159.58,0.2,345.1),c(21.901,22.221,66.5),list(c(10,20),c(11),c(12,33)))
setkey(locations,V1,V2)
searchfor<-data.table(lon=c(-159.58,345.1,11),lat=c(21.901,66.5,0))
The final result should look like this:
[1] 10 20 12 33
The following works when searching for just one item.
locations[.(-159.58,21.901),V3][[1]]
[1] 10 20
I don't know how to generalize this and use the table "searchfor" as source for the searched indices (BTW - the "searchfor" table can be changed to a different format if it makes the solution easier).
Also, how do I unite the different values in V3 I'll then (hopefully) receive, into one list?
You can use the same syntax that you used with a data.table as an index.
lst <- locations[ .(searchfor), V3]
You probably want to use only the non-null elements. If so, you can directly use nomatch=0L argument:
locations[ .(searchfor), V3, nomatch=0L]
# [[1]]
# [1] 10 20
#
# [[2]]
# [1] 12 33
This will return a list. If you want to return a vector instead, use the base function unlist() as follows:
locations[ .(searchfor), unlist(V3), nomatch=0L]
# [1] 10 20 12 33