Count Distinct values from 2 columns

Count Distinct values from 2 columns - r

I need to get only distinct values that are spread over two columns and return the distinct values into one column.
Example:
colA colB
---- --------
darcy elizabeth
elizabeth darcy
jon doe
doe joe
It should return:
resultCol
darcy
elizabeth
jon
doe
Is there any builtin function or library that can do that more efficiently?
I tried a workaround to get the results but it is extremely slow for more than 100 thousands observations.
#First i create a sample dataframe
col1<-c("darcy","elizabeth","elizabeth","darcy","john","doe")
col2<-c("elizabeth","darcy","darcy","elizabeth","doe","john")
dfSample<-data.frame(col1,col2)
#Then i create an empty dataframe to store all values in a single column
emptyDataframe<-data.frame(resultColumn=character())
for(i in 1:nrow(dfSample)){
emptyDataframe<-rbind(emptyDataframe,c(toString(dfSample[i,1])),stringsAsFactors=FALSE)
}
for(i in 1:nrow(dfSample)){
emptyDataframe<-rbind(emptyDataframe,c(toString(dfSample[i,2])),stringsAsFactors=FALSE)
}
emptyDataframe
#Finally i get the distinct values using dplyr
var_distinct_values<-distinct(emptyDataframe)

I use union to get unique values across specific columns:
with(dfSample, union(col1, col2))
PS: The answer from d.b in the comments is also another way.
You can improvise his answer if you have extra columns but want to run it only over specific columns:
unique(unlist(dfSample[1:2]))
This gets the unique values from first two columns.

Here is a general purpose solution.
It's based on this answer but can be extended to any number of columns as long as the object is a data.frame or list.
Reduce(union, dfSample)
[1] "darcy" "elizabeth" "john" "doe"
Now with 100K observations in each of 10 columns.
set.seed(1234)
n <- 1e5
bigger <- replicate(n, sample(c(col1, col2), 10, TRUE))
bigger <- as.data.frame(bigger)
system.time(Reduce(union, bigger))
# user system ellapsed
# 3.769 0.000 3.772
Edit.
After a second thought, I realized that the test above is run with a dataframe with a very small number of different values. A test with a larger number does not necessarily give the same results.
set.seed(1234)
s <- sprintf("%05d", 1:5000)
big2 <- replicate(n, sample(s, 10, TRUE))
big2 <- as.data.frame(big2)
rm(s)
microbenchmark::microbenchmark(
red = Reduce(union, big2),
uniq = unique(unlist(big2)),
times = 10
)
#Unit: seconds
# expr min lq mean median uq max neval cld
# red 26.021855 26.42693 27.470746 27.198807 28.56720 29.022047 10 b
# uniq 1.405091 1.42978 1.632265 1.548753 1.56691 2.693431 10 a
The unique/unlist solution is now clearly better.

Related

equivalent of melt+reshape that splits on column names

Point: if you are going to vote to close, it is poor form not to give a reason why. If it can be improved without requiring a close, take the 10 seconds it takes to write a brief comment.
Question:
How do I do the following "partial melt" in a way that memory can support?
Details:
I have a few million rows and around 1000 columns. The names of the columns have 2 pieces of information in them.
Normally I would melt to a data frame (or table) comprised of a pair of columns, then I would split on the variable name to create two new columns, then I would cast using one of the new splits for new column names, and one for row names.
This isn't working. My billion or so rows of data are making the additional columns overwhelm my memory.
Outside the "iterative force" (as opposed to brute force) of a for-loop, is there a clean and effective way to do this?
Thoughts:
this is a little like melt-colsplit-cast
libraries common for this seem to be "dplyr", "tidyr", "reshape2", and "data.table".
tidyr's gather+separate+spread looks good, but doesn't like not having a unique row identifier
reshape2's dcast (I'm looking for 2d output) wants to aggregate
brute force loses the labels. By brute force I mean df <- rbind(df[,block1],...) where block is the first 200 column indices, block2 is the second, etcetera.
Update (dummy code):
#libraries
library(stringr)
#reproducibility
set.seed(56873504)
#geometry
Ncol <- 2e3
Nrow <- 1e6
#column names
namelist <- numeric(length=Ncol)
for(i in 1:(Ncol/200)){
col_idx <- 1:200+200*(i-1)
if(i<26){
namelist[col_idx] <- paste0(intToUtf8(64+i),str_pad(string=1:200,width=3,pad="0"))
} else {
namelist[col_idx] <- paste0(intToUtf8(96+i),str_pad(string=1:200,width=3,pad="0"))
}
}
#random data
df <- as.data.frame(matrix(runif(n=Nrow*Ncol,min=0, max=16384),nrow=Nrow,ncol=Ncol))
names(df) <- namelist
The output that I would be looking for would have a column with the first character of the current name (single alphabet character) and colnames would be 1 to 200. It would be much less wide than "df" but not fully melted. It would also not kill my cpu or memory.
(Ugly/Manual) Brute force version:
(working on it... )

Here are two options both using data.table.
If you know that each column string always has 200 (or n) fields associated with it (i.e., A001 - A200), you can use melt() and make a list of measurement variables.
melt(dt
, measure.vars = lapply(seq_len(Ncol_p_grp), seq.int, to = Ncol_p_grp * n_grp, by = Ncol_p_grp)
, value.name = as.character(seq_len(Ncol_p_grp))
)[, variable := rep(namelist_letters, each = Nrow)][]
#this data set used Ncol_p_grp <- 5 to help condense the data.
variable 1 2 3 4 5
1: A 0.2655087 0.06471249 0.2106027 0.41530902 0.59303088
2: A 0.3721239 0.67661240 0.1147864 0.14097138 0.55288322
3: A 0.5728534 0.73537169 0.1453641 0.45750426 0.59670404
4: A 0.9082078 0.11129967 0.3099322 0.80301300 0.39263068
5: A 0.2016819 0.04665462 0.1502421 0.32111280 0.26037592
---
259996: Z 0.5215874 0.78318812 0.7857528 0.61409610 0.67813484
259997: Z 0.6841282 0.99271480 0.7106837 0.82174887 0.92676493
259998: Z 0.1698301 0.70759513 0.5345685 0.09007727 0.77255570
259999: Z 0.2190295 0.14661878 0.1041779 0.96782695 0.99447460
260000: Z 0.4364768 0.06679642 0.6148842 0.91976255 0.08949571
Alternatively, we can use rbindlist(lapply(...)) to go through the data set and subset it based on the letter within the columns.
rbindlist(
lapply(namelist_letters,
function(x) setnames(
dt[, grep(x, names(dt), value = T), with = F]
, as.character(seq_len(Ncol_p_grp)))
)
, idcol = 'ID'
, use.names = F)[, ID := rep(namelist_letters, each = Nrow)][]
With 78 million elements in this dataset, it takes around a quarter of a second. I tried to up it to 780 million, but I just don't really have the RAM to generate the data that quickly in the first place.
#78 million elements - 10,000 rows * 26 grps * 200 cols_per_group
Unit: milliseconds
expr min lq mean median uq max neval
melt_option 134.0395 135.5959 137.3480 137.1523 139.0022 140.8521 3
rbindlist_option 290.2455 323.4414 350.1658 356.6373 380.1260 403.6147 3
Data: Run this before everything above:
#packages ----
library(data.table)
library(stringr)
#data info
Nrow <- 10000
Ncol_p_grp <- 200
n_grp <- 26
#generate data
set.seed(1)
dt <- data.table(replicate(Ncol_p_grp * n_grp, runif(n = Nrow)))
names(dt) <- paste0(rep(LETTERS[1:n_grp], each = Ncol_p_grp)
, str_pad(rep(seq_len(Ncol_p_grp), n_grp), width = 3, pad = '0'))
#first letter
namelist_letters <- unique(substr(names(dt), 1, 1))

Difference between 'select' and '$' in R

I want to understand the speed difference between select and $ to subset columns in R (whilst appreciating that they do not return exactly the same things, rather both perform the conceptual get-me-a-column operation). I would like to understand when either is most appropriate.
Specifically, under what conditions would the following select statement be faster than the corresponding $ statement?
Syntax is:
select(df, colName1, colName2, ...)
df$colName

In summary, you should use dplyr when speed of development, ease of understanding or ease of maintenance is most important.
Benchmarks below show that the operation takes longer with dplyr than base R equivalents.
dplyr returns a different (more complex) object.
Base R $ and similar operations can be faster to execute, but come with additional risks (e.g. partial matching behaviour); may be harder to read and/to maintain; return a (minimal) vector object, which might be missing some of the contextual richness of a data frame.
This might also help tease out (if one is wont to avoid looking at source code of packages) that dplyr is doing alot of work under the hood to target columns. It's also an unfair test since we get back different things, but all the ops are "give me this column" ops, so read it with that context:
library(dplyr)
microbenchmark::microbenchmark(
base1 = mtcars$cyl, # returns a vector
base2 = mtcars[['cyl', exact = TRUE]], # returns a vector
base2a = mtcars[['cyl', exact = FALSE]], # returns a vector
base3 = mtcars[,"cyl"], # returns a vector
base4 = subset(mtcars, select = cyl), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars, cyl), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars, "cyl"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars, cyl), # returns a vector
dplyr4 = dplyr::pull(mtcars, "cyl") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.682 6.3860 9.23727 7.7125 10.6050 25.397 100
## base2 4.224 5.9905 9.53136 7.7590 11.1095 27.329 100
## base2a 3.710 5.5380 7.92479 7.0845 10.1045 16.026 100
## base3 6.312 10.9935 13.99914 13.1740 16.2715 37.765 100
## base4 51.084 70.3740 92.03134 76.7350 95.9365 662.395 100
## dplyr1 698.954 742.9615 978.71306 784.8050 1154.6750 3568.188 100
## dplyr2 711.925 749.2365 1076.32244 808.9615 1146.1705 7875.388 100
## dplyr3 64.299 78.3745 126.97205 85.3110 112.1000 2383.731 100
## dplyr4 63.235 73.0450 99.28021 85.1080 114.8465 263.219 100
But, what if we have alot of columns:
# Make a wider version of mtcars
do.call(
cbind.data.frame,
lapply(1:20, function(i) setNames(mtcars, sprintf("%s_%d", colnames(mtcars), i)))
) -> mtcars_manycols
# I randomly chose to get "cyl_4"
microbenchmark::microbenchmark(
base1 = mtcars_manycols$cyl_4, # returns a vector
base2 = mtcars_manycols[['cyl_4', exact = TRUE]], # returns a vector
base2a = mtcars_manycols[['cyl_4', exact = FALSE]], # returns a vector
base3 = mtcars_manycols[,"cyl_4"], # returns a vector
base4 = subset(mtcars_manycols, select = cyl_4), # returns a 1 column data frame
dplyr1 = dplyr::select(mtcars_manycols, cyl_4), # returns a 1 column data frame
dplyr2 = dplyr::select(mtcars_manycols, "cyl_4"), # returns a 1 column data frame
dplyr3 = dplyr::pull(mtcars_manycols, cyl_4), # returns a vector
dplyr4 = dplyr::pull(mtcars_manycols, "cyl_4") # returns a vector
)
## Unit: microseconds
## expr min lq mean median uq max neval
## base1 4.534 6.8535 12.15802 8.7865 13.1775 75.095 100
## base2 4.150 6.5390 11.59937 9.3005 13.2220 73.332 100
## base2a 3.904 5.9755 10.73095 7.5820 11.2715 61.687 100
## base3 6.255 11.5270 16.42439 13.6385 18.6910 70.106 100
## base4 66.175 89.8560 118.37694 99.6480 122.9650 340.653 100
## dplyr1 1970.706 2155.4170 3051.18823 2443.1130 3656.1705 9354.698 100
## dplyr2 1995.165 2169.9520 3191.28939 2554.2680 3765.9420 11550.716 100
## dplyr3 124.295 142.9535 216.89692 166.7115 209.1550 1138.368 100
## dplyr4 127.280 150.0575 195.21398 169.5285 209.0480 488.199 100
For a ton of projects, dplyr is a great choice. Speed of execution, however, is very often not an attribute of the "tidyverse" but the speed of development and expressiveness usually outweigh the speed difference.
NOTE: dplyr verbs are likely better candidates than subset() and — while I lazily use $ it's also a tad dangerous due to default partial matching behaviour as is [[]] without exact=TRUE. A good habit (IMO) to get into is setting options(warnPartialMatchDollar = TRUE) in all your projects where you aren't knowingly counting on this behaviour.

It is not the same. If you're looking for the same functionality you could consider pull() from the same dplyr package.
Dollarsign returns a vector 'build' from the dataframe, pull does the same.

select is in the dplyr package, part of the tidyverse. https://dplyr.tidyverse.org/
you might do something like
df %>%
select(colName1, colName2)
Which would select those columns from df. These statements are written like verbs (e.g. select, arrange, group_by, etc.) and makes it much easier to work with data.
$ is from base r. It would show you only that column from df.

How to make factor names appear in ifelse statement in R?

I have to following dataset. I want to create a column so that if there is a number in the unid column then in dat$identification I want it to say "unidentified" otherwise I want it to print whatever is there in the species column. So the final output should look like dat$identificaiton x,y,unidentified,unidentified. With this code it shows 1,2,unidentified,unidentified.
Please note, for other purposes I want to use only the unid column for the !(is.na) part of the ifelse statement and not the species.
unid <- c(NA,NA,1,4)
species <- c("x","y",NA,NA)
df <- data.frame(unid, species)
df$identification <- ifelse(!is.na(unid), "unidentified", df$species)
#Current Output of df$identification:
1,2,unidentified,unidentified
#Needed Output
x,y,unidentified,unidentified

You can coerce the column of class 'factorto classcharacterin theifelse`.
df$identification <- ifelse(!is.na(unid), "unidentified", as.character(df$species))
df
# unid species identification
#1 NA x x
#2 NA y y
#3 1 <NA> unidentified
#4 4 <NA> unidentified
Edit.
After the OP accepted the answer, I reminded myself that ifelse is slow and indexing fast, so I tested both using a larger dataset.
First of all, see if both solutions produce the same results:
df$id1 <- ifelse(!is.na(unid), "unidentified", as.character(df$species))
df$id2 <- "unidentified"
df$id2[is.na(unid)] <- species[is.na(unid)]
identical(df$id1, df$id2)
#[1] TRUE
The results are the same.
Now time them both using package microbenchmark.
n <- 1e4
df1 <- data.frame(unid = rep(unid, n), species = rep(species, n))
microbenchmark::microbenchmark(
ifelse = {df1$id1 <- ifelse(!is.na(df1$unid), "unidentified", as.character(df1$species))},
index = {df1$id2 <- "unidentified"
df1$id2[is.na(df1$unid)] <- species[is.na(df1$unid)]
},
relative = TRUE
)
#Unit: nanoseconds
# expr min lq mean median uq max neval cld
# ifelse 12502465 12749881 16080160.39 14365841 14507468.5 85836870 100 c
# index 3243697 3299628 4575818.33 3326692 4983170.0 74526390 100 b
#relative 67 68 208.89 228 316.5 540 100 a
On average, indexing is 200 times faster. More than worth the trouble to write two lines of code instead of just one for ifelse.

How can I add a column with the names of the nth list element to each nth element of the list?

Say I have
library(dplyr)
a <- list(a=tbl_df(cars), b=tbl_df(iris))
How can I add to each element of this list a column name whose values are the name of the named element of the list?
For instance, this how the output should look like for the first element
Source: local data frame [50 x 3]
speed dist name
(dbl) (dbl) (chr)
1 4 2 a
2 4 10 a
3 7 4 a
4 7 22 a
5 8 16 a
6 9 10 a
7 10 18 a
8 10 26 a
9 10 34 a
10 11 17 a

After all this commenting, guess I'll write an answer.
You should use a for loop for this: it's quick to code, quick to execute, readable and straightforward:
for (i in seq_along(a)) a[[i]]$name = names(a)[i]
You could use map or mapply or lapply instead of the for loop. In this case, I would think it will be less readable.
You could also use mutate instead of [ for adding the column. This will be slower:
library(microbenchmark)
library(dplyr)
cars_tbl = tbl_df(cars)
mbm = microbenchmark
mbm(
mutate = {cars_tbl = mutate(cars_tbl, name = 'a')},
base = {cars_tbl['name'] = 'a'}
)
# Unit: microseconds
# expr min lq mean median uq max neval cld
# mutate 240.617 262.4730 293.29001 276.158 299.7255 813.078 100 b
# base 34.971 42.1935 55.46356 53.407 57.3980 226.932 100 a
For such a simple operation, [<- is going to be hard to beat. data.table will probably be faster, but only if the object is already a data.table. If the object is a data.frame rather than a tbl_df, then the mutate is about twice as slow. But these differences are in microseconds. Unless you are repeatedly doing this operation to lists of at least hundreds of thousands of data frames it won't matter.
This is not to say dplyr has poor performance - when you are using the grouping operations, relying on the NSE built in to dplyr, it's excellent. This is just a simple case where the simple base solution is easiest and also quickest.
If we increase the size of the data enough so that it takes a noticeable amount of time to do these operations (10 million rows, here), the differences essentially go away:
df = tbl_df(data.frame(x = rep(1, 1e7)))
mbm(
mutate = {df = mutate(df, name = 'a')},
base = {df['name'] = 'a'}
)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# mutate 58.08095 59.87531 132.3180 105.22507 207.6439 261.8121 100 a
# base 52.09899 53.96386 129.9304 99.96153 203.8581 237.0084 100 a
Implementing with for loops and with map, comparing [<- and mutate
# base for loop
for (i in seq_along(a)) {
a[[i]]$name = names(a)[i]
}
# dplyr in for loop
for (i in seq_along(a)) {
a[[i]] = mutate(a[[i]], name = names(a)[i])
}
# dplyr hiding the loop in Map()
a = Map(function(x, y) mutate(x, name = y), x = a, y = names(a))
We could benchmark these (I did -- see the edit history if you want the results), but the differences are less than 1 millisecond so it shouldn't matter. Go with whatever is easiest for you to read, write, and understand.
All this comes with the caveat that if your eventual goal is to bind these data frames together and you want the name column to see what list element the data came from, then that is implemented directly in dplyr::bind_rows.

Use Factor Vector to Lookup Value in Data Frame

I have a vector
> head(gbmPred)
[1] COMPLETED DEAD COMPLETED COMPLETED COMPLETED LOW
I also have a data frame
> head(gbmPredProb)
COLLECTION COMPLETED DEAD LOW
1 0.04535981 0.8639282 0.07698963 0.01372232
2 0.19031127 0.6680874 0.11708416 0.02451713
3 0.25004446 0.6789679 0.04827067 0.02271702
4 0.09625138 0.7877128 0.09906595 0.01696983
5 0.15696875 0.7617585 0.04441733 0.03685539
6 0.14157307 0.7690410 0.06057754 0.02880836
I want to be create a vector by using the levels in gbmPred to lookup the values in gbmPredProb:
0.8639282
0.1170841
0.6789679
0.7877128
0.7617585
0.02880836
Does anyone know how to do this in R? Appreciate the help.
EDIT *** Sorry copy and paste error. Fixed above
The first value .86 matches COMPLETED
the second value .11 matches DEAD
WHat I am looking for is to loop through the vector gbmPred to get the value (COMPLETED,etc), then search gbmPredProb data frame for the value matching the column with the same name as well as the index of the vector.
So, the first value is COMPLETED. Look at gbmPredProb and get .863
The second value of gbmPred is DEAD. Look at gbmPredProb and get .117
the thrid value of gbmPred is COMPLETED. Look at gbmPredProb and get .678

If you have a bunch of (row, col) pairs that you want to grab out of a matrix, a good way to get them is to index by a 2-column matrix where the first column is all the row numbers of the elements you want and the second column is all the column numbers of the elements you want:
gbmPredProb[cbind(1:length(gbmPred), match(gbmPred, names(gbmPredProb)))]
# [1] 0.86392820 0.11708416 0.67896790 0.78771280 0.76175850
# [6] 0.02880836
One advantage of this sort of an approach is that it will be a good deal quicker than a row-by-row approach on larger data frames:
gbmPredProb <- gbmPredProb[rep(1:6, each=1000),] # 6000x4
gbmPred <- rep(gbmPred, each=1000) # Length 6000
josilber <- function(mat, vec) mat[cbind(1:length(vec), match(vec, names(mat)))]
rscriven <- function(mat, vec) sapply(seq_along(vec), function(i) mat[i, as.character(vec[i])])
all.equal(josilber(gbmPredProb, gbmPred), rscriven(gbmPredProb, gbmPred))
# [1] TRUE
library(microbenchmark)
microbenchmark(josilber(gbmPredProb, gbmPred), rscriven(gbmPredProb, gbmPred))
# Unit: microseconds
# expr min lq median uq max neval
# josilber(gbmPredProb, gbmPred) 328.524 398.8545 442.065 512.949 766.082 100
# rscriven(gbmPredProb, gbmPred) 97843.015 111478.4360 117294.079 123901.368 254645.966 100

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Count Distinct values from 2 columns - r

Related

equivalent of melt+reshape that splits on column names

Difference between 'select' and '$' in R

How to make factor names appear in ifelse statement in R?

How can I add a column with the names of the nth list element to each nth element of the list?

Use Factor Vector to Lookup Value in Data Frame

Categories

Resources