Finding distinct values across multiple columns in R - r

the distinct function in R generates unique values within the same column. However, I would like to have unique values regardless which column the value appears in.
The sample data is shown below.
10A appears in the second row under var 1. It appears again in the third row, although it is in var 2 this time. I would like to remove the third row since at least one of the values (10A) is a duplicate.
In the 5th row, 10B have appeared in row 2, so I would also like to remove the 5th row as at least one of the values is a duplicate.
In the 6th row, although 7A has appeared before in rows 3 and 5, rows 3 and 5 will be removed, therefore 7A is not considered a duplicate and I would like to retain row 6.
Rows 7 and 8 have NA values. NA should not be considered as duplicate so rows 7 and 8 should be retained.
How do I do it in R?
Sample data
var 1
var 2
5A
5B
10A
10B
7A
10A
6B
5C
10B
7A
10C
7A
99A
NA
NA
99B
Required Result
var 1
var 2
5A
5B
10A
10B
6B
5C
10C
7A
99A
NA
NA
99B

If df has var1 & var2 variables and you wanting to maintain only var1 distinct values:
df |>
filter(!var2 %in% unique(var1))

You can easily use recursion to acomplish this:
relation <- function(dat){
if(nrow(dat) == 1) dat
else
rbind(dat[1,],
relation(dat[!tapply(unlist(dat) %in% dat[1,], row(dat), sum),]))
}
relation(df)
var.1 var.2
1 5A 5B
2 10A 10B
4 6B 5C
6 10C 7A

Update:
The data:
dt <- read.table(header=TRUE, text = "
'var1' 'var2'
'5A' '5B2' #1
'10A' '10B' #2
'7A' '10A' #3 - to be removed
'6B' '5C' #4
'10B' '7A' #5 - to be removed
'10C' '7A' #6
'99A' 'NA' #7 - keep
'2A' '3B' #8
'NA' '99B' #9 - keep
'3A' '11B' #10")
Loop solution
looper <- function(dt) {
uniqstock <- unlist(na.omit(dt[1, 1:ncol(dt)]))
rows2keep <- TRUE # which rows to keep
# loop through data frame row by row
for(r in 2:nrow(dt)) {
rowdat <- na.omit(unlist(dt[r, ])) # na.omit to ignore NAs in row
# Are *any* values in current row a duplicate to previous ones
dupl <- any(rowdat %in% uniqstock)
rows2keep <- c(rows2keep, !dupl)
### Set all values in current row to NA (in doing so removing them from future duplicate checks)
if (!dupl)
uniqstock <- c(uniqstock, rowdat)
}
dt[rows2keep,] # return but not before removing rows with NA
}
Applier solution
applier <- function(dt) {
uniqstock <- character()
unqrows <- function(x) {
# Are *any* values in current row a duplicate to previous ones
dupl <- any(x %in% uniqstock)
# Set all values in current row to NA (in doing so removing them from future duplicate checks)
if (dupl) return(FALSE)
uniqstock <<- c(uniqstock, na.omit(x))
return(TRUE)
}
rows2keep <- apply(dt, 1, unqrows)
dt[rows2keep,]
}
Recursion solution by #onyambu with small modifications to handle NAs correctly
recursor <- function(dt) {
relation <- function(dat){
if(nrow(dat) == 1) dat
else
{
# Include <- unlist(dat) %in% dat[1,]
Include <- match(unlist(dat), dat[1,], nomatch = 0, incomparables = NA) > 0
rbind(dat[1,],
relation(dat[!tapply(Include, row(dat), sum),]))
}
}
relation(dt)
}
Benchmark. Compare the solutions’ performance, since speed is important here:
library(microbenchmark)
microbenchmark(
looper(dt), recursor(dt), applier(dt)#, check = "equivalent"
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> looper(dt) 369.8 393.00 512.402 404.5 420.60 9094.1 100
#> recursor(dt) 1408.5 1446.75 1725.944 1459.6 1492.75 16427.9 100
#> applier(dt) 144.8 155.60 260.622 164.3 173.65 6143.0 100
Check the results of the solutions
looper(dt)
#> var1 var2
#> 1 5A 5B2
#> 2 10A 10B
#> 4 6B 5C
#> 6 10C 7A
#> 7 99A <NA>
#> 8 2A 3B
#> 9 <NA> 99B
#> 10 3A 11B
recursor(dt) # considers NA as duplicates
#> var1 var2
#> 1 5A 5B2
#> 2 10A 10B
#> 4 6B 5C
#> 6 10C 7A
#> 7 99A <NA>
#> 8 2A 3B
#> 9 <NA> 99B
#> 10 3A 11B
applier(dt) # loses column labels
#> var1 var2
#> 1 5A 5B2
#> 2 10A 10B
#> 4 6B 5C
#> 6 10C 7A
#> 7 99A <NA>
#> 8 2A 3B
#> 9 <NA> 99B
#> 10 3A 11B
Created on 2022-06-04 by the reprex package (v2.0.1)

Related

How to use Logical indexing and min function to find the row which has min value?

So, I know how to find it using the subset function. Is there any way not to use subset function?
Example dataset:
Month A B
J 67 89
F 48 69
M 78 89
A 54 90
M 54 75
So, lets say I need to write a code to find the min value in Column B.
My Code: subset(df, B == min(df)
My question:
How to use Logical indexing and min function for this dataset? I don't wanna use subset.
You can use which to find the postitions.
x <- c(2,1,3,1)
which(x == min(x))
#[1] 2 4
To get the first hit which.min could be used.
which.min(x)
#[1] 2
With the given data set.
x <- read.table(header=TRUE, text="Month A B
J 67 89
F 48 69
M 78 89
A 54 90
M 54 75")
which(x$B == min(x$B))
#[1] 2
which(x[2:3] == min(x[2:3]), TRUE)
# row col
#[1,] 2 1

Select rows based on multi-column attributes in R

I have to merge two datasets (spatial) for which I need to keep the row (polyline) with the most information (i.e. the longest line).
I can select the rows with the same ID as the other dataframe or not select rows with the same ID (see sample below). And reverse that operation. But I can't figure out how to choose the row of the two dataframes that has the bigger length value.
#set up sample data
sample = data.frame(x=c(1:5), length=c(1.2,1.3,1.5,7.2,36.1), ID_obj=c("a3", "4a", "5b", "8b", "a7"))
sample2 = data.frame(x=c(1:5), length=c(15.1,1.3,1.5,17.2,6.1), ID_obj=c("a3", "k6", "9c", "8b", "a7"))
#select the rows with identical values
sample[sample$ID_obj %in% sample2$ID_obj,]
sample2[sample2$ID_obj %in% sample$ID_obj,]
#select rows without duplicates in ID
'%!in%' <- function(x,y)!('%in%'(x,y))
sample[sample$ID_obj %!in% sample2$ID_obj,]
sample2[sample2$ID_obj %!in% sample$ID_obj,]
error<-rbind(sample, sample2[sample2$ID_obj %!in% sample$ID_obj,])
# x length ID_obj
#1 1 1.2 a3
#2 2 1.3 4a
#3 3 1.5 5b
#4 4 7.2 8b#keep 8b from the first set should not have been kept because length is shorter
#5 5 36.1 a7
#21 2 1.3 k6
#31 3 1.5 9c
#this is the result I want to get automatically
final<-rbind(sample[c(2, 3, 5),], sample2[c(1, 2, 3, 4),])#
# x length ID_obj
#2 2 1.3 4a
#3 3 1.5 5b
#5 5 36.1 a7#keep a7 from the first set because length is longer
#1 1 15.1 a3
#21 2 1.3 k6
#31 3 1.5 9c
#4 4 17.2 8b#keep 8b from the second set because length is longer
Use the data.table package for a simplified syntax (and better performance than data.frame):
sample = data.frame(x=c(1:5), length=c(1.2,1.3,1.5,7.2,36.1), ID_obj=c("a3", "4a", "5b", "8b", "a7"))
sample2 = data.frame(x=c(1:5), length=c(15.1,1.3,1.5,17.2,6.1), ID_obj=c("a3", "k6", "9c", "8b", "a7"))
library(data.table)
setDT(sample) # convert data.frame to data.table "in-place"
setDT(sample2)
x <- rbind(sample, sample2) # combine rows vertically
setorder(x, -length) # order by length descending
x[, head(.SD, 1), by = ID_obj] # output the first row ("head") per ID_obj group
To get the result (in a different order than your expected result):
ID_obj x length
1: a7 5 36.1
2: 8b 4 17.2
3: a3 1 15.1
4: 5b 3 1.5
5: 9c 3 1.5
6: 4a 2 1.3
7: k6 2 1.3
A bit more more cryptic with base functions, but just as an exercise:
x <- rbind(sample, sample2)
x <- x[order(x$length), ]
x <- do.call(rbind, lapply(split(x, x$ID_obj), tail, n=1))
x
# x length ID_obj
# 4a 2 1.3 4a
# 5b 3 1.5 5b
# 8b 4 17.2 8b
# a3 1 15.1 a3
# a7 5 36.1 a7
# 9c 3 1.5 9c
# k6 2 1.3 k6
Add rownames(x) <- NULL if you don't want to use ID_obj as row names.

return a vector in a column in data.table

I have a data.table in R, and I'm looking to create a vector based on .SDcols row by row.
library("data.table")
dt = data.table(
id=1:6,
A1=sample(100,6),
A2=sample(100,6),
A3=sample(100,6),
B1=sample(100,6),
B2=sample(100,6),
B3=sample(100,6)
)
dt[,x1:=paste(.SD,collapse = ","),.SDcols=A1:B3,by=id]
dt[,x2:=strsplit(x1,",")] # x2 vector of characters
now, I got x2 with a vector of characters.
however, I expected x2 with a vector of integers.
R > dt
id A1 A2 A3 B1 B2 B3 x2
1: 1 72 23 76 10 35 14 c(72,23,76,10,35,14)
2: 2 44 28 77 29 20 63 c(44,28,77,29,20,63)
3: 3 18 34 43 77 76 100 c(18,34,43,77,76,100)
4: 4 15 33 50 87 86 86 c(15,33,50,87,86,86)
5: 5 71 71 41 75 8 3 c(71,71,41,75,8,3)
6: 6 11 89 98 42 72 27 c(11,89,98,42,72,27)
I tried with several solutions, all failed.
dt[,x2:=.(list(.SD)),.SDcols=A1:B3,by=id] #x2 is <data.table>
dt[,x2:=.(lapply(.SD,c)),.SDcols=A1:B3,by=id]
dt[,x2:=.(c(.SD)), .SDcols=A1:B3,by=id] #RHS 1 is length 6 (greater than the size (1) of group 1). The last 5 element(s) will be discarded.
dt[,x2:=c(.SD),.SDcols=A1:B3,by=id] # x2 equals A1
dt[,x2:=lapply(.SD,c),.SDcols=A1:B3,by=id] # x2 equals A1
dt[,x2:=sapply(.SD,c),.SDcols=A1:B3,by=id] # x2 equals A1
Any suggestion?
Thanks in advance
=====================================================================
edit: thanks Jaap,
dt[, x2 := lapply(strsplit(x1, ","), as.integer)] # it works
Still, I wonder any beautiful solution?
=====================================================================
edit2:
new solutions, base function is much more useful than I thought.
dt[,ABC0:=apply(rbind(.SD), 1, list),.SDcols=A1:B3,by=id]
dt[,ABC1:=apply(cbind(.SD), 1, list),.SDcols=A1:B3,by=id]
or more simple
dt[,ABC2:=lapply(.SD,rbind),.SDcols=A1:B3]

r count cells with missing values across each row [duplicate]

This question already has answers here:
Count NAs per row in dataframe [duplicate]
(2 answers)
Closed 6 years ago.
I have a dataframe as shown below
Id Date Col1 Col2 Col3 Col4
30 2012-03-31 A42.2 20.46 NA
36 1996-11-15 NA V73 55
96 2010-02-07 X48 Z16 13
40 2010-03-18 AD14 20.12 36
69 2012-02-21 22.45
11 2013-07-03 81 V017 TCG11
22 2001-06-01 67
83 2005-03-16 80.45 V22.15 46.52 X29.11
92 2012-02-12
34 2014-03-10 82.12 N72.22 V45.44
I am trying to count the number of NA or Empty cells across each row and the final expected output is as follows
Id Date Col1 Col2 Col3 Col4 MissCount
30 2012-03-31 A42.2 20.46 NA 2
36 1996-11-15 NA V73 55 2
96 2010-02-07 X48 Z16 13 1
40 2010-03-18 AD14 20.12 36 1
69 2012-02-21 22.45 3
11 2013-07-03 81 V017 TCG11 1
22 2001-06-01 67 3
83 2005-03-16 80.45 V22.15 46.52 X29.11 0
92 2012-02-12 4
34 2014-03-10 82.12 N72.22 V45.44 1
The last column MissCount will store the number of NAs or empty cells for each row. Any help is much appreciated.
The one-liner
rowSums(is.na(df) | df == "")
given by #DavidArenburg in his comment is definitely the way to go, assuming that you don't mind checking every column in the data frame. If you really only want to check Col1 through Col4, then using an apply function might make more sense.
apply(df, 1, function(x) {
sum(is.na(x[c("Col1", "Col2", "Col3", "Col4")])) +
sum(x[c("Col1", "Col2", "Col3", "Col4")] == "", na.rm=TRUE)
})
Edit: Shortened code
apply(df[c("Col1", "Col2", "Col3", "Col4")], 1, function(x) {
sum(is.na(x)) +
sum(x == "", na.rm=TRUE)
})
or if data columns are exactly like the example data:
apply(df[3:6], 1, function(x) {
sum(is.na(x)) +
sum(x == "", na.rm=TRUE)
})
This should do it.
yourframe$MissCount = rowSums(is.na(yourframe) | yourframe == "" | yourframe == " "))
You can use by_row from library purrr:
library(purrr)
#sample data frame
x <- data.frame(A1=c(1,NA,3,NA),
A2=c("A","B"," ","C"),
A3=c(" "," ",NA,"t"))
Here you apply a function on each row, you can edit it according to your condition. And you can use whatever function you want.
In the following example, I counted empty or NA entries in each row by using sum(...):
by_row(x, function(y) sum(y==" "| (is.na(y))),
.to="MissCount",
.collate = "cols"
)
You will get:
# A tibble: 4 x 4
A1 A2 A3 MissCount
<dbl> <fctr> <fctr> <int>
1 1 A 1
2 NA B 2
3 3 NA 2
4 NA C t 1
We can use
Reduce(`+`, lapply(df, function(x) is.na(x)|!nzchar(as.character(x))))

outcome variable as argument in regression function

I have a datasetup function which currently has 2 arguments: testData and ID1. I want to include outcome variable as an argument.
Suppose outcomevar=c(y1,y2,y3) then the function should create the lagged and differenced variable of my outcome variable.
preparedata<-function(testData,ID1,outcomevar){
#Order temp data by firm and date
testData <- testData[order(testData$firm,testData$date),]
#Create lagged outcomevar for each firm
testData <- ddply(testData, .(firm), transform,
ly1 = c( NA, y1[-length(y1)] ) )
#Create differenced variable
testData$dy1<-(testData$y1-testData$ly1)
}
where the "l" and "d" in front of y1 stand for lagged and differenced.
Depending How can I include the outcome variable?
Thanks
T
Here's a solution using data tables:
# create sample dataset
set.seed(1)
df <- data.frame(firm=rep(LETTERS[1:5],each=10),
date=as.Date("2014-01-01")+1:10,
y1=sample(1:100,50),y2=sample(1:100,50),y3=sample(1:100,50))
preparedata<-function(testData,ID1,outcomevar){
require(data.table)
DT <- as.data.table(testData)
setkey(DT,firm,date)
DT[,lag := c(NA,unlist(.SD)[-.N]), by=firm, .SDcols=outcomevar]
DT[,diff := c(NA,diff(unlist(.SD))), by=firm, .SDcols=outcomevar]
setnames(DT,c("lag","diff"),paste0(c("l","d"),outcomevar))
return(DT)
}
result <- preparedata(df,1,outcomevar="y1")
head(result)
# firm date y1 y2 y3 ly1 dy1
# 1: A 2014-01-02 27 48 66 NA NA
# 2: A 2014-01-03 37 86 35 27 10
# 3: A 2014-01-04 57 43 27 37 20
# 4: A 2014-01-05 89 24 97 57 32
# 5: A 2014-01-06 20 7 61 89 -69
# 6: A 2014-01-07 86 10 21 20 66
This assumes you pass the name of the column containing the "outcomevar", not the column itself.
You should read the documentation on data tables (?data.table), but in brief this code converts the input data frame to a data table, orders the data table (using setkey(...)), and adds two new columns by reference: lag and diff. .SD is a special variable in the data table framework which is an alias for "the subset of the original DT containing the rows specified in by=...". You can specify which columns to include using .SDcols=.... The diff(...) function calculates lagged differences, which is the same thing you were doing. Finally, we rename the columns lag and diff to, e.g. ly1 and dy1.
Here is an outline of a function that relies more heavily on your example:
preparedata<-function(testData,outcomevar){
require(plyr)
testData <- testData[order(testData$firm,testData$date),]
testData$tmp.var <- with(testData, eval(parse(text=outcomevar)))
testData <- ddply(testData, .(firm), transform,
lvar = c( NA, tmp.var[-length(tmp.var)]))
testData$tmp.var <- NULL
testData <- within(testData, assign(paste("d", outcomevar, sep=""),
testData[,outcomevar]-testData$lvar))
colnames(testData)[grep("lvar", colnames(testData))] <- paste("l", outcomevar, sep="")
return(testData)
}
Using the df defined in jihoward's answer, we get
> head(preparedata(df,"y1"))
firm date y1 y2 y3 lvar dy1
1 A 2014-01-02 27 48 66 NA NA
2 A 2014-01-03 37 86 35 27 10
3 A 2014-01-04 57 43 27 37 20
4 A 2014-01-05 89 24 97 57 32
5 A 2014-01-06 20 7 61 89 -69
6 A 2014-01-07 86 10 21 20 66
This function returns a dataframe where ly1 is the lagged variable, and dy1 is the differenced variable that was specified with the second argument outcomevar. Note that in this function, you pass the name (i.e. a character) to the function. That is, do not write y1, but "y1" when you call the function.
You could process all outcome variables simultaneously by first gathering them into a key-value column pair:
set.seed(1)
df <- data.frame(
firm = rep(LETTERS[1:5], each = 10),
date = as.Date("2014-01-01") + 1:10,
y1 = sample(100, 50),
y2 = sample(100, 50),
y3 = sample(100, 50)
)
library(dplyr)
library(tidyr)
df %>%
gather(key, value, y1:y3) %>%
group_by(firm, key) %>%
mutate(lag = lag(value), diff = lag - value)
#> Source: local data frame [150 x 6]
#> Groups: firm, key
#>
#> firm date key value lag diff
#> 1 A 2014-01-02 y1 27 NA NA
#> 2 A 2014-01-03 y1 37 27 -10
#> 3 A 2014-01-04 y1 57 37 -20
#> 4 A 2014-01-05 y1 89 57 -32
#> 5 A 2014-01-06 y1 20 89 69
#> 6 A 2014-01-07 y1 86 20 -66
#> 7 A 2014-01-08 y1 97 86 -11
#> 8 A 2014-01-09 y1 62 97 35
#> 9 A 2014-01-10 y1 58 62 4
#> 10 A 2014-01-11 y1 6 58 52
#> .. ... ... ... ... ... ...

Resources