R - find rows with at least n distinct elements - r

I have a data frame of arbitrary but non-trivial size. Each entry has one of three distinct values 0, 1, or 2 randomly distributed. For example:
col.1 col.2 col.3 col.4 ...
0 0 1 0 ...
0 2 2 1 ...
2 2 2 2 ...
0 0 0 0 ...
0 1 1 1 ...
... ... ... ... ...
My goal is to remove any row that only contains one unique element or to select only those rows with at least two distinct elements. Originally I selected those rows where the row mean was a not a whole number, but I realized that could eliminate rows containing equal amounts of 0 and 2 which I want to keep.
My current thought process is to use unique on each row of the data frame, followed by length to determine how many unique elements each contains but I can't seem to get the syntax right. I'm looking for something like this
DataFrame[length(unique(DataFrame)) != 1, ]

Try any of these:
nuniq <- function(x) length(unique(x))
subset(dd, apply(dd, 1, nuniq) >= 2)
subset(dd, apply(dd, 1, sd) > 0)
subset(dd, apply(dd[-1] != dd[[1]], 1, any))
subset(dd, rowSums(dd[-1] != dd[[1]]) > 0)
subset(dd, lengths(lapply(as.data.frame(t(dd)), unique)) >= 2)
subset(dd, lengths(apply(dd, 1, table)) >= 2)
# nuniq is from above
subset(dd, tapply(as.matrix(dd), row(dd), nuniq) >= 2)
giving:
col.1 col.2 col.3 col.4
1 0 0 1 0
2 0 2 2 1
5 0 1 1 1
Alternatives to nuniq
In the above nuniq could be replaced with any of these:
function(x) nlevels(factor(x))
function(x) sum(!duplicated(x))
funtion(x) length(table(x))
dplyr::n_distinct
Note
dd in reproducible form is:
dd <- structure(list(col.1 = c(0L, 0L, 2L, 0L, 0L), col.2 = c(0L, 2L,
2L, 0L, 1L), col.3 = c(1L, 2L, 2L, 0L, 1L), col.4 = c(0L, 1L,
2L, 0L, 1L)), class = "data.frame", row.names = c(NA, -5L))

What about something like this:
# some fake data
df<-data.frame(col1 = c(2,2,1,1),
col2 = c(1,0,2,0),col3 = c(0,0,0,0))
col1 col2 col3
1 2 1 0
2 2 0 0
3 1 2 0
4 1 0 0
# first we can convert 0 to NA
df[df == 0] <- NA
# a function that calculates the length of uniques, not counting NA as levels
fun <- function(x){
res <- unique(x[!is.na(x)])
length(res)
}
# apply it: not counting na, we can use 2 as threshold
df <- df[apply(df,1,fun)>=2,]
# convert the na to 0 as original
df[is.na(df)] <- 0
df
col1 col2 col3
1 2 1 0
3 1 2 0

Related

How to group by a variable and see if they have another observation within a given time frame, R

I have the something like the following:
person_ID visit date
1 2/25/2001
1 2/30/2001
1 4/2/2001
2 3/18/2004
3 9/22/2004
3 10/27/2004
3 5/15/2008
I want to add another column to see if the person has a reoccurring observation within 90 days, like:
person_ID visit date reoccurrence
1 2/25/2001 1
1 2/30/2001 1
1 4/2/2001 0
2 3/18/2004 0
3 9/22/2004 1
3 10/27/2004 0
3 5/15/2008 0
any help is appreciated, thank you!
If the second 'date' is not 2/30/2001, convert the 'visit_date' to Date class, grouped by 'person_id', get the difference between current and next 'visit_date' in 'day', check if it is less than 90, replace the NA with 0
library(dplyr)
library(lubridate)
library(tidyr)
df1 <- df1 %>%
mutate(visit_date = mdy(visit_date)) %>%
group_by(person_ID) %>%
mutate(reoccurrence = replace_na(+(difftime(lead(visit_date),
visit_date, units = 'day') < 90), 0)) %>%
ungroup
-output
# A tibble: 7 x 3
# person_ID visit_date reoccurrence
# <int> <date> <dbl>
#1 1 2001-02-25 1
#2 1 2001-02-28 1
#3 1 2001-04-02 0
#4 2 2004-03-18 0
#5 3 2004-09-22 1
#6 3 2004-10-27 0
#7 3 2008-05-15 0
Or using data.table
library(data.table)
setDT(df1)[, visit_date := as.IDate(visit_date, '%m/%d/%Y')
][, reoccurence := +(difftime(shift(visit_date, type = 'lead'),
visit_date, units = 'day') < 90))
][is.na(reoccurence), reoccurence := 0]
Or with base R
df1$visit_date <- as.Date(df1$visit_date, '%m/%d/%Y')
with(df1, ave(as.integer(visit_date), person_ID, FUN =
function(x) c(+(diff(x) < 90), 0)))
#[1] 1 1 0 0 1 0 0
data
df1 <- structure(list(person_ID = c(1L, 1L, 1L, 2L, 3L, 3L, 3L), visit_date = c("2/25/2001",
"2/28/2001", "4/2/2001", "3/18/2004", "9/22/2004", "10/27/2004",
"5/15/2008")), row.names = c(NA, -7L), class = "data.frame")
Base R variant:
reoccur <- function(x, lim=90) {
m <- outer(x, x, `-`)
m[upper.tri(m, diag=TRUE)] <- NA
colSums(!is.na(m) & m >= 0 & m <= lim) > 0
}
### make your dates *dates*
dat$visit <- as.Date(dat$visit, format="%m/%d/%Y")
### calculate if you have reoccurrences
ave(as.numeric(dat$visit), dat$person_ID, FUN=reoccur)
# [1] 1 1 0 0 1 0 0
Data:
dat <- structure(list(person_ID = c(1L, 1L, 1L, 2L, 3L, 3L, 3L), visit = c("2/25/2001", "2/27/2001", "4/2/2001", "3/18/2004", "9/22/2004", "10/27/2004", "5/15/2008")), class = "data.frame", row.names = c(NA, -7L))
(I changed "2/30/2001" to "2/27/2001" to get a real Date out of it.)

R - Function to make a binary variable

I have some variables which take value between 1 and 5. I would like to code them 0 if they take the value between 1 and 3 (included) and 1 if they take the value 4 or 5.
My dataset looks like this
var1 var2 var3
1 1 NA
4 3 4
3 4 5
2 5 3
So I would like it to be like this:
var1 var2 var3
0 0 NA
1 0 1
0 1 1
0 1 0
I tried to do a function and to call it
making_binary <- function (var){
var <- factor(var >= 4, labels = c(0, 1))
return(var)
}
df <- lapply(df, making_binary)
But I had an error : incorrect labels : length 2 must be 1 or 1
Where did I go wrong?
Thank you very much for your answers!
You can use :
df[] <- +(df == 4 | df == 5)
df
# var1 var2 var3
#1 0 0 NA
#2 1 0 1
#3 0 1 1
#4 0 1 0
Comparison of df == 4 | df == 5 returns logical values (TRUE/FALSE), + here turns those logical values to integer values (1/0) respectively.
If you want to apply this for selected columns you can subset the columns by position or by name.
cols <- 1:3 #Position
#cols <- grep('var', names(df)) #Name
df[cols] <- +(df[cols] == 4 | df[cols] == 5)
As far as your function is concerned you can do :
making_binary <- function (var){
var <- as.integer(var >= 4)
#which is faster version of
#var <- ifelse(var >= 4, 1, 0)
return(var)
}
df[] <- lapply(df, making_binary)
data
df <- structure(list(var1 = c(1L, 4L, 3L, 2L), var2 = c(1L, 3L, 4L,
5L), var3 = c(NA, 4L, 5L, 3L)), class = "data.frame", row.names = c(NA, -4L))
I think ifelse would fit the problem well:
df[] <- lapply(df, function(x) ifelse(x >=1 & x <=3, 0, x))
df
var1 var2 var3
1 0 0 NA
2 4 0 4
3 0 4 5
4 0 5 0
df[] <- lapply(df, function(x) ifelse(x >=4 & x <=5, 1, x))
df
var1 var2 var3
1 0 0 NA
2 1 0 1
3 0 1 1
4 0 1 0
If you need to do the two steps at once, you can look at dplyr::case_when() or data.table::fcase().
You can simply test if the value is larger than 3, which will return TRUE and FALSE and cast this to a number:
+(x>3)
# var1 var2 var3
#[1,] 0 0 NA
#[2,] 1 0 1
#[3,] 0 1 1
#[4,] 0 1 0
In case you want this only for some columns, you have to subset them. E.g. for column 1 and 2:
+(x[1:2]>3)
#+(x[c("var1","var2")]>3) #Alternative
# var1 var2
#[1,] 0 0
#[2,] 1 0
#[3,] 0 1
#[4,] 0 1
Data:
x <- data.frame(var1 = c(1L, 4L, 3L, 2L), var2 = c(1L, 3L, 4L, 5L)
, var3 = c(NA, 4L, 5L, 3L))

R-Loop through data frame and count values greater than a value and removing rows

I want to loop through a large dataframe counting in the first column how many values >0, removing those rows that were counted.... then moving on to column 2 counting the number of values>0 and removing those rows etc...
the data frame
taxonomy A B C
1 cat 0 2 0
2 dog 5 1 0
3 horse 3 0 0
4 mouse 0 0 4
5 frog 0 2 4
6 lion 0 0 2
can be generated with
DF1 = structure(list(taxonomy = c("cat", "dog","horse","mouse","frog", "lion"),
A = c(0L, 5L, 3L, 0L, 0L, 0L), D = c(2L, 1L, 0L, 0L, 2L, 0L), C = c(0L, 0L, 0L, 4L, 4L, 2L)),
.Names = c("taxonomy", "A", "B", "C"),
row.names = c(NA, -6L), class = "data.frame")
and i expect the outcome to be
A B C
count 2 2 2
i wrote this loop but it does not remove the rows as it goes
res <- data.frame(DF1[1,], row.names = c('count'))
for(n in 1:ncol(DF1)) {
res[colnames(DF1)[n]] <- sum(DF1[n])
DF1[!DF1[n]==1]
}
it gives this incorrect result
A B C
count 2 3 3
You could do ...
DF = DF1[, -1]
cond = DF != 0
p = max.col(cond, ties="first")
fp = factor(p, levels = seq_along(DF), labels = names(DF))
table(fp)
# A B C
# 2 2 2
To account for rows that are all zeros, I think this works:
fp[rowSums(cond) == 0] <- NA
We can update the dataset in each run. Create a temporary dataset without the 'taxonomy' column ('tmp'). Initiate a named vector ('n'), loop through the columns of 'tmp', get a logical index based on whether the column is greater than 0 ('i1'), get the sum of TRUE values, update the 'n' for the corresponding column, then update the 'tmp' by removing those rows using 'i1' as row index
tmp <- DF1[-1]
n <- setNames(numeric(ncol(tmp)), names(tmp))
for(i in seq_len(ncol(tmp))) {
i1 <- tmp[[i]] > 0
n[i] <- sum(i1)
tmp <- tmp[!i1, ]}
n
# A B C
# 2 2 2
It can also be done with Reduce
sapply(Reduce(function(x, y) y[!x] > 0, DF1[3:4],
init = DF1[,2] > 0, accumulate = TRUE ), sum)
#[1] 2 2 2
Or using accumulate from purrr
library(purrr)
accumulate(DF1[3:4], ~ .y[!.x] > 0, .init = DF1[[2]] > 0) %>%
map_int(sum)
#[1] 2 2 2
This is easy with Reduce and sapply:
> first <- Reduce(function(a,b) b[a==0], df[-1], accumulate=TRUE)
> first
[[1]]
[1] 0 5 3 0 0 0
[[2]]
[1] 2 0 2 0
[[3]]
[1] 0 4 2
> then <- sapply(setNames(first, names(df[-1])), function(x) length(x[x>0]))
> then
A B C
2 2 2

Find column index where row value is greater than zero in R

I have data set as follows:
A B C
R1 1 0 1
R2 0 1 0
R3 0 0 0
I want to add another column in data set named index such that it gives column names for each row where the column value is greater than zero. The result I want is as follows:
A B C Index
R1 1 0 1 A,C
R2 0 1 0 B
R3 0 0 0 NA
Here is one approach using base:
use apply to go over rows, find elements that are equal to one and paste together the corresponding column names:
df$Index <- apply(df, 1, function(x) paste(colnames(df)[which(x == 1)], collapse = ", "))
df$Index <- crate a new column called Index where the result of the operation will be held
apply - applies a function over rows and/or columns of a matrix/data frame
1 - specify that the function should be applied to rows (2 - means over columns)
function(x) an unnamed function which is further defined - x corresponds to each row
which(x == 1) which elements of a row are equal to 1 output is TRUE/FALSE
colnames(df) - names of the columns of the data frame
colnames(df)[which(x == 1] - subsets the column names which are TRUE for the expression which(x == 1)
paste with collapse = ", " - collapse a character vector (in this case a vector of column names that we acquired before) into a string where each element will be separated by ,.
now replace empty entries with NA
df$Index[df$Index == ""] <- NA_character_
here is how the output looks like
#output
sample A B C Index
1 R1 1 0 1 A, C
2 R2 0 1 0 B
3 R3 0 0 0 <NA>
data:
structure(list(sample = structure(1:3, .Label = c("R1", "R2",
"R3"), class = "factor"), A = c(1L, 0L, 0L), B = c(0L, 1L, 0L
), C = c(1L, 0L, 0L)), .Names = c("sample", "A", "B", "C"), class = "data.frame", row.names = c(NA,
-3L))
Slightly different flavored apply()solution:
df$index <- apply(df, 1, function(x) ifelse(any(x), toString(names(df)[x == 1]), NA))
A B C index
R1 1 0 1 A, C
R2 0 1 0 B
R3 0 0 0 <NA>
data:
df <- structure(
list(
A = c(1L, 0L, 0L),
B = c(0L, 1L, 0L),
C = c(1L, 0L, 0L)
),
row.names = paste0('R', 1:3),
class = "data.frame"
)

How can i pull out the column name in a dataframe when two conditions are met?

Here is my dataframe.
Name Column_1 Column_2 Column_3 Column_4
A 4 1 0 1
B 5 0 0 1
C 2 0 1 0
D 1 0 1 1
I want to extract the name of the column when there is a 1 EXCLUSIVELY in a row where column_1 <=2.
In this example the only column that would work is column_3.
I had 2 theories about what was asked theory 1: If we can assume that only 1's and 0's will be in the numbered columns then perhaps:
colSums( dat[ dat$Column_1 >=2, reduce the dataframe to only qualifying rows
-1 ]) == 1 # remove letter column and make test.
#-------
Column_1 Column_2 Column_3 Column_4
FALSE TRUE TRUE FALSE
You can use that to select from names(dat)[-1]
dput(dat)
structure(list(Name = structure(1:4, .Label = c("A", "B", "C",
"D"), class = "factor"), Column_1 = c(4L, 5L, 2L, 1L), Column_2 = c(1L,
0L, 0L, 0L), Column_3 = c(0L, 0L, 1L, 1L), Column_4 = c(1L, 1L,
0L, 1L)), .Names = c("Name", "Column_1", "Column_2", "Column_3",
"Column_4"), class = "data.frame", row.names = c(NA, -4L))
Theory 2: (also get different answer than what you say is correct
sdat <- dat[ dat$Column_1 >=2,
-1 ]
sdat[ rowSums(sdat[-1]) == 1, ]
#-------
Column_1 Column_2 Column_3 Column_4
2 5 0 0 1
3 2 0 1 0
> names(sdat)[colSums( sdat[ rowSums(sdat[-1]) == 1, ]) == 1]
[1] "Column_3" "Column_4"
First the question said Column_1 needed to be >= and now it reads <= 2. So use the code for the second theory after simply reversing the inequality for row selection. When I do that now I do get just "Column_3.
sdat <- dat[ dat$Column_1 <= 2,
-1 ]
sdat[ rowSums(sdat[-1]) == 1, ]
names(sdat)[colSums( sdat[ rowSums(sdat[-1]) == 1, ]) == 1]
#[1] "Column_3"
Maybe you want something like this:
a <- ...
w <- apply(a[,-1], 2, FUN= function(x) {all(x[a$Column_1 > 2] == 0) & any(x == 1)})
The result is:
Column_1 Column_2 Column_3 Column_4
FALSE FALSE TRUE FALSE
We apply that function to all the columns of a (except the first column), and check that the column is always 0 when Column1 > 2, but there is at least one 1.
The column name(s) then is
n <- names(a)[which(w)+1]
With data.table:
library(data.table)
dt <- fread('Name Column_1 Column_2 Column_3 Column_4
A 4 1 0 1
B 5 0 0 1
C 2 0 1 0
D 1 0 1 1')
melt(dt[Column_1<=2 & Column_2+Column_3+Column_4==1], id = "Name")[value==1, .(variable)]
variable
1: Column_3

Resources