row wise test if multiple (not all) columns are equal - r

I want to do a row wise check if multiple columns are all equal or not. I came up with a convoluted approach to count the occurences of each value per group. But this seems somewhat... cumbersome.
sample data
sample_df <- data.frame(id = letters[1:6], group = rep(c('r','l'),3), stringsAsFactors = FALSE)
set.seed(4)
for(i in 3:5) {
sample_df[i] <- sample(1:4, 6, replace = TRUE)
sample_df
}
desired output
library(tidyverse)
sample_df %>%
gather(var, value, V3:V5) %>%
mutate(n_var = n_distinct(var)) %>% # get the number of columns
group_by(id, group, value) %>%
mutate(test = n_distinct(var) == n_var ) %>% # check how frequent values occur per "var"
spread(var, value) %>%
select(-n_var)
#> # A tibble: 6 x 6
#> # Groups: id, group [6]
#> id group test V3 V4 V5
#> <chr> <chr> <lgl> <int> <int> <int>
#> 1 a r FALSE 3 3 1
#> 2 b l FALSE 1 4 4
#> 3 c r FALSE 2 4 2
#> 4 d l FALSE 2 1 2
#> 5 e r TRUE 4 4 4
#> 6 f l FALSE 2 2 3
Created on 2019-02-27 by the reprex package (v0.2.1)
Does not need to be dplyr. I just used it for showing what I want to achieve.

There are a bunch of ways to check for equality row-wise. Two good ways:
# test that all values equal the first column
rowSums(df == df[, 1]) == ncol(df)
# count the unique values, see if there is just 1
apply(df, 1, function(x) length(unique(x)) == 1)
If you only want to test some columns, then use a subset of columns rather than the whole data frame:
cols_to_test = c(3, 4, 5)
rowSums(df[cols_to_test] == df[, cols_to_test[1]]) == length(cols_to_test)
# count the unique values, see if there is just 1
apply(df[cols_to_test], 1, function(x) length(unique(x)) == 1)
Note I use df[cols_to_test] instead of df[, cols_to_test] when I want to be sure the result is a data.frame even if cols_to_test has length 1.

Related

Count occurences in multiple column with condtions R Studio

I wanna get the count of occurrences when the data in column 1 is 1 and also data in column 2 is 1. TIA
You can use filter and summarize from dplyr package
library(dplyr)
df1 <- df %>%
filter(col1 == 1 & col2==1) %>%
summarize(Freq_col1andcol2_equalto1 = n())
data:
df <- tribble(
~col1, ~col2,
5,5,
5,5,
1,1,
5,5,
5,5)
seed(323)
df <- tibble(x = round(runif(100, 0, 10)),
y = round(runif(100, 0, 10)))
df %>% count(x==1,y==1)
# A tibble: 4 x 3
# `x == 1` `y == 1` n
# <lgl> <lgl> <int>
# 1 FALSE FALSE 80
# 2 FALSE TRUE 10
# 3 TRUE FALSE 8
# 4 TRUE TRUE 2
In this scenario, the first column gets 10 times number 1 (8 alone and 2 together with the second column), and the second column gets 12 times (10 alone and 2 together with the first column)

Filter rows based on multiple conditions using dplyr

df <- data.frame(loc.id = rep(1:2,each = 10), threshold = rep(1:10,times = 2))
I want to filter out the first rows when threshold >= 2 and threshold is >= 4 for each loc.id. I did this:
df %>% group_by(loc.id) %>% dplyr::filter(row_number() == which.max(threshold >= 2),row_number() == which.max(threshold >= 4))
I expected a dataframe like this:
loc.id threshold
1 2
1 4
2 2
2 4
But it returns me an empty dataframe
Based on the condition, we can slice the rows from concatenating the two which.max index, get the unique (if there are only cases where threshold is greater than 4, then both the conditions get the same index)
df %>%
group_by(loc.id) %>%
filter(any(threshold >= 2)) %>% # additional check
#slice(unique(c(which.max(threshold > 2), which.max(threshold > 4))))
# based on the expected output
slice(unique(c(which.max(threshold >= 2), which.max(threshold >= 4))))
# A tibble: 4 x 2
# Groups: loc.id [2]
# loc.id threshold
# <int> <int>
#1 1 2
#2 1 4
#3 2 2
#4 2 4
Note that there can be groups where there are no values in threshold greater than or equal to 2. We could keep only those groups
If this isn't what you want, assign the df below a name and use it to filter your dataset.
df %>%
distinct() %>%
filter(threshold ==2 | threshold==4)
#> loc.id threshold
#> 1 1 2
#> 2 1 4
#> 3 2 2
#> 4 2 4
```

how to count repetitions of first occuring value with dplyr

I have a dataframe with groups that essentially looks like this
DF <- data.frame(state = c(rep("A", 3), rep("B",2), rep("A",2)))
DF
state
1 A
2 A
3 A
4 B
5 B
6 A
7 A
My question is how to count the number of consecutive rows where the first value is repeated in its first "block". So for DF above, the result should be 3. The first value can appear any number of times, with other values in between, or it may be the only value appearing.
The following naive attempt fails in general, as it counts all occurrences of the first value.
DF %>% mutate(is_first = as.integer(state == first(state))) %>%
summarize(count = sum(is_first))
The result in this case is 5. So, hints on a (preferably) dplyr solution to this would be appreciated.
You can try:
rle(as.character(DF$state))$lengths[1]
[1] 3
In your dplyr chain that would just be:
DF %>% summarize(count_first = rle(as.character(state))$lengths[1])
# count_first
# 1 3
Or to be overzealous with piping, using dplyr and magrittr:
library(dplyr)
library(magrittr)
DF %>% summarize(count_first = state %>%
as.character %>%
rle %$%
lengths %>%
first)
# count_first
# 1 3
Works also for grouped data:
DF <- data.frame(group = c(rep(1,4),rep(2,3)),state = c(rep("A", 3), rep("B",2), rep("A",2)))
# group state
# 1 1 A
# 2 1 A
# 3 1 A
# 4 1 B
# 5 2 B
# 6 2 A
# 7 2 A
DF %>% group_by(group) %>% summarize(count_first = rle(as.character(state))$lengths[1])
# # A tibble: 2 x 2
# group count_first
# <dbl> <int>
# 1 1 3
# 2 2 1
No need of dplyrhere but you can modify this example to use it with dplyr. The key is the function rle
state = c(rep("A", 3), rep("B",2), rep("A",2))
x = rle(state)
DF = data.frame(len = x$lengths, state = x$values)
DF
# get the longest run of consecutive "A"
max(DF[DF$state == "A",]$len)

Make multiple random number of copies of rows in a dataframe

I have a dataframe in r with 100 rows of unique first and last name and address. I also have columns for weather 1 and weather 2. I want to make a random number of copies between 50 and 100 for each row. How would I do that?
df$fname df$lname df$street df$town df%state df$weather1 df$weather2
Using iris and baseR:
#example data
iris2 <- iris[1:100, ]
#replicate rows at random
iris2[rep(1:100, times = sample(50:100, 100, replace = TRUE)), ]
Each row of iris2 will be replicated between 50-100 times at random
This is probably not the easiest way to do this, but...
What I've done here is for each for of the data set select just that row and make 1-3 (sub 50-100) copies of that row, and finally stack all the results together.
library(dplyr)
library(purrr)
df <- tibble(foo = 1:3, bar = letters[1:3])
map_dfr(seq_len(nrow(df)), ~{
df %>%
slice(.x) %>%
sample_n(size = sample(1:3, 1), replace = TRUE)
})
#> # A tibble: 7 x 2
#> foo bar
#> <int> <chr>
#> 1 1 a
#> 2 1 a
#> 3 1 a
#> 4 2 b
#> 5 2 b
#> 6 3 c
#> 7 3 c

dplyr: adding qualifying column names while doing scoped filtering ("filter_all", ...)

I have a very wide & long data set from which I need to pick out rows where any of a selection of variables meet certain conditions. So far, scoped filtering in dplyr along with any_vars are very close to what I need. To illustrate:
x <- tibble(v1 = c(1, 1, 5, 3, 4), v2 = c(3, 1, 2, 1, 2))
filter_all(x, any_vars( . == min(.)))
produces
# A tibble: 3 x 2
v1 v2
<dbl> <dbl>
1 1 3
2 1 1
3 3 1
I want to add the name of the "filtering variable" to the resulting rows as shown below:
v1 v2 var
<dbl> <dbl> <chr>
1 1 3 v1
2 1 1 v1
3 1 1 v2
4 3 1 v2
Any suggestions? I suspect that one of the map function in purrr may work to do the filtering one by one and then combine the results afterwards.
When one qualify for multiple variables (Thanks to #Moody_Mudskipper), I'd like show the row multiple times --- both with v1 and v2 in this case.
There you go, this should scale for a wide dataset.
x <- tibble(v1 = c(1, 1, 5, 3, 4), v2 = c(3, 1, 2, 1, 2))
library(dplyr)
library(tidyr)
x %>%
mutate_all(rank,ties.method ="min") %>%
gather(var,val) %>%
cbind(x,.) %>%
filter(val ==1) %>%
select(-val)
# v1 v2 var
# 1 1 3 v1
# 2 1 1 v1
# 3 1 1 v2
# 4 3 1 v2
to avoid building big temp table:
gathered <- x %>%
mutate_all(rank,ties.method ="min") %>%
gather(var,val)
rows_to_keep <- which(gathered$val == 1)
cbind(x[(rows_to_keep-1) %% nrow(x) + 1,],gathered[rows_to_keep,])
This is uglier but I think it's the most efficient I could come up with:
log_df <- mutate_all(x,function(x){x==min(x)}) # identify rows that contain min (no time wasted sorting here)
filter1 <- rowSums(log_df)>0 # to get rid of uninteresting rows
x2 <- x[filter1,]
log_df2 <- log_df[filter1,]
gathered <- gather(log_df2,var,val) # put in long format
rows_to_keep <- which(gathered$val)
cbind(x2[(rows_to_keep-1) %% nrow(x2) + 1,],gathered[rows_to_keep,]) %>% select(-val)
Try this code:
x%>%filter_all(., any_vars( . == min(.)))%>%
data.frame(.,var=apply(.,1,function(i) names(.)[i==sapply(x,min)]))
If this helps please let us know. Thank you.
This code will fail in one condition: If more than one variable in a row are minimums. for example in the example posted, if there is a row which has both 1's then this code will fail. Thank you
Thanks for the idea to create new columns, my solution below stores the variable names first prior to the filtering. Let me know if you can improve upon this:
x %>%
mutate_all(funs(qual = . == min(.))) %>%
filter_at(vars(ends_with("_qual")), any_vars(. == TRUE)) %>%
gather(var, qual, ends_with("_qual")) %>%
filter(qual==TRUE) %>%
select(-qual) %>%
extract(var, "var")
the intermediate table after the first step:
v1 v2 v1_qual v2_qual
1 1 3 TRUE FALSE
2 1 1 TRUE TRUE
3 5 2 FALSE FALSE
4 3 1 FALSE TRUE
5 4 2 FALSE FALSE

Resources