New column with the count of columns that meet certain criteria - r

I searched a lot and I could not find a good solution for this simple problem. I tried rowSums, but with no success.
I have a df like the first image. I want to create a new column (V4), preferably using tidyverse, with the count of rows that meet a certain condition. In this example, the condition would be . == 5.
How many times number 5 appears in the other columns:
Example df
df <- data.frame(V1 = c(1,2,5,5,3),
V2 = c(1,5,5,5,5),
V3 = c(1,3,4,5,1))

We could use rowSums on a logical matrix
df$V4 <- rowSums(df == 5)
If we want a dplyr solution
library(dplyr)
df <- df %>%
mutate(V4 = rowSums(cur_data() == 5))
Or may also use reduce
library(purrr)
df %>%
mutate(V4 = across(everything(), `==`, 5) %>%
reduce(`+`))

Here is another dplyr option:
library(dplyr)
df %>%
rowwise %>%
mutate(V4 = sum(c_across(V1:V3) == 5, na.rm = TRUE))
Output
V1 V2 V3 V4
<dbl> <dbl> <dbl> <int>
1 1 1 1 0
2 2 5 3 1
3 5 5 4 2
4 5 5 5 3
5 3 5 1 1
Or another option using purrr:
library(tidyverse)
df %>%
mutate(V4 = pmap_int(select(., everything()), ~ sum(c(...) == 5, na.rm = T)))

Related

Is there a way to create a for loop to give the sum of values in a range of cells in a column based on the values from another column

I want to regroup cells from V1 based on a condition: if values in V1 are between 0 and 3 calculate the sum of the same rows in V3. form the picture I uploaded. values from V1 are between 0 and 3 and therefore we can calculate the 3 first cells in V3 (sum=12).
Next, I want to do the same for the interval [3,6] then [6,9] ... [i,i+3].
I tried to make a for loop but I couldn't figure out how to specify my arguments.
We could use dyplr:
library(dplyr)
df %>%
mutate(interval = rep(row_number(), each=3, length.out = n())) %>%
group_by(interval) %>%
summarise(sum = sum(V3))
interval sum
<int> <int>
1 1 12
2 2 21
3 3 30
4 4 12
data:
V1 <- 1:10
V2 <- 2:11
V3 <- 3:12
df <- tibble(V1,V2,V3)
Or another option is gl to create grouping column
library(dplyr)
df %>%
group_by(interval = as.integer(gl(n(), 3, n()))) %>%
summarise(V3 = sum(V3))
NOTE: data from #TarJae

Combining two variables to create new variable

I would like to combine two variables that have only one answer each into a single variable that has both answers.
Example
IPV_YES only has answers that are 1
IPV_NO only has answers that are 2
I would like to combine them into a single variable named IPV that would have the 1 and 2 results from both individual category.
I have tried using ifelse command but it only shows me the value of IPV_YES.
Dataset I have
My desired outcome
my answer
df %>% mutate(across(everything(), ~ifelse(. == "", NA, as.numeric(.)))) %>%
group_by(ID) %>%
rowwise() %>%
transmute(IPV = sum(c_across(everything()), na.rm = T))
# A tibble: 4 x 2
# Rowwise: ID
ID IPV
<dbl> <dbl>
1 1 1
2 2 2
3 3 1
4 4 2
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
We can use coalesce after converting the '' to NA
library(dplyr)
df <- df %>%
transmute(ID, IPV = coalesce(na_if(IPV_YES, ""), na_if(IPV_NO, ""))) %>%
type.convert(as.is = TRUE)
data
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
df$IPV <- ifelse(df$IPV_YES != "", df$IPV_YES, df$IPV_NO[!df$IPV_NO==""])
Here, we specify an ifelse statement; it can be glossed thus: if the value in df$IPV_YES is not blank, then give the value in df$IPV_YES, else give those values from df$IPV_NO that are not blank.
If you want to remove the IPV_* columns:
df[,2:3] <- NULL
Result:
df
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2
Data:
df <- data.frame(ID = 1:4, IPV_YES = c(1,"",1,""), IPV_NO = c("",2,"",2))
Maybe you can try the code below
replace(df, df == "", NA) %>%
mutate(IPV = coalesce(IPV_YES, IPV_NO)) %>%
select(ID, IPV) %>%
type.convert(as.is = TRUE)
which gives
ID IPV
1 1 1
2 2 2
3 3 1
4 4 2

filter() but keep groups without value

I am trying to condense a grouped df, pulling out only rows that contain a certain value, but that value isn't reflected in all groups. I want to find a way to pull out all rows with that value, but also make a NA or 0 row for groups not containing that value.
Ex:
x1 <- c('1','1','1','1','1','2','2','2','2','2','3','3','3','3','3')
x2 <- c('a','b','c','d','e','b','c','d','e','f','a','b','d','e','f')
df <- data.frame(x1,x2)
df %>% group_by(x1) %>%
filter(x2 =="a")
this returns:
x1 x2
<fct> <fct>
1 1 a
2 3 a
but I want it to return:
x1 x2
<fct> <fct>
1 1 a
2 2 NA
3 3 a
Obviously the real code is much more complicated, so I'm looking for the best way to keep these empty groups in a reproducible way.
PS - I would like to stay in dplyr to keep smooth in a function chain
Thanks!
One dplyr option could be:
df %>%
group_by(x1) %>%
slice(which.max(x2 == "a")) %>%
mutate(x2 = replace(x2, x2 != "a", NA_complex_))
x1 x2
<fct> <fct>
1 1 a
2 2 <NA>
3 3 a
If it's relevant to have multiple target values per group:
df %>%
group_by(x1) %>%
filter(x2 == "a") %>%
bind_rows(df %>%
group_by(x1) %>%
filter(all(x2 != "a")) %>%
slice(1) %>%
mutate(x2 = replace(x2, x2 != "a", NA_complex_)))
As you did not specify dplyr solutions only, here's one option with library(data.table)
setDT(df)
df[, .(x2 = x2[match('a', x2)]), x1]
# x1 x2
# 1: 1 a
# 2: 2 <NA>
# 3: 3 a
This happens because of the way Dplyr was written.
According to Hadley Wickham (the Package Creator) to maintain NA values you should declare that you want them explicitly. As he said in this issue on github, you should filter(a == x | is.na(a)). In your case you use the following:
df %>% group_by(x1) %>%
filter(x2 =="a" | is.na(x2)
That you'll return you this as a result:
x1 x2
<fct> <fct>
1 1 a
2 2 NA
3 3 a
In this code you're asking to R all rows in which x2 is equal to "a" and also those in which x2 is NA.
We can use complete after the filter step to get the missing combinations. By default, all the other columns will be filled with NA (it can be made to custom value with fill argument)
library(dplyr)
library(tidyr)
df %>%
filter(x2 == 'a') %>%
complete(x1 = unique(df$x1))
# A tibble: 3 x 2
# x1 x2
# <fct> <fct>
#1 1 a
#2 2 <NA>
#3 3 a
Another option is match
df %>%
group_by(x1) %>%
summarise(x2 = x2[match('a', x2)])
If there are many columns, then mutate 'x2' with match and then slice the first row
df %>%
group_by(x1) %>%
mutate(x2 = x2[match('a', x2)]) %>%
slice(1)
How about the base R solution using aggregate() like below?
dfout <- aggregate(x2~x1,df,function(v) ifelse("a" %in% v,"a",NA))
or
dfout <- aggregate(x2~x1,df,function(v) v[match("a", v)])
such that
> dfout
x1 x2
1 1 a
2 2 <NA>
3 3 a

dplyr: adding qualifying column names while doing scoped filtering ("filter_all", ...)

I have a very wide & long data set from which I need to pick out rows where any of a selection of variables meet certain conditions. So far, scoped filtering in dplyr along with any_vars are very close to what I need. To illustrate:
x <- tibble(v1 = c(1, 1, 5, 3, 4), v2 = c(3, 1, 2, 1, 2))
filter_all(x, any_vars( . == min(.)))
produces
# A tibble: 3 x 2
v1 v2
<dbl> <dbl>
1 1 3
2 1 1
3 3 1
I want to add the name of the "filtering variable" to the resulting rows as shown below:
v1 v2 var
<dbl> <dbl> <chr>
1 1 3 v1
2 1 1 v1
3 1 1 v2
4 3 1 v2
Any suggestions? I suspect that one of the map function in purrr may work to do the filtering one by one and then combine the results afterwards.
When one qualify for multiple variables (Thanks to #Moody_Mudskipper), I'd like show the row multiple times --- both with v1 and v2 in this case.
There you go, this should scale for a wide dataset.
x <- tibble(v1 = c(1, 1, 5, 3, 4), v2 = c(3, 1, 2, 1, 2))
library(dplyr)
library(tidyr)
x %>%
mutate_all(rank,ties.method ="min") %>%
gather(var,val) %>%
cbind(x,.) %>%
filter(val ==1) %>%
select(-val)
# v1 v2 var
# 1 1 3 v1
# 2 1 1 v1
# 3 1 1 v2
# 4 3 1 v2
to avoid building big temp table:
gathered <- x %>%
mutate_all(rank,ties.method ="min") %>%
gather(var,val)
rows_to_keep <- which(gathered$val == 1)
cbind(x[(rows_to_keep-1) %% nrow(x) + 1,],gathered[rows_to_keep,])
This is uglier but I think it's the most efficient I could come up with:
log_df <- mutate_all(x,function(x){x==min(x)}) # identify rows that contain min (no time wasted sorting here)
filter1 <- rowSums(log_df)>0 # to get rid of uninteresting rows
x2 <- x[filter1,]
log_df2 <- log_df[filter1,]
gathered <- gather(log_df2,var,val) # put in long format
rows_to_keep <- which(gathered$val)
cbind(x2[(rows_to_keep-1) %% nrow(x2) + 1,],gathered[rows_to_keep,]) %>% select(-val)
Try this code:
x%>%filter_all(., any_vars( . == min(.)))%>%
data.frame(.,var=apply(.,1,function(i) names(.)[i==sapply(x,min)]))
If this helps please let us know. Thank you.
This code will fail in one condition: If more than one variable in a row are minimums. for example in the example posted, if there is a row which has both 1's then this code will fail. Thank you
Thanks for the idea to create new columns, my solution below stores the variable names first prior to the filtering. Let me know if you can improve upon this:
x %>%
mutate_all(funs(qual = . == min(.))) %>%
filter_at(vars(ends_with("_qual")), any_vars(. == TRUE)) %>%
gather(var, qual, ends_with("_qual")) %>%
filter(qual==TRUE) %>%
select(-qual) %>%
extract(var, "var")
the intermediate table after the first step:
v1 v2 v1_qual v2_qual
1 1 3 TRUE FALSE
2 1 1 TRUE TRUE
3 5 2 FALSE FALSE
4 3 1 FALSE TRUE
5 4 2 FALSE FALSE

Remove duplicated rows using dplyr

I have a data.frame like this -
set.seed(123)
df = data.frame(x=sample(0:1,10,replace=T),y=sample(0:1,10,replace=T),z=1:10)
> df
x y z
1 0 1 1
2 1 0 2
3 0 1 3
4 1 1 4
5 1 0 5
6 0 1 6
7 1 0 7
8 1 0 8
9 1 0 9
10 0 1 10
I would like to remove duplicate rows based on first two columns. Expected output -
df[!duplicated(df[,1:2]),]
x y z
1 0 1 1
2 1 0 2
4 1 1 4
I am specifically looking for a solution using dplyr package.
Here is a solution using dplyr >= 0.5.
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
> df %>% distinct(x, y, .keep_all = TRUE)
x y z
1 0 1 1
2 1 0 2
3 1 1 4
Note: dplyr now contains the distinct function for this purpose.
Original answer below:
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
One approach would be to group, and then only keep the first row:
df %>% group_by(x, y) %>% filter(row_number(z) == 1)
## Source: local data frame [3 x 3]
## Groups: x, y
##
## x y z
## 1 0 1 1
## 2 1 0 2
## 3 1 1 4
(In dplyr 0.2 you won't need the dummy z variable and will just be
able to write row_number() == 1)
I've also been thinking about adding a slice() function that would
work like:
df %>% group_by(x, y) %>% slice(from = 1, to = 1)
Or maybe a variation of unique() that would let you select which
variables to use:
df %>% unique(x, y)
For completeness’ sake, the following also works:
df %>% group_by(x) %>% filter (! duplicated(y))
However, I prefer the solution using distinct, and I suspect it’s faster, too.
Most of the time, the best solution is using distinct() from dplyr, as has already been suggested.
However, here's another approach that uses the slice() function from dplyr.
# Generate fake data for the example
library(dplyr)
set.seed(123)
df <- data.frame(
x = sample(0:1, 10, replace = T),
y = sample(0:1, 10, replace = T),
z = 1:10
)
# In each group of rows formed by combinations of x and y
# retain only the first row
df %>%
group_by(x, y) %>%
slice(1)
Difference from using the distinct() function
The advantage of this solution is that it makes it explicit which rows are retained from the original dataframe, and it can pair nicely with the arrange() function.
Let's say you had customer sales data and you wanted to retain one record per customer, and you want that record to be the one from their latest purchase. Then you could write:
customer_purchase_data %>%
arrange(desc(Purchase_Date)) %>%
group_by(Customer_ID) %>%
slice(1)
When selecting columns in R for a reduced data-set you can often end up with duplicates.
These two lines give the same result. Each outputs a unique data-set with two selected columns only:
distinct(mtcars, cyl, hp);
summarise(group_by(mtcars, cyl, hp));
If you want to find the rows that are duplicated you can use find_duplicates from hablar:
library(dplyr)
library(hablar)
df <- tibble(a = c(1, 2, 2, 4),
b = c(5, 2, 2, 8))
df %>% find_duplicates()

Resources