I have the following table:
col1
col2
col3
col4
1
2
1
4
5
6
6
3
My goal is to find the max value per each row, and then find how many times it was repeated in the same row.
The resulting table should look like this:
col1
col2
col3
col4
max_val
repetition
1
2
1
4
4
1
5
6
6
3
6
2
Now to achieve this, I am doing the following for Max:
df%>% rowwise%>%
mutate(max=max(col1:col4))
However, I am struggling to find the repetition. My idea is to use this pseudo code in mutate:
sum( "select current row entirely or only for some columns"==max). But I don't know how to select entire row or only some columns of it and use its content to do the check, i.e.: is it equal to the max. How can we do this in dplyr?
A dplyr approach:
library(dplyr)
df %>%
rowwise() %>%
mutate(max_val = max(across(everything())),
repetition = sum(across(col1:col4) == max_val))
# A tibble: 2 × 6
# Rowwise:
col1 col2 col3 col4 max_val repetition
<int> <int> <int> <int> <int> <int>
1 1 2 1 4 4 1
2 5 6 6 3 6 2
An R base approach:
df$max_val <- apply(df,1,max)
df$repetition <- rowSums(df[, 1:4] == df[, 5])
For other (non-tidyverse) readers, a base R approach could be:
df$max_val <- apply(df, 1, max)
df$repetition <- apply(df, 1, function(x) sum(x[1:4] == x[5]))
Output:
# col1 col2 col3 col4 max_val repetition
# 1 1 2 1 4 4 1
# 2 5 6 6 3 6 2
Although dplyr has added many tools for working across rows of data, it remains, in my mind at least, much easier to adhere to tidy principles and always convert the data to "long" format for these kinds of operations.
Thus, here is a tidy approach:
df %>%
mutate(row = row_number()) %>%
pivot_longer(cols = -row) %>%
group_by(row) %>%
mutate(max_val = max(value), repetitions = sum(value == max(value))) %>%
pivot_wider(id_cols = c(row, max_val, repetitions)) %>%
select(col1:col4, max_val, repetitions)
The last select() is just to get the columns in the order you want.
Related
I am trying to make a new variable using mutate() . In df1, I have ranges of values in col1, col2, col3, and col4. I would like to create a new binary variable in df1 that is "1" IF any of the col1-4 values are found in a specific df2 column (let's say col10).
Thanks!
This is what I have tried so far, but I don't think it is returning a value of "1" for all matching value, only some of them.
df1 %>%
mutate(newvar = case_when(
col1 == df2$col10 | col2 == df2$col10 | col3 == df2$col10 | col4 == df2$col10 ~ 1
))
We could use if_any here. If the number of rows are the same, use == for elementwise comparison instead of %in%
library(dplyr)
df1 %>%
mutate(newvar = +(if_any(col1:col4, ~.x %in% df2$col10)))
First, let's make some dummy data. df1 has 4 columns and df2 has one column named col10. In the dummy data, rows 1,2,3 and 5 have matches in df2$col10.
library(dplyr)
df1 <- data.frame(col1 = 1:5, col2=3:7, col3=5:9, col4=10:14)
df2 <- data.frame(col10 = c(1,2,3,14))
We can use rowwise() to do computations within each row and then c_across() to identify that variables of interest. The code identifies whether any of the values in the four columns are in df2$col10 and returns a logical value. The as.numeric() turns that logical value into 0 (FALSE) and 1 (TRUE).
df1 %>%
rowwise() %>%
mutate(newvar = as.numeric(any(c_across(col1:col4) %in% df2$col10)))
#> # A tibble: 5 × 5
#> # Rowwise:
#> col1 col2 col3 col4 newvar
#> <int> <int> <int> <int> <dbl>
#> 1 1 3 5 10 1
#> 2 2 4 6 11 1
#> 3 3 5 7 12 1
#> 4 4 6 8 13 0
#> 5 5 7 9 14 1
Created on 2023-02-09 by the reprex package (v2.0.1)
I've got a dataframe such as this:
df = data.frame(col1=c(1,1,1,2,2,2,3,3,3),
col2=as.factor(c('a','b','b','a','a','a','b','a','b')))
Then I extract all the categories (levels) related to each column:
levels_df = expand.grid(unique(df$col1), unique(df$col2))
colnames(levels_df)=c('col1','col2')
My objective now is to perform for the rows belonging to each pair of levels a function. How can I do that?
sapply(levels, FUN, dataset=df)
Any other strategy to perform the same task is accepted. The function operation could be whatever you like, for example a counting function (how many rows belong to each pair of levels), in which case the output would have this aspect:
In conclusion I want to susbset rows from a dataframe using each pair of levels, so I can manipulate those rows to perform a function ( such as nrows() )
You can skip the levels part, and just use dplyr to group by col1 and col2, then count the rows. Finally, we use complete to add in any combinations that don't appear in our dataset:
library(tidyverse)
df %>%
group_by(col1, col2) %>% # group df by col1 and col2
summarise(n = n()) %>% # make a new column, n, which is the count
complete(col1, col2, fill=list(n=0)) # Fill in missing pairs with 0
The output matches what you expected:
# A tibble: 6 x 3
# Groups: col1 [3]
col1 col2 n
<dbl> <fct> <dbl>
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 0
5 3 a 1
6 3 b 2
I‘m not sure if this specific count example will help you, but here‘s what you could do in the tidyverse:
library(tidyverse)
df %>%
group_by(col1, col2) %>%
count() %>%
ungroup() %>%
complete(col1, col2, fill = list(n = 0))
which gives:
# A tibble: 6 x 3
col1 col2 n
<dbl> <fct> <dbl>
1 1 a 1
2 1 b 2
3 2 a 3
4 2 b 0
5 3 a 1
6 3 b 2
Below I create a function that deletes a specific column if there is only one unique value in it. Can I somehow use lapply within %>% to avoid calling the function three times? Or even call the function for all columns?
df <- tibble(col1 = sample(1:6), col2 = sample(1:6), col3 = 3, col4 = 4)
condDelCol <- function(mycolumn, mydataframe) {
if(length(unique(mydataframe[[mycolumn]])) == 1) { mydataframe[[mycolumn]] = NULL }
mydataframe
}
df %>%
condDelCol("col2", .) %>%
condDelCol("col3", .) %>%
condDelCol("col4", .)
With dplyr, an option is select_if
library(dplyr)
df %>%
select_if(~ n_distinct(.) > 1)
# A tibble: 6 x 2
# col1 col2
# <int> <int>
#1 1 6
#2 6 1
#3 5 5
#4 3 4
#5 4 2
#6 2 3
Or another way is base R by looping over the columns with sapply, create a logical vector, extract the column names that have only single unique value and assign (<-) it to NULL
i1 <- sapply(df, function(x) length(unique(x)))
df[names(which(i1 == 1))] <- NULL
Or with Filter
Filter(var, df)
You could use this one as well. It ignores the columns for which the standard deviation is 0.
df[, sapply(df, sd) != 0]
# A tibble: 6 x 2
col1 col2
<int> <int>
1 1 3
2 5 6
3 6 1
4 2 2
5 3 4
6 4 5
or if you want to use the pipe operator
df %>%
select(which(sapply(df, sd) != 0))
I recently had to compile a data frame of student scores (one row per student, id column and several integer-valued columns, one per score component). I had to combine a "master" data frame and several "correction" data frames (containing mostly NA and some updates to the master), so that the result contains the maximum values from the master, and all corrections.
I succeeded by copy-pasting a sequence of mutate() calls, which works (see example below), but is not elegant in my opinion. What I would have wanted to do, was instead of copying and pasting, to use something along the lines of map2 and two lists of columns to compare the columns pair-wise. Something like (which obviously does not work as such):
list_of_cols1 <- list(col1.x, col2.x, col3.x)
list_of_cols2 <- list(col1.y, col2.y, col3.y
map2(list_of_cols1, list_of_cols2, ~ column = pmax(.x, .y, na.rm=T))
I can't seem to be able to figure out to do it. My question is: how to specify such lists of columns and mutate them in one map2() call in dplyr pipe, or is it even possible – have I gotten it all wrong?
Minimum working example
library(tidyverse)
master <- tibble(
id=c(1,2,3),
col1=c(1,1,1),
col2=c(2,2,2),
col3=c(3,3,3)
)
correction1 <- tibble(
id=seq(1,3),
col1=c(NA, NA, 2 ),
col2=c( 1, NA, 3 ),
col3=c(NA, NA, NA)
)
result <- reduce(
# Ultimately there would several correction data frames
list(master, correction1),
function(x,y) {
x <- x %>%
left_join(
y,
by = c("id")
) %>%
# Wish I knew how to do this mutate call with map2
mutate(
col1 = pmax(col1.x, col1.y, na.rm=T),
col2 = pmax(col2.x, col2.y, na.rm=T),
col3 = pmax(col3.x, col3.y, na.rm=T)
) %>%
select(id, col1:col3)
}
)
The result is
> result
# A tibble: 3 x 4
id col1 col2 col3
<int> <dbl> <dbl> <dbl>
1 1 1 2 3
2 2 1 2 3
3 3 2 3 3
Rather than do a left_join, just bind the rows then summarize. For example
result <- reduce(
list(master, master),
function(x,y) {
bind_rows(x, y) %>%
group_by(id) %>%
summarize_all(max, na.rm=T)
}
)
result
# id col1 col2 col3
# <dbl> <dbl> <dbl> <dbl>
# 1 1 1 2 3
# 2 2 1 2 3
# 3 3 2 3 3
Actually, you don't even need reduce as bind_rows can take a list
Adding another table
correction2 <- tibble(id=2,col1=NA,col2=8,col3=NA)
bind_rows(master, correction1, correction2) %>%
group_by(id) %>%
summarize_all(max, na.rm=T)
Sorry this doesn't answer your question about map2, I find it's easier to aggregate over rows than it is over columns in tidy R:
library(dplyr)
master <- tibble(
id=c(1,2,3),
col1=c(1,1,1),
col2=c(2,2,2),
col3=c(3,3,3)
)
correction1 <- tibble(
id=seq(1,3),
col1=c(NA, NA, 2 ),
col2=c( 1, NA, 3 ),
col3=c(NA, NA, NA)
)
result <- list(master, correction1) %>%
bind_rows() %>%
group_by(id) %>%
summarise_all(max, na.rm = TRUE)
result
#> # A tibble: 3 x 4
#> id col1 col2 col3
#> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 2 3
#> 2 2 1 2 3
#> 3 3 2 3 3
If correction tables will always have the same structure as master, you can do something like the following:
library(dplyr)
library(purrr)
update_master = function(...){
map(list(...), as.matrix) %>%
reduce(pmax, na.rm = TRUE) %>%
data.frame()
}
update_master(master, correction1)
To allow id to take character values, make the following modification:
update_master = function(x, ...){
map(list(x, ...), function(x) as.matrix(x[-1])) %>%
reduce(pmax, na.rm = TRUE) %>%
data.frame(id = x[[1]], .)
}
update_master(master, correction1)
Result:
id col1 col2 col3
1 1 1 2 3
2 2 1 2 3
3 3 2 3 3
I can't find an exact answer to this problem, so I hope I'm not duplicating a question.
I have a dataframe as follows
groupid col1 col2 col3 col4
1 0 n NA 2
1 NA NA 2 2
What I'm trying to convey with this is that there are duplicate IDs where the total information is spread across both rows and I want to combine these rows to get all the information into one row. How do I go about this?
I've tried to play around with group_by and paste but that ends up making the data messier (getting 22 instead of 2 in col4 for example) and sum() does not work because some columns are strings and those that are not are categorical variables and summing them would change the information.
Is there something I can do to collapse the rows and leave consistent data unchanged while filling in NAs?
EDIT:
Sorry desired output is as follows:
groupid col1 col2 col3 col4
1 0 n 2 2
Is this what you want ? zoo+dplyr also check the link here
df %>%
group_by(groupid) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE)))%>%filter(row_number()==n())
# A tibble: 1 x 5
# Groups: groupid [1]
groupid col1 col2 col3 col4
<int> <int> <chr> <int> <int>
1 1 0 n 2 2
EDIT1
without the filter , will give back whole dataframe.
df %>%
group_by(groupid) %>%
mutate_all(funs(na.locf(., na.rm = FALSE, fromLast = FALSE)))
# A tibble: 2 x 5
# Groups: groupid [1]
groupid col1 col2 col3 col4
<int> <int> <chr> <int> <int>
1 1 0 n NA 2
2 1 0 n 2 2
filter here, just slice the last one, na.locf will carry on the previous not NA value, which mean the last row in your group is what you want.
Also base on # thelatemail recommended. you can do the following , give back the same answer.
df %>% group_by(groupid) %>% summarise_all(funs(.[!is.na(.)][1]))
EDIT2
Assuming you have conflict and you want to show them all.
df <- read.table(text="groupid col1 col2 col3 col4
1 0 n NA 2
1 1 NA 2 2",
header=TRUE,stringsAsFactors=FALSE)
df
groupid col1 col2 col3 col4
1 1 0 n NA 2
2 1 1(#)<NA> 2 2(#)
df %>%
group_by(groupid) %>%
summarise_all(funs(toString(unique(na.omit(.)))))#unique for duplicated like col4
groupid col1 col2 col3 col4
<int> <chr> <chr> <chr> <chr>
1 1 0, 1 n 2 2
Another option with just dplyr is just to take the first non-NA value when available. You can do
dd <- read.table(text="groupid col1 col2 col3 col4
1 0 n NA 2
1 NA NA 2 2", header=T)
dd %>%
group_by(groupid) %>%
summarise_all(~first(na.omit(.)))
Would you be able to draw the desired output in this case? Converting data.frame into anothre type as.vector(), as.matrix() and grouping/factoring might help.
UPDATE:
Finding a unique elements for each column and omitting NAs.
df<-data.frame(groupid=c(1,1), col1=c(0,NA), col2=c('n', NA), col3=c(NA,2), col4=c(2,2)) # your input
out<-data.frame(df[1,]) # where the output is stored, duplicate retaining 1 row
for(i in 1:ncol(df)) out[,i]<-na.omit(unique(df[,i]))
print(out)