tidyr spread subset of key-value pairs - r

Given the example data, I'd like to spread a subset of the key-value pairs. In this case it is just one pair. However there are other cases where the subset to be spread is more than one pair.
library(tidyr)
# dummy data
> df1 <- data.frame(e = c(1, 1, 1, 1),
n = c("a", "b", "c", "d") ,
s = c(1, 2, 5, 7))
> df1
e n s
1 1 a 1
2 1 b 2
3 1 c 5
4 1 d 7
Classical spread of all key-value pairs:
> df1 %>% spread(n,s)
e a b c d
1 1 1 2 5 7
Desired output, spread only n=c
e c n s
1 1 5 a 1
2 1 5 b 2
3 1 5 d 7

We can do a gather after the spread
df1 %>%
spread(n, s) %>%
gather(n, s, -c, -e)
# e c n s
#1 1 5 a 1
#2 1 5 b 2
#3 1 5 d 7
Or instead of spread/gather, we filter without the 'c' row and then mutate to create the 'c' column while subsetting the 's' that corresponds to 'c'
df1 %>%
filter(n != "c") %>%
mutate(c = df1$s[df1$n=="c"])

Related

Filter groups based on difference two highest values

I have the following dataframe called df (dput below):
> df
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 B 8
6 B 2
7 B 2
8 B 3
9 C 10
10 C 1
11 C 1
12 C 8
I would like to filter groups based on the difference between their highest value (max) and second highest value. The difference should be smaller equal than 2 (<=2), this means that group B should be removed because the highest value is 8 and the second highest value is 3 which is a difference of 5. The desired output should look like this:
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
So I was wondering if anyone knows how to filter groups based on the difference between their highest and second-highest value?
dput of df:
df<-structure(list(group = c("A", "A", "A", "A", "B", "B", "B", "B",
"C", "C", "C", "C"), value = c(5, 1, 1, 5, 8, 2, 2, 3, 10, 1,
1, 8)), class = "data.frame", row.names = c(NA, -12L))
Using dplyr
library(dplyr)
df %>%
group_by(group) %>%
filter(abs(diff(sort(value, decreasing=T)[1:2])) <= 2) %>%
ungroup()
# A tibble: 8 × 2
group value
<chr> <int>
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
A base R alternative
grp <- na.omit(aggregate(. ~ group, df, function(x)
abs(diff(sort(x, decreasing=T)[1:2])) <= 2))
do.call(rbind, c(mapply(function(g, v)
list(df[df$group == g & v,]), grp$group, grp$value), make.row.names=F))
group value
1 A 5
2 A 1
3 A 1
4 A 5
5 C 10
6 C 1
7 C 1
8 C 8
I possibility would be to first create a vector with the groups that achieve your condition and then filter in the original data.frame. Here how I thought:
library(dplyr)
group_to_keep <-
df %>%
group_by(group) %>%
slice_max(n = 2,value) %>%
filter(abs(diff(value)) <= 2) %>%
pull(group) %>%
unique()
df %>%
filter(group %in% group_to_keep)
You can use ave.
df[ave(df$value, df$group, FUN=\(x) diff(sort(c(-x, Inf)))[1]) <= 2,]
# group value
#1 A 5
#2 A 1
#3 A 1
#4 A 5
#9 C 10
#10 C 1
#11 C 1
#12 C 8
In case you can sure that you have all the time at least two values you can use.
df[ave(df$value, df$group, FUN=\(x) diff(tail(sort(x), 2))) <= 2,]
df[ave(df$value, df$group, FUN=\(x) diff(sort(-x)[1:2])) <= 2,]

Relocate rows with tidyverse

Is it possible to relocate rows in tidyverse framework like it is possible for columns with dplyr relocate?
In this example I would like to relocate row 1 to position 5 (end of dataframe)
My dataframe:
df <- structure(list(ID = c(1, 2, 3, 4, 5), var1 = c("a", "b", "c",
"d", "e"), var2 = c(1, 1, 0, 0, 1)), class = "data.frame", row.names = c(NA,
-5L))
df
ID var1 var2
1 1 a 1
2 2 b 1
3 3 c 0
4 4 d 0
5 5 e 1
Desired output:
ID var1 var2
1 2 b 1
2 3 c 0
3 4 d 0
4 5 e 1
5 1 a 1
Note: In the it should be 'pipe friendly' solution. I tried a lot but found nothing. Thank you.
arrange() is the tidyverse verb for reordering rows. It can be (ab)used as follows:
dplyr::arrange(df, ID==1)
(ID==1 is logical; when it is ordered FALSE values come before TRUE values ...)
This isn't as flexible as relocate() (e.g. it's not immediately obvious how to say "move rows 100-200 so they are immediately after row 1000"), but you can probably find a way to do most tasks.
Another option (less idiomatic in my opinion) is slice():
dplyr::slice(df, order(ID==1))
(this is a tidyverse translation of #akrun's base-R answer). Either of these solutions can also be written with pipes (e.g. df %>% arrange(ID==1)).
Just to be silly:
df %>% `[`(order(.$ID==1),)
Using base R
df[order(df$ID == 1), ]
Or with slice
library(dplyr)
df %>%
slice(2:n(), 1)
Or specify with row_number()
df %>%
slice(lead(row_number(), default = 1))
Maybe this is not so elegant but here is a way:
library(dplyr)
df %>%
filter(between(row_number(), 2, nrow(df))) %>%
bind_rows(df[1, ])
ID var1 var2
1 2 b 1
2 3 c 0
3 4 d 0
4 5 e 1
5 1 a 1
Let's play a math trick
> df[order((seq(nrow(df)) -2) %% nrow(df)), ]
ID var1 var2
2 2 b 1
3 3 c 0
4 4 d 0
5 5 e 1
1 1 a 1
or
> df %>%
+ arrange(replace(row_number(), 1, n() + 1))
ID var1 var2
1 2 b 1
2 3 c 0
3 4 d 0
4 5 e 1
5 1 a 1
After the whole day of trial and error and your wonderful answers:
library(dplyr)
df %>%
slice(-1) %>% bind_rows(df %>% slice(1))
Output:
ID var1 var2
1 2 b 1
2 3 c 0
3 4 d 0
4 5 e 1
5 1 a 1

Group by count NAs as zeros [duplicate]

This question already has answers here:
Count number of non-NA values by group
(3 answers)
Count non-NA values by group [duplicate]
(3 answers)
Closed 1 year ago.
I try to count values in group_by with NA in one column of data frame. I have data like this:
> df <- data.frame(id = c(1, 2, 3, NA, 4, NA),
group = c("A", "A", "B", "C", "D", "E"))
> df
id group
1 1 A
2 2 A
3 3 B
4 NA C
5 4 D
6 NA E
I want to count groups having NA in first column as 0, but with an approach like this
> df %>% group_by(group) %>% summarise(n = n())
# A tibble: 5 x 2
group n
* <chr> <int>
1 A 2
2 B 1
3 C 1
4 D 1
5 E 1
i have 1 in rows C and E but not 0 which i want.
The expected result looks like this:
# A tibble: 5 x 2
group n
* <chr> <int>
1 A 2
2 B 1
3 C 0
4 D 1
5 E 0
How can i do this?
We can get the sum of a logical vector created with is.na to get the count as TRUE => 1 and FALSE => 0 so the sum returns the count of non-NA elements
library(dplyr)
df %>%
group_by(group) %>%
summarise(n = sum(!is.na(id)))
# A tibble: 5 x 2
# group n
# * <chr> <int>
#1 A 2
#2 B 1
#3 C 0
#4 D 1
#5 E 0
Or use length after subsetting
df %>%
group_by(group) %>%
summarise(n = length(id[!is.na(id)]))
n() returns the total number of rows including the missing values

Replace all NA values for variable with one row equal to 0

Slightly difficult to phrase, as far as I saw none of the similar questions answered my problem.
I have a data.frame such as:
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1
id val
1 a NA
2 a NA
3 a NA
4 a NA
5 b 1
6 b 2
7 b 2
8 b 3
9 c NA
10 c 2
11 c NA
12 c 3
and I want to get rid of all the NA values (easy enough using e.g. filter() ) but make sure that if this removes all of one id value (in this case it removes every instance of "a") that one extra row is inserted of (e.g.) a = 0
so that:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c 2
7 c 3
obviously easy enough to do this in a roundabout way but I was wondering if there's a tidy/elegant way to do this. I thought tidyr::complete() might help but not entirely sure how to apply it to a case like this
I don't care about the order of the rows
Cheers!
edit: updated with clearer desired output. might make desired answers submitted before that a bit less clear
Another idea using dplyr,
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(val = ifelse(row_number() == 1 & all(is.na(val)), 0, val)) %>%
na.omit()
which gives,
# A tibble: 5 x 2
# Groups: id [2]
id val
<fct> <dbl>
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
We may do
df1 %>% group_by(id) %>% do(if(all(is.na(.$val))) replace(.[1, ], 2, 0) else na.omit(.))
# A tibble: 5 x 2
# Groups: id [2]
# id val
# <fct> <dbl>
# 1 a 0
# 2 b 1
# 3 b 2
# 4 b 2
# 5 b 3
After grouping by id, if everything in val is NA, then we leave only the first row with the second element replaced by 0, otherwise the same data is returned after applying na.omit.
In a more readable format that would be
df1 %>% group_by(id) %>%
do(if(all(is.na(.$val))) data.frame(id = .$id[1], val = 0) else na.omit(.))
(Here I presume that you indeed want to get rid of all NA values; otherwise there is no need for na.omit.)
df1[is.na(df1)] <- 0
df1[!(duplicated(df1$id) & df1$val == 0), ]
id val
1 a 0
5 b 1
6 b 2
7 b 2
8 b 3
Base R option is to find groups with all NAs and transform them by changing their val to 0 and select only unique rows so that there is only one row per group. We rbind this dataframe with the groups which are !all_NA.
all_NA <- with(df1, ave(is.na(val), id, FUN = all))
rbind(unique(transform(df1[all_NA, ], val = 0)), df1[!all_NA, ])
# id val
#1 a 0
#5 b 1
#6 b 2
#7 b 2
#8 b 3
dplyr option looks ugly but one way is to make two groups of dataframes one with groups of all NA values and other with groups of all non-NA values. For groups with all NA values we add row with it's id and val as 0 and bind this to the other group.
library(dplyr)
bind_rows(df1 %>%
group_by(id) %>%
filter(all(!is.na(val))),
df1 %>%
group_by(id) %>%
filter(all(is.na(val))) %>%
ungroup() %>%
summarise(id = unique(id),
val = 0)) %>%
arrange(id)
# id val
# <fct> <dbl>
#1 a 0
#2 b 1
#3 b 2
#4 b 2
#5 b 3
Changed the df to make example more exhaustive -
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(case=sum(is.na(val))==n(), row_num=row_number() ) %>%
mutate(val=ifelse(is.na(val)&case,0,val)) %>%
filter( !(case&row_num!=1) ) %>%
select(id, val)
Output
id val
<fct> <dbl>
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c NA
7 c 2
8 c NA
9 c 3
Another base approach, one that doesn't maintain the order of the rows and takes advantage of factors remembering lost values:
df1 <- na.omit(df1)
df1 <- rbind(
df1,
data.frame(
id = levels(df1$id)[!levels(df1$id) %in% df1$id],
val = 0)
)
I do personally prefer the dplyr approach given by Sotos, as I don't like rbind-ing data.frames back together so it's a matter of taste, but this isn't unbearably complicated by my eye. It's easy enough to adapt to a character id column with a unique(df1$id) variable.
Here is an option too:
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate_all(funs(replace(.,is.na(.),0))) %>%
slice(4:nrow(.))
This gives:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
Alternative:
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate_all(funs(replace(.,is.na(.),0))) %>%
unique()
UPDATE based on other requirements:
Some users suggested to test on this dataframe. Of course this answer assumes you'll look at everything by hand. Might be less useful if you have to look at everything by "hand" but here goes:
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4), val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate(val=ifelse(id=="a",0,val)) %>%
slice(4:nrow(.))
This yields:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c NA
7 c 2
8 c NA
9 c 3
Here is a base R solution.
res <- lapply(split(df1, df1$id), function(DF){
if(anyNA(DF$val)) {
i <- is.na(DF$val)
DF$val[i] <- 0
DF <- rbind(DF[i & !duplicated(DF[i, ]), ], DF[!i, ])
}
DF
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
# id val
#1 a 0
#2 b 1
#3 b 2
#4 b 2
#5 b 3
Edit.
A dplyr solution could be the following.
It was tested with the original dataset posted by the OP, with the dataset in Vivek Kalyanarangan's answer and with the dataset in markus' comment, renamed df2 and df3, respectively.
library(dplyr)
na2zero <- function(DF){
DF %>%
group_by(id) %>%
mutate(val = ifelse(is.na(val), 0, val),
crit = val == 0 & duplicated(val)) %>%
filter(!crit) %>%
select(-crit)
}
na2zero(df1)
na2zero(df2)
na2zero(df3)
One may try this :
df1 = data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1
# id val
#1 a NA
#2 a NA
#3 a NA
#4 a NA
#5 b 1
#6 b 2
#7 b 2
#8 b 3
#9 c NA
#10 c 2
#11 c NA
#12 c 3
Task is to remove all rows corresponding to any id IFF val for the corresponding id is all NAs and add new row with this id and val = 0.
In this example, id = a.
Note : val for c also has NAs but all the val corresponding to c are not NA therefore we need to remove the corresponding row for c where val = NA.
So lets create another column say, val2 which indicates 0 means its all NAs and 1 otherwise.
library(dplyr)
df1 = df1 %>%
group_by(id) %>%
mutate(val2 = if_else(condition = all(is.na(val)),true = 0, false = 1))
df1
# A tibble: 12 x 3
# Groups: id [3]
# id val val2
# <fct> <dbl> <dbl>
#1 a NA 0
#2 a NA 0
#3 a NA 0
#4 a NA 0
#5 b 1 1
#6 b 2 1
#7 b 2 1
#8 b 3 1
#9 c NA 1
#10 c 2 1
#11 c NA 1
#12 c 3 1
Get the list of ids with corresponding val = NA for all.
all_na = unique(df1$id[df1$val2 == 0])
Then remove theids from the dataframe df1 with val = NA.
df1 = na.omit(df1)
df1
# A tibble: 6 x 3
# Groups: id [2]
# id val val2
# <fct> <dbl> <dbl>
# 1 b 1 1
# 2 b 2 1
# 3 b 2 1
# 4 b 3 1
# 5 c 2 1
# 6 c 3 1
And create a new dataframe with ids in all_na and val = 0
all_na_df = data.frame(id = all_na, val = 0)
all_na_df
# id val
# 1 a 0
then combine these two dataframes.
df1 = bind_rows(all_na_df, df1[,c('id', 'val')])
df1
# id val
# 1 a 0
# 2 b 1
# 3 b 2
# 4 b 2
# 5 b 3
# 6 c 2
# 7 c 3
Hope this helps and Edits are most welcomed :-)

R: assignment using values from several rows

Say I have measured some value (valueencoded as H,L or I) in five individuals (id) at two time points (time). Sometimes NAs may occur in value:
require(stringr)
require(dplyr)
set.seed(8)
df1 <- data.frame(
time=rep(c(1,2), 5),
id=rep(c("a", "b", "c", "d", "e"),2),
value=sample(c("H","L","I", NA), replace=T, 10))
How can I make a factor variable (preferable using dplyr::mutate()) that indicates for each idthe transition of value from time 1 to time 2 (e.g: like "HL" if H at time 1 and L at time 2).
df1 %>%
group_by(id) %>%
arrange(time)
Gives:
time id value
1 1 a L
2 2 a I
3 1 b L
4 2 b H
5 1 c NA
6 2 c NA
7 1 d NA
8 2 d I
9 1 e L
10 2 e I
And I would need a fourth column indicating time transition, like (made-up):
time id value transition
1 1 a L L-I
2 2 a I L-I
3 1 b L L-H
4 2 b H L-H
5 1 c NA NA-NA
6 2 c NA NA-NA
7 1 d NA NA-I
8 2 d I NA-I
9 1 e L L-I
10 2 e I L-I
Something like (if only the str_c() command could do it):
df1 <-
df1 %>%
group_by(id) %>%
arrange(time) %>%
mutate(transition=str_c(value, sep="-"))
df1 %>%
arrange(id, time) %>%
group_by(id) %>%
mutate(transition = paste0(value[1],"-",value[2]))

Resources