How to reduce factor levels depending on other attribute? - r

I have a dataframe of two columns id and result, and I want to assign factor levels to result depending on id. So that for id "1", result c("a","b","c","d") will have factor levels 1,2,3,4.
For id "2", result c("22","23","24") will have factor levels 1,2,3.
id <- c(1,1,1,1,2,2,2)
result <- c("a","b","c","d","22","23","24")
I tried to group them by split, but they will be converted to a list instead of a data frame, which causes a length problem for modeling. Can you help please?

Though the question was closed as a duplicate by user #Ronak Shah, I don't believe it is the same question.
After numbering the row by group the new column must be coerced to class "factor".
library(dplyr)
id <- c(1,1,1,1,2,2,2)
result <- c("a","b","c","d","22","23","24")
df <- data.frame(id, result)
df %>%
group_by(id) %>%
mutate(fac = row_number()) %>%
ungroup() %>%
mutate(fac = factor(fac))
# A tibble: 7 x 3
# id result fac
# <dbl> <fct> <fct>
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 1 d 4
#5 2 22 1
#6 2 23 2
#7 2 24 3
Edit.
If there are repeated values in result, coerce as.integer/factor to get numbers, then coerce those numbers to factor.
id2 <- c(1,1,1,1,2,2,2,2)
result2 <- c("a","b","c","d","22", "22","23","24")
df2 <- data.frame(id = id2, result = result2)
df2 %>%
group_by(id) %>%
mutate(fac = as.integer(factor(result))) %>%
ungroup() %>%
mutate(fac = factor(fac))
# A tibble: 8 x 3
# id result fac
# <dbl> <fct> <fct>
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 1 d 4
#5 2 22 1
#6 2 22 1
#7 2 23 2
#8 2 24 3

After grouping by id, we can use match with unique to assign unique number to each result. Using #Rui Barradas' dataframe df2
library(dplyr)
df2 %>%
group_by(id) %>%
mutate(ans = match(result, unique(result))) %>%
ungroup %>%
mutate(ans = factor(ans))
# id result ans
# <dbl> <fct> <fct>
#1 1 a 1
#2 1 b 2
#3 1 c 3
#4 1 d 4
#5 2 22 1
#6 2 22 1
#7 2 23 2
#8 2 24 3

Related

how to subtract two columbs using index in tidyverse

i have a dataframe
df <- tibble(row1= c(1,2,3,4,5),
row2=c(2,3,4,5,6))
how do i subtract the two columbs using index (not rownames)? I would like this to work
df %>% mutate(diff= select(1)-select(2))
But the universe is not on my side....
The select needs a data parameter as the Usage is
select(.data, ...)
Also, as select returns a data.frame/tibble as output, we can get the vector with [[
library(dplyr)
df %>%
mutate(diff = select(., 1)[[1]] - select(., 2)[[1]])
-output
# A tibble: 5 x 3
# row1 row2 diff
# <dbl> <dbl> <dbl>
#1 1 2 -1
#2 2 3 -1
#3 3 4 -1
#4 4 5 -1
#5 5 6 -1
or instead use pull to return the vector
df %>%
mutate(diff = pull(., 1) - pull(., 2))
What about using select like below?
> df %>% mutate(diff = do.call(`-`,select(.,1:2)))
# A tibble: 5 x 3
row1 row2 diff
<dbl> <dbl> <dbl>
1 1 2 -1
2 2 3 -1
3 3 4 -1
4 4 5 -1
5 5 6 -1

Flagging row that meets two conditions

For a given ID, I am trying to identify the latest observation (last wave or highest wave number) that meets a criteria (=1 or =2)
My data:
data <- data.frame(id=c(1,1,1, 2,2,2, 3,3,3), wave=c(1,2,3, 1,2,3, 1,2,3), var=c(NA,1,2, 1,2,NA, 3,1,3))
Outcome:
outcome <- data.frame(id=c(1,1,1, 2,2,2, 3,3,3), wave=c(1,2,3, 1,2,3, 1,2,3), var=c(NA,1,2, 1,2,NA, 3,1,3), flag=c(0,0,1, 0,1,0, 0,1,0))
I can't seem to figure out how to specify to only flag the latest/last row for a given id
data %>% group_by(id) %>% mutate(flag=if_else(var %in% c(1,2) & ...,1,0))
Subset the 'wave', get the max, compare (==) with the 'wave' column and convert to integer
library(dplyr)
data %>%
group_by(id) %>%
mutate(flag = as.integer(wave == max(wave[var %in% 1:2])))
# A tibble: 9 x 4
# Groups: id [3]
# id wave var flag
# <dbl> <dbl> <dbl> <int>
#1 1 1 NA 0
#2 1 2 1 0
#3 1 3 2 1
#4 2 1 1 0
#5 2 2 2 1
#6 2 3 NA 0
#7 3 1 3 0
#8 3 2 1 1
#9 3 3 3 0
Here, we assume that there are unique 'wave' values for each 'id'

Replace all NA values for variable with one row equal to 0

Slightly difficult to phrase, as far as I saw none of the similar questions answered my problem.
I have a data.frame such as:
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1
id val
1 a NA
2 a NA
3 a NA
4 a NA
5 b 1
6 b 2
7 b 2
8 b 3
9 c NA
10 c 2
11 c NA
12 c 3
and I want to get rid of all the NA values (easy enough using e.g. filter() ) but make sure that if this removes all of one id value (in this case it removes every instance of "a") that one extra row is inserted of (e.g.) a = 0
so that:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c 2
7 c 3
obviously easy enough to do this in a roundabout way but I was wondering if there's a tidy/elegant way to do this. I thought tidyr::complete() might help but not entirely sure how to apply it to a case like this
I don't care about the order of the rows
Cheers!
edit: updated with clearer desired output. might make desired answers submitted before that a bit less clear
Another idea using dplyr,
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(val = ifelse(row_number() == 1 & all(is.na(val)), 0, val)) %>%
na.omit()
which gives,
# A tibble: 5 x 2
# Groups: id [2]
id val
<fct> <dbl>
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
We may do
df1 %>% group_by(id) %>% do(if(all(is.na(.$val))) replace(.[1, ], 2, 0) else na.omit(.))
# A tibble: 5 x 2
# Groups: id [2]
# id val
# <fct> <dbl>
# 1 a 0
# 2 b 1
# 3 b 2
# 4 b 2
# 5 b 3
After grouping by id, if everything in val is NA, then we leave only the first row with the second element replaced by 0, otherwise the same data is returned after applying na.omit.
In a more readable format that would be
df1 %>% group_by(id) %>%
do(if(all(is.na(.$val))) data.frame(id = .$id[1], val = 0) else na.omit(.))
(Here I presume that you indeed want to get rid of all NA values; otherwise there is no need for na.omit.)
df1[is.na(df1)] <- 0
df1[!(duplicated(df1$id) & df1$val == 0), ]
id val
1 a 0
5 b 1
6 b 2
7 b 2
8 b 3
Base R option is to find groups with all NAs and transform them by changing their val to 0 and select only unique rows so that there is only one row per group. We rbind this dataframe with the groups which are !all_NA.
all_NA <- with(df1, ave(is.na(val), id, FUN = all))
rbind(unique(transform(df1[all_NA, ], val = 0)), df1[!all_NA, ])
# id val
#1 a 0
#5 b 1
#6 b 2
#7 b 2
#8 b 3
dplyr option looks ugly but one way is to make two groups of dataframes one with groups of all NA values and other with groups of all non-NA values. For groups with all NA values we add row with it's id and val as 0 and bind this to the other group.
library(dplyr)
bind_rows(df1 %>%
group_by(id) %>%
filter(all(!is.na(val))),
df1 %>%
group_by(id) %>%
filter(all(is.na(val))) %>%
ungroup() %>%
summarise(id = unique(id),
val = 0)) %>%
arrange(id)
# id val
# <fct> <dbl>
#1 a 0
#2 b 1
#3 b 2
#4 b 2
#5 b 3
Changed the df to make example more exhaustive -
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
library(dplyr)
df1 %>%
group_by(id) %>%
mutate(case=sum(is.na(val))==n(), row_num=row_number() ) %>%
mutate(val=ifelse(is.na(val)&case,0,val)) %>%
filter( !(case&row_num!=1) ) %>%
select(id, val)
Output
id val
<fct> <dbl>
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c NA
7 c 2
8 c NA
9 c 3
Another base approach, one that doesn't maintain the order of the rows and takes advantage of factors remembering lost values:
df1 <- na.omit(df1)
df1 <- rbind(
df1,
data.frame(
id = levels(df1$id)[!levels(df1$id) %in% df1$id],
val = 0)
)
I do personally prefer the dplyr approach given by Sotos, as I don't like rbind-ing data.frames back together so it's a matter of taste, but this isn't unbearably complicated by my eye. It's easy enough to adapt to a character id column with a unique(df1$id) variable.
Here is an option too:
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate_all(funs(replace(.,is.na(.),0))) %>%
slice(4:nrow(.))
This gives:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
Alternative:
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate_all(funs(replace(.,is.na(.),0))) %>%
unique()
UPDATE based on other requirements:
Some users suggested to test on this dataframe. Of course this answer assumes you'll look at everything by hand. Might be less useful if you have to look at everything by "hand" but here goes:
df1 <- data.frame(id = rep(c("a", "b","c"), each = 4), val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1 %>%
mutate_if(is.factor,as.character) %>%
mutate(val=ifelse(id=="a",0,val)) %>%
slice(4:nrow(.))
This yields:
id val
1 a 0
2 b 1
3 b 2
4 b 2
5 b 3
6 c NA
7 c 2
8 c NA
9 c 3
Here is a base R solution.
res <- lapply(split(df1, df1$id), function(DF){
if(anyNA(DF$val)) {
i <- is.na(DF$val)
DF$val[i] <- 0
DF <- rbind(DF[i & !duplicated(DF[i, ]), ], DF[!i, ])
}
DF
})
res <- do.call(rbind, res)
row.names(res) <- NULL
res
# id val
#1 a 0
#2 b 1
#3 b 2
#4 b 2
#5 b 3
Edit.
A dplyr solution could be the following.
It was tested with the original dataset posted by the OP, with the dataset in Vivek Kalyanarangan's answer and with the dataset in markus' comment, renamed df2 and df3, respectively.
library(dplyr)
na2zero <- function(DF){
DF %>%
group_by(id) %>%
mutate(val = ifelse(is.na(val), 0, val),
crit = val == 0 & duplicated(val)) %>%
filter(!crit) %>%
select(-crit)
}
na2zero(df1)
na2zero(df2)
na2zero(df3)
One may try this :
df1 = data.frame(id = rep(c("a", "b","c"), each = 4),
val = c(NA, NA, NA, NA, 1, 2, 2, 3,NA,2,NA,3))
df1
# id val
#1 a NA
#2 a NA
#3 a NA
#4 a NA
#5 b 1
#6 b 2
#7 b 2
#8 b 3
#9 c NA
#10 c 2
#11 c NA
#12 c 3
Task is to remove all rows corresponding to any id IFF val for the corresponding id is all NAs and add new row with this id and val = 0.
In this example, id = a.
Note : val for c also has NAs but all the val corresponding to c are not NA therefore we need to remove the corresponding row for c where val = NA.
So lets create another column say, val2 which indicates 0 means its all NAs and 1 otherwise.
library(dplyr)
df1 = df1 %>%
group_by(id) %>%
mutate(val2 = if_else(condition = all(is.na(val)),true = 0, false = 1))
df1
# A tibble: 12 x 3
# Groups: id [3]
# id val val2
# <fct> <dbl> <dbl>
#1 a NA 0
#2 a NA 0
#3 a NA 0
#4 a NA 0
#5 b 1 1
#6 b 2 1
#7 b 2 1
#8 b 3 1
#9 c NA 1
#10 c 2 1
#11 c NA 1
#12 c 3 1
Get the list of ids with corresponding val = NA for all.
all_na = unique(df1$id[df1$val2 == 0])
Then remove theids from the dataframe df1 with val = NA.
df1 = na.omit(df1)
df1
# A tibble: 6 x 3
# Groups: id [2]
# id val val2
# <fct> <dbl> <dbl>
# 1 b 1 1
# 2 b 2 1
# 3 b 2 1
# 4 b 3 1
# 5 c 2 1
# 6 c 3 1
And create a new dataframe with ids in all_na and val = 0
all_na_df = data.frame(id = all_na, val = 0)
all_na_df
# id val
# 1 a 0
then combine these two dataframes.
df1 = bind_rows(all_na_df, df1[,c('id', 'val')])
df1
# id val
# 1 a 0
# 2 b 1
# 3 b 2
# 4 b 2
# 5 b 3
# 6 c 2
# 7 c 3
Hope this helps and Edits are most welcomed :-)

R group by | count distinct values grouping by another column

How can I count the number of distinct visit_ids per pagename?
visit_id post_pagename
1 A
1 B
1 C
1 D
2 A
2 A
3 A
3 B
Result should be:
post_pagename distinct_visit_ids
A 3
B 2
C 1
D 1
tried it with
test_df<-data.frame(cbind(c(1,1,1,1,2,2,3,3),c("A","B","C","D","A","A","A","B")))
colnames(test_df)<-c("visit_id","post_pagename")
test_df
test_df %>%
group_by(post_pagename) %>%
summarize(vis_count = n_distinct(visit_id))
But this gives me only the amount of distinct visit_id in my data set
One way
test_df |>
distinct() |>
count(post_pagename)
# post_pagename n
# <fct> <int>
# 1 A 3
# 2 B 2
# 3 C 1
# 4 D 1
Or another
test_df |>
group_by(post_pagename) |>
summarise(distinct_visit_ids = n_distinct(visit_id))
# A tibble: 4 x 2
# post_pagename distinct_visit_ids
# <fct> <int>
#1 A 3
#2 B 2
#3 C 1
#4 D 1
*D has one visit, so it must be counted*
The function n_distinct() will give you the number of distict rows in your data, as you have 2 rows that are "2 A", you should use only n(),that will count the number of times your groupped variable appears.
test_df<-data.frame(cbind(c(1,1,1,1,2,2,3,3),c("A","B","C","D","A","A","A","B")))
colnames(test_df)<-c("visit_id","post_pagename")
test_df
test_df %>%
unique() %>%
group_by(post_pagename) %>%
summarize(vis_count = n())
This should work fine.
Hope it helps :)

Reorder a single column in a dataframe within each level of another column

Probably the solution to this problem is really easy but I just can't see it. Here is my sample data frame:
df <- data.frame(id=c(1,1,1,2,2,2), value=rep(1:3,2), level=rep(letters[1:3],2))
df[6,2] <- NA
And here is the desired output that I would like to create:
df$new_value <- c(3,2,1,NA,2,1)
So the order of all columns is the same, and for the new_value column the value column order is reversed within each level of the id column. Any ideas? Thanks!
As I understood your question, it's a coincidence that your data is sorted, if you just want to reverse the order without sorting:
library(dplyr)
df %>% group_by(id) %>% mutate(new_value = rev(value)) %>% ungroup
# A tibble: 6 x 4
id value level new_value
<dbl> <int> <fctr> <int>
1 1 1 a 3
2 1 2 b 2
3 1 3 c 1
4 2 1 a NA
5 2 2 b 2
6 2 NA c 1
A slightly different approach, using the parameters in the sort function:
library(dplyr)
df %>% group_by(id) %>%
mutate(value = sort(value, decreasing=TRUE, na.last=FALSE))
Output:
# A tibble: 6 x 3
# Groups: id [2]
id value level
<dbl> <int> <fctr>
1 1.00 3 a
2 1.00 2 b
3 1.00 1 c
4 2.00 NA a
5 2.00 2 b
6 2.00 1 c
Hope this helps!
We can use order on the missing values and on the column itself
library(dplyr)
df %>%
group_by(id) %>%
mutate(new_value = value[order(!is.na(value), -value)])
# A tibble: 6 x 4
# Groups: id [2]
# id value level new_value
# <dbl> <int> <fctr> <int>
#1 1.00 1 a 3
#2 1.00 2 b 2
#3 1.00 3 c 1
#4 2.00 1 a NA
#5 2.00 2 b 2
#6 2.00 NA c 1
Or using the arrange from dplyr
df %>%
arrange(id, !is.na(value), desc(value)) %>%
transmute(new_value = value) %>%
bind_cols(df, .)
Or using base R and specify the na.last option as FALSE in order
with(df, ave(value, id, FUN = function(x) x[order(-x, na.last = FALSE)]))
#[1] 3 2 1 NA 2 1

Resources