Group cases based on id and other variables - r

I'm having trouble restricting a dataset based on an ifelse condition.
This is an example of my dataframe:
structure(list(id = c(111, 111, 111, 112, 112, 112), se = c(1,
2, 3, 1, 2, 3), pe = c(1, 1, 2, 1, 1, 1)), class = "data.frame", row.names = c(NA,
-6L))
I need to select cases that have the same id and pe
End table should be this:
id se pe
112 1 1
112 2 1
112 3 1

I would suggest next approach using dplyr. You can compute flags to determine the number of unique elements and then filter. The flags are nid and npe. Here the code with df your dput() data:
library(dplyr)
#Code
df %>% group_by(id) %>% mutate(nid = n_distinct(id),npe = n_distinct(pe)) %>%
filter(nid==1 & npe==1) %>% select(-c(nid,npe))
Output:
# A tibble: 3 x 3
# Groups: id [1]
id se pe
<dbl> <dbl> <dbl>
1 112 1 1
2 112 2 1
3 112 3 1

We could also do this without creating/deleting new columns
library(dplyr)
df1 %>%
group_by(id) %>%
filter(n_distinct(se) == 1 | n_distinct(pe) == 1)
# A tibble: 3 x 3
# Groups: id [1]
# id se pe
# <dbl> <dbl> <dbl>
#1 112 1 1
#2 112 2 1
#3 112 3 1

Related

Create a group variable based on different criteria of consecutive scores

I have a dataset that contains just the subject id and scores from different time points. Is there a way for me to create a group variable based on their scores? For example, if a subject has 6 consecutive scores of 1 or 2, I would put them in group "a" | if they had 4 consecutive scores of 3, I would put them in group "b" | if they had 6 consecutive scores of 4 or higher, I would put them in group "c".
Here is an example dataset:
id score1 score2 score3 score4 score5 score6 score7 score8 group
101 2 2 2 2 1 2 2 1 a
102 4 4 3 3 3 3 4 4 b
103 4 5 5 5 5 6 5 5 c
Here is the R code for the above table without the "group" column
structure(list(id = c(101, 102, 103), score1 = c(2, 4, 4), score2 = c(2,
4, 5), score3 = c(2, 3, 5), score4 = c(2, 3, 5), score5 = c(1,
3, 5), score6 = c(2, 3, 6), score7 = c(2, 4, 5), score8 = c(1,
4, 5)), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"
))
Any ideas are appreciated! Thank you so much :)
The function at the heart of all solutions is rle(). How you handle everything around is up to you.
library(tidyverse, quietly = TRUE)
score_df %>%
pivot_longer(score1:score8) %>%
mutate(value =
case_when(
value <= 2 ~ 1,
value >= 4 ~ 4,
TRUE ~ value
)) %>%
group_by(id) %>%
group_map(~{
r <- rle(.$value)
highest_val <- max(r$values)
longest_len <- max(r$lengths)
case_when(max(r$value) == 1 ~ "a",
any(r$lengths[which(r$value == 3)] >= 4) ~ "b",
any(r$lengths[which(r$value == 4)] >= 6) ~ "c",
TRUE ~ NA_character_)
}) %>%
unlist()
#> [1] "a" "b" "c"
Loop over the rows of numeric columns of the data with apply (MARGIN = 1), replace the values 1 to 2 to 1, and those that are greater than or equal to 4 to 4, then get the rle (run-length-encoding) on the replaced values in the row, extract the 'values' and 'lengths', create a logical expression based on the conditions specified in OP's post and return the desired group values if those conditions are met
library(dplyr)
df1$group <- apply(df1[-1], 1, function(x) {
x <- case_when(x %in% 1:2 ~ 1, x >=4 ~ 4, TRUE ~ x)
v1 <- rle(x)
na.omit(case_when(v1$values == 1 & v1$lengths >= 6 ~ 'a',
v1$values == 3 & v1$lengths >=4 ~ 'b',
v1$values ==4 & v1$lengths >= 6 ~ 'c' )) })
df1$group
#[1] "a" "b" "c"
Or using tidyverse
library(data.table)
library(tidyr)
df1 %>%
pivot_longer(cols = -id) %>%
mutate(newvalue = case_when(value %in% 1:2 ~ 1,
value >= 4 ~ 4, TRUE ~ value)) %>%
add_count(id, grp = rleid(newvalue)) %>%
group_by(id) %>%
summarise( group = first(na.omit(case_when(newvalue == 1 & n >= 6 ~ 'a',
newvalue == 3 & n >= 4 ~'b',
newvalue == 4 & n >= 6 ~ 'c'))), .groups = 'drop') %>%
left_join(df1, .)
-output
# A tibble: 3 x 10
# id score1 score2 score3 score4 score5 score6 score7 score8 group
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#1 101 2 2 2 2 1 2 2 1 a
#2 102 4 4 3 3 3 3 4 4 b
#3 103 4 5 5 5 5 6 5 5 c
Using base R, you could do:
pat <- c(a = "[12]{6}", b="3{4}", c="[4-9]{6}")
cbind(df, group = names(pat)[max.col(sapply(pat, grepl, do.call(paste0, df[-1])))])
id score1 score2 score3 score4 score5 score6 score7 score8 group
1 101 2 2 2 2 1 2 2 1 a
2 102 4 4 3 3 3 3 4 4 b
3 103 4 5 5 5 5 6 5 5 c

Filtering by conditional values in R

So, my data take the general shape of:
library(tidyverse)
id <- c(1, 1, 2, 2, 3, 3)
group <- c("A", "B", "A", "A", "B", "B")
value <- c(34, 12, 56, 78, 90, 91)
df <- tibble(id, group, value)
df
id group value
<dbl> <chr> <dbl>
1 1 A 34
2 1 B 12
3 2 A 56
4 2 A 78
5 3 B 90
6 3 B 91
What I want to do can be described as "for each id, take the maximum value of group A. But, if A is not there, take the maximum value of group B." So my desired output would look something like:
id group value
<dbl> <chr> <dbl>
1 1 A 34
4 2 A 78
6 3 B 91
I tried to do this using the code...
desired <- df %>%
group_by(id) %>%
filter(if (exists(group == "A")) max(value) else if (exists(group == "B")) (max(value)))
...but I received an error. Help?
One option could be:
df %>%
group_by(id) %>%
arrange(group, desc(value), .by_group = TRUE) %>%
slice(which.max(group == "A"))
id group value
<dbl> <chr> <dbl>
1 1 A 34
2 2 A 78
3 3 B 91
Here is a base R option
subset(
df[order(id, group, -value), ],
ave(rep(TRUE, nrow(df)), id, FUN = function(x) seq_along(x) == 1)
)
which gives
id group value
<dbl> <chr> <dbl>
1 1 A 34
2 2 A 78
3 3 B 91
The basic idea is:
We reorder the rows of df via df[order(id, group, -value), ]
Then we take the first value in the reordered df by id
Using data.table
library(data.table)
setDT(df)[order(id, group, -value), .SD[1], id]
# id group value
#1: 1 A 34
#2: 2 A 78
#3: 3 B 91

Extracting many variables from a single column in R

I'm working on a data cleaning problem where I'm stuck. I've started receiving csv-files in the format shown below and I need to clean it before I can do any analysis. There are several such columns and there can be a couple hundred variables in each cell that need to be extracted.
Original <- structure(list(CustNum = c(0, 1),
Sales = c("[1000, 345, Zero, 56]", "[987, 879, 325, 4568]"),
Amounts = c("[10, 2, 0, 98]", "[57, 25, 52, 75]"),
Number = c("['1', '2', '3', '4']", "['4', '3', '2', '1']"),
Identifier = c("A", "B")),
row.names = c(NA, -2L),
class = c("tbl_df", "tbl", "data.frame"))
What I'm trying to do is wrangle it into this format.
Desired <- tibble(CustNum = c(0, 0, 0, 0, 1, 1, 1, 1),
Sales = c(1000, 345, "Zero", 56, 987, 879, 325, 4568),
Amounts = c(10, 2, 0, 98, 57, 25, 52, 75),
Number = c(1, 2, 3, 4, 4, 3, 2, 1),
Identifier = c("A", "A", "A", "A", "B", "B", "B", "B"))
I've tried a number of different variations of the following type but can't get anywhere.
Original$Sales %>%
str_replace("\\[", "") %>%
str_replace("\\]", "") %>%
str_replace("'", "")
It's easy to do the cleaning in Power Query in Excel but would like to find a way to do it in R so I don't have to use several different tools. Can anyone show me how to do this?
Try with this:
library(dplyr) # must be version >= 1.0.0
library(stringr)
Original %>%
mutate(across(everything(), str_remove_all, pattern = "\\[|\\]|\\'")) %>%
mutate(across(everything(), str_split, pattern = ",")) %>%
tidyr::unnest(everything()) %>%
mutate(across(everything(), str_trim)) %>%
mutate(across(c(CustNum, Amounts, Number), as.numeric))
# A tibble: 8 x 5
CustNum Sales Amounts Number Identifier
<dbl> <chr> <dbl> <dbl> <chr>
1 0 1000 10 1 A
2 0 345 2 2 A
3 0 Zero 0 3 A
4 0 56 98 4 A
5 1 987 57 4 B
6 1 879 25 3 B
7 1 325 52 2 B
8 1 4568 75 1 B
Basically:
Remove [ ] '
Split by ,
Unnest the lists
Trim out unnecessary spaces
Set to numeric where necessary
You can try this approach
library(tidyverse)
library(stringr)
Original2 <- Original %>%
mutate_at(vars(Sales, Amounts, Number), ~str_replace_all(., "\\[|\\'|\\]|\\s", "")) %>%
separate_rows(c("Sales", "Amounts", "Number"), sep = ",")
# CustNum Sales Amounts Number Identifier
# <dbl> <chr> <chr> <chr> <chr>
# 1 0 1000 10 1 A
# 2 0 345 2 2 A
# 3 0 Zero 0 3 A
# 4 0 56 98 4 A
# 5 1 987 57 4 B
# 6 1 879 25 3 B
# 7 1 325 52 2 B
# 8 1 4568 75 1 B
Here we replace [ , ' and space, then we use separate_rows() from tidyr package to separate rows. It takes 2 steps to achieve our goal.
I would suggest this approach reshaping your Original data first to long and then separate the rows by sep=','. After that you will clean the variable to remove some special characters. Therefore, you can create an id variable by group in order to transform data to wide as you want in Desired:
library(tidyverse)
#Reshape
Original %>%
pivot_longer(cols = -c(CustNum,Identifier)) %>%
separate_rows(value,sep = ',') %>%
mutate(value=trimws(gsub("[[:punct:]]", " ", value))) %>%
group_by(name) %>% mutate(id=1:n()) %>%
pivot_wider(names_from = name,values_from=value) %>%
ungroup() %>%
select(-id)
Output:
# A tibble: 8 x 5
CustNum Identifier Sales Amounts Number
<dbl> <chr> <chr> <chr> <chr>
1 0 A 1000 10 1
2 0 A 345 2 2
3 0 A Zero 0 3
4 0 A 56 98 4
5 1 B 987 57 4
6 1 B 879 25 3
7 1 B 325 52 2
8 1 B 4568 75 1

Dplyr tranformation based on string filtering and conditions

I would like to tranform messy dataset in R,
However I am having issues figuring out how to do so, I provided example dataset and result that I need to achieve:
dataset <- tribble(
~ID, ~DESC,
1, "3+1Â 81Â mÂ",
2, "2+1Â 90Â mÂ",
3, "3+KK 28Â mÂ",
4, "3+1 120 m (Mezone)")
dataset
dataset_tranformed <- tribble(
~ID, ~Rooms, ~Meters, ~Mezone, ~KK,
1, 4, 81,0, 0,
2, 3, 90,0,0,
3, 3, 28,0,1,
4, 4, 120,1, 0)
dataset_tranformed
columns firstly need to be seperated, however using dataset %>% separate(DESC, c("size", "meters_squared", "Mezone"), sep = " ") does not work because (Mezone) is thrown away.
We can do this by doing evaluation and individually extract the components
library(dplyr)
library(stringr)
library(tidyr)
dataset %>%
mutate(Rooms = map_dbl(DESC, ~
str_extract(.x, "^\\d+\\+\\d*") %>%
str_replace("\\+$", "+0") %>%
rlang::parse_expr(.) %>%
eval ),
Meters = str_extract(DESC, "(?<=\\s)\\d+(?=Â)"),
Mezone = +(str_detect(DESC, "Mezone")),
KK = +(str_detect(DESC, "KK"))) %>%
select(-DESC)
# A tibble: 4 x 5
# ID Rooms Meters Mezone KK
# <dbl> <dbl> <chr> <int> <int>
#1 1 4 81 0 0
#2 2 3 90 0 0
#3 3 3 28 0 1
#4 4 4 120 1 0
Or another option is extract and then make use of str_detect
dataset %>%
extract(DESC, into = c("Rooms1", "Rooms2", "Meters"),
"^(\\d+)\\+(\\d*)[^0-9]+(\\d+)", convert = TRUE, remove = FALSE) %>%
transmute(ID, Mezone = +(str_detect(DESC, "Mezone")),
KK = +(is.na(Rooms2)), Rooms = Rooms1 + replace_na(Rooms2, 0), Meters )
# A tibble: 4 x 5
# ID Mezone KK Rooms Meters
# <dbl> <int> <int> <dbl> <int>
#1 1 0 0 4 81
#2 2 0 0 3 90
#3 3 0 1 3 28
#4 4 1 0 4 120

substitute value in dataframe based on conditional

I have the following data set
library(dplyr)
df<- data.frame(c("a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b"),
c(1, 1, 2, 2, 2, 3, 1, 2, 2, 2, 3, 3),
c(25, 75, 20, 40, 60, 50, 20, 10, 20, 30, 40, 60))
colnames(df)<-c("name", "year", "val")
This we summarize by grouping df by name and year and then find the average and number of these entries
asd <- (df %>%
group_by(name,year) %>%
summarize(average = mean(val), `ave_number` = n()))
This gives the following desired output
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 50 1
4 b 1 20 1
5 b 2 20 3
6 b 3 50 2
Now, all entries of asd$average where asd$ave_number<2 I would like to substitute according to the following array based on year
replacer<- data.frame(c(1,2,3),
c(100,200,300))
colnames(replacer)<-c("year", "average")
In other words, I would like to end up with
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1 #substituted
4 b 1 100 1 #substituted
5 b 2 20 3
6 b 3 50 2
Is there a way to achieve this with dplyr? I guess I have to use the %>%-operator, something like this (not working code)
asd %>%
group_by(name, year) %>%
summarize(average = ifelse(n() < 2, #SOMETHING#, mean(val)))
Here's what I would do:
colnames(replacer) <- c("year", "average_replacer") #To avoid duplicate of variable name
asd <- left_join(asd, replacer, by = "year") %>%
mutate(average = ifelse(ave_number < 2, average_replacer, average)) %>%
select(-average_replacer)
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2
Regarding the following:
I guess I have to use the %>%-operator
You don't ever have to use the pipe operator. It is there for convenience because you can string (or "pipe") functions one after another, as you would with a train of thought. It's kind of like having a flow in your code.
You can do this easily by using a named vector of replacement values by year instead of a data frame. If you're set on a data frame, you'd be using joins.
replacer <- setNames(c(100,200,300),c(1,2,3))
asd <- df %>%
group_by(name,year) %>%
summarize(average = mean(val),
ave_number = n()) %>%
mutate(average = if_else(ave_number < 2, replacer[year], average))
Source: local data frame [6 x 4]
Groups: name [2]
name year average ave_number
<fctr> <dbl> <dbl> <int>
1 a 1 50 2
2 a 2 40 3
3 a 3 300 1
4 b 1 100 1
5 b 2 20 3
6 b 3 50 2

Resources