combining rows based on a condition in R

combining rows based on a condition in R - r

I am trying to remove some useless rows from the below df. There can be a type (1:5) per ID and yes_no variable to see if there is a variable recorded or not. As you can see, I would like to remove the 3rd and 5th rows as they have other rows with the same ID and type with a recorded value with yes_no = y.
df <- data.frame(ID = c("1", "1", "1", "1", "1", "1", "1", "1"), type = c("1", "2", "3", "3", "4", "4", "4", "5"), yes_no = c("n", "n", "n", "y", "n", "y", "y", "n"), value = c(NA, NA, NA, "2", NA, "5", "6", NA))
ID type yes_no value
1 1 n <NA>
1 2 n <NA>
1 3 n <NA>
1 3 y 2
1 4 n <NA>
1 4 y 5
1 4 y 6
1 5 n <NA>
The desired output is as follows:
df2 <- data.frame(ID = c("1", "1", "1", "1", "1", "1"), type = c("1", "2", "3", "4", "4", "5"), yes_no = c("n", "n", "y", "y", "y", "n"), value = c(NA, NA, "2", "5", "6", NA))
ID type yes_no value
1 1 n <NA>
1 2 n <NA>
1 3 y 2
1 4 y 5
1 4 y 6
1 5 n <NA>
There are ID's other than 1 that have types 1:5 so looks like I have to group_by(ID). A dplyr solution would be great too.
Any help would be appreciated, thanks!

You may use an if condition to check if yes_no has any y value.
library(dplyr)
df %>%
group_by(ID, type) %>%
filter(if(any(yes_no == 'y')) yes_no == 'y' else TRUE) %>%
ungroup
# ID type yes_no value
# <chr> <chr> <chr> <chr>
#1 1 1 n NA
#2 1 2 n NA
#3 1 3 y 2
#4 1 4 y 5
#5 1 4 y 6
#6 1 5 n NA

A base R option using subset + ave
subset(
df,
ave(yes_no == "y", ID, type, FUN = max) == (yes_no == "y")
)
gives
ID type yes_no value
1 1 1 n <NA>
2 1 2 n <NA>
4 1 3 y 2
6 1 4 y 5
7 1 4 y 6
8 1 5 n <NA>

After grouping by 'ID', 'type', we may use an OR (|) condition to filter to filter the groups where 'y' is present or when all elements are not 'y'
library(dplyr)
df %>%
group_by(ID, type) %>%
filter(yes_no == 'y'|all(yes_no != 'y')) %>%
ungroup
-output
# A tibble: 6 x 4
ID type yes_no value
<chr> <chr> <chr> <chr>
1 1 1 n <NA>
2 1 2 n <NA>
3 1 3 y 2
4 1 4 y 5
5 1 4 y 6
6 1 5 n <NA>

Related

Replacing NA values with the next value in a column in R

I'm trying to mutate a column in a Dataframe using the lag() function as a condition without producing NA values. Let me create an example:
df <- data.frame("Score" = as.numeric(c("20", "10", "15", "30", "15", "10")),
"Time" = c("1", "2", "1", "2", "1", "2"),
"Team" = c("A", "A", "B", "B", "C", "C"))
After that, I created a new column named Diff that calculates the difference of the Score of every Team:
df <- df %>%
group_by(Team) %>%
mutate(Diff = Score - lag(Score))
My problem is that this method creates NA values, obviously:
Score Time Team Diff
20 1 A NA
10 2 A -10
15 1 B NA
30 2 B 15
15 1 C NA
10 2 C -5
My goal is to have this at the end:
Score Time Team Diff
20 1 A -10
10 2 A -10
15 1 B 15
30 2 B 15
15 1 C -5
10 2 C -5
I've tried mutating again using the case_when() function to substitute the NA for the next value, but it also didn't work:
df %>%
group_by(Team) %>%
mutate(Diff = Score - lag(Score)) %>%
mutate(Diff = case_when(
NA ~ lead(Diff)
))
Anyway, how do I make the NA values be replaced by the next Diff value?
Thanks a lot!

Just use fill() after the fact:
library(tidyverse)
df <- data.frame("Score" = as.numeric(c("20", "10", "15", "30", "15", "10")),
"Time" = c("1", "2", "1", "2", "1", "2"),
"Team" = c("A", "A", "B", "B", "C", "C"))
df <- df %>%
group_by(Team) %>%
mutate(Diff = Score - lag(Score)) %>%
fill(Diff, .direction = 'up')
df
# output
# Score Time Team Diff
# <dbl> <chr> <chr> <dbl>
#1 20 1 A -10
#2 10 2 A -10
#3 15 1 B 15
#4 30 2 B 15
#5 15 1 C -5
#6 10 2 C -5

How to replace if the NA values in any column that should replace values by the next column's values in R programming

How to replace if the NA values in any column that should replace values by the next column's values in R programming, This has to be done without particularly mentioned the name of the columns (without hardcode)
Also the entire column that had NA values should be removed in R programming
library(tidyverse)
df1 <- structure(list(GID = c("1", "2", "3", "4", "5", "NG1", "MG2", "MG3", "NG4"),
ColA = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
ColB = c("2", "4", "4", "5", "5", "", "1", "1", "")),
row.names = c(NA, -9L),
class = "data.frame")
df1 %>%
mutate(across(everything(), ~str_replace(., "^$", "N")),
GID = GID %>% str_remove("N"))
#> GID ColA ColB
#> 1 1 NA 2
#> 2 2 NA 4
#> 3 3 NA 4
#> 4 4 NA 5
#> 5 5 NA 5
#> 6 G1 NA N
#> 7 MG2 NA 1
#> 8 MG3 NA 1
#> 9 G4 NA N
Expected output:
#> GID ColA
#> 1 1 2
#> 2 2 4
#> 3 3 4
#> 4 4 5
#> 5 5 5
#> 6 G1 N
#> 7 MG2 1
#> 8 MG3 1
#> 9 G4 N

I guess you already have answer to the first part of your question, here is an alternative way using replace. To drop columns that have all NA in them you can use select with where.
library(dplyr)
df1 %>%
mutate(across(.fns = ~replace(., . == '', 'N')),
GID = sub('N', '', GID)) %>%
select(-where(~all(is.na(.)))) %>%
rename_with(~names(df1)[seq_along(.)])
# GID ColA
#1 1 2
#2 2 4
#3 3 4
#4 4 5
#5 5 5
#6 G1 N
#7 MG2 1
#8 MG3 1
#9 G4 N

how to split a dataframe by specific rows in r

I have a data look like this:
data <- structure(list(A = c("1", "1", "1", "A", "10", "10", "B", "200"), B = c("2", "2", "2", "B", "20", "20", "C", "300"), C = c("3","3", "3", "C", "30", "30", "D", "400"), D = c("4", "4", "4", "D", "40", "40", NA, NA)), row.names = c(NA, -8L), class = c("tbl_df","tbl", "data.frame"))
data
> data
# A tibble: 8 x 4
A B C D
<chr> <chr> <chr> <chr>
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 A B C D
5 10 20 30 40
6 10 20 30 40
7 B C D NA
8 200 300 400 NA
It was wrong bind by rows and I wanted to split the data into 3 sub data(d1, d2 and d3) such like this:
NOTE: In my real situation, d1, d2 and d3 have different nrow(). I set nrow(d1) = 3, nrow(d2) = 2 and nrow(d3) = 1 just for simplify the question in this example.
d1 <- data.frame(A = rep(1,3), B = rep(2,3), C = rep(3,3), D = rep(4,3))
d2 <- data.frame(A = rep(10,2), B = rep(20,2), C = rep(30,2), D = rep(40,2))
d3 <- data.frame( B = 200, C = 300, D = 400)
> d1
A B C D
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
> d2
A B C D
1 10 20 30 40
2 10 20 30 40
> d3
B C D
1 200 300 400
And then I could bind them correctly using bind_rows from dplyr
bind_rows(d1, d2, d3) %>% as_tibble()
# A tibble: 6 x 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 10 20 30 40
5 10 20 30 40
6 NA 200 300 400
The problem is that I am troubled by how to get the d1, d2 and d3 from data.
Any help will be highly appreciated!

Here is a tidyverse solution.
process_df takes a data frame and sets the column names and removes the first row.
process_df <- function(df, ...) {
df %>%
set_names(slice(., 1)) %>%
select(which(!is.na(names(.)))) %>%
slice(-1)
}
Add a header row that just contains the column names.
Use rowwise() and c_across() to get the values of all columns by row. Use this to identify which rows are header rows.
group_map will apply a function over each group and bind_rows will combine the results.
data %>%
add_row(!!!set_names(names(.)), .before = 1) %>%
rowwise() %>%
mutate(
group = all(is.na(c_across()) | c_across() %in% names(.))
) %>%
ungroup() %>%
mutate(group = cumsum(group)) %>%
group_by(group) %>%
group_map(process_df) %>%
bind_rows()
#> # A tibble: 6 x 4
#> A B C D
#> <chr> <chr> <chr> <chr>
#> 1 1 2 3 4
#> 2 1 2 3 4
#> 3 1 2 3 4
#> 4 10 20 30 40
#> 5 10 20 30 40
#> 6 NA 200 300 400
Explanation of the usage of !!! in new_row
set_names(names(.)) creates a named vector that represents the row we want to add. However, add_row doesn't accept a named vector - it wants the values to be specified as arguments.
Here is a simplified example.
new_row <- c(speed = 1, dist = 2)
add_row doesn't accept a named vector, so this doesn't work.
cars %>% add_row(new_row, .before = TRUE)
# (Error)
!!! will unpack the vector as arguments to the function.
cars %>% add_row(!!!new_row, .before = TRUE)
# (Works)
!!! above essentially results in this:
cars %>% add_row(speed = 1, dist = 2, .before = TRUE)

Does this work:
data
# A tibble: 5 x 4
A B C D
<chr> <chr> <chr> <chr>
1 1 2 3 4
2 A B C D
3 10 20 30 40
4 B C D NA
5 200 300 400 NA
data <- rbind(LETTERS[1:4],data)
data
# A tibble: 6 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 1 2 3 4
3 A B C D
4 10 20 30 40
5 B C D NA
6 200 300 400 NA
split(data, rep(1:ceiling(nrow(data)/2), each = 2))
$`1`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 1 2 3 4
$`2`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 10 20 30 40
$`3`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 B C D NA
2 200 300 400 NA

Base R solution:
Map(function(x){setNames(data.frame(t(x[,2, drop = FALSE])), x[,1])[,!is.na(x[,1])]},
split.default(cbind(X0 = names(df), data.frame(t(df))), c(0, seq_len(nrow(df)) %/% 2)))
Including pushing separate data.frames to Global Environment:
list2env(setNames(Map(function(x){setNames(data.frame(t(x[,2, drop = FALSE])), x[,1])[,!is.na(x[,1])]},
split.default(cbind(X0 = names(df), data.frame(t(df))), c(0, seq_len(nrow(df)) %/% 2))),
paste0('d', seq_len(ceiling(nrow(df) / 2)))), .GlobalEnv)
Tidyverse Solution:
library(tidyverse)
df %>%
rbind(names(df), .) %>%
split(cumsum(seq_len(nrow(.)) %% 2)) %>%
Map(function(x){setNames(x[2,], x[1,])[,complete.cases(t(x))]}, .) %>%
set_names(str_c('d', names(.))) %>%
list2env(., .GlobalEnv)
Note solution adjusted to reflect edit to the question:
rdf <- type.convert(data.frame(t(rbind(names(df), df))))
Map(function(x){
y <- setNames(t(x[,-1, drop = FALSE]), x[,1]); y[,!is.na(colSums(y))]
}, split.default(rdf, cumsum(!sapply(rdf, is.integer))))
New solution including push to Global Env:
rdf <- type.convert(data.frame(t(rbind(names(df), df))))
dflist <- Map(function(x) {
y <-
setNames(t(x[, -1, drop = FALSE]), x[, 1])
y[, !is.na(colSums(y))]
}, split.default(rdf, cumsum(!sapply(rdf, is.integer))))
list2env(setNames(dflist, paste0('d', names(dflist))), .GlobalEnv)
Adjusted Tidyverse solution:
df %>%
rbind(names(.), .) %>%
t() %>%
data.frame() %>%
type.convert() %>%
split.default(cumsum(!sapply(., is.integer))) %>%
Map(function(x){
y <- setNames(t(x[,-1, drop = FALSE]), x[,1])
data.frame(y[,!is.na(colSums(y)), drop = FALSE])}, .) %>%
set_names(str_c('d', names(.))) %>%
list2env(., .GlobalEnv)
Data:
df <- structure(list(A = c("1", "A", "10", "B", "200"), B = c("2", "B", "20", "C", "300"), C = c("3", "C", "30", "D", "400"), D = c("4","D", "40", NA, NA)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))
Updated Data:
df <- structure(list(A = c("1", "1", "1", "A", "10", "10", "B", "200"), B = c("2", "2", "2", "B", "20", "20", "C", "300"), C = c("3","3", "3", "C", "30", "30", "D", "400"), D = c("4", "4", "4", "D", "40", "40", NA, NA)), row.names = c(NA, -8L), class = c("tbl_df","tbl", "data.frame"))

Replace "n\a" values with missing values in R / tidyverse

I am looking for an explanation of why this tidy way to replace "n\a" with NA doesn't work:
dta <- tibble(var1 = c("1", "2", "n/a"), var2 = c("n/a", "2", "4"))
> dta %>% mutate(across(everything(), na_if, "n/a"))
# A tibble: 3 x 2
var1 var2
<chr> <chr>
1 1 n/a
2 2 2
3 n/a 4

Making explicit implicit missing values in nested levels [duplicate]

This question already has an answer here:
How to complete missing factor levels in data frame?
(1 answer)
Closed 3 years ago.
I am trying to complete my dataframe with missing levels.
Current output
id foo bar val
1 a x 7
2 a y 9
3 a z 6
4 b x 10
5 b y 4
6 b z 5
7 c y 2
Data
structure(list(id = c("1", "2", "3", "4", "5", "6", "7"), foo = c("a",
"a", "a", "b", "b", "b", "c"), bar = c("x", "y", "z", "x", "y",
"z", "y"), val = c("7", "9", "6", "10", "4", "5", "2")), .Names = c("id",
"foo", "bar", "val"), row.names = c(NA, -7L), class = "data.frame")
I would like to make explicit the missing nested levels of c with 0s for x and z. I could find a workaround with expand.grid but could not manage to obtain the desired output with tidyr.
Desired output :
id foo bar val
1 a x 7
2 a y 9
3 a z 6
4 b x 10
5 b y 4
6 b z 5
7 c x 0
8 c y 2
9 c z 0
Thanks in advance!

Given that you are looking for a tidyr solution, you should check out tidyr::complete (which does exactly what you are after):
library(tidyverse)
complete(df, foo, bar, fill = list(val = 0)) %>% select(-id)
#> # A tibble: 9 x 3
#> foo bar val
#> <chr> <chr> <chr>
#> 1 a x 7
#> 2 a y 9
#> 3 a z 6
#> 4 b x 10
#> 5 b y 4
#> 6 b z 5
#> 7 c x 0
#> 8 c y 2
#> 9 c z 0

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

combining rows based on a condition in R - r

You may use an if condition to check if yes_no has any y value. library(dplyr) df %>% group_by(ID, type) %>% filter(if(any(yes_no == 'y')) yes_no == 'y' else TRUE) %>% ungroup # ID type yes_no value # <chr> <chr> <chr> <chr> #1 1 1 n NA #2 1 2 n NA #3 1 3 y 2 #4 1 4 y 5 #5 1 4 y 6 #6 1 5 n NA

A base R option using subset + ave subset( df, ave(yes_no == "y", ID, type, FUN = max) == (yes_no == "y") ) gives ID type yes_no value 1 1 1 n <NA> 2 1 2 n <NA> 4 1 3 y 2 6 1 4 y 5 7 1 4 y 6 8 1 5 n <NA>

Related

Replacing NA values with the next value in a column in R

How to replace if the NA values in any column that should replace values by the next column's values in R programming

how to split a dataframe by specific rows in r

Replace "n\a" values with missing values in R / tidyverse

Making explicit implicit missing values in nested levels [duplicate]

Categories

Resources