Replacing NA values with the next value in a column in R - r

I'm trying to mutate a column in a Dataframe using the lag() function as a condition without producing NA values. Let me create an example:
df <- data.frame("Score" = as.numeric(c("20", "10", "15", "30", "15", "10")),
"Time" = c("1", "2", "1", "2", "1", "2"),
"Team" = c("A", "A", "B", "B", "C", "C"))
After that, I created a new column named Diff that calculates the difference of the Score of every Team:
df <- df %>%
group_by(Team) %>%
mutate(Diff = Score - lag(Score))
My problem is that this method creates NA values, obviously:
Score Time Team Diff
20 1 A NA
10 2 A -10
15 1 B NA
30 2 B 15
15 1 C NA
10 2 C -5
My goal is to have this at the end:
Score Time Team Diff
20 1 A -10
10 2 A -10
15 1 B 15
30 2 B 15
15 1 C -5
10 2 C -5
I've tried mutating again using the case_when() function to substitute the NA for the next value, but it also didn't work:
df %>%
group_by(Team) %>%
mutate(Diff = Score - lag(Score)) %>%
mutate(Diff = case_when(
NA ~ lead(Diff)
))
Anyway, how do I make the NA values be replaced by the next Diff value?
Thanks a lot!

Just use fill() after the fact:
library(tidyverse)
df <- data.frame("Score" = as.numeric(c("20", "10", "15", "30", "15", "10")),
"Time" = c("1", "2", "1", "2", "1", "2"),
"Team" = c("A", "A", "B", "B", "C", "C"))
df <- df %>%
group_by(Team) %>%
mutate(Diff = Score - lag(Score)) %>%
fill(Diff, .direction = 'up')
df
# output
# Score Time Team Diff
# <dbl> <chr> <chr> <dbl>
#1 20 1 A -10
#2 10 2 A -10
#3 15 1 B 15
#4 30 2 B 15
#5 15 1 C -5
#6 10 2 C -5

Related

How to add a common number to rows that have same value in another column?

After years of using your advices to another users, here is my for now unsolvable issue...
I have a dataset with thousands of rows and hundreds of column, that have one column with a possible value in common. Here is a subset of my dataset :
ID <- c("A", "B", "C", "D", "E")
Dose <- c("1", "5", "3", "4", "5")
Value <- c("x1", "x2", "x3", "x2", "x3")
mat <- cbind(ID, Dose, Value)
What I want is to assign a unique value to the rows that have the "Value" column in common, like that :
ID <- c("A", "B", "C", "D", "E")
Dose <- c("1", "5", "3", "4", "5")
Value <- c("153254", "258634", "896411", "258634", "896411")
Code <- c("1", "2", "3", "2", "3")
mat <- cbind(ID, Dose, Value, Code)
Does anyone have an idea that could help me a little ?
Thanks !
We may use match here
library(dplyr)
mat %>%
mutate(Code = match(Value, unique(Value)))
-output
ID Dose Value Code
1 A 1 153254 1
2 B 5 258634 2
3 C 3 896411 3
4 D 4 258634 2
5 E 5 896411 3
data
mat <- data.frame(ID, Dose, Value)
You should consider using a data.frame:
mat <- data.frame(ID, Dose, Value)
Using dplyr you could create the desired output:
library(dplyr)
mat %>%
group_by(Value) %>%
mutate(Code = cur_group_id()) %>%
ungroup()
This returns
# A tibble: 5 x 4
ID Dose Value Code
<chr> <chr> <chr> <int>
1 A 1 153254 1
2 B 5 258634 2
3 C 3 896411 3
4 D 4 258634 2
5 E 5 896411 3

calculate difference based on id and date

I have a data set that have id date and time, now I want to calculate the difference between each available date based on id. I have try to look for similar problem in stack overflow but so far no luck. I have try a few different syntax but still no luck at the moment. any help would be great.
data set:
> dput(mydata)
structure(list(id = c("a", "a", "b", "b", "b", "c"), date = c("2018-04-13",
"2011-11-12", "2019-05-30", "2014-09-13", "2019-06-21", "1998-01-08"
), time = c("50", "40", "30", "20", "10", "30")), class = "data.frame", row.names = c(NA,
-6L))
Desire output:
id date time time_diff
a 2018-04-13 50 10
a 2011-11-12 40 NA/0
b 2019-05-30 30 10
b 2014-09-13 20 NA/0
b 2019-06-21 10 -20
c 1998-01-08 30 NA/0
I understand the earliest date won't have anything to calculate the difference so it can be either NA or 0 in this case.
Here is the code that I have try but getting error:
mydata <- mydata %>%
group_by(id,date) %>%
mutate(time_diff = diff(time))
library(dplyr)
df <- structure(list(
id = c("a", "a", "b", "b", "b", "c"),
date = ("2018-04-13", "2011-11-12", "2019-05-30", "2014-09-13", "2019-06-21", "1998-01-08"),
time = c("50", "40", "30", "20", "10", "30")),
class = "data.frame", row.names = c(NA, -6L))
df %>%
group_by(id) %>%
arrange(id, date) %>%
mutate(
time = as.numeric(time),
time_diff = time - lag(time)
)
For each id you may subtract the time corresponding to minimum date.
library(dplyr)
mydata %>%
mutate(time = as.numeric(time),
date = as.Date(date)) %>%
group_by(id) %>%
mutate(time_diff = time - time[which.min(date)]) %>%
ungroup
# id date time time_diff
# <chr> <date> <dbl> <dbl>
#1 a 2018-04-13 50 10
#2 a 2011-11-12 40 0
#3 b 2019-05-30 30 10
#4 b 2014-09-13 20 0
#5 b 2019-06-21 10 -10
#6 c 1998-01-08 30 0
We can use data.table
library(data.table)
mydata <- type.convert(mydata, as.is = TRUE)
setDT(mydata)[, time_diff := time - time[date %in% min(date)], id]
mydata
id date time time_diff
1: a 2018-04-13 50 10
2: a 2011-11-12 40 0
3: b 2019-05-30 30 10
4: b 2014-09-13 20 0
5: b 2019-06-21 10 -10
6: c 1998-01-08 30 0

combining rows based on a condition in R

I am trying to remove some useless rows from the below df. There can be a type (1:5) per ID and yes_no variable to see if there is a variable recorded or not. As you can see, I would like to remove the 3rd and 5th rows as they have other rows with the same ID and type with a recorded value with yes_no = y.
df <- data.frame(ID = c("1", "1", "1", "1", "1", "1", "1", "1"), type = c("1", "2", "3", "3", "4", "4", "4", "5"), yes_no = c("n", "n", "n", "y", "n", "y", "y", "n"), value = c(NA, NA, NA, "2", NA, "5", "6", NA))
ID type yes_no value
1 1 n <NA>
1 2 n <NA>
1 3 n <NA>
1 3 y 2
1 4 n <NA>
1 4 y 5
1 4 y 6
1 5 n <NA>
The desired output is as follows:
df2 <- data.frame(ID = c("1", "1", "1", "1", "1", "1"), type = c("1", "2", "3", "4", "4", "5"), yes_no = c("n", "n", "y", "y", "y", "n"), value = c(NA, NA, "2", "5", "6", NA))
ID type yes_no value
1 1 n <NA>
1 2 n <NA>
1 3 y 2
1 4 y 5
1 4 y 6
1 5 n <NA>
There are ID's other than 1 that have types 1:5 so looks like I have to group_by(ID). A dplyr solution would be great too.
Any help would be appreciated, thanks!
You may use an if condition to check if yes_no has any y value.
library(dplyr)
df %>%
group_by(ID, type) %>%
filter(if(any(yes_no == 'y')) yes_no == 'y' else TRUE) %>%
ungroup
# ID type yes_no value
# <chr> <chr> <chr> <chr>
#1 1 1 n NA
#2 1 2 n NA
#3 1 3 y 2
#4 1 4 y 5
#5 1 4 y 6
#6 1 5 n NA
A base R option using subset + ave
subset(
df,
ave(yes_no == "y", ID, type, FUN = max) == (yes_no == "y")
)
gives
ID type yes_no value
1 1 1 n <NA>
2 1 2 n <NA>
4 1 3 y 2
6 1 4 y 5
7 1 4 y 6
8 1 5 n <NA>
After grouping by 'ID', 'type', we may use an OR (|) condition to filter to filter the groups where 'y' is present or when all elements are not 'y'
library(dplyr)
df %>%
group_by(ID, type) %>%
filter(yes_no == 'y'|all(yes_no != 'y')) %>%
ungroup
-output
# A tibble: 6 x 4
ID type yes_no value
<chr> <chr> <chr> <chr>
1 1 1 n <NA>
2 1 2 n <NA>
3 1 3 y 2
4 1 4 y 5
5 1 4 y 6
6 1 5 n <NA>

how to split a dataframe by specific rows in r

I have a data look like this:
data <- structure(list(A = c("1", "1", "1", "A", "10", "10", "B", "200"), B = c("2", "2", "2", "B", "20", "20", "C", "300"), C = c("3","3", "3", "C", "30", "30", "D", "400"), D = c("4", "4", "4", "D", "40", "40", NA, NA)), row.names = c(NA, -8L), class = c("tbl_df","tbl", "data.frame"))
data
> data
# A tibble: 8 x 4
A B C D
<chr> <chr> <chr> <chr>
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 A B C D
5 10 20 30 40
6 10 20 30 40
7 B C D NA
8 200 300 400 NA
It was wrong bind by rows and I wanted to split the data into 3 sub data(d1, d2 and d3) such like this:
NOTE: In my real situation, d1, d2 and d3 have different nrow(). I set nrow(d1) = 3, nrow(d2) = 2 and nrow(d3) = 1 just for simplify the question in this example.
d1 <- data.frame(A = rep(1,3), B = rep(2,3), C = rep(3,3), D = rep(4,3))
d2 <- data.frame(A = rep(10,2), B = rep(20,2), C = rep(30,2), D = rep(40,2))
d3 <- data.frame( B = 200, C = 300, D = 400)
> d1
A B C D
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
> d2
A B C D
1 10 20 30 40
2 10 20 30 40
> d3
B C D
1 200 300 400
And then I could bind them correctly using bind_rows from dplyr
bind_rows(d1, d2, d3) %>% as_tibble()
# A tibble: 6 x 4
A B C D
<dbl> <dbl> <dbl> <dbl>
1 1 2 3 4
2 1 2 3 4
3 1 2 3 4
4 10 20 30 40
5 10 20 30 40
6 NA 200 300 400
The problem is that I am troubled by how to get the d1, d2 and d3 from data.
Any help will be highly appreciated!
Here is a tidyverse solution.
process_df takes a data frame and sets the column names and removes the first row.
process_df <- function(df, ...) {
df %>%
set_names(slice(., 1)) %>%
select(which(!is.na(names(.)))) %>%
slice(-1)
}
Add a header row that just contains the column names.
Use rowwise() and c_across() to get the values of all columns by row. Use this to identify which rows are header rows.
group_map will apply a function over each group and bind_rows will combine the results.
data %>%
add_row(!!!set_names(names(.)), .before = 1) %>%
rowwise() %>%
mutate(
group = all(is.na(c_across()) | c_across() %in% names(.))
) %>%
ungroup() %>%
mutate(group = cumsum(group)) %>%
group_by(group) %>%
group_map(process_df) %>%
bind_rows()
#> # A tibble: 6 x 4
#> A B C D
#> <chr> <chr> <chr> <chr>
#> 1 1 2 3 4
#> 2 1 2 3 4
#> 3 1 2 3 4
#> 4 10 20 30 40
#> 5 10 20 30 40
#> 6 NA 200 300 400
Explanation of the usage of !!! in new_row
set_names(names(.)) creates a named vector that represents the row we want to add. However, add_row doesn't accept a named vector - it wants the values to be specified as arguments.
Here is a simplified example.
new_row <- c(speed = 1, dist = 2)
add_row doesn't accept a named vector, so this doesn't work.
cars %>% add_row(new_row, .before = TRUE)
# (Error)
!!! will unpack the vector as arguments to the function.
cars %>% add_row(!!!new_row, .before = TRUE)
# (Works)
!!! above essentially results in this:
cars %>% add_row(speed = 1, dist = 2, .before = TRUE)
Does this work:
data
# A tibble: 5 x 4
A B C D
<chr> <chr> <chr> <chr>
1 1 2 3 4
2 A B C D
3 10 20 30 40
4 B C D NA
5 200 300 400 NA
data <- rbind(LETTERS[1:4],data)
data
# A tibble: 6 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 1 2 3 4
3 A B C D
4 10 20 30 40
5 B C D NA
6 200 300 400 NA
split(data, rep(1:ceiling(nrow(data)/2), each = 2))
$`1`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 1 2 3 4
$`2`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 A B C D
2 10 20 30 40
$`3`
# A tibble: 2 x 4
A B C D
<chr> <chr> <chr> <chr>
1 B C D NA
2 200 300 400 NA
Base R solution:
Map(function(x){setNames(data.frame(t(x[,2, drop = FALSE])), x[,1])[,!is.na(x[,1])]},
split.default(cbind(X0 = names(df), data.frame(t(df))), c(0, seq_len(nrow(df)) %/% 2)))
Including pushing separate data.frames to Global Environment:
list2env(setNames(Map(function(x){setNames(data.frame(t(x[,2, drop = FALSE])), x[,1])[,!is.na(x[,1])]},
split.default(cbind(X0 = names(df), data.frame(t(df))), c(0, seq_len(nrow(df)) %/% 2))),
paste0('d', seq_len(ceiling(nrow(df) / 2)))), .GlobalEnv)
Tidyverse Solution:
library(tidyverse)
df %>%
rbind(names(df), .) %>%
split(cumsum(seq_len(nrow(.)) %% 2)) %>%
Map(function(x){setNames(x[2,], x[1,])[,complete.cases(t(x))]}, .) %>%
set_names(str_c('d', names(.))) %>%
list2env(., .GlobalEnv)
Note solution adjusted to reflect edit to the question:
rdf <- type.convert(data.frame(t(rbind(names(df), df))))
Map(function(x){
y <- setNames(t(x[,-1, drop = FALSE]), x[,1]); y[,!is.na(colSums(y))]
}, split.default(rdf, cumsum(!sapply(rdf, is.integer))))
New solution including push to Global Env:
rdf <- type.convert(data.frame(t(rbind(names(df), df))))
dflist <- Map(function(x) {
y <-
setNames(t(x[, -1, drop = FALSE]), x[, 1])
y[, !is.na(colSums(y))]
}, split.default(rdf, cumsum(!sapply(rdf, is.integer))))
list2env(setNames(dflist, paste0('d', names(dflist))), .GlobalEnv)
Adjusted Tidyverse solution:
df %>%
rbind(names(.), .) %>%
t() %>%
data.frame() %>%
type.convert() %>%
split.default(cumsum(!sapply(., is.integer))) %>%
Map(function(x){
y <- setNames(t(x[,-1, drop = FALSE]), x[,1])
data.frame(y[,!is.na(colSums(y)), drop = FALSE])}, .) %>%
set_names(str_c('d', names(.))) %>%
list2env(., .GlobalEnv)
Data:
df <- structure(list(A = c("1", "A", "10", "B", "200"), B = c("2", "B", "20", "C", "300"), C = c("3", "C", "30", "D", "400"), D = c("4","D", "40", NA, NA)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", "data.frame"))
Updated Data:
df <- structure(list(A = c("1", "1", "1", "A", "10", "10", "B", "200"), B = c("2", "2", "2", "B", "20", "20", "C", "300"), C = c("3","3", "3", "C", "30", "30", "D", "400"), D = c("4", "4", "4", "D", "40", "40", NA, NA)), row.names = c(NA, -8L), class = c("tbl_df","tbl", "data.frame"))

how can I regroup multiple categorical variables into a new variable

I have a data.frame (df) with 2 columns (A, B):
A B
1 a TCRB
2 a TCRG
3 a TCRB
4 b TCRB
5 b TCRG
6 c TCRB
7 c TCRB
8 c TCRB
9 c TCRB
10 d TCRG
11 d TCRG
12 d TCRG
I want to create a new column "C" as bellow that tells me whether each unique variable in "A" has both TCRB and TCRG or either one of them (0= TCRB only, 1= TCRG only, 2= both) as follows:
A: a b c d
C: 2 2 0 1
Greatly appreciate any help!
Here's an approach with dplyr:
library(dplyr)
df %>%
group_by(A) %>%
dplyr::summarise(C = case_when("TCRB" %in% B & "TCRG" %in% B ~ 2,
"TCRB" %in% B ~ 0,
"TCRG" %in% B ~ 1,
TRUE ~ NA_real_))
# A tibble: 4 x 2
A C
<fct> <dbl>
1 a 2
2 b 2
3 c 0
4 d 1
An option with n_distinct
library(dplyr)
df %>%
group_by(A) %>%
summarise(C = n_distinct(B) *!all(B == 'TCRB'))
# A tibble: 4 x 2
# A C
# <chr> <int>
#1 a 2
#2 b 2
#3 c 0
#4 d 1
data
df <- structure(list(A = c("a", "a", "a", "b", "b", "c", "c", "c",
"c", "d", "d", "d"), B = c("TCRB", "TCRG", "TCRB", "TCRB", "TCRG",
"TCRB", "TCRB", "TCRB", "TCRB", "TCRG", "TCRG", "TCRG")),
class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))
In Base R, we can use aggregate :
aggregate(B~A, df, function(x) {
if(all(c('TCRB', 'TCRG') %in% x)) 2
else if(any(x == 'TCRG')) 1
else if(any(x == 'TCRB')) 0
else NA
})
# A B
#1 a 2
#2 b 2
#3 c 0
#4 d 1

Resources