How to recode multiple columns in R efficiently? - r

I need to recode some data. Firstly,
iImagine that the the original data looks something like this
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <chr> <chr> <chr> <chr>
s1 414234 244575 539645 436236
s2 NA 512342 644252 835325
s3 NA NA 816747 475295
s4 NA NA NA 125429
s5 NA NA NA NA
s6 617465 844526 NA 194262
which, secondly, is transformed into
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 4 2 5 4
s2 NA 5 6 8
s3 NA NA 8 4
s4 NA NA NA 1
s5 NA NA NA NA
s6 6 8 NA 1
because I am going to recode everything according to the first digit. When, thirdly, recoded (see recoding pattern in MWE below) it should look like this
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 3 1 3 3
s2 NA 3 4 5
s3 NA NA 5 3
s4 NA NA NA 1
s5 NA NA NA NA
s6 4 5 NA 1
and, fourthly, entire rows should be removed if all columns except the first one is empty, that is
A data.frame: 6 × 5
col1 col2 col3 col4 col5
<chr> <int> <int> <int> <int>
s1 3 1 3 3
s2 NA 3 4 5
s3 NA NA 5 3
s4 NA NA NA 1
s6 4 5 NA 1
which is the ultimate data.
The first and second step were easily implemented but I struggle with the third and fourth step since I am new to R (see MWE below). For the third step, I tried to use mutate over multiple columns but Error in UseMethod("mutate"): no applicable method for 'mutate' applied to an object of class "c('integer', 'numeric')" appeared. The fourth step is easily implemented in Python with thresh but I am not sure if there is an equivalent in R.
How is this possible? Also, I work with huge data, so time-efficient solutions would also be highly appreciated.
library(dplyr)
df <- data.frame(
col1 = c("s1", "s2", "s3", "s4", "s5", "s6"),
col2 = c("414234", NA, NA, NA, NA, "617465"),
col3 = c("244575", "512342", NA, NA, NA, "844526"),
col4 = c("539645", "644252", "816747", NA, NA, NA),
col5 = c("436236", "835325", "475295", "125429", NA, "194262")
)
n = ncol(df)
for (i in colnames(df[2:n])) {
df[, i] = strtoi(substr(df[, i], 1, 1))
}
for (i in colnames(df[2:n])) {
df[, i] %>% mutate(i=recode(i, "0": 1, "1": 1, "2": 1, "3": 2, "4": 3, "5": 3, "6": 4, "7": 5, "8": 5))
}

Base R way:
# cut out just the numeric columns
df2 <- as.matrix(df[, -1])
# first digits
df2[] <- substr(df2, 1, 1)
mode(df2) <- 'numeric'
# recode
df2[] <- c(1, 1, 1, 2, 3, 3, 4, 5, 5)[df2+1]
# write back into the original data frame
df[, -1] <- df2
# remove rows with NAs only
df <- df[apply(df[, -1], 1, \(x) !all(is.na(x))), ]
df
# V1 V2 V3 V4 V5
# 1 s1 3 1 3 3
# 2 s2 NA 3 4 5
# 3 s3 NA NA 5 3
# 4 s4 NA NA NA 1
# 6 s6 4 5 NA 1
As you can see, it is not necessary to do the operations column-wise as they can be performed en bloc, which will be more efficient.

You can do this with a combination of tidyverse packages. We generally avoid for loops in R, unless we really need them. It's almost always preferable to vetorise.
library(dplyr)
library(stringr) # for str_sub
library(purrr) # for negate
mat = matrix(c( "s1", "s2", "s3", "s4", "s5", "s6",
"414234", NA, NA, NA, NA, "617465",
"244575", "512342", NA, NA, NA, "844526",
"539645", "644252", "816747", NA, NA, NA,
"436236", "835325", "475295", "125429", NA, "194262"),
nrow=6,
ncol=5
)
df <- as.data.frame(mat)
## Step 1: Extract first character of each element
df <- mutate(df, across(V2:V5, str_sub, 1, 1))
head(df)
#> V1 V2 V3 V4 V5
#> 1 s1 4 2 5 4
#> 2 s2 <NA> 5 6 8
#> 3 s3 <NA> <NA> 8 4
#> 4 s4 <NA> <NA> <NA> 1
#> 5 s5 <NA> <NA> <NA> <NA>
#> 6 s6 6 8 <NA> 1
## Step 3: Recode
df <- mutate(df,
across(V2:V5,
recode,
`0` = "1", `1` = "1", `2` = "1", `3` = "2",
`4` = "3", `5` = "3", `6` = "4", `7` = "5", `8` = "5"
))
## Step 2: convert all columns to numeric
df <- mutate(df, across(V2:V5, as.numeric))
head(df)
#> V1 V2 V3 V4 V5
#> 1 s1 3 1 3 3
#> 2 s2 NA 3 4 5
#> 3 s3 NA NA 5 3
#> 4 s4 NA NA NA 1
#> 5 s5 NA NA NA NA
#> 6 s6 4 5 NA 1
## Step 4: filter all rows where every value is numeric
## By purrr::negate()-ing is.na, we can select rows only rows where
## at least one value is not missing
df <- filter(df, if_any(V2:V5, negate(is.na)))
df
#> V1 V2 V3 V4 V5
#> 1 s1 3 1 3 3
#> 2 s2 NA 3 4 5
#> 3 s3 NA NA 5 3
#> 4 s4 NA NA NA 1
#> 5 s6 4 5 NA 1
Created on 2022-12-13 with reprex v2.0.2

This one using fancy math
df |>
pivot_longer(col2:col5, values_to = "val", names_to = "col") |>
mutate(val = map_dbl(as.integer(val),
~c(1, 1, 1, 2, 3, 3, 4, 5, 5)[.x %/% 10^trunc(log10(.x)) +1])) |>
filter(!is.na(val)) |>
pivot_wider(values_from = val, names_from = col )
##> + # A tibble: 5 × 5
##> col1 col2 col3 col4 col5
##> <chr> <dbl> <dbl> <dbl> <dbl>
##> 1 s1 3 1 3 3
##> 2 s2 NA 3 4 5
##> 3 s3 NA NA 5 3
##> 4 s4 NA NA NA 1
##> 5 s6 4 5 NA 1

Related

Join similar observations within a data.frame with R

I want to mix several observations in a data.frame using as a reference one constantly repeated variable.
Example:
id var1 var2 var3
a 1 na na
a na 2 na
a na na 3
b 1 na
b na 2 na
b na na na
c na na 3
c na 2 na
c 1 na na
Expected result:
id var1 var2 var3
a 1 2 3
b 1 2 na
c 1 2 3
A possible solution (replacing "na" by NA with na_if):
library(tidyverse)
df %>%
na_if("na") %>%
group_by(id) %>%
summarize(across(var1:var3, ~ sort(.x)[1]))
#> # A tibble: 3 × 4
#> id var1 var2 var3
#> <chr> <chr> <chr> <chr>
#> 1 a 1 2 3
#> 2 b 1 2 <NA>
#> 3 c 1 2 3
Assumptions:
"na" above is really R's native NA (not a string);
b's first row, var2 should be NA instead of an empty string ""
perhaps from the above, var1:var3 should be numbers
either you will never have a group where there is more than one non-NA in a group/column, or you don't care about anything other than the first and want the remaining discarded
library(dplyr)
dat %>%
group_by(id) %>%
summarize(across(everything(), ~ na.omit(.)[1]))
# # A tibble: 3 x 4
# id var1 var2 var3
# <chr> <int> <int> <int>
# 1 a 1 2 3
# 2 b 1 2 NA
# 3 c 1 2 3
Data
dat <- structure(list(id = c("a", "a", "a", "b", "b", "b", "c", "c", "c"), var1 = c(1L, NA, NA, 1L, NA, NA, NA, NA, 1L), var2 = c(NA, 2L, NA, NA, 2L, NA, NA, 2L, NA), var3 = c(NA, NA, 3L, NA, NA, NA, 3L, NA, NA)), class = "data.frame", row.names = c(NA, -9L))
Assuming that your data has NA, you can use the following base R option using the Data from #r2evans (thanks!):
aggregate(.~id, dat, mean, na.rm = TRUE, na.action=NULL)
Output:
id var1 var2 var3
1 a 1 2 3
2 b 1 2 NaN
3 c 1 2 3

How to replace all values in multiple columns that are not among the values in another column

I have a dataset with one variable with participant IDs and several variables with peer-nominations (in form of IDs).
I need to replace all numbers in the peer-nomination variables, that are not among the participant IDs, with NA.
Example: I have
ID PN1 PN2
1 2 5
2 3 4
4 6 2
5 2 7
I need
ID PN1 PN2
1 2 5
2 NA 4
4 NA 2
5 2 NA
Would be great if someone can help! Thank you so much in advance.
An alternative with Base R,
df[,-1][matrix(!(unlist(df[,-1]) %in% df[,1]),nrow(df))] <- NA
df
gives,
ID PN1 PN2
1 1 2 5
2 2 NA 4
3 4 NA 2
4 5 2 NA
library(tidyverse)
df %>%
mutate(across(-ID, ~if_else(. %in% ID, ., NA_real_)))
which gives:
# ID PN1 PN2
# 1 1 2 5
# 2 2 NA 4
# 3 4 NA 2
# 4 5 2 NA
Data used:
df <- data.frame(ID = c(1, 2, 4, 5),
PN1 = c(2, 3, 6, 2),
PN2 = c(5, 4, 2, 7))
Here is a base R way.
The lapply loop on all columns except for the id column, uses function is.na<- to assign NA values to vector elements not in df1[[1]]. Then returns the changed vector.
df1[-1] <- lapply(df1[-1], function(x){
is.na(x) <- !x %in% df1[[1]]
x
})
df1
# ID PN1 PN2
#1 1 2 5
#2 2 NA 4
#3 4 NA 2
#4 5 2 NA
Data in dput format
df1 <-
structure(list(ID = c(1L, 2L, 4L, 5L),
PN1 = c(2L, NA, NA, 2L), PN2 = c(5L, 4L, 2L, NA)),
row.names = c(NA, -4L), class = "data.frame")
We could use mutate with case_when:
library(dplyr)
df %>%
mutate(across(starts_with("PN"), ~case_when(!(. %in% ID) ~ NA_real_,
TRUE ~ as.numeric(.))))
Output:
# A tibble: 4 x 3
ID PN1 PN2
<int> <dbl> <dbl>
1 1 2 5
2 2 NA 4
3 4 NA 2
4 5 2 NA
With data.table you can (l)apply the function fifelse() to every column
you have selected with .SD & .SDcols.
require(data.table)
cols = grep('PN', names(df)) # column indices (or names)
df[ , lapply(.SD, function(x) fifelse(!x %in% ID, NA_real_, x)),
.SDcols = cols ]
Data from #deschen:
df = data.frame(ID = c(1, 2, 4, 5),
PN1 = c(2, 3, 6, 2),
PN2 = c(5, 4, 2, 7))
setDT(df)

Count non-`NA` of several columns by group using summarize and across from dplyr

I want to use summarize and across from dplyrto count the number of non-NA values by my grouping variable. For example, using these data:
library(tidyverse)
d <- tibble(ID = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
Col1 = c(5, 8, 2, NA, 2, 2, NA, NA, 1),
Col2 = c(NA, 2, 1, NA, NA, NA, 1, NA, NA),
Col3 = c(1, 5, 2, 4, 1, NA, NA, NA, NA))
# A tibble: 9 x 4
ID Col1 Col2 Col3
<dbl> <dbl> <dbl> <dbl>
1 1 5 NA 1
2 1 8 2 5
3 1 2 1 2
4 2 NA NA 4
5 2 2 NA 1
6 2 2 NA NA
7 3 NA 1 NA
8 3 NA NA NA
9 3 1 NA NA
With a solution resembling:
d %>%
group_by(ID) %>%
summarize(across(matches("^Col[1-3]$"),
#function to count non-NA per column per ID
))
With the following result:
# A tibble: 3 x 4
ID Col1 Col2 Col3
<dbl> <dbl> <dbl> <dbl>
1 1 3 2 3
2 2 2 0 2
3 3 1 1 0
I hope this is what you are looking for:
library(dplyr)
d %>%
group_by(ID) %>%
summarise(across(Col1:Col3, ~ sum(!is.na(.x)), .names = "non-{.col}"))
# A tibble: 3 x 4
ID `non-Col1` `non-Col2` `non-Col3`
<dbl> <int> <int> <int>
1 1 3 2 3
2 2 2 0 2
3 3 1 1 0
Or if you would like to select columns by their shared string you can use this:
d %>%
group_by(ID) %>%
summarise(across(contains("Col"), ~ sum(!is.na(.x)), .names = "non-{.col}"))

How to replace if the NA values in any column that should replace values by the next column's values in R programming

How to replace if the NA values in any column that should replace values by the next column's values in R programming, This has to be done without particularly mentioned the name of the columns (without hardcode)
Also the entire column that had NA values should be removed in R programming
library(tidyverse)
df1 <- structure(list(GID = c("1", "2", "3", "4", "5", "NG1", "MG2", "MG3", "NG4"),
ColA = c(NA, NA, NA, NA, NA, NA, NA, NA, NA),
ColB = c("2", "4", "4", "5", "5", "", "1", "1", "")),
row.names = c(NA, -9L),
class = "data.frame")
df1 %>%
mutate(across(everything(), ~str_replace(., "^$", "N")),
GID = GID %>% str_remove("N"))
#> GID ColA ColB
#> 1 1 NA 2
#> 2 2 NA 4
#> 3 3 NA 4
#> 4 4 NA 5
#> 5 5 NA 5
#> 6 G1 NA N
#> 7 MG2 NA 1
#> 8 MG3 NA 1
#> 9 G4 NA N
Expected output:
#> GID ColA
#> 1 1 2
#> 2 2 4
#> 3 3 4
#> 4 4 5
#> 5 5 5
#> 6 G1 N
#> 7 MG2 1
#> 8 MG3 1
#> 9 G4 N
I guess you already have answer to the first part of your question, here is an alternative way using replace. To drop columns that have all NA in them you can use select with where.
library(dplyr)
df1 %>%
mutate(across(.fns = ~replace(., . == '', 'N')),
GID = sub('N', '', GID)) %>%
select(-where(~all(is.na(.)))) %>%
rename_with(~names(df1)[seq_along(.)])
# GID ColA
#1 1 2
#2 2 4
#3 3 4
#4 4 5
#5 5 5
#6 G1 N
#7 MG2 1
#8 MG3 1
#9 G4 N

Last observation carried forward conditional on multiple columns

I have a dataset with this structure:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
L40 = c(1, NA, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, NA, NA, NA, NA, 1, NA, NA, NA, NA, 1)
df = data.frame(ID, L40, K50)
# ID L40 K50
# 1 1 1 NA
# 2 1 NA NA
# 3 1 NA NA
# 4 1 NA NA
# 5 2 1 NA
# 6 2 NA 1
# 7 2 NA NA
# 8 3 NA NA
# 9 3 1 NA
# 10 3 NA NA
# 11 3 NA 1
When missing values occur in columns L40 and K50, I want to carry forward the last non-missing value in that column, conditional on ID being the same as the previous ID and the values in L40 and K50 in the current row being empty. I applied the following code:
library(tidyr)
df2 <- df %>% group_by(ID) %>% fill(L40:K50)
This does not achieve what I am looking for. I want the previous non-missing value to be carried forward into the next row only when the other columns (except ID) in that row are empty. This is what I want:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
L40 = c(1, 1, 1, 1, 1, NA, NA, NA, 1, 1, NA)
K50 = c(NA, NA, NA, NA, NA, 1, 1, NA, NA, NA, 1)
df3 = data.frame(ID, L40, K50)
df3
# ID L40 K50
# 1 1 1 NA
# 2 1 1 NA
# 3 1 1 NA
# 4 1 1 NA
# 5 2 1 NA
# 6 2 NA 1
# 7 2 NA 1
# 8 3 NA NA
# 9 3 1 NA
# 10 3 1 NA
# 11 3 NA 1
We can use na.locf
library(data.table)
library(zoo)
setDT(df)[, if(any(is.na(K50[-1]))) lapply(.SD, na.locf) else .SD , by = ID]
# ID L40 K50
#1: 1 1 NA
#2: 1 1 NA
#3: 1 1 NA
#4: 1 1 NA
#5: 2 1 NA
#6: 2 NA 1
#7: 3 NA 1
#8: 3 NA 1
#9: 3 NA 1
An option using dplyr would be
library(dplyr)
df %>%
mutate(ind = rowSums(is.na(.))) %>%
group_by(ID) %>%
mutate_each(funs(if(any(ind>1)) na.locf(., na.rm=FALSE) else .), L40:K50) %>%
select(-ind)
# ID L40 K50
# <dbl> <dbl> <dbl>
#1 1 1 NA
#2 1 1 NA
#3 1 1 NA
#4 1 1 NA
#5 2 1 NA
#6 2 NA 1
#7 3 NA 1
#8 3 NA 1
#9 3 NA 1
I played around with this question for a while, and with my limited knowledge of R I came up with the following work-around. I have added a date column to the original data frame for purpose of illustration:
ID = c(1,1,1,1,2,2,2,3,3,3,3)
date = c(1,2,3,4,1,2,3,1,2,3,4)
L40 = c(1, 1, NA, NA, 1, NA, NA, NA, 1, NA, NA)
K50 = c(NA, 1, 1, NA, NA, 1, NA, NA, NA, NA, 1)
df = data.frame(ID, date, L40, K50)
Here is what I did:
#gather the diagnosis columns in rows and keep only those rows where the patient has the associated diagnosis.
df1 <- df %>% gather(diagnos, dummy, L40:K50) %>% filter(dummy==1) %>% arrange(ID, date)
#concatenate across rows by ID and date to collect all diagnoses of an ID at a particular date.
df2 <- df1 %>% group_by(ID, date) %>% mutate(diag = paste(diagnos, collapse=" ")) %>% select(-diagnos, -dummy)
#convert into data tables in preparation for join
Dt1 <- data.table(df)
Dt2 <- data.table(df2)
setkey(Dt1, ID, date)
setkey(Dt2, ID, date)
#Each observation in Dt1 is matched with the observation in Dt1 with the same date or, if that particular date is not present,
#by the nearest previous date:
final <- Dt2[Dt1, roll=TRUE] %>% distinct()
This carries forward the name(s) of the diagnosis until the next observed diagnosis.

Resources