Summing row values by specific columns using mutate_at and sum function? - r

I have a data table with questionnaire data, so the first column is participant IDs followed by columns of each questionnaire headed with the separate questions. for example, the data table would look like this, where A is one questionnaire and B is a different one:
ID A1 A2 A3 B1 B2
1 3 5 3 4 2
2 2 5 2 2 1
3 4 1 3 4 1
4 3 2 3 3 2
I want to be coding this using dplyr functions. I'm having trouble using mutate_at from dplyr to find the summary scores of each questionnaire, for each ID. I want to find the the sum for questionnaire A (from A1, A2, and A3), and for B...and so on. But my data table has many questionnaires in it (A, B, C, D.....etc) so my code right now looks like:
data %>%
group_by(ID) %>%
mutate_at(vars(contains("A")), funs(sum)) %>%
ungroup()
However running this always gives me an error of
Error: invalid 'type' (character) of argument
and I can't understand why. Same thing happens when I try mutate_each. How can I solve this?

I think one way would be the following. I can see how you want to work with the wide-format data using mutate_at, but you may want to choose long format here. That would make your life easy. You can use melt or gather to format your data in a long format. Then, you want to change the column, variable. You want to remove numbers. Finally you group the data by ID and variable and get sum.
melt(mydf, id.var = "ID") %>%
mutate(variable = gsub(pattern = "[0-9]+", replacement = "", x = variable)) %>%
group_by(ID, variable) %>%
summarise(total = sum(value))
# ID variable total
# <int> <chr> <int>
#1 1 A 11
#2 1 B 6
#3 2 A 9
#4 2 B 3
#5 3 A 8
#6 3 B 5
#7 4 A 8
#8 4 B 5
DATA
mydf <- structure(list(ID = 1:4, A1 = c(3L, 2L, 4L, 3L), A2 = c(5L, 5L,
1L, 2L), A3 = c(3L, 2L, 3L, 3L), B1 = c(4L, 2L, 4L, 3L), B2 = c(2L,
1L, 1L, 2L)), .Names = c("ID", "A1", "A2", "A3", "B1", "B2"), class = "data.frame", row.names = c(NA,
-4L))

The reason it's difficult to do is that you haven't explicitly coded the questionnaire type and number and the data are therefore not "tidy". Jazzurro's approach is right but here I've used the tidyr package to do this with gather and separate.
library(tidyr)
library(dplyr)
data %>%
gather(test, tot, A1:B2) %>%
separate(test, into=c("Q", "No"), sep=1) %>%
group_by(ID, Q) %>% summarise(totals=sum(tot))
This avoids having to use gsub and the like.
Also, you can add %>% spread(Q, totals) to the end of the pipeline if you want A and B in separate columns.

Related

how to check if there is no ID that is registered in more than one group in R? [duplicate]

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 1 year ago.
I would like to check if I have the same identification (ID) for the 2 groups.
For example:
ID group
A1 1
A1 1
A1 1
A2 2
A2 2
A2 1
A3 3
A3 3
A3 3
Please, notice that I had one ID = A2 associated with group 1 and 2.
My question is: how can I identify this kind of situation in my database using an R code?
I'd like to filter if I have my unique ID contained in two groups.
Thank you!
We could use n_distinct after grouping by 'ID' to get a logical summarised 'flag'
library(dplyr)
df1 %>%
group_by(ID) %>%
summarise(Flag = n_distinct(group) == 1, .groups = 'drop')
# A tibble: 3 x 2
# ID Flag
#* <chr> <lgl>
#1 A1 TRUE
#2 A2 FALSE
#3 A3 TRUE
If we need to filter those rows,
df1 %>%
group_by(ID) %>%
filter(n_distinct(group) > 1) %>%
ungroup
data
df1 <- structure(list(ID = c("A1", "A1", "A1", "A2", "A2", "A2", "A3",
"A3", "A3"), group = c(1L, 1L, 1L, 2L, 2L, 1L, 3L, 3L, 3L)),
class = "data.frame", row.names = c(NA,
-9L))

Using a vector as a grep pattern

I am new to R. I am trying to search the columns using grep multiple times within an apply loop. I use grep to specify which rows are summed based on the vector individuals
individuals <-c("ID1","ID2".....n)
bcdata_total <- sapply(individuals, function(x) {
apply(bcdata_clean[,grep(individuals, colnames(bcdata_clean))], 1, sum)
})
bcdata is of random size and contains random data but contains columns that have individuals in part of the string
>head(bcdata)
ID1-4 ID1-3 ID2-5
A 3 2 1
B 2 2 3
C 4 5 5
grep(individuals[1],colnames(bcdata_clean)) returns a vector that looks like
[1] 1 2, a list of the column names containing ID1. That vector is used to select columns to be summed in bcdata_clean. This should occur n number of times depending on the length of individuals
However this returns the error
In grep(individuals, colnames(bcdata)) :
argument 'pattern' has length > 1 and only the first element will be used
And results in all the columns of bcdata being identical
Ideally individuals would increment each time the function is run like this for each iteration
apply(bcdata_clean[,grep(individuals[1,2....n], colnames(bcdata_clean))], 1, sum)
and would result in something like this
>head(bcdata_total)
ID1 ID2
A 5 1
B 4 3
C 9 5
But I'm not sure how to increment individuals. What is the best way to do this within the function?
You can use split.default to split data on similarly named columns and sum them row-wise.
sapply(split.default(df, sub('-.*', '', names(df))), rowSums, na.rm. = TRUE)
# ID1 ID2
#A 5 1
#B 4 3
#C 9 5
data
df <- structure(list(`ID1-4` = c(3L, 2L, 4L), `ID1-3` = c(2L, 2L, 5L
), `ID2-5` = c(1L, 3L, 5L)), class = "data.frame", row.names = c("A", "B", "C"))
Passing individuals as my argument in function(x) fixed my issue
bcdata_total <- sapply(individuals, function(individuals) {
apply(bcdata_clean[,grep(individuals, colnames(bcdata_clean))], 1, sum)
})
An option with tidyverse
library(dplyr)
library(tidyr)
library(tibble)
df %>%
rownames_to_column('rn') %>%
pivot_longer(cols = -rn, names_to = c(".value", "grp"), names_sep="-") %>%
group_by(rn) %>%
summarise(across(starts_with('ID'), sum, na.rm = TRUE), .groups = 'drop') %>%
column_to_rownames('rn')
# ID1 ID2
#A 5 1
#B 4 3
#C 9 5
data
df <- df <- structure(list(`ID1-4` = c(3L, 2L, 4L), `ID1-3` = c(2L, 2L, 5L
), `ID2-5` = c(1L, 3L, 5L)), class = "data.frame", row.names = c("A", "B", "C"))

Rename group of consecutive columns efficiently in R

I'm looking for an efficient way to rename several columns.
I have a dataframe that looks like the following.
id sdf dir fki
1 3 4 2
2 5 2 1
3 4 1 2
I want to rename columns sdf, dir, and fki.
I know I could do so like this:
df <- df %>%
rename(newname1 = sdf,
newname2 = dir,
newname3 = fki)
With the amount of columns I have, it is taking a long time to type the names of the columns I would like to replace.
Ideally, I would like to create a vector with names:
newcolumns <- c("newname1", "newname2", "newname3")
And then specify that these should replace the column names in the dataframe, starting with column sdf. Is there a way to do this?
We can use rename_at
library(dplyr)
df %>%
rename_at(vars(-id), ~ newcolumns)
-output
# id newname1 newname2 newname3
#1 1 3 4 2
#2 2 5 2 1
#3 3 4 1 2
Or with rename_with
df %>%
rename_with(~ newcolumns, -id)
Or pass a named vector and use !!! in rename
df %>%
rename(!!! setNames(names(df)[-1], newcolumns))
Or using base R
names(df)[-1] <- newcolumns
data
df <- structure(list(id = 1:3, sdf = c(3L, 5L, 4L), dir = c(4L, 2L,
1L), fki = c(2L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-3L))

Summarise number of specific rows containing string variables in R (dplyr/tidyverse codes are appreciated)

I have a big dataset with a variety of variables concerning infectious complications. There are columns, containing symptoms written as strings in the corresponding columns ("Dysuria", "Fever", etc.). I would like to know the number of positive symptoms in each observation. I have tried to write different codes, using rowSums within mutate_at with is.character and !is.na, trying to do it simpler and as short as a single line of code, but it did not work.
example:
symps_na %>%
mutate_if(~any(is.character(.), rowSums)) %>%
View()
Then, I wrote a code for each column separately, trying to recode string variables to 1, convert them to numeric and then sum these ones to get the number of symptoms (see the codes below).
symps_na<-
pb_table_ord %>%
select(ID, dysuria:fever)%>%
mutate(dysuria=ifelse(dysuria=="Dysuria", 1, dysuria)) %>%
mutate(frequency=ifelse(frequency=="Frequency", 1, frequency)) %>%
mutate(urgency=ifelse(urgency=="Urgency", 1, urgency)) %>%
mutate(prostatepain=ifelse(prostatepain=="Prostate pain", 1, prostatepain)) %>%
mutate(rigor=ifelse(!is.na(rigor), 1, rigor)) %>%
mutate(loinpain=ifelse(!is.na(loinpain), 1, loinpain)) %>%
mutate(fever=ifelse(!is.na(fever), 1, fever)) %>%
mutate_at(vars(dysuria:fever), as.numeric) %>%
mutate(symptoms.sum=rowSums(select(., dysuria:fever)))
but the column symptoms.sum returns NA's instead numbers.
Oh, sorry, just have realized that I have missed na.rm=TRUE! But anyway. Can anyone suggest a more elegant way how could one get the summary number of non-NA/string variables for each observation in a separate column?
You can create two sets of columns one where you need to check value same as column name and the other one where you need to check to for NA values. I have created a sample data shared at the end of the answer and the two vectors cols1 which is a vector of column names which has same value as in it's column and cols2 where we need to check for NA values. You can change that according to column names that you have.
library(dplyr)
cols1 <- c('b', 'c')
cols2 <- c('d')
purrr::imap_dfc(df %>% select(cols1), `==`) %>% mutate_all(as.numeric) %>%
bind_cols(df %>% transmute_at(vars(cols2), ~+(!is.na(.)))) %>%
mutate(symptoms.sum = rowSums(select(., b:d), na.rm = TRUE))
# A tibble: 5 x 4
# b c d symptoms.sum
# <dbl> <dbl> <int> <dbl>
#1 1 1 0 2
#2 0 1 1 2
#3 1 0 1 2
#4 NA NA 1 1
#5 1 NA 0 1
data
Tested on this data which looks like this
df <- structure(list(a = 1:5, b = structure(c(1L, 2L, 1L, NA, 1L), .Label = c("b",
"c"), class = "factor"), c = structure(c(1L, 1L, 2L, NA, NA), .Label = c("c",
"d"), class = "factor"), d = c(NA, 1, 2, 4, NA)), class = "data.frame",
row.names = c(NA, -5L))
df
# a b c d
#1 1 b c NA
#2 2 c c 1
#3 3 b d 2
#4 4 <NA> <NA> 4
#5 5 b <NA> NA

Looping through columns and duplicating data in R

I am trying to iterate through columns, and if the column is a whole year, it should be duplicated four times, and renamed to quarters
So this
2000 Q1-01 Q2-01 Q3-01
1 2 3 3
Should become this:
Q1-00 Q2-00 Q3-00 Q4-00 Q1-01 Q2-01 Q3-01
1 1 1 1 2 3 3
Any ideas?
We can use stringr::str_detect to look for colnames with 4 digits then take the last two digits from those columns
library(dplyr)
library(tidyr)
library(stringr)
df %>% gather(key,value) %>% group_by(key) %>%
mutate(key_new = ifelse(str_detect(key,'\\d{4}'),paste0('Q',1:4,'-',str_extract(key,'\\d{2}$'),collapse = ','),key)) %>%
ungroup() %>% select(-key) %>%
separate_rows(key_new,sep = ',') %>% spread(key_new,value)
PS: I hope you don't have a large dataset
Since you want repeated columns, you can just re-index your data frame and then update the column names
df <- structure(list(`2000` = 1L, Q1.01 = 2L, Q2.01 = 3L, Q3.01 = 3L,
`2002` = 1L, Q1.03 = 2L, Q2.03 = 3L, Q3.03 = 3L), row.names = c(NA,
-1L), class = "data.frame")
#> df
#2000 Q1.01 Q2.01 Q3.01 2002 Q1.03 Q2.03 Q3.03
#1 1 2 3 3 1 2 3 3
# Get indices of columns that consist of 4 numbers
col.ids <- grep('^[0-9]{4}$', names(df))
# For each of those, create new names, and for the rest preserve the old names
new.names <- lapply(seq_along(df), function(i) {
if (i %in% col.ids)
return(paste(substr(names(df)[i], 3, 4), c('Q1', 'Q2', 'Q3', 'Q4'), sep = '.'))
return(names(df)[i])
})
# Now repeat each of those columns 4 times
df <- df[rep(seq_along(df), ifelse(seq_along(df) %in% col.ids, 4, 1))]
# ...and finally set the column names to the desired new names
names(df) <- unlist(new.names)
#> df
#00.Q1 00.Q2 00.Q3 00.Q4 Q1.01 Q2.01 Q3.01 02.Q1 02.Q2 02.Q3 02.Q4 Q1.03 Q2.03 Q3.03
#1 1 1 1 1 2 3 3 1 1 1 1 2 3 3

Resources