Rename group of consecutive columns efficiently in R - r

I'm looking for an efficient way to rename several columns.
I have a dataframe that looks like the following.
id sdf dir fki
1 3 4 2
2 5 2 1
3 4 1 2
I want to rename columns sdf, dir, and fki.
I know I could do so like this:
df <- df %>%
rename(newname1 = sdf,
newname2 = dir,
newname3 = fki)
With the amount of columns I have, it is taking a long time to type the names of the columns I would like to replace.
Ideally, I would like to create a vector with names:
newcolumns <- c("newname1", "newname2", "newname3")
And then specify that these should replace the column names in the dataframe, starting with column sdf. Is there a way to do this?

We can use rename_at
library(dplyr)
df %>%
rename_at(vars(-id), ~ newcolumns)
-output
# id newname1 newname2 newname3
#1 1 3 4 2
#2 2 5 2 1
#3 3 4 1 2
Or with rename_with
df %>%
rename_with(~ newcolumns, -id)
Or pass a named vector and use !!! in rename
df %>%
rename(!!! setNames(names(df)[-1], newcolumns))
Or using base R
names(df)[-1] <- newcolumns
data
df <- structure(list(id = 1:3, sdf = c(3L, 5L, 4L), dir = c(4L, 2L,
1L), fki = c(2L, 1L, 2L)), class = "data.frame", row.names = c(NA,
-3L))

Related

How to merge two data frame which has jumbled column names

I have 2 data frames df1 and df2 with the same column names but in different column numbers. How to merge as df3 without creating additional columns/rows.
df1
a b c
1 3 6
df2
b c a
5 6 1
expected df3
a b c
1 3 6
1 5 6
Tried below code but it did not work
df3=merge(df1, df2, by = "col.names")
We may use bind_rows which automatically find the matching column names and if it is not there, it will add a NA row for those doesn't have. The order of columns will be based on the order from the first dataset input in `bind_rows i.e. df1
library(dplyr)
bind_rows(df1, df2)
-output
a b c
1 1 3 6
2 1 5 6
data
df1 <- structure(list(a = 1L, b = 3L, c = 6L), class = "data.frame", row.names = c(NA,
-1L))
df2 <- structure(list(b = 5L, c = 6L, a = 1L), class = "data.frame", row.names = c(NA,
-1L))
Rearrange columns of any one dataframe according on another dataframe so both the columns have the same order of column names and then use rbind.
rbind(df1, df2[names(df1)])
# a b c
#1 1 3 6
#2 1 5 6
In this case, using rbind(df1, df2) should work too.

Using a vector as a grep pattern

I am new to R. I am trying to search the columns using grep multiple times within an apply loop. I use grep to specify which rows are summed based on the vector individuals
individuals <-c("ID1","ID2".....n)
bcdata_total <- sapply(individuals, function(x) {
apply(bcdata_clean[,grep(individuals, colnames(bcdata_clean))], 1, sum)
})
bcdata is of random size and contains random data but contains columns that have individuals in part of the string
>head(bcdata)
ID1-4 ID1-3 ID2-5
A 3 2 1
B 2 2 3
C 4 5 5
grep(individuals[1],colnames(bcdata_clean)) returns a vector that looks like
[1] 1 2, a list of the column names containing ID1. That vector is used to select columns to be summed in bcdata_clean. This should occur n number of times depending on the length of individuals
However this returns the error
In grep(individuals, colnames(bcdata)) :
argument 'pattern' has length > 1 and only the first element will be used
And results in all the columns of bcdata being identical
Ideally individuals would increment each time the function is run like this for each iteration
apply(bcdata_clean[,grep(individuals[1,2....n], colnames(bcdata_clean))], 1, sum)
and would result in something like this
>head(bcdata_total)
ID1 ID2
A 5 1
B 4 3
C 9 5
But I'm not sure how to increment individuals. What is the best way to do this within the function?
You can use split.default to split data on similarly named columns and sum them row-wise.
sapply(split.default(df, sub('-.*', '', names(df))), rowSums, na.rm. = TRUE)
# ID1 ID2
#A 5 1
#B 4 3
#C 9 5
data
df <- structure(list(`ID1-4` = c(3L, 2L, 4L), `ID1-3` = c(2L, 2L, 5L
), `ID2-5` = c(1L, 3L, 5L)), class = "data.frame", row.names = c("A", "B", "C"))
Passing individuals as my argument in function(x) fixed my issue
bcdata_total <- sapply(individuals, function(individuals) {
apply(bcdata_clean[,grep(individuals, colnames(bcdata_clean))], 1, sum)
})
An option with tidyverse
library(dplyr)
library(tidyr)
library(tibble)
df %>%
rownames_to_column('rn') %>%
pivot_longer(cols = -rn, names_to = c(".value", "grp"), names_sep="-") %>%
group_by(rn) %>%
summarise(across(starts_with('ID'), sum, na.rm = TRUE), .groups = 'drop') %>%
column_to_rownames('rn')
# ID1 ID2
#A 5 1
#B 4 3
#C 9 5
data
df <- df <- structure(list(`ID1-4` = c(3L, 2L, 4L), `ID1-3` = c(2L, 2L, 5L
), `ID2-5` = c(1L, 3L, 5L)), class = "data.frame", row.names = c("A", "B", "C"))

R - how to subtract with rowsum

my dataframe looks like this.
df <- read.table(text="
column1 column2 column3
1 3 2 1
1 3 2 1
", header=TRUE)
I need to subtract last 2 columns from first. For counting that columns I´d use rowSums(summary[,1:3]) but I don´t know how to subtract this columns. Note that I can´t just write my code like this since I don´t know column names.
`result <- df %>%
mutate(result = rowSums(column1, - column2, - column3))`
We can subset the data to remove the first column (.[-1]), get the rowSums and subtract from 'column1'
library(tidyverse)
df %>%
mutate(result = column1 - rowSums(.[-1]))
# column1 column2 column3 result
#1 3 2 1 0
#2 3 2 1 0
If there are more columns and want to select the last two columns
df %>%
mutate(result = column1 - rowSums(.[tail(names(.), 2)]))
If we have only the index of the columns that are involved in the operation
df %>%
mutate(result = .[[1]] - rowSums(.[c(2, 3)]))
data
df <- structure(list(column1 = c(3L, 3L), column2 = c(2L, 2L), column3 = c(1L,
1L)), class = "data.frame", row.names = c(NA, -2L))

Looping through columns and duplicating data in R

I am trying to iterate through columns, and if the column is a whole year, it should be duplicated four times, and renamed to quarters
So this
2000 Q1-01 Q2-01 Q3-01
1 2 3 3
Should become this:
Q1-00 Q2-00 Q3-00 Q4-00 Q1-01 Q2-01 Q3-01
1 1 1 1 2 3 3
Any ideas?
We can use stringr::str_detect to look for colnames with 4 digits then take the last two digits from those columns
library(dplyr)
library(tidyr)
library(stringr)
df %>% gather(key,value) %>% group_by(key) %>%
mutate(key_new = ifelse(str_detect(key,'\\d{4}'),paste0('Q',1:4,'-',str_extract(key,'\\d{2}$'),collapse = ','),key)) %>%
ungroup() %>% select(-key) %>%
separate_rows(key_new,sep = ',') %>% spread(key_new,value)
PS: I hope you don't have a large dataset
Since you want repeated columns, you can just re-index your data frame and then update the column names
df <- structure(list(`2000` = 1L, Q1.01 = 2L, Q2.01 = 3L, Q3.01 = 3L,
`2002` = 1L, Q1.03 = 2L, Q2.03 = 3L, Q3.03 = 3L), row.names = c(NA,
-1L), class = "data.frame")
#> df
#2000 Q1.01 Q2.01 Q3.01 2002 Q1.03 Q2.03 Q3.03
#1 1 2 3 3 1 2 3 3
# Get indices of columns that consist of 4 numbers
col.ids <- grep('^[0-9]{4}$', names(df))
# For each of those, create new names, and for the rest preserve the old names
new.names <- lapply(seq_along(df), function(i) {
if (i %in% col.ids)
return(paste(substr(names(df)[i], 3, 4), c('Q1', 'Q2', 'Q3', 'Q4'), sep = '.'))
return(names(df)[i])
})
# Now repeat each of those columns 4 times
df <- df[rep(seq_along(df), ifelse(seq_along(df) %in% col.ids, 4, 1))]
# ...and finally set the column names to the desired new names
names(df) <- unlist(new.names)
#> df
#00.Q1 00.Q2 00.Q3 00.Q4 Q1.01 Q2.01 Q3.01 02.Q1 02.Q2 02.Q3 02.Q4 Q1.03 Q2.03 Q3.03
#1 1 1 1 1 2 3 3 1 1 1 1 2 3 3

Summing row values by specific columns using mutate_at and sum function?

I have a data table with questionnaire data, so the first column is participant IDs followed by columns of each questionnaire headed with the separate questions. for example, the data table would look like this, where A is one questionnaire and B is a different one:
ID A1 A2 A3 B1 B2
1 3 5 3 4 2
2 2 5 2 2 1
3 4 1 3 4 1
4 3 2 3 3 2
I want to be coding this using dplyr functions. I'm having trouble using mutate_at from dplyr to find the summary scores of each questionnaire, for each ID. I want to find the the sum for questionnaire A (from A1, A2, and A3), and for B...and so on. But my data table has many questionnaires in it (A, B, C, D.....etc) so my code right now looks like:
data %>%
group_by(ID) %>%
mutate_at(vars(contains("A")), funs(sum)) %>%
ungroup()
However running this always gives me an error of
Error: invalid 'type' (character) of argument
and I can't understand why. Same thing happens when I try mutate_each. How can I solve this?
I think one way would be the following. I can see how you want to work with the wide-format data using mutate_at, but you may want to choose long format here. That would make your life easy. You can use melt or gather to format your data in a long format. Then, you want to change the column, variable. You want to remove numbers. Finally you group the data by ID and variable and get sum.
melt(mydf, id.var = "ID") %>%
mutate(variable = gsub(pattern = "[0-9]+", replacement = "", x = variable)) %>%
group_by(ID, variable) %>%
summarise(total = sum(value))
# ID variable total
# <int> <chr> <int>
#1 1 A 11
#2 1 B 6
#3 2 A 9
#4 2 B 3
#5 3 A 8
#6 3 B 5
#7 4 A 8
#8 4 B 5
DATA
mydf <- structure(list(ID = 1:4, A1 = c(3L, 2L, 4L, 3L), A2 = c(5L, 5L,
1L, 2L), A3 = c(3L, 2L, 3L, 3L), B1 = c(4L, 2L, 4L, 3L), B2 = c(2L,
1L, 1L, 2L)), .Names = c("ID", "A1", "A2", "A3", "B1", "B2"), class = "data.frame", row.names = c(NA,
-4L))
The reason it's difficult to do is that you haven't explicitly coded the questionnaire type and number and the data are therefore not "tidy". Jazzurro's approach is right but here I've used the tidyr package to do this with gather and separate.
library(tidyr)
library(dplyr)
data %>%
gather(test, tot, A1:B2) %>%
separate(test, into=c("Q", "No"), sep=1) %>%
group_by(ID, Q) %>% summarise(totals=sum(tot))
This avoids having to use gsub and the like.
Also, you can add %>% spread(Q, totals) to the end of the pipeline if you want A and B in separate columns.

Resources