To rename a specific variable I can do for instance
names(df1)[which(names(df1) == "C")] <- "X"
> df1
A B X
1 1 2 3
I wonder if this is also possible with setNames(), but without repeating the names I don't want to rename as in
df1 <- setNames(df1, c("A", "B", "X"))`
I've tried setNames(df1, c(rep(NA, 2), "X")) and setNames(df1[3], "X") but this won't work. The advantage I see in setNames() is that I can set names while doing other stuff in one step.
Data
df1 <- setNames(data.frame(matrix(1:3, 1)), LETTERS[1:3])
> df1
A B C
1 1 2 3
You can use replace,
setNames(df1, replace(names(df1), names(df1) == 'B', 'X'))
# A X C
#1 1 2 3
setNames(df1, replace(names(df1), names(df1) == 'A', 'X'))
# X B C
#1 1 2 3
setNames(df1, replace(names(df1), names(df1) == 'C', 'X'))
# A B X
#1 1 2 3
You can do it using setnames from library(data.table)
library(data.table)
setnames(DF, "oldName", "newName")
dplyr also has a special function for this:
dplyr::rename(df1, X = C)
# A B X
# 1 1 2 3
Because names of data is a vector, I try to use ifelse() to identify elements logically.
setNames(df1, ifelse(names(df1) == "A", "X", names(df1)))
X B C
1 1 2 3
setNames(df1, ifelse(names(df1) == "B", "X", names(df1)))
A X C
1 1 2 3
setNames(df1, ifelse(names(df1) == "C", "X", names(df1)))
A B X
1 1 2 3
Best I can do is this one, which doesn't seem any easier than using other methods. You could also write a function that would fit your needs..
df2 <- setNames(df1, c(colnames(df1)[1:2],"test"))
> df2
A B test
1 1 2 3
Edit: to change other names (for example column B), we can define a custom function:
dfrename <- function(mydf, mycolumns=1:ncol(mydf), mynewnames=c(letters[1:mycolumns])) {
if(!is.numeric(mycolumns)) {
toreplace <- colnames(mydf) %in% mycolumns
} else {
toreplace <- 1:ncol(mydf) %in% mycolumns
}
mycols <- colnames(mydf)
mycols[toreplace] <- mynewnames
res <- setNames(mydf, mycols)
return(res)
}
You can either use the indexes of the columns to replace or their names.
> dfrename(df1, 2, "test")
A test C
1 1 2 3
Another base R solution if you're ok to repeat the name of the old variable:
res <- transform(iris, a = Species, Species = NULL)
# [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "a"
Regarding efficiency I'm not sure if the data is copied or not.
Related
How do I split strings here? For some reason str_split_n is in stringr's repo, but using it is not possible. There is no help file, either.
So I'd like to use this:
x <- c("a", "b[12]", "c[34]", "d")
tibble(x) |>
dplyr::mutate(
y = str_split_n(x, "\\[", 1)
)
To get this:
# A tibble: 4 x 2
x y
<chr> <chr>
1 a a
2 b[12] b
3 c[34] c
4 d d
We could just use sub here, and remove the trailing bracketed item:
df <- data.frame(x=c("a", "b[12]", "c[34]", "d"))
df$y <- sub("\\[.*?\\]$", "", df$x)
df
x y
1 a a
2 b[12] b
3 c[34] c
4 d d
I think what you were looking for is -
tibble(x) |>
dplyr::mutate(
y = stringr::str_split_fixed(x, "\\[", 2)[, 1]
)
# x y
# <chr> <chr>
#1 a a
#2 b[12] b
#3 c[34] c
#4 d d
Using trimws from base R
df$x <- trimws(df$x, whitespace = "\\[.*")
df$x
[1] "a" "b" "c" "d"
data
df <- data.frame(x=c("a", "b[12]", "c[34]", "d"))
Consider the following data frame:
df <- setNames(data.frame(1:5,rep(1,5)), c("id", "value"))
I want to change the names for multiple cells in the column "id". Let's say I want to change the following:
df$id[df$id %In% 2:3] <- 1
df$id[df$id == 4] <- 3
However, instead of using the code above, I want to create a function, where I can do the transformation more "smooth" (because I have a lot of data frames, where I need to change the names for the cells). I want to create a function:
mapping <- function(...) {
...
}
where I afterward can create a simple and smooth mapping function for my df, where I only have to specific the "old" and the "new" names for the cells. Something like this:
df_mapping <- function(...) {
2.1
3.1
4.3
}
And then I can apply the function on my data and specific which column it should do it for, and it will work in the same way as the code with gsub:
df <- df_mapping(df,id)
Is it possible to create that mapping function?
if we need a function, then can have a 'data' argument, column name, values to replace and replacer value, then create the logical condition, subset the column, assign with replacer_val and return the dataset after the assignment
f1 <- function(dat, colnm, values_to_replace, replacer_val) {
dat[[colnm]][dat[[colnm]] %in% values_to_replace] <- replacer_val
return(dat)
}
f1(df, "id", c(2, 3), 1)
-output
# id value
#1 1 1
#2 1 1
#3 1 1
#4 4 1
#5 5 1
To replace values with corresponding sets of replacers,
f2 <- function(dat, colnm, values_to_replace, replacer_vals) {
nm1 <- setNames(replacer_vals, values_to_replace)
v1 <- nm1[as.character(dat[[colnm]])]
i1 <- !is.na(v1)
dat[[colnm]][i1] <- v1[i1]
return(dat)
}
f2(df, "id", c(2, 3), c(5, 6))
# id value
#1 1 1
#2 5 1
#3 6 1
#4 4 1
#5 5 1
Or another option is to create a key/value dataset and use merge or join
library(data.table)
f3 <- function(dat, colnm, values_to_replace, replacer_vals) {
keydat <- data.frame(key = values_to_replace, val = replacer_vals)
names(keydat)[1] <- colnm
dt <- as.data.table(dat)
dt[keydat, (colnm) := val, on = colnm][]
return(dt)
}
f3(df, "id", c(2, 5), c(3, 6))
Maybe a mapping like below could help
mapping <- function(df, id, to_replace, obj_value) {
transform(df, id = replace(id, id %in% to_replace, obj_value))
}
e.g.,
> mapping(df, id, c(2, 3), 1)
id value
1 1 1
2 1 1
3 1 1
4 4 1
5 5 1
You can use dplyr's recode function
mapping <- function(data, col, old, new) {
data[[col]] <- dplyr::recode(data[[col]], !!!setNames(new, old))
data
}
mapping(df, "id", c(2, 3), c(7L, 8L))
# id value
#1 1 1
#2 7 1
#3 8 1
#4 4 1
#5 5 1
Given a dataframe such as,
num <- c(5,10,15,20,25)
letter <- c("A", "B", "A", "C", "B")
thelist <- data.frame(num, letter)
I need to find the indices where the letters are the same.
Output:
A 1 3
B 2 5
C 4
Then, take these indices and find the mean of those indices in num.
Output:
A 10
B 17.5
C 20
I cannot use loops or if statements, I am looking at using a sort of apply, which, etc.
As the objective is to find the mean for each similar 'letter', it is better to group by 'letter' and get the mean of 'num'
library(dplyr)
thelist %>%
group_by(letter) %>%
summarise(num = mean(num))
# A tibble: 3 x 2
# letter num
# <fct> <dbl>
#1 A 10
#2 B 17.5
#3 C 20
or in base R
aggregate(num ~ letter, thelist, mean)
To find the index of the same 'letter', we can split the sequence of rows by 'letter
split(seq_len(nrow(thelist)), thelist$letter)
#$A
#[1] 1 3
#$B
#[1] 2 5
#$C
#[1] 4
Another option using data.table:
library(data.table)
setDT(thelist)[, .(ind = paste(.I, collapse = " "),
mean_num = mean(num)
),
by = letter]
Output:
letter ind mean_num
1: A 1 3 10.0
2: B 2 5 17.5
3: C 4 20.0
I'd use dplyr/tidyverse for this:
# setup
library(tidyverse)
# group by letters then get mean of num
thelist %>%
group_by(letter) %>%
summarise(mean_num = mean(num))
You could also use base R with a for loop:
lets <- unique(thelist$letter)
x <- rep(NA, length(lets))
for(i in 1:3){
x[i] <- mean(thelist$num[thelist$letter %in% lets[i]])
}
x
I'm looking for the (1) name and (2) a (cleaner) method in R (base and data.table preferred) of the following.
Input
> d1
id x y
1 1 1 NA
2 2 NA 3
3 3 4 NA
> d2
id x y z
1 4 NA 30 a
2 3 20 2 b
3 2 14 NA c
4 1 15 97 d
(note that the actual data.frames have hundreds of columns)
Expected output:
> d1
id x y z
1 1 1 97 d
2 2 14 3 c
3 3 4 2 b
Data and current solution:
d1 <- data.frame(id = 1:3, x = c(1, NA, 4), y = c(NA, 3, NA))
d2 <- data.frame(id = 4:1, x = c(NA, 20, 14, 15), y = c(30, 2, NA, 97), z = letters[1:4])
for (col in setdiff(names(d1), "id")) {
# If missing look in d2
missing <- is.na(d1[[col]])
d1[missing, col] <- d2[match(d1$id[missing], d2$id), col]
}
for (col in setdiff(names(d2), names(d1))) {
# If column missing then add
d1[[col]] <- d2[match(d1$id, d2$id), col]
}
PS:
Likely this questions has been asked before but I'm lacking in vocabulary to search it.
Assuming you are working with 2 data.frames, here is a base solution
#expand d1 to have the same columns as d2
d <- merge(d1, d2[, c("id", setdiff(names(d2), names(d1))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#make sure that d2 also have same number of columns as d1
d2 <- merge(d2, d1[, c("id", setdiff(names(d1), names(d2))), drop=FALSE],
by="id", all.x=TRUE, all.y=FALSE)
#align rows and columns to match those in d1
mask <- d2[match(d1$id, d2$id), names(d)]
#replace NAs with those mask
replace(d, is.na(d), mask[is.na(d)])
If you dont mind, we can rewrite your question into a general matrix-coalesce question (i.e. any number of matrices, columns, rows) which seems like it has not been asked before.
edit:
Another base R solution is a hack of coalesce1a from How to implement coalesce efficiently in R
coalesce.mat <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
rn <- match(ans$id, elt$id)
ans[is.na(ans)] <- elt[rn, names(ans)][is.na(ans)]
}
ans
}
allcols <- Reduce(union, lapply(list(d1, d2), names))
do.call(coalesce.mat,
lapply(list(d1, d2), function(x) {
x[, setdiff(allcols, names(x))] <- NA
x
}))
edit:
a possible data.table solution using coalesce1a from How to implement coalesce efficiently in R by Martin Morgan.
coalesce1a <- function(...) {
ans <- ..1
for (elt in list(...)[-1]) {
i <- which(is.na(ans))
ans[i] <- elt[i]
}
ans
}
setDT(d1)
setDT(d2)
#melt into long formats and full outer join the 2
mdt <- merge(melt(d1, id.vars="id"), melt(d2, id.vars="id"), by=c("id","variable"), all=TRUE)
#perform a coalesce on vectors
mdt[, value := do.call(coalesce1a, .SD), .SDcols=grep("value", names(mdt), value=TRUE)]
#pivot into original format and subset to those in d1
dcast.data.table(mdt, id ~ variable, value.var="value")[
d1, .SD, on=.(id)]
Here is a possibility using dplyr::left_join:
left_join(d1, d2, by = "id") %>%
mutate(
x = ifelse(!is.na(x.x), x.x, x.y),
y = ifelse(!is.na(y.x), y.x, y.y)) %>%
select(id, x, y, z)
# id x y z
#1 1 1 97 d
#2 2 14 3 c
#3 3 4 2 b
We can use data.table with coalesce from dplyr. Create a vector of column names that are common ('nm1') and difference ('nm2') in both datasets. Convert the first dataset to 'data.table' (setDT(d1)), join on the 'id' column, assign (:=) the coalesced columns of the first and second (with prefix i. - if there are common columns) to update the values in the first dataset
library(data.table)
nm1 <- setdiff(intersect(names(d1), names(d2)), 'id')
nm2 <- setdiff(names(d2), names(d1))
setDT(d1)[d2, c(nm1, nm2) := c(Map(dplyr::coalesce, mget(nm1),
mget(paste0("i.", nm1))), mget(nm2)), on = .(id)]
d1
# id x y z
#1: 1 1 97 d
#2: 2 14 3 c
#3: 3 4 2 b
Say I have the following two data frames:
col1 <- c("a","b","c","d","e")
col2 <- c("A","B","C","D","E")
col1a <- c("a","b","c","d","e")
col2a <- c("A","B","C","D","E")
df1 <- data.frame(col1, col2)
df2 <- data.frame(col1a, col2a)
colnames(df1) <- c("c1","c2")
colnames(df2) <- c("c1","c3")
And I have the following function to rename column headers:
library(dplyr)
col_rename <- function(x) x %>% rename(new_c1 = c1, new_c2 = c2, new_c3 = c3)
When I run this function, I get an error because the columns in the function does not match the columns in the data frame.
df1 <- col_rename(df1)
Error: `c3` contains unknown variables
How can I make the function run only on the present columns, and ignore the ones not present, without removing or changing the column names specified in the function?
EDIT:
I can see how the example was a bit confusing. I have many dataframes with many columns. These columns are shared by some dataframes but not all. However, I want to rename all columns specified by the function, regardless of what is present in the dataframe. It looks something like this:
col1 <- c(1:5)
col2 <- c(1:5)
col3 <- c(1:5)
col4 <- c(1:5)
df1 <- data.frame(col1,col2,col3,col4)
df2 <- data.frame(col1,col2,col3,col4)
colnames(df1) <- c("c1","c2","c6","c8")
colnames(df2) <- c("c1","c3","c2","c8")
AB_rename <- function(x) x %>% rename(aa=col1,bb=col2,
cc=col3,dd=col4,
ee=col5,ff=col6,
gg=col7,hh=col8)
Therefore I cannot follow the example of #Ycw, as they do not all follow the same rename rule. How do I make this ignore columns that are not present?
Here is a workaround to use setNames for the col_rename function.
col_rename <- function(x) setNames(x, paste0("new_", names(x)))
col_rename(df1)
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
col_rename(df2)
new_c1 new_c3
1 a A
2 b B
3 c C
4 d D
5 e E
Or use the select_all function from the dplyr.
library(dplyr)
df1 %>% select_all(function(x) paste0("new_", x))
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
This (~) also works for select_all
df2 %>% select_all(~paste0("new_", .))
new_c1 new_c3
1 a A
2 b B
3 c C
4 d D
5 e E
rename_all also works well
library(dplyr)
df1 %>% rename_all(~paste0("new_", .))
new_c1 new_c2
1 a A
2 b B
3 c C
4 d D
5 e E
Update
This is an update to address OP's updated question.
We can create a named vector showing the relationship between old column names and new column names. And defined a function to change the name based on the setNames function.
# Create name vector
vec <- paste0("c", 1:8)
names(vec) <- c("aa", "bb", "cc", "dd", "ee", "ff", "gg", "hh")
# Create the function
AB_rename <- function(x, name_vec){
old_colname <- names(x)
new_colname <- name_vec[name_vec %in% old_colname]
x2 <- setNames(x, names(new_colname))
return(x2)
}
AB_rename(df1, vec)
aa bb ff hh
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5