The first function works and the second one does not, and I am not sure why. I am solely interested in what is happening with the paste() function in this example, as all of the other code works properly. In addition to what is shown below, I have also tried the second function with a comma separator between each value.
Ideally, the list would be as follows, within my function but with the paste() function instead of my listing these values.
X41262.0.0 = i.X41262.0.0, X41262.0.1 = i.X41262.0.1, etc.
fread("ukb33822.csv", select= c("eid", "X2784.0.0", "X2794.0.0",
"X2804.0.0", "X2814.0.0", "X2834.0.0",
"X3536.0.0", "X3546.0.0", paste("X41262.0.", 0:65, sep = ""),
"X3581.0.0"))
biobank[biobank2, on = .(eid), `:=` (X2784.0.0 = i.X2784.0.0, X2794.0.0 = i.X2794.0.0,
X2804.0.0 = i.X2804.0.0, X2814.0.0 = i.X2814.0.0,
X2834.0.0 = i.X2834.0.0, X3536.0.0 = i.X3536.0.0,
X3546.0.0 = i.X3546.0.0, paste("X41262.0.", 0:65, " = ", "i.X41262.0.", 0:65, sep = ""),
X3581.0.0 = i.X3581.0.0)]
Error in
`[.data.table`(biobank, biobank2, on = .(eid), `:=`(X2784.0.0 = i.X2784.0.0, :
In `:=`(col1=val1, col2=val2, ...) form, all arguments must be named.
Not having your data, it's a little contrived, but this might be enough to show you one option:
DT <- data.table(x=1:3)
DT[, c("a", "b", letters[3:5]) := c(list(1, 2), 3:5) ]
DT
# x a b c d e
# 1: 1 1 2 3 4 5
# 2: 2 1 2 3 4 5
# 3: 3 1 2 3 4 5
In this example:
"a" and "b" are your already-known names, e.g., "X2784.0.0", "X2794.0.0", etc
letters[3:5] are names you need to create programmatically, e.g., paste0("X41262.0.", 0:65)
1 and 2 are your already-known values, e.g., i.X2784.0.0, i.X2794.0.0, etc
3:5 are values you determine programmatically
It is not clear to me where your other values are found ...
If they are in the enclosing environment (and not within the actual table), then perhaps:
x1 <- 3:5
x2 <- 13:15
x3 <- 33:35
e <- environment()
DT[, c("a", "b", paste0("x", 1:3)) := c(list(1, 2), mget(paste0("x", 1:3), envir=e))]
# x a b x1 x2 x3
# 1: 1 1 2 3 13 33
# 2: 2 1 2 4 14 34
# 3: 3 1 2 5 15 35
where paste0("x", 1:3) is forming the variable names, and mget(...) actually retrieves them. You might need to define e as I have here, if they are not visible from data.table's search path.
If they are already in the data.table, then you might be able to do something with this:
DT <- data.table(x1=1:3, x2=11:13, x3=21:23)
DT[, c("a", "b", paste0("y", 1:3)) := c(list(1, 2), DT[, paste0("x", 1:3), with=FALSE]) ]
# x1 x2 x3 a b y1 y2 y3
# 1: 1 11 21 1 2 1 11 21
# 2: 2 12 22 1 2 2 12 22
# 3: 3 13 23 1 2 3 13 23
where paste0("y", 1:3) forms the names you want them to be, and paste0("x", 1:3) forms the other columns' names as they exist before this call.
Related
I have a question about reshaping a complex data from wide to long format.
"Prim_key" is the unique id. The variables have the following format: "sn016_1_2". I need to pull the first number into a column and name it "S" (For example, here it would be 1) and the second number to a column named "T" (For example, here it would be 2) and then pull the values into other variable names grouped by the unique id. The prefix sn016 is also not the only prefix. Here are the variables:
[1] "prim_key" "sn016_1_2" "sn016_1_3" "sn016_1_4" "sn016_1_5" "sn016_1_6" "sn016_1_7" "sn016_2_3"
[9] "sn016_2_4" "sn016_2_5" "sn016_2_6" "sn016_2_7" "sn016_3_4" "sn016_3_5" "sn016_3_6" "sn016_3_7"
[17] "sn016_4_5" "sn016_4_6" "sn016_4_7" "sn016_5_6" "sn016_5_7" "sn016_6_7" "sn017_1_2" "sn017_1_3"
[25] "sn017_1_4" "sn017_1_5" "sn017_1_6" "sn017_1_7" "sn017_2_3" "sn017_2_4" "sn017_2_5" "sn017_2_6"
[33] "sn017_2_7" "sn017_3_4" "sn017_3_5" "sn017_3_6" "sn017_3_7" "sn017_4_5" "sn017_4_6" "sn017_4_7"
[41] "sn017_5_6" "sn017_5_7" "sn017_6_7"
"Prim_key" is the unique id. Any ideas on how to do this? I feel like it shouldn't be terribly hard but it's evading me.
Here's an example of what I'm looking for:
THESE VARS: "prim_key" "sn016_1_2" "sn016_1_3" "sn016_2_6" "sn016_2_7" "sn016_3_4" "sn016_3_5"
prim_key S T sn016
1 1 2 value
1 1 3 value
1 2 6 value
1 2 7 value
1 3 4 value
1 3 5 value
P.s. The goal long format example is not showing up correctly. So I've attached as an image.
Thanks in advance for any help!!
Perhaps you might try using pivot_longer from tidyr.
You can specify:
Columns to make longer (could select columns that start with "sn", such as starts_with("sn"), or all columns except for prim_key)
Names of the new columns generated, which include the initial letter/number combination (e.g., sn016), S, and T
And a regex pattern to split up into these columns
The code as follows:
library(tidyverse)
df %>%
pivot_longer(cols = -prim_key,
names_to = c(".value", "S", "T"),
names_pattern = "(\\w+)_(\\d+)_(\\d+)")
Output
# A tibble: 10 x 5
prim_key S T sn016 sn017
<dbl> <chr> <chr> <int> <int>
1 1 1 2 5 NA
2 1 1 3 2 NA
3 1 2 6 5 3
4 1 2 7 1 2
5 1 3 5 NA 3
6 1 1 2 2 NA
7 1 1 3 3 NA
8 1 2 6 3 4
9 1 2 7 2 3
10 1 3 5 NA 5
Data
Example data made up:
df <- structure(list(prim_key = c(1, 1), sn016_1_2 = c(5L, 2L), sn016_1_3 = 2:3,
sn016_2_6 = c(5L, 3L), sn016_2_7 = 1:2, sn017_2_6 = 3:4,
sn017_2_7 = 2:3, sn017_3_5 = c(3L, 5L)), class = "data.frame", row.names = c(NA,
-2L))
We could use melt from data.table
library(data.table)
dcast(melt(setDT(df), id.var = 'prim_key')[, c("nm1", "S", "T")
:= tstrsplit(variable, '_')], rowid(nm1, S, T) + prim_key + S + T
~ nm1, value.var = 'value')[, nm1 := NULL][]
# prim_key S T sn016 sn017
# 1: 1 1 2 5 NA
# 2: 1 1 3 2 NA
# 3: 1 2 6 5 3
# 4: 1 2 7 1 2
# 5: 1 3 5 NA 3
# 6: 1 1 2 2 NA
# 7: 1 1 3 3 NA
# 8: 1 2 6 3 4
# 9: 1 2 7 2 3
#10: 1 3 5 NA 5
data
df <- structure(list(prim_key = c(1, 1), sn016_1_2 = c(5L, 2L), sn016_1_3 = 2:3,
sn016_2_6 = c(5L, 3L), sn016_2_7 = 1:2, sn017_2_6 = 3:4,
sn017_2_7 = 2:3, sn017_3_5 = c(3L, 5L)), class = "data.frame", row.names = c(NA,
-2L))
The answers using external packages are probably the way to go in terms of parsimony. It's useful, however, to be able to brute force your desired solution using base R sometimes. Below is an example. One benefit of the following is that the call to lapply can be replaced with the parallel version parLapply or mclapply both from the parallel package that ships with R.
#### First make some example data
# The column names you gave
cnames <- c("prim_key", "sn016_1_2", "sn016_1_3", "sn016_1_4", "sn016_1_5",
"sn016_1_6", "sn016_1_7", "sn016_2_3", "sn016_2_4", "sn016_2_5",
"sn016_2_6", "sn016_2_7", "sn016_3_4", "sn016_3_5", "sn016_3_6",
"sn016_3_7", "sn016_4_5", "sn016_4_6", "sn016_4_7", "sn016_5_6",
"sn016_5_7", "sn016_6_7", "sn017_1_2", "sn017_1_3", "sn017_1_4",
"sn017_1_5", "sn017_1_6", "sn017_1_7", "sn017_2_3", "sn017_2_4",
"sn017_2_5", "sn017_2_6", "sn017_2_7", "sn017_3_4", "sn017_3_5",
"sn017_3_6", "sn017_3_7", "sn017_4_5", "sn017_4_6", "sn017_4_7",
"sn017_5_6", "sn017_5_7", "sn017_6_7")
# An example matrix with random data
mat <- matrix(runif(length(cnames) * 4), nrow = 4)
# Make the column names corrcet
colnames(mat) <- cnames
### Now pretend we already had the data
# Get the column names of the input matrix
cnames <- colnames(mat)
# The column names that are not your primary key
n_primkey <- cnames[which(cnames != "prim_key")]
# Get the unique set of prefixes for the non-primkey variables
prefix <- strsplit(n_primkey, "_")
prefix <- unique(unlist(lapply(prefix, "[", 1)))
# Go row by row through the original matrix
dat <- lapply(seq_len(nrow(mat)), function(i) {
# The row we're dealing with now
row <- mat[i, ]
# The column names of your output matrix
dcnames <- c("prim_key", "S", "T", prefix)
# A pre-allocated data.frame to hold the rehaped data for this row
dat <- matrix(rep(NA, length(dcnames) * length(n_primkey)), ncol = length(dcnames))
dat <- as.data.frame(dat)
colnames(dat) <- dcnames
# All values for this row have the same prim_key value
dat$prim_key <- row["prim_key"]
# Go through each of the non-prim_key variables, split them, and put the
# values in the correct place
for (j in seq_len(length(n_primkey))) {
# k has the non-prim_key name we're dealing with
k <- n_primkey[j]
# l splits this name by underscores "_"
l <- strsplit(k, "_")
# The first element gives the prefix
pref <- l[[1]][1]
# The second gives the "S" value
S_val <- l[[1]][2]
# The third gives the "T" value
T_val <- l[[1]][3]
# Allocate these values into the output data.frame we created ealier
dat[j, "S"] <- S_val
dat[j, "T"] <- T_val
dat[j, pref] <- row[k]
}
# Return the data for row i of the input data
dat
})
# dat is a list, so combine each element into a single data.frame
dat <- do.call(rbind, dat)
# Check a few
dat[1:2, ]
mat[1, ]
I'm trying to merge 2 datasets on a key, but if there is no match then I want to try another key, and so on.
df1 <- data.frame(a=c(5,1,7,3),
b=c("T","T","T","F"),
c=c("F","T","F","F"))
df2 <- data.frame(x1=c(4,5,3,9),
x2=c(7,8,1,2),
x3=c("g","w","t","o"))
df1
a b c
1 5 T F
2 1 T T
3 7 T F
4 3 F F
df2
x1 x2 x3 ..
1 4 7 g ..
2 5 8 w ..
3 3 1 t ..
4 9 2 o ..
The desired output is something like
a b c x3 ..
1 5 T F w ..
2 1 T T t ..
3 7 T F g ..
4 3 F F t ..
I tried something along the lines of
dfm <- merge(df1,df2, by.x = "a", by.y = "x1", all.x = TRUE)
dfm <- merge(dfm,df2, by.x = "a", by.y = "x2", all.x = TRUE)
but that isn't quite right.
This really isn't a standard sort of merge. You can make it more standard by reshaping df2 so you have just one field to merge on
df2long <- rbind(
data.frame(a = df2$x1, df2[,-(1:2), drop=FALSE]),
data.frame(a = df2$x2, df2[,-(1:2), drop=FALSE])
)
dfm <- merge(df1, df2long, by = "a", all.x = TRUE)
You could do something like this:
matches <- lapply(df2[, c("x1", "x2")], function(x) match(df1$a, x))
# finding matches in df2$x1 and df2$x2
# notice that the code below should work with any number of columns to be matched:
# you just need to add the names here eg. df2[, paste0("x", 1:100)]
matches
$x1
[1] 2 NA NA 3
$x2
[1] NA 3 1 NA
combo <- Reduce(function(a,b) "[<-"(a, is.na(a), b[is.na(a)]), matches)
# combining the matches on "first come first served" basis
combo
[1] 2 3 1 3
cbind(df1, df2[combo,])
a b c x1 x2 x3
2 5 T F 5 8 w
3 1 T T 3 1 t
1 7 T F 4 7 g
3.1 3 F F 3 1 t
If I understand correctly, the OP has requested to try a match of a with x1 first, then - if failed - to try to match a with x2. So any match of a with x1 should take precedence over a match of a with x2.
Unfortunately, the sample data set provided by the OP does not include a use case to prove this. Therefore, I have modified the sample dataset accordingly (see Data section).
The approach suggested here is to reshape df2 from wide to long format (likewise to MrFlick's answer) but to use a data.table join with parameter mult = "first".
The columns of df2 to be considered as key columns and the precedence can be controlled by the measure.vars parameter to melt(). After reshaping, melt() arranges the rows in the column order given in measure.vars:
library(data.table)
# define cols of df2 to use as key in order of
key_cols <- c("x1", "x2")
# reshape df2 from wide to long format
long <- melt(setDT(df2), measure.vars = key_cols, value.name = "a")
# join long with df1, pick first matches
result <- long[setDT(df1), on = "a", mult = "first"]
# clean up
setcolorder(result, names(df1))
result[, variable := NULL]
result
a b c x3
1: 5 T F w
2: 1 T T t
3: 7 T F g
4: 3 F F t
5: 0 F F <NA>
Please, note that the original row order of df1 has been preserved.
Also, note that the code works for an arbitrary number of key columns. The precedence of key columns can be easily changed. E.g., if the order is reversed, i.e., key_cols <- c("x2", "x1") matches of a with x2 will be picked first.
Data
Enhanced sample datasets:
df1 has an additional row with no match in df2.
df1 <- data.frame(a=c(5,1,7,3,0),
b=c("T","T","T","F","F"),
c=c("F","T","F","F","F"))
df1
a b c
1: 5 T F
2: 1 T T
3: 7 T F
4: 3 F F
5: 0 F F
df2 has an additional row to prove that a match in x1 takes precedence over a match in x2. The value 5 appears twice: In row 2 of column x1 and in row 5 of column x2.
df2 <- data.frame(x1=c(4,5,3,9,6),
x2=c(7,8,1,2,5),
x3=c("g","w","t","o","n"))
df2
x1 x2 x3
1: 4 7 g
2: 5 8 w
3: 3 1 t
4: 9 2 o
5: 6 5 n
Not sure I understood your question, but rather than repetitive merging I'd compare the keys of the potential merge, if this number is >0, than you have a match. If you want to take the first column with a match you can try this:
library(tidyr)
library(purrr)
(df1 <- data.frame(a=c(5,1,7,3),
b=c("T","T","T","F"),
c=c("F","T","F","F")) )
(df2 <- data.frame(x1=c(4,5,3,9),
x2=c(7,8,1,2),
x3=c("g","w","t","o")) )
FirstColMatch<-1:ncol(df2) %>%
map(~intersect(df1$a, df2[[.x]])) %>%
map(length) %>%
detect_index(function(x)x>0)
NewDF<-merge(df1,df2,by.x="a", by.y =names(df2)[FirstColMatch])
I am writing a script that loads RData files containing the results of earlier experiments and parses data frames saved in them. I've noticed that, while the names of variables are not consistent , for instance, sometimes symbol is called gene_name or gene_symbol. The order of variables is also different between the different data frames, so I can't just rename them all with colnames(df) <- c('a', 'b', ...)
I'm looking for a way to rename variables based on their name that won't give an error if that variable isn't found. The below is what I want to do, but (ideally) without needing dozens of conditional statements.
if ('gene_name' %in% colnames(df)) {
df <- df %>% dplyr::rename('symbol' = gene_name)
}
In the below example, I'd like to find an elegant way to rename the variable b to D that I can use safely on data frames that lack a variable b
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
dfs <- list(x,y)
dfs.fixed <- lapply(dfs, function(x) ?????)
Desired result:
dfs.fixed
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Try this approach:
STEP 1
A function substituting a list of colnames with another string (both info parameterized):
colnames_rep<-function(df,to_find,to_sub)
{
colnames(df)[which(colnames(df) %in% to_find)]<-to_sub
return(df)
}
STEP 2
Use lapply to apply the function over each data.frame:
lapply(dfs,colnames_rep,to_find=c("b"),to_sub="D")
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Thanks to divibisan for the suggestion
We can use rename_at with map
map(dfs, ~ .x %>%
rename_at(b, sub, pattern = "^b$", replacement = "D"))
#[[1]]
# a D
#1 1 4
#2 2 5
#3 3 6
#[[2]]
# a c
#1 1 4
#2 2 5
#3 3 6
Here's an approach that is similar in concept to Terru_theTerror's, but extends it by allowing regular expressions. It might be overkill, but ...
First, we define a simple "map" that maps to the desired name (first string in each vector of the list) from any string (remaining strings in each vector). The function that does the matching accepts an argument of fixed=FALSE, in which case the 2nd and remaining strings can be regular expressions, which gives more power and responsibility.
If using fixed=TRUE (the default), then the map might look like this:
colnamemap <- list(
c("symbol", "gene_name", "gene_symbol"),
c("D", "c", "quux"),
c("bbb", "b", "ccc")
)
where "gene_name" and "gene_symbol" will both be changed to "symbol", etc. If you want to use patterns (fixed=FALSE), however, you should be as specific as possible to preclude mis- or multiple-matches (across columns).
colnamemapptn <- list(
c("symbol", "^gene_(name|symbol)$"),
c("D", "^D$", "^c$", "^quux$"),
c("bbb", "^b$", "^ccc$")
)
The function that does the actual remapping:
fixfunc <- function(df, namemap, fixed = TRUE, ignore.case = FALSE) {
compare <- if (fixed) `%in%` else grepl
downcase <- if (ignore.case) tolower else c
newcn <- cn <- colnames(df)
newnames <- sapply(namemap, `[`, 1L)
matches <- sapply(namemap, function(nmap) {
apply(outer(downcase(nmap[-1]), downcase(cn), Vectorize(compare)), 2, any)
}) # dims: 1=cn; 2=map-to
for (j in seq_len(ncol(matches))) {
if (sum(matches[,j]) > 1) {
warning("rule ", sQuote(newnames[j]), " matches multiple columns: ",
paste(sQuote(cn[ matches[,j] ]), collapse=","))
matches[,j] <- FALSE
}
}
for (i in seq_len(nrow(matches))) {
rowmatches <- sum(matches[i,])
if (rowmatches == 1) {
newcn[i] <- newnames[ matches[i,] ]
} else if (rowmatches > 1) {
warning("column ", sQuote(cn[i]), " matches multiple rules: ",
paste(sQuote(newnames[ matches[i,]]), collapse=","))
matches[i,] <- FALSE
}
}
if (any(matches)) colnames(df) <- newcn
df
}
(You might extend it to ensure unique-ness, using make.names and/or make.unique. There's also ignore.case, not really tested here but easily done, I believe.)
I'm going to extend your sample data by including one that will match multiple patterns resulting in ambiguity:
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
z <- data.frame('cc' = 1:3, 'ccc' = 2:4)
dfs <- list(x,y,z)
where the third data.frame has two columns that match my third non-pattern vector. When there are multiple matches, I think the safer thing to do is warn about it and change none of them.
This is correct, fixed-strings only:
lapply(dfs, fixfunc, colnamemap, fixed=TRUE)
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
This incorrectly uses the strings as patterns, which causes one of them to warn about multiple matches:
lapply(dfs, fixfunc, colnamemap, fixed=FALSE)
# Warning in FUN(X[[i]], ...) :
# rule 'D' matches multiple columns: 'cc','ccc'
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
A better use of fixed=FALSE, with strict patterns instead:
lapply(dfs, fixfunc, colnamemapptn, fixed=FALSE)
# same output as the first call
I have been going crazy with something basic...
I am trying to count and list in a comma separated column each unique ID coming up in a data frame, e.g.:
df<-data.frame(id = as.character(c("a", "a", "a", "b", "c", "d", "d", "e", "f")), x1=c(3,1,1,1,4,2,3,3,3),
x2=c(6,1,1,1,3,2,3,3,1),
x3=c(1,1,1,1,1,2,3,3,2))
> > df
id x1 x2 x3
1 a 3 6 1
2 a 1 1 1
3 a 1 1 1
4 b 1 1 1
5 c 4 3 1
6 d 1 2 2
7 d 3 3 3
8 e 1 3 3
9 f 3 1 2
I am trying to get a count of unique id that satisfy a condition, >1:
res = data.frame(x1_counts =5, x1_names="a,c,d,e,f", x2_counts = 4, x2_names="a,c,d,f", x3_counts = 3, x3_names="d,e,f")
> res
x1_counts x1_names x2_counts x2_names x3_counts x3_names
1 5 a,c,d,e,f 4 a,c,d,f 3 d,e,f
I have tried with data.table but it seems very convoluted, i.e.
DT = as.data.table(df)
res <- DT[, list(x1= length(unique(id[which(x1>1)])), x2= length(unique(id[which(x2>1)]))), by=id)
But I can't get it right, I am going not getting what I need to do with data.table since it is not really a grouping I am looking for. Can you direct me in the right path please? Thanks so much!
You can reshape your data to long format and then do the summary:
library(data.table)
(melt(setDT(df), id.vars = "id")[value > 1]
[, .(counts = uniqueN(id), names = list(unique(id))), variable])
# You can replace the list to toString if you want a string as name instead of list
# variable counts names
#1: x1 5 a,c,d,e,f
#2: x2 4 a,c,d,e
#3: x3 3 d,e,f
To get what you need, reshape it back to wide format:
dcast(1~variable,
data = (melt(setDT(df), id.vars = "id")[value > 1]
[, .(counts = uniqueN(id), names = list(unique(id))), variable]),
value.var = c('counts', 'names'))
# . counts_x1 counts_x2 counts_x3 names_x1 names_x2 names_x3
# 1: . 5 4 3 a,c,d,e,f a,c,d,e d,e,f
Consider the following:
df <- data.frame(a = 1, b = 2, c = 3)
names(df[1]) <- "d" ## First method
## a b c
##1 1 2 3
names(df)[1] <- "d" ## Second method
## d b c
##1 1 2 3
Both methods didn't return an error, but the first didn't change the column name, while the second did.
I thought it has something to do with the fact that I'm operating only on a subset of df, but why, for example, the following works fine then?
df[1] <- 2
## a b c
##1 2 2 3
What I think is happening is that replacement into a data frame ignores the attributes of the data frame that is drawn from. I am not 100% sure of this, but the following experiments appear to back it up:
df <- data.frame(a = 1:3, b = 5:7)
# a b
# 1 1 5
# 2 2 6
# 3 3 7
df2 <- data.frame(c = 10:12)
# c
# 1 10
# 2 11
# 3 12
df[1] <- df2[1] # in this case `df[1] <- df2` is equivalent
Which produces:
# a b
# 1 10 5
# 2 11 6
# 3 12 7
Notice how the values changed for df, but not the names. Basically the replacement operator `[<-` only replaces the values. This is why the name was not updated. I believe this explains all the issues.
In the scenario:
names(df[2]) <- "x"
You can think of the assignment as follows (this is a simplification, see end of post for more detail):
tmp <- df[2]
# b
# 1 5
# 2 6
# 3 7
names(tmp) <- "x"
# x
# 1 5
# 2 6
# 3 7
df[2] <- tmp # `tmp` has "x" for names, but it is ignored!
# a b
# 1 10 5
# 2 11 6
# 3 12 7
The last step of which is an assignment with `[<-`, which doesn't respect the names attribute of the RHS.
But in the scenario:
names(df)[2] <- "x"
you can think of the assignment as (again, a simplification):
tmp <- names(df)
# [1] "a" "b"
tmp[2] <- "x"
# [1] "a" "x"
names(df) <- tmp
# a x
# 1 10 5
# 2 11 6
# 3 12 7
Notice how we directly assign to names, instead of assigning to df which ignores attributes.
df[2] <- 2
works because we are assigning directly to the values, not the attributes, so there are no problems here.
EDIT: based on some commentary from #AriB.Friedman, here is a more elaborate version of what I think is going on (note I'm omitting the S3 dispatch to `[.data.frame`, etc., for clarity):
Version 1 names(df[2]) <- "x" translates to:
df <- `[<-`(
df, 2,
value=`names<-`( # `names<-` here returns a re-named one column data frame
`[`(df, 2),
value="x"
) )
Version 2 names(df)[2] <- "x" translates to:
df <- `names<-`(
df,
`[<-`(
names(df), 2, "x"
) )
Also, turns out this is "documented" in R Inferno Section 8.2.34 (Thanks #Frank):
right <- wrong <- c(a=1, b=2)
names(wrong[1]) <- 'changed'
wrong
# a b
# 1 2
names(right)[1] <- 'changed'
right
# changed b
# 1 2