Difference between `names(df[1]) <- ` and `names(df)[1] <- ` - r

Consider the following:
df <- data.frame(a = 1, b = 2, c = 3)
names(df[1]) <- "d" ## First method
## a b c
##1 1 2 3
names(df)[1] <- "d" ## Second method
## d b c
##1 1 2 3
Both methods didn't return an error, but the first didn't change the column name, while the second did.
I thought it has something to do with the fact that I'm operating only on a subset of df, but why, for example, the following works fine then?
df[1] <- 2
## a b c
##1 2 2 3

What I think is happening is that replacement into a data frame ignores the attributes of the data frame that is drawn from. I am not 100% sure of this, but the following experiments appear to back it up:
df <- data.frame(a = 1:3, b = 5:7)
# a b
# 1 1 5
# 2 2 6
# 3 3 7
df2 <- data.frame(c = 10:12)
# c
# 1 10
# 2 11
# 3 12
df[1] <- df2[1] # in this case `df[1] <- df2` is equivalent
Which produces:
# a b
# 1 10 5
# 2 11 6
# 3 12 7
Notice how the values changed for df, but not the names. Basically the replacement operator `[<-` only replaces the values. This is why the name was not updated. I believe this explains all the issues.
In the scenario:
names(df[2]) <- "x"
You can think of the assignment as follows (this is a simplification, see end of post for more detail):
tmp <- df[2]
# b
# 1 5
# 2 6
# 3 7
names(tmp) <- "x"
# x
# 1 5
# 2 6
# 3 7
df[2] <- tmp # `tmp` has "x" for names, but it is ignored!
# a b
# 1 10 5
# 2 11 6
# 3 12 7
The last step of which is an assignment with `[<-`, which doesn't respect the names attribute of the RHS.
But in the scenario:
names(df)[2] <- "x"
you can think of the assignment as (again, a simplification):
tmp <- names(df)
# [1] "a" "b"
tmp[2] <- "x"
# [1] "a" "x"
names(df) <- tmp
# a x
# 1 10 5
# 2 11 6
# 3 12 7
Notice how we directly assign to names, instead of assigning to df which ignores attributes.
df[2] <- 2
works because we are assigning directly to the values, not the attributes, so there are no problems here.
EDIT: based on some commentary from #AriB.Friedman, here is a more elaborate version of what I think is going on (note I'm omitting the S3 dispatch to `[.data.frame`, etc., for clarity):
Version 1 names(df[2]) <- "x" translates to:
df <- `[<-`(
df, 2,
value=`names<-`( # `names<-` here returns a re-named one column data frame
`[`(df, 2),
value="x"
) )
Version 2 names(df)[2] <- "x" translates to:
df <- `names<-`(
df,
`[<-`(
names(df), 2, "x"
) )
Also, turns out this is "documented" in R Inferno Section 8.2.34 (Thanks #Frank):
right <- wrong <- c(a=1, b=2)
names(wrong[1]) <- 'changed'
wrong
# a b
# 1 2
names(right)[1] <- 'changed'
right
# changed b
# 1 2

Related

How to correct/standardize variable names if their format is not consistent

I am writing a script that loads RData files containing the results of earlier experiments and parses data frames saved in them. I've noticed that, while the names of variables are not consistent , for instance, sometimes symbol is called gene_name or gene_symbol. The order of variables is also different between the different data frames, so I can't just rename them all with colnames(df) <- c('a', 'b', ...)
I'm looking for a way to rename variables based on their name that won't give an error if that variable isn't found. The below is what I want to do, but (ideally) without needing dozens of conditional statements.
if ('gene_name' %in% colnames(df)) {
df <- df %>% dplyr::rename('symbol' = gene_name)
}
In the below example, I'd like to find an elegant way to rename the variable b to D that I can use safely on data frames that lack a variable b
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
dfs <- list(x,y)
dfs.fixed <- lapply(dfs, function(x) ?????)
Desired result:
dfs.fixed
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Try this approach:
STEP 1
A function substituting a list of colnames with another string (both info parameterized):
colnames_rep<-function(df,to_find,to_sub)
{
colnames(df)[which(colnames(df) %in% to_find)]<-to_sub
return(df)
}
STEP 2
Use lapply to apply the function over each data.frame:
lapply(dfs,colnames_rep,to_find=c("b"),to_sub="D")
[[1]]
a D
1 1 4
2 2 5
3 3 6
[[2]]
a c
1 1 4
2 2 5
3 3 6
Thanks to divibisan for the suggestion
We can use rename_at with map
map(dfs, ~ .x %>%
rename_at(b, sub, pattern = "^b$", replacement = "D"))
#[[1]]
# a D
#1 1 4
#2 2 5
#3 3 6
#[[2]]
# a c
#1 1 4
#2 2 5
#3 3 6
Here's an approach that is similar in concept to Terru_theTerror's, but extends it by allowing regular expressions. It might be overkill, but ...
First, we define a simple "map" that maps to the desired name (first string in each vector of the list) from any string (remaining strings in each vector). The function that does the matching accepts an argument of fixed=FALSE, in which case the 2nd and remaining strings can be regular expressions, which gives more power and responsibility.
If using fixed=TRUE (the default), then the map might look like this:
colnamemap <- list(
c("symbol", "gene_name", "gene_symbol"),
c("D", "c", "quux"),
c("bbb", "b", "ccc")
)
where "gene_name" and "gene_symbol" will both be changed to "symbol", etc. If you want to use patterns (fixed=FALSE), however, you should be as specific as possible to preclude mis- or multiple-matches (across columns).
colnamemapptn <- list(
c("symbol", "^gene_(name|symbol)$"),
c("D", "^D$", "^c$", "^quux$"),
c("bbb", "^b$", "^ccc$")
)
The function that does the actual remapping:
fixfunc <- function(df, namemap, fixed = TRUE, ignore.case = FALSE) {
compare <- if (fixed) `%in%` else grepl
downcase <- if (ignore.case) tolower else c
newcn <- cn <- colnames(df)
newnames <- sapply(namemap, `[`, 1L)
matches <- sapply(namemap, function(nmap) {
apply(outer(downcase(nmap[-1]), downcase(cn), Vectorize(compare)), 2, any)
}) # dims: 1=cn; 2=map-to
for (j in seq_len(ncol(matches))) {
if (sum(matches[,j]) > 1) {
warning("rule ", sQuote(newnames[j]), " matches multiple columns: ",
paste(sQuote(cn[ matches[,j] ]), collapse=","))
matches[,j] <- FALSE
}
}
for (i in seq_len(nrow(matches))) {
rowmatches <- sum(matches[i,])
if (rowmatches == 1) {
newcn[i] <- newnames[ matches[i,] ]
} else if (rowmatches > 1) {
warning("column ", sQuote(cn[i]), " matches multiple rules: ",
paste(sQuote(newnames[ matches[i,]]), collapse=","))
matches[i,] <- FALSE
}
}
if (any(matches)) colnames(df) <- newcn
df
}
(You might extend it to ensure unique-ness, using make.names and/or make.unique. There's also ignore.case, not really tested here but easily done, I believe.)
I'm going to extend your sample data by including one that will match multiple patterns resulting in ambiguity:
x <- data.frame('a' = c(1,2,3), 'b' = c(4,5,6))
y <- data.frame('a' = c(1,2,3), 'c' = c(4,5,6))
z <- data.frame('cc' = 1:3, 'ccc' = 2:4)
dfs <- list(x,y,z)
where the third data.frame has two columns that match my third non-pattern vector. When there are multiple matches, I think the safer thing to do is warn about it and change none of them.
This is correct, fixed-strings only:
lapply(dfs, fixfunc, colnamemap, fixed=TRUE)
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
This incorrectly uses the strings as patterns, which causes one of them to warn about multiple matches:
lapply(dfs, fixfunc, colnamemap, fixed=FALSE)
# Warning in FUN(X[[i]], ...) :
# rule 'D' matches multiple columns: 'cc','ccc'
# [[1]]
# a bbb
# 1 1 4
# 2 2 5
# 3 3 6
# [[2]]
# a D
# 1 1 4
# 2 2 5
# 3 3 6
# [[3]]
# cc bbb
# 1 1 2
# 2 2 3
# 3 3 4
A better use of fixed=FALSE, with strict patterns instead:
lapply(dfs, fixfunc, colnamemapptn, fixed=FALSE)
# same output as the first call

Change column names in list of dataframes

I have a list of dataframes on which I want to change the name of the 2nd column in each dataframe in the list so that it matches the name of the list item that holds it. The code that I have at the moment is:
my_list <- list(one = data.frame(a <- 1:5, b <- 1:5), two = data.frame(a <- 1:5, b <- 1:5))
my_list <- lapply(seq_along(names(my_list)), function(x) names(my_list[[x]])[2] <- names(my_list)[x])
but my code just replaces the dataframes without me understanding why. Any help would be much appreciated.
I know that I can do this easily with a "for" loop, but I would like to avoid it, hence my question.
setNames can be convenient here:
my_list2 <- lapply(
names(my_list),
function(x) setNames(my_list[[x]], c(names(my_list[[x]])[1], x))
)
Or the same using Map (which I think is easier to read):
my_list2 <- Map(
function(x, n) setNames(x, c(names(x)[1], n)),
my_list, names(my_list)
)
> my_list2
$one
a one
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
$two
a two
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
A problem of names<- is that it returns the name, not the object.
Try this, loop through data.frames, update column name:
# dummy list
my_list <- list(one = data.frame(a = 1:5, b = 1:5), two = data.frame(a = 1:5, b = 1:5))
my_list_updated <-
lapply(names(my_list), function(i){
x <- my_list[[ i ]]
# set 2nd column to a new name
names(x)[2] <- i
# return
x
})
my_list_updated
# [[1]]
# a one
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
#
# [[2]]
# a two
# 1 1 1
# 2 2 2
# 3 3 3
# 4 4 4
# 5 5 5
You could to use the super assignment operator:
my_list <- list(one = data.frame(a = 1:5, b = 1:5), two = data.frame(a = 1:5, b = 1:5))
lapply(seq_along(names(my_list)), function(x) names(my_list[[x]])[2] <<- names(my_list)[x] )
my_list

Add new column to data.frame through loop in R

I have n number of data.frame i would like to add column to all data.frame
a <- data.frame(1:4,5:8)
b <- data.frame(1:4, 5:8)
test=ls()
for (j in test){
j = cbind(get(j),IssueType=j)
}
Problem that i'm running into is
j = cbind(get(j),IssueType=j)
because it assigns all the data to j instead of a, b.
As commented, it's mostly better to keep related data in a list structure. If you already have the data.frames in your global environment and you want to get them into a list, you can use:
dflist <- Filter(is.data.frame, as.list(.GlobalEnv))
This is from here and makes sure that you only get data.frame objects from your global environment.
You will notice that you now already have a named list:
> dflist
# $a
# X1.4 X5.8
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
#
# $b
# X1.4 X5.8
# 1 1 5
# 2 2 6
# 3 3 7
# 4 4 8
So you can easily select the data you want by typing for example
dflist[["a"]]
If you still want to create extra columns, you could do it like this:
dflist <- Map(function(df, x) {df$IssueType <- x; df}, dflist, names(dflist))
Now, each data.frame in dflist has a new column called IssueType:
> dflist
# $a
# X1.4 X5.8 IssueType
# 1 1 5 a
# 2 2 6 a
# 3 3 7 a
# 4 4 8 a
#
# $b
# X1.4 X5.8 IssueType
# 1 1 5 b
# 2 2 6 b
# 3 3 7 b
# 4 4 8 b
In the future, you can create the data inside a list from the beginning, i.e.
dflist <- list(
a = data.frame(1:4,5:8)
b = data.frame(1:4, 5:8)
)
To create a list of your data.frames do this:
a <- data.frame(1:4,5:8); b <- data.frame(1:4, 5:8); test <- list(a,b)
This allows you to us the lapply function to perform whatever you like to do with each of the dataframes, eg:
out <- lapply(test, function(x) cbind(j))
For most data.frame operations I recommend using the packages dplyr and tidyr.
wooo wooo
here is answer for the issue
helped by #docendo discimus
Created Dataframe
a <- data.frame(1:4,5:8)
b <- data.frame(1:4, 5:8)
Group data.frame into list
dflist <- Filter(is.data.frame, as.list(.GlobalEnv))
Add's extra column
dflist <- Map(function(df, x) {df$IssueType <- x; df}, dflist, names(dflist))
unstinting the data frame
list2env(dflist ,.GlobalEnv)

R: change data frame structure using values from one variable as new variable

df1 <- data.frame(
name = c("a", "b", "b", "c"),
score = c(1, 1, 2, 1)
)
How can I get a new data frame with variables/columns from df$name and with each 'corresponding' df$score. I figure that its actually a two-step problem:
First I would need to make a list of (in this example) unequal length vectors like this:
$a
[1] 1
$b
[1] 1 2
$c
[1] 1
Second, NAs need to be padded so one get vectors of equal length before making the desired data frame
that would be like:
a b c
1 1 1 1
2 NA 2 NA
I cannot find any simple means to do this - Im sure there must be!
If the solution can be delivered using dplyr it would be fantastic! Thanks!
To split the data:
(s <- split(df1$score, df1$name))
# $a
# [1] 1
#
# $b
# [1] 1 2
#
# $c
# [1] 1
To create the new data frame:
as.data.frame(sapply(s, `length<-`, max(vapply(s, length, 1L))))
# a b c
# 1 1 1 1
# 2 NA 2 NA
Slightly more efficient would be to use vapply in place of sapply
len <- max(vapply(s, length, 1L))
as.data.frame(vapply(s, `length<-`, double(len), len))
# a b c
# 1 1 1 1
# 2 NA 2 NA

Complete.cases used on list of data frames

I'm trying to remove all the NA values from a list of data frames. The only way I have got it to work is by cleaning the data with complete.cases in a for loop. Is there another way of doing this with lapply as I had been trying for a while to no avail. Here is the code that works.
I start with
data_in <- lapply (file_name,read.csv)
Then have:
clean_data <- list()
for (i in seq_along(id)) {
clean_data[[i]] <- data_in[[i]][complete.cases(data_in[[i]]), ]
}
But what I tried to get to work was using lapply all the way like this.
comp <- lapply(data_in, complete.cases)
clean_data <- lapply(data_in, data_in[[id]][comp,])
Which returns this error "Error in [.default(xj, i) : invalid subscript type 'list' "
What I'd like to know is some alternatives or if I was going about this right. And why didn't the last example not work?
Thank you so much for your time. Have a nice day.
I'm not sure what you expected with
clean_data <- lapply(data_in, data_in[[id]][comp,])
The second parameter to lapply should be a proper function to which each member of the data_in list will be passed one at a time. Your expression data_in[[id]][comp,] is not a function. I'm not sure where you expected id to come from, but lapply does not create magic variables for you like that. Also, at this point comp is now a list itself of indices. You are making no attempt to iterate over this list in sync with your data_in list. If you wanted to do it in two separate steps, a more appropriate approach would be
comp <- lapply(data_in, complete.cases)
clean_data <- Map(function(d,c) {d[c,]}, data_in, comp)
Here we use Map to iterate over the data_in and comp lists simultaneously. They each get passed in to the function as a parameter and we can do the proper extraction that way. Otherwise, if we wanted to do it in one step, we could do
clean_data <- lapply(data_in, function(x) x[complete.cases(x),])
welcome to SO, please provide some working code next time
here is how i would do it with na.omit (since complete.cases only returns a logical)
(dat.l <- list(dat1 = data.frame(x = 1:2, y = c(1, NA)),
dat2 = data.frame(x = 1:3, y = c(1, NA, 3))))
# $dat1
# x y
# 1 1 1
# 2 2 NA
#
# $dat2
# x y
# 1 1 1
# 2 2 NA
# 3 3 3
Map(na.omit, dat.l)
# $dat1
# x y
# 1 1 1
#
# $dat2
# x y
# 1 1 1
# 3 3 3
Do you mean like the below?
> lst
$a
a
1 1
2 2
3 NA
4 3
5 4
$b
b
1 1
2 NA
3 2
4 3
5 4
$d
d e
1 NA 1
2 NA 2
3 3 3
4 4 NA
5 5 NA
> f <- function(x) x[complete.cases(x),]
> lapply(lst, f)
$a
[1] 1 2 3 4
$b
[1] 1 2 3 4
$d
d e
3 3 3
file_name[complete.cases(file_name), ]
complete.cases() returns only a logical value. This should do the job and returns only the rows with no NA values.

Resources