I have a script generating a dataframe with multiple columns named with numbers 1, 2, 3 –> n
I want to rename the columns with the following names: "Cluster_1", "Cluster_2", "Cluster_3" –> "Cluster_n" (with incrementation).
As the number of columns in my dataframe can change accordingly to another part of my script, I would like to be able to have a kind of loop structure that would go through my dataframe and change columns accordingly.
I would like to do something like:
for (i in colnames(df)){
an expression that would change the column name to a concatenation of "Cluster_" + i
}
Outside the loop context, I generally use this expression to rename a column:
names(df)[names(df) == '1'] <- 'Cluster_1'
But I struggle to produce an adapted version of this expression that would properly integrate in my for loop with a concatenation of string and variable value.
How can I adjust the expression that renames the column of the dataframe to integrate in my for loop?
Or is there a better way than a for loop to do this?
A tidyverse solution: rename_with()
require(dplyr)
## '~' notation can be used for formulae in this context:
df <- rename_with(df, ~ paste0("Cluster_", .))
Using paste0.
names(df) <- paste0('cluster_', seq_len(length(df)))
If you really need a for loop, try
for (i in seq_along(names(df))) {
names(df)[i] <- paste0('cluster_', i)
}
df
# cluster_1 cluster_2 cluster_3 cluster_4
# 1 1 4 7 10
# 2 2 5 8 11
# 3 3 6 9 12
Note: colnames()/rownames() is designed for class "matrix", for "data.frame"s, you might want to use names()/row.names().
Data:
df <- data.frame(matrix(1:12, 3, 4))
Related
After searching for some time, I cannot find a smooth R-esque solution.
I have a list of vectors that I want to convert to dataframes and add a column with the names of the vectors. I cant do this with cbind() and melt() to a single dataframe b/c there are vectors with different number of rows.
Basic example would be:
list<-list(a=c(1,2,3),b=c(4,5,6,7))
var<-"group"
What I have come up with and works is:
list<-lapply(list, function(x) data.frame(num=x,grp=""))
for (j in 1:length(list)){
list[[j]][,2]<-names(list[j])
names(list[[j]])[2]<-var
}
But I am trying to better use lapply() and have cleaner coding practices. Right now I rely so heavily on for and if statements, which a lot of the base functions do already and much more efficiently than I can code at this point.
The psuedo code I would like is something like:
list<-lapply(list, function(x) data.frame(num=x,get(var)=names(x))
Is there a clean way to get this done?
Second closely related question, if I already have a list of dataframes, why is it so hard to reassign column values and names using lapply()?
So using something like:
list<-list(a=data.frame(num=c(1,2,3),grp=""),b=data.frame(num=c(4,5,6,7),grp=""))
var<-"group"
#pseudo code
list<-lapply(list, function(x) x[,2]<-names(x)) #populate second col with name of df[x]
list<-lapply(list, function(x) names[[x]][2]<-var) #set 2nd col name to 'var'
The first line of pseudo code throws an error about matching row lengths. Why does lapply() not just loop over and repeat names(x) like the same function on a single dataframe does in a for loop?
For the second line, as I understand it I can use setNames() to reassign all the column names, but how do I make this work for just one of the col names?
Many thanks for any ideas or pointing to other threads that cover this and helping me understand the behavior of lapply() in this context.
A full R base approach without using loops
> l<-list(a=c(1,2,3),b=c(4,5,6,7))
> data.frame(grp=rep(names(l), lengths(l)), num=unlist(l), row.names = NULL)
grp num
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
Related to your first/main question you can use the function enframe from package tibble for this purpose
library(tibble)
library(tidyr)
library(dplyr)
l<-list(a=c(1,2,3),b=c(4,5,6,7))
l %>%
enframe(name = "group", value="value") %>%
unnest(value) %>%
group_split(group)
Try this:
library(dplyr)
mylist <- list(a = c(1,2,3), b = c(4,5,6,7))
bind_rows(lapply(names(mylist), function(x) tibble(grp = x, num = mylist[[x]])))
# A tibble: 7 x 2
grp num
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 b 7
This is essentially a lapply-based solution where you iterate over the names of your list, and not the individual list elements themselves. If you prefer to do everything in base R, note that the above is equivalent to
do.call(rbind, lapply(names(mylist), function(x) data.frame(grp = x, num = mylist[[x]], stringsAsFactors = F)))
Having said that, tibbles as modern implementation of data.frames are preferred, as is bind_rows over the do.call(rbind... construct.
As to the second question, note the following:
lapply(mylist, function(x) str(x))
num [1:3] 1 2 3
num [1:4] 4 5 6 7
....
lapply(mylist, function(x) names(x))
$a
NULL
$b
NULL
What you see here is that the function inside of lapply gets the elements of mylist. In this case, it get's to work with the numeric vector. This does not have any name as far as the function that is called inside lapply is concerned. To highlight this, consider the following:
names(c(1,2,3))
NULL
Which is the same: the vector c(1,2,3) does not have a name attribute.
I have managed to do chisq-test using loop in R but it is very slow for a large data and I wonder if you could help me out doing it faster with something like dplyr? I've tried with dplyr but I ended up getting an error all the time which I am not sure about the reason.
Here is a short example of my data:
df
1 2 3 4 5
row_1 2260.810 2136.360 3213.750 3574.750 2383.520
row_2 328.050 496.608 184.862 383.408 151.450
row_3 974.544 812.508 1422.010 1307.510 1442.970
row_4 2526.900 826.197 1486.000 2846.630 1486.000
row_5 2300.130 2499.390 1698.760 1690.640 2338.640
row_6 280.980 752.516 277.292 146.398 317.990
row_7 874.159 794.792 1033.330 2383.420 748.868
row_8 437.560 379.278 263.665 674.671 557.739
row_9 1357.350 1641.520 1397.130 1443.840 1092.010
row_10 1749.280 1752.250 3377.870 1534.470 2026.970
cs
1 1 1 2 1 2 2 1 2 3
What I want to do is to run chisq-test between each row of the df and cs. Then giving me the statistics and p.values as well as row names.
here is my code for the loop:
value = matrix(nrow=ncol(df),ncol=3)
for (i in 1:ncol(df)) {
tst <- chisq.test(df[i,], cs)
value[i,1] <- tst$p.value
value[i,2] <- tst$statistic
value[i,3] <- rownames(df)[i]}
Thanks for your help.
I guess you do want to do this column by column. Knowing the structure of Biobase::exprs(PANCAN_w)) would have helped greatly. Even better would have been to use an example from the Biobase package instead of a dataset that cannot be found.
This is an implementation of the code I might have used. Note: you do NOT want to use a matrix to store results if you are expecting a mixture of numeric and character values. You would be coercing all the numerics to character:
value = data.frame(p_val =NA, stat =NA, exprs = rownames(df) )
for (i in 1:col(df)) {
# tbl <- table((df[i,]), cs) ### No use seen for this
# I changed the indexing in the next line to compare columsn to the standard `cs`.
tst <- chisq.test(df[ ,i], cs) #chisq.test not vectorized, need some sort of loop
value[i, 1:2] <- tst[ c('p.value', 'statistic')] # one assignment per row
}
Obviously, you would need to change every instance of df (not a great name since there is also a df function) to Biobase::exprs(PANCAN_w)
Trying to using %in% operator in r to find an equivalent SAS Code as below:
If weather in (2,5) then new_weather=25;
else if weather in (1,3,4,7) then new_weather=14;
else new_weather=weather;
SAS code will produce variable "new_weather" with values 25, 14 and as defined in variable "weather".
R code:
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[newcol] = df[col]
df[df[newcol] %in% c(2,5)]= 25
df[df[newcol] %in% c(1,3,4,7)] = 14
return(df)
}
Result: output values of "col" and "newcol" are same, when passing a data frame through a function "GS". Syntax is not picking up the second or more values for a variable "newcol"? Appreciated your time explaining the reason and possible fix.
Is this what you are trying to do?
df <- data.frame(A=seq(1:4), B=seq(1:4))
add_and_adjust <- function(df, copy_column, new_column_name) {
df[new_column_name] <- df[copy_column] # make copy of column
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(2,5), 25, df[,new_column_name])
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(1,3,4,7), 14, df[,new_column_name])
return(df)
}
Usage:
add_and_adjust(df, 'B', 'my_new_column')
df[newcol] is a data frame (with one column), df[[newcol]] or df[, newcol] is a vector (just the column). You need to use [[ here.
You also need to be assigning the result to df[[newcol]], not to the whole df. And to be perfectly consistent and safe you should probably test the col values, not the newcol values.
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[[newcol]] = df[[col]]
df[[newcol]][df[[col]] %in% c(2,5)] = 25
df[[newcol]][df[[col]] %in% c(1,3,4,7)] = 14
return(df)
}
GS(data.frame(x = 1:7), "x", "new")
# x new
# 1 1 14
# 2 2 25
# 3 3 14
# 4 4 14
# 5 5 25
# 6 6 6
# 7 7 14
#user9231640 before you invest too much time in writing your own function you may want to explore some of the recode functions that already exist in places like car and Hmisc.
Depending on how complex your recoding gets your function will get longer and longer to check various boundary conditions or to change data types.
Just based upon your example you can do this in base R and it will be more self documenting and transparent at one level:
df <- data.frame(A=seq(1:30), B=seq(1:30))
df$my_new_column <- df$B
df$my_new_column <- ifelse(df$my_new_column %in% c(2,5), 25, df$my_new_column)
df$my_new_column <- ifelse(df$my_new_column %in% c(1,3,4,7), 14, df$my_new_column)
I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4
I have a list called cols with column names in it:
cols <- c('Column1','Column2','Column3')
I'd like to reproduce this command, but with a call to the list:
data.frame(Column1=rnorm(10))
Here's what happens when I try it:
> data.frame(cols[1]=rnorm(10))
Error: unexpected '=' in "data.frame(I(cols[1])="
The same thing happens if I wrap cols[1] in I() or eval().
How can I feed that item from the vector into the data.frame() command?
Update:
For some background, I have defined a function calc.means() that takes a data frame and a list of variables and performs a large and complicated ddply operation, summarizing at the level specified by the variables.
What I'm trying to do with the data.frame() command is walk back up the aggregation levels to the very top, re-running calc.means() at each step and using rbind() to glue the results onto one another. I need to add dummy columns with 'All' values in order to get the rbind to work properly.
I'm rolling cast-like margin functionality into ddply, basically, and I'd like to not retype the column names for each run. Here's the full code:
cols <- c('Col1','Col2','Col3')
rbind ( calc.means(dat,cols),
data.frame(cols[1]='All', calc.means(dat, cols[2:3])),
data.frame(cols[1]='All', cols[2]='All', calc.means(dat, cols[3]))
)
Use can use structure:
cols <- c("a","b")
foo <- structure(list(c(1, 2 ), c(3, 3)), .Names = cols, row.names = c(NA, -2L), class = "data.frame")
I don't get why you are doing this though!
I'm not sure how to do it directly, but you could simply skip the step of assigning names in the data.frame() command. Assuming you store the result of data.frame() in a variable named foo, you can simply do:
names(foo) <- cols
after the data frame is created
There is one trick. You could mess with lists:
cols_dummy <- setNames(rep(list("All"), 3), cols)
Then if you use call to list with one paren then you should get what you want
data.frame(cols_dummy[1], calc.means(dat, cols[2:3]))
You could use it on-the-fly as setNames(list("All"), cols[1]) but I think it's less elegant.
Example:
some_names <- list(name_A="Dummy 1", name_B="Dummy 2") # equivalent of cols_dummy from above
data.frame(var1=rnorm(3), some_names[1])
# var1 name_A
# 1 -1.940169 Dummy 1
# 2 -0.787107 Dummy 1
# 3 -0.235160 Dummy 1
I believe the assign() function is your answer:
cols <- c('Col1','Col2','Col3')
data.frame(assign(cols[1], rnorm(10)))
Returns:
assign.cols.1...rnorm.10..
1 -0.02056822
2 -0.03675639
3 1.06249599
4 0.41763399
5 0.38873118
6 1.01779018
7 1.01379963
8 1.86119518
9 0.35760039
10 1.14742560
With the lapply() or sapply() function, you should be able to loop the cbind() process. Something like:
operation <- sapply(cols, function(x) data.frame(assign(x, rnorm(10))))
final <- data.frame(lapply(operation, cbind))
Returns:
Col1.assign.x..rnorm.10.. Col2.assign.x..rnorm.10.. Col3.assign.x..rnorm.10..
1 0.001962187 -0.3561499 -0.22783816
2 -0.706804781 -0.4452781 -1.09950505
3 -0.604417525 -0.8425018 -0.73287079
4 -1.287038060 0.2545236 -1.18795684
5 0.232084366 -1.0831463 0.40799046
6 -0.148594144 0.4963714 -1.34938144
7 0.442054119 0.2856748 0.05933736
8 0.984615916 -0.0795147 -1.91165189
9 1.222310749 -0.1743313 0.18256877
10 -0.231885977 -0.2273724 -0.43247570
Then, to clean up the column names:
colnames(final) <- cols
Returns:
Col1 Col2 Col3
1 0.19473248 0.2864232 0.93115072
2 -1.08473526 -1.5653469 0.09967827
3 -1.90968422 -0.9678024 -1.02167873
4 -1.11962371 0.4549290 0.76692067
5 -2.13776949 3.0360777 -1.48515698
6 0.64240694 1.3441656 0.47676056
7 -0.53590163 1.2696336 -1.19845723
8 0.09158526 -1.0966833 0.91856639
9 -0.05018762 1.0472368 0.15475583
10 0.27152070 -0.2148181 -1.00551111
Cheers,
Adam