How to change multiple variable names in R?

How to change multiple variable names in R? - r

I am new to R
I am trying to create a variable list with variables containing the log of some other variables. I managed to create the list, but I don't know how to rename each variable. Moreover, I don't know how to make these variables be a part of my dataset. Here is what I do:
Imagine somedata is a csv file of the form:
v1, v2, v3, ..., vn
1, 4, 6, ..., 1
...
Then here is my script
##################
## Import Data
###################
lights <- read.csv("somedata.csv")
##################
## Variable Lists
###################
lights.varlist1 <- subset(lights, select=c(v1,v2,...,vJ))
###########
## Logs
###########
lights.logsvarlist1=apply(lights.varlist1, 2, function(y) log(y))
This part seems to be working just fine as the results of print(lights.logsvarlist1)[1,] make sense
To change the names of the variables I do:
for (i in 1:length(lights.logsvarlist1[1,]) {
name <-paste("l", names(lights.varlist1)[i], separator="")
names(lights.logsvarlist1)[i]=name
}
I have now two problems.
When I print(lights.logsvarlist1[1,], the names don't seem to have changed. I still have my old variable names as headers.
When I print(names(lights)), my newly created variables don't seem to be part of the dataset (they are not in the list).
What am I doing wrong? I am very new to R and I really want to continue, I'd appreciate any help.

DF <- data.frame(a=1:3, v1=4:6, v2=7:9, v3=10:12)
sub <- c("v1", "v2", "v3")
DF[, paste0("l", sub)] <- lapply(DF[, sub], log)
# a v1 v2 v3 lv1 lv2 lv3
# 1 1 4 7 10 1.386294 1.945910 2.302585
# 2 2 5 8 11 1.609438 2.079442 2.397895
# 3 3 6 9 12 1.791759 2.197225 2.484907

This works for me and avoids the for loop
data = as.data.frame(matrix(abs(rnorm(100)), 10))
ldata = log(data)
names(ldata) = paste('log', names(ldata), sep = '')
Some other tips
apply(lights.varlist1, 2, function(y) log(y))
Can be replaced by
apply(lights.varlist1, 2, log)
As log is a function or simply
log(lights.varlist1)
Instead of the following
for (i in 1:length(lights.logsvarlist1[1,])
use
ncol(lights.logsvarlist1)
Your new variables aren't in the lights data frame. They are in a data frame called
lights.logsvarlist1
To put them in the data frame use merge or cbind. Type ?merge etc

Related

Why won't R recognize data frame column names within lists?

HEADLINE: Is there a way to get R to recognize data.frame column names contained within lists in the same way that it can recognize free-floating vectors?
SETUP: Say I have a vector named varA:
(varA <- 1:6)
# [1] 1 2 3 4 5 6
To get the length of varA, I could do:
length(varA)
#[1] 6
and if the variable was contained within a larger list, the variable and its length could still be found by doing:
list <- list(vars = "varA")
length(get(list$vars[1]))
#[1] 6
PROBLEM:
This is not the case when I substitute the vector for a dataframe column and I don't know how to work around this:
rows <- 1:6
cols <- c("colA")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# colA
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
list <- list(vars = "varA",
cols = "df$colA")
length(get(list$vars[1]))
#[1] 6
length(get(list$cols[1]))
#Error in get(list$cols[1]) : object 'df$colA' not found
Though this contrived example seems inane, because I could always use the simple length(variable) approach, I'm actually interested in writing data from hundreds of variables varying in lengths onto respective dataframe columns, and so keeping them in a list that I could iterate through would be very helpful. I've tried everything I could think of, but it may be the case that it's just not possible in R, especially given that I cannot find any posts with solutions to the issue.

You could try:
> length(eval(parse(text = list$cols[1])))
[1] 6
Or:
list <- list(vars = "varA",
cols = "colA")
length(df[, list$cols[1]])
[1] 6
Or with regex:
list <- list(vars = "varA",
cols = "df$colA")
length(df[, sub(".*\\$", "", list$cols[1])])
[1] 6

If you are truly working with a data frame d, then nrow(d) is the length of all of the variables in d. There should be no reason to use length in this case.
If you are actually working with a list x containing variables of potentially different lengths, then you should use the [[ operator to extract those variables by name (see ?Extract):
x <- list(a = 1:10, b = rnorm(20L))
l <- list(vars = "a")
length(d[[l$vars[1L]]]) # 10
If you insist on using get (you shouldn't), then you need to supply a second argument telling it where to look for the variable (see ?get):
length(get(l$vars[1L], x)) # 10

(Pearson's) Correlation loop through the data frame

I have a data frame with 159 obs and 27 variables, and I want to correlate all 159 obs from column 4 (variable 4) with each one of the following columns (variables), this is, correlate column 4 with 5, then column 4 with 6 and so on... I've been unsuccessfully trying to create a loop, and since I'm a beginner in R, it turned out harder than I thought. The reason why I want to turn it more simple is that I would need to do the same thing for a couple more data frames and if I had a function that could do that, it would be so much easier and less time-consuming. Thus, it would be wonderful if anyone could help me.
df <- ZEB1_23genes # CHANGE ZEB1_23genes for df (dataframe)
for (i in colnames(df)){ # Check the class of the variables
print(class(df[[i]]))
}
print(df)
# Correlate ZEB1 with each of the 23 genes accordingly to Pearson's method
cor.test(df$ZEB1, df$PITPNC1, method = "pearson")
### OR ###
cor.test(df[,4], df[,5])
So I can correlate individually but I cannot create a loop to go back to column 4 and correlate it to the next column (5, 6, ..., 27).
Thank you!

If I've understood your question correctly, the solution below should work well.
#Sample data
df <- data.frame(matrix(data = sample(runif(100000), 4293), nrow = 159, ncol = 27))
#Correlation function
#Takes data.frame contains columns with values to be correlated as input
#The column against which other columns must be correlated cab be specified (start_col; default is 4)
#The number of columns to be correlated against start_col can also be specified (end_col; default is all columns after start_col)
#Function returns a data.frame containing start_col, end_col, and correlation value as rows.
my_correlator <- function(mydf, start_col = 4, end_col = 0){
if(end_col == 0){
end_col <- ncol(mydf)
}
#out_corr_df <- data.frame(start_col = c(), end_col = c(), corr_val = c())
out_corr <- list()
for(i in (start_col+1):end_col){
out_corr[[i]] <- data.frame(start_col = start_col, end_col = i, corr_val = as.numeric(cor.test(mydf[, start_col], mydf[, i])$estimate))
}
return(do.call("rbind", out_corr))
}
test_run <- my_correlator(df, 4)
head(test_run)
# start_col end_col corr_val
# 1 4 5 -0.027508521
# 2 4 6 0.100414199
# 3 4 7 0.036648608
# 4 4 8 -0.050845418
# 5 4 9 -0.003625019
# 6 4 10 -0.058172227
The function basically takes a data.frame as an input and spits out (as output) another data.frame containing correlations between a given column from the original data.frame against all subsequent columns. I do not know the structure of your data, and obviously, this function will fail if it runs into unexpected conditions (for instance, a column of characters in one of the columns).

Using %in% operator in R for categorical variables

Trying to using %in% operator in r to find an equivalent SAS Code as below:
If weather in (2,5) then new_weather=25;
else if weather in (1,3,4,7) then new_weather=14;
else new_weather=weather;
SAS code will produce variable "new_weather" with values 25, 14 and as defined in variable "weather".
R code:
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[newcol] = df[col]
df[df[newcol] %in% c(2,5)]= 25
df[df[newcol] %in% c(1,3,4,7)] = 14
return(df)
}
Result: output values of "col" and "newcol" are same, when passing a data frame through a function "GS". Syntax is not picking up the second or more values for a variable "newcol"? Appreciated your time explaining the reason and possible fix.

Is this what you are trying to do?
df <- data.frame(A=seq(1:4), B=seq(1:4))
add_and_adjust <- function(df, copy_column, new_column_name) {
df[new_column_name] <- df[copy_column] # make copy of column
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(2,5), 25, df[,new_column_name])
df[,new_column_name] <- ifelse(df[,new_column_name] %in% c(1,3,4,7), 14, df[,new_column_name])
return(df)
}
Usage:
add_and_adjust(df, 'B', 'my_new_column')

df[newcol] is a data frame (with one column), df[[newcol]] or df[, newcol] is a vector (just the column). You need to use [[ here.
You also need to be assigning the result to df[[newcol]], not to the whole df. And to be perfectly consistent and safe you should probably test the col values, not the newcol values.
GS <- function(df, col, newcol){
# Pass a dataframe, col name, new column name
df[[newcol]] = df[[col]]
df[[newcol]][df[[col]] %in% c(2,5)] = 25
df[[newcol]][df[[col]] %in% c(1,3,4,7)] = 14
return(df)
}
GS(data.frame(x = 1:7), "x", "new")
# x new
# 1 1 14
# 2 2 25
# 3 3 14
# 4 4 14
# 5 5 25
# 6 6 6
# 7 7 14

#user9231640 before you invest too much time in writing your own function you may want to explore some of the recode functions that already exist in places like car and Hmisc.
Depending on how complex your recoding gets your function will get longer and longer to check various boundary conditions or to change data types.
Just based upon your example you can do this in base R and it will be more self documenting and transparent at one level:
df <- data.frame(A=seq(1:30), B=seq(1:30))
df$my_new_column <- df$B
df$my_new_column <- ifelse(df$my_new_column %in% c(2,5), 25, df$my_new_column)
df$my_new_column <- ifelse(df$my_new_column %in% c(1,3,4,7), 14, df$my_new_column)

Assigning value to an R object without using its name with get()

I am having a problem with get() in R.
I have a set of data.frames with a common structure in my environment. I want to loop through these data frames and change the name of the 2nd column so that the name of the 2nd column contains a prefix from the 1st column.
For example, if column 1 = A_cat and column 2 is dog, I want column 2 to be changed to A_dog.
Below is an example of the R code I am using:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
for( element in grep('^df$', names(environment()), value=TRUE) ) {
colnames(get(element))[2] <- paste(strsplit(colnames(get(element)) [1], '`_`')[[1]][1],
colnames(get(element))[2], sep='`_`')
}
The arguments within the for loop, on either side of the assignment operator, both give the expected result if I run them separately but when run together produce the following error.
Error in colnames(get(element))[2] <- paste(strsplit(colnames(get(element))[1], :
could not find function "get<-"
Any help with this problem would be greatly appreciated.

This does the same thing as the code in the question without using get:
df <- data.frame('A_cat'= 1:10 , 'dog' = 11:20)
e <- environment() ##
df.names <- grep("^df$", names(e), value = TRUE)
# nm is the current data frame name and nms are its column names
for(nm in df.names) {
nms <- names(e[[nm]])
names(e[[nm]])[2] <- paste0(sub("_.*", "_", nms[1]), nms[2])
}
giving:
> df
A_cat A_dog
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
Keeping the data.frames in a named list as suggested in a comment to the question might be even better. For example, if instead of keeping the data.frames in an environment they were in a list called e
e <- list(df = df)
then omit the line marked ## and the rest works as is.

Here would be one way to accomplish this goal if the data.frames have systematic names (here, df1 df2 df3, etc) and the prefix ends with "_" as in the example:
# suggested by #roland roll them up in a list:
myDfList <- mget(ls(pattern="^df"))
# change names
for(dfName in names(myDfList)) {
names(myDfList[[dfName]])[2] <- paste0(gsub("^(.*_)", "\\1",
names(myDfList[[dfName]])[1]),
names(myDfList[[dfName]])[2])
}

Specifying column names from a list in the data.frame command

I have a list called cols with column names in it:
cols <- c('Column1','Column2','Column3')
I'd like to reproduce this command, but with a call to the list:
data.frame(Column1=rnorm(10))
Here's what happens when I try it:
> data.frame(cols[1]=rnorm(10))
Error: unexpected '=' in "data.frame(I(cols[1])="
The same thing happens if I wrap cols[1] in I() or eval().
How can I feed that item from the vector into the data.frame() command?
Update:
For some background, I have defined a function calc.means() that takes a data frame and a list of variables and performs a large and complicated ddply operation, summarizing at the level specified by the variables.
What I'm trying to do with the data.frame() command is walk back up the aggregation levels to the very top, re-running calc.means() at each step and using rbind() to glue the results onto one another. I need to add dummy columns with 'All' values in order to get the rbind to work properly.
I'm rolling cast-like margin functionality into ddply, basically, and I'd like to not retype the column names for each run. Here's the full code:
cols <- c('Col1','Col2','Col3')
rbind ( calc.means(dat,cols),
data.frame(cols[1]='All', calc.means(dat, cols[2:3])),
data.frame(cols[1]='All', cols[2]='All', calc.means(dat, cols[3]))
)

Use can use structure:
cols <- c("a","b")
foo <- structure(list(c(1, 2 ), c(3, 3)), .Names = cols, row.names = c(NA, -2L), class = "data.frame")
I don't get why you are doing this though!

I'm not sure how to do it directly, but you could simply skip the step of assigning names in the data.frame() command. Assuming you store the result of data.frame() in a variable named foo, you can simply do:
names(foo) <- cols
after the data frame is created

There is one trick. You could mess with lists:
cols_dummy <- setNames(rep(list("All"), 3), cols)
Then if you use call to list with one paren then you should get what you want
data.frame(cols_dummy[1], calc.means(dat, cols[2:3]))
You could use it on-the-fly as setNames(list("All"), cols[1]) but I think it's less elegant.
Example:
some_names <- list(name_A="Dummy 1", name_B="Dummy 2") # equivalent of cols_dummy from above
data.frame(var1=rnorm(3), some_names[1])
# var1 name_A
# 1 -1.940169 Dummy 1
# 2 -0.787107 Dummy 1
# 3 -0.235160 Dummy 1

I believe the assign() function is your answer:
cols <- c('Col1','Col2','Col3')
data.frame(assign(cols[1], rnorm(10)))
Returns:
assign.cols.1...rnorm.10..
1 -0.02056822
2 -0.03675639
3 1.06249599
4 0.41763399
5 0.38873118
6 1.01779018
7 1.01379963
8 1.86119518
9 0.35760039
10 1.14742560
With the lapply() or sapply() function, you should be able to loop the cbind() process. Something like:
operation <- sapply(cols, function(x) data.frame(assign(x, rnorm(10))))
final <- data.frame(lapply(operation, cbind))
Returns:
Col1.assign.x..rnorm.10.. Col2.assign.x..rnorm.10.. Col3.assign.x..rnorm.10..
1 0.001962187 -0.3561499 -0.22783816
2 -0.706804781 -0.4452781 -1.09950505
3 -0.604417525 -0.8425018 -0.73287079
4 -1.287038060 0.2545236 -1.18795684
5 0.232084366 -1.0831463 0.40799046
6 -0.148594144 0.4963714 -1.34938144
7 0.442054119 0.2856748 0.05933736
8 0.984615916 -0.0795147 -1.91165189
9 1.222310749 -0.1743313 0.18256877
10 -0.231885977 -0.2273724 -0.43247570
Then, to clean up the column names:
colnames(final) <- cols
Returns:
Col1 Col2 Col3
1 0.19473248 0.2864232 0.93115072
2 -1.08473526 -1.5653469 0.09967827
3 -1.90968422 -0.9678024 -1.02167873
4 -1.11962371 0.4549290 0.76692067
5 -2.13776949 3.0360777 -1.48515698
6 0.64240694 1.3441656 0.47676056
7 -0.53590163 1.2696336 -1.19845723
8 0.09158526 -1.0966833 0.91856639
9 -0.05018762 1.0472368 0.15475583
10 0.27152070 -0.2148181 -1.00551111
Cheers,
Adam