R : rename columns time series data - r

I am trying to rename the columns of a time series using assign function as follows -
assign(colnames(paste0(<logic_to_get_dataset>)),
c(<logic_to_get_column_names>))
I am getting a warning : In assign(colnames(get(paste0("xvars_", TopVars[j, 1], "_lag", :
only the first element is used as variable name
also, the column name assignment does not happen. I think this is happening because of colnames() function. Is there a workaround ?

The issue is that assign only looks at the first element of the vector.
You can try this, for example:
df = data.frame(x = 1:3, y = 4:2)
within(df, assign(colnames(df),c('a','b'))
You'll notice that R only looks at the first variable, and it tries to reassign the values that are described by those column names to the second value. This behavior is obviously not what you're looking for.
Unfortunately, it's kind of hackey, but you can always use something like this
data.frame.name = get_df()#some function that returns text
data.frame.columns = get_cols()#some function that returns text
eval(parse(text = paste0('colnames(',data.frame.name,') = c(',
paste(data.frame.columns,collapse = ','),')')))
I prefer to avoid doing these kinds of expressions, but it should work as intended.

Here it goes -
temp_var <- paste0('colnames(var_',TopLines[j,1],'_lag',get(paste0('uniqLg_',TopLines[j,1]))[k,],'_',get(paste0('uniqLg_',TopLines[j,1]))[k,]+12 ,
') <- c(gsub( "xt',get(paste0('uniqLg_',TopLines[j,1]))[k,],'" , "xt',get(paste0('uniqLg_',TopLines[j,1]))[k,],'__',get(paste0('uniqLg_',TopLines[j,1]))[k,]+12,
'", colnames(var_',TopLines[j,1],'_xt',get(paste0('uniqLg_',TopLines[j,1]))[k,],')))')
print(temp_var )
eval(parse( text=temp_var ))
where TopLines is a data frame with one column and contains a list of lines. The only problem with this method is, I can't test the output of eval unless I actually open the dataset and see if the changes have been affected.

Related

R Generic References to Data Frames and Variables

I would like to know how to make a reference to a data frame and variable generic, please. Say I have a data frame named 's' and a variable in that data frame named 'Y'.
Regular R code:
look = s$Y
What I would like to do:
data = s
variable = Y
look = data$variable (which functions the same as look = s$Y)
Any thoughts? The reason I would like to do this is that I have s$Y throughout my code, and later I may want to change s for t (or Y for some other variable), and don't want to have to go through all of my code manually replacing s$Y with t$Y where I need it changed.
Thanks!
This is the reason that the $-operator is considered poor-practice inside function definitions, i.e. it "locks you in" to a particular spelling of a column name. You are not going to do this, however:
variable = Y
Rather you are going to do this:
variable = "Y"
And that is because the first version would have caused the R-interpreter to go out and try to identify a value for the symbol Y someplace in what is known as its "search path" which is roughly speaking all that functions and values that have been called and are still being processed since code was started. In the case of the second version "Y" is its own value and no further searching is needed. With that fundamental confusion corrected you would now do this
look <- data[[ variable ]] # although using 'data' as a name is another "poor-practice"
Whereupon R will look for a value of variable and find it in the global environment, returning the character "Y" and delivering a column named "Y" from the dataset s. Column names are not considered first-class objects in R, whereas named dataframes are. The "names" of columns are not true R names (even though they are called colnames).. The $-operator is just shorthand for "[[" with a character value. Here's a full transcript to test this:
> s <- data.frame(Y=1:10, X=LETTERS[1:10]); data = s
>
> variable <- "Y"
>
> look1 <- data$Y; look2 <- data[["Y"]]
> identical(look1, look2)
[1] TRUE
The confusion that this "non-standard evaluation" (NSE) shorthand feature of R has caused new users appears to be one of the motivations for the creation of first the ggplot aes function and later the evolution of the package-dplyr and the tidyverse-bundle-of-packages. Those packages allow the use of non-quoted names or tokens to refer to column identities.
In addition to #42-'s answer, you can dynamically reference columns like this:
colName <- "something"
myDataFrame[,colname]
Edit: Since you also asked about dynamically referencing data.frames #Rich Scriven suggested making a function that takes the data.frame as an argument, which is one working solution. You can also just load the data you need at the top of your script, which is easy to change on the fly if you need:
fileName <- "file1.csv"
data <- read.table(fileName, header = TRUE, stringsAsFactors = FALSE)
As per -42 above, the best choice seems to be the packages referenced. Using a function is close but doesn't seem to allow 'data' and 'variable' to be generic in 'data$variable'.
Thanks everyone!

R create dynamically subset of data.table

I would like to create dynamically subsets of my data.table based on the values in some columns.
In my data.table, I have the following variables: owner,2G,3G,4G.
2G,3G,4G are binary.
I want to create three subsets: one where 2G==1, one where 3G==1, one where 4G==1.
Example:
a=c("Paul",1,1,0)
b=c("George",1,0,0)
x=cbind(a,b)
colnames(x)=c("Owner","2G","3G","4G")
Here is my code:
all_names_df=c()
for(value in 2:4){
techno=paste0(value,"G")
name=paste0("arcep",techno)
all_df=c(all_names_df,name)
df=arcep[techno==1]
assign(name,df)
}
My new data.tables are created, but empty. I have tried several things (with eval, quote, change the syntax etc...) but I fail to call properly the column.
EDIT:
I have tried something else, but it also fails:
techno=c("2G","3G","4G")
for(value in techno){
index=grep(value,colnames(arcep))
print(index)
set1=subset(arcep,arcep[,index]==1)
print(dim(set1))
assign(set1,paste0("ARCEP_",value))}
Error in `[.data.table`(arcep, , index) :
j (the 2nd argument inside [...]) is a single symbol but column name 'index' is not found. Perhaps you intended DT[,..index] or DT[,index,with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1.
Why does it says "'column name 'index' is not found"? Why isn't it the value of "index" that is taken in account? Eval index doesn't change anything.
Vectors and matrixes cannot contain both numbers and characters, in your case numbers are converted to characters.
This will work better to define your table, but column names can't start with a number in a data.frame
x <- data.frame(Owner = c("Paul","George"),
G2 = c(1,1),
G3 = c(1,0),
G4 = c(0,0),
stringsAsFactors= FALSE)
then here's your subset
subset(x,G3 == 1)
(also, you've used cbind instead of rbind in your question, you may want to edit it)
I finally found the answer:
for(value in techno){
set1=subset(arcep,arcep[,get(colnames(arcep)[grep(value,colnames(arcep))])]==1)
assign(paste0("ARCEP_",value),set1)
}

Optimization of R Data.table combination with for loop function

I have a 'Agency_Reference' table containing column 'agency_lookup', with 200 entries of strings as below :
alpha
beta
gamma etc..
I have a dataframe 'TEST' with a million rows containing a 'Campaign' column with entries such as :
Alpha_xt2010
alpha_xt2014
Beta_xt2016 etc..
i want to loop through for each entry in reference table and find which string is present within each campaign column entries and create a new agency_identifier column variable in table.
my current code is as below and is slow to execute. Requesting guidance on how to optimize the same. I would like to learn how to do it in the data.table way
Agency_Reference <- data.frame(agency_lookup = c('alpha','beta','gamma','delta','zeta'))
TEST <- data.frame(Campaign = c('alpha_xt123','ALPHA345','Beta_xyz_34','BETa_testing','code_delta_'))
TEST$agency_identifier <- 0
for (agency_lookup in as.vector(Agency_Reference$agency_lookup)) {
TEST$Agency_identifier <- ifelse(grepl(tolower(agency_lookup), tolower(TEST$Campaign)),agency_lookup,TEST$Agency_identifier)}
Expected Output :
Campaign----Agency_identifier
alpha_xt123---alpha
ALPHA34----alpha
Beta_xyz_34----beta
BETa_testing----beta
code_delta_-----delta
Try
TEST <- data.frame(Campaign = c('alpha_xt123','ALPHA345','Beta_xyz_34','BETa_testing','code_delta_'))
pattern = tolower(c('alpha','Beta','gamma','delta','zeta'))
TEST$agency_identifier <- sub(pattern = paste0('.*(', paste(pattern, collapse = '|'), ').*'),
replacement = '\\1',
x = tolower(TEST$Campaign))
This will not answer your question per se, but from what I understand you want to dissect the Campaign column and do something with the values it provides.
Take a look at Tidy data, more specifically the part "Multiple variables stored in one column". I think you'll make some great progress using tidyr::separate. That way you don't have to use a for-loop.

R: add column to dataframe, named based on formula

More 'feels like it should be' simple stuff which seems to be eluding me today. Thanks in advance for assistance.
Within a loop, that's within a function, I'm trying to add a column, and name it based on a formula.
I can bind a column & its name is taken from the bound object: data<-cbind(data,bothdata)
I can bind a column & manually name the bound object: data<-cbind(data,newname=bothdata)
I can bind a column which is the product of an equation & manually name the bound object: data<-cbind(data,newname2=bothdata-1)
Or another way: data <- transform(data, newColumn = bothdata-1)
What I can't do is have the name be the product of a formula. My actual formula-derived example name is paste("E_wgt",rev(which(rev(Esteps) == q))-1,"%") & equation for column: baddata - q.
A simpler one: data<-cbind(data,paste("magic",100,"beans")=bothdata-1). This fails because cbind isn't expecting the = even though it's fine in previous examples. Same fail for transform.
My first thought was assign but while I've used this successfully for creating forumla-named objects, I can't see how to get it to work for formula-named columns.
If I use an intermediary step to put the naming formula in an object container then use that, e.g.:
name <- paste("magic",100,"beans")
data<-cbind(data,name=bothdata-1)
the column name is "name" not "magic100beans". If I assign the equation result to an formula-named object:
assign(paste("magic",100,"beans"),bothdata-1)
Then try to cbind that via get:
data<-cbind(data,get(paste("magic",100,"beans")))
The column is called "get(paste("magic",100,"beans"))". Boo! Any thoughts anyone? It occurs to me that I can do cbind then separately colnames(data)[ncol(data)] <- paste("magic",100,"beans")) which I guess I'll settle for for now, but would still be interested to find if there was a direct way.
Thanks.
Chances are that cbind is overkill for your use case. In almost every instance, you can simply mutate the underlying data frame using data$newname2 <- data$bothdata - 1.
In the case where the name of the column is dynamic, you can just refer to it using the [[ operator -- data[["newcol"]] <- data$newname + 1. See ?'[' and ?'[.data.frame' for other tips and usages.
EDIT: Incorporated #Marek's suggestion for [["newcol"]] instead of [, "newcol"]
It may help you to know that data$col1 is the same than data[,"col1"] which is the same than data[,x] if x is "col1". This is how I usually access/set columns programmatically.
So this should work:
name <- paste("magic",100,"beans")
data[,name] <- obsdata-1
Note that you don't have to use the temporary variable name. This is equivalent to:
data$magic100beans <- obsdata-1
Itself equivalent, for a data.frame, to:
data<-cbind(data, magic100beans=bothdata-1)
Just so you know, you could also set the names afterwards:
old_names <- names(data)
name <- paste("magic",100,"beans")
data <- cbind(data, bothdata-1)
data <- setNames(data, c(old_names, name))
# or
names(data) <- c(old_names, name)

The way R handles subseting

I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.

Resources