data table and data frame operations

data table and data frame operations - r

I am using some R code that uses a data table class, instead of a data frame class.
How would I do the following operation in R without having to transform map.dt to a map.df?
map.dt = data.table(chr = c("chr1","chr1","chr1","chr2"), ref = c(1,0,3200,3641), pat = c(1,3020,3022, 3642), mat = c(1,0,3021,0))
parent = "mat"
chrom = "chr1"
map.df<-as.data.frame(map.dt);
parent.block.starts<-map.df[map.df$chr == chrom & map.df[,parent] > 0,parent];
Note: parent needs to be dynamically allocated, its an input from the user. In this example I chose "mat" but it could be any of the columns.
Note1: parent.block.starts should be a vector of integers.
Note2: map.dt is a data table where the column names are c("chr","ref","pat","mat").
The problem is that in data tables I cannot access a given column by name, or at least I couldn't figure out how.
Please let me know if you have some suggestions!
Thanks!

It's a little unclear what the end goal is here, especially without sample data, but if you want to access rows by character name there are two ways to do this:
Columns = c("A", "B")
# .. means "look up one level"
dt[,..Columns]
dt[,get("A")]
dt[,list(get("A"), get("B"))]
But if you find yourself needing to use this technique often, you're probably using data.table poorly.
EDIT
Based on your edit, this line will return the same result, without having to do any as.data.frame conversion:
> map.dt[chr==chrom & get(parent) > 0, get(parent)]

Related

R Generic References to Data Frames and Variables

I would like to know how to make a reference to a data frame and variable generic, please. Say I have a data frame named 's' and a variable in that data frame named 'Y'.
Regular R code:
look = s$Y
What I would like to do:
data = s
variable = Y
look = data$variable (which functions the same as look = s$Y)
Any thoughts? The reason I would like to do this is that I have s$Y throughout my code, and later I may want to change s for t (or Y for some other variable), and don't want to have to go through all of my code manually replacing s$Y with t$Y where I need it changed.
Thanks!

This is the reason that the $-operator is considered poor-practice inside function definitions, i.e. it "locks you in" to a particular spelling of a column name. You are not going to do this, however:
variable = Y
Rather you are going to do this:
variable = "Y"
And that is because the first version would have caused the R-interpreter to go out and try to identify a value for the symbol Y someplace in what is known as its "search path" which is roughly speaking all that functions and values that have been called and are still being processed since code was started. In the case of the second version "Y" is its own value and no further searching is needed. With that fundamental confusion corrected you would now do this
look <- data[[ variable ]] # although using 'data' as a name is another "poor-practice"
Whereupon R will look for a value of variable and find it in the global environment, returning the character "Y" and delivering a column named "Y" from the dataset s. Column names are not considered first-class objects in R, whereas named dataframes are. The "names" of columns are not true R names (even though they are called colnames).. The $-operator is just shorthand for "[[" with a character value. Here's a full transcript to test this:
> s <- data.frame(Y=1:10, X=LETTERS[1:10]); data = s
>
> variable <- "Y"
>
> look1 <- data$Y; look2 <- data[["Y"]]
> identical(look1, look2)
[1] TRUE
The confusion that this "non-standard evaluation" (NSE) shorthand feature of R has caused new users appears to be one of the motivations for the creation of first the ggplot aes function and later the evolution of the package-dplyr and the tidyverse-bundle-of-packages. Those packages allow the use of non-quoted names or tokens to refer to column identities.

In addition to #42-'s answer, you can dynamically reference columns like this:
colName <- "something"
myDataFrame[,colname]
Edit: Since you also asked about dynamically referencing data.frames #Rich Scriven suggested making a function that takes the data.frame as an argument, which is one working solution. You can also just load the data you need at the top of your script, which is easy to change on the fly if you need:
fileName <- "file1.csv"
data <- read.table(fileName, header = TRUE, stringsAsFactors = FALSE)

As per -42 above, the best choice seems to be the packages referenced. Using a function is close but doesn't seem to allow 'data' and 'variable' to be generic in 'data$variable'.
Thanks everyone!

Optimization of R Data.table combination with for loop function

I have a 'Agency_Reference' table containing column 'agency_lookup', with 200 entries of strings as below :
alpha
beta
gamma etc..
I have a dataframe 'TEST' with a million rows containing a 'Campaign' column with entries such as :
Alpha_xt2010
alpha_xt2014
Beta_xt2016 etc..
i want to loop through for each entry in reference table and find which string is present within each campaign column entries and create a new agency_identifier column variable in table.
my current code is as below and is slow to execute. Requesting guidance on how to optimize the same. I would like to learn how to do it in the data.table way
Agency_Reference <- data.frame(agency_lookup = c('alpha','beta','gamma','delta','zeta'))
TEST <- data.frame(Campaign = c('alpha_xt123','ALPHA345','Beta_xyz_34','BETa_testing','code_delta_'))
TEST$agency_identifier <- 0
for (agency_lookup in as.vector(Agency_Reference$agency_lookup)) {
TEST$Agency_identifier <- ifelse(grepl(tolower(agency_lookup), tolower(TEST$Campaign)),agency_lookup,TEST$Agency_identifier)}
Expected Output :
Campaign----Agency_identifier
alpha_xt123---alpha
ALPHA34----alpha
Beta_xyz_34----beta
BETa_testing----beta
code_delta_-----delta

Try
TEST <- data.frame(Campaign = c('alpha_xt123','ALPHA345','Beta_xyz_34','BETa_testing','code_delta_'))
pattern = tolower(c('alpha','Beta','gamma','delta','zeta'))
TEST$agency_identifier <- sub(pattern = paste0('.*(', paste(pattern, collapse = '|'), ').*'),
replacement = '\\1',
x = tolower(TEST$Campaign))

This will not answer your question per se, but from what I understand you want to dissect the Campaign column and do something with the values it provides.
Take a look at Tidy data, more specifically the part "Multiple variables stored in one column". I think you'll make some great progress using tidyr::separate. That way you don't have to use a for-loop.

R: add column to dataframe, named based on formula

More 'feels like it should be' simple stuff which seems to be eluding me today. Thanks in advance for assistance.
Within a loop, that's within a function, I'm trying to add a column, and name it based on a formula.
I can bind a column & its name is taken from the bound object: data<-cbind(data,bothdata)
I can bind a column & manually name the bound object: data<-cbind(data,newname=bothdata)
I can bind a column which is the product of an equation & manually name the bound object: data<-cbind(data,newname2=bothdata-1)
Or another way: data <- transform(data, newColumn = bothdata-1)
What I can't do is have the name be the product of a formula. My actual formula-derived example name is paste("E_wgt",rev(which(rev(Esteps) == q))-1,"%") & equation for column: baddata - q.
A simpler one: data<-cbind(data,paste("magic",100,"beans")=bothdata-1). This fails because cbind isn't expecting the = even though it's fine in previous examples. Same fail for transform.
My first thought was assign but while I've used this successfully for creating forumla-named objects, I can't see how to get it to work for formula-named columns.
If I use an intermediary step to put the naming formula in an object container then use that, e.g.:
name <- paste("magic",100,"beans")
data<-cbind(data,name=bothdata-1)
the column name is "name" not "magic100beans". If I assign the equation result to an formula-named object:
assign(paste("magic",100,"beans"),bothdata-1)
Then try to cbind that via get:
data<-cbind(data,get(paste("magic",100,"beans")))
The column is called "get(paste("magic",100,"beans"))". Boo! Any thoughts anyone? It occurs to me that I can do cbind then separately colnames(data)[ncol(data)] <- paste("magic",100,"beans")) which I guess I'll settle for for now, but would still be interested to find if there was a direct way.
Thanks.

Chances are that cbind is overkill for your use case. In almost every instance, you can simply mutate the underlying data frame using data$newname2 <- data$bothdata - 1.
In the case where the name of the column is dynamic, you can just refer to it using the [[ operator -- data[["newcol"]] <- data$newname + 1. See ?'[' and ?'[.data.frame' for other tips and usages.
EDIT: Incorporated #Marek's suggestion for [["newcol"]] instead of [, "newcol"]

It may help you to know that data$col1 is the same than data[,"col1"] which is the same than data[,x] if x is "col1". This is how I usually access/set columns programmatically.
So this should work:
name <- paste("magic",100,"beans")
data[,name] <- obsdata-1
Note that you don't have to use the temporary variable name. This is equivalent to:
data$magic100beans <- obsdata-1
Itself equivalent, for a data.frame, to:
data<-cbind(data, magic100beans=bothdata-1)
Just so you know, you could also set the names afterwards:
old_names <- names(data)
name <- paste("magic",100,"beans")
data <- cbind(data, bothdata-1)
data <- setNames(data, c(old_names, name))
# or
names(data) <- c(old_names, name)

R : rename columns time series data

I am trying to rename the columns of a time series using assign function as follows -
assign(colnames(paste0(<logic_to_get_dataset>)),
c(<logic_to_get_column_names>))
I am getting a warning : In assign(colnames(get(paste0("xvars_", TopVars[j, 1], "_lag", :
only the first element is used as variable name
also, the column name assignment does not happen. I think this is happening because of colnames() function. Is there a workaround ?

The issue is that assign only looks at the first element of the vector.
You can try this, for example:
df = data.frame(x = 1:3, y = 4:2)
within(df, assign(colnames(df),c('a','b'))
You'll notice that R only looks at the first variable, and it tries to reassign the values that are described by those column names to the second value. This behavior is obviously not what you're looking for.
Unfortunately, it's kind of hackey, but you can always use something like this
data.frame.name = get_df()#some function that returns text
data.frame.columns = get_cols()#some function that returns text
eval(parse(text = paste0('colnames(',data.frame.name,') = c(',
paste(data.frame.columns,collapse = ','),')')))
I prefer to avoid doing these kinds of expressions, but it should work as intended.

Here it goes -
temp_var <- paste0('colnames(var_',TopLines[j,1],'_lag',get(paste0('uniqLg_',TopLines[j,1]))[k,],'_',get(paste0('uniqLg_',TopLines[j,1]))[k,]+12 ,
') <- c(gsub( "xt',get(paste0('uniqLg_',TopLines[j,1]))[k,],'" , "xt',get(paste0('uniqLg_',TopLines[j,1]))[k,],'__',get(paste0('uniqLg_',TopLines[j,1]))[k,]+12,
'", colnames(var_',TopLines[j,1],'_xt',get(paste0('uniqLg_',TopLines[j,1]))[k,],')))')
print(temp_var )
eval(parse( text=temp_var ))
where TopLines is a data frame with one column and contains a list of lines. The only problem with this method is, I can't test the output of eval unless I actually open the dataset and see if the changes have been affected.

R equivalent to the MATLAB structure?

Is there an R type equivalent to the Matlab structure type?
I have a few named vectors and I try to store them in a data frame. Ideally, I would simply access one element of an object and it would return the named vectors (like a structure in Matlab). I feel that using a data frame is not the right thing to do since it can store the values of the named vectors but not the names when they differ from one vector to the other.
More generally, is it possible to store a bunch of different objects in a single one in R?
Edit: As Joran said I think that list does the job.
l = list()
l$vec1 = namedVector1
l$vec2 = namedVector2
...
If I have a list of names
name1 = 'vec1'
name2 = 'vec2'
is there any way for the interpreter to understand that when I use a variable name like name1, I am not referring to the variable name but to its content? I have tried get(name1) but it does not work.

I could still be wrong about what you're trying to do, but I think this is the best you're going to get in terms of accessing each list element by name:
l <- list(a= 1:3,b = 1:10)
> ind <- "a"
> l[[ind]]
[1] 1 2 3
Namely, you're going to have to use [[ explicitly.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

data table and data frame operations - r

Related

R Generic References to Data Frames and Variables

Optimization of R Data.table combination with for loop function

R: add column to dataframe, named based on formula

R : rename columns time series data

R equivalent to the MATLAB structure?

Categories

Resources