Optimization of R Data.table combination with for loop function - r

I have a 'Agency_Reference' table containing column 'agency_lookup', with 200 entries of strings as below :
alpha
beta
gamma etc..
I have a dataframe 'TEST' with a million rows containing a 'Campaign' column with entries such as :
Alpha_xt2010
alpha_xt2014
Beta_xt2016 etc..
i want to loop through for each entry in reference table and find which string is present within each campaign column entries and create a new agency_identifier column variable in table.
my current code is as below and is slow to execute. Requesting guidance on how to optimize the same. I would like to learn how to do it in the data.table way
Agency_Reference <- data.frame(agency_lookup = c('alpha','beta','gamma','delta','zeta'))
TEST <- data.frame(Campaign = c('alpha_xt123','ALPHA345','Beta_xyz_34','BETa_testing','code_delta_'))
TEST$agency_identifier <- 0
for (agency_lookup in as.vector(Agency_Reference$agency_lookup)) {
TEST$Agency_identifier <- ifelse(grepl(tolower(agency_lookup), tolower(TEST$Campaign)),agency_lookup,TEST$Agency_identifier)}
Expected Output :
Campaign----Agency_identifier
alpha_xt123---alpha
ALPHA34----alpha
Beta_xyz_34----beta
BETa_testing----beta
code_delta_-----delta

Try
TEST <- data.frame(Campaign = c('alpha_xt123','ALPHA345','Beta_xyz_34','BETa_testing','code_delta_'))
pattern = tolower(c('alpha','Beta','gamma','delta','zeta'))
TEST$agency_identifier <- sub(pattern = paste0('.*(', paste(pattern, collapse = '|'), ').*'),
replacement = '\\1',
x = tolower(TEST$Campaign))

This will not answer your question per se, but from what I understand you want to dissect the Campaign column and do something with the values it provides.
Take a look at Tidy data, more specifically the part "Multiple variables stored in one column". I think you'll make some great progress using tidyr::separate. That way you don't have to use a for-loop.

Related

Creating a simple for loop in R

I have a tibble called 'Volume' in which I store some data (10 columns - the first 2 columns are characters, 30 rows).
Now I want to calculate the relative Volume of every column that corresponds to Column 3 of my tibble.
My current solution looks like this:
rel.Volume_unmod = tibble(
"Volume_OD" = Volume[[3]] / Volume[[3]],
"Volume_Imp" = Volume[[4]] / Volume[[3]],
"Volume_OD_1" = Volume[[5]] / Volume[[3]],
"Volume_WS_1" = Volume[[6]] / Volume[[3]],
"Volume_OD_2" = Volume[[7]] / Volume[[3]],
"Volume_WS_2" = Volume[[8]] / Volume[[3]],
"Volume_OD_3" = Volume[[9]] / Volume[[3]],
"Volume_WS_3" = Volume[[10]] / Volume[[3]])
rel.Volume_unmod
I would like to keep the tibble structure and the labels. I am sure there is a better solution for this, but I am relative new to R so I it's not obvious to me. What I tried is something like this, but I can't actually run this:
rel.Volume = NULL
for(i in Volume[,3:10]){
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
}
Mockup Data
Since you did not provide some data, I've followed the description you provided to create some mockup data. Here:
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE))
Volume[3:10] <- rnorm(30*8)
Solution with Dplyr
library(dplyr)
# rename columns [brute force]
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(Volume)[3:10] <- cols
# divide by Volumn_OD
rel.Volume_unmod <- Volume %>%
mutate(across(all_of(cols), ~ . / Volume_OD))
# result
rel.Volume_unmod
Explanation
I don't know the names of your columns. Probably, the names correspond to the names of the columns you intended to create in rel.Volume_unmod. Anyhow, to avoid any problem I renamed the columns (kinda brutally). You can do it with dplyr::rename if you wan to.
There are many ways to select the columns you want to mutate. mutate is a verb from dplyr that allows you to create new columns or perform operations or functions on columns.
across is an adverb from dplyr. Let's simplify by saying that it's a function that allows you to perform a function over multiple columns. In this case I want to perform a division by Volum_OD.
~ is a tidyverse way to create anonymous functions. ~ . / Volum_OD is equivalent to function(x) x / Volumn_OD
all_of is necessary because in this specific case I'm providing across with a vector of characters. Without it, it will work anyway, but you will receive a warning because it's ambiguous and it may work incorrectly in same cases.
More info
Check out this book to learn more about data manipulation with tidyverse (which dplyr is part of).
Solution with Base-R
rel.Volume_unmod <- Volume
# rename columns
cols <- c("Volume_OD","Volume_Imp","Volume_OD_1","Volume_WS_1","Volume_OD_2","Volume_WS_2","Volume_OD_3","Volume_WS_3")
colnames(rel.Volume_unmod)[3:10] <- cols
# divide by columns 3
rel.Volume_unmod[3:10] <- lapply(rel.Volume_unmod[3:10], `/`, rel.Volume_unmod[3])
rel.Volume_unmod
Explanation
lapply is a base R function that allows you to apply a function to every item of a list or a "listable" object.
in this case rel.Volume_unmod is a listable object: a dataframe is just a list of vectors with the same length. Therefore, lapply takes one column [= one item] a time and applies a function.
the function is /. You usually see / used like this: A / B, but actually / is a Primitive function. You could write the same thing in this way:
`/`(A, B) # same as A / B
lapply can be provided with additional parameters that are passed directly to the function that is being applied over the list (in this case /). Therefore, we are writing rel.Volume_unmod[3] as additional parameter.
lapply always returns a list. But, since we are assigning the result of lapply to a "fraction of a dataframe", we will just edit the columns of the dataframe and, as a result, we will have a dataframe instead of a list. Let me rephrase in a more technical way. When you are assigning rel.Volume_unmod[3:10] <- lapply(...), you are not simply assigning a list to rel.Volume_unmod[3:10]. You are technically using this assigning function: [<-. This is a function that allows to edit the items in a list/vector/dataframe. Specifically, [<- allows you to assign new items without modifying the attributes of the list/vector/dataframe. As I said before, a dataframe is just a list with specific attributes. Then when you use [<- you modify the columns, but you leave the attributes (the class data.frame in this case) untouched. That's why the magic works.
Whithout a minimal working example it's hard to guess what the Variable Volume actually refers to. Apart from that there seems to be a problem with your for-loop:
for(i in Volume[,3:10]){
Assuming Volume refers to a data.frame or tibble, this causes the actual column-vectors with indices between 3 and 10 to be assigned to i successively. You can verify this by putting print(i) inside the loop. But inside the loop it seems like you actually want to use i as a variable containing just the index of the current column as a number (not the column itself):
rel.Volume[i] = tibble(Volume = Volume[[i]] / Volume[[3]])
Also, two brackets are usually used with lists, not data.frames or tibbles. (You can, however, do so, because data.frames are special cases of lists.)
Last but not least, initialising the variable rel.Volume with NULL will result in an error, when trying to reassign to that variable, since you haven't told R, what rel.Volume should be.
Try this, if you like (thanks #Edo for example data):
set.seed(1)
Volume <- data.frame(ID = sample(letters, 30, TRUE),
GR = sample(LETTERS, 30, TRUE),
Vol1 = rnorm(30),
Vol2 = rnorm(30),
Vol3 = rnorm(30))
rel.Volume <- Volume[1:2] # Assuming you want to keep the IDs.
# Your data.frame will need to have the correct number of rows here already.
for (i in 3:ncol(Volume)){ # ncol gives the total number of columns in data.frame
rel.Volume[i] = Volume[i]/Volume[3]
}
A more R-like approach would be to avoid using a for-loop altogether, since R's strength is implicit vectorization. These expressions will produce the same result without a loop:
# OK, this one messes up variable names...
rel.V.2 <- data.frame(sapply(X = Volume[3:5], FUN = function(x) x/Volume[3]))
rel.V.3 <- data.frame(Map(`/`, Volume[3:5], Volume[3]))
Since you said you were new to R, frankly I would recommend avoiding the Tidyverse-packages while you are still learing the basics. From my experience, in the long run you're better off learning base-R first and adding the "sugar" when you're more familiar with the core language. You can still learn to use Tidyverse-functions later (but then, why would anybody? ;-) ).

Dynamically call dataframe column & conditional replacement in R

First question post. Please excuse any formatting issues that may be present.
What I'm trying to do is conditionally replace a factor level in a dataframe column. Reason being due to unicode differences between a right single quotation mark (U+2019) and an apostrophe (U+0027).
All of the columns that need this replacement begin with with "INN8", so I'm using
grep("INN8", colnames(demoDf)) -> apostropheFixIndices
for(i in apostropheFixIndices) {
levels(demoDfFinal[i]) <- c(levels(demoDf[i]), "I definitely wouldn't")
(insert code here)
}
to get the indices in order to perform the conditional replacement.
I've taken a look at a myriad of questions that involve naming variables on the fly: naming variables on the fly
as well as how to assign values to dynamic variables
and have explored the R-FAQ on turning a string into a variable and looked into Ari Friedman's suggestion that named elements in a list are preferred. However I'm unsure as to the execution as well as the significance of the best practice suggestion.
I know I need to do something along the lines of
demoDf$INN8xx[demoDf$INN8xx=="I definitely wouldn’t"] <- "I definitely wouldn't"]
but the iterations I've tried so far haven't worked.
Thank you for your time!
If I understand you correctly, then you don't want to rename the columns. Then this might work:
demoDf <- data.frame(A=rep("I definitely wouldn’t",10) , B=rep("I definitely wouldn’t",10))
newDf <- apply(demoDf, 2, function(col) {
gsub(pattern="’", replacement = "'", x = col)
})
It just checks all columns for the wrong symbol.
Or if you have a vector containing the column indices you want to check then you could go with
# Let's say you identified columns 2, 5 and 8
cols <- c(2,5,8)
sapply(cols, function(col) {
demoDf[,col] <<- gsub(pattern="’", replacement = "'", x = demoDf[,col])
})

R : rename columns time series data

I am trying to rename the columns of a time series using assign function as follows -
assign(colnames(paste0(<logic_to_get_dataset>)),
c(<logic_to_get_column_names>))
I am getting a warning : In assign(colnames(get(paste0("xvars_", TopVars[j, 1], "_lag", :
only the first element is used as variable name
also, the column name assignment does not happen. I think this is happening because of colnames() function. Is there a workaround ?
The issue is that assign only looks at the first element of the vector.
You can try this, for example:
df = data.frame(x = 1:3, y = 4:2)
within(df, assign(colnames(df),c('a','b'))
You'll notice that R only looks at the first variable, and it tries to reassign the values that are described by those column names to the second value. This behavior is obviously not what you're looking for.
Unfortunately, it's kind of hackey, but you can always use something like this
data.frame.name = get_df()#some function that returns text
data.frame.columns = get_cols()#some function that returns text
eval(parse(text = paste0('colnames(',data.frame.name,') = c(',
paste(data.frame.columns,collapse = ','),')')))
I prefer to avoid doing these kinds of expressions, but it should work as intended.
Here it goes -
temp_var <- paste0('colnames(var_',TopLines[j,1],'_lag',get(paste0('uniqLg_',TopLines[j,1]))[k,],'_',get(paste0('uniqLg_',TopLines[j,1]))[k,]+12 ,
') <- c(gsub( "xt',get(paste0('uniqLg_',TopLines[j,1]))[k,],'" , "xt',get(paste0('uniqLg_',TopLines[j,1]))[k,],'__',get(paste0('uniqLg_',TopLines[j,1]))[k,]+12,
'", colnames(var_',TopLines[j,1],'_xt',get(paste0('uniqLg_',TopLines[j,1]))[k,],')))')
print(temp_var )
eval(parse( text=temp_var ))
where TopLines is a data frame with one column and contains a list of lines. The only problem with this method is, I can't test the output of eval unless I actually open the dataset and see if the changes have been affected.

The way R handles subseting

I'm having some trouble understanding how R handles subsetting internally and this is causing me some issues while trying to build some functions. Take the following code:
f <- function(directory, variable, number_seq) {
##Create a empty data frame
new_frame <- data.frame()
## Add every data frame in the directory whose name is in the number_seq to new_frame
## the file variable specify the path to the file
for (i in number_seq){
file <- paste("~/", directory, "/",sprintf("%03d", i), ".csv", sep = "")
x <- read.csv(file)
new_frame <- rbind.data.frame(new_frame, x)
}
## calculate and return the mean
mean(new_frame[, variable], na.rm = TRUE)*
}
*While calculating the mean I tried to subset first using the $ sign new_frame$variable and the subset function subset( new_frame, select = variable but it would only return a None value. It only worked when I used new_frame[, variable].
Can anyone explain why the other subseting didn't work? It took me a really long time to figure it out and even though I managed to make it work I still don't know why it didn't work in the other ways and I really wanna look inside the black box so I won't have the same issues in the future.
Thanks for the help.
This behavior has to do with the fact that you are subsetting inside a function.
Both new_frame$variable and subset(new_frame, select = variable) look for a column in the dataframe withe name variable.
On the other hand, using new_frame[, variable] uses the variablename in f(directory, variable, number_seq) to select the column.
The dollar sign ($) can only be used with literal column names. That avoids confusion with
dd<-data.frame(
id=1:4,
var=rnorm(4),
value=runif(4)
)
var <- "value"
dd$var
In this case if $ took variables or column names, which do you expect? The dd$var column or the dd$value column (because var == "value"). That's why the dd[, var] way is different because it only takes character vectors, not expressions referring to column names. You will get dd$value with dd[, var]
I'm not quite sure why you got None with subset() I was unable to replicate that problem.

data table and data frame operations

I am using some R code that uses a data table class, instead of a data frame class.
How would I do the following operation in R without having to transform map.dt to a map.df?
map.dt = data.table(chr = c("chr1","chr1","chr1","chr2"), ref = c(1,0,3200,3641), pat = c(1,3020,3022, 3642), mat = c(1,0,3021,0))
parent = "mat"
chrom = "chr1"
map.df<-as.data.frame(map.dt);
parent.block.starts<-map.df[map.df$chr == chrom & map.df[,parent] > 0,parent];
Note: parent needs to be dynamically allocated, its an input from the user. In this example I chose "mat" but it could be any of the columns.
Note1: parent.block.starts should be a vector of integers.
Note2: map.dt is a data table where the column names are c("chr","ref","pat","mat").
The problem is that in data tables I cannot access a given column by name, or at least I couldn't figure out how.
Please let me know if you have some suggestions!
Thanks!
It's a little unclear what the end goal is here, especially without sample data, but if you want to access rows by character name there are two ways to do this:
Columns = c("A", "B")
# .. means "look up one level"
dt[,..Columns]
dt[,get("A")]
dt[,list(get("A"), get("B"))]
But if you find yourself needing to use this technique often, you're probably using data.table poorly.
EDIT
Based on your edit, this line will return the same result, without having to do any as.data.frame conversion:
> map.dt[chr==chrom & get(parent) > 0, get(parent)]

Resources