Using parameters within data.table column update by reference - r

I have the following data table and function which extracts a parameter and adds it as a column to a data table:
library(stringr)
library(data.table)
pq <- c("cm_mmc=MSN-_-Exact-_-BrandExact-_-CheddarCheese"
,"cm_mmc=Google-_-cheeseTypes-_-cheddar-_-cheesedelivered&gclid=CMXyy2D81MsCFcsW0w0dMVoPGw"
,"cm_mmc=MSN-_-worldcitiesuslocations-_-cheese-_-cheeseshops"
,"cm_mmc=MSN-_-worldcitiesuslocations-_-cheese-_-cheeseshops")
rq <- c("q=cheese&src=IE-SearchBox&FORM=IESR02",
"sa=L",
"q=london+cheese+shop&src=IE-TopResult&FORM=IETR02&pc=WCUG",
"q=london+cheese+shop&src=IE-TopResult&FORM=IETR02&pc=WCUG")
DT = data.table(page_urlquery = pq, refr_urlquery = rq)
# Extracts a paramater from the relevant query and adds it to the table as a column
extract_param <- function(dt, source = c("page_urlquery", "refr_urlquery"), param_name){
source <- match.arg(source)
regexp <- paste("(?i)", param_name, "=([^&]+)", sep="")
col_name <- switch(source
,"page_urlquery" = paste("url_", param_name, sep = "")
,"refr_urlquery" = paste("ref_", param_name, sep = "")
)
dt[,(col_name):= str_match((source), regexp)[,2]]
}
However when I call the function as follows:
extract_param(DT, "page_urlquery", "cm_mmc")
It creates the column, but the contents are blank. I think it's something wrong with the syntax in the data table (source) parameter. What am I missing?

change the code inside function from
dt[,(col_name):= str_match((source), regexp)[,2]]
to
dt[,(col_name):= str_match(get(source), regexp)[,2]]

Related

Parsing colnames text string as expression in R

I am trying to create a large number of data frames in a for loop using the "assign" function in R. I want to use the colnames function to set the column names in the data frame. The code I am trying to emulate is the following:
county_tmax_min_df <- data.frame(array(NA,c(length(days),67)))
colnames(county_tmax_min_df) <- c('Date',sd_counties$NAME)
county_tmax_min_df$Date <- days
The code I have so far in the loop looks like this:
file_vars = c('file1','file2')
days <- seq(as.Date("1979-01-01"), as.Date("1979-01-02"), "days")
f = 1
for (f in 1:2){
assign(paste0('county_',file_vars[f]),data.frame(array(NA,c(length(days),67))))
}
I need to be able to set the column names similar to how I did in the above statement. How do I do this? I think it needs to be something like this, but I am unsure what goes in the text portion. The end result I need is just a bunch of data frames. Any help would be wonderful. Thank you.
expression(parse(text = ))
You can set the names within assign, like that:
file_vars = c('file1', 'file2')
days <- seq.Date(from = as.Date("1979-01-01"), to = as.Date("1979-01-02"), by = "days")
for (f in seq_along(file_vars)) {
assign(x = paste0('county_', file_vars[f]),
value = {
df <- data.frame(array(NA, c(length(days), 67)))
colnames(df) <- paste0("fancy_column_",
sample(LETTERS, size = ncol(df), replace = TRUE))
df
})
}
When in {} you can use colnames(df) or setNames to assign column names in any manner desired. In your first piece of code you are referring to sd_counties object that is not available but the generic idea should work for you.

convert character column and then split it into multiple new boolean columns using r mutate

I am attempting to split out a flags column into multiple new columns in r using mutate_at and then separate functions. I have simplified and cleaned my solution as seen below, however I am getting an error that indicates that the entire column of data is being passed into my function rather than each row individually. Is this normal behaviour which just requires me to loop over each element of x inside my function? or am I calling the mutate_at function incorrectly?
example data:
dataVariable <- data.frame(c_flags = c(".q.q.q","y..i.o","0x5a",".lll.."))
functions:
dataVariable <- read_csv("...",
col_types = cols(
c_date = col_datetime(format = ""),
c_dbl = col_double(),
c_flags = col_character(),
c_class = col_factor(c("a", "b", "c")),
c_skip = col_skip()
))
funTranslateXForNewColumn <- function(x){
binary = ""
if(startsWith(x, "0x")){
binary=hex2bin(x)
} else {
binary = c(0,0,0,0,0,0)
splitFlag = strsplit(x, "")[[1]]
for(i in splitFlag){
flagVal = 1
if(i=="."){
flagVal = 0
}
binary=append(binary, flagVal)
}
}
return(paste(binary[4:12], collapse='' ))
}
mutate_at(dataVariable, vars(c_flags), funs(funTranslateXForNewColumn(.)))
separate(dataVariable, c_flags, c(NA, "flag_1","flag_2","flag_3","flag_4","flag_5","flag_6","flag_7","flag_8","flag_9"), sep="")
The error I am receiving is:
Warning messages:
1: Problem with `mutate()` input `c_flags`.
i the condition has length > 1 and only the first element will be used
After translating the string into an appropriate binary representation of the flags, I will then use the seperate function to split it into new columns.
Similar to OP's logic but maybe shorter :
dataVariable$binFlags <- sapply(strsplit(dataVariable$c_flags, ''), function(x)
paste(as.integer(x != '.'), collapse = ''))
If you want to do this using dplyr we can implement the same logic as :
library(dplyr)
dataVariable %>%
mutate(binFlags = purrr::map_chr(strsplit(c_flags, ''),
~paste(as.integer(. != '.'), collapse = '')))
# c_flags binFlags
#1 .q.q.q 010101
#2 y..i.o 100101
#3 .lll.. 011100
mutate_at/across is used when you want to apply a function to multiple columns. Moreover, I don't see here that you are creating only one new binary column and not multiple new columns as mentioned in your post.
I was able to get the outcome I desired by replacing the mutate_at function with:
dataVariable$binFlags <- mapply(funTranslateXForNewColumn, dataVariable$c_flags)
However I want to know how to use the mutate_at function correctly.
credit to: https://datascience.stackexchange.com/questions/41964/mutate-with-custom-function-in-r-does-not-work
The above link also includes the solution to get this function to work which is to vectorize the function:
v_funTranslateXForNewColumn <- Vectorize(funTranslateXForNewColumn)
mutate_at(dataVariable, vars(c_flags), funs(v_funTranslateXForNewColumn(.)))

How to create a loop of ppcor function?

I am trying to create a loop to go through and perform a correlation (and in future a partial correlation) using ppcor function on variables stored within a data frame. The first variable (A) will remain the same for all correlations, whilst the second variable (B) will be the next variable along in the next column within my data frame. I have around 1000 variables.
I show the mtcars dataset below as an example, as it is in the same layout as my data.
I've been able to complete the operation successfully when performed manually using cbind to bind 2 columns (the 2 variables of interest) prior to running ppcor on the array ("tmp_df"). I have then been able to bind the output from correlation operation ("mpg_cycl"), ("mpg_disp") into a single object. However I can't get any of this operation to work in a loop. Any ideas please?
library("MASS")
install.packages("ppcor")
library("ppcor")
mtcars_df <- as.data.frame(mtcars)
tmp_df = cbind(mtcars_df$mpg, mtcars_df$cycl)
mpg_cycl <- pcor(as.matrix(tmp_df), method = 'spearman')
tmp_df1= cbind(mtcars_df$mpg, mtcars_df$disp)
mpg_disp <- pcor(as.matrix(tmp_df1), method = 'spearman')
combined_table <- do.call(cbind, lapply(list("mpg_cycl" = mpg_cycl,
mpg_disp" = mpg_disp), as.data.frame, USE.NAMES = TRUE))
attempting to loop above operation ## (ammended after last reviewer's comments:
for (i in mtcars_df[2:7]){
tmp_df = (cbind(i, mtcars_df$mpg)
i <- pcor(as.matrix(tmp_df), method = 'spearman')
write.csv(i, file = paste0("MyDataOutput",i[1],".csv")
}
I expected the loop to output two of the correlations results to MyDataOutput csv file. But this generates an error message, I thought i was in the correct place?:
Error: unexpected symbol in:
" tmp_df = (cbind(i, mtcars_df$mpg)
i"
Even adding a curly bracket at the end does not resolve issue so I have left this out as it introduces another error message '}'
I have redone some of your code and fixed missing ), }, ". The for cyckle now outputs file with name + name of the variable. Hope this will help.
library("MASS")
#install.packages("ppcor")
library("ppcor")
mtcars_df <- as.data.frame(mtcars)
tmp_df = cbind(mtcars_df$mpg, mtcars_df$cycl)
mpg_cycl <- pcor(as.matrix(tmp_df), method = 'spearman')
tmp_df1= cbind(mtcars_df$mpg, mtcars_df$disp)
mpg_disp <- pcor(as.matrix(tmp_df1), method = 'spearman')
combined_table <- do.call(cbind, lapply(list("mpg_cycl" = mpg_cycl,
"mpg_disp" = mpg_disp), as.data.frame, USE.NAMES = TRUE))
for(i in colnames(mtcars_df[2:7])){
tmp_df = mtcars_df[c(i,"mpg")]
i_resutl <- pcor(as.matrix(tmp_df), method = 'spearman')
write.csv(i_resutl, file = paste0("MyDataOutput_",i,".csv"))
}
for merging before saving:
dta <- c()
for(i in colnames(mtcars_df[2:7])){
tmp_df = mtcars_df[c(i,"mpg")]
i_resutl <- pcor(as.matrix(tmp_df), method = 'spearman')
dta <- rbind(dta,c(i,(unlist( i_resutl))))
}

R adding values to vector in for loop

I'm working in R & using a for loop to iterate through my data:
pos = c(1256:1301,6542:6598)
sd_all = null
for (i in pos){
nameA = paste("A", i, sep = "_")
nameC = paste("C", i, sep = "_")
resA = assign(nameA, unlist(lapply(files, function(x) x$percentageA[x$position==i])))
resC = assign(nameC, unlist(lapply(files, function(x) x$percentageC[x$position==i])))
sd_A = sd(resA)
sd_C = sd(resC)
sd_all = ?
}
now I want to generate a vector called 'sd_all' that contains the standard deviations of resA & resC. I cannot just do 'sd_all = c(sd(resA), sd(resC))', because then I only use one value in 'pos'. I want to do it for all values in 'pos' off course.
It looks like you'd be best served with sd_all as a list object. That way you can insert each of your 2 values ( sd(resA) and sd(resC) ).
Initialising a list is simple (this would replace the second line of your code):
sd_all <- list()
Then you can insert both the values you want to into a single list element like so (this would replace the last line in your for loop):
sd_all[[ i ]] <- c( sd( resA ), sd( resC ) )
After your loop, you can then insert this list as a column in a data.frame if that's what you'd like to do.

returned objects within a list while keeping the original data structure in R

In R, I need to return two objects from a function:
myfunction()
{
a.data.frame <- read.csv(file = input.file, header = TRUE, sep = ",", dec = ".")
index.hash <- get_indices_function(colnames(a.data.frame))
alist <- list("a.data.frame" = a.data.frame, "index.hash" = index.hash)
return(alist)
}
But, the returned objects from myfunction all become list not data.frame and hash.
Any help would be appreciated.
You can only return one object from an R function; this is consistent with..pretty much every other language I've used. However, you'll note that the objects retain their original structure within the list - so alist[[1]] and alist[[2]] should be the data frame and hash respectively, and are structured as data frames and hashes. Once you've returned them from the function, you can split them out into unique objects if you want :).
You can use a structure.
return (structure(class = "myclass",
list(data = daza.frame,
type = anytype,
page.content = page.content.as.string.vector,
knitr = knitr)))
Than you can access your data with
values <- my function(...)
values$data
values$type
values$page.content
values$knitr
and so on.
A working example from my package:
sju.table.values <- function(tab, digits=2) {
if (class(tab)!="ftable") tab <- ftable(tab)
tab.cell <- round(100*prop.table(tab),digits)
tab.row <- round(100*prop.table(tab,1),digits)
tab.col <- round(100*prop.table(tab,2),digits)
tab.expected <- as.table(round(as.array(margin.table(tab,1)) %*% t(as.array(margin.table(tab,2))) / margin.table(tab)))
# -------------------------------------
# return results
# -------------------------------------
invisible (structure(class = "sjutablevalues",
list(cell = tab.cell,
row = tab.row,
col = tab.col,
expected = tab.expected)))
}
tab <- table(sample(1:2, 30, TRUE), sample(1:3, 30, TRUE))
# show expected values
sju.table.values(tab)$expected
# show cell percentages
sju.table.values(tab)$cell

Resources