Updating dataframe values with functions at scale - r

I've got a data frame that requires some manual overrides of certain values given different conditions. But, these conditions tend to take a consistent form (I'm always going to be replacing one value with another, given a specific date range, etc.).
I'm wondering if there is a more elegant way to do the following; I suspect there is, given that I'm repeating function calls over and over. Here's an example:
set.seed(1234); library(dplyr)
data <- data.frame(
biz = sample(c("telco","shipping","tech"), 50, replace = TRUE),
region = sample(c("mideast","americas","asia"), 50, replace = TRUE),
date = rep(seq(as.Date("2010-02-01"), length=10, by = "1 day"),5)
)
Now, as described above, I want to change values in this dataframe subject to certain conditions about values of the other variables. My first thought is to use a function:
changeFunc <- function(df, ...) {
df$region <- ifelse(df$date %in% daterange &
data$biz == string1, string2, as.character(data$region))
return(df)
}
And then to actually update the dataframe, call that function:
daterange <- as.Date('2010-02-05') + 0:02; string1 <- 'telco'; string2 <- 'southeast'
data <- changeFunc(data, daterange, string1, string2)
The problem is, I want to do this over and over again for different date ranges and values for "string 1" and "string 2", eg:
daterange <- as.Date('2010-02-07') + 0:03; string1 <- 'shipping'; string2 <- 'northeast'
data <- changeFunc(data, daterange, string1, string2)
data %.% arrange(biz, region, date)
Is there a more automated way to make many changes to dataframe values subject to consistent rules?
EDIT: Update with a bit more clarity:
What I ultimately want to do is not have to define 'daterange','string1','string2' and call the function repeatedly. Maybe there's some way to leverage an "apply" function to have a list of pre-defined values for the date ranges and strings and update the dataframe all at once? Something like:
valList <- list(daterange = as.Date('2010-02-05') + 0:02, string1 = "telco", string2 = "southeast",
daterange2 = as.Date('2010-02-07') + 0:03, string1_2 = "shipping", string2_2 = "northeast")
And apply this over the dataframe with the changeFunc
But as you can see, I'm a bit unclear about how the function would know that string1_2 is the condition that matches with daterange2 and not daterange

Related

Parsing colnames text string as expression in R

I am trying to create a large number of data frames in a for loop using the "assign" function in R. I want to use the colnames function to set the column names in the data frame. The code I am trying to emulate is the following:
county_tmax_min_df <- data.frame(array(NA,c(length(days),67)))
colnames(county_tmax_min_df) <- c('Date',sd_counties$NAME)
county_tmax_min_df$Date <- days
The code I have so far in the loop looks like this:
file_vars = c('file1','file2')
days <- seq(as.Date("1979-01-01"), as.Date("1979-01-02"), "days")
f = 1
for (f in 1:2){
assign(paste0('county_',file_vars[f]),data.frame(array(NA,c(length(days),67))))
}
I need to be able to set the column names similar to how I did in the above statement. How do I do this? I think it needs to be something like this, but I am unsure what goes in the text portion. The end result I need is just a bunch of data frames. Any help would be wonderful. Thank you.
expression(parse(text = ))
You can set the names within assign, like that:
file_vars = c('file1', 'file2')
days <- seq.Date(from = as.Date("1979-01-01"), to = as.Date("1979-01-02"), by = "days")
for (f in seq_along(file_vars)) {
assign(x = paste0('county_', file_vars[f]),
value = {
df <- data.frame(array(NA, c(length(days), 67)))
colnames(df) <- paste0("fancy_column_",
sample(LETTERS, size = ncol(df), replace = TRUE))
df
})
}
When in {} you can use colnames(df) or setNames to assign column names in any manner desired. In your first piece of code you are referring to sd_counties object that is not available but the generic idea should work for you.

How to filter for 'any value' in R?

Strange question but how to do I filter such that all rows are returned for a dataframe? For example, say you have the following dataframe:
Pts <- floor(runif(20, 0, 4))
Name <- c(rep("Adam",5), rep("Ben",5), rep("Charlie",5), rep("Daisy",5))
df <- data.frame(Pts, Name)
And say you want to set up a predetermined filter for this dataframe, for example:
Ptsfilter <- c("2", "1")
Which you will then run through the dataframe, to get your new filtered dataframe
dffil <- df[df$Pts %in% Ptsfilter, ]
At times, however, you don't want the dataframe to be filtered at all, and in the interests of automation and minimising workload, you don't want to have to go back and remove/comment-out every instance of this filter. You just want to be able to adjust the Ptsfilter value such that no rows will be filtered out of the dataframe, when that line of code is run.
I have experimented/guesses with things like:
Ptsfilter <- c("")
Ptsfilter <- c(" ")
Ptsfilter <- c()
to no avail.
Is there a value I can enter for Ptsfilter that will achieve this goal?
You might need to define a function to do this for you.
filterDF = function(df,filter){
if(length(filter)>0){
return(df[df$Pts %in% filter, ])
}
else{
return(df)
}
}

R- How to do a loop on a list and output different dataframes

I'm attempting to create a loop in R that will use a vector of dates, run them through a loop that includes a SQL query, and then generate a separate dataframe for each output. Here is as far as I've gotten:
library(RODBC)
dvect <- as.Date("2015-04-13") + 0:2
d <- list()
for(i in list(dvect)){
queryData <- sqlQuery(myconn, paste("SELECT
WQ_hour,
sum(calls) as calls
FROM database
WHERE DDATE = '", i,"'
GROUP BY 1
", sep = ""))
d[i] <- rbind(d, queryData)
}
From what I can tell, the query portion of the code runs fine since I've tested it by itself. Where I'm stumbling is the last line where I try to save the contents of each loop through the query separately with each having a label of the date that was used in the loop.
I'd appreciate any help. I've only been using R consistently for about 2 months now so I'm definitely open to alternative ways of doing this that are cleaner and more efficient.
Thanks.
I'd suggest making the SQL query a function, and use lapply to apply it and return your result as a list.
userSQLquery = function(i) {
sqlQuery(myconn, paste("SELECT
WQ_hour,
sum(calls) as calls
FROM database
WHERE DDATE = '", i,"'
GROUP BY 1
", sep = ""))
}
dvect = as.Date("2015-04-13") + 0:2
d = as.list(1:length(dvect))
names(d) = dvect
lapply(d, userSQLquery)
I have very little experience with SQL though, so this may not work. Maybe it could start you off?
Looks like a job for lapply (lapply documentation)instead of a for loop. (In R it's often good to avoid a for loop by using a vectorization.)
If you want each date to return a separate data frame, and then have each data frame labelled with the original date, try:
dates <- c("Jan 1", "Oct 31", "Dec 25")
queryData <- function(date){
#dummy data
return(runif(5))
}
results <- lapply(dates, queryData)
names(results) <- dates
Either use:
d[[i]] <- queryData
if you want each data.frame (query result) as a separate element in the list output d.
Or use:
d <- rbind(d, queryData)
if you want a single data.frame with all the query outputs combined. In this case you should declare d as a data.frame (i.e. d <- data.frame()).
You can also store each data.frame (i.e. the query result) with its corresponding date in a list as:
d[[i]] <- list(date = dvect[[i]], queryResult = queryData)
I think the last one is what you are looking for.

Using paste to create logical expression for data frame subset

I have two dataframes, remove and dat (the actual dataframe). remove specifies various combinations of the factor variables found in dat, and how many to sample (remove$cases).
Reproducible example:
set.seed(83)
dat <- data.frame(RateeGender=sample(c("Male", "Female"), size = 1500, replace = TRUE),
RateeAgeGroup=sample(c("18-39", "40-49", "50+"), size = 1500, replace = TRUE),
Relationship=sample(c("Direct", "Manager", "Work Peer", "Friend/Family"), size = 1500, replace = TRUE),
X=rnorm(n=1500, mean=0, sd=1),
y=rnorm(n=1500, mean=0, sd=1),
z=rnorm(n=1500, mean=0, sd=1))
What I am trying to accomplish is to read in a row from remove and use it to subset dat. My current approach looks like:
remove <- expand.grid(RateeGender = c("Male", "Female"),
RateeAgeGroup = c("18-39","40-49", "50+"),
Relationship = c("Direct", "Manager", "Work Peer", "Friend/Family"))
remove$cases <- c(36,34,72,58,47,38,18,18,15,22,17,10,24,28,11,27,15,25,72,70,52,43,21,27)
# For each row of remove (combination of factor levels:)
for (i in 1:nrow(remove)) {
selection <- character()
# For each column of remove (particular selection):
for (j in 1:(ncol(remove)-1)){
add <- paste0("dat$", names(remove)[j], ' == "', remove[i,j], '" & ')
selection <- paste0(selection, add)
}
selection <- sub(' & $', '', selection) # Remove trailing ampersand
cat(selection, sep = "\n") # What does selection string look like?
tmp <- sample(dat[selection, ], size = remove$cases[i], replace = TRUE)
}
The output from cat() while the loop runs looks right, for example: dat$RateeGender == "Male" & dat$RateeAgeGroup == "18-39" & dat$Relationship == "Direct" and if I paste that into dat[dat$RateeGender == "Male" & dat$RateeAgeGroup3 == "18-39" & dat$Relationship == "Direct" ,], I get the right subset.
However, if I run the loop as written with dat[selection, ], each subset only returns NAs. I get the same outcome if I use subset(). Note, I have replace = TRUE in the above solely because of the random sampling. In the actual application, there will always be more cases per combination than required.
I know I can dynamically construct formulas for lm() and other functions using paste() in this way, but am obviously missing something in translating this into working with [,].
Any advice would be really appreciated!
You cannot use character expressions as you describe to subset either with [ or subset. If you wanted to do that you would have to construct the entire expression, and then use eval. That said, there is a better solution using merge. For example, let's get all the entries in dat that match the first two rows from remove:
merge(dat, remove[1:2,])
If we want all the rows that don't match those two, then:
subset(merge(dat, remove[1:2,], all.x=TRUE), is.na(cases))
This is assuming you want to join on the columns with the same names across the two tables. If you have a lot of data you should consider using data.table as it is very fast for this type of operation.
I upvoted BrodieG's answer before I realized it doesn't do what you wanted in situations wehre the size of the category is smaller than the number of samples desired. (In fact his method doesn't really do sampling at all, but I think it is is an elegant solution to a different question so I'm not reversing my vote. And you could use a similar split strategy as illustrated below with that data.frame as the input.).
sub <- lapply( split(dat, with(dat, paste(RateeGender, # split vector
RateeAgeGroup,
Relationship, sep="_")) ),
function (d) { n= with(remove, remove[
RateeGender==d$RateeGender[1]&
RateeAgeGroup==d$RateeAgeGroup[1]&
Relationship==d$Relationship[1],
"cases"])
cat(n);
sample(d, n, repl=TRUE) } )

Map over data frame columns, apply function to data if column meets condition

I'm pulling data from the Google Analytics API, processing it locally, then knitting an .Rmd file into text, tables, and visualisations. As part of the knitting/tabling process, I'm doing some basic formatting (e.g. rounding off percentages and adding % signs).
For this question, I have toPercent(), which works fine if used like this:
toPercent <- function(percentData){
percentData <- round(data, 2)
percentData <- mapply(toString, percentData)
percentData <- paste(percentData, "%", sep="")
}
devices <- toPercent(devices$avgSessionDuration)
However, manually setting the function for every table is time-intensive. I created the percentCheck() to look for columns that matched my criteria:
percentCheck <- function(data){
data[,grep("rate|percent", names(data), ignore.case=TRUE)] <- toPercent(data[,grep("rate|percent", names(data), ignore.case=TRUE)])
}
devices <- percentCheck(devices)
But I know this doesn't work on a dataset with multiple matches (e.g. a column for exitRate and a column for bounceRate).
Q1: Have I written toPercent() in a way that won't return multiple values to one entry?
Q2: How can I structure percentCheck() to map over the dataset and only apply toPercent() if the column name includes a given string?
Version/Packages:
R version 3.1.1 (2014-07-10) -- "Sock it to Me"
library(rga)
library(knitr)
library(stargazer)
Data:
> dput(devices)
structure(list(deviceCategory = c("desktop", "mobile", "tablet"
), sessions = c(817, 38, 1540), avgSessionDuration = c(153.424888853179,
101.942758538617, 110.270988142292), bounceRate = c(39.0192297391397,
50.2915625371891, 50.1343873517787), exitRate = c(25.3257456030279,
32.0236280487805, 29.0991902834008)), .Names = c("deviceCategory",
"sessions", "avgSessionDuration", "bounceRate", "exitRate"), row.names = c(NA,
-3L), class = "data.frame")
How about this modification:
percentCheck <- function(data){
idx <- grepl("rate|percent", names(data), ignore.case=TRUE)
data[idx] <- lapply(data[idx], function(x) paste0(sprintf("%.2f", round(x,2)), "%"))
return(data)
}
Here, I first used grepl to create and index of columns which meet the specified criteria. Then, this index is used in lapply to apply it to all these columns and the function that is applied is similar to your toPercent function, only I found it a bit more compact like this.
Now you can apply it to your whole data set in one go:
percentCheck(devices)
# deviceCategory sessions avgSessionDuration bounceRate exitRate
#1 desktop 817 153.4249 39.02% 25.33%
#2 mobile 38 101.9428 50.29% 32.02%
#3 tablet 1540 110.2710 50.13% 29.10%

Resources