I'm trying to create a loop and for each iteration (the number of which can vary between source files) construct a mutate statement to add a column based on the value of another column.
Having my programming background in php, to my mind this should work:
for(i in number){
colname <- paste("Column",i,sep="")
filtercol <- paste("DateDiff_",i,sep="")
dataset <- mutate(dataset, a = ifelse(b >= 0 & b <= 364,1,NA))
}
But... as I've noticed a couple of times now with R functions sometimes the function ignores outright that you have defined a variable with that name -
as mutate() is here.
So instead of getting several columns titled "a1", "a2", "a3", etc, I get one column entitled "a" that gets overwritten each iteration.
Firstly, can somebody point out to me where I'm going wrong here, but secondly could someone explain to me under what circumstances R ignores variable names, as it's happened a couple of times now and it just seems wildly inconsistent at this point. I'm sure it's not, and there's logic there, but it's certainly well obfuscated.
It's also worth mentioning that originally I tried it this way:
just.dates <- just.dates %>%
for(i in number){
a <- paste("a",i,sep="")
filtercol <- paste("DateDiff_",i,sep="")
mutate(a = ifelse(filtercol >= 0 & filtercol <= 364),1,NA)
}
But that way decided I was passing the for() loop 4 arguments when it only wanted three.
Something like this may work for you. The mutate_() function as opposed to just mutate() should help you out with this.
# Create dataframe for testing
dataset <- data.frame(date = as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001",
"06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011"), "%d/%m/%Y"),
event=c(0,0,1,0,1, 1,0,1,0,1),
id = c(rep(1,5),rep(2,5)),
DateDiff_1 = c(-2,0,34,700,rep(5,6)),
DateDiff_2 = c(20,-12,360,900,rep(5,6))
)
# Set test number vector
number <- c(1:2)
# Begin loop through numbers
for(i in number){
# Set the name of the new column to be created
newcolumn <- paste("Column",i,sep="")
# Set the name of the column to be filtered
filtercolumn <- paste("DateDiff_",i,sep="")
# Create the function to be passed into the mutate command
mutate_function = lazyeval::interp(~ ifelse(fc >= 0 & fc <= 364, 1, NA), fc = as.name(filtercolumn))
# Apply the mutate command to the dataframe
dataset <- dataset %>%
mutate_(.dots = setNames(list(mutate_function), newcolumn))
}
Related
My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))
I have my code:
new_df = data.frame()
#G = 0
for (i in 1:nrow(furin_data)){
frac = furin_data[i,3]/furin_data[i,5]
#print(frac)
if (frac > 2 || frac < 0.5) {
name = furin_data[i,1]
print(name)
new_df = furin_data[i,]
#print(new_df)
#G = G + 1
}
write.csv(new_df, "C:\\User\\Documents\\MyData.csv", row.names = FALSE)
}
It creates a new data file, but only the last row is written and not all of the the rows based on the condition. I cannot seem to figure out where is the problem.
That's because you're assigning the row to it, so every assignment overrides the previous one. What you want is to add rows to it instead.
new_df[nrow(new_df)+1,] = furin_data[i,]
Another thing is that you created your new_df data frame without any columns, so none are assigned in the transfer. You should define it with the same types and names of columns as furin_data, so those columns could be copied. An easy of initializing it as empty but having the same structure would be:
new_df = furin_data[F,]
Buuut, in the R language, writing a loop is not the best way to do things. R is a vectorized language, meaning it can perform all operations on a vector at once, causing it to execute much much faster. So a conversion of your whole code to R style would be:
library(dplyr)
new_df <-
furin_data %>%
mutate(frac = .[3] / .[5]) %>%
subset(frac > 2 | frac < 0.5)
write.csv(new_df, "C:\\User\\Documents\\MyData.csv", row.names = FALSE)
Currently, I'm having an issue with computation time because I run a triple for loop in R to create anomaly thresholds on the day of the week and hour level for each unique ID.
My original data frame:
Unique ID, Event Date Hour, Event Date, Event Day of Week, Event Hour, Numeric Variable 1, Numeric Variable 2, etc.
df <- read.csv("mm.csv",header=TRUE,sep=",")
for (i in unique(df$customer_id)) {
#I initialize the output data frame so I can rbind as I loop though the grains. This data frame is always emptied out once we move onto our next customer_id
output.final.df <- data_frame(seller_name = factor(), is_anomaly_date = integer(), event_date_hr = double(), event_day_of_wk = integer(), event_day = double(), ...)
for (k in unique(df$event_day_of_wk)) {
for (z in unique(df$event_hr)) {
merchant.df = df[df$merchant_customer_id==i & df$event_day_of_wk==k & df$event_hr==z,10:19] #columns 10:19 are the 9 different numeric variables I am creating anomaly thresholds
#1st anomaly threshold - I have multiple different anomaly thresholds
# TRANSFORM VARIABLES - sometime within the for loop I run another loop that transforms the subset of data within it.
for(j in names(merchant.df)){
merchant.df[[paste(j,"_log")]] <- log(merchant.df[[j]]+1)
#merchant.df[[paste(j,"_scale")]] <- scale(merchant.df[[j]])
#merchant.df[[paste(j,"_cube")]] <- merchant.df[[j]]**3
#merchant.df[[paste(j,"_cos")]] <- cos(merchant.df[[j]])
}
mu_vector = apply( merchant.df, 2, mean )
sigma_matrix = cov( merchant.df, use="complete.obs", method='pearson' )
inv_sigma_matrix = ginv(sigma_matrix)
det_sigma_matrix = det( sigma_matrix )
z_probas = apply( merchant.df, 1, mv_gaussian, mu_vector, det_sigma_matrix, inv_sigma_matrix )
eps = quantile(z_probas,0.01)
mv_outliers = ifelse( z_probas<eps, TRUE, FALSE )
#2nd anomaly threshold
nov = ncol(merchant.df)
pca_result <- PCA(merchant.df,graph = F, ncp = nov, scale.unit = T)
pca.var <- pca_result$eig[['cumulative percentage of variance']]/100
lambda <- pca_result$eig[, 'eigenvalue']
anomaly_score = (as.matrix(pca_result$ind$coord) ^ 2) %*% (1 / as.matrix(lambda, ncol = 1))
significance <- c (0.99)
thresh = qchisq(significance, nov)
pca_outliers = ifelse( anomaly_score > thresh , TRUE, FALSE )
#This is where I bind the anomaly points with the original data frame and then I row bind to the final output data frame then the code goes back to the top and loops through the next hour and then day of the week. Temp.output.df is constantly remade and output.df is slowly growing bigger.
temp.output.df <- cbind(merchant.df, mv_outliers, pca_outliers)
output.df <- rbind(output.df, temp.output.df)
}
}
#Again this is where I write the output for a particular unique_ID then output.df is recreated at the top for the next unique_ID
write.csv(output.df,row.names=FALSE)
}
The following code shows the idea of what I'm doing. As you can see I run 3 for loops where I calculate multiple anomaly detections at the lowest grain which is the hour level by day of the week, then once I finish I output every unique customer_id level into a csv.
Overall the code runs very fast; however, doing a triple for loop is killing my performance. Does anyone know any other way I can do an operation like this given my original data frame and having the need to output a csv at every unique_id level?
So don't use a triple-loop. Use dplyr::group_by(customer_id, event_day_of_wk, event_hr), or the data.table equivalent. Both should be faster.
No need for explicit appending on every iteration with rbind and cbind which will kill your performance.
Also, no need to cbind() your entire input df into your output df; your only actual outputs are mv_outliers, pca_outliers; you could join() the input and output dfs later on customer_id, event_day_of_wk, event_hr
EDIT: since you want to collate all results for each customer_id then write.csv() them, that needs to go in an outer level of grouping, and group_by(event_day_of_wk, event_hr) in the inner level.
.
# Here is pseudocode, you can figure out the rest, do things incrementally
# It looks like seller_name, is_anomaly_date, event_date_hr, event_day_of_wk, event_day,... are variables from your input
require(dplyr)
output.df <- df %>%
group_by(customer_id) %>%
group_by(event_day_of_wk, event_hr) %>%
# columns 10:19 ('foo','bar','baz'...) are the 9 different numeric variables I am creating anomaly thresholds
# Either a) you can hardcode their names in mutate(), summarize() calls
# or b) you can reference the vars by string in mutate_(), summarize_() calls
# TRANSFORM VARIABLES
mutate(foo_log = log1p(foo), bar_log = log1p(bar), ...) %>%
mutate(mu_vector = c(mean(foo_log), mean(bar_log)...) ) %>%
# compute sigma_matrix, inv_sigma_matrix, det_sigma_matrix ...
summarize(
z_probas=mv_gaussian(mu_vector, det_sigma_matrix, inv_sigma_matrix),
eps = quantile(z_probas,0.01),
mv_outliers = (z_probas<eps)
) %>%
# similarly, use mutate() and do.call() for your PCA invocation...
# Your outputs are mv_outliers, pca_outliers
# You don't necessarily need to `cbind(merchant.df, mv_outliers, pca_outliers)` i.e. cbind all your input data together with your output
# Now remove all your temporary variables from your output:
select(-foo_log, -bar_log, ...) %>%
# or else just select(mv_outliers, pca_outliers) the variables you want to keep
ungroup() %>% # (this ends the group_by(event_day_of_wk, event_hr) and cbinds all the intermediate dataframes for you)
write.csv( c(.$mv_outliers, .$pca_outliers), file='<this_customer_id>.csv')
ungroup() # group_by(customer_id)
See also "write.csv() in dplyr chain"
Given a dummy data frame that looks like this:
Data1<-rnorm(20, mean=20)
Data2<-rnorm(20, mean=21)
Data3<-rnorm(20, mean=22)
Data4<-rnorm(20, mean=19)
Data5<-rnorm(20, mean=20)
Data6<-rnorm(20, mean=23)
Data7<-rnorm(20, mean=21)
Data8<-rnorm(20, mean=25)
Index<-rnorm(20,mean=5)
DF<-data.frame(Data1,Data2,Data3,Data4,Data5,Data6,Data7,Data8,Index)
What I'd like to do is remove (make NA) certain columns per row based on the Index column. I took the long way and did this to give you an idea of what I'm trying to do:
DF[DF$Index>5.0,8]<-NA
DF[DF$Index>=4.5 & DF$Index<=5.0,7:8]<-NA
DF[DF$Index>=4.0 & DF$Index<=4.5,6:8]<-NA
DF[DF$Index>=3.5 & DF$Index<=4.0,5:8]<-NA
DF[DF$Index>=3.0 & DF$Index<=3.5,4:8]<-NA
DF[DF$Index>=2.5 & DF$Index<=3.0,3:8]<-NA
DF[DF$Index>=2.0 & DF$Index<=2.5,2:8]<-NA
DF[DF$Index<=2.0,1:8]<-NA
This works fine as is, but is not very adaptable. If the number of columns change, or I need to tweak the conditional statements, it's a pain to rewrite the entire code (the actual data set is much larger).
What I would like to do is be able to define a few variables, and then run some sort of loop or apply to do exactly what the lines of code above do.
As an example, in order to replicate my long code, something along the lines of this kind of logic:
NumCol<-8
Max<-5
Min<-2.0
if index > Max, then drop NumCol
if index >= (Max-0.5) & <=Max, than drop NumCol:(NumCol -1)
repeat until reach Min
I don't know if that's the most logical line of reasoning in R, and I'm pretty bad with Looping and apply, so I'm open to any line of thought that can replicate the above long lines of code with the ability to adjust the above variables.
If you don't mind changing your data.frame to a matrix, here is a solution that uses indexing by a matrix. The building of the two-column matrix of indices to drop is a nice review of the apply family of functions:
Seq <- seq(Min, Max, by = 0.5)
col.idx <- lapply(findInterval(DF$Index, Seq) + 1, seq, to = NumCol)
row.idx <- mapply(rep, seq_along(col.idx), sapply(col.idx, length))
drop.idx <- as.matrix(data.frame(unlist(row.idx), unlist(col.idx)))
M <- as.matrix(DF)
M[drop.idx] <- NA
Here is a memory efficient (but I can't claim elegant) data.table solution
It uses the very useful function findInterval to change you less than / greater than loop
#
library(data.table)
DT <- data.table(DF)
# create an index column which 1:8 represent your greater than less than
DT[,IND := findInterval(Index, c(-Inf, seq(2,5,by =0.5 ), Inf))]
# the columns you want to change
changing <- names(DT)[1:8]
setkey(DT, IND)
# loop through the indexes and alter by reference
for(.ind in DT[,unique(IND)]){
# the columns you want to change
.which <- tail(changing, .ind)
# create a call to `:=`(a = as(NA, class(a), b= as(NA, class(b))
pairlist <- mapply(sprintf, .which, .which, MoreArgs = list(fmt = '%s = as(NA,class(%s))'))
char_exp <- sprintf('`:=`( %s )',paste(pairlist, collapse = ','))
.e <- parse(text = char_exp)
DT[J(.ind), eval(.e)]
}
I am trying to compile data from several files using for loops in R. I would like to get all the data into one table. Following calculation is just an example.
library(reshape)
dat1 <- data.frame("Specimen" = paste("sp", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2), "Density_3" = rnorm(10,4,2))
dat2 <- data.frame("Specimen" = paste("fg", 1:10, sep=""), "Density_1" = rnorm(10,4,2), "Density_2" = rnorm(10,4,2))
dat <- c("dat1", "dat2")
for(i in 1:length(dat)){
data <- get(dat[i])
melt.data <- melt(data, id = 1)
assign(paste(dat[i], "tbl", sep=""), cast(melt.data, ~ variable, mean))
}
rbind(dat1tbl, dat2tbl)
What is the smoothest way to add an extra column into dat2? I would like to get the same column name ("Density_3" in this case) and fill it up with zeros, if it does not already exist. Assume that I have ~100 tables with number of columns (Density_1, 2, 3 etc) varying between 5 and 6.
I tried following, but it didn't work:
if(names(data) %in% "Density_3" == FALSE){
dat.all$Density_3 <- 0
} else {
dat.all$Density_3 <- dat.all$Density3}
Another one: is there a smooth way to rbind() the tables? It seems that rbind(get(dat)) does not work.
After staring at this question for a while I think its intent may have been obscured by the unnecessary get and assign manipulations. And I think the answer is pylr::rbind.fill
I would have constructed "dat", not as a character vector but as a list of two dataframes, used aggregate( ..., FUN=mean) (because I haven't gotten on the reshape2/plyr bus, except for melt and rbind.fill that is ) and then do.call(rbind.fill, ...) on the resulting list. At any rate this is what I think you want. I do not think it is a good idea to add in zeros for what are really missing values.
> rbind.fill(dat1tbl, dat2tbl)
value Density_1 Density_2 Density_3
1 (all) 5.006709 4.088988 2.958971
2 (all) 4.178586 3.812362 NA