Nest data inside loop - r

I'm currently having an issue where I'm trying to nest simulated data for an efficient frontier inside a tibble containing all 250 simulations. The tibble will have 1 column named "sim" which indicates the number of the simulation, i.e. the rows in this column runs from 1:250. The other column should contain the nested simulation data which is a 3x123 tibble for each simulation. (Really hope this makes sense).
I've tried to replicate the problem such that you don't need all of the previous code and data to see the issue. Problem is that the nested data is saved as a list:
library(tidyverse)
counter = 0
table <- tibble(sim = 1:250, obs = NA)
for(i in (1:250)){
counter = counter + 1
tibble <- tibble(a = NA, b = 1:113, c = 2, d = 3)
tibble$a <- counter
nested_tibble <- tibble %>% nest(data = -a) %>% select(-a)
table$obs[i] <- nested_tibble
}
In this simplified reproducible example the values in the tibble are identical. Whereas in the assignment I'm working on, the tibble contains values for the efficient frontier. Variable 'a' in the tibble corresponds to simulation number and this is the variable i use to nest the efficient frontier. Afterwards I wish to remove this variable a, and insert the nested tible in the corresponding 'obs' field currently being NA.
I really hope this makes sense. I'm still very new with R and coding. If you need any additional documentation please let me know.

Your nested_tibble is a list containing a tibble. To access the tibble inside the list, you can use double bracket notation: nested_tibble[[1]]. So to get the result you want you can change your loop as follows:
counter = 0
table <- tibble(sim = 1:250, obs = NA)
for(i in (1:250)){
counter = counter + 1
tibble <- tibble(a = NA, b = 1:113, c = 2, d = 3)
tibble$a <- counter
nested_tibble <- tibble %>% nest(data = -a) %>% select(-a)
table$obs[i] <- nested_tibble[[1]]
}

Related

Can I use lapply to check for outliers in comparison to values from all listed tibbles?

My data is imported into R as a list of 60 tibbles each with 13 columns and 8 rows. I want to detect outliers defined as 2*sd by comparing each value in column "2" to the mean of all values of column "2" in the same row.
I know that I am on a wrong path with these lines, as I am not comparing the single values
lapply(list, function(x){
if(x$"2">(mean(x$"2")) + (2*sd(x$"2"))||x$"2"<(mean(x$"2")) - (2*sd(x$"2"))) {}
})
Also I was hoping to replace all values that are thus identified as outliers by the corresponding mean calculated from the 60 values in the same position as the outlier while keeping everything else, but I am also quite unsure how to do that.
Thank you!
you haven't added an example of your code so I've made a quick and simple example to demonstrate my answer. I think this would be much more straightforward logic if you first combine the list of tibbles into a single tibble. This allows you to do everything you want in a simple dplyr pipe, ultimately identifying outliers by 1's in the 'outlier' column:
library(tidyverse)
tibble1 <- tibble(colA = c(seq(1,20,1), 150),
colB = seq(0.1,2.1,0.1),
id = 1:21)
tibble2 <- tibble(colA = c(seq(101,120,1), -150),
colB = seq(21,41,1),
id = 1:21)
# N.B. if you don't have an 'id' column or equivalent
# then it makes it a lot easier if you add one
# The 'id' column is essentially shorthand for an index
tibbleList <- list(tibble1, tibble2)
joinedTibbles <- bind_rows(tibbleList, .id = 'tbl')
res <- joinedTibbles %>%
group_by(id) %>%
mutate(meanA = mean(colA),
sdA = sd(colA),
lowThresh = meanA - 2*sdA,
uppThresh = meanA + 2*sdA,
outlier = ifelse(colA > uppThresh | colA < lowThresh, 1, 0))

Comparing each row of one dataframe with a row in another dataframe using R

I'm relatively new to R and I have looked for an answer for my problem but didn't find one. I want to compare two dataframes.
library(dplyr)
library(gtools)
v1 <- LETTERS[1:10]
combinations_from_4_letters <- (as.data.frame(combinations(n = 10, r = 4, v = v1),
stringsAsFactors = FALSE))
combinations_from_4_letters$group <- rep(1:15, each = 14)
combinations_from_2_letters <- (as.data.frame(combinations(n = 10, r = 2, v = v1),
stringsAsFactors = FALSE))
Dataframe 'combinations_from_4_letters' contains all combinations that can be made from 10 letters without repetitions and permutations. The combinations are binned into groups from 1-15. I want to find out how often pairs of the 10 letters (saved in dataframe 'combinations_from_2_letters') are found in each group (basically a frequency table). I started doing a complicated loop looping through both dataframes but I think there must be a more 'R' solution to it, similar to comparing a dataframe and a vector like:
combinations_from_4_letters %in% combinations_from_2_letters[i,])
Thank you in advance for your help!
I recommend an approach like the following:
# adding dummy column for a complete cross-join
combinations_from_4_letters = combinations_from_4_letters %>%
mutate(ones = 1)
combinations_from_2_letters = combinations_from_2_letters %>%
mutate(ones = 1)
joined = combinations_from_2_letters %>%
inner_join(combinations_from_4_letters, by = "ones") %>%
# comparison goes here
mutate(within = ifelse(comb2 %in% comb4, 1, 0)) %>%
group_by(comb2) %>%
summarise(freq = sum(within))
You'll probably need to modify to ensure it matches the exact column names and your comparison condition.
Key ideas:
adding filler column so we have a complete cross-join
mutate a new indicator column for whether the two letter pair is within the four letter pair
sum indicators on the two letter pair

Putting rows from old data frame into a new data frame

I have my code:
new_df = data.frame()
#G = 0
for (i in 1:nrow(furin_data)){
frac = furin_data[i,3]/furin_data[i,5]
#print(frac)
if (frac > 2 || frac < 0.5) {
name = furin_data[i,1]
print(name)
new_df = furin_data[i,]
#print(new_df)
#G = G + 1
}
write.csv(new_df, "C:\\User\\Documents\\MyData.csv", row.names = FALSE)
}
It creates a new data file, but only the last row is written and not all of the the rows based on the condition. I cannot seem to figure out where is the problem.
That's because you're assigning the row to it, so every assignment overrides the previous one. What you want is to add rows to it instead.
new_df[nrow(new_df)+1,] = furin_data[i,]
Another thing is that you created your new_df data frame without any columns, so none are assigned in the transfer. You should define it with the same types and names of columns as furin_data, so those columns could be copied. An easy of initializing it as empty but having the same structure would be:
new_df = furin_data[F,]
Buuut, in the R language, writing a loop is not the best way to do things. R is a vectorized language, meaning it can perform all operations on a vector at once, causing it to execute much much faster. So a conversion of your whole code to R style would be:
library(dplyr)
new_df <-
furin_data %>%
mutate(frac = .[3] / .[5]) %>%
subset(frac > 2 | frac < 0.5)
write.csv(new_df, "C:\\User\\Documents\\MyData.csv", row.names = FALSE)

How to extract rows of a data frame between two characters

I've got some poorly structured data I am trying to clean. I have a list of keywords I can use to extract data frames from a CSV file. My raw data is structured roughly as follows:
There are 7 columns with values, the first columns are all string identifiers, like a credit rating or a country symbol (for FX data), while the other 6 columns are either a header like a percentage change string (e.g. +10%) or just a numerical value. Since I have all this data lumped together, I want to be able to extract data for each category. So for instance, I'd like to extract all the rows between my "credit" keyword and my "FX" keyword in my first column. Is there a way to do this in either base R or dplyr easily?
eg.
df %>%
filter(column1 = in_between("credit", "FX"))
Sample dataframe:
row 1: c('random',-1%', '0%', '1%, '2%')
row 2: c('credit', NA, NA, NA, NA)
row 3: c('AAA', 1,2,3,4)
...
row n: c('FX', '-1%', '0%', '1%, '2%')
And I would want the following output:
row 1: c('credit', -1%', '0%', '1%, '2%')
row 2: c('AAA', 1,2,3,4)
...
row n-1: ...
If I understand correctly you could do something like
start <- which(df$column1 == "credit")
end <- which(df$column1 == "FX")
df[start:(end-1), ]
Of course this won't work if "credit" or "FX" is in the column more than once.
Using what Brian suggested:
in_between <- function(df, start, end){
return(df[start:(end-1),])
}
Then loop over the indices in
dividers = which(df$column1 %in% keywords == TRUE)
And save the function outputs however one would like.
lapply(1:(length(dividers)-1), function(x) in_between(df, start = dividers[x], end = dividers[x+1]))
This works. Messy data so I still have the annoying case where I need to keep the offset rows.
I'm still not 100% sure what you are trying to accomplish but does this do what you need it to?
set.seed(1)
df <- data.frame(
x = sample(LETTERS[1:10]),
y = rnorm(10),
z = runif(10)
)
start <- c("C", "E", "F")
df2 <- df %>%
mutate(start = x %in% start,
group = cumsum(start))
split(df2, df2$group)

How to avoid for loop in R [triple-loop aka triple-threat]

Currently, I'm having an issue with computation time because I run a triple for loop in R to create anomaly thresholds on the day of the week and hour level for each unique ID.
My original data frame:
Unique ID, Event Date Hour, Event Date, Event Day of Week, Event Hour, Numeric Variable 1, Numeric Variable 2, etc.
df <- read.csv("mm.csv",header=TRUE,sep=",")
for (i in unique(df$customer_id)) {
#I initialize the output data frame so I can rbind as I loop though the grains. This data frame is always emptied out once we move onto our next customer_id
output.final.df <- data_frame(seller_name = factor(), is_anomaly_date = integer(), event_date_hr = double(), event_day_of_wk = integer(), event_day = double(), ...)
for (k in unique(df$event_day_of_wk)) {
for (z in unique(df$event_hr)) {
merchant.df = df[df$merchant_customer_id==i & df$event_day_of_wk==k & df$event_hr==z,10:19] #columns 10:19 are the 9 different numeric variables I am creating anomaly thresholds
#1st anomaly threshold - I have multiple different anomaly thresholds
# TRANSFORM VARIABLES - sometime within the for loop I run another loop that transforms the subset of data within it.
for(j in names(merchant.df)){
merchant.df[[paste(j,"_log")]] <- log(merchant.df[[j]]+1)
#merchant.df[[paste(j,"_scale")]] <- scale(merchant.df[[j]])
#merchant.df[[paste(j,"_cube")]] <- merchant.df[[j]]**3
#merchant.df[[paste(j,"_cos")]] <- cos(merchant.df[[j]])
}
mu_vector = apply( merchant.df, 2, mean )
sigma_matrix = cov( merchant.df, use="complete.obs", method='pearson' )
inv_sigma_matrix = ginv(sigma_matrix)
det_sigma_matrix = det( sigma_matrix )
z_probas = apply( merchant.df, 1, mv_gaussian, mu_vector, det_sigma_matrix, inv_sigma_matrix )
eps = quantile(z_probas,0.01)
mv_outliers = ifelse( z_probas<eps, TRUE, FALSE )
#2nd anomaly threshold
nov = ncol(merchant.df)
pca_result <- PCA(merchant.df,graph = F, ncp = nov, scale.unit = T)
pca.var <- pca_result$eig[['cumulative percentage of variance']]/100
lambda <- pca_result$eig[, 'eigenvalue']
anomaly_score = (as.matrix(pca_result$ind$coord) ^ 2) %*% (1 / as.matrix(lambda, ncol = 1))
significance <- c (0.99)
thresh = qchisq(significance, nov)
pca_outliers = ifelse( anomaly_score > thresh , TRUE, FALSE )
#This is where I bind the anomaly points with the original data frame and then I row bind to the final output data frame then the code goes back to the top and loops through the next hour and then day of the week. Temp.output.df is constantly remade and output.df is slowly growing bigger.
temp.output.df <- cbind(merchant.df, mv_outliers, pca_outliers)
output.df <- rbind(output.df, temp.output.df)
}
}
#Again this is where I write the output for a particular unique_ID then output.df is recreated at the top for the next unique_ID
write.csv(output.df,row.names=FALSE)
}
The following code shows the idea of what I'm doing. As you can see I run 3 for loops where I calculate multiple anomaly detections at the lowest grain which is the hour level by day of the week, then once I finish I output every unique customer_id level into a csv.
Overall the code runs very fast; however, doing a triple for loop is killing my performance. Does anyone know any other way I can do an operation like this given my original data frame and having the need to output a csv at every unique_id level?
So don't use a triple-loop. Use dplyr::group_by(customer_id, event_day_of_wk, event_hr), or the data.table equivalent. Both should be faster.
No need for explicit appending on every iteration with rbind and cbind which will kill your performance.
Also, no need to cbind() your entire input df into your output df; your only actual outputs are mv_outliers, pca_outliers; you could join() the input and output dfs later on customer_id, event_day_of_wk, event_hr
EDIT: since you want to collate all results for each customer_id then write.csv() them, that needs to go in an outer level of grouping, and group_by(event_day_of_wk, event_hr) in the inner level.
.
# Here is pseudocode, you can figure out the rest, do things incrementally
# It looks like seller_name, is_anomaly_date, event_date_hr, event_day_of_wk, event_day,... are variables from your input
require(dplyr)
output.df <- df %>%
group_by(customer_id) %>%
group_by(event_day_of_wk, event_hr) %>%
# columns 10:19 ('foo','bar','baz'...) are the 9 different numeric variables I am creating anomaly thresholds
# Either a) you can hardcode their names in mutate(), summarize() calls
# or b) you can reference the vars by string in mutate_(), summarize_() calls
# TRANSFORM VARIABLES
mutate(foo_log = log1p(foo), bar_log = log1p(bar), ...) %>%
mutate(mu_vector = c(mean(foo_log), mean(bar_log)...) ) %>%
# compute sigma_matrix, inv_sigma_matrix, det_sigma_matrix ...
summarize(
z_probas=mv_gaussian(mu_vector, det_sigma_matrix, inv_sigma_matrix),
eps = quantile(z_probas,0.01),
mv_outliers = (z_probas<eps)
) %>%
# similarly, use mutate() and do.call() for your PCA invocation...
# Your outputs are mv_outliers, pca_outliers
# You don't necessarily need to `cbind(merchant.df, mv_outliers, pca_outliers)` i.e. cbind all your input data together with your output
# Now remove all your temporary variables from your output:
select(-foo_log, -bar_log, ...) %>%
# or else just select(mv_outliers, pca_outliers) the variables you want to keep
ungroup() %>% # (this ends the group_by(event_day_of_wk, event_hr) and cbinds all the intermediate dataframes for you)
write.csv( c(.$mv_outliers, .$pca_outliers), file='<this_customer_id>.csv')
ungroup() # group_by(customer_id)
See also "write.csv() in dplyr chain"

Resources