How to construct a loop function for a massive data frame - r

I'm trying to find outliers in many times series data frames by writing a loop function so that the result would be listed.
Every row usually has numbers or just zeros, but there are sometimes N/A values and they should be eliminated from the function.
For every row there is a name of data (first column) and then different numbers (further columns). For every data I find the mean, standard deviation, Tmin and Tmax and then simple filtering:
mean <- mean(as.vector(as.numeric(x2[1,])))
sd <- sd(as.vector(as.numeric(x2[1,])))
Tmin <- mean - (3*sd)
Tmax <- mean + (3*sd)
x[which(x < Tmin | x > Tmax)]
Sample table:
Name
2015Q1
2015Q2
2015Q3
2015Q4
2016Q1
...
Banking sect.
63.4442
61.73465
67.33452
52.34546
12.3334
...
Money laund. sect.
0
0
0
2.4
0
...
Households
N/A
N/A
1.2
N/A
9.6
...
...
...
...
...
...
...
...
Result after using the loop function should be a table containing only the *Name *column and outlier column and values.
I've tried writing a loop function, but I don't understand loop functions fully. It should be reusable adding new identical template data frames.
for (i in 1:nrow(x)) {
mean <- mean(as.numeric(data.frame(x[i,2:50])))
sd <- sd(as.numeric(data.frame(x[i,2:50])))
Tmin <- mean - (3*sd)
Tmax <- mean + (3*sd)
print(x[which(x < Tmin | x > Tmax)])
i <- i+1
}
Would appreciate any advice.
Thank you

In R, for loops should be avoided like the plague. Use the function rowMeans to compute the mean of each row, and use apply to find the sd of each row. Here's some experimental code:
df <- cbind(
Name=sample(stringi::stri_rand_strings(100, 1, "[A-Z]"), size=100, replace=T),
data.frame(matrix(rnorm(n=1000), nrow=100))
)
means <- rowMeans(as.numeric(df[, 2:ncol(df)], na.rm=T)
stdDevs <- apply(X=df[, 2:ncol(df), MARGIN=1, FUN=sd, na.rm=T)
Tmin <- means - 3*stdDevs
Tmax <- means + 3*stdDevs
outliers <- sapply(df, FUN=function(x){x<Tmin | x>Tmax})
This yields a boolean array, where each entry indicates whether that entry is an outlier. NOTE: the sapply function outputs TRUE for every entry in the Name column.
EDIT: details to get into the format OP desires.
outliers <- data.frame(sapply(df, FUN=function(x){x<Tmin | x>Tmax}))
outliers$Name <- df$Name
outliers <- tidyr::pivot_longer(
outliers, names_to="variable", cols=setdiff(colnames(outliers), "Name")
)
outliers <- outliers[outliers$value == TRUE, ]
New you have a data frame where each row shows a Name-variable combination that is an outlier (according to the +/- three standard deviation rule you set forth).

Related

how i can set a function that works for all column of a data frame or matrix in R?

I have a data frame with 1000 rows and 10000 columns, I want to set a function of estimating v(corresponds to 2pq) for each column.
sample data can be
data=data.frame(replicate(1000,sample(0:2,100,rep=T))) #[1000 snp (column) and 100 ind (row)]
I can calculate v for the first row
a=count(data$X1==2) #totla no of 2
b=count(data$X1==1) #total no of 1
n=nrow(data) #no of row in real data NA can be there
p=(a+(b*0.5))/n
q=1-p
v=2*p*q
v
I want to estimate v for all the columns.
Thanks in advance
Put the code in a function.
calculate <- function(v) {
a=sum(v==2)
b=sum(v==1)
n=length(v)
p=(a+(b*0.5))/n
q=1-p
v=2*p*q
v
}
and if you have a dataframe use sapply to calculate v for all columns.
result <- sapply(data, calculate)
You can also use apply with MARGIN = 2 which will work for both dataframe and matrix.
result <- apply(data, 2, calculate)
We can use dapply
library(collapse)
dapply(data, calculate)
where
calculate <- function(v) {
a=sum(v==2)
b=sum(v==1)
n=length(v)
p=(a+(b*0.5))/n
q=1-p
v=2*p*q
v
}

purrr equivalent of lapply to return multi-element list?

I have a population of individuals that have attributes for whether or not they are alive, their sex, and age:
ind <- vector(mode="list", 10)
for(i in seq(ind)){
ind[[i]]$alive <- 1
ind[[i]]$sex <- sample(c("female","male"),1)
ind[[i]]$age <- round(runif(1, min=1, max=10))
}
ind
Using lapply, I can increase the age of each individual, and get the list of individuals back with all of their attributes:
lapply(ind,function(x){x$age <- x$age+1; x})
Is there a map function from purrr that can do the same exact thing (give the same output as lapply)? When I use map(), I only get back a list of ages, and not all of the attributes for each individual:
map(ind, ~.$age+1)
After reading the help files, it looks like ?update_list is what you want:
map(ind, update_list, age = ~age + 1)
So apply to each item of ind the update_list function, and replace the age variable with the result of the expression age + 1.

How to calculate statistics over sequences of non zeros in a dataframe in R

I have a dataframe containing sequences as follows:
r1=c(0,0,0,1.2,5,0.5,3.3,0,0,2.1,0.7,1,3.3,0,0,0,0,2.5,4.2,1,5.2,0,0,0,0)
r2=c(0,0,3.5,5.1,2.5,0,0,0,0.6,1.7,1.6,1.2,1.6,0,0,0,0,1.5,1.8,1.5,0,0,0,0,0)
r=as.data.frame(cbind(r1,r2))
My actual data contain more columns and rows. For each column, I'd like to get the minimum/maximum/average (basic statistics) of the maximum of each sequence of non-zero values. That means that, considering one column, I extract the maximum value of each one of its sequences of successive non-0 values and then I perform the statistics over them.
Here I've written some functions to break your vectors into the individual runs, extract the values you want (the maxes within the runs), and then apply the basic statistics you are asking for. There may be a more elegant or more efficient method.
r1=c(0,0,0,1.2,5,0.5,3.3,0,0, 2.1,0.7,1,3.3,0,0,0,0,2.5,4.2,1,5.2,0,0,0,0)
r2=c(0,0,3.5,5.1,2.5,0,0,0,0.6,1.7,1.6,1.2,1.6,0,0,0,0,1.5,1.8,1.5,0,0,0,0,0)
r=as.data.frame(cbind(r1,r2))
my.stats.fun <- function(col){
# sub fuctions
remove.successive.0s <- function(col){
col <- c(col, 0)
i0 <- which(col==0)
i00 <- i0[which(diff(i0)==1)]
col2 <- col[-i00]
if(col2[1]==0){ col2 <- col2[-1] } # pops first 0
return(col2)
}
run.indicator <- function(col){
i0 <- which(col==0)
lr <- length(i0)
runs <- rep(1:lr, times=c(i0-c(0,i0[-lr])))
col <- col[-i0]
runs <- runs[-i0]
return(list(values=col, index=runs))
}
basic.stats <- function(maxes){
return(c(min=min(maxes), ave=mean(maxes), max=max(maxes)))
}
# apply functions
col <- remove.successive.0s(col)
runs <- run.indicator(col)
maxes <- aggregate(runs$values, by=list(runs$index), max)[,2]
stats <- basic.stats(maxes)
return(stats)
}
sapply(r, my.stats.fun)
# r1 r2
# min 3.3 1.700000
# ave 4.5 2.866667
# max 5.2 5.100000

improve a for loop with apply inside

I have a data.frame, ordered by mean column that looks like this:
10SE191_2 10SE207 10SE208 mean
7995783 12.64874 13.06391 12.69378 12.73937
8115327 12.69979 12.52285 12.41582 12.50363
8108370 12.58685 12.87818 12.66021 12.45720
7945680 12.46392 12.26087 11.77040 12.36518
7923547 11.98463 11.96649 12.50666 12.33138
8016718 12.81610 12.71548 12.48164 12.32703
I would like to apply a t.test to each row, using as input the intensity values: df[i,1:3] and the mean values from the rows with lower intensities. For example, for the first row I want to compute a t.test for df[1,1:3] vs _mean values_ from row 2 to row 6. My code uses a for loop but my current data.frame has more than 20,000 rows and 24 columns and it takes a long time. Any ideas for improving the code?
Thanks
Code:
temp <- matrix(-9, nrow=dim(matrix.order)[1], ncol=2) #create a result matrix
l <- dim(matrix.order)[1]
for (i in 1:l){
j <- 1+i
if (i < l | j +2 == l) { #avoid not enough y observations
mean.val <- matrix.order[j:l,4]
p <- t.test(matrix.order[i, 1:3], mean.val)
temp[i,1] <- p$p.value
}
else {temp[i,1] <- 1}
}
dput for my df
structure(list(`10SE191_2` = c(12.6487418898415, 12.6997932097351,12.5868508174491, 12.4639169398277, 11.9846348627906, 12.8160978540904), `10SE207` = c(13.0639063105224, 12.522848114011, 12.8781769160682, 12.260865493177, 11.9664905651469, 12.7154788700468), `10SE208` = c(12.6937808736673, 12.4158248856386, 12.6602128982717, 11.7704045448312, 12.5066604109231, 12.4816357798965), mean = c(12.7393707471856, 12.5036313008127, 12.4572035036992, 12.3651842840775, 12.3313821056582, 12.3270331271091)), .Names = c("10SE191_2", "10SE207", "10SE208", "mean"), row.names = c("7995783", "8115327", "8108370", "7945680", "7923547", "8016718"), class = "data.frame")
You can obtain all p-values (if possible) with this command:
apply(df, 1, function(x) {
y <- df$mean[df$mean < x[4]]
if(length(y) > 1)
t.test(x[1:3], y)$p.value
else NA
})
The function will return NA if there are not enough values for y.
7995783 8115327 8108370 7945680 7923547 8016718
0.08199794 0.15627947 0.04993244 0.50885253 NA NA
Running 2E4 t.tests probably takes a lot of time no matter what. Try using Rprof to find the hot spots. You might also want to use mcapply or similar parallel processing tools, since your analysis of each row is independent of all other data (which means this is a task well-suited to multicore parallel processing).

Efficiency of transforming counts to percentages and index scores

I currently have the following code that produces the desired results I want (Data_Index and Data_Percentages)
Input_Data <- read.csv("http://dl.dropbox.com/u/881843/RPubsData/gd/2010_pop_estimates.csv", row.names=1, stringsAsFactors = FALSE)
Input_Data <- data.frame(head(Input_Data))
Rows <-nrow(Input_Data)
Vars <-ncol(Input_Data) - 1
#Total population column
TotalCount <- Input_Data[1]
#Total population sum
TotalCountSum <- sum(TotalCount)
Input_Data[1] <- NULL
VarNames <- colnames(Input_Data)
Data_Per_Row <- c()
Data_Index_Row <- c()
for (i in 1:Rows) {
#Proportion of all areas population found in this row
OAPer <- TotalCount[i, ] / TotalCountSum * 100
Data_Per_Col <- c()
Data_Index_Col <- c()
for(u in 1:Vars) {
# For every column value in the selected row
# the percentage of that value compared to the
# total population (TotalCount) for that row is calculated
VarPer <- Input_Data[i, u] / TotalCount[i, ] * 100
# Once the percentage is calculated the index
# score is calculated by diving this percentage
# by the proportion of the total population in that
# area compared to all areas
VarIndex <- VarPer / OAPer * 100
# Binds results for all columns in the row
Data_Per_Col <- cbind(Data_Per_Col, VarPer)
Data_Index_Col <- cbind(Data_Index_Col, VarIndex)
}
# Binds results for completed row with previously completed rows
Data_Per_Row <- rbind(Data_Per_Row, Data_Per_Col)
Data_Index_Row <- rbind(Data_Index_Row, Data_Index_Col)
}
colnames(Data_Per_Row) <- VarNames
colnames(Data_Index_Row) <- VarNames
# Changes the index scores to range from -1 to 1
OldRange <- (max(Data_Index_Row) - min(Data_Index_Row))
NewRange <- (1 - -1)
Data_Index <- (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
# Final outputs
Data_Index
Data_Percentages
The problem I have is that the code is very slow. I want to be able to use it on dataset that has 200,000 rows and 200 columns (which using the code at present will take around 4 days). I am sure there must be a way of speeding this process up, but I am not sure how exactly.
What the code is doing is taking (in this example) a population counts table divided into age bands and by different areas and turning it into percentages and index scores. Currently there are 2 loops so that every value in all the rows and columns are selected individually have calculations performed on them. I assume it is these loops that is making it run slow, are there any alternatives that produce the same results, but quicker? Thanks for any help you can offer.
This is your entire code. The for-loop is not necessary. And so is apply. The division can be implemented by diving a matrix entirely.
df <- Input_Data
total_count <- df[, 1]
total_sum <- sum(total_count)
df <- df[, -1]
# equivalent of your for-loop
oa_per <- total_count/total_sum * 100
Data_Per_Row <- df/matrix(rep(total_count, each=5), ncol=5, byrow=T)*100
Data_Index_Row <- Data_Per_Row/oa_per * 100
names(Data_Per_Row) <- names(Data_Index_Row) <- names(df)
# rest of your code: identical
OldRange = max(Data_Index_Row) - min(Data_Index_Row)
NewRange = (1 - -1)
Data_Index = (((Data_Index_Row - min(Data_Index_Row)) * NewRange) / OldRange) + -1
Data_Percentages <- Data_Per_Row
get rid of the "i" loop
use apply to calculate OAPer
OAPer<-apply(TotalCount,1,
function(x,tcs)x/tcs*100,
tcs = TotalCountSum)
Likewise, you can vectorize the work inside the "u" loop as well, would appreciate some comments in your code

Resources