I have a dataset that contains both numeric and categorical values. I am trying to create box plots to visually identify outliers for each numeric column in my dataset. The below code works to do this, but it is very clunky and I would not want to use this code with even more variables. I am looking for a way to use a loop to create box plots using a loop in R.
Here is the clunky code that works without a loop:
#Using Boxplots, check for outliers in each in each float or integer value column.
b <-boxplot(df$item1, main = 'item1')
b <-boxplot(df$item2, main = 'item2')
b <-boxplot(df$item3, main = 'item3')
b <-boxplot(df$item4, main = 'item4')
b <-boxplot(df$item5, main = 'item5')
b <-boxplot(df$item6, main = 'item6')
b <-boxplot(df$item7, main = 'item7')
b <-boxplot(df$item8, main = 'item8')
b <-boxplot(df$item9, main = 'item9')
b <-boxplot(df$item10, main = 'item10')
b <-boxplot(df$item11, main = 'item11')
b <-boxplot(df$item12, main = 'item12')
b <-boxplot(df$item13, main = 'item13')
b <-boxplot(df$item14, main = 'item14')
b <-boxplot(df$item15, main = 'item15')
b <-boxplot(df$item16, main = 'item16')
In python the code would be:
outliers = ['Item1', 'Item2', 'Item3', 'Item4', 'Item5', 'Item6', 'Item7', 'Item8', 'Item9', 'Item10', 'Item11', 'Item12', 'Item13', 'Item14', 'Item15', 'Item16']
i=0
while i < len(outliers):
sns.boxplot(x = outliers[i], data = df)
plt.show()
i = i + 1
(I am looking for something similar in R!)
Thank you!
Using a for loop to loop over the columns and a minimal reprex based on mtcars you could do
outliers <- c("mpg", "hp")
for (i in outliers) {
boxplot(mtcars[i], main = i)
}
Related
I have dataframe with an id column. I want to filter the dataframe for a specific id and then add data from columns to a plot (This part is done using a for loop). And finally, I want to add the plots (3 in this case) into a subplot. I have achieved it like this, but the plot I end up with is incorrect. Wondering if anyone has an idea what I am doing wrong (top two subplots are empty, all info seem to be in the third subplot)?
function plotAnom(data::DataFrame)
data = copy(data)
uniqIds = unique(data.id)
# Create emplty plots
p1 = plot()
p2 = plot()
p3 = plot()
for id in uniqIds
# Filter datframe based on id
df = data[data.id .== id,:]
p1 = plot!(
df.time,
df[:,names(df)[3]],
label = id,
line = 3,
legend = :bottomright,
)
p1 = plot!(
df.time,
df.pre_month_avg,
label = id,
line = 2
)
p2 = plot!(
df.time,
df.diff,
line = 3
)
p3 = plot!(
df.time,
df.cumulative_diff,
line = 3
)
end
p = plot(p1,p2,p3,layout = (3,1), legend = false)
return p
end
In your example, you always update the latest created plot, which is your p3. When you use plot!, you specify which plot gets updated by putting it in plot!'s arguments (otherwise it updates the latest one). So I think you should do plot!(p1, ...) instead of p1 = plot!(...), and so on.
Suppose I have repeated simulated data (100 times). Then, suppose that I would like to apply one function on each of these data. Since my data is repeated (sometimes 1000 times) I would like to know at which data my code working at this moment. That is, I would like my code to show the number of each data it is working on it. For example, when my code start with first data, then I would like it to let me know this is the first data. Then, the same for the second data and so on. I know that I will get the number of my data in console as a list. However, my function is much more complicated. This is just a simple example to explain my problem. I would like my code to let me know at which data it is working now.
This is my code:
N.a = 186; N.b = 38; N.ab=13; N.o = 284
## 1) numerical optimization
llk = function(xpar){
tmp = exp(c(xpar,0))
pr = tmp/sum(tmp) ## A/B/O
res1 = N.a*log(pr[1]^2+2*pr[1]*pr[3]) + N.b*log(pr[2]^2+2*pr[2]*pr[3])
res2 = N.ab*log(2*pr[1]*pr[2]) + N.o*log(pr[3]^2)
-res1-res2
}
pr = rep(1/3,3) ## A/B/O
it = 0; pdiff = 1
while( (it<100)&(pdiff>1e-5) ){
tmp = c(pr[1]^2, 2*pr[1]*pr[3])
tmp = tmp/sum(tmp)
N.aa = N.a*tmp[1]
N.ao = N.a*tmp[2]
tmp = c(pr[2]^2, 2*pr[2]*pr[3])
tmp = tmp/sum(tmp)
N.bb = N.b*tmp[1]
N.bb = N.b*tmp[1]
N.bo = N.b*tmp[2]
pr1 = c(2*N.aa+N.ao+N.ab, 2*N.bb+N.bo+N.ab, N.ao+N.bo+2*N.o)
pr1 = pr1/sum(pr1)
pdiff = mean(abs(pr1-pr))
it = it+1
pr = pr1
cat(it, pr, "\n")
}
How I can use cat function. For example, how to use this in my code:
cat(paste0("data: ", i, "\n"))
I have a function that uses matplot to plot some data. Data structure is like this:
test = data.frame(x = 1:10, a = 1:10, b = 11:20)
matplot(test[,-1])
matlines(test[,1], test[,-1])
So far so good. However, if there are missing values in the data set, then there are gaps in the resulting plot, and I would like to avoid those by connecting the edges of the gaps.
test$a[3:4] = NA
test$b[7] = NA
matplot(test[,-1])
matlines(test[,1], test[,-1])
In the real situation this is inside a function, the dimension of the matrix is bigger and the number of rows, columns and the position of the non-overlapping missing values may change between different calls, so I'd like to find a solution that could handle this in a flexible way. I also need to use matlines
I was thinking maybe filling in the gaps with intrapolated data, but maybe there is a better solution.
I came across this exact situation today, but I didn't want to interpolate values - I just wanted the lines to "span the gaps", so to speak. I came up with a solution that, in my opinion, is more elegant than interpolating, so I thought I'd post it even though the question is rather old.
The problem causing the gaps is that there are NAs between consecutive values. So my solution is to 'shift' the column values so that there are no NA gaps. For example, a column consisting of c(1,2,NA,NA,5) would become c(1,2,5,NA,NA). I do this with a function called shift_vec_na() in an apply() loop. The x values also need to be adjusted, so we can make the x values into a matrix using the same principle, but using the columns of the y matrix to determine which values to shift.
Here's the code for the functions:
# x -> vector
# bool -> boolean vector; must be same length as x. The values of x where bool
# is TRUE will be 'shifted' to the front of the vector, and the back of the
# vector will be all NA (i.e. the number of NAs in the resulting vector is
# sum(!bool))
# returns the 'shifted' vector (will be the same length as x)
shift_vec_na <- function(x, bool){
n <- sum(bool)
if(n < length(x)){
x[1:n] <- x[bool]
x[(n + 1):length(x)] <- NA
}
return(x)
}
# x -> vector
# y -> matrix, where nrow(y) == length(x)
# returns a list of two elements ('x' and 'y') that contain the 'adjusted'
# values that can be used with 'matplot()'
adj_data_matplot <- function(x, y){
y2 <- apply(y, 2, function(col_i){
return(shift_vec_na(col_i, !is.na(col_i)))
})
x2 <- apply(y, 2, function(col_i){
return(shift_vec_na(x, !is.na(col_i)))
})
return(list(x = x2, y = y2))
}
Then, using the sample data:
test <- data.frame(x = 1:10, a = 1:10, b = 11:20)
test$a[3:4] <- NA
test$b[7] <- NA
lst <- adj_data_matplot(test[,1], test[,-1])
matplot(lst$x, lst$y, type = "b")
You could use the na.interpolation function from the imputeTS package:
test = data.frame(x = 1:10, a = 1:10, b = 11:20)
test$a[3:4] = NA
test$b[7] = NA
matplot(test[,-1])
matlines(test[,1], test[,-1])
library('imputeTS')
test <- na.interpolation(test, option = "linear")
matplot(test[,-1])
matlines(test[,1], test[,-1])
Had also the same issue today. In my context I was not permitted to interpolate. I am providing here a minimal, but sufficiently general working example of what I did. I hope it helps someone:
mymatplot <- function(data, main=NULL, xlab=NULL, ylab=NULL,...){
#graphical set up of the window
plot.new()
plot.window(xlim=c(1,ncol(data)), ylim=range(data, na.rm=TRUE))
mtext(text = xlab,side = 1, line = 3)
mtext(text = ylab,side = 2, line = 3)
mtext(text = main,side = 3, line = 0)
axis(1L)
axis(2L)
#plot the data
for(i in 1:nrow(data)){
nin.na <- !is.na(data[i,])
lines(x=which(nin.na), y=data[i,nin.na], col = i,...)
}
}
The core 'trick' is in x=which(nin.na). It aligns the data points of the line consistently with the indices of the x axis.
The lines
plot.new()
plot.window(xlim=c(1,ncol(data)), ylim=range(data, na.rm=TRUE))
mtext(text = xlab,side = 1, line = 3)
mtext(text = ylab,side = 2, line = 3)
mtext(text = main,side = 3, line = 0)
axis(1L)
axis(2L)`
draw the graphical part of the window.
range(data, na.rm=TRUE) adapts the plot to a proper size being able to include all data points.
mtext(...) is used to label the axes and provides the main title. The axes themselves are drawn by the axis(...) command.
The following for-loop plots the data.
The function head of mymatplot provides the ... argument for an optional passage of typical plot parameters as lty, lwt, cex etc. via . Those will be passed on to the lines.
At last word on the choice of colors - they are up to your flavor.
I have a box plot with multiple levels d$a + d$b
d = data.frame(value = c(1,2,3,100), a = c("A","A","B","A"), b = c("C","D","C","C") )
boxplot(d$value ~ d$a + d$b, horizontal = TRUE)
when you run that code you will see that the B.D combination still shows up but it is empty. How do I remove it?
This is just a toy example. In reality I have 40+ combinations and do not want to remove the blank one by hand.
You can first use interaction (together with its drop argument) to create a new column of your data.frame, then plot it:
d <- data.frame(value = c(1,2,3,100), a = c("A","A","B","A"), b = c("C","D","C","C"))
d <- within(d, interaction <- interaction(a, b, drop = TRUE))
boxplot(value ~ interaction, data = d, horizontal = TRUE)
I have the the following data frame and variables:
u0 <- c(1,1,1,1,1)
df <- data.frame (u0)
a = .793
b = 2.426
r = 0.243
q = 1
w = 2
j = 1
z = .314
using the following loop I am doing some calculations and put the results in the first row of my data frame.
while (j<5){
df[q,w] <- df[q, w-1] * (r+j-1)*(b+j-1)*(z) / ((a+b+j-2)*j)
j = j + 1
w = w + 1
}
now I want to create another loop to do the same calculations for all rows (i.e I need the 'q' variable to vary) of my data frame. I would be thankful if anyone helps me.
You could either do this by putting your while loop inside of a for loop that goes over q, but a more R-tastic way would be to simply define q <- 1:5, and leave the rest of your code as-is. Then df will fill up entirely. I take it in this example you want all rows to be identical?
Can't you just put it in a for loop?
df <- data.frame (d1=u0, d2=u0+1, d3=u0+2, d4=u0+4, d5=u0+5)
for (q in 2:5) {
while (j<5){
df[q,w] <- df[q, w-1] * (r+j-1)*(b+j-1)*(z) / ((a+b+j-2)*j)
j = j + 1
w = w + 1
} }
You may want to check the algorithm. It doesn't seem to be doing anything very interesting.