I have a matrix of 2 columns. I would like boxplot each of these columns but each has different number of entries.
For example, first column has 10 entries and the second column has 7 entries. The remaining 3 of the second column is given zero.
I would like to plot these side by side for comparison reasons.
Is there a way to tell R to boxplot the whole column 1 and only the first 7 entry for column 2?
You could simply index the values you want, for example
## dummy version of your data
mat <- matrix(c(1:17, rep(0, 3)), ncol = 2)
## create object suitable for plotting with boxplot
## I.e. convert to melted or long format
df <- data.frame(values = mat[1:17],
vars = rep(c("Col1","Col2"), times = c(10,7)))
## draw the boxplot
boxplot(values ~ vars, data = df)
In the above I'm taking you at your word that you have a matrix. If you actually have a data frame then you would need
df <- data.frame(values = c(mat[,1], mat[1:7, 2]),
vars = rep(c("Col1","Col2"), times = c(10,7)))
and I assume that the data in the two columns are comparable in that the fact that the values are in two columns suggests a categorical variable that allows us to split the values (like Height of men and women, with sex as the categorical value).
The resulting boxplot is shown below
For any number of columns and any number of empty entries you can do like this.
## Load data from CSV; first row contains column headers
dat <- read.csv( 'your-filename.csv', header = T )
## Set plot region (when set 'ylim' skip first row with headers)
plot(
1, 1,
xlim=c(1,ncol(dat)), ylim=range(dat[-1,], na.rm=T),
xaxt='n', xlab='', ylab=''
)
axis(1, labels=colnames(dat), at=1:ncol(dat))
for(i in 1:ncol(dat)) {
## Get i-th column
p <- dat[,i]
## Remove 0 values from column
p <- p[! p %in% 0]
## Instead of 0 you can use any values
## For example, you can remove 1, 2, 3
## p <- p[! p %in% c(1,2,3)]
## Draw boxplot
boxplot(p, add=T, at=i)
}
This code loads table form CSV files, remove 0 values from the column (or you can remove any other values), and draw all boxplot for every column in one graphic.
Thinks this helps.
Related
I'm combining 12 CSV files into one dataframe in R. Before doing this I want to ensure all the column names are an exact match with each other. I've made a dataframe where each column is the column names of the 12 CSV files.
jul21_cols <- data.frame(colnames(jul21))
aug21_cols <- data.frame(colnames(aug21))
sep21_cols <- data.frame(colnames(sep21))
oct21_cols <- data.frame(colnames(oct21))
nov21_cols <- data.frame(colnames(nov21))
dec21_cols <- data.frame(colnames(dec21))
jan22_cols <- data.frame(colnames(jan22))
feb22_cols <- data.frame(colnames(feb22))
mar22_cols <- data.frame(colnames(mar22))
apr22_cols <- data.frame(colnames(apr22))
may22_cols <- data.frame(colnames(may22))
jun22_cols <- data.frame(colnames(jun22))
col_df <- cbind(jul21_cols,aug21_cols,sep21_cols,oct21_cols,nov21_cols,dec21_cols,
jan22_cols,feb22_cols,mar22_cols,apr22_cols,may22_cols,jun22_cols)
I've tried using the identical function to compare 2 columns at a time.
identical(col_df[['jul21']], col_df[['aug21']])
identical(col_df[['aug21']], col_df[['sep21']])
identical(col_df[['sep21']], col_df[['oct21']])
identical(col_df[['oct21']], col_df[['nov21']])
identical(col_df[['nov21']], col_df[['dec21']])
identical(col_df[['dec21']], col_df[['jan22']])
identical(col_df[['jan22']], col_df[['feb22']])
identical(col_df[['feb22']], col_df[['mar22']])
identical(col_df[['mar22']], col_df[['apr22']])
identical(col_df[['apr22']], col_df[['may22']])
identical(col_df[['may22']], col_df[['jun22']])`
All of the identical lines return the value of TRUE
I'm just trying to verify that this code is telling me all my column names are identical in each CSV files before I move on. I'd also like to know if there is a more efficient way to solve this problem.
First, identical() will only return TRUE if the two dataframes have all the same column names in the same order. If you don’t care about order, just that all the same names are in both dataframes, you can sort() the names before comparing as shown below.
Second, you can often use the base::lapply() or purrr::map() families of functions for operations requiring iteration.
For your case, let’s put your dataframes in a list (which they probably should be to begin with), then use sapply() to compare the column names of the first df in the list to the column names of all other dfs.
jul21 <- data.frame(x = 1, y = 2)
aug21 <- data.frame(x = 3, y = 4)
sep21 <- data.frame(y = 6, x = 5)
dfs <- list(jul21,aug21,sep21)
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# TRUE
And as another test case, we’ll add a df with a non-matching column.
oct22 <- data.frame(x = 1, y = 2, z = 3)
dfs[[4]] <- oct22
all(sapply(
dfs[-1],
\(x) identical(sort(colnames(x)), sort(colnames(dfs[[1]])))
))
# FALSE
We assume that what is needed is to determine if the column names are the same and in same order and if not to determine which differ.
First get a character vector, Names, containing the names of the data frames and from that make a named list L containing the data frames themselves.
From those names assemble a list L of the data frames and then get a character vector nms whose elements are strings of column names, one for each data frame.
Finally group the names of the data frames using tapply and nms as the groupings so we can see which data frames contain which columns. In the example below aug21 and jul21 have one set of columns, i.e. Time and demand, and sep21 has a different set, i.e. Time and DEMAND. If there were only one row then all data frames have the same column names in the same order.
Names <- c("jul21", "aug21", "sep21") # using example in Note
L <- mget(Names)[Names]
nms <- sapply(names(L), function(x) toString(names(L[[x]])))
tab <- stack(tapply(names(nms), nms, toString))
names(tab) <- c("data.frames", "column.names")
nrow(tab)
## [1] 2
tab
## data.frames column.names
## 1 jul21, aug21 Time, demand
## 2 sep21 Time, DEMAND
graph
Another approach which could be used alternately or in conjuction with the one above is to create a graph such that each vertex is a data frame and each edge means that the two vertices on either end of the edge have the same column names in the same order. Each connected component represents distinct column names or orders. From the example below we see that jul21 and aug21 form one connected component and sep21 forms a second connected component.
To investigate how data frame column names differ note that setdiff(names(jul21), names(sep21)) will show names that are in jul21 but not in sep21 and the reverse can be used for the other direction. If the setdiff in both directions are zero length vectors and names vectors are not the same then they differ by order.
library(igraph)
set.seed(123)
isSame <- function(x, y) +identical(names(x), names(y))
A <- outer(L, L, Vectorize(isSame))
diag(A) <- 0
g <- graph_from_adjacency_matrix(A, "undirected")
plot(g, vertex.color = "white", vertex.size = 30)
Note
Test data. BOD comes with R.
jul21 <- aug21 <- sep21 <- BOD
names(sep21) <- c("Time", "DEMAND")
I have a data frame, all values are numeric, I am interested in plotting all the values of the second row against a vector of the same length.
vector = seq(400, 2498, 2)
this vector is length 1050. This is the same length of the row.
I want to plot this row against the values of the vector and join the dots of the plot.
Based on your answer to my question, the following should work.
vector = seq(400, 2498, 2)
# A dummy data frame with two rows.
my.df <- as.data.frame(t(
data.frame(values = runif(1050),
other.values = runif(1050))))
plot(x = vector,
y = my.df[1, ]) # we select row one here. Choose the row number you want.
HTH!
I have a wide data frame consisting of 1000 rows and over 300 columns. The first 2 columns are GroupID and Categorical fields. The remaining columns are all continuous numeric measurements. What I would like to do is loop through a specific range of these columns in R, beginning with the first numeric column (column #3). For example, loop through columns 3:10. I would also like to retain the column names in the loop. I've started with the following code using
for(i in 3:ncol(df)){
print(i)
}
But this includes all columns to the right of column #3 (not the range 3:10), and this does not identify column names. Can anyone help get me started on this loop so I can specify the column range and also retain column names? TIA!
Side Note: I've used tidyr to gather the data frame in long format. That works, but I've found it makes my data frame very large and therefore eats a lot of time and memory in my loop.
As long as you do not include your data, I created a similar dummy data (1000 rows and 302 columns, 2 id vars ) in order to show you how to select columns, and prepare for plot:
library(reshape2)
library(ggplot2)
set.seed(123)
#Dummy data
Numvars <- as.data.frame(matrix(rnorm(1000*300),nrow = 1000,ncol = 300))
vec1 <- 1:1000
vec2 <- rep(paste0('class',1:5),200)
IDs <- data.frame(vec1,vec2,stringsAsFactors = F)
#Bind data
Data <- cbind(IDs,Numvars)
#Select vars (in your case 10 initial vars)
df <- Data[,1:12]
#Prepare for plot
df.melted <- melt(data = df,id.vars = c('vec1','vec2'))
#Plot
ggplot(df.melted,aes(x=vec1,y=value,group=variable,color=variable))+
geom_line()+
facet_wrap(~vec2)
You will end up with a plot like this:
I hope this helps.
You can keep column names by feeding them into an lapply function, here's an example with the iris dataset:
lapply(names(iris)[2:4], function(columntoplot){
df <- data.frame(datatoplot=iris[[columntoplot]])
graphname <- columntoplot
ggplot(df, aes(x = datatoplot)) +
geom_histogram() +
ggtitle(graphname)
ggsave(filename = paste0(graphname, ".png"), width = 4, height = 4)
})
In the lapply function, you create a new dataset comprising one column (note the double brackets). You can then plot and optionally save the output within the function (see ggsave line). You're then able to use the column name as the plot title as well as the file name.
I have two datasets: m and s. The first data set includes variables Frequency, p1, p2 and p3.
The second dataset includes the value for type of regression, mean and sample size. Column names are z, mean, and samplesize, respectively.
I need to add four columns to the first dataset m as follows:
The first column m$reg1 should be m$p1 times the value of s$samplesize corresponding to s$z == 'Regression1'.
The second column m$reg2 should be m$p2 times the value of s$samplesize corresponding to s$z == 'regression2'.
The third column m$reg3 should be m$p3 times the value of s$samplesize corresponding to s$z == 'regression3'.
I was wondering how can I write a loop function for calculating these new four columns in m data set.
See how the datasets are created in the code below:
Frequency<-seq(1,27,1)
p1<-seq(2,28,1)
p2<-seq(10,36,1)
p3<-seq(0,26,1)
m<-data.frame(Frequency,p1,p2,p3)
z<-c('Regression1','Regression2','Regression3','Regression4')
mean<-c(2,28,1,17)
samplesize<-c(10,20,30,40)
s<-data.frame(z,mean,samplesize)
Use the same principle as we applied in this answer. First, define names of columns or row values that would subset tables and then perform the calculation, filling the values into a new, similarly constructed, column.
# custom function that calculates column values
add.col <- function(i){
# name in the s$z that defines the correct row
reg <- paste0("Regression", i)
# name of the m column
p <- paste0("p", i)
# multiply the named column from m with respective samplesize in s
return(m[, p] * s$samplesize[s$z == reg])
}
# loop through all indices
for(i in 1:3){
# create a new column with the compound name and fill it with appropriate values
m[, paste0("reg", i)] <- add.col(i = i)
}
No need for a loop, if I understand your question correctly. Just do:
m$regr1 <- m$p1*s$samplesize[s$z=="Regression1"]
m$regr2 <- m$p2*s$samplesize[s$z=="Regression2"]
m$regr3 <- m$p3*s$samplesize[s$z=="Regression3"]
If you want to do a for loop this might work as well:
desired_col = c(2,3,4) # this can be any selection
for(i in desired_col) { m[[paste0(i,"reg")]] = m[,i]*s[match(i,desired_col),3] }
I have a data frame with 30 columns numbered from 0 to 29.
I call stack on this data frame to plot a series of boxplots, one for each column number.
But instead of getting the boxplots in the sequence 0,1,2,3, ... it prints 0,1,10,11..19, 2, 7, 8, 9.
In other words, i want the boxplots to appear in the same sequence of the columns, which
is natural.
I'm using boxplot(values ~ column, data = mydata).
I don't want to fix that by changing the column names.
Is there another solution?
Thanks!
stack stores the column name as a factor,
and the default order is alphabetic.
You can either fix the order once it has been tampered with,
or just use melt instead of stack:
the column order will be preserved.
# Sample data
d <- matrix( rnorm(300), nc=30 )
d <- as.data.frame( d )
colnames(d) <- as.character(0:29)
# Plot
library(reshape2)
boxplot( value ~ variable, melt(d) )