I have a data frame with 30 columns numbered from 0 to 29.
I call stack on this data frame to plot a series of boxplots, one for each column number.
But instead of getting the boxplots in the sequence 0,1,2,3, ... it prints 0,1,10,11..19, 2, 7, 8, 9.
In other words, i want the boxplots to appear in the same sequence of the columns, which
is natural.
I'm using boxplot(values ~ column, data = mydata).
I don't want to fix that by changing the column names.
Is there another solution?
Thanks!
stack stores the column name as a factor,
and the default order is alphabetic.
You can either fix the order once it has been tampered with,
or just use melt instead of stack:
the column order will be preserved.
# Sample data
d <- matrix( rnorm(300), nc=30 )
d <- as.data.frame( d )
colnames(d) <- as.character(0:29)
# Plot
library(reshape2)
boxplot( value ~ variable, melt(d) )
Related
I have a wide data frame consisting of 1000 rows and over 300 columns. The first 2 columns are GroupID and Categorical fields. The remaining columns are all continuous numeric measurements. What I would like to do is loop through a specific range of these columns in R, beginning with the first numeric column (column #3). For example, loop through columns 3:10. I would also like to retain the column names in the loop. I've started with the following code using
for(i in 3:ncol(df)){
print(i)
}
But this includes all columns to the right of column #3 (not the range 3:10), and this does not identify column names. Can anyone help get me started on this loop so I can specify the column range and also retain column names? TIA!
Side Note: I've used tidyr to gather the data frame in long format. That works, but I've found it makes my data frame very large and therefore eats a lot of time and memory in my loop.
As long as you do not include your data, I created a similar dummy data (1000 rows and 302 columns, 2 id vars ) in order to show you how to select columns, and prepare for plot:
library(reshape2)
library(ggplot2)
set.seed(123)
#Dummy data
Numvars <- as.data.frame(matrix(rnorm(1000*300),nrow = 1000,ncol = 300))
vec1 <- 1:1000
vec2 <- rep(paste0('class',1:5),200)
IDs <- data.frame(vec1,vec2,stringsAsFactors = F)
#Bind data
Data <- cbind(IDs,Numvars)
#Select vars (in your case 10 initial vars)
df <- Data[,1:12]
#Prepare for plot
df.melted <- melt(data = df,id.vars = c('vec1','vec2'))
#Plot
ggplot(df.melted,aes(x=vec1,y=value,group=variable,color=variable))+
geom_line()+
facet_wrap(~vec2)
You will end up with a plot like this:
I hope this helps.
You can keep column names by feeding them into an lapply function, here's an example with the iris dataset:
lapply(names(iris)[2:4], function(columntoplot){
df <- data.frame(datatoplot=iris[[columntoplot]])
graphname <- columntoplot
ggplot(df, aes(x = datatoplot)) +
geom_histogram() +
ggtitle(graphname)
ggsave(filename = paste0(graphname, ".png"), width = 4, height = 4)
})
In the lapply function, you create a new dataset comprising one column (note the double brackets). You can then plot and optionally save the output within the function (see ggsave line). You're then able to use the column name as the plot title as well as the file name.
I am having an issue with mutate function in dplyr.
I am trying to
add a new column called state depending on the change in one of the column (V column). (V column repeat itself with a sequence so each sequence (rep(seq(100,2100,100),each=96) corresponds to one dataset in my df)
Error: impossible to replicate vector of size 8064
Here is reproducible example of md df:
df <- data.frame (
No=(No= rep(seq(0,95,1),times=84)),
AC= rep(rep(c(78,110),each=1),times=length(No)/2),
AR = rep(rep(c(256,320,384),each=2),times=length(No)/6),
AM = rep(1,times=length(No)),
DQ = rep(rep(seq(0,15,1),each=6),times=84),
V = rep(rep(seq(100,2100,100),each=96),times=4),
R = sort(replicate(6, sample(5000:6000,96))))
labels <- rep(c("CAP-CAP","CP-CAP","CAP-CP","CP-CP"),each=2016)
I added here 2016 value intentionally since I know the number of rows of each dataset.
But I want to assign these labels with automated function when the dataset changes. Because there is a possibility the total number of rows may change for each df for my real files. For this question think about its only one txt file and also think about there are plenty of them with different number of rows. But the format is the same.
I use dplyr to arrange my df
library("dplyr")
newdf<-df%>%mutate_each(funs(as.numeric))%>%
mutate(state = labels)
is there elegant way to do this process?
Iff you know the number of data sets contained in df AND the column you're keying off --- here, V --- is ordered in df like it is in your toy data, then this works. It's pretty clunky, and there should be a way to make it even more efficient, but it produced what I take to be the desired result:
# You'll need dplyr for the lead() part
library(dplyr)
# Make a vector with the labels for your subsets of df
labels <- c("AP-AP","P-AP","AP-P","P-P")
# This line a) produces an index that marks the final row of each subset in df
# with a 1 and then b) produces a vector with the row numbers of the 1s
endrows <- which(grepl(1, with(df, ifelse(lead(V) - V < 0, 1, 0))))
# This line uses those row numbers or the differences between them to tell rep()
# how many times to repeat each label
newdf$state <- c(rep(labels[1], endrows[1]), rep(labels[2], endrows[2] - endrows[1]),
rep(labels[3], endrows[3] - endrows[2]), rep(labels[4], nrow(newdf) - endrows[3]))
I have one data table (tab separated) as follows.
data.csv
A B C
0.0509259 0.0634984 0.0334984
0.12037 0.0599042 0.0299042
0.00925926 0.0109824 0.0599042
0.990741 0.976837 0.059442
0.99537 0.997404 0.0549042
0.99537 0.997404 0.0529042
0.00462963 0.0109824 0.0699042
0.986111 0.975839 0.0999042
0.12963 0.0758786 0.0899042
0.00462963 0.00419329 0.0499042
0.865741 0.876597 0.0519042
0.865741 0.870807 0.0539042
How can i plot multiseries data in one histogram as explained below.
data<-read.table("C:/Users/User/Desktop/data.csv",header=T)
hist(data$A)
hist(data$B)
hist(data$C)
how can i merge these three histogram together in a way that i can see three diffrernt series together in different colors in one plot?
Sample output:
Here are two ways. In base R:
barplot(t(as.matrix(data)),beside=TRUE,
col=c("red","green","blue"),names=rownames(data))
Using ggplot.
library(ggplot2)
library(reshape2)
gg <- melt(data.frame(id=rownames(data),data),id="id")
gg$id <- factor(gg$id,levels=unique(gg$id))
ggplot(gg,aes(x=id,y=value,fill=variable))+geom_bar(stat="identity",position="dodge")
The ggplot approach, which ultimately is much more flexible, is also more work. You have to add a column based on the row names (or a sequence 1:nrow(data), if you prefer), and convert the data from wide to long format (as in the other answer). But you're still not done: ggplot converts the id's to a factor and then orders them alphabetically, so the groups are, e,g, 1, 10, 11, 12, 2, 3, ... You don't want that, so you have to reorder the factor first, and then plot.
If you're ok with ggplot2, you can do it as follows:
library(reshape2)
library(ggplot2)
1: Rearrange the dataframe to change A,B,C, to factors:
dat3 <- melt(dat2, varnames = c('A','B','C'))
2: Plot using the factors: (
qplot(data=dat3, value, fill=variable, position = 'dodge')
Can't say too many good things about ggplot2
I have a matrix of 2 columns. I would like boxplot each of these columns but each has different number of entries.
For example, first column has 10 entries and the second column has 7 entries. The remaining 3 of the second column is given zero.
I would like to plot these side by side for comparison reasons.
Is there a way to tell R to boxplot the whole column 1 and only the first 7 entry for column 2?
You could simply index the values you want, for example
## dummy version of your data
mat <- matrix(c(1:17, rep(0, 3)), ncol = 2)
## create object suitable for plotting with boxplot
## I.e. convert to melted or long format
df <- data.frame(values = mat[1:17],
vars = rep(c("Col1","Col2"), times = c(10,7)))
## draw the boxplot
boxplot(values ~ vars, data = df)
In the above I'm taking you at your word that you have a matrix. If you actually have a data frame then you would need
df <- data.frame(values = c(mat[,1], mat[1:7, 2]),
vars = rep(c("Col1","Col2"), times = c(10,7)))
and I assume that the data in the two columns are comparable in that the fact that the values are in two columns suggests a categorical variable that allows us to split the values (like Height of men and women, with sex as the categorical value).
The resulting boxplot is shown below
For any number of columns and any number of empty entries you can do like this.
## Load data from CSV; first row contains column headers
dat <- read.csv( 'your-filename.csv', header = T )
## Set plot region (when set 'ylim' skip first row with headers)
plot(
1, 1,
xlim=c(1,ncol(dat)), ylim=range(dat[-1,], na.rm=T),
xaxt='n', xlab='', ylab=''
)
axis(1, labels=colnames(dat), at=1:ncol(dat))
for(i in 1:ncol(dat)) {
## Get i-th column
p <- dat[,i]
## Remove 0 values from column
p <- p[! p %in% 0]
## Instead of 0 you can use any values
## For example, you can remove 1, 2, 3
## p <- p[! p %in% c(1,2,3)]
## Draw boxplot
boxplot(p, add=T, at=i)
}
This code loads table form CSV files, remove 0 values from the column (or you can remove any other values), and draw all boxplot for every column in one graphic.
Thinks this helps.
I have a data frame with 251 observations and 45 variables. There are 6 observations in the middle of the data frame that i'd like to exclude from my analyses. All 6 belong to one level of a factor. It is easy to generate a new data frame that, when printed, appears to exclude the 6 observations. When I use the new data frame to plot variables by the factor in question, however, the supposedly excluded level is still included in the plot (sans observations). Using str() confirms that the level is still present in some form. Also, the index for the new data frame skips 6 values where the observations formerly resided.
How can I create a new data frame that excludes the 6 observations and does not continue to recognize the excluded factor level when plotting? Can the new data frame be made to "re-index", so that the new index does not skip values formerly assigned to the excluded factor level?
I've provided an example with made up data:
# ---------------------------------------------
# data
char <- c( rep("anc", 4), rep("nam", 3), rep("oom", 5), rep("apt", 3) )
a <- 1:15 / pi
b <- seq(1, 8, .5)
d <- rep(c(3, 8, 5), 5)
dat <- data.frame(char, a, b, d)
dat
# two ways to remove rows that contain a string
datNew1 <- dat[-which(dat$char == "nam"), ]
datNew1
datNew2 <- dat[grep("nam", dat[ ,"char"], invert=TRUE), ]
datNew2
# plots still contain the factor level that was excluded
boxplot(datNew1$a ~ datNew1$char)
boxplot(datNew2$a ~ datNew2$char)
# str confirms that it's still there
str(datNew1)
str(datNew2)
# ---------------------------------------------
You can use the drop.levels() function from the gdata package to reduce the factor levels down to the actually used ones -- apply it on your column after you created the new data.frame.
Also try a search for r and drop.levels here (but you need to make the search term [r] drop.levels which I can't here as it interferes with the formatting logic).
Starting with R version 2.12.0, there is a function droplevels, which can be applied either to factor columns or to the entire dataframe. When applied to the dataframe, it will remove zero-count levels from all factor columns. So your example will become simply:
# two ways to remove rows that contain a string
datNew1 <- droplevels( dat[-which(dat$char == "nam"), ] )
datNew2 <- droplevels( dat[grep("nam", dat[ ,"char"], invert=TRUE), ] )
I have pasted something from my code- I have an enclosure experiment in a lake- have measurements from enclosures and the lake but mostly dont want to deal with lake:
my variable is called "t.level" and the levels were control, low medium high and lake-
-this code makes it possible to use the nolk$ or data=nolk to get data without the "lake"..
nolk<-subset(mylakedata,t.level == "control" |
t.level == "low" |
t.level == "medium" |
t.level=="high")
nolk[]<-lapply(nolk, function(t.level) if(is.factor(t.level))
t.level[drop=T]
else t.level)