I am trying to create a histogram of my data.
My dataframe looks like this
x counts
4 78
5 45
... ...
where x is the variable I would like to plot and counts is the number of observations. If I do hist(x) the plot will be misleading because I am not taking into account the count. I have also tried:
hist(do.call("c", (mapply(rep, df$x, df$count))))
Unfortunately this does not work because the resulting vector will be too big
sum(df$ount)
[1] 7943571126
Is there any other way I can try?
Thank you
The solution is a barplot as #Rui Barradas suggested. I use ggplot to plot data.
library(ggplot2)
x <- c(4, 5, 6, 7, 8, 9, 10)
counts <- c(78, 45, 50, 12, 30, 50)
df <- data.frame(x=x, counts=counts)
plt <- ggplot(df) + geom_bar(aes(x=x, y=counts), stat="identity")
print(plt)
Since creating a new row for each repetition of x was not possible due to the size of the data, you can plot the density with a weight in ggplot2 using geom_histogram.
library(tidyverse)
set.seed(1)
x <- 1:100
counts <- sample(20:200,100,T)
df <- data.frame(x,counts)
df %>% ggplot() +geom_histogram(aes(x=x, y=..density..,weight=counts))
compare this with just plotting the counts:
df %>% ggplot() +geom_histogram(aes(x=x))
Related
I have this data frame to construct some lines chart using ggplot2. lb is what I want my label to be on x-axis while each other variables (x0.6, x0.8, x0.9, x0.95, x0.99, and x0.999) will be against lb on the y-axis.
# my data
lb <- c(1, 2, 3, 4, 5, 6, 7, 8, 9)
x0.6 <- c(0.9200795, 0.9315084, 0.9099002, 0.9160192, 0.9121120, 0.9134098, 0.9130619, 0.9128494, 0.9144164)
x0.8 <- c(0.9804872, 1.0144678, 0.9856382, 0.9730490, 1.0032707, 1.0036311, 0.9726198, 0.9986403, 1.0022643)
x0.9 <- c(1.055256, 1.016159, 1.067242, 1.089894, 1.043502, 1.041497, 1.037738, 1.023274, 1.040536)
x0.95 <- c(1.058024, 1.105353, 1.069076, 1.061077, 1.095764, 1.096789, 1.096670, 1.121497, 1.109918)
x0.99 <- c(1.107258, 1.098061, 1.118248, 1.101253, 1.083208, 1.109715, 1.083704, 1.083704, 1.118057)
x0.999 <- c(1.110732, 1.119625, 1.121221, 1.087423, 1.093228, 1.094003, 1.108910, 1.112413, 1.096734)
#my datafram
pos11 <- data.frame(lb, x0.6, x0.8, x0.9, x0.95, x0.99, x0.999)
#load packages
library("reshape2")
library("ggplot2")
# this `R` CODE reshapes the data
long_pos11 <- melt(pos11, id="lb")
# Here is the `R` code that produces the `line-chart`
pos_line <- ggplot(data = long_pos11,
aes(x=AR, y=value, colour=variable)) +
geom_line()
I want the line-chart to show elements of the vector lb (1, 2, 3, 4, 5, 6, 7, 8, 9) on x-axis as its label just like date is 0n Plotting two variables as lines using ggplot2 on the same graph
Try this. As your variable is of numeric type you would need to set it as factor and then also add group to your aes() statement. Here the code:
library("reshape2")
library("ggplot2")
# this `R` CODE reshapes the data
long_pos11 <- melt(pos11, id="lb")
# Here is the `R` code that produces the `line-chart`
pos_line <- ggplot(data = long_pos11,
aes(x=factor(lb), y=value, colour=variable,group=variable)) +
geom_line()+xlab('lb')
Output:
We can also use pivot_longer
library(ggplot2)
library(tidyr)
library(dplyr)
pos11 %>%
pivot_longer(cols = -lb) %>%
mutate(lb = factor(lb)) %>%
ggplot(aes(x = lb, y = value, color = name, group = name)) +
geom_line() +
xlab('lb')
Using matplot, I can plot a line for each row of a dataframe at given x values. For example
set.seed(1)
df <- matrix(runif(20, 0, 1), nrow = 5)
matplot(t(df), type = "l", x = c(1, 3, 7, 9)) # c(1, 3, 7, 9) are the x-axis positions I'd like to plot along
# the line colours are not important
I'd like to use ggplot2 instead, but I'm not sure how best to replicate the outcome. Using melt I can rename the columns to the desired x values, as below. But is there a 'cleaner' approach that I'm missing?
df1 <- as.data.frame(df)
names(df1) <- c(1, 3, 7, 9) # rename columns to the desired x-axis values
df1$id <- 1:nrow(df1)
df1_melt <- melt(df1, id.var = "id")
df1_melt$variable <- as.numeric(as.character(df1_melt$variable)) # convert x-axis values from factor to numeric
ggplot(df1_melt, aes(x = variable, y = value)) + geom_line(aes(group = id))
Any help would be much appreciated. Thanks
Since ggplot2 is increasingly used as part of the tidyverse family of packages, I thought I would post a tidy approach.
# generate data
set.seed(1)
df <- matrix(runif(20, 0, 1), nrow = 5) %>% as.data.frame
# put x-values into a data.frame
x_df <- data.frame(col=c('V1', 'V2', 'V3', 'V4'),
x=c(1, 3, 7, 9))
# make a tidy version of the data and graph
df %>%
rownames_to_column %>%
gather(col, value, -rowname) %>%
left_join(x_df, by='col') %>%
ggplot(aes(x=x, y=value, color=rowname)) +
geom_line()
The key idea is to gather() the data into tidy format, so that instead of being 5 rows × 4 columns, the data is 20 rows × 1 value column along with a few other identifier columns (col, rowname and eventually x) in this particular case).
autoplot.zoo can do ggplot graphics of matrix data. Omit the facet argument if you want separate panels. The inputs are defined in the Note at the end.
library(ggplot2)
library(zoo)
z <- zoo(t(m), x) # use t so that series are columns
autoplot(z, facet = NULL) + xlab("x")
Note: The inputs used:
set.seed(1)
m <- matrix(runif(20, 0, 1), nrow = 5)
rownames(m) <- c("a", "b", "c", "d", "e")
x <- c(1, 3, 7, 9)
I have a dataframe for my for variable X, with the corresponding count of how many times each X value appears:
df = data.frame(X, count.X)
I cannot create a frequency vector via
for(i in 1:length(X)) rep(X[i], count.X[i])
since the total count is around 37 million and memory allocation becomes an issue.
I would like to make a histogram with the variable X on the x axis of the histogram, and count.X in the frequency bars , however I cannot seem to find how to do this as everything seems geared towards plotting frequency vectors.
Thanks :)
You can use stat="identity" with geom_bar.
e.g.
testdt <- data.frame(x = c(1,2,3,4,5,6), count = c(10, 20, 10, 5, 15, 25))
ggplot(data = testdt) + geom_bar(aes(x = x, y = count), stat = "identity")
Following the very good example provided here, I tried to make the following filled contour plot.
x<-seq(1,11,.03) # note finer grid
y<-seq(1,11,.03)
xyz.func<-function(x,y) {(x^2+y^2)}
gg <- expand.grid(x=x,y=y)
gg$z <- with(gg,xyz.func(x,y)) # need long format for ggplot
brks <- cut(gg$z,breaks=c(1, 2, 5, 10, 30, 50, 100, 200))
brks <- gsub(","," - ",brks,fixed=TRUE)
gg$brks <- gsub("\\(|\\]","",brks) # reformat guide labels
ggplot(gg,aes(x,y)) +
geom_tile(aes(fill=brks))+
scale_fill_manual("Z",values=brewer.pal(7,"YlOrRd"))+
scale_x_continuous(expand=c(0,0))+
scale_y_continuous(expand=c(0,0))+
coord_fixed()
The result looks like this:
The thing is, the contours are sorted by alphabetical order, not by ascending values.
How would you change the order of the colors to be by ascending z values?
At first, I thought about adding "0"s in front of the values. I tried something like:
brks <- gsub(pattern = "(\b[0-9]\b)", replacement = "0$1", x = brks)
But it does not work.
Moreover, it would only add one zero in front of single digits, and 100 would still be before 02.
Actually, I'm not completely satisfied with this workaround, as 001 - 002 does not look beautiful.
Make your breaks an ordered factor:
x<-seq(1,11,.03) # note finer grid
y<-seq(1,11,.03)
xyz.func<-function(x,y) {(x^2+y^2)}
gg <- expand.grid(x=x,y=y)
gg$z <- with(gg,xyz.func(x,y)) # need long format for ggplot
brks <- cut(gg$z,breaks=c(1, 2, 5, 10, 30, 50, 100, 200), ordered_result = T)
levels(brks) <- gsub(","," - ", levels(brks), fixed=TRUE)
levels(brks) <- gsub("\\(|\\]","", levels(brks))
gg$brks <- brks # reformat guide labels
ggplot(gg,aes(x,y)) +
geom_tile(aes(fill=brks))+
scale_fill_manual("Z",values=brewer.pal(7,"YlOrRd"))+
scale_x_continuous(expand=c(0,0))+
scale_y_continuous(expand=c(0,0))+
coord_fixed()
I used the code provided in R: How do I display clustered matrix heatmap (similar color patterns are grouped) succesfully, however im not able to replace the Y-axis with text-labels, is this possible?
library(reshape2)
library(ggplot2)
# Create dummy data
set.seed(123)
df <- data.frame(
a = sample(1:5, 25, replace=TRUE),
b = sample(1:5, 25, replace=TRUE),
c = sample(1:5, 25, replace=TRUE)
)
# Perform clustering
k <- kmeans(df, 3)
# Append id and cluster
dfc <- cbind(df, id=seq(nrow(df)), cluster=k$cluster)
# Add idsort, the id number ordered by cluster
dfc$idsort <- dfc$id[order(dfc$cluster)]
dfc$idsort <- order(dfc$idsort)
# use reshape2::melt to create data.frame in long format
dfm <- melt(dfc, id.vars=c("id", "idsort"))
ggplot(dfm, aes(x=variable, y=idsort)) + geom_tile(aes(fill=value))
You can use scale_y_continuous() to set breaks= and then provide labels= (for example used just letters). With argument expand=c(0,0) inside scale_... you can remove grey area in plot.
ggplot(dfm, aes(x=variable, y=idsort)) + geom_tile(aes(fill=value))+
scale_x_discrete(expand=c(0,0))+
scale_y_continuous(expand=c(0,0),breaks=1:25,labels=letters[1:25])