I have a data frame which looks like this (simplified):
data1.time1 data1.time2 data2.time1 data2.time2 data3.time1 group
1 1.53 2.01 6.49 5.22 3.46 A
...
24 2.12 3.14 4.96 4.89 3.81 C
where there are actually dataK.timeT for K in 1..27 and T in some (but maybe not all) of 1..8.
I would like to rearrange the data into K data frames so that I can plot, for each K, the summary data (for now let's say mean and mean ± standard deviation) for each of the three groups A, B, and C. That is, I want 27 graphs with three lines per graph, and also marks for the deviations.
Once I rearrange the data it should be easy enough to collapse by group, compute summary statistics, etc. But I'm not really sure how to get the data into this form. I looked at the reshape package, which suggests melting it into a key-value store format and rearranging from there, but it doesn't seem to support the columns containing the T values as I have here.
Is there a good way to do this? I'm quite willing to use something other than R to do this, since I can just import the results into R after transforming.
After creating fake data with a structure similar to yours, we convert from wide to long format, making a "tidy" data frame that is ready for plotting with ggplot2.
library(reshape2)
library(ggplot2)
library(dplyr)
Create fake data
set.seed(194)
dat = data.frame(replicate(27*8, cumsum(rnorm(24*3))))
names(dat) = paste0(rep(paste0("data",1:27), each=8), ".", rep(paste0("time",1:8), 27))
dat$group = rep(LETTERS[1:3], each=24)
Remove some columns so that number of time points will be different for different data sources:
dat = dat[ , -c(2,4,9,43,56,78,100:103,115:116,134:136,202,205)]
Reshape from wide to long format
datl = melt(dat, id.var="group")
Split data source and time point into separate columns:
datl$source = gsub("(.*)\\..*","\\1", datl$variable)
datl$time = as.numeric(gsub(".*time(.*)","\\1", datl$variable))
# Order data frame names by number (rather than alphabetically)
datl$source = factor(datl$source, levels=paste0("data",1:length(unique(datl$source))))
Plot the data using ggplot2
# Helper function for plotting standard deviation
sdFnc = function(x) {
vals = c(mean(x) - sd(x), mean(x) + sd(x))
names(vals) = c("ymin", "ymax")
vals
}
pd = position_dodge(0.7)
ggplot(datl, aes(time, value, group=group, color=group)) +
stat_summary(fun.y=mean, geom="line", position=pd) +
stat_summary(fun.data=sdFnc, geom="errorbar", width=0.4, position=pd) +
stat_summary(fun.y=mean, geom="point", position=pd) +
facet_wrap(~source, ncol=3) +
theme_bw()
Original (unnecessarily complicated) reshaping code. (Note, this code will no longer work with the updated (fake) data set, because the number of time columns is no longer uniform):
# Convert data source from wide to long
datl = data.frame()
for (i in seq(1,27*8,8)) {
tmp.dat = dat[, c(i:(i+7),grep("group",names(dat)))]
tmp.dat$source = gsub("(.*)\\..*", "\\1", names(tmp.dat)[1])
names(tmp.dat)[1:8] = 1:8
#datl = rbind(datl, tmp.dat)
datl = bind_rows(datl, tmp.dat) # Updated based on comment
}
datl$source = factor(datl$source, levels=paste0("data",1:27))
# Convert time from wide to long
datl = melt(datl, id.var = c("source","group"), variable.name="time")
Could do something like this with dplyr:
for(i in 1:K){ ## for 1:27
my.data.ind <- paste0("data",i,"|group") ## "datai|group"
one.month <- select(data, contains(my.data.ind) %>% ## grab cols that have these
group_by(group) %>% ## group by your group
summarise_each(funs(mean), funs(sd)) ## find mean for each col within each group
}
That should leave you with a 3xT data frame that has the average value of each group over time T
Related
I'm trying to draw 18 graphs using R and the ggplot2 package. My data look like this:
v1 v2 v3 ... v18 subject group
534 543 512 ... 410 1 (6.5, 18]
437 576 465 ... 420 2 (0, 6.5]
466 487 492 ... 501 3 (18, 55]
And I need to create a "faceted" histogram showing distributions for all of the groups in one frame (i. e. to conveniently present all of the subgroups' distributions) like this:
I came up with this code for a single plot:
ggplot(data = df, aes (x = v1)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
But since there are 18 variables (v1, v2,...), I'm looking for a way to write an efficient function/loop/command that would draw all the 18 graphs without me having to copy/paste and change the variable name 18 times. Like this:
ggplot(data = df, aes (x = **v1**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
ggplot(data = df, aes (x = **v2**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
ggplot(data = df, aes (x = **v3**)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
I know that the solution probably lies in looping and it seems like a useful skill to have, so I'm also using this opportunity to learn this right.
Thank you, any help is appreciated! (And thanks to all the suggestions so far!)
This is where I've gotten so far with the kind help of the user below:
for (v in c(v1,v2)) {
pdf("plots.pdf")
histograms <- ggplot(data = data, aes (x = v)) + geom_histogram (boundary = 500) + facet_wrap(~Group, nrow = 2)
print(histograms)
}
dev.off()
EDIT A significantly revised answer is provided having clarified the needs.
The problem presents several common issues, each of which are addressed in other posts. However, perhaps this suggestion allows for a one-stop solution to these common issues.
My first suggestion is to reformat the data into a "long" format. There are many resources describing this and packages to help. Many users embrace the "tidyverse" set of tools and I'll leave that to others. I'll demonstrate a simple approach using base functions. I don't recommend the reshape() function in the stats package. I find it to be useful for repeated measures with time as one of the variables but find it rather complicated for other data.
A large fake data set will be generated in the "wide" format with demographic data (id, sex, weight, age, group) and 18 variables named "v01", "v02", ..., "v18" as random integers between 400 and 500.
# Set random number generator and number of "individuals" in fake data
set.seed(1234) # to ensure reproducibility
N <- 936 # number of "individuals" in the fake data
# Create typical fake demographic data and divide the age into 4 groups
id <- factor(sample(1e4:9e4, N, replace = FALSE))
age <- rpois(N, 36)
sex <- sample(c("F","M"), N, replace = TRUE)
weight <- 16 * log(age)
group <- cut(age, breaks = c(12, 32, 36, 40, 62))
Generate 18 fake values for each individual for the wide format and then create the fake "wide" data.frame.
# 18 variable measurements for wide format
V <- replicate(18, sample(400:600, N, replace = TRUE), simplify = FALSE)
names(V) <- sprintf("v%02d", 1:18)
# Add a little variation to the fake data
adj <- sample(1:6, 18, replace = TRUE)
V <- Map("/", V, adj) # divide each value by the number in 'adj'
V <- lapply(V, round, 1) # simplify
# Create data.frame with variable data in wide format
vars <- as.data.frame(V)
names(vars)
# Assemble demographic and variable data into a typical "wide" data set
wide <- data.frame(id, sex, weight, age, group, vars)
names(wide)
head(wide)
In the "wide" format, each row corresponds to a unique individual with demographic information and 18 values for 18 variables. This is going to be changed into the "long" format with each value represented by a row. The new "long" data frame will have two new variables for the data (values) and a factor indicating the group from which the data came (ind). Typically they get renamed but I will simply work with the default names here.
As noted above, the simple base function stack() will be used to stack the variables into a single vector. In contrast to cbind(), the data.frame() function will replicate values only as long as they are an even multiple of each other. The following code takes advantage of this property to build the "long" data.frame.
# Identify those variables to be stacked (they all start with 'v')
sel <- grepl("^v", names(wide))
long <- data.frame(wide[!sel], stack(wide[sel]))
head(long)
My second suggestion is to use one of the "apply" functions to create a list of ggplot objects. By storing the plots in this variable, you have the option of plotting them with different formats without running the plotting code each time.
The code creates a plot for each of the 18 different variables, which are identified by the new variable ind. I changed boundary = 500 to a bins = 10 since I don't know what your actual data looks like. I also added a "caption" to each plot identifying the original variable.
library(ggplot2) # to use ggplot...
plotList <- lapply(levels(long$ind), function(i)
ggplot(data = subset(long, ind == i), aes(x = values))
+ geom_histogram(bins = 10)
+ facet_wrap(~ group, nrow = 2)
+ labs(caption = paste("Variable", i)))
names(plotList) <- levels(long$ind) # name the list elements for convenience
Now to examine each of the 18 plots (this may not work in RStudio):
opar <- par(ask = TRUE)
plotList # This is the same as print(plotList)
par(opar) # turn off the 'ask' option
To save the plots to file, the advice of Imo is good. But it would be wise to take control of the size and nature of the file output. I suggest you look at the help files for pdf() and dev.print(). The last part of this answer shows one possibility with the pdf() function using a for loop to generate single page plots.
for (v in levels(long$ind)) {
fname <- paste(v, "pdf", sep = ".")
fname <- file.path("~", fname) # change this to specify a directory
pdf(fname, width = 6.5, height = 7, paper = "letter")
print(plotList[[v]])
dev.off()
}
And just to add another possible approach, here's a solution with lattice showing 6 groups of variables per plot. (Personally, I'm a fan of this simpler approach.)
library(lattice)
idx <- split(levels(long$ind), gl(3, 6, 18))
opar <- par(ask = TRUE)
for (i in idx)
plot(histogram(~values | group + ind, data = long,
subset = ind %in% i, as.table = TRUE))
par(opar)
I'm new to R and i'm having some trouble in solving this problem.
I have the following table/dataframe:
I am trying to generate a boxplot like this one:
However, i want that the x-axis be scaled according to the labels 1000, 2000, 5000, etc.
So, i want that the distance between 1000 and 2000 be different from the distance between 50000 and 100000, since the exact distance is not the same.
Is it possible to do that in R?
Thank you everyone and have a nice day!
Maybe try to convert the data set in to this format, ie as integers in a column, rather than a header title?
# packages
library(ggplot2)
library(reshape2)
# data in ideal format
dt <- data.frame(x=rep(c(1,10,100), each=5),
y=runif(15))
# data that we have. Use reshape2::dcast to get data in to this format
dt$id <- rep(1:5, 3)
dt_orig <- dcast(dt, id~x, value.var = "y")
dt_orig$id <- NULL
names(dt_orig) <- paste0("X", names(dt_orig))
# lets get back to dt, the ideal format :)
# melt puts it in (variable, value) form. Need to configure variable column
dt2 <- melt(dt_orig)
# firstly, remove X from the string
dt2$variable <- gsub("X", "", dt2$variable)
# almost there, can see we have a character, but need it to be an integer
class(dt2$variable)
dt2$variable <- as.integer(dt2$variable)
# boxplot with variable X axis
ggplot(dt2, aes(x=variable, y=value, group=variable)) + geom_boxplot() + theme_minimal()
Base way of re-shaping data: https://www.statmethods.net/management/reshape.html
I am using this script to plot chemical elements using ggplot2 in R:
# Load the same Data set but in different name, becaus it is just for plotting elements as a well log:
Core31B1 <- read.csv('OilSandC31B1BatchResultsCr.csv', header = TRUE)
#
# Calculating the ratios of Ca.Ti, Ca.K, Ca.Fe:
C31B1$Ca.Ti.ratio <- (C31B1$Ca/C31B1$Ti)
C31B1$Ca.K.ratio <- (C31B1$Ca/C31B1$K)
C31B1$Ca.Fe.ratio <- (C31B1$Ca/C31B1$Fe)
C31B1$Fe.Ti.ratio <- (C31B1$Fe/C31B1$Ti)
#C31B1$Si.Al.ratio <- (C31B1$Si/C31B1$Al)
#
# Create a subset of ratios and depth
core31B1_ratio <- C31B1[-2:-18]
#
# Removing the totCount column:
Core31B1 <- Core31B1[-9]
#
# Metling the data set based on the depth values, to have only three columns: depth, element and count
C31B1_melted <- melt(Core31B1, id.vars="depth")
#ratio melted
C31B1_ra_melted <- melt(core31B1_ratio, id.vars="depth")
#
# Eliminating the NA data from the data set
C31B1_melted<-na.exclude(C31B1_melted)
# ratios
C31B1_ra_melted <-na.exclude(C31B1_ra_melted)
#
# Rename the columns:
colnames(C31B1_melted) <- c("depth","element","counts")
# ratios
colnames(C31B1_ra_melted) <- c("depth","ratio","percentage")
#
# Ploting the data in well logs format using ggplot2:
Core31B1_Sp <- ggplot(C31B1_melted, aes(x=counts, y=depth)) +
theme_bw() +
geom_path(aes(linetype = element))+ geom_path(size = 0.6) +
labs(title='Core 31 Box 1 Bioturbated sediments') +
scale_y_reverse() +
facet_grid(. ~ element, scales='free_x') #rasterImage(Core31Image, 0, 1515.03, 150, 0, interpolate = FALSE)
#
# View the plot:
Core31B1_Sp
I got the following image (as you can see the plot has seven element plots, and each one has its scale. Please ignore the shadings and the image at the far left):
My question is, is there a way to make these scales the same like using log scales? If yes what I should change in my codes to change the scales?
It is not clear what you mean by "the same" because that will not give you the same result as log transforming the values. Here is how to get the log transformation, which, when combined with the no using free_x will give you the plot I think you are asking for.
First, since you didn't provide any reproducible data (see here for more on how to ask good questions), here is some that gives at least some of the features that I think your data has. I am using tidyverse (specifically dplyr and tidyr) to do the construction:
forRatios <-
names(iris)[1:3] %>%
combn(2, paste, collapse = " / ")
toPlot <-
iris %>%
mutate_(.dots = forRatios) %>%
select(contains("/")) %>%
mutate(yLocation = 1:n()) %>%
gather(Comparison, Ratio, -yLocation) %>%
mutate(logRatio = log2(Ratio))
Note that the last line takes the log base 2 of the ratio. This allows ratios in each direction (above and below 1) to plot meaningfully. I think that step is what you need. you can accomplish something similar with myDF$logRatio <- log2(myDF$ratio) if you don't want to use dplyr.
Then, you can just plot that:
ggplot(
toPlot
, aes(x = logRatio
, y = yLocation) ) +
geom_path() +
facet_wrap(~Comparison)
Gives:
I am looking for a way to transform time series of different length into a unique length.
I think this question has already been asked by I can't find it. I guess I am just not using the right vocabulary for the question.
Data 1: 20 variables x 250 observations (time points)
Data 2: 20 variables x 50 observations (time points)
I would like to transform these data into 100 observations while keeping the shape of curves for the 20 variables in both cases.
Thanks a lot
Sample data
set.seed(123)
data <- matrix(0, 250, 20)
data[1, ] <- rnorm(20)
for (i in 2:nrow(data)) {
data[i, ] <- data[i - 1, ] + rnorm(20, 0, 0.02)
}
rownames(data) <- 0:249
One way of handling this is with reshape2 and dplyr:
library("reshape2")
library("dplyr")
library("ggplot2")
molten <- melt(data, varnames = c("Time", "Variable"))
Plot of original data:
ggplot(molten, aes(x = Time, y = value, colour = factor(Variable))) + geom_line()
Now reduce the data.frame by a factor of 5 using means of the values in each time period:
shorter <- molten %>%
group_by(Variable, Time %/% 5) %>%
summarise(value = mean(value), Time = mean(Time))
Plot new data:
ggplot(shorter, aes(x = Time, y = value, colour = factor(Variable))) + geom_line()
If you want original wide form of data:
shorterWide <- acast(shorter, Time ~ Variable)
I think I find a way using this function
Basic two-dimensional cubic spline fitting in R
I guess the keyword I was missing was cubic spline.
In my case I want to do something like that
spline(Data1, n = 100)
spline(Data1, n = 100)
What is the best way to construct a barplot to compare two sets of data?
e.g. dataset:
Number <- c(1,2,3,4)
Yresult <- c(1233,223,2223,4455)
Xresult <- c(1223,334,4421,0)
nyx <- data.frame(Number, Yresult, Xresult)
What I want is Number across X and bars beside each other representing the individual X and Y values
It is better to reshape your data into long format. You can do that with for example the melt function of the reshape2 package (alternatives are reshape from base R, melt from data.table (which is an extended implementation of the melt function of reshape2) and gather from tidyr).
Using your dataset:
# load needed libraries
library(reshape2)
library(ggplot2)
# reshape your data into long format
nyxlong <- melt(nyx, id=c("Number"))
# make the plot
ggplot(nyxlong) +
geom_bar(aes(x = Number, y = value, fill = variable),
stat="identity", position = "dodge", width = 0.7) +
scale_fill_manual("Result\n", values = c("red","blue"),
labels = c(" Yresult", " Xresult")) +
labs(x="\nNumber",y="Result\n") +
theme_bw(base_size = 14)
which gives the following barchart: