This seems simple, but I've tried multiple variations of matplot, ggplot2, regular old plot...I can't get any to do what I need.
I have a gigantic dataframe of years, months, and observations. I simplified it down to number of observations per month, per year, see below. I'm not sure why it read in with the "X" in front of each column heading, but if it's not going to affect the code, right now I don't care.
head(storms)
X Month X1992 X1993 X1994
1 1 1 2 1
2 2 2 4 1
3 3 3 26 10
4 4 4 47 26
5 5 5 969 615
The full (simplified) set is 10 columns of years (1992-2001), each with 12 months/rows of totals (1 storm in Jan 1992, 26 storms in March 1993...). I need simply to plot these all on an x-axis 120 months long, # of observations per month on the y-axis. It could be a line or bars or vertical lines. I've seen many ways to plot 20 lines with 12 months on the x-axis; that is not what I'm going for. I also need to label the years every 12 months, but I think I can figure that out after I get this block out of the way.
In other words (I hope this is more clear if the previous is not):
y axis: # of storms, ylim=c(0-1000)
x axis: 10 sets of months (Jan-Dec, 1992-2001, 120 months total). The only labels will be the years, every 12 months of course.
I know I'm just thinking about it wrong, could someone please set my head straight?
(first post; please also tell me if I'm not formatting or inquiring properly!)
is this something you are looking for? If I am not mistaken, you may want to rearrange your data frame. You wanna make your data frame longer rather than wider. Then, you can draw a figure. The thing is that you have 120 month. So you may need to think plot space issue. But at least this example let you move forward. I hope this helps you.
library(tidyr)
library(ggplot2)
# Create a sample data
month <- rep(c(1:12), each = 1, times = 2)
nintytwo <- runif(24, 0, 20)
nintythree <- runif(24, 0, 20)
# Crate a data frame
ana <- data.frame(month, nintytwo, nintythree)
# Make the data longer rather than wider.
bob <- gather(ana, year, value, -month)
bob$month <- as.factor(bob$month)
# Draw a firure
cathy <- ggplot(bob, aes(x= year,y = value, fill = month)) + geom_bar(stat="identity", position="dodge")
cathy
Here's an example using base R :
# create an example data
set.seed(123)
df <- data.frame(Month=1:12)
for(y in 1992:2001){
tmp <- data.frame(X=as.integer(abs(rnorm(12,mean=2,sd=10))))
colnames(tmp) <- paste("X",y,sep="")
df <- cbind(df,tmp)
}
# reshape to long format (one column with n.of storms, and period columns)
long <- reshape(df[,-1], idvar="Month", ids=df$Month,
times=names(df[,-1]), timevar="Year",
varying = list(names(df[,-1])),
direction = "long",v.names="Storms")
# remove the "X" from the year
long$Year <- substr(long$Year,2,nchar(long$Year))
nYears <- length(unique(long$Year))
# plot the line
plot(x=1:nrow(long),y=long$Storms,type="l",
xaxt="n",main="Monthly Storms",
xlab="Period",ylab="Storms",col="RoyalBlue")
# add custom labels
axis(1,at=((1:nYears)*12)-6,labels=unique(long$Year))
# add vertical lines
abline(v=c(0.5,((1:nYears)*12)+0.5),col="Gray80",lty=2)
Result :
Related
I have conducted a study with triplicates (SampleID) for each sample (Sample) on different time points.
Now, I want to plot the means of the triplicates for the characteristic "Aerobic".
I want to plot for example the development of amount of aerobic bacteria over time. Therefore, I need to calculate the means (and the standard deviation) of the triplicates and then plot these means in the graph. Here, I could imagine to use a geom_line or geom_point diagram.
SampleID Sample Aerobic Anaerobic Day
[Factor] [Factor] [num] [num] [num]
1 V1.1.K1 V1.1.K 0.610063430 0.05146154 1
2 V1.1.K2 V1.1.K 0.740887757 0.02115290 1
3 V1.1.K3 V1.1.K 0.683726217 0.04270182 1
4 V1.1.N1 V1.1.N 0.432019752 0.35722350 1
5 V1.1.N2 V1.1.N 0.515792694 0.41357935 1
6 V1.14.K16 V1.14.K 0.038141335 0.84496088 14
7 V1.14.K17 V1.14.K 0.042078682 0.76523093 14
8 V1.14.K18 V1.14.K 0.009594763 0.90767637 14
9 V1.14.N0 V1.14.N 0.513100502 0.10618731 14
10 V1.14.W16 V1.14.W 0.483710571 0.32765968 14
How should i do this?
I tried it with the following code
plot <- mydata %>%
group_by(Sample) %>%
mutate(Mean=mean(Aerobic)) %>%
ggplot(aes(x=Day, y=Aerobic)) +
geom_point()
If I google the questions I get only information about how to calculate the mean alone, but not to set up a new table with the means for the different variables.
Is there something like
calc_mean_by_group ??
You would help me a lot :)
Simple base-R solution for calculating the means:
tapply(X = foo$Aerobic, INDEX = foo$Sample, FUN = mean)
("foo" being the name of your data.frame)
This seems so simple. I can easily do it in Excel but I want to automate the process through R. I have installed ggplot2. Using RStudio I have read in my CSV file.
The resulting data frame has over 200 rows, each a town in New Hampshire. The first column is titled "Town" and each row below that has the text name of the town, (e.g., "Concord" or "Lancaster"). Column 2 contains a number for each town (spending per elementary school pupil) and the title of that column in the CSV file is "01/02 Elem PPE" - but it shows as "X01.02.Elem.PPE" when using View(). Column 3 has similar numbers for each town and its title in View() is "X02.03.Elem.PPE". Columns 4 through 11 are similar.
I just want to plot a line graph of the numbers in columns 2-11 for one row (one town). It will show how the spending per pupil has changed in that town over time. There must be a simple way to do this, but I can't find it.
Please help. I am a 77 year old with some programming experience 3-5 decades ago but new to R and Rstudio only yesterday.
First, I'll make some new data that mimics yours. It should have more or less the same properties.
library(glue)
library(tidyverse)
set.seed(4314)
mat <- matrix(rpois(40, 5000), ncol=10)
colnames(mat) <- glue("X{sprintf('%2.0f', 1:10)}.{sprintf('%2.0f', 2:11)}.Elem.PPE", sep="") %>%
gsub(". ", ".0", ., fixed=TRUE) %>%
gsub("X ", "X0", ., fixed=TRUE)
df <- tibble(town = c("Concord", "Lancaster", "Manchester", "Nashua"))
df <- bind_cols(df, as_tibble(mat))
Now, this is where you would start. I'm going to assume that you read your csv into an object called df. The first thing you should do to make plotting easier is to pivot the data from wide-form (one-row and 10 columns per observation) to long-form with 1 column and 10 rows per observation. I'm going to save this in an object called df2. The pivot_longer function is in the tidyr package. The first argument is the columns that you want to change from wide- to long-form, in this case, it's everything except town. Then you tell it a variable name for the column names and a variable name for the values. Then, I'm just using a couple of regular expressions to go from X01.02.Elem.PPE to 01/02 for plotting purposes.
df2 <- df %>%
pivot_longer(-town, names_to="time", values_to="val") %>%
mutate(time = gsub("X(.*)\\.Elem\\.PPE", "\\1", time),
time = gsub("\\.", "/", time))
The resulting data frame looks like this:
# # A tibble: 40 x 3
# town time val
# <chr> <chr> <int>
# 1 Concord 01/02 4965
# 2 Concord 02/03 4953
# 3 Concord 03/04 5066
# 4 Concord 04/05 5100
# 5 Concord 05/06 4979
# 6 Concord 06/07 5090
# 7 Concord 07/08 5136
# 8 Concord 08/09 5076
# 9 Concord 09/10 5079
# 10 Concord 10/11 4945
Next, we can make a plot for a single place (before we think about automation). Let's try Concord. First, we'll save the values that we want to put on the x-axis:
xlabs <- unique(df2$time)
Next, we can use ggplot() to make the plot. In the code below, we're first piping the data frame to a filter that will pull out the values for a single town. The filtered data frame is piped into the ggplot() function. Since time in the data frame is a character vector, we need to turn it into a factor and then into a numeric to make the line plot. We add the line geometry to plot the line. Then we change the x-axis labels with scale_x_continuous(). The labs() function changes the axis labels for the x- and y-axes. Finally, ggtitle() puts the title at the top of the plot. I also like theme_bw() rather than the gray background, but that's entirely a matter of personal preference. The resulting plot looks like this:
df2 %>% filter(town == "Concord") %>%
ggplot(aes(x=as.numeric(as.factor(time)), y=val)) +
geom_line() +
scale_x_continuous(breaks=1:10, labels = xlabs) +
labs(x="Time", y="Spending per Pupil") +
ggtitle("Concord") +
theme_bw()
Now, the next part you mentioned was automation - you want to do this for every row of the original data frame. We could do that as follows. First, untown grabs the unique values of town from the data. The for() loop loops from 1 to the number of values in untown. Then you can see where "Concord" was in the previous plot, we now have untown[i]. We also use ggsave() at the end and we paste together the town name and .png. This will make a different plot for each town in R's working directory.
untown <- unique(df2$town)
for(i in 1:length(untown)){
df2 %>% filter(town == untown[i]) %>%
ggplot(aes(x=as.numeric(as.factor(time)), y=val)) +
geom_line() +
scale_x_continuous(breaks=1:10, labels = xlabs) +
labs(x="Time", y="Spending per Pupil") +
ggtitle(untown[i]) +
theme_bw()
ggsave(glue("{untown[i]}.png"), width=9, height=6)
}
I have a very big dataset that I'd like to illustrate using plotly in R.
A sample of my dataset is shown below:
> new_data_2
# Groups: newdatum [8]
date activity totaal
<date> <fct> <int>
1 2019-11-21 N11 144
2 2019-09-22 N11 129
3 2019-05-15 N22 117
4 2019-01-23 N22 12
5 2019-07-04 N22 12
6 2019-07-18 N22 12
...
For every activity I want to display the amount (totaal) per date (date) in a time series plot.
Somehow I don't get it right in R. Somehow I need to group my activity to display, but I can't figure it out.
new_data_2 %>%
group_by(activity) %>%
plot_ly(x=new_data_2$newdatum) %>%
add_lines(y=~new_data_2$totaal, color = ~factor(newdatum))
It does display an empty plot and not with the 'activity' on the left side.
What i want to achieve is:
You're on the right track, but after the group_by() you need to tell R to do something to the groups.
new_data_2 %>%
group_by(activity, date) %>% # use two groupings since you want by activity & date
summarise(totaal_2 = sum(totaal))
That should get to the dataframe you're looking for. You can use ggplot & plotly on it from there.
I would recommend reshaping the data first (as above), saving it as a new object, and then graphing it. Doing it this way helps you see each step along the way. Pipes %>% are great, but can make each step difficult to see.
This might not be very obvious at first, but the structure of your data is ideal for plot with multiple time series. You don't even need to worry with the group_by function. Your dataset seems to hava a long format where the dates in the date column and the names in activity column are not unique. But you will have only one variable per activity and date.
Given the correct specifications, plot_ly() will group your data using color=~activity like this: p <- plot_ly(new_data2, x = ~date, y = ~totall, color = ~activity) %>% add_lines(). Since you haven't provided a data sample that is large enough, I'll use the built-in dataset economics_long to show you how you can do this. First of all, notice how the structure of my sampled dataset matches yours:
date variable value
1 1967-07-01 psavert 12.5
2 1967-08-01 psavert 12.5
3 1967-09-01 psavert 11.7
4 1967-10-01 psavert 12.5
5 1967-11-01 psavert 12.5
6 1967-12-01 psavert 12.1
...
Plot:
Code:
library(plotly)
library(dplyr)
# data
data("economics_long")
df <- data.frame(economics_long)
# keep only some variables that have values on a comparable level
df <- df %>% filter(!(variable %in% c('pop', 'pce', 'unemploy')))
# plotly time series
p <- plot_ly(df, x = ~date, y = ~value, color = ~variable) %>%
add_lines()
# show plot
p
I am trying to use the R barplot function to plot the following array on the same graph:
ID 1 2 3 4 5 6 7 8
HeL 0 2 1 4 2 3 2 4
CaC 2 0 0 2 1 5 7 8
NIH 1 2 5 6 3 5 7 9
I would need to have the barplot of each row having its own y-axis, but the x-axis should be common for all rows. What I have achieved so far, is to read the matrix from the file "rna.tab" and then plot each row separately:
dat <- read.table ("rna.tab", row.names=1, header=TRUE)
barplot (as.matrix (dat[,1]))
barplot (as.matrix (dat[,2]))
barplot (as.matrix (dat[,3]))
but I didn't succeed in plotting them all together.
Thanks in advance-
Arturo
Is this what you are looking for? If it isn't could you please make a manual example of what you want and post the image?
par(mfrow = c(ncol(dat),1), mar = c(2.5,4,1,1))
apply(dat, 2, barplot, beside = TRUE)
par(mfrow = c(1,1))
The first par say you want a grid of plots with as many rows as there are columns of dat and 1 column, and changes the margins of the plot to be appropriate. The apply function makes a barplot for eash column of dat and beside = TRUE puts the columns next to each other. The next par resets the plotting grid to a single graph so next time you need to plot something you aren't just making a bunch of tiny plots.
Thanks Barker for the fix and sorry for taking so long to get back to you, but I was sick for almost one week.
Your code works great, the only thing is that, since I need to plot the rows and not the columns, it should be:
apply(dat, 1, barplot, beside = TRUE)
Sorry for not being clear about this point.
I have just one last question, if you don't mind. Usually my real life matrix is 6000*30. This means that I have to plot 30 rows.
Usually I save the image to disk:
png ("plot.png")
par(mfrow = c(ncol(dat),1), mar = c(2.5,4,1,1))
apply(dat, 1, barplot, beside = TRUE)
dev.off ()
When I do this, I get only the plot of the last 4 rows in the file "plot.png", instead of the plot of all rows. Also, since the x-axis is the same for all plots, would be possible to draw it only at the end?
I am a novice R user, hence the question. I refer to the solution on creating stacked barplots from R programming: creating a stacked bar graph, with variable colors for each stacked bar.
My issue is slightly different. I have 4 column data. The last column is the summed total of the first 3 column. I want to plot bar charts with the following information 1) the summed total value (ie 4th column), 2) each bar is split by the relative contributions of each of the three column.
I was hoping someone could help.
Regards,
Bernard
If I understood it rightly, this may do the trick
the following code works well for the example df dataframe
df <- a b c sum
1 9 8 18
3 6 2 11
1 5 4 10
23 4 5 32
5 12 3 20
2 24 1 27
1 2 4 7
As you don't want to plot a counter of variables, but the actual value in your dataframe, you need to use the goem_bar(stat="identity") method on ggplot2. Some data manipulation is necessary too. And you don't need a sum column, ggplot does the sum for you.
df <- df[,-ncol(df)] #drop the last column (assumed to be the sum one)
df$event <- seq.int(nrow(df)) #create a column to indicate which values happaned on the same column for each variable
df <- melt(df, id='event') #reshape dataframe to make it readable to gpglot
px = ggplot(df, aes(x = event, y = value, fill = variable)) + geom_bar(stat = "identity")
print (px)
this code generates the plot bellow