How to make Violin Plots from a text file - r

Currently, I am trying to make an image of multiple violin graphs that I read in from a text file. The text file is formatted in a way so that there a "count" column which is just incrementing by 1 to show the index of the results, and there are also multiple columns each being the results of a different variable size. Below is an example of a portion of the text file.
Count X1.1 X1.2 X1.3 X1.4
1 174.647 173.368 172.713 172.264
2 169.549 166.791 167.010 165.682
3 174.341 170.821 169.861 169.103
4 178.305 177.736 177.796 176.067
5 160.614 159.842 158.548 157.145
So I would like to create a new violin graph for each column using ggplot (1.1, 1.2, etc.) that can be displayed side by side.
library(ggplot2)
myData <- read.csv("E2_1_RingSize.text", sep = "\t", header=TRUE)
I've read in the file I would want, and am able to plot one column at a time by hard coding in the column name. See below
graph1 <- ggplot(myData, aes(x=Count, y=X1.1) + geom_violin()
But I'm unsure how to include all of the columns at once. It's most likely an easy fix, only 1-2 lines, but I'm not that experienced in R/RStudio and so I've got no clue.

What you need to do is pivot your data.frame so it's in long format:
dat %>%
tidyr::pivot_longer(-Count) %>%
ggplot(aes(x=as.factor(name), y=value)) + geom_violin()

Related

Omitting NA values from ggplot when using multiple dataframes to plot multiple lines

My dataframes sometimes contain NA values. These were previously blanks, characters like 'BAD' or actual 'NA' characters from the imported .csv file. I have changed everything in my dataframes to numeric - this changes all non-numeric characters to NA. So far, so good.
I am aware I can use the following using dataframe 'df' to ensure a line is always drawn between data points, ensuring there are no gaps:
ggplot(na.omit(df), aes(x=Time, y=pH)) +
geom_line()
However, sometimes I wish to plot 2 or more dataframes using ggplot2 to get a single plot. I do this because my x axis (Time) is indeed the same for all dataframes, but the specific numbers are different. I was having immense trouble merging these dataframes because the rows are not equal. Otherwise I would merge, melt the data and use ggplot2 as normal to make a multiple-lined line plot.
I have since learnt you can plot multiple dataframes manually on ggplot at the 'geom level':
ggplot() +
geom_line(df1, aes(x=Time1, y=pH1), colour='green') +
geom_line(df2, aes(x=Time2, y=pH2), colour='red') +
geom_line(df3, aes(x=Time3, y=pH3), colour='blue') +
geom_line(df4, aes(x=Time4, y=pH4), colour='yellow')
However, how can I now ensure NA values are omitted and the lines are connected?! It all seems to work, but my 4 plots have gaps in them where the NA values are!
I am new to R, but enjoying it so far and realise there are usually multiple solutions to an issue. Any help or advice appreciated.
EDIT (for anyone who later sees this)
So, after playing around for 30 mins I realised I could first use the no.omit function separately on each dataframe, name these new objects and then just these plot these instead on ggplot. This works fine. Also, the above code was incorrect anyway if I wanted a suitable legend.
New, correct code:
df1.omit <- na.omit(df1)
df2.omit <- na.omit(df2)
df3.omit <- na.omit(df3)
df4.omit <- na.omit(df4)
ggplot() +
geom_line(df1.omit, aes(x=Time1, y=pH1, colour="Variable 1") +
geom_line(df2.omit, aes(x=Time2, y=pH2, colour="Variable 2") +
geom_line(df3.omit, aes(x=Time3, y=pH3, colour="Variable 3") +
geom_line(df4.omit, aes(x=Time4, y=pH4, colour="Variable 4")
So, after playing around for 30 mins I realised I could first use the no.omit function separately on each dataframe, name these new objects and then just these plot these instead on ggplot. This works fine. Also, the above code was incorrect anyway if I wanted a suitable legend.
df1.omit <- na.omit(df1)
df2.omit <- na.omit(df2)
df3.omit <- na.omit(df3)
df4.omit <- na.omit(df4)
ggplot() +
geom_line(df1.omit, aes(x=Time1, y=pH1, colour="Variable 1") +
geom_line(df2.omit, aes(x=Time2, y=pH2, colour="Variable 2") +
geom_line(df3.omit, aes(x=Time3, y=pH3, colour="Variable 3") +
geom_line(df4.omit, aes(x=Time4, y=pH4, colour="Variable 4")

Reading and reducing a .kmz

I'm brand new to using R geospatially.
Working .kmz: https://www.cnrfc.noaa.gov/ - from the second drop down right below the map pane titled 'Download Overlay Files', I've downloaded and I'm using the "Drainage Basins" kml that should download as "basins.kml"
library(rgdal)
library(tidyverse)
From looking at the .kml in a text editor, it looks like the KML layer name is
"cnrfc_09122018_basins_thin", so reading it in with:
cnrfc_basins <- readOGR("basins.kml", "cnrfc_09122018_basins_thin")
gives me a "Large SpatialPolygonsDataFrame".
To be able to plot, it looks like I need to "fortify it" (?), and make a more ordinary data.frame, so from some other posts I've come across:
cnrfc_basins_fortify <- merge(broom::tidy(cnrfc_basins),
as.data.frame(cnrfc_basins), by.x="id", by.y=0)
plotting with this:
ggplot() + geom_path(data = cnrfc_basins_fortify, aes(x=long, y=lat, group = group)) +
coord_quickmap()
gives me the data I'm expecting:
But, for these around one hundred polygons or so, I have hundreds of thousands of data.frame rows. How do I reduce these, so I have just one row for each polygon?(Each polygon, which is representative of a particular basin, has a unique five digit ID already, in the 'Name' column). Having fewer rows seems it will make working with the file easier and quicken joins, when I will join data to these unique polygons.
Any advice greatly appreciated.
All you have to is directly extract the #data contained in the SpatialPolygonsDataFrame:
poly = cnrfc_basins#data
That should give you a 339-row data.frame with the unique identifiers you need (without the geometric metadata)
> head(poly)
Name
0 EFBC1
1 CSKC1
2 CMIC1
3 FMDC1
4 NMFC1
5 NFDC1

Iterating over a excel file and ploting a 2 columns comparison

First of all, i am a beginner so i apreciate your patience and time to trying help me. i have one excel file with 3 columns: Shopname, 2016 and 2017 wich are particular values for a comparison.
Id like to iterate over the excel file and plot two bars one with the value for shop X in the year 2016 and other bar for 2017.
ill post here what i wrote until this moment, i can see the printings but not the plots... what could i make better?
> #importing excel file
> #and ploting each line comparison between 2 columns
> library(xlsx)
> xl_data <- read.xlsx("File.xlsx", "Plan1")
> df<- data.frame(xl_data)
> # plot using facets
> ggplot(aes(x=time, y=sold, group=shop)) +geom_bar(stat="identity")+
facet_grid(.~xl_data)
Afonso,
You don't need a loop for that. One way to accomplish it would be with ggplot's facetting capability:
#### load needed libraries
library(tidyr)
library(ggplot2)
### load data -- this is coming from Excel
dt <- tribble(
~LOJAS, ~y2016, ~y2017,
"CD NEREU" , 168459.86, 223637.46,
"LJ CANOINH", 14480.03, 80006.86,
"LJ MAL338" , 21095.07, 62768.54,
"LJ SBENTO" , 43290.47, 43168.34)
### arrange data for plotting
dt %>%
gather(time, sold, y2016, y2017) %>%
# plot using facets
ggplot(aes(x=time, y=sold, group=LOJAS)) +
geom_bar(stat="identity") +
facet_grid(.~LOJAS)

How to plot several line plots in one

I would like to plot my figure using R (ggplot2). I'd like to have a line graph like image 2.
here my.data:
B50K,B50K+1000C50K,B50K+2000C50K,B50K+4000C50K,B50K+8000C50K,gen,xaxile
0.3795,0.4192,0.4675,0.5357,0.6217,T18-Yield,B50K
0.3178,0.3758,0.4249,0.5010,0.5870,T20-Yield,B50K+1000C50K
0.2795,0.3266,0.3763,0.4636,0.5583,T21-Yield,B50K+2000C50K
0.2417,0.2599,0.2898,0.3291,0.3736,T18-Fertility,B50K+4000C50K
0.2002,0.2287,0.2531,0.2962,0.3485,T19-Fertility,B50K+8000C50K
0.1642,0.1911,0.2151,0.2544,0.2951,T20-Fertility
***--> The delimiter is ",". By the way, I have not any useful .r script which would be helpful or useful.
The illustrated image shows my figure in Microsoft word.
I have tried several scripts via internet but non of them have not worked.
would you please help me to have a .r script to read my data file like img1 and plot my data like illustrated figure.
The trick is to reshape your data (using melt from the reshape2 package) so that you can easily map colours and linetypes to gen.
# Your data - note i also added an extra comma after the fifth column in row 6.
# It would be easier if you gave data using dput as described in comments above - thanks
dat <- read.table(text="B50K,B50K+1000C50K,B50K+2000C50K,B50K+4000C50K,B50K+8000C50K,xaxile,gen
0.3795,0.4192,0.4675,0.5357,0.6217,B50K,T18-Yield
0.3178,0.3758,0.4249,0.5010,0.5870,B50K+1000C50K,T20-Yield
0.2795,0.3266,0.3763,0.4636,0.5583,B50K+2000C50K,T21-Yield
0.2417,0.2599,0.2898,0.3291,0.3736,B50K+4000C50K,T18-Fertility
0.2002,0.2287,0.2531,0.2962,0.3485,B50K+8000C50K,T19-Fertility
0.1642,0.1911,0.2151,0.2544,0.2951,,T20-Fertility",
header=T, sep=",", na.strings="")
# load the pckages you need
library(ggplot2)
library(reshape2)
# assume xaxile column is unneeded? - did you add this column yourself?
dat$xaxile <- NULL
# reshape data for plotting
dat.m <- melt(dat)
# plot
ggplot(dat.m, aes(x=variable, y=value, colour=gen,
shape=gen, linetype=gen, group=gen)) +
geom_point() +
geom_line()
You can then use scale_linetype_manual and scale_shape_manual to manually specify how you want the plot to look. This post will help, but there are many others as well

R: Issues with line graph of germination trough time

I'm still in the process of learning R using Swirl and RStudio, and a goal I've set for myself is to recreate this graph. I have a small dataset that I will link below (it's saved as a plain text CSV file that I import into R with headings enabled).
If I try to plot that dataset without changing anything, I get this, which is obviously not the goal.
At first I thought the problem would be in the class of my imported dataset, defined as kt. After class(kt) turned out to be data.frame I figured that wasn't the problem. Should I be trying to rewrite the table to something that R can plot instantly, or should I be trying to extract each species individually, plot them separately and then combining the different plots into one graph? Perhaps there is something wrong with my dates, I know that R handles dates in a specific way. Maybe these solutions are not even needed and I'm just doing something stupidly simple wrong, but I can't find it myself.
Your help is much appreciated.
Dataset:
Species,week 0,week 1,week 2,week 3,week 4,week 5,week 6,week 7,week 8,week 9,week 10,week 11,week 12,week 13,week 14,week 15,week 16,week 17,week 18
Caesalpinia coriaria,0.0%,24.0%,28.0%,28.0%,32.0%,37.0%,40.0%,46.0%,52.0%,56.0%,63.0%,64.0%,68.0%,71.0%,72.0%,,,,
Coccoloba swartzii,0.0%,0.0%,1.0%,10.0%,19.0%,31.0%,33.0%,39.0%,43.0%,48.0%,52.0%,52.0%,52.0%,52.0%,52.0%,52.0%,52.0%,55.0%,
Cordia dentata,0.0%,5.0%,18.0%,21.0%,24.0%,26.0%,27.0%,30.0%,32.0%,32.0%,32.0%,32.0%,32.0%,32.0%,33.0%,33.0%,33.0%,34.0%,35.0%
Guaiacum officinale,0.0%,0.0%,0.0%,0.0%,4.0%,5.0%,5.0%,5.0%,7.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,8.0%,,
Randia aculeata,0.0%,0.0%,0.0%,4.0%,13.0%,14.0%,18.0%,19.0%,21.0%,21.0%,21.0%,21.0%,21.0%,22.0%,22.0%,22.0%,22.0%,,
Schoepfia schreberi,0.0%,0.0%,0.0%,0.0%,0.0%,0.0%,1.0%,4.0%,8.0%,11.0%,13.0%,21.0%,21.0%,24.0%,24.0%,25.0%,27.0%,,
Prosopis juliflora,0.0%,7.5%,31.3%,34.2%,,,,,,,,,,,,,,,
Something like this??
# get rid of "%" signs
df <- data.frame(sapply(df,function(x)gsub("%","",x,fixed=T)))
# convert cols 2:20 to numeric
df[,2:20] <- sapply(df[,2:20],function(x)as.numeric(as.character(x)))
library(reshape2)
library(ggplot2)
gg <- melt(df,id="Species")
ggplot(gg,aes(x=variable,y=value,color=Species,group=Species)) +
geom_line()+
theme_bw()+
theme(legend.position="bottom", legend.title=element_blank())
There are lots of problems here.
First, if your dataset really has those % signs, then R interprets the data as character and imports it as factors. So first we have to get rid of the % (using gsub(...), and then we have to convert what's left to numeric. With factors, you have to convert to character first, then numeric, so: as.numeric(as.character(...)). All of this could have been avoided if you exported the data without the % signs!!!
Plotting multiple curves with different colors is something the ggplot package was designed for (among many other things), so we use that. ggplot prefers data in "long" format - all the data in one column, with a second column distinguishing different datasets. Your data is in "wide" format - data in different columns. So we convert to long using melt(...) from the reshape2 package. The result, gg has three columns: Species, variable and value. value contains the actual data and variable contains the week number.
So now we create a ggplot object, setting the x-axis to the variable column, the y-axis to the value column, with color mapped to Species, and we tell ggplot to plot lines (using geom_line(...)).
The rest is to position the legend at the bottom, and turn off some of the ggplot default formatting.

Resources