R preserving row order ggplot2 geom_tile - r

I am trying to plot some categorical data and this answer is very close to what I am trying to do, however in my case I have dates in the place of countries as seen in this example. How can I create the plot with the original row order from the data.frame? It appears that even though the factors are in the same order in dat and melt.data they are not ordered sequentially on the y axis in the plot.
Here is a reproducible example:
library(reshape)
library(ggplot2)
dat <- data.frame(dates=c("01/01/2002", "09/15/2003", "05/31/2012"), Germany = c(0,1,0), Italy = c(1,0,0))
melt.data<-melt(dat, id.vars="dates", variable_name="country")
qplot(data=melt.data,
x=country,
y=dates,
fill=factor(value),
geom="tile")

Your problem is that date is stored as a character string. See str(dat) for a structure of the data.
By adding
dat$dates <- as.Date(dat$dates,"%m/%d/%Y")
after loading dat, you can get the dates in the original order.

Your problem is that dat$dates is a factor, and by default R has sorted the levels lexicographically. R does not know they are dates.
So
levels(dat$dates)
## [1] "01/01/2002" "05/31/2012" "09/15/2003"
and thererfore
order(dat$dates)
## [1] 1 3 2
If you want R to treat these as dates, then you can convert them to Date column
dat$dates <- as.Date(as.character(dat$dates), format = '%m/%d/%Y')
# now
order(dat$dates)
## 1 2 3
Which is what you want

Related

How to Plot line graph in R with the following Data

I want a line graph of around 145 data observations using R, the format of data is as below
Date Total Confirmed Total Deceased
3-Mar 6 0
4-Mar 28 0
5-Mar 30 5
.
.
.
141 more obs like this
I'm new to ggplot 2 in R so i don't know how to get the graph, I tried plotting the graph, but the dates
in x-axis becomes overlaped and were not visible. I want line graph of Total confirmed column and the Total Deceased column together with dates in the x- axis, please help and please also tell me how to colour the line graph, i want a colorfull graph, so... Please Do help in your busy schedule.. thank you so much...
Similar questions like this gives a lot of error, so I would like an answer for my specific requirements.
There are a lot of resources to help you create what you are looking to do - and even quite a few questions already answered here. However, I understand it's tough starting out, so here's a quick example to get you started.
Sample Data:
df <- data.frame(
dates=c('2020-01-01','2020-02-01','2020-03-03','2020-03-14','2020-04-01'),
var1=c(13,15,18,29,40),
var2=c(5,8,11,13,18)
)
If you are plotting by date on your x axis, you need to ensure that df$dates is formatted as a "Date" class (or one of the other date-like classes). You can do that via:
df$dates <- as.Date(df$dates, format='%Y-%m-%d')
The format= argument of as.Date() should follow the conventions indicated in strptime(). Just type ?striptime in your console and you can see in the help for that function how the various terms are defined for format=.
The next step is very important, which is to recognize that the data is in "wide" format, not "long" format. You will always want your data in what is known as Tidy Data format - convenient for any analysis, but necessary for ggplot2 and the related packages. In your data, the measure itself is numbers of cases and deaths. The measure itself is number of people. The type of the measure is either cases or deaths. So "number of people" is spread over two columns and the information on "type of measure" is stuck as a name for each column when it should be a variable in the dataset. Your goal should be to gather() those two columns together and create two new columns: (1) one to indicate if the number is "cases" or "deaths", and (2) the number itself. In the example I've shown you can do this via:
library(dplyr)
library(tidyr)
library(ggplot2)
df <- df %>% gather(key='var_name', value='number', -dates)
The result is that the data frame has columns for:
dates: unchanged
var_name: contains either var1 or var2 as a character class
number: the actual number
Finally, for the plot, the code is quite simple. You apply dates to the x aesthetic, number to y, and use var_name to differentiate color for the line geom:
ggplot(df, aes(x=dates, y=number)) +
geom_line(aes(color=var_name))

How to extrapolate/interpolate in R

I am trying to interpolate/extrapolate NA values. The dataset that I have is from a measuring station that measures soil temperature at 4 depths every 5 minutes. In this specific example there are erroneous data (-888.88) at the end of the measurements for the 0 cm depth variable and 1-5 cm depth variable. I transformed this to NA. Now my professor wants me interpolate/extrapolate for this and all other datasets that I have. I am aware that extrapolating for so much values after the last observation could be statistically inaccurate but I am trying to at least come up with a working code. As of now I tried to extrapolate for one of the variables (SoilTemp_1.5cm). The final line runs but when I open the data frame, the NAs are still there.
library(dplyr)
library(Hmisc)
MyD <- read.csv("2319538_Bodentemp_braun_WILDKOGEL_17_18 - Copy.csv",header=TRUE, sep=";")
MyD$date <- as.Date(MyD$Date, "%d.%m.%Y")
MyD$datetime <- as.POSIXct(MyD$Date.Time..GMT.01.00, format = "%d.%m.%Y %H:%M")
MyD[,-c(1,2,3,4,9)][MyD[,-c(1,2,3,4,9)] == -888.88] <- NA #convert erroneous data to NA
MyD %>% mutate(`SoilTemp_1.5cm`=approxExtrap(x=SoilTemp_5cm, y=SoilTemp_1.5cm, xout=SoilTemp_5cm)$y)
I also tried this way which gives me a list of 2 which has a lot of columns instead of rows when I convert to data frame. I am not going to lie that this approxExtrap syntax confuses me a little bit.
MyD1 <- approxExtrap(MyD$SoilTemp_5cm, MyD$SoilTemp_1.5cm,xout=MyD$SoilTemp_5cm)
MyD1
I am honestly not sure how to reproduce the data so here is pastebin link of a dput() output https://pastebin.com/NFZdmm4L. I tried to include as much output as I could. Have in mind that I excluded some of the columns when running the dput() so the code MyD[,-c(1,2,3,4,9)][MyD[,-c(1,2,3,4,9)] == -888.88] might differ. Anyways, the dput() output already has the NAs included so you might not even need it.
Thanks in advance.
Best regards,
Zorin
na.approx will fill in NAs with interpolated values and rule=2 will extend the first and last values.
library(zoo)
x <- c(NA, 4, NA, 5, NA) # test input
na.approx(x, rule = 2)
## [1] 4.0 4.0 4.5 5.0 5.0

Combine 2 variables from a 8 variables set, with the difference from each row

Hi, and thanks for all help. I have a dataset with 8 variables and 5 observations. What i want to do is to take 2 variables from the dataset with every 5 observations. In these variables and have digits such as high.price and low.price from five different days hence the observations. What i want is to take the variables High.price and Low.price into a new dataset and plot a genom_line with the difference between the high.price and low.price such as the difference to be "y" on the plot and "x" as date the 5 observations.
What i want is that i want to calculate the difference between High.price and Low.price for each five days, and then plot the difference "spread".
If I understand correctly, it's a simple case of subsetting. if dataset1 is the first dataset with 8 column and five rows, you can simply subset using:
dataset2 <- dataset1[c(1,2),] where 1 and 2 are the lines to keep. Since data is not in the dataset, you can build graph using date vectors as X and data from dataset2 for y.
I made an example:
df <- data.frame (a=c(2,4,6,8,9),
b=c(1,5,7,9,10),
c=c(6,8,5,7,7),
d=c(1,2,3,4,5),
e=c(4,5,6,2,1),
f=c(2,5,4,7,1),
g=c(1,1,2,1,2),
h=c(5,6,5,5,5))
Vmin <- unlist(lapply(df, min))
Vmax <- unlist(lapply(df, max))
spread <- Vmax-Vmin
plot(spread, type = "o",pch=20, xaxt="n")
axis(1,1:8,colnames(df)) #or your date

How to generate a plot for reported values and missing values in R - timeseries

Hi I am using R to analyze my data. I have time-series data in following format:
dates ID
2008-02-12 3
2008-03-12 3
2008-05-12 3
2008-09-12 3
2008-02-12 8
2008-04-12 6
I would like to create a plot with dates at the x axis and ID on Y axis. Such that it draws a point if id is reported for that data and nothing if there is no data for that.
In the original dataset I only have id if the value is reported on that date. For e.g. for 2008-02-12 for id 6 there is no data reported hence it is missing in my dataset.
I was able to get all the dates with unique(df$dates) function, but dont know enough about R data structures on how to loop through data and make matrix with 1 0 for all ids and then plot it.
I will be grateful if you guys can help me with the code or give me some pointers on what could be effective way to approach this problem.
Thanks in advance.
It seems you want something like a scatter-plot :
# input data
DF <-
read.csv(
text=
'Year,ID
2008-02-12,3
2008-03-12,3
2008-05-12,3
2008-09-12,3
2008-02-12,8
2008-04-12,6',
colClasses=c('character','integer'))
# convert first column from characters to dates
DF$Year <- as.POSIXct(DF$Year,format='%Y-%m-%d',tz='GMT')
# scatter plot
plot(x=DF$Year,y=DF$ID,type='p',xlab='Date',ylab='ID',
main='Reported Values',pch=19,col='red')
Result :
But this approach has a problem. For example if you have unique(ids) = c(1,2,1000) the space on the y axis between id=2 and id=1000 will be very big (the same holds for the dates on the x axis).
Maybe you want a sort of "map" id-dates, like the following :
# input data
DF <-
read.csv(
text=
'Year,ID
2008-02-12,3
2008-03-12,3
2008-05-12,3
2008-09-12,3
2008-02-12,8
2008-04-12,6',
colClasses=c('character','integer'))
dates <- as.factor(DF$Year)
ids <- as.factor(DF$ID)
plot(x=as.integer(dates),y=as.integer(ids),type="p",
xlim=c(0.5,length(levels(dates))+0.5),
ylim=c(0.5,length(levels(ids))+0.5),
xaxs="i", yaxs="i",
xaxt="n",yaxt="n",main="Reported Values",
xlab="Date",ylab="ID",pch=19,col='red')
axis(1,at=1:length(levels(dates)),labels=levels(dates))
axis(2,at=1:length(levels(ids)),labels=levels(ids))
# add grid
abline(v=(1:(length(levels(dates))-1))+0.5,,col="Gray80",lty=2)
abline(h=(1:(length(levels(ids))-1))+0.5,col="Gray80",lty=2)
Result :

R plotting dates with apcluster

I'm using the package apcluster to do some clustering on some data. I currently have a large matrix called mat which follows this format:
date A B C
1 2000-01-03 2.00000000 0.300000000 4.00000000
2 2000-01-04 0.20000000 0.000030000 -0.02469136
3 2000-01-05 -0.07692308 -0.02469136 -0.07594937
apcluster has provided functionality to plot the clusters (as scatterboxes) overlaid on your original data. When plotting I do:
plot(cluster, mat)
Don't need to worry about cluster, only mat is giving me problems. The above gives me 9 plots... The diagonals being the column names (except date) and each plot representing the data of each column plotted against another. This means that the X and Y axes are in the range of the data ie. for A, it would be from -0.08 to 2.0!
So my question is how can I plot each column to date, as in date will act as the X axis, while the data from mat acts as Y and so that all three columns of data will appear on one plot, without modifying the plot command above?
apcluster documentation is located HERE.
Thanks.
I am not 100% sure what you need. Do you want to include the date column into both the clustering procedure and the plot? If you run apcluster() on the data frame you mention above, the date column is simply neglected.
So, if you want to include the date column, my suggestion would be to convert the date column to numeric, e.g. by the following:
x$date <- as.numeric(as.Date(x$date))
The disadvantage is that the result is in days (from 1970-01-01), so (1) the column would be on a completely different scale than the other columns and (2) the axes of the plots would not be labeled in a very interpretable way. So it is probably better to convert the dates to fractions of years, e.g. something like 2013-01-01 = 2013.00; 2013-07-01 ~= 2013.50; 2014-01-01 = 2014.00. Do you know what I mean?
If you choose any of these two options, the dates will be taken into account by apcluster() and the plot() command will also plot the columns A, B, ... against the date column.
Cheers,
UBod

Resources