I'm using the package apcluster to do some clustering on some data. I currently have a large matrix called mat which follows this format:
date A B C
1 2000-01-03 2.00000000 0.300000000 4.00000000
2 2000-01-04 0.20000000 0.000030000 -0.02469136
3 2000-01-05 -0.07692308 -0.02469136 -0.07594937
apcluster has provided functionality to plot the clusters (as scatterboxes) overlaid on your original data. When plotting I do:
plot(cluster, mat)
Don't need to worry about cluster, only mat is giving me problems. The above gives me 9 plots... The diagonals being the column names (except date) and each plot representing the data of each column plotted against another. This means that the X and Y axes are in the range of the data ie. for A, it would be from -0.08 to 2.0!
So my question is how can I plot each column to date, as in date will act as the X axis, while the data from mat acts as Y and so that all three columns of data will appear on one plot, without modifying the plot command above?
apcluster documentation is located HERE.
Thanks.
I am not 100% sure what you need. Do you want to include the date column into both the clustering procedure and the plot? If you run apcluster() on the data frame you mention above, the date column is simply neglected.
So, if you want to include the date column, my suggestion would be to convert the date column to numeric, e.g. by the following:
x$date <- as.numeric(as.Date(x$date))
The disadvantage is that the result is in days (from 1970-01-01), so (1) the column would be on a completely different scale than the other columns and (2) the axes of the plots would not be labeled in a very interpretable way. So it is probably better to convert the dates to fractions of years, e.g. something like 2013-01-01 = 2013.00; 2013-07-01 ~= 2013.50; 2014-01-01 = 2014.00. Do you know what I mean?
If you choose any of these two options, the dates will be taken into account by apcluster() and the plot() command will also plot the columns A, B, ... against the date column.
Cheers,
UBod
Related
I want a line graph of around 145 data observations using R, the format of data is as below
Date Total Confirmed Total Deceased
3-Mar 6 0
4-Mar 28 0
5-Mar 30 5
.
.
.
141 more obs like this
I'm new to ggplot 2 in R so i don't know how to get the graph, I tried plotting the graph, but the dates
in x-axis becomes overlaped and were not visible. I want line graph of Total confirmed column and the Total Deceased column together with dates in the x- axis, please help and please also tell me how to colour the line graph, i want a colorfull graph, so... Please Do help in your busy schedule.. thank you so much...
Similar questions like this gives a lot of error, so I would like an answer for my specific requirements.
There are a lot of resources to help you create what you are looking to do - and even quite a few questions already answered here. However, I understand it's tough starting out, so here's a quick example to get you started.
Sample Data:
df <- data.frame(
dates=c('2020-01-01','2020-02-01','2020-03-03','2020-03-14','2020-04-01'),
var1=c(13,15,18,29,40),
var2=c(5,8,11,13,18)
)
If you are plotting by date on your x axis, you need to ensure that df$dates is formatted as a "Date" class (or one of the other date-like classes). You can do that via:
df$dates <- as.Date(df$dates, format='%Y-%m-%d')
The format= argument of as.Date() should follow the conventions indicated in strptime(). Just type ?striptime in your console and you can see in the help for that function how the various terms are defined for format=.
The next step is very important, which is to recognize that the data is in "wide" format, not "long" format. You will always want your data in what is known as Tidy Data format - convenient for any analysis, but necessary for ggplot2 and the related packages. In your data, the measure itself is numbers of cases and deaths. The measure itself is number of people. The type of the measure is either cases or deaths. So "number of people" is spread over two columns and the information on "type of measure" is stuck as a name for each column when it should be a variable in the dataset. Your goal should be to gather() those two columns together and create two new columns: (1) one to indicate if the number is "cases" or "deaths", and (2) the number itself. In the example I've shown you can do this via:
library(dplyr)
library(tidyr)
library(ggplot2)
df <- df %>% gather(key='var_name', value='number', -dates)
The result is that the data frame has columns for:
dates: unchanged
var_name: contains either var1 or var2 as a character class
number: the actual number
Finally, for the plot, the code is quite simple. You apply dates to the x aesthetic, number to y, and use var_name to differentiate color for the line geom:
ggplot(df, aes(x=dates, y=number)) +
geom_line(aes(color=var_name))
So I get that the title is terrible and generic like. I have no idea how to concisely describe what I am trying to do.
I've got a 2 column data frame in R, column A has data values, column B had data that has now been binned (was year associated with Column A, now is a bin label based on year ranges).
I need to generate a new data frame which uses the bin labels as columns with the associated data values as row entries, preferably sorted, back-filled with 'NA' to prevent columns of different lengths.
Sample data:
df <- data.frame(values=c(1,NA,3,NA,5:6,7:9),
bins=rep(c("yr1_yr2","yr2_yr3","yr3_yr4"),each=3))
SOLUTION EDIT: So after a lot of experimentation I was able to do what I wanted with my data by using the 'cut_width' function from ggplot2 to slice my data into bins then plop it in a distribution graph.
Thank you all for your attempts, sorry again for the vague question and lack of sample data.
Not quite sure if this is getting close to what you want...
library(tidyverse)
reshape2::melt(df, id.vars='bins', measure.vars='values')
returns
bins variable value
1 yr1_yr2 values 1
2 yr1_yr2 values NA
3 yr1_yr2 values 3
4 yr2_yr3 values NA
5 yr2_yr3 values 5
6 yr2_yr3 values 6
7 yr3_yr4 values 7
8 yr3_yr4 values 8
9 yr3_yr4 values 9
This is my data https://www.dropbox.com/s/msf0ro8saav7wbl/data1.txt?dl=0 (dataA), i want to extract "Habitat" to have frequency table so that i can calculate any statistical analysis such as mean and variance, and also to plot such as boxplot using ggplot2
I tried to use solution in duplicate question here R: How to get common counts (frequency) of levels of two factor variables by ID Variable (as new data frame) but i think it does not help my problem
Here's the easiest way to get a data.frame with frequencies using table. I'm using t to transpose and as.data.frame.matrix to transform it into a data.frame.
as.data.frame.matrix(t(table(data1)))
A B C
Adult 1 2 1
Juvenile 2 0 0
Hi I am using R to analyze my data. I have time-series data in following format:
dates ID
2008-02-12 3
2008-03-12 3
2008-05-12 3
2008-09-12 3
2008-02-12 8
2008-04-12 6
I would like to create a plot with dates at the x axis and ID on Y axis. Such that it draws a point if id is reported for that data and nothing if there is no data for that.
In the original dataset I only have id if the value is reported on that date. For e.g. for 2008-02-12 for id 6 there is no data reported hence it is missing in my dataset.
I was able to get all the dates with unique(df$dates) function, but dont know enough about R data structures on how to loop through data and make matrix with 1 0 for all ids and then plot it.
I will be grateful if you guys can help me with the code or give me some pointers on what could be effective way to approach this problem.
Thanks in advance.
It seems you want something like a scatter-plot :
# input data
DF <-
read.csv(
text=
'Year,ID
2008-02-12,3
2008-03-12,3
2008-05-12,3
2008-09-12,3
2008-02-12,8
2008-04-12,6',
colClasses=c('character','integer'))
# convert first column from characters to dates
DF$Year <- as.POSIXct(DF$Year,format='%Y-%m-%d',tz='GMT')
# scatter plot
plot(x=DF$Year,y=DF$ID,type='p',xlab='Date',ylab='ID',
main='Reported Values',pch=19,col='red')
Result :
But this approach has a problem. For example if you have unique(ids) = c(1,2,1000) the space on the y axis between id=2 and id=1000 will be very big (the same holds for the dates on the x axis).
Maybe you want a sort of "map" id-dates, like the following :
# input data
DF <-
read.csv(
text=
'Year,ID
2008-02-12,3
2008-03-12,3
2008-05-12,3
2008-09-12,3
2008-02-12,8
2008-04-12,6',
colClasses=c('character','integer'))
dates <- as.factor(DF$Year)
ids <- as.factor(DF$ID)
plot(x=as.integer(dates),y=as.integer(ids),type="p",
xlim=c(0.5,length(levels(dates))+0.5),
ylim=c(0.5,length(levels(ids))+0.5),
xaxs="i", yaxs="i",
xaxt="n",yaxt="n",main="Reported Values",
xlab="Date",ylab="ID",pch=19,col='red')
axis(1,at=1:length(levels(dates)),labels=levels(dates))
axis(2,at=1:length(levels(ids)),labels=levels(ids))
# add grid
abline(v=(1:(length(levels(dates))-1))+0.5,,col="Gray80",lty=2)
abline(h=(1:(length(levels(ids))-1))+0.5,col="Gray80",lty=2)
Result :
I am trying to plot some categorical data and this answer is very close to what I am trying to do, however in my case I have dates in the place of countries as seen in this example. How can I create the plot with the original row order from the data.frame? It appears that even though the factors are in the same order in dat and melt.data they are not ordered sequentially on the y axis in the plot.
Here is a reproducible example:
library(reshape)
library(ggplot2)
dat <- data.frame(dates=c("01/01/2002", "09/15/2003", "05/31/2012"), Germany = c(0,1,0), Italy = c(1,0,0))
melt.data<-melt(dat, id.vars="dates", variable_name="country")
qplot(data=melt.data,
x=country,
y=dates,
fill=factor(value),
geom="tile")
Your problem is that date is stored as a character string. See str(dat) for a structure of the data.
By adding
dat$dates <- as.Date(dat$dates,"%m/%d/%Y")
after loading dat, you can get the dates in the original order.
Your problem is that dat$dates is a factor, and by default R has sorted the levels lexicographically. R does not know they are dates.
So
levels(dat$dates)
## [1] "01/01/2002" "05/31/2012" "09/15/2003"
and thererfore
order(dat$dates)
## [1] 1 3 2
If you want R to treat these as dates, then you can convert them to Date column
dat$dates <- as.Date(as.character(dat$dates), format = '%m/%d/%Y')
# now
order(dat$dates)
## 1 2 3
Which is what you want