How to structure data for R? - r

So... newbie R user here. I have some observations that I'd like to record using R and be able to add to later.
The items are sorted by weights, and the number at each weight recorded. So far what I have looks like this:
weights <- c(rep(171.5, times=1), rep(171.6, times=2), rep(171.7, times=4), rep(171.8, times=18), rep(171.9, times=39), rep(172.0, times=36), rep(172.1, times=34), rep(172.2, times=25))
There will be a total of 500 items being observed.
I'm going to be taking additional observations over time to (hopefully) see how the distribution of weights changes with use/wear. I'd like to be able plots showing either stacked histograms or boxplots.
What would be the best way to format / store this data to facilitate this kind of use case? A matrix, dataframe, something else?

As other comments have suggest, the most versatile (and perhaps useful) container (structure) for your data would be a data frame - for use with the library(ggplot2) for your future plotting and graphing needs(such as BoxPlot with ggplot and various histograms
Toy example
All the code below does is use your weights vector above, to create a data frame with some dummy IDs and plot a box and whisker plot, and results in the below plot.
library(ggplot2)
IDs<-sample(LETTERS[1:5],length(weights),TRUE) #dummy ID values
df<-data.frame(ID=IDs,Weights=weights) #make data frame with your
#original `weights` vector
ggplot(data=df,aes(factor(ID),Weights))+geom_boxplot() #box-plot

Related

is there a function in R to quickly calculate the difference between two geom_bin2d maps?

I have a large 2-variable dataset that may be classified into 2 groups using a third variable. Overplotting is an issue, so I've resorted to visualizing my data using bin2d and other similar approaches. I would like to calculate the difference between the binned counts of the two groups and visualize that as well (e.g subtract one 2d histogram from another).
example code:
df <- diamonds
df_color_H <- filter(df,color=="H")
df_color_E <- filter(df,color=="E")
ggplot(df_color_H)+
geom_bin2d(aes(carat,price),bins=40)
ggplot(df_color_E)+
geom_bin2d(aes(carat,price),bins=40)
Ultimately, I want to visualize the difference between overlapping bins. I know the solution is likely a pre-processing step before bringing them into GGplot but I haven't found exactly what I'm looking for. I also don't need a sophisticated solution using KDEs or something like that.
Any suggestions would be welcome!

Creating Histograms with R. Questions regarding possibilities of use and problems with overlapping values

For my thesis i want to create a histogram on standardized earnings. This histogram should ideally have the following properties:
The histogram should be able to have the intervals of the data
(bins) played with.
Since i have my data in a spreadsheet. Is it possible to consider
more than one column?
Also it should have the ability to set the range of the data that is
included in the histogram for example from -50 mio. to 200 mio. (But
i could do this in my input)
Sadly I was not able to perform this task my own.
I have downloaded the data from orbis in spreadsheet (xlsx). Afterwards I cleaned my data of symbols that R can't read, saved everything as a Tab separated .txt and imported it into R-Studio:
setwd("/path")
getwd()
df<- read.table("importFile", header = TRUE)
View(df)
This worked nicely.
Now i tried creating the histogram
library(ggplot2)
myplot=ggplot(df, aes(JuStandartisiert2007))
myplot+ stat_count(width = 1000)
Then i received the following warning:
position_stack requires non-overlapping x intervals
My histogram looks horrible:
This perplexes me, I tried making a histogram on the airquality dataset and it works without problems.
Also note that i have to use stat_count for my histogram in a youtube video i saw, they did it the following way:
myplot+ geom_histogram(binwidth = 10)
My questions are now:
What is wrong with my Data why i have overlapping x Values? To my naked eye my data looks the same than that from R's airquality dataset.
How can I sepparate my x values?
Can i set max and min values for the data that enters my Histogram?
Can I consider more than one column in my dataset.
Here is my Dataset as TAB separated txt file.
https://www.dropbox.com/sh/jbscj6cftpcqaxh/AADglvv_xnG2wWN-o2SIrTwpa?dl=0
I would rather begin with base plotting such as:
hist(df$JuStandartisiert2007,breaks=1000,xlim=c(-2,2))
you can also observe the limits for the x-axis.
In order to have the plot of two columns try :
plot(df$JuStandartisiert2007,df$BilanzsummeAktiva2007,xlim = c(-5,5),ylim=c(-1,1000))
Once again observe the x and y limits represented by: xlim and ylim

Plot the relationship of each column to a singular column in a table

I have one table of derived vegetation indices for 63 sample sites from different satellites. this gives me a table with 63 observations(sample sites) and 56 variables(1 Sample ID, 50 vegetation indices, 4 Biomass and 1 LAI). The last 5 columns of the table are the biomass and LAI, and the first column is the sample ID.
I want to generate a plot showing the relationship between a single vegetation index and one of the biomass parameters.
I am able to do this using the plot function, for one observation and variable at a time.
plot(data$Dry10, data$X8047EVImea)
I don't want to run this code 50 times and again by 5 sets for each biomass and LAI parameter.
Is there a way to loop or nested loop this plot function so that I can generate 200 graphs at once?
Also, I will place a regression line in each plot to see what vegetation index will best represent the amount of biomass present at the sample site.
This is my first post on stackoverflow, so please don't hesitate to request more information on the problem if I have missed something.
As noted in my comment you can accomplish this with a faceted plot in the ggplot2 package. This does require a little bit of data re-arrangement that can be accomplished with the reshape2 package. Here is some code that will be close to what you want to do but since I don't completely know your data formats it might take some fixes:
library(ggplot2)
library(reshape2)
library(dplyr)
vegDat <- data[,2:51]
bioDat <- data[,52:55]
## melt the data.frames so the biomass and vegetation headers are now variables
vegDatM <- melt(vegDat, variable.name='vegInd', value.name='vegVal')
bioDatM <- melt(bioDat, variable.name='bioInd', value.name='bioVal')
## Join these datasets to create all comparisons to be made
gdat <- bind_cols(vegDatM[rep(seq_len(nrow(vegDatM)), each=nrow(bioDatM)),],
bioDatM[rep(seq_len(nrow(bioDatM)), nrow(vegDatM)),])
## plot the data in a faceted grid
ggplot(gdat) + geom_point(aes(x=vegVal, y=bioVal)) + facet_grid(vegInd ~ bioInd)
Note that since there are 50 plots you may want to open a divice with a large height (or width if you swap the facet) i.e. pdf('foo.pdf', heigth=20). Hope this gets you on the right track.

Plot boxplots and line of time series data in R

I want to combine a time series of in situ values (line) with boxplots of estimated values of special dates. I tried to understand this "Add a line from different result to boxplot graph in ggplot2" question, but my dates make me drive crazy. Sometimes I only have in situ values of a date, sometimes only estimated values and sometimes both together.
I uploaded a sample of my data here:
http://www.file-upload.net/download-9942494/estimated.txt.html
http://www.file-upload.net/download-9942495/insitu.txt.html
How can I create a plot with both data sets that looks like this http://www.file-upload.net/download-9942496/desired_outputplot.png.html
in the end?
I got help and have a solution now:
insitu <- read.table("insitu.txt",header=TRUE,colClasses=c("Date","numeric"))
est <- read.table("estimated.txt",header=TRUE,colClasses=c("Date","numeric"))
insitu.plot <- xyplot(insitu~date_fname,data=insitu,type="l",
panel=function(x,y,...){panel.grid(); panel.xyplot(x,y,...)},xlab=list(label="Date",cex=2))
est.plot <- xyplot(estimated~date,data=est,panel=panel.bwplot,horizontal=FALSE)
both <- insitu.plot+est.plot
update(both,xlim=range(c(est$date,insitu$date_fname))+c(-1,1),ylim=range(c(est$estimated,insitu$insitu)))

Plotting distribution of differences in R

I have a dataset with numbers indicating daily difference in some measure.
https://dl.dropbox.com/u/22681355/diff.csv
I would like to create a plot of the distribution of the differences with special emphasis on the rare large changes.
I tried plotting each column using the hist() function but it doesn't really provide a detailed picture of the data.
For example plotting the first column of the dataset produces the following plot:
https://dl.dropbox.com/u/22681355/Rplot.pdf
My problem is that this gives very little detail to the infrequent large deviations.
What is the easiest way to do this?
Also any suggestions on how to summarize this data in a table? For example besides showing the min, max and mean values, would you look at quantiles? Any other ideas?
You could use boxplots to visualize the distribution of the data:
sdiff <- read.csv("https://dl.dropbox.com/u/22681355/diff.csv")
boxplot(sdiff[,-1])
Outliers are printed as circles.
I back #Sven's suggestion for identifying outliers, but you can get more refinement in your histograms by specifying a denser set of breakpoints than what hist chooses by default.
d <- read.csv('https://dl.dropbox.com/u/22681355/diff.csv', header=TRUE, row.names=1)
with(d, hist(a, breaks=seq(min(a), max(a), length.out=100)))
Violin plots could be useful:
df <- read.csv('https://dl.dropbox.com/u/22681355/diff.csv')
library(vioplot)
with(df,vioplot(a,b,c,d,e,f,g,h,i,j))
I would use a boxplot on transformed data, e.g.:
boxplot(df[,-1]/sqrt(abs(df[,-1])))
Obviously a histogram would also look better after transformation.

Resources