Related
I am trying to label the outliers in my boxplot using the text function so I can find out from which class the outliers are coming from. I've stored the rownames of my data in variable "rownames" using names(vehData) to get the row names. When I apply this however, I get an error.
ERROR: Error in which(removeOutliers1 == bxpdat$out, arr.ind = TRUE) :
'list' object cannot be coerced to type 'double'
Completely new to R programming. Completely not sure how to fix this or what I am doing wrong
Thanks in advance for any help!
library(reshape2)
vehData <-
structure(
list(
Samples = 1:6,
Comp = c(95L, 91L, 104L, 93L, 85L,
107L),
Circ = c(48L, 41L, 50L, 41L, 44L, 57L),
D.Circ = c(83L,
84L, 106L, 82L, 70L, 106L),
Rad.Ra = c(178L, 141L, 209L, 159L,
205L, 172L),
Pr.Axis.Ra = c(72L, 57L, 66L, 63L, 103L, 50L),
Max.L.Ra = c(10L,
9L, 10L, 9L, 52L, 6L),
Scat.Ra = c(162L, 149L, 207L, 144L, 149L,
255L),
Elong = c(42L, 45L, 32L, 46L, 45L, 26L),
Pr.Axis.Rect = c(20L,
19L, 23L, 19L, 19L, 28L),
Max.L.Rect = c(159L, 143L, 158L, 143L,
144L, 169L),
Sc.Var.Maxis = c(176L, 170L, 223L, 160L, 241L, 280L),
Sc.Var.maxis = c(379L, 330L, 635L, 309L, 325L, 957L),
Ra.Gyr = c(184L,
158L, 220L, 127L, 188L, 264L),
Skew.Maxis = c(70L, 72L, 73L,
63L, 127L, 85L),
Skew.maxis = c(6L, 9L, 14L, 6L, 9L, 5L),
Kurt.maxis = c(16L,
14L, 9L, 10L, 11L, 9L),
Kurt.Maxis = c(187L, 189L, 188L, 199L,
180L, 181L),
Holl.Ra = c(197L, 199L, 196L, 207L, 183L, 183L),
Class = c("van", "van", "saab", "van", "bus", "bus")
),
row.names = c(NA,
6L), class = "data.frame")
#Remove outliers
removeOutliers <- function(data) {
OutVals <- boxplot(data)$out
remOutliers <- sapply(data, function(x) x[!x %in% OutVals])
return (remOutliers)
}
vehDataRemove1 <- vehData[, -1]
vehDataRemove2 <- vehDataRemove1[,-19]
vehData <- vehDataRemove2
vehClass <- vehData$Class
rownames <- names(vehData) #column names
#Begin removing outliers
removeOutliers1 <- removeOutliers(vehData)
bxpdat <- boxplot(removeOutliers1)
#Also tried using vehicles$Class instead of rownames but get the same error
text(bxpdat$group, bxpdat$out,
rownames[which(removeOutliers1 == bxpdat$out, arr.ind = TRUE)[,1]],
pos = 4)
The boxplot looks like this. I am trying to label the outliers based on the x axis e.g. "Comp", "Circ", "D.Circ", "Rad.Ra", "Max.L.Ra" etc.. & by vehicle class "Van", "Bus" ..
Crammed text issue when identifying class
If it is the outliers in the 2nd boxplot, it would be:
bxpdat <- boxplot(removeOutliers1)
text(bxpdat$group, bxpdat$out,
bxpdat$names[bxpdat$group],
pos = 4)
Maybe looks better like this, if you adjust the margin and flip the labels:
par(mar=c(8,3.5,3.5,3.5))
bxpdat = boxplot(removeOutliers1,las=2,cex=0.5)
text(bxpdat$group, bxpdat$out,
bxpdat$names[bxpdat$group],
pos = 4,cex=0.5)
I understood the question differently to #StupidWolf. I thought the goal was to replace points indicating outliers with the text of the vehicle class (bus, van or saab). If you simply print the variable name (e.g. Skew.maxis), then you might as well have simply plotted the outliers as points. Unless I'm missing something.
Here is code to answer the question as I understood it, for what it's worth (beginning after defining removeOutliers):
# CHANGE: Create vehClass vector before removing Class from the dataframe
vehClass <- vehData$Class
vehDataRemove1 <- vehData[, -1]
vehDataRemove2 <- vehDataRemove1[,-19]
vehData <- vehDataRemove2
#Begin removing outliers
removeOutliers1 <- removeOutliers(vehData)
bxpdat <- boxplot(removeOutliers1) # use boxplot(vehData) if you plot all the outliers as points
# loop over columns
n_plot <- 1; set.seed(123) # only plot n_plot randomly-chosen outliers
for(i in 1:ncol(vehData)){
# find out which row indices were removed as outliers
diffInd <- which(vehData[[i]] %in% setdiff(vehData[[i]], removeOutliers1[[i]]))
# if none were, then don't add any outlier text
if(length(diffInd) == 0) next
print(i)
print(paste0("l:", length(diffInd)))
if(length(diffInd) > n_plot){
diffIndPlot <- sample(diffInd, n_plot, replace = FALSE)
} else diffIndPlot <- diffInd
text(x = i, y = vehData[[i]][diffIndPlot],
labels = paste0(vehClass[diffIndPlot], ": ", vehData[[i]][diffIndPlot]))
}
I have been working on this for a while now, but I can't seem to figure it out. I'm looking for a solution that can: calculate difference between col1 and col2 and create colA based on this; then calculate difference between col2 and col3 and create colB based on this, etc. etc. I have about 70 rows and 42 of these columns so it's not something I want to do by hand (at this point I am almost desperate enough).
To give a note also, some of the cells in the rows are empty (NA). An emergency solution would be to fill these with zeroes, but I'd rather not.
Also, the dataframe I use is a tibble, however, I am not bound to this so much that I can't change it to a real dataframe.
My data looks like this:
testdata
As you can see, the columns have annoyingly long names I did not know how to change also :). I use the column numbers usually, which are 77:119. I hope this is complete enough. Sorry for the noob-ness and possibly unclear explanation, this is my first question on here and I'm not that craftsy in R!
Finally, to create the 'user/intermittent_answers/n_length' columns I used the following loop, so I thought it'd be possible to reuse this for the calculations that I need now.
#loop through PARTS of testdata to create _length's
for(i in names(testdata[34:76]))
testdata[[paste(i, 'length', sep="_")]] <- str_length(testdata[[i]])
Then I tried something similar which I found here: FOR loop to calculate difference on dates in R
for (j in 2:length(testdata$`user/intermittant_answers/42_length`))
+ testdata$lag[j] <- as.numeric(difftime(testdata$`user/intermittant_answers/42_length`[j], testdata$`user/intermittant_answers/42_length`[j-1], units=c("difference")), units = "days")
Error in as.POSIXct.numeric(time1) : 'origin' must be supplied
I figured this was because I am not working with anything time related, but I don't know/don't know how to find another 'diff' related function that is not bound to matrixes like the one from matrixStats package.
I hope someone can push me in the right direction!
Thank you!!
EDIT: #Ben, thank you for responding! If I had known this function I would've used it way sooner :'). I tried to keep a representation of NA values inside the df. Also, some people suggested using a double loop, however, I have not managed to figure this out. I hope this helps!
> dput(testdata[1:10, 95:105])
structure(list(`user/intermittant_answers/18_length` = c(NA,
24L, 34L, 33L, NA, NA, 16L, NA, 25L, 28L), `user/intermittant_answers/19_length` = c(NA,
38L, 68L, 34L, NA, 11L, 20L, 12L, 47L, 52L), `user/intermittant_answers/20_length` = c(NA,
59L, 81L, 42L, 2L, 33L, 20L, 26L, 96L, 78L), `user/intermittant_answers/21_length` = c(6L,
90L, 116L, 42L, 14L, 41L, 20L, NA, 127L, 113L), `user/intermittant_answers/22_length` = c(17L,
115L, 131L, 65L, 20L, 70L, 37L, 11L, 170L, 130L), `user/intermittant_answers/23_length` = c(40L,
138L, 188L, 65L, 38L, 113L, 22L, 24L, 200L, 136L), `user/intermittant_answers/24_length` = c(66L,
155L, 210L, 99L, 49L, 133L, 41L, 49L, 242L, 185L), `user/intermittant_answers/25_length` = c(66L,
158L, 233L, 99L, 65L, 156L, 67L, 70L, 296L, 224L), `user/intermittant_answers/26_length` = c(84L,
201L, 250L, 113L, 84L, 164L, 67L, 78L, 334L, 224L), `user/intermittant_answers/27_length` = c(89L,
237L, 285L, 130L, 97L, 167L, 84L, 86L, 412L, 232L), `user/intermittant_answers/28_length` = c(116L,
284L, 315L, 130L, 97L, 184L, 97L, 108L, 445L, 247L)), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
I have the following data which contains data from 7 combinations (rows) and 12 methods (columns).
structure(list(Beams = structure(c(1L, 3L, 4L, 5L, 6L, 7L, 2L
), .Label = c("1 – 2", "1 – 2 – 3 – 4", "1 – 3", "1 – 4", "2 – 3",
"2 – 4", "3 – 4"), class = "factor"), Slope...No.weight = c(75L,
65L, 45L, 30L, 95L, 70L, 75L), Slope...W1 = c(85L, 70L, 65L,
55L, 90L, 85L, 75L), Slope...W2 = c(80L, 65L, 65L, 50L, 90L,
90L, 75L), Slope...W3 = c(80L, 75L, 75L, 65L, 90L, 95L, 80L),
Average.Time...No.Weight = c(75L, 65L, 45L, 30L, 95L, 70L,
70L), Average.Time...W1 = c(70L, 60L, 75L, 60L, 75L, 75L,
80L), Average.Time...W2 = c(65L, 40L, 65L, 50L, 75L, 85L,
70L), Average.Time...W3 = c(65L, 40L, 80L, 75L, 65L, 85L,
80L), Momentum...No.weight = c(80L, 60L, 45L, 30L, 95L, 70L,
75L), Momentum...W1 = c(85L, 75L, 60L, 55L, 95L, 90L, 80L
), Momentum...W2 = c(80L, 65L, 70L, 50L, 90L, 90L, 85L),
Momentum...W3 = c(85L, 75L, 75L, 55L, 90L, 95L, 80L)), .Names = c("Beams",
"Slope...No.weight", "Slope...W1", "Slope...W2", "Slope...W3",
"Average.Time...No.Weight", "Average.Time...W1", "Average.Time...W2",
"Average.Time...W3", "Momentum...No.weight", "Momentum...W1",
"Momentum...W2", "Momentum...W3"), class = "data.frame", row.names = c(NA,
-7L))
I would like to get a barplot like the one below:
I've tried with
library(RColorBrewer)
dat<-read.csv("phaser-p13-30dBm-100ms.csv")
names <- c("1-2","1-3","1-4","2-3","2-4","3-4","1-2-3-4")
barx <-
barplot(as.integer(dat2[,2:13]),
beside=TRUE,
col=brewer.pal(12,"Set3"),
names.arg=names,
ylim=c(0,100),
xlab='Combination of beams',
ylab='Correct detection [%]')
box()
par(xpd=TRUE)
legend("top", c("Slope - No weight","Slope - W1","Slope - W2","Slope - W3","Average Time - No weight","Average Time - W1","Average Time - W2","Average Time - W3","Momentum - No weight","Momentum - W1","Momentum - W2","Momentum - W3"), fill = brewer.pal(12,"Set3"),horiz = T)
but I got this error:
Error in barplot.default(as.integer(dat2[, 2:13]), beside = TRUE, col = brewer.pal(12, :
incorrect number of names
Could you find the error?
I've named you dataframe df here and made use of three packages. This is not a base R solution. Given your dataset format, this is the easiest way (IMO) to do this:
library(dplyr)
library(tidyr)
library(ggplot2)
df %>% # dataframe
gather(variable, value, -Beams) %>% # convert to long format excluding beams column
ggplot(aes(x=Beams, y=value, fill=variable)) + # plot the bar plot
geom_bar(stat='identity', position='dodge')
This should get you started, if you wish to use base graphics and not ggplot2:
df <- as.matrix(dat[,-1])
rownames(df) <- dat[, 1]
barplot(df, beside = TRUE, las = 2)
Use ggplot2 package and make sure that your data is neat and ordered?
something like ggplot(dataframe, aes(colour = some_factor))) + geom_bar(aes(x=Some_variable, y=Some_other_variable))
More explict statement as to how your data matches the image would be useful.
Hi I am working with a table in r. The first column consists of the date(monthly) and the following columns contain different return data on several portfolios. I downloaded the package PerformanceAnalytics and therefore I need this data to be read a time series.
This is what I tried to do. It has worked with a sheet before. But now I always get this error. But I only changed the return data nothing else. I dont understand why it wont read the date correctly.
> library(PerformanceAnalytics)
Loading required package: xts
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
Package PerformanceAnalytics (1.4.3541) loaded.
Copyright (c) 2004-2014 Peter Carl and Brian G. Peterson, GPL-2 | GPL-3
http://r-forge.r-project.org/projects/returnanalytics/
> #load file into R
> FactorR <- read.table("~/Desktop/Rfiles/FactorRegression.csv",header=TRUE,sep=";")
>
> #Time Series (first column date)
> FactorR_xts <- xts(x = FactorR[, -1],order.by = as.Date(FactorR$Date))
Error in charToDate(x) :
character string is not in a standard unambiguous format
I attached the dput function so you can see what kind of data I am talking about.(I did not include all data because it would be too much)
put(FactorR)
structure(list(Date = structure(c(203L, 55L, 5L, 142L, 70L, 35L,
85L, 167L, 178L, 102L, 105L, 116L, 204L, 26L, 2L, 143L, 71L,
9L, 145L, 36L, 157L, 169L, 19L, 181L, 107L, 192L, 122L, 7L,
30L, 60L, 146L, 17L, 158L, 90L, 92L, 182L, 49L, 193L, 123L, 8L,
133L, 61L, 72L, 76L, 159L, 41L, 93L, 183L, 22L, 194L, 53L, 3L,
134L, 62L, 147L, 77L, 87L, 170L, 94L, 46L, 108L, 195L, 124L,
9L, 135L, 32L, 148L, 78L, 39L, 171L, 95L, 184L, 109L, 118L, 125L,
10L, 136L, 16L, 149L, 79L, 160L, 172L, 45L, 185L, 110L, 52L,
126L, 11L, 57L, 63L, 150L, 37L, 161L, 173L, 20L, 186L, 111L,
196L, 127L, 28L, 137L, 64L, 73L, 80L, 162L, 42L, 96L, 187L, 23L,
197L, 54L, 4L, 138L, 65L, 34L, 81L, 163L, 174L, 97L, 104L, 112L,
198L, 25L, 1L, 139L, 66L, 151L, 82L, 88L, 175L, 98L, 47L, 113L,
199L, 128L, 12L, 140L, 33L, 152L, 83L, 40L, 176L, 99L, 188L,
114L, 119L, 129L, 29L, 58L, 67L, 153L, 38L, 164L, 177L, 21L,
189L, 115L, 200L, 130L, 13L, 31L, 68L, 154L, 18L, 165L, 91L,
100L, 190L, 50L, 201L, 131L, 14L, 141L, 69L, 74L, 84L, 166L,
43L, 101L, 191L, 24L, 202L), .Label = c("26.02.10", "26.02.99",
"27.02.04", "27.02.09", "27.02.98", "28.02.01", "28.02.02", "28.02.03",
"28.02.05", "28.02.06", "28.02.07", "28.02.11", "28.02.13", "28.02.14",
"28.04.00", "28.04.06", "28.06.02", "28.06.13", "28.09.01", "28.09.07",
"28.09.12", "28.11.03", "28.11.08", "28.11.14", "29.01.10", "29.01.99",
"29.02.00", "29.02.08", "29.02.12", "29.03.02", "29.03.13", "29.04.05",
"29.04.11", "29.05.09", "29.05.98", "29.06.01", "29.06.07", "29.06.12",
"29.07.05", "29.07.11", "29.08.03", "29.08.08", "29.08.14", "29.09.00",
"29.09.06", "29.10.04", "29.10.10", "29.10.99", "29.11.02", "29.11.13",
"29.12.00", "29.12.06", "30.01.04", "30.01.09", "30.01.98", "30.03.01",
"30.03.07", "30.03.12", "30.04.01", "30.04.02", "30.04.03", "30.04.04",
"30.04.07", "30.04.08", "30.04.09", "30.04.10", "30.04.12", "30.04.13",
"30.04.14", "30.04.98", "30.04.99", "30.05.03", "30.05.08", "30.05.14",
"30.06.00", "30.06.03", "30.06.04", "30.06.05", "30.06.06", "30.06.08",
"30.06.09", "30.06.10", "30.06.11", "30.06.14", "30.06.98", "30.06.99",
"30.07.04", "30.07.10", "30.07.99", "30.08.02", "30.08.13", "30.09.02",
"30.09.03", "30.09.04", "30.09.05", "30.09.08", "30.09.09", "30.09.10",
"30.09.11", "30.09.13", "30.09.14", "30.09.98", "30.09.99", "30.10.09",
"30.10.98", "30.11.00", "30.11.01", "30.11.04", "30.11.05", "30.11.06",
"30.11.07", "30.11.09", "30.11.10", "30.11.11", "30.11.12", "30.11.98",
"30.11.99", "30.12.05", "30.12.11", "31.01.00", "31.01.01", "31.01.02",
"31.01.03", "31.01.05", "31.01.06", "31.01.07", "31.01.08", "31.01.11",
"31.01.12", "31.01.13", "31.01.14", "31.03.00", "31.03.03", "31.03.04",
"31.03.05", "31.03.06", "31.03.08", "31.03.09", "31.03.10", "31.03.11",
"31.03.14", "31.03.98", "31.03.99", "31.05.00", "31.05.01", "31.05.02",
"31.05.04", "31.05.05", "31.05.06", "31.05.07", "31.05.10", "31.05.11",
"31.05.12", "31.05.13", "31.05.99", "31.07.00", "31.07.01", "31.07.02",
"31.07.03", "31.07.06", "31.07.07", "31.07.08", "31.07.09", "31.07.12",
"31.07.13", "31.07.14", "31.07.98", "31.08.00", "31.08.01", "31.08.04",
"31.08.05", "31.08.06", "31.08.07", "31.08.09", "31.08.10", "31.08.11",
"31.08.12", "31.08.98", "31.08.99", "31.10.00", "31.10.01", "31.10.02",
"31.10.03", "31.10.05", "31.10.06", "31.10.07", "31.10.08", "31.10.11",
"31.10.12", "31.10.13", "31.10.14", "31.12.01", "31.12.02", "31.12.03",
"31.12.04", "31.12.07", "31.12.08", "31.12.09", "31.12.10", "31.12.12",
"31.12.13", "31.12.14", "31.12.97", "31.12.98", "31.12.99"), class = "factor"),
T1V = c(2.647778077, 2.210168532, 5.184543047, 8.040141376,
1.375197787, 5.254693278, 0.238583717, -0.897572167, -6.812178155,
-4.904778447, 1.445454477, 4.362544312, 0.577758687, -1.049345994,
-0.862978469, 1.496311077, 1.535298083, 0.288034989, 1.002503645,
-0.677737904, 1.148733333, -0.068879397, -0.933636437, 1.952957927,
0.864593373, 0.69587105, 1.566383785, 0.201725025, 0.108433102,
1.121251221, 0.697840536, -0.341798507, 1.750353464, -0.336236355,
-0.173630687, 0.405227621, 0.407442779, 0.301534209, -0.252288427,
-2.197112455, 0.4182172, 2.417270431, -1.777693712, 0.333608117,
-0.963997684, -6.639419411, 0.258711011, 0.186660625, 1.075364953,
-0.260546877, -0.144517713, 2.614703924, 1.592532166, 0.247679225,
-2.45731793, -4.605964615, -0.051317674, -2.162348318, -2.094287999,
1.053871887, 0.775032852, -2.409925349, -1.24731202, 0.20137383,
2.9796142, 1.18379607, 0.530516718, 0.687770774, 2.425813597,
1.070508498, 1.594988715, 2.577337728, 1.735724627, 4.753962343,
1.817757107, 0.287317513, 2.122250222, 0.509726992, 1.623651005,
-0.629218412, 1.413071621, 1.466153048, -0.032322501, 1.570878067,
2.495539535, 4.669928369, 2.540314459, 1.351671444, -0.511289999,
.
..
....
....
1.637709345, 0.949670725, -0.380310863, -1.434786801, 0.546588731,
-1.680930574, -1.497671033, 2.134405674, 0.189844698), T3R = c(0.440505512,
5.325647834, 8.837385281, 21.10071908, 4.5326005, 6.606732343,
-4.488433652, -1.304513421, -27.57526532, -19.22941607, 13.12560656,
10.95535151, -2.960696646, -1.282931055, -4.047714673, 4.325802659,
13.34806221, -3.940632325, 2.668465326, -2.035239493, 2.265868534,
2.901646772, 1.555938816, 8.725598107, 11.1111256, 15.10307892,
10.71764649, -1.860936247, -3.235221339, -0.718662895, 2.928862379,
1.567574208, 0.098434872, -2.639317291, -4.334738565, -7.662240412,
-1.392672778, -0.249440069, -7.519374824, -12.54244192, 3.211494367,
-1.798924417, -9.750103402, -17.47336517, -13.59092267, -30.85835803,
6.627120118, 13.84521564, 1.224167247, -4.282226202, -3.879824851,
11.0002882, -1.633862571, 0.728697276, -15.20216478, -21.43439457,
-9.173494124, -27.72510655, -1.643806123, 15.30080078, -11.42185815,
-10.86780424, -10.08529262, 0.158622664, 19.07560852, 4.410459583,
6.983702045, 9.726738752, 11.96532368, 0.865241128, 10.52710826,
1.824183803, 0.051281172, 7.643560265, 3.857934445, -4.269747269,
0.193491252, -1.127403274, -1.145642636, -4.336023223, -4.750288798,
1.386568693, -3.058304715, 3.87811701, 6.007778471, 6.972611825,
7.139746344, 4.366307305, -4.231872029, 0.465995363, 3.370806119,
6.055047349, 1.589337466, 6.641594709, -5.834167246, 0.500189653,
3.001936466, 5.665564573, 6.219235151, 4.696735739, 3.597032279,
-6.95415108, -2.658694701, 0.700309545, 3.870252718, 4.059903633,
4.129877722, 2.850231626, 6.026897131, 11.42913672, -1.40600749,
4.68987461, 6.138984252, 0.859683472, -0.783511946, -2.061859604,
-7.537614888, -3.971992672, 2.743416779, -13.26388813, 1.902781239,
-21.73358064, 5.433251961, -6.426065721, 5.500056238, 1.813441355,
-11.11515726, -5.234823589, 2.582946217, -16.67855167, -36.66711169,
-12.46637364, -5.211445441, -8.572591139, -17.88276043, 2.956958358,
25.59635755, 9.043196394, -1.052072638, 8.698101054, 11.55426061,
6.544403365, -4.495701412, -3.156245124, -1.293693294, 5.803543849,
-0.762197087, 8.000348105, 2.646959488, -12.09434448, -2.563082034,
1.466128125, -1.863374559, 4.699135454, 3.622459782, -1.706221195,
4.038651722, 2.817603386, 1.027156327, -1.486388335, 0.168641413,
-3.888501653, -9.915080583, -11.88374941, -13.56634471, -10.51374661,
3.846951996, -11.50943308, 2.074359943, 7.548294859, 6.711539857,
1.806850477, -0.576496993, -9.21065397, -4.154519223, 3.525193617,
-0.24777096, 3.601168094, 0.143557195, -6.368196817, 5.231960646,
6.810400741, 3.672507394, -2.556477674, -2.869519924, 4.479135652,
-5.380429829, 1.713023169, 3.396652152, 4.922622663, 4.040155598,
1.512006061, 0.24907751, 4.496251525, 0.92375895, -0.774870584,
-3.784012139, 5.614058853, 5.327086162, -0.706470295, 0.771043886,
-4.377376587, -2.491251246, 3.172560156, -2.082216546)), .Names = c("Date",
"T1V", "T2V", "T3V", "T1MV", "T2MV", "T3MV", "T1BTM", "T2BTM",
"T3BTM", "T1MOM", "T2MOM", "T3MOM", "Rm", "SMB", "HML", "MOM",
"T1R", "T2R", "T3R"), class = "data.frame", row.names = c(NA,
-205L))
I would be very happy if anyone could help me.
You need to specify your date format (see ?as.Date):
dates <- c("26.02.10", "26.02.99", "27.02.04", "27.02.09", "27.02.98", "28.02.01", "28.02.02", "28.02.03")
as.Date(dates, "%d.%m.%y")
I am very very new to R and stats in general, and am having trouble adding multiple confidence ellipses to a PCA plot.
My interest is in highlighting potential groupings/clusters in the PCA plot with 95% confidence ellipses. I have tried using the dataEllipse function in R, however I cannot figure out how to add multiple ellipses with different centers to the PCA plot (the centers would be at various points that appear to contain a cluster, in this case lithic sources and lithic tools likely made from that source).
Thanks for any help with this!
{
lithic_final <- LITHIC.DATASHEET.FOR.R.COMPLETE.FORMAT
lithic_final
pca1 <- princomp(lithic_final); pca1
lithic_source <- c("A1", "A1", "A1", "A1", "A2","A2", "A2", "A3","A3","A3","B","B","B","B","B","B","C","C","C","C","C","C","C","D","D","D","D","D","D","D","D","E","E","E","E","E","E","E","E","F","F","G","G","G","G","H","H","H","H","H","H","H","I1","I1","I1","I2","I2","I2","I2","I2","J1","J1","J2","J2","J2","J2","J2","J2","J2","J2","J2","K","K","K","K","K","K","K","L","L","L","L","L","L","L","L","L","L","L","L","L","L","BB1","BB1","BB1","FC","FC","FC","JRPP","JRPP","JRPP","BB2","BB2","BB2","BB2","MWP","MWP","MWP","MWP","RPO","RPO","RPO")
lithic_source
summary(pca1)
plot(pca1)
#Plotting the scores with the Lithic Source Info
round(pca1$scores[,1:2], 2)
pca_scores <-round(pca1$scores[,1:2], 2)
plot(pca1$scores[,1], pca1$scores[,2], type="n")
text(pca1$scores[,1], pca1$scores[,2],labels=abbreviate(lithic_source, minlength=3), cex=.45)
#Plotting PCA Scores of EACH SAMPLE for PCA 2 and 3 with Lithic Source Info
round(pca1$scores[,2:3], 2)
pca2_3_scores <-round(pca1$scores[,2:3], 2)
plot(pca1$scores[,2], pca1$scores[,3], type="n")
text(pca1$scores[,2], pca1$scores[,3], labels=abbreviate(lithic_source, minlength=3), cex=.45)
#Plotting PCA Scores of EACH SAMPLE for PCA 3 and 4 with Lithic Source Info
round(pca1$scores[,3:4], 2)
pca3_4_scores <-round(pca1$scores[,3:4], 2)
plot(pca1$scores[,3], pca1$scores[,4], type="n")
text(pca1$scores[,3], pca1$scores[,4], labels=abbreviate(lithic_source, minlength=3), cex=.45)
#Plotting PCA Scores of EACH SAMPLE for PCA 1 and 3 with Lithic Source Info
round(pca1$scores[,1:3], 2)
pca1_3_scores <-round(pca1$scores[,1:3], 2)
plot(pca1$scores[,1], pca1$scores[,3], type="n")
text(pca1$scores[,1], pca1$scores[,3], labels=abbreviate(lithic_source, minlength=3), cex=.45)
#Plotting PCA Scores of EACH SAMPLE for PCA 1 and 4 with Lithic Source Info
round(pca1$scores[,1:4], 2)
pca1_4_scores <-round(pca1$scores[,1:4], 2)
plot(pca1$scores[,1], pca1$scores[,4], type="n")
text(pca1$scores[,1], pca1$scores[,4], labels=abbreviate(lithic_source, minlength=3), cex=.45)
#TRYING TO GET ELLIPSES ADDED TO PCA 1 and 4 scores
dataEllipse(pca1$scores[,1], pca1$scores[,4],centers=12,add=TRUE,levels=0.9, plot.points=FALSE)
structure(list(Ca.K12 = c(418L, 392L, 341L, 251L, 297L, 238L,
258L, 5L, 2L, 37L), Cr.K12 = c(1L, 12L, 15L, 6L, 9L, 6L, 35L,
7L, 45L, 32L), Cu.K12 = c(89L, 96L, 81L, 63L, 88L, 103L, 104L,
118L, 121L, 90L), Fe.K12 = c(18627L, 18849L, 18413L, 12893L,
17757L, 17270L, 16198L, 2750L, 4026L, 3373L), K.K12 = c(20L,
23L, 28L, 0L, 34L, 17L, 45L, 102L, 150L, 147L), Mn.K12 = c(205L,
212L, 235L, 120L, 216L, 212L, 246L, 121L, 155L, 115L), Nb.K12 = c(139L,
119L, 154L, 91L, 122L, 137L, 137L, 428L, 414L, 428L), Rb.K12 = c(99L,
42L, 79L, 49L, 210L, 243L, 168L, 689L, 767L, 705L), Sr.K12 = c(3509L,
3766L, 3481L, 2715L, 2851L, 2668L, 2695L, 202L, 220L, 217L),
Ti.K12 = c(444L, 520L, 431L, 293L, 542L, 622L, 531L, 82L,
129L, 84L), Y.K12 = c(135L, 121L, 105L, 74L, 144L, 79L, 85L,
301L, 326L, 379L), Zn.K12 = c(131L, 133L, 108L, 78L, 124L,
111L, 114L, 81L, 78L, 59L), Zr.K12 = c(1348L, 1479L, 1333L,
964L, 1506L, 1257L, 1296L, 3967L, 4697L, 4427L)), .Names = c("Ca.K12",
"Cr.K12", "Cu.K12", "Fe.K12", "K.K12", "Mn.K12", "Nb.K12", "Rb.K12",
"Sr.K12", "Ti.K12", "Y.K12", "Zn.K12", "Zr.K12"), row.names = c(NA,
10L), class = "data.frame")
I think you would have received a speedier reply if you had focused on your question instead of all the extraneous stuff. You gave us your commands for plotting a bunch of principal components that had nothing to do with your question. The question is, how do you plot ellipses by group? Your sample data at 10 lines and three groups is not helpful because 3 points is not enough to plot data ellipses. You are using the dataEllipse function in package car which has the simplest answer to your question:
First, a reproducible example:
set.seed(42) # so you can get the same numbers I get
source_a <- data.frame(X1=rnorm(25, 50, 5), X2=rnorm(25, 40, 5))
source_b <- data.frame(X1=rnorm(25, 20, 5), X2=rnorm(25, 40, 5))
source_c <- data.frame(X1=rnorm(25, 35, 5), X2=rnorm(25, 25, 5))
lithic_dat <- rbind(source_a, source_b, source_c)
lithic_source <- c(rep("a", 25), rep("b", 25), rep("c", 25))
Plot ellipses with scatterplot() and add text:
scatterplot(X2~X1 | lithic_source, data=lithic_dat, pch="", smooth=FALSE,
reg.line=FALSE, ellipse=TRUE, levels=.9)
text(lithic_dat$X1, lithic_dat$X2, lithic_source, cex=.75)
Scatterplot can be tweaked to do everything you want, but it is also
possible to plot the ellipses without using it:
sources <- unique(lithic_source) # vector of the different sources
plot(lithic_dat$X1, lithic_dat$X1, type="n")
text(lithic_dat$X1, lithic_dat$X2, lithic_source, cex=.75)
for (i in sources) with(lithic_dat, dataEllipse(X1[lithic_source==i],
X2[lithic_source==i], levels=.9, plot.points=FALSE))
This will work for your principal components and any other data.
Here is a simple solution using a package called ggbiplot (available on github) with Iris data. I hope this is what you were looking for.
library(devtools);install_github('vqv/ggbiplot')
library(ggbiplot)
pca = prcomp(iris[,1:4])
ggbiplot(pca,groups = iris$Species,ellipse = T,ellipse.prob = .95)