melting multiple spans of variables - r

(still) new to r, and very confused as to how I should accomplish multiple melts of my data. Here is a subset:
df <- structure(list(Subject = c(101L, 101L, 101L, 102L, 102L, 102L
), Condition = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("apass",
"vpas"), class = "factor"), FreqCode = structure(c(1L, 1L, 1L,
2L, 2L, 2L), .Label = c("LessVerbal", "MoreVerbal"), class = "factor"),
Item = c(1L, 4L, 7L, 1L, 4L, 7L), Len = c(80L, 68L, 85L,
68L, 85L, 79L), R1_1.RT = c(237L, 203L, 207L, 336L, 487L,
340L), R1_2.RT = c(177L, 225L, 162L, 634L, 590L, 347L), R1_3.RT = c(200L,
226L, 212L, 707L, 653L, 379L), R1.RT = c(614L, 654L, 581L,
1677L, 1730L, 1066L), R1_1 = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = "The", class = "factor"), R1_2 = structure(c(3L,
1L, 2L, 1L, 2L, 4L), .Label = c("antique", "course", "new",
"road"), class = "factor"), R1_3 = structure(c(4L, 1L, 2L,
1L, 2L, 3L), .Label = c("car", "materials", "surfaces", "technology"
), class = "factor"), R1 = structure(c(3L, 1L, 2L, 1L, 2L,
4L), .Label = c("The antique car", "The course materials",
"The new technology", "The road surfaces"), class = "factor")), .Names = c("Subject",
"Condition", "FreqCode", "Item", "Len", "R1_1.RT", "R1_2.RT",
"R1_3.RT", "R1.RT", "R1_1", "R1_2", "R1_3", "R1"), class = "data.frame", row.names =
c(NA,
-6L))
My goal is to get output that (in part) looks like this:
Region RT WordRegion Word
R1_1.RT 237 R1_1 the
...
R1_2.RT 177 R1_2 new
...
EDIT: The variable ending with ".RT" (e.g., R1_1.RT) are Region names and will be melted into a Region column. The variables ending in numbers (e.g., R1_1) correspond exactly to the Region names and their associated values. I want them to be melted alongside the Region names so that I can analyze them in relation to the Region column
In the first part of the code, I melt all of the values into a Region column and change the value to RT. This seems to work fine:
#long transform (with individual regions at end)
SmallMelt1 = melt(df, measure.vars = c("R1_1.RT", "R1_2.RT", "R1_3.RT", "R1.RT"), var = "Region")
#change newly created column name to "RT" (note:you have to change the number in [] to match your data)
colnames(SmallMelt1)[11 ] <- "RT"
But I don't get how to simultaneously melt another span of variables such that they will line up vertically with the first span. I want to do something like this, after the first melt, but it does not work:
#Second Melt for region names (doesn't work)
SmallMelt2 = melt(SmallMelt1, measure.vars = c("R1_1", "R1_2", "R1_3", "R1"), var = "WordRegion")
#Change name to Word
colnames(SmallMelt2)[9] <- "Word" #add col number for "value" here
Please let me know if you need any clarification. I hope someone can help... thanks in advance - DT

So, after consulting with someone off-list, I found the solution. My mistake was that I was trying to run the second step on the output of the first step. By running the two steps independently on the original data and then concatenating, I get the right result.
SmallMelt1 = melt(df, measure.vars = c("R1_1.RT", "R1_2.RT", "R1_3.RT", "R1.RT"), var = "Region")
SmallMelt2 = melt(df, measure.vars = c("R1_1", "R1_2", "R1_3", "R1"), var = "WordRegion")
SmallMelt3=cbind(SmallMelt1,SmallMelt2[,11])

Related

Conditional updating coordinate column in dataframe

I am attempting to populate two newly empty columns in a data frame with data from other columns in the same data frame in different ways depending on if they are populated.
I am trying to populate the values of HIGH_PRCN_LAT and HIGH_PRCN_LON (previously called F_Lat and F_Lon) which represent the final latitudes and londitudes for those rows this will be based off the values of the other columns in the table.
Case 1: Lat/Lon2 are populated (like in IDs 1 & 2), using the great
circle algorithm a midpoint between them should be calculated and
then placed into F_Lat & F_Lon.
Case 2: Lat/Lon2 are empty, then the values of Lat/Lon1 should be put
into F_Lat and F_Lon (like with IDs 3 & 4).
My code is as follows but doesn't work (see previous versions, removed in an edit).
The preperatory code I am using is as follows:
incidents <- structure(list(id = 1:9, StartDate = structure(c(1L, 3L, 2L,
2L, 2L, 3L, 1L, 3L, 1L), .Label = c("02/02/2000 00:34", "02/09/2000 22:13",
"20/01/2000 14:11"), class = "factor"), EndDate = structure(1:9, .Label = c("02/04/2006 20:46",
"02/04/2006 22:38", "02/04/2006 23:21", "02/04/2006 23:59", "03/04/2006 20:12",
"03/04/2006 23:56", "04/04/2006 00:31", "07/04/2006 06:19", "07/04/2006 07:45"
), class = "factor"), Yr.Period = structure(c(1L, 1L, 2L, 2L,
2L, 3L, 3L, 3L, 3L), .Label = c("2000 / 1", "2000 / 2", "2000 /3"
), class = "factor"), Description = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L), .Label = "ENGLISH TEXT", class = "factor"),
Location = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L
), .Label = c("Location 1", "Location 1 : Location 2"), class = "factor"),
Location.1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = "Location 1", class = "factor"), Postcode.1 = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Postcode 1", class = "factor"),
Location.2 = structure(c(2L, 2L, 1L, 2L, 2L, 2L, 2L, 1L,
1L), .Label = c("", "Location 2"), class = "factor"), Postcode.2 = structure(c(2L,
2L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("", "Postcode 2"
), class = "factor"), Section = structure(c(2L, 2L, 3L, 1L,
4L, 4L, 2L, 1L, 4L), .Label = c("East", "North", "South",
"West"), class = "factor"), Weather.Category = structure(c(1L,
2L, 4L, 2L, 2L, 2L, 4L, 1L, 3L), .Label = c("Animals", "Food",
"Humans", "Weather"), class = "factor"), Minutes = c(13L,
55L, 5L, 5L, 5L, 522L, 1L, 11L, 22L), Cost = c(150L, 150L,
150L, 20L, 23L, 32L, 21L, 11L, 23L), Location.1.Lat = c(53.0506727,
53.8721035, 51.0233529, 53.8721035, 53.6988355, 53.4768766,
52.6874562, 51.6638245, 51.4301359), Location.1.Lon = c(-2.9991256,
-2.4004125, -3.0988341, -2.4004125, -1.3031529, -2.2298073,
-1.8023421, -0.3964916, 0.0213837), Location.2.Lat = c(52.7116187,
53.746791, NA, 53.746791, 53.6787167, 53.4527824, 52.5264907,
NA, NA), Location.2.Lon = c(-2.7493169, -2.4777984, NA, -2.4777984,
-1.489026, -2.1247029, -1.4645023, NA, NA)), class = "data.frame", row.names = c(NA, -9L))
#gpsColumns is used as the following line of code is used for several data frames.
gpsColumns <- c("HIGH_PRCN_LAT", "HIGH_PRCN_LON")
incidents [ , gpsColumns] <- NA
#create separate variable(?) containing a list of which rows are complete
ind <- complete.cases(incidents [,17])
#populate rows with a two Lat/Lons with great circle middle of both values
incidents [ind, c("HIGH_PRCN_LON_2","HIGH_PRCN_LAT_2")] <-
with(incidents [ind,,drop=FALSE],
do.call(rbind, geosphere::midPoint(cbind.data.frame(Location.1.Lon, Location.1.Lat), cbind.data.frame(Location.2.Lon, Location.2.Lat))))
#populate rows with one Lat/Lon with those values
incidents[!ind, c("HIGH_PRCN_LAT","HIGH_PRCN_LON")] <- incidents[!ind, c("Location.1.Lat","Location.1.Lon")]
I will use the geosphere::midPoint function based off a recommendation here: http://r.789695.n4.nabble.com/Midpoint-between-coordinates-td2299999.html.
Unfortunately, it doesn't appear that this way of populating the column will work when there are several cases.
The current error that is thrown is:
Error in `$<-.data.frame`(`*tmp*`, F_Lat, value = integer(0)) :
replacement has 0 rows, data has 178012
Edit: also posted to reddit: https://www.reddit.com/r/Rlanguage/comments/bdvavx/conditional_updating_column_in_dataframe/
Edit: Added clarity on the parts of the code I do not understand.
#replaces the F_Lat2/F_Lon2 columns in rows with a both sets of input coordinates
dataframe[ind, c("F_Lat2","F_Lon2")] <-
#I am unclear on what this means, specifically what the "with" function does and what "drop=FALSE" does and also why they were used in this case.
with(dataframe[ind,,drop=FALSE],
#I am unclear on what do.call and rbind are doing here, but the second half (geosphere onwards) is binding the Lats and Lons to make coordinates as inputs for the gcIntermediate function.
do.call(rbind, geosphere::gcIntermediate(cbind.data.frame(Lat1, Lon1),
cbind.data.frame(Lat2, Lon2), n = 1)))
Though your code doesn't work as-written for me, and I cannot calculate the same precise values your expect, I suspect the error your seeing can be fixed with these steps. (Data is down at the bottom here.)
Pre-populate the empty columns.
Pre-calculate the complete.cases step, it'll save time.
Use cbind.data.frame for inside gcIntermediate.
I'm inferring from
gcIntermediate([dataframe...
^
this is an error in R
that you are binding those columns together, so I'll use cbind.data.frame. (Using cbind itself produced some ignorable warnings from geosphere, so you can use it instead and perhaps suppressWarnings, but that function is a little strong in that it'll mask other warnings as well.)
Also, since it appears you want one intermediate value for each pair of coordinates, I added the gcIntermediate(..., n=1) argument.
The use of do.call(rbind, ...) is because gcIntermediate returns a list, so we need to bring them together.
dataframe$F_Lon2 <- dataframe$F_Lat2 <- NA_real_
ind <- complete.cases(dataframe[,4])
dataframe[ind, c("F_Lat2","F_Lon2")] <-
with(dataframe[ind,,drop=FALSE],
do.call(rbind, geosphere::gcIntermediate(cbind.data.frame(Lat1, Lon1),
cbind.data.frame(Lat2, Lon2), n = 1)))
dataframe[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
dataframe
# ID Lat1 Lon1 Lat2 Lon2 F_Lat F_Lon F_Lat2 F_Lon2
# 1 1 19.05067 -3.999126 92.71332 -6.759169 55.88200 -5.379147 55.78466 -6.709509
# 2 2 58.87210 -1.400413 54.74679 -4.479840 56.80945 -2.940126 56.81230 -2.942029
# 3 3 33.02335 -5.098834 NA NA 33.02335 -5.098834 33.02335 -5.098834
# 4 4 54.87210 -4.400412 NA NA 54.87210 -4.400412 54.87210 -4.400412
Update, using your new incidents data and switching to geosphere::midPoint.
Try this:
incidents$F_Lon2 <- incidents$F_Lat2 <- NA_real_
ind <- complete.cases(incidents[,4])
incidents[ind, c("F_Lat2","F_Lon2")] <-
with(incidents[ind,,drop=FALSE],
geosphere::midPoint(cbind.data.frame(Location.1.Lat,Location.1.Lon),
cbind.data.frame(Location.2.Lat,Location.2.Lon)))
incidents[!ind, c("F_Lat2","F_Lon2")] <- dataframe[!ind, c("Lat1","Lon1")]
One (big) difference is that geosphere::gcIntermediate(..., n=1) returns a list of results, whereas geosphere::midPoint(...) (no n=) returns just a matrix, so no rbinding required.
Data:
dataframe <- read.table(header=T, stringsAsFactors=F, text="
ID Lat1 Lon1 Lat2 Lon2 F_Lat F_Lon
1 19.0506727 -3.9991256 92.713318 -6.759169 55.88199535 -5.3791473
2 58.8721035 -1.4004125 54.746791 -4.47984 56.80944725 -2.94012625
3 33.0233529 -5.0988341 NA NA 33.0233529 -5.0988341
4 54.8721035 -4.4004125 NA NA 54.8721035 -4.4004125")

Using a lookup table in R with continous values

I have a lookup table in R that I am trying to figure out how to implement. The challenge for me is that it involves continuous values or ranges of data. If the value falls inbetween I'd like it to pick the right value.
I want to use the two continuous 'GRADE', 'SAT' variables plus the categorical 'TYPE' value to assign a 'GROUP' value. This big block of code looks intimidating but these are tiny tiny tables.
Any advice is appreciated!!!!
#lookup table code for recreating dataframe
structure(list(Type = structure(c(1L, 2L, 1L, 1L), .Label = c("A",
"B"), class = "factor"), min_grade = c(93L, 85L, 93L, 80L), max_grade = c(100L,
93L, 100L, 92L), min_sat = c(600L, 700L, 400L, 600L), max_sat = c(800L,
800L, 599L, 800L), Group = structure(c(1L, 1L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor")), .Names = c("Type", "min_grade",
"max_grade", "min_sat", "max_sat", "Group"), class = "data.frame", row.names = c(NA,
-4L))
#example ----- desired value is in the 'GROUP' column so this would be NULL before I used the lookup table
structure(list(Name = structure(c(3L, 1L, 2L, 4L), .Label = c("Jack",
"James", "John", "Jordan"), class = "factor"), Grade = c(95L,
95L, 92L, 93L), Sat = c(701L, 500L, 800L, 800L), Type = structure(c(1L,
1L, 1L, 2L), .Label = c("A", "B"), class = "factor"), Group = structure(c(1L,
2L, 3L, 1L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("Name",
"Grade", "Sat", "Type", "Group"), class = "data.frame", row.names = c(NA,
-4L))
how abt this?
ltab <- structure(list(Type = structure(c(1L, 2L, 1L, 1L), .Label = c("A",
"B"), class = "factor"), min_grade = c(93L, 85L, 93L, 80L), max_grade = c(100L,
93L, 100L, 92L), min_sat = c(600L, 700L, 400L, 600L), max_sat = c(800L,
800L, 599L, 800L), Group = structure(c(1L, 1L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor")), .Names = c("Type", "min_grade",
"max_grade", "min_sat", "max_sat", "Group"), class = "data.frame", row.names = c(NA,
-4L))
dat <- structure(list(Name = structure(c(3L, 1L, 2L, 4L), .Label = c("Jack",
"James", "John", "Jordan"), class = "factor"), Grade = c(95L,
95L, 92L, 93L), Sat = c(701L, 500L, 800L, 800L), Type = structure(c(1L,
1L, 1L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("Name",
"Grade", "Sat", "Type"), class = "data.frame", row.names = c(NA,
-4L))
library(plyr)
mdat <- adply(merge(dat, ltab, by="Type", all=T), 1, function(x) {
c(FallsIn=x$Grade > x$min_grade & x$Grade <= x$max_grade & x$Sat > x$min_sat & x$Sat <= x$max_sat)
})
mdat[mdat$FallsIn,]
thinking about generalizing, are there going to be more continuous variables that you need to check?
EDIT: could not edit OP post so taking OP's comment into account is how I would tackle an example of "categorizing multidimensional continuous random variables"
so that these keywords will flag up in future searches
breaks <- list(Var1=c(0, 0.25, 1),
Var2=c(0, 0.5, 1),
Var3=c(0, 0.25, 0.75, 1))
#generate this on the fly
genIntv <- function(x) {
ret <- paste0("(", x[1:(length(x)-1)],", ",x[2:length(x)], "]")
names(ret) <- 1:(length(x)-1)
ret
}
lookupTbl <- data.frame(expand.grid(lapply(breaks, genIntv), stringsAsFactors=F),
Group=LETTERS[1:12])
lookupTbl2 <- data.frame(expand.grid(lapply(breaks, function(x) 1:(length(x)-1)), stringsAsFactors=F),
Group=LETTERS[1:12])
#data set
dat <- data.frame(Var1=c(0.1, 0.76), Var2=c(0.5, 0.75), Var3=c(0.25,0.9))
binDat <- do.call(cbind, setNames(lapply(1:ncol(dat), function(k)
.bincode(dat[,k], breaks[[k]], T, T)),colnames(dat)))
merge(binDat, lookupTbl2, all.x=T, all.y=F)
good to learn if someone else has better approaches
If you have small data, a full join should be fine.
library(dplyr)
result =
example %>%
select(-Type) %>%
full_join(look_up) %>%
filter(min_grade < Grade & Grade <= max_grade &
min_sat < Sat & Sat <= max_sat)

Outlier detection for multi column data frame in R

I have a data frame with 18 columns and about 12000 rows. I want to find the outliers for the first 17 columns and compare the results with the column 18. The column 18 is a factor and contains data which can be used as indicator of outlier.
My data frame is ufo and I remove the column 18 as follow:
ufo2 <- ufo[,1:17]
and then convert 3 non0numeric columns to numeric values:
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
and then use the following command for outlier detection:
outlier.scores <- lofactor(ufo2, k=5)
But all of the elements of the outlier.scores are NA!!!
Do I have any mistake in this code?
Is there another way to find outlier for such a data frame?
All of my code:
setwd(datadirectory)
library(doMC)
registerDoMC(cores=8)
library(DMwR)
# load data
load("data_9802-f2.RData")
ufo2 <- ufo[,2:17]
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
outlier.scores <- lofactor(ufo2, k=5)
The output of the dput(head(ufo2)) is:
structure(list(Origin = c(2L, 2L, 2L, 2L, 2L, 2L), IO = c(2L,
2L, 2L, 2L, 2L, 2L), Lot = c(1003L, 1003L, 1003L, 1012L, 1012L,
1013L), DocNumber = c(10069L, 10069L, 10087L, 10355L, 10355L,
10382L), OperatorID = c(5698L, 5698L, 2015L, 246L, 246L, 4135L
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), LineNo = c(1L, 2L, 1L,
1L, 2L, 1L), Country = c(1L, 1L, 1L, 1L, 11L, 1L), ProduceCode = c(63456227L,
63455714L, 33687427L, 32686627L, 32686627L, 791614L), Weight = c(900,
850, 483, 110000, 5900, 1000), InvoiceValue = c(637, 775, 2896,
48812, 1459, 77), InvoiceValueWeight = c(707L, 912L, 5995L, 444L,
247L, 77L), AvgWeightMonth = c(1194.53, 1175.53, 7607.17, 311.667,
311.667, 363.526), SDWeightMonth = c(864.931, 780.247, 3442.93,
93.5818, 93.5818, 326.238), Score = c(0.56366535234262, 0.33775439984787,
0.46825476121676, 1.414092583904, 0.69101737288291, 0.87827342721894
), TransactionNo = c(47L, 47L, 6L, 3L, 3L, 57L)), .Names = c("Origin",
"IO", "Lot", "DocNumber", "OperatorID", "Month", "LineNo", "Country",
"ProduceCode", "Weight", "InvoiceValue", "InvoiceValueWeight",
"AvgWeightMonth", "SDWeightMonth", "Score", "TransactionNo"), row.names = c(NA,
6L), class = "data.frame")
First of all, you need to spend a lot more time preprocessing your data.
Your axes have completely different meaning and scale. Without care, the outlier detection results will be meaningless, because they are based on a meaningless distance.
For example produceCode. Are you sure, this should be part of your similarity?
Also note that I found the lofactor implementation of the R DMwR package to be really slow. Plus, it seems to be hard-wired to Euclidean distance!
Instead, I recommend using ELKI for outlier detection. First of all, it comes with a much wider choice of algorithms, secondly it is much faster than R, and third, it is very modular and flexible. For your use case, you may need to implement a custom distance function instead of using Euclidean distance.
Here's the link to the ELKI tutorial on implementing a custom distance function.

GGPlot geom_text coloring with facets

Hopefully someone here will be able to help me with a problem that I'm having with a ggplot script I'm trying to get right. The script will be used many times with different data, so it needs to be relatively flexible. I've got it almost where I want it, but I've come across a problem I haven't been able to solve.
The script is for a line graph with labels for each line in the right hand margin. Sometimes the graph is faceted, other times it is not.
The piece I'm having trouble with is that I would like to color code the labels in the right margin as black if there was no significant change over time, green if there was positive change, and red if there was negative change. I've got a script that works to carry this out when I only have a single facet, but as soon as I have multiple facets in the graph, the color coding of the labels gives the following error
Error: Incompatible lengths for set aesthetics:
Below is the script with data with multiple facets. The problem seems to be in the way that I'm specifying color in the geom_text line. If I delete the color call in the geom_text line in the script, then I get the attributes printed in the correct place, just not colored. I'm really at a loss on this one. This is my first post here, so let me know if I've done anything wrong with my post.
WITH MULTIPLE FACETS (DOES NOT WORK)
require(ggplot2)
require(grid)
require(zoo)
require(reshape)
require(reshape2)
require(directlabels)
time.data<-structure(list(Attribute = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
4L, 4L, 5L, 5L, 6L, 6L), .Label = c("Taste 1", "Taste 2", "Taste 3",
"Use 1", "Use 2", "Use 3"), class = "factor"), Attribute.Category = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("Nutritional/Usage",
"Taste/Quality"), class = "factor"), Attribute.Order = c(1L,
1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L), Category.Order = c(1L,
1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), Color = structure(c(1L,
1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L), .Label = c("#084594",
"#2171B5", "#4292C6", "#6A51A3", "#807DBA", "#9E9AC8"), class = "factor"),
value = c(75L, 78L, 90L, 95L, 82L, 80L, 43L, 40L, 25L, 31L,
84L, 84L), Date2 = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 2L, 1L), .Label = c("1/1/2013", "9/1/2012"), class = "factor")), .Names = c("Attribute",
"Attribute.Category", "Attribute.Order", "Category.Order", "Color",
"value", "Date2"), class = "data.frame", row.names = c(NA, -12L
))
label.data<-structure(list(7:12, Attribute = structure(1:6, .Label = c("Taste 1",
"Taste 2", "Taste 3", "Use 1", "Use 2", "Use 3"), class = "factor"),
Attribute.Category = structure(c(2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Nutritional/Usage",
"Taste/Quality"), class = "factor"), Attribute.Order = 1:6,
Category.Order = c(1L, 1L, 1L, 2L, 2L, 2L), Color = structure(1:6, .Label = c("#084594",
"#2171B5", "#4292C6", "#6A51A3", "#807DBA", "#9E9AC8"), class = "factor"),
Significance = structure(c(2L, 3L, 1L, 1L, 3L, 2L), .Label = c("neg",
"neu", "pos"), class = "factor"), variable = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "1/1/2013", class = "factor"),
value = c(78L, 95L, 80L, 40L, 31L, 84L), Date2 = structure(c(1L,
1L, 1L, 1L, 1L, 1L), .Label = "2013-01-01", class = "factor"),
label.color = structure(c(1L, 2L, 3L, 3L, 2L, 1L), .Label = c("black",
"forestgreen", "red"), class = "factor")), .Names = c("",
"Attribute", "Attribute.Category", "Attribute.Order", "Category.Order",
"Color", "Significance", "variable", "value", "Date2", "label.color"
), class = "data.frame", row.names = c(NA, -6L))
color.palette<-as.character(unique(time.data$Color))
time.data$Date2<-as.Date(time.data$Date2,format="%m/%d/%Y")
plot<-ggplot()+
geom_line(data=time.data,aes(as.numeric(time.data$Date2),time.data$value,group=time.data$Attribute,color=time.data$Color),size=1)+
geom_text(data=label.data,aes(x=Inf, y=label.data$value, label=paste(" ",label.data$Attribute)),
color=label.data$label.color,
size=4,vjust=0, hjust=0,na.rm=T)+
facet_grid(Attribute.Category~.,space="free")+
theme_bw()+
scale_x_continuous(breaks=as.numeric(unique(time.data$Date2)),labels=format(unique(time.data$Date2),format = "%b %Y"))+
theme(strip.background=element_blank(),
strip.text.y=element_blank(),
legend.text=element_blank(),
legend.title=element_blank(),
plot.margin=unit(c(1,5,1,1),"cm"),
legend.position="none")+
scale_colour_manual(values=color.palette)
gt3 <- ggplot_gtable(ggplot_build(plot))
gt3$layout$clip[gt3$layout$name == "panel"] <- "off"
grid.draw(gt3)
Some problems:
Inside your aesthetic declarations, you should not be referencing the data columns as time.data$Date2, but just as Date2. The data argument specifies where to look for that information (which needs to all be in the same data.frame for a given layer, but, as you take advantage of, can vary layer to layer).
In the geom_text call, color was not inside the aes call; if you are mapping it to data which is in the data.frame, you have to have it inside the aes call. This would throw a different error after fixing the first part because then it would not be able to find label.color anywhere because it would not know to look inside label.data.
Fixing those, then the scale_colour_manual complains that there are 9 colors and you have only supplied 6. That is because there are 6 colors from the lines and 3 from the text. Since you specified these as actual color names, you can just use scale_colour_identity.
Putting this all together:
plot <- ggplot()+
geom_line(data=time.data, aes(as.numeric(Date2), value,
group=Attribute, color=Color),
size=1)+
geom_text(data=label.data, aes(x=Inf, y=value,
label=paste(" ",Attribute),
color=label.color),
size=4,vjust=0, hjust=0)+
facet_grid(Attribute.Category~.,space="free") +
scale_x_continuous(breaks=as.numeric(unique(time.data$Date2)),
labels=format(unique(time.data$Date2),format = "%b %Y")) +
scale_colour_identity() +
theme_bw()+
theme(strip.background=element_blank(),
strip.text.y=element_blank(),
legend.text=element_blank(),
legend.title=element_blank(),
plot.margin=unit(c(1,5,1,1),"cm"),
legend.position="none")
gt3 <- ggplot_gtable(ggplot_build(plot))
gt3$layout$clip[gt3$layout$name == "panel"] <- "off"
grid.draw(gt3)
To get an idea how much you can strip down your example, this is much closer to minimal:
time.data <-
structure(list(Attribute = structure(c(1L, 1L, 2L, 2L, 3L, 3L,
4L, 4L), .Label = c("Taste 1", "Taste 2", "Use 1", "Use 2"), class = "factor"),
Attribute.Category = structure(c(2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L), .Label = c("Nutritional/Usage", "Taste/Quality"), class = "factor"),
Color = c("#084594", "#084594", "#2171B5", "#2171B5", "#6A51A3",
"#6A51A3", "#807DBA", "#807DBA"), value = c(75L, 78L, 90L,
95L, 43L, 40L, 25L, 31L), Date2 = structure(c(15584, 15706,
15584, 15706, 15584, 15706, 15584, 15706), class = "Date")), .Names = c("Attribute",
"Attribute.Category", "Color", "value", "Date2"), row.names = c(NA,
-8L), class = "data.frame")
label.data <-
structure(list(value = c(78L, 95L, 40L, 31L), Attribute = structure(1:4, .Label = c("Taste 1",
"Taste 2", "Use 1", "Use 2"), class = "factor"), label.color = c("black",
"forestgreen", "red", "forestgreen"), Attribute.Category = structure(c(2L,
2L, 1L, 1L), .Label = c("Nutritional/Usage", "Taste/Quality"), class = "factor"),
Date2 = structure(c(15706, 15706, 15706, 15706), class = "Date")), .Names = c("value",
"Attribute", "label.color", "Attribute.Category", "Date2"), row.names = c(NA,
-4L), class = "data.frame")
ggplot() +
geom_line(data = time.data,
aes(x=Date2, y=value, group=Attribute, colour=Color)) +
geom_text(data = label.data,
aes(x=Date2, y=value, label=Attribute, colour=label.color),
hjust = 1) +
facet_grid(Attribute.Category~.) +
scale_colour_identity()
The theme stuff (and the making the labels visible outside the plot) isn't relevant to the question, nor is the x-axis conversions from Date to numeric to handle having Inf. I also trimmed the data to just the needed columns, and reduced categorical variable to only two categories.

change border from around legend from a scatterplot

This should be simple, but I can't figure out how to remove the border from around my legend. I would also like to place the legend within the graph and remove the inner grid lines and the top and left side border. I am using the scatterplot function and this is the code I've written thus far:
scatterplot(Comp1~ln1wr|Season, moose,
xlab = "Risk", ylab = "Principal component 1",
labels= row.names(moose), by.groups=T, smooth=F, boxplots=F, legend.plot=F)
legend("bottomleft", moose, fill=0)
Here I was just experimenting to even see if I could get the legend to be placed somewhere else, but each time I run this code, I get an error
Error in as.graphicsAnnot(legend) :
argument "legend" is missing, with no default
I would like to place the legend within the graph, but where it will not conflict with the data displaying. here is sample data:
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 32L, 33L,
33L, 34L, 34L, 34L), .Label = c("F07001", "F07002", "F07003",
"F07004", "F07005", "F07006", "F07008", "F07009", "F07010", "F07011",
"F07014", "F07015", "F07017", "F07018", "F07019", "F07020", "F07021",
"F07022", "F07023", "F07024", "F10001", "F10004", "F10008", "F10009",
"F10010", "F10012", "F10013", "F98015", "M07007", "M07012", "M07013",
"M07016", "M10007", "M10011", "M10015"), class = "factor"), Season = structure(c(1L,
1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L), .Label = c("SUM", "WIN"
), class = "factor"), Time = structure(c(1L, 2L, 1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L, 1L), .Label = c("day", "night"), class = "factor"),
Repro = structure(c(2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L), .Label = c("f", "fc", "m"), class = "factor"), Comp1 = c(-0.524557195,
-0.794214153, -0.408247216, -0.621285004, -0.238828585, 0.976634392,
-0.202405922, -0.633821539, -0.306163898, -0.302261589, 1.218779672
), ln1wr = c(0.833126490613386, 0.824526258616325, 0.990730077688989,
0.981816265754353, 0.933462450382474, 1.446048015519, 1.13253050687157,
1.1349442179155, 1.14965388471562, 1.14879830358128, 1.14055365645628
)), .Names = c("ID", "Season", "Time", "Repro", "Comp1",
"ln1wr"), row.names = c(1L, 2L, 3L, 4L, 5L, 220L, 221L, 222L,
223L, 224L, 225L), class = "data.frame")
I would suggest
par(bty="l",las=1)
scatterplot(Comp1~ln1wr|Season, moose,
xlab = "Risk", ylab = "Principal component 1",
labels= row.names(moose),
by.groups=TRUE, smooth=FALSE, boxplots=FALSE,
grid=FALSE,
legend.plot=FALSE)
legend("bottomright", title="Season",
legend=levels(moose$Season), bty="n",
pch=1:2, col=1:2)
As indicated in ?legend, bty controls the legend box -- "n" means "none.
I put the legend in the bottom right rather than in the bottom left because it seems to avoid your data better that way.
I used bty="l" to eliminate the top and right box edges (this means "box type L")
I used las=1 to get the y-axis tick labels horizontal -- you didn't ask for that but I strongly prefer it
grid=FALSE removes the internal grid lines
You have to unique your moose ID as you have more than one point for each moose.
legend("bottomleft",legend=unique(moose))
Then you have to associate a color and a point type to your legend (corresponding to your moose ID in your plot). I would also have a look at plot() instead of scatterplot().

Resources