Excel Dates and R? - r

I have a short data frame I randomly created to have a practice before it gets to Big Data frames. I made it with the same Variables as the original should be but way shorter.
The problem I'm having is that Excel takes dates with month first, so R is confused and it's putting 10/1/2015 first. When it's supposed to be last.
What can I do so R correctly orders the dates?
Also I want to for example calculate the Total amount of money (Data$Total) that I made in one month.
What would be the script for that?
Also if I'm already here I could kill two birds with one stone. I know there is already an answer for this, but the answer I saw involves using Direct.labels package that completely messes up with the whole graphic.
What would you advise to prevent the labels going over the plot
margin?
DPUT()
dput(Data)
structure(list(JOB = structure(c(2L, 3L, 1L, 3L, 3L), .Label = c("JAGER",
"PLAY", "RUGBY"), class = "factor"), AGENCY = structure(c(1L,
1L, 2L, 1L, 1L), .Label = c("LONDON", "WILHEL"), class = "factor"),
DATE = structure(c(4L, 5L, 1L, 2L, 3L), .Label = c("10/1/2015",
"10/3/2015", "10/9/2015", "9/24/2015", "9/26/2015"), class = "factor"),
RATE = c(90L, 90L, 100L, 90L, 90L), HS = c(8L, 6L, 4L, 6L,
4L), TOTAL = c(720L, 540L, 400L, 540L, 360L)), .Names = c("JOB",
"AGENCY", "DATE", "RATE", "HS", "TOTAL"), class = "data.frame", row.names = c(NA,
-5L))

Here is how I went about what you're after:
rugger is the dataset I constructed from your dput()
plot(order(as.Date(rugger$DATE,"%m/%d/%Y")),rugger$TOTAL,xaxt="n",xlab="",ylab="Total")
labs <- as.Date(rugger$DATE,"%m/%d/%Y")
axis(side = 1,at = rugger$DATE,labels = rep("",5))
text(cex=1, x=order(as.Date(rugger$DATE,"%m/%d/%Y"))+0.1, y=min(rugger$TOTAL)-25, labs, xpd=TRUE, srt=45, pos=2)
The text call allows you to manipulate the labels far more, srt is a rotation call. I used order() to put the days in chronological order, this will also turn them into the numbers that represent those Dates as ordered Dates appeared to be managed as factors (I'm not positive on that, it's just what I'm seeing).
If you don't want dots check out the pch argument within plot(). Pch types.

Related

Specify end points for different groups when plotting regression output in R

I'm hoping to get some help with presenting regression outputs for my Masters thesis. I am assessing the impacts of elephants on woody vegetation, particularly in relation to artificial waterholes. In addition to generally declining with distance from waterholes, the impacts differ substantially between the two vegetation types involved.
I've figured out what seems to me a satisfactory way to of plotting this using visreg. In the model output shown below, both distance to waterhole and veg type explained damage, hence my attempt to show both. However, the issue is that I only have samples at the furthest distances for waterholes (x-axis) from the red vegetation type. As you can see, the regression line for the blue veg type is extending beyond the last points for this vegetation type. Is there anyway I can get the blue line to stop at a smaller distance from the waterhole (x axis value) than for the red to avoid this?
See code for the model and plot below the visreg plot.
Sample data and code
> dput(vegdata[21:52, c(4,7,33)])
structure(list(distance = c(207L, 202L, 501L, 502L, 1001L, 1004L,
2010L, 1997L, 4003L, 3998L, 202L, 194L, 499L, 494L, 1004L, 1000L,
2008L, 1993L, 4008L, 3998L, 493L, 992L, 1941L, 2525L, 485L, 978L,
1941L, 3024L, 495L, 978L, 1977L, 2952L), vegtype = structure(c(1L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("teak",
"term"), class = "factor"), toedl = c(35.48031025, 47.30482718,
25.16709533, 22.29360164, 17.6546533, 12.81605101, 20.34136734,
18.45809334, 11.3578081, 3.490830751, 60.54870317, 44.9863128,
18.81010698, 20.4777188, 30.36994386, 18.7417214, 21.52247156,
18.29685939, 30.26217664, 8.945486104, 43.95749178, 43.54799495,
44.42693993, 50.06207783, 48.05538594, 35.31220933, 52.37339094,
40.51569938, 41.45677007, 58.86629306, 37.80203313, 46.35633342
)), row.names = 21:52, class = "data.frame")
m1<-lm(toedl~vegtype+distance, data=vegdata)
summary(m1)
library(visreg)
visreg(oedl6, 'sexactd', by='vegtype',overlay=TRUE, gg=TRUE, points=list(size=2.5), ylab='% old elephant damage', xlab='distance from waterhole')
Regarding the comments about a reproducible example, you can just make a small dataframe with representative data like below, also a general comment that you should avoid naming your variables names of base functions like 'all'.
I'm not sure whether it's possible to use visreg to do what you want, but you can extract the information from your model using predict, then use ggplot to plot it, which may be preferable because ggplot is really good for customizing plots.
library(ggplot2)
library(visreg)
# Create reproducible data example
allData <- data.frame(vegtype = rep(c("t1", "t2"), each = 10),
oedl = c(seq(from = 35, to = 20, length.out = 10),
seq(from = 20, to = 5, length.out = 10)),
sexactd = c(seq(from = -1, to = 1, length.out = 10),
seq(from = -1, to = 2, length.out = 10)))
# Make linear model
oedl6 <- lm(formula = oedl ~ sexactd + vegtype, data = allData)
# Predict the data using the linear model
odelPred <- cbind(allData, predict(oedl6, interval = 'confidence'))
ggplot(odelPred, aes(sexactd, oedl, color = vegtype, fill = vegtype)) +
geom_point() + geom_line(aes(sexactd, fit)) +
geom_ribbon(aes(ymin = lwr, ymax = upr), alpha = 0.3)
MR Macarthurs solution is great, and (s)he deserved the accepted answer. Visualising a multiple regression model with several predictors in a 2 dimensional graph is... difficult. Basically, you are limited to one predictor. And can add the interaction (in your case: vegtype). One can simply use
geom_smooth for it.
Using your data:
library(tidyverse)
ggplot(vegdata, aes(toedl, distance, color = vegtype)) +
geom_point() +
geom_smooth(method = 'lm')
Created on 2019-12-13 by the reprex package (v0.3.0)

character and date variable chart

I have a data set that looks like these two first columns are just IDs and the last is the date,
I need to find a relation between them in R but am lost since my first problem is how to visualize my data correctly. I have the id as a factor but each time that I do a plot it gives me a numeric value of that.
You might start visualizing the relationship between your variables using the pairs.panel from the psych package. Here is an output using the sample data you shared. Note the data points are sparse but you have more data points.
library(psych)
pairs.panels(df)
Output
Data
structure(list(id1 = structure(c(6L, 2L, 2L, 1L, 5L, 4L, 5L,
3L), .Label = c("10017097", "17596277", "20501146", "3603827",
"57106539", "7596227"), class = "factor"), id2 = structure(c(3L,
1L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("10122", "10197", "13840"
), class = "factor"), t_date = structure(c(17966, 17590, 17956,
17984, 17478, 17483, 17513, 17544), class = "Date")), class = "data.frame", row.names = c(NA,
-8L))
The documentation is available at pairs.panels
.

Scale ggplot window and geom elements without altering text or export image size

I regularly produce ggplot() graphics and have struggled to come across a way to scale the plot window without affecting other elements such as titles, axis labels, axes text, legends, etc. I typically use ggsave() and have pulled from here a bit but due to scaling and resizing issues they never get standardized. I'm trying to get better at streamlining everything but have come to a roadblock. Here is a similar question asking how to do the same thing with pdf() though the central problem appears to be the same.
Question
Is there any way to adjust the scaling of only the plot window and (edit: non-text) geom elements without adjusting text or image size?
Example:
Here's some sample data I'm working with. Percent of jobs filled and offered by day of week:
dput(day_data)
structure(list(day = structure(c(2L, 6L, 7L, 3L, 1L, 2L, 6L,
7L, 3L, 1L), .Label = c("F", "M", "R", "Sa", "Su", "T", "W"), class = "factor"),
order = c(2L, 3L, 4L, 5L, 6L, 2L, 3L, 4L, 5L, 6L), Status = structure(c(1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("Jobs Filled",
"Jobs Offered"), class = "factor"), total = c(13724496, 15298119,
15293656, 16272599, 17652393, 16252141, 17590028, 17549470,
18875899, 21441775), percent = c(17.5, 19.5, 19.5, 20.8,
22.6, 17.7, 19.2, 19.1, 20.6, 23.4)), .Names = c("day", "order",
"Status", "total", "percent"), row.names = c(2L, 3L, 4L, 5L,
6L, 9L, 10L, 11L, 12L, 13L), class = "data.frame")
Here's some sample code to produce a graph. It doesn't look quite like my attached images but that's because I dropped unnecessary theme and other tidy-up elements to shoot for a minimally reproducible example.
ggplot(data=month_data, aes(x=reorder(month, order_sch), y=percent, fill=Status)) +
geom_bar(position="dodge", stat="identity") +
geom_text(aes(label = percent, y=percent+0.2),
position=position_dodge(width=1),
family="serif") +
labs(x="Day of Week",
y="Percent") +
theme(text = element_text(size=12, family="serif"))
Now, here's the nitty gritty of the problem. When using ggsave(), we can specify a given height and width, and once we do that in combination with setting element_text(size=12) everything should be set. But, sometimes the plot doesn't look quiet the same. Attached are two examples, showing the difference in scaling. The first was done with scale=1, the default, and the second was done with scale=4, and extreme example. One might think scale could be used to adjust only the plot window and not other (text) elements, but actually all it does is override the height and width arguments and act as a multiplier. So if height/width/scale = 4/4/1 we get a 4"x4" image, but if height/width/scale = 4/4/4 we get a 16"x16" image.
ggsave(paste0(getwd(), "/example.jpg"), width=4, height=4, dpi=300)
So the question again: Is there any way to adjust the scaling of only the plot window and bar/line/etc elements without adjusting text or image size? Can it be done either within the original ggplot() call or within ggsave() or someother export function?**

Outlier detection for multi column data frame in R

I have a data frame with 18 columns and about 12000 rows. I want to find the outliers for the first 17 columns and compare the results with the column 18. The column 18 is a factor and contains data which can be used as indicator of outlier.
My data frame is ufo and I remove the column 18 as follow:
ufo2 <- ufo[,1:17]
and then convert 3 non0numeric columns to numeric values:
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
and then use the following command for outlier detection:
outlier.scores <- lofactor(ufo2, k=5)
But all of the elements of the outlier.scores are NA!!!
Do I have any mistake in this code?
Is there another way to find outlier for such a data frame?
All of my code:
setwd(datadirectory)
library(doMC)
registerDoMC(cores=8)
library(DMwR)
# load data
load("data_9802-f2.RData")
ufo2 <- ufo[,2:17]
ufo2$Weight <- as.numeric(ufo2$Weight)
ufo2$InvoiceValue <- as.numeric(ufo2$InvoiceValue)
ufo2$Score <- as.numeric(ufo2$Score)
outlier.scores <- lofactor(ufo2, k=5)
The output of the dput(head(ufo2)) is:
structure(list(Origin = c(2L, 2L, 2L, 2L, 2L, 2L), IO = c(2L,
2L, 2L, 2L, 2L, 2L), Lot = c(1003L, 1003L, 1003L, 1012L, 1012L,
1013L), DocNumber = c(10069L, 10069L, 10087L, 10355L, 10355L,
10382L), OperatorID = c(5698L, 5698L, 2015L, 246L, 246L, 4135L
), Month = c(1L, 1L, 1L, 1L, 1L, 1L), LineNo = c(1L, 2L, 1L,
1L, 2L, 1L), Country = c(1L, 1L, 1L, 1L, 11L, 1L), ProduceCode = c(63456227L,
63455714L, 33687427L, 32686627L, 32686627L, 791614L), Weight = c(900,
850, 483, 110000, 5900, 1000), InvoiceValue = c(637, 775, 2896,
48812, 1459, 77), InvoiceValueWeight = c(707L, 912L, 5995L, 444L,
247L, 77L), AvgWeightMonth = c(1194.53, 1175.53, 7607.17, 311.667,
311.667, 363.526), SDWeightMonth = c(864.931, 780.247, 3442.93,
93.5818, 93.5818, 326.238), Score = c(0.56366535234262, 0.33775439984787,
0.46825476121676, 1.414092583904, 0.69101737288291, 0.87827342721894
), TransactionNo = c(47L, 47L, 6L, 3L, 3L, 57L)), .Names = c("Origin",
"IO", "Lot", "DocNumber", "OperatorID", "Month", "LineNo", "Country",
"ProduceCode", "Weight", "InvoiceValue", "InvoiceValueWeight",
"AvgWeightMonth", "SDWeightMonth", "Score", "TransactionNo"), row.names = c(NA,
6L), class = "data.frame")
First of all, you need to spend a lot more time preprocessing your data.
Your axes have completely different meaning and scale. Without care, the outlier detection results will be meaningless, because they are based on a meaningless distance.
For example produceCode. Are you sure, this should be part of your similarity?
Also note that I found the lofactor implementation of the R DMwR package to be really slow. Plus, it seems to be hard-wired to Euclidean distance!
Instead, I recommend using ELKI for outlier detection. First of all, it comes with a much wider choice of algorithms, secondly it is much faster than R, and third, it is very modular and flexible. For your use case, you may need to implement a custom distance function instead of using Euclidean distance.
Here's the link to the ELKI tutorial on implementing a custom distance function.

melting multiple spans of variables

(still) new to r, and very confused as to how I should accomplish multiple melts of my data. Here is a subset:
df <- structure(list(Subject = c(101L, 101L, 101L, 102L, 102L, 102L
), Condition = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("apass",
"vpas"), class = "factor"), FreqCode = structure(c(1L, 1L, 1L,
2L, 2L, 2L), .Label = c("LessVerbal", "MoreVerbal"), class = "factor"),
Item = c(1L, 4L, 7L, 1L, 4L, 7L), Len = c(80L, 68L, 85L,
68L, 85L, 79L), R1_1.RT = c(237L, 203L, 207L, 336L, 487L,
340L), R1_2.RT = c(177L, 225L, 162L, 634L, 590L, 347L), R1_3.RT = c(200L,
226L, 212L, 707L, 653L, 379L), R1.RT = c(614L, 654L, 581L,
1677L, 1730L, 1066L), R1_1 = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = "The", class = "factor"), R1_2 = structure(c(3L,
1L, 2L, 1L, 2L, 4L), .Label = c("antique", "course", "new",
"road"), class = "factor"), R1_3 = structure(c(4L, 1L, 2L,
1L, 2L, 3L), .Label = c("car", "materials", "surfaces", "technology"
), class = "factor"), R1 = structure(c(3L, 1L, 2L, 1L, 2L,
4L), .Label = c("The antique car", "The course materials",
"The new technology", "The road surfaces"), class = "factor")), .Names = c("Subject",
"Condition", "FreqCode", "Item", "Len", "R1_1.RT", "R1_2.RT",
"R1_3.RT", "R1.RT", "R1_1", "R1_2", "R1_3", "R1"), class = "data.frame", row.names =
c(NA,
-6L))
My goal is to get output that (in part) looks like this:
Region RT WordRegion Word
R1_1.RT 237 R1_1 the
...
R1_2.RT 177 R1_2 new
...
EDIT: The variable ending with ".RT" (e.g., R1_1.RT) are Region names and will be melted into a Region column. The variables ending in numbers (e.g., R1_1) correspond exactly to the Region names and their associated values. I want them to be melted alongside the Region names so that I can analyze them in relation to the Region column
In the first part of the code, I melt all of the values into a Region column and change the value to RT. This seems to work fine:
#long transform (with individual regions at end)
SmallMelt1 = melt(df, measure.vars = c("R1_1.RT", "R1_2.RT", "R1_3.RT", "R1.RT"), var = "Region")
#change newly created column name to "RT" (note:you have to change the number in [] to match your data)
colnames(SmallMelt1)[11 ] <- "RT"
But I don't get how to simultaneously melt another span of variables such that they will line up vertically with the first span. I want to do something like this, after the first melt, but it does not work:
#Second Melt for region names (doesn't work)
SmallMelt2 = melt(SmallMelt1, measure.vars = c("R1_1", "R1_2", "R1_3", "R1"), var = "WordRegion")
#Change name to Word
colnames(SmallMelt2)[9] <- "Word" #add col number for "value" here
Please let me know if you need any clarification. I hope someone can help... thanks in advance - DT
So, after consulting with someone off-list, I found the solution. My mistake was that I was trying to run the second step on the output of the first step. By running the two steps independently on the original data and then concatenating, I get the right result.
SmallMelt1 = melt(df, measure.vars = c("R1_1.RT", "R1_2.RT", "R1_3.RT", "R1.RT"), var = "Region")
SmallMelt2 = melt(df, measure.vars = c("R1_1", "R1_2", "R1_3", "R1"), var = "WordRegion")
SmallMelt3=cbind(SmallMelt1,SmallMelt2[,11])

Resources