Sorting a dotchart with matrix input in R - r

How do you generate a grouped Cleveland dot plot (dot chart), where the data is sorted from highest to loweset in each subroup, when your input is a matrix?
For example, R has a nice built-in example of a dotchart using groups with a matrix as input:
dotchart(VADeaths, main = "Death Rates in Virginia - 1940")
In this particular example, the data is already sorted in each category for each of the groups (Rural Male, Rural Female, etc.). However, if it wasn't, what are the R commands to generate a plot such that the data points in each subgroup are sorted from highest to lowest?

If you do not want to order your data by the column names, as #DWin suggested, but solely on numeric data, you might try:
# get data
data <- VADeaths[sample(1:5), ]
# order data by first row's numeric values
data <- data[order(data[,1]),]
dotchart(data)
Note: this will sort the dataframe by the first column only! It is not possible to sort every column in a table without specifying different rownames for each column, which is definitely not possible with table class.
If you stick to your original question: I would suggest splitting up the data by the columns, plot the dotchart for each sorted column and pile up those in a layout.

This shows the creation of a matrix with arbitrary row order and how one can restore it to proper order.
> set.seed(123)
> VA2 <- VADeaths[sample(1:5), ]
> VA2
Rural Male Rural Female Urban Male Urban Female
55-59 18.1 11.7 24.3 13.6
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
60-64 26.9 20.3 37.0 19.3
50-54 11.7 8.7 15.4 8.4
> VA2[order(rownames(VA2)), ]
Rural Male Rural Female Urban Male Urban Female
50-54 11.7 8.7 15.4 8.4
55-59 18.1 11.7 24.3 13.6
60-64 26.9 20.3 37.0 19.3
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
If you were faced with disordered colnames but not something with a the desired lexical order you could just use a character vector in the proper order with "["
> c2 <- c("Rural Male", "Rural Female", "Urban Male" , "Urban Female")
> VA3 <- VA2[ , sample(1:4)]
> VA3
Rural Male Rural Female Urban Male Urban Female
55-59 18.1 11.7 24.3 13.6
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
60-64 26.9 20.3 37.0 19.3
50-54 11.7 8.7 15.4 8.4
> VA3[ , c2]
Rural Male Rural Female Urban Male Urban Female
55-59 18.1 11.7 24.3 13.6
65-69 41.0 30.9 54.6 35.1
70-74 66.0 54.3 71.1 50.0
60-64 26.9 20.3 37.0 19.3
50-54 11.7 8.7 15.4 8.4

Related

Gathering multiple data columns currently in factor form

I have a dataset of train carloads. It currently has a number (weekly carload) listed for each company (the row) for each week (the columns) over the course of a couple years (100+ columns). I want to gather this into just two columns: a date and loads.
It currently looks like this:
3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5
I'm looking for:
Date Load
3/29/2017 32.7
3/29/2017 20.5
3/29/2017 24.1
3/29/2017 24.9
4/5/2017 31.6
I've been doing various versions of the following:
rail3 <- rail2 %>%
gather(`3/29/2017`:`1/24/2018`, key = "date", value = "loads")
When I do this it makes a dataset called rail3, but it didn't make the new columns I wanted. It only made the dataset 44 times longer than it was. And it gave me the following message:
Warning message:
attributes are not identical across measure variables;
they will be dropped
I'm assuming this is because the date columns are currently coded as factors. But I'm also not sure how to convert 100+ columns from factors to numeric. I've tried the following and various other methods:
rail2["3/29/2017":"1/24/2018"] <- lapply(rail2["3/29/2017":"1/24/2018"], as.numeric)
None of this has worked. Let me know if you have any advice. Thanks!
If you want to avoid warnings when gathering and want date and numeric output in final df you can do:
library(tidyr)
library(hablar)
# Data from above but with factors
rail2<-read.table(header=TRUE, text="3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5", check.names=FALSE) %>%
as_tibble() %>%
convert(fct(everything()))
# Code
rail2 %>%
convert(num(everything())) %>%
gather("date", "load") %>%
convert(dte(date, .args = list(format = "%m/%d/%Y")))
Gives:
# A tibble: 16 x 2
date load
<date> <dbl>
1 2017-03-29 32.7
2 2017-03-29 20.5
3 2017-03-29 24.1
4 2017-03-29 24.9
5 2017-04-05 31.6
Here is a possible solution:
rail2<-read.table(header=TRUE, text="3/29/2017 4/5/2017 4/12/2017 4/19/2017
32.7 31.6 32.3 32.5
20.5 21.8 22.0 22.3
24.1 24.1 23.6 23.4
24.9 24.7 24.8 26.5", check.names=FALSE)
library(tidyr)
# gather the data from columns and convert to long format.
rail3 <- rail2 %>% gather(key="date", value="load")
rail3
# date load
#1 3/29/2017 32.7
#2 3/29/2017 20.5
#3 3/29/2017 24.1
#4 3/29/2017 24.9
#5 4/5/2017 31.6
#6 4/5/2017 21.8
#7 ...

Can't remove a row from a matrix in R

I'm trying to remove an outlier from a data matrix. The original matrix is called Westdata and I want to remove row 51.
I've tried the following line of code but it doesn't remove the outlier and the new matrix is identical to the old one.
Westdata.Outlier<-Westdata[-51,]
Westdata.Outlier
State Region Pay Spend Area
20 Mont. MN 22.5 3.95 West
21 Wyo. MN 27.2 5.44 West
22 N.Mex. MN 22.6 3.40 West
23 Utah MN 22.3 2.30 West
24 Wash. PA 26.0 3.71 West
25 Calif. PA 29.1 3.61 West
26 Hawaii PA 25.8 3.77 West
46 Idaho MN 21.0 2.51 West
47 Colo. MN 25.9 4.04 West
48 Ariz. MN 26.6 2.83 West
49 Nev. MN 25.6 2.93 West
50 Oreg. PA 25.8 4.12 West
51 Alaska PA 41.5 8.35 West
Any suggestions?

run function on consecutive vals with specific range in the vector with R

spouse i have a vector tmp of size 100
i want to know where there is for example an average of 10 between
each 4 elements.
i.e
i want to know which of these: mean(tmp[c(1,2,3,4)]),mean(tmp[c(2,3,4,5)]),mean(tmp[c(3,4,5,6)])..and so on...mean(tmp[c(97,98,99,100)])
are larger then 10
how can i do it not in a loop?
(loop takes too long since i have a table of 500000 rows by 60 col)
and more not only avg but also difference or sum and so on...
i have tried splitting rows as such
tmp<-seq(1,100,1)
one<-seq(1,97,1)
two<-seq(2,98,1)
tree<-seq(3,99,1)
four<-seq(4,100,1)
aa<-(tmp[one]+tmp[two]+tmp[tree]+tmp[four])/4
which(aa>10)
its working but its not rational to do it if you want for example avg of 12
here is an example of what i do to be clear
b12<-seq(1,988,1)
b11<-seq(2,989,1)
b10<-seq(3, 990,1)
b9<-seq(4,991,1)
b8<-seq(5,992,1)
b7<-seq(6,993,1)
b6<-seq(7,994,1)
b5<-seq(8, 995,1)
b4<-seq(9,996,1)
b3<-seq(10,997,1)
b2<-seq(11,998,1)
b1<-seq(12,999,1)
now<-seq(13, 1000,1)
po<-rpois(1000,4)
nor<-rnorm(1000,5,0.2)
uni<-runif(1000,10,75)
chis<-rchisq(1000,3,0)
which((po[now]/nor[now])>1 & (nor[b12]/nor[now])>1 &
((po[now]/po[b4])>1 | (uni[now]-uni[b4])>=0) &
((chis[now]+chis[b1]+chis[b2]+chis[b3])/4)>2 &
(uni[now]/max(uni[b1],uni[b2],uni[b3],uni[b4],
uni[b5],uni[b6],uni[b7],uni[b8]))>0.5)+12
this code give me the exact index in the real table
that mach all the conditions
and i have 58 vars with 550000 rows
thank you
The question is not very clear. Based on the wording, I guess, this should help:
n <- 100
res <- sapply(1:(n-3), function(i) mean(tmp[i:(i+3)]))
which(res >10)
Also,
m1 <- matrix(tmp[1:4+ rep(0:96,each=4)],ncol=4,byrow=T)
which(rowMeans(m1) >10)
Maybe you should look at the rollapply function from the "zoo" package. You would need to adjust the width argument according to your specific needs.
library(zoo)
tmp <- seq(1, 100, 1)
rollapply(tmp, width = 4, FUN = mean)
# [1] 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5
# [15] 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5 28.5 29.5
# [29] 30.5 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5 42.5 43.5
# [43] 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5 56.5 57.5
# [57] 58.5 59.5 60.5 61.5 62.5 63.5 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5
# [71] 72.5 73.5 74.5 75.5 76.5 77.5 78.5 79.5 80.5 81.5 82.5 83.5 84.5 85.5
# [85] 86.5 87.5 88.5 89.5 90.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5
So, to get the details you want:
aa <- rollapply(tmp, width = 4, FUN = mean)
which(aa > 10)

subset dataframe variables through part of names

Suppose I have a data frame that contains these series and something else.
Where Ru and Uk are country codes.
Date CPI.Ru CPI.g.Ru CPI.s.Ru CPI.Uk CPI.g.Uk CPI.s.Uk
Q4-1990 61.4 66.4 67.5 72.2 68.2 32.4
Q1-1991 61.3 67.0 68.0 72.6 68.8 33.2
Q2-1991 61.4 67.5 68.1 73.2 69.5 35.1
Q3-1991 61.7 68.7 68.9 73.7 70.6 35.9
Q4-1991 62.3 68.4 69.3 74.3 71.9 38.2
Q1-1992 62.3 69.7 69.6 74.7 72.9 39.2
Q2-1992 62.1 70.3 70.0 75.3 73.7 40.6
Q3-1992 62.2 71.4 70.5 75.3 74.1 41.2
Q4-1992 62.5 71.1 70.9 75.7 74.3 44.0
I want to subset dataframe by country and then do something with this series.
For example I want to divide CPI index for each country by its first element.
How can I do it in cycle or maybe with apply function?
countries <- c("Ru","Uk")
for (i in countries)
{dataFrameName$CPI.{i} <- dfName$CPI.{i}/dfName$CPI.{i}[1]}
What should I write instead of {i}?
$ only accept fixed column names. To select columns based on an expression you can instead use double brackets:
countries <- c("Ru", "Uk")
for (i in countries){
x <- paste0("CPI.", i)
dfName[[x]] <- dfName[[x]]/dfName[[x]][1]
}
This is not a loop, but if your data is always of the same form for each country, so that each country has 3 columns, and you always want to operate on the first column per country, you could try this:
sub <- df[,seq(2,ncol(df), 3)] #create a subsetted data.frame containing the CPI index per country
apply(sub, 2, function(x) x/x[1]) #then use apply to operate on each column
# CPI.Ru CPI.Uk
# [1,] 1.0000000 1.000000
# [2,] 0.9983713 1.005540
# [3,] 1.0000000 1.013850
# [4,] 1.0048860 1.020776
# [5,] 1.0146580 1.029086
# [6,] 1.0146580 1.034626
# [7,] 1.0114007 1.042936
# [8,] 1.0130293 1.042936
# [9,] 1.0179153 1.048476

How to create a plot consisting of multiple residuals?

How can I make a residual plot according to the following (what are y_hat and e here)?
Is this a form of residual plot as well?
beeflm=lm(PBE ~ CBE + PPO + CPO + PFO +DINC + CFO+RDINC+RFP+YEAR, data = beef)
summary(beeflm)
qqnorm(residuals(beeflm))
#plot(beeflm) #in manuals I have seen they use this but it gives me multiple plot
or is this one correct?
plot(beeflm$residuals,beeflm$fitted.values)
I know through the comments that plot(beeflm,which=1) is correct but according to the stated question I should use matplot but I receive the following error:
matplot(beeflm,which=1,
+ main = "Beef: residual plot",
+ ylab = expression(e[i]), # only 1st is taken
+ xlab = expression(hat(y[i])))
Error in xy.coords(x, y, xlabel, ylabel, log = log) :
(list) object cannot be coerced to type 'double'
And when I use plot I receive the following error:
plot(beeflm,which=1,main="Beef: residual plot",ylab = expression(e[i]),xlab = expression(hat(y[i])))
Error in plot.default(yh, r, xlab = l.fit, ylab = "Residuals", main = main, :
formal argument "xlab" matched by multiple actual arguments
Also do you know what does the following mean? Any example for illustrating this (or external link)?
Beef data is like the following:
Here's the beef data.frame:
YEAR PBE CBE PPO CPO PFO DINC CFO RDINC RFP
1 1925 59.7 58.6 60.5 65.8 65.8 51.4 90.9 68.5 877
2 1926 59.7 59.4 63.3 63.3 68.0 52.6 92.1 69.6 899
3 1927 63.0 53.7 59.9 66.8 65.5 52.1 90.9 70.2 883
4 1928 71.0 48.1 56.3 69.9 64.8 52.7 90.9 71.9 884
5 1929 71.0 49.0 55.0 68.7 65.6 55.1 91.1 75.2 895
6 1930 74.2 48.2 59.6 66.1 62.4 48.8 90.7 68.3 874
7 1931 72.1 47.9 57.0 67.4 51.4 41.5 90.0 64.0 791
8 1932 79.0 46.0 49.5 69.7 42.8 31.4 87.8 53.9 733
9 1933 73.1 50.8 47.3 68.7 41.6 29.4 88.0 53.2 752
10 1934 70.2 55.2 56.6 62.2 46.4 33.2 89.1 58.0 811
11 1935 82.2 52.2 73.9 47.7 49.7 37.0 87.3 63.2 847
12 1936 68.4 57.3 64.4 54.4 50.1 41.8 90.5 70.5 845
13 1937 73.0 54.4 62.2 55.0 52.1 44.5 90.4 72.5 849
14 1938 70.2 53.6 59.9 57.4 48.4 40.8 90.6 67.8 803
15 1939 67.8 53.9 51.0 63.9 47.1 43.5 93.8 73.2 793
16 1940 63.4 54.2 41.5 72.4 47.8 46.5 95.5 77.6 798
17 1941 56.0 60.0 43.9 67.4 52.2 56.3 97.5 89.5 830
Use plot(beeflm, which=1) to get the plot between residuals and fitted values.
require(graphics)
## Annette Dobson (1990) "An Introduction to Generalized Linear Models".
## Page 9: Plant Weight Data.
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
plot(lm.D9, which=1)
Edited
You can use matplot as given below:
matplot(
x = lm.D9$fitted.values
, y = lm.D9$resid
)
An example illustrating this using the mtcars data:
fit <- lm(mpg ~ ., data=mtcars)
plot(x=fitted(fit), y=residuals(fit))
and
par(mfrow=c(3,4)) # or 'layout(matrix(1:12, nrow=3, byrow=TRUE))'
for (coeff in colnames(mtcars)[-1])
plot(x=mtcars[, coeff], residuals(fit), xlab=coeff, ylab=expression(e[i]))

Resources