How to take an Average of + or - SD - r

I have data where the [1] dependent variable is taken from a controlled and independent variable [2] then independent variable. The mean and SD are taken from [1].
(a) and this is the result of SD:
Year Species Pop_Index
1 1994 Corn Bunting 2.082483
5 1998 Corn Bunting 2.048155
10 2004 Corn Bunting 2.061617
15 2009 Corn Bunting 2.497792
20 1994 Goldfinch 1.961236
25 1999 Goldfinch 1.995600
30 2005 Goldfinch 2.101403
35 2010 Goldfinch 2.138496
40 1995 Grey Partridge 2.162136
(b) And the result of mean:
Year Species Pop_Index
1 1994 Corn Bunting 2.821668
5 1998 Corn Bunting 2.916975
10 2004 Corn Bunting 2.662797
15 2009 Corn Bunting 4.171538
20 1994 Goldfinch 3.226108
25 1999 Goldfinch 2.452807
30 2005 Goldfinch 2.954816
35 2010 Goldfinch 3.386772
40 1995 Grey Partridge 2.207708
(c) This is the Code for SD:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.0824833420524, 2.04815530904537,
2.06161673349657, 2.49779159320587, 1.96123572400404, 1.99559986715288,
2.10140285528351, 2.13849611018009, 2.1621364896722)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(d) This is the code for mean:
structure(list(Year = c(1994L, 1998L, 2004L, 2009L, 1994L, 1999L,
2005L, 2010L, 1995L), Species = structure(c(1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L), .Label = c("Corn Bunting", "Goldfinch", "Grey Partridge"
), class = "factor"), Pop_Index = c(2.82166841455814, 2.91697463618566,
2.66279663056763, 4.17153795031277, 3.22610845074252, 2.45280743991572,
2.95481600904799, 3.38677188055508, 2.20770835158744)), row.names = c(1L,
5L, 10L, 15L, 20L, 25L, 30L, 35L, 40L), class = "data.frame")
(e) And this is the code used to take the mean of mean Pop_Index over the years:
df2 <- aggregate(Pop_Index ~ Year, df1, mean)
(f) And this is the result:
Year Pop_Index
1 1994 3.023888
2 1995 2.207708
3 1998 2.916975
4 1999 2.452807
5 2004 2.662797
6 2005 2.954816
7 2009 4.171538
8 2010 3.386772
Now it wouldn't make sense for me to take the average of SD by doing the same procedure as before with the function mean or SD.
I have looked online and found someone in a similar predicament with this data:
Month: January
Week 1 Mean: 67.3 Std. Dev: 0.8
Week 2 Mean: 80.5 Std. Dev: 0.6
Week 3 Mean: 82.4 Std. Dev: 0.8
And the response:
"With equal samples size, which is what you have, the standard deviation you are looking for is:
Sqrt [ (.64 + .36 + .64) / 3 ] = 0.739369"
How would I do this in R, or is there another way of doing this? Because I want to plot error bars and the dataset plotted is like that of (f), and it would be absurd to plot the SD of (a) against this because the vector lengths would differ.
Sample from original data.frame with a few columns and many rows not included:
structure(list(GRIDREF = structure(c(1L, 1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 10L), .Label = c("SP8816", "SP9212", "SP9322",
"SP9326", "SP9440", "SP9513", "SP9632", "SP9939", "TF7133", "TF9437"
), class = "factor"), Lat = c(51.83568688, 51.83568688, 51.79908899,
51.88880822, 51.92476157, 52.05042795, 51.80757645, 51.97818159,
52.04057068, 52.86730817, 52.89542895), Long = c(-0.724233561,
-0.724233561, -0.667258035, -0.650074995, -0.648996758, -0.630626734,
-0.62349292, -0.603710436, -0.558026241, 0.538966197, 0.882597783
), Year = c(2006L, 2007L, 1999L, 2004L, 1995L, 2009L, 2011L,
2007L, 2011L, 1996L, 2007L), Species = structure(c(4L, 7L, 5L,
10L, 4L, 6L, 8L, 3L, 2L, 9L, 1L), .Label = c("Blue Tit", "Buzzard",
"Canada Goose", "Collared Dove", "Greenfinch", "Jackdaw", "Linnet",
"Meadow Pipit", "Robin", "Willow Warbler"), class = "factor"),
Pop_Index = c(0L, 0L, 2L, 0L, 1L, 0L, 1L, 4L, 0L, 0L, 8L)), row.names = c(1L,
100L, 1000L, 2000L, 3000L, 4000L, 5000L, 6000L, 10000L, 20213L,
30213L), class = "data.frame")
A look into this data.frame:
GRIDREF Lat Long Year Species Pop_Index TempJanuary
1 SP8816 51.83569 -0.7242336 2006 Collared Dove 0 2.128387
100 SP8816 51.83569 -0.7242336 2007 Linnet 0 4.233226
1000 SP9212 51.79909 -0.6672580 1999 Greenfinch 2 5.270968
2000 SP9322 51.88881 -0.6500750 2004 Willow Warbler 0 4.826452
3000 SP9326 51.92476 -0.6489968 1995 Collared Dove 1 4.390322
4000 SP9440 52.05043 -0.6306267 2009 Jackdaw 0 2.934516
5000 SP9513 51.80758 -0.6234929 2011 Meadow Pipit 1 3.841290
6000 SP9632 51.97818 -0.6037104 2007 Canada Goose 4 7.082580
10000 SP9939 52.04057 -0.5580262 2011 Buzzard 0 3.981290
20213 TF7133 52.86731 0.5389662 1996 Robin 0 3.532903
30213 TF9437 52.89543 0.8825978 2007 Blue Tit 8 7.028710

Related

Cut values to intervals and plot a heatmap in ggplot2

Given a dataframe as follows:
df <- structure(list(year = c(2001L, 2001L, 2001L, 2001L, 2002L, 2002L,
2002L, 2002L, 2003L, 2003L, 2003L, 2003L), quater = c(1L, 2L,
3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), value = c(4L, 23L, 14L,
12L, 6L, 22L, 45L, 12L, 34L, 15L, 3L, 40L)), class = "data.frame", row.names = c(NA,
-12L))
Out:
year quater value
0 2001 1 4
1 2001 2 23
2 2001 3 14
3 2001 4 12
4 2002 1 6
5 2002 2 22
6 2002 3 45
7 2002 4 12
8 2003 1 34
9 2003 2 15
10 2003 3 3
11 2003 4 40
How could I plot a chart similar to the plot below:
Please note the year and quater in this dataset correspondent to year and week to the plot above.
I need to first cut the value column by (0, 10], (10, 20], (20, 30], (30, 40], (40, 50] then plot them.
The code I have tried:
ggplot(df, aes(week, year, fill= value)) +
geom_tile() +
scale_fill_gradient(low="white", high="red")
Out:
As you can see, the legend is different to what I need.
Thanks for your help.
You should first use cut to get the classes (as Ronak Shah already mentioned) and then you can use scale_fill_brewer to change the color of the tiles.
library(tidyverse)
df %>%
mutate(class = cut(value, seq(0, 50, 10))) %>%
ggplot(aes(quater, year, fill = class) ) +
geom_tile() +
scale_fill_brewer(type = "seq",
direction = 1,
palette = "RdPu")

Regression for multiple countries over time

My data set looks as follows:
country year Var1 Var2 Var3 Var4
1 AT 2010 0.27246094 15 0 0
2 BE 2010 0.14729459 53 0 1
3 BG 2010 0.08744856 3 0 0
4 CY 2010 0.15369261 6 0 0
5 CZ 2010 0.20284360 6 0 1
6 DE 2010 0.12541694 37 0 0
7 AT 2011 0.35370741 16 0 0
8 BE 2011 0.14572864 54 0 0
9 BG 2011 0.11929461 4 0 0
10 CY 2011 0.24550898 7 0 1
11 CZ 2011 0.23333333 7 0 0
12 DE 2011 0.21943574 38 0 0
13 AT 2012 0.35073780 17 0 0
14 BE 2012 0.19700000 55 0 0
15 BG 2012 0.08472803 5 0 0
16 CY 2012 0.16949153 8 0 0
17 CZ 2012 0.26914661 8 0 0
18 DE 2012 0.22037422 39 0 0
19 AT 2013 0.34716599 18 0 1
20 BE 2013 0.28906250 56 0 0
21 BG 2013 0.14602216 6 0 1
22 CY 2013 0.44023904 9 0 0
23 CZ 2013 0.35146022 9 0 1
24 DE 2013 0.25500323 40 0 1
It covers 4 years for each of the 6 countries.
What I want to do is run a regression Var2 ~ Var 1.
Since I have multiple years I considered using time series. So, first I changed the year column from character to date:
library(dplyr)
mutate(testdf, year = as.Date(year, format= "%Y"))
Then, I tried to run my regression and received this error:
library(plm)
reg1 <- plm(Var2 ~ Var1 + Var3 + Var4, data = df)
summary(reg1)
Error in pdim.default(index[[1]], index[[2]]) : duplicate couples (id-time)
Did I miss a step before running the regression or am I just using the wrong function?
I also tried to run the regression by using the lmerfunction (using time and to control for country differences):
library(lme4)
library(lmerTest)
reg2 <- lmer(Var2 ~ time(Var1) + Var3 + Var4 + (1 | country), data = df, REML = F)
summary(reg2)
Here I got a result, but I am completely unsure whether this is the way it should be done. Would this be a possibility or is it something different?
The date requires month and day, I suggest to use the beginning of the year via ISOdate.
testdf <- transform(testdf, year=as.Date(ISOdate(year, 1, 1))) ## Note: transform is from
## base R
head(testdf, 3)
# country year Var1 Var2 Var3 Var4
# 1 AT 2010-01-01 0.27246094 15 0 0
# 2 BE 2010-01-01 0.14729459 53 0 1
# 3 BG 2010-01-01 0.08744856 3 0 0
In the plm call you probably want to define the index= and select a model=, see ?plm.
library(plm)
reg1 <- plm(Var2 ~ Var1 + Var3 + Var4, data=testdf, index=c("country", "year"),
model="random")
Result:
summary(reg1)
# Oneway (individual) effect Random Effect Model
# (Swamy-Arora's transformation)
#
# Call:
# plm(formula = Var2 ~ Var1 + Var3 + Var4, data = testdf, model = "random",
# index = c("country", "year"))
#
# Balanced Panel: n = 6, T = 4, N = 24
#
# Effects:
# var std.dev share
# idiosyncratic 0.8135 0.9019 0.001
# individual 615.6029 24.8113 0.999
# theta: 0.9818
#
# Residuals:
# Min. 1st Qu. Median 3rd Qu. Max.
# -1.416570 -0.789216 -0.064901 0.728004 1.392325
#
# Coefficients:
# Estimate Std. Error z-value Pr(>|z|)
# (Intercept) 18.47629 9.76600 1.8919 0.0585 .
# Var1 12.95722 2.84290 4.5577 5.171e-06 ***
# Var4 0.32221 0.40056 0.8044 0.4212
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Total Sum of Squares: 32.753
# Residual Sum of Squares: 15.806
# R-Squared: 0.5174
# Adj. R-Squared: 0.47144
# Chisq: 22.5147 on 2 DF, p-value: 1.2912e-05
Data:
testdf <- structure(list(country = structure(c(1L, 2L, 3L, 4L, 5L, 6L,
1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L,
5L, 6L), .Label = c("AT", "BE", "BG", "CY", "CZ", "DE"), class = "factor"),
year = c(2010L, 2010L, 2010L, 2010L, 2010L, 2010L, 2011L,
2011L, 2011L, 2011L, 2011L, 2011L, 2012L, 2012L, 2012L, 2012L,
2012L, 2012L, 2013L, 2013L, 2013L, 2013L, 2013L, 2013L),
Var1 = c(0.27246094, 0.14729459, 0.08744856, 0.15369261,
0.2028436, 0.12541694, 0.35370741, 0.14572864, 0.11929461,
0.24550898, 0.23333333, 0.21943574, 0.3507378, 0.197, 0.08472803,
0.16949153, 0.26914661, 0.22037422, 0.34716599, 0.2890625,
0.14602216, 0.44023904, 0.35146022, 0.25500323), Var2 = c(15L,
53L, 3L, 6L, 6L, 37L, 16L, 54L, 4L, 7L, 7L, 38L, 17L, 55L,
5L, 8L, 8L, 39L, 18L, 56L, 6L, 9L, 9L, 40L), Var3 = c(0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Var4 = c(0L, 1L, 0L, 0L,
1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 1L, 0L, 1L, 1L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13",
"14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24"
))

Delete rows if column values are equal

I want delete the rows if the columns (YEAR, POL, CTY, ID, AMOUNT) are equal in the values across all rows. Please see the output table below.
Table:
YEAR POL CTY ID AMOUNT RAN LEGAL
2017 30408 11 36 3500 RANGE1 L0015N20W23
2017 30408 11 36 3500 RANGE1 L00210N20W24
2017 30408 11 36 3500 RANGE1 L00310N20W25
2017 30409 11 36 3500 RANGE1 L0015N20W23
2017 30409 11 35 3500 RANGE2 NANANA
2017 30409 11 35 3500 RANGE3 NANANA
2017 30409 11 35 3500 RANGE3 NANANA
Output:
YEAR POL CTY ID AMOUNT RAN LEGAL
2017 30408 11 35 3500 RANGE1 L0015N20W23
You can try this:
no_duplicate_cols <- c("YEAR", "POL", "CTY", "ID", "AMOUNT")
new_df <- df[!duplicated(df[, no_duplicate_cols]), ]
The data frame new_df will hold the rows from df that are not duplicated.
If I understood the question correctly then I think you can try this
library(dplyr)
df %>%
group_by(YEAR, POL, CTY, ID, AMOUNT) %>%
filter(n() == 1)
Output (but it seems that the output provided in the original question has bit of typo!):
# A tibble: 1 x 7
# Groups: YEAR, POL, CTY, ID, AMOUNT [1]
YEAR POL CTY ID AMOUNT RAN LEGAL
1 2017 30409 11 36 3500 RANGE1 L0015N20W23
#sample data
> dput(df)
structure(list(YEAR = c(2017L, 2017L, 2017L, 2017L, 2017L, 2017L,
2017L), POL = c(30408L, 30408L, 30408L, 30409L, 30409L, 30409L,
30409L), CTY = c(11L, 11L, 11L, 11L, 11L, 11L, 11L), ID = c(36L,
36L, 36L, 36L, 35L, 35L, 35L), AMOUNT = c(3500L, 3500L, 3500L,
3500L, 3500L, 3500L, 3500L), RAN = structure(c(1L, 1L, 1L, 1L,
2L, 3L, 3L), .Label = c("RANGE1", "RANGE2", "RANGE3"), class = "factor"),
LEGAL = structure(c(1L, 2L, 3L, 1L, 4L, 4L, 4L), .Label = c("L0015N20W23",
"L00210N20W24", "L00310N20W25", "NANANA"), class = "factor")), .Names = c("YEAR",
"POL", "CTY", "ID", "AMOUNT", "RAN", "LEGAL"), class = "data.frame", row.names = c(NA,
-7L))

Find out the item first time shows in a data set

I have a data set ProductTable, I want to return the date of all the ProductsFamily has been ordered first time and the very last time. Examples:
ProductTable
OrderPostingYear OrderPostingMonth OrderPostingDate ProductsFamily Sales QTY
2008 1 20 R1 5234 1
2008 1 12 R2 223 2
2009 1 30 R3 34 1
2008 2 1 R1 1634 3
2010 4 23 R3 224 1
2009 3 20 R1 5234 1
2010 7 12 R2 223 2
Result as followings
OrderTime
ProductsFamily OrderStart OrderEnd SumSales
R1 2008/1/20 2009/3/20 12102
R2 2008/1/12 2010/7/12 446
R3 2009/1/30 2010/4/23 258
I have no idea how to do it. Any suggestions?
ProductTable <- structure(list(OrderPostingYear = c(2008L, 2008L, 2009L, 2008L,
2010L, 2009L, 2010L), OrderPostingMonth = c(1L, 1L, 1L, 2L, 4L,
3L, 7L), OrderPostingDate = c(20L, 12L, 30L, 1L, 23L, 20L, 12L
), ProductsFamily = structure(c(1L, 2L, 3L, 1L, 3L, 1L, 2L), .Label = c("R1",
"R2", "R3"), class = "factor"), Sales = c(5234L, 223L, 34L, 1634L,
224L, 5234L, 223L), QTY = c(1L, 2L, 1L, 3L, 1L, 1L, 2L)), .Names = c("OrderPostingYear",
"OrderPostingMonth", "OrderPostingDate", "ProductsFamily", "Sales",
"QTY"), class = "data.frame", row.names = c(NA, -7L))
We can also use dplyr/tidyr to do this. We arrange the columns, concatenate the 'Year:Date' columns with unite, group by 'ProductsFamily', get the first, last of 'Date' column and sum of 'Sales' within summarise.
library(dplyr)
library(tidyr)
ProductTable %>%
arrange(ProductsFamily, OrderPostingYear, OrderPostingMonth, OrderPostingDate) %>%
unite(Date,OrderPostingYear:OrderPostingDate, sep='/') %>%
group_by(ProductsFamily) %>%
summarise(OrderStart=first(Date), OrderEnd=last(Date), SumSales=sum(Sales))
# Source: local data frame [3 x 4]
# ProductsFamily OrderStart OrderEnd SumSales
# (fctr) (chr) (chr) (int)
# 1 R1 2008/1/20 2009/3/20 12102
# 2 R2 2008/1/12 2010/7/12 446
# 3 R3 2009/1/30 2010/4/23 258
You can first set up the date in a new column, and then aggregate your data using data.table package (you take the first and last date by ID, as well as the sum of sales):
library(data.table)
# First build up the date
ProductTable$date = with(ProductTable,
as.Date(paste(OrderPostingYear,
OrderPostingMonth,
OrderPostingDate, sep = "." ),
format = "%Y.%m.%d"))
# In a second step, aggregate your data
setDT(ProductTable)[,list(OrderStart = sort(date)[1],
OrderEnd = sort(date)[.N],
SumSales = sum(Sales))
,ProductsFamily]
# ProductsFamily OrderStart OrderEnd SumSales
#1: R1 2008-01-20 2009-03-20 12102
#2: R2 2008-01-12 2010-07-12 446
#3: R3 2009-01-30 2010-04-23 258

R: Assign colors to values/color gradient palette

I have a sample dataframe which looks like this:
reg1 <- structure(list(REGION = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("REG1", "REG2"), class = "factor"),STARTYEAR = c(1959L, 1960L, 1961L, 1962L, 1963L, 1964L, 1965L, 1966L, 1967L, 1945L, 1946L, 1947L, 1948L, 1949L), ENDYEAR = c(1960L, 1961L, 1962L, 1963L, 1964L, 1965L, 1966L, 1967L, 1968L, 1946L, 1947L, 1948L, 1949L, 1950L), Y_START = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 2L, 2L, 2L, 2L), Y_END = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L), COLOR_VALUE = c(-969L, -712L, -574L, -312L, -12L, 1L, 0L, -782L, -999L, -100L, 23L, 45L, NA, 999L)), .Names = c("REGION", "STARTYEAR", "ENDYEAR", "Y_START", "Y_END", "COLOR_VALUE"), class = "data.frame", row.names = c(NA, -14L))
REGION STARTYEAR ENDYEAR Y_START Y_END COLOR_VALUE
1 REG1 1959 1960 0 1 -969
2 REG1 1960 1961 0 1 -712
3 REG1 1961 1962 0 1 -574
4 REG1 1962 1963 0 1 -312
5 REG1 1963 1964 0 1 -12
6 REG1 1964 1965 0 1 1
7 REG1 1965 1966 0 1 0
8 REG1 1966 1967 0 1 -782
9 REG1 1967 1968 0 1 -999
10 REG2 1945 1946 2 3 -100
11 REG2 1946 1947 2 3 23
12 REG2 1947 1948 2 3 45
13 REG2 1948 1949 2 3 NA
14 REG2 1949 1950 2 3 999
I am creating a plot with the rect() function which works fine.
xx = unlist(reg1[, c(2, 3)])
yy = unlist(reg1[, c(4, 5)])
png(width=1679, height=1165, res=150)
if(any(xx < 1946)) {my_x_lim <- c(min(xx), 2014)} else {my_x_lim <- c(1946, 2014)}
plot(xx, yy, type='n', xlim = my_x_lim)
apply(reg1, 1, function(y)
rect(y[2], y[4], y[3], y[5]))
dev.off()
In my reg1 data I have a 6th column which contains values between +1000 and -1000. What I was wondering is if there is a method that I could colour the rectangles in my plot according to my color values. Low values should be blue, values around 0 should result in white and high values in red (if no value is present or NA, then grey should be plotted).
My question: How could I create a color palette that ranges from values 1000 to -1000 (from red over white to blue) and apply it to my plot so that each rectangle gets coloured according to the color value?
Here is how your get a color ramp and match it in the data frame.
my.colors<-colorRampPalette(c("blue", "white", "red")) #creates a function my.colors which interpolates n colors between blue, white and red
color.df<-data.frame(COLOR_VALUE=seq(-1000,1000,1), color.name=my.colors(2001)) #generates 2001 colors from the color ramp
reg1.with.color<-merge(reg1, color.df, by="COLOR_VALUE")
I can't help you with the rect() plotting, I've never used it

Resources