R Merging Boxplots - r

I am trying to use R to show a merged boxplot, I am sure this is easy, I just am missing something:
boxplot(WHO$Male, WHO$Female, ylim=c(0,100))
boxplot(WHO$Female ~ WHO$Year, ylim=c(0,100))
boxplot(WHO$Male ~ WHO$Year, ylim=c(0,100))
All three work, but when I try:
boxplot(WHO$Male ~ WHO$Year, WHO$Female ~ WHO$Year, ylim=c(0,100))
It returns:
Error in as.data.frame.default(data) :
cannot coerce class ""formula"" to a data.frame
Note, Year, only contains three numbers, 1990, 2000, 2010
> head(WHO)
Year WHO.region Country Male Female
1 1990 Africa Algeria 66 68
2 1990 Africa Angola 39 43
3 1990 Africa Benin 45 50
4 1990 Africa Botswana 63 66
5 1990 Africa Burkina Faso 45 49
6 1990 Africa Burundi 47 50

reshape2 package does something similar. Actually there was quite similar question - Plot multiple boxplot in one graph, maybe it will be helpful.

Related

How to filter values within a threshold in R

I have a data set that looks like this with the first 10 rows
country freq
Albania 2
Argentina 4
Australia 26
Austria 14
Belgium 22
Brazil 46
Bulgaria 2
Cambodia 2
Canada 37
Chile 19
I want to filter out counts(frequency) that are less than 30
i tried this code:
dd %>%
group_by(freq) %>%
filter(n()<30)
The output was same with the dataset. I did not get want i want
how do I resolve this?
Thanks in advance
Use simple indexing. Why are you grouping by?
dd <- dd[dd$freq >= 30, ]

ggplot2 + Date structure using scale X

I really need help here because I am way beyond lost.
I am trying to create a line chart showing several teams' performance over a year. I divided the year into quarters: 1/1/2012, 4/1/12. 8/1/12. 12/1/12 and loaded the csv data frame into R.
Month Team Position
1 1/1/12 South Africa 56
2 1/1/12 Angola 85
3 1/1/12 Morocco 61
4 1/1/12 Cape Verde Islands 58
5 4/1/12 South Africa 71
6 4/1/12 Angola 78
7 4/1/12 Morocco 62
8 4/1/12 Cape Verde Islands 76
9 8/1/12 South Africa 67
10 8/1/12 Angola 85
11 8/1/12 Morocco 68
12 8/1/12 Cape Verde Islands 78
13 12/1/12 South Africa 87
14 12/1/12 Angola 84
15 12/1/12 Morocco 72
16 12/1/12 Cape Verde Islands 69
When I try using ggplot2 to generate the graph the fourth quarter 12/1/12 inexplicably moves to the second spot.
ggplot(groupA, aes(x=Month, y=Position, colour=Team, group=Team)) + geom_line()
I then put this plot into a variable GA in order to try to use scale_x to format the date:
GA + scale_x_date(labels = date_format("%m/%d"))
But I keep getting this Error:
Error in structure(list(call = match.call(), aesthetics = aesthetics, :
could not find function "date_format"
And if I run this code:
GA + scale_x_date()
I get this error:
Error: Invalid input: date_trans works with objects of class Date only
I am using a Mac OS X running R 2.15.2
Please help.
Its because, df$Month, (assuming your data.frame is df), which is a factor has its levels in this order.
> levels(df$Month)
# [1] "1/1/12" "12/1/12" "4/1/12" "8/1/12"
The solution is to re-order the levels of your factor.
df$Month <- factor(df$Month, levels=df$Month[!duplicated(df$Month)])
> levels(df$Month)
# [1] "1/1/12" "4/1/12" "8/1/12" "12/1/12"
Edit: Alternate solution using strptime
# You could convert Month first:
df$Month <- strptime(df$Month, '%m/%d/%y')
Then your code should work. Check the plot below:

How to reverse coordinates on a line graph ggplot2 R

I'm working on a data visualization project and am making some line graphs. This is my data set:
groupA <- read.csv("afcongroupA.csv", header=T, row.names=NULL)
groupA
Date Team Position
1 1/12 South Africa 56
2 1/12 Angola 85
3 1/12 Morocco 61
4 1/12 Cape Verde Islands 58
5 4/12 South Africa 71
6 4/12 Angola 78
7 4/12 Morocco 62
8 4/12 Cape Verde Islands 76
9 8/12 South Africa 67
10 8/12 Angola 85
11 8/12 Morocco 68
12 8/12 Cape Verde Islands 78
13 12/12 South Africa 87
14 12/12 Angola 84
15 12/12 Morocco 72
16 12/12 Cape Verde Islands 69
I then plotted them on a line graph to show the rise of decline in position standings:
groupA$Date <- factor(groupA$Date, levels=groupA$Date[!duplicated(groupA$Date)])
ggplot(groupA, aes(x=Date, y=Position, colour=Team, group=Team)) + geom_line()
What I want to do is reverse the y-axis so that the largest number is at the bottom. I tried this bit of code:
groupA <- coord_flip() + scale_x_reverse()
But I get this error message:
Error in coord_flip() + scale_x_reverse() :
non-numeric argument to binary operator
I'm using R 2.15.2 on a Mac running OS X.
As your column Date is a factor then scale_x_reverse() won't work. One solution is to order your levels of factors in data frame
groupA$Date <- factor(groupA$Date, levels=rev(unique(groupA$Date)))
Then just use your code to make plot and flip axis.
ggplot(groupA, aes(x=Date, y=Position, colour=Team, group=Team)) +
geom_line()+coord_flip()

Calculate Concentration Index by Region and Year (panel data)

This is my first post and very stuck on trying to build my first function that calculates Herfindahl measures on Firm gross output, using panel data (year=1998:2007) with firms = obs. by year (1998-2007) and region ("West","Central","East","NE") and am having problems with passing arguments through the function. I think I need to use two loops (one for time and one for region). Any help would be useful.. I really dont want to have to subset my data 400+ times to get herfindahl measures one at a time. Thanks in advance!
Below I provide: 1) My starter code (only returns one value); 2) desired output (2-bins that contain the hefindahl measures by 1) year and by 2) year-region); and 3) original data
1) My starter Code
myherf<- function (x, time, region){
time = year # variable is defined in my data and includes c(1998:2007)
region = region # Variable is defined in my data, c("West", "Central","East","NE")
for (i in 1:length(time)) {
for (j in 1:length(region)) {
herf[i,j] <- x/sum(x)
herf[i,j] <- herf[i,j]^2
herf[i,j] <- sum(herf[i,j])^1/2
}
}
return(herf[i,j])
}
myherf(extractiveoutput$x, i, j)
Error in herf[i, j] <- x/sum(x) : object 'herf' not found
2) My desired outcome is the following two vectors:
A. (1x10 vector)
Year herfindahl(yr)
1998 x
1999 x
...
2007 x
B. (1x40 vector)
Year Region hefindahl(yr-region)
1998 West x
1998 Central x
1998 East x
1998 NE x
...
2007 West x
2007 Central x
2007 East x
2007 northeast x
3) Original Data
Obs. industry year region grossoutput
1 06 1998 Central 0.048804830
2 07 1998 Central 0.011222478
3 08 1998 Central 0.002851575
4 09 1998 Central 0.009515881
5 10 1998 Central 0.0067931
...
12 06 1999 Central 0.050861447
13 07 1999 Central 0.008421093
14 08 1999 Central 0.002034649
15 09 1999 Central 0.010651283
16 10 1999 Central 0.007766118
...
111 06 1998 East 0.036787413
112 07 1998 East 0.054958377
113 08 1998 East 0.007390260
114 09 1998 East 0.010766598
115 10 1998 East 0.015843418
...
436 31 2007 West 0.166044176
437 32 2007 West 0.400031011
438 33 2007 West 0.133472059
439 34 2007 West 0.043669662
440 45 2007 West 0.017904620
You can use the conc function from the ineq library. The solution gets really simple and fast using data.table.
library(ineq)
library(data.table)
# convert your data.frame into a data.table
setDT(df)
# calculate inequality of grossoutput by region and year
df[, .(inequality = conc(grossoutput, type = "Herfindahl")), by=.(region, year) ]

R: Calculating 5 year averages in panel data

I have a balanced panel by country from 1951 to 2007 in a data frame. I'd like to transform it into a new data frame of five year averages of my other variables. When I sat down to do this I realized the only way I could think to do this involved a for loop and then decided that it was time to come to stackoverflow for help.
So, is there an easy way to turn data that looks like this:
country country.isocode year POP ci grgdpch
Argentina ARG 1951 17517.34 18.445022145 3.4602044759
Argentina ARG 1952 17876.96 17.76066507 -7.887407586
Argentina ARG 1953 18230.82 18.365255769 2.3118720688
Argentina ARG 1954 18580.56 16.982113434 1.5693778844
Argentina ARG 1955 18927.82 17.488907008 5.3690276523
Argentina ARG 1956 19271.51 15.907756547 0.3125559183
Argentina ARG 1957 19610.54 17.028450999 2.4896639667
Argentina ARG 1958 19946.54 17.541597134 5.0025894968
Argentina ARG 1959 20281.15 16.137310492 -6.763501447
Argentina ARG 1960 20616.01 20.519539628 8.481742144
...
Venezuela VEN 1997 22361.80 21.923577413 5.603872759
Venezuela VEN 1998 22751.36 24.451736863 -0.781844721
Venezuela VEN 1999 23128.64 21.585034168 -8.728234466
Venezuela VEN 2000 23492.75 20.224310777 2.6828641218
Venezuela VEN 2001 23843.87 23.480311721 0.2476965412
Venezuela VEN 2002 24191.77 16.290691319 -8.02535946
Venezuela VEN 2003 24545.43 10.972153646 -8.341989049
Venezuela VEN 2004 24904.62 17.147693312 14.644028806
Venezuela VEN 2005 25269.18 18.805970212 7.3156977879
Venezuela VEN 2006 25641.46 22.191098769 5.2737381326
Venezuela VEN 2007 26023.53 26.518210052 4.1367897561
into something like this:
country country.isocode period AvPOP Avci Avgrgdpch
Argentina ARG 1 18230 17.38474 1.423454
...
Venezuela VEN 12 25274 21.45343 5.454334
Do I need to transform this data frame using a specific panel data package? Or is there another easy way to do this that I'm missing?
This is the stuff aggregate is made for. :
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Level <-cut(Df$year,seq(1951,1971,by=5),right=F)
id <- c("var1","var2")
> aggregate(Df[id],list(Df$country,Level),mean)
Group.1 Group.2 var1 var2
1 Arg [1951,1956) 3 18
2 Ven [1951,1956) 53 68
3 Arg [1956,1961) 8 13
4 Ven [1956,1961) 58 63
5 Arg [1961,1966) 13 8
6 Ven [1961,1966) 63 58
7 Arg [1966,1971) 18 3
8 Ven [1966,1971) 68 53
The only thing you might want to do, is to rename the categories and the variable names.
For this type of problem, the plyr package is truely phenomenal. Here is some code that gives you what you want in essentially a single line of code plus a small helper function.
library(plyr)
library(zoo)
library(pwt)
# First recreate dataset, using package pwt
data(pwt6.3)
pwt <- pwt6.3[
pwt6.3$country %in% c("Argentina", "Venezuela"),
c("country", "isocode", "year", "pop", "ci", "rgdpch")
]
# Use rollmean() in zoo as basis for defining a rolling 5-period rolling mean
rollmean5 <- function(x){
rollmean(x, 5)
}
# Use ddply() in plyr package to create rolling average per country
pwt.ma <- ddply(pwt, .(country), numcolwise(rollmean5))
Here is the output from this:
> head(pwt, 10)
country isocode year pop ci rgdpch
ARG-1950 Argentina ARG 1950 17150.34 13.29214 7736.338
ARG-1951 Argentina ARG 1951 17517.34 18.44502 8004.031
ARG-1952 Argentina ARG 1952 17876.96 17.76067 7372.721
ARG-1953 Argentina ARG 1953 18230.82 18.36526 7543.169
ARG-1954 Argentina ARG 1954 18580.56 16.98211 7661.550
ARG-1955 Argentina ARG 1955 18927.82 17.48891 8072.900
ARG-1956 Argentina ARG 1956 19271.51 15.90776 8098.133
ARG-1957 Argentina ARG 1957 19610.54 17.02845 8299.749
ARG-1958 Argentina ARG 1958 19946.54 17.54160 8714.951
ARG-1959 Argentina ARG 1959 20281.15 16.13731 8125.515
> head(pwt.ma)
country year pop ci rgdpch
1 Argentina 1952 17871.20 16.96904 7663.562
2 Argentina 1953 18226.70 17.80839 7730.874
3 Argentina 1954 18577.53 17.30094 7749.694
4 Argentina 1955 18924.25 17.15450 7935.100
5 Argentina 1956 19267.39 16.98977 8169.456
6 Argentina 1957 19607.51 16.82080 8262.250
Note that rollmean(), by default, calculates the centred moving mean. You can modify this behaviour to get the left or right moving mean by passing this parameter to the helper function.
EDIT:
#Joris Meys gently pointed out that you might in fact be after the average for five-year periods.
Here is the modified code to do this:
pwt$period <- cut(pwt$year, seq(1900, 2100, 5))
pwt.ma <- ddply(pwt, .(country, period), numcolwise(mean))
pwt.ma
And the output:
> pwt.ma
country period year pop ci rgdpch
1 Argentina (1945,1950] 1950.0 17150.336 13.29214 7736.338
2 Argentina (1950,1955] 1953.0 18226.699 17.80839 7730.874
3 Argentina (1955,1960] 1958.0 19945.149 17.42693 8410.610
4 Argentina (1960,1965] 1963.0 21616.623 19.09067 9000.918
5 Argentina (1965,1970] 1968.0 23273.736 18.89005 10202.665
6 Argentina (1970,1975] 1973.0 25216.339 19.70203 11348.321
7 Argentina (1975,1980] 1978.0 27445.430 23.34439 11907.939
8 Argentina (1980,1985] 1983.0 29774.778 17.58909 10987.538
9 Argentina (1985,1990] 1988.0 32095.227 15.17531 10313.375
10 Argentina (1990,1995] 1993.0 34399.829 17.96758 11221.807
11 Argentina (1995,2000] 1998.0 36512.422 19.03551 12652.849
12 Argentina (2000,2005] 2003.0 38390.719 15.22084 12308.493
13 Argentina (2005,2010] 2006.5 39831.625 21.11783 14885.227
14 Venezuela (1945,1950] 1950.0 5009.006 41.07972 7067.947
15 Venezuela (1950,1955] 1953.0 5684.009 44.60849 8132.041
16 Venezuela (1955,1960] 1958.0 6988.078 37.87946 9468.001
17 Venezuela (1960,1965] 1963.0 8451.073 26.93877 9958.935
18 Venezuela (1965,1970] 1968.0 10056.910 28.66512 11083.242
19 Venezuela (1970,1975] 1973.0 11903.185 32.02671 12862.966
20 Venezuela (1975,1980] 1978.0 13927.882 36.35687 13530.556
21 Venezuela (1980,1985] 1983.0 16082.694 22.21093 10762.718
22 Venezuela (1985,1990] 1988.0 18382.964 19.48447 10376.123
23 Venezuela (1990,1995] 1993.0 20680.645 19.82371 10988.096
24 Venezuela (1995,2000] 1998.0 22739.062 20.93509 10837.580
25 Venezuela (2000,2005] 2003.0 24550.973 17.33936 10085.322
26 Venezuela (2005,2010] 2006.5 25832.495 24.35465 11790.497
Use cut on your year variable to make the period variable, then use melt and cast from the reshape package to get the averages. There's a lot of other answers that can show you how; see https://stackoverflow.com/questions/tagged/r+reshape
There is a base stats and a plyr answer, so for completeness, here is a dplyr based answer. Using the toy data given by Joris, we have
Df <- data.frame(
year=rep(1951:1970,2),
country=rep(c("Arg","Ven"),each=20),
var1 = c(1:20,51:70),
var2 = c(20:1,70:51)
)
Now, using cut to create the periods, we can then group on them and get the means:
Df %>% mutate(period = cut(Df$year,seq(1951,1971,by=5),right=F)) %>%
group_by(country, period) %>% summarise(V1 = mean(var1), V2 = mean(var2))
Source: local data frame [8 x 4]
Groups: country
country period V1 V2
1 Arg [1951,1956) 3 18
2 Arg [1956,1961) 8 13
3 Arg [1961,1966) 13 8
4 Arg [1966,1971) 18 3
5 Ven [1951,1956) 53 68
6 Ven [1956,1961) 58 63
7 Ven [1961,1966) 63 58
8 Ven [1966,1971) 68 53

Resources