Very simple bar graph - r

First I will have to apologize for my ignorance as I'm sure this is a very simple question but I am very new to R. My question is that I have a data frame that looks like this:
countrydes Insured
USA 4500
CANADA 4500
USA 7500
CANADA 7600
All I want to do is aggregate the sum of the insured value by country and produce a bar graph e.g.
countrydes Insured
USA 12000
Canada 12100
Many thanks.

This will do the trick:
# Define your data
dfr <- data.frame(
countrydes=rep(c("USA", "CANADA"), 2),
Insured=c(4500, 4500, 7500, 7600))
# Sum by country
totals <- with(dfr, by(Insured, countrydes, sum))
# Plot the answer
barplot(totals)
(As Etiennebr mentioned, you could use aggregate instead of by, in which case you need to coerce countrydes to be a list.)

You could simply sum each one separatly. Let's call your data frame df :
USA <- sum(df[df$countrydes=="USA",]$Insured)
CANADA <- sum(df[df$countrydes=="CANADA",]$Insured)
But with aggregate() you can handle all the countries in one line.
aggsumcount <- aggregate(x=df$Insured,by=list(df$countrydes),FUN=sum)

Related

How to calculate growth rate in R? [duplicate]

I have a data frame and would like to calculate the growth rate of nominal GDP in R. I know how to do it in Excel with the formula ((gdp of this year)-gdp of last year)/( gdp of last year))*100. What kind of command could be used in R to calculate it?
year nominal gdp
2003 7696034.9
2004 8690254.3
2005 9424601.9
2006 10520792.8
2007 11399472.2
2008 12256863.6
2009 12072541.6
2010 13266857.9
2011 14527336.9
2012 15599270.7
2013 16078959.8
You can also use the lag() fuction from dplyr. It gives the previous values in a vector. Here is an example
data <- data.frame(year = c(2003:2013),
gdp = c(7696034.9, 8690254.3, 9424601.9, 10520792.8,
11399472.2, 12256863.6, 12072541.6, 13266857.9,
14527336.9, 15599270.7, 16078959.8))
library(dplyr)
growth_rate <- function(x)(x/lag(x)-1)*100
data$growth_rate <- growth_rate(data$gdp)
It's probably best for you to get familiar with data tables, and do something like this:
library(data.table)
dt_gdp <- data.table(df)
dt_gdp[, growth_rate_of_gdp := 100 * (Producto.interno.bruto..PIB. - shift(Producto.interno.bruto..PIB.)) / shift(Producto.interno.bruto..PIB.)]
A base-R solution:
with(data,
c(NA, ## augment results (growth rate unknown in year 1)
diff(gdp)/ ## this is gdp(t) - gdp(t-1)
head(gdp, -1)) ## gdp(t-1)
*100) ## scale to percentage growth
head(gdp, -1) is perhaps a little too clever. gdp[-length(gdp)] (i.e. "gdp, excluding the last value") would be slightly more idiomatic.
Or
(gdp/c(NA,gdp[-length(gdp)])-1)*100

How to reproduce this graph?

Here is my code ;
library(rvest)
library(dplyr)
library(tidyr)
col_link <- "https://ourworldindata.org/famines#famines-by-world-region-since-1860"
col_page <- read_html(col_link)
col_table <- col_page %>% html_nodes("table#tablepress-73") %>%
html_table() %>% . [[1]]
new_data <- col_table %>%
select(Year, Country, `Excess Mortality midpoint`)
new_data
I would like to arrange the years and countries in such a way that I can use them in a graph but I can't. My objective is to reproduce this graph :
My problem is that in the "year" column, some data last several years for a country. For example to show that the famine lasted from 1846 to 1852 in Ireland it says "1846-52" and this is a problem because I cannot use the data in this form for a graph.
Year Country `Excess Mortality midpoint`
<chr> <chr> <chr>
1 1846–52 Ireland 1,000,000
2 1860-1 India 2,000,000
3 1863-67 Cape Verde 30,000
4 1866-7 India 961,043
5 1868 Finland 100,000
6 1868-70 India 1,500,000
7 1870–1871 Persia (now Iran) 1,000,000
8 1876–79 Brazil 750,000
9 1876–79 India 7,176,346
10 1877–79 China 11,000,000
# ... with 67 more rows
I think it's more of a question of data than R programming, you could try matching the year periods to the decades. However if a year range spans several decades the data should be 'split up' in some way (e.g. do a simple proportional split) to accommodate that. If the chart you linked to is made with this data, some assumptions had to made to adjust the data, without knowing those assumptions you won't be able to reproduce the chart.

approach to cut dataset to make a new factor variable

Currently, I am trying to cut the dataset into three parts: developed, developing and under-developed. The cut criteria is quantiles. That is,
developed would be those above 75% quantiles, developing would be between 50%-75% and under-developed would be below 50%. However, quantiles are different by years.
data = data.frame("country" = c("U.S.A","U.S.A","Jamaica","Jamaica","Congo","Congo"),
"year" = c(2000,2001,2000,2001,2000,2001),
"gdp_per_capita" = c(30000,40000,100,200,50,60))
quantiles = do.call("data.frame",
tapply(data$gdp_per_capita, data$year, quantile))
What I did was to calculate the quantiles by year and I got a data frame with just that information. Now, I am trying to use this information to apply above criteria for each year.
Example
2000 = (50% = 3000, 75% = 15999)
2001 = (50% = 5000, 75% = 18000)
cut points changes
Possible results
year country gdp_per_capita status
2000 U.S. 1800000 "developed"
2000 France 200000 "developed"
....more than 500+ obs.
2000 Kenya 300 "under-developed"
2000 Malaysia 1500 "developing"
2001 Malaysia 3000 "developing"
2001 Kenya 500 "under-developed"
2001 Spain 30000 "developed"
2000 India 300 "under-developed"
2001 India 5100 "developing"
What will be the most efficient way to resolving this issue?
I tried using ifelse and doing that one by one. This seems like it is too much work and I felt like there was no reason to use computer if I am going to iterate them one by one.
Instead of data.frame, consider rbind in do.call to create quantile percents as columns, then merge to original dataset by year. Finally, calculate status with a nested ifelse conditional logic.
### QUANTILES
quantiles_matrix <- do.call("rbind", tapply(data$gdp_per_capita, data$year, quantile))
quantiles_df <- transform(data.frame(quantiles_matrix),
year = row.names(quantiles_matrix))
### MERGE
mdf <- merge(data, quantiles_df, by="year")
### STATUS COLUMN ASSIGNMENT
final_df <- transform(mdf,
status = ifelse(gdp_per_capita > X75., "developed",
ifelse(gdp_per_capita >= X50. & gdp_per_capita <= X75., "developing",
ifelse(gdp_per_capita < X50., "under-developed", NA)
)
)
)
Rextester demo

Averaging rows based upon a known, irregular relationship using R

I have data on energy companies whose jurisdiction overlaps in places. I want to be able to compute an average of sales for the places where these companies overlap. These companies will always overlap - so how can I use this information to calculate the averages just for those pairs? There are about 20 pairs of companies.
data <- data.frame(Company = c("Energy USA","Good Energy",
"Hydropower 4 U",
"Coal Town",
"Energy USA/Good Energy",
"Good Energy/Coal Town"),
Sales = c(100, 2500, 550, 6000, "?", "?"))
Company Sales
1 Energy USA 100
2 Good Energy 2500
3 Hydropower 4 U 550
4 Coal Town 6000
5 Energy USA/Good Energy ? (Answer: 1300)
6 Good Energy/Coal Town ? (Answer: 4250)
We use 'grep' to get index of 'Company' elements that have more than one entries i.e. separated by '/'. Then, split those elements by the delimiter (output will be a list), loop through the list with sapply, match the elements with the 'Company' column to get the position, use that to get the corresponding 'Sales' elements. As the 'Sales' column was factor, we need to convert it to numeric to get the mean. When we convert factor to numeric class, all non-numeric elements i.e. ? will be converted to NA. Replace those NA elements with the mean values.
i1 <- grepl('/', data$Company)
v1 <- sapply(strsplit(as.character(data$Company[i1]), '/'),
function(x) mean(as.numeric(as.character(data$Sales[match(x,
data$Company)]))))
data$Sales <- as.numeric(as.character(data$Sales))
data$Sales[is.na(data$Sales)] <- v1
data
# Company Sales
#1 Energy USA 100
#2 Good Energy 2500
#3 Hydropower 4 U 550
#4 Coal Town 6000
#5 Energy USA/Good Energy 1300
#6 Good Energy/Coal Town 4250
Without knowing how your original data is, it is hard to give a working answer. However, assuming your data has Company and Sales columns with multiple rows for each company, you can do something like this:
mean(data$Sales[data$Company %in% c('Energy USA', 'Good Energy')]])
mean(data$Sales[data$Company %in% c('Good Energy', 'Coal Town')]])
you could create a new column "jurisdiction" in "data", if your dataset is rather small..
MeansByJurisdiction <- tapply(data$sales, data$jurisdiction, mean)
then you could convert the vector to dataframe
MeansByJurisdiction <- data.frame(MeansByJurisdiction)
the rownames in the MeansByJurisdiction dataframe will be populated with the jurisdictions and you can extract them with a simple line of code:
MeansByJurisdiction$jurisdictions <- row.names(MeansByJurisdiction)

R: Percentile calculations on subsets of data

I have a data set which contains the following identifiers, an rscore, gvkey, sic2, year, and cdom. What I am looking to do is calculate percentile ranks based on summed rscores for all temporal spans (~1500) for a given gvkey, and then calculate percentile ranks in a given temporal time span and sic2 based on gvkey.
Calculating the percentiles for all temporal time spans is a fairly quick process, however once I add in calculating the sic2 percentile ranks it's fairly slow, but we are likely looking at about ~65,000 subsets in total. I'm wondering if there is a possibility of speeding up this process.
The data for one temporal time span looks like the following
gvkey sic2 cdom rscoreSum pct
1187 10 USA 8.00E-02 0.942268617
1265 10 USA -1.98E-01 0.142334654
1266 10 USA 4.97E-02 0.88565478
1464 10 USA -1.56E-02 0.445748247
1484 10 USA 1.40E-01 0.979807985
1856 10 USA -2.23E-02 0.398252565
1867 10 USA 4.69E-02 0.8791019
2047 10 USA -5.00E-02 0.286701209
2099 10 USA -1.78E-02 0.430915371
2127 10 USA -4.24E-02 0.309255308
2187 10 USA 5.07E-02 0.893020421
The code to calculate the industry ranks is below, and fairly straightforward.
#generate 2 digit industry SICs percentile ranks
dout <- ddply(dfSum, .(sic2), function(x){
indPct <- rank(x$rscoreSum)/nrow(x)
gvkey <- x$gvkey
x <- data.frame(gvkey, indPct)
})
#merge 2 digit industry SIC percentile ranks with market percentile ranks
dfSum <- merge(dfSum, dout, by = "gvkey")
names(dfSum)[2] <- 'sic2'
Any suggestions to speed the process would be appreciated!
You might try the data.table package for fast operations across relatively large datasets like yours. For example, my machine has no problem working through this:
library(data.table)
# Create a dataset like yours, but bigger
n.rows <- 2e6
n.sic2 <- 1e4
dfSum <- data.frame(gvkey=seq_len(n.rows),
sic2=sample.int(n.sic2, n.rows, replace=TRUE),
cdom="USA",
rscoreSum=rnorm(n.rows))
# Now make your dataset into a data.table
dfSum <- data.table(dfSum)
# Calculate the percentiles
# Note that there is no need to re-assign the result
dfSum[, indPct:=rank(rscoreSum)/length(rscoreSum), by="sic2"]
whereas the plyr equivalent takes a while.
If you like the plyr syntax (I do), you may also be interested in the dplyr package, which is billed as "the next generation of plyr", with support for faster data stores in the backend.

Resources