Ordering a Data Frame By 2 Parameters, Then Plotting - r

I have a data frame with GDP values for 12 South American countries over ~40 years. A snippet of the frame is as follows:
168 Chile 1244.1799 1972
169 Chile 4076.3207 1994
170 Chile 3474.7172 1992
171 Chile 2928.1562 1991
172 Chile 6143.7276 2004
173 Colombia 882.5687 1976
174 Colombia 1094.8795 1977
175 Colombia 5403.4557 2008
176 Colombia 2376.8022 2002
177 Colombia 2047.9784 1993
1) I want to order the data frame by country. The first ~40 values should pertain to Argentina, then next ~40 to Bolivia, etc.
2) Within each country grouping, I want to order by year. The first 3 rows should pertain to Argentina 2012, Argentina 2011, Argentina 2010, etc.
I can grab the data for each country individually using subset(), and then order it with order(). Surely I don't have to do this for every country and then use rbind()? How do I do it in one foul swoop?
3) Once I have the final product, I'd like to create 12 small, individual line graphs stacked vertically, each pertaining to a different country, which shows the trend of that country's GDP over the ~40 years. How I do create such a plot?
I'm sure I could find info on the 3rd question myself, but, well, I don't even know what such a graph is called in the first place..

Here is a solution with ggplot2. Assuming your data is in df:
library(ggplot2)
df$year.as.date <- as.Date(paste0(df$year, "-01-01")) # convert year to date
ggplot(df, aes(x=year.as.date, y=gdp)) +
geom_line() + facet_grid(country ~ .)
You don't actually need to sort by year and country, ggplot will handle that for you. Here is the data (clearly, only using 5 countries and 12 years, but this will work for your data). Also, I show you how to sort by two columns on the third line:
countries <- c("ARG", "BRA", "CHI", "PER", "URU")
df <- data.frame(country=rep(countries, 12), year=rep(2001:2012, each=5), gdp=runif(60))
df <- df[order(df$country, df$year),] # <- we sort here
df$gdp <- df$gdp + 1:12 / 2

Related

ggplot doesn't arrange my graph as expected [duplicate]

This question already has answers here:
ggplot2: sorting a plot
(5 answers)
How to force specific order of the variables on the X axis?
(1 answer)
Closed last month.
Good morning,
I'm trying to use ggplot with a data frame but I faced an issue. My ggplot doesn't take consideration about the function arrange on my data frame.
Here is my code :
data()
pop <- population[population$year == 1995, ]
pop <- pop[1:10, ]
pop %>%
ggplot(aes(x = country, y = population)) +
geom_point()
pop <- pop %>%
arrange(population)
pop %>%
ggplot(aes(x = country, y = population)) +
geom_point()
I would like that my graph would be arranged according to the population, so at the first place, the country with the lowest population, at the second place, the country with the second lowest population and so on. But ggplot doesn't match my graph as expected.
I have this data frame :
country year population
<chr> <int> <int>
1 Anguilla 1995 9807
2 American Samoa 1995 52874
3 Andorra 1995 63854
4 Antigua and Barbuda 1995 68349
5 Armenia 1995 3223173
6 Albania 1995 3357858
7 Angola 1995 12104952
8 Afghanistan 1995 17586073
9 Algeria 1995 29315463
10 Argentina 1995 34833168
But my graph is ordered by alphabetical order :
Do you have any idea to make it by population number?

How can I choose the countries that have number of points in the top 25% of the distribution of number of datapoints with subset?

I have to select the countries that have a number of points in the top 25% of the distribution of number of datapoints using function subset & quantiles with the %in% operator.
My dataset has this form
head(drugs1)
LOCATION TIME PC_HEALTHXP PC_GDP USD_CAP TOTAL_SPEND
1 AUS 1971 15.992 0.727 35.720 462.11
2 AUS 1972 15.091 0.686 36.056 475.11
3 AUS 1973 15.117 0.681 39.871 533.47
4 AUS 1974 14.771 0.755 47.559 652.65
5 AUS 1975 11.849 0.682 47.561 660.76
6 AUS 1976 10.920 0.630 46.908 658.26
where the first column represents the countries & the second the data points that each country appear in each year.
I tried to apply the command
a<-subset(drugs1, quantile(drugs1$TIME, 0.25),1)
but the results are NULL.
Can you help me with this?
Start by figuring out the number of datapoints for each country using table().
n <- table(drugs1$location)
Find the 25th percentile of the number of datapoints.
q <- quantile(n, .75)
Find the countries that have more than q datapoints.
countries <- names(n)[n > q]
Subset the original data to only include countries in countries.
drugs2 <- subset(drugs1, LOCATION %in% countries)

I want to use R to sample my timestamped dataframe

I want to use R to sample my dataframe. My data is timestamped epidemiological data, and I want to randomly sample at least 1 and as many as 10 records for each year, preferably in a manner that is scaled to the number of records for each year. I would like to export the results as a csv.
here are a few lines of my dataset, where I've left off the long genetic sequence field for each record.
year matrix USD clade
1958 W mG018U UP
1958 W mG018U UP
1958 W mG018U UP
1966 UN mG140L LL
1969 UN mG207L LL
1969 UN mG013L LL
1971 UN mG208L LL
1972 HA mG129M MN
1973 C1 mG018U UP
1973 NA mG001U UC
1973 NA mG001U UC
all I've learned to do is
sample(mydata, size = 600, replace = FALSE)
which doesn't of course take the year into account.
There are many possibilities to run sample per group (for example sample_n in the dplyr package), here's an illustration using the data.table package.
You can set a fraction of, let's say 0.1, of the amount of the records you want to sample out of each year so the size will be relative, wrap it up in ceiling in case this fraction is smaller than 1, and restrict to maximum 10 per group using the min function, for example
library(data.table)
setDT(df)[, .SD[sample(.N, min(10, ceiling(.N*.1)))], year]
# year matrix USD clade
#1: 1958 W mG018U UP
#2: 1966 UN mG140L LL
#3: 1969 UN mG013L LL
#4: 1971 UN mG208L LL
#5: 1972 HA mG129M MN
#6: 1973 NA mG001U UC

How to get column mean for specific rows only?

I need to get the mean of one column (here: score) for specific rows (here: years). Specifically, I would like to know the average score for three periods:
period 1: year <= 1983
period 2: year >= 1984 & year <= 1990
period 3: year >= 1991
This is the structure of my data:
country year score
Algeria 1980 -1.1201501
Algeria 1981 -1.0526943
Algeria 1982 -1.0561565
Algeria 1983 -1.1274560
Algeria 1984 -1.1353926
Algeria 1985 -1.1734330
Algeria 1986 -1.1327666
Algeria 1987 -1.1263586
Algeria 1988 -0.8529455
Algeria 1989 -0.2930265
Algeria 1990 -0.1564207
Algeria 1991 -0.1526328
Algeria 1992 -0.9757842
Algeria 1993 -0.9714060
Algeria 1994 -1.1422258
Algeria 1995 -0.3675797
...
The calculated mean values should be added to the df in an additional column ("mean"), i.e. same mean value for years of period 1, for those of period 2 etc.
This is how it should look like:
country year score mean
Algeria 1980 -1.1201501 -1.089
Algeria 1981 -1.0526943 -1.089
Algeria 1982 -1.0561565 -1.089
Algeria 1983 -1.1274560 -1.089
Algeria 1984 -1.1353926 -0.839
Algeria 1985 -1.1734330 -0.839
Algeria 1986 -1.1327666 -0.839
Algeria 1987 -1.1263586 -0.839
Algeria 1988 -0.8529455 -0.839
Algeria 1989 -0.2930265 -0.839
Algeria 1990 -0.1564207 -0.839
...
Every possible path I tried got easily super complicated - and I have to calculate the mean scores for different periods of time for over 90 countries ...
Many many thanks for your help!
datfrm$mean <-
with (datfrm, ave( score, findInterval(year, c(-Inf, 1984, 1991, Inf)), FUN= mean) )
The title question is a bit different than the real question and would be answered by using logical indexing. If one wanted only the mean for a particular subset say year >= 1984 & year <= 1990 it would be done via:
mn84_90 <- with(datfrm, mean(score[year >= 1984 & year <= 1990]) )
Since findInterval requires year to be sorted (as it is in your example) I'd be tempted to use cut in case it isn't sorted [proved wrong, thanks #DWin]. For completeness the data.table equivalent (scales for large data) is :
require(data.table)
DT = as.data.table(DF) # or just start with a data.table in the first place
DT[, mean:=mean(score), by=cut(year,c(-Inf,1984,1991,Inf))]
or findInterval is likely faster as DWin used :
DT[, mean:=mean(score), by=findInterval(year,c(-Inf,1984,1991,Inf))]
If the rows are ordered by year, I think the easiest way to accomplish this would be:
m80_83 <- mean(dataframe[1:4,3]) #Finds the mean of the values of column 3 for rows 1 through 4
m84_90 <- mean(dataframe[5:10,3])
#etc.
If the rows are not ordered by year, I would use tapply like this.
list.of.means <- c(tapply(dataframe$score, cut(dataframe$year, c(0,1983.5, 1990.5, 3000)), mean)
Here, tapply takes three parameters:
First, the data you want to do stuff with (in this case, datafram$score).
Second, a function that cuts that data up into groups. In this case, it will cut the data into three groups based on the dataframe$year values. Group 1 will include all rows with dataframe$year values from 0 to 1983.5, Group 2 will include all rows with dataframe$year values from 1983.5 to 1990.5, and Group 3 will include all rows with dataframe$year values from 1983.5 to 3000.
Third, a function that is applied to each group. This function will apply to the data you selected as your first parameter.
So, list.of.means should be a list of the 3 values you are looking for.

Create a moving sum of past levels of a variable, summed over for each level of 3 other variables, in R

I have a data.frame of the following structure (panel data), with 16 levels of time(quarters) 14 levels of geo (countries) and 20 levels of citizen, each of them repeating accordingly in the dataframe.
time geo citizen X
2008Q1 Belgium Afghanistan 22
2008Q1 Belgium Armenia 10
2008Q1 Belgium Bangladesh 25
2008Q1 Belgium Democratic Republic of the Congo 55
2008Q1 Belgium China (including Hong Kong) 5
2008Q1 Belgium Eritrea 8
I would like to create a new column lets say MOVSUM where it will sum variable X for each level of citizen and geo and time for the previous 4 quarters, so that I would have for each quarter, t, how many X's of each citizen in each geo were available during t-4 to t-1 quarters.
Thanks in advance

Resources