Create a moving sum of past levels of a variable, summed over for each level of 3 other variables, in R - r

I have a data.frame of the following structure (panel data), with 16 levels of time(quarters) 14 levels of geo (countries) and 20 levels of citizen, each of them repeating accordingly in the dataframe.
time geo citizen X
2008Q1 Belgium Afghanistan 22
2008Q1 Belgium Armenia 10
2008Q1 Belgium Bangladesh 25
2008Q1 Belgium Democratic Republic of the Congo 55
2008Q1 Belgium China (including Hong Kong) 5
2008Q1 Belgium Eritrea 8
I would like to create a new column lets say MOVSUM where it will sum variable X for each level of citizen and geo and time for the previous 4 quarters, so that I would have for each quarter, t, how many X's of each citizen in each geo were available during t-4 to t-1 quarters.
Thanks in advance

Related

Combine rows with two matching columns in R [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a df that resembles this:
Year Country Sales($M)
2013 Australia 120
2013 Australia 450
2013 Armenia 80
2013 Armenia 175
2013 Armenia 0
2014 Australia 500
2014 Australia 170
2014 Armenia 0
2014 Armenia 100
I'd like to combine the rows that match Year and Country, adding the Sales column. The result should be:
Year Country Sales($M)
2013 Australia 570
2013 Armenia 255
2014 Australia 670
2014 Armenia 100
I'm sure I could write a long loop to check whether Year and Country are the same and then add the Sales from those rows, but this is R so there must be a simple function that I'm totally missing.
Many thanks in advance.
library(tidyverse)
df %>%
group_by(Year,Country) %>%
summarise(Sales = sum(Sales))

How do I get the sum of frequency count based on two columns?

Assuming that the dataframe is stored as someData, and is in the following format:
ID Team Games Medal
1 Australia 1992 Summer NA
2 Australia 1994 Summer Gold
3 Australia 1992 Summer Silver
4 United States 1991 Winter Gold
5 United States 1992 Summer Bronze
6 Singapore 1991 Summer NA
How would I count the frequencies of the medal, based on the Team - while excluding NA as an variable. But at the same time, the total frequency of each country should be summed, rather than displayed separately for Gold, Silver and Bronze.
In other words, I am trying to display the total number of medals PER country, with the exception of NA.
I have tried something like this:
library(plyr)
counts <- ddply(olympics, .(olympics$Team, olympics$Medal), nrow)
names(counts) <- c("Country", "Medal", "Freq")
counts
But this just gives me a massive table of every medal for every country separately, including NA.
What I would like to do is the following:
Australia 2
United States 2
Any help would be greatly appreciated.
Thank you!
We can use count
library(dplyr)
df1 %>%
filter(!is.na(Medal)) %>%
count(Team)
# A tibble: 2 x 2
# Team n
# <fct> <int>
#1 Australia 2
#2 United States 2
You can do that in base R with table and colSums
colSums(table(someData$Medal, someData$Team))
Australia Singapore United States
2 0 2
Data
someData = read.table(text="ID Team Games Medal
1 Australia '1992 Summer' NA
2 Australia '1994 Summer' Gold
3 Australia '1992 Summer' Silver
4 'United States' '1991 Winter' Gold
5 'United States' '1992 Summer' Bronze
6 Singapore '1991 Summer' NA",
header=TRUE)

Removing certain rows whose column value does not match another column (all within the same data frame)

I'm attempting to remove all of the rows (cases) within a data frame in which a certain column's value does not match another column value.
The data frame bilat_total contains these 10 columns/variables:
bilat_total[,c("year", "importer1", "importer2", "flow1",
"flow2", "country", "imports", "exports", "bi_tot",
"mother")]
Thus the table's head is:
year importer1 importer2 flow1 flow2 country
6 2009 Afghanistan Bhutan NA NA Afghanistan
11 2009 Afghanistan Solomon Islands NA NA Afghanistan
12 2009 Afghanistan India 516.13 120.70 Afghanistan
13 2009 Afghanistan Japan 124.21 0.46 Afghanistan
15 2009 Afghanistan Maldives NA NA Afghanistan
19 2009 Afghanistan Bangladesh 4.56 1.09 Afghanistan
imports exports bi_tot mother
6 6689.35 448.25 NA United Kingdom
11 6689.35 448.25 NA United Kingdom
12 6689.35 448.25 1.804361e-02 United Kingdom
13 6689.35 448.25 6.876602e-05 United Kingdom
15 6689.35 448.25 NA United Kingdom
19 6689.35 448.25 1.629456e-04 United Kingdom
I've attempted to remove all the cases in which importer2 do not match mother by making a subset:
subset(bilat_total, importer2 == mother)
But each time I do, I get the error:
Error in Ops.factor(importer2, mother) : level sets of factors are different
How would I go about dropping all the rows/cases in which importer 2 and mother don't match?
The error may be because the columns are factor class. We can convert the columns to character class and then compare to subset the rows.
subset(bilat_total, as.character(importer2) == as.character(mother))
Based on the data example showed
subset(bilat_total, importer2 == mother)
# Error in Ops.factor(importer2, mother) :
# level sets of factors are different

R t-test of mean vs observations for multiple factor levels

I have a dataset of some 39k rows of data, an excerpt is below:
'Country', 'Group', 'Item', 'Year' are categorical
'Production' and 'Waste' are numerical
'LF' is also numerical, but is the result of 'Waste'/'Production
Region Country Group Item Year Production Waste LF
Europe Bulgaria Cereals Wheat 1961 2040 274 0.134313725
Europe Bulgaria Cereals Wheat 1962 2090 262 0.125358852
Europe Bulgaria Cereals Wheat 1963 1894 277 0.14625132
Europe Bulgaria Cereals Wheat 1964 2121 286 0.134842056
Europe Bulgaria Cereals Wheat 1965 2923 341 0.116660965
Europe Bulgaria Cereals Wheat 1966 3193 385 0.120576261
Europe Bulgaria Cereals Barley 1961 612 15 0.024509804
Europe Bulgaria Cereals Barley 1962 599 16 0.026711185
Europe Bulgaria Cereals Barley 1963 618 16 0.025889968
Europe Bulgaria Cereals Barley 1964 764 21 0.027486911
Europe Bulgaria Cereals Barley 1965 876 22 0.025114155
Europe Bulgaria Cereals Barley 1966 1064 24 0.022556391
I have used the following code to generate 991 different means by Item and Group
df2 <- aggregate(LF ~ Country + Item, data=df1, FUN='mean')
The results of this function look ok.
I would like to test whether the respective means of LF in df2 are different to the underlying annual observations in df1 for each Country-Item combination (ie. if FALSE, then LF is really just a static ratio, if TRUE then 'Waste' is independent from 'Production').
How might this best be done? There seem to be 991 tests to conduct for this dataset alone and I don't know how to mix the apply and t.test functions in this manner.
Thanks!
t.test requires two groups to compare on a numeric/scale dependent output variable. Here, it seems to me that for each combination of country and item you want to compare all different year averages/means. In other words, you are trying to investigate if year is influencing the LF averages, for each combination of country and item.
The easiest way to do this is to create a linear model (LF ~ Year) for each combination of country and item and interpret the coefficient and p value of the variable year.
library(dplyr)
library(broom)
set.seed(115)
# example dataset
dt = data.frame(Country = rep("country1",12),
Item = c(rep("item1",6), rep("item2",6)),
Year = rep(1961:1966,2),
LF = runif(12,0,1))
# general means by country and item
dt %>% group_by(Country,Item) %>% summarise(Mean_LF = mean(LF))
# each years means by country and item
dt %>% group_by(Country,Item,Year) %>% summarise(Mean_LF = mean(LF))
# does year influence the means for each country and item?
dt %>% group_by(Country,Item) %>% do(tidy(lm(LF~Year, data=.)))
Hope this helps. Let me know if I'm missing something and I'll update my code.

Mean of time - hh:mm:ss - group by a variable

Need to calculate the mean of Time by Country. Time is a Date variable - hh:mm:ss.
This command with(df,tapply(as.numeric(times(df$Time)),Country,mean))
is not returning the correct mean in hh:mm:ss.
Country Time
1 Germany 2:26:21
2 Germany 2:19:19
3 Brazil 2:06:34
4 USA 2:06:17
5 Eth 2:18:58
6 Japan 2:08:35
7 Morocco 2:05:27
8 Germany 2:13:57
9 Romania 2:21:30
10 Spain 2:07:23
Output:
>with(df,tapply(as.numeric(times(df$Time)),Country,mean))
Andorra Australia Brazil Canada China
0.09334491 0.09634259 0.09578125 0.09634645 0.09481192
Eritrea Ethiopia France Germany Great Britain
0.09709491 0.09010031 0.10025463 0.09713349 0.09524306
Ireland Italy Japan Kenya Morocco
0.09593750 0.09520255 0.09579630 0.08934854 0.09400463
New Zeland Peru Poland Romania Russia
0.09664931 0.09809606 0.09638889 0.09875000 0.09327932
Spain Switzerland Uganda United States Zimbabwe
0.09314236 0.09620949 0.10068287 0.09399016 0.09892940
I see you've discovered the agony of working with date and time values in R...
Is this what you had in mind?
df$nTime <- difftime(strptime(df$Time,"%H:%M:%S"),
strptime("00:00:00","%H:%M:%S"),
units="secs")
df.means <- aggregate(df$nTime,by=list(df$Country),mean)
df.means$Time <- format(.POSIXct(df.means$x,tz="GMT"), "%H:%M:%S")
df.means
Group.1 x Time
# 1 Brazil 7594.000 02:06:34
# 2 Eth 8338.000 02:18:58
# 3 Germany 8392.333 02:19:52
# 4 Japan 7715.000 02:08:35
# 5 Morocco 7527.000 02:05:27
# 6 Romania 8490.000 02:21:30
# 7 Spain 7643.000 02:07:23
# 8 USA 7577.000 02:06:17
The first line adds a column nTime which is the time, in seconds, since midnight.
The second line calculates the means.
The third line converts back to H:M:S.
The problem you were having is the strptime(...), when forced to convert to numeric, returns the number of second between 1970-01-01 and the indicated time today. So, a really big number. This code just subtracts out the number of second from 1970-01-01 and 00:00:00 today.
Are you trying to do this -
dades$Time <- strptime(dades$Time,'%H:%M:%S')
by(dades$Time, dades$Country, mean)
If I didn't understand your question, can you please post sample output.

Resources