Averaging by column for set number of rows - r

I have a panel dataset where I want to average over a specified number of time periods (t) by variable (column).
An example:
Country Year Var 1 Var 2 Var 3
Austria 1984 1 3.6 95
Austria 1985 2 4.1 94.6
Austria 1986 1 2.6 93.6
Austria 1987 1 3 94.4
Austria 1988 1 3.9 95.2
What I want then is a new column/new dataframe with a new variable for the average for the 5 year period (1984-1988) for Var 1, a variable for the average of Var 2 and var 3 etc.
I also want to loop the function over such that I can apply it to the other countries in my dataset. It would be great if I could avoid that the averaging mixes up countries, so I was thinking of adding some matching string pattern (for code %in% AUT in this case for instance, I have a variable with country codes) but I couldn't figure out how to do it.
Thank you very much in advance

1) Using the sample input in the Note at the end, read in the country and year from the row names and round the year up to the end of the current 5 year period so that each year from 1984 to 1988 gets rounded up to 1988, etc. Then use aggregate to calculate the means of each column by both country and year. No packages are used.
By0 <- read.table(text = rownames(DF), col.names = c("Country", "Year"))
By <- transform(By0, Year = 5 * ((Year - min(Year)) %/% 5) + min(Year) + 4)
aggregate(DF, By, mean)
giving the following:
Country Year Var 1 Var 2 Var 3
1 Australia 1988 1.6 18.46 95.52
2 Austria 1988 1.2 3.44 94.56
2) or if what was wanted was to append the columns to the original data frame lapply over the columns using ave to take the mean by Country for each:
out <- cbind(DF, lapply(DF, function(x) with(By, ave(x, Country, Year, FUN = mean))))
names(out) <- c(names(DF), paste("Mean", names(DF)))
giving:
> out
Var 1 Var 2 Var 3 Mean Var 1 Mean Var 2 Mean Var 3
Austria 1984 1 3.6 95.0 1.2 3.44 94.56
Austria 1985 2 4.1 94.6 1.2 3.44 94.56
Austria 1986 1 2.6 93.6 1.2 3.44 94.56
Austria 1987 1 3.0 94.4 1.2 3.44 94.56
Austria 1988 1 3.9 95.2 1.2 3.44 94.56
Australia 1984 1 3.6 95.0 1.6 18.46 95.52
Australia 1985 2 4.1 94.6 1.6 18.46 95.52
Australia 1986 1 2.6 93.6 1.6 18.46 95.52
Australia 1987 1 3.0 94.4 1.6 18.46 95.52
Australia 1988 3 79.0 100.0 1.6 18.46 95.52
Note
The input used, shown reproducibly, is:
Lines <- "
Var 1,Var 2,Var 3
Austria 1984,1,3.6,95
Austria 1985,2,4.1,94.6
Austria 1986,1,2.6,93.6
Austria 1987,1,3,94.4
Austria 1988,1,3.9,95.2
Australia 1984,1,3.6,95
Australia 1985,2,4.1,94.6
Australia 1986,1,2.6,93.6
Australia 1987,1,3,94.4
Australia 1988,3,79,100"
DF <- read.csv(text = Lines, check.names = FALSE)

Related

Data transformation in R- columns

My dataframe with n dates
Date team_home team_away prob_home draw prob_away
01/01/2021 Brazil Germany 95.0 5.0 0.0
01/01/2021 England Belgium 50.0 10.0 40.0
02/01/2021 Belgium Canada 90.0 7.0 3.0
02/01/2021 Germany France 60.0 10.0 30.0
... .... ... ... ... ...
DESIRED DATAFRAME. Important: Only one date per row
Date prob_Brazil draw_Brazil_Germany prob_Germany prob_England draw_England_Belgium prob_Belgium ....
01/01/2021 95.0 5.0 0.0 50.0 10.0 40.0
02/01/2021 NA NA 60.0 NA NA 90.0
Thank you for your help!
You can use this but the output may not be quite desirable:
library(tidyr)
df %>%
pivot_wider(names_from = c(team_home, team_away),
values_from = c(prob_home, draw, prob_away))

Reshape wide data in R: Converting two rows into columns

How do I transpose this dataset in R? See below:
I downloaded a dataset that looks like this (the dates go backward from 2016 - 1975):
V1 V2 V3 V4 V5
1 2016 2016 2016 2015
4 Country Both-sexes Male Female Both-sexes
5 Afghanistan 23.4 [22.0-24.8] 22.6 [20.1-25.1] 24.1 [23.0-25.3] 23.3 [21.9-24.6]
6 Albania 26.7 [25.8-27.5] 27.0 [25.8-28.2] 26.3 [25.0-27.6] 26.6 [25.8-27.4]
7 Algeria 25.5 [24.5-26.5] 24.7 [23.4-26.1] 26.4 [24.9-27.8] 25.5 [24.5-26.4]
8 Andorra 26.7 [24.6-28.7] 27.3 [24.8-29.8] 26.1 [22.8-29.5] 26.7 [24.7-28.7]
I need to make the year and sex rows (currently numbered rows 1 & 4) into columns. Here's what I want:
1 Country Year Sex Rate
2 Afghanistan 2016 Both-sexes 23.4
3 Afghanistan 2016 Male 22.6
3 Afghanistan 2016 Female 24.1
4 Afghanistan 2015 Both-sexes 23.3
...and the rows continue on through all of the years for all of the countries in the dataset.
Here's what I have done trying to get there:
cfile <- read.csv(file= "countries-BMI.csv", header = F)
#removed second two rows that have unnecessary info
countries_data <- cfile[-c(2,3), ]
molten_countries_data <- melt(countries_data, id=c("V1"))
.and here's my result - head(molten_countries_data):
V1 variable value
1 V2 2016
2 Country V2 Both-sexes
3 Afghanistan V2 23.4 [22.0-24.8]
4 Albania V2 26.7 [25.8-27.5]
5 Algeria V2 25.5 [24.5-26.5]
6 Andorra V2 26.7 [24.6-28.7]
Not what I wanted! Please help.
I figured it out thanks to the tip from #Dave2e to merge the first 2 rows first. Here's what I ended up doing:
library(reshape2)
library(tidyr)
#load data frame without first two rows
cdata <- read.csv("countries-BMI.csv", skip = 2, header = F)
#create header by combining top two rows
headers <- read.csv("countries-BMI.csv", nrows=2, header=FALSE)
headers_names <- sapply(headers,paste,collapse="_")
#add the new header to data frame
names(cdata) <- headers_names
#transpose the "wide data" to make it tidy/long
longdata <- melt(cdata, id.vars = c("_Country"))
#separate the year and sex columns
countriesBMI2 <- separate(data = longdata, col = variable, into = c("Year", "Sex"), sep = "_")
My result: head(countriesBMI2)
_Country Year Sex value
1 Afghanistan 2016 Both-sexes 23.4 [22.0-24.8]
2 Albania 2016 Both-sexes 26.7 [25.8-27.5]
3 Algeria 2016 Both-sexes 25.5 [24.5-26.5]
4 Andorra 2016 Both-sexes 26.7 [24.6-28.7]
5 Angola 2016 Both-sexes 23.3 [21.2-25.6]
6 Antigua and Barbuda 2016 Both-sexes 26.7 [24.6-28.8]

Calculate the percent occurrence of a variable in multiple groups

Sample data
set.seed(123)
df <- data.frame(loc.id = rep(1:1000, each = 35), year = rep(1980:2014,times = 1000),month.id = sample(c(1:4,8:10,12),35*1000,replace = T))
The data frame has a 1000 locations X 35 years of data for a variable called month.id which is basically the month of a year. For each year, I want to calculate percent occurrence of each month. For e.g. for 1980,
month.vec <- df[df$year == 1980,]
table(month.vec$month.id)
1 2 3 4 8 9 10 12
106 132 116 122 114 130 141 139
To calculate the percent occurrence of months:
table(month.vec$month.id)/length(month.vec$month.id) * 100
1 2 3 4 8 9 10 12
10.6 13.2 11.6 12.2 11.4 13.0 14.1 13.9
I want to have a table something like this:
year month percent
1980 1 10.6
1980 2 13.2
1980 3 11.6
1980 4 12.2
1980 5 NA
1980 6 NA
1980 7 NA
1980 8 11.4
1980 9 13
1980 10 14.1
1980 11 NA
1980 12 13.9
Since, months 5,6,7,11 are missing, I just want to add the additional rows with NAs for those months. If possible, I would
like a dplyr solution to something like this:
library(dplyr)
df %>% group_by(year) %>% summarise(percentage.contri = table(month.id)/length(month.id)*100)
Solution using dplyr and tidyr
# To get month as integer use (or add as.integer to mutate):
# df$month.id <- as.integer(df$month.id)
library(dplyr)
library(tidyr)
df %>%
group_by(year, month.id) %>%
# Count occurrences per year & month
summarise(n = n()) %>%
# Get percent per month (year number is calculated with sum(n))
mutate(percent = n / sum(n) * 100) %>%
# Fill in missing months
complete(year, month.id = 1:12, fill = list(percent = 0)) %>%
select(year, month.id, percent)
year month.id percent
<int> <dbl> <dbl>
1 1980 1.00 10.6
2 1980 2.00 13.2
3 1980 3.00 11.6
4 1980 4.00 12.2
5 1980 5.00 0
6 1980 6.00 0
7 1980 7.00 0
8 1980 8.00 11.4
9 1980 9.00 13.0
10 1980 10.0 14.1
11 1980 11.0 0
12 1980 12.0 13.9
A base R solution:
tab <- table(month.vec$year, factor(month.vec$month.id, levels = 1:12))/length(month.vec$month.id) * 100
dfnew <- as.data.frame(tab)
which gives:
> dfnew
Var1 Var2 Freq
1 1980 1 10.6
2 1980 2 13.2
3 1980 3 11.6
4 1980 4 12.2
5 1980 5 0.0
6 1980 6 0.0
7 1980 7 0.0
8 1980 8 11.4
9 1980 9 13.0
10 1980 10 14.1
11 1980 11 0.0
12 1980 12 13.9
Or with data.table:
library(data.table)
setDT(month.vec)[, .N, by = .(year, month.id)
][.(year = 1980, month.id = 1:12), on = .(year, month.id)
][, N := 100 * N/sum(N, na.rm = TRUE)][]

correlation between two data frames in R

I have one data frame which has sales values for the time period Oct. 2000 to Dec. 2001 (15 months). Also I have profit values for the same time period as above and I want to find the correlation between these two data frames month wise for these 15 months in R. My data frame sales is:
Month sales
Oct. 2000 24.1
Nov. 2000 23.3
Dec. 2000 43.9
Jan. 2001 53.8
Feb. 2001 74.9
Mar. 2001 25
Apr. 2001 48.5
May. 2001 18
Jun. 2001 68.1
Jul. 2001 78
Aug. 2001 48.8
Sep. 2001 48.9
Oct. 2001 34.3
Nov. 2001 54.1
Dec. 2001 29.3
My second data frame profit is:
period profit
Oct 2000 14.1
Nov 2000 3.3
Dec 2000 13.9
Jan 2001 23.8
Feb 2001 44.9
Mar 2001 15
Apr 2001 58.5
May 2001 18
Jun 2001 58.1
Jul 2001 38
Aug 2001 28.8
Sep 2001 18.9
Oct 2001 24.3
Nov 2001 24.1
Dec 2001 19.3
Now I know that for initial two months I cannot get the correlation as there are not enough values but from Dec 2000 onwards I want to calculate the correlation by taking into consideration the previous months values. So, for Dec. 200 I will consider values of Oct. 2000, Nov. 2000 and Dec. 2000 which will give me 3 sales value and 3 profit values. Similarly for Jan. 2001 I will consider values of Oct. 2000, Nov. 2000 Dec. 2000 and Jan. 2001 thus having 4 sales value and 4 profit value. Thus for every month I will consider previous month values also to calculate the correlation and my output should be something like this:
Month Correlation
Oct. 2000 NA or Empty
Nov. 2000 NA or Empty
Dec. 2000 x
Jan. 2001 y
. .
. .
Dec. 2001 a
I know that in R there is a function cor(sales, profit) but how can I find out the correlation for my scenario?
Make some sample data:
> sales = c(1,4,3,2,3,4,5,6,7,6,7,5)
> profit = c(4,3,2,3,4,5,6,7,7,7,6,5)
> data = data.frame(sales=sales,profit=profit)
> head(data)
sales profit
1 1 4
2 4 3
3 3 2
4 2 3
5 3 4
6 4 5
Here's the beef:
> data$runcor = c(NA,NA,
sapply(3:nrow(data),
function(i){
cor(data$sales[1:i],data$profit[1:i])
}))
> data
sales profit runcor
1 1 4 NA
2 4 3 NA
3 3 2 -0.65465367
4 2 3 -0.63245553
5 3 4 -0.41931393
6 4 5 0.08155909
7 5 6 0.47368421
8 6 7 0.69388867
9 7 7 0.78317543
10 6 7 0.81256816
11 7 6 0.80386072
12 5 5 0.80155885
So now data$runcor[3] is the correlation of the first 3 sales and profit numbers.
Note I call this runcor as its a "running correlation", like a "running sum" which is the sum of all elements so far. This is the correlation of all pairs so far.
Another possibility would be: (if dat1 and dat2 are the initial datasets)
Update
dat1$Month <- gsub("\\.", "", dat1$Month)
datN <- merge(dat1, dat2, sort=FALSE, by.x="Month", by.y="period")
indx <- sequence(3:nrow(datN)) #create index to replicate the rows
indx1 <- cumsum(c(TRUE,diff(indx) <0)) #create another index to group the rows
#calculate the correlation grouped by `indx1`
datN$runcor <- setNames(c(NA, NA,by(datN[indx,-1],
list(indx1), FUN=function(x) cor(x$sales, x$profit) )), NULL)
datN
# Month sales profit runcor
#1 Oct 2000 24.1 14.1 NA
#2 Nov 2000 23.3 3.3 NA
#3 Dec 2000 43.9 13.9 0.5155911
#4 Jan 2001 53.8 23.8 0.8148546
#5 Feb 2001 74.9 44.9 0.9345166
#6 Mar 2001 25.0 15.0 0.9119941
#7 Apr 2001 48.5 58.5 0.7056301
#8 May 2001 18.0 18.0 0.6879528
#9 Jun 2001 68.1 58.1 0.7647177
#10 Jul 2001 78.0 38.0 0.7357748
#11 Aug 2001 48.8 28.8 0.7351366
#12 Sep 2001 48.9 18.9 0.7190413
#13 Oct 2001 34.3 24.3 0.7175138
#14 Nov 2001 54.1 24.1 0.7041889
#15 Dec 2001 29.3 19.3 0.7094334

calculation of anomalies on time-series

I'd like to calculate monthly temperature anomalies on a time-series with several stations.
I call here "anomaly" the difference of a single value from a mean calculated on a period.
My data frame looks like this (let's call it "data"):
Station Year Month Temp
A 1950 1 15.6
A 1980 1 12.3
A 1990 2 11.4
A 1950 1 15.6
B 1970 1 12.3
B 1977 2 11.4
B 1977 4 18.6
B 1980 1 12.3
B 1990 11 7.4
First, I made a subset with the years comprised between 1980 and 1990:
data2 <- subset(data, Year>=1980& Year<=1990)
Second, I used plyr to calculate monthly mean (let's call this "MeanBase") between 1980 and 1990 for each station:
data3 <- ddply(data2, .(Station, Month), summarise,
MeanBase = mean(Temp, na.rm=TRUE))
Now, I'd like to calculate, for each line of data, the difference between the corresponding MeanBase and the value of Temp... but I'm not sure to be in the right way (I don't see how to use data3).
You can use ave in base R to get this.
transform(data,
Demeaned=Temp - ave(replace(Temp, Year < 1980 | Year > 1990, NA),
Station, Month, FUN=function(t) mean(t, na.rm=TRUE)))
# Station Year Month Temp Demeaned
# 1 A 1950 1 15.6 3.3
# 2 A 1980 1 12.3 0.0
# 3 A 1990 2 11.4 0.0
# 4 A 1950 1 15.6 3.3
# 5 B 1970 1 12.3 0.0
# 6 B 1977 2 11.4 NaN
# 7 B 1977 4 18.6 NaN
# 8 B 1980 1 12.3 0.0
# 9 B 1990 11 7.4 0.0
The result column will have NaNs for Month-Station combinations that have no years in your specified range.

Resources