Splitting sequentially a data frame R - r

I have a data frame like this
V1 V2 V3 V4 month year
1 -1 9 1 1 1989
1 -1 9 1 1 1989
4 -1 9 1 2 1989
3 2 7 3 1 1990
4 4 8 2 2 1990
3 6 9 2 2 1990
4 7 0 2 2 1990
5 8 4 2 2 1990
where the first 4 rows indicate the value of the quantity A in the cell 1,2,3,4 and the last two columns give the month and the year. What I would like to do is to calculate the monthly average of A for every cell and so to end up with a list
V1
1989
<A>jen <A>feb ..
1 4
1990
<A>jen <A>feb ..
3 4
V2
V3
Many thanks

I was still hoping for something a little bit more precise in your question as to what your desired output is exactly, but since you haven't updated that part, I'll provide two options.
Option 1
aggregate seems to be a pretty straightforward tool for this task, particularly if sticking with a "wide" format would be fine for your needs.
aggregate(. ~ year + month, mydf, mean)
# year month V1 V2 V3 V4
# 1 1989 1 1 -1.00 9.00 1
# 2 1990 1 3 2.00 7.00 3
# 3 1989 2 4 -1.00 9.00 1
# 4 1990 2 4 6.25 5.25 2
Option 2
If you prefer your data in a "long" format, you should explore the "reshape2" package which can handle the reshaping and aggregating in just a few steps.
library(reshape2)
mydfL <- melt(mydf, id.vars = c("year", "month"))
## The next step is purely cosmetic...
mydfL$month <- factor(month.abb[mydfL$month], month.abb, ordered = TRUE)
head(mydfL)
# year month variable value
# 1 1989 Jan V1 1
# 2 1989 Jan V1 1
# 3 1989 Feb V1 4
# 4 1990 Jan V1 3
# 5 1990 Feb V1 4
# 6 1990 Feb V1 3
## This is the actual aggregation and reshaping step...
out <- dcast(mydfL, variable + year ~ month,
value.var = "value", fun.aggregate = mean)
out
# variable year Jan Feb
# 1 V1 1989 1 4.00
# 2 V1 1990 3 4.00
# 3 V2 1989 -1 -1.00
# 4 V2 1990 2 6.25
# 5 V3 1989 9 9.00
# 6 V3 1990 7 5.25
# 7 V4 1989 1 1.00
# 8 V4 1990 3 2.00

Related

R: Cumulative Mean Excluding Current Value?

I am working with the R programming language.
I have a dataset that looks something like this:
id = c(1,1,1,1,2,2,2)
year = c(2010,2011,2012,2013, 2012, 2013, 2014)
var = rnorm(7,7,7)
my_data = data.frame(id, year,var)
id year var
1 1 2010 12.186300
2 1 2011 19.069836
3 1 2012 7.456078
4 1 2013 14.875019
5 2 2012 20.827933
6 2 2013 5.029625
7 2 2014 -2.260658
For each "group" within the ID column - at each row, I want to take the CUMULATIVE MEAN of the "var" column but EXCLUDE the value of "var" within that row (i.e. most recent).
As an example:
row 1: NA
row 2: 12.186300/1
row 3: (12.186300 + 19.069836)/2
row 4: (12.186300 + 19.069836 + 7.45)/3
row 5: NA
row 6: 20.827933
row 7: (20.827933 + 5.029625)/2
I found this post here (Cumsum excluding current value) which (I think) shows how to do this for the "cumulative sum" - I tried to apply the logic here to my question:
transform(my_data, cmean = ave(var, id, FUN = cummean) - var)
id year var cmean
1 1 2010 12.186300 0.000000
2 1 2011 19.069836 -3.441768
3 1 2012 7.456078 5.447994
4 1 2013 14.875019 -1.478211
5 2 2012 20.827933 0.000000
6 2 2013 5.029625 7.899154
7 2 2014 -2.260658 10.126291
The code appears to have run - but I don't think I have done this correctly (i.e. the numbers produced don't match up with the numbers I had anticipated).
I then tried an answer provided here (Compute mean excluding current value):
my_data %>%
group_by(id) %>%
mutate(avg = (sum(var) - var)/(n() - 1))
# A tibble: 7 x 4
# Groups: id [2]
id year var avg
<dbl> <dbl> <dbl> <dbl>
1 1 2010 12.2 13.8
2 1 2011 19.1 11.5
3 1 2012 7.46 15.4
4 1 2013 14.9 12.9
5 2 2012 20.8 1.38
6 2 2013 5.03 9.28
But it is still not working.
Can someone please show me what I am doing wrong and what I can do this fix this problem?
Thanks!
df %>%
group_by(id)%>%
mutate(avg = lag(cummean(var)))
# A tibble: 7 × 4
# Groups: id [2]
id year var avg
<int> <int> <dbl> <dbl>
1 1 2010 12.2 NA
2 1 2011 19.1 12.2
3 1 2012 7.46 15.6
4 1 2013 14.9 12.9
5 2 2012 20.8 NA
6 2 2013 5.03 20.8
7 2 2014 -2.26 12.9
With the help of some intermediate variables you can do it like so:
library(dplyr)
df <- read.table(text = "
id year var
1 1 2010 12.186300
2 1 2011 19.069836
3 1 2012 7.456078
4 1 2013 14.875019
5 2 2012 20.827933
6 2 2013 5.029625
7 2 2014 -2.260658", header=T)
df |>
group_by(id) |>
#mutate(avg =lag(cummean(var)))
mutate(id_g = row_number()) |>
mutate(ms = cumsum(var)) |>
mutate(cm = ms/id_g,
cm = ifelse(ms == cm, NA, cm)) |>
select(-id_g, -ms)
#> # A tibble: 7 × 4
#> # Groups: id [2]
#> id year var cm
#> <int> <int> <dbl> <dbl>
#> 1 1 2010 12.2 NA
#> 2 1 2011 19.1 15.6
#> 3 1 2012 7.46 12.9
#> 4 1 2013 14.9 13.4
#> 5 2 2012 20.8 NA
#> 6 2 2013 5.03 12.9
#> 7 2 2014 -2.26 7.87

Average percentage change over different years in R

I have a data frame from which I created a reproducible example:
country <- c('A','A','A','B','B','C','C','C','C')
year <- c(2010,2011,2015,2008,2009,2008,2009,2011,2015)
score <- c(1,2,2,1,4,1,1,3,2)
country year score
1 A 2010 1
2 A 2011 2
3 A 2015 2
4 B 2008 1
5 B 2009 4
6 C 2008 1
7 C 2009 1
8 C 2011 3
9 C 2015 2
And I am trying to calculate the average percentage increase (or decrease) in the score for each country by calculating [(final score - initial score) ÷ (initial score)] for each year and averaging it over the number of years.
country year score change
1 A 2010 1 NA
2 A 2011 2 1
3 A 2015 2 0
4 B 2008 1 NA
5 B 2009 4 3
6 C 2008 1 NA
7 C 2009 1 0
8 C 2011 3 2
9 C 2015 2 -0.33
The final result I am hoping to obtain:
country avg_change
1 A 0.5
2 B 3
3 C 0.55
As you can see, the trick is that countries have spans over different years, sometimes with a missing year in between. I tried different ways to do it manually but I do struggle. If someone could hint me a solution would be great. Many thanks.
With dplyr, we can group_by country and get mean of difference between scores.
library(dplyr)
df %>%
group_by(country) %>%
summarise(avg_change = mean(c(NA, diff(score)), na.rm = TRUE))
# country avg_change
# <fct> <dbl>
#1 A 0.500
#2 B 3.00
#3 C 0.333
Using base R aggregate with same logic
aggregate(score~country, df, function(x) mean(c(NA, diff(x)), na.rm = TRUE))
We can use data.table to group by 'country' and take the mean of the difference between the 'score' and the lag of 'score'
library(data.table)
setDT(df1)[, .(avg_change = mean(score -lag(score), na.rm = TRUE)), .(country)]
# country avg_change
#1: A 0.5000000
#2: B 3.0000000
#3: C 0.3333333

Combine data in many row into a columnn

I have a data like this:
year Male
1 2011 8
2 2011 1
3 2011 4
4 2012 3
5 2012 12
6 2012 9
7 2013 4
8 2013 3
9 2013 3
and I need to group the data for the year 2011 in one column, 2012 in the next column and so on.
2011 2012 2013
1 8 3 4
2 1 12 3
3 4 9 3
How do I achieve this?
One option is unstack if the number of rows per 'year' is the same
unstack(df1, Male ~ year)
One option is to use functions from dplyr and tidyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(year) %>%
mutate(ID = 1:n()) %>%
spread(year, Male) %>%
select(-ID)
1
If every year has the same number of data, you could split the data and cbind it using base R
do.call(cbind, split(df$Male, df$year))
# 2011 2012 2013
#[1,] 8 3 4
#[2,] 1 12 3
#[3,] 4 9 3
2
If every year does not have the same number of data, you could use rbind.fill of plyr
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(plyr)
setNames(object = data.frame(t(rbind.fill.matrix(lapply(split(df$Male, df$year), t)))),
nm = unique(df$year))
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA
3
Yet another way is to use dcast to convert data from long to wide format
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(reshape2)
dcast(df, ave(df$Male, df$year, FUN = seq_along) ~ year, value.var = "Male")[,-1]
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

If function on a for loop

I have two dataframes with different number of lines and columns, such as:
a (12981 lines and 3 columns)
Year Month Day
1980 1 1
1980 1 2
1980 1 3
1980 1 4
1980 1 5
...
1980 1 31
1980 2 1
1980 2 2
1980 2 3
1980 2 4
1980 2 5
...
b (426 lines and 3 columns)
Year Month Value
1980 1 356
1980 2 389
1980 3 378
1980 4 450
1980 5 500
...
1981 2 450
I want to add "Value" column (from b ) to a to get something like this:
a_withValues (12981 lines with 4 columns)
Year Month Day Value
1980 1 1 356
1980 1 2 356
1980 1 3 356
1980 1 4 356
1980 1 5 356
...
1980 1 31 356
1980 2 1 389
1980 2 2 389
1980 2 3 389
1980 2 4 389
1980 2 5 389
...
In other words if a$Year and a$Month are equal to b$Year and b$Month I want to add (for a new column in a) the corresponding value from b$Value.
There is a base R solution to this, just use the function merge. By default it will choose columns with matching names, so in your case it will work out of the box
a <- expand.grid(year=1980, month=1:2, day=1:30)
b <- data.frame(year=1980, month=1:2, value=c(356,389))
a_with_b <- merge(a,b)
Here:
> head(a)
year month day
1 1980 1 1
2 1980 2 1
3 1980 1 2
4 1980 2 2
5 1980 1 3
6 1980 2 3
> head(b)
year month value
1 1980 1 356
2 1980 2 389
> head(a_with_b)
year month day value
1 1980 1 1 356
2 1980 1 8 356
3 1980 1 2 356
4 1980 1 9 356
5 1980 1 3 356
6 1980 1 10 356
What you are looking for is a join of the data.frames (at least to my understanding). That includes matching keys of the two items and then adding the values as another column.
You can achieve merging the two datasets like this, using data.table:
library(data.table)
dt1 <- data.table(Year = 1980,
Month = 1:3,
Day = 1)
dt1
# Year Month Day
# 1: 1980 1 1
# 2: 1980 2 1
# 3: 1980 3 1
dt2 <- data.table(Year = 1980,
Month = 1:3,
Value = runif(3, 100, 1000))
dt2
# Year Month Value
# 1: 1980 1 389.7436
# 2: 1980 2 902.0029
# 3: 1980 3 663.6313
merge(dt1, dt2, by = c("Year", "Month"), all.x = T)[order(Year, Month)]
# Year Month Day Value
# 1: 1980 1 1 389.7436
# 2: 1980 2 1 902.0029
# 3: 1980 3 1 663.6313
If you just want to create another column in one data.table (note, data.tables are similar to a data.frames in many aspects) without any matching, you can do it like this:
dt1$Value <- dt2$Value

Resources