Combine data in many row into a columnn - r

I have a data like this:
year Male
1 2011 8
2 2011 1
3 2011 4
4 2012 3
5 2012 12
6 2012 9
7 2013 4
8 2013 3
9 2013 3
and I need to group the data for the year 2011 in one column, 2012 in the next column and so on.
2011 2012 2013
1 8 3 4
2 1 12 3
3 4 9 3
How do I achieve this?

One option is unstack if the number of rows per 'year' is the same
unstack(df1, Male ~ year)

One option is to use functions from dplyr and tidyr.
library(dplyr)
library(tidyr)
dt2 <- dt %>%
group_by(year) %>%
mutate(ID = 1:n()) %>%
spread(year, Male) %>%
select(-ID)

1
If every year has the same number of data, you could split the data and cbind it using base R
do.call(cbind, split(df$Male, df$year))
# 2011 2012 2013
#[1,] 8 3 4
#[2,] 1 12 3
#[3,] 4 9 3
2
If every year does not have the same number of data, you could use rbind.fill of plyr
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(plyr)
setNames(object = data.frame(t(rbind.fill.matrix(lapply(split(df$Male, df$year), t)))),
nm = unique(df$year))
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA
3
Yet another way is to use dcast to convert data from long to wide format
df[10,] = c(2015, 5) #Add only one data for the year 2015
library(reshape2)
dcast(df, ave(df$Male, df$year, FUN = seq_along) ~ year, value.var = "Male")[,-1]
# 2011 2012 2013 2015
#1 8 3 4 5
#2 1 12 3 NA
#3 4 9 3 NA

Related

Remove duplicate year rows by groups [duplicate]

This question already has answers here:
get rows of unique values by group
(4 answers)
Closed 1 year ago.
I have a data.table of the following form:-
data <- data.table(group = rep(1:3, each = 4),
year = c(2011:2014, rep(2011:2012, each = 2),
2012, 2012, 2013, 2014), value = 1:12)
This is only an abstract of my data.
So group 2 has 2 values for 2011 and 2012. And group 3 has 2 values for the year 2012. I want to just keep the first row for all the duplicated years.
So, in effect, my data.table will become the following:-
data <- data.table(group = c(rep(1, 4), rep(2, 2), rep(3, 3)),
year = c(2011:2014, 2011, 2012, 2012, 2013, 2014),
value = c(1:5, 7, 9, 11, 12))
How can I achieve this? Thanks in advance.
Try this data.table option with duplicated
> data[!duplicated(cbind(group, year))]
group year value
1: 1 2011 1
2: 1 2012 2
3: 1 2013 3
4: 1 2014 4
5: 2 2011 5
6: 2 2012 7
7: 3 2012 9
8: 3 2013 11
9: 3 2014 12
For data.tables you can pass by argument to unique -
library(data.table)
unique(data, by = c('group', 'year'))
# group year value
#1: 1 2011 1
#2: 1 2012 2
#3: 1 2013 3
#4: 1 2014 4
#5: 2 2011 5
#6: 2 2012 7
#7: 3 2012 9
#8: 3 2013 11
#9: 3 2014 12
Using base R
subset(data, !duplicated(cbind(group, year)))
One solution would be to use distinct from dplyr like so:
library(dplyr)
data %>%
distinct(group, year, .keep_all = TRUE)
Output:
group year value
1: 1 2011 1
2: 1 2012 2
3: 1 2013 3
4: 1 2014 4
5: 2 2011 5
6: 2 2012 7
7: 3 2012 9
8: 3 2013 11
9: 3 2014 12
This should do the trick:
library(tidyverse)
data %>%
group_by(group, year) %>%
filter(!duplicated(group, year))

Calculate average of values in R and add result as new rows instead of as a new column

I have a dataframe like the following one:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
and I want to calculate the average value by day for the three year period (2014, 2015, 2016). The following code works for this purpose:
data %>%
group_by(day) %>%
mutate(MEAN = mean(value))
and produces this output:
day year value MEAN
1 2014 5 7
1 2015 16 7
1 2016 0 7
2 2014 3 3
2 2015 1 3
2 2016 4 3
but I want to add the average values as new rows in the same dataframe as follows:
day year value
1 2014 5
1 2015 16
1 2016 0
2 2014 3
2 2015 1
2 2016 4
1 avg 7 <--
2 avg 3 <--
Any suggestions about how can I possibly do this? Thanks!
We can use summarise (instead of mutate - which adds a new column in the original dataset) to calculate the mean and then with bind_rows can bind with original data. The tidyverse functions are very particular about type, so make sure the class are the same before we do the binding
library(dplyr)
data %>%
group_by(day) %>%
summarise(year = 'avg', value = mean(value)) %>%
bind_rows(data %>%
mutate(year = as.character(year)), .)
# day year value
#1 1 2014 5.00
#2 1 2015 16.00
#3 1 2016 0.00
#4 2 2014 3.00
#5 2 2015 1.00
#6 2 2016 4.00
#7 1 avg 7.00
#8 2 avg 2.67
Another option is to split by the 'day' and then with add_row (from tibble) create a new row on each of the list elements
library(tibble)
library(purrr)
data %>%
mutate(year = as.character(year)) %>%
group_split(day) %>%
map_dfr(~ .x %>% add_row(day = first(.$day),
year = 'avg', value = mean(.$value)))
Here is a base R option using aggregate
rbind(df,cbind(aggregate(value~day,df,mean),year = "avg")[c(1,3,2)])
or a variation (by #thelatemail from comments)
rbind(df, aggregate(df["value"], cbind(df["day"], year="avg"), FUN=mean))
which gives
day year value
1 1 2014 5.000000
2 1 2015 16.000000
3 1 2016 0.000000
4 2 2014 3.000000
5 2 2015 1.000000
6 2 2016 4.000000
7 1 avg 7.000000
8 2 avg 2.666667

How do you select a max of one column and not NA's in another column in R?

I'm looking for a way in R where I can select the max(col1) where col2 is not NA?
Example datafame named df1
#df1
Year col1 col2
2016 4 NA # has NA
2016 2 NA # has NA
2016 1 3 # this is the max for 2016
2017 3 NA
2017 2 3 # this is the max for 2017
2017 1 3
2018 2 4 # this is the max for 2018
2018 1 NA
I would like the new dataset to only return
Year col1 col2
2016 1 3
2017 2 3
2018 2 4
If any one can help, it would be very appreciated?
In base R
out <- na.omit(df1)
merge(aggregate(col1 ~ Year, out, max), out) # thanks to Rui
# Year col1 col2
#1 2016 1 3
#2 2017 2 3
#3 2018 2 4
Using dplyr:
library(dplyr)
df1 %>% filter(!is.na(col2)) %>%
group_by(year) %>%
arrange(desc(col1)) %>%
slice(1)
Using data.table:
library(data.table)
setDT(df1)
df1[!is.na(col2), .SD[which.max(col1)], by = Year]
This works in a fresh R session:
library(data.table)
dt = fread("Year col1 col2
2016 4 NA
2016 2 NA
2016 1 3
2017 3 NA
2017 2 3
2017 1 3
2018 2 4
2018 1 NA")
dt[!is.na(col2), .SD[which.max(col1)], by = Year]
# Year col1 col2
# 1: 2016 1 3
# 2: 2017 2 3
# 3: 2018 2 4

Appending and overwriting when joining dataframes

I have the following three dataframes:
prim <- data.frame("t"=2007:2012,
"a"=1:6,
"b"=7:12)
secnd <- data.frame("t"=2012:2013,
"a"=c(5, 7))
third <- data.frame("t"=2012:2013,
"b"=c(11, 13))
I want to join secnd and third to prim in two steps. In the first step I join prim and secnd, where any existing elements in prim are overwritten by those in secnd, so we end up with:
t a b
1 2007 1 7
2 2008 2 8
3 2009 3 9
4 2010 4 10
5 2011 5 11
6 2012 5 12
7 2013 7 NA
After this I want to join with third, where again existing elements are overwritten by those in third:
t a b
1 2007 1 7
2 2008 2 8
3 2009 3 9
4 2010 4 10
5 2011 5 11
6 2012 5 11
7 2013 7 13
Is there a way to achieve this using dplyr or base R?
By using dplyr you can do:
require(dplyr)
prim %>% full_join(secnd, by = 't') %>%
full_join(third, by = 't') %>%
mutate(a = coalesce(as.integer(a.y),a.x),
b = coalesce(as.integer(b.y),b.x)) %>%
select(t,a,b)
I added the as.integer function since you have different data types in your dataframes.
Consider base R with a chain merge and ifelse calls, followed by final column cleanup:
final_df <- Reduce(function(x, y) merge(x, y, by="t", all=TRUE), list(prim, secnd, third))
final_df <- within(final_df, {
a.x <- ifelse(is.na(a.y), a.x, a.y)
b.x <- ifelse(is.na(b.y), b.x, b.y)
})
final_df <- setNames(final_df[,1:3], c("t", "a", "b"))
final_df
# t a b
# 1 2007 1 7
# 2 2008 2 8
# 3 2009 3 9
# 4 2010 4 10
# 5 2011 5 11
# 6 2012 5 11
# 7 2013 7 13
Not very pretty. But seems to do the job
prim %>%
anti_join(secnd, by = "t") %>%
full_join(secnd, by = c("t", "a")) %>%
select(-b) %>%
left_join(prim %>%
anti_join(third, by = "t") %>%
full_join(third, by = c("t", "b")) %>%
select(-a))
gives
t a b
1 2007 1 7
2 2008 2 8
3 2009 3 9
4 2010 4 10
5 2011 5 11
6 2012 5 11
7 2013 7 13

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

Resources