Pivot/merge some columns in a dataset while keeping remaining columns - r

I have a dataset that resembles this (but with more columns):
table <- "year site square triangle circle
1 2019 A 3 9 5
2 2019 A 5 NA 34
3 2019 B 0 0 69
4 2019 B NA 111 2
5 2020 C 0 45 3
6 2020 C 29 0 NA
7 2020 D NA 0 1
8 2021 D 3 NA 4
9 2021 D 158 5 0
10 2021 D 2 9 0"
df <- read.table(text=table, header = TRUE)
df
I want to pivot a portion of the the table so that it resembles this:
year site type count
1 2019 A square 3
2 2019 A triangle 9
3 2019 A circle 5
4 2019 A square 5
5 2019 A triangle NA
6 2019 A circle 34
7 2019 B square 0
8 2019 B triangle 0
9 2019 B circle 60
(and so on)
I've tried solutions from here, but this does not deal with counts so I lose that value when I use these solutions.
For example, the below code leaves me with NAs in each column and I lose the count values
df2 <- df[1:2]
df2$type <- apply(df[3:5], 1, function(k) names(df[3:5])[k])
df2
year site type
1 2019 A circle, NA, NA
2 2019 A NA, NA, NA
3 2019 B NA
4 2019 B NA, NA, triangle
5 2020 C NA, circle
6 2020 C NA, NA
7 2020 D NA, square
8 2021 D circle, NA, NA
9 2021 D NA, NA
10 2021 D triangle, NA
I've also tried using tidyr gather() package, but this won't allow me to keep multiple columns.
library(tidyr)
df3 <- gather(df, year, site, `square`:`circle`)
head(df3)
year site
1 square 3
2 square 5
3 square 0
4 square NA
5 square 0
6 square 29
My only idea is to make a new column of unique numbers (1-X) in my dataframe, use that with gather(), then merge the original dataframe and the new dataframe by that unique ID, then remove the unwanted columns. This would work, but I'm wondering if there's a better, cleaner solution?

How about tidyr::pivot_longer:
library(tidyr)
tidyr::pivot_longer(df, -c(year, site))
#> # A tibble: 30 x 4
#> year site name value
#> <int> <chr> <chr> <int>
#> 1 2019 A square 3
#> 2 2019 A triangle 9
#> 3 2019 A circle 5
#> 4 2019 A square 5
#> 5 2019 A triangle NA
#> 6 2019 A circle 34
#> 7 2019 B square 0
#> 8 2019 B triangle 0
#> 9 2019 B circle 69
#> 10 2019 B square NA
#> # … with 20 more rows

Related

How to compare two or more lines in a long dataset to create a new variable?

I have a long format dataset like that:
ID
year
Address
Classification
1
2020
A
NA
1
2021
A
NA
1
2022
B
B_
2
2020
C
NA
2
2021
D
NA
2
2022
F
F_
3
2020
G
NA
3
2021
G
NA
3
2022
G
G_
4
2020
H
NA
4
2021
I
NA
4
2022
H
H_
I have a Classification of each subject in year 2022 based on their addresses in 2022. This Classification was not made in other years. But I would like to generalize this classification to other years, in a way that if their addresses in other years are the same address they hold in 2022, so the NA values from the 'Classification' variable in these years would be replaced with the same value of the 'Classification' they got in 2022.
I have been trying to convert to a wide data and compare the lines in a more direct way with dplyr. But it is not working properly, since there are these NA values. And, also, this doesn't look a smart way to achieve the final dataset I desire. I would like to get the 'Aim' column in my dataset as showed below:
ID
year
Address
Classification
Aim
1
2020
A
NA
NA
1
2021
A
NA
NA
1
2022
B
B_
B_
2
2020
C
NA
NA
2
2021
D
NA
NA
2
2022
F
F_
F_
3
2020
G
NA
G_
3
2021
G
NA
G_
3
2022
G
G_
G_
4
2020
H
NA
H_
4
2021
I
NA
NA
4
2022
H
H_
H_
I use tidyr::fill with dplyr::group_by for this. Here you need to specify the direction (the default is "down" which will fill with NAs since that's the first value in each group).
library(dplyr)
library(tidyr)
df %>%
group_by(ID, Address) %>%
tidyr::fill(Classification, .direction = "up")
Output:
# ID year Address Classification
# <int> <int> <chr> <chr>
# 1 1 2020 A NA
# 2 1 2021 A NA
# 3 1 2022 B B_
# 4 2 2020 C NA
# 5 2 2021 D NA
# 6 2 2022 F F_
# 7 3 2020 G G_
# 8 3 2021 G G_
# 9 3 2022 G G_
#10 4 2020 H H_
#11 4 2021 I NA
#12 4 2022 H H_
Data
df <- read.table(text = "ID year Address Classification
1 2020 A NA
1 2021 A NA
1 2022 B B_
2 2020 C NA
2 2021 D NA
2 2022 F F_
3 2020 G NA
3 2021 G NA
3 2022 G G_
4 2020 H NA
4 2021 I NA
4 2022 H H_", header = TRUE)

Repeating annual values multiple times to form a monthly dataframe

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

Merging two dataframes creates new missing observations

I have two dataframes with the following matching keys: year, region and province. They each have a set of variables (in this illustrative example I use x1 for df1 and x2 for df2) and both variables have several missing values on their own.
df1 df2
year region province x2 ... xn year region province x2 ... xn
2019 1 5 NA 2019 1 5 NA
2019 2 4 NA. 2019 2 4 NA.
2019 2 4 NA. 2019 2 4 NA
2018 3 7 13. 2018 3 7 13
2018 3 7 15 2018 3 7 15
2018 3 7 17 2018 3 7 17
I want to merge both dataframes such that they end up like this:
year region province x1 x2
2019 1 5 3 NA
2019 2 4 27 NA
2019 2 4 15 NA
2018 3 7 12 13
2018 3 7 NA 15
2018 3 7 NA 17
2017 4 9 NA 12
2017 4 9 19 30
2017 4 9 20 10
However, when doing so using merged_df <- merge(df1, df2, by=c("year","region","province"), all.x=TRUE), R seems to create a lot of additional missing values on each of the variable columns (x1 and x2), which were not there before. What is happening here? I have tried sorting both using df1 %>% arrange(province,-year) and df2 %>% arrange(province,-year), which is enough to have matching order in both dataframes, only to find the same issue when running the merge command. I've tried a bunch of other stuff too, but nothing seems to work. R's output sort of looks like this:
year region province x1 x2
2019 1 5 NA NA
2019 2 4 NA NA
2019 2 4 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2018 3 7 NA NA
2017 4 9 15 NA
2017 4 9 19 30
2017 4 9 20 10
I have done this before; in fact, one of the dataframes is an already merged dataframe in which I did not encounter this issue.
Maybe it is not clear the concept of merge(). I include two examples with example data. I hope you understand and it helps you.
#Data
set.seed(123)
DF1 <- data.frame(year=rep(c(2017,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x1=rnorm(9,3,1.5))
DF2 <- data.frame(year=rep(c(2016,2018,2019),3),
region=rep(c(1,2,3),3),
province=round(runif(9,1,5),0),
x2=rnorm(9,3,1.5))
#Merge based only in df1
Merged1 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all.x=T)
Merged1
year region province x1 x2
1 2017 1 2 2.8365510 NA
2 2017 1 3 3.7557187 NA
3 2017 1 5 4.9208323 NA
4 2018 2 4 2.8241371 NA
5 2018 2 5 6.7925048 1.460993
6 2018 2 5 0.4090941 1.460993
7 2019 3 1 5.5352765 NA
8 2019 3 3 3.8236451 4.256681
9 2019 3 3 3.2746239 4.256681
#Merge including all elements despite no match between ids
Merged2 <- merge(DF1,DF2,by=intersect(names(DF1),names(DF2)),all = T)
Merged2
year region province x1 x2
1 2016 1 3 NA 4.052034
2 2016 1 4 NA 2.062441
3 2016 1 5 NA 2.673038
4 2017 1 2 2.8365510 NA
5 2017 1 3 3.7557187 NA
6 2017 1 5 4.9208323 NA
7 2018 2 1 NA 0.469960
8 2018 2 2 NA 2.290813
9 2018 2 4 2.8241371 NA
10 2018 2 5 6.7925048 1.460993
11 2018 2 5 0.4090941 1.460993
12 2019 3 1 5.5352765 NA
13 2019 3 2 NA 1.398264
14 2019 3 3 3.8236451 4.256681
15 2019 3 3 3.2746239 4.256681
16 2019 3 4 NA 1.906663

How to add columns for months in a dataframe at specific locations

I have a dataframe that looks like this:
CONTRACT_ID START_DATE SERVICE VALUE year month
1 01-01-2018 A 10 2018 1
2 01-01-2018 B 20 2018 1
3 01-01-2018 C 30 2018 1
4 01-03-2018 B 40 2018 3
5 01-03-2018 C 50 2018 3
6 01-03-2018 A 60 2018 3
And I have converted it to a form like this:
CONTRACT_ID year SERVICE 1 3
1 2018 A 10 NA
2 2018 B 20 NA
3 2018 C 30 NA
4 2018 B NA 40
5 2018 C NA 50
6 2018 A NA 60
Using reshape function like this:
reshape(df, idvar = c("year","CONTRACT_ID","SERVICE"), timevar = "month", direction = "wide")
The problem is that in my current dataframe I don't have data for some of the months like we see here for 2(feb). But i would like to add columns for all the missing months like:
CONTRACT_ID year SERVICE 1 2 3
1 2018 A 10 NA NA
2 2018 B 20 NA NA
3 2018 C 30 NA NA
4 2018 B NA NA 40
5 2018 C NA NA 50
6 2018 A NA NA 60
How do I achieve that. I know that I can add columns in between and in the end, but it doesn't seems efficient. I am creating a script and I want it to be efficient and less time consuming.
EDIT:
As per the suggestion in the comment below, I used spread function for widening the data.
But if I keep drop = False the code gives all the combination as output which significantly increases the table size. If I make it TRUE, it doesn't create the combinations but it also removes the Month columns for which I don't have the data, in the current data. I want to keep the columns but not the combinations of CONTRACT_ID, DATE, SERVICE which don't exist. Initially I was removing those rows in subsequent steps but now the size of the table has increased substantially large and I need to handle it while doing the spread of data.
Any suggestions.
Try this.
library(tidyr)
long_data <- read.table(header=TRUE, text='
CONTRACT_ID START_DATE SERVICE VALUE year month
1 01-01-2018 A 10 2018 1
2 01-01-2018 B 20 2018 1
3 01-01-2018 C 30 2018 1
4 01-03-2018 B 40 2018 3
5 01-03-2018 C 50 2018 3
6 01-03-2018 A 60 2018 3
')
long_data
long_data$month <- factor(long_data$month, levels = 1:12, ordered = TRUE)
spread(long_data, key = month, value = VALUE, fill = NA, drop = FALSE)

how to replace missing values with previous year's binned mean

I have a data frame as below
p1_bin and f1_bin are calculated by cut function by me with
Bins <- function(x) cut(x, breaks = c(0, seq(1, 1000, by = 5)), labels = 1:200)
binned <- as.data.frame (sapply(df[,-1], Bins))
colnames(binned) <- paste("Bin", colnames(binned), sep = "_")
df<- cbind(df, binned)
Now how to calculate mean/avg for previous two years and replace in NA values with in that bin
for example : at row-5 value is NA for p1 and f1 is 30 with corresponding bin 7.. now replace NA with previous 2 years mean for same bin (7) ,i.e
df
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 NA 30 NA 7
6 2016 10 NA 2 NA
df1
ID year p1 f1 Bin_p1 Bin_f1
1 2013 20 30 5 7
2 2013 24 29 5 7
3 2014 10 16 2 3
4 2014 11 17 2 3
5 2015 **22** 30 NA 7
6 2016 10 **16.5** 2 NA
Thanks in advance
I believe the following code produces the desired output. There's probably a much more elegant way than using mean(rev(lag(f1))[1:2]) to get the average of the last two values of f1 but this should do the trick anyway.
library(dplyr)
df %>%
arrange(year) %>%
mutate_at(c("p1", "f1"), "as.double") %>%
group_by(Bin_p1) %>%
mutate(f1 = ifelse(is.na(f1), mean(rev(lag(f1))[1:2]), f1)) %>%
group_by(Bin_f1) %>%
mutate(p1 = ifelse(is.na(p1), mean(rev(lag(p1))[1:2]), p1)) %>%
ungroup
and the output is:
# A tibble: 6 x 6
ID year p1 f1 Bin_p1 Bin_f1
<int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 2013 20 30.0 5 7
2 2 2013 24 29.0 5 7
3 3 2014 10 16.0 2 3
4 4 2014 11 17.0 2 3
5 5 2015 22 30.0 NA 7
6 6 2016 10 16.5 2 NA

Resources