Convert semi-long data into wide data - r

I'm very sure there should be a simple alternative but I'm not able to figure it out. Currently using a for loop which is not optimal.
My dataframe is like this:
NAME <- c("ABC", "ABC", "ABC", "DEF", "GHI", "GHI", "JKL", "JKL", "JKL", "MNO")
YEAR <- c(2012, 2013, 2014, 2012, 2012, 2013, 2012, 2014, 2016, 2013)
MARKS <- c(45, 75, 95, 91, 75, 76, 85, 88, 89, 77)
MAXIMUM <- c(95, NA, NA, 91, 76, NA, 89, NA, NA, 77)
DF <- data.frame(
NAME,
YEAR,
MARKS,
MAXIMUM
)
> DF
NAME YEAR MARKS MAXIMUM
1 ABC 2012 45 95
2 ABC 2013 75 NA
3 ABC 2014 95 NA
4 DEF 2012 91 91
5 GHI 2012 75 76
6 GHI 2013 76 NA
7 JKL 2012 85 89
8 JKL 2014 88 NA
9 JKL 2016 89 NA
10 MNO 2013 77 77
I want to have only one name per row and each year-wise details (YEAR, MARKS and MAXIMUM columns) should be spread as individual headers. I have tried to use tidyr::pivot_wider function but was not successful.
I have given the sample output here:
Required output

Perhaps you could enumerate by NAME first based on row_number(). Then, use pivot_wider:
library(tidyverse)
DF %>%
group_by(NAME) %>%
mutate(n = row_number()) %>%
pivot_wider(NAME, names_from = n, values_from = c(YEAR, MARKS, MAXIMUM))
Output
NAME YEAR_1 YEAR_2 YEAR_3 MARKS_1 MARKS_2 MARKS_3 MAXIMUM_1 MAXIMUM_2 MAXIMUM_3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABC 2012 2013 2014 45 75 95 95 NA NA
2 DEF 2012 NA NA 91 NA NA 91 NA NA
3 GHI 2012 2013 NA 75 76 NA 76 NA NA
4 JKL 2012 2014 2016 85 88 89 89 NA NA
5 MNO 2013 NA NA 77 NA NA 77 NA NA
Or, as mentioned by #RobertoT, you could make YEAR a factor and then line up your YEAR values. Using complete you can fill in NA for missing YEAR. The final select will order your columns.
DF$YEAR_FAC = factor(DF$YEAR)
DF %>%
group_by(NAME) %>%
complete(YEAR_FAC, fill = list(YEAR = NA)) %>%
mutate(n = row_number()) %>%
pivot_wider(NAME, names_from = n, values_from = c(YEAR, MARKS, MAXIMUM)) %>%
select(NAME, ends_with(as.character(1:nlevels(DF$YEAR_FAC))))
Output
NAME YEAR_1 MARKS_1 MAXIMUM_1 YEAR_2 MARKS_2 MAXIMUM_2 YEAR_3 MARKS_3 MAXIMUM_3 YEAR_4 MARKS_4 MAXIMUM_4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABC 2012 45 95 2013 75 NA 2014 95 NA NA NA NA
2 DEF 2012 91 91 NA NA NA NA NA NA NA NA NA
3 GHI 2012 75 76 2013 76 NA NA NA NA NA NA NA
4 JKL 2012 85 89 NA NA NA 2014 88 NA 2016 89 NA
5 MNO NA NA NA 2013 77 77 NA NA NA NA NA NA

In addition to #Ben+1 solution we could a code that I recently learned to order the columns Combining two dataframes with alternating column position
DF %>%
group_by(NAME) %>%
mutate(n = row_number()) %>%
pivot_wider(NAME, names_from = n, values_from = c(YEAR, MARKS, MAXIMUM)) %>%
select(-NAME) %>%
dplyr::select(all_of(c(matrix(names(.), ncol = 3, byrow = TRUE))))
NAME YEAR_3 MARKS_3 MAXIMUM_3 YEAR_1 MARKS_1 MAXIMUM_1 YEAR_2 MARKS_2 MAXIMUM_2
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABC 2014 95 NA 2012 45 95 2013 75 NA
2 DEF NA NA NA 2012 91 91 NA NA NA
3 GHI NA NA NA 2012 75 76 2013 76 NA
4 JKL 2016 89 NA 2012 85 89 2014 88 NA
5 MNO NA NA NA 2013 77 77 NA NA NA

I think all the previous answers have overlooked that the expected output is based on YEAR as a factor. The expected output has 4 grouped-columns per row, not 3. Therefore, you avoid mixing different years in the same column.
You can assign a number for every row- grp - based on the level of Year as a factor(). Also, if you first pivot longer, you can arrange the values as you want and then pivot wider everything so the columns are sorted as you expect:
library(tidyverse)
DF %>%
mutate(grp = as.integer(factor(DF$YEAR,unique(DF$YEAR)))) %>%
pivot_longer(cols=c('YEAR','MARKS','MAXIMUM'), names_to = 'COLNAMES', values_to= 'COL_VALUES') %>%
arrange(NAME,grp) %>%
pivot_wider(names_from = c(COLNAMES,grp), values_from= COL_VALUES, names_sep = '')
Output:
# A tibble: 5 x 13
NAME YEAR1 MARKS1 MAXIMUM1 YEAR2 MARKS2 MAXIMUM2 YEAR3 MARKS3 MAXIMUM3 YEAR4 MARKS4 MAXIMUM4
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABC 2012 45 95 2013 75 NA 2014 95 NA NA NA NA
2 DEF 2012 91 91 NA NA NA NA NA NA NA NA NA
3 GHI 2012 75 76 2013 76 NA NA NA NA NA NA NA
4 JKL 2012 85 89 NA NA NA 2014 88 NA 2016 89 NA
5 MNO NA NA NA 2013 77 77 NA NA NA NA NA NA
However, I suggest you to keep track of the years to not make the tibble more confusing:
DF$YEAR = factor(DF$YEAR)
DF %>%
pivot_longer(cols=c('MARKS','MAXIMUM'), names_to = 'COLNAMES', values_to= 'COL_VALUES') %>%
arrange(NAME,YEAR) %>%
pivot_wider(names_from = c(COLNAMES,YEAR), values_from= COL_VALUES)
# A tibble: 5 x 9
NAME MARKS_2012 MAXIMUM_2012 MARKS_2013 MAXIMUM_2013 MARKS_2014 MAXIMUM_2014 MARKS_2016 MAXIMUM_2016
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ABC 45 95 75 NA 95 NA NA NA
2 DEF 91 91 NA NA NA NA NA NA
3 GHI 75 76 76 NA NA NA NA NA
4 JKL 85 89 NA NA 88 NA 89 NA
5 MNO NA NA 77 77 NA NA NA NA

Here a version with data.table:
library(data.table)
DT <- setDT(DF)
# numerotate the line
DT[,I := .I - .I[1] + 1,by = NAME]
# melt to have only three columns
tmp <- melt(DT,measure.vars = c("YEAR","MARKS","MAXIMUM"))
# transforming to wide
dcast(tmp,
NAME ~ paste0(variable,I),
value.var = "value")
NAME MARKS1 MARKS2 MARKS3 MAXIMUM1 MAXIMUM2 MAXIMUM3 YEAR1 YEAR2 YEAR3
1: ABC 45 75 95 95 NA NA 2012 2013 2014
2: DEF 91 NA NA 91 NA NA 2012 NA NA
3: GHI 75 76 NA 76 NA NA 2012 2013 NA
4: JKL 85 88 89 89 NA NA 2012 2014 2016
5: MNO 77 NA NA 77 NA NA 2013 NA NA

Related

How to subtract data value using latest and earliest date (value is in another column) using R

I am struggling to understand how I can subtract the blood pressure data if the patient had anywhere from 1 measurement to 5 measurements. For example, my data
ID
Date1
Value1
Date2
Value2
Date3
Value3
Date4
Value4
Date5
Value5
1
01/01/2022
160
01/02/2022
161
01/04/2022
159
01/05/2022
159
01/06/2022
130
2
08/02/2022
130
01/07/2022
120
NA
NA
NA
NA
NA
NA
3
01/04/2022
112
29/09/2022
161
10/10/2022
159
NA
NA
NA
NA
4
01/10/2022
182
NA
NA
NA
NA
NA
NA
NA
NA
So some patients will have all 5 measurements (e.g. ID 1) when some patients will have only 1 measurement (e.g. ID 4).
I want to make a new variable that subtracts from the latest value to the earliest value. If the patient only has 1 measurement, the new variable will be NA. For example like this.
ID
Date1
Value1
Date2
Value2
Date3
Value3
Date4
Value4
Date5
Value5
NewVariable
1
01/01/2022
160
01/02/2022
161
01/04/2022
159
01/05/2022
159
01/06/2022
130
-30
2
08/02/2022
130
01/07/2022
120
NA
NA
NA
NA
NA
NA
-10
3
01/04/2022
112
29/09/2022
161
10/10/2022
159
NA
NA
NA
NA
47
4
01/10/2022
182
NA
NA
NA
NA
NA
NA
NA
NA
NA
I am using R Studio for this. I would appreciate any coding help to achieve this!
Assuming that the first value is always located in Value1 and that the dates are sorted correctly, the dplyr package makes it straight-forward.
Use coalesce to find the first non-missing value 2-5 (in reverse order), and substract value 1 from that.
library(dplyr)
mutate(df, NewVariable = coalesce(Value5, Value4, Value3, Value2) - Value1)
#> # A tibble: 4 × 12
#> ID Date1 Value1 Date2 Value2 Date3 Value3 Date4 Value4 Date5 Value5 NewVariable
#> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 1 01/01/2022 160 01/02/2022 161 01/04/2022 159 01/05/2022 159 01/06/2022 130 -30
#> 2 2 08/02/2022 130 01/07/2022 120 <NA> NA <NA> NA <NA> NA -10
#> 3 3 01/04/2022 112 29/09/2022 161 10/10/2022 159 <NA> NA <NA> NA 47
#> 4 4 01/10/2022 182 <NA> NA <NA> NA <NA> NA <NA> NA NA

Collapse data frame so NAs are removed

I want to collapse this data frame so NA's are removed. How to accomplish this? Thanks!!
id <- c(1,1,1,2,2,3,4,5,5)
q1 <- c(23,55,7,88,90,34,11,22,99)
df <- data.frame(id,q1)
df$row <- 1:nrow(df)
spread(df, id, q1)
row 1 2 3 4 5
1 23 NA NA NA NA
2 55 NA NA NA NA
3 7 NA NA NA NA
4 NA 88 NA NA NA
5 NA 90 NA NA NA
6 NA NA 34 NA NA
7 NA NA NA 11 NA
8 NA NA NA NA 22
9 NA NA NA NA 89
I want it to look like this:
1 2 3 4 5
23 88 34 11 22
55 90 NA NA 89
7 NA NA NA NA
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
The row should be created on the sequence of 'id'. In addition, pivot_wider would be a more general function compared to spread
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
mutate(row = row_number()) %>%
ungroup %>%
pivot_wider(names_from = id, values_from = q1) %>%
select(-row)
-output
# A tibble: 3 × 5
`1` `2` `3` `4` `5`
<dbl> <dbl> <dbl> <dbl> <dbl>
1 23 88 34 11 22
2 55 90 NA NA 99
3 7 NA NA NA NA
Or use dcast
library(data.table)
dcast(setDT(df), rowid(id) ~ id, value.var = 'q1')[, id := NULL][]
1 2 3 4 5
<num> <num> <num> <num> <num>
1: 23 88 34 11 22
2: 55 90 NA NA 99
3: 7 NA NA NA NA
Here's a base R solution. I sort each column so the non-NA values are at the top, find the number of non-NA values in the column with the most non-NA values (n), and return the top n rows from the data frame.
library(tidyr)
id <- c(1,1,1,2,2,3,4,5,5)
q1 <- c(23,55,7,88,90,34,11,22,99)
df <- data.frame(id,q1)
df$row <- 1:nrow(df)
df <- spread(df, id, q1)
collapse_df <- function(df) {
move_na_to_bottom <- function(x) x[order(is.na(x))]
sorted <- sapply(df, move_na_to_bottom)
count_non_na <- function(x) sum(!is.na(x))
n <- max(apply(df, 2, count_non_na))
sorted[1:n, ]
}
collapse_df(df[, -1])

R: Pivot_Wider/spread by obtaining average sorted by year

I've the following dataset
Pet Shop
Year
Item
Price
A
2021
dog
300
A
2021
dog
250
A
2021
fish
20
A
2020
turtle
50
A
2020
dog
250
A
2020
cat
280
A
2019
rabbit
180
A
2019
cat
165
A
2019
cat
270
B
2021
dog
350
B
2021
fish
80
B
2021
fish
70
B
2020
cat
220
B
2020
turtle
90
B
2020
turtle
80
B
2020
fish
55
B
2019
fish
75
C
2021
dog
280
C
2020
cat
260
C
2020
cat
270
C
2019
fish
65
C
2019
cat
270
The code for the data is as follows
Pet_Shop = c(rep("A",9), rep("B",8), rep("C",5))
Item = c("Dog","Dog","Fish","Turtle","Dog","Cat","Rabbit","Cat","Cat","Dog","Fish","Fish","Cat","Turtle","Turtle","Fish","Fish","Dog","Cat","Cat","Fish","Cat")
Price = c(300,250,20,50,250,280,180,165,270,350,80,70,220,90,80,55,75,280,260,270,65,270)
Data = data.frame(Pet_Shop, Item, Price)
Does anyone here know how I can use pivot_wider or spread (or any other method) to achieve the following table? It groups the Shop by year and takes the average of the similar item of the same shop for the year. I've issues incorporating the year.
Pet Shop
Year
dog
fish
turtle
cat
rabbit
A
2021
Average(300,250) = 275
20
NA
NA
NA
A
2020
250
NA
50
280
NA
A
2019
NA
NA
NA
217.5
NA
B
2021
350
75
NA
NA
NA
B
2020
NA
55
85
220
NA
B
2019
NA
75
NA
NA
NA
C
2021
280
NA
NA
NA
NA
C
2020
NA
NA
NA
265
NA
C
2019
NA
60
NA
270
NA
In pivot_wider you may pass a function (values_fn) to be applied to each combination of Pet_Shop and Year.
result <- tidyr::pivot_wider(Data, names_from = Item,
values_from = Price, values_fn = mean)
result
# Pet_Shop Year dog fish turtle cat rabbit
# <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 A 2021 275 20 NA NA NA
#2 A 2020 250 NA 50 280 NA
#3 A 2019 NA NA NA 218. 180
#4 B 2021 350 75 NA NA NA
#5 B 2020 NA 55 85 220 NA
#6 B 2019 NA 75 NA NA NA
#7 C 2021 280 NA NA NA NA
#8 C 2020 NA NA NA 265 NA
#9 C 2019 NA 65 NA 270 NA
The same can also be done with data.table dcast -
library(data.table)
dcast(setDT(Data), Pet_Shop + Year ~ Item,
value.var = "Price", fun.aggregate = mean)

How to delete missing observations for a subset of columns: the R equivalent of dropna(subset) from python pandas

Consider a dataframe in R where I want to drop row 6 because it has missing observations for the variables var1:var3. But the dataframe has valid observations for id and year. See code below.
In python, this can be done in two ways:
use df.dropna(subset = ['var1', 'var2', 'var3'], inplace=True)
use df.set_index(['id', 'year']).dropna()
How to do this in R with tidyverse?
library(tidyverse)
df <- tibble(id = c(seq(1,10)), year=c(seq(2001,2010)),
var1 = c(sample(1:100, 10, replace=TRUE)),
var2 = c(sample(1:100, 10, replace=TRUE)),
var3 = c(sample(1:100, 10, replace=TRUE)))
df[3,4] = NA
df[6,3:5] = NA
df[8,3:4] = NA
df[10,4:5] = NA
We may use complete.cases
library(dplyr)
df %>%
filter(if_any(var1:var3, complete.cases))
-output
# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 48 55 82
2 2 2002 22 83 67
3 3 2003 89 NA 19
4 4 2004 56 1 38
5 5 2005 17 58 35
6 7 2007 4 30 94
7 8 2008 NA NA 36
8 9 2009 97 100 80
9 10 2010 37 NA NA
We can use pmap for this case also:
library(dplyr)
library(purrr)
df %>%
filter(!pmap_lgl(., ~ {x <- c(...)[-c(1, 2)];
all(is.na(x))}))
# A tibble: 9 x 5
id year var1 var2 var3
<int> <int> <int> <int> <int>
1 1 2001 90 55 77
2 2 2002 77 5 18
3 3 2003 17 NA 70
4 4 2004 72 33 33
5 5 2005 10 55 77
6 7 2007 22 81 17
7 8 2008 NA NA 46
8 9 2009 93 28 100
9 10 2010 50 NA NA
Or we could also use complete.cases function in pmap as suggested by dear #akrun:
df %>%
filter(pmap_lgl(select(., 3:5), ~ any(complete.cases(c(...)))))
You can use if_any in filter -
library(dplyr)
df %>% filter(if_any(var1:var3, Negate(is.na)))
# id year var1 var2 var3
# <int> <int> <int> <int> <int>
#1 1 2001 14 99 43
#2 2 2002 25 72 76
#3 3 2003 90 NA 15
#4 4 2004 91 7 32
#5 5 2005 69 42 7
#6 7 2007 57 83 41
#7 8 2008 NA NA 74
#8 9 2009 9 78 23
#9 10 2010 93 NA NA
In base R, we can use rowSums to select rows which has atleast 1 non-NA value.
cols <- grep('var', names(df))
df[rowSums(!is.na(df[cols])) > 0, ]
If looking for complete cases, use the following (kernel of this is based on other answers):
library(tidyverse)
df <- tibble(id = c(seq(1,10)), year=c(seq(2001,2010)),
var1 = c(sample(1:100, 10, replace=TRUE)),
var2 = c(sample(1:100, 10, replace=TRUE)),
var3 = c(sample(1:100, 10, replace=TRUE)))
df[3,4] = NA
df[6,3:5] = NA
df[8,3:4] = NA
df[10,4:5] = NA
df %>% filter(!if_any(var1:var3, is.na))
#> # A tibble: 6 x 5
#> id year var1 var2 var3
#> <int> <int> <int> <int> <int>
#> 1 1 2001 13 28 26
#> 2 2 2002 61 77 58
#> 3 4 2004 95 38 58
#> 4 5 2005 38 34 91
#> 5 7 2007 85 46 14
#> 6 9 2009 45 60 40
Created on 2021-06-24 by the reprex package (v2.0.0)

Spread and Gather table return duplicated rows with NA values

I have a table with categories and sub categories encoded in this format of columns name:
Date| Admissions__0 |Attendance__0 |Tri_1__0|Tri_2__0|...
Tri_1__1|Tri_2__1|...|
and I would like to change it to this format of columns using spread and gather function of tidyverse:
Date| Country code| Admissions| Attendance| Tri_1|Tri_2|...
I tried a solution posted but the outcome actually return multiple rows with NA rather than a single row.
My code used:
temp <- data %>% gather(key="columns",value ="dt",-Date)
temp <- temp %>% mutate(category = gsub(".*__","",columns)) %>% mutate(columns = gsub("__\\d","",columns))
temp %>% mutate(row = row_number()) %>% spread(key="columns",value="dt")
And my results is:
Date country_code row admissions attendance Tri_1 Tri_2 Tri_3 Tri_4 Tri_5
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 01-APR-2014 0 275 NA 209 NA NA NA NA NA
2 01-APR-2014 0 640 84 NA NA NA NA NA NA
3 01-APR-2014 0 1005 NA NA 5 NA NA NA NA
4 01-APR-2014 0 1370 NA NA NA 33 NA NA NA
5 01-APR-2014 0 1735 NA NA NA NA 62 NA NA
6 01-APR-2014 0 2100 NA NA NA NA NA 80 NA
7 01-APR-2014 0 2465 NA NA NA NA NA NA 29
8 01-APR-2014 1 2830 NA 138 NA NA NA NA NA
9 01-APR-2014 1 3195 66 NA NA NA NA NA NA
10 01-APR-2014 1 3560 NA NA N/A NA NA NA NA
My expected results:
Date country_code row admissions attendance Tri_1 Tri_2 Tri_3 Tri_4 Tri_5
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 01-APR-2014 0 275 84 209 5 33 62 80 29
8 01-APR-2014 1 2830 66 138 66 ... ... ... ...
We can do a summarise_at coalesce to remove the NA elements after the spread
library(tidyverse)
data %>%
gather(key = "columns", val = "dt", -Date, na.rm = TRUE) %>%
mutate(category = gsub(".*__","",columns)) %>%
mutate(columns = gsub("__\\d","",columns)) %>%
group_by(Date, dt, columns, category) %>%
mutate(rn = row_number()) %>%
spread(columns, dt) %>%
select(-V1) %>%
summarise_at(vars(Admissions:Tri_5),list(~ coalesce(!!! .))) # %>%
# filter if needed
#filter_at(vars(Admissions:Tri_5), all_vars(!is.na(.)))

Resources