advanced reshaping / pivoting in a r dataframe - r

I am struggling to reshape a dataframe in R. My starting point is a dataframe, which has the following structure:
df_given <- data.frame(
first_column = c("NA", "NA", "NA", "Country1", "Country2", "Country3"),
second_column = c("Consumption", "real", "2021", 10, 11, 23),
third_column = c("Consumption", "real", "2022", 20, 22, 12),
fourth_column = c("Inflation", "expected", "2021", 1, 1.2, 2.5),
fifth_column = c("Inflation", "expected", "2022", 5, 3, 2)
)
Now my problem is the following: I would like to have the 2021 and 2022 only as two columns, instead of repeating the sequence two times. This, therefore, involves transforming the "description" of this time series (e.g. consumption real and inflation expected) from a row to a column. For this reason, my final target dataframe would look somehow like this:
df_target <- data.frame(
first_column = c("type", "Country1 Consumption real", "Country2 Consumption real",
"Country3 Consumption real", "Country1 Inflation expected",
"Country2 Inflation expected","Country3 Inflation expected"),
second_column = c(2021, 10, 11, 23, 1, 1.2, 2.5),
third_column = c(2022, 20, 22, 12, 5, 3, 2)
)
I assume that pivoting to wider or longer would do the trick. However, my problem is, that I can't really tell if my current dataframe is actually in long or wide format, because I think it is kind of both. Can anyone tell me how to approach this problem? Thanks in advance

You can use data table, after dropping the extra info in the first couple of rows which aren't really data.
names(df_given) <- c("country","Real C 2021", "Real C 2022", "Inf 2021", "Inf 2022")
df_given <- df_given[-c(1:3),]
library(data.table)
setDT(df_given)
melt(df_given, measure = patterns("^Real C","^Inf"), value.name = c("2021","2022"))
country variable 2021 2022
1: Country1 1 10 1
2: Country2 1 11 1.2
3: Country3 1 23 2.5
4: Country1 2 20 5
5: Country2 2 22 3
6: Country3 2 12 2
Documentation

Easiest way is manual, using Base R:
# Transpose: ir => data.frame
ir <- data.frame(t(df_given))
# Derive metrics: ir2 => character vector
ir2 <- apply(ir[,1:3], 1, paste, collapse = " ")[-1]
# Derive countries: ir3 => character vector
ir3 <- unlist(ir[1,4:ncol(ir), drop = TRUE])
# Derive values: ir4 => data.frame
ir4 <- unlist(ir[2:nrow(ir), 4:ncol(ir)])
# Reshape into long df: ir5 => data.frame
ir5 <- within(
data.frame(
cbind(
stat = ir2,
country = rep(ir3, each = length(ir2)),
val = ir4
),
row.names = NULL
),
{
year <- substring(stat, nchar(stat)-4)
stat <- trimws(gsub(paste0(year, collapse = "|"), "", stat))
}
)
# Pivot: data.frame => stdout(console)
reshape(
ir5,
idvar=c("country", "stat"),
timevar="year",
v.names="val",
direction="wide"
)

Thanks a lot for your input, your solutions were very helpful to find my own solution. For me, the most important take away was to merge all "identifying" rows into the header, which makes all of the following operations a lot easier.
# merge row 1 with 2
df_given[1,] <- paste(df_given[1,], df_given[2,])
df_given = df_given[-c(2),]
# merge "merged row" with daterow with unique seperator into a header
names(df_given) <- as.character(paste(df_given[1,], df_given[2,], sep ="-x-"))
df_given = df_given[-c(1,2),]
names(df_given)[1] <- 'country'
# pivot longer
df_given_new <- df_given %>%
pivot_longer(!country, names_to = "identifier", values_to = "obs")
# split columns
df_given_new[c('type', 'year')] <- str_split_fixed(df_given_new$identifier, '-x-', 2)
df_given_new <- subset(df_given_new, select=-c(identifier))
# back to long
dfFinal <- df_given_new %>%
pivot_wider(names_from = year, values_from = obs)

Related

How to identify the name of a column with the maximum of a the dataset in R? [duplicate]

I can only find information for finding the max value for each row.
But I need the max value among multiple rows and columns and to find the column name corresponding to it.
e.g if my dataset looks like:
data <- data.frame(Year = c(2001, 2002, 2003),
X = c(3, 2, 45),
Y = c(6, 20, 23),
Z = c(10, 4, 4))
I want my code to return "X" because 45 is the maximum.
I suppose one way to approach this is to turn your wide dataset into a long (tidy) table and then filter for the max value and extract that value name.
library(tidyverse)
df <- read.table(text = "Year X Y Z
2001 3 6 10
2002 2 20 4
2003 45 23 4", header = T)
df %>%
pivot_longer(cols = c("X", "Y", "Z"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
# [1] "X"
And if you have a large number of columns, one method to "pivot" your data from wide to long without specifying all the columns names (as I do in the pivot_longer(...) command), you can run this instead:
df %>%
pivot_longer(cols = setdiff(names(.), "Year"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
A base R solution:
Assuming that you want to exclude the Year variable from this analysis:
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 4, 5))
dat_ex_year <- dat[, !names(dat) %in% c("Year")]
names(dat_ex_year)[which(dat_ex_year == max(dat_ex_year), arr.ind = TRUE)[,2]]
which gives:
[1] "X"
EDIT: I slightly adjusted the code so that it would return all column names in case the maximum value is found in several columns, e.g. with :
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 45, 5))
the code gives:
[1] "X" "Y"

How do I find the column name corresponding to the maximum value in multiple rows and columns?

I can only find information for finding the max value for each row.
But I need the max value among multiple rows and columns and to find the column name corresponding to it.
e.g if my dataset looks like:
data <- data.frame(Year = c(2001, 2002, 2003),
X = c(3, 2, 45),
Y = c(6, 20, 23),
Z = c(10, 4, 4))
I want my code to return "X" because 45 is the maximum.
I suppose one way to approach this is to turn your wide dataset into a long (tidy) table and then filter for the max value and extract that value name.
library(tidyverse)
df <- read.table(text = "Year X Y Z
2001 3 6 10
2002 2 20 4
2003 45 23 4", header = T)
df %>%
pivot_longer(cols = c("X", "Y", "Z"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
# [1] "X"
And if you have a large number of columns, one method to "pivot" your data from wide to long without specifying all the columns names (as I do in the pivot_longer(...) command), you can run this instead:
df %>%
pivot_longer(cols = setdiff(names(.), "Year"), names_to = "column") %>%
filter(max(value) == value) %>%
pull(column)
A base R solution:
Assuming that you want to exclude the Year variable from this analysis:
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 4, 5))
dat_ex_year <- dat[, !names(dat) %in% c("Year")]
names(dat_ex_year)[which(dat_ex_year == max(dat_ex_year), arr.ind = TRUE)[,2]]
which gives:
[1] "X"
EDIT: I slightly adjusted the code so that it would return all column names in case the maximum value is found in several columns, e.g. with :
dat <- data.frame(Year = c(2000, 2001, 2002),
X = c(1, 2, 45),
Y = c(3, 45, 5))
the code gives:
[1] "X" "Y"

Variable offset and matching in a dataframe

Good day, I would like to achieve something with a data frame, and I think it’s a combination of doing a variable offset and matching, but I am not quite sure how to do it in R.
Sample data to replicate the original and desired output:
original = data.frame(
ID = c(1, 2, 3, 2, 2, 2),
Type = c("Live", "Live", "Live", "Live", "Live", "Dead"),
Number = c(100, 20, 30, 40, 50, NA))
desired = data.frame(
ID = c(1, 2, 3, 2, 2, 2),
Type = c("Live", "Live", "Live", "Live", "Live", "Dead"),
Number = c(100, 20, 30, 40, 50, NA),
Number2 = c(NA, NA, NA, NA, NA, 50))
Essentially what I would like to achieve is that when Type = “Dead”, then I want to get the last Number in the series when that ID was “Live.” It’s possible that the same ID can be live across a number of rows (e.g. ID = 2), but when an ID has Type = “Dead”, then I want to extract the last number at which it was live. The challenge is that it’s not the case that the preceding row always has the same ID so there needs to be some sort of search that I would like to generalise for all IDs.
Thanks!
Here's one way
library(dplyr)
original %>%
group_by(ID) %>%
mutate(
Number2 = if_else(Type=="Dead", last(Number[Type=="Live"]), NA_real_))
Here we group_by the ID then for each of the "Dead" values, find the last value of Number where the Type is Live, returning NA if not "Dead"
Here is a base R option
(u <- Reduce(
rbind,
lapply(
split(original, original$ID),
function(v) {
within(v, Number2 <- ifelse(Type == "Dead",
tail(Number[Type == "Live"], 1), NA
))
}
)
))[order(as.numeric(row.names(u))), ]
which gives
ID Type Number Number2
1 1 Live 100 NA
2 2 Live 20 NA
3 3 Live 30 NA
4 2 Live 40 NA
5 2 Live 50 NA
6 2 Dead NA 50

How to I cast data frame with more than 3 columns in R?

Importing from an Access database, I have data that look similar to this:
p <- data.frame(SurvDate = as.Date(c('2018-11-1','2018-11-1','2018-11-1',
'2018-11-3', '2018-11-3')),
Area = c('AF','BB','CT', 'DF', 'BB'),
pCount = c(6, 3, 0, 12, 32),
ObsTime = c('8:51','8:59','9:13', '9:24', '9:30'),
stringsAsFactors = FALSE)
I want to cast my data with Rows as SurvDate and columns to be Areas (values as pCount) and ObsTime columns next to each Area with value ObsTime.
Example:
n <- data.frame(SurvDate = as.Date(c('2018-11-1','2018-11-3')),
AF = c(6, NA),
TimeAF = c('8:51', NA),
BB = c(3, 32),
TimeBB = c('8:59', '9:30'),
CT = c(0, NA),
TimeCT = c(NA, '9:13'),
DF = c(NA,12),
TimeDF = c(NA, '9:24'))
I've tried variations on this theme, but can't get time to work.
library(reshape2)
dcast(p, SurvDate+ObsTime ~ Area)
Here is one way using tidyverse tools. Note that the output is not the same as your expected output, because it seems like you didn't put the values for CT in the right place (values spread across two dates). Approach is to unite the values so we have a single key-value pair to spread, and then separate out the columns again with mutate_at. We could also have used separate multiple times, though this would become unwieldy with too many Areas.
SurvDate <- as.Date(c('2018-11-1','2018-11-1','2018-11-1', '2018-11-3', '2018-11-3'))
Area <- c('AF','BB','CT', 'DF', 'BB')
People <- c(6, 3, 0, 12, 32)
ObsTime <- (c('8:51','8:59','9:13', '9:24', '9:30'))
p <- data.frame(SurvDate, Area, People, ObsTime, stringsAsFactors = FALSE)
library(tidyverse)
p %>%
unite(vals, People, ObsTime) %>%
spread(Area, vals) %>%
mutate_at(
.vars = vars(-SurvDate),
.funs = funs(
Time = str_extract(., "(?<=_).*$"),
Area = str_extract(., "^.*(?=_)")
)
) %>%
filter(!is.na(SurvDate)) %>%
select(SurvDate, matches("_")) %>%
select(SurvDate, order(colnames(.)))
#> SurvDate AF_Area AF_Time BB_Area BB_Time CT_Area CT_Time DF_Area
#> 1 2018-11-01 6 8:51 3 8:59 0 9:13 <NA>
#> 2 2018-11-03 <NA> <NA> 32 9:30 <NA> <NA> 12
#> DF_Time
#> 1 <NA>
#> 2 9:24
Created on 2018-04-30 by the reprex package (v0.2.0).

Calculation between groups in one column in tidy data

I have data like that:
df <- (
tibble(
ID = rep(1:2, 4),
Group = c("A", "B", "A", "B","A", "B", "A", "B"),
Parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
Value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
)
I want to calculate the ratio between "Height" and "Waist" and between "Waist" and "Hip".
I have the following solution. But my solution requires using spread() and delivers only the calculation for "Waist-to-hip".
df <- rbind(df,
spread(df, Parameter, Value)
%>% transmute(ID = ID,
Group = Group,
Parameter = "Ratio.Height-to-Hip",
Value = Height / Hip,
Parameter = "Ratio.Waist-to-Hip",
Value = Waist / Hip))
Is it possible to stay in tidy data format and avoid switching to the long-format? Why is the calculation for "Height-to-hip" missing?
Here is one the possible solution:
# Calculate ratios "Height" vs "Waist" and "Waist" vs "Hip"
# 1. Load packages
library(tidyr)
library(dplyr)
# 2. Data set
df <- tibble(
id = rep(1:2, 4),
group = c("A", "B", "A", "B","A", "B", "A", "B"),
parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
# 3. Filter and transform data set
df <- df %>%
filter(parameter %in% c("Height", "Waist", "Hip")) %>%
spread(parameter, value)
# 4. Convert column names to lower case
colnames(df) <- tolower(colnames(df))
# 5. Calcutate ratios
df <- df %>%
mutate(
ratio_height_vs_waist = round(height / waist, 2),
ratio_waist_vs_hip = round(waist / hip, 2))
The main problem is that the data are not in a tidy format.
Two key features of the tidy format are (Wickham, 2013):
Each variable forms a column;
Each observation forms a row.
In its original format, your data violates these two rules. For example, the Parameter column contains four variables (Blood, Height, Waist, and Hip). The knock-on effect of grouping several variables within Parameter is that each observation has to be repeated across several rows. In general, repeated rows of an identifier (ID in this case) in the absence of repeated measures is a sign that two or more variables have been grouped under a single column.
Anyway, here's my attempt to clean the data (I have used mutate and and not transmute for illustrative purposes).
# Load packages
library(dplyr)
library(tidyr)
library(magrittr) # For the %<>% function, which I love
# Make data frame, df
df <- tibble(
ID = rep(1:2, 4),
Group = c("A", "B", "A", "B","A", "B", "A", "B"),
Parameter = c("Blood", "Blood", "Height", "Height", "Waist", "Waist", "Hip", "Hip"),
Value = c(6.3, 6.0, 180, 170, 90, 102, 60, 65))
# Wrangle df
df %<>%
# ID and Group appear to be repeated, so use them to group_by
group_by(ID, Group) %>%
# Spread the Value column by the Parameter column
spread(key = Parameter,
value = Value) %>%
# Ungroup, just because its a good habit
ungroup() %>%
# Generate new columns.
mutate(Ratio_height_to_hip = Height / Hip,
Ratio_waist_to_hip = Waist / Hip)
# Print df
df
#> # A tibble: 2 x 8
#> ID Group Blood Height Hip Waist Ratio_height_to_hip
#> <int> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 A 6.3 180 60 90 3.000000
#> 2 2 B 6.0 170 65 102 2.615385
#> # ... with 1 more variables: Ratio_waist_to_hip <dbl>
df <- df %>%
spread(Parameter, Value) %>%
mutate("Ratio.Height-to-Hip" = Height / Hip) %>%
mutate("Ratio.Waist-to-Hip" = Hip / Waist) %>%
gather("Parameter", "Value", -c("ID", "Group"))
Your data is not in tidy format ;) If you want your data in tidy format remove the last step.

Resources