Converting long format flat file to wide in R - r

I have a household and member dataset in one long flat format. There is a fixed number of members and each corresponds to a column. For simplicity, assume 2 members per household and assume 2 questions are asked for the members- age (Q1), gender(Q2).
The file format looks as given below:
HHID, MEM_ID_1, MEM_ID_2, AGE_1, AGE_2, GENDER_1, GENDER_2
1 1 2 50 45 M F
And I want to convert it to the following format:
HHID MEM_ID AGE GENDER
1 1 50 M
1 2 45 F

Let's say our data frame is test
dput(test)
structure(list(HHID = 1L, MEM_ID_1 = 1L, MEM_ID_2 = 2L, AGE_1 = 50L,
AGE_2 = 45L, GENDER_1 = structure(1L, .Label = "Male", class = "factor"),
GENDER_2 = structure(1L, .Label = "Female", class = "factor")), class = "data.frame", row.names = c(NA,
-1L))
You could try the reshape function on this data frame as below:
reshape(test, direction = "long",
varying = list(c("MEM_ID_1","MEM_ID_2"), c("AGE_1","AGE_2"), c( "GENDER_1","GENDER_2")),
v.names = c("MEM_ID","AGE","GENDER"),
idvar = 'HHID')
The reshape() function comes from the base R. Broadly speaking, it can simultaneously melt over multiple sets of variables, by using the varying parameter and setting the direction to long.
For example in your case we have a list of three vectors of variable names to the varying argument:
varying = list(c("MEM_ID_1","MEM_ID_2"), c("AGE_1","AGE_2"), c( "GENDER_1","GENDER_2"))
The output is below:
HHID time MEM_ID AGE GENDER
1.1 1 1 1 50 Male
1.2 1 2 2 45 Female

You can use tidyr::gather(), tidyr::separate(), and tidyr::spread() in order. Here household is the name of your data frame.
library(tidyverse)
1. gather
First, tidyr::gather(). Then you can get the below result.
household %>%
gather(-HHID, key = domestic, value = value)
#> HHID domestic value
#> 1 1 MEM_ID_1 1
#> 2 1 MEM_ID_2 2
#> 3 1 AGE_1 50
#> 4 1 AGE_2 45
#> 5 1 GENDER_1 M
#> 6 1 GENDER_2 F
Now all you have to do is
separate domestic column at _[0-9]: In regular expression, _(?=[0-9])
Changing the format into the wide format, you can see the output you want.
2. Conclusion: entire code
household %>%
gather(-HHID, key = domestic, value = value) %>% # long data
separate(domestic, into = c("domestic", "vals"), sep = "_(?=[0-9])") %>% # separate the digit
spread(domestic, value) %>% # wide format
select(HHID, MEM_ID, AGE, GENDER, -vals) # just arranging columns, and excluding needless column
#> HHID MEM_ID AGE GENDER
#> 1 1 1 50 M
#> 2 1 2 45 F

Related

Customizing a pivot_wider and collapsing rows

I have a data frame that looks like this:
ID
Feature
Quality
Quantity
Condition
21
Shed
A
1
AV
72
Masonry
1
72
Shed
D
1
AV
Currently the data frame has the unit of observation as the feature, not the ID number. I would like to pivot this to a data frame that looks like this :
ID
ShedQuant
ShedQual
ShedCond
MasonryQuant
MasonryQual
MasonryCond
21
1
A
AV
72
1
D
AV
1
In the new data frame, the unit of observation should be the ID number (aka each ID number is one row that lists all features associated with the ID number, and their quantities/qualities/conditions.
I tried to combine several pivot_widers but it did not give me the intended result. Any help is appreciated!
Note: If the quantity of a certain feature is more than 1 for a certain ID, I want a sum for the quantity column and blanks for quality and condition.
library(tidyr)
data.frame(
stringsAsFactors = FALSE,
ID = c(21L, 72L, 72L),
Feature = c("Shed", "Masonry", "Shed"),
Quality = c("A", NA, "D"),
Quantity = c(1L, 1L, 1L),
Condition = c("AV", NA, "AV")
) %>%
pivot_wider(ID, names_from = Feature, names_glue = "{Feature}_{.value}",
values_from = Quality:Condition, names_vary = "slowest")
Result
# A tibble: 2 × 7
ID Shed_Quality Shed_Quantity Shed_Condition Masonry_Quality Masonry_Quantity Masonry_Condition
<int> <chr> <int> <chr> <chr> <int> <chr>
1 21 A 1 AV NA NA NA
2 72 D 1 AV NA 1 NA

Reshape dataframe that has years in column names

I am trying to reshape a wide dataframe in R into a long dataframe. Reading over some of the functions in reshape2 and tidyr they all seem to just handle if you have 1 variable you are splitting whereas I have ~10. Each column has the type variables names and the year and I would like it split so that the years become a factor in each row and then have significantly less columns and an easier data set to work with.
Currently the table looks something like this.
State Rank Name V1_2016 V1_2017 V1_2018 V2_2016 V2_2017 V2_2018
TX 1 Company 1 2 3 4 5 6
I have tried to melt the data with reshape2 but it came out looking like garbage and being 127k rows when it should only be about 10k.
I am trying to get the data to look something like this.
State Rank Name Year V1 V2
1 TX 1 Company 2016 1 4
2 TX 1 Company 2017 2 5
3 TX 1 Company 2018 3 6
An option with melt from data.table that can take multiple measure based on the patterns in the column names
library(data.table)
nm1 <- unique(sub(".*_", "", names(df)[-(1:3)]))
melt(setDT(df), measure = patterns("V1", "V2"),
value.name = c("V1", "V2"), variable.name = "Year")[,
Year := nm1[Year]][]
# State Rank Name Year V1 V2
#1: TX 1 Company 2016 1 4
#2: TX 1 Company 2017 2 5
#3: TX 1 Company 2018 3 6
data
df <- structure(list(State = "TX", Rank = 1L, Name = "Company", V1_2016 = 1L,
V1_2017 = 2L, V1_2018 = 3L, V2_2016 = 4L, V2_2017 = 5L, V2_2018 = 6L),
class = "data.frame", row.names = c(NA,
-1L))
One dplyr and tidyr possibility could be:
df %>%
gather(var, val, -c(1:3)) %>%
separate(var, c("var", "Year")) %>%
spread(var, val)
State Rank Name Year V1 V2
1 TX 1 Company 2016 1 4
2 TX 1 Company 2017 2 5
3 TX 1 Company 2018 3 6
It, first, transforms the data from wide to long format, excluding the first three columns. Second, it separates the original variable names into two new variables: one containing the variable prefix, second containing the year. Finally, it spreads the data.

Start from a specific column and traverse upto a specific number of columns based on conditions in R

I have the below data in wide format where each row represents a showroom, Quarter is from which quarter the showroom started selling and Starting Year is the Financial Year of Start.
Code Quarter StartingYear Quarter1_Num.FY16-17 Quarter2_Num.FY16-17 Quarter3_Num.FY16-17 Quarter4_Num.FY16-17 Quarter1_Num.FY17-18 Quarter2_Num.FY17-18 Quarter3_Num.FY17-18 Quarter4_Num.FY17-18
S2249 2 FY16-17 0 23 0 0 2 0 6 0
S463 3 FY17-18 0 0 4 0 0 4 90 8
For each agent, I have to start from the column based on Quarter & Starting Year (Quarter2_Num.FY16-17 for row1) and cover a period of a year which in this case would mean Quarter2_Num.FY17-18.
As can be seen the column names are based on the Quarter and StartingYear.
Ouput I am trying to get:
Code Quarter1_Starting_Num Quarter2_Starting_Num Quarter3_Starting_Num Quarter4_Starting_Num Quarter5_Starting_Num
S2249 23 0 0 2 0
S463 4 0 0 4 90
The columns capture data for a year across the quarters after the showroom started.
I know that using gsub I can get the columns containing FY16-17 or FY17-18.
But I am not sure how to specify the starting column for each row and then traversal for N rows.
Can anyone please help me with this?
First, we transfer the data set from wide to long then do our calculations and filters finally transform it back to wide format.
library(dplyr)
library(tidyr)
gather(df, k,val,-c(Code,Quarter,StartingYear)) %>%
mutate(Quar=gsub('Quarter(\\d)_.*','\\1',k),year=gsub('Quarter\\d_Num\\.(.*)\\.(.*)','\\1-\\2',k)) %>%
arrange(Code) %>% group_by(Code) %>%
mutate(flag=cumsum(cumsum(Quarter==Quar & StartingYear==year)), Quarter1=paste0('Quarter',flag,'_Starting_Num')) %>%
filter(between(flag,1,5)) %>% select(Code,Quarter1,val) %>% spread(Quarter1,val)
# A tibble: 2 x 6
# Groups: Code [2]
Code Quarter1_Starting_Num Quarter2_Starting_Num Quarter3_Starting_Num Quarter4_Starting_Num Quarter5_Starting_Num
<fct> <int> <int> <int> <int> <int>
1 S2249 23 0 0 2 0
2 S463 4 0 0 4 90
Data
df <- structure(list(Code = structure(1:2, .Label = c("S2249", "S463"
), class = "factor"), Quarter = 2:3, StartingYear = structure(c(1L,
1L), .Label = "FY16-17", class = "factor"), Quarter1_Num.FY16.17 = c(0L,
0L), Quarter2_Num.FY16.17 = c(23L, 0L), Quarter3_Num.FY16.17 = c(0L,
4L), Quarter4_Num.FY16.17 = c(0L, 0L), Quarter1_Num.FY17.18 = c(2L,
0L), Quarter2_Num.FY17.18 = c(0L, 4L), Quarter3_Num.FY17.18 = c(6L,
90L), Quarter4_Num.FY17.18 = c(0L, 8L)), class = "data.frame", row.names = c(NA,
-2L))
PS: I changed S463 3 FY17-18 to S463 3 FY16-17 to match the expected output, you can keep S463 3 FY17-18 but you will get NAs for Q3 to Q5
gsub('Quarter(\\d)_.*','\\1',c('Quarter1_Num.FY16.17','Quarter4_Num.FY17.18'))
[1] "1" "4"
'Quarter(\\d)_.*' group the one digit i.e. 1-9 after Quarter and before _ and return that group using \\1
gsub('Quarter\\d_Num\\.(.*)\\.(.*)','\\1-\\2',c('Quarter1_Num.FY16.17','Quarter4_Num.FY17.18'))
[1] "FY16-17" "FY17-18"
\\. skip a literal dot after Quarter followed by a digit_Num. In a regular expression, we skip special characters like . using \\
(.*) group anything after dot and before the next dot in one group i.e. FY16 and FY17. gsub will consider this as group 1
\\. skip a literal dot
(.*) group anything after dot in one group i.e. 17 and 18, gsub will consider this as group 2
\\1-\\2 return group 1 and group 2 separted by - i.e. FY16-17

Transform binary columns to 1 row in R

I am quite new to R and have only basic skills so far and even though I checked functions like melt() and gather() they somehow do not work for me.
What I want to do is transform such data (considering that all options on HAS House /Renting and Homeless are only 1 and 0and you cannot have more than 1 (you cannot be Renting and Homeless at the same time)
Eg.
Passenger ID /// Has Own House /// Renting /// Homeless /// Age /// Gender
1 1 0 0 21 Male
2 0 1 0 24 Female
I want this data to look like this:
Passenger ID /// Housing /// Age /// Gender
1 Has own house 21 Male
2 Renting 24 Female
And when it comes to forecasting - please can you advise whether the above method (with the binary factors) will work better in terms of speed or having all in 1 column will be better solution?
try this
library(tidyverse)
# importing your data
df <- read_table("Passenger_ID Has_Own_House Renting Homeless Age Gender
1 1 0 0 21 Male
2 0 1 0 24 Female")
and run:
df %>%
gather(Housing, value, -Passenger_ID, -Age, -Gender) %>%
filter(value==1) %>%
select(-value)
the output is:
# A tibble: 2 x 4
# Passenger_ID Age Gender Housing
# <int> <int> <chr> <chr>
# 1 1 21 Male Has_Own_House
# 2 2 24 Female Renting
In base R with ifelse:
# Load Data
dat <- structure(list(Passenger_ID = 1:2, Has_Own_House = c(1L, 0L),
Renting = 0:1, Homeless = c(0L, 0L), Age = c(21L, 24L), Gender = structure(c(2L,
1L), .Label = c("Female", "Male"), class = "factor")), .Names = c("Passenger_ID",
"Has_Own_House", "Renting", "Homeless", "Age", "Gender"), class = "data.frame", row.names = c(NA,
-2L))
# Assign new column "Housing" based on testing nested ifelse statements:
dat2 <- within(dat, Housing <- ifelse(Has_Own_House==1, "Has_Own_House",
ifelse(Renting==1, "Renting",
ifelse(Homeless==1, "Homeless", NA))))
# Remove extra columns
dat2$Has_Own_House <- NULL
dat2$Renting <- NULL
dat2$Homeless <- NULL
Yielding
>dat2
Passenger_ID Age Gender Housing
1 21 Male Has_Own_House
2 24 Female Renting
In base R, you can simply assign a new column just in one line by applying to all lines (1 argument) of the data frame a function returning the appropriate column name (where the value is 1 thanks to which):
df = data.frame('Passenger ID' = 1:5,
'Has Own House' = c(1,0,0,1,0),
'Renting' = c(0,1,0,0,0),
'Homeless' = c(0,0,1,0,1),
'Age'=21:25,
'Gender' = c('Male', 'Female', 'Male', 'Female', 'Male'))
df$HOUSING = apply(df[, 2:4], 1, function(x) names(df)[2:4][which(x==1)])
df
# Passenger.ID Has.Own.House Renting Homeless Age Gender HOUSING
# 1 1 1 0 0 21 Male Has.Own.House
# 2 2 0 1 0 22 Female Renting
# 3 3 0 0 1 23 Male Homeless
# 4 4 1 0 0 24 Female Has.Own.House
# 5 5 0 0 1 25 Male Homeless

calculate timeline for different subjects in dataframe

I have data like
subject date number
1 1/2/01 4
1 3/2/01 6
1 10/2/01 7
2 1/1/01 2
2 4/1/01 3
I want to get R to work out the number of days since the first sample for each subject. eg:
Subject days
1 0
1 2
1 9
2 0
2 3
How can I do this? I have converted the dates using lubridate.
SOmething like:
for(i in 1:nrow(data)){
if(data$date[i] != data$date[i -1]) {
data$timeline <- data$date[i] - data$date[i-1]
}
}
I get the error:
argument is of length 0 - I think the problem is the first line where there is no preceeding row..?
I would use dplyr to do some grouping and data manipulation. Note that we first have to convert your date into something R will recognize as a date.
library(dplyr)
dat$Date <- as.Date(dat$date, '%d/%m/%y')
dat %>%
group_by(subject) %>%
mutate(days = Date - min(Date))
# subject date number Date days
# <int> <chr> <int> <date> <time>
# 1 1 1/2/01 4 2001-02-01 0
# 2 1 3/2/01 6 2001-02-03 2
# 3 1 10/2/01 7 2001-02-10 9
# 4 2 1/1/01 2 2001-01-01 0
# 5 2 4/3/01 3 2001-03-04 62
here's the data:
dat <- structure(list(subject = c(1L, 1L, 1L, 2L, 2L), date = c("1/2/01",
"3/2/01", "10/2/01", "1/1/01", "4/3/01"), number = c(4L, 6L,
7L, 2L, 3L), Date = structure(c(11354, 11356, 11363, 11323, 11385
), class = "Date")), .Names = c("subject", "date", "number",
"Date"), row.names = c(NA, -5L), class = "data.frame")
Using the input shown in the note convert the date column to Date class (assuming that it is in the form dd/mm/yy) and then use ave to subtract the least date from all the dates for each subject. If the input is sorted as in the question we could optionally use x[1] instead of min(x). No packages are used.
data$date <- as.Date(data$date, "%d/%m/%y")
diff1 <- function(x) x - min(x)
with(data, data.frame(subject, days = ave(as.numeric(date), subject, FUN = diff1)))
giving:
subject days
1 1 0
2 1 2
3 1 9
4 2 0
5 2 62
Note
The input used, in reproducible form, is:
Lines <- "
subject date number
1 1/2/01 4
1 3/2/01 6
1 10/2/01 7
2 1/1/01 2
2 4/3/01 3"
data <- read.table(text = Lines, header = TRUE)

Resources