Reshape dataframe that has years in column names - r

I am trying to reshape a wide dataframe in R into a long dataframe. Reading over some of the functions in reshape2 and tidyr they all seem to just handle if you have 1 variable you are splitting whereas I have ~10. Each column has the type variables names and the year and I would like it split so that the years become a factor in each row and then have significantly less columns and an easier data set to work with.
Currently the table looks something like this.
State Rank Name V1_2016 V1_2017 V1_2018 V2_2016 V2_2017 V2_2018
TX 1 Company 1 2 3 4 5 6
I have tried to melt the data with reshape2 but it came out looking like garbage and being 127k rows when it should only be about 10k.
I am trying to get the data to look something like this.
State Rank Name Year V1 V2
1 TX 1 Company 2016 1 4
2 TX 1 Company 2017 2 5
3 TX 1 Company 2018 3 6

An option with melt from data.table that can take multiple measure based on the patterns in the column names
library(data.table)
nm1 <- unique(sub(".*_", "", names(df)[-(1:3)]))
melt(setDT(df), measure = patterns("V1", "V2"),
value.name = c("V1", "V2"), variable.name = "Year")[,
Year := nm1[Year]][]
# State Rank Name Year V1 V2
#1: TX 1 Company 2016 1 4
#2: TX 1 Company 2017 2 5
#3: TX 1 Company 2018 3 6
data
df <- structure(list(State = "TX", Rank = 1L, Name = "Company", V1_2016 = 1L,
V1_2017 = 2L, V1_2018 = 3L, V2_2016 = 4L, V2_2017 = 5L, V2_2018 = 6L),
class = "data.frame", row.names = c(NA,
-1L))

One dplyr and tidyr possibility could be:
df %>%
gather(var, val, -c(1:3)) %>%
separate(var, c("var", "Year")) %>%
spread(var, val)
State Rank Name Year V1 V2
1 TX 1 Company 2016 1 4
2 TX 1 Company 2017 2 5
3 TX 1 Company 2018 3 6
It, first, transforms the data from wide to long format, excluding the first three columns. Second, it separates the original variable names into two new variables: one containing the variable prefix, second containing the year. Finally, it spreads the data.

Related

How do I merge multiple contingency tables into one using R?

I have multiple columns that I need to merge and return a contingency table counting each number.
Example of an ordinal data set:
df <- data.frame(ab = c(1,2,3,4,5),
ba = c(1,3,3,3,5))
>ab ba
1 1
2 3
3 3
4 3
5 5
I would like to be able to return a contingency table showing like this:
>1 2 3 4 5
2 1 4 1 2
Ive attempted examples featured on here for similar issues, but I get the sums returned instead of a count:
library(plyr)
colSums(rbind.fill(data.frame(t(unclass(df$ab))), data.frame(t(unclass(df$ba)))),`
na.rm = T)
Any help is greatly appreciated
We unlist the data.frame into a vector and apply table in base R
table(unlist(df))
# 1 2 3 4 5
# 2 1 4 1 2
Or with tidyverse, by reshaping the data into 'long' format with pivot_longer and get the count
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything()) %>%
count(value)
data
df <- structure(list(ab = 1:5, ba = c(1L, 3L, 3L, 3L, 5L)),
class = "data.frame", row.names = c(NA,
-5L))

Converting long format flat file to wide in R

I have a household and member dataset in one long flat format. There is a fixed number of members and each corresponds to a column. For simplicity, assume 2 members per household and assume 2 questions are asked for the members- age (Q1), gender(Q2).
The file format looks as given below:
HHID, MEM_ID_1, MEM_ID_2, AGE_1, AGE_2, GENDER_1, GENDER_2
1 1 2 50 45 M F
And I want to convert it to the following format:
HHID MEM_ID AGE GENDER
1 1 50 M
1 2 45 F
Let's say our data frame is test
dput(test)
structure(list(HHID = 1L, MEM_ID_1 = 1L, MEM_ID_2 = 2L, AGE_1 = 50L,
AGE_2 = 45L, GENDER_1 = structure(1L, .Label = "Male", class = "factor"),
GENDER_2 = structure(1L, .Label = "Female", class = "factor")), class = "data.frame", row.names = c(NA,
-1L))
You could try the reshape function on this data frame as below:
reshape(test, direction = "long",
varying = list(c("MEM_ID_1","MEM_ID_2"), c("AGE_1","AGE_2"), c( "GENDER_1","GENDER_2")),
v.names = c("MEM_ID","AGE","GENDER"),
idvar = 'HHID')
The reshape() function comes from the base R. Broadly speaking, it can simultaneously melt over multiple sets of variables, by using the varying parameter and setting the direction to long.
For example in your case we have a list of three vectors of variable names to the varying argument:
varying = list(c("MEM_ID_1","MEM_ID_2"), c("AGE_1","AGE_2"), c( "GENDER_1","GENDER_2"))
The output is below:
HHID time MEM_ID AGE GENDER
1.1 1 1 1 50 Male
1.2 1 2 2 45 Female
You can use tidyr::gather(), tidyr::separate(), and tidyr::spread() in order. Here household is the name of your data frame.
library(tidyverse)
1. gather
First, tidyr::gather(). Then you can get the below result.
household %>%
gather(-HHID, key = domestic, value = value)
#> HHID domestic value
#> 1 1 MEM_ID_1 1
#> 2 1 MEM_ID_2 2
#> 3 1 AGE_1 50
#> 4 1 AGE_2 45
#> 5 1 GENDER_1 M
#> 6 1 GENDER_2 F
Now all you have to do is
separate domestic column at _[0-9]: In regular expression, _(?=[0-9])
Changing the format into the wide format, you can see the output you want.
2. Conclusion: entire code
household %>%
gather(-HHID, key = domestic, value = value) %>% # long data
separate(domestic, into = c("domestic", "vals"), sep = "_(?=[0-9])") %>% # separate the digit
spread(domestic, value) %>% # wide format
select(HHID, MEM_ID, AGE, GENDER, -vals) # just arranging columns, and excluding needless column
#> HHID MEM_ID AGE GENDER
#> 1 1 1 50 M
#> 2 1 2 45 F

Add a column for rank, then ranking by group

I have managed to add a column for ranking for my data frame
lowest.mortality.upper<-nrow(lowest.mortality)
## Add a ranking column
lowest.mortality$ranking<-c(1:lowest.mortality.upper)
However now I have to rank a bigger dataset based on another column state. So it would read
AK 1
AK 2
TX 1
TX 2
TX 3
I could use a for loop but thats so 1980's. I'm sure that subset or lapply should work but I can't figure out how
Thanks
Seems like you want to add a sequence column by group. There are several options.
A base R solution using ave is
df1$indx <- with(df1, ave(seq_along(grp), grp, FUN=seq_along))
Or this can be done with getanID from splitstackshape
library(splitstackshape)
getanID(df1, 'grp')[]
# grp .id
#1: AK 1
#2: AK 2
#3: TX 1
#4: TX 2
#5: TX 3
Or
library(dplyr)
df1 %>%
group_by(grp) %>%
mutate(indx = row_number())
data
df1 <- structure(list(grp = c("AK", "AK", "TX", "TX", "TX")), .Names =
"grp", class = "data.frame", row.names = c(NA, -5L))

For i in loops in R

I have been really struggling to grasp a basic programming concept - the for loop. I typically deal with heirarchically structured data such that measurements repeat with levels of unique identifiers, like this:
ID Measure
1 2
1 3
1 3
2 4
2 1
...
Very often I need to create a new column the aggregates within ID or produces a value for each row for each level of ID. The former I use pretty basic functions from either base or dplyr, but for the latter case I'd like to get in the habit of creating for loops.
So for this example, I would like a column added to my hypothetical df such that the new column begins with one for each ID and adds 1 to each subsequent row, until a new ID occurs.
So, this:
ID Measure NewVal
1 2 1
1 3 2
1 3 3
2 4 1
2 1 2
...
Would love to learn for computing, but if there are other ways, would like to hear those too.
One way is to use the splitstackshape package. There is a function called getanID. This is your friend here. If your df is called mydf, you would do the following. Please note that the outcome is data.table. If necessary, you want to convert that to data.frame.
library(splitstackshape)
getanID(mydf, "ID")
# ID Measure .id
#1: 1 2 1
#2: 1 3 2
#3: 1 3 3
#4: 2 4 1
#5: 2 1 2
DATA
mydf <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Measure = c(2L, 3L,
3L, 4L, 1L)), .Names = c("ID", "Measure"), class = "data.frame", row.names = c(NA,
-5L))
seq_along gives an increasing sequence starting at 1, with the same length as its input. tapply is used to apply a function to various levels of input. Here we don't care what is supplied, so you can apply the ID column to itself:
> d$NewVal <- unlist(tapply(d$ID, d$ID, FUN=seq_along))
> d
ID Measure NewVal
1 1 2 1
2 1 3 2
3 1 3 3
4 2 4 1
5 2 1 2
You could also use data.table to assign the sequence by reference.
# library(data.table)
setDT(mydf) ## convert to data table
mydf[,NewVal := seq(.N), by=ID] ## .N contains number of rows in each ID group
# ID Measure NewVal
# 1: 1 2 1
# 2: 1 3 2
# 3: 1 3 3
# 4: 2 4 1
# 5: 2 1 2
setDF(mydf) ## convert easily to data frame if you wish.
Or you could use ave. The advantage is that it will give the sequence in the same order as that in the original dataset, which may be beneficial in unordered datasets.
transform(df, NewVal=ave(ID, ID, FUN=seq_along))
# ID Measure NewVal
#1 1 2 1
#2 1 3 2
#3 1 3 3
#4 2 4 1
#5 2 1 2
For a more general case (if the ID column is factor )
transform(df, NewVal=ave(seq_along(ID), ID, FUN=seq_along))
Or if the ID column is ordered
df$NewVal <- sequence(tabulate(df$ID))
Or using dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(NewVal=row_number())
data
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L), Measure = c(2L, 3L,
3L, 4L, 1L)), .Names = c("ID", "Measure"), class = "data.frame",
row.names = c(NA, -5L))
I'd recommend you don't use a for loop for this. It's not a good place for one. You can do this pretty easily inplyr (or dplyr) if you prefer:
require(plyr)
x <- data.frame(cbind(rnorm(100), rnorm(100)))
x$ID <- sample(1:10, 100, replace=T)
new_col <- function(x) {
x <- x[order(x[,1]), ]
x$NewVal <- 1:nrow(x)
return(x)
}
x <- ddply(.data= x, .var= "ID", .fun= new_col)

restructure data frame in R

I'm wondering if there is an easy way to restructure some data I have. I currently have a data frame that looks like this...
Year Cat Number
2001 A 15
2001 B 2
2002 A 4
2002 B 12
But what I ultimately want is to have it in this shape...
Year Cat Number Cat Number
2001 A 15 B 2
2002 A 4 B 12
Is there a simple way to do this?
Thanks in advance
:)
One way would be to use dcast/melt from reshape2. In the below code, first I created a sequence of numbers (indx column) for each Year by using transform and ave. Then, melt the transformed dataset keeping id.var as Year, and indx. The long format dataset is then reshaped to wide format using dcast. If you don't need the suffix _number, you can use gsub to remove that part.
library(reshape2)
res <- dcast(melt(transform(df, indx=ave(seq_along(Year), Year, FUN=seq_along)),
id.var=c("Year", "indx")), Year~variable+indx, value.var="value")
colnames(res) <- gsub("\\_.*", "", colnames(res))
res
# Year Cat Cat Number Number
#1 2001 A B 15 2
#2 2002 A B 4 12
Or using dplyr/tidyr. Here, the idea is similar as above. After grouping by Year column, generate a indx column using mutate, then reshape to long format with gather, unite two columns to a single column VarIndx and then reshape back to wide format with spread. In the last step mutate_each, columns with names that start with Number are converted to numeric column.
library(dplyr)
library(tidyr)
res1 <- df %>%
group_by(Year) %>%
mutate(indx=row_number()) %>%
gather("Var", "Val", Cat:Number) %>%
unite(VarIndx, Var, indx) %>%
spread(VarIndx, Val) %>%
mutate_each(funs(as.numeric), starts_with("Number"))
res1
# Source: local data frame [2 x 5]
# Year Cat_1 Cat_2 Number_1 Number_2
#1 2001 A B 15 2
#2 2002 A B 4 12
Or you can create an indx variable .id using getanID from splitstackshape (from comments made by #Ananda Mahto (author of splitstackshape) and use reshape from base R
library(splitstackshape)
reshape(getanID(df, "Year"), direction="wide", idvar="Year", timevar=".id")
# Year Cat.1 Number.1 Cat.2 Number.2
#1: 2001 A 15 B 2
#2: 2002 A 4 B 12
data
df <- structure(list(Year = c(2001L, 2001L, 2002L, 2002L), Cat = c("A",
"B", "A", "B"), Number = c(15L, 2L, 4L, 12L)), .Names = c("Year",
"Cat", "Number"), class = "data.frame", row.names = c(NA, -4L
))

Resources