restructure data frame in R - r

I'm wondering if there is an easy way to restructure some data I have. I currently have a data frame that looks like this...
Year Cat Number
2001 A 15
2001 B 2
2002 A 4
2002 B 12
But what I ultimately want is to have it in this shape...
Year Cat Number Cat Number
2001 A 15 B 2
2002 A 4 B 12
Is there a simple way to do this?
Thanks in advance
:)

One way would be to use dcast/melt from reshape2. In the below code, first I created a sequence of numbers (indx column) for each Year by using transform and ave. Then, melt the transformed dataset keeping id.var as Year, and indx. The long format dataset is then reshaped to wide format using dcast. If you don't need the suffix _number, you can use gsub to remove that part.
library(reshape2)
res <- dcast(melt(transform(df, indx=ave(seq_along(Year), Year, FUN=seq_along)),
id.var=c("Year", "indx")), Year~variable+indx, value.var="value")
colnames(res) <- gsub("\\_.*", "", colnames(res))
res
# Year Cat Cat Number Number
#1 2001 A B 15 2
#2 2002 A B 4 12
Or using dplyr/tidyr. Here, the idea is similar as above. After grouping by Year column, generate a indx column using mutate, then reshape to long format with gather, unite two columns to a single column VarIndx and then reshape back to wide format with spread. In the last step mutate_each, columns with names that start with Number are converted to numeric column.
library(dplyr)
library(tidyr)
res1 <- df %>%
group_by(Year) %>%
mutate(indx=row_number()) %>%
gather("Var", "Val", Cat:Number) %>%
unite(VarIndx, Var, indx) %>%
spread(VarIndx, Val) %>%
mutate_each(funs(as.numeric), starts_with("Number"))
res1
# Source: local data frame [2 x 5]
# Year Cat_1 Cat_2 Number_1 Number_2
#1 2001 A B 15 2
#2 2002 A B 4 12
Or you can create an indx variable .id using getanID from splitstackshape (from comments made by #Ananda Mahto (author of splitstackshape) and use reshape from base R
library(splitstackshape)
reshape(getanID(df, "Year"), direction="wide", idvar="Year", timevar=".id")
# Year Cat.1 Number.1 Cat.2 Number.2
#1: 2001 A 15 B 2
#2: 2002 A 4 B 12
data
df <- structure(list(Year = c(2001L, 2001L, 2002L, 2002L), Cat = c("A",
"B", "A", "B"), Number = c(15L, 2L, 4L, 12L)), .Names = c("Year",
"Cat", "Number"), class = "data.frame", row.names = c(NA, -4L
))

Related

Inner_join two dataframes when year on the RHS is 1-3 years after the Year on the LHS

I want to join two dataframes with index and Year as long as the Year on the RHS is 1-3 years after the Year on the LHS. For example, dataframe df_lhs is
A index Year
1 A 12/31/2012
3 B 12/31/2011
5 C 12/31/2009
the df_rhs is
B index Year
5 A 12/31/2001
6 B 12/31/2010
2 C 12/31/2011
I hope the resulting inner_join to contain:
A index Year_left Year_right
5 C 12/31/2009 12/31/2011
This is what I tried
df = inner_join(df_lhs, df_rhs, by = c('index','Year'), suffix = c(".left", ".right"))
The code doesn't work. Maybe I should not think about using inner_join at all?
library(dplyr)
library(tidyr)
df_lhs %>%
separate(Year, sep = "/", into = c("m", "d", "y"), remove = F) %>%
inner_join(., {df_rhs%>%
separate(Year, sep = "/", into = c("m", "d", "y"), remove = F)},
by = c('index','m', 'd'), suffix = c(".left", ".right")) %>%
filter((as.numeric(y.right) - as.numeric(y.left)) %in% 1:3) %>%
select(A, B, index, Year.left, Year.right)
#> A B index Year.left Year.right
#> 1 5 2 C 12/31/2009 12/31/2011
What you can do is do a simple join/merge, and then filter out the rows which satisfy your condition (here 1-3 years).
Below is the code for merging two data frames based on multiple IDs.
merge(df_lhs,data df_rhs,by=c("index","Year"))
After this you will get simple merge and then you can filter based on some condition like difference of dates between 1-3 years.
This is just a suggestion. I hope this helps.

Reshape dataframe that has years in column names

I am trying to reshape a wide dataframe in R into a long dataframe. Reading over some of the functions in reshape2 and tidyr they all seem to just handle if you have 1 variable you are splitting whereas I have ~10. Each column has the type variables names and the year and I would like it split so that the years become a factor in each row and then have significantly less columns and an easier data set to work with.
Currently the table looks something like this.
State Rank Name V1_2016 V1_2017 V1_2018 V2_2016 V2_2017 V2_2018
TX 1 Company 1 2 3 4 5 6
I have tried to melt the data with reshape2 but it came out looking like garbage and being 127k rows when it should only be about 10k.
I am trying to get the data to look something like this.
State Rank Name Year V1 V2
1 TX 1 Company 2016 1 4
2 TX 1 Company 2017 2 5
3 TX 1 Company 2018 3 6
An option with melt from data.table that can take multiple measure based on the patterns in the column names
library(data.table)
nm1 <- unique(sub(".*_", "", names(df)[-(1:3)]))
melt(setDT(df), measure = patterns("V1", "V2"),
value.name = c("V1", "V2"), variable.name = "Year")[,
Year := nm1[Year]][]
# State Rank Name Year V1 V2
#1: TX 1 Company 2016 1 4
#2: TX 1 Company 2017 2 5
#3: TX 1 Company 2018 3 6
data
df <- structure(list(State = "TX", Rank = 1L, Name = "Company", V1_2016 = 1L,
V1_2017 = 2L, V1_2018 = 3L, V2_2016 = 4L, V2_2017 = 5L, V2_2018 = 6L),
class = "data.frame", row.names = c(NA,
-1L))
One dplyr and tidyr possibility could be:
df %>%
gather(var, val, -c(1:3)) %>%
separate(var, c("var", "Year")) %>%
spread(var, val)
State Rank Name Year V1 V2
1 TX 1 Company 2016 1 4
2 TX 1 Company 2017 2 5
3 TX 1 Company 2018 3 6
It, first, transforms the data from wide to long format, excluding the first three columns. Second, it separates the original variable names into two new variables: one containing the variable prefix, second containing the year. Finally, it spreads the data.

Extract a value in column A given the maximum value in column B within each group (column C) in an R dataframe?

My question is related to Extract the maximum value within each group in a dataframe.
The question there is essentially: how do you select the max value in one column based on repeated groups in a separate column in the same data frame?
In that post, user EDi provides a ton of examples for how to accomplish this task.
My question: how do I accomplish the same task, but instead of reporting the max value, I instead report a value in a third column associated with that max value?
For example:
Assume I have a data.frame:
Group Value Year
A 12 1933
A 10 2010
B 3 1935
B 5 1978
B 6 2011
C 1 1954
D 3 1933
D 4 1978
For each level of my grouping variable, I wish to extract the year that the maximum value occurred. The result should thus be a data frame with one row per level of the grouping variable:
Group Year
A 1933
B 2011
C 1954
D 1978
I know I could use any of the answers from EDi's post mentioned above and then just use something like which,match or sapply to figure out the year, but that seems too sloppy.
Is there a quick way to extract a value in column A given the maximum value in column B within each group (column C) in a dataframe?
Update: Could someone please provide a base R solution?
library(dplyr)
df %>% group_by(Group) %>% slice(which.max(Value)) %>% select(-Value)
#Source: local data frame [4 x 2]
#Groups: Group [4]
# Group Year
# <fctr> <int>
#1 A 1933
#2 B 2011
#3 C 1954
#4 D 1978
Note this only keeps one max value per group if ties exist.
A method that keeps tied max values:
library(dplyr)
df %>% group_by(Group) %>% filter(Value == max(Value)) %>% select(-Value)
#Source: local data frame [4 x 2]
#Groups: Group [4]
# Group Year
# <fctr> <int>
#1 A 1933
#2 B 2011
#3 C 1954
#4 D 1978
Here is a base R and a data.table solution:
df <- structure(list(Group = c("A", "A", "B", "B", "B", "C", "D", "D"
), Value = c(12L, 10L, 3L, 5L, 6L, 1L, 3L, 4L), Year = c(1933L,
2010L, 1935L, 1978L, 2011L, 1954L, 1933L, 1978L)), .Names = c("Group",
"Value", "Year"), row.names = c(NA, -8L), class = "data.frame")
# Base R - use aggregate to get max Value per group, then merge with df
merge(df, aggregate(Value ~ Group, df, max), by.all = c("Group", "Value"))[
, c("Group", "Year")]
# Group Year
# 1 A 1933
# 2 B 2011
# 3 C 1954
# 4 D 1978
# data.table solution
library(data.table)
dt <- data.table(df)
dt[, .SD[which.max(Value), .(Year)], by = Group]
# Group Year
# 1: A 1933
# 2: B 2011
# 3: C 1954
# 4: D 1978

R code to generate numbers in sequence and insert rows [duplicate]

This question already has answers here:
R code to insert rows based on a column's value and increment it by 1
(3 answers)
Closed 6 years ago.
I have a dataset with 2 columns. First column is an ID and the 2nd will column is the total number of quarters. If the Col B(quarters) has the value 8, then the 8 rows should be created starting from 1 to 8. The ID in col A should be the same for all rows. The dataset shown below is an example.
ID Quarters
A 5
B 2
C 1
Expected output
ID Quarters
A 1
A 2
A 3
A 4
A 5
B 1
B 2
C 1
Here is what I tried.
library(data.table)
setDT(df.WQuarter)[, (Quarters=1:Quarters), ID]
I get this error. Can you please help. I am really stuck at this for the whole day. I am just learning the basics of R.
We can use base R to replicate the 'ID' by 'Quarters' and create the 'Quarters' by taking the sequence of that column.
with(df1, data.frame(ID= rep(ID, Quarters), Quarters = sequence(Quarters)))
# ID Quarters
#1 A 1
#2 A 2
#3 A 3
#4 A 4
#5 A 5
#6 B 1
#7 B 2
#8 C 1
If we are using data.table, convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'ID', get the sequence of 'Quarters' or just seq(Quarters).
library(data.table)
setDT(df1)[, .(Quarters=sequence(Quarters)) , by = ID]
As #PierreLaFortune commented on the post, if we have NA values, then we need to remove it
setDT(df1)[, .(Quarters = seq_len(Quarters[!is.na(Quarters)])), by = ID]
Or using the dplyr/tidyr
library(dplyr)
library(tidyr)
df1 %>%
group_by(ID) %>%
mutate(Quarters = list(seq(Quarters))) %>%
ungroup() %>%
unnest(Quarters)
If the OP's "Quarters" column is non-numeric, it should be converted to 'numeric' before proceeding
df1$Quarters <- as.numeric(as.character(df1$Quarters))
The as.character is in case if the column is factor, but if it is character class, as.numeric is enough.
data
df1 <- structure(list(ID = c("A", "B", "C"), Quarters = c(5L, 2L, 1L
)), .Names = c("ID", "Quarters"), class = "data.frame", row.names = c(NA,
-3L))

Convert R dataframe from long to wide format, but with unequal group sizes, for use with qcc

I would like to convert a dataframe from long format to a wide format, but with unequal group sizes.
The eventual use will be in 'qcc', which requires a data frame or a matrix with each row consisting of one group, using NA's in groups which have fewer samples.
The following code will create an example dataset, as well as show manual conversion to the desired format.
# This is an example of the initial data that I have
# * 10 sample measurements, over 3 groups with 3, 2, and 5 elements respectively
x <- rnorm(10)
x_df <- data.frame( time = c( rep('2001 Q1',3), rep('2001 Q2',2), rep('2001 Q3',5) ), measure = x )
x_df
# This is a manual conversion into the desired format
x_pad <- c( x[1:3], NA, NA, x[4:5], NA, NA, NA, x[6:10] )
x_matrix <- matrix( x_pad, nrow = 3, ncol = 5, byrow = TRUE, dimnames = list(c('2001 Q1','2001 Q2','2001 Q3')) )
x_matrix # desired format
# An example of how it will be used
library(qcc)
plot(qcc(x_matrix, type = 'xbar', plot = FALSE))
So, I'd like to convert this:
time measure
1 2001 Q1 0.14680685
2 2001 Q1 0.53593193
3 2001 Q1 0.56097974
4 2001 Q2 -1.48102689
5 2001 Q2 0.18150972
6 2001 Q3 1.72018147
7 2001 Q3 -0.08480855
8 2001 Q3 -2.23208877
9 2001 Q3 -1.15269107
10 2001 Q3 0.57975023
... to this ...
[,1] [,2] [,3] [,4] [,5]
2001 Q1 0.1468068 0.53593193 0.5609797 NA NA
2001 Q2 -1.4810269 0.18150972 NA NA NA
2001 Q3 1.7201815 -0.08480855 -2.2320888 -1.152691 0.5797502
There is probably an easy way (perhaps some usage of reshape or reshape2 casting that I'm not familiar with?), but a bunch of searching hasn't helped me so far.
Thanks for any help!
==========
From one of the solutions below, the following will generate the final qcc xbar plot, including group labels:
library(splitstackshape)
out_df <- dcast( getanID( x_df, 'time' ), time~.id, value.var='measure' )
qcc( out_df[,-1], type = 'xbar', labels = out_df[,1] )
You can create a sequence column ('.id') using getanID from splitstackshape and use dcast from data.table to convert the long format to wide format. The output of splitstackshape is a data.table. When we load splitstackshape, data.table will also be loaded. So, if you already have the devel version of data.table, then the dcast from data.table can be used as well.
library(splitstackshape)
dcast(getanID(df1, 'time'), time~.id, value.var='measure')
# time 1 2 3 4 5
#1: 2001 Q1 0.1468068 0.53593193 0.5609797 NA NA
#2: 2001 Q2 -1.4810269 0.18150972 NA NA NA
#3: 2001 Q3 1.7201815 -0.08480855 -2.2320888 -1.152691 0.5797502
Update
As #snoram mentioned in the comments, function rowid from data.table makes it easier to use just data.table alone
library(data.table)
dcast(setDT(df1), time ~ rowid(time), value.var = "measure")
You'll need an intermediate variable that gives a "within-time" id. You can create it and reshape like this
library(tidyr)
library(dplyr)
group_by(X, time) %>%
mutate(seq = 1:n()) %>%
ungroup() %>%
spread(seq, measure)
Another splitstackshape approach
cSplit(setDT(df)[, toString(measure), by='time'], 'V1', ',')
# time V1_1 V1_2 V1_3 V1_4 V1_5
#1: 2001 Q1 0.1468068 0.53593193 0.5609797 NA NA
#2: 2001 Q2 -1.4810269 0.18150972 NA NA NA
#3: 2001 Q3 1.7201815 -0.08480855 -2.2320888 -1.152691 0.5797502
Or using the devel version of data.table a similar approach after pasting together the 'measure' by the grouping column 'time' would be using tstrsplit to split the 'V1' column generated from toString(measure).
setDT(df)[, toString(measure), by ='time'][, c(list(time), tstrsplit(V1, ', '))]
Also, we can add type.convert=TRUE in tstrsplit to convert the class of the split columns. By default it is FALSE.

Resources