How to Reshape data (with col name parsing) - r

need to reshape a data.frame from this
TestID Machine1Measure Machine1Count Machine2Measure Machine2Count
1 10006 11 14 16 24
2 10007 23 27 32 35
To this:
TestID Machine Measure Count
1 10006 1 11 14
2 10006 2 16 24
3 10007 1 23 27
4 10007 2 32 35
Below is code to create each. Looked at reshape in R but couldn't figure out how to split the names
Note: this is a subset of the columns - there are 70-140 machines. How can I make this simpler?
b <-data.frame(10006:10007, matrix(c(11,23,14,27,16,32,24,35),2,4))
colnames(b) <- c("TestID", "Machine1Measure", "Machine1Count", "Machine2Measure", "Machine2Count")
a<-data.frame(matrix(c(10006,10006,10007,10007,1,2,1,2,11,16,23,32,14,24,27,35),4,4))
colnames(a) <- c("TestID", "Machine", "Measure", "Count")
b
a

The following reproduces your expected output:
df %>%
gather(key, value, -TestID) %>%
separate(key, into = c("tmp", "what"), sep = "(?<=\\d)") %>%
separate(tmp, into = c("tmp", "Machine"), sep = "(?=\\d+)") %>%
spread(what, value) %>%
select(-tmp)
# TestID Machine Count Measure
#1 10006 1 14 11
#2 10006 2 24 16
#3 10007 1 27 23
#4 10007 2 35 32
Explanation: We reshape data from wide to long, and use two separate calls to separate the various values and ids before reshaping again from long to wide. (We use a positive look-ahead and positive look-behind to separate the keys into the required fields.)
Sample data
df <- read.table(text =
" TestID Machine1Measure Machine1Count Machine2Measure Machine2Count
1 10006 11 14 16 24
2 10007 23 27 32 35", header = T)

data.table can do all this within one melt, which is almost 30x faster than the (perfectly working) tidyverse solution provided by MauritsEvers.
It uses patterns to define the columns with 'Measure' and 'Count' in their names, and then melts these columns to the columns names in value.name
library( data.table )
melt( setDT( b),
id.vars = c("TestID"),
measure.vars = patterns( ".*Measure", ".*Count"),
variable.name = "Machine",
value.name = c("Measure", "Count") )
# TestID Machine Measure Count
# 1: 10006 1 11 14
# 2: 10007 1 23 27
# 3: 10006 2 16 24
# 4: 10007 2 32 35
Benchmarking
# Unit: microseconds
# expr min lq mean median uq max neval
# data.table 182.265 200.3405 245.0403 234.0825 264.6605 3137.967 1000
# reshape 1757.575 1840.7240 2180.4957 1938.3335 2011.3895 100429.392 1000
# tidyverse 6173.203 6430.7830 6925.6034 6569.9670 6763.9810 29722.714 1000

And since nobody else likes reshape() any longer, I'll add an answer:
reshape(
setNames(b, sub("^.+(\\d+)(.+)$", "\\2.\\1", names(b))),
idvar="TestID", direction="long", varying=-1, timevar="Machine"
)
# TestID Machine Measure Count
#10006.1 10006 1 11 14
#10007.1 10007 1 23 27
#10006.2 10006 2 16 24
#10007.2 10007 2 32 35
It'll never compete with data.table for pure speed, but brief testing on 2M rows using:
bbig <- b[rep(1:2,each=1e6),]
bbig$TestID <- make.unique(as.character(bbig$TestID))
#data.table - 0.06 secs
#reshape - 2.30 secs
#tidyverse - 56.60 secs

Related

Using purrr to apply filter and mutate based on another data set

I am trying to use purrr to apply filter and mutate a variable, both based on values of another data frame.
# This is the original table
set.seed(100)
dfOriginal <- data.table(age = sample(10:60, 10))
# Following is the second data frame containing one variable which
# I would like to filter by - age criterion
# and then to mutate with - age band
dfAgeBands <- data.table(ageCriterion = c("age > 0 & age <= 20", "age > 20 & age <= 30"),
ageBand = c("Young Adults", "Adults"))
finalDf <- map2(dfAgeBands$ageCriterion, dfAgeBands$ageBand, function(x,y){dfOriginal[.x, ageBands := .y]})
Edit: Just corrected the code (which was built for a different dataset!)
But it still does not work.
The expected output would be like the below, as per the rules defined by ageCriterion in the dfAgeBands dataframe.
age ageBand
1: 56 <NA>
2: 51 <NA>
3: 41 <NA>
4: 36 <NA>
5: 44 <NA>
6: 32 <NA>
7: 19 Young Adults
8: 53 <NA>
9: 28 Adults
10: 29 Adults
solution using a non-equi join from data.table..
first, get the min and max age per group, extract from description
library(dplyr)
library(stringr)
#get minimum and maximum age grom group
dfAgebands <- dfAgeBands %>% mutate( minAge = stringr::str_extract( ageCriterion, "(?<=\\> )[0-9]+(?= &)") %>% as.numeric(),
maxAge = stringr::str_extract( ageCriterion, "(?<=\\<= )[0-9]+(?=$)") %>% as.numeric() )
ageCriterion ageBand minAge maxAge
1 age > 0 & age <= 20 Young Adults 0 20
2 age > 20 & age <= 30 Adults 20 30
now, you can easily perform a non-equi join
library(data.table)
dfOriginal[ dfAgebands, ageBand := i.ageBand, on = c("age > minAge", "age <= maxAge")]
# age ageBand
# 1: 55 <NA>
# 2: 40 <NA>
# 3: 41 <NA>
# 4: 33 <NA>
# 5: 56 <NA>
# 6: 25 Adults
# 7: 11 Young Adults
# 8: 13 Young Adults
# 9: 28 Adults
# 10: 27 Adults
It is better not to go through eval(parse usually, but the expression here is tempting to use that. One option is to evaluate the expression in the i by looping through each element of 'ageCriterion' and assign (:=) the 'ageBand' value to those that satisfy the condition in i
library(data.table)
for(i in seq_len(nrow(dfAgeBands))) {
dfOriginal[eval(parse(text = dfAgeBands$ageCriterion[i])),
ageBand := dfAgeBands$ageBand[i]]
}
dfOriginal[]
Or using purrr
library(purrr)
pwalk(dfAgeBands, ~ dfOriginal[eval(parse(text = .x)), ageBand := .y])
dfOriginal[]
# age ageBand
# 1: 25 Adults
# 2: 22 Adults
# 3: 37 <NA>
# 4: 12 Young Adults
# 5: 32 <NA>
# 6: 56 <NA>
# 7: 46 <NA>
# 8: 26 Adults
# 9: 33 <NA>
#10: 17 Young Adults
For what it is worth --- i.e., my solution in addition to the solution of giants like akrun and other geniuses like Wimpel --- here is a solution with map2:
map2(ageBands$AgeCriteria, ageBands$AgeBand,
function(x,y){df1[eval(parse_expr(x)), ageBands := y]})

Splitting a column with multiple and unevenly distributed delimiters in R

I have a column/vector of character data that I need to separate into different columns. The problem? There are different delimiters (which mean different things), and different lengths between each delimiter. For example:
column_name
akjhaa 1-29 y 12-30
bsd, 14-20
asdf asdf del 2-5 y 6
dkljwv 3-31
joikb 6-22
sqwzsxcryvyde jd de 1-2
pk, ehde 1-2
jsd 1-15
asdasd asedd 1,3
The numbers need to be separated into columns apart from the characters. However, the numbers can be separated by a comma or dash or 'y'. Moreover, the numbers separated by dash should be somehow designated, as eventually, I need to make a document/vector where each of the numbers in that range is in their own column also (such that the split aaa column would become aaa 1 2 3 4 5 .... 29 12 13 ... 30).
So far, I have tried separating into columns based on the different delimiters, but because sometimes the values have more than one '-','y', or the 'y' falls as a word in one of the first character parts, it is starting to get a bit complicated...is there an easier way?
For clarification, in the particular "column_name" I gave, the final output would be such that i would have n columns, where n = (the highest number of numbers + 1 (the character string of the column name)). So, in the example of the provided "column_name," it would look like:
column_name n1 n2 n3 n4 n5 n6 n7 n8 n9 n10 n11 n12 n13 n14 n15 n16 n17 n18 n19 n20 n21 n22 n23 n24 n25 n26 n27 n28 n29 n30 n31 n32 n33 n34 n35 n36 n37 n38 n39 n40 n41 n42 n43 n44 n45 n46 n47 n48 n49 n50 n51 n52 n53 n54 n55 n56 n57 n58
akjhaa 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
bsd 14 15 16 17 18 19 20
asdf asdf del 2 3 4 5 6
dkljwv 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
joikb 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
sqwzsxcryvyde jd de 1 2
pk ehde 1 2
jsd 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
asdasd asedd 1 3
This isn't pretty, but it works. The result is a list column with the relevant values.
library(magrittr)
library(splitstackshape)
setDT(mydf)[, CN := gsub("(.*?) ([0-9].*)", "\\1SPLIT\\2", column_name)] %>%
cSplit("CN", "SPLIT") %>%
cSplit("CN_2", "[y,]", "long", fixed = FALSE) %>%
cSplit("CN_2", "-") %>%
.[, list(values = list(if (is.na(CN_2_2)) CN_2_1 else CN_2_1:CN_2_2)),
.(CN_1, rowid(CN_1))] %>%
.[, list(values = list(unlist(values))), .(CN_1)]
# CN_1 values
# 1: akjhaa 1,2,3,4,5,6,...
# 2: bsd, 14,15,16,17,18,19,...
# 3: asdf asdf del 2,3,4,5,6
# 4: dkljwv 3,4,5,6,7,8,...
# 5: joikb 6, 7, 8, 9,10,11,...
# 6: sqwzsxcryvyde jd de 1,2
# 7: pk, ehde 1,2
# 8: jsd 1,2,3,4,5,6,...
# 9: asdasd asedd 1,3
To get the extra columns instead of a list, you would need one more line: cbind(., .[, data.table::transpose(values)]):
as.data.table(mydf)[, CN := gsub("(.*?) ([0-9].*)", "\\1SPLIT\\2", column_name)] %>%
cSplit("CN", "SPLIT") %>%
cSplit("CN_2", "[y,]", "long", fixed = FALSE) %>%
cSplit("CN_2", "-") %>%
.[, list(values = list(if (is.na(CN_2_2)) CN_2_1 else CN_2_1:CN_2_2)),
.(CN_1, rowid(CN_1))] %>%
.[, list(values = list(unlist(values))), .(CN_1)] %>%
cbind(., .[, data.table::transpose(values)])
The basic idea is to do the following steps:
Split the column names from the values.
Split values separated by "y" or by a "," into new rows.
Split values separated by "-" into multiple columns.
Create your list of vectors according to the rule that if any values in the second split column are NA, return just the value from the first column, otherwise, create the sequence from the value in the first column to the value in the second column. Since you have duplicated "id" values because you've converted the data into a longer form, use rowid() to help with the grouping.
Consolidate the values in the list column according to the actual IDs.
(Optionally, in my opinion) transform the list data into multiple columns.

How to gather series of columns with data into rows [duplicate]

This question already has answers here:
Reshaping multiple sets of measurement columns (wide format) into single columns (long format)
(8 answers)
Closed 5 years ago.
I'm just trying to get my head around tidying my data and I have this problem:
I have data as follows:
ID Tx1 Tx1Date Tx1Details Tx2 Tx2Date Tx2Details Tx3 Tx1Date Tx1Details
1 14 12/3/14 blabla 1e 12/5/14 morebla r 14/2/14 grrr
2 23 14/5/16 albalb 342 1/4/5 teeee s 5/6/17 purrr
I want the data to be in the format
ID Tx TxDate TxDetails
1 14 12/3/14 blabla
1 1e 12/5/14 morebla
1 r 14/2/14 grrr
2 23 14/5/16 albalb
2 342 1/4/5 teeee
2 s 5/6/17 purrr
I have used
library(tidyr)
library(dplyr)
NewData<-mydata %>% gather(key, value, "ID", 2:10)
but I'm not sure how to rename the columns as per the intended output to see if this will work
You can rename your data frame column names to a more conventional separable names and then use the base reshape function, assuming your initial data frames looks like this(changed the last two column names to Tx3Date and Tx3Details as otherwise they are duplicates of columns 4 and 5):
df
# ID Tx1 Tx1Date Tx1Details Tx2 Tx2Date Tx2Details Tx3 Tx3Date Tx3Details
#1 1 14 12/3/14 blabla 1e 12/5/14 morebla r 14/2/14 grrr
#2 2 23 14/5/16 albalb 342 1/4/5 teeee s 5/6/17 purrr
names(df) <- gsub("(\\d)(\\w*)", "\\2\\.\\1", names(df))
df
# ID Tx.1 TxDate.1 TxDetails.1 Tx.2 TxDate.2 TxDetails.2 Tx.3 TxDate.3 TxDetails.3
#1 1 14 12/3/14 blabla 1e 12/5/14 morebla r 14/2/14 grrr
#2 2 23 14/5/16 albalb 342 1/4/5 teeee s 5/6/17 purrr
reshape(df, varying = 2:10, idvar = "ID", dir = "long")
# ID time Tx TxDate TxDetails
#1.1 1 1 14 12/3/14 blabla
#2.1 2 1 23 14/5/16 albalb
#1.2 1 2 1e 12/5/14 morebla
#2.2 2 2 342 1/4/5 teeee
#1.3 1 3 r 14/2/14 grrr
#2.3 2 3 s 5/6/17 purrr
Drop the redundant time variable if you don't need it.
The data.table package handles this pretty well.
library(data.table)
setDT(df)
melt(df, measure = list(Tx = grep("^Tx[0-3]$", names(df)),
Date = grep("Date", names(df)),
Details = grep("Details", names(df))),
value.name = c("Tx", "TxDate", "TxDetails"))
Or more concisely
melt(df, measure = patterns("^Tx[0-3]$", "Date", "Details"),
value.name = c("Tx", "TxDate", "TxDetails"))

How to quote the grouped data frame it self in the function in ddply()

It is possible to apply certain function in the grouping of data frame by certain variables with ddply(), but how to quote the grouped data frame as the argument of the function?
Take min() as an EXAMPLE:
What I have:
> BodyWeight
Treatment day1 day2 day3
1 a 32 33 36
2 a 35 35 26
3 a 33 38 46
4 b 23 24 25
5 b 22 16 34
6 b 36 35 37
7 c 45 45 39
8 c 29 26 12
9 c 43 27 36
What I want:
Treatment min
1 a 26
2 b 16
3 c 12
What I did and what I got:
> ddply(BodyWeight, .(Treatment), summarize, min= min(BodyWeight[,-1]))
Treatment min
1 a 12
2 b 12
3 c 12
The min() is just an example, unspecific solutions are desired.
What you want is to summarize by Treatment and Day. The issue is you have days in multiple columns. You need to convert your data from the wide format its in (multiple columns) into a long format (key-value pairs).
library(tidyr)
library(plyr)
bw_long <- gather(Bodyweight, day, value, day1:day3)
ddply(bw_long, .(Treatment, day), summarize, min = min(value))
p.s. Check out the successor to plyr, dplyr
We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(BodyWeight)), grouped by 'Treatment', unlist the Subset of Data.table (.SD) and get the min value.
library(data.table)
setDT(BodyWeight)[, .(min = min(unlist(.SD))) , by = Treatment]
# Treatment min
#1: a 26
#2: b 16
#3: c 12

How to group in R with partial match and assign a column with the aggregated value?

Below is the data frame I have:
Quarter Revenue
1 2014-Q1 10
2 2014-Q2 20
3 2014-Q3 30
4 2014-Q4 40
5 2015-Q1 50
6 2015-Q2 60
7 2015-Q3 70
8 2015-Q4 80
I want to find the mean of the quarters containing Q1,Q2,Q3,Q4 separately (for e.g. for text containing Q1, I have two values for revenue i.e. 10 and 50, the mean of which is 30) and insert a column depicting the mean. The o/p should look like the one described below:
Quarter Revenue Aggregate
1 2014-Q1 10 30
2 2014-Q2 20 40
3 2014-Q3 30 50
4 2014-Q4 40 60
5 2015-Q1 50 30
6 2015-Q2 60 40
7 2015-Q3 70 50
8 2015-Q4 80 60
Could you all please let me know if there are any processes without using the popular packages and with using too.
Thanks!
We can separate the "Quarter" into "Year", "Quart", group by "Quart", and get the mean of "Revenue"
library(dplyr)
library(tidyr)
separate(df1, Quarter, into = c("Year", "Quart"), remove = FALSE) %>%
group_by(Quart) %>%
mutate(Aggregate = mean(Revenue)) %>%
ungroup() %>%
select(-Quart, -Year)
# Quarter Revenue Aggregate
# <chr> <int> <dbl>
#1 2014-Q1 10 30
#2 2014-Q2 20 40
#3 2014-Q3 30 50
#4 2014-Q4 40 60
#5 2015-Q1 50 30
#6 2015-Q2 60 40
#7 2015-Q3 70 50
#8 2015-Q4 80 60
Or we can do this compactly with data.table. Convert the 'data.frame' to 'data.table' (setDT(df1), grouped by the substring of 'Quarter (removed the Year and -), we assign (:=) the mean of 'Revenue' to create the 'Aggregate'.
library(data.table)
setDT(df1)[, Aggregate := mean(Revenue) ,.(sub(".*-", "", Quarter))]
One possible solution using functions from the base package.
qtr <- c("Q1", "Q2", "Q3", "Q4")
avg <- numeric()
for (n in 1:length(qtr)) {
ind <- grep(qtr[n], df1$Quarter)
avg[length(avg) + 1] <- mean(df1$Revenue[ind])
}
df1 <- transform(df1, Aggregate = avg)
Apparently using functions from other packages (e.g., dplyr) make code less verbose.

Resources