melt multiple groups of measure.vars - r

I have a data.table containing a number of variables across multiple years, i.e:
> dt <- data.table(id=1:3, A_2011=rnorm(3), A_2012=rnorm(3),
B_2011=rnorm(3), B_2012=rnorm(3),
C_2011=rnorm(3), C_2012=rnorm(3))
> dt
id A_2011 A_2012 B_2011 B_2012 C_2011 C_2012
1: 1 -0.8262134 0.832013744 -2.320136 0.1275409 -0.1344309 0.7360329
2: 2 0.9350433 0.279966534 -0.725613 0.2514631 1.0246772 -0.2009985
3: 3 1.1520396 -0.005775964 1.376447 -1.2826486 -0.8941282 0.7513872
I would like to melt this table into variable groups by year, i.e:
> dtLong <- data.table(id=rep(dt[,id], 2), year=c(rep(2011, 3), rep(2012, 3)),
A=c(dt[,A_2011], dt[,A_2012]),
B=c(dt[,B_2011], dt[,B_2012]),
C=c(dt[,C_2011], dt[,C_2012]))
> dtLong
id year A B C
1: 1 2011 -0.826213405 -2.3201355 -0.1344309
2: 2 2011 0.935043336 -0.7256130 1.0246772
3: 3 2011 1.152039595 1.3764468 -0.8941282
4: 1 2012 0.832013744 0.1275409 0.7360329
5: 2 2012 0.279966534 0.2514631 -0.2009985
6: 3 2012 -0.005775964 -1.2826486 0.7513872
I can easily do this for one set of variables easily using melt.data.frame from the reshape2 package:
> melt(dt[,list(id, A_2011, A_2012)], measure.vars=c("A_2011", "A_2012"))
But haven't been able to achieve this for multiple measure.vars with a common "factor".

You can do this easily with reshape from base R
reshape(dt, varying = 2:7, sep = "_", direction = 'long')
This will give you the following output
id time A B C
1.2011 1 2011 -0.1602428 0.428154271 0.384892382
2.2011 2 2011 1.4493949 0.178833067 2.404267878
3.2011 3 2011 -0.1952697 1.072979813 -0.653812311
1.2012 1 2012 1.7151334 0.007261567 1.521799983
2.2012 2 2012 1.0866426 0.060728118 -1.158503305
3.2012 3 2012 1.0584738 -0.508854175 -0.008505982

From ?melt samples:
melt(DT, id=1:2, measure=patterns("^f_", "^d_"), value.factor=TRUE)

Related

data table lapply and additional columns in output

I am just hoping there is a more convenient way. Imaging I would like to run a model with different transformations of some of the columns, e.g. winsorizing. I would like to provide the transformed data set to the model and some additional columns that do not need to be transformed. Is there a practical way to this in one line? I do not want to replace the data using := because I am planning to run the model with different specifications of the transformation.
dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
sel.col<-c("x","y")
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
I would Need to call data.table again to merge the original dt with the transformed data and pay Attention to the order.
data.table(dt[,.(id,Country),by=factor],
dt[,lapply(.SD,Winsorize),.SDcols=sel.col,by=factor])
I was hoping that I could include the additional columns with the lapply call
dt[,.(lapply(.SD,Winsorize), id, Country),.SDcols=sel.col,by=factor]
Are there any other solutions?
Do you just need?
dt[, c(lapply(.SD,Winsorize), list(id = id, Country = Country)), .SDcols=sel.col,by=factor]
Unfortunately this method get's slow with big data. Apparently this was optimised in some recent update, but it still very slow.
There is no need to merge, you can assign columns after lapply call:
> library(DescTools)
> library(data.table)
> dt<-data.table(id=1:10, Country=sample(c("Germany", "USA"),10, replace=TRUE), x=rnorm(10,1,10),y=rnorm(10,1,10),factor=factor(sample(LETTERS[1:2],10,replace=TRUE)))
> sel.col<-c("x","y")
> dt
id Country x y factor
1: 1 Germany 13.116248 -0.4609152 B
2: 2 Germany -6.623404 -3.7048052 A
3: 3 USA -18.027532 22.2946805 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -12.585897 0.8255081 B
6: 6 Germany -8.816252 -12.1218135 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 6.3262951 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 15.857069 8.6422997 A
> # Notice an assignment `(sel.col) :=` here:
> dt[,(sel.col) := lapply(.SD,Winsorize),.SDcols=sel.col,by=factor]
> dt
id Country x y factor
1: 1 Germany 11.129140 -0.4609152 B
2: 2 Germany -6.623404 -1.7234191 A
3: 3 USA -17.097573 19.5642043 A
4: 4 USA -13.377736 6.2021252 A
5: 5 Germany -11.831968 0.8255081 B
6: 6 Germany -8.816252 -12.0116571 B
7: 7 USA -3.459926 -11.5710316 B
8: 8 USA 3.180706 5.2261377 B
9: 9 Germany -5.520637 7.2877123 A
10: 10 Germany 11.581528 8.6422997 A

How can I aggregate data.table in quarterly frequency?

My data is available in monthly frequency and I'm trying to aggregate them in quarterly frequency. I'm working with data.table which package I dont understand very well, to be honest.
X.DATA_BASE NOME_INSTITUICAO SALDO.x SALDO.y
1: 199407 ASB S/A - CFI 1694581 1124580
2: 199407 BANCO ARAUCARIA S.A. 40079517 6314782
3: 199407 BANCO ATLANTIS S.A. 200463907 9356445
4: 199407 BANCO BANKPAR 1078342 5770046
5: 199407 BANCO BBI 97812975 31112289
For each date, which is defined by X.DATA_BASE, 199407 = July 1994. I have several institutions with SALDO.x and SALDO.y values. I want to add SALDO.x and SALDO.y for each institution in each quarterly. One of the problem is that some institutions get in and get out through the time. In the end of the day I want to have mydata with the same columns but quarterly frequency.
How could I do that?
Here's an example of how to group and sum by quarter (with thanks to #eddi for his suggested improvement). First let's create some fake date:
library(data.table)
set.seed(1485)
dat = data.table(date=rep(c(199401:199412,199501:199512),2),
firm=rep(c("A","B"), each=24),
value1=rnorm(48,1000,10),
value2=rnorm(48,2000,100))
dat
date firm value1 value2
1: 199401 A 1009.8620 2054.251
2: 199402 A 1009.7180 2124.202
3: 199403 A 1014.3421 1919.251
...
46: 199510 B 992.9961 2079.517
47: 199511 B 997.9147 1968.676
48: 199512 B 1002.5993 2006.231
Now, summarize by firm, year, and quarter. To do this, we create year and quarter grouping variables from date (we use integer division (%/%) to create the years and mod (%%) plus integer division to create the quarters), and calculate the sum of value1 and value2 for each sub-group. This all assumes date is numeric. If you have it stored as character or factor, convert to numeric first:
dat.summary = dat[ , list(valueByQuarter = sum(sum(value1) + sum(value2))),
by=list(firm,
year=date %/% 100,
quarter=(date %% 100 - 1) %/% 3 + 1)]
dat.summary
firm year quarter valueByQuarter
1: A 1994 1 9131.626
2: A 1994 2 8953.116
3: A 1994 3 8981.407
4: A 1994 4 9175.959
5: A 1995 1 9003.225
6: A 1995 2 8962.690
7: A 1995 3 8809.256
8: A 1995 4 8885.264
9: B 1994 1 9000.791
10: B 1994 2 8936.356
11: B 1994 3 8905.789
12: B 1994 4 8951.369
13: B 1995 1 8922.716
14: B 1995 2 9097.134
15: B 1995 3 8724.188
16: B 1995 4 9047.934
For dplyr fans, here's a dplyr approach:
library(dplyr)
dat %>%
group_by(firm, year=date %/% 100,
quarter=(date %% 100 - 1) %/% 3 + 1) %>%
summarise(valueByQuarter = sum(value1 + value2))

R Cleaning and reordering names/serial numbers in data frame

Let's say I have a data frame as follows in R:
Data <- data.frame("SerialNum" = character(), "Year" = integer(), "Name" = character(), stringsAsFactors = F)
Data[1,] <- c("983\n837\n424\n ", 2015, "Michael\nLewis\nPaul\n ")
Data[2,] <- c("123\n456\n789\n136", 2014, "Elaine\nJerry\nGeorge\nKramer")
Data[3,] <- c("987\n654\n321\n975\n ", 2010, "John\nPaul\nGeorge\nRingo\nNA")
Data[4,] <- c("424\n983\n837", 2015, "Paul\nMichael\nLewis")
Data[5,] <- c("456\n789\n123\n136", 2014, "Jerry\nGeorge\nElaine\nKramer")
What I want to do is the following:
Split up each string of names and each string of serial numbers so that they are their own vectors (or a list of string vectors).
Eliminate any character "NA" in either set of vectors or any blank spaces denoted by "...\n ".
Reorder each list of names alphabetically and reorder the corresponding serial numbers according to the same permutation.
Concatenate each vector in the same fashion it was originally (I usually do this with paste(., collapse = "\n")).
My issue is how to do this without using a for loop. What is an object-oriented way to do this? As a first attempt in this direction I originally made a list by the command LIST <- strsplit(Data$Name, split = "\n") and from here I need a for loop in order to find the permutations of the names, which seems like a process that won't scale according to my actual data. Additionally, once I make the list LIST I'm not sure how I go about removing NA symbols or blank spaces. Any help is appreciated!
Using lapply I take each row of the data frame and turn it into a new data frame with one name per row. This creates a list of 5 data frames, one for each row of the original data frame.
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Year=Data[i,"Year"],
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
})
UPDATE: Based on your comment, let me know if this is the result you're trying to achieve:
seinfeld = lapply(1:nrow(Data), function(i) {
# Turn strings into data frame with one name per row
dat = data.frame(SerialNum=unlist(strsplit(Data[i,"SerialNum"], split="\n")),
Name=unlist(strsplit(Data[i,"Name"], split="\n")))
# Get rid of empty strings and NA values
dat = dat[!(dat$Name %in% c(""," ","NA")), ]
# Order alphabetically
dat = dat[order(dat$Name), ]
# Collapse back into a single row with the new sort order
dat = data.frame(SerialNum=paste(dat[, "SerialNum"], collapse="\n"),
Year=Data[i, "Year"],
Name=paste(dat[, "Name"], collapse="\n"))
})
do.call(rbind, seinfeld)
SerialNum Year Name
1 837\n983\n424 2015 Lewis\nMichael\nPaul
2 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
3 321\n987\n654\n975 2010 George\nJohn\nPaul\nRingo
4 837\n983\n424 2015 Lewis\nMichael\nPaul
5 123\n789\n456\n136 2014 Elaine\nGeorge\nJerry\nKramer
eipi10 offered a great answer. In addition to that, I'd like to leave what I tried mainly with data.table. First, I split two columns (i.e., SerialNum and Name) with cSplit(), added an index with add_rownames(), and split the data by the index. In the first lapply(), I used Stacked() from the splitstackshape package. I stacked SerialNum and Name; separated SeriaNum and Name become two columns, as you see in a part of temp2. In the second lapply(), I used merge from the data.table package. Then, I removed rows with NAs (lapply(na.omit)), combined all data tables (rbindlist), and changed order of rows by rowname, which is row number of the original data) and Name (setorder(rowname, Name))
library(data.table)
library(splitstackshape)
library(dplyr)
cSplit(mydf, c("SerialNum", "Name"), direction = "wide",
type.convert = FALSE, sep = "\n") %>%
add_rownames %>%
split(f = .$rowname) -> temp
#a part of temp
#$`1`
#Source: local data frame [1 x 12]
#
#rowname Year SerialNum_1 SerialNum_2 SerialNum_3 SerialNum_4 SerialNum_5 Name_1 Name_2
#(chr) (dbl) (chr) (chr) (chr) (chr) (chr) (chr) (chr)
#1 1 2015 983 837 424 NA NA Michael Lewis
#Variables not shown: Name_3 (chr), Name_4 (chr), Name_5 (chr)
lapply(temp, function(x){
Stacked(x, var.stubs = c("SerialNum", "Name"), sep = "_")
}) -> temp2
# A part of temp2
#$`1`
#$`1`$SerialNum
# rowname Year .time_1 SerialNum
#1: 1 2015 1 983
#2: 1 2015 2 837
#3: 1 2015 3 424
#4: 1 2015 4 NA
#5: 1 2015 5 NA
#
#$`1`$Name
# rowname Year .time_1 Name
#1: 1 2015 1 Michael
#2: 1 2015 2 Lewis
#3: 1 2015 3 Paul
#4: 1 2015 4 NA
#5: 1 2015 5 NA
lapply(1:nrow(mydf), function(x){
merge(temp2[[x]]$SerialNum, temp2[[x]]$Name, by = c("rowname", "Year", ".time_1"))
}) %>%
lapply(na.omit) %>%
rbindlist %>%
setorder(rowname, Name) -> out
print(out)
# rowname Year .time_1 SerialNum Name
# 1: 1 2015 2 837 Lewis
# 2: 1 2015 1 983 Michael
# 3: 1 2015 3 424 Paul
# 4: 2 2014 1 123 Elaine
# 5: 2 2014 3 789 George
# 6: 2 2014 2 456 Jerry
# 7: 2 2014 4 136 Kramer
# 8: 3 2010 3 321 George
# 9: 3 2010 1 987 John
#10: 3 2010 2 654 Paul
#11: 3 2010 4 975 Ringo
#12: 4 2015 3 837 Lewis
#13: 4 2015 2 983 Michael
#14: 4 2015 1 424 Paul
#15: 5 2014 3 123 Elaine
#16: 5 2014 2 789 George
#17: 5 2014 1 456 Jerry
#18: 5 2014 4 136 Kramer
DATA
mydf <- structure(list(SerialNum = c("983\n837\n424\n ", "123\n456\n789\n136",
"987\n654\n321\n975\n ", "424\n983\n837", "456\n789\n123\n136"
), Year = c(2015, 2014, 2010, 2015, 2014), Name = c("Michael\nLewis\nPaul\n ",
"Elaine\nJerry\nGeorge\nKramer", "John\nPaul\nGeorge\nRingo\nNA",
"Paul\nMichael\nLewis", "Jerry\nGeorge\nElaine\nKramer")), .Names = c("SerialNum",
"Year", "Name"), row.names = c(NA, -5L), class = "data.frame")

getting the row name of which.max when doing summarise in ddply

I have the following data and want to get the dates when Close is at its max for each Year.
> str(ndvdf)
'data.frame': 1374 obs. of 2 variables:
$ Close: num 150 150 150 150 150 ...
$ Year : num 2009 2009 2009 2009 2009 ...
> head(ndvdf)
Close Year
2010-01-04 150.34 2009
2010-01-05 150.34 2009
2010-01-06 150.34 2009
I tried the following but the row indices are return rather than dates and the indices are for each yearly subsets so it's difficult to use rownames to get the dates either.
> ddply(ndvdf, .(Year), summarise, MaxDate=which.max(Close))
Year MaxDate
1 2009 60
2 2010 244
3 2011 245
How can I get the dates from my data?
Thanks.
Here is some reproducible sample data:
set.seed(19)
df <- data.frame(Close = sample(150, 10), Year = sample(2000:2003, 10, TRUE))
rownames(df) <- Sys.Date() + 1:10
I prefer to use the data.table package here. We can use as.data.table with keep.rownames = TRUE and use that to easily get the row names (dates) for when "Close" is at its max for each "Year".
library(data.table)
as.data.table(df, keep.rownames = TRUE)[, rn[which.max(Close)], keyby = Year]
# Year V1
# 1: 2000 2015-08-13
# 2: 2001 2015-08-17
# 3: 2002 2015-08-16
# 4: 2003 2015-08-18

Creating a long table from a wide table using merged.stack (or reshape)

I have a data frame that looks like this:
ID rd_test_2011 rd_score_2011 mt_test_2011 mt_score_2011 rd_test_2012 rd_score_2012 mt_test_2012 mt_score_2012
1 A 80 XX 100 NA NA BB 45
2 XX 90 NA NA AA 80 XX 80
I want to write a script that would, for IDs that don't have NA's in the yy_test_20xx columns, create a new data frame with the subject taken from the column title, the test name, the test score and year taken from the column title. So, in this example ID 1 would have three entries. Expected output would look like this:
ID Subject Test Score Year
1 rd A 80 2011
1 mt XX 100 2012
1 mt BB 45 2012
2 rd XX 90 2011
2 rd AA 80 2012
2 mt XX 80 2012
I've tried both reshape and various forms of merged.stack which works in the sense that I get an output that is on the road to being right but I can't understand the inputs well enough to get there all the way:
library(splitstackshape)
merged.stack(x, id.vars='id', var.stubs=c("rd_test","mt_test"), sep="_")
I've had more success (gotten closer) with reshape:
y<- reshape(x, idvar="id", ids=1:nrow(x), times=grep("test", names(x), value=TRUE),
timevar="year", varying=list(grep("test", names(x), value=TRUE), grep("score",
names(x), value=TRUE)), direction="long", v.names=c("test", "score"),
new.row.names=NULL)
This will get your data into the right format:
df.long = reshape(df, idvar="ID", ids=1:nrow(df), times=grep("Test", names(df), value=TRUE),
timevar="Year", varying=list(grep("Test", names(df), value=TRUE),
grep("Score", names(df), value=TRUE)), direction="long", v.names=c("Test", "Score"),
new.row.names=NULL)
Then omitting NA:
df.long = df.long[!is.na(df.long$Test),]
Then splitting Year to remove Test_:
df.long$Year = sapply(strsplit(df.long$Year, "_"), `[`, 2)
And ordering by ID:
df.long[order(df.long$ID),]
ID Year Test Score
1 1 2011 A 80
5 1 2012 XX 100
2 2 2011 XX 90
9 2 2013 AA 80
6 3 2012 A 10
3 4 2011 A 50
7 4 2012 XX 60
10 4 2013 AA 99
4 5 2011 C 50
8 5 2012 A 75
Using reshape:
dat.long <- reshape(dat, direction="long", varying=list(c(2, 4,6), c(3, 5,7)),
times=2011:2013,timevar='Year',
sep="_", v.names=c("Test", "Score"))
dat.long[complete.cases(dat.long),]
ID Year Test Score id
1.2011 1 2011 A 80 1
2.2011 2 2011 XX 90 2
4.2011 4 2011 A 50 4
5.2011 5 2011 C 50 5
1.2012 1 2012 XX 100 1
3.2012 3 2012 A 10 3
4.2012 4 2012 XX 60 4
5.2012 5 2012 A 75 5
2.2013 2 2013 AA 80 2
4.2013 4 2013 AA 99 4
Considering your update, I've entirely rewritten this answer. View the history if you want to see the old version.
The main problem is that your data is "double wide" in a ways. Thus, you can actually solve your problem by reshaping in the "long" direction twice. Alternatively, use melt and *cast to melt your data in a very long format and convert it to a semi-wide format.
However, I would still suggest "splitstackshape" (and not just because I wrote it). It can handle this problem fine, but it needs you to rearrange your names of your data. The part of the name that will result in the names of the new columns should come first. In your example, that means "test" and "score" should be the first part of the variable name.
For this, we can use some gsub to rearrange the existing names.
library(splitstackshape)
setnames(mydf, gsub("(rd|mt)_(score|test)_(.*)", "\\2_\\1_\\3", names(mydf)))
names(mydf)
# [1] "ID" "test_rd_2011" "score_rd_2011" "test_mt_2011"
# [5] "score_mt_2011" "test_rd_2012" "score_rd_2012" "test_mt_2012"
# [9] "score_mt_2012"
out <- merged.stack(mydf, "ID", var.stubs=c("test", "score"), sep="_")
setnames(out, c(".time_1", ".time_2"), c("Subject", "Year"))
out[complete.cases(out), ]
# ID Subject Year test score
# 1: 1 mt 2011 XX 100
# 2: 1 mt 2012 BB 45
# 3: 1 rd 2011 A 80
# 4: 2 mt 2012 XX 80
# 5: 2 rd 2011 XX 90
# 6: 2 rd 2012 AA 80
For the benefit of others, "mydf" in this answer is defined as:
mydf <- structure(list(ID = 1:2, rd_test_2011 = c("A", "XX"),
rd_score_2011 = c(80L, 90L), mt_test_2011 = c("XX", NA),
mt_score_2011 = c(100L, NA), rd_test_2012 = c(NA, "AA"),
rd_score_2012 = c(NA, 80L), mt_test_2012 = c("BB", "XX"),
mt_score_2012 = c(45L, 80L)),
.Names = c("ID", "rd_test_2011", "rd_score_2011", "mt_test_2011",
"mt_score_2011", "rd_test_2012", "rd_score_2012", "mt_test_2012",
"mt_score_2012"), class = "data.frame", row.names = c(NA, -2L))

Resources