R data.table reshape chunks of columns at once - r

Lets say I have a data.table with these columns
nodeID
hour1aaa
hour1bbb
hour1ccc
hour2aaa
hour2bbb
hour2ccc
...
hour24aaa
hour24bbb
hour24ccc
for a total of 72 columns. Let's call it rawtable
I want to reshape it so I have
nodeID
hour
aaa
bbb
ccc
for a total of just these 5 columns
where the hour column will contain whichever hour from the original 72 that it should be.
Let's call it newshape
The way I'm doing it now is to use rbindlist with 24 items where each item is the proper subset of the bigger data.table. Like this (except I'm leaving out most of the hours in my example)
newshape<-rbindlist(list(
rawtable[,list(nodeID, Hour=1, aaa=hour1aaa, bbb=hour1bbb, ccc=hour1ccc)],
rawtable[,list(nodeID, Hour=2, aaa=hour2aaa, bbb=hour2bbb, ccc=hour2ccc)],
rawtable[,list(nodeID, Hour=24, aaa=hour24aaa, bbb=hour24bbb, ccc=hour24ccc)]))
Here is some sample data to play with
rawtable<-data.table(nodeID=c(1,2),hour1aaa=c(12.4,32),hour1bbb=c(61.1,65.33),hour1ccc=c(-4.2,54),hour2aaa=c(12.2,1.2),hour2bbb=c(12.2,5.7),hour2ccc=c(5.6,101.9),hour24aaa=c(45.2,8.5),hour24bbb=c(23,7.9),hour24ccc=c(98,32.3))
Using my rbindlist approach gives the desired result but, as with most things I do with R, there is probably a better way. By better I mean more memory efficient, faster, and/or uses less lines of code. Does anyone have a better way to achieve this?

This is a classic reshape problem if you get your names in the standard convention it expects, though I'm not sure this really harnesses the efficiency of the data.table structure:
reshape(
setNames(rawtable, gsub("(\\D+)(\\d+)(\\D+)", "\\3.\\2", names(rawtable))),
idvar="nodeID", direction="long", varying=-1
)
Result:
nodeID hour aaa bbb ccc
1: 1 1 12.4 61.10 -4.2
2: 2 1 32.0 65.33 54.0
3: 1 2 12.2 12.20 5.6
4: 2 2 1.2 5.70 101.9
5: 1 24 45.2 23.00 98.0
6: 2 24 8.5 7.90 32.3
#Arun's answer over here: https://stackoverflow.com/a/15510828/496803 may also be useful if you can adapt it to your current data.

One option is to use merged.stack from my package "splitstackshape". This function, stacks groups of columns and then merges the output together. Because of how the function creates the "time" variable, you can specify whatever you wanted to strip out from the column names. In this case, we want to strip out "hour", "aaa", "bbb", and "ccc" and have just the numbers remaining.
library(splitstackshape)
## Make sure you're using at least 1.2.0
packageVersion("splitstackshape")
# [1] ‘1.2.0’
merged.stack(rawtable, id.vars="nodeID",
var.stubs=c("aaa", "bbb", "ccc"),
sep="hour|aaa|bbb|ccc")
# nodeID .time_1 aaa bbb ccc
# 1: 1 1 12.4 61.10 -4.2
# 2: 1 2 12.2 12.20 5.6
# 3: 1 24 45.2 23.00 98.0
# 4: 2 1 32.0 65.33 54.0
# 5: 2 2 1.2 5.70 101.9
# 6: 2 24 8.5 7.90 32.3

Related

Pull subset of rows of dataframe based on conditions from other columns

I have a dataframe like the one below:
x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
Type=c("put","call","put","call","call","put","call","put","call","put","call"),
Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
Other=sample(20,11))
Tickers Type Strike Other
1: A put 35.0 6
2: A call 37.5 5
3: A put 37.5 13
4: B call 10.0 15
5: B call 11.0 12
6: B put 11.0 4
7: B call 12.0 20
8: D put 40.0 7
9: D call 40.0 11
10: D put 42.0 10
11: D call 42.0 1
I am trying to analyze a subset of the data. The subset I would like to take is data where the ticker and strike are the same. But I also only want to grab this data if both a put and a call exists under type. With the data above for example, I would like to return the following result:
x[c(2,3,5,6,8:11),]
Tickers Type Strike Other
1: A call 37.5 5
2: A put 37.5 13
3: B call 11.0 12
4: B put 11.0 4
5: D put 40.0 7
6: D call 40.0 11
7: D put 42.0 10
8: D call 42.0 1
I'm not sure what the best way to go about doing this. My thought process is that I should create another column vector like
x$id <- paste(x$Tickers,x$Strike,sep="_")
Then use this vector to only pull values where there are multiple ids.
x[x$id %in% x$id[duplicated(x$id)],]
Tickers Type Strike Other id
1: A call 37.5 5 A_37.5
2: A put 37.5 13 A_37.5
3: B call 11.0 12 B_11
4: B put 11.0 4 B_11
5: D put 40.0 7 D_40
6: D call 40.0 11 D_40
7: D put 42.0 10 D_42
8: D call 42.0 1 D_42
I'm not sure how efficient this is, as my actual data consists of a lot more rows.
Also, this solution does not check for the type condition of there being one put and one call.
also the wording of the title could be a lot better, I apologize
EDIT::: having checked out this post Finding ALL duplicate rows, including "elements with smaller subscripts"
I could also use this solution:
x$id <- paste(x$Tickers,x$Strike,sep="_")
x[duplicated(x$id) | duplicated(x$id,fromLast=T),]
You could try something like:
x[, select := (.N >= 2 & all(c("put", "call") %in% unique(Type))), by = .(Tickers, Strike)][which(select)]
# Tickers Type Strike Other select
#1: A call 37.5 17 TRUE
#2: A put 37.5 16 TRUE
#3: B call 11.0 11 TRUE
#4: B put 11.0 20 TRUE
#5: D put 40.0 1 TRUE
#6: D call 40.0 12 TRUE
#7: D put 42.0 6 TRUE
#8: D call 42.0 2 TRUE
Another idea might be a merge:
x[x, on = .(Tickers, Strike), select := (length(Type) >= 2 & all(c("put", "call") %in% Type)),by = .EACHI][which(select)]
I'm not entirely sure how to get around the group-by operations since you want to make sure for each group they have both "call" and "put". I was thinking about using keys, but haven't been able to incorporate the "call"/"put" aspect.
An edit to your data to give a case where both put and call does not exist (I changed the very last "call" to "put"):
x <- data.table(Tickers=c("A","A","A","B","B","B","B","D","D","D","D"),
Type=c("put","call","put","call","call","put","call","put","call","put","put"),
Strike=c(35,37.5,37.5,10,11,11,12,40,40,42,42),
Other=sample(20,11))
Since you are using data.table, you can use the built in counter .N along with by variables to count groups and subset with that. If by counting Type you can reliably determine there is both put and call, this could work:
x[, `:=`(n = .N, types = uniqueN(Type)), by = c('Tickers', 'Strike')][n > 1 & types == 2]
The part enclosed in the first set of [] does the counting, and then the [n > 1 & types == 2] does the subsetting.
I am not a user of package data.table so this code is base R only.
agg <- aggregate(Type ~ Tickers + Strike, data = x, length)
result <- merge(x, subset(agg, Type > 1)[1:2], by = c("Tickers", "Strike"))[, c(1, 3, 2, 4)]
result
# Tickers Type Strike Other
#1: A call 37.5 17
#2: A put 37.5 7
#3: B call 11.0 14
#4: B put 11.0 20
#5: D put 40.0 15
#6: D call 40.0 2
#7: D put 42.0 8
#8: D call 42.0 1
rm(agg) # final clean up

Filling in a ton of NA data in R by indices?

I have Price data indexed according to three things:
State, Date, and UPC (that is the Product Code).
I have a bunch of prices that are NA.
I am trying to fill the NAs in in the following way: For a given missing Price with index (S,D,UPC), fill in with the average Price of all the data points with the same S and UPC. I.e., take the average over Date.
There must be an incredibly easy way to do this because this is very simple. I have been using for loops, but I now realize that that is incredibly inefficient and I would like to use a function, such as one in plyr or dplyr, that will do it all in as few steps as possible.
upc=c(1153801013,1153801013,1153801013,1153801013,1153801013,1153801013,2105900750,2105900750,2105900750,2105900750,2105900750,2173300001,2173300001,2173300001,2173300001)
date=c(200601,200602,200603,200604,200601,200602,200601,200602,200603,200601,200602,200603,200604,200605,200606)
price=c(26,28,NA,NA,23,24,85,84,NA,81,78,24,19,98,NA)
state=c(1,1,1,1,2,2,1,1,2,2,2,1,1,1,1)
# This is what I have:
data <- data.frame(upc,date,state,price)
# This is what I want:
price=c(26,28,27,27,23,24,85,84,79.5,81,78,24,19,98,47)
data2 <- data.frame(upc,date,state,price)
Any advice? Thanks.
Use ave with multiple grouping variables, and then replace NA values with the means:
with(data,
ave(price, list(upc,state), FUN=function(x) replace(x,is.na(x),mean(x,na.rm=TRUE) ) )
)
# [1] 26.0 28.0 27.0 27.0 23.0 24.0 85.0 84.0 79.5 81.0 78.0 24.0 19.0 98.0 47.0
You can construct a matrix of means by upc and state:
meanmtx <- tapply(dat$price, dat[c('upc','state')], mean, na.rm=TRUE)
That matrix has character indices that can be matched to values in upc and state. So then use 2 column character indexing to put these in the empty "slots":
dat$price[is.na(dat$price)] <-
meanmtx[ cbind( as.character(dat[ is.na(dat$price), 'upc']),
as.character(dat[ is.na(dat$price),'state']) ) ]
> dat
upc date state price
1 1153801013 200601 1 26.0
2 1153801013 200602 1 28.0
3 1153801013 200603 1 27.0
4 1153801013 200604 1 27.0
5 1153801013 200601 2 23.0
6 1153801013 200602 2 24.0
7 2105900750 200601 1 85.0
8 2105900750 200602 1 84.0
9 2105900750 200603 2 79.5
10 2105900750 200601 2 81.0
11 2105900750 200602 2 78.0
12 2173300001 200603 1 24.0
13 2173300001 200604 1 19.0
14 2173300001 200605 1 98.0
15 2173300001 200606 1 47.0
Here is another compact option using na.aggregate (from zoo) and data.table. The na.aggregate by default replace the NA values with the mean of the column of interest. It also has a FUN argument in case we want to replace the NA by median, min or max, or whatever we wish. The group by operations can be done by dplyr/data.table/base R methods. With data.table, we convert the 'data.frame' to 'data.table' (setDT(data)), grouped by 'upc', 'state', we assign (:=) the 'price' as the na.aggregate of 'price'.
library(data.table)
library(zoo)
setDT(data)[, price:= na.aggregate(price) , .(upc, state)]

R Programming Calculate Rows Average

How to use R to calculate row mean ?
Sample data:
f<- data.frame(
name=c("apple","orange","banana"),
day1sales=c(2,5,4),
day1sales=c(2,8,6),
day1sales=c(2,15,24),
day1sales=c(22,51,13),
day1sales=c(5,8,7)
)
Expected Results :
Subsequently the table will add more column for example the expected results is only until AverageSales day1sales.4. After running more data, it will add on to day1sales.6 and so on. So how can I count the average for all the rows?
with rowMeans
> rowMeans(f[-1])
## [1] 6.6 17.4 10.8
You can also add another column to of means to the data set
> f$AvgSales <- rowMeans(f[-1])
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 AvgSales
## 1 apple 2 2 2 22 5 6.6
## 2 orange 5 8 15 51 8 17.4
## 3 banana 4 6 24 13 7 10.8
rowMeans is the simplest way. Also the function apply will apply a function along the rows or columns of a data frame. In this case you want to apply the mean function to the rows:
f$AverageSales <- apply(f[, 2:length(f)], 1, mean)
(changed 6 to length(f) since you say you may add more columns).
will add an AverageSales column to the dataframe f with the value that you want
> f
## name day1sales day1sales.1 day1sales.2 day1sales.3 day1sales.4 means
##1 apple 2 2 2 22 5 6.6
##2 orange 5 8 15 51 8 17.4
##3 banana 4 6 24 13 7 10.8

element-by-element multiplication within a data table - Multiple variables at the same time (R)

Let's say I have the following data table in R:
L3 <- LETTERS[1:3]
(d <- data.table(cbind(x = 1, y = 1:10), fac = sample(L3, 10, replace = TRUE)))
vecfx=c(5.3,2.8)
and I would like to compute two new variables, dot1 and dot2 that are:
d[,dot1:=5.3*x]
d[,dot2:=2.8*y]
But I don't want to compute them this way as this is a relaxation of my problem. In my original problem, vecfx consists of 12 elements and my data table has twuelve columns so I want to avoid writing that twuelve times.
I tried this: vecfx*d[,list(x,y)] but I'm not getting the desired result (it seems like the product is done by rows instead of by columns). Also, I want to create those two new variables within my data table d.
This is also useful when one wants to create several columns at the same time within a data table in R.
Any help is appreciated.
Update: In v1.8.11, FR #2077 is now implemented - set() can now add columns by reference, . From NEWS:
set() is able to add new columns by reference now. For example, set(DT, i=3:5, j="bla", 5L) is equivalent to DT[3:5, bla := 5L]. This was FR #2077. Tests added.
with which one would then be able to do (as #MatthewDowle suggests under comments):
for (j in seq_along(vecfx))
set(d, i=NULL, j=paste0("dot", j), vecfx[j]*d[[j]])
I think you're looking for ?set. Note that set() also adds by reference and is very fast! Pasting the relevant section from ?set:
Since [.data.table incurs overhead to check the existence and type of arguments (for example), set() provides direct (but less flexible) assignment by reference with low overhead, appropriate for use inside a for loop. See examples. := is more flexible than set() because := is intended to be combined with i and by in single queries on large datasets.
for (j in seq_along(vecfx))
set(d, i=NULL, j=j, vecfx[j]*d[[j]])
x y fac
1: 5.3 2.8 B
2: 5.3 5.6 C
3: 5.3 8.4 C
4: 5.3 11.2 C
5: 5.3 14.0 B
6: 5.3 16.8 B
7: 5.3 19.6 C
8: 5.3 22.4 C
9: 5.3 25.2 C
10: 5.3 28.0 C
It's just a matter of providing the right indices to set.
Arun's answer is good.
The LHS and RHS of := accept multiple items so another way is :
d[,paste0("dot",1:2):=mapply("*",vecfx,list(x,y),SIMPLIFY=FALSE)]
d
x y fac dot1 dot2
1: 1 1 C 5.3 2.8
2: 1 2 B 5.3 5.6
3: 1 3 C 5.3 8.4
4: 1 4 C 5.3 11.2
5: 1 5 B 5.3 14.0
6: 1 6 A 5.3 16.8
7: 1 7 A 5.3 19.6
8: 1 8 B 5.3 22.4
9: 1 9 A 5.3 25.2
10: 1 10 A 5.3 28.0
Maybe there's a better way than that. I think Arun's for should be faster though, and maybe easier to read.

R merge with itself

Can I merge data like
name,#797,"Stachy, Poland"
at_rank,#797,1
to_center,#797,4.70
predicted,#797,4.70
According to the second column and take the first column as column names?
name at_rank to_center predicted
#797 "Stachy, Poland" 1 4.70 4.70
Upon request, the whole set of data: http://sprunge.us/cYSJ
The first problem, of reading the data in, should not be a problem if your strings with commas are quoted (which they seem to be). Using read.csv with the header=FALSE argument does the trick with the data you shared. (Of course, if the data file had headers, delete that argument.)
From there, you have several options. Here are two.
reshape (base R) works fine for this:
myDF <- read.csv("http://sprunge.us/cYSJ", header=FALSE)
myDF2 <- reshape(myDF, direction="wide", idvar="V2", timevar="V1")
head(myDF2)
# V2 V3.name V3.at_rank V3.to_center V3.predicted
# 1 #1 Kitoman 1 2.41 2.41
# 5 #2 Hosaena 2 4.23 9.25
# 9 #3 Vinzelles, Puy-de-Dôme 1 5.20 5.20
# 13 #4 Whitelee Wind Farm 6 3.29 8.07
# 17 #5 Steveville, Alberta 1 9.59 9.59
# 21 #6 Rocher, Ardèche 1 0.13 0.13
The reshape2 package is also useful in these cases. It has simpler syntax and the output is also a little "cleaner" (at least in terms of variable names).
library(reshape2)
myDFw_2 <- dcast(myDF, V2 ~ V1)
# Using V3 as value column: use value.var to override.
head(myDFw_2)
# V2 at_rank name predicted to_center
# 1 #1 1 Kitoman 2.41 2.41
# 2 #10 4 Icaraí de Minas 6.07 8.19
# 3 #100 2 Scranton High School (Pennsylvania) 5.78 7.63
# 4 #1000 1 Bat & Ball Inn, Clanfield 2.17 2.17
# 5 #10000 3 Tăuteu 1.87 5.87
# 6 #10001 1 Oak Grove, Northumberland County, Virginia 5.84 5.84
Look at the reshape package from Hadley. If I understand correctly, you are just pivoting your data from long to wide.
I think in this case all you really need to do is transpose, cast to data.frame, set the colnames to the first row and then remove the first row. It might be possible to skip the last step through some combination of arguments to data.frame but I don't know what they are right now.

Resources