cut scoping link between key and data.frame for data.table - r

When using data.table as a lookup it is very fast. There is one behavior that does not work with my current workflow and I'm sure there's a better way and I'm missing it. The behavior is that of modifying in place even if the key was taken from a parent data.frame, data.table will act on the parent data.frame in ways that may not always be desirable.
Here's an example since I lack the language to express it properly:
library(data.table)
set.seed(123)
N <- 100
key <- data.frame(x = sample.int(N, N), y = 1:N, z = 1:N)
key$w <- key$x
head(key)
## x y z w
## 1 29 1 1 29
## 2 79 2 2 79
## 3 41 3 3 41
## 4 86 4 4 86
## 5 91 5 5 91
## 6 5 6 6 5
set.seed(1)
terms <- data.frame(z = sample.int(2 * N, 1e2, replace = TRUE))
subkey <- key[c("x", "y")]
setDT(subkey)
setDT(terms)
setkey(subkey, x)
subkey[terms][[2]]
head(key)
## x y z w
## 1 1 74 1 1
## 2 2 35 2 2
## 3 3 51 3 3
## 4 4 18 4 4
## 5 5 6 5 5
## 6 6 54 6 6
Notice the order of key is affected by the use of data.table even though it wasn't used in the lookup?
I know data.table is avoiding making copies but is there a way to cut this link to key and force data.table to act on subkey without modifying key?

Rather than
subkey <- key[c("x", "y")]
setDT(subkey)
just do
subkey <- as.data.table(key[c("x", "y")])
That will force a copy and sever the connection

Related

R: bind columns after lapply() the poly() function

I want to add columns containing polynomials to a dataframe (DF).
Background: I need to use polynomials in a glmnet setting. I cannot call poly() directly in the glmnet() estimation command. I get an error, likely because my “Xtrain” data contain factors.
My workaround is to slice my Xtrain DF in two pieces, one containing all factors (for which no transformation is needed) and one containing the rest, viz. the numeric columns.
Now I want to add columns with polynomials to my numeric DF.
Here is a minimal example of my problem.
# Some data
x <- 1:10
y <- 11:20
df = as.data.frame(cbind(x,y))
# Looks like this
x y
1 1 11
2 2 12
3 3 13
# Now I generate polys
lapply(df, function(i) poly(i, 2, raw=T)[,1:2])
However, I cannot figure out how to "cbind" the results. What I want to have in the end is a DF in which x, x^2, y, y^2, are contained. Order does not matter. However, ideally I would also have column labels (to identify the polys). For instance like this:
x x2 y y2
1 1 1 11 121
2 2 4 12 144
3 3 9 13 169
Thank you...
Cheers!
Another option is
as.data.frame(lapply(df, function(i) poly(i, 2, raw=T)[,1:2]))
# x.1 x.2 y.1 y.2
#1 1 1 11 121
#2 2 4 12 144
#3 3 9 13 169
# ...
As mentioned by #gpier and #akrun already, you might use ^ instead of poly
n <- 2
df[paste(names(df), n, sep = "_")] <- df^n
df
We can use do.call
do.call(cbind, lapply(df, function(i) poly(i, 2, raw=T)[,1:2]))
If we just need squares
cbind(df, as.matrix(df)^2)
poly is not the right function if you need squares. Try
cbind(df,lapply(df, function(x) x^2))
x y x y
1 1 11 1 121
2 2 12 4 144
3 3 13 9 169
4 4 14 16 196
5 5 15 25 225
6 6 16 36 256
7 7 17 49 289
8 8 18 64 324
9 9 19 81 361
10 10 20 100 400
EDIT: indeed you don't even need lapply, you could just use cbind(df, df^2)

randomly select rows based on limited random numbers

Seems simple but I can't figure it out.
I have a bunch of animal location data (217 individuals) as a single dataframe. I'm trying to randomly select X locations per individual for further analysis with the caveat that X is within the range of 6-156.
So I'm trying to set up a loop that first randomly selects a value within the range of 6-156 then use that value (say 56) to randomly extract 56 locations from the first individual animal and so on.
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
sub2<-i([sample(nrow(i),sub),])
}
This approach didn't seem to work so I tried tweaking it...
for(i in unique(ANIMALS$ID)){
sub<-sample(6:156,1)
rand<-i[sample(1:nrow(i),sub,replace=FALSE),]
}
This did not work either.. Any suggestions or previous postings would be helpful!
Head of the datafile...ANIMALS is the name of the df, ID indicates unique individuals
> FID X Y MONTH DAY YEAR HOUR MINUTE SECOND ELKYR SOURCE ID animalid
1 0 510313 4813290 9 5 2008 22 30 0 342008 FG 1 1
2 1 510382 4813296 9 6 2008 1 30 0 342008 FG 1 1
3 2 510385 4813311 9 6 2008 2 0 0 342008 FG 1 1
4 3 510385 4813394 9 6 2008 3 30 0 342008 FG 1 1
5 4 510386 4813292 9 6 2008 2 30 0 342008 FG 1 1
6 5 510386 4813431 9 6 2008 4 1 0 342008 FG 1 1
Here's one way using mapply. This function takes two lists (or something that can be coerced into a list) and applies function FUN to corresponding elements.
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
# calculate number of samples for individual animal
num.samples.per.animal <- sample(3:6, length(unique(xy$animal)), replace = TRUE)
num.samples.per.animal
[1] 6 3 4 4 6 3 3 6 3 5
# subset random x number of rows from each animal
result <- do.call("rbind",
mapply(num.samples.per.animal, split(xy, f = xy$animal), FUN = function(x, y) {
y[sample(1:nrow(y), x),]
}, SIMPLIFY = FALSE)
)
result
animal loc
7 1 0.99483999
1 1 0.50951321
10 1 0.36505294
6 1 0.34058842
8 1 0.26489107
9 1 0.47418823
13 2 0.27213396
12 2 0.28087775
15 2 0.22130069
23 3 0.33646632
21 3 0.02395097
28 3 0.53079981
29 3 0.85287600
35 4 0.84534073
33 4 0.87370167
31 4 0.85646813
34 4 0.11642335
46 5 0.59624723
48 5 0.15379729
45 5 0.57046122
42 5 0.88799675
44 5 0.62171858
49 5 0.75014593
60 6 0.86915983
54 6 0.03152932
56 6 0.66128549
64 7 0.85420774
70 7 0.89262455
68 7 0.40829671
78 8 0.19073661
72 8 0.20648832
80 8 0.71778913
73 8 0.77883677
75 8 0.37647108
74 8 0.65339300
82 9 0.39957202
85 9 0.31188471
88 9 0.10900795
100 10 0.55282999
95 10 0.10145296
96 10 0.09713218
93 10 0.64900866
94 10 0.76099256
EDIT
Here is another (more straightforward) approach that also handles cases when number of rows is less than the number of samples that should be allocated.
set.seed(357)
result <- do.call("rbind",
by(xy, INDICES = xy$animal, FUN = function(x) {
avail.obs <- nrow(x)
num.rows <- sample(3:15, 1)
while (num.rows > avail.obs) {
message("Sample to be larger than available data points, repeating sampling.")
num.rows <- sample(3:15, 1)
}
x[sample(1:avail.obs, num.rows), ]
}))
result
I like Stackoverflow because I learn so much. #RomanLustrik provided a simple solution; mine is straight-froward as well:
# simulate some data
xy <- data.frame(animal = rep(1:10, each = 10), loc = runif(100))
newVec <- NULL #Create a blank dataFrame
for(i in unique(xy$animal)){
#Sample a number between 1 and 10 (or 6 and 156, if you need)
samp <- sample(1:10, 1)
#Determine which rows of dataFrame xy correspond with unique(xy$animal)[i]
rows <- which(xy$animal == unique(xy$animal)[i])
#From xy, sample samp times from the rows associated with unique(xy$animal)[i]
newVec1 <- xy[sample(rows, samp, replace = TRUE), ]
#append everything to the same new dataFrame
newVec <- rbind(newVec, newVec1)
}

R ffdf sorted data

I want to sort the data
z=as.ffdf(data.frame(w=c(4,1,2,5,7,8,65,3,2,9), x=c(12,1,3,5,65,3,2,45,34,11),y=1:10))
I need sorted data based on columns w,x. This is much simple task, if we have a data frame.
Thanks.
Use ffdforder from package ff, this returns an ff_vector, which you can use to index your ffdf, without RAM issues.
require(ff)
z=as.ffdf(data.frame(w=c(4,1,2,5,7,8,65,3,2,9), x=c(12,1,3,5,65,3,2,45,34,11),y=1:10))
idx <- ffdforder(z[c("w","x")])
zordered <- z[idx, ]
zordered
You can try something like this
require(ffbase)
z <- as.ffdf(data.frame(w=c(4,1,2,5,7,8,65,3,2,9),
x=c(12,1,3,5,65,3,2,45,34,11),y=1:10))
z[order(z$w[], z$x[]), ]
## w x y
## 2 1 1 2
## 3 2 3 3
## 9 2 34 9
## 8 3 45 8
## 1 4 12 1
## 4 5 5 4
## 5 7 65 5
## 6 8 3 6
## 10 9 11 10
## 7 65 2 7
You can use fforder to order your ffdf without using your RAM. Credit to #jwijffels
z[fforder(z$w, z$x), ]

Remove rows based on factor-levels

I have a data.frame df in format "long".
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df <- df[order(df$site), ]
df
site time value
1 A 11 12
2 A 22 -24
3 A 33 -30
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
Question
How do I remove the rows where an unique element of df$time is not present for each of the levels of df$site ?
In this case I want to remove df[3,], because for df$time the timestamp 33 is only present for site A and not for site B and site C.
Desired output:
df.trimmed
site time value
1 A 11 12
2 A 22 -24
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
The data.frame has easily 800k rows and 200k unique timestamps. I don't want to use loops but I don't know how to use vectorized functions like apply() or lapply() for this case.
Here's another possible solution using the data.table package:
unTime <- unique(df$time)
library(data.table)
DT <- data.table(df, key = "site")
(notInAll <- unique(DT[, list(ans = which(!unTime %in% time)), by = key(DT)]$ans))
# [1] 3
DT[time %in% unTime[-notInAll]]
# site time value
# [1,] A 11 3
# [2,] A 22 11
# [3,] B 11 -6
# [4,] B 22 -2
# [5,] C 11 -19
# [6,] C 22 -14
EDIT from Matthew
Nice. Or a slightly more direct way :
DT = as.data.table(df)
tt = DT[,length(unique(site)),by=time]
tt
time V1
1: 11 3
2: 22 3
3: 33 1
tt = tt[V1==max(V1)] # See * below
tt
time V1
1: 11 3
2: 22 3
DT[time %in% tt$time]
site time value
1: A 11 7
2: A 22 -2
3: B 11 8
4: B 22 -10
5: C 11 3
6: C 22 1
In case no time is present in all sites, when final result should be empty (as Ben pointed out in comments), the step marked * above could be :
tt = tt[V1==length(unique(DT$site))]
Would rle work for you?
df <- df[order(df$time), ]
df <- subset(df, time != rle(df$time)$value[rle(df$time)$lengths == 1])
df <- df[order(df$site), ]
df
## site time value
## 1 A 11 17
## 4 A 22 -3
## 2 B 11 8
## 5 B 22 5
## 3 C 11 0
## 6 C 22 13
Re-looking at your data, it seems that this solution might be too simple for your needs though....
Update
Here's an approach that should be better than the rle solution that I put above. Rather than look for a run-length of "1", will delete rows that do not match certain conditions of the results of table(df$site, df$time). To illustrate, I've also added some more fake data.
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df2 <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(14,14,15,15,16,16,16),
value = ceiling(rnorm(7)*10))
df <- rbind(df, df2)
df <- df[order(df$site), ]
temp <- as.numeric(names(which(colSums(with(df, table(site, time)))
>= length(levels(df$site)))))
df2 <- merge(df, data.frame(temp), by.x = "time", by.y = "temp")
df2 <- df2[order(df2$site), ]
df2
## time site value
## 3 11 A -2
## 4 16 A -2
## 7 22 A 2
## 1 11 B -16
## 5 16 B 3
## 8 22 B -6
## 2 11 C 8
## 6 16 C 11
## 9 22 C -10
Here's the result of tabulating and summing up the site/time combination:
colSums(with(df, table(site, time)))
## 11 14 15 16 22 33
## 3 2 2 3 3 1
Thus, if we were interested in including sites where at least two sites had the timestamp, we could change the line >= length(levels(df$site)) (in this example, 3) to >= length(levels(df$site))-1 (obviously, 2).
Not sure if this solution is useful to you at all, but I thought I would share it to show the flexibility in solutions we have with R.

R self reference

In R I find myself doing something like this a lot:
adataframe[adataframe$col==something]<-adataframe[adataframe$col==something)]+1
This way is kind of long and tedious. Is there some way for me
to reference the object I am trying to change such as
adataframe[adataframe$col==something]<-$self+1
?
Try package data.table and its := operator. It's very fast and very short.
DT[col1==something, col2:=col3+1]
The first part col1==something is the subset. You can put anything here and use the column names as if they are variables; i.e., no need to use $. Then the second part col2:=col3+1 assigns the RHS to the LHS within that subset, where the column names can be assigned to as if they are variables. := is assignment by reference. No copies of any object are taken, so is faster than <-, =, within and transform.
Also, soon to be implemented in v1.8.1, one end goal of j's syntax allowing := in j like that is combining it with by, see question: when should I use the := operator in data.table.
UDPDATE : That was indeed released (:= by group) in July 2012.
You should be paying more attention to Gabor Grothendeick (and not just in this instance.) The cited inc function on Matt Asher's blog does all of what you are asking:
(And the obvious extension works as well.)
add <- function(x, inc=1) {
eval.parent(substitute(x <- x + inc))
}
# Testing the `inc` function behavior
EDIT: After my temporary annoyance at the lack of approval in the first comment, I took the challenge of adding yet a further function argument. Supplied with one argument of a portion of a dataframe, it would still increment the range of values by one. Up to this point has only been very lightly tested on infix dyadic operators, but I see no reason it wouldn't work with any function which accepts only two arguments:
transfn <- function(x, func="+", inc=1) {
eval.parent(substitute(x <- do.call(func, list(x , inc)))) }
(Guilty admission: This somehow "feels wrong" from the traditional R perspective of returning values for assignment.) The earlier testing on the inc function is below:
df <- data.frame(a1 =1:10, a2=21:30, b=1:2)
inc <- function(x) {
eval.parent(substitute(x <- x + 1))
}
#---- examples===============>
> inc(df$a1) # works on whole columns
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 6 25 1
6 7 26 2
7 8 27 1
8 9 28 2
9 10 29 1
10 11 30 2
> inc(df$a1[df$a1>5]) # testing on a restricted range of one column
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 7 25 1
6 8 26 2
7 9 27 1
8 10 28 2
9 11 29 1
10 12 30 2
> inc(df[ df$a1>5, ]) #testing on a range of rows for all columns being transformed
> df
a1 a2 b
1 2 21 1
2 3 22 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
# and even in selected rows and grepped names of columns meeting a criterion
> inc(df[ df$a1 <= 3, grep("a", names(df)) ])
> df
a1 a2 b
1 3 22 1
2 4 23 2
3 4 23 1
4 5 24 2
5 8 26 2
6 9 27 3
7 10 28 2
8 11 29 3
9 12 30 2
10 13 31 3
Here is what you can do. Let us say you have a dataframe
df = data.frame(x = 1:10, y = rnorm(10))
And you want to increment all the y by 1. You can do this easily by using transform
df = transform(df, y = y + 1)
I'd be partial to (presumably the subset is on rows)
ridx <- adataframe$col==something
adataframe[ridx,] <- adataframe[ridx,] + 1
which doesn't rely on any fancy / fragile parsing, is reasonably expressive about the operation being performed, and is not too verbose. Also tends to break lines into nicely human-parse-able units, and there is something appealing about using standard idioms -- R's vocabulary and idiosyncrasies are already large enough for my taste.

Resources