Here is the code to rank based on column v2:
x <- data.frame(v1 = c(2,1,1,2), v2 = c(1,1,3,2))
x$rank1 <- rank(x$v2, ties.method='first')
But I really want to rank based on both v2 and/then v1 since there are ties in v2. How can I do that without using RPostgreSQL?
How about:
within(x, rank2 <- rank(order(v2, v1), ties.method='first'))
# v1 v2 rank1 rank2
# 1 2 1 1 2
# 2 1 1 2 1
# 3 1 3 4 4
# 4 2 2 3 3
order works, but for manipulating data frames, also check out the plyr and dplyr packages.
> arranged_x <- arrange(x, v2, v1)
Here we create a sequence of numbers and then reorder it as if it was created near the ordered data:
x$rank <- seq.int(nrow(x))[match(rownames(x),rownames(x[order(x$v2,x$v1),]))]
Or:
x$rank <- (1:nrow(x))[order(order(x$v2,x$v1))]
Or even:
x$rank <- rank(order(order(x$v2,x$v1)))
Try this:
x <- data.frame(v1 = c(2,1,1,2), v2 = c(1,1,3,2))
# The order function returns the index (address) of the desired order
# of the examined object rows
orderlist<- order(x$v2, x$v1)
# So to get the position of each row in the index, you can do a grep
x$rank<-sapply(1:nrow(x), function(x) grep(paste0("^",x,"$"), orderlist ) )
x
# For a little bit more general case
# With one tie
x <- data.frame(v1 = c(2,1,1,2,2), v2 = c(1,1,3,2,2))
x$rankv2<-rank(x$v2)
x$rankv1<-rank(x$v1)
orderlist<- order(x$rankv2, x$rankv1)
orderlist
#This rank would not be appropriate
x$rank<-sapply(1:nrow(x), function(x) grep(paste0("^",x,"$"), orderlist ) )
#there are ties
grep(T,duplicated(x$rankv2,x$rankv1) )
# Example for only one tie
makeTieRank<-mean(x[which(x[,"rankv2"] %in% x[grep(T,duplicated(x$rankv2,x$rankv1) ),][,c("rankv2")] &
x[,"rankv1"] %in% x[grep(T,duplicated(x$rankv2,x$rankv1) ),][,c("rankv1")]),]$rank)
x[which(x[,"rankv2"] %in% x[grep(T,duplicated(x$rankv2,x$rankv1) ),][,c("rankv2")] &
x[,"rankv1"] %in% x[grep(T,duplicated(x$rankv2,x$rankv1) ),][,c("rankv1")]),]$rank<-makeTieRank
x
Related
I have a huge dataset in which several mini dataset were merged. I want to split them in different dataframes and save them. The mini datasets are identified by a variable name (which always include the string "-gram") on a given row.
I have been trying to construct a for loop, but with no luck.
grams <- read.delim("grams.tsv", header=FALSE) #read dataset
index <- which(grepl("-gram", grams$V1), arr.ind=TRUE) # identify the row positions where each mini dataset starts
index[10] <- nrow(grams) # add the total number of rows as last variable of the vector
start <- c() # initialize vector
end <- c() # initialize vector
for (i in 1:length(index)-1) for ( k in 2:length(index)) {
start[i] <- index[i] # add value to the vector start
if (k != 10) { end[k-1] <- index[k]-1 } else { end[k-1] <- index[k] } # add value to the vector end
gram <- grams[start[i]:end[i],] #subset the dataset grams so that the split mini dataset has start and end that correspond to the index in the vector
write.csv(gram, file=paste0("grams_", i, ".csv"), row.names=FALSE) # save dataset
}
I get an error when I try to subset the dataset:
Error in start[i]:end[i] : argument of length 0
...and I do not understand why! Can anyone help me?
Thanks!
You can cumsum and split:
dat <- data.frame(V1 = c("foo", "bar", "quux-gram", "bar-gram", "something", "nothing"),
V2 = 1:6, stringsAsFactors = FALSE)
dat
# V1 V2
# 1 foo 1
# 2 bar 2
# 3 quux-gram 3
# 4 bar-gram 4
# 5 something 5
# 6 nothing 6
grepl("-gram$", dat$V1)
# [1] FALSE FALSE TRUE TRUE FALSE FALSE
cumsum(grepl("-gram$", dat$V1))
# [1] 0 0 1 2 2 2
spl_dat <- split(dat, cumsum(grepl("-gram$", dat$V1)))
spl_dat
# $`0`
# V1 V2
# 1 foo 1
# 2 bar 2
# $`1`
# V1 V2
# 3 quux-gram 3
# $`2`
# V1 V2
# 4 bar-gram 4
# 5 something 5
# 6 nothing 6
With that, you can write them to files with:
ign <- Map(write.csv, spl_dat, sprintf("gram-%03d.csv", seq_along(spl_dat)),
list(row.names=FALSE))
An option with group_split and endsWith
library(dplyr)
library(stringr)
dat %>%
group_split(grp = cumsum(endsWith(V1, '-gram')), keep = FALSE)
I have a data set like this:
df <- data.frame(v1 = rnorm(12),
v2 = rnorm(12),
v3 = rnorm(12),
time = rep(1:3,4))
It looks like this:
> head(df)
v1 v2 v3 time
1 0.1462583 -1.1536425 3.0319594 1
2 1.4017828 -1.2532555 -0.1707027 2
3 0.3767506 0.2462661 -1.1279605 3
4 -0.8060311 -0.1794444 0.1616582 1
5 -0.6395198 0.4637165 -0.9608578 2
6 -1.6584524 -0.5872627 0.5359896 3
I now want to stack row 1-3 in a new column, then rows 4-6, then 7-9 and so on.
This is may naive way to do it, but there must be fast way, that doesn't use that many helper variables and loops:
l <- list()
for(i in 1:length(df)) {
l[[i]] <- stack(df[i:(i+2), -4])$values #time column is removed, was just for illustration
}
result <- do.call(cbind, l)
only base R should be used.
We can use split on the 'time' column
sapply(split(df[-4], cumsum(df$time == 1)), function(x) stack(x)$values)
Or instead of stack, unlist could be faster
sapply(split(df[-4], cumsum(df$time == 1)), unlist)
Based on the OP's code, it seems to be subsetting the rows based on the sequence of column
sapply(1:length(df), function(i) unlist(df[i:(i+2), -4]))
I have a data.table with multiple categorical variables for which I would like to create contrast (or "dummy") variables along with many more numerical variables which I would like to simply pass by reference.
Example dataset:
library('data.table')
d <- data.table(1:3, # there are lots of numerics, so I want to avoid copying
letters[1:3], # convert these to factor then dummy variable
10:12,
LETTERS[24:26])
# >d
# V1 V2 V3 V4
# 1: 1 a 10 X
# 2: 2 b 11 Y
# 3: 3 c 12 Z
The desired result looks like:
>dummyDT(d)
V1 V3 V2.b V2.c V4.Y V4.Z
1: 1 10 0 0 0 0
2: 2 11 1 0 1 0
3: 3 12 0 1 0 1
which can be produced with:
# this does what I want but is slow and inelegant and not idiomatic data.table
categorToMatrix <- function(x, name_prefix='Var'){
# set levels in order of appearance to avoid default re-sort by alpha
m <- contrasts(factor(x, levels=unique(x)))
dimnames(m) <- list(NULL, paste(name_prefix, colnames(m), sep='.') )
m
}
dummyDT <- function(d){
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <-
data.table(
do.call(cbind, lapply(toDummy, function(j) {
categorToMatrix(d[[j]], name_prefix = names(d)[j])
} )
)
)
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
(I do not care about maintaining original column ordering.)
I have tried in addition to the above, the approach of splitting each matrix into a list of columns, as in:
# split a matrix into list of columns and keep track of column names
# expanded from #Tommy's answer at: https://stackoverflow.com/a/6821395/2573061
splitMatrix <- function(m){
setNames( lapply(seq_len(ncol(m)), function(j) m[,j]), colnames(m) )
}
# Example:
splitMatrix(categoricalToMatrix(d$V2, name_prefix='V2'))
# $V2.b
# [1] 0 1 0
#
# $V2.c
# [1] 0 0 1
which works for an individual column, but then when I try to lapply to multiple columns, these lists get somehow coerced into string-rows and recycled, which is baffling me:
dummyDT2 <- function(d){
stopifnot(inherits(d,'data.table'))
toDummy <- which(sapply(d, function(x) is.factor(x) | is.character(x)))
if(length(toDummy)>0){
dummyComponent <- d[, lapply(.SD, function(x) splitMatrix( categorToMatrix(x) ) ) ,
.SDcols=isChar]
asIs <- (1:ncol(d))[-toDummy]
if(length(asIs)>0) {
allCols <- cbind(d[,asIs,with=FALSE], dummyComponent)
} else allCols <- dummyComponent
} else allCols <- d
return(allCols)
}
dummyDT2(d)
# V1 V3 V2
# 1: 1 10 0,1,0
# 2: 2 11 0,0,1
# 3: 3 12 0,1,0
# Warning message:
# In data.table::data.table(...) :
# Item 2 is of size 2 but maximum size is 3 (recycled leaving remainder of 1 items)
I then tried wrapping splitMatrix with data.table() and got an amusingly laconic error message.
I know that functions like caret::dummyVars exist for data.frame. I am trying to create a data.table optimized version.
Closely related question: How to one-hot-encode factor variables with data.table?
But there are two differences: I do not want full-rank dummy variables (because I'm using this for regression) but rather contrast variables (n-1 of these for n levels) and I have multiple numeric variables that I do not want to OHE.
For example, I have a dataframe like this (the content of V1 is not the same as line number):
V1 V2 V3
1 cat animal
3 dog animal
4 apple fruit
And a vector like this:
c(4,1,3)
Is there an easy way to get a vector like this in R?
c("fruit:apple", "animal:cat", "animal:dog")
I tried ==(my_frame$V1==my_vector) but found that can't be used for two vectors..
A slightly more concise version:
my_charvec <- as.character(my_vector)
rownames(my_frame) <- my_frame$V1
apply(my_frame[my_charvec,c(3,2)],1,paste,collapse=":")
Here's the output:
4 1 3
"fruit:apple" "animal:cat" "animal:dog"
The 4,1,3 on the output are just names; you can ignore them if you want to.
Something like this works:
## my_dat <- read.table(text="V1 V2 V3
## 1 cat animal
## 2 dog animal
## 3 apple fruit", header=T)
##
## my_vect <- c(3,1,2)
library(qdap) #for paste2 function
paste2(my_dat[sapply(my_vect, function(x) which(x == my_dat[, 1])), 3:2], sep=":")
## [1] "fruit:apple" "animal:cat" "animal:dog"
First I match the my_vect with the column 1 of my_dat using sapply, which and ==. This tells the order to grab in:
sapply(my_vect, function(x) which(x == my_dat[, 1]))
## [1] 3 1 2
Then I index and grab only the last two columns in the order you requested (3rd col. then 2nd)
my_dat[sapply(my_vect, function(x) which(x == my_dat[, 1])), 3:2]
## V3 V2
## 3 fruit apple
## 1 animal cat
## 2 animal dog
Then I use paste2 from the qdap package to bind the columns together without specifying the specific columns (just being lazy; you could accomplish this with base paste by explicitly stating the vectors.
#Firegun,how about using 'merge', like so:
#original data frames
df1=data.frame(V1=c(1,3,4),V2=c("cat","dog","apple"),V3=c("animal","animal","fruit"))
df2=data.frame(V1=c(4,1,3))
# just merge and don't sort (which is the default)
df3=merge(df2,df1,by.x="V1",sort=FALSE)
vec=as.vector(paste0(df3$V3,":",df3$V2))
> vec
[1] "fruit:apple" "animal:cat" "animal:dog"
Another base R solution, using match and mapply
d <- read.table(text='V1 V2 V3
1 cat animal
3 dog animal
4 apple fruit', header=TRUE)
v <- c(4,1,3)
with(d[match(v, d$V1), ], paste(V3, V2, sep=':'))
# [1] "fruit:apple" "animal:cat" "animal:dog"
Maybe something like this?
> df<-data.frame(V1=c(1,2,3), V2=c("a","b","c"), V3=c("v", "nv", "nv"))
> v <- c(3,1,2)
> df[v,]
## V1 V2 V3
## 3 3 c nv
## 1 1 a v
## 2 2 b nv
> res <- as.character(apply(df[v,], 1, function(r) paste(r[3],r[2],sep=":")))
> res
## [1] "nv:c" "v:a" "nv:b"
Here is a base solution
a <- data.frame(c("cat", "dog", "apple"), c("animal", "animal", "fruit"))
v <- c(3,1,2)
apply(a[v,], 1, paste0, collapse=":")
With the sample data
> df1 <- data.frame(x=c(1,1,2,3), y=c("a","b","a","b"))
> df1
x y
1 1 a
2 1 b
3 2 a
4 3 b
> df2 <- data.frame(x=c(1,3), y=c("a","b"))
> df2
x y
1 1 a
2 3 b
I want to remove all the value pairs (x,y) of df2 from df1. I can do it using a for loop over each row in df2 but I'm sure there is a better and simpler way that I just can't think of at the moment. I've been trying to do something starting with the following:
> df1$x %in% df2$x & df1$y %in% df2$y
[1] TRUE TRUE FALSE TRUE
But this isn't what I want as df1[2,] = (1,b) is pulled out for removal. Thank you very much in advance for your help.
Build a set of pairs from df2:
prs <- with(df2, paste(x,y,sep="."))
Test each row in df1 with similarly process for membership in the pairset:
df1[ paste(df1$x, df1$y, sep=".") %in% prs , ]
You could go the other way around: rbind everything and remove duplicates
out <-rbind(df1,df2)
out[!duplicated(out, fromLast=TRUE) & !duplicated(out),]
x y
2 1 b
3 2 a