Reshaping 2 column data.table from long to wide - r

This is my data.frame:
library(data.table)
df<- fread('
predictions Label
3 A
4 B
5 C
1 A
2 B
3 C
')
Desired Output:
A B C
3 4 5
1 2 3
I am trying DesiredOutput<-dcast(df, Label+predictions ~ Label, value.var = "predictions") with no success. Your help is appreciated!

df[, idx := 1:.N, by = Label]
dcast(df, idx ~ Label, value.var = 'predictions')
# idx A B C
#1: 1 3 4 5
#2: 2 1 2 3

Maybe the base R function unstack is the cleanest solution:
unstack(df)
A B C
1 3 4 5
2 1 2 3
Note that this returns a data.frame rather than a data.table, so if you want a data.table at the end:
df2 <- setDT(unstack(df))
will return a data.table.

Related

How to make a true ranking by giving same ranking to same score while the rest follow the true order?

Here is the toy sample of 8 students with grades from A to D. I would like to give a ranking which reflects the true order while students with same grade shall have same ranking.
It seems the .GRP is most likely the right approach, but it goes with order of numbers, how can I skip the position occupied by the students with same grade, with data.table? Thanks.
DT <- data.table(GRADE = c("A","B","B","C",rep("D",4)))
DT[, GRP:=.GRP, by = GRADE][, RANK:= c(1,2,2,4,5,5,5,5)]
# GRADE GRP RANK
#1: A 1 1
#2: B 2 2
#3: B 2 2
#4: C 3 4
#5: D 4 5
#6: D 4 5
#7: D 4 5
#8: D 4 5
An option is frank
DT[, RANK := frank(GRADE, ties.method = 'min')]
DT$RANK
#[1] 1 2 2 4 5 5 5 5
Or in dplyr with min_rank
library(dplyr)
DT %>%
mutate(RANK = min_rank(GRADE))

Add max column using variable

I am trying to do the same thing as this question: Add max value to a new column in R, however, I want to pass in a variable instead of the column name directly so I don't hard code the columns name into the formula.
Sample code:
a <- c(1,1,2,2,3,3)
b <- c(1,3,5,9,4,NA)
d <- data.table(a, b)
d
a b
1 1
1 3
2 5
2 9
3 4
3 NA
I can get this:
a b max_b
1 1 3
1 3 3
2 5 9
2 9 9
3 4 4
3 NA 4
By hard coding it: setDT(d)[, max_b:= max(b, na.rm = T), a] but I would like to do something like this instead:
cn <- "b"
setDT(d)[, paste0("max_", cn):= max(cn, na.rm = T), a]
However, this is not working because inside of max() it evaluates to max of the character instead of the column. And it evaluates to a column named max_b that contains the value b because max("b") = "b". I get why this is happening, I just do not know a workaround.
What is a solution to this?
Note: the above stack question I tagged was marked as a duplicate and closed, but I chose that question because I am using the accepted answer from it in my code. I also do not 100% agree that it is a duplicate question anyways.
Try setDT(d)[, paste0("max_", cn) := eval(parse(text = max(eval(parse(text = cn))))), a]
# output
a b max_b
1: 1 1 3
2: 1 3 3
3: 2 5 9
4: 2 9 9
5: 3 4 4
# example with missing values
a <- c(1,1,2,2,3,3)
b <- c(1,3,5,9,4,NA)
d <- data.table(a, b)
cn <- "b"
setDT(d)[, paste0("max_", cn) := eval(parse(text = max(eval(parse(text = cn)),
na.rm = TRUE))), a]
#output
a b max_b
1: 1 1 3
2: 1 3 3
3: 2 5 9
4: 2 9 9
5: 3 4 4
6: 3 NA 4
One option is to specify the variable in .SDcols and then apply the function on .SD (Subset of Data.table).
d[, paste0("max_", cn) := lapply(.SD, max, na.rm = TRUE), by = a, .SDcols = cn]
d
# a b max_b
#1: 1 1 3
#2: 1 3 3
#3: 2 5 9
#4: 2 9 9
#5: 3 4 4
#6: 3 NA 4
Another option is converting to symbol and then do the evaluation
d[, paste0("max_", cn) := max(eval(as.symbol(cn)), na.rm = TRUE), by = a]

How to keep rows with the same values in two variables in r?

I have a dataset with several variables, but I want to keep the rows that are the same based on two columns. Here is an example of what I want to do:
a <- c(rep('A',3), rep('B', 3), rep('C',3))
b <- c(1,1,2,4,4,4,5,5,5)
df <- data.frame(a,b)
a b
1 A 1
2 A 1
3 A 2
4 B 4
5 B 4
6 B 4
7 C 5
8 C 5
9 C 5
I know that if I use the duplicated function I can get:
df[!duplicated(df),]
a b
1 A 1
3 A 2
4 B 4
7 C 5
But since the level 'A' on column a does not have a unique value in b, I want to drop both observations to get a new data.frame as this:
a b
4 B 4
7 C 5
I don't mind to have repeated values across b, as long as for every same level on a there is the same value in b.
Is there a way to do this? Thanks!
This one maybe?
ag <- aggregate(b~a, df, unique)
ag[lengths(ag$b)==1,]
# a b
#2 B 4
#3 C 5
Maybe something like this:
> ind <- apply(sapply(with(df, split(b,a)), diff), 2, function(x) all(x==0) )
> out <- df[!duplicated(df),]
> out[out$a %in% names(ind)[ind], ]
a b
4 B 4
7 C 5
Here is another option with data.table
library(data.table)
setDT(df)[, if(uniqueN(b)==1) .SD[1L], by = a]
# a b
#1: B 4
#2: C 5

R delete non max values in redundant rows

I have a matrix that contains following:
A B C D
a 1 3 2 5
b 3 2 5 8
a 2 1 0 9
a 4 2 1 3
c 4 3 1 1
b 2 5 1 9
A, B, C, D are column names and
a, b, c, d are row names.
I want to make it look like
A B C D
a 4 3 2 9
b 3 5 5 9
c 4 3 1 1
using R, Which is to
1) order the row in alphabetical order,
2) and then if there are redundant rows (i.e. there are other rows with the same row name), pick a maximum value among the redundant rows for each column and delete the others.
I first used python to do this process, but I was wondering if there is
more convenient way for this job in R.
I would appreciate any help.
You can use data.table
dt_in <- data.table(matrix_in)
dt_in[, name := rownames(matrix_in)]
dt_max <- dt_in[, list(A = max(A), B = max(B), C = max(C), D = max(D)), by = "name"]
as.matrix(data.frame(dt_max))
Here's a one liner using data.table you can keep the rows while converting to data.table and then apply max function over all columns using lapply(.SD,...) by the rn variable (the saved row names)
library(data.table)
data.table(m, keep.rownames = TRUE)[, lapply(.SD, max), by = rn]
# rn A B C D
# 1: a 4 3 2 9
# 2: b 3 5 5 9
# 3: c 4 3 1 1
You can simply use aggregate function:
aggregate(matrix ~ rownames(matrix), matrix, max)

Create counter with multiple variables [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 6 years ago.
I have my data that looks like below:
CustomerID TripDate
1 1/3/2013
1 1/4/2013
1 1/9/2013
2 2/1/2013
2 2/4/2013
3 1/2/2013
I need to create a counter variable, which will be like below:
CustomerID TripDate TripCounter
1 1/3/2013 1
1 1/4/2013 2
1 1/9/2013 3
2 2/1/2013 1
2 2/4/2013 2
3 1/2/2013 1
Tripcounter will be for each customer.
Use ave. Assuming your data.frame is called "mydf":
mydf$counter <- with(mydf, ave(CustomerID, CustomerID, FUN = seq_along))
mydf
# CustomerID TripDate counter
# 1 1 1/3/2013 1
# 2 1 1/4/2013 2
# 3 1 1/9/2013 3
# 4 2 2/1/2013 1
# 5 2 2/4/2013 2
# 6 3 1/2/2013 1
For what it's worth, I also implemented a version of this approach in a function included in my "splitstackshape" package. The function is called getanID:
mydf <- data.frame(IDA = c("a", "a", "a", "b", "b", "b", "b"),
IDB = c(1, 2, 1, 1, 2, 2, 2), values = 1:7)
mydf
# install.packages("splitstackshape")
library(splitstackshape)
# getanID(mydf, id.vars = c("IDA", "IDB"))
getanID(mydf, id.vars = 1:2)
# IDA IDB values .id
# 1 a 1 1 1
# 2 a 2 2 1
# 3 a 1 3 2
# 4 b 1 4 1
# 5 b 2 5 1
# 6 b 2 6 2
# 7 b 2 7 3
As you can see from the example above, I've written the function in such a way that you can specify one or more columns that should be treated as ID columns. It checks to see if any of the id.vars are duplicated, and if they are, then it generates a new ID variable for you.
You can also use plyr for this (using #AnadaMahto's example data):
> ddply(mydf, .(IDA), transform, .id = seq_along(IDA))
IDA IDB values .id
1 a 1 1 1
2 a 2 2 2
3 a 1 3 3
4 b 1 4 1
5 b 2 5 2
6 b 2 6 3
7 b 2 7 4
or even:
> ddply(mydf, .(IDA, IDB), transform, .id = seq_along(IDA))
IDA IDB values .id
1 a 1 1 1
2 a 1 3 2
3 a 2 2 1
4 b 1 4 1
5 b 2 5 1
6 b 2 6 2
7 b 2 7 3
Note that plyr does not have a reputation for being the quickest solution, for that you need to take a look at data.table.
Here's a data.table approach:
library(data.table)
DT <- data.table(mydf)
DT[, .id := sequence(.N), by = "IDA,IDB"]
DT
# IDA IDB values .id
# 1: a 1 1 1
# 2: a 2 2 1
# 3: a 1 3 2
# 4: b 1 4 1
# 5: b 2 5 1
# 6: b 2 6 2
# 7: b 2 7 3
meanwhile, you can also use dplyr. if your data.frame is called mydata
library(dplyr)
mydata %>% group_by(CustomerID) %>% mutate(TripCounter = row_number())
I need to do this often, and wrote a function that accomplishes it differently than the previous answers. I am not sure which solution is most efficient.
idCounter <- function(x) {
unlist(lapply(rle(x)$lengths, seq_len))
}
mydf$TripCounter <- idCounter(mydf$CustomerID)
Here's the procedure styled code. I dont believe in things like if you are using loop in R then you are probably doing something wrong
x <- dataframe$CustomerID
dataframe$counter <- 0
y <- dataframe$counter
count <- 1
for (i in 1:length(x)) {
ifelse (x[i] == x[i-1], count <- count + 1, count <- 1 )
y[i] <- count
}
dataframe$counter <- y
This isn't the right answer but showing some interesting things comparing to for loops, vectorization is fast does not care about sequential updating.
a<-read.table(textConnection(
"CustomerID TripDate
1 1/3/2013
1 1/4/2013
1 1/9/2013
2 2/1/2013
2 2/4/2013
3 1/2/2013 "), header=TRUE)
a <- a %>%
group_by(CustomerID,TripDate) # must in order
res <- rep(1, nrow(a)) #base # 1
res[2:6] <-sapply(2:6, function(i)if(a$CustomerID[i]== a$CustomerID[i - 1]) {res[i] = res[i-1]+1} else {res[i]= res[i]})
a$TripeCounter <- res

Resources