Related
I got this data:
df = data.frame(x = c(1,2,3), y = c(5,1,4))
> x y
> 1 1 5
> 2 2 1
> 3 3 4
But i want a new column with the column name of the max value in the row
like this:
> x y max.col
> 1 1 5 y
> 2 2 1 x
> 3 3 4 y
I've tried a lot of codes, but without sucess. Extra points with i can use the solution with %>%
Edit1: i got a lot of NA's and i want skip it
Edit2: i got 30 different columns in the real df
We can use max.col to return the index of the max value and use that to subset the column name. If there are NAs replace the NA with a negative value
If a row is all NA, then we can identify it with rowSums on logical matrix
i1 <- !rowSums(!is.na(df))
df$max.col <- names(df)[max.col(replace(df, is.na(df), -999), 'first')]
df$max.col[i1] <- NA
Here is the solution for your question
df2 <- df %>%
mutate(max.col = ifelse(x>y, "x", "y"))
# x y max.col
# 1 1 5 y
# 2 2 1 x
# 3 3 4 y
I have a data frame in R that I want to aggregate. The summary function that I want to apply to each subset is a custom function that takes several variables (columns) as input, and returns a vector or list of variable length. As an output, I would like to have a data frame with a column of the grouping variable, and a single other column containing the output vector (of varying length).
To give a mock example, suppose I have the following dataframe:
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
> df
particle time state energy
1 X 1 A 9
2 X 2 A 8
3 X 3 B 7
4 X 4 C 5
5 X 5 A 0
6 Y 1 A 1
7 Y 2 B 7
8 Y 3 B 7
9 Z 1 B 3
10 Z 2 C 9
11 Z 3 A 5
12 Z 4 A 6
I would like to obtain for each particle a list of the energy they had every time they changed state. The output I'm looking for is something like this:
>
particle energy
1 X c(9,7,5,0)
2 Y c(1,7)
3 Z c(3,9,5)
To do so, I would define a function like the following:
myfun <- function(state, energy){
tempstate <- state[1]
energyvec <- energy[1]
for(i in 2:length(state)){
if(state[i] != tempstate){
energyvec <- c(energyvec, energy[i])
tempstate <- state[i]
}
}
return(energyvec)
}
And try to pass it to aggregate somehow
The two data structures I tried for this are data.frame and data.table.
In data.frame, using a custom function that returns a vector seems to give the correct output format I am looking for, that is where the output column is really a list, and each row contains a list with the output of the function. However, I can't seem to pass several columns to the function when aggregating this way.
With a data.table, the aggregation is easier to do when considering a function of several variables. However, I can't seem to obtain the output I'm looking for. Indeed,
dt <- data.table(df)
dt[,myfun(state, energy), by= Particle]
only returns the first element of energyvec (instead of a vector), and
dt <- data.table(df)
dt[,as.list(myfun(state, energy)), by= Particle]
doesn't work as the outputs don't all have the same length.
Is there an alternative way to go to accomplish this?
Thank you very much in advance for all your help!
Here's a tidyverse approach:
library(tidyverse)
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
# Hard-code energy to make this reproducible
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
df %>%
group_by(particle) %>%
mutate(
changed_state = coalesce(state != lag(state, 1), TRUE)
) %>%
filter(changed_state) %>%
summarise(
string = toString(energy)
)
#> # A tibble: 3 x 2
#> particle string
#> <fct> <chr>
#> 1 X 9, 7, 5, 0
#> 2 Y 1, 7
#> 3 Z 3, 9, 5
I'd run each line of the pipe individually. Basically, create a changed_state variable by checking if the "this" state matches the last state lag(state, 1). Since we only care when this happens, we filter where this is TRUE (a more verbose line would be filter(changed_state == TRUE). The toString function collapses the rows of energy as desired and we are already "grouped" by particle.
data.table approach
sample data
#stolen from JasonAizkalns's answer
df <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
df$energy <- c(9, 8, 7, 5, 0, 1, 7, 7, 3, 9, 5, 6)
code
library( data.table )
#create data.table
dt <- as.data.table(df)
#use `uniqlist` to get rownumbers where the value of `state` changes,
# then get these rows into a subset
result <- dt[ data.table:::uniqlist(dt[, c("particle", "state")]), ]
#split the resulting `energy`-column by the contents of the `particle`-column
l <- split( result$energy, result$particle)
# $X
# [1] 9 7 5 0
#
# $Y
# [1] 1 7
#
# $Z
# [1] 3 9 5
#craete final output
data.table( particle = names(l), energy = l )
# particle energy
# 1: X 9,7,5,0
# 2: Y 1,7
# 3: Z 3,9,5
Another possible data.table approach
library(data.table)
setDT(DF)[, .(energy=.(.SD[, first(energy), by=.(rleid(state))]$V1)), by=.(particle)]
output:
particle energy
1: X 9,4,6,9
2: Y 2,9
3: Z 7,6,1
data:
set.seed(0L)
DF <- data.frame( particle = c(rep("X",5),rep("Y",3),rep("Z",4)),
time = c(1:5,1:3,1:4), state = c(c("A","A","B","C","A"),c("A","B","B"),
c("B","C","A","A")), energy = round(runif(12,0,10)))
DF
# particle time state energy
# 1 X 1 A 9
# 2 X 2 A 3
# 3 X 3 B 4
# 4 X 4 C 6
# 5 X 5 A 9
# 6 Y 1 A 2
# 7 Y 2 B 9
# 8 Y 3 B 9
# 9 Z 1 B 7
# 10 Z 2 C 6
# 11 Z 3 A 1
# 12 Z 4 A 2
I would like to do calculations across columns in my data, by row. The calculations are "moving" in that I would like to know the difference between two numbers in column 1 and 2, then columns 3 and 4, and so on. I have looked at "loops" and "rollapply" functions, but could not figure this out. Below are three options of what was attempted. Only the third option gives me the result I am after, but it is very lengthy code and also does not allow for automation (the input data will be a much larger matrix, so typing out the calculation for each row won't work).
Please advice how to make this code shorter and/or any other packages/functions to check out which will do the job. THANK YOU!
MY TEST SCRIPT IN R + errors/results
Sample data set
a<- c(1,2,3, 4, 5)
b<- c(1,2,3, 4, 5)
c<- c(1,2,3, 4, 5)
test.data <- data.frame(cbind(a,b*2,c*10))
names(test.data) <- c("a", "b", "c")
Sample of calculations attempted:
OPTION 1
require(zoo)
rollapply(test.data, 2, diff, fill = NA, align = "right", by.column=FALSE)
RESULT 1 (not what we're after. What we need is at the bottom of Option 3)
# a b c
#[1,] NA NA NA
#[2,] 1 2 10
#[3,] 1 2 10
#[4,] 1 2 10
#[5,] 1 2 10
OPTION 2:
results <- for (i in 1:length(nrow(test.data))) {
diff(as.numeric(test.data[i,]), lag=1)
print(results)}
RESULT 2: (again not what we're after)
# NULL
OPTION 3: works, but long way, so would like to simplify code and make generic for any length of observations in my dataframe and any number of columns (i.e. more than 3). I would like to "automate" the steps below, if know number of observations (i.e. rows).
row1=diff(as.numeric(test[1,], lag=1))
row2=diff(as.numeric(test[2,], lag=1))
row3=diff(as.numeric(test[3,], lag=1))
row4=diff(as.numeric(test[4,], lag=1))
row5=diff(as.numeric(test[5,], lag=1))
results.OK=cbind.data.frame(row1, row2, row3, row4, row5)
transpose.results.OK=data.frame(t(as.matrix(results.OK)))
names(transpose.results.OK)=c("diff.ab", "diff.bc")
Final.data = transpose.results.OK
print(Final.data)
RESULT 3: (THIS IS WHAT I WOULD LIKE TO GET, "row1" can be "obs1" etc)
# diff.ab diff.bc
#row1 1 8
#row2 2 16
#row3 3 24
#row4 4 32
#row5 5 40
THE END
Here are the 3 options redone plus a 4th option:
# 1
library(zoo)
d <- t(rollapplyr(t(test.data), 2, diff, by.column = FALSE))
# 2
d <- test.data[-1]
for (i in 1:nrow(test.data)) d[i, ] <- diff(unlist(test.data[i, ]))
# 3
d <- t(diff(t(test.data)))
# 4 - also this works
nc <- ncol(test.data)
d <- test.data[-1] - test.data[-nc]
For any of them to set the names:
colnames(d) <- paste0("diff.", head(names(test.data), -1), colnames(d))
(2) and (4) give this data.frame and (1) and (3) give the corresponding matrix:
> d
diff.ab diff.bc
1 1 8
2 2 16
3 3 24
4 4 32
5 5 40
Use as.matrix or as.data.frame if you want the other.
An apply based solution using diff on row-wise can be achieved as:
# Result
res <- t(apply(test.data, 1, diff)) #One can change it to data.frame
# Name of the columns
colnames(res) <- paste0("diff.", head(names(test.data), -1),
tail(names(test.data), -1))
res
# diff.ab diff.bc
# [1,] 1 8
# [2,] 2 16
# [3,] 3 24
# [4,] 4 32
# [5,] 5 40
I was trying this out, trying to subset a data frame based on values in vector being in another vector:
x <- c( 1,2,3,1,2,3 )
df <- data.frame(x=x,y=x)
df[ df$x == c(1,2), ]
expecting to get this:
x y
1 1 1
2 2 2
4 1 1
5 2 2
but I didn't, I got this:
x y
1 1 1
2 2 2
Disregarding the fact that I really wanted this (occurred to me a minute later):
df[ df$x %in% c(1,2), ]
What is the logic behind the result of this:
x == c(1,2)
being this:
[1] TRUE TRUE FALSE FALSE FALSE FALSE
I don't really get it. I am aware that this is likely a duplicate, but I couldn't find one.
It is based on the recycling of c(1,2) to the length of 'x', i.e. we are comparing df$x with
rep(c(1,2),length.out= nrow(df))
#[1] 1 2 1 2 1 2
df$x ==rep(c(1,2),length.out= nrow(df))
#[1] TRUE TRUE FALSE FALSE FALSE FALSE
It means, we are comparing the corresponding elements of 'x' with the corresponding recycled c(1,2) instead of checking any element of 'x' contains c(1,2)
I have an R dataframe that I need to subset data from. The subsetting will be based on two columns in the dataframe. For example:
A <- c(1,2,3,3,5,1)
B <- c(6,7,8,9,8,8)
Value <- c(9,5,2,1,2,2)
DATA <- data.frame(A,B,Value)
This is how DATA looks
A B Value
1 6 9
2 7 5
3 8 2
3 9 1
5 8 2
1 8 2
I want those rows of data for which (A,B) combination is (1,6) and (3,8). These pairs are stored as individual (ordered) vectors of A and B:
AList <- c(1,3)
BList <- c(6,8)
Now, I am trying to subset the data basically by comparing if A column is present in AList AND B column is present in BList
DATA[(DATA$A %in% AList & DATA$B %in% BList),]
The subsetted result is shown below. In addition to the value pairs (1,6) and (3,8) I am also getting (1,8). Basically, this filter has given me value pairs for all combinations in AList and BList. How do I restrict it to just (1,6) and (3,8)?
A B Value
1 6 9
3 8 2
1 8 2
This is my desired result:
A B Value
1 6 9
3 8 2
This is a job for merge:
KEYS <- data.frame(A = AList, B = BList)
merge(DATA, KEYS)
# A B Value
# 1 1 6 9
# 2 3 8 2
Edit: after the OP expressed his preference for a logical vector in the comments below, I would suggest one of the following.
Use merge:
df.in.df <- function(x, y) {
common.names <- intersect(names(x), names(y))
idx <- seq_len(nrow(x))
x <- x[common.names]
y <- y[common.names]
x <- transform(x, .row.idx = idx)
idx %in% merge(x, y)$.row.idx
}
or interaction:
df.in.df <- function(x, y) {
common.names <- intersect(names(x), names(y))
interaction(x[common.names]) %in% interaction(y[common.names])
}
In both cases:
df.in.df(DATA, KEYS)
# [1] TRUE FALSE TRUE FALSE FALSE FALSE
You could try match which an appropriated nomatch argument:
sub <- match(DATA$A, AList, nomatch=-1) == match(DATA$B, BList, nomatch=-2)
sub
# [1] TRUE FALSE TRUE FALSE FALSE FALSE
DATA[sub,]
# A B Value
#1 1 6 9
#3 3 8 2
A paste based approach would also be possible:
sub <- paste(DATA$A, DATA$B, sep=":") %in% paste(AList, BList, sep=":")
sub
# [1] TRUE FALSE TRUE FALSE FALSE FALSE
DATA[sub,]
# A B Value
#1 1 6 9
#3 3 8 2