Difference of maximum and minimum by group - r

I have the following data frame
v1 v2 v3
a 2 5
b 5 3
c 2 1
d 2 1
e 1 2
a 2 4
a 8 1
e 1 6
b 0 1
c 2 8
d 1 5
using R, I want to compute for every unique value of V1, the difference between the max V3 and the min V3.
Expected :
Val max_min
a “5-1”
b “3-1”
c “8-1”
d “5-1”
e “6-2”
I am trying using
ddply(fil1, c("V1"), summarise, max(V3) - min(V1))
but, don't have the expected result. It gives the same value in max_min: the max(V3) - min(V3) for the whole data frame and not for the group.
I have also try average, with no success.

Or in base R,
MAX = aggregate(df$v3, list(df$v1), max)
MIN = aggregate(df$v3, list(df$v1), min)
MAX[,2] - MIN[,2]
[1] 4 2 7 4 4
A one liner of the above would be,
aggregate(v3 ~ v1, df, FUN = function(i)max(i) - min(i))
# v1 v3
#1 a 4
#2 b 2
#3 c 7
#4 d 4
#5 e 4
We can also use tapply which will display the output as follows,
with(df, tapply(v3, list(v1), function(i) max(i)-min(i)))
#a b c d e
#4 2 7 4 4

You could also go for split:
lapply(split(df$v3, df$v1), function(a) max(a)-min(a))
# $a
# [1] 4
# $b
# [1] 2
# $c
# [1] 7
# $d
# [1] 4
# $e
# [1] 4
In case you persist to see your defined output:
ls <- lapply(split(df$v3, df$v1), function(a) max(a)-min(a))
data.frame(Val=names(ls), max_min=unlist(ls))
# Val max_min
#a a 4
#b b 2
#c c 7
#d d 4
#e e 4

If you're using dplyr you can use the summarise function. In base R, range returns a vector containing the min and max values, and diff finds the difference. So a one-liner is:
df %>% group_by(V1) %>% summarise(max_min=diff(range(V3)))

Related

R function to replace tricky merge in Excel (vlookup + hlookup)

I have a tricky merge that I usually do in Excel via various formulas and I want to automate with R.
I have 2 dataframes, one called inputs looks like this:
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
And another called df
id v
1 1
1 2
1 3
2 2
3 1
I would like to combined them based on the id and v values such that I get
id v key
1 1 A
1 2 A
1 3 C
2 2 D
3 1 T
So I'm matching on id and then on the column from v1 thru v2, in the first example you will see that I match id = 1 and v1 since the value of v equals 1. In Excel I do this combining creatively VLOOKUP and HLOOKUP but I want to make this simpler in R. Dataframe examples are simplified versions as the I have more records and values go from v1 thru up to 50.
Thanks!
You could use pivot_longer:
library(tidyr)
library(dplyr)
key %>% pivot_longer(!id,names_prefix='v',names_to = 'v') %>%
mutate(v=as.numeric(v)) %>%
inner_join(df)
Joining, by = c("id", "v")
# A tibble: 5 × 3
id v value
<int> <dbl> <chr>
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Data:
key <- read.table(text="
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F",header=T)
df <- read.table(text="
id v
1 1
1 2
1 3
2 2
3 1 ",header=T)
You can use two column matrices as index arguments to "[" so this is a one liner. (Not the names of the data objects are d1 and d2. I'd opposed to using df as a data object name.)
d1[-1][ data.matrix(d2)] # returns [1] "A" "A" "C" "D" "T"
So full solution is:
cbind( d2, key= d1[-1][ data.matrix(d2)] )
id v key
1 1 1 A
2 1 2 A
3 1 3 C
4 2 2 D
5 3 1 T
Try this:
x <- "
id v1 v2 v3
1 A A C
2 B D F
3 T T A
4 A F C
5 F F F
"
y <- "
id v
1 1
1 2
1 3
2 2
3 1
"
df <- read.table(textConnection(x) , header = TRUE)
df2 <- read.table(textConnection(y) , header = TRUE)
key <- c()
for (i in 1:nrow(df2)) {
key <- append(df[df2$id[i],(df2$v[i] + 1L)] , key)
}
df2$key <- rev(key)
df2
># id v key
># 1 1 1 A
># 2 1 2 A
># 3 1 3 C
># 4 2 2 D
># 5 3 1 T
Created on 2022-06-06 by the reprex package (v2.0.1)

Compare values in a grouped data frame with corresponding value in a vector

Let's say I got a data.frame like the following:
u <- as.numeric(rep(rep(1:5,3)))
w <- as.factor(c(rep("a",5), rep("b",5), rep("c",5)))
q <- data.frame(w,u)
q
w u
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
15 c 5
and the vector:
v <- c(2,3,1)
Now I want to find the first row in the respective group [i] where the value [i] from vector "v" is bigger than the value in column "u".
The result should look like this:
1 a 3
2 b 4
3 c 2
I tried:
fun <- function (m) {
first(which(m[,2]>v))
}
ddply(q, .(w), summarise, fun(q))
and got as a result:
w fun(q)
1 a 3
2 b 3
3 c 3
Thus it seems like, ddply is only taking the first value from the vector "v".
Does anyone know how to solve this?
We can join the vector by creating a data.frame with 'w' as the unique values from 'w' column of 'q', then do a group_by 'w' and get the first row index where u is greater than the corresponding 'vector' column value
library(dplyr)
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
summarise(n = which(u > new)[1])
# // or use findInterval
#summarise(n = findInterval(new[1], u)+1)
-output
# A tibble: 3 x 2
# w n
#* <fct> <int>
#1 a 3
#2 b 4
#3 c 2
or use Map after splitting the data by 'w' column
Map(function(x, y) which(x$u > y)[1], split(q,q$w), v)
#$a
#[1] 3
#$b
#[1] 4
#$c
#[1] 2
OP mentioned that comparison starts from the beginning and it is not correct because we have a group_by operation. If we create a column of sequence, it resets at each group
q %>%
left_join(data.frame(w = unique(q$w), new = v)) %>%
group_by(w) %>%
mutate(rn = row_number())
Joining, by = "w"
# A tibble: 15 x 4
# Groups: w [3]
w u new rn
<fct> <dbl> <dbl> <int>
1 a 1 2 1
2 a 2 2 2
3 a 3 2 3
4 a 4 2 4
5 a 5 2 5
6 b 1 3 1
7 b 2 3 2
8 b 3 3 3
9 b 4 3 4
10 b 5 3 5
11 c 1 1 1
12 c 2 1 2
13 c 3 1 3
14 c 4 1 4
15 c 5 1 5
Using data.table: for each 'w' (by = w), subset 'v' with the group index .GRP. Compare the value with 'u' (v[.GRP] < u). Get the index for the first TRUE (which.max):
library(data.table)
setDT(q)[ , which.max(v[.GRP] < u), by = w]
# w V1
# 1: a 3
# 2: b 4
# 3: c 2

How to custom flatten a data frame? [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 5 years ago.
I have a data frame as follows:
df <- data.frame(x=c('a,b,c','d,e','f'),y=c(1,2,3))
df
> df
x y
1 a,b,c 1
2 d,e 2
3 f 3
I can get the flattened df$x like this:
unique(unlist(strsplit(as.character(df$x), ",")))
[1] "a" "b" "c" "d" "e" "f"
What would be the best way to transform my input df into:
x y
a 1
b 1
c 1
d 2
e 2
f 3
Basically flatten df$x and individually assign its corresponding y
If you are working on data.frame, I recommend using tidyr
df <- data.frame(x=c('a,b,c','d,e','f'),y=c(1,2,3),stringsAsFactors = F)
library(tidyr)
df %>%
transform(x= strsplit(x, ",")) %>%
unnest(x)
y x
1 1 a
2 1 b
3 1 c
4 2 d
5 2 e
6 3 f
sapply(unlist(strsplit(as.character(df$x), ",")), function(ss)
df$y[which(grepl(pattern = ss, x = df$x))])
#a b c d e f
#1 1 1 2 2 3
If you want a dataframe
do.call(rbind, lapply(1:NROW(df), function(i)
setNames(data.frame(unlist(strsplit(as.character(df$x[i]), ",")), df$y[i]),
names(df))))
# x y
#1 a 1
#2 b 1
#3 c 1
#4 d 2
#5 e 2
#6 f 3
FWIW, you could also repeat the row indices according to how many elements each x value has:
df <- data.frame(x=c('a,b,c','d,e','f'),y=c(1,2,3),stringsAsFactors = F)
df[,1] <- strsplit(df[,1],",")
cbind(x=unlist(df[,1]),df[rep(1:nrow(df), lengths(df[,1])),-1,F])
# x y
# 1 a 1
# 1.1 b 1
# 1.2 c 1
# 2 d 2
# 2.1 e 2
# 3 f 3

Formatting the output in R

I have a set of data which shows the visit ID and the subject name
visit<-c(1,2,3,1,2,1,1,2,3,1,2,3)
subject<-c("A","A","A","B","B","C","D","D","D","E","E","E")
data<-data.frame(visit=visit,subject=subject)
I attempted to work out the latest visit ID for each subject:
tapply(visit,subject,max)
And I get this output:
A B C D E
3 2 1 3 3
I am wondering if there is any way that I can change the output such that it becomes:
A 3
B 2
C 1
D 3
E 3
Thank you
You can try aggregate
aggregate(visit~subject, data, max)
# subject visit
#1 A 3
#2 B 2
#3 C 1
#4 D 3
#5 E 3
Or from tapply
res <- tapply(visit,subject,max)
data.frame(subject=names(res), visit=res)
Or data.table
library(data.table)
setDT(data)[, list(visit=max(visit)), by=subject]
And a dplyr solution would be:
library(dyplr)
data %>% group_by(subject) %>% summarize(max = max(visit))
## Source: local data frame [5 x 2]
## subject max
## 1 A 3
## 2 B 2
## 3 C 1
## 4 D 3
## 5 E 3
It may feel dirty, but using the base function as.matrix (or matrix for that matter) will give you what you need.
> as.matrix(tapply(visit,subject,max))
[,1]
A 3
B 2
C 1
D 3
E 3
You can easily do this in base R with stack:
stack(tapply(visit, subject, max))
# values ind
# 1 3 A
# 2 2 B
# 3 1 C
# 4 3 D
# 5 3 E
(Note: In this case, the values for "visit" and "subject" aren't actually coming from your data.frame. Just thought you should know!)
(Second note: You could also do data.frame(as.table(tapply(visit, subject, max))) but that is more deceptive than using stack so may lead to less readable code later on.)

Convert datafile from wide to long format to fit ordinal mixed model in R

I am dealing with a dataset that is in wide format, as in
> data=read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
> data
factor1 factor2 count_1 count_2 count_3
1 a a 1 2 0
2 a b 3 0 0
3 b a 1 2 3
4 b b 2 2 0
5 c a 3 4 0
6 c b 1 1 0
where factor1 and factor2 are different factors which I would like to take along (in fact I have more than 2, but that shouldn't matter), and count_1 to count_3 are counts of aggressive interactions on an ordinal scale (3>2>1). I would now like to convert this dataset to long format, to get something like
factor1 factor2 aggression
1 a a 1
2 a a 2
3 a a 2
4 a b 1
5 a b 1
6 a b 1
7 b a 1
8 b a 2
9 b a 2
10 b a 3
11 b a 3
12 b a 3
13 b b 1
14 b b 1
15 b b 2
16 b b 2
17 c a 1
18 c a 1
19 c a 1
20 c a 2
21 c a 2
22 c a 2
23 c a 2
24 c b 1
25 c b 2
Would anyone happen to know how to do this without using for...to loops, e.g. using package reshape2? (I realize it should work using melt, but I just haven't been able to figure out the right syntax yet)
Edit: For those of you that would also happen to need this kind of functionality, here is Ananda's answer below wrapped into a little function:
widetolong.ordinal<-function(data,factors,responses,responsename) {
library(reshape2)
data$ID=1:nrow(data) # add an ID to preserve row order
dL=melt(data, id.vars=c("ID", factors)) # `melt` the data
dL=dL[order(dL$ID), ] # sort the molten data
dL[,responsename]=match(dL$variable,responses) # convert reponses to ordinal scores
dL[,responsename]=factor(dL[,responsename],ordered=T)
dL=dL[dL$value != 0, ] # drop rows where `value == 0`
out=dL[rep(rownames(dL), dL$value), c(factors, responsename)] # use `rep` to "expand" `data.frame` & drop unwanted columns
rownames(out) <- NULL
return(out)
}
# example
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
widetolong.ordinal(data,c("factor1","factor2"),c("count_1","count_2","count_3"),"aggression")
melt from "reshape2" will only get you part of the way through this problem. To go the rest of the way, you just need to use rep from base R:
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
library(reshape2)
## Add an ID if the row order is importantt o you
data$ID <- 1:nrow(data)
## `melt` the data
dL <- melt(data, id.vars=c("ID", "factor1", "factor2"))
## Sort the molten data, if necessary
dL <- dL[order(dL$ID), ]
## Extract the numeric portion of the "variable" variable
dL$aggression <- gsub("count_", "", dL$variable)
## Drop rows where `value == 0`
dL <- dL[dL$value != 0, ]
## Use `rep` to "expand" your `data.frame`.
## Drop any unwanted columns at this point.
out <- dL[rep(rownames(dL), dL$value), c("factor1", "factor2", "aggression")]
This is what the output finally looks like. If you want to remove the funny row names, just use rownames(out) <- NULL.
out
# factor1 factor2 aggression
# 1 a a 1
# 7 a a 2
# 7.1 a a 2
# 2 a b 1
# 2.1 a b 1
# 2.2 a b 1
# 3 b a 1
# 9 b a 2
# 9.1 b a 2
# 15 b a 3
# 15.1 b a 3
# 15.2 b a 3
# 4 b b 1
# 4.1 b b 1
# 10 b b 2
# 10.1 b b 2
# 5 c a 1
# 5.1 c a 1
# 5.2 c a 1
# 11 c a 2
# 11.1 c a 2
# 11.2 c a 2
# 11.3 c a 2
# 6 c b 1
# 12 c b 2

Resources