Enumerate quantiles in reverse order - r

I'm trying to get the quantile number of a column in a data frame, but in reverse order. I want the highest number to be in quantile number 1.
Here is what I have so far:
> x<-c(10, 12, 75, 89, 25, 100, 67, 89, 4, 67, 120.2, 140.5, 170.5, 78.1)
> x <- data.frame(x)
> within(x, Q <- as.integer(cut(x, quantile(x, probs=0:5/5, na.rm=TRUE),
include.lowest=TRUE)))
x Q
1 10.0 1
2 12.0 1
3 75.0 3
4 89.0 4
5 25.0 2
6 100.0 4
7 67.0 2
8 89.0 4
9 4.0 1
10 67.0 2
11 120.2 5
12 140.5 5
13 170.5 5
14 78.1 3
And what I want to get is:
x Q
1 10.0 5
2 12.0 5
3 75.0 3
4 89.0 2
5 25.0 4
6 100.0 2
7 67.0 4
8 89.0 2
9 4.0 5
10 67.0 4
11 120.2 1
12 140.5 1
13 170.5 1
14 78.1 3

One way to do this is to specify the reversed labels in the cut() function. If you want Q to be an integer then you need to first coerce the factor labels into a character and then into an integer.
result <- within(x, Q <- as.integer(as.character((cut(x,
quantile(x, probs = 0:5/5, na.rm = TRUE),
labels = c(5, 4, 3, 2, 1),
include.lowest = TRUE)))))
head(result)
x Q
1 10 5
2 12 5
3 75 3
4 89 2
5 25 4
6 100 2
Your data:
x <- c(10, 12, 75, 89, 25, 100, 67, 89, 4, 67, 120.2, 140.5, 170.5, 78.1)
x <- data.frame(x)

Related

Merge overlapping datasets by column identifier?

I am trying to merge/join two datasets which have different data about the same samples with no rows in common. I would like to be able to merge them by the column names and have that add the rows from the smaller dataset to the larger, filling in NA for all columns that do not have information from the smaller dataset. I feel like this is something super easy that I'm just somehow not able to figure out.
2 tiny sample datasets:
df1 <- data.frame(team=c('A', 'B', 'C', 'D'),
points=c(88, 98, 104, 100),
league=c('Alpha', 'Beta', 'Gamma', 'Delta'))
team points league
1 A 88 Alpha
2 B 98 Beta
3 C 104 Gamma
4 D 100 Delta
df2 <- data.frame(team=c('L', 'M','N', 'O', 'P', 'Q'),
points=c(43, 66, 77, 83, 12, 12),
league=c('Epsilon', 'Zeta', 'Eta', 'Theta', 'Iota', 'Kappa'),
rebounds=c(22, 31, 29, 20, 33, 44),
fouls=c(1, 3, 2, 4, 5, 1))
team points league rebounds fouls
1 L 43 Epsilon 22 1
2 M 66 Zeta 31 3
3 N 77 Eta 29 2
4 O 83 Theta 20 4
5 P 12 Iota 33 5
6 Q 12 Kappa 44 1
the output I would like to get would be:
df3<- data.frame(team=c('A', 'B', 'C', 'D', 'L', 'M','N', 'O', 'P', 'Q' ),
points=c(88, 98, 104, 100, 43, 66, 77, 83, 12, 12),
league=c('Alpha', 'Beta', 'Gamma', 'Delta', 'Epsilon', 'Zeta', 'Eta', 'Theta', 'Iota', 'Kappa'),
rebounds=c('NA', 'NA', 'NA', 'NA', 22, 31, 29, 20, 33, 44),
fouls= c('NA', 'NA', 'NA', 'NA',1, 3, 2, 4, 5, 1))
team points league rebounds fouls
1 A 88 Alpha NA NA
2 B 98 Beta NA NA
3 C 104 Gamma NA NA
4 D 100 Delta NA NA
5 L 43 Epsilon 22 1
6 M 66 Zeta 31 3
7 N 77 Eta 29 2
8 O 83 Theta 20 4
9 P 12 Iota 33 5
10 Q 12 Kappa 44 1
I tried transposing the dfs, but because they have no rows in common that does not work either. I thought about making an index, but I'm just learning about those and I'm not sure how I would do it or if that's the correct move.
Use full_join and arrange
library(dplyr)
full_join(df2, df1) %>%
arrange(team)
-output
team points league rebounds fouls
1 A 88 Alpha NA NA
2 B 98 Beta NA NA
3 C 104 Gamma NA NA
4 D 100 Delta NA NA
5 L 43 Epsilon 22 1
6 M 66 Zeta 31 3
7 N 77 Eta 29 2
8 O 83 Theta 20 4
9 P 12 Iota 33 5
10 Q 12 Kappa 44 1
Or with rows_upsert
rows_upsert(df2, df1, by = c("team", "points", "league"))
We could use bind_rows()
When row-binding, columns are matched by name, and any missing columns will be filled with NA:
library(dplyr)
bind_rows(df1, df2)
team points league rebounds fouls
1 A 88 Alpha NA NA
2 B 98 Beta NA NA
3 C 104 Gamma NA NA
4 D 100 Delta NA NA
5 L 43 Epsilon 22 1
6 M 66 Zeta 31 3
7 N 77 Eta 29 2
8 O 83 Theta 20 4
9 P 12 Iota 33 5
10 Q 12 Kappa 44 1
Using base R, you could add the missing columns in df1 using setdiff() and then rbind them together:
df1[, setdiff(names(df2), names(df1))] <- NA
rbind(df1, df2)
Output:
# team points league rebounds fouls
# 1 A 88 Alpha NA NA
# 2 B 98 Beta NA NA
# 3 C 104 Gamma NA NA
# 4 D 100 Delta NA NA
# 5 L 43 Epsilon 22 1
# 6 M 66 Zeta 31 3
# 7 N 77 Eta 29 2
# 8 O 83 Theta 20 4
# 9 P 12 Iota 33 5
# 10 Q 12 Kappa 44 1

adjusting value of column based on if duplicate row - iteratively R

Say I have this dataset:
df <- data.frame(time = c(100, 101, 101, 101, 102, 102, 103, 105, 109, 109, 109),
val = c(1,3,1,2,3,1,2,3,1,2,1))
df
time val
1 100 1
2 101 3
3 101 1
4 101 2
5 102 3
6 102 1
7 103 2
8 105 3
9 109 1
10 109 2
11 109 1
We can identify duplicate times in the 'time' column like this:
df[duplicated(df$time),]
What I want to do is to adjust the value of time (add 0.1) if it's duplicate. I could do this like this:
df$time <- ifelse(duplicated(df$time),df$time+.1,df$time)
time val
1 100.0 1
2 101.0 3
3 101.1 1
4 101.1 2
5 102.0 3
6 102.1 1
7 103.0 2
8 105.0 3
9 109.0 1
10 109.1 2
11 109.1 1
The issue here is that we still have duplicate values e.g.rows 3 and 4 (that they differ in the column 'val' is irrelevant). Rows 10 and 11 have the same problem. Rows 5 and 6 are fine.
Is there a way of doing this iteratively - i.e. adding 0.1 to first duplicate, 0.2 to second duplicate (of same time value) etc. This way row 4 would become 101.2, and row 11 would become 109.2 . The number of duplicates per value is unknown but will never equal 10 (usually maximum 4).
As in the top answer for the related question linked by #Henrik, this uses data.table::rowid
library(data.table)
setDT(df)
df[, time := time + 0.1*(rowid(time) - 1)]
# time val
# 1: 100.0 1
# 2: 101.0 3
# 3: 101.1 1
# 4: 101.2 2
# 5: 102.0 3
# 6: 102.1 1
# 7: 103.0 2
# 8: 105.0 3
# 9: 109.0 1
# 10: 109.1 2
# 11: 109.2 1
Here's a one line solution using base R -
df <- data.frame(time = c(100, 101, 101, 101, 102, 102, 103, 105, 109, 109, 109),
val = c(1,3,1,2,3,1,2,3,1,2,1))
df$new_time <- df$time + duplicated(df$time)*0.1*(ave(seq_len(nrow(df)), df$time, FUN = seq_along) - 1)
df
# time val new_time
# 1 100 1 100.0
# 2 101 3 101.0
# 3 101 1 101.1
# 4 101 2 101.2
# 5 102 3 102.0
# 6 102 1 102.1
# 7 103 2 103.0
# 8 105 3 105.0
# 9 109 1 109.0
# 10 109 2 109.1
# 11 109 1 109.2
With dplyr:
library(dplyr)
df %>%
group_by(time1 = time) %>%
mutate(time = time + (0:(n()-1))*0.1) %>%
ungroup() %>%
select(-time1)
or with row_number() (suggested by Henrik):
df %>%
group_by(time1 = time) %>%
mutate(time = time + (row_number()-1)*0.1) %>%
ungroup() %>%
select(-time1)
Output:
time val
1 100.0 1
2 101.0 3
3 101.1 1
4 101.2 2
5 102.0 3
6 102.1 1
7 103.0 2
8 105.0 3
9 109.0 1
10 109.1 2
11 109.2 1

Vector of SumIfs() in R

I am looking to mimic Excel's SumIfs() function in R by creating a mean-if vector of conditional averages for each observation. I've seen a lot of examples that use aggregate() or setDT() to summarize a data frame based on fixed quantities. However, I'd like to create a vector of these summaries based on the variable inputs of each row in my data frame.
Here is an example of my data:
> a <- c('c', 'a', 'b', 'a', 'c', 'b', 'c', 'a', 'b', 'c', 'a', 'b', 'c', 'b', 'a')
> b <- c(6, 1, 1, 2, 1, 2, 2, 4, 3, 3, 5, 5, 4, 6, 6)
> c <- c(69.9, 21.2, 37, 25, 65.9, 33.1, 67, 28.4, 36, 67, 22, 37.9, 62.3, 30, 25)
> df <- data.frame(cbind(a, b, c))
> df$b <- as.numeric(as.character(df$b))
> df$c <- as.numeric(as.character(df$c))
> df
a b c
1 c 6 69.9
2 a 1 21.2
3 b 1 37.0
4 a 2 25.0
5 c 1 65.9
6 b 2 33.1
7 c 2 67.0
8 a 4 28.4
9 b 3 36.0
10 c 3 67.0
11 a 5 22.0
12 b 5 37.9
13 c 4 62.3
14 b 6 30.0
15 a 6 25.0
I would like to add a fourth column, df$d, that takes the average of df$c for those observations where df$a == x & y - 2 <= df$b < y where x and y are df$a and df$b, respectively, for the observation being calculated.
Doing this by hand, df$d looks like:
> df$d <- c(62.3, NA, NA, 21.2, NA, 37, 65.9, 25, 35.05, 66.45, 28.4, 36, 67, 37.9, 25.2)
> df
a b c d
1 c 6 69.9 62.30
2 a 1 21.2 NA
3 b 1 37.0 NA
4 a 2 25.0 21.20
5 c 1 65.9 NA
6 b 2 33.1 37.00
7 c 2 67.0 65.90
8 a 4 28.4 25.00
9 b 3 36.0 35.05
10 c 3 67.0 66.45
11 a 5 22.0 28.40
12 b 5 37.9 36.00
13 c 4 62.3 67.00
14 b 6 30.0 37.90
15 a 6 25.0 25.20
Is there a function I can use to do this automatically? Thanks for your help!
This can be done in a straightforward way using a left self-join in SQL. This joins to each row of the u instance of df those rows of the v instance of df that satisfy the on condition and then averages over their c values.
library(sqldf)
sqldf("select u.*, avg(v.c) as d
from df u left join df v
on u.a = v.a and v.b between u.b-2 and u.b-1
group by u.rowid")
giving:
a b c d
1 c 6 69.9 62.30
2 a 1 21.2 NA
3 b 1 37.0 NA
4 a 2 25.0 21.20
5 c 1 65.9 NA
6 b 2 33.1 37.00
7 c 2 67.0 65.90
8 a 4 28.4 25.00
9 b 3 36.0 35.05
10 c 3 67.0 66.45
11 a 5 22.0 28.40
12 b 5 37.9 36.00
13 c 4 62.3 67.00
14 b 6 30.0 37.90
15 a 6 25.0 25.20
You could just use a loop to basically just write out exactly how you described the problem:
n <- nrow(df)
d <- numeric(n)
for (i in seq_len(n)) {
x <- df$a[i]
y <- df$b[i]
d[i] <- with(df, mean(c[a == x & y - 2 <= b & b < y]))
}
all.equal(d, df$d)
#> [1] TRUE
I don't love this solution, but I couldn't think of a simple way to do this otherwise, because the required grouping isn't disjoint due to the condition for b. I'm very curious to see if somebody comes up with a neater way to do this.

dplyr- renaming sequence of columns with select function

I'm trying to rename my columns in dplyr. I found that doing it with select function. however when I try to rename some selected columns with sequence I cannot rename them the format that I want.
test = data.frame(x = rep(1:3, each = 2),
group =rep(c("Group 1","Group 2"),3),
y1=c(22,8,11,4,7,5),
y2=c(22,18,21,14,17,15),
y3=c(23,18,51,44,27,35),
y4=c(21,28,311,24,227,225))
CC <- paste("CC",seq(0,3,1),sep="")
aa<-test%>%
select(AC=x,AR=group,CC=y1:y4)
head(aa)
AC AR CC1 CC2 CC3 CC4
1 1 Group 1 22 22 23 21
2 1 Group 2 8 18 18 28
3 2 Group 1 11 21 51 311
4 2 Group 2 4 14 44 24
5 3 Group 1 7 17 27 227
6 3 Group 2 5 15 35 225
the problem is even I set CC value from CC0, CC1, CC2, CC3 the output gives automatically head names starting from CC1.
how can I solve this issue?
I think you'll have an easier time crating such an expression with the select_ function:
library(dplyr)
test <- data.frame(x=rep(1:3, each=2),
group=rep(c("Group 1", "Group 2"), 3),
y1=c(22, 8, 11, 4, 7, 5),
y2=c(22, 18, 21, 14, 17, 15),
y3=c(23, 18, 51, 44, 27, 35),
y4=c(21, 28, 311,24, 227, 225))
# build out our select "translation" named vector
DQ <- paste0("y", 1:4)
names(DQ) <- paste0("DQ", seq(0, 3, 1))
# take a look
DQ
## DQ0 DQ1 DQ2 DQ3
## "y1" "y2" "y3" "y4"
test %>%
select_("AC"="x", "AR"="group", .dots=DQ)
## AC AR DQ0 DQ1 DQ2 DQ3
## 1 1 Group 1 22 22 23 21
## 2 1 Group 2 8 18 18 28
## 3 2 Group 1 11 21 51 311
## 4 2 Group 2 4 14 44 24
## 5 3 Group 1 7 17 27 227
## 6 3 Group 2 5 15 35 225

Extracting baseline values from long format data frame

I have a following data frame, representing longitudinal data:
df<-structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2), AGE = c(59,
59, 59, 59, 59, 69, 69, 69, 69, 69), BMI = c(23.8, 23.8, 23.8,
23.8, 23.8, 29.8, 29.8, 29.8, 29.8, 29.8), time = c(0, 1, 3,
5, 6, 0, 1, 3, 5, 6), variable = c(5, 6, 1, 6, 2, 3, 2, NA, 10,
1)), .Names = c("ID", "AGE", "BMI", "time", "var"), row.names = c(NA,
10L), class = "data.frame")
> df
ID AGE BMI time var
1 1 59 23.8 0 5
2 1 59 23.8 1 6
3 1 59 23.8 3 1
4 1 59 23.8 5 6
5 1 59 23.8 6 2
6 2 69 29.8 0 3
7 2 69 29.8 1 2
8 2 69 29.8 3 NA
9 2 69 29.8 5 10
10 2 69 29.8 6 1
AGE and BMI are baseline variables, var is longitudinal variable measured at different time points (time).
I would like to extract baseline (time = 0) data from var variable and create new baseline variable var.baseline. My data frame is going to look like
> df
ID AGE BMI time variable var.baseline
1 1 59 23.8 0 5 5
2 1 59 23.8 1 6 5
3 1 59 23.8 3 1 5
4 1 59 23.8 5 6 5
5 1 59 23.8 6 2 5
6 2 69 29.8 0 3 3
7 2 69 29.8 1 2 3
8 2 69 29.8 3 NA 3
9 2 69 29.8 5 10 3
10 2 69 29.8 6 1 3
Of course, I can transform the data to wide format, create var.baseline based on variable.0, and then again transform to long format. However, as my real data set is significantly larger and I have much more variables, it becomes cumbersome. Could you please suggest a more easy way of extracting baseline data from long format data frame.
You can try
library(dplyr)
df %>%
group_by(ID) %>%
mutate(var.baseline=var[time==0])
Or
library(data.table)
setDT(df)[,var.baseline:=var[time==0] , by=ID]
Or using base R
merge(df,setNames(subset(df, time==0,select=c("ID", "var")),
c('ID', 'var.baseline')), by='ID')
Or
library(zoo)
df$var.baseline <- with(df, na.locf(var[!NA^time==0]))

Resources