Reshaping a data frame and setting flag variables - r

I want to reshape my data frame from the df1 to df2 as appears below:
df1 <-
ID TIME RATEALL CL V1 Q V2
1 0 0 2.4 10 6 20
1 1 2 0.6 10 6 25
2 0 0 3.0 15 7 30
2 5 3 3.0 16 8 15
into a long format like this:
df2 <-
ID var TIME value
1 1 0 0
1 1 1 2
1 2 0 2.4
1 2 1 10
1 3 0 6
1 3 1 6
1 4 0 20
1 4 1 20
2 1 0 3.0
2 1 1 3.0
AND so on ...
Basically I want to give a flag variables (1: for RATEALL, 2:for CL, 3:for V1, 4:for Q,and 5: for V2 and then melt the values for each subject ID. Is there an easy way to do this in R?

You can try
df2 <- reshape2::melt(df1, c("ID", "TIME"))
names <- c("RATEALL"=1, "CL"=2, "V1"=3, "Q"=4, "V2"=5)
df2$variable <- names[df2$variable]

You could use tidyr/dplyr
library(tidyr)
library(dplyr)
res <- gather(df1,var, value, RATEALL:V2) %>%
mutate(var= as.numeric(factor(var)))
head(res)
# ID TIME var value
#1 1 0 1 0.0
#2 1 1 1 2.0
#3 2 0 1 0.0
#4 2 5 1 3.0
#5 1 0 2 2.4
#6 1 1 2 0.6

Related

Count number of values which are less than current value

I'd like to count the rows in the column input if the values are smaller than the current row (Please see the results wanted below). The issue to me is that the condition is based on current row value, so it is very different from general case where the condition is a fixed number.
data <- data.frame(input = c(1,1,1,1,2,2,3,5,5,5,5,6))
input
1 1
2 1
3 1
4 1
5 2
6 2
7 3
8 5
9 5
10 5
11 5
12 6
The results I expect to get are like this. For example, for observations 5 and 6 (with value 2), there are 4 observations with value 1 less than their value 2. Hence count is given value 4.
input count
1 1 0
2 1 0
3 1 0
4 1 0
5 2 4
6 2 4
7 3 6
8 5 7
9 5 7
10 5 7
11 5 7
12 6 11
Edit: as I am dealing with grouped data with dplyr, the ultimate results I wish to get is like below, that is, I am wishing the conditions could be dynamic within each group.
data <- data.frame(id = c(1,1,2,2,2,3,3,4,4,4,4,4),
input = c(1,1,1,1,2,2,3,5,5,5,5,6),
count=c(0,0,0,0,2,0,1,0,0,0,0,4))
id input count
1 1 1 0
2 1 1 0
3 2 1 0
4 2 1 0
5 2 2 2
6 3 2 0
7 3 3 1
8 4 5 0
9 4 5 0
10 4 5 0
11 4 5 0
12 4 6 4
Here is an option with tidyverse
library(tidyverse)
data %>%
mutate(count = map_int(input, ~ sum(.x > input)))
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
Update
With the updated data, add the group by 'id' in the above code
data %>%
group_by(id) %>%
mutate(count1 = map_int(input, ~ sum(.x > input)))
# A tibble: 12 x 4
# Groups: id [4]
# id input count count1
# <dbl> <dbl> <dbl> <int>
# 1 1 1 0 0
# 2 1 1 0 0
# 3 2 1 0 0
# 4 2 1 0 0
# 5 2 2 2 2
# 6 3 2 0 0
# 7 3 3 1 1
# 8 4 5 0 0
# 9 4 5 0 0
#10 4 5 0 0
#11 4 5 0 0
#12 4 6 4 4
In base R, we can use sapply and for each input count how many values are greater than itself.
data$count <- sapply(data$input, function(x) sum(x > data$input))
data
# input count
#1 1 0
#2 1 0
#3 1 0
#4 1 0
#5 2 4
#6 2 4
#7 3 6
#8 5 7
#9 5 7
#10 5 7
#11 5 7
#12 6 11
With dplyr one way would be using rowwise function and following the same logic.
library(dplyr)
data %>%
rowwise() %>%
mutate(count = sum(input > data$input))
1. outer and rowSums
data$count <- with(data, rowSums(outer(input, input, `>`)))
2. table and cumsum
tt <- cumsum(table(data$input))
v <- setNames(c(0, head(tt, -1)), c(head(names(tt), -1), tail(names(tt), 1)))
data$count <- v[match(data$input, names(v))]
3. data.table non-equi join
Perhaps more efficient with a non-equi join in data.table. Count number of rows (.N) for each match (by = .EACHI).
library(data.table)
setDT(data)
data[data, on = .(input < input), .N, by = .EACHI]
If your data is grouped by 'id', as in your update, join on that variable as well:
data[data, on = .(id, input < input), .N, by = .EACHI]
# id input N
# 1: 1 1 0
# 2: 1 1 0
# 3: 2 1 0
# 4: 2 1 0
# 5: 2 2 2
# 6: 3 2 0
# 7: 3 3 1
# 8: 4 5 0
# 9: 4 5 0
# 10: 4 5 0
# 11: 4 5 0
# 12: 4 6 4

Split up grouped binomial data in r

I have data that looks like this
samplesize <- 6
group <- c(1,2,3)
total <- rep(samplesize,length(group))
outcomeTrue <- c(2,1,3)
df <- data.frame(group,total,outcomeTrue)
and would like my data to look like this
group2 <- c(rep(1,6),rep(2,6),rep(3,6))
outcomeTrue2 <- c(rep(1,2),rep(0,6-2),rep(1,1),rep(0,6-1),rep(1,3),rep(0,6-3))
df2 <- data.frame(group2,outcomeTrue2)
That is to say I have binary data where I am told the total observations and the successful observations, but would prefer it to be organised as individual observations with their explicit outcome as 0 or 1. i.e.Visual Example of Desired Result
Is there an easy way to do this in r, or will I need to write a loop to automate this myself?
Here is one option with tidyverrse. We uncount to expand the rows using the 'total' column, grouped by 'group', create a binary index with a logical condition based on the row_number() and the value of 'outcomeTrue'
library(tidyverse)
df %>%
uncount(total) %>%
group_by(group) %>%
mutate(outcomeTrue = as.integer(row_number() <= outcomeTrue[1]))
# A tibble: 18 x 2
# Groups: group [3]
# group outcomeTrue
# <dbl> <int>
# 1 1 1
# 2 1 1
# 3 1 0
# 4 1 0
# 5 1 0
# 6 1 0
# 7 2 1
# 8 2 0
# 9 2 0
#10 2 0
#11 2 0
#12 2 0
#13 3 1
#14 3 1
#15 3 1
#16 3 0
#17 3 0
#18 3 0
You are also there. just use the group 2 variable with the "[" function in the x position:
df[ group2 , ]
group total outcomeTrue
1 1 6 2
1.1 1 6 2
1.2 1 6 2
1.3 1 6 2
1.4 1 6 2
1.5 1 6 2
2 2 6 1
2.1 2 6 1
2.2 2 6 1
2.3 2 6 1
2.4 2 6 1
2.5 2 6 1
3 3 6 3
3.1 3 6 3
3.2 3 6 3
3.3 3 6 3
3.4 3 6 3
3.5 3 6 3
When a number or character value that matches a rowname is put in the x-position of the "[" it replicates the entire row
Here is a base R solution.
do.call(rbind, lapply(split(df, df$group), function(x) data.frame(group2 = x$group, outcome2 = rep(c(1,0), times = c(x$outcome, x$total-x$outcome)))))
# group2 outcome2
# 1.1 1 1
# 1.2 1 1
# 1.3 1 0
# 1.4 1 0
# 1.5 1 0
# 1.6 1 0
# 2.1 2 1
# 2.2 2 0
# 2.3 2 0
# 2.4 2 0
# 2.5 2 0
# 2.6 2 0
# 3.1 3 1
# 3.2 3 1
# 3.3 3 1
# 3.4 3 0
# 3.5 3 0
# 3.6 3 0

R - Match values using multiple identifiers (when the order of lookup IDs are random)

My question is a follow up from this question. I am opening a new question here - since this is very different from the last one.
Suppose I have the following two datasets:
df1 = data.frame(PersonId1=c(1,2,3,4,5,6,7,8,9,10,1),PersonId2=c(11,12,13,14,15,16,17,18,19,20,11),
Played_together = c(1,0,0,1,1,0,0,0,1,0,1),
Event=c(1,1,1,1,2,2,2,2,2,2,2),
Utility=c(20,-2,-5,10,30,2,1,.5,50,-1,60))
This looks like:
PersonId1 PersonId2 Played_together Event Utility
1 1 11 1 1 20.0
2 2 12 0 1 -2.0
3 3 13 0 1 -5.0
4 4 14 1 1 10.0
5 5 15 1 2 30.0
6 6 16 0 2 2.0
7 7 17 0 2 1.0
8 8 18 0 2 0.5
9 9 19 1 2 50.0
10 10 20 0 2 -1.0
11 1 11 1 2 60.0
.
df2 = data.frame(PersonId1=c(11,15,9,1),PersonId2=c(1,5,19,11),
Played_together = c(1,1,1,1),
Event=c(1,2,2,2),Utility=c(25,36,51,64))
This looks like:
PersonId1 PersonId2 Played_together Event Utility
1 11 1 1 1 25
2 15 5 1 2 36
3 9 19 1 2 51
4 1 11 1 2 64
I would like to do the following: Look up each pair (in each event and for played_together == 1) in df2 and match it with the observations in df1. If its a match, create a new column in df1, called 'Utility from df2'. It not, put 0.
The challenge for me comes from the fact that the order of the persons are not consistent across df1 and df2. For example in df1 row 1, for event== 1 and played_together == 1 we see: personid1 = 1 and personid2 = 11 whereas in df2, in row 1 I have personid1=11 and personid2 =1, for event == 1 and played_together==1. Thus the two are the same. I would like to take the value of utility from df2 and put it in a new column in df1. if there is no match, then put 0.
The final dataframe should look as follows:
PersonId1 PersonId2 Played_together Event Utility Utility_from_df2
1 1 11 1 1 20.0 25
2 2 12 0 1 -2.0 0
3 3 13 0 1 -5.0 0
4 4 14 1 1 10.0 0
5 5 15 1 2 30.0 36
6 6 16 0 2 2.0 0
7 7 17 0 2 1.0 0
8 8 18 0 2 0.5 0
9 9 19 1 2 50.0 51
10 10 20 0 2 -1.0 0
11 1 11 1 2 60.0 64
Thanks a lot in advance.
Using dplyr and data.table:
df2 = data.frame(PersonId1=c(11,15,9,1),PersonId2=c(1,5,19,11),
Played_together = c(1,1,1,1),
Event=c(1,2,2,2),
Utility=c(25,36,51,64)) # you had missed adding Utility in your ques
library(data.table)
library(dplyr)
df3 <- copy(df2)
colnames(df2) <- c("PersonId2", "PersonId1", "Played_together", "Event", "Utility")
setDT(df2)
df2 <- df2[, c("PersonId2", "PersonId1", "Utility", "Event")]
df3 <- df3[, c("PersonId2", "PersonId1", "Utility", "Event")]
df <- left_join(df1, df2, c("PersonId2", "PersonId1", "Event"))
df <- left_join(df, df3, by = c("PersonId2", "PersonId1", "Event"))
setDT(df)
df[, Utility_from_df2 := ifelse(is.na(Utility), Utility.y, ifelse(is.na(Utility.y), Utility, 0))]
df[is.na(df)] <- 0
df[, c("Utility.y", "Utility") := NULL]
setnames(df, "Utility.x", "Utility")
Desired Output:
PersonId1 PersonId2 Played_together Event Utility Utility_from_df2
1: 1 11 1 1 20.0 25
2: 2 12 0 1 -2.0 0
3: 3 13 0 1 -5.0 0
4: 4 14 1 1 10.0 0
5: 5 15 1 2 30.0 36
6: 6 16 0 2 2.0 0
7: 7 17 0 2 1.0 0
8: 8 18 0 2 0.5 0
9: 9 19 1 2 50.0 51
10: 10 20 0 2 -1.0 0
11: 1 11 1 2 60.0 64

Restructuing and formatting data frame columns

dfin <-
ID SEQ GRP C1 C2 C3 T1 T2 T3
1 1 1 0 5 8 0 1 2
1 2 1 5 10 15 5 6 7
2 1 2 20 25 30 0 1 2
C1 is the concentration (CONC) at T1 (TIME) and so on. This is what I want as an output:
dfout <-
ID SEQ GRP CONC TIME
1 1 1 0 0
1 1 1 5 1
1 1 1 8 2
1 2 1 5 5
1 2 1 10 6
1 2 1 15 7
2 1 2 20 0
2 1 2 25 1
2 1 2 30 2
The dfin has much more columns for Cx and Tx where x is the number of concentration readings.
You can do this with data.table::melt, with its capability of melting the table into multiple columns based on the columns pattern:
library(data.table)
melt(
setDT(df),
id.vars=c("ID", "SEQ", "GRP"),
# columns starts with C and T should be melted into two separate columns
measure.vars=patterns("^C", "^T"),
value.name=c('CONC', 'TIME')
)[order(ID, SEQ)][, variable := NULL][]
# ID SEQ GRP CONC TIME
#1: 1 1 1 0 0
#2: 1 1 1 5 1
#3: 1 1 1 8 2
#4: 1 2 1 5 5
#5: 1 2 1 10 6
#6: 1 2 1 15 7
#7: 2 1 2 20 0
#8: 2 1 2 25 1
#9: 2 1 2 30 2
Or if the value column names follow the pattern [CT][0-9], you can use reshape from base R by specifying the sep="" which will split the value columns name by the letter/digit separation due to this default setting (from ?reshape):
split = if (sep == "") {
list(regexp = "[A-Za-z][0-9]", include = TRUE)
} else {
list(regexp = sep, include = FALSE, fixed = TRUE)}
reshape(df, varying=-(1:3), idvar=c("ID", "SEQ", "GRP"),
dir="long", sep="", v.names=c("CONC", "TIME"))
# ID SEQ GRP time CONC TIME
#1: 1 1 1 1 0 5
#2: 1 2 1 1 5 10
#3: 2 1 2 1 20 25
#4: 1 1 1 2 8 0
#5: 1 2 1 2 15 5
#6: 2 1 2 2 30 0
#7: 1 1 1 3 1 2
#8: 1 2 1 3 6 7
#9: 2 1 2 3 1 2

Applying a function for calculating AUC for each subject

I want to calculate the area under the curve(AUC) of concentration-TIME profiles for many subjects (~200 subjects). I am using the package MESS where:
AUC = auc(data$TIME,data$CONC, type = "spline")
How can I apply it to each unique ID in the data set? and retain the results in R by adding a new "AUC" column in the original data set?
The data has the following columns:
ID TIME CONC
1 0 0
1 2 4
1 3 7
2 0 0
2 1 NA
2 3 5
2 4 10
One way would be like this. foo is your data.
library(MESS)
library(dplyr)
foo %>%
group_by(ID) %>%
summarize(AUC = auc(TIME,CONC, type = "spline"))
# ID AUC
#1 1 9.12500
#2 2 12.08335
If you want to keep all data, you could do this.
foo %>%
group_by(ID) %>%
mutate(AUC = auc(TIME,CONC, type = "spline"))
# ID TIME CONC AUC
#1 1 0 0 9.12500
#2 1 2 4 9.12500
#3 1 3 7 9.12500
#4 2 0 0 12.08335
#5 2 1 NA 12.08335
#6 2 3 5 12.08335
#7 2 4 10 12.08335
In my opinion, the dplyrsolution provided by #jazzurro is the way to go, but here's a base approach for good measure.
d <- read.table(text='ID TIME CONC
1 0 0
1 2 4
1 3 7
2 0 0
2 1 NA
2 3 5
2 4 10', header=TRUE)
library(MESS)
auc <- t(sapply(split(d, d$ID), function(x) {
data.frame(ID=x$ID[1], auc=auc(x$TIME, x$CONC, type='spline'))
}))
merge(d, auc)
# ID TIME CONC auc
# 1 1 0 0 9.125
# 2 1 2 4 9.125
# 3 1 3 7 9.125
# 4 2 0 0 12.08335
# 5 2 1 NA 12.08335
# 6 2 3 5 12.08335
# 7 2 4 10 12.08335

Resources