R: finding absolute difference with dplyr and group_by - r

I have the following example.
I want to create a new column with the absolute difference in AGE compared to each Treat==1 in the same PairID.
Desired output should be as shown below.
I have tried using dplyr with:
Data complete:
Treat <- c(1,0,0,1,0,0,1,0)
PairID <- c(1,1,1,2,2,2,3,3)
Age <- c(30,60,31,20,20,40,50,52)
D <- data.frame(Treat,PairID,Age)
D
D %>%
group_by(PairID) %>%
abs(Age - Age[Treat == 1])

in Base-R:
D$absD <- unlist(lapply(split(D,D$PairID), function(x) abs(x$Age - x$Age[x$Treat==1])))
> D
Treat PairID Age absD
1 1 1 30 0
2 0 1 60 30
3 0 1 31 1
4 1 2 20 0
5 0 2 20 0
6 0 2 40 20
7 1 3 50 0
8 0 3 52 2

Related

Conditional Statements: selecting/assigning a variable per row

I have a data set with 2 VPs and 350 interval values for each. I am writing an if loop to select when a minimum value of VP1 overlaps with the maximum value of VP2.
The data usually sorts by VP, but I arranged to sort by minimum since it is a timeframe.
I ran the following code that worked to assign 0 or 1 when the values overlap the previous item, but it does not account for what the previous item is (ie. whether the previous item is VP1 or VP2).
for (i in 2:length(df$newvariable)) {
if (df$minimum[i] < df$maximum[i-1]){
df$newvariable[i] <- 0
} else {
df$newvariable[i] <- 1
}
}
I want to say if df$minimum[i] of VP1 < df$maximum[i] of VP2, then df$newvariable = 0. Otherwise, df$newvariable = 1.
I have not been able to find how to make it conditional per each row and loop again. Does anyone have any recommendations?
Many thanks.
Sample Data:
VP xmin xmax
1 0 6
2 0 2
2 6 14
1 14 24
2 20 30
1 30 36
... And so on for 600 or so rows.
Desired Output:
VP xmin xmax newvariable
1 0 6 -
2 0 2 0
2 6 14 1
1 14 24 1
2 20 30 0
1 30 36 1
If I have a dataframe that had another variable and I subsetted to only look at one part of the variable. For example, variable = talking and the assignments are 1 (yes) or 0 (no). I originally subsetted to just look at 0 and create new variables, like quiet_together. However, I want to put these dataframes back together but have added columns in the separate dataframes. If I want the same exact thing as described above but with the dataframe together (instead of 2 separate ones), how would I specify for the each assigned variable? I want to end up with two new columns based on xmin and xmax values while accounting for the value in the talking variable. The new columns would be talk_together (for the 1 value of the talking variable) and quiet_together (for the 0 value of the talking variable, when xmin <= xmax for the previous line.
For example:
Sample Data:
VP xmin xmax talking
1 0 6 0
2 0 2 0
2 2 6 1
2 6 14 0
1 6 14 1
2 14 24 1
1 14 20 0
1 20 30 1
2 24 32 0
1 30 32 0
... And so on for 600 or so rows.
Desired Output:
VP xmin xmax talking talk_together quiet_together
1 0 6 0 0 0
2 0 2 0 0 0
2 2 6 1 0 0
2 6 14 0 0 0
1 6 14 1 0 0
1 14 20 0 0 0
2 14 24 1 1 0
1 20 30 1 1 0
2 24 32 0 0 1
1 30 32 0 0 1
You could use lag from dplyr to compare with previous xmax value.
library(dplyr)
df %>% mutate(newvariable = as.integer(xmin >= lag(xmax)))
# VP xmin xmax newvariable
#1 1 0 6 NA
#2 2 0 2 0
#3 2 6 14 1
#4 1 14 24 1
#5 2 20 30 0
#6 1 30 36 1
Or shift with data.table
library(data.table)
setDT(df)[, newvariable := +(xmin >= shift(xmax))]
Base R alternatives are :
df$newvariable <- as.integer(c(NA, df$xmin[-1] >= df$xmax[-nrow(df)]))
and
df$newvariable <- +c(NA, tail(df$xmin, -1) >= head(df$xmax, -1))
With data.table, we can do
library(data.table)
setDT(df)[, newvariable := as.integer(xmin >= shift(xmax))]

R dplyr::mutate - add all elements in a returned list

I have a function returning a list. I'm using mutate to add columns in a data frame that correspond to the output. The calculation is rather involved so I would prefer to only call the function once. I'm rather new to R and dplry and cannot figure out a more efficient way of doing this.
Here is a very simple example of what I am doing now.
library(dplyr)
testFun <- function(x,z)
{
list(x2=x*x + z, x3=x*x*x + z)
}
have <- data.frame(x=seq(1:10),y=1,z=0)
want <- have %>%
dplyr::mutate(x2=testFun(x,z)$x2,
x3=testFun(x,z)$x3)
How can I do this more efficiently?
With the purrr-package you can solve this problem, like that:
library(purrr)
library(dplyr)
testFun <- function(x,z) {
tibble(x2=x*x + z, x3=x*x*x + z)
}
have %>%
mutate(new_x = map2(x, z, testFun)) %>%
unnest(new_x)
# x y z x2 x3
# 1 1 1 0 1 1
# 2 2 1 0 4 8
# 3 3 1 0 9 27
# 4 4 1 0 16 64
# 5 5 1 0 25 125
# 6 6 1 0 36 216
# 7 7 1 0 49 343
# 8 8 1 0 64 512
# 9 9 1 0 81 729
# 10 10 1 0 100 1000
Note that I changed the output of your function from a list to a tibble.
We can use pmap
library(purrr)
library(dplyr)
pmap_dfr(have %>%
select(x, z), testFun) %>%
bind_cols(have, .)
# x y z x2 x3
#1 1 1 0 1 1
#2 2 1 0 4 8
#3 3 1 0 9 27
#4 4 1 0 16 64
#5 5 1 0 25 125
#6 6 1 0 36 216
#7 7 1 0 49 343
#8 8 1 0 64 512
#9 9 1 0 81 729
#10 10 1 0 100 1000
Or if we can change the function by quoting (quote or quo) it, this becomes more easier
testFun <- function(x,z){
list(x2= quo(x*x + z), x3= quo(x*x*x + z))
}
have %>%
mutate(!!! testFun(x, z))
# x y z x2 x3
#1 1 1 0 1 1
#2 2 1 0 4 8
#3 3 1 0 9 27
#4 4 1 0 16 64
#5 5 1 0 25 125
#6 6 1 0 36 216
#7 7 1 0 49 343
#8 8 1 0 64 512
#9 9 1 0 81 729
#10 10 1 0 100 1000
I might have missed something really obvious here, but you seem to be running the function twice to produce two answers. To keep things really simple to begin with, try:
library(dplyr)
have <- data.frame(x=seq(1:10),y=1,z=0)
want <- have %>%
dplyr::mutate(x2 = (x * 2 + z),
x3 = (x * 3 + z))
Does that help? Or has your example simplified out what you were trying to achieve?
Using a different function for the mutate you should be able to do:
library(dplyr)
createMultiX <- function(inputX, inputZ, multiplier) {
inputX * multiplier + inputZ
}
have <- data.frame(x=seq(1:10),y=1,z=0)
want <- have %>%
dplyr::mutate(x2 = createMultiX(x, z, 2),
x3 = createMultiX(x, z, 3))
(Apologies in advance as I've written this blindly without access to an R terminal so hope it works first time without typos!)

r: find lowest value matching criteria over columns

My data frame looks like this
personID t1 t2 t3
1 0 11 0
1 0 11 0
2 0 11 13
2 0 11 13
3 0 0 0
3 0 0 0
I need to make sure that each person has one test score above 10. If they do not, they have to be removed from the data frame. I also want to keep track of the lowest score above 10, and add it to a new column.
Thus, the result would look like this:
personID t1 t2 t3 new
1 0 11 0 11
1 0 11 0 11
2 0 11 13 11
2 0 11 13 11
If I was to go the data.table route, I think you could do it with a melt and join:
library(data.table)
setDT(dat)
dat[
melt(dat, id.vars="personID")[value > 10, .(new=min(value)), by=personID],
on="personID"
]
# personID t1 t2 t3 new
#1: 1 0 11 0 11
#2: 1 0 11 0 11
#3: 2 0 11 13 11
#4: 2 0 11 13 11
using data.table
library(data.table)
#convert your data (named DF here) to use data.table syntax
setDT(DF)
DF[ , {
# vector of row-wise minima within ID
m = do.call(pmin, .SD)
# confirm acceptance condition
if (min(m) > 10)
# add new column by appending it to current data
c(.SD, list(new = m))
}, by = personID]

Building a contingency table

I have a data like this:
A B
1 10
1 20
1 30
2 10
2 30
2 40
3 20
3 10
3 30
4 20
4 10
5 10
5 10
and I want to build a contingency table like this:
10 20 30 40
10 1 3 2 0
20 3 0 2 0
30 2 2 0 0
40 0 0 0 0
Meaning: According to column A, for each two values of column B mark + 1 in the specific Contingency table.
Can you help me do this?
Here is a very ugly answer, using the data from the image, because I already spent too much time on your problem. In general, it's not practical to have your result depend on the order of variables.
A <- rep(c(1:4),c(3,2,3,3))
B <- c(10,10,30,10,20,30,20,10,10,20,30)
data <- data.frame(cbind(A,B))
#split by A
library(plyr)
data2 <- ddply(data,.(A),function(x){
combined_pairs <- cbind(x$B[-nrow(x)],
x$B[-1])
#return data where first is always lowest
smallest <- apply(combined_pairs,MARGIN=1,
FUN=min)
largest <- apply(combined_pairs,MARGIN=1,
FUN=max)
return(data.frame(small=smallest,large=largest))
})
library(reshape2)
result <- dcast(small~large,data=data2,
fun.aggregate=length)
> result
small 10 20 30
1 10 1 3 1
2 20 0 0 2
I think you can add the empty rows yourself if you still need them.

r how to get total count of repeated values

I have a dataframe with person_id, study_id columns like below:
person_id study_id
10 1
11 2
10 3
10 4
11 5
I want to get the count for number of persons (unique by person_id) with 1 study or 2 studies - so not those with particular value for study_id but:
2 persons with 1 study
3 persons with 2 studies
1 person with with 3 studies
etc
How can I do this? I think maybe a count through loop but I wonder if there is a package that makes it easier?
To get a sample data set that better matches your expected output, i'll use this
dd <- data.frame(
person_id = c(10, 11, 15, 12, 10, 13, 10, 11, 12, 14, 15),
study_id = 1:11
)
Now I can count the number of people with a given number of studies with.
table(rowSums(with(dd, table(person_id, study_id))>0))
# 1 2 3
# 2 3 1
Where the top line is the number of studies, and the bottom line it the number of people with that number of studies.
This works because
with(dd, table(person_id, study_id))
returns
study_id
person_id 1 2 3 4 5 6 7 8 9 10 11
10 1 0 0 0 1 0 1 0 0 0 0
11 0 1 0 0 0 0 0 1 0 0 0
12 0 0 0 1 0 0 0 0 1 0 0
13 0 0 0 0 0 1 0 0 0 0 0
14 0 0 0 0 0 0 0 0 0 1 0
15 0 0 1 0 0 0 0 0 0 0 1
and then we use >0 and rowSums to get a count of unique studies for each person. Then we use table again to summarize the results.
The creating the table for your data is taking up too much RAM, you can try
table(with(dd, tapply(study_id, person_id, function(x) length(unique(x)))))
which is a slightly different way to get at the same thing.
You can use the aggregate function to get counts per user.
Then use it again to get counts per counts
i.e. assume your data is called "test"
person_id study_id
10 1
11 2
10 3
10 4
11 5
12 NA
You can set your NA to be a number such as zero so they are not ignored i.e.
test$study_id[is.na(test$study_id)] = 0
Then you can run the same function but with a condition that the study_id has to be greater than zero
stg=setNames(
aggregate(
study_id~person_id,
data=test,function(x){sum(x>0)}),
c("person_id","num_studies"))
Output:
stg
person_id num_studies
10 3
11 2
12 0
Then do the same to get counts of counts
setNames(
aggregate(
person_id~num_studies,
data=stg,length),
c("num_studies","num_users"))
Output:
num_studies num_users
0 1
2 1
3 1
Here's a solution using dplyr
library(dplyr)
tmp <- df %>%
group_by(person_id) %>%
summarise(num.studies = n()) %>%
group_by(num.studies) %>%
summarise(num.persons = n())
> dat <- read.table(h=T, text = "person_id study_id
10 1
11 2
10 3
10 4
11 5
12 6")
I think you can just use xtabs for this. I may have misunderstood the question, but it seems like that's what you want.
> table(xtabs(dat))
# 10 11 12
# 3 2 1
df <- data.frame(
person_id = c(10,11,10,10,11,11,11),
study_id = c(1,2,3,4,5,5,1))
# remove replicated rows
df <- unique(df)
# number of studies each person has been in:
summary(as.factor(df$person_id))
#10 11
# 3 4
# number of people in each study
summary(as.factor(df$study_id))
# 1 2 3 4 5
# 2 1 1 1 2

Resources