Clustering rows by ID based on a column value condition multiple times - r

Some time ago I opened a related question in this post
Suppose I have the following df:
data <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,1, 1, 1,1,1,1,1,1,1,1,1,1),
Obs1 = c(1,1,0,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1,1,1,0,1),
Control = c(0,3,3,1,12,1,1,1,36,13,1,1,2,24,2,2,48,24,20,21,10,10),
ClusterObs1 = c(1,1,1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,6))
And I want to obtain:
data <- data.frame(ID = c(1,1,1,1,1,1,1,1,1,1,1, 1, 1,1,1,1,1,1,1,1,1,1),
Obs1 = c(1,1,0,1,0,1,1,0,1,0,0,0,1,1,1,1,1,1,1,1,0,1),
Control = c(0,3,3,1,12,1,1,1,36,13,1,1,2,24,2,2,48,24,20,21,10,10),
ClusterObs1 = c(1,1,1,2,2,3,3,3,4,4,4,4,5,5,5,5,5,5,5,5,5,6),
DesiredResultClusterObs1 = c(1,1,1,2,2,3,3,3,4,4,4,4,5,6,6,6,7,8,9,10,10,11))
The conditions are:
If value of 'Control' is higher than 12 and actual 'Obs1' value is equal to 1 and to previous 'Obs1' value, 'DesiredResultClusterObs1' value should add +1 (the main difference with the other question is that consecutive control values above 12 must be considered)
Any idea of how can I achieve the desired result.

I don't know much how to use the whith() and rle() functions, but i've got to a solution to the problem, using ifelse.
data <- data %>% mutate (aux = ifelse (Control>12 & Obs1 == 1 & lag(Obs1) ==1,1,0),
DesiredResultClusterObs1 = ClusterObs1 + cumsum(aux))
The aux variable is not necessary, it just help to see step by step. You can do the following too
data <- data %>% mutate (DesiredResultClusterObs1 =
ClusterObs1 +
cumsum(ifelse (Control>12 & Obs1 == 1 & lag(Obs1) ==1,1,0)))

Related

Loop only running for the last iteration in R - Looping over participants

I am very new to R and I am trying to run a loop, so any help is greatly appreciated.
I have longitudinal data with multiple timepoints for each participant, which looks like the picture attached1
I need to replace the NA values with the values from when the Years variable is equal to 0, and I want to write a loop to do this for each participant. I have written some code which seems to work, however it only gives output for the last iteration of the loop (the last participant). This is the code I am using:
x <- c(1:4)
n = length(x)
for(i in 1:n)
{
data <- subset(df, ID %in% c(x[i]))
data$outcome <- ifelse(is.na(data$outcome),
data[1,3],
data$outcome)
}
Using this code, the output gives only the last iteration (i.e. in this case, ID 4). I need to complete this for all IDs.
Any help is much appreciated! Thankyou.
I'm not 100% clear on your intent, but this will, within an ID, fill all outcomes missing values by the (first) outcomes value from a row where Years == 0.
library(dplyr)
df %>%
group_by(ID) %>%
mutate(outcome = coalesce(outcome, first(outcomes[Years == 0])))
Obvioustly untested, but if you provide some sample data I'll happily help debug.
Your loop replaces data$outcome each iteration. That is why you only get the last result.
Here's my inelegant solution:
Making sample data to match yours (not including unused column)
my_dat <- data.frame("years" = sample(c(0, 1.5, 3), 30, replace = T),
"outcome" = as.numeric(sample(c("", 1, 2), 30, replace = T)))
Find which rows are both 0 for years and missing outcome
my_index <- my_dat$years == 0 * is.na(my_dat$outcome)
Assign 0 to replace NA:
my_dat$outcome[my_index] <- 0
A simpler tidyverse method:
library(tidyverse)
df %>%
filter(ID %in% x) %>%
mutate(outcome = ifelse(is.na(Outcome), Years, Outcome))
your question could do with some clarification and a repreducible example. As I understand it from: "I need to replace the NA values with the values from when the Years variable is equal to 0". So if outcome equals NA and Years equals 0 you want outcom to equal 0?
set.seed(1984) # ser the seed so that my_dat is the same each time
# using a modified df from markhogue answer...
my_dat <- data.frame(
ID = 1:30,
years = sample(c(0, 1.5, 3), 30, replace = T),
outcome = as.numeric(sample(c("", 1, 2), 30, replace = T))
)
my_dat # have a look at rows 9 and 22
# ifelse given two conditions does year == 0 and is.na(outcome)
my_dat$outcome <- ifelse(my_dat$year == 0 & is.na(my_dat$outcome), my_dat$years, my_dat$outcome)
my_dat # have a look at rows 9 and 22
Let me know if this is what you need :)

generate variable based on first occurrence of a value

I have 5 repeat measures called pub1:pub5 each taking a value of 1 to 4. Each was measured at a different age age1:age5. That is, pub1 was measured at age1....pub5 at age5 etc.
I would like to create a new variable age_pb2 that shows the age at which a value of 2 first occurred in pub. For example, for individual x, age_pb2 will equal age3 if the first time a value of 2 is scored is in pub3
I have tried modifying previous code but not had much luck.
library(tidyverse)
#Example data
N <- 2000
data <- data.frame(id = 1:2000,age1 = rnorm(N,6:8),age2 = rnorm(N,7:9),age3 = rnorm(N,8:10),
age4 = rnorm(N,9:11),age5 = rnorm(N,10:12),pub1 = rnorm(N,1:2),pub2 = rnorm(N,1:2),
pub3 = rnorm(N,1:2),pub4 = rnorm(N,1:2),pub5 = rnorm(N,1:2))
data <- data %>% mutate_at(vars(starts_with("pub")), funs(round(replace(., .< 0, NA), 0)))
#New variable showing first age at getting a score of 2 (doesn't work)
i1 <- grepl('^pub', names(data)) # index for pub columns
i2 <- grepl('^age', names(data)) # index for age columns
data[paste0("age_pb2")] <- lapply(2, function(i) {
j1 <- max.col(data[i1] == i, 'first')
j2 <- rowSums(data[i1] == i) == 0
data[i2][cbind(seq_len(nrow(data)), j1 *(NA^j2))]
})
set.seed(1)
N <- 2000
data <- data.frame(id = 1:2000,age1 = rnorm(N,6:8),age2 = rnorm(N,7:9),age3 = rnorm(N,8:10),
age4 = rnorm(N,9:11),age5 = rnorm(N,10:12),pub1 = rnorm(N,1:2),pub2 = rnorm(N,1:2),
pub3 = rnorm(N,1:2),pub4 = rnorm(N,1:2),pub5 = rnorm(N,1:2)) %>%
mutate_at(vars(starts_with("pub")), funs(round(replace(., .< 0, NA), 0))) %>%
mutate(age_pb2 = eval(parse(text = paste0("age", which.min(apply(select(., starts_with("pub")), 2, function(x) which(x == 2)[1]))))))
The way it works, you apply over the pubs columns and take with which(x == 2)[1] the first matched row per column, then take the which.min to get the column index number (of pub respectively age) which you then paste with "age" to assign (using eval(parse(text = variable name))) the respective column.
E.g. here after apply you get
[pub1 = 2, pub2 = 1, pub3 = 2, pub4 = 4, pub5 = 2]
which is the first occurrence of 2 per column. The earliest (which.min) occurrence is for the second pub column, thus index is 2. This pasted with "age" and eval parsed to mutate.
EDIT
It is probably more convenient to do it in a for loop for all age_pbi, or there is an easy solution in dplyr that I am not aware of.
for (i in 1:5) {
index <- which.min(apply(select(data, starts_with("pub")), 2, function(x) which(x == i)[1]))
data[ ,paste0("age_pb", i)] <- data[ ,paste0("age", index)]
}
Note however, that which.min takes the first minimum. E.g. pub1 and pub2 both have a 1 in the first row, so the above approach assigns age1 to age_pb1 whereas it could be age2 as well. I don't know what you want to do with this, so can't say what is a better option.

R limit output of dataframe?

I have a data frame of transactions.
I am using dplyr to filter the transaction by gender.
Gender in my case is 0 or 1.
I want to filter 2 rows one with Gender == 0 and the second with Gender == 1.
The closest was to do it like this
df %>% arrange(Gender)
and then select 2 transactions in the middle where one is 1 and the second is 0.
Please advise.
To randomly sample a row/cell where condition in another cell is satisfied you can use sample like this:
# Dummy data: X = value of interest, G = Gender (0,1)
df1 <- data.frame("X" = rnorm(10, 0, 1), "G" = sample(c(0,1), replace = T, size = 10))
# Sampling
sample(df1[,'X'][df1[,'G'] == 1], size = 1)
sample(df1[,'X'][df1[,'G'] == 0], size = 1)
This is taking one value of X for each gender (condition of G being set by [df1[,'G'] == 1]).
Building from the comment by docendo discimus you can use the popular dplyr package, using the script below, but note that this runs considerably slower (5 times slower, 3M rows & 1000 iterations) than the sample approach I offered above:
pull(df1 %>% group_by(G) %>% sample_n(1), X)

create unique values in each cells for a given range in excel or R

Apologies if the example is not formatted properly.
I have a data set with one sample per row, the data contain two columns with reference numbers of the start value and end value.
cell A1 = Sample #1
cell B1 = 101-263 (start value)
cell C1 = 101-266 (end value)
cell A2 = Sample #2
cell B2 = 162-186 (start value)
cell C2 = 162-187 (end value)
The range of values is a different length of each row of data, with a maximum range of 8 values. I need to fill in the values in the range, with each value in a cell along the row.
So for sample #1 above I need to create the cell values: D1 = 101-264, and E1 = 101-265
While for sample #2 there will be no extra cells needed.
Is there a formula (using Vlookup and If perhaps?) that I can create and drag across all rows and over the 8 needed columns to fill in this data? (I don't mind if there are N/A in the shorter-range rows)
If there is an easier way using R also fine with me.
Thanks for any advice
Please try in D1 copied across eight columns and then D1:K1 copied down to suit:
=IF(1*RIGHT($C1,3)>RIGHT($B1,3)+COLUMN()-3,LEFT($B1,4)&RIGHT($B1,3)+COLUMN()-3,"")
The condition (IF) checks whether or not to display a result (or 'blank' "", for neater presentation) depending on the result equalling or exceeding the upper limit specified in ColumnC.
There is some text manipulation (RIGHT and LEFT) to get at the part that is to be integer incremented or to add back the static part.
COLUMN() returns the column number (A>1, B>2 etc) so is useful as a kind of stepping function. In D1 COLUMN()-3 is 4-3 or 1 so there 1 is added to the start of the range (shown on the right of B1). When copied across to ColumnE COLUMN()-3 becomes 5-3, so 2 is added to the start of the range.
The following code:
library(magrittr)
library(plyr)
library(reshape2)
# Create input example
dat = data.frame(
sample = c("Sample #1", "Sample #2"),
start = c("101-263", "162-186"),
end = c("101-266", "162-187"),
stringsAsFactors = FALSE
)
# Extract 'start' and 'end' values
dat$num1 = dat$start %>% strsplit("-") %>% sapply("[", 1)
dat$start2 = dat$start %>% strsplit("-") %>% sapply("[", 2) %>% as.numeric
dat$end2 = dat$end %>% strsplit("-") %>% sapply("[", 2) %>% as.numeric
dat$start = NULL
dat$end = NULL
# For each row
for(i in 1:nrow(dat)) {
# Check if there is any need to add entries
if((dat$end2[i] - dat$start2[i]) > 1) {
# For each entry
for(j in seq(dat$start2[i], dat$end2[i] -1)) {
# Create entry
new_entry = data.frame(
sample = dat$sample[i],
num1 = dat$num1[i],
start2 = dat$start2[i],
end2 = j,
stringsAsFactors = FALSE
)
# Add to table
dat = rbind(dat, new_entry)
}
}
}
# Calculate all values
dat$value = paste0(dat$num1, "-", dat$end2)
dat = dat[, c("sample", "value")]
# Create column labels
dat = ddply(
dat,
"sample",
transform,
var = paste0("col", rank(value))
)
# Reshape to required format
dat = dcast(dat, sample ~ var, value.var = "value")
Does what you asked on the provided example.
It transforms this table -
sample start end
1 Sample #1 101-263 101-266
2 Sample #2 162-186 162-187
Into this one -
sample col1 col2 col3 col4
1 Sample #1 101-263 101-264 101-265 101-266
2 Sample #2 162-187 <NA> <NA> <NA>
If there is a larger example for testing will be happy to do so :)

Grouping data into ranges in R

Suppose I have a data frame in R that has names of students in one column and their marks in another column. These marks range from 20 to 100.
> mydata
id name marks gender
1 a1 56 female
2 a2 37 male
I want to divide the student into groups, based on the criteria of obtained marks, so that difference between marks in each group should be more than 10. I tried to use the function table, which gives the number of students in each range from say 20-30, 30-40, but I want it to pick those students that have marks in a given range and put all their information together in a group. Any help is appreciated.
I am not sure what you mean with "put all their information together in a group", but here is a way to obtain a list with dataframes split up of your original data frame where each element is a data frame of the students within a mark range of 10:
mydata <- data.frame(
id = 1:100,
name = paste0("a",1:100),
marks = sample(20:100,100,TRUE),
gender = sample(c("female","male"),100,TRUE))
split(mydata,cut(mydata$marks,seq(20,100,by=10)))
I think that #Sacha's answer should suffice for what you need to do, even if you have more than one set.
You haven't explicitly said how you want to "group" the data in your original post, and in your comment, where you've added a second dataset, you haven't explicitly said whether you plan to "merge" these first (rbind would suffice, as recommended in the comment).
So, with that, here are several options, each with different levels of detail or utility in the output. Hopefully one of them suits your needs.
First, here's some sample data.
# Two data.frames (myData1, and myData2)
set.seed(1)
myData1 <- data.frame(id = 1:20,
name = paste("a", 1:20, sep = ""),
marks = sample(20:100, 20, replace = TRUE),
gender = sample(c("F", "M"), 20, replace = TRUE))
myData2 <- data.frame(id = 1:17,
name = paste("b", 1:17, sep = ""),
marks = sample(30:100, 17, replace = TRUE),
gender = sample(c("F", "M"), 17, replace = TRUE))
Second, different options for "grouping".
Option 1: Return (in a list) the values from myData1 and myData2 which match a given condition. For this example, you'll end up with a list of two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) x[x$marks >= 30 & x$marks <= 50, ])
Option 2: Return (in a list) each dataset split into two, one for FALSE (doesn't match the stated condition) and one for TRUE (does match the stated condition). In other words, creates four groups. For this example, you'll end up with a nested list with two list items, each with two data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, x$marks >= 30 & x$marks <= 50))
Option 3: More flexible than the first. This is essentially #Sacha's example extended to a list. You can set your breaks wherever you would like, making this, in my mind, a really convenient option. For this example, you'll end up with a nested list with two list items, each with multiple data.frames.
lapply(list(myData1 = myData1, myData2 = myData2),
function(x) split(x, cut(x$marks,
breaks = c(0, 30, 50, 75, 100),
include.lowest = TRUE)))
Option 4: Combine the data first and use the grouping method described in Option 1. For this example, you will end up with a single data.frame containing only values which match the given condition.
# Combine the data. Assumes all the rownames are the same in both sets
myDataALL <- rbind(myData1, myData2)
# Extract just the group of scores you're interested in
myDataALL[myDataALL$marks >= 30 & myDataALL$marks <= 50, ]
Option 5: Using the combined data, split the data into two groups: one group which matches the stated condition, one which doesn't. For this example, you will end up with a list with two data.frames.
split(myDataALL, myDataALL$marks >= 30 & myDataALL$marks <= 50)
I hope one of these options serves your needs!
I had the same kind of issue and after researching some answers on stack overflow I came up with the following solution :
Step 1 : Define range
Step 2 : Find the elements that fall in the range
Step 3 : Plot
A sample code is as shown below:
range = NULL
for(i in seq(0, max(all$downlink), 2000)){
range <- c(range, i)
}
counts <- numeric(length(range)-1);
for(i in 1:length(counts)) {
counts[i] <- length(which(all$downlink>=range[i] & all$downlink<range[i+1]));
}
countmax = max(counts)
a = round(countmax/1000)*1000
barplot(counts, col= rainbow(16), ylim = c(0,a))

Resources