I have a silly function, which updates the value of S in the length of the vector called ACC by delta1 and delta2.
Sstart=0 #a starting value for S
ACC=c(1,1,0,1,1) #accuracy: 0 or 1
f=c(1,1,1,1,0) #feedback: 0 or 1
ID=rep(1,5) #ID of the participant
delta1=seq(1,5,1)
delta2=seq(1,5,1)
m<-as.matrix(expand.grid(delta1=delta1, delta2=delta2)) #all the possible combination of delta1 and delta2
The function is the following. When the feedback (f) is 1, it updates S by delta1, when the ffedback is 0, then with delta2. Delta1 and delta2 ranges from 1 to 5 and I increment them separately.
silly_function<-function(delta1, delta2,ACC,f,Sstart){
S = Sstart
for (i in 1:length(ACC)){
if (ACC[i]==1 & f[i]==1){
S[i+1]=S[i]+delta1
}
else if (ACC[i]==1 & f[i]==0){
S[i+1]=S[i]+delta2
}
else if (ACC[i]==0){
S[i+1]=S[i]
}
}
return(S)
}
I call the function
N=length(delta1)*length(delta2)
SMat<-matrix(data=NA, nrow=N, ncol=(length(ACC)+1)) #matrix for the data
for (i in 1:N){
SMat[i,] <- silly_function(m[i,1],m[i,2],ACC,f,Sstart)}
My problem:
The function works perfectly for 1 subject, but I cannot find a clever way for applying it for all my subjects separately (I have a data frame where all the data from my participants are in one data frame) and combine the results into one matrix or data frame. I wanted to use ddply from the plyr package, but I couldn’t find an example similar to mine to modify it and get some idea how to implement it in this case.
Thank you very much for your comments/hints in advance!
My input for two participants
ID Feedback ACC
1 1 1
1 1 1
1 1 0
1 1 1
1 0 1
2 1 1
2 1 1
2 0 1
2 1 0
2 1 1
Actual output
V1 V2 V3 V4 V5 V6
0 1 2 2 3 4 #row1
0 2 4 4 6 7
.
.
0 5 10 10 15 20 #row25
For subject 1 the output is:
25 * 6 (rows*columns) matrix: 25 rows because I update S by all possible combinations of delta1 and delta2. the first column is always 0 because Sstart is 0.
Desired output
Basically the same as for subject 1 but with all the subjects data
1-25 rows for subject 1
26-50 rows for subject 2...
Related
I am trying to combine several binary variables into one categorical variable. I have ten categorial variables, each describing tasks of a job.
Data looks something like this:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
# etc.
My goal is to combine them into one variable, where the value 1 (=Yes) of each binary variable will be a seperate level of the categorical variable.
To illustrate what I imagine (wrong code obviously):
If Personal_Help = 1 -> Jobcontent = 1
If PR = 1 -> Jobcontent = 2
If Fundraising = 1 -> Jobcontent = 3
etc.
Thank you very much in advance!
Edit:
Thanks for your Answers and apologies for my late answer. I think more context from my side is needed. The goal of combining the binary variables into a categorical variable is to print them into one graphic (using ggplot). The graphic should display how many respondants report the above mentioned tasks as part of their work.
if you're interested only in the first occurrence of 1 among your variables:
df <- data.frame(t(data.frame(Personal_Help, PR,Fundraising)))
result <- sapply(df, function(x) which(x==1)[1])
X1 X2 X3 X4 X5 X6
1 1 2 1 2 1
Of course, this will depend on what you want to do when multiple values are 1 as asked in comments.
Since there are three different variables, and each variable can take either of 2 values, there are 2^3 = 8 possible unique combinations of the three variables, each of which should have a unique number associated.
One way to do this is to imagine each column as being a digit in a three digit binary number. If we subtract 1 from each column, we get a 1 for "no" and a 0 for "yes". This means that our eight possible unique values, and the binary numbers associated with each would be:
binary decimal
0 0 0 = 0
0 0 1 = 1
0 1 0 = 2
0 1 1 = 3
1 0 0 = 4
1 0 1 = 5
1 1 0 = 6
1 1 1 = 7
This system will work for any number of columns, and can be achieved as follows:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
df <- data.frame(Personal_Help, PR, Fundraising)
New_var <- 0
for(i in seq_along(df)) New_var <- New_var + (2^(i - 1)) * (df[[i]] - 1)
df$New_var <- New_var
The end result would then be:
df
#> Personal_Help PR Fundraising New_var
#> 1 1 2 1 2
#> 2 1 1 2 4
#> 3 2 1 1 1
#> 4 1 2 2 6
#> 5 2 1 2 5
#> 6 1 2 1 2
In your actual data, there will be 1024 possible combinations of tasks, so this will generate numbers for New_var between 0 and 1023. Because of how it is generated, you can actually use this single number to reverse engineer the entire row as long as you know the original column order.
As #ulfelder commented, you need to clarify how you want to handle cases where more than one column is 1.
Assuming you want to use the first column equal to 1, you can use which.min(), applied by row:
data <- data.frame(Personal_Help, PR, Fundraising)
data$Jobcontent <- apply(data, MARGIN = 1, which.min)
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 1
2 1 1 2 1
3 2 1 1 2
4 1 2 2 1
5 2 1 2 2
6 1 2 1 1
If you’d like Jobcontent to include the name of each job, you can index into names(data):
data$Jobcontent <- names(data)[apply(data, MARGIN = 1, which.min)]
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 Personal_Help
2 1 1 2 Personal_Help
3 2 1 1 PR
4 1 2 2 Personal_Help
5 2 1 2 PR
6 1 2 1 Personal_Help
max.col may help here:
Jobcontent <- max.col(-data.frame(Personal_Help, PR, Fundraising), "first")
Jobcontent
#> [1] 1 1 2 1 2 1
I am trying to reformat longitudinal data for a time to event analysis. In the example data below, I simply want to find the earliest week that the result was “0” for each ID.
The specific issue I am having is how to patients that don't convert to 0, and had either all 1's or 2's. In the example data, patient J has all 1's.
#Sample data
have<-data.frame(patient=rep(LETTERS[1:10], each=9),
week=rep(0:8,times=10),
result=c(1,0,2,rep(0,6),1,1,2,1,rep(0,5),1,1,rep(0,7),1,rep(0,8),
1,1,1,1,2,1,0,0,0,1,1,1,rep(0,6),1,2,1,rep(0,6),1,2,rep(0,7),
1,rep(0,8),rep(1,9)))
patient week result
A 0 1
A 1 0
A 2 2
A 3 0
A 4 0
A 5 0
A 6 0
A 7 0
A 8 0
B 0 1
B 1 0
... .....
J 6 1
J 7 1
J 8 1
I am able to do this relatively straightforward process with the following code:
want<-aggregate(have$week, by=list(have$patient,have$result), min)
want<-want[which(want[2]==0),]
but realize if someone does not convert to 0, it excludes them (in this example, patient J is excluded). Instead, J should be present with a 1 in the second column and an 8 in the third column. Instead it of course is omitted
print(want)
Group.1 Group.2 x
A 0 1
B 0 4
C 0 2
D 0 1
E 0 6
F 0 3
G 0 3
H 0 2
I 0 1
#But also need
J 1 8
Pursuant to guidelines on posting here, I did work to solve this, am able to get what I need very inelegantly:
mins<-aggregate(have$week, by=list(have$patient,have$result), min)
maxs<-aggregate(have$week, by=list(have$patient,have$result), max)
want<-rbind(mins[which(mins[2]==0),],maxs[which(maxs[2]==1&maxs[3]==8),])
This returns the correct desired dataset, but the coding is terrible and not sustainable as I work with other datasets (i.e. datasets with different timeframes since I have to manually put in maxsp[3]==8, etc).
Is there a more elegant or systematic way to approach this data manipulation issue?
We can write a function to select a row from the group.
select_row <- function(result, week) {
if(any(result == 0)) which.max(result == 0) else which.max(week)
}
This function returns the index of first 0 value if it is present or else returns index of maximum value of week.
and apply it to all groups.
library(dplyr)
have %>% group_by(patient) %>% slice(select_row(result, week))
# patient week result
# <fct> <int> <dbl>
# 1 A 1 0
# 2 B 4 0
# 3 C 2 0
# 4 D 1 0
# 5 E 6 0
# 6 F 3 0
# 7 G 3 0
# 8 H 2 0
# 9 I 1 0
#10 J 8 1
I am trying to efficiently assign a value to a column, based on another column, but without a for loop as this takes too long.
I'm doing something like this: If the reference column value is greater than a certain random number, I assign 1 to the new column. Otherwise, assign 0. Can't figure out the best way to do this without a loop. I tried dplyr and case_when, but that wasn't iterating over each row.
Thanks!
for (i in 1:nrow(data)) {
if (data$value[i] > runif(1, 0, 1.7)) {
temp$newValue[i] <- 1
} else{
temp$newValue[i] <- 0
}
}
c0=data.frame(c(1,4,6,3,7,3),c(2,8,2,4,9,4))
names(c0)=c("A","B")
c0$C=ifelse(c0[,"A"]>runif(1,0,1.7),1,0)
c0
I'm not so sure if I understand you well. Please comment if I have any misunderstanding.
A
<dbl>
B
<dbl>
C
<dbl>
1 2 0
4 8 1
6 2 1
3 4 1
7 9 1
3 4 1
6 rows
Here is how I use A to generate C
Does this solve your problem?
DATA:
set.seed(1)
df <- data.frame(
refcol = rnorm(10)
)
randvalue <- 0
SOLUTION:
df$newcol <- ifelse(df$refcol > randvalue, 1, 0)
RESULT:
df
refcol newcol
1 0.2352207 1
2 -0.3307359 0
3 -0.3116238 0
4 -2.3023457 0
5 -0.1708760 0
6 0.1402782 1
7 -1.4974267 0
8 -1.0101884 0
9 -0.9484756 0
10 -0.4939622 0
This question already has an answer here:
R: apply simple function to specific columns by grouped variable
(1 answer)
Closed 5 years ago.
I'm trying to convert a dataset that has multiple observations per person over a period of time. For example, person 1 can be obese and not obese (just overweight) during this time. Here's an example from person 1:
ID Obese Overweight
1 NA NA
1 NA NA
1 0 1
1 1 0
1 0 0
2 NA 0
2 0 1
2 0 NA
I need to replace the values in each column to "1" if a 1 appears at all WITHIN THAT COLUMN, across a specified number of columns (there are 700+; e.g. c(5:749)) BY "ID". Ideally, the output would look like:
ID Obese Overweight
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
2 0 1
2 0 1
2 0 1
First I changed all the NAs to 0's; I then figured I could take the maximum along each column and replace (by ID), but can't find documentation on how to do this by group ("ID") AND a given set of columns (i.e. c(5:749)). Also I would not want to create new columns, but rather just replace values within columns already existing within the data frame.
I got it to work for a single variable, but couldn't translate this into a loop to go through a set of variables...
dat2 <- dat[, Obese:= max(Obese), by=ID]
Also I think a loop would take too long given the data size. Any other recommendations? Thanks in advance. Here's an example dataset:
dat <- as.data.frame(matrix(NA,18))
dat$id <- as.character(c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3))
dat$ob1 <- as.character(c(NA,NA,0,1,0,NA,0,1,0,0,0,0,0,0,0,0,0,0))
dat$ob2 <- as.character(c(NA,NA,1,0,0,NA,0,0,1,0,0,0,0,1,0,0,0,0))
dat <- dat[,-1]
As far as the linked paged using "lapply", it doesn't seem to work in the case where all values are NA (or 0) for a given individual. In this scenario, it seems to "fill in" / impute with values from other columns (which never appeared in the column in the original dataset); this was clearly spotted when a binary variable was imputed/replaced with a continuous value. Any idea why this may be happening?
I think tapply is helpful for this case.
You can find the max for each id by
with(dat, tapply(ob1, id, max))
My solution is:
dat$ob1 <- as.numeric(dat$ob1)
dat$ob2 <- as.numeric(dat$ob2)
dat[is.na(dat)] <- 0
dat$ob1 <- with(dat,tapply(ob1,id,max)[id])
dat$ob2 <- with(dat,tapply(ob2,id,max)[id])
dat
id ob1 ob2
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
7 2 1 1
8 2 1 1
9 2 1 1
10 2 1 1
11 2 1 1
12 2 1 1
13 3 0 1
14 3 0 1
15 3 0 1
16 3 0 1
17 3 0 1
18 3 0 1
I have quite big data frame (few millions of records).
I need to filter it due to following rule:
- For each product delete all records which are before the fifth record after the first record with x>0.
So, We are interested only in two columns - ID and x. Data frame is sorted by ID.
It is fairly easy to do it using loops, but loops doesn't perform well on such big data frame.
How to do it in 'vector style'?
Example:
BEFORE FILTERING
ID x
1 0
1 0
1 5 # First record with x>0
1 0
1 3
1 4
1 0
1 9
1 0 # Delete all earlier records of that product
1 0
1 6
2 0
2 1 # First record with x>0
2 0
2 4
2 5
2 8
2 0 # Delete all earlier records of that product
2 1
2 3
After filtering:
ID x
1 9
1 0
1 0
1 6
2 0
2 1
2 3
For these split, apply, combine problems - I like using plyr. There are alternatives if speed becomes an issue, but for most things - plyr is easy to understand and use. I wrote a function that implements the logic you described above and then fed that to ddply() to operate on each chunk of the data based on ID.
fun <- function(x, column, threshold, numplus){
whichcol <- which(x[column] > threshold)[1]
rows <- seq(from = (whichcol + numplus), to = nrow(x))
return(x[rows,])
}
And then feed this to ddply()
require(plyr)
ddply(dat, "ID", fun, column = "x", threshold = 0, numplus = 5)
#-----
ID x
1 1 9
2 1 0
3 1 0
4 1 6
5 2 0
6 2 1
7 2 3