I have data 7320 obs of 3 variables: age groups and contact number between them. Ex:
ageGroup ageGroup1 mij
0 0 0.012093847617507
0 1 0.00510485237464309
0 2 0.00374919082969427
0 3 0.00307241431437433
0 4 0.00254487083293498
0 5 0.00213734013959765
0 6 0.00182565778959543
0 7 0.00159036659169942
1 0 0.00475097494199872
1 1 0.00748329237103462
1 2 0.00427123298868537
1 3 0.00319622224196792
1 4 0.00287522072903812
1 5 0.00257773394696414
1 6 0.00230322568677366
1 7 0.00205265986733139
and so on until 86. I have to calculate mean of contact number (mij) between ageGroups so that, for example, ageGroup = 0 contacts with ageGroup1 =1 with mij and ageGroup = 1 contacts with ageGroup1 = 0 with mij. I need to sum this values and divide by 2 to get an average between then. Would you be so kind to give me a hint how to do that all over the data?
Use ddply from plyr package (assuming your dataframe is data)
ddply(data,.(ageGroup,ageGroup1),summarize,sum.mij=sum(mij))
ageGroup ageGroup1 sum.mij
1 0 0 0.012093848
2 0 1 0.005104852
3 0 2 0.003749191
4 0 3 0.003072414
5 0 4 0.002544871
6 0 5 0.002137340
7 0 6 0.001825658
8 0 7 0.001590367
9 1 0 0.004750975
10 1 1 0.007483292
11 1 2 0.004271233
12 1 3 0.003196222
13 1 4 0.002875221
14 1 5 0.002577734
15 1 6 0.002303226
16 1 7 0.002052660
I think I see what you're trying to do here. You want to treat interactions between the two ageGroup columns as being non-directional and get the mean interaction? The code below should do this using base R functions.
Note that since the example dataset is truncated, it will only give a correct answer for the group with index 01. However if you run with the full dataset, it should work for all interactions.
# Create the data frame
df=read.table(header=T,text="
ageGroup,ageGroup1,mij
0,0,0.012093848
0,1,0.005104852
0,2,0.003749191
0,3,0.003072414
0,4,0.002544871
0,5,0.00213734
0,6,0.001825658
0,7,0.001590367
1,0,0.004750975
1,1,0.007483292
1,2,0.004271233
1,3,0.003196222
1,4,0.002875221
1,5,0.002577734
1,6,0.002303226
1,7,0.00205266
",sep=",")
df
# Using the strSort function from this SO answer:
# http://stackoverflow.com/questions/5904797/how-to-sort-letters-in-a-string-in-r
strSort <- function(x)
sapply(lapply(strsplit(x, NULL), sort), paste, collapse="")
# Label each of the i-j interactions and j-i interactions with an index ij
# e.g. anything in ageGroup=1 interacting with ageGroup1=0 OR ageGroup=0 interacting with ageGroup1=1
# are labelled with index 01
df$ind=strSort(paste(df$ageGroup,df$ageGroup1,sep=""))
# Use the tapply function to get mean interactions for each group as suggested by Paul
tapply(df$mij,df$ind,mean)
Related
I am trying to reformat longitudinal data for a time to event analysis. In the example data below, I simply want to find the earliest week that the result was “0” for each ID.
The specific issue I am having is how to patients that don't convert to 0, and had either all 1's or 2's. In the example data, patient J has all 1's.
#Sample data
have<-data.frame(patient=rep(LETTERS[1:10], each=9),
week=rep(0:8,times=10),
result=c(1,0,2,rep(0,6),1,1,2,1,rep(0,5),1,1,rep(0,7),1,rep(0,8),
1,1,1,1,2,1,0,0,0,1,1,1,rep(0,6),1,2,1,rep(0,6),1,2,rep(0,7),
1,rep(0,8),rep(1,9)))
patient week result
A 0 1
A 1 0
A 2 2
A 3 0
A 4 0
A 5 0
A 6 0
A 7 0
A 8 0
B 0 1
B 1 0
... .....
J 6 1
J 7 1
J 8 1
I am able to do this relatively straightforward process with the following code:
want<-aggregate(have$week, by=list(have$patient,have$result), min)
want<-want[which(want[2]==0),]
but realize if someone does not convert to 0, it excludes them (in this example, patient J is excluded). Instead, J should be present with a 1 in the second column and an 8 in the third column. Instead it of course is omitted
print(want)
Group.1 Group.2 x
A 0 1
B 0 4
C 0 2
D 0 1
E 0 6
F 0 3
G 0 3
H 0 2
I 0 1
#But also need
J 1 8
Pursuant to guidelines on posting here, I did work to solve this, am able to get what I need very inelegantly:
mins<-aggregate(have$week, by=list(have$patient,have$result), min)
maxs<-aggregate(have$week, by=list(have$patient,have$result), max)
want<-rbind(mins[which(mins[2]==0),],maxs[which(maxs[2]==1&maxs[3]==8),])
This returns the correct desired dataset, but the coding is terrible and not sustainable as I work with other datasets (i.e. datasets with different timeframes since I have to manually put in maxsp[3]==8, etc).
Is there a more elegant or systematic way to approach this data manipulation issue?
We can write a function to select a row from the group.
select_row <- function(result, week) {
if(any(result == 0)) which.max(result == 0) else which.max(week)
}
This function returns the index of first 0 value if it is present or else returns index of maximum value of week.
and apply it to all groups.
library(dplyr)
have %>% group_by(patient) %>% slice(select_row(result, week))
# patient week result
# <fct> <int> <dbl>
# 1 A 1 0
# 2 B 4 0
# 3 C 2 0
# 4 D 1 0
# 5 E 6 0
# 6 F 3 0
# 7 G 3 0
# 8 H 2 0
# 9 I 1 0
#10 J 8 1
This question already has an answer here:
R: apply simple function to specific columns by grouped variable
(1 answer)
Closed 5 years ago.
I'm trying to convert a dataset that has multiple observations per person over a period of time. For example, person 1 can be obese and not obese (just overweight) during this time. Here's an example from person 1:
ID Obese Overweight
1 NA NA
1 NA NA
1 0 1
1 1 0
1 0 0
2 NA 0
2 0 1
2 0 NA
I need to replace the values in each column to "1" if a 1 appears at all WITHIN THAT COLUMN, across a specified number of columns (there are 700+; e.g. c(5:749)) BY "ID". Ideally, the output would look like:
ID Obese Overweight
1 1 1
1 1 1
1 1 1
1 1 1
1 1 1
2 0 1
2 0 1
2 0 1
First I changed all the NAs to 0's; I then figured I could take the maximum along each column and replace (by ID), but can't find documentation on how to do this by group ("ID") AND a given set of columns (i.e. c(5:749)). Also I would not want to create new columns, but rather just replace values within columns already existing within the data frame.
I got it to work for a single variable, but couldn't translate this into a loop to go through a set of variables...
dat2 <- dat[, Obese:= max(Obese), by=ID]
Also I think a loop would take too long given the data size. Any other recommendations? Thanks in advance. Here's an example dataset:
dat <- as.data.frame(matrix(NA,18))
dat$id <- as.character(c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3))
dat$ob1 <- as.character(c(NA,NA,0,1,0,NA,0,1,0,0,0,0,0,0,0,0,0,0))
dat$ob2 <- as.character(c(NA,NA,1,0,0,NA,0,0,1,0,0,0,0,1,0,0,0,0))
dat <- dat[,-1]
As far as the linked paged using "lapply", it doesn't seem to work in the case where all values are NA (or 0) for a given individual. In this scenario, it seems to "fill in" / impute with values from other columns (which never appeared in the column in the original dataset); this was clearly spotted when a binary variable was imputed/replaced with a continuous value. Any idea why this may be happening?
I think tapply is helpful for this case.
You can find the max for each id by
with(dat, tapply(ob1, id, max))
My solution is:
dat$ob1 <- as.numeric(dat$ob1)
dat$ob2 <- as.numeric(dat$ob2)
dat[is.na(dat)] <- 0
dat$ob1 <- with(dat,tapply(ob1,id,max)[id])
dat$ob2 <- with(dat,tapply(ob2,id,max)[id])
dat
id ob1 ob2
1 1 1 1
2 1 1 1
3 1 1 1
4 1 1 1
5 1 1 1
6 1 1 1
7 2 1 1
8 2 1 1
9 2 1 1
10 2 1 1
11 2 1 1
12 2 1 1
13 3 0 1
14 3 0 1
15 3 0 1
16 3 0 1
17 3 0 1
18 3 0 1
I would like to recenter an unbalanced time predictor in a mixed model so that the intercept reflects end of treatment.
For example:
ID <- c(1,1,2,2,2,3,3,3,3)
Time <- c(0,1,0,1,2,0,1,2,3)
Before <- data.table(ID,Time)
Before
ID Time
1 0
1 1
2 0
2 1
2 2
3 0
3 1
3 2
3 3
I would like to get this:
Recenter <- c(1,0,2,1,0,3,2,1,0)
After <- data.table(ID,Time, Recenter)
After
ID Time Recenter
1 0 1
1 1 0
2 0 2
2 1 1
2 2 0
3 0 3
3 1 2
3 2 1
3 3 0
Looks like you want to reverse Time within each ID. This is what you need:
Recenter <- unlist(with(Before, tapply(Time, ID, rev)), use.names = FALSE)
by applying rev function to unbalanced / ragged array using tapply.
I have quite big data frame (few millions of records).
I need to filter it due to following rule:
- For each product delete all records which are before the fifth record after the first record with x>0.
So, We are interested only in two columns - ID and x. Data frame is sorted by ID.
It is fairly easy to do it using loops, but loops doesn't perform well on such big data frame.
How to do it in 'vector style'?
Example:
BEFORE FILTERING
ID x
1 0
1 0
1 5 # First record with x>0
1 0
1 3
1 4
1 0
1 9
1 0 # Delete all earlier records of that product
1 0
1 6
2 0
2 1 # First record with x>0
2 0
2 4
2 5
2 8
2 0 # Delete all earlier records of that product
2 1
2 3
After filtering:
ID x
1 9
1 0
1 0
1 6
2 0
2 1
2 3
For these split, apply, combine problems - I like using plyr. There are alternatives if speed becomes an issue, but for most things - plyr is easy to understand and use. I wrote a function that implements the logic you described above and then fed that to ddply() to operate on each chunk of the data based on ID.
fun <- function(x, column, threshold, numplus){
whichcol <- which(x[column] > threshold)[1]
rows <- seq(from = (whichcol + numplus), to = nrow(x))
return(x[rows,])
}
And then feed this to ddply()
require(plyr)
ddply(dat, "ID", fun, column = "x", threshold = 0, numplus = 5)
#-----
ID x
1 1 9
2 1 0
3 1 0
4 1 6
5 2 0
6 2 1
7 2 3
I am trying to create a new column (variable) according to the values that appear in an existing column such that if there is an NA in the existing column then the corresponding value in the new column should be 0 (zero), if not NA then it should be 1 (one). An example data is given below:
aid=c(1,2,3,4,5,6,7,8,9,10)
age=c(2,14,NA,0,NA,1,6,9,NA,15)
data=data.frame(aid,age)
My new data frame should look like this:
aid=c(1,2,3,4,5,6,7,8,9,10)
age=c(2,14,NA,0,NA,1,6,9,NA,15)
surv=c(1,1,0,1,0,1,1,1,0,1)
data<-data.frame(aid,age,surv)
data
I hope that my question is clear enough.
The R community's help is highly appreciated!
Baz
surv = 1 - is.na(age)
> data
aid age surv
1 1 2 1
2 2 14 1
3 3 NA 0
4 4 0 1
5 5 NA 0
6 6 1 1
7 7 6 1
8 8 9 1
9 9 NA 0
10 10 15 1
>
If I'm understanding correctly:
data$surv <- 1
data$surv[is.na(data$age)] <- 0
or
data$surv <- ifelse(is.na(data$age), 0, 1)
An alternative to #mod's 1-is.na(foo) solution, is to just invert the TRUE/FALSE with !, and call as.numeric(). This involves more typing, but the intention and explicit coercion to numeric is apparent.
> as.numeric(!is.na(c(2,14,NA,0,NA,1,6,9,NA,15)))
[1] 1 1 0 1 0 1 1 1 0 1