chaining together sequential observations with only current and immediately prior ID values in R - r

Say I have some data on traits of individuals measured over time, that looks like this:
present <- c(1:4)
pre.1 <- c(5:8)
pre.2 <- c(9:12)
present2 <- c(13:16)
id <- c(present,pre.1,pre.2,present2)
prev.id <- c(pre.1,pre.2,rep(NA,8))
trait <- rnorm(16,10,3)
d <- data.frame(id,prev.id,trait)
print d:
id prev.id trait
1 1 5 10.693266
2 2 6 12.059654
3 3 7 3.594182
4 4 8 14.411477
5 5 9 10.840814
6 6 10 13.712924
7 7 11 11.258689
8 8 12 10.920899
9 9 NA 14.663039
10 10 NA 5.117289
11 11 NA 8.866973
12 12 NA 15.508879
13 13 NA 14.307738
14 14 NA 15.616640
15 15 NA 10.275843
16 16 NA 12.443139
Every observations has a unique value of id. However, some individuals have been observed in the past, and so I also have an observation of prev.id. This allows me to connect an individual with its current and past values of trait. However, some individuals have been remeasured multiple times. Observations 1-4 have previous IDs of 5-8, and observations of 5-8 have previous IDs of 9-12. Observations 9-12 have no previous ID because this is the first time these were measured. Furthermore, observations 13-16 have never been measured before. So, observations 1:4 are unique individuals, observations 5-12 are prior observations of individuals 1-4, and observations 13-16 are another set of unqiue individuals, distinct from 1-4. I would like to write code to generate a table that has every unique individual, as well as every past observation of that individuals trait. The final output would look like:
id <- c(1:4,13:16)
prev.id <- c(5:8, rep(NA,4))
trait <- d$trait[c(1:4,13:16)]
prev.trait.1 <- d$trait[c(5:8 ,rep(NA,4))]
prev.trait.2 <- d$trait[c(9:12,rep(NA,4))]
output<- data.frame(id,prev.id,trait,prev.trait.1,prev.trait.2)
> output
id prev.id trait prev.trait.1 prev.trait.2
1 1 5 10.693266 10.84081 14.663039
2 2 6 12.059654 13.71292 5.117289
3 3 7 3.594182 11.25869 8.866973
4 4 8 14.411477 10.92090 15.508879
5 13 NA 14.307738 NA NA
6 14 NA 15.616640 NA NA
7 15 NA 10.275843 NA NA
8 16 NA 12.443139 NA NA
I can accomplish this in a straightforward manner, but it requires me coding an additional pairing for each previous observation, such that the number of code groups I need to write is the number of times any individual has been recorded. This is a pain, as in the data set I am applying this problem to, there may be anywhere from 0-100 previous observations of an individual.
#first pairing
d.prev <- data.frame(d$id,d$trait,d$prev.id)
colnames(d.prev) <- c('prev.id','prev.trait.1','prev.id.2')
d <- merge(d,d.prev, by = 'prev.id',all.x=T)
#second pairing
d.prev2 <- data.frame(d$id,d$trait,d$prev.id)
colnames(d.prev2) <- c('prev.id.2','prev.trait.2','prev.id.3')
d<- merge(d,d.prev2,by='prev.id.2',all.x=T)
#remove observations that are another individuals previous observation
d <- d[!(d$id %in% d$prev.id),]
How can I go about doing this in fewer lines, so I don't need 100 code chunks to cover individuals that have been remeasured 100 times?

What you have is a forest of linear lists. We'll start at the terminal ends
roots<-d$id[is.na(d$prev.id)]
And determine the paths backwards
path <- function(node) {
a <- integer(nrow(d))
i <- 0
while(!is.na(node)) {
i <- i+1
a[i] <- node
node <- d$id[match(node,d$prev.id)]
}
return(rev(a[1:i]))
}
Then we can get a 'stacked' representation of your desired output with
x<-do.call(rbind,lapply(roots,
function(r) {p<-path(r); data.frame(id=p[[1]],seq=seq_along(p),traits=d$trait[p])}))
And then use reshape2::dcast to get it in the desired shape
library(reshape2)
dcast(x,id~seq,fill=NA,value.var='traits')
id 1 2 3
1 1 10.693266 10.84081 14.663039
2 2 12.059654 13.71292 5.117289
3 3 3.594182 11.25869 8.866973
4 4 14.411477 10.92090 15.508879
5 13 14.307738 NA NA
6 14 15.616640 NA NA
7 15 10.275843 NA NA
8 16 12.443139 NA NA
I leave it to you to adapt column names.

Related

how to filter for a vector within a data frame that contains some but not all of the elements within the vector

I have a large data set that contains a lot of information about departure times of bus stops. I have a main data set that contains information regarding Trip_ID, Bus_sign as well as stop_ID. I further have an index by which I would like to filter the df by.
df <- data.frame(c(10,10,10,10,10,10,10,10,10,10),
c(8,10,12,15,22,26,27,40,45,50),
c("0000001","0000002","0000003","0000004","0000005","0000006","0000007", "0000008","0000009","0000010"))
names <- c("trip_ID", "Bus_sign", "stop_ID")
colnames(df) <- names
index <- c("0000001", "0000002", "0000003", "0000011","00000013")
the data frame would look something like this
trip_ID Bus_sign stop_ID
1 10 8 0000001
2 10 10 0000002
3 10 12 0000003
4 10 15 0000004
5 10 22 0000005
6 10 26 0000006
7 10 27 0000007
8 10 40 0000008
9 10 45 0000009
10 10 50 0000010
the index contains some of the stop_ID within df, however it also contains some that are not in df. I would like to filter for matches of index and df for df$stop_ID.
the result should look like this:
trip_ID Bus_sign stop_ID
1 10 8 0000001
2 10 10 0000002
3 10 12 0000003
I have tried the subset function, however it wouldn't work
subset(df, stop_ID %in% index)

Mapping a dataframe (with NA) to an n by n adjacency matrix (as a data.frame object)

I have a three-column dataframe object recording the bilateral trade data between 161 countries, the data are of dyadic format containing 19687 rows, three columns (reporter (rid), partner (pid), and their bilateral trade flow (TradeValue) in a given year). rid or pid takes a value from 1 to 161, and a country is assigned the same rid and pid. For any given pair of (rid, pid) in which rid =/= pid, TradeValue(rid, pid) = TradeValue(pid, rid).
The data (run in R) look like this:
#load the data from dropbox folder
library(foreign)
example_data <- read.csv("https://www.dropbox.com/s/hf0ga22tdjlvdvr/example_data.csv?dl=1")
head(example_data, n = 10)
rid pid TradeValue
1 2 3 500
2 2 7 2328
3 2 8 2233465
4 2 9 81470
5 2 12 572893
6 2 17 488374
7 2 19 3314932
8 2 23 20323
9 2 25 10
10 2 29 9026220
The data were sourced from UN Comtrade database, each rid is paired with multiple pid to get their bilateral trade data, but as can be seen, not every pid has a numeric id value because I only assigned a rid or pid to a country if a list of relevant economic indicators of that country are available, which is why there are NA in the data despite TradeValue exists between that country and the reporting country (rid). The same applies when a country become a "reporter," in that situation, that country did not report any TradeValue with partners, and its id number is absent from the rid column. (Hence, you can see rid column begins with 2, because country 1 (i.e., Afghanistan) did not report any bilateral trade data with partners). A quick check with summary statistics helps confirm this
length(unique(example_data$rid))
[1] 139
# only 139 countries reported bilateral trade statistics with partners
length(unique(example_data$pid))
[1] 162
# that extra pid is NA (161 + NA = 162)
Since most countries report bilateral trade data with partners and for those who don't, they tend to be small economies. Hence, I want to preserve the complete list of 161 countries and transform this example_data dataframe into a 161 x 161 adjacency matrix in which
for those countries that are absent from the rid column (e.g., rid == 1), create each of them a row and set the entire row (in the 161 x 161 matrix) to 0.
for those countries (pid) that do not share TradeValue entries with a particular rid, set those cells to 0.
For example, suppose in a 5 x 5 adjacency matrix, country 1 did not report any trade statistics with partners, the other four reported their bilateral trade statistics with other (except country 1). The original dataframe is like
rid pid TradeValue
2 3 223
2 4 13
2 5 9
3 2 223
3 4 57
3 5 28
4 2 13
4 3 57
4 5 82
5 2 9
5 3 28
5 4 82
from which I want to convert it to a 5 x 5 adjacency matrix (of data.frame format), the desired output should look like this
V1 V2 V3 V4 V5
1 0 0 0 0 0
2 0 0 223 13 9
3 0 223 0 57 28
4 0 13 57 0 82
5 0 9 28 82 0
And using the same method on the example_data to create a 161 x 161 adjacency matrix. However, after a couple trial and error with reshape and other methods, I still could not get around with such conversion, not even beyond the first step.
It will be really appreciated if anyone could enlighten me on this?
I cannot read the dropbox file but have tried to work off of your 5-country example dataframe -
country_num = 5
# check countries missing in rid and pid
rid_miss = setdiff(1:country_num, example_data$rid)
pid_miss = ifelse(length(setdiff(1:country_num, example_data$pid) == 0),
1, setdiff(1:country_num, example_data$pid))
# create dummy dataframe with missing rid and pid
add_data = as.data.frame(do.call(cbind, list(rid_miss, pid_miss, NA)))
colnames(add_data) = colnames(example_data)
# add dummy dataframe to original
example_data = rbind(example_data, add_data)
# the dcast now takes missing rid and pid into account
mat = dcast(example_data, rid ~ pid, value.var = "TradeValue")
# can remove first column without setting colnames but this is more failproof
rownames(mat) = mat[, 1]
mat = as.matrix(mat[, -1])
# fill in upper triangular matrix with missing values of lower triangular matrix
# and vice-versa since TradeValue(rid, pid) = TradeValue(pid, rid)
mat[is.na(mat)] = t(mat)[is.na(mat)]
# change NAs to 0 according to preference - would keep as NA to differentiate
# from actual zeros
mat[is.na(mat)] = 0
Does this help?

How do I create a column using values of a second column that meet the conditions of a third in R?

I have a dataset Comorbidity in RStudio, where I have added columns such as MDDOnset, and if the age at onset of MDD < the onset of OUD, it equals 1, and if the opposite is true, then it equals 2. I also have another column PhysDis that has values 0-100 (numeric in nature).
What I want to do is make a new column that includes the values of PhysDis, but only if MDDOnset == 1, and another if MDDOnset==2. I want to make these columns so that I can run a t-test on them and compare the two groups (those with MDD prior OUD, and those who had MDD after OUD with regards to which group has a greater physical disability score). I want any case where MDDOnset is not 1 to be NA.
ttest1 <-t.test(Comorbidity$MDDOnset==1, Comorbidity$PhysDis)
ttest2 <-t.test(Comorbidity$MDDOnset==2, Comorbidity$PhysDis)
When I did the t test twice, once where MDDOnset = 1 and another when it equaled 2, the mean for y (Comorbidity$PhysDis) was the same, and when I looked into the original csv file, it turned out that this mean was the mean of the entire column, and not just cases where MDDOnset had a value of one or two. If there is a different way to run the t-tests that would have the mean of PhysDis only when MDDOnset = 1, and another with the mean of PhysDis only when MDDOnset == 2 that does not require making new columns, then please tell me.. Sorry if there are any similar questions or if my approach is way off, I'm new to R and programming in general, and thanks in advance.
Here's a smaller data frame where I tried to replicate the error where the new columns have switched lengths. The issue would be that the length of C would be 4, and the length of D would be 6 if I could replicate the error.
> A <- sample(1:10)
> B <-c(25,34,14,76,56,34,23,12,89,56)
> alphabet <-data.frame(A,B)
> alphabet$C <-ifelse(alphabet$A<7, alphabet$B, NA)
> alphabet$D <-ifelse(alphabet$A>6, alphabet$B, NA)
> print(alphabet)
A B C D
1 7 25 NA 25
2 9 34 NA 34
3 4 14 14 NA
4 2 76 76 NA
5 5 56 56 NA
6 10 34 NA 34
7 8 23 NA 23
8 6 12 12 NA
9 1 89 89 NA
10 3 56 56 NA
> length(which(alphabet$C>0))
[1] 6
> length(which(alphabet$D>0))
[1] 4
I would use the mutate command from the dplyr package.
Comorbidity <- mutate(Comorbidity, newColumn = (ifelse(MDDOnset == 1, PhysDis, "")), newColumn2 = (ifelse(MDDOnset == 2, PhysDis, "")))

Subsetting in R using a list

I have a large amount of data which I would like to subset based on the values in one of the columns (dive site in this case). The data looks like this:
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
alice rain 95 NA 50 NA 2 4 9
alice over NA 25 NA 25 2 4 9
steps clear NA 27 NA 25 2 4 9
steps NA 30 NA 20 1 4 9
andrea1 clear 60 NA 60 NA 2 4 5
I would like to create a subset of the data which contains only data for one dive site at a time (e.g. one subset for alice, one for steps, one for andrea1 etc...).
I understand that I could subset each individually using
alice <- subset(reefdata, site=="alice")
But as I have over 100 different sites to subset by would like to avoid having to individually specify each subset. I think that subset is probably not flexible enough for me to ask it to subset by a list of names (or at least not to my current knowledge of R, which is growing, but still in infancy), is there another command which I should be looking into?
Thank you
This will create a list that contains the subset data frames in separate list elements.
splitdat <- split(reefdata, reefdata$site)
Then if you want to access the "alice" data you can reference it like
splitdat[["alice"]]
I would use the plyr package.
library(plyr)
ll <- dlply(df,.variables = c("site"))
Result:
>ll
$alice
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 alice rain 95 NA 50 NA 2 4 9
2 alice over NA 25 NA 25 2 4 9
$andrea1
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 andrea1 clear 60 NA 60 NA 2 4 5
$steps
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 steps clear NA 27 NA 25 2 4 9
2 steps <NA> 30 NA 20 1 4 9 NA
split() and dlply() are perfect one shot solutions.
If you want a "step by step" procedure with a loop (which is frowned upon by many R users, but I find it helpful in order to understand what's going on), try this:
# create vector with site names, assuming reefdata$site is a factor
sites <- as.character( unique( reefdata$site ) )
# create empty list to take dive data per site
dives <- list( NULL )
# collect data per site into the list
for( i in 1:length( sites ) )
{
# subset
dive <- reefdata[ reefdata$site == sites[ i ] , ]
# add resulting data.frame to the list
dives[[ i ]] <- dive
# name the list element
names( dives )[ i ] <- sites[ i ]
}

looping over the name of the columns in R for creating new columns

I am trying to use the loop over the column names of the existing dataframe and then create new columns based on one of the old column.Here is my sample data:
sample<-list(c(10,12,17,7,9,10),c(NA,NA,NA,10,12,13),c(1,1,1,0,0,0))
sample<-as.data.frame(sample)
colnames(sample)<-c("x1","x2","D")
>sample
x1 x2 D
10 NA 1
12 NA 1
17 NA 1
7 10 0
9 20 0
10 13 0
Now, I am trying to use for loop to generate two variables x1.imp and x2.imp that have values related to D=0 when D=1 and values related to D=1 when D=0(Here I actually don't need for loop but for my original dataset with large cols (variables), I really need the loop) based on the following condition:
for (i in names(sample[,1:2])){
sample$i.imp<-with (sample, ifelse (D==1, i[D==0],i[D==1]))
i=i+1
return(sample)
}
Error in i + 1 : non-numeric argument to binary operator
However, the following works, but it doesn't give the names of new cols as imp.x2 and imp.x3
for(i in sample[,1:2]){
impt.i<-with(sample,ifelse(D==1,i[D==0],i[D==1]))
i=i+1
print(as.data.frame(impt.i))
}
impt.i
1 7
2 9
3 10
4 10
5 12
6 17
impt.i
1 10
2 12
3 13
4 NA
5 NA
6 NA
Note that I already know the solution without loop [here]. I want with loop.
Expected output:
x1 x2 D x1.impt x2.imp
10 NA 1 7 10
12 NA 1 9 20
17 NA 1 10 13
7 10 0 10 NA
9 20 0 12 NA
10 13 0 17 NA
I would greatly appreciate your valuable input in this regard.
This is nuts, but since you are asking for it... Your code with minimum changes would be:
for (i in colnames(sample)[1:2]){
sample[[paste0(i, '.impt')]] <- with(sample, ifelse(D==1, get(i)[D==0],get(i)[D==1]))
}
A few comments:
replaced names(sample[,1:2]) with the more elegant colnames(sample)[1:2]
the $ is for interactive usage. Instead, when programming, i.e. when the column name is to be interpreted, you need to use [ or [[, hence I replaced sample$i.imp with sample[[paste0(i, '.impt')]]
inside with, i[D==0] will not give you x1[D==0] when i is "x1", hence the need to dereference it using get.
you should not name your data.frame sample as it is also the name of a pretty common function
This should work,
test <- sample[,"D"] == 1
for (.name in names(sample)[1:2]){
newvar <- paste(.name, "impt", sep=".")
sample[[newvar]] <- ifelse(test, sample[!test, .name],
sample[test, .name])
}
sample

Resources