Like dcast but without sum of data - r

I have data organized for the R survival package, but want to export it to work in Graphpad Prism, which uses a different structure.
#Example data
Treatment<-c("A","A","A","A","A","B","B","B","B","B")
Time<-c(3,4,5,5,5,1,2,2,3,5)
Status<-c(1,1,0,0,0,1,1,1,1,1)
df<-data.frame(Treatment,Time,Status)
The R survival package data structure looks like this
Treatment Time Status
A 3 1
A 4 1
A 5 0
A 5 0
A 5 0
B 1 1
B 2 1
B 2 1
B 3 1
B 5 1
The output I need organizes each treatment as one column, and then sorts by time. Each individual is then recorded as a 1 or 0 according to its Status. The output should look like this:
Time A B
1 1
2 1
2 1
3 1 1
4 1
5 0 1
5 0
5 0
dcast() does something similar to what I want, but it sums up the Status values and merges them into one cell for all individuals with matching Time values.
Thanks for any help!

I ran into a weird issue when trying to implement Sotos' code to my actual data. I got the error:
Error in Math.factor(var) : ‘abs’ not meaningful for factors
Which is weird, because Sotos' code works for the example. When I checked the example data frame using sapply() it gave me the result:
> sapply(df,class)
Treatment Time Status
"factor" "numeric" "numeric"
My issue as far as I could tell, was that my Status variable was read as numeric in my example, but an integer in my real data:
> sapply(df,class)
Treatment Time Status
"factor" "numeric" "integer"
I loaded my data from a .csv, so maybe that's what caused the change in variable calling. I ended up changing my Status variable using as.numeric(), and then re-generating the dataframe.
Status<-as.numeric(df$Status)
df<-data.frame(Treatment, Time, Status)
And was able to apply Sotos' code to the new dataframe.

Related

Procedural way to generate signal combinations and their output in r

I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data

Displaying of factor levels and labels in R

I am having an issue with displaying the correct grouping of a factor variable after using MICE. I believe this is an R thing, but I included it with mice just to be sure.
So, I run my mice algorithm, here is a snipit of how I call I format it in the mice algorithm. Note, I want it to be 0 for no drug, and 1 for yes drug, so I coerce it to be a factor with levels 0 and 1 before I run it
mydat$drug=factor(mydat$drug,levels=c(0,1),labels=c(0,1))
I then run mice and it runs logistic regression (this is the default) on drug, along with my other variables to be imputed.
I can extract the results of one of the imputations when it is complete by
drug=complete(imp,1)$drug
We can view it
> head(drug)
[1] 0 0 1 0 1 1
attr(,"contrasts")
2
0 0
1 1
Levels: 0 1
So the data is certainly 0,1.
However, when I do something with it, like cbind, it changes to 1's and 2's
> head(cbind(drug))
drug
[1,] 1
[2,] 1
[3,] 2
[4,] 1
[5,] 2
[6,] 2
Even when I coerce it to a numeric
> head(as.numeric(drug))
[1] 1 1 2 1 2 2
I want to say it has something to do with the contrasts, but when I delete the contrast by doing
attr(drug,"contrasts")=NULL
It still shows up with 1's and 2's when called and printed by others.
I am able to get it to print correctly by using I()
> head(I(drug))
[1] 0 0 1 0 1 1
Levels: 0 1
So, I believe that this is an R issue, but I don't know how to remedy it. Is using I() the correct solution, or is it just a workaround that happens to work here? What is actually happening behind the scenes that is making the output display as 1's and 2's?
Thanks
Factors start with the first level being represented internally by 1.
Your two options:
1) Adjust for 1-based index of levels:
as.numeric(drug) - 1
2) Take the labels of the factors and convert to numeric:
as.numeric(as.character(drug))
Some people will point you in the direction of the faster option that does the same thing:
as.numeric(levels(drug))[drug]
I'd also consider using logical values instead of factor in the first place.
mydat$drug = as.logical(mydat$drug)
The 0s and 1s are the names of your levels. The underlying integer corresponding to the names is 1 and 2. You can see with str,
str(drug)
# Factor w/ 2 levels "0","1": 2 2 2 2 2 2 1 1 2 2
When you coerce the factor to numeric, you drop the names and get the integer representation.
This is how R encodes factors. The underlying numeric representation of the factors always starts with 1. As you can see with the following to examples:
as.numeric(factor(c(0,1)))
as.numeric(factor(c(A,B)))
Not sure about the specifics about how MICE works, but if it requires a factor instead of a simple 0/1 numeric variable to use logistic regression, you can always hack the results with something like the following:
as.numeric(as.character(factor(c(0,1))))
or in your specific case
drug <- as.numeric(as.character(drug))

ChoiceModelR - Hierarchical Bayes Multinomial Logit Model

I hope that some of you are a bit experienced with the R package ChoiceModelR by Sermas and Colias, to estimate a Hierarchical Bayes Multinomial Logit Model. Actually, I am quite a newbie on both R and Hierarchical Bayes. However, I tried to get some estimates by using the script provided by Sermas and Colias in the help file. I have a data set in the same structure as they use (ID, choice set, alternative, independent variables, and choice variable). I have four independent variables all of them binary coded as categorical variables, none of them restricted. I have eight choice sets with three alternatives within each set as well as one no-choice-option as fourth alternative. I tried the following script:
library (ChoiceModelR)
data <- read.delim("Z:/KLU/CSR/CBC/mp3_vio.txt")
xcoding=c(0,0,0,0)
mcmc = list(R = 10, use = 10)
options = list(none=FALSE, save=TRUE, keep=1)
attlevels=c(2,2,2,2)
c1=matrix(c(0,0,0,0),2,2)
c2=matrix(c(0,0,0,0),2,2)
c3=matrix(c(0,0,0,0),2,2)
c4=matrix(c(0,0,0,0),2,2)
constraints = list(c1, c2, c3, c4)
out = choicemodelr(data, xcoding, mcmc = mcmc, options = options, constraints = constraints)
and have got the following error message:
Error in 1:nalts[i] : result would be too long a vector
In addition: There were 50 or more warnings (use warnings() to see the first 50). The mentioned warnings are of the following:
In max(temp[temp[, 2] == j, 3]) : no non-missing arguments to max; returning -Inf
In max(temp[temp[, 2] == j, 3]) : no non-missing arguments to max; returning -Inf
Actually, I have no idea what went wrong so far as I used the same data structure even I have more independent variables, more choice sets, and more alternatives within a choice set. I would be fantastic if anybody can shed some light into the darkness
I know that this may not be helpful since you posted so long ago, but if it comes up again in the future, this could prove useful.
One of the most common reasons for this error (in my experience) has been that either the scenario variable or the alternative variable is not in ascending order within your data.
id scenario alt x1 ... y
1 1 1 4 1
1 1 2 1 0
1 3 1 4 2
1 3 2 5 0
2 1 4 3 1
2 1 5 1 0
2 2 1 4 2
2 2 2 3 0
This dataset will give you errors since the scenario and alternative variables must be ascending, and they must not skip any values. Just to fully reiterate what I mean, the scenario and alt variables must be reordered as follows in order to work:
id scenario alt x1 ... y
1 1 1 4 1
1 1 2 1 0
1 2 1 4 2
1 2 2 5 0
2 1 1 3 1
2 1 2 1 0
2 2 1 4 2
2 2 2 3 0
I work with ChoiceModelR quite frequently, and this is what has caused these errors for me in the past. If you have a github account, you can also post your data (or modified data) there if you end up wanting to have other users take a look.

Not sure why dcast() this data set results in dropping variables

I have a data frame that looks like:
id fromuserid touserid from_country to_country length
1 1 54525953 47195889 US US 2
2 2 54525953 54361607 US US 1
3 3 54525953 53571081 US US 2
4 4 41943048 55379244 US US 1
5 5 47185938 53140304 US PR 1
6 6 47185938 54121387 US US 1
7 7 54525974 50928645 GB GB 1
8 8 54525974 53495302 GB GB 1
9 9 51380247 45214216 SG SG 2
10 10 51380247 43972484 SG US 2
Each row describes a number of messages (length) sent from one user to another user.
What I would like to do is create a visualization (via a chord diagram in D3) of the messages sent between each country.
There are almost 200 countries. I use the function dcast as follows:
countries <- dcast(chats,from_country ~ to_country,drop=FALSE,fill=0)
This worked before for me when I had a smaller data set and fewer variables, but this data set is over 3M rows, and not easy to debug, so to speak.
At any rate, what I am getting now is a matrix that is not square, and I can't figure out why not. What I am expecting to get is essentially a matrix where the (i,j)th cell represents the messages sent from country i to country j. What I end up with is something very close to this, but with some rows and columns obviously missing, which is easy to spot because US->US messages show up shifted by one row or column.
So here's my question. Is there anything I'm doing that is obviously wrong? If not, is there something "strange" I should be looking for in the data set to sort this out?
Be sure that your "from_country" and "to_country" variables are factors, and that they share the same levels. Using the example data you shared:
chats$from_country <- factor(chats$from_country,
levels = unique(c(chats$from_country,
chats$to_country)))
chats$to_country <- factor(chats$to_country,
levels = levels(chats$from_country))
dcast(chats,from_country ~ to_country, drop = FALSE, fill = 0)
# Using length as value column: use value.var to override.
# Aggregation function missing: defaulting to length
# from_country US GB SG PR
# 1 US 5 0 0 1
# 2 GB 0 2 0 0
# 3 SG 1 0 1 0
# 4 PR 0 0 0 0
If your "from_country" and "to_country" variables are already factors, but not with the same levels, you can do something like this for the first step:
chats$from_country <- factor(chats$from_country,
levels = unique(c(levels(chats$from_country),
levels(chats$to_country)))
Why is this necessary? If they are already factors, then c(chats$from_country, chats$to_country) will coerce the factors to numeric, and since that doesn't match with any of the character values of the factors, it will result in <NA>.

Doubts about ddply function in R

I'm trying to do an equivalent group by summary in R through the plyr function named ddply. I have a data frame which have three columns (say id, period and event). Then, I'd like to count the times each id appears in the data frame (count(*)... group by id with SQL) and get the last element of each id corresponding to the column event.
Here an example of what I have and what I'm trying to obtain:
id period event #original data frame
1 1 1
2 1 0
2 2 1
3 1 1
4 1 1
4 1 0
id t x #what I want to obtain
1 1 1
2 2 1
3 1 1
4 2 0
This is the simple code I've been using for that:
teachers.pp<-read.table("http://www.ats.ucla.edu/stat/examples/alda/teachers_pp.csv", sep=",", header=T) # whole data frame
datos=ddply(teachers.pp,.(id),function(x) c(t=length(x$id), x=x[length(x$id),3])) #This is working fine.
Now, I've been reading The Split-Apply-Combine Strategy for Data
Analysis and it is given an example where they employed an equivalent syntax to the one I put below:
datos2=ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3]) #using summarise but the result is not what I want.
This is the data frame I get using datos2
id t x
1 1 1
2 2 0
3 1 1
4 1 1
So, my question is: why is this result different from the one I get using the first piece of code, I mean datos1? What am I doing wrong?
It is not clear for me when I have to use summarise or transform. Could you tell me the correct syntax for the ddply function?
When you use summarise, stop referencing the original data frame. Instead, just write expressions in terms of the column names.
You tried this:
ddply(teachers.pp,.(id), summarise, t=length(id), x=teachers.pp[length(id),3])
when what you probably wanted was something more like this:
ddply(teachers.pp,.(id), summarise, t=length(id), x=tail(event,1))

Resources