I am using the LCMM package with the function hlme.
I want to know which IDs are classified in each class. Some of my cases are being dropped from the model. So when I use $pprob, the id variable is basically just starting at 1 and increasing. It is not the original study ID. So I am not able to merge it with the original dataset.
Does anyone know how to do that?
Here is the code:
q4r.fv <- hlme(i_rest~days+days2, subject = 'ninclu', ng=4, idiag=T, mixture=~days+days2,
data=patch, maxiter = 100, returndata = TRUE,
classmb=~tttantal+sexe+atchroniq+hads_anxiI+hads_depI+pcs_totI)
summary(q4r.fv)
fmq4r.fv <-q4r.fv$pprob
write.csv(fmq4r.fv,file="fmq4r.fv",row.names=F)
With the csv file, I obtain the following. The ninclu should be my ID but it no longer matches my original minclu variables that is a string participant ID variable
> print.data.frame(fmq4r.fv)
ninclu class prob1 prob2 prob3 prob4
1 1 2 7.416779e-09 9.635142e-01 3.630078e-02 1.850563e-04
2 2 2 5.479232e-02 8.710804e-01 7.412726e-02 9.118352e-16
3 3 1 9.933911e-01 6.607882e-03 9.830514e-07 1.110920e-23
4 4 2 2.620132e-07 9.991825e-01 8.155809e-04 1.631318e-06
5 5 2 4.382259e-04 9.877001e-01 1.186168e-02 1.166050e-11
6 6 3 4.239271e-09 2.361263e-01 7.634902e-01 3.835313e-04
You mention that your subject ID is a string participant ID variable.
If you look at the source code for lcmm, you can see that lcmm requires a numeric subject:
if(!is.numeric(data[,subject])) stop("The argument subject must be numeric")
I suspect the character string participant ID is coerced to factor, and it is the default factor coding that is appearing as "subj" in the model output predictions.
Try using a numeric participant ID and see if your results make more sense.
Related
I have a data frame on name (df) as follows.
ID name
1 Xiaoao
2 Yukata
3 Kim
4 ...
Examples of API are like this.
European-SouthSlavs,0.2244 Muslim-Pakistanis-Bangladesh,0.0000 European-Italian-Italy,0.0061 ...
And I would like to add a new column using API that returns nationality scores up to 39 nationalities and I would like to list up to top 3 scores per name. My desired outcomes as follows.
ID name score nat
1 Xiaoao 0.7361 Chinese
1 Xiaoao 0.1721 Korean
1 Xiaoao 0.0721 Japanese
2 Yukata 0.8121 Japanese
2 Yukata 0.0811 Chinese
2 Yukata 0.0122 Korean
3 Kim 0.6532 Korean
3 Kim 0.2182 Chinese
3 Kim 0.0981 Japanese
4 ... ... ...
Below is my some of scratch to get it done. But I failed to get the desired outcomes for a number errors.
df_result <- purrr::map_dfr(df$name, function(name) {
result <- GET(paste0("http://www.name-prism.com/api_token/nat/csv/",
"API TOKEN","/", URLencode(df$name)))
if(http_error(result)){
NULL
}else{
nat<- content(result, "text")
nat<- do.call(rbind, strsplit(strsplit(nat, split = "(?<=\\d)\n", perl=T)[[1]],","))
#first three nationalities
top_nat <- nat[order(as.numeric(nat[,2]), decreasing = T)[1:3],]
c(df$name,as.vector(t(top_nat)))
}
})
First, the results of top scores were based on the entire data rather than per name.
Second, I faced an error saying "Error in dplyr::bind_rows():! Argument 1 must have names."
If you can add any comments on my codings, I will appreciate it!
Thank you in advance.
The output of each iteration of the map_dfr should be a dataframe for which to bind rows:
library(tidyverse)
library(httr)
df <- data.frame(name = c("Xiaoao", "Yukata", "Kim"))
map_dfr(df$name, function(name) {
data.frame(name = df$name, score = sample(1:10, 1))
})
Instead of concatenating name with top_nat at the end of your function, you should be making it a data.frame!
This is a follow-up from a previous post (R: Running multiple tests by selecting (and increasing) number of fixed data points selected):
I have a dataframe (saved as data.csv) that looks something like this:
person
outcome
baseline_post
time
1
0
baseline
BL_1
1
1
baseline
BL_2
1
0
baseline
BL_3
1
2
baseline
BL_4
1
4
post
post_1
1
3
post
post_2
1
4
post
post_3
1
6
post
post_4
2
1
baseline
BL_1
2
2
baseline
BL_2
2
0
baseline
BL_3
2
1
baseline
BL_4
2
3
post
post_1
2
2
post
post_2
2
4
post
post_3
2
3
post
post_4
And same as the previous post, the purpose is to try iterate a same test (can be any test) over the desired fixed combinations arranged across time,
i.e., For each participant, compare outcome(s) at BL_1 against post_1, then BL_1 and BL_2 against post_1 ... BL_1, BL_2, BL_3 and BL_4 against post_1 etc.
Basically all combinations increasing in the number of weeks tested before (BL_1 to 4) and after (post_1 to 2) treatment.
I tried modifying from #Caspar V.'s codes (thanks #Caspar V. for your previous response):
#creating pre/post data frames for later use
df <- read.csv("C:/Users/data.csv")
df_baseline <- filter(df, baseline_post == "baseline") %>%
rename(baseline = baseline_post) %>%
rename(time_baseline = time)
df_post <- filter(df, baseline_post == "post") %>%
rename(post = baseline_post) %>%
rename(time_post = time)
#generate a list of desired comparisons
comparisons = list()
for(a_len in seq_along(df_baseline$baseline)) for(b_len in seq_along(df_post$post)){
comp = list(baseline = head(df_baseline$time_baseline, a_len), post = head(df_post$time_post, b_len))
comparisons = append(comparisons, list(comp))
}
#KIV create combined df for time if required
df_baseline_post <- cbind(df_baseline$time_baseline, df_post$time_post)
colnames(df_baseline_post) = c("time_baseline", "time_post")
#iterate through list of comparisons
for(df_baseline_post in comparisons) {
cat(df_baseline_post$time_baseline, 'versus', df_baseline_post$time_post, '\n')
#this is where your analysis goes, poisson_frequencies being a test function I created
poisson_frequencies(df)
}
This is unfortunately my output, which are 16 "versus-es", because there are 16 possible combinations based on the above data:
versus
versus
versus
versus
versus
versus
...
versus
I am not sure what went wrong. Appreciate any input. I am new when it comes to programming in R.
There's a number of problems; the following should get you back on track. Good luck!
1)
You're getting 64 comparisons in comparisons, not 16. If you would just look at the contents of comparisons you'd see that. It's because you have duplicates in df$time. You'll need to remove them first:
#generate a list of desired comparisons
groupA = unique(df_baseline$time_baseline)
groupB = unique(df_post$time_post)
comparisons = list()
for(a_len in seq_along(groupA)) for(b_len in seq_along(groupB)) {
comp = list(baseline = head(groupA, a_len), post = head(groupB, b_len))
comparisons = append(comparisons, list(comp))
}
2)
The following block is not used, and the variable df_baseline_post is overwritten in the for-loop after it, so you can just remove this:
#KIV create combined df for time if required
# df_baseline_post <- cbind(df_baseline$time_baseline, df_post$time_post)
# colnames(df_baseline_post) = c("time_baseline", "time_post")
3)
You're executing poisson_frequencies(df) every time, but not doing anything with the output. That's why you're not seeing anything. You'll need to put a print() around it: print(poisson_frequencies(df)). Of course df is also not the data you want to work with, but I hope you already knew that.
4)
df_baseline_post$time_baseline and df_baseline_post$time_post don't exist. The loop should be:
for(df_baseline_post in comparisons) {
cat(df_baseline_post$baseline, 'versus', df_baseline_post$post, '\n')
print(poisson_frequencies(df))
}
I want to simulate a time series data frame that contains observations of 5 variables that were taken on 10 individuals. I want the number of rows (observations) to be different between each individual. For instance, I could start with something like this:
ID = rep(c("alp", "bet", "char", "delta", "echo"), times = c(1000,1200,1234,980,1300))
in which case ID represents each unique individual (I would later turn this into a factor), and the number of times each ID was repeated would represent the length of measurements for that factor. I would next need to create a column called Time with sequences from 1:1000, 1:1200, 1:1234, 1:980, and 1:1300 (to represent the length of measurements for each individual). Lastly I would need to generate 5 columns of random numbers for each of the 5 variables.
There are tons of ways to go about generating this data set, but what would be the most practical way to do it?
You can do :
ID = c("alp", "bet", "char", "delta", "echo")
num = c(1000,1200,1234,980,1300)
df <- data.frame(ID = rep(ID, num), num = sequence(num))
df[paste0('rand', seq_along(ID))] <- rnorm(length(ID) * sum(num))
head(df)
# ID num rand1 rand2 rand3 rand4 rand5
#1 alp 1 0.1340386 0.95900538 0.84573154 0.7151784 -0.07921171
#2 alp 2 0.2210195 1.67105483 -1.26068288 0.9171749 -0.09736927
#3 alp 3 1.6408462 0.05601673 -0.35454240 -2.6609228 0.21615254
#4 alp 4 -0.2190504 -0.05198191 -0.07355602 1.1102771 0.88246516
#5 alp 5 0.1680654 -1.75323736 -1.16865142 -0.4849876 0.20559750
#6 alp 6 1.1683839 0.09932759 -0.63474826 0.2306168 -0.61643584
I have used rnorm here, you can use any other distribution to generate random numbers.
I've been building simulators in Excel with VBA to understand the distribution of outcomes a player may experience as they open up collectible card packs. These were largely built with nested for loops, and as you can imagine...were slow as molasses.
I've been spinning up on R over the last couple months, and have come up with a function that handles a particular definition of a pack (i.e., two cards with particular drop rates for n characters on either card), and now am trying to abstract my function so that it can take any number of cards of whatever type of thing you want to throw at it(i.e., currency, gear, materials, etc).
What this simulator is basically doing is saying "I want to watch 10,000 people open up 250 packs of 2 cards" and then I perform some analysis after the results are generated to ask questions like "How many $ will you need to spend to acquire character x?" or "What's the distribution of outcomes for getting x, y or z pieces of a character?"
Here's my generic function and then I'll provide some inputs that the function operates on:
mySimAnyCard <- function(observations, packs, lookup, droptable, cardNum){
obvs <- rep(1:observations, each = packs)
pks <- rep(1:packs, times = observations)
crd <- rep(cardNum, length.out = length(obvs))
if("prob" %in% colnames(lookup))
{
awrd = sample(lookup[,"award"], length(obvs), replace = TRUE, prob = lookup[,"prob"])
} else {
awrd = sample(unique(lookup[,"award"]), length(obvs), replace = TRUE)
}
qty = sample(droptable[,"qty"], length(obvs), prob = droptable[,"prob"], replace = TRUE)
df <- data.frame(observation = obvs, pack = pks, card = cardNum, award = awrd, quantity = qty)
observations and packs are set to an integer.
lookup takes a dataframe:
award prob
1 Nick 0.5
2 Alex 0.4
3 Sam 0.1
and droptable takes a similar dataframe :
qty prob
1 10 0.1355
2 12 0.3500
3 15 0.2500
4 20 0.1500
5 25 0.1000
6 50 0.0080
... continued
cardnum also takes an integer.
It's fine to run this multiple times and assign the output to a variable and then rbind and order, but what I'd really like to do is feed a master function a dataframe that contains which cards it needs to provision and which lookup and droptables it should pull against for each card a la:
card lookup droptable
1 1 char1 chardrops
2 2 char1 chardrops
3 3 char2 <NA>
4 4 credits <NA>
5 5 credits creditdrops
6 6 abilityMats abilityMatDrops
7 7 abilityMats abilityMatDrops
It's probably never going to be more than 20 cards...so I'm willing to take the speed of a for loop, but I'm curious how the SO community would approach this problem.
Here's what I put together thus far:
mySimAllCards <- function(observations, packs, cards){
full <- data.frame()
for(i in i:length(cards$card)){
tmp <- mySimAnyCard(observations, packs, cards[i,2], cards[i,3], i)
full <- rbind(full, tmp)
}
}
which trips over
Error in `[.default`(lookup, , "award") : incorrect number of dimensions
I can work through the issues above, but is there a better approach to consider?
I'm struggling with a reshape in R. I have 2 types of error (err and rel_err) that have been calculated for 3 different models. This gives me a total of 6 error variables (i.e. err_1, err_2, err_3, rel_err_1, rel_err_2, and rel_err_3). For each of these types of error I have 3 different types of predivtive validity tests (ie random holdouts, backcast, forecast). I would like to make my data set long so I keep the 4 types of test long while also making the two error measurements long. So in the end I will have one variable called err and one called rel_err as well as an id variable for what model the error corresponds to (1,2,or 3)
Here is my data right now:
iter err_1 rel_err_1 err_2 rel_err_2 err_3 rel_err_3 test_type
1 -0.09385732 -0.2235443 -0.1216982 -0.2898543 -0.1058366 -0.2520759 random
1 0.16141630 0.8575728 0.1418732 0.7537442 0.1584816 0.8419816 back
1 0.16376930 0.8700738 0.1431505 0.7605302 0.1596502 0.8481901 front
1 0.14345986 0.6765194 0.1213689 0.5723444 0.1374676 0.6482615 random
1 0.15890059 0.7435382 0.1589823 0.7439204 0.1608709 0.7527580 back
1 0.14412360 0.6743928 0.1442039 0.6747684 0.1463520 0.6848202 front
and here is what I would like it to look like:
iter model err rel_err test_type
1 1 -0.09385732 (#'s) random
1 2 -0.1216982 (#'s) random
1 3 -0.1216982 (#'s) random
and on...
I've tried playing around with the syntax but can't quite figure out what to put for the time.varying argument
Thanks very much for any help you can offer.
You could do it the "hard" way. For transparency you can use names.
with( dat, data.frame(iter = rep(iter, 3),
model = rep(1:3, each = nrow(dat)),
err = c(err_1, err_2, err_3),
rel_err = c(rel_err_1, rel_err_2, rel_err_3),
test_type = rep(test_type, 3)) )
Or, for conciseness, indexes.
data.frame(iter = dat[,1], model = rep(1:3, each = nrow(dat)), err = dat[,c(2, 4, 6)],
rel_err = dat[,c(3, 5, 7)], test_type = dat[,8]) )
If you had a LOT of columns the hard way might involve grepping the column names.
This "hard" way was about as concise as reshape and required less thinking about how to use the commands. Sometimes I just skip thinking about reshape.
The base function reshape will let you do this
reshape(DT, direction = 'long', varying = list(paste('err',1:3,sep ='_'), paste('rel_err',1:3,sep ='_')), v.names = c('err','rel_err'), timevar = 'model')
iter test_type model err rel_err id
1.1 1 random 1 -0.09385732 -0.2235443 1
2.1 1 back 1 0.16141630 0.8575728 2
3.1 1 front 1 0.16376930 0.8700738 3
4.1 1 random 1 0.14345986 0.6765194 4
5.1 1 back 1 0.15890059 0.7435382 5
6.1 1 front 1 0.14412360 0.6743928 6
1.2 1 random 2 -0.12169820 -0.2898543 1
2.2 1 back 2 0.14187320 0.7537442 2
3.2 1 front 2 0.14315050 0.7605302 3
4.2 1 random 2 0.12136890 0.5723444 4
5.2 1 back 2 0.15898230 0.7439204 5
6.2 1 front 2 0.14420390 0.6747684 6
1.3 1 random 3 -0.10583660 -0.2520759 1
2.3 1 back 3 0.15848160 0.8419816 2
3.3 1 front 3 0.15965020 0.8481901 3
4.3 1 random 3 0.13746760 0.6482615 4
5.3 1 back 3 0.16087090 0.7527580 5
6.3 1 front 3 0.14635200 0.6848202 6
I agree that the syntax for reshape hard to get your head around sometimes. I will spell out how this call works
direction = 'long' -- reshaping to long format
varying = list(paste('err',1:3,sep ='_'), paste('rel_err',1:3,sep ='_')) -- We pass a list of length 2 because we are trying to stack into two different variables. The columns paste('err',1:3,sep ='_') will become the first new variable in long format and
paste('rel_err',1:3,sep ='_')) will become the second new variable in long format
v.names = c('err','rel_err') sets the names of the two new variables in long format
timevar = 'model' sets the name of the time identifier (here the _1 from the columns in wide format.
I hope this is somewhat clearer.