How to make a normally distributed variable depend on entries and time in R? - r

I'm trying to generate a dataset of cross sectional time series to estimate uses of different models.
In this dataset, I have a ID variable and time variable. I'm trying to add a normally distributed variable that depends on the two identifications. In other words, how do I create a variable that recongizes both ID and time in R?
If my question appears uncertain, feel free to ask any questions.
Thanks in advance.
df2 <- read.table(
text =
"Year,ID,H,
1,1,N(2.3),
2,1,N(2.3),
3,1,N(2.3),
1,2,N(0.1),
2,2,N(0.1),
3,2,N(0.1),
", sep = ",", header = TRUE)

Assuming that the data in the dataframe df looks like
ID
Time
1
1
1
2
1
3
1
4
2
1
2
2
2
3
2
4
3
1
3
2
3
3
3
4
you can generate a variable y that depends on ID and time as the sum of two random normal distributions (yielding another normal distribution) that depend on ID and time respectively:
set.seed(42)
df = data.frame(
ID = rep(1:4, each=3),
time = rep(1:3, times=4)
)
df$y = rnorm(nrow(df), mean=df$ID, sd=1+0.1*df$ID) +
rnorm(nrow(df), mean=df$time, sd=0.05*df$time)
# Output:
ID time y
1 1 1 3.438611
2 1 2 2.350953
3 1 3 4.379443
4 1 4 5.823339
5 2 1 3.470909
6 2 2 3.607005
7 2 3 6.447756
8 2 4 6.150432
9 3 1 6.608619
10 3 2 4.740341
11 3 3 7.670543
12 3 4 10.215574
Note that the underlying normal distributions depend on both ID and time. That is in contrast to your example table above where it looks like it solely depends on ID -- namely resulting in a single normal distribution per ID that is independent of the time variable.

Related

How to extract a list of columns name based on the means of their data?

I'm pretty new to R and hope i'll make myself clear enough.
I have a table of several columns which are factors. I want to make a score for each of these columns. Then I want to calculate the mean of each score, and display the list of columns ranked by their mean scores, is that possible ?
Table would be:
head(musico[,69:73])
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4
I want to make a score for each:
musico$score1<-0
musico$score1[musico$AVIS1==1]<-1
musico$score1[musico$AVIS1==2]<-0.5
then do the mean of each column score: mean of score1, mean of score2, ...:
mean(musico$score1), mean(musico$score2), ...
My goal is to have a list of titles (avis1, avis2,...) ranked by their mean score.
Any advice appreciated !
Here's one way using base although it is somewhat unclear what you want. What does score1 have to do with AVIS1? I think you may be missing some of the data from musico.
Based on the example provided, here's a base R solution. vapply loops through the data.frame and produces the mean for each column. Then the stack and order are only there to make the output a dataframe that looks nice.
music <- read.table(text = "
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4", header = TRUE)
means <- vapply(music, mean, 1)
stack(means[order(means, decreasing = TRUE)])
values ind
4 4.000000 AVIS4
3 3.166667 AVIS3
2 2.666667 AVIS2
5 2.500000 AVIS5
1 2.166667 AVIS1
This is how I would do it by first introducing a scores vector to be used as a lookup. I assume that scores are decreasing by 0.5 and that the number of scores needed are according to the maximum number of levels found in your columns (i.e. 6 seen in AVIS1).
Then using tidyr you can organise your data set such that you have to variables (i.e. AVIS and Value) containing the respective levels. Then add a score variable with the mutate function from dplyr in which the position of the score in the score vector matches the value in the Value variable. From here you can find the mean scores corresponding to the AVIS levels, arrange them accordingly and put them in a list.
music <- read.table(text = "
AVIS1 AVIS2 AVIS3 AVIS4 AVIS5
1 2 1 2 3 2
2 2 5 2 3 2
3 3 2 5 5 1
4 1 2 5 5 5
5 1 5 1 3 1
6 4 1 4 5 4", header = TRUE) # your data
scores <- seq(1, by = -0.5, length.out = 6) # vector of scores
library(tidyr)
library(dplyr)
music2 <- music %>%
gather(AVIS, Value) %>% # here you tidy the data
mutate(score = scores[Value]) %>% # match score to value
group_by(AVIS) %>% # group AVIS levels
summarise(score.mean = mean(score)) %>% # find mean scores for AVIS levels
arrange(desc(score.mean))
list <- list(AVIS = music2$AVIS) # here is the list
> list$AVIS
[1] "AVIS1" "AVIS5" "AVIS2" "AVIS3" "AVIS4"

what is this function doing? replication [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 years ago.
Improve this question
rep_sample_n <- function(tbl, size, replace = FALSE, reps = 1)
{
rep_tbl = replicate(reps, tbl[sample(1:nrow(tbl), size, replace = replace),
], simplify = FALSE) %>%
bind_rows() %>%
mutate(replicate = rep(1:reps, each = size)) %>%
select(replicate, everything()) %>%
group_by(replicate)
return(rep_tbl)
}
Hey, can anyone help me there? What is this function doing? Is the first line setting the variables of the function? And then what is this "replicate" doing? Thanks!
This formula replicates your data. lets say we have a dataset of 10 observations. In order to come up with additional like-datasets of your current one, you can replicate it by introducing random sampling of your dataset.
You can check out the wikipedia page on
statistical replication if you're more curious.
Lets take a simple dataframe:
df <- data.frame(x = 1:10, y = 1:10)
df
x y
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 10 10
if we want to take a random sample of this, we can use the function rep_sample_n which takes 2 arguments tbl, size, and has another 2 optional arguments replace = FALSE, reps = 1.
Here is an example of us just taking 4 randomly selected columns from our data.
rep_sample_n(df, 4)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 1 1
2 1 3 3
3 1 4 4
4 1 10 10
Now if we want to randomly sample 15 observations from a 10 observation dataset, it will throw an error. Currently the replace = FALSE argument doesn't allow that because each time a sample row is chosen, it's removed from the pool for the next sample to be taken. In the example above, it chose the 1st observation, then it went to choose the 2nd (because we asked for 4), and it only have 2 through 10 left, and it chose the 3rd, then 4th and then 10th etc. If we allow replace = TRUE, it will choose an observation from the full dataset each time.
Notice how in this example, the 5th observation was chosen twice. That wouldn't happen with replace = FALSE
rep_sample_n(df, 4, replace = TRUE)
# A tibble: 4 x 3
# Groups: replicate [1]
replicate x y
<int> <int> <int>
1 1 5 5
2 1 3 3
3 1 2 2
4 1 5 5
Lastly and most importantly, we have the reps argument which is the basis for this function, really. It allows you randomly sample your dataset multiple times, and then combine all those samples together.
Below, we have sampled our original dataset of 10 observations by selecting 4 of them in a sample, then we replicated that 5 times, so we have 5 different sample dataframes of 4 observations each that have been combined together into one 20 observation dataframe, but each of the unique 5 dataframes has been tagged with a replicate #. The replicate column will point out which 4 observations goes with which replicated dataframe.
rep_sample_n(df, 4, reps = 5)
# A tibble: 20 x 3
# Groups: replicate [5]
replicate x y
<int> <int> <int>
1 1 8 8
2 1 4 4
3 1 3 3
4 1 1 1
5 2 4 4
6 2 5 5
7 2 8 8
8 2 3 3
9 3 6 6
10 3 1 1
11 3 3 3
12 3 2 2
13 4 5 5
14 4 7 7
15 4 10 10
16 4 3 3
17 5 7 7
18 5 10 10
19 5 3 3
20 5 9 9
I hope this provided some clarity
This function takes a data frame as input (and several input preferences). It takes a random sample of size rows from the table, with or without replacement as set by the replace input. It repeats that random sampling reps times.
Then, it binds all the samples together into a single data frame, adding a new column called "replicate" indicating which repetition of the sampling produced each row.
Finally, it "groups" the resulting table, preparing it for future group-wise operations with dplyr.
For general questions about specific functions, like "What is this "replicate" doing?", you should look at the function's help page: type ?replicate or help("replicate") to get there. It includes a description of the function and examples of how to use it. If you read the description, run the examples, and are still confused, feel free to come back with a specific question and example illustrating what you are confused by.
Similarly, for "Is the first line setting the variables of the function?", the arguments to function() are the inputs to the function. If you have basic questions about R like "How do functions work", have a look at An Introduction to R, or one of the other sources in the R Tag Wiki.

Assign ID across 2 columns of variable

I have a data frame in which each individual (row) has two data points per variable.
Example data:
df1 <- read.table(text = "IID L1.1 L1.2 L2.1 L2.2
1 1 38V1 38V1 48V1 52V1
2 2 36V1 38V2 50V1 48Y1
3 3 37Y1 36V1 50V2 48V1
4 4 38V2 36V2 52V1 50V2",
stringsAsFactor = FALSE, header = TRUE)
I have many more columns than this in the full dataset and would like to recode these values to label unique identifiers across the two columns. I know how to get identifiers and relabel a single column from previous questions (Creating a unique ID and How to assign a unique ID number to each group of identical values in a column) but I don't know how to include the information for two columns, as R identifies and labels factors per column.
Ultimately I want something that would look like this for the above data:
(df2)
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 1 5 4 3
It doesn't really matter what the numbers are, as long as they indicate unique values across both columns. I've tried creating a function based on the output from:
unique(df1[,1:2])
but am struggling as this still looks at unique entries per column, not across the two.
Something like this would work...
pairs <- (ncol(df1)-1)/2
for(i in 1:pairs){
refs <- unique(c(df1[,2*i],df1[,2*i+1]))
df1[,2*i] <- match(df1[,2*i],refs)
df1[,2*i+1] <- match(df1[,2*i+1],refs)
}
df1
IID L1.1 L1.2 L2.1 L2.2
1 1 1 1 1 4
2 2 2 4 2 5
3 3 3 2 3 1
4 4 4 5 4 3
You could reshape it to long format, assign the groups and then recast it to wide:
library(data.table)
df_m <- melt(df, id.vars = "IID")
setDT(df_m)[, id := .GRP, by = .(gsub("(.*).","\\1", df_m$variable), value)]
dcast(df_m, IID ~ variable, value.var = "id")
# IID L1.1 L1.2 L2.1 L2.2
#1 1 1 1 6 9
#2 2 2 4 7 10
#3 3 3 2 8 6
#4 4 1 5 9 8
This should also be easily expandable to multiple groups of columns. I.e. if you have L3. it should work with that as well.

loop ordinal regression statistical analysis and save the data R

could you, please, help me with a loop? I am relatively new to R.
The short version of the data looks ike this:
sNumber blockNo running TrialNo wordTar wordTar1 Freq Len code code2
1 1 1 5 spouse violent 5011 6 1 2
1 1 1 5 violent spouse 17873 7 2 1
1 1 1 5 spouse aviator 5011 6 1 1
1 1 1 5 aviator wife 515 7 1 1
1 1 1 5 wife aviator 87205 4 1 1
1 1 1 5 aviator spouse 515 7 1 1
1 1 1 9 stability usually 12642 9 1 3
1 1 1 9 usually requires 60074 7 3 4
1 1 1 9 requires client 25949 8 4 1
1 1 1 9 client requires 16964 6 1 4
2 2 1 5 grimy cloth 757 5 2 1
2 2 1 5 cloth eats 8693 5 1 4
2 2 1 5 eats whitens 3494 4 4 4
2 2 1 5 whitens woman 18 7 4 1
2 2 1 5 woman penguin 162541 5 1 1
2 2 1 9 pie customer 8909 3 1 1
2 2 1 9 customer sometimes 13399 8 1 3
2 2 1 9 sometimes reimburses 96341 9 3 4
2 2 1 9 reimburses sometimes 65 10 4 3
2 2 1 9 sometimes gangster 96341 9 3 1
I have a code for ordinal regression analysis for one participant for one trial (eye-tracking data - eyeData) that looks like this:
#------------set the path and import the library-----------------
setwd("/AscTask-3/Data")
library(ordinal)
#-------------read the data----------------
read.delim(file.choose(), header=TRUE) -> eyeData
#-------------extract 1 trial from one participant---------------
ss <- subset(eyeData, sNumber == 6 & runningTrialNo == 21)
#-------------delete duplicates = refixations-----------------
ss.s <- ss[!duplicated(ss$wordTar), ]
#-------------change the raw frequencies to log freq--------------
ss.s$lFreq <- log(ss.s$Freq)
#-------------add a new column with sequential numbers as a factor ------------------
ss.s$rankF <- as.factor(seq(nrow(ss.s)))
#------------ estimate an ordered logistic regression model - fit ordered logit model----------
m <- clm(rankF~lFreq*Len, data=ss.s, link='probit')
summary(m)
#---------------get confidence intervals (CI)------------------
(ci <- confint(m))
#----------odd ratios (OR)--------------
exp(coef(m))
The eyeData file is a huge massive of data consisting of 91832 observations with 11 variables. In total there are 41 participants with 78 trials each. In my code I extract data from one trial from each participant to run the anaysis. However, it takes a long time to run the analysis manually for all trials for all participants. Could you, please, help me to create a loop that will read in all 78 trials from all 41 participants and save the output of statistics (I want to save summary(m), ci, and coef(m)) in one file.
Thank you in advance!
You could generate a unique identifier for every trial of every particpant. Then you could loop over all unique values of this identifier and subset the data accordingly. Then you run the regressions and save the output as a R object
eyeData$uniqueIdent <- paste(eyeData$sNumber, eyeData$runningTrialNo, sep = "-")
uniqueID <- unique(eyeData$uniqueIdent)
for (un in uniqueID) {
ss <- eyeData[eyeData$uniqueID == un,]
ss <- ss[!duplicated(ss$wordTar), ] #maybe do this outside the loop
ss$lFreq <- log(ss$Freq) #you could do this outside the loop too
#create DV
ss$rankF <- as.factor(seq(nrow(ss)))
m <- clm(rankF~lFreq*Len, data=ss, link='probit')
seeSumm <- summary(m)
ci <- confint(m)
oddsR <- exp(coef(m))
save(seeSumm, ci, oddsR, file = paste("toSave_", un, ".Rdata", sep = ""))
# add -un- to the output file to be able identify where it came from
}
Variations of this could include combining the output of every iteration in a list (create an empty list in the beginning) and then after running the estimations and the postestimation commands combine the elements in a list and recursively fill the previously created list "gatherRes":
gatherRes <- vector(mode = "list", length = length(unique(eyeData$uniqueIdent) ##before the loop
gatherRes[[un]] <- list(seeSum, ci, oddsR) ##last line inside the loop
If you're concerned with speed, you could consider writing a function that does all this and use lapply (or mclapply).
Here is a solution using the plyr package (it should be faster than a for loop).
Since you don't provide a reproducible example, I'll use the iris data as an example.
First make a function to calculate your statistics of interest and return them as a list. For example:
# Function to return summary, confidence intervals and coefficients from lm
lm_stats = function(x){
m = lm(Sepal.Width ~ Sepal.Length, data = x)
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
Then use the dlply function, using your variables of interest as grouping
data(iris)
library(plyr) #if not installed do install.packages("plyr")
#Using "Species" as grouping variable
results = dlply(iris, c("Species"), lm_stats)
This will return a list of lists, containing output of summary, confint and coef for each species.
For your specific case, the function could look like (not tested):
ordFit_stats = function(x){
#Remove duplicates
x = x[!duplicated(x$wordTar), ]
# Make log frequencies
x$lFreq <- log(x$Freq)
# Make ranks
x$rankF <- as.factor(seq(nrow(x)))
# Fit model
m <- clm(rankF~lFreq*Len, data=x, link='probit')
# Return list of statistics
return(list(summary = summary(m), confint = confint(m), coef = coef(m)))
}
And then:
results = dlply(eyeData, c("sNumber", "TrialNo"), ordFit_stats)

How to find the final value from repeated measures in R?

I have data arranged like this in R:
indv time mass
1 10 7
2 5 3
1 5 1
2 4 4
2 14 14
1 15 15
where indv is individual in a population. I want to add columns for initial mass (mass_i) and final mass (mass_f). I learned yesterday that I can add a column for initial mass using ddply in plyr:
sorted <- ddply(test, .(indv, time), sort)
sorted2 <- ddply(sorted, .(indv), transform, mass_i = mass[1])
which gives a table like:
indv mass time mass_i
1 1 1 5 1
2 1 7 10 1
3 1 10 15 1
4 2 4 4 4
5 2 3 5 4
6 2 8 14 4
7 2 9 20 4
However, this same method will not work for finding the final mass (mass_f), as I have a different number of observations for each individual. Can anyone suggest a method for finding the final mass, when the number of observations may vary?
You can simply use length(mass) as the index of the last element:
sorted2 <- ddply(sorted, .(indv), transform,
mass_i = mass[1], mass_f = mass[length(mass)])
As suggested by mb3041023 and discussed in the comments below, you can achieve similar results without sorting your data frame:
ddply(test, .(indv), transform,
mass_i = mass[which.min(time)], mass_f = mass[which.max(time)])
Except for the order of rows, this is the same as sorted2.
You can use tail(mass, 1) in place of mass[1].
sorted2 <- ddply(sorted, .(indv), transform, mass_i = head(mass, 1), mass_f=tail(mass, 1))
Once you have this table, it's pretty simple:
t <- tapply(test$mass, test$ind, max)
This will give you an array with ind. as the names and mass_f as the values.

Resources