Create a new Variable of values of another variable-multilevel regression - r

I am up to create a multilevel analysis (and I am a total newbie).
In this analysis I want to test if a high value of a predictor( here:senseofhumor) (numeric value - transfered into "high","low","medium") would predict the (numeric)outcome more than the other (numeric)predictors (senseofhomor-seriousness-friednlyness).
I have a dataset with many people and groups and want to compare the outcome between the groups regarding the influence of SenseofhumorHIGH
The code for that might look like this
RandomslopeEC <- lme(criteria(timepoint1) ~ senseofhumor + seriousness + friendlyness , data = DATA, random = ~ **SenseofhumorHIGH**|group)
For that reason I created values "high" "low" "medium" for my numeric predictor via
library(tidyverse)
DATA <- DATA %>%
mutate(predictorNew = case_when(senseofhumor< quantile(senseofhumor, 0.5) ~ 'low',
senseofhumor > quantile(senseofhumor, 0.75)~'high',
TRUE ~ 'med'))
Now they look like this:
Person
Group
senseofhumor
1
56
low
7
1
high
87
7
low
764
45
high
Now I realized i might need to cut this variable values in separate variables if I want to test my idea.
Do any of you know how to generate variables, that they may look like this?
Person
Group
senseofhumorHIGH
senseofhumorMED
senseofhumorLOW
1
56
0
0
1
7
1
1
0
0
87
7
0
0
1
764
45
1
0
0
51
3
1
0
0
362
9
1
0
0
87
27
0
0
1
Does this make any sense to you regarding my approach? Or do you have a better idea?
Thanks a lot in advance

Welcome to learning R. You will want to convert these types of variables to "factors," and R will be able to process them accordingly. To do that, use as.factor(variable) - so for you it may be DATA$senseofhumor <- as.factor(DATA$senseofhumor). If you need to convert multiple columns, you can use:
factor_cols <- c("Var1","Var2","Var3") # list columns you want as factors
DATA[factor_cols] <- lapply(DATA[factor_cols], as.factor)
Since you are new, note that this forum is typically for questions that cant be easily found online. This question is relatively routine and more details can be found with a quick google search. While SO is a great place to learn R, you may be penalized by the SO community in the future for routine questions like this. Just trying to help ensure you keep learning!

Related

Procedural way to generate signal combinations and their output in r

I have been continuing to learn r to transition away from excel and I am wondering what the best way to approach the following problem is, or at least what tools are available to me:
I have a large data set (100K+ rows) and several columns that I could generate a signal off of and each value in the vectors can range between 0 and 3.
sig1 sig2 sig3 sig4
1 1 1 1
1 1 1 1
1 0 1 1
1 0 1 1
0 0 1 1
0 1 2 2
0 1 2 2
0 1 1 2
0 1 1 2
I want to generate composite signals using the state of each cell in the four columns then see what each of the composite signals tell me about the returns in a time series. For this question the scope is only generating the combinations.
So for example, one composite signal would be when all four cells in the vectors = 0. I could generate a new column that reads TRUE when that case is true and false in each other case, then go on to figure out how that effects the returns from the rest of the data frame.
The thing is I want to check all combinations of the four columns, so 0000, 0001, 0002, 0003 and on and on, which is quite a few. With the extent of my knowledge of r, I only know how to do that by using mutate() for each combination and explicitly entering the condition to check. I assume there is a better way to do this, but I haven't found it yet.
Thanks for the help!
I think that you could paste the columns together to get unique combinations, then just turn this to dummy variables:
library(dplyr)
library(dummies)
# Create sample data
data <- data.frame(sig1 = c(1,1,1,1,0,0,0),
sig2 = c(1,1,0,0,0,1,1),
sig3 = c(2,2,0,1,1,2,1))
# Paste together
data <- data %>% mutate(sig_tot = paste0(sig1,sig2,sig3))
# Generate dummmies
data <- cbind(data, dummy(data$sig_tot, sep = "_"))
# Turn to logical if needed
data <- data %>% mutate_at(vars(contains("data_")), as.logical)
data

Imputing NAs for factorial variables NAs & Converting them to dummy variables

I have a dataframe, in which some of the variables (columns) are factorial, when for some records I have missing values (NA).
Questions are:
What is the correct approach of replacing\imputing NAs in factorial variables?
e.g VarX with 4 Levels {"A", "B", "C", "D"} - What would be the preffered value to replace NAs with? A\B\C\D? Maybe just 0? Maybe impute with the level that is the majority for this variable observations?
How to implement such imputation, based on answer to 1?
Once 1&2 resolved, I'll use the following to create dummy variables for the factorial variables:
is.fact <- sapply(my_data, is.factor)
my_data.dummy_vars <- dummy.data.frame(my_data[, is.fact], sep = ".")
Afterwards, how do I replace all the factorial variables in my_data with the dummy variables i've extracted into my_data.dummy_vars?
My use case is to calculate principal components afterwards (Which needs all variables to have numerical values, thus the dummy variables)
Thanks
Thanks for clarifying your intentions - that really helps! Here are my thoughts:
Imputing missing data is a non-trivial problem, and maybe a good question for the fine folks at crossvalidated. This is a problem that can only really be addressed in the context of the project, by you (the subject-matter expert). A big question is whether missing values are missing at random, or as a function of some other variables, and whether these are observed or unobserved. If you conclude that they're missing as a function of other (observed) variables, you might even consider a model-based approach, perhaps using GLM. The easiest approach by far (and if you don't have many missing values) is to just delete these rows with something like mydata2 <- mydata[!is.na(TheFactorInQuestion),] I'll say it again, imputation of missing data is a non-trivial problem that should be considered carefully and in context. Perhaps a good approach is to try a few methods of imputation and see if (and how) your inferences change. If they don't change (much), you'll know you don't need to worry.
Dropping rows instead could be done with a fairly simple mydata2 <- mydata[!is.na(TheFactorInQuestion),]. If you do any other form of imputation (in a sense, "making up" data), I'd advocate thinking long and hard about doing that before concluding that it's the right decision. And, of course, it might be.
Joining two data.frames is pretty straightforward using cbind, something like my_data2 <- cbind(my_data, my_data.dummy_vars). If you need to remove the column with your factor data, my_data3 <- my_data2[,-5] if, for example, the factor data is in column 5.
By dummy variables, do you mean zeroes and ones? This is how I'd structure it:
# first building a fake data frame
x <- 1:10
y <- as.factor(c("A","A","B","B","C","C",NA,"A","B","C"))
df <- data.frame(x,y)
# creating dummy variables
df$dummy_A <- 1*(y=="A")
df$dummy_B <- 1*(y=="B")
df$dummy_c <- 1*(y=="C")
# did it work?
df
x y dummy_A dummy_B dummy_c
1 1 A 1 0 0
2 2 A 1 0 0
3 3 B 0 1 0
4 4 B 0 1 0
5 5 C 0 0 1
6 6 C 0 0 1
7 7 <NA> NA NA NA
8 8 A 1 0 0
9 9 B 0 1 0
10 10 C 0 0 1

Cluster a big data set (quantitative/qualitative values)

I have a dataset composed of 54 000 rows and a few columns (7). My values are both numeric and alphanumeric (qualitative and quantitative variables). I want to cluster it using function hclust in R.
Let's take an example :
X <- data.frame(rnorm(54000, sd = 0.3),
rnorm(54000, mean = 1, sd = 0.3),
sample( LETTERS[1:24], 54000, replace=TRUE),
sample( letters[1:10], 54000, replace=TRUE),
round(rnorm(54000,mean=25, sd=3)),
round(runif(n = 54000,min = 1000,max = 25000)),
round(runif(54000,0,200000)))
colnames(X) <- c("A","B","C","D","E","F","G")
If I use the hclust function like this :
hclust(dist(X), method = "ward.D")
I get this error message :
Error: cannot allocate vector of size 10.9 Gb
What is the problem ? I'm trying to create a 54k * 54k matrix which is too big to be computed by my PC (4Go of RAM). I've read that since R3.0.0, the software is now in 64 bits (able to work with a 2.916e+09 matrix like in my example) so limitations are from my computer. I've tried it with hclust in stats / fastcluster/ flashClust and get the same problem.
In this packages, hclust are described like that :
hclust(d, method="complete", members=NULL)
flashClust(d, method = "complete", members=NULL)
d a dissimilarity structure as produced by dist.
We always need a dist matrix to make this function work. I've also tried to set higher the limitations of my computer for R session using this :
memory.limit(size = 4014)
memory.size(max = TRUE)
Question :
Is it possible to use a hierarchical clustering (or similar way to cluster data) whithout using this dist() matrix for a quantitative/qualitative dataset with R ?
Edit :
About k-means :
The method of k-means works great for a big dataset composed of numerical values. In my example, I got both numeric and alphanumeric values. I've tried to tranform my qualitative variables into binary numerical variables to do the process of k-means :
First dataframe (example) :
Col1 Col2 Col3
1 12 43.93145 Alpha
2 45 44.76081 Beta
3 48 45.09708 Gamma
4 31 45.42278 Alpha
5 12 46.53709 Delta
6 7 39.07841 Beta
7 78 49.60947 Alpha
If I transform this into binary variables, I get this :
Col1 Col2 Alpha Beta Gamma Delta
1 12 44.29369 1 0 0 0
2 45 43.90610 0 1 0 0
3 48 44.82659 0 0 1 0
4 31 43.09096 1 0 0 0
5 12 42.71190 0 0 0 1
6 7 43.71710 0 1 0 0
7 78 42.24293 1 0 0 0
It's OK if I only got a few modalities but in a real dataset, we could get about 10.000 modalities for a 50k rows base. I don't think k-means is the solution of this type of problem.
From reading your question, it seems there are 2 problems:
1. You have a fairly large amount of observations for clustering
2. The categorical variables have high cardinality
My advice:
1) You can just take a sample and use fastcluster::hclust, or use clara.
Probably after sorting out 2) you can use more observations, in any case it's potentially ok to use a sample. Try to take a stratified sample of the categories.
2) You basically need to represent these categories in a numeric format, without having 10000 columns more. You could use PCA or a Discrete version of it.
A few questions deal with this problem:
q1, q2

Force model.matrix() in R to use a given set of levels

I need to convert a small number of categorical variables in a survey dataframe to dummy variables. The variables are grouped by type (e.g. food type), and within each type survey respondents ranked their 1st, 2nd and 3rd preferences. The list of choices available for each type is similar but not identical. My problem is that I want to force the superset of category choices to be dummy-coded in every case.
set.seed(1)
d<-data.frame(foodtype1rank1=sample(c('noodles','rice','cabbage','pork'),5,replace=T),
foodtype1rank2=sample(c('noodles','rice','cabbage','pork'),5,replace=T),
foodtype1rank3=sample(c('noodles','rice','cabbage','pork'),5,replace=T),
foodtype2rank1=sample(c('noodles','rice','cabbage','tuna'),5,replace=T),
foodtype2rank2=sample(c('noodles','rice','cabbage','tuna'),5,replace=T),
foodtype2rank3=sample(c('noodles','rice','cabbage','tuna'),5,replace=T),
foodtype3rank1=sample(c('noodles','rice','cabbage','pork','mackerel'),5,replace=T),
foodtype3rank2=sample(c('noodles','rice','cabbage','pork','mackerel'),5,replace=T),
foodtype3rank3=sample(c('noodles','rice','cabbage','pork','mackerel'),5,replace=T))
To recap, model.matrix() will create dummy variables for any individual variable:
model.matrix(~d[,1]-1)
d[, 1]cabbage d[, 1]noodles d[, 1]pork d[, 1]rice
1 0 0 0 1
2 0 0 0 1
3 1 0 0 0
4 0 0 1 0
5 0 1 0 0
Or via sapply() for all variables:
sapply(d,function(x) model.matrix(~x-1))
Naturally, model.matrix() will only consider the levels that are present in each factor separately. But I want to force the complete set of foodtypes to be included for each type: noodles, rice, cabbage, pork, tuna, mackerel. In this example that would generate 54 dummy variables (3 types x 3 ranks x 6 categories). I assume I would pass the complete set explicitly to model.matrix() in some way, but can't see how.
Finally, I know R models automatically dummy-code factors internally but I still need to do it, including for exporting outside R.
The best way to achieve this is by explicitly specifying the levels to each factor:
d$foodtype1rank1=factor(sample(c('noodles','rice','cabbage','pork'), 5, replace=T),
levels=c('noodles','rice','cabbage','pork','mackerel'))
When you know the data this is always good practice.

descriptive statistics in table r for multiple variables

I am totally new with R and I'll appreciate the time anyone bothers to take with helping me with these probably simple tasks. I'm just at a loss with all the resources available and am not sure where to start.
My data looks something like this:
subject sex age nR medL medR meanL meanR pL ageBin
1 0146si 1 67 26 1 1 1.882353 1.5294118 0.5517241 1
2 0162le 1 72 5 2 1 2 1.25 0.6153846 1
3 0323er 1 54 30 2.5 3 2.416667 2.5 0.4915254 0
4 0811ne 0 41 21 2 2 2 1.75 0.5333333 0
5 0825en 1 44 31 2 2 2.588235 1.8235294 0.5866667 0
Though the actual data has many, many more subjects in variables.
This first thing I need to do is compare the 'ageBin' values. 0 = under age 60, 1 = over age 60. I want to compare stats between these two groups. So I guess the first thing I need is the ability to recognize the different ageBin values and make those the two rows.
Then I need to do things like calculate the frequency of the values in the two groups (ie. how many instances of 1 and 0), the mean of the 'age' variable, the median of the age variable, number of males (ie. sex = 1), the mean of meanL, etc. Simple things like that. I just want them to be all in one table.
So an example of a potential table might be
n nMale mAge
ageBin 0 14 x x
ageBin 1 14 x x
I could easily do this stuff in SPSS or even Excel...I just really want to get started with R. So any resource or advice someone could offer to point me in the right direction would be so, so helpful. Sorry if this sounds unclear...I can try to clarify if necessary.
Thanks in advance, anyone.
Use the plyr() package to split up the data structure and then apply a function to combine all the results back together.
install.packages("plyr") # install package from CRAN
library(plyr) # load the package into R
dd <- list(subject=c("0146si", "0162le", "1323er", "0811ne", "0825en"),
sex = c(1,1,1,0,1),
age = c(67,72,54,41,44),
nR = c(26,5,30,21,31),
medL = c(1,2,2.5,2,2),
medR = c(1,1,3,2,2),
meanL = c(1.882352,2,2.416667,2,2.588235),
meanR = c(1.5294118,1.25,2.5,1.75,1.8235294),
pL = c(0.5517241,0.6153846,0.4915254,0.5333333,0.5866667),
ageBin = c(1,1,0,0,0))
dd <- data.frame(dd) # convert to data.frame
Using the ddply function, you can do things like calculate the frequency of the values in the two groups
ddply(dd, .(ageBin), summarise, nMale = sum(sex), mAge = mean(age))
ageBin nMale mAge
0 2 46.33333
1 2 69.50000
The following is a very useful resource by Sean Anderson for getting up to speed with the plyr package.
A more comprehensive extremely resource by Hadley Wickham the package author can be found here
Try the by function:
if your data frame is named df:
by(data=df, INDICES=df$ageBin, FUN=summary)

Resources