What is a random intercept model? - r

I'm new to R. I have a data set
df <- structure(list(schoolid = c(1L, 1L, 1L, 1L, 1L, 1L),
score = c(0L, 10L, 0L, 40L, 42L, 4L),
gender = c(1L, 1L, 1L, 1L, 1L, 1L)),
.Names = c("schoolid", "score", "gender"),
row.names = c(NA, 6L),
class = "data.frame")
for which I have to run a random intercept model to see if there is an impact of the gender on the score across schools. Can anyone kindly explain what is expected from me?

Related

How to calculate the average value in one column for the 10 maximum values in another column?

I have a dataset and the task:"Average number of major credit cards held for people with top 10 income".
dput(head(creditcard))
structure(list(card = structure(c(2L, 2L, 2L, 2L, 2L, 2L), levels = c("no","yes"), class = "factor"), reports = c(0L, 0L, 0L, 0L, 0L, 0L), age = c(37.66667, 33.25, 33.66667, 30.5, 32.16667, 23.25), income = c(4.52, 2.42, 4.5, 2.54, 9.7867, 2.5), share = c(0.03326991, 0.005216942, 0.004155556, 0.06521378, 0.06705059, 0.0444384), expenditure = c(124.9833, 9.854167, 15, 137.8692, 546.5033, 91.99667), owner = structure(c(2L, 1L, 2L, 1L, 2L, 1L), levels = c("no", "yes"), class = "factor"), selfemp = structure(c(1L, 1L, 1L, 1L, 1L, 1L), levels = c("no", "yes"), class = "factor"),
dependents = c(3L, 3L, 4L, 0L, 2L, 0L), days = c(54L, 34L,58L, 25L, 64L, 54L), majorcards = c(1L, 1L, 1L, 1L, 1L, 1L), active = c(12L, 13L, 5L, 7L, 5L, 1L), income_fam = c(1.13, 0.605, 0.9, 2.54, 3.26223333333333, 2.5)), row.names = c("1","2", "3", "4", "5", "6"), class = "data.frame")
I tried to do the task like this
round(mean(creditcard[order(creditcard$income, decreasing = TRUE),]$majorcards[1:10]))
But my solution turned out to be inoptimal and I do not understand how it can be corrected
You can get the 10 observations with the highest income using slice_max, then creating a new dataset with the mean of majorcards
library(dplyr)
creditcard %>%
slice_max(income, n = 10) %>%
summarise(mean(majorcards))
If your dataset is one row per person, then you can do this:
library(dplyr)
creditcard %>%
arrange(desc(income)) %>%
slice_head(n=10) %>%
summarize(mean_cards = mean(majorcards,na.rm=T))
Maybe something like:
mean(creditcard$majorcards[which(creditcard$income%in%sort(creditcard$income, decreasing = TRUE)[1:10])])
Using base R
with(creditcard, mean(head(majorcards[order(-income)], 10)))
Or in data.table
library(data.table)
setDT(creditcard)[order(-income), mean(head(majorcards, 10))]

Debugging AUC error in R. A data table is provided. Thanks

I am trying to calculate a an AUC for the following predictions, and outcomes using the AUC function, but I keep getting an error. I am supposed to do this by tomorrow. Any help would be much appreciated! Thanks!
structure(list(POD1HemoglobinCut = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 1L
), .Label = c("[10,Inf)", "[0,10)"), class = "factor"), pred = c(0.0044102752927413,
0.00782725095161221, 0.210140717409347, 0.066525545459026, 0.0666137804946143,
0.0125809431305506, 0.0107560804580978, 0.829245110498723, 0.759165998590355,
0.0128042545229042, 0.738354081921031, 0.00287448336844446, 0.0448026818172726,
0.0162243785121634, 0.0687716959646373, 0.0724616690876388, 0.005033110699528,
0.893314696161109, 0.883299551200163, 0.189696433058773)), row.names = c(NA,
-20L), class = c("data.table", "data.frame"))
roc <- roc(test, x = test$pred, class = test$POD1HemoglobinCut)

Turning a Presence/Absence Matrix into a Cluster Analysis in R Studio

Okay, I have a presence/absence matrix of 6 samples with 25 possibilities of presence/absence.
I've been able to make a cluster dendrogram with the data, but I'd rather have it plotted as a distance matrix that looks better and is easier to analysis? (Maybe a cluster plot or something similar?)
I'm really stuck with figuring out the next part - I've spent days searching on here and various other Google searches but nothing is turning up!
Here's the code I've got for the cluster dendrogram:
matrix<-read.csv("Horizontal.csv")
distance<-dist(matrix)
hc.m<-hclust(distance)
plot(hc.m, labels=matrix$Sample, main ="", cex.main=0.8, cex.lab= 1.1)
Help!
> dput(head(matrix,20))structure(list(Sample = structure(1:6, .Label = c("CL1", "CL2",
"CL3", "COL1", "COL2", "COL3"), class = "factor"), X = c(0L,
0L, 0L, 1L, 1L, 1L), X.1 = c(1L, 0L, 0L, 1L, 1L, 1L), X.2 = c(1L,
1L, 1L, 0L, 0L, 0L), X.3 = c(1L, 1L, 1L, 1L, 1L, 1L), X.4 = c(1L,
1L, 1L, 0L, 0L, 0L), X.5 = c(0L, 0L, 0L, 1L, 1L, 0L), X.6 = c(1L,
1L, 1L, 1L, 1L, 1L), X.7 = c(1L, 1L, 1L, 1L, 1L, 1L), X.8 = c(0L,
0L, 0L, 1L, 1L, 1L), X.9 = c(0L, 0L, 0L, 1L, 1L, 1L), X.10 = c(1L,
1L, 1L, 1L, 1L, 1L), X.11 = c(1L, 1L, 1L, 1L, 1L, 1L), X.12 = c(1L,
1L, 1L, 1L, 1L, 1L), X.13 = c(1L, 0L, 0L, 0L, 0L, 0L), X.14 = c(0L,
0L, 0L, 1L, 1L, 1L), X.15 = c(0L, 0L, 0L, 1L, 1L, 1L), X.16 = c(1L,
1L, 1L, 1L, 0L, 0L), X.17 = c(1L, 1L, 1L, 1L, 1L, 1L), X.18 = c(1L,
1L, 1L, 1L, 1L, 1L), X.19 = c(1L, 1L, 1L, 1L, 1L, 1L), X.20 = c(1L,
1L, 1L, 1L, 1L, 1L), X.21 = c(1L, 1L, 1L, 1L, 0L, 0L), X.22 = c(0L,
0L, 0L, 0L, 1L, 1L), X.23 = c(1L, 1L, 1L, 1L, 1L, 1L), X.24 = c(0L,
1L, 1L, 1L, 1L, 1L)), .Names = c("Sample", "X", "X.1", "X.2",
"X.3", "X.4", "X.5", "X.6", "X.7", "X.8", "X.9", "X.10", "X.11",
"X.12", "X.13", "X.14", "X.15", "X.16", "X.17", "X.18", "X.19",
"X.20", "X.21", "X.22", "X.23", "X.24"), row.names = c(NA, 6L
), class = "data.frame")
Okay with this code:
library(vegan)
library(ggplot2)
library(tidyverse)
library(MASS)
#set working directory
setwd("~/Documents/Masters/BS707/Metagenomics")
#read csv file
cookie<-read.csv("Horizontal.csv")
data.frame(cookie, row.names = c("CL1", "CL2", "CL3", "COL1", "COL2", "COL3"))
df = subset(cookie)
data.frame(df, row.names = c("CL1", "CL2", "CL3", "COL1", "COL2", "COL3"))
dm<- dist(df, method = "binary") #calculate the distance matrix
cmdscale(dm, eig = TRUE, k=2) -> mds
as.tibble(mds$points) #mds coordinates
bind_cols(df, Sample = df$Sample) #bind sample names
mutate(df,group = gsub("\\d$", "", "Sample1"))#remove last digit from sample names to form groups
ggplot(df)+
geom_point (aes(x = "V1",y = "V2", color = "group")) #plot
as.tibble(mds$points) %>% ggplot() + geom_point (aes(x = V1, y = V2))
I get the plot but each group is named 'Sample' rather than CL1, CL2, CL3, COL1, COL2, COL3. I had to remove the %>% because my R didn't recognise it as a command or anything and gave an error every single time (switched to + or deleted and then it worked fine).
Here is a way to visualize your data in 2 dimensions:
library(tidyverse)
df %>%
dplyr::select(-1) %>% #remove first column
dist(method = "binary") %>% #calculate the distance matrix
cmdscale(eig = TRUE, k = 2) -> mds #do MDS also known as principal coordinates analysis
as.tibble(mds$points) %>% #mds coordinates
bind_cols( Sample = df$Sample) %>% #bind sample names
mutate(group = gsub("\\d$", "", Sample)) %>% #remove last digit from sample names to form groups
ggplot()+
geom_point(aes(x = V1,y = V2, color = group)) #plot
or without tidyverse:
df_dist <- dist(df[,-1], method = "binary")
mds <- cmdscale(df_dist, eig = TRUE, k = 2)
for_plot <- data.frame(mds$points, group = gsub("\\d$", "", df$Sample))
ggplot(for_plot)+
geom_point(aes(x = X1,y = X2, color = group))
other options include using isoMDS from MASS library which will perform Kruskal's Non-metric Multidimensional Scaling or metaMDS from vegan library which performs Nonmetric Multidimensional Scaling with Stable Solution from Random Starts, Axis Scaling and Species Scores.

chi squared and basic statistics on multiple columns of a data frame

I would like to compute a chi squared test for each column in a dataframe and grouping for the variable Project.
Basically I would like to compute a two by two table for each column and then store the value in a new table.
Here an example of my dataframe.
structure(list(Project = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("discovery", "validation"), class = "factor"), MLL = c(1L, 1L, 1L, 1L, 1L, 1L), CREB = c(0L, 1L, 1L, 1L, 1L, 0L), TNR = c(1L, 1L, 0L, 0L, 1L, 1L)), .Names = c("Project", "MLL", "CREB", "TNR"), row.names = c(1L, 2L, 3L, 300L, 301L, 302L), class = "data.frame")
After the comment of Jaap I have tried:
pvalue <- data.frame(apply(cast_subset[-1] , 2 , function(i) chisq.test(table(cast_subset$Project , i ))$p.value))
colnames(pvalue) <- "p.value"
but i can not accces the column with the gene name for merging to other data set.

Bin data by (x,y) and summarize

These are the first 10 lines of a huge files I have: (Note that there is only one user in these 10 lines but I've got thousands of users)
dput(testd)
structure(list(user = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L
), otime = structure(c(10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L
), .Label = c("2010-10-12T19:56:49Z", "2010-10-13T03:57:23Z",
"2010-10-13T16:41:35Z", "2010-10-13T20:05:43Z", "2010-10-13T23:31:51Z",
"2010-10-14T00:21:47Z", "2010-10-14T18:25:51Z", "2010-10-16T03:48:54Z",
"2010-10-16T06:02:04Z", "2010-10-17T01:48:53Z"), class = "factor"),
lat = c(39.747652, 39.891383, 39.891077, 39.750469, 39.752713,
39.752508, 39.7513, 39.758974, 39.827022, 39.749934),
long = c(-104.99251, -105.070814, -105.068532, -104.999073,
-104.996337, -104.996637, -105.000121, -105.010853,
-105.143191, -105.000017),
locid = structure(c(5L, 4L, 9L, 6L, 1L, 2L, 8L, 3L, 10L, 7L),
.Label = c("2ef143e12038c870038df53e0478cefc",
"424eb3dd143292f9e013efa00486c907", "6f5b96170b7744af3c7577fa35ed0b8f",
"7a0f88982aa015062b95e3b4843f9ca2", "88c46bf20db295831bd2d1718ad7e6f5",
"9848afcc62e500a01cf6fbf24b797732f8963683", "b3d356765cc8a4aa7ac5cd18caafd393",
"d268093afe06bd7d37d91c4d436e0c40d217b20a", "dd7cd3d264c2d063832db506fba8bf79",
"f6f52a75fd80e27e3770cd3a87054f27"), class = "factor"),
dnt = structure(c(10L, 9L, 8L, 7L, 6L, 5L, 4L, 3L, 2L, 1L),
.Label = c("2010-10-12 19:56:49",
"2010-10-13 03:57:23", "2010-10-13 16:41:35", "2010-10-13 20:05:43",
"2010-10-13 23:31:51", "2010-10-14 00:21:47", "2010-10-14 18:25:51",
"2010-10-16 03:48:54", "2010-10-16 06:02:04", "2010-10-17 01:48:53"
), class = "factor"),
x = c(-11674.6344476781, -11683.3414552141,
-11683.0877083915, -11675.3642199817, -11675.0599906624,
-11675.0933491404, -11675.4807522648, -11676.6740962175,
-11691.3894104198, -11675.4691879924),
y = c(4419.73724843345, 4435.719406435, 4435.68538078744,
4420.05048454181, 4420.3000059572, 4420.27721099723,
4420.14288752585, 4420.99619739292, 4428.56278976123,
4419.99099525605),
cellx = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L),
.Label = c("[-11682,-11672)", "[-11692,-11682)"
), class = "factor"),
celly = structure(c(1L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L), .Label = c("[4419,4429)", "[4429,4439)"
), class = "factor"),
cellxy = structure(c(1L, 3L, 3L, 1L,
1L, 1L, 1L, 1L, 2L, 1L), .Label = c("[-11682,-11672)[4419,4429)",
"[-11692,-11682)[4419,4429)", "[-11692,-11682)[4429,4439)"
), class = "factor")), .Names = c("user", "otime", "lat",
"long", "locid", "dnt", "x", "y", "cellx", "celly", "cellxy"), class = "data.frame", row.names = c(NA,
-10L))
A bit of explanation on what the data is to simplify understanding. The x and y are transformation of the lat and long coordinates. I have discretised the x,y locations into bins using cut. I want to get the most visited bin per user so I use ddply. As follows:
cells = ddply(testd, .(user, cellxy), summarise, length(cellxy))
Obtaining:
dput(cells)
structure(list(user = c(0, 0, 0), cellxy = structure(1:3, .Label = c("[-11682,-11672)[4419,4429)",
"[-11692,-11682)[4419,4429)", "[-11692,-11682)[4429,4439)"), class = "factor"),
count = c(7L, 1L, 2L)), .Names = c("user", "cellxy", "count"
), row.names = c(NA, -3L), class = "data.frame")
Now what I want to do is calculate the average x,y from the first dataset for the most visited bin per user as obtained from the previous calculation. I have no idea how to do this efficiently and given that my dataset is really big I would appreciate some guidance. Thanks!
Here is two stage approach. First, modified your original code of cells - for each combination of cellxy and user calculate mean x and y value.
cells = ddply(testd, .(user, cellxy), summarise,
cellcount=length(cellxy),meanx=mean(x),meany=mean(y))
cells
user cellxy cellcount meanx meany
1 0 [-11682,-11672)[4419,4429) 7 -11675.40 4420.214
2 0 [-11692,-11682)[4419,4429) 1 -11691.39 4428.563
3 0 [-11692,-11682)[4429,4439) 2 -11683.21 4435.702
Then use other call to ddply() to subset for each user cellxy with highest cellcount.
cells2 = ddply(cells,.(user),subset,cellcount==max(cellcount))
cells2
user cellxy cellcount meanx meany
1 0 [-11682,-11672)[4419,4429) 7 -11675.4 4420.214
since your data set is large, you might want to consider data.table, which not only will be blazing fast, it will also make the data mungling a bit easier.
Converting to a data table is straight forward:
library (data.table)
DT <- data.table(testd, by="user")
Then determining the most visited, by user, is just one line
# Determining which is the most visited, by user
DT[, "MostVisited" := {counts <- table(cellxy); names(counts)[which(counts==max(counts))]}, by=user]
I'm not sure how specifically you want to calculate the average x, y relative to the MostVisited, but I'm sure that as well could be relatively straight forward with data.table.
## But perhaps something like this
DT[, c("AvgX", "AvgY") := list(mean(x), mean(y)), by=list(user, MostVisited)]

Resources