Question updated!!
I have 15 columns of categorical variables and I want the correlation among them. The data set is 20,000+ long and the data set looks like this:
state | job | hair_color | car_color | marital_status
NY | cs | brown | blue | s
FL | mt | black | blue | d
NY | md | blond | white | m
NY | cs | brown | red | s
Notice that 1st row and last row NY, cs, and s repeats. I want to find out that kind of patterns. NY and cs is highly correlated. I need to rank the combination of values in the columns. Hope now the question make sense. Please notice that is NOT counting NY or cs. Is about finding out how many times NY and blond appears together in the same row. I need to do that for all values by row. Hope now this make sense.
I tried to utilize cor() with R but since these are categorical variables the function doesn't work. How can I work with this data set to find the correlation among them?
You may wish to refer to Ways to calculate similarity. Suppose your data is
d <- structure(list(state = structure(c(2L, 1L, 1L, 2L, 2L), .Label = c("FL",
"NY"), class = "factor"), job = structure(c(2L, 1L, 4L, 3L, 2L
), .Label = c("bs", "cs", "md", "mt"), class = "factor"), hair_color = structure(c(3L,
3L, 1L, 2L, 3L), .Label = c("black", "blond", "brown"), class = "factor"),
car_color = structure(c(1L, 2L, 1L, 3L, 2L), .Label = c("blue",
"red", "white"), class = "factor"), marital_status = structure(c(3L,
1L, 1L, 2L, 3L), .Label = c("d", "m", "s"), class = "factor")), .Names = c("state",
"job", "hair_color", "car_color", "marital_status"), class = "data.frame", row.names = c(NA,
-5L))
Data:
> d
state job hair_color car_color marital_status
1 NY cs brown blue s
2 FL bs brown red d
3 FL mt black blue d
4 NY md blond white m
5 NY cs brown red s
We can calculate the "dissimilarities" between observations:
library(cluster)
daisy(d, metric = "euclidean")
Output:
> daisy(d, metric = "euclidean")
Dissimilarities :
1 2 3 4
2 0.8
3 0.8 0.6
4 0.8 1.0 1.0
5 0.2 0.6 1.0 0.8
Metric : mixed ; Types = N, N, N, N, N
Number of objects : 5
which tells us that observations 1 and 5 are least dissimilar. With many observations, it is obviously impossible to visually inspect the dissimilarity matrix, but we can filter out the pairs that fall below a certain threshold, e.g.
out <- daisy(d, metric = "euclidean")
pairs <- expand.grid(2:5, 1:4)
pairs <- pairs[pairs[,1]!=pairs[,2],]
similars <- pairs[which(out<.8),]
Given a threshold of 0.8,
> similars
Var1 Var2
4 5 1
6 3 2
8 5 2
Related
I have a dataset composed of more than 100 columns and all columns are of type factor. Ex:
animal fruit vehicle color
cat orange car blue
dog apple bus green
dog apple car green
dog orange bus green
In my dataset i need to remove all columns with factors thas has less than 5 observations per level. In this example, if i want to remove all columns with amount of observations per levels less than or equal to 1, like blue or cat, the algorithm will remove the columns animal and color. What is the most elegant way to do this?
We can use Filter with table
Filter(function(x) !any(table(x) < 2), df1)
# fruit vehicle
#1 orange car
#2 apple bus
#3 apple car
#4 orange bus
data
df1 <- structure(list(animal = structure(c(1L, 2L, 2L, 2L), .Label = c("cat",
"dog"), class = "factor"), fruit = structure(c(2L, 1L, 1L, 2L
), .Label = c("apple", "orange"), class = "factor"), vehicle = structure(c(2L,
1L, 2L, 1L), .Label = c("bus", "car"), class = "factor"), color = structure(c(1L,
2L, 2L, 2L), .Label = c("blue", "green"), class = "factor")),
row.names = c(NA,
-4L), class = "data.frame")
We can use select_if from dplyr
library(dplyr)
df1 %>% select_if(~all(table(.) > 1))
# fruit vehicle
#1 orange car
#2 apple bus
#3 apple car
#4 orange bus
Given
Group ss
B male
B male
B female
A male
A female
X male
Then
tab <- table(res$Group, res$ss)
I want the group column to be in the order B, A, X as it is on the data. Currently its alphabetic order which is not what I want. This is what I want
MALE FEMALE
B 5 5
A 5 10
X 10 12
If you arrange the factor levels based on the order you want, you'll get the desired result.
res$Group <- factor(res$Group, levels = c('B', 'A', 'X'))
#If it is based on occurrence in Group column we can use
#res$Group <- factor(res$Group, levels = unique(res$Group))
table(res$Group, res$ss)
#Or just
#table(res)
# female male
# B 1 2
# A 1 1
# X 0 1
data
res <- structure(list(Group = structure(c(2L, 2L, 2L, 1L, 1L, 3L),
.Label = c("A", "B", "X"), class = "factor"), ss = structure(c(2L, 2L, 1L, 2L,
1L, 2L), .Label = c("female", "male"), class = "factor")),
class = "data.frame", row.names = c(NA, -6L))
unique returns the unique elements of a vector in the order they occur. A table can be ordered like any other structure by extracting its elements in the order you want. So if you pass the output of unique to [,] then you'll get the table sorted in the order of occurrence of the vector.
tab <- table(res$Group, res$ss)[unique(res$Group),]
My data is about 270 columns with 160.000 mainly non-numeric observations.
I need to find patterns and dependencies between the columns.
As example, I need a correlation of the column "Material" to other columns.
Material | Name | Country | Vehicle
----------------------------------------------
Bricks | John | A | Car
Bricks | John | A | Car
Bricks | John | A | Motorcycles
Bricks | John | B | Motorcycles
Concrete | Bill | B | Car
Concrete | Bill | B | Car
Concrete | Bill | B | Car
Concrete | Bill | A | Car
My desirable result is:
Name - 100%
Country - 75%
Vehicle - 50%
I tried:
library("GoodmanKruskal")
Cor_matrix<- GKtauDataframe(df)
plot(Cor_matrix)
but got: Error in table(x, y, useNA = includeNA) : attempt to make a table with >= 2^31 elements
or:
library("corrr")
df %>% correlate() %>% focus(Material)
Error in stats::cor(x = x, y = y, use = use, method = method) : 'x' must be numeric
So I am searching for a package and a code example which can handle non-numerics. Many thanks in advance.
If columns in your df are of factor type, then you need to transform to numeric first.
df[] <- Map(as.numeric,df)
otherwise
df[] <- Map(function(v) as.numeric(factor(v)),df)
Then, you can run the following code
df %>% correlate() %>% focus(Material)
# A tibble: 3 x 2
rowname Material
<chr> <dbl>
1 Name -1
2 Country 0.5
3 Vehicle -0.577
DATA
df <- structure(list(Material = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("Bricks", "Concrete"), class = "factor"),
Name = structure(c(2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Bill",
"John"), class = "factor"), Country = structure(c(1L, 1L,
1L, 2L, 2L, 2L, 2L, 1L), .Label = c("A", "B"), class = "factor"),
Vehicle = structure(c(1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L), .Label = c("Car",
"Motorcycles"), class = "factor")), class = "data.frame", row.names = c(NA,
-8L))
Your code uses the function GKtauDataframe, which tries to calculate the metrics for all 270 x 270 combinations simultanuously. That is too much.
However, as you mentioned, you want rather to compare one column against all others. This should be feasible, and not need that much memory. The function GKtau does this between a pair of columns:
GKtau(df[, 1], df[, 2])
To get the values for the 1st column against all others, simply call:
lapply(df[, -1], GKtau, df[, 1])
Sure, you can refine your output using something like:
sapply(df[, -1], function(di) GKtau(df[, 1], di)$tauxy)
which makes the output way more compact.
I am interested in testing some network visualization techniques but before trying those functions I want to build an adjacency matrix (from, to) using the dataframe which is as follows.
Id Gender Col_Cold_1 Col_Cold_2 Col_Cold_3 Col_Hot_1 Col_Hot_2 Col_Hot_3
10 F pain sleep NA infection medication walking
14 F Bump NA muscle NA twitching flutter
17 M pain hemoloma Callus infection
18 F muscle pain twitching medication
My goal is to create an adjacency matrix as follows
1) All values in columns with keyword Cold will contribute to the rows
2) All values in columns with keyword Hot will contribute to the columns
For example, pain, sleep, Bump, muscle, hemaloma are cell values under the columns with keyword Cold and they will form the rows and cell values such as infection, medication, Callus, walking, twitching, flutter are under columns with keywords Hot and this will form the columns of the association matrix.
The final desired output should appear like this:
infection medication walking twitching flutter Callus
pain 2 2 1 1 1
sleep 1 1 1
Bump 1 1
muscle 1 1
hemaloma 1 1
[pain, infection] = 2 because the association between pain and infection occurs twice in the original dataframe: once in row 1 and again in row 3.
[pain, medication]=2 because association between pain and medication occurs twice once in row 1 and again in row 4.
Any suggestions or advice on producing such an association matrix is much appreciated thanks.
Reproducible Dataset
df = structure(list(id = c(10, 14, 17, 18), Gender = structure(c(1L, 1L, 2L, 1L), .Label = c("F", "M"), class = "factor"), Col_Cold_1 = structure(c(4L, 2L, 1L, 3L), .Label = c("", "Bump", "muscle", "pain"), class = "factor"), Col_Cold_2 = structure(c(4L, 2L, 3L, 1L), .Label = c("", "NA", "pain", "sleep"), class = "factor"), Col_Cold_3 = structure(c(1L, 3L, 2L, 4L), .Label = c("NA", "hemaloma", "muscle", "pain" ), class = "factor"), Col_Hot_1 = structure(c(4L, 3L, 2L, 1L), .Label = c("", "Callus", "NA", "infection"), class = "factor"), Col_Hot_2 = structure(c(2L, 3L, 1L, 3L), .Label = c("infection", "medication", "twitching"), class = "factor"), Col_Hot_3 = structure(c(4L, 2L, 1L, 3L), .Label = c("", "flutter", "medication", "walking" ), class = "factor")), .Names = c("id", "Gender", "Col_Cold_1", "Col_Cold_2", "Col_Cold_3", "Col_Hot_1", "Col_Hot_2", "Col_Hot_3" ), row.names = c(NA, -4L), class = "data.frame")
One way is to make the dataset into a "tidy" form, then use xtabs. First, some cleaning up:
df[] <- lapply(df, as.character) # Convert factors to characters
df[df == "NA" | df == "" | is.na(df)] <- NA # Make all blanks NAs
Now, tidy the dataset:
library(tidyr)
library(dplyr)
out <- do.call(rbind, sapply(grep("^Col_Cold", names(df), value = T), function(x){
vars <- c(x, grep("^Col_Hot", names(df), value = T))
setNames(gather_(select(df, one_of(vars)),
key_col = x,
value_col = "value",
gather_cols = vars[-1])[, c(1, 3)], c("cold", "hot"))
}, simplify = FALSE))
The idea is to "pair" each of the "cold" columns with each of the "hot" columns to make a long dataset. out looks like this:
out
# cold hot
# 1 pain infection
# 2 Bump <NA>
# 3 <NA> Callus
# 4 muscle <NA>
# 5 pain medication
# ...
Finally, use xtabs to make the desired output:
xtabs(~ cold + hot, na.omit(out))
# hot
# cold Callus flutter infection medication twitching walking
# Bump 0 1 0 0 1 0
# hemaloma 1 0 1 0 0 0
# muscle 0 1 0 1 2 0
# pain 1 0 2 2 1 1
# sleep 0 0 1 1 0 1
I'm trying to make a simple bar chart that first, distinguishes between two groups say based on sex, male or female, and then after stats, for each sample/ individual, there is a P-value, significant or not. I know how to color code the bars between male and female, but I want R to automatically put a star above each sample/ individual who has a P-value less than 0.05 say.
I'm currently just using the simple barplot(x) function.
I've tried to look around for answers but haven't found anything for this yet.
Below is is a link to my example data set:
[url=http://www.divshare.com/download/22797284-187]DivShare File - test.csv[/url]
I'd like to put the time on the y axis, color code the bars to distinguish between Male and Female, and then for individuals in either group who has a 1 under significance, put a star above their corresponding bar.
Thanks for any suggestions in advance.
I messed with your data a bit to make it friendlier:
## dput(read.csv("barcharttest.csv"))
x <- structure(list(ID = 1:7,
sex = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 2L), .Label = c("female", "male"),
class = "factor"),
val = c(309L, 192L, 384L, 27L, 28L, 245L, 183L),
stat = structure(c(1L, 2L, 2L, 1L, 2L, 1L, 1L), .Label = c("NS", "sig"),
class = "factor")),
.Names = c("ID", "sex", "val", "stat"),
class = "data.frame", row.names = c(NA, -7L))
Which looks like this:
ID sex val stat
1 1 female 309 NS
2 2 female 192 sig
3 3 female 384 sig
4 4 male 27 NS
5 5 male 28 sig
6 6 female 245 NS
7 7 male 183 NS
Now the plot:
sexcols <- c("pink","blue")
## png("barplot.png") ## for output graph
par(las=1,bty="l") ## I prefer these settings; see ?par
b <- with(x,barplot(val,col=sexcols[sex])) ## b saves x coords of bars
legend("topright",levels(x$sex),fill=sexcols,bty="n")
## use xpd=NA to make sure that star on tallest bar doesn't get clipped;
## pos=3 puts the text above the (x,y) location specified
text(b,x$val,ifelse(x$stat=="sig","*",""),pos=3,cex=2,xpd=NA)
axis(side=1,at=b,label=x$ID)
## dev.off()
I should also add "Time" and "ID" labels on the relevant axes.