I am new to R, and I am trying to figure out how to create a clustered bar chart the mean interest in a film, but separated by gender.
Here is my dataframe:
i gender film interest
1 male 1 22
2 male 1 13
3 male 1 16
4 male 1 10
5 male 1 18
6 male 1 24
7 male 1 13
8 male 1 14
9 male 1 19
10 male 1 23
11 male 2 37
12 male 2 20
13 male 2 16
14 male 2 28
15 male 2 27
16 male 2 18
17 male 2 32
18 male 2 24
19 male 2 21
20 male 2 35
21 female 1 3
22 female 1 15
23 female 1 5
24 female 1 16
25 female 1 13
26 female 1 20
27 female 1 11
28 female 1 19
29 female 1 15
30 female 1 7
31 female 2 30
32 female 2 25
33 female 2 31
34 female 2 36
35 female 2 23
36 female 2 14
37 female 2 21
38 female 2 31
39 female 2 22
40 female 2 14
Here is the script that is used:
movies<-read.csv(file.choose())
t.test(interest~film, data = movies)
names(movies)
str(movies)
movies$ï..gender = factor(movies$ï..gender, levels=c(1,2),
labels=c("male","female"))
with(movies, table(film,interest))
summary(movies)
movie.types<-split(movies$interest, movies$film)
boxplot(movie.types)
movie.mean<-sapply(movie.types,mean)
barplot(movie.mean, col = "red", main = "Mean Interest by Film",
ylim=c(0,30), names.arg = c("Bridget Jones Diary","Memento"))
And here is the barplot I made, which I need to make a cluster plot to divide out by gender:
Related
I have below client dataset includes client_id, birth_number and district_id. The birth number is in the form YYMMDD, here is twist - The value is in the form: YYMMDD(for Men) and the value is in the form: YY(+50MM)DD(for Women). I want your help to develop the script in R where we can split the YYMMDD and set condition. based on condition if MM>12 then that row belong to women and the actual month value subtracted by 15 else Men with the same birth number.
please help
The value is in the form: YYMMDD (for men)
The value is in the form: YY(+50MM)DD (for women)
"client_id";"birth_number";"district_id"
1;"706213";18
2;"450204";1
3;"406009";1
4;"561201";5
5;"605703";5
6;"190922";12
7;"290125";15
8;"385221";51
9;"351016";60
10;"430501";57
11;"505822";57
12;"810220";40
13;"745529";54
14;"425622";76
15;"185828";21
16;"190225";21
17;"341013";76
18;"315405";76
19;"421228";47
20;"790104";46
21;"526029";12
22;"696011";1
23;"730529";1
24;"395729";43
25;"395423";21
26;"695420";74
27;"665326";54
28;"450929";1
29;"515911";30
30;"576009";74
31;"620209";68
32;"800728";52
33;"486204";73
An option is to use substring along with ifelse as:
# Get the 3rd and 4th character from "birth_number". If it is > 12
# that row is for Female, otherwise Male
df$Gender <- ifelse(as.numeric(substring(df$birth_number,3,4)) > 12, "Female", "Male")
# Now correct the "birth_number". Subtract 50 form middle 2 digits.
# Updated based on feedback from #RuiBarradas to use df$Gender == "Female"
# to subtract 50 from month number
df$birth_number <- ifelse(df$Gender == "Female",
as.character(as.numeric(df$birth_number)-5000), df$birth_number)
df
# client_id birth_number district_id Gender
# 1 1 701213 18 Female
# 2 2 450204 1 Male
# 3 3 401009 1 Female
# 4 4 561201 5 Male
# 5 5 600703 5 Female
# 6 6 190922 12 Male
# so on
#
Data:
df <- read.table(text =
'"client_id";"birth_number";"district_id"
1;"706213";18
2;"450204";1
3;"406009";1
4;"561201";5
5;"605703";5
6;"190922";12
7;"290125";15
8;"385221";51
9;"351016";60
10;"430501";57
11;"505822";57
12;"810220";40
13;"745529";54
14;"425622";76
15;"185828";21
16;"190225";21
17;"341013";76
18;"315405";76
19;"421228";47
20;"790104";46
21;"526029";12
22;"696011";1
23;"730529";1
24;"395729";43
25;"395423";21
26;"695420";74
27;"665326";54
28;"450929";1
29;"515911";30
30;"576009";74
31;"620209";68
32;"800728";52
33;"486204";73',
header = TRUE, stringsAsFactors = FALSE, sep = ";")
Using the same commands as #MKR, I just prefer the tidyverse approach.
require(tidyverse)
df %>%
mutate(Gender = ifelse(substr(birth_number, 3, 4) > 12,
"Female", "Male"),
birth_number = ifelse(Gender == "Female",
birth_number - 5000,
birth_number))
client_id birth_number district_id Gender
1 1 701213 18 Female
2 2 450204 1 Male
3 3 401009 1 Female
4 4 561201 5 Male
5 5 600703 5 Female
6 6 190922 12 Male
7 7 290125 15 Male
8 8 380221 51 Female
9 9 351016 60 Male
10 10 430501 57 Male
11 11 500822 57 Female
12 12 810220 40 Male
13 13 740529 54 Female
14 14 420622 76 Female
15 15 180828 21 Female
16 16 190225 21 Male
17 17 341013 76 Male
18 18 310405 76 Female
19 19 421228 47 Male
20 20 790104 46 Male
21 21 521029 12 Female
22 22 691011 1 Female
23 23 730529 1 Male
24 24 390729 43 Female
25 25 390423 21 Female
26 26 690420 74 Female
27 27 660326 54 Female
28 28 450929 1 Male
29 29 510911 30 Female
30 30 571009 74 Female
31 31 620209 68 Male
32 32 800728 52 Male
33 33 481204 73 Female
I have a data set with two outcome variables, case1 and case2. Case1 has 4 levels, while case2 has 50 (levels in case2 could increase later). I would like to create data partition for train and test keeping the ratio in both cases. The real data is imbalanced for both case1 and case2. As an example,
library(caret)
set.seed(123)
matris=matrix(rnorm(10),1000,20)
case1 <- as.factor(ceiling(runif(1000, 0, 4)))
case2 <- as.factor(ceiling(runif(1000, 0, 50)))
df <- as.data.frame(matris)
df$case1 <- case1
df$case2 <- case2
split1 <- createDataPartition(df$case1, p=0.2)[[1]]
train1 <- df[-split1,]
test1 <- df[split1,]
length(split1)
201
split2 <- createDataPartition(df$case2, p=0.2)[[1]]
train2 <- df[-split2,]
test2 <- df[split2,]
length(split2)
220
If I do separate splitting, I get different length for the data frame. If I do one splitting based on case2 (one with more classes), I lose the ratio of classes for case1.
I will be predicting the two cases separately, but at the end my accuracy will be given by having the exact match for both cases (e.g., ix = which(pred1 == case1 & pred2 == case2), so I need the arrays to be the same size.
Is there a smart way to do this?
Thank you!
If I understand correctly (which I do not guarantee) I can offer the following approach:
Group by case1 and case2 and get the group indices
library(tidyverse)
df %>%
select(case1, case2) %>%
group_by(case1, case2) %>%
group_indices() -> indeces
use these indeces as the outcome variable in create data partition:
split1 <- createDataPartition(as.factor(indeces), p=0.2)[[1]]
check if satisfactory:
table(df[split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
5 6 5 8 5 5 6 6 4 6 6 6 6 6 5 5 5 4 4 7 5 6 5 6 7 5 5 8 6 7 6 6 7
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
4 5 6 6 6 5 5 6 5 6 6 5 4 5 6 4 6
table(df[-split1,22])
#output
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
15 19 13 18 12 13 16 15 8 13 13 15 21 14 11 13 12 9 12 20 17 15 16 19 16 11 14 21 13 20 18 13 16
34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
9 6 12 19 14 10 16 19 17 17 16 14 4 15 14 9 19
table(df[split1,21])
#output
1 2 3 4
71 70 71 67
table(df[-split1,21])
1 2 3 4
176 193 174 178
In my dataframe, how would I create a new variable with the median of Adv. (Advertising) amounts for each SIC group?
As an example:
SIC Adv.
1 65
1 96
1 NA
1 23
2 45
2 23
2 12
3 45
3 NA
3 35
3 6
3 888
4 23
5 656
5 547
6 12
6 32
6 1
Should become:
SIC Adv. SIC.Adv.Median
1 65 65
1 96 65
1 NA 65
1 23 65
2 45 23
2 23 23
2 12 23
3 45 40
3 NA 40
3 35 40
3 6 40
3 888 40
4 23 23
5 656 601.5
5 547 601.5
6 12 12
6 32 12
6 1 12
Any help would be greatly appreciated.
Thank you!
I have a data.frame named final that looks like:
labels gvs order color f3
1 Adygei -2.3321916 1 1 353.0184
2 Basque -0.8519079 2 1 368.1515
3 French -0.9298674 3 1 365.2545
4 Italian -2.8859587 4 1 354.4481
5 Orcadian -1.4996229 5 1 350.9650
6 Russian -1.5597359 6 1 358.9736
7 Sardinian -1.4494841 7 1 355.1171
8 Tuscan -2.4279528 8 1 362.4717
9 Bedouin -3.1717421 9 2 319.3706
10 Druze -0.5058627 10 2 346.2211
11 Mozabite -2.6491331 11 2 299.5014
12 Palestinian -0.7819299 12 2 330.4576
13 Balochi -1.4095947 13 3 327.1238
14 Brahui -1.2534511 14 3 331.0927
15 Burusho 1.7958170 15 3 335.0919
16 Hazara 2.2810477 16 3 325.2444
17 Kalash -0.9258497 17 3 337.7116
18 Makrani -0.9007551 18 3 321.5726
19 Pathan 2.5543214 19 3 326.1923
20 Sindhi 2.6614486 20 3 318.7025
21 Uygur -1.2207974 21 3 322.0286
22 Cambodian 2.3706977 22 4 310.8989
23 Dai -0.9441980 23 4 305.5687
24 Daur -1.0325107 24 4 309.0984
25 Han -0.7381369 25 4 309.1198
26 Hezhen -2.7590587 26 4 296.9128
27 Japanese -0.5644325 27 4 297.9313
28 Lahu -0.8449225 28 4 307.0776
29 Miao -0.7237586 29 4 303.6593
30 Mongola -0.9452944 30 4 302.1380
31 Naxi -0.1625003 31 4 311.8019
32 Oroqen -1.2035258 32 4 308.7219
33 She -2.7758460 33 4 302.1271
34 Tu -0.7703779 34 4 307.3750
35 Tujia -1.0265275 35 4 303.5923
36 Xibo -1.1163019 36 4 295.5764
37 Yakut -3.2102686 37 4 315.0111
38 Yi -0.9614190 38 4 296.8134
39 Colombian -1.9659984 39 5 311.3134
40 Karitiana -0.9195156 40 5 300.8539
41 Maya 2.1239768 41 5 333.8995
42 Pima -3.0895998 42 5 325.3484
43 Surui -0.9377928 43 5 313.8505
44 Melanesian -1.6961014 44 6 294.5214
45 Papuan -0.7037952 45 6 286.7389
46 BantuKenya -1.9311354 46 7 152.9971
47 BantuSouthAfrica -1.8515908 47 7 133.6722
48 BiakaPygmy -1.7657017 48 7 117.5555
49 Mandenka -0.5423822 49 7 152.8525
50 MbutiPygmy -1.6244801 50 7 114.1691
51 San -0.9049735 51 7 0.0000
52 Yoruba 2.0949378 52 7 154.4460
I'm using the following code to make a graph
jpeg("F3.SCZ.Jul_22.jpg", 700,700)
final$color <- as.factor(final$color)
levels(final$color) <- c("blue","yellow3","red","pink","purple","green","orange")
plot(final$gvs, final$f3, cex=2,pch = 21, bg = as.character(final$color), xaxt="n", xlab="Genetic Values", ylab="F3", main="SCZ")
dev.off()
that looks like:
I would like to split the y-axis at 200, to have the y-values range from 0 to 200 to take up only 10% of the graph, while 200 to 400 to take up 90% of the y-axis. Is that possible?
EDIT:
Here is the data that is running into issues:
labels gvs order color f3
1 Adygei -2.3321916 1 1 0.09862109
2 Basque -0.8519079 2 1 0.09942770
3 French -0.9298674 3 1 0.10357547
4 Italian -2.8859587 4 1 0.09960179
5 Orcadian -1.4996229 5 1 0.10244666
6 Russian -1.5597359 6 1 0.10097691
7 Sardinian -1.4494841 7 1 0.10189642
8 Tuscan -2.4279528 8 1 0.09794686
9 Bedouin -3.1717421 9 2 0.09272493
10 Druze -0.5058627 10 2 0.09682272
11 Mozabite -2.6491331 11 2 0.08563901
12 Palestinian -0.7819299 12 2 0.09331649
13 Balochi -1.4095947 13 3 0.09227273
14 Brahui -1.2534511 14 3 0.09328593
15 Burusho 1.7958170 15 3 0.09396032
16 Hazara 2.2810477 16 3 0.09342432
17 Kalash -0.9258497 17 3 0.09666599
18 Makrani -0.9007551 18 3 0.09222257
19 Pathan 2.5543214 19 3 0.09468376
20 Sindhi 2.6614486 20 3 0.09172395
21 Uygur -1.2207974 21 3 0.09140727
22 Cambodian 2.3706977 22 4 0.08655821
23 Dai -0.9441980 23 4 0.08739080
24 Daur -1.0325107 24 4 0.08656669
25 Han -0.7381369 25 4 0.08764395
26 Hezhen -2.7590587 26 4 0.08802065
27 Japanese -0.5644325 27 4 0.08810874
28 Lahu -0.8449225 28 4 0.08609791
29 Miao -0.7237586 29 4 0.08700414
30 Mongola -0.9452944 30 4 0.08921706
31 Naxi -0.1625003 31 4 0.08646436
32 Oroqen -1.2035258 32 4 0.08719536
33 She -2.7758460 33 4 0.08656100
34 Tu -0.7703779 34 4 0.08818588
35 Tujia -1.0265275 35 4 0.08737680
36 Xibo -1.1163019 36 4 0.08806230
37 Yakut -3.2102686 37 4 0.08965344
38 Yi -0.9614190 38 4 0.08593454
39 Colombian -1.9659984 39 5 0.09114697
40 Karitiana -0.9195156 40 5 0.09040477
41 Maya 2.1239768 41 5 0.09068139
42 Pima -3.0895998 42 5 0.09084750
43 Surui -0.9377928 43 5 0.08925535
44 Melanesian -1.6961014 44 6 0.08430903
45 Papuan -0.7037952 45 6 0.08272786
46 BantuKenya -1.9311354 46 7 0.04668356
47 BantuSouthAfrica -1.8515908 47 7 0.03914248
48 BiakaPygmy -1.7657017 48 7 0.03546243
49 Mandenka -0.5423822 49 7 0.04612336
50 MbutiPygmy -1.6244801 50 7 0.03098719
51 San -0.9049735 51 7 0.00000000
52 Yoruba 2.0949378 52 7 0.04561542
You can do:
my_color <- as.factor(final$color)
levels(my_color) <- c("blue","yellow3","red","pink","purple","green","orange")
par(mfrow = c(1,2))
# original plot
pos <- seq(min(final$f3), max(final$f3), by = 25) ## y-axis tick marks position.
plot(final$gvs, final$f3, cex=2, pch=21, bg = as.character(my_color),
xaxt="n", yaxt="n", xlab="Genetic Values", ylab="F3", main="SCZ")
axis(2, at = pos, labels = pos) ## add y-axis
# new plot
threshold <- 260 ## cut off threshold
## some rescaling
## if f3 < threshold, we take new_f3 <- 0.1 * f3
## if f3 > threshold, we take new_f3 <- f3 - 0.9 * threshold
new_f3 <- ifelse(final$f3 < threshold, 0.1 * final$f3, final$f3 - threshold * 0.9)
## we apply the same transform to `pos` to get `new_pos`
new_pos <- ifelse(pos < threshold, 0.1 * pos, pos - threshold * 0.9)
plot(final$gvs, new_f3, cex=2, pch=21, bg = as.character(my_color),
xaxt="n", yaxt="n", xlab="Genetic Values", ylab="F3", main="SCZ")
abline(h = threshold * 0.1, lty = 3) # threshold line
axis(2, at = new_pos, labels = pos)
I would use trans_new() from scales package to transform the y-axis. This should get you close. I prefer the continuously differentiable transform (first), but you can also do a step change in scale (second). H/T to Gregor for pointing out that pmin and pmax handle vectors and are correct here.
setwd("C:/Users/rherron1/Desktop/")
final <- read.table("Scratch2.txt", header=TRUE)
final$id <- NULL
# default y-scale
require(ggplot2)
a <- ggplot(final, aes(gvs, f3, color=factor(color)))
a <- a + geom_point()
a
# transform y-axis
require(scales)
skew <- function(x) x^2
iskew <- function(x) x^(1/2)
skew_trans <- function() trans_new("skew", "skew", "iskew")
b <- a + coord_trans(y="skew")
b
# transform y-axis
require(scales)
sku <- function(x) pmin(x, 200) + 9*pmax(x-200, 0)
isku <- function(x) pmax((x-200)/9, 0) + pmin(x, 200)
sku_trans <- function() trans_new("sku", "sku", "isku")
c <- a + coord_trans(y="sku")
c
I have some gamete data in the following format:
Ind Letter Place Position
1 A 19 23
2 B 19 23
3 B 19 23
4 B 19 23
1 B 19 34
2 A 19 34
3 B 19 34
4 B 19 34
1 C 19 52
2 T 19 52
3 C 19 52
4 T 19 52
1 T 33 15
2 T 33 15
3 T 33 15
4 C 33 15
1 C 33 26
2 T 33 26
3 T 33 26
4 C 33 26
dput of data:
structure(list(Ind = c(1L,2L,3L,4L,1L,2L,3L,4L,1L,2L,3L,4L,1L,2L,3L,4L,1L,2L,3L,4L),
Letter = structure(c(1L,2L,2L,2L,2L,1L,2L,2L,3L,4L,3L,4L,4L,4L,4L,3L,3L,4L,4L,3L),
.Label = c("A","B","C","T"), class="factor"),
Place = c(19L,19L,19L,19L,19L,19L,19L,19L,19L,19L,19L,19L,33L,33L,33L,33L,33L,33L,33L,33L),
Position = c(23L,23L,23L,23L,34L,34L,34L,34L,52L,52L,52L,52L,15L,15L,15L,15L,26L,26L,26L,26L)),
.Names = c("Ind","Letter","Place","Position"),
class="data.frame", row.names = c(NA,-20L))
I need to pair and combine them, so I get all possible unique combinations with reference to Position within a pair. I have another data-file, that contains information on the pairs, and they are paired with reference to Place. So in this file I may see, that Place 19+Place 33 is a pair, and I want the following result:
Ind Letter Place Position Ind Letter Place Position
1 A 19 23 1 T 33 15
2 B 19 23 2 T 33 15
3 B 19 23 3 T 33 15
4 B 19 23 4 C 33 15
1 A 19 23 1 C 33 26
2 B 19 23 2 T 33 26
3 B 19 23 3 T 33 26
4 B 19 23 4 C 33 26
1 B 19 34 1 T 33 15
2 A 19 34 2 T 33 15
3 B 19 34 3 T 33 15
4 B 19 34 4 C 33 15
1 B 19 34 1 C 33 26
2 A 19 34 2 T 33 26
3 B 19 34 3 T 33 26
4 B 19 34 4 C 33 26
1 C 19 52 1 T 33 15
2 T 19 52 2 T 33 15
3 C 19 52 3 T 33 15
4 T 19 52 4 C 33 15
1 C 19 52 1 C 33 26
2 T 19 52 2 T 33 26
3 C 19 52 3 T 33 26
4 T 19 52 4 C 33 26
In this case unique means that A1:A2 is equal to A2:A1.
The reason I want to do this, is because I want to do a Four-Gamete-Test on the pairs, to the see if all possible combinations of Letter is existent. So e.g. for the last combined pair above, we have the letter-pairs CC, TT, CT, TC, so this combined pair will pass the FGT.
I have tried to do the combining with expand.grid, as it seems this is quite close to what I want. However, when I require all combination of data$Position, I lose the information for Ind, Letter, and Place. Also the output includes non-unique pairs.
Can anyone point me to a tool, that is closer to what I want? Or give me some guidelines on how to modify expand.grid, to get what I need.
Should you be aware of a tool, that actually does the Four-Gamete-Test, or something similar, then that would of course also be interesting for me to look at.
You can use expand.grid but not directly on the Position column. The idea is to find all combinations of the "quartets" (unique Positions):
pair <- c(19, 33)
df1 <- df1[df1$Place %in% pair, ]
split1 <- split( df1, df1$Position)
vec1 <- unique(df1$Position[df1$Place == pair[1]])
vec2 <- unique(df1$Position[df1$Place == pair[2]])
combin_num <- expand.grid(vec2, vec1)[,2:1]
do.call(
rbind,
lapply(seq_len(nrow(combin_num)), function(i){
cbind( split1[[as.character(combin_num[i,1])]],
split1[[as.character(combin_num[i,2])]] )
})
)[,]
Result:
# Ind Letter Place Position Ind.1 Letter.1 Place.1 Position.1
# 1 1 A 19 23 1 T 33 15
# 2 2 B 19 23 2 T 33 15
# 3 3 B 19 23 3 T 33 15
# 4 4 B 19 23 4 C 33 15
# 5 1 A 19 23 1 C 33 26
# 6 2 B 19 23 2 T 33 26
# 7 3 B 19 23 3 T 33 26
# 8 4 B 19 23 4 C 33 26
# 51 1 B 19 34 1 T 33 15
# 61 2 A 19 34 2 T 33 15
# 71 3 B 19 34 3 T 33 15
# 81 4 B 19 34 4 C 33 15
# 52 1 B 19 34 1 C 33 26
# 62 2 A 19 34 2 T 33 26
# 72 3 B 19 34 3 T 33 26
# 82 4 B 19 34 4 C 33 26
# 9 1 C 19 52 1 T 33 15
# 10 2 T 19 52 2 T 33 15
# 11 3 C 19 52 3 T 33 15
# 12 4 T 19 52 4 C 33 15
# 91 1 C 19 52 1 C 33 26
# 101 2 T 19 52 2 T 33 26
# 111 3 C 19 52 3 T 33 26
# 121 4 T 19 52 4 C 33 26