I am new to R, and I am trying to figure out how to create a clustered bar chart the mean interest in a film, but separated by gender.
Here is my dataframe:
i gender film interest
1 male 1 22
2 male 1 13
3 male 1 16
4 male 1 10
5 male 1 18
6 male 1 24
7 male 1 13
8 male 1 14
9 male 1 19
10 male 1 23
11 male 2 37
12 male 2 20
13 male 2 16
14 male 2 28
15 male 2 27
16 male 2 18
17 male 2 32
18 male 2 24
19 male 2 21
20 male 2 35
21 female 1 3
22 female 1 15
23 female 1 5
24 female 1 16
25 female 1 13
26 female 1 20
27 female 1 11
28 female 1 19
29 female 1 15
30 female 1 7
31 female 2 30
32 female 2 25
33 female 2 31
34 female 2 36
35 female 2 23
36 female 2 14
37 female 2 21
38 female 2 31
39 female 2 22
40 female 2 14
Here is the script that is used:
movies<-read.csv(file.choose())
t.test(interest~film, data = movies)
names(movies)
str(movies)
movies$ï..gender = factor(movies$ï..gender, levels=c(1,2),
labels=c("male","female"))
with(movies, table(film,interest))
summary(movies)
movie.types<-split(movies$interest, movies$film)
boxplot(movie.types)
movie.mean<-sapply(movie.types,mean)
barplot(movie.mean, col = "red", main = "Mean Interest by Film",
ylim=c(0,30), names.arg = c("Bridget Jones Diary","Memento"))
And here is the barplot I made, which I need to make a cluster plot to divide out by gender:
So I have a csv file with column headers ID, Score, and Age.
So in R I have,
data <- read.csv(file.choose(), header=T)
attach(data)
I would like to create two new vectors with people's scores whos age are below 70 and above 70 years old. I thought there was a nice a quick way to do this but I cant find it any where. Thanks for any help
Example of what data looks like
ID, Score, Age
1, 20, 77
2, 32, 65
.... etc
And I am trying to make 2 vectors where it consists of all peoples scores who are younger than 70 and all peoples scores who are older than 70
Assuming data looks like this:
Score Age
1 1 29
2 5 39
3 8 40
4 3 89
5 5 31
6 6 23
7 7 75
8 3 3
9 2 23
10 6 54
.. . ..
you can use
df_old <- data[data$Age >= 70,]
df_young <- data[data$Age < 70,]
which gives you
> df_old
Score Age
4 3 89
7 7 75
11 7 97
13 3 101
16 5 89
18 5 89
19 4 96
20 3 97
21 8 75
and
> df_young
Score Age
1 1 29
2 5 39
3 8 40
5 5 31
6 6 23
8 3 3
9 2 23
10 6 54
12 4 23
14 2 23
15 4 45
17 7 53
PS: if you only want the scores and not the age, you could use this:
df_old <- data[data$Age >= 70, "Score"]
df_young <- data[data$Age < 70, "Score"]
I have a big data set of 72 columns and I want to gather each 3 of columns into a new column and thus getting 24 columns in the end.
I tried using gather() function but it works good for one time only t=i.e., it gather only 3 columns at a time.
Can I use this function in a for loop?
I tried this:
j=0
k=1
l=2
for (i in 2:24){
neww <- gather(columns, "KEy", "Proteins H/L", c((i+j), (i+k), (i+l)), na.rm = TRUE)
j=j+2;
k=k+2;
l=l+2;
}
I need to gather first 3 columns in a single column and then next 3 in another column and so on.
You can use the to_long function from the sjmisc-package for this purpose. This function is a convenient for-loop, which calls multiple gather() calls.
# create sample
mydat <- data.frame(age = c(20, 30, 40),
sex = c("Female", "Male", "Male"),
score_t1 = c(30, 35, 32),
score_t2 = c(33, 34, 37),
score_t3 = c(36, 35, 38),
speed_t1 = c(2, 3, 1),
speed_t2 = c(3, 4, 5),
speed_t3 = c(1, 8, 6))
# check tidyr. score is gathered, however, speed is not
tidyr::gather(mydat, "time", "score", score_t1, score_t2, score_t3)
> age sex speed_t1 speed_t2 speed_t3 time score
> 1 20 Female 2 3 1 score_t1 30
> 2 30 Male 3 4 8 score_t1 35
> 3 40 Male 1 5 6 score_t1 32
> 4 20 Female 2 3 1 score_t2 33
> 5 30 Male 3 4 8 score_t2 34
> 6 40 Male 1 5 6 score_t2 37
> 7 20 Female 2 3 1 score_t3 36
> 8 30 Male 3 4 8 score_t3 35
> 9 40 Male 1 5 6 score_t3 38
# gather multiple columns. both time and speed are gathered.
to_long(mydat, "time", c("score", "speed"),
c("score_t1", "score_t2", "score_t3"),
c("speed_t1", "speed_t2", "speed_t3"))
> age sex time score speed
> (dbl) (fctr) (chr) (dbl) (dbl)
> 1 20 Female score_t1 30 2
> 2 30 Male score_t1 35 3
> 3 40 Male score_t1 32 1
> 4 20 Female score_t2 33 3
> 5 30 Male score_t2 34 4
> 6 40 Male score_t2 37 5
> 7 20 Female score_t3 36 1
> 8 30 Male score_t3 35 8
> 9 40 Male score_t3 38 6
In this case, the time vector (indicating the gathered groups) just takes one of the multiple gathered column name. If this is too confusing, you can also just number the ID variable:
to_long(mydat, "time", c("score", "speed"),
c("score_t1", "score_t2", "score_t3"),
c("speed_t1", "speed_t2", "speed_t3"),
recode.key = TRUE)
> age sex time score speed
> (dbl) (fctr) (dbl) (dbl) (dbl)
> 1 20 Female 1 30 2
> 2 30 Male 1 35 3
> 3 40 Male 1 32 1
> 4 20 Female 2 33 3
> 5 30 Male 2 34 4
> 6 40 Male 2 37 5
> 7 20 Female 3 36 1
> 8 30 Male 3 35 8
> 9 40 Male 3 38 6
See ?to_long for more examples.
I'm not sure, but I think I read something on GitHub that "multiple column gathering" is also planned for tidyr somewhen...
I need some help in data cleaning using R.
my CSV file looks as follows.
"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,
I need to reformat as follows.
id gender age category rank
1 Male 22 movies 1
1 Male 22 music 2
1 Male 22 travel 3
1 Male 22 cloths 4
1 Male 22 grocery 5
1 Male 22 books NA
1 Male 22 rent NA
1 Male 22 fuel NA
1 Male 22 utility NA
1 Male 22 online-shopping NA
...................................
5 Female 22 movies NA
5 Female 22 music NA
5 Female 22 travel NA
5 Female 22 cloths NA
5 Female 22 grocery NA
5 Female 22 books NA
5 Female 22 rent 1
5 Female 22 fuel NA
5 Female 22 utility NA
5 Female 22 online-shopping 2
So far My efforts are as follows.
mini <- read.csv("~/MS/coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by "V1"')
Now I want to know what is the best way to fill all missing categories and also how do I rank the categories as per their position in CSV file.
For more clarity please refer the above CSV file and expected output.
text1='"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,'
d1 <- read.table(text=text1, sep=",", head=T, as.is=T)
library(reshape2)
d2 <- melt(d1, id.vars=c("id","gender","age"))
names(d2)[5] <- "category"
names(d2)[4] <- "rank"
d2$rank <- gsub("category", "", d2$rank)
head(d2)
# id gender age rank category
# 1 1 Male 22 1 movies
# 2 2 Male 28 1 travel
# 3 3 Female 27 1 rent
# 4 4 Female 22 1 rent
# 5 5 Female 22 1 rent
# 6 1 Male 22 2 music
We can use gather from tidyr
library(tidyr)
d2 <- gather(d1, rank, category, -(1:3)) %>%
extract(rank, into='rank', '.*(\\d+)')
head(d2)
# id gender age rank category
#1 1 Male 22 1 movies
#2 2 Male 28 1 travel
#3 3 Female 27 1 rent
#4 4 Female 22 1 rent
#5 5 Female 22 1 rent
#6 1 Male 22 2 music
I am using the dcast function in R to turn a long-format dataset into a wide-format dataset. I have an ID number, a categorical variable (CAT), and a continuous variable (AMT). However, I also have a variable SEX, which is the same for all rows of a given ID number. This code works to create the wide-format dataset, but I lose SEX. How can I retain it?
PC1cast <- dcast(PC1, ID~CAT, value.var='AMT', fun.aggregate=sum, na.rm=TRUE)
If I add SEX to the ID~CAT line, it gives me SEX-CAT combinations. I want SEX to just be one value for each row.
Sample data:
ID CAT AMT SEX
1 A 46 Female
1 B 22 Female
1 C 31 Female
2 A 17 Male
2 B 25 Male
2 C 44 Male
For that, you need to add SEX to the ID side of your formula:
dcast(PC1, ID + SEX~CAT, value.var='AMT', fun.aggregate=sum, na.rm=TRUE)
# results in:
ID SEX A B C
1 1 Female 46 22 31
2 2 Male 17 25 44
Things on the left hand side of the formula are kept as-is, things on the right-hand side are cast.
I added some extra data lines to clarify some parts of this. But the gist is that you just need to put SEX on the left hand side (i.e., of ~):
PC2 <- read.table(text="ID CAT AMT SEX
1 A 46 Female
1 B 22 Female
1 C 31 Female
2 A 17 Male
2 B 25 Male
2 C 44 Male
3 A 47 Female
3 B 27 Female
3 C 37 Female
4 A 17 Male
4 A 17 Male
4 B 22 Male
4 B NA Male
4 C 44 Male", header=T)
library(reshape2)
PC1cast2 <- dcast(PC2, ID+SEX~CAT, value.var='AMT', fun.aggregate=sum,
na.rm=TRUE)
PC1cast2
# ID SEX A B C
# 1 1 Female 46 22 31
# 2 2 Male 17 25 44
# 3 3 Female 47 27 37
# 4 4 Male 34 22 44
In your example data, you only have one instance of each combination and no NAs, so the fun.aggregate=sum, na.rm=TRUE doesn't do anything. When some are duplicated (e.g., there are two 4 As and two 4 Bs), the values are summed, but the NAs are dropped first. Make sure that is what you want.