How I split particular column data from the dataset in R - r

I have below client dataset includes client_id, birth_number and district_id. The birth number is in the form YYMMDD, here is twist - The value is in the form: YYMMDD(for Men) and the value is in the form: YY(+50MM)DD(for Women). I want your help to develop the script in R where we can split the YYMMDD and set condition. based on condition if MM>12 then that row belong to women and the actual month value subtracted by 15 else Men with the same birth number.
please help
The value is in the form: YYMMDD (for men)
The value is in the form: YY(+50MM)DD (for women)
"client_id";"birth_number";"district_id"
1;"706213";18
2;"450204";1
3;"406009";1
4;"561201";5
5;"605703";5
6;"190922";12
7;"290125";15
8;"385221";51
9;"351016";60
10;"430501";57
11;"505822";57
12;"810220";40
13;"745529";54
14;"425622";76
15;"185828";21
16;"190225";21
17;"341013";76
18;"315405";76
19;"421228";47
20;"790104";46
21;"526029";12
22;"696011";1
23;"730529";1
24;"395729";43
25;"395423";21
26;"695420";74
27;"665326";54
28;"450929";1
29;"515911";30
30;"576009";74
31;"620209";68
32;"800728";52
33;"486204";73

An option is to use substring along with ifelse as:
# Get the 3rd and 4th character from "birth_number". If it is > 12
# that row is for Female, otherwise Male
df$Gender <- ifelse(as.numeric(substring(df$birth_number,3,4)) > 12, "Female", "Male")
# Now correct the "birth_number". Subtract 50 form middle 2 digits.
# Updated based on feedback from #RuiBarradas to use df$Gender == "Female"
# to subtract 50 from month number
df$birth_number <- ifelse(df$Gender == "Female",
as.character(as.numeric(df$birth_number)-5000), df$birth_number)
df
# client_id birth_number district_id Gender
# 1 1 701213 18 Female
# 2 2 450204 1 Male
# 3 3 401009 1 Female
# 4 4 561201 5 Male
# 5 5 600703 5 Female
# 6 6 190922 12 Male
# so on
#
Data:
df <- read.table(text =
'"client_id";"birth_number";"district_id"
1;"706213";18
2;"450204";1
3;"406009";1
4;"561201";5
5;"605703";5
6;"190922";12
7;"290125";15
8;"385221";51
9;"351016";60
10;"430501";57
11;"505822";57
12;"810220";40
13;"745529";54
14;"425622";76
15;"185828";21
16;"190225";21
17;"341013";76
18;"315405";76
19;"421228";47
20;"790104";46
21;"526029";12
22;"696011";1
23;"730529";1
24;"395729";43
25;"395423";21
26;"695420";74
27;"665326";54
28;"450929";1
29;"515911";30
30;"576009";74
31;"620209";68
32;"800728";52
33;"486204";73',
header = TRUE, stringsAsFactors = FALSE, sep = ";")

Using the same commands as #MKR, I just prefer the tidyverse approach.
require(tidyverse)
df %>%
mutate(Gender = ifelse(substr(birth_number, 3, 4) > 12,
"Female", "Male"),
birth_number = ifelse(Gender == "Female",
birth_number - 5000,
birth_number))
client_id birth_number district_id Gender
1 1 701213 18 Female
2 2 450204 1 Male
3 3 401009 1 Female
4 4 561201 5 Male
5 5 600703 5 Female
6 6 190922 12 Male
7 7 290125 15 Male
8 8 380221 51 Female
9 9 351016 60 Male
10 10 430501 57 Male
11 11 500822 57 Female
12 12 810220 40 Male
13 13 740529 54 Female
14 14 420622 76 Female
15 15 180828 21 Female
16 16 190225 21 Male
17 17 341013 76 Male
18 18 310405 76 Female
19 19 421228 47 Male
20 20 790104 46 Male
21 21 521029 12 Female
22 22 691011 1 Female
23 23 730529 1 Male
24 24 390729 43 Female
25 25 390423 21 Female
26 26 690420 74 Female
27 27 660326 54 Female
28 28 450929 1 Male
29 29 510911 30 Female
30 30 571009 74 Female
31 31 620209 68 Male
32 32 800728 52 Male
33 33 481204 73 Female

Related

Clustered barplot using the mean

I am new to R, and I am trying to figure out how to create a clustered bar chart the mean interest in a film, but separated by gender.
Here is my dataframe:
i gender film interest
1 male 1 22
2 male 1 13
3 male 1 16
4 male 1 10
5 male 1 18
6 male 1 24
7 male 1 13
8 male 1 14
9 male 1 19
10 male 1 23
11 male 2 37
12 male 2 20
13 male 2 16
14 male 2 28
15 male 2 27
16 male 2 18
17 male 2 32
18 male 2 24
19 male 2 21
20 male 2 35
21 female 1 3
22 female 1 15
23 female 1 5
24 female 1 16
25 female 1 13
26 female 1 20
27 female 1 11
28 female 1 19
29 female 1 15
30 female 1 7
31 female 2 30
32 female 2 25
33 female 2 31
34 female 2 36
35 female 2 23
36 female 2 14
37 female 2 21
38 female 2 31
39 female 2 22
40 female 2 14
Here is the script that is used:
movies<-read.csv(file.choose())
t.test(interest~film, data = movies)
names(movies)
str(movies)
movies$ï..gender = factor(movies$ï..gender, levels=c(1,2),
labels=c("male","female"))
with(movies, table(film,interest))
summary(movies)
movie.types<-split(movies$interest, movies$film)
boxplot(movie.types)
movie.mean<-sapply(movie.types,mean)
barplot(movie.mean, col = "red", main = "Mean Interest by Film",
ylim=c(0,30), names.arg = c("Bridget Jones Diary","Memento"))
And here is the barplot I made, which I need to make a cluster plot to divide out by gender:

R: Creating a vector with certain values from another vector

So I have a csv file with column headers ID, Score, and Age.
So in R I have,
data <- read.csv(file.choose(), header=T)
attach(data)
I would like to create two new vectors with people's scores whos age are below 70 and above 70 years old. I thought there was a nice a quick way to do this but I cant find it any where. Thanks for any help
Example of what data looks like
ID, Score, Age
1, 20, 77
2, 32, 65
.... etc
And I am trying to make 2 vectors where it consists of all peoples scores who are younger than 70 and all peoples scores who are older than 70
Assuming data looks like this:
Score Age
1 1 29
2 5 39
3 8 40
4 3 89
5 5 31
6 6 23
7 7 75
8 3 3
9 2 23
10 6 54
.. . ..
you can use
df_old <- data[data$Age >= 70,]
df_young <- data[data$Age < 70,]
which gives you
> df_old
Score Age
4 3 89
7 7 75
11 7 97
13 3 101
16 5 89
18 5 89
19 4 96
20 3 97
21 8 75
and
> df_young
Score Age
1 1 29
2 5 39
3 8 40
5 5 31
6 6 23
8 3 3
9 2 23
10 6 54
12 4 23
14 2 23
15 4 45
17 7 53
PS: if you only want the scores and not the age, you could use this:
df_old <- data[data$Age >= 70, "Score"]
df_young <- data[data$Age < 70, "Score"]

how to use gather() in a for loop in R

I have a big data set of 72 columns and I want to gather each 3 of columns into a new column and thus getting 24 columns in the end.
I tried using gather() function but it works good for one time only t=i.e., it gather only 3 columns at a time.
Can I use this function in a for loop?
I tried this:
j=0
k=1
l=2
for (i in 2:24){
neww <- gather(columns, "KEy", "Proteins H/L", c((i+j), (i+k), (i+l)), na.rm = TRUE)
j=j+2;
k=k+2;
l=l+2;
}
I need to gather first 3 columns in a single column and then next 3 in another column and so on.
You can use the to_long function from the sjmisc-package for this purpose. This function is a convenient for-loop, which calls multiple gather() calls.
# create sample
mydat <- data.frame(age = c(20, 30, 40),
sex = c("Female", "Male", "Male"),
score_t1 = c(30, 35, 32),
score_t2 = c(33, 34, 37),
score_t3 = c(36, 35, 38),
speed_t1 = c(2, 3, 1),
speed_t2 = c(3, 4, 5),
speed_t3 = c(1, 8, 6))
# check tidyr. score is gathered, however, speed is not
tidyr::gather(mydat, "time", "score", score_t1, score_t2, score_t3)
> age sex speed_t1 speed_t2 speed_t3 time score
> 1 20 Female 2 3 1 score_t1 30
> 2 30 Male 3 4 8 score_t1 35
> 3 40 Male 1 5 6 score_t1 32
> 4 20 Female 2 3 1 score_t2 33
> 5 30 Male 3 4 8 score_t2 34
> 6 40 Male 1 5 6 score_t2 37
> 7 20 Female 2 3 1 score_t3 36
> 8 30 Male 3 4 8 score_t3 35
> 9 40 Male 1 5 6 score_t3 38
# gather multiple columns. both time and speed are gathered.
to_long(mydat, "time", c("score", "speed"),
c("score_t1", "score_t2", "score_t3"),
c("speed_t1", "speed_t2", "speed_t3"))
> age sex time score speed
> (dbl) (fctr) (chr) (dbl) (dbl)
> 1 20 Female score_t1 30 2
> 2 30 Male score_t1 35 3
> 3 40 Male score_t1 32 1
> 4 20 Female score_t2 33 3
> 5 30 Male score_t2 34 4
> 6 40 Male score_t2 37 5
> 7 20 Female score_t3 36 1
> 8 30 Male score_t3 35 8
> 9 40 Male score_t3 38 6
In this case, the time vector (indicating the gathered groups) just takes one of the multiple gathered column name. If this is too confusing, you can also just number the ID variable:
to_long(mydat, "time", c("score", "speed"),
c("score_t1", "score_t2", "score_t3"),
c("speed_t1", "speed_t2", "speed_t3"),
recode.key = TRUE)
> age sex time score speed
> (dbl) (fctr) (dbl) (dbl) (dbl)
> 1 20 Female 1 30 2
> 2 30 Male 1 35 3
> 3 40 Male 1 32 1
> 4 20 Female 2 33 3
> 5 30 Male 2 34 4
> 6 40 Male 2 37 5
> 7 20 Female 3 36 1
> 8 30 Male 3 35 8
> 9 40 Male 3 38 6
See ?to_long for more examples.
I'm not sure, but I think I read something on GitHub that "multiple column gathering" is also planned for tidyr somewhen...

Need help in data cleaning using R

I need some help in data cleaning using R.
my CSV file looks as follows.
"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,
I need to reformat as follows.
id gender age category rank
1 Male 22 movies 1
1 Male 22 music 2
1 Male 22 travel 3
1 Male 22 cloths 4
1 Male 22 grocery 5
1 Male 22 books NA
1 Male 22 rent NA
1 Male 22 fuel NA
1 Male 22 utility NA
1 Male 22 online-shopping NA
...................................
5 Female 22 movies NA
5 Female 22 music NA
5 Female 22 travel NA
5 Female 22 cloths NA
5 Female 22 grocery NA
5 Female 22 books NA
5 Female 22 rent 1
5 Female 22 fuel NA
5 Female 22 utility NA
5 Female 22 online-shopping 2
So far My efforts are as follows.
mini <- read.csv("~/MS/coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by "V1"')
Now I want to know what is the best way to fill all missing categories and also how do I rank the categories as per their position in CSV file.
For more clarity please refer the above CSV file and expected output.
text1='"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,'
d1 <- read.table(text=text1, sep=",", head=T, as.is=T)
library(reshape2)
d2 <- melt(d1, id.vars=c("id","gender","age"))
names(d2)[5] <- "category"
names(d2)[4] <- "rank"
d2$rank <- gsub("category", "", d2$rank)
head(d2)
# id gender age rank category
# 1 1 Male 22 1 movies
# 2 2 Male 28 1 travel
# 3 3 Female 27 1 rent
# 4 4 Female 22 1 rent
# 5 5 Female 22 1 rent
# 6 1 Male 22 2 music
We can use gather from tidyr
library(tidyr)
d2 <- gather(d1, rank, category, -(1:3)) %>%
extract(rank, into='rank', '.*(\\d+)')
head(d2)
# id gender age rank category
#1 1 Male 22 1 movies
#2 2 Male 28 1 travel
#3 3 Female 27 1 rent
#4 4 Female 22 1 rent
#5 5 Female 22 1 rent
#6 1 Male 22 2 music

Retaining variables in dcast in R

I am using the dcast function in R to turn a long-format dataset into a wide-format dataset. I have an ID number, a categorical variable (CAT), and a continuous variable (AMT). However, I also have a variable SEX, which is the same for all rows of a given ID number. This code works to create the wide-format dataset, but I lose SEX. How can I retain it?
PC1cast <- dcast(PC1, ID~CAT, value.var='AMT', fun.aggregate=sum, na.rm=TRUE)
If I add SEX to the ID~CAT line, it gives me SEX-CAT combinations. I want SEX to just be one value for each row.
Sample data:
ID CAT AMT SEX
1 A 46 Female
1 B 22 Female
1 C 31 Female
2 A 17 Male
2 B 25 Male
2 C 44 Male
For that, you need to add SEX to the ID side of your formula:
dcast(PC1, ID + SEX~CAT, value.var='AMT', fun.aggregate=sum, na.rm=TRUE)
# results in:
ID SEX A B C
1 1 Female 46 22 31
2 2 Male 17 25 44
Things on the left hand side of the formula are kept as-is, things on the right-hand side are cast.
I added some extra data lines to clarify some parts of this. But the gist is that you just need to put SEX on the left hand side (i.e., of ~):
PC2 <- read.table(text="ID CAT AMT SEX
1 A 46 Female
1 B 22 Female
1 C 31 Female
2 A 17 Male
2 B 25 Male
2 C 44 Male
3 A 47 Female
3 B 27 Female
3 C 37 Female
4 A 17 Male
4 A 17 Male
4 B 22 Male
4 B NA Male
4 C 44 Male", header=T)
library(reshape2)
PC1cast2 <- dcast(PC2, ID+SEX~CAT, value.var='AMT', fun.aggregate=sum,
na.rm=TRUE)
PC1cast2
# ID SEX A B C
# 1 1 Female 46 22 31
# 2 2 Male 17 25 44
# 3 3 Female 47 27 37
# 4 4 Male 34 22 44
In your example data, you only have one instance of each combination and no NAs, so the fun.aggregate=sum, na.rm=TRUE doesn't do anything. When some are duplicated (e.g., there are two 4 As and two 4 Bs), the values are summed, but the NAs are dropped first. Make sure that is what you want.

Resources