Retaining variables in dcast in R - r

I am using the dcast function in R to turn a long-format dataset into a wide-format dataset. I have an ID number, a categorical variable (CAT), and a continuous variable (AMT). However, I also have a variable SEX, which is the same for all rows of a given ID number. This code works to create the wide-format dataset, but I lose SEX. How can I retain it?
PC1cast <- dcast(PC1, ID~CAT, value.var='AMT', fun.aggregate=sum, na.rm=TRUE)
If I add SEX to the ID~CAT line, it gives me SEX-CAT combinations. I want SEX to just be one value for each row.
Sample data:
ID CAT AMT SEX
1 A 46 Female
1 B 22 Female
1 C 31 Female
2 A 17 Male
2 B 25 Male
2 C 44 Male

For that, you need to add SEX to the ID side of your formula:
dcast(PC1, ID + SEX~CAT, value.var='AMT', fun.aggregate=sum, na.rm=TRUE)
# results in:
ID SEX A B C
1 1 Female 46 22 31
2 2 Male 17 25 44
Things on the left hand side of the formula are kept as-is, things on the right-hand side are cast.

I added some extra data lines to clarify some parts of this. But the gist is that you just need to put SEX on the left hand side (i.e., of ~):
PC2 <- read.table(text="ID CAT AMT SEX
1 A 46 Female
1 B 22 Female
1 C 31 Female
2 A 17 Male
2 B 25 Male
2 C 44 Male
3 A 47 Female
3 B 27 Female
3 C 37 Female
4 A 17 Male
4 A 17 Male
4 B 22 Male
4 B NA Male
4 C 44 Male", header=T)
library(reshape2)
PC1cast2 <- dcast(PC2, ID+SEX~CAT, value.var='AMT', fun.aggregate=sum,
na.rm=TRUE)
PC1cast2
# ID SEX A B C
# 1 1 Female 46 22 31
# 2 2 Male 17 25 44
# 3 3 Female 47 27 37
# 4 4 Male 34 22 44
In your example data, you only have one instance of each combination and no NAs, so the fun.aggregate=sum, na.rm=TRUE doesn't do anything. When some are duplicated (e.g., there are two 4 As and two 4 Bs), the values are summed, but the NAs are dropped first. Make sure that is what you want.

Related

Converting categorical columns to numerical values

I want to convert categorical columns in the dataset to be numerical values (1,2,3, etc).
How can I do this in R?
## Load vcd package
library(vcd)
## Load Arthritis dataset (data frame)
data(Arthritis)
Arthritis <- Arthritis[,2:5]
head(Arthritis)
Treatment Sex Age Improved
1 Treated Male 27 Some
2 Treated Male 29 None
3 Treated Male 30 None
4 Treated Male 32 Marked
5 Treated Male 46 Marked
6 Treated Male 58 Marked
Resulting dataset would look like this:
Treatment Sex Age Improved
[1,] 1 1 27 1
[2,] 1 1 29 0
[3,] 1 1 30 0
[4,] 1 1 32 2
[5,] 1 1 46 2
[6,] 1 1 58 2
If number of variables is huge, you may consider using this automation:
Arthritis2 <- sapply(Arthritis, unclass)
Edit:
Arthritis2 <- sapply(Arthritis, unclass) - 1
Solution using named list and match function:
scores <- list("0" = "None", "1" = "Some", "2" = "Marked" )
Arthritis$Scores <- names(scores)[match(Arthritis$Improved, scores)]
head(Arthritis)
Sex Age Improved Scores
1 Male 27 Some 1
2 Male 29 None 0
3 Male 30 None 0
4 Male 32 Marked 2
5 Male 46 Marked 2
6 Male 58 Marked 2
If you don't want to keep Improved column, simply do this instead:
Arthritis$Improved <- names(scores)[match(Arthritis$Improved, scores)]

summarise count of options of variables

I have a big data frame of many variables and their options, so I want the count of all variables and their options. for example the data frame below.
also I have same another data frame and if I want to merge these two data frame, to check if the names of column are same , if not the get the names of different column names.
Excluding c(uniqueid,name) column
the objective is to find if we have any misspelled words with the help of count, or the words have any accent.
df11 <- data.frame(uniqueid=c(9143,2357,4339,8927,9149,4285,2683,8217,3702,7857,3255,4262,8501,7111,2681,6970),
name=c("xly,mnn","xab,Lan","mhy,mun","vgtu,mmc","ftu,sdh","kull,nnhu","hula,njam","mund,jiha","htfy,ntha","sghu,njui","sgyu,hytb","vdti,kula","mftyu,huta","mhuk,ghul","cday,bhsue","ajtu,nudj"),
city=c("A","B","C","C","D","F","S","C","E","S","A","B","W","S","C","A"),
age=c(22,45,67,34,43,22,34,43,34,52,37,44,41,40,39,30),
country=c("usa","USA","AUSTRALI","AUSTRALIA","uk","UK","SPAIN","SPAIN","CHINA","CHINA","BRAZIL","BRASIL","CHILE","USA","CANADA","UK"),
language=c("ENGLISH(US)","ENGLISH(US)","ENGLISH","ENGLISH","ENGLISH(UK)","ENGLISH(UK)","SPANISH","SPANISH","CHINESE","CHINESE","ENGLISH","ENGLISH","ENGLISH","ENGLISH","ENGLISH","ENGLISH(US)"),
gender=c("MALE","FEMALE","male","m","f","MALE","FEMALE","f","FEMALE","MALE","MALE","MALE","FEMALE","FEMALE","MALE","MALE"))
the output should be like a summary of count of group of variables and their options. its a kind of Pivot for Eg: for city
so it should select all available columns in data frame and the give kind of summary of counts for all options available in columns
I am quite confused with what you call "option" but here is something to start with using only base R functions.
Note: it only refers to the 1st part of the question "I want the count of all variables and their options".
res <- do.call(rbind, lapply(df11[, 3:ncol(df11)], function(option) as.data.frame(table(option)))) # apply table() to the selected columns and gather the output in a dataframe
res$variable <- gsub("[.](.*)","", rownames(res)) # recover the name of the variable from the row names with a regular expression
rownames(res) <- NULL # just to clean
res <- res[, c(3,1,2)] # ordering columns
res <- res[order(-res$Freq), ] # sorting by decreasing Freq
The output:
> res
variable option Freq
34 language ENGLISH 7
42 gender MALE 7
39 gender FEMALE 5
3 city C 4
1 city A 3
7 city S 3
11 age 34 3
36 language ENGLISH(US) 3
2 city B 2
9 age 22 2
16 age 43 2
27 country CHINA 2
28 country SPAIN 2
30 country UK 2
32 country USA 2
33 language CHINESE 2
35 language ENGLISH(UK) 2
37 language SPANISH 2
38 gender f 2
4 city D 1
5 city E 1
6 city F 1
8 city W 1
10 age 30 1
12 age 37 1
13 age 39 1
14 age 40 1
15 age 41 1
17 age 44 1
18 age 45 1
19 age 52 1
20 age 67 1
21 country AUSTRALI 1
22 country AUSTRALIA 1
23 country BRASIL 1
24 country BRAZIL 1
25 country CANADA 1
26 country CHILE 1
29 country uk 1
31 country usa 1
40 gender m 1
41 gender male 1
You could count calculate the length of unique values in an aggregate.
res <- aggregate(. ~ city, df11, function(x) length(unique(x)))
res
# city uniqueid name age country language gender
# 1 A 3 3 3 3 2 1
# 2 B 2 2 2 2 2 2
# 3 C 4 4 4 4 2 4
# 4 D 1 1 1 1 1 1
# 5 E 1 1 1 1 1 1
# 6 F 1 1 1 1 1 1
# 7 S 3 3 3 3 3 2
# 8 W 1 1 1 1 1 1

How I split particular column data from the dataset in R

I have below client dataset includes client_id, birth_number and district_id. The birth number is in the form YYMMDD, here is twist - The value is in the form: YYMMDD(for Men) and the value is in the form: YY(+50MM)DD(for Women). I want your help to develop the script in R where we can split the YYMMDD and set condition. based on condition if MM>12 then that row belong to women and the actual month value subtracted by 15 else Men with the same birth number.
please help
The value is in the form: YYMMDD (for men)
The value is in the form: YY(+50MM)DD (for women)
"client_id";"birth_number";"district_id"
1;"706213";18
2;"450204";1
3;"406009";1
4;"561201";5
5;"605703";5
6;"190922";12
7;"290125";15
8;"385221";51
9;"351016";60
10;"430501";57
11;"505822";57
12;"810220";40
13;"745529";54
14;"425622";76
15;"185828";21
16;"190225";21
17;"341013";76
18;"315405";76
19;"421228";47
20;"790104";46
21;"526029";12
22;"696011";1
23;"730529";1
24;"395729";43
25;"395423";21
26;"695420";74
27;"665326";54
28;"450929";1
29;"515911";30
30;"576009";74
31;"620209";68
32;"800728";52
33;"486204";73
An option is to use substring along with ifelse as:
# Get the 3rd and 4th character from "birth_number". If it is > 12
# that row is for Female, otherwise Male
df$Gender <- ifelse(as.numeric(substring(df$birth_number,3,4)) > 12, "Female", "Male")
# Now correct the "birth_number". Subtract 50 form middle 2 digits.
# Updated based on feedback from #RuiBarradas to use df$Gender == "Female"
# to subtract 50 from month number
df$birth_number <- ifelse(df$Gender == "Female",
as.character(as.numeric(df$birth_number)-5000), df$birth_number)
df
# client_id birth_number district_id Gender
# 1 1 701213 18 Female
# 2 2 450204 1 Male
# 3 3 401009 1 Female
# 4 4 561201 5 Male
# 5 5 600703 5 Female
# 6 6 190922 12 Male
# so on
#
Data:
df <- read.table(text =
'"client_id";"birth_number";"district_id"
1;"706213";18
2;"450204";1
3;"406009";1
4;"561201";5
5;"605703";5
6;"190922";12
7;"290125";15
8;"385221";51
9;"351016";60
10;"430501";57
11;"505822";57
12;"810220";40
13;"745529";54
14;"425622";76
15;"185828";21
16;"190225";21
17;"341013";76
18;"315405";76
19;"421228";47
20;"790104";46
21;"526029";12
22;"696011";1
23;"730529";1
24;"395729";43
25;"395423";21
26;"695420";74
27;"665326";54
28;"450929";1
29;"515911";30
30;"576009";74
31;"620209";68
32;"800728";52
33;"486204";73',
header = TRUE, stringsAsFactors = FALSE, sep = ";")
Using the same commands as #MKR, I just prefer the tidyverse approach.
require(tidyverse)
df %>%
mutate(Gender = ifelse(substr(birth_number, 3, 4) > 12,
"Female", "Male"),
birth_number = ifelse(Gender == "Female",
birth_number - 5000,
birth_number))
client_id birth_number district_id Gender
1 1 701213 18 Female
2 2 450204 1 Male
3 3 401009 1 Female
4 4 561201 5 Male
5 5 600703 5 Female
6 6 190922 12 Male
7 7 290125 15 Male
8 8 380221 51 Female
9 9 351016 60 Male
10 10 430501 57 Male
11 11 500822 57 Female
12 12 810220 40 Male
13 13 740529 54 Female
14 14 420622 76 Female
15 15 180828 21 Female
16 16 190225 21 Male
17 17 341013 76 Male
18 18 310405 76 Female
19 19 421228 47 Male
20 20 790104 46 Male
21 21 521029 12 Female
22 22 691011 1 Female
23 23 730529 1 Male
24 24 390729 43 Female
25 25 390423 21 Female
26 26 690420 74 Female
27 27 660326 54 Female
28 28 450929 1 Male
29 29 510911 30 Female
30 30 571009 74 Female
31 31 620209 68 Male
32 32 800728 52 Male
33 33 481204 73 Female

Need help in data cleaning using R

I need some help in data cleaning using R.
my CSV file looks as follows.
"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,
I need to reformat as follows.
id gender age category rank
1 Male 22 movies 1
1 Male 22 music 2
1 Male 22 travel 3
1 Male 22 cloths 4
1 Male 22 grocery 5
1 Male 22 books NA
1 Male 22 rent NA
1 Male 22 fuel NA
1 Male 22 utility NA
1 Male 22 online-shopping NA
...................................
5 Female 22 movies NA
5 Female 22 music NA
5 Female 22 travel NA
5 Female 22 cloths NA
5 Female 22 grocery NA
5 Female 22 books NA
5 Female 22 rent 1
5 Female 22 fuel NA
5 Female 22 utility NA
5 Female 22 online-shopping 2
So far My efforts are as follows.
mini <- read.csv("~/MS/coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by "V1"')
Now I want to know what is the best way to fill all missing categories and also how do I rank the categories as per their position in CSV file.
For more clarity please refer the above CSV file and expected output.
text1='"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,'
d1 <- read.table(text=text1, sep=",", head=T, as.is=T)
library(reshape2)
d2 <- melt(d1, id.vars=c("id","gender","age"))
names(d2)[5] <- "category"
names(d2)[4] <- "rank"
d2$rank <- gsub("category", "", d2$rank)
head(d2)
# id gender age rank category
# 1 1 Male 22 1 movies
# 2 2 Male 28 1 travel
# 3 3 Female 27 1 rent
# 4 4 Female 22 1 rent
# 5 5 Female 22 1 rent
# 6 1 Male 22 2 music
We can use gather from tidyr
library(tidyr)
d2 <- gather(d1, rank, category, -(1:3)) %>%
extract(rank, into='rank', '.*(\\d+)')
head(d2)
# id gender age rank category
#1 1 Male 22 1 movies
#2 2 Male 28 1 travel
#3 3 Female 27 1 rent
#4 4 Female 22 1 rent
#5 5 Female 22 1 rent
#6 1 Male 22 2 music

Help needed in Data cleaning using R

"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,
I need to reformat as follows.
id gender age category rank
1 Male 22 movies 1
1 Male 22 music 2
1 Male 22 travel 3
1 Male 22 cloths 4
1 Male 22 grocery 5
1 Male 22 books NA
1 Male 22 rent NA
1 Male 22 fuel NA
1 Male 22 utility NA
1 Male 22 online-shopping NA
So far my efforts are as follows.
mini <- read.csv("coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by "V1"')
Now I want to know what is the best way to fill all missing categories for each user.
Any help in this regard is appreciated.
library(reshape2)
library(tidyr)
mdf <- melt(df, c("id","gender","age"))
complete(na.omit(mdf), c(id, gender, age), value)
# Source: local data frame [50 x 5]
#
# id gender age value variable
# (int) (fctr) (int) (chr) (fctr)
# 1 1 Male 22 books NA
# 2 1 Male 22 cloths category4
# 3 1 Male 22 fuel NA
# 4 1 Male 22 grocery category5
# 5 1 Male 22 movies category1
# 6 1 Male 22 music category2
# 7 1 Male 22 online-shopping NA
# 8 1 Male 22 rent NA
# 9 1 Male 22 travel category3
# 10 1 Male 22 utiliy NA
# .. ... ... ... ... ...
Explanation
We can first melt the data.frame specifying the id columns. Next, the new release of tidyr has a helper function complete to expand columns as your output describes.
Data
df <- read.csv(text='"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,')
is.na(df) <- is.na(df) | df== ""
Consider using the base function reshape as this is the regular example of wide to long dataset reshaping/pivoting:
reshapedf <- reshape(df, varying = c(4:13),
v.names = c("category"),
timevar=c("rank"),
times = c(1:10),
idvar = c("id", "gender", "age"),
new.row.names = 1:1000,
direction = "long")
# ORDER RESULTING DATA FRAME
reshapedf <- reshapedf[with(reshapedf , order(id, gender, age)), ]
# RESET ROW NAMES
row.names(reshapedf) <- 1:nrow(reshapedf)
OUTPUT
id gender age rank category
1 1 Male 22 1 movies
2 1 Male 22 2 music
3 1 Male 22 3 travel
4 1 Male 22 4 cloths
5 1 Male 22 5 grocery
6 1 Male 22 6 NA
7 1 Male 22 7 NA
8 1 Male 22 8 NA
9 1 Male 22 9 NA
10 1 Male 22 10 NA
...

Resources