I need some help in data cleaning using R.
my CSV file looks as follows.
"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,
I need to reformat as follows.
id gender age category rank
1 Male 22 movies 1
1 Male 22 music 2
1 Male 22 travel 3
1 Male 22 cloths 4
1 Male 22 grocery 5
1 Male 22 books NA
1 Male 22 rent NA
1 Male 22 fuel NA
1 Male 22 utility NA
1 Male 22 online-shopping NA
...................................
5 Female 22 movies NA
5 Female 22 music NA
5 Female 22 travel NA
5 Female 22 cloths NA
5 Female 22 grocery NA
5 Female 22 books NA
5 Female 22 rent 1
5 Female 22 fuel NA
5 Female 22 utility NA
5 Female 22 online-shopping 2
So far My efforts are as follows.
mini <- read.csv("~/MS/coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by "V1"')
Now I want to know what is the best way to fill all missing categories and also how do I rank the categories as per their position in CSV file.
For more clarity please refer the above CSV file and expected output.
text1='"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,'
d1 <- read.table(text=text1, sep=",", head=T, as.is=T)
library(reshape2)
d2 <- melt(d1, id.vars=c("id","gender","age"))
names(d2)[5] <- "category"
names(d2)[4] <- "rank"
d2$rank <- gsub("category", "", d2$rank)
head(d2)
# id gender age rank category
# 1 1 Male 22 1 movies
# 2 2 Male 28 1 travel
# 3 3 Female 27 1 rent
# 4 4 Female 22 1 rent
# 5 5 Female 22 1 rent
# 6 1 Male 22 2 music
We can use gather from tidyr
library(tidyr)
d2 <- gather(d1, rank, category, -(1:3)) %>%
extract(rank, into='rank', '.*(\\d+)')
head(d2)
# id gender age rank category
#1 1 Male 22 1 movies
#2 2 Male 28 1 travel
#3 3 Female 27 1 rent
#4 4 Female 22 1 rent
#5 5 Female 22 1 rent
#6 1 Male 22 2 music
Related
I have a big data frame of many variables and their options, so I want the count of all variables and their options. for example the data frame below.
also I have same another data frame and if I want to merge these two data frame, to check if the names of column are same , if not the get the names of different column names.
Excluding c(uniqueid,name) column
the objective is to find if we have any misspelled words with the help of count, or the words have any accent.
df11 <- data.frame(uniqueid=c(9143,2357,4339,8927,9149,4285,2683,8217,3702,7857,3255,4262,8501,7111,2681,6970),
name=c("xly,mnn","xab,Lan","mhy,mun","vgtu,mmc","ftu,sdh","kull,nnhu","hula,njam","mund,jiha","htfy,ntha","sghu,njui","sgyu,hytb","vdti,kula","mftyu,huta","mhuk,ghul","cday,bhsue","ajtu,nudj"),
city=c("A","B","C","C","D","F","S","C","E","S","A","B","W","S","C","A"),
age=c(22,45,67,34,43,22,34,43,34,52,37,44,41,40,39,30),
country=c("usa","USA","AUSTRALI","AUSTRALIA","uk","UK","SPAIN","SPAIN","CHINA","CHINA","BRAZIL","BRASIL","CHILE","USA","CANADA","UK"),
language=c("ENGLISH(US)","ENGLISH(US)","ENGLISH","ENGLISH","ENGLISH(UK)","ENGLISH(UK)","SPANISH","SPANISH","CHINESE","CHINESE","ENGLISH","ENGLISH","ENGLISH","ENGLISH","ENGLISH","ENGLISH(US)"),
gender=c("MALE","FEMALE","male","m","f","MALE","FEMALE","f","FEMALE","MALE","MALE","MALE","FEMALE","FEMALE","MALE","MALE"))
the output should be like a summary of count of group of variables and their options. its a kind of Pivot for Eg: for city
so it should select all available columns in data frame and the give kind of summary of counts for all options available in columns
I am quite confused with what you call "option" but here is something to start with using only base R functions.
Note: it only refers to the 1st part of the question "I want the count of all variables and their options".
res <- do.call(rbind, lapply(df11[, 3:ncol(df11)], function(option) as.data.frame(table(option)))) # apply table() to the selected columns and gather the output in a dataframe
res$variable <- gsub("[.](.*)","", rownames(res)) # recover the name of the variable from the row names with a regular expression
rownames(res) <- NULL # just to clean
res <- res[, c(3,1,2)] # ordering columns
res <- res[order(-res$Freq), ] # sorting by decreasing Freq
The output:
> res
variable option Freq
34 language ENGLISH 7
42 gender MALE 7
39 gender FEMALE 5
3 city C 4
1 city A 3
7 city S 3
11 age 34 3
36 language ENGLISH(US) 3
2 city B 2
9 age 22 2
16 age 43 2
27 country CHINA 2
28 country SPAIN 2
30 country UK 2
32 country USA 2
33 language CHINESE 2
35 language ENGLISH(UK) 2
37 language SPANISH 2
38 gender f 2
4 city D 1
5 city E 1
6 city F 1
8 city W 1
10 age 30 1
12 age 37 1
13 age 39 1
14 age 40 1
15 age 41 1
17 age 44 1
18 age 45 1
19 age 52 1
20 age 67 1
21 country AUSTRALI 1
22 country AUSTRALIA 1
23 country BRASIL 1
24 country BRAZIL 1
25 country CANADA 1
26 country CHILE 1
29 country uk 1
31 country usa 1
40 gender m 1
41 gender male 1
You could count calculate the length of unique values in an aggregate.
res <- aggregate(. ~ city, df11, function(x) length(unique(x)))
res
# city uniqueid name age country language gender
# 1 A 3 3 3 3 2 1
# 2 B 2 2 2 2 2 2
# 3 C 4 4 4 4 2 4
# 4 D 1 1 1 1 1 1
# 5 E 1 1 1 1 1 1
# 6 F 1 1 1 1 1 1
# 7 S 3 3 3 3 3 2
# 8 W 1 1 1 1 1 1
I know there are already some threads like this, but I could not find any solutions.
I have a dataframe that looks like this:
Name Age Sex Survived
1 Allison 0.17 female 1
2 Leah 0.33 female 0
3 David 0.8 male 1
4 Daniel 0.83 male 1
5 Alex 0.83 male 1
6 Jay 0.92 male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
I want to remove ages that are below 1. I want the data to look like this:
Name Age Sex Survived
1 Allison NA female 1
2 Leah NA female 0
3 David NA male 1
4 Daniel NA male 1
5 Alex NA male 1
6 Jay NA male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
Or to just remove the rows with ages < 1 altogether.
Following other solutions I tried this but it didn't work
mydata[mydata$Age<"1"&&mydata$Age>"0"] <- NA
Here are three ways to remove the rows:
mydata[mydata$Age > 1, ]
subset(mydata, Age > 1)
filter(mydata, Age > 1)
Here is how to make them NA:
mydata$Age[mydata$Age < 1] <- NA
Your issue is that you are using 1 as a character (in quotes). Character less/greater than work a little differently to numbers so be careful. Also make sure your Age column is numeric. The best way to do that is
mydata$Age <- as.numeric(as.character(mydata$Age))
so you don't accidentally mess up factor variables.
edit
put the wrong signs. fixed now
> mydata[mydata$Age<1, "Age"] <- NA
> mydata
Name Age Sex Survived
1 Allison NA female 1
2 Leah NA female 0
3 David NA male 1
4 Daniel NA male 1
5 Alex NA male 1
6 Jay NA male 1
7 Sara 16 female 1
8 Jade 15 female 1
9 Connor 17 male 1
10 Jon 18 male 1
11 Mary 8 female 1
Update
Maybe you can use if Age is factor
mydata[as.numeric(as.character(mydata$Age))<1, "Age"] <- NA
"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,
I need to reformat as follows.
id gender age category rank
1 Male 22 movies 1
1 Male 22 music 2
1 Male 22 travel 3
1 Male 22 cloths 4
1 Male 22 grocery 5
1 Male 22 books NA
1 Male 22 rent NA
1 Male 22 fuel NA
1 Male 22 utility NA
1 Male 22 online-shopping NA
So far my efforts are as follows.
mini <- read.csv("coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by "V1"')
Now I want to know what is the best way to fill all missing categories for each user.
Any help in this regard is appreciated.
library(reshape2)
library(tidyr)
mdf <- melt(df, c("id","gender","age"))
complete(na.omit(mdf), c(id, gender, age), value)
# Source: local data frame [50 x 5]
#
# id gender age value variable
# (int) (fctr) (int) (chr) (fctr)
# 1 1 Male 22 books NA
# 2 1 Male 22 cloths category4
# 3 1 Male 22 fuel NA
# 4 1 Male 22 grocery category5
# 5 1 Male 22 movies category1
# 6 1 Male 22 music category2
# 7 1 Male 22 online-shopping NA
# 8 1 Male 22 rent NA
# 9 1 Male 22 travel category3
# 10 1 Male 22 utiliy NA
# .. ... ... ... ... ...
Explanation
We can first melt the data.frame specifying the id columns. Next, the new release of tidyr has a helper function complete to expand columns as your output describes.
Data
df <- read.csv(text='"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,')
is.na(df) <- is.na(df) | df== ""
Consider using the base function reshape as this is the regular example of wide to long dataset reshaping/pivoting:
reshapedf <- reshape(df, varying = c(4:13),
v.names = c("category"),
timevar=c("rank"),
times = c(1:10),
idvar = c("id", "gender", "age"),
new.row.names = 1:1000,
direction = "long")
# ORDER RESULTING DATA FRAME
reshapedf <- reshapedf[with(reshapedf , order(id, gender, age)), ]
# RESET ROW NAMES
row.names(reshapedf) <- 1:nrow(reshapedf)
OUTPUT
id gender age rank category
1 1 Male 22 1 movies
2 1 Male 22 2 music
3 1 Male 22 3 travel
4 1 Male 22 4 cloths
5 1 Male 22 5 grocery
6 1 Male 22 6 NA
7 1 Male 22 7 NA
8 1 Male 22 8 NA
9 1 Male 22 9 NA
10 1 Male 22 10 NA
...
I am using the dcast function in R to turn a long-format dataset into a wide-format dataset. I have an ID number, a categorical variable (CAT), and a continuous variable (AMT). However, I also have a variable SEX, which is the same for all rows of a given ID number. This code works to create the wide-format dataset, but I lose SEX. How can I retain it?
PC1cast <- dcast(PC1, ID~CAT, value.var='AMT', fun.aggregate=sum, na.rm=TRUE)
If I add SEX to the ID~CAT line, it gives me SEX-CAT combinations. I want SEX to just be one value for each row.
Sample data:
ID CAT AMT SEX
1 A 46 Female
1 B 22 Female
1 C 31 Female
2 A 17 Male
2 B 25 Male
2 C 44 Male
For that, you need to add SEX to the ID side of your formula:
dcast(PC1, ID + SEX~CAT, value.var='AMT', fun.aggregate=sum, na.rm=TRUE)
# results in:
ID SEX A B C
1 1 Female 46 22 31
2 2 Male 17 25 44
Things on the left hand side of the formula are kept as-is, things on the right-hand side are cast.
I added some extra data lines to clarify some parts of this. But the gist is that you just need to put SEX on the left hand side (i.e., of ~):
PC2 <- read.table(text="ID CAT AMT SEX
1 A 46 Female
1 B 22 Female
1 C 31 Female
2 A 17 Male
2 B 25 Male
2 C 44 Male
3 A 47 Female
3 B 27 Female
3 C 37 Female
4 A 17 Male
4 A 17 Male
4 B 22 Male
4 B NA Male
4 C 44 Male", header=T)
library(reshape2)
PC1cast2 <- dcast(PC2, ID+SEX~CAT, value.var='AMT', fun.aggregate=sum,
na.rm=TRUE)
PC1cast2
# ID SEX A B C
# 1 1 Female 46 22 31
# 2 2 Male 17 25 44
# 3 3 Female 47 27 37
# 4 4 Male 34 22 44
In your example data, you only have one instance of each combination and no NAs, so the fun.aggregate=sum, na.rm=TRUE doesn't do anything. When some are duplicated (e.g., there are two 4 As and two 4 Bs), the values are summed, but the NAs are dropped first. Make sure that is what you want.
I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45