summarise count of options of variables - r

I have a big data frame of many variables and their options, so I want the count of all variables and their options. for example the data frame below.
also I have same another data frame and if I want to merge these two data frame, to check if the names of column are same , if not the get the names of different column names.
Excluding c(uniqueid,name) column
the objective is to find if we have any misspelled words with the help of count, or the words have any accent.
df11 <- data.frame(uniqueid=c(9143,2357,4339,8927,9149,4285,2683,8217,3702,7857,3255,4262,8501,7111,2681,6970),
name=c("xly,mnn","xab,Lan","mhy,mun","vgtu,mmc","ftu,sdh","kull,nnhu","hula,njam","mund,jiha","htfy,ntha","sghu,njui","sgyu,hytb","vdti,kula","mftyu,huta","mhuk,ghul","cday,bhsue","ajtu,nudj"),
city=c("A","B","C","C","D","F","S","C","E","S","A","B","W","S","C","A"),
age=c(22,45,67,34,43,22,34,43,34,52,37,44,41,40,39,30),
country=c("usa","USA","AUSTRALI","AUSTRALIA","uk","UK","SPAIN","SPAIN","CHINA","CHINA","BRAZIL","BRASIL","CHILE","USA","CANADA","UK"),
language=c("ENGLISH(US)","ENGLISH(US)","ENGLISH","ENGLISH","ENGLISH(UK)","ENGLISH(UK)","SPANISH","SPANISH","CHINESE","CHINESE","ENGLISH","ENGLISH","ENGLISH","ENGLISH","ENGLISH","ENGLISH(US)"),
gender=c("MALE","FEMALE","male","m","f","MALE","FEMALE","f","FEMALE","MALE","MALE","MALE","FEMALE","FEMALE","MALE","MALE"))
the output should be like a summary of count of group of variables and their options. its a kind of Pivot for Eg: for city
so it should select all available columns in data frame and the give kind of summary of counts for all options available in columns

I am quite confused with what you call "option" but here is something to start with using only base R functions.
Note: it only refers to the 1st part of the question "I want the count of all variables and their options".
res <- do.call(rbind, lapply(df11[, 3:ncol(df11)], function(option) as.data.frame(table(option)))) # apply table() to the selected columns and gather the output in a dataframe
res$variable <- gsub("[.](.*)","", rownames(res)) # recover the name of the variable from the row names with a regular expression
rownames(res) <- NULL # just to clean
res <- res[, c(3,1,2)] # ordering columns
res <- res[order(-res$Freq), ] # sorting by decreasing Freq
The output:
> res
variable option Freq
34 language ENGLISH 7
42 gender MALE 7
39 gender FEMALE 5
3 city C 4
1 city A 3
7 city S 3
11 age 34 3
36 language ENGLISH(US) 3
2 city B 2
9 age 22 2
16 age 43 2
27 country CHINA 2
28 country SPAIN 2
30 country UK 2
32 country USA 2
33 language CHINESE 2
35 language ENGLISH(UK) 2
37 language SPANISH 2
38 gender f 2
4 city D 1
5 city E 1
6 city F 1
8 city W 1
10 age 30 1
12 age 37 1
13 age 39 1
14 age 40 1
15 age 41 1
17 age 44 1
18 age 45 1
19 age 52 1
20 age 67 1
21 country AUSTRALI 1
22 country AUSTRALIA 1
23 country BRASIL 1
24 country BRAZIL 1
25 country CANADA 1
26 country CHILE 1
29 country uk 1
31 country usa 1
40 gender m 1
41 gender male 1

You could count calculate the length of unique values in an aggregate.
res <- aggregate(. ~ city, df11, function(x) length(unique(x)))
res
# city uniqueid name age country language gender
# 1 A 3 3 3 3 2 1
# 2 B 2 2 2 2 2 2
# 3 C 4 4 4 4 2 4
# 4 D 1 1 1 1 1 1
# 5 E 1 1 1 1 1 1
# 6 F 1 1 1 1 1 1
# 7 S 3 3 3 3 3 2
# 8 W 1 1 1 1 1 1

Related

Counting the number of changes of a categorical variable during repeated measurements within a category

I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2

CountIF function in R?

I have the following formulas in excel, but calculation takes forever, so I would like to find a way to calculate these formulas in excel.
I'm counting the number of times an item shows up in a location (Location 1, Location 2, and External) with these formulas:
=SUMPRODUCT(($N:$N=$A2)*(RIGHT(!$C:$C)="1")
=SUMPRODUCT(($N:$N=$A2)*(RIGHT(!$C:$C)="2")
=SUMPRODUCT(($N:$N=$A2)*(LEFT($C:$C)="E"))
Here is the dataframe in which the columns with these values will be added:
> str(FinalPars)
'data.frame': 10038 obs. of 3 variables:
$ ID: int 11 13 18 22 39 181 182 183 191 192 ...
$ Minimum : num 15 6 1.71 1 1 4.39 2.67 5 5 2 ...
$ Maximum : num 15 6 2 1 1 5.48 3.69 6.5 5 2 ...
and here is the dataset to which the ItemID will be matched (This is a master list of all locations each item is stored in):
> str(StorageLocations)
'data.frame': 14080 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ CLASSIFICATION : Factor w/ 3 levels "Central 1","Central 2",..: 3 3 3 1 2 3 3 1 2 3 ...
$ Cart Descr : Factor w/ 145 levels "Closet1",..: 36 41 110 1 99 58 60 14 99 60 ...
Sample of Storage Location Data Frame:
ID Classification Cart Descr
123 Central 1 Main Store Room
123 Central 2 Secondary Store Room
123 External Closet 1
123 External Closet 2
123 External Closet 3
So the output for the above would be added to the data frame total pars as the new colums Central 1, Central 2, and External and count the number of times the item was IDd as in those locations:
ID Minimum Maximum Central 1 Central 2 External
123 10 15 1 1 3
This was my output in Excel - a Count of the # of times an item was identified as Central 1, Central 2, or External
If anyone knows the comparable formula in R it would be great!
It's hard to know what you are really asking for without example data. I produced an example below.
Location <- c(rep(1,4), rep(2,4), rep(3,4))
Item_Id <- c(rep(1,2),rep(2,3),rep(1,2),rep(2,2),rep(1,3))
Item_Id_Want_to_Match <- 1
df <- data.frame(Location, Item_Id)
> df
Location Item_Id
1 1 1
2 1 1
3 1 2
4 1 2
5 2 2
6 2 1
7 2 1
8 2 2
9 3 2
10 3 1
11 3 1
12 3 1
sum(ifelse(df$Location == 1 & df$Item_Id == Item_Id_Want_to_Match, df$Item_Id*df$Location,0))
> sum(ifelse(df$Location == 1 & df$Item_Id == Item_Id_Want_to_Match, df$Item_Id*df$Location,0))
[1] 2
EDIT:
ID <- rep(123,5)
Classification <- c("Central 1", "Central 2", rep("External",3))
df <- data.frame(ID, Classification)
df$count <- 1
ID2 <- 123
Min <- 10
Max <- 15
df2 <- data.frame(ID2, Min, Max)
library(dplyr)
count_df <- df %>%
group_by(ID, Classification) %>%
summarise(count= sum(count))
> count_df
Source: local data frame [3 x 3]
Groups: ID
ID Classification count
1 123 Central 1 1
2 123 Central 2 1
3 123 External 3
library(reshape)
new_df <- recast(count_df, ID~Classification, id.var=c("ID", "Classification"))
> new_df
ID Central 1 Central 2 External
1 123 1 1 3
merge(new_df, df2, by.x="ID", by.y="ID2")
> merge(new_df, df2, by.x="ID", by.y="ID2")
ID Central 1 Central 2 External Min Max
1 123 1 1 3 10 15

Help needed in Data cleaning using R

"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,
I need to reformat as follows.
id gender age category rank
1 Male 22 movies 1
1 Male 22 music 2
1 Male 22 travel 3
1 Male 22 cloths 4
1 Male 22 grocery 5
1 Male 22 books NA
1 Male 22 rent NA
1 Male 22 fuel NA
1 Male 22 utility NA
1 Male 22 online-shopping NA
So far my efforts are as follows.
mini <- read.csv("coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by "V1"')
Now I want to know what is the best way to fill all missing categories for each user.
Any help in this regard is appreciated.
library(reshape2)
library(tidyr)
mdf <- melt(df, c("id","gender","age"))
complete(na.omit(mdf), c(id, gender, age), value)
# Source: local data frame [50 x 5]
#
# id gender age value variable
# (int) (fctr) (int) (chr) (fctr)
# 1 1 Male 22 books NA
# 2 1 Male 22 cloths category4
# 3 1 Male 22 fuel NA
# 4 1 Male 22 grocery category5
# 5 1 Male 22 movies category1
# 6 1 Male 22 music category2
# 7 1 Male 22 online-shopping NA
# 8 1 Male 22 rent NA
# 9 1 Male 22 travel category3
# 10 1 Male 22 utiliy NA
# .. ... ... ... ... ...
Explanation
We can first melt the data.frame specifying the id columns. Next, the new release of tidyr has a helper function complete to expand columns as your output describes.
Data
df <- read.csv(text='"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,')
is.na(df) <- is.na(df) | df== ""
Consider using the base function reshape as this is the regular example of wide to long dataset reshaping/pivoting:
reshapedf <- reshape(df, varying = c(4:13),
v.names = c("category"),
timevar=c("rank"),
times = c(1:10),
idvar = c("id", "gender", "age"),
new.row.names = 1:1000,
direction = "long")
# ORDER RESULTING DATA FRAME
reshapedf <- reshapedf[with(reshapedf , order(id, gender, age)), ]
# RESET ROW NAMES
row.names(reshapedf) <- 1:nrow(reshapedf)
OUTPUT
id gender age rank category
1 1 Male 22 1 movies
2 1 Male 22 2 music
3 1 Male 22 3 travel
4 1 Male 22 4 cloths
5 1 Male 22 5 grocery
6 1 Male 22 6 NA
7 1 Male 22 7 NA
8 1 Male 22 8 NA
9 1 Male 22 9 NA
10 1 Male 22 10 NA
...

Remove duplicate observations based on set of rules

I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45

Expand each summary row of a data frame into a type-count

I have a data frame that looks like this, with two key columns and then a count of things that come in three different types.
Year Month Urban Suburban Rural
1 1 1 11 12 13
2 1 2 21 22 23
I want to expand each row so that it lists the type as a factor and then the number in that type, so something like this:
Year Month Type Number
1 1 1 Urban 11
2 1 1 Suburban 12
3 1 1 Rural 13
4 1 2 Urban 21
5 1 2 Suburban 22
6 1 2 Rural 23
is there a function that does this painlessly?
This is precisely what the reshape and reshape2 packages are designed to do:
require(reshape2)
x <- read.table(text = "Year Month Urban Suburban Rural
1 1 1 11 12 13
2 1 2 21 22 23")
#Specify the variables that are your ID variables. The others will form your "long" data
x.m <- melt(x, id.vars = c("Year", "Month"))
#-----
Year Month variable value
1 1 1 Urban 11
2 1 2 Urban 21
3 1 1 Suburban 12
...
There is a paper in the journal of statistical software that's a great place to get started.
dat <- read.table(text=" Year Month Urban Suburban Rural
1 1 1 11 12 13
2 1 2 21 22 23
", header=TRUE)
reshape(dat, direction="long", idvar=1:2, varying=names(dat)[3:5], times=names(dat)[3:5], v.names="Number", timevar="Type")
Year Month Type Number
1.1.Urban 1 1 Urban 11
1.2.Urban 1 2 Urban 21
1.1.Suburban 1 1 Suburban 12
1.2.Suburban 1 2 Suburban 22
1.1.Rural 1 1 Rural 13
1.2.Rural 1 2 Rural 23
(Note that the reshape function is in the standard set of packages and not in the reshape or resshape2 packages.)

Resources