I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2
I have the following formulas in excel, but calculation takes forever, so I would like to find a way to calculate these formulas in excel.
I'm counting the number of times an item shows up in a location (Location 1, Location 2, and External) with these formulas:
=SUMPRODUCT(($N:$N=$A2)*(RIGHT(!$C:$C)="1")
=SUMPRODUCT(($N:$N=$A2)*(RIGHT(!$C:$C)="2")
=SUMPRODUCT(($N:$N=$A2)*(LEFT($C:$C)="E"))
Here is the dataframe in which the columns with these values will be added:
> str(FinalPars)
'data.frame': 10038 obs. of 3 variables:
$ ID: int 11 13 18 22 39 181 182 183 191 192 ...
$ Minimum : num 15 6 1.71 1 1 4.39 2.67 5 5 2 ...
$ Maximum : num 15 6 2 1 1 5.48 3.69 6.5 5 2 ...
and here is the dataset to which the ItemID will be matched (This is a master list of all locations each item is stored in):
> str(StorageLocations)
'data.frame': 14080 obs. of 3 variables:
$ ID : int 1 2 3 4 5 6 7 8 9 10 ...
$ CLASSIFICATION : Factor w/ 3 levels "Central 1","Central 2",..: 3 3 3 1 2 3 3 1 2 3 ...
$ Cart Descr : Factor w/ 145 levels "Closet1",..: 36 41 110 1 99 58 60 14 99 60 ...
Sample of Storage Location Data Frame:
ID Classification Cart Descr
123 Central 1 Main Store Room
123 Central 2 Secondary Store Room
123 External Closet 1
123 External Closet 2
123 External Closet 3
So the output for the above would be added to the data frame total pars as the new colums Central 1, Central 2, and External and count the number of times the item was IDd as in those locations:
ID Minimum Maximum Central 1 Central 2 External
123 10 15 1 1 3
This was my output in Excel - a Count of the # of times an item was identified as Central 1, Central 2, or External
If anyone knows the comparable formula in R it would be great!
It's hard to know what you are really asking for without example data. I produced an example below.
Location <- c(rep(1,4), rep(2,4), rep(3,4))
Item_Id <- c(rep(1,2),rep(2,3),rep(1,2),rep(2,2),rep(1,3))
Item_Id_Want_to_Match <- 1
df <- data.frame(Location, Item_Id)
> df
Location Item_Id
1 1 1
2 1 1
3 1 2
4 1 2
5 2 2
6 2 1
7 2 1
8 2 2
9 3 2
10 3 1
11 3 1
12 3 1
sum(ifelse(df$Location == 1 & df$Item_Id == Item_Id_Want_to_Match, df$Item_Id*df$Location,0))
> sum(ifelse(df$Location == 1 & df$Item_Id == Item_Id_Want_to_Match, df$Item_Id*df$Location,0))
[1] 2
EDIT:
ID <- rep(123,5)
Classification <- c("Central 1", "Central 2", rep("External",3))
df <- data.frame(ID, Classification)
df$count <- 1
ID2 <- 123
Min <- 10
Max <- 15
df2 <- data.frame(ID2, Min, Max)
library(dplyr)
count_df <- df %>%
group_by(ID, Classification) %>%
summarise(count= sum(count))
> count_df
Source: local data frame [3 x 3]
Groups: ID
ID Classification count
1 123 Central 1 1
2 123 Central 2 1
3 123 External 3
library(reshape)
new_df <- recast(count_df, ID~Classification, id.var=c("ID", "Classification"))
> new_df
ID Central 1 Central 2 External
1 123 1 1 3
merge(new_df, df2, by.x="ID", by.y="ID2")
> merge(new_df, df2, by.x="ID", by.y="ID2")
ID Central 1 Central 2 External Min Max
1 123 1 1 3 10 15
"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,
I need to reformat as follows.
id gender age category rank
1 Male 22 movies 1
1 Male 22 music 2
1 Male 22 travel 3
1 Male 22 cloths 4
1 Male 22 grocery 5
1 Male 22 books NA
1 Male 22 rent NA
1 Male 22 fuel NA
1 Male 22 utility NA
1 Male 22 online-shopping NA
So far my efforts are as follows.
mini <- read.csv("coding/mini.csv", header=FALSE)
mini_clean <- mini[-1,]
df_mini <- melt(df_clean, id.vars=c("V1","V2","V3"))
sqldf('select * from df_mini order by "V1"')
Now I want to know what is the best way to fill all missing categories for each user.
Any help in this regard is appreciated.
library(reshape2)
library(tidyr)
mdf <- melt(df, c("id","gender","age"))
complete(na.omit(mdf), c(id, gender, age), value)
# Source: local data frame [50 x 5]
#
# id gender age value variable
# (int) (fctr) (int) (chr) (fctr)
# 1 1 Male 22 books NA
# 2 1 Male 22 cloths category4
# 3 1 Male 22 fuel NA
# 4 1 Male 22 grocery category5
# 5 1 Male 22 movies category1
# 6 1 Male 22 music category2
# 7 1 Male 22 online-shopping NA
# 8 1 Male 22 rent NA
# 9 1 Male 22 travel category3
# 10 1 Male 22 utiliy NA
# .. ... ... ... ... ...
Explanation
We can first melt the data.frame specifying the id columns. Next, the new release of tidyr has a helper function complete to expand columns as your output describes.
Data
df <- read.csv(text='"id","gender","age","category1","category2","category3","category4","category5","category6","category7","category8","category9","category10"
1,"Male",22,"movies","music","travel","cloths","grocery",,,,,
2,"Male",28,"travel","books","movies",,,,,,,
3,"Female",27,"rent","fuel","grocery","cloths",,,,,,
4,"Female",22,"rent","grocery","travel","movies","cloths",,,,,
5,"Female",22,"rent","online-shopping","utiliy",,,,,,,')
is.na(df) <- is.na(df) | df== ""
Consider using the base function reshape as this is the regular example of wide to long dataset reshaping/pivoting:
reshapedf <- reshape(df, varying = c(4:13),
v.names = c("category"),
timevar=c("rank"),
times = c(1:10),
idvar = c("id", "gender", "age"),
new.row.names = 1:1000,
direction = "long")
# ORDER RESULTING DATA FRAME
reshapedf <- reshapedf[with(reshapedf , order(id, gender, age)), ]
# RESET ROW NAMES
row.names(reshapedf) <- 1:nrow(reshapedf)
OUTPUT
id gender age rank category
1 1 Male 22 1 movies
2 1 Male 22 2 music
3 1 Male 22 3 travel
4 1 Male 22 4 cloths
5 1 Male 22 5 grocery
6 1 Male 22 6 NA
7 1 Male 22 7 NA
8 1 Male 22 8 NA
9 1 Male 22 9 NA
10 1 Male 22 10 NA
...
I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45
I have a data frame that looks like this, with two key columns and then a count of things that come in three different types.
Year Month Urban Suburban Rural
1 1 1 11 12 13
2 1 2 21 22 23
I want to expand each row so that it lists the type as a factor and then the number in that type, so something like this:
Year Month Type Number
1 1 1 Urban 11
2 1 1 Suburban 12
3 1 1 Rural 13
4 1 2 Urban 21
5 1 2 Suburban 22
6 1 2 Rural 23
is there a function that does this painlessly?
This is precisely what the reshape and reshape2 packages are designed to do:
require(reshape2)
x <- read.table(text = "Year Month Urban Suburban Rural
1 1 1 11 12 13
2 1 2 21 22 23")
#Specify the variables that are your ID variables. The others will form your "long" data
x.m <- melt(x, id.vars = c("Year", "Month"))
#-----
Year Month variable value
1 1 1 Urban 11
2 1 2 Urban 21
3 1 1 Suburban 12
...
There is a paper in the journal of statistical software that's a great place to get started.
dat <- read.table(text=" Year Month Urban Suburban Rural
1 1 1 11 12 13
2 1 2 21 22 23
", header=TRUE)
reshape(dat, direction="long", idvar=1:2, varying=names(dat)[3:5], times=names(dat)[3:5], v.names="Number", timevar="Type")
Year Month Type Number
1.1.Urban 1 1 Urban 11
1.2.Urban 1 2 Urban 21
1.1.Suburban 1 1 Suburban 12
1.2.Suburban 1 2 Suburban 22
1.1.Rural 1 1 Rural 13
1.2.Rural 1 2 Rural 23
(Note that the reshape function is in the standard set of packages and not in the reshape or resshape2 packages.)