Remove duplicate observations based on set of rules - r

I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))

You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45

With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45

Related

Counting the number of changes of a categorical variable during repeated measurements within a category

I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2

Check time series incongruencies

Let's say that we have the following matrix:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
c(14,28,42,14,46,64,71,85,14,28,51,84,66,22,38,32,40,42)))
colnames(x)<- c("ID","Visit", "Age")
The first column represents subject ID, the second a list of observations and the third the age at each of this consecutive observations.
Which would be the easiest way of finding visits where the age is wrong according to the previous visit age. (i.e. in row 13, subject C is 66 years old, when in the previous visit he was already 84 or in row 16, subject D is 32 years old, when in the previous visit he was already 38).
Which would be the way of highlighting the potential errors and removing rows 13 and 16?
I have tried to aggregate by IDs and look for the difference between ages across visits, but it seems hard for me since the error could occur in any visit.
How about this in base R?
df <- do.call(rbind.data.frame, lapply(split(x, x$ID), function(w)
w[c(1, which(diff(w[order(w$Visit), "Age"]) > 0) + 1), ]));
df;
# ID Visit Age
#A.1 A 1 14
#A.2 A 2 28
#A.3 A 3 42
#B.4 B 1 14
#B.5 B 2 46
#B.6 B 3 64
#B.7 B 4 71
#B.8 B 5 85
#C.9 C 1 14
#C.10 C 2 28
#C.11 C 3 51
#C.12 C 4 84
#D.14 D 1 22
#D.15 D 2 38
#D.17 D 4 40
#D.18 D 5 42
Explanation: We split the dataframe on column ID, then order every dataframe subset by Visit, calculate differences between successive Age values, and only keep those rows where the difference is > 0 (i.e. Age is increasing); rbinding gives the final dataframe.
You could do it by filtering out the rows where diff(Age) is negative for each ID.
Using the dplyr package:
library(dplyr)
x %>% group_by(ID) %>% filter(c(0,diff(Age))>=0)
# A tibble: 16 x 3
# Groups: ID [4]
ID Visit Age
<fctr> <fctr> <fctr>
1 A 1 14
2 A 2 28
3 A 3 42
4 B 1 14
5 B 2 46
6 B 3 64
7 B 4 71
8 B 5 85
9 C 1 14
10 C 2 28
11 C 3 51
12 C 4 84
13 D 1 22
14 D 2 38
15 D 4 40
16 D 5 42
The aggregate() approach is pretty concise.
Removing bad rows
good <- do.call(c, aggregate(Age ~ ID, x, function(z) c(z[1], diff(z)) > 0)$Age)
x[good,]
# ID Visit Age
# 1 A 1 14
# 2 A 2 28
# 3 A 3 42
# 4 B 1 14
# 5 B 2 46
# 6 B 3 64
# 7 B 4 71
# 8 B 5 85
# 9 C 1 14
# 10 C 2 28
# 11 C 3 51
# 12 C 4 84
# 14 D 1 22
# 15 D 2 38
# 17 D 4 40
# 18 D 5 42
This will only highlight which groups have an inconsistency:
aggregate(Age ~ ID, x, function(z) all(diff(z) > 0))
# ID Age
# 1 A TRUE
# 2 B TRUE
# 3 C FALSE
# 4 D FALSE

R: how to use expand.grid to generate combinations based on group

I am trying to get all combinations of values per group. I want to prevent combination of values between different groups.
To create all combinations of values (no matter which group the value belongs) vaI can use:
expand.grid(value, value)
Awaited result should be the subset of result of previous command.
Example:
#base data
value = c(1,3,5, 1,5,7,9, 2)
group = c("a", "a", "a","b","b","b","b", "c")
base <- data.frame(value, group)
#creating ALL combinations of value
allComb <- expand.grid(base$value, base$value)
#awaited result is subset of allComb.
#Note: first colums shows the number of row from allComb.
#Empty rows are separating combinations per group and are shown only for clarification.
Var1 Var2
1 1 1
2 3 1
3 5 1
11 1 3
12 3 3
13 5 3
21 1 5
22 3 5
23 5 5
34 1 1
35 5 1
36 7 1
37 9 1
44 1 5
45 5 5
46 7 5
47 9 5
54 1 7
55 5 7
56 7 7
57 9 7
64 1 9
65 5 9
66 7 9
67 9 9
78 2 2

Using tidyverse gather() to output multiple value vectors with a single key in a data frame

Despite the conventions of R, data collection and entry is for me most easily done in vertical columns. Therefore, I have a question about efficiently converting to horizontal rows with the gather() function in the tidyverse library. I find myself using gather() over and over which seems inefficient. Is there a more efficient way? And can an existing vector serve as the key? Here is an example:
Let's say we have the following health metrics on baby birds.
bird day_1_mass day_2_mass day_1_heart_rate day_3_heart_rate
1 1 5 6 60 55
2 2 6 8 62 57
3 3 3 3 45 45
Using the gather function I can reorganize the mass data into rows.
horizontal.data <- gather(vertical.data,
key = age,
value = mass,
day_1_mass:day_2_mass,
factor_key=TRUE)
Giving us
bird day_1_heart_rate day_3_heart_rate age mass
1 1 60 55 day_1_mass 5
2 2 62 57 day_1_mass 6
3 3 45 45 day_1_mass 3
4 1 60 55 day_2_mass 6
5 2 62 57 day_2_mass 8
6 3 45 45 day_2_mass 3
And use the same function again to similarly reorganize heart rate data.
horizontal.data.2 <- gather(horizontal.data,
key = age2,
value = heart_rate,
day_1_heart_rate:day_3_heart_rate,
factor_key=TRUE)
Producing a new dataframe
bird age mass age2 heart_rate
1 1 day_1_mass 5 day_1_heart_rate 60
2 2 day_1_mass 6 day_1_heart_rate 62
3 3 day_1_mass 3 day_1_heart_rate 45
4 1 day_2_mass 6 day_1_heart_rate 60
5 2 day_2_mass 8 day_1_heart_rate 62
6 3 day_2_mass 3 day_1_heart_rate 45
7 1 day_1_mass 5 day_3_heart_rate 55
8 2 day_1_mass 6 day_3_heart_rate 57
9 3 day_1_mass 3 day_3_heart_rate 45
10 1 day_2_mass 6 day_3_heart_rate 55
11 2 day_2_mass 8 day_3_heart_rate 57
12 3 day_2_mass 3 day_3_heart_rate 45
So it took two steps, but it worked. The questions are 1) Is there a way to do this in one step? and 2) Can it alternatively be done with one key (the "age" vector) that I can then simply replace as numeric data?
if I get the question right, you could do that by first gathering everything together, and then "spreading" on mass and heart rate:
library(forcats)
library(dplyr)
mass_levs <- names(vertical.data)[grep("mass", names(vertical.data))]
hearth_levs <- names(vertical.data)[grep("heart", names(vertical.data))]
horizontal.data <- vertical.data %>%
gather(variable, value, -bird, factor_key = TRUE) %>%
mutate(day = stringr::str_sub(variable, 5,5)) %>%
mutate(variable = fct_collapse(variable,
"mass" = mass_levs,
"hearth_rate" = hearth_levs)) %>%
spread(variable, value)
, giving:
bird day mass hearth_rate
1 1 1 5 60
2 1 2 6 NA
3 1 3 NA 55
4 2 1 6 62
5 2 2 8 NA
6 2 3 NA 57
7 3 1 3 45
8 3 2 3 NA
9 3 3 NA 45
we can see how it works by going through the pipe one pass at a time.
First, we gather everyting on a long format:
horizontal.data <- vertical.data %>%
gather(variable, value, -bird, factor_key = TRUE)
bird variable value
1 1 day_1_mass 5
2 2 day_1_mass 6
3 3 day_1_mass 3
4 1 day_2_mass 6
5 2 day_2_mass 8
6 3 day_2_mass 3
7 1 day_1_heart_rate 60
8 2 day_1_heart_rate 62
9 3 day_1_heart_rate 45
10 1 day_3_heart_rate 55
11 2 day_3_heart_rate 57
12 3 day_3_heart_rate 45
then, if we want to keep a "proper" long table, as the OP suggested we have to create a single key variable. In this case, it makes sense to use the day (= age). To create the day variable, we can extract it from the character strings now in variable:
%>% mutate(day = stringr::str_sub(variable, 5,5))
here, str_sub gets the substring in position 5, which is the day (note that if in the full dataset you have multiple-digits days, you'll have to tweak this a bit, probably by splitting on _):
bird variable value day
1 1 day_1_mass 5 1
2 2 day_1_mass 6 1
3 3 day_1_mass 3 1
4 1 day_2_mass 6 2
5 2 day_2_mass 8 2
6 3 day_2_mass 3 2
7 1 day_1_heart_rate 60 1
8 2 day_1_heart_rate 62 1
9 3 day_1_heart_rate 45 1
10 1 day_3_heart_rate 55 3
11 2 day_3_heart_rate 57 3
12 3 day_3_heart_rate 45 3
now, to finish we have to "spread " the table to have a mass and a heart rate column.
Here we have a problem, because currently there are 2 levels each corresponding to mass and hearth rate in the variable column. Therefore, applying spread on variable would give us again four columns.
To prevent that, we need to aggregate the four levels in variable into two levels. We can do that by using forcats::fc_collapse, by providing the association between the new level names and the "old" ones. Outside of a pipe, that would correspond to:
horizontal.data$variable <- fct_collapse(horizontal.data$variable,
mass = c("day_1_mass", "day_2_mass",
heart = c("day_1_hearth_rate", "day_3_heart_rate")
However, if you have many levels it is cumbersome to write them all. Therefore, I find beforehand the level names corresponding to the two "categories" using
mass_levs <- names(vertical.data)[grep("mass", names(vertical.data))]
hearth_levs <- names(vertical.data)[grep("heart", names(vertical.data))]
mass_levs
[1] "day_1_mass" "day_2_mass"
hearth_levs
[1] "day_1_heart_rate" "day_3_heart_rate"
therefore, the third line of the pipe can be shortened to:
%>% mutate(variable = fct_collapse(variable,
"mass" = mass_levs,
"hearth_rate" = hearth_levs))
, after which we have:
bird variable value day
1 1 mass 5 1
2 2 mass 6 1
3 3 mass 3 1
4 1 mass 6 2
5 2 mass 8 2
6 3 mass 3 2
7 1 hearth_rate 60 1
8 2 hearth_rate 62 1
9 3 hearth_rate 45 1
10 1 hearth_rate 55 3
11 2 hearth_rate 57 3
12 3 hearth_rate 45 3
, so that we are now in the condition to "spread" the table again according to variable using:
%>% spread(variable, value)
bird day mass hearth_rate
1 1 1 5 60
2 1 2 6 NA
3 1 3 NA 55
4 2 1 6 62
5 2 2 8 NA
6 2 3 NA 57
7 3 1 3 45
8 3 2 3 NA
9 3 3 NA 45
HTH
If you insist on a single command , i can give you one
setup the data.frame
c1<-c(1,2,3)
c2<-c(5,6,3)
c3<-c(6,8,3)
c4<-c(60,62,45)
c5<-c(55,57,45)
dt<-as.data.table(cbind(c1,c2,c3,c4,c5))
colnames(dt)<-c("bird","day_1_mass","day_2_mass","day_1_heart_rate","day_3_heart_rate")
Now use this single command to get the final outcome
merge(melt(dt[,c("bird","day_1_mass","day_2_mass")],id.vars = c("bird"),variable.name = "age",value.name="mass"),melt(dt[,c("bird","day_1_heart_rate","day_3_heart_rate")],id.vars = c("bird"),variable.name = "age2",value.name="heart_rate"),by = "bird")
The final outcome is
bird age mass age2 heart_rate
1: 1 day_1_mass 5 day_1_heart_rate 60
2: 1 day_1_mass 5 day_3_heart_rate 55
3: 1 day_2_mass 6 day_1_heart_rate 60
4: 1 day_2_mass 6 day_3_heart_rate 55
5: 2 day_1_mass 6 day_1_heart_rate 62
6: 2 day_1_mass 6 day_3_heart_rate 57
7: 2 day_2_mass 8 day_1_heart_rate 62
8: 2 day_2_mass 8 day_3_heart_rate 57
9: 3 day_1_mass 3 day_1_heart_rate 45
10: 3 day_1_mass 3 day_3_heart_rate 45
11: 3 day_2_mass 3 day_1_heart_rate 45
12: 3 day_2_mass 3 day_3_heart_rate 45
Though already answered, I have a different solution in which you save a list of the gather parameters you would like to run, and then run the gather_() command for each set of parameters in the list.
# Create a list of gather parameters
# Format is key, value, columns_to_gather
gather.list <- list(c("age", "mass", "day_1_mass", "day_2_mass"),
c("age2", "heart_rate", "day_1_heart_rate", "day_3_heart_rate"))
# Run gather command for each list item
for(i in gather.list){
df <- gather_(df, key_col = i[1], value_col = i[2], gather_cols = c(i[3:length(i)]), factor_key = TRUE)
}

5 nearest neighbors based on given distance in r

I have the following dataset:
id x y age
1 1745353 930284.1 30
2 1745317 930343.4 23
3 1745201 930433.9 10
4 1745351 930309.4 5
5 1745342 930335.2 2
6 1746619 929969.7 66
7 1746465 929827.1 7
8 1746731 928779.5 55
9 1746629 929902.6 26
10 1745938 928923.2 22
I want to find 5 closest neighbors for each of the id based on the distance calculated from the given (x,y). The final output should look like the following:
id n_id dist age age_n_id
1 2 2 30 23
1 5 1.5 30 2
1 3 5 30 10
1 7 3 30 7
1 8 3 30 55
2 1 6 23 30
2 10 1 23 22
2 6 2 23 66
2 7 6 23 7
2 8 9 23 55
3 2 1 10 23
3 1 2 10 30
3 4 1.2 10 5
3 6 1.6 10 66
3 9 2.3 10 26
................................
................................
10 2 1.9 22 23
10 6 2.3 22 66
10 9 2.1 22 26
10 1 2.5 22 30
10 5 1.6 22 2
where n_id is the id if the neighbors, dist is the straight line distance between id and n_id, age is the age of the id, and age_n_id is the age of the n_id. Also, the maximum distance would be 10km. If there are fewer than 5 neighbors within 10km, say 3 neighbors, the corresponding id would be repeated only three times.
I am relatively newer in r programming and any help would be much appreciated.
data.table solution:
library(data.table)
data<-fread("id x y age
1 1745353 930284.1 30
2 1745317 930343.4 23
3 1745201 930433.9 10
4 1745351 930309.4 5
5 1745342 930335.2 2
6 1746619 929969.7 66
7 1746465 929827.1 7
8 1746731 928779.5 55
9 1746629 929902.6 26
10 1745938 928923.2 22")
data[,all_x:=list(list(x))]
data[,all_y:=list(list(y))]
data[,all_age:=list(list(age))]
data[,seq_nr:=seq_len(.N)]
#Distance formula:
formula_distance<-function(x_1,x_2,y_1,y_2,z){
x_2<-x_2[[1]][-z]
y_2<-y_2[[1]][-z]
sqrt((x_1-x_2)^2+(y_1-y_2)^2)
}
data<-data[,{list(dist = formula_distance(x,all_x,y,all_y,seq_nr),
id =seq(1:nrow(data))[-id],
age_id=all_age[[1]][-id],
age=rep(age,nrow(data)-1))},by=1:nrow(data)]
data<-data[order(nrow,dist)]
#Filter data within threshold:
threshold<-1000
#How many nearest neighbors to take:
k<-5
filtered<-data[dist<=threshold]
filtered<-filtered[,{list(dist=dist[1:k],n_id=id[1:k],n_age=age_id[1:k])},by=c("nrow","age")]
filtered<-filtered[!is.na(dist)]
setnames(filtered,"nrow","id")
filtered
id age dist n_id n_age
1: 1 30 25.37893 4 5
2: 1 30 52.27055 5 2
3: 1 30 69.37211 2 23
4: 1 30 213.41050 3 10
5: 2 23 26.31045 5 2
6: 2 23 48.08326 4 5
7: 2 23 69.37211 1 30
8: 2 23 147.12665 3 10
9: 3 10 147.12665 2 23
10: 3 10 172.11243 5 2
11: 3 10 194.93653 4 5
12: 3 10 213.41050 1 30
13: 4 5 25.37893 1 30
14: 4 5 27.32471 5 2
15: 4 5 48.08326 2 23
16: 4 5 194.93653 3 10
17: 5 2 26.31045 2 23
18: 5 2 27.32471 4 5
19: 5 2 52.27055 1 30
20: 5 2 172.11243 3 10
21: 6 66 67.84106 9 26
22: 6 66 209.88273 7 7
23: 7 7 180.54432 9 26
24: 7 7 209.88273 6 66
25: 8 55 805.91482 10 22
26: 9 26 67.84106 6 66
27: 9 26 180.54432 7 7
28: 10 22 805.91482 8 55
Assuming that the unit of coordinates is in meter.
# Load packages
library(FNN)
library(tidyverse)
library(data.table)
# Create example data frame
dataset <- fread("id x y age
1 1745353 930284.1 30
2 1745317 930343.4 23
3 1745201 930433.9 10
4 1745351 930309.4 5
5 1745342 930335.2 2
6 1746619 929969.7 66
7 1746465 929827.1 7
8 1746731 928779.5 55
9 1746629 929902.6 26
10 1745938 928923.2 22")
# Calculate the nearest ID and distance
near_data <- get.knn(dataset[, 2:3], k = 5)
# Extract the nearest ID
nn_index <- as.data.frame(near_data$nn.index)
# Extract the nearest Distance
nn_dist <- as.data.frame(near_data$nn.dist)
# Re organize the data
nn_index2 <- nn_index %>%
# Add ID column
mutate(ID = 1:10) %>%
# Transform the data frame
gather(Rank, n_id, -ID)
nn_dist2 <- nn_dist %>%
# Add ID column
mutate(ID = 1:10) %>%
# Transform the data frame
gather(Rank, dist, -ID)
# Remove coordinates in dataset
dataset2 <- dataset %>% select(-x, -y)
# Create the final output
nn_final <- nn_index2 %>%
# Merge nn_index2 and nn_dist2
left_join(nn_dist2, by = c("ID", "Rank")) %>%
# Merge with dataset2 by ID and id
left_join(dataset2, by = c("ID" = "id")) %>%
# Merge with dataset2 by n_id and id
left_join(dataset2, by = c("n_id" = "id")) %>%
# Remove Rank
select(-Rank) %>%
# Rename column names
rename(id = ID, age = age.x, age_n_id = age.y) %>%
# Sort the data frame
arrange(id, dist) %>%
# Filter the dist < 10000 meters
filter(dist < 10000)

Resources