Let's say that we have the following matrix:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
c(14,28,42,14,46,64,71,85,14,28,51,84,66,22,38,32,40,42)))
colnames(x)<- c("ID","Visit", "Age")
The first column represents subject ID, the second a list of observations and the third the age at each of this consecutive observations.
Which would be the easiest way of finding visits where the age is wrong according to the previous visit age. (i.e. in row 13, subject C is 66 years old, when in the previous visit he was already 84 or in row 16, subject D is 32 years old, when in the previous visit he was already 38).
Which would be the way of highlighting the potential errors and removing rows 13 and 16?
I have tried to aggregate by IDs and look for the difference between ages across visits, but it seems hard for me since the error could occur in any visit.
How about this in base R?
df <- do.call(rbind.data.frame, lapply(split(x, x$ID), function(w)
w[c(1, which(diff(w[order(w$Visit), "Age"]) > 0) + 1), ]));
df;
# ID Visit Age
#A.1 A 1 14
#A.2 A 2 28
#A.3 A 3 42
#B.4 B 1 14
#B.5 B 2 46
#B.6 B 3 64
#B.7 B 4 71
#B.8 B 5 85
#C.9 C 1 14
#C.10 C 2 28
#C.11 C 3 51
#C.12 C 4 84
#D.14 D 1 22
#D.15 D 2 38
#D.17 D 4 40
#D.18 D 5 42
Explanation: We split the dataframe on column ID, then order every dataframe subset by Visit, calculate differences between successive Age values, and only keep those rows where the difference is > 0 (i.e. Age is increasing); rbinding gives the final dataframe.
You could do it by filtering out the rows where diff(Age) is negative for each ID.
Using the dplyr package:
library(dplyr)
x %>% group_by(ID) %>% filter(c(0,diff(Age))>=0)
# A tibble: 16 x 3
# Groups: ID [4]
ID Visit Age
<fctr> <fctr> <fctr>
1 A 1 14
2 A 2 28
3 A 3 42
4 B 1 14
5 B 2 46
6 B 3 64
7 B 4 71
8 B 5 85
9 C 1 14
10 C 2 28
11 C 3 51
12 C 4 84
13 D 1 22
14 D 2 38
15 D 4 40
16 D 5 42
The aggregate() approach is pretty concise.
Removing bad rows
good <- do.call(c, aggregate(Age ~ ID, x, function(z) c(z[1], diff(z)) > 0)$Age)
x[good,]
# ID Visit Age
# 1 A 1 14
# 2 A 2 28
# 3 A 3 42
# 4 B 1 14
# 5 B 2 46
# 6 B 3 64
# 7 B 4 71
# 8 B 5 85
# 9 C 1 14
# 10 C 2 28
# 11 C 3 51
# 12 C 4 84
# 14 D 1 22
# 15 D 2 38
# 17 D 4 40
# 18 D 5 42
This will only highlight which groups have an inconsistency:
aggregate(Age ~ ID, x, function(z) all(diff(z) > 0))
# ID Age
# 1 A TRUE
# 2 B TRUE
# 3 C FALSE
# 4 D FALSE
Related
I want to take the values of a (large) data frame:
library(tidyverse)
df.grid = expand.grid(x = letters, y = 1:60)
head(df.grid)
x y
1 a 1
2 b 1
3 c 1
4 d 1
5 e 1
6 f 1
[...]
Which eventually reaches a 2, a 3, etc.
And I have a second data frame which contains some variables (x) that I want just part of a range (min max) that is different for each "x" variables
sub.data = data.frame(x = c("a","c","d"), min = c(2,50,25), max = c(6,53,30))
sub.data
x min max
1 a 2 6
2 c 50 53
3 d 25 30
The output should look like something like this:
x y
1 a 2
2 a 3
3 a 4
4 a 5
5 a 6
6 c 50
7 c 51
8 c 52
9 c 53
10 d 25
11 d 26
12 d 27
13 d 28
14 d 29
15 d 30
I've tried this:
df.grid %>%
group_by(x) %>%
filter_if(y > sub.data$min)
But it doesn't work as the min column has multiple values and the 'if' part complains.
I also found this post, but it doesn't seem to work for me as there is no 'matching' variables to guide the filtering process.
I want to avoid using for loops since I want to apply this to a data frame that is 11GB in size.
We could use a non-equi join
library(data.table)
setDT(df.grid)[, y1 := y][sub.data, .(x, y), on = .(x, y1 >= min, y1 <= max)]
-output
x y
1: a 2
2: a 3
3: a 4
4: a 5
5: a 6
6: c 50
7: c 51
8: c 52
9: c 53
10: d 25
11: d 26
12: d 27
13: d 28
14: d 29
15: d 30
With dplyr version 1.1.0, we could also use non-equi joins with join_by
library(dplyr)
inner_join(df.grid, sub.data, by = join_by(x, y >= min , y <= max)) %>%
select(x, y)
-output
x y
1 a 2
2 a 3
3 a 4
4 a 5
5 a 6
6 d 25
7 d 26
8 d 27
9 d 28
10 d 29
11 d 30
12 c 50
13 c 51
14 c 52
15 c 53
Or as #Davis Vaughan mentioned, use between with a left_joion
left_join(sub.data, df.grid, by = join_by(x, between(y$y, x$min,
x$max))) %>%
select(names(df.grid))
I'm trying to create a dataset for each combination of rows from separate groups. Ideally, one row from each group would be selected and there would be a dataset for every combination. I have a dataset of that looks similar in structure to the sample below:
Name Group Stat1 Stat2
1 1 a 63 38
2 2 a 33 62
3 3 b 3 66
4 4 b 57 67
5 5 c 42 69
6 6 c 47 14
7 7 c 16 10
8 8 d 21 46
9 9 d 72 1
Trying to get the end result of the first dataset to look like this:
Name Group Stat1 Stat2
1 1 a 63 38
2 3 b 3 66
3 5 c 42 69
4 8 d 21 46
With the second data dataset looking like this:
Name Group Stat1 Stat2
1 1 a 63 38
2 3 b 3 66
3 5 c 42 69
4 9 d 72 1
Until every combination has been exhausted. I've tried strategies using apply functions and combn but cannot seem to get the result I want. This does not seem too challenging to me conceptually, so I'm not sure what I'm missing.
Any help would be greatly appreciated! Thanks in advance!
Lots of ways to approach this. A simple solution is to just generate all 4 row combos, then subset to those with all distinct Group values. I named your data df and assumed Name would be unique row id. If that's not true, you could replace df$Name with 1:nrow(df)
# All 4 row combos of row ids
combs <- combn(df$Name, 4)
# Match group labels to row ids
g <- matrix(df$Group[combs], nrow = 4)
# 4 row combs filtered to all distinct group vals
combs <- combs[,apply(g, 2, function(i) all(!duplicated(i)))]
# For each 4 row combo, extract rows from the dataframe
final_list <- apply(combs, 2, function(i) df[i,])
final_list[1:3]
[[1]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
5 5 c 42 69
8 8 d 21 46
[[2]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
5 5 c 42 69
9 9 d 72 1
[[3]]
Name Group Stat1 Stat2
1 1 a 63 38
3 3 b 3 66
6 6 c 47 14
8 8 d 21 46
I have the following data:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
> a
ID a b c
1 A 0 3 6
2 B 1 4 7
3 Z 2 5 8
4 H 45 22 3
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
> b
ID a b c
1: A 9 4 12
2: B 10 2 0
3: E 11 7 34
4: W 39 54 23
5: Z 5 12 13
6: H 0 34 14
I want to merge both dataframes, keeping only rows of data.frame a and summarize the same columns, so at the end I get:
> z
ID a b c
1 A 9 7 18
2 B 11 6 7
3 Z 7 17 21
4 H 45 56 17
So far I have tried the following:
merge(a,b,by="ID",all.x=T,all.y=F)
> merge(a,b,by="ID",all.x=T,all.y=F)
ID a.x b.x c.x a.y b.y c.y
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 H 45 22 3 0 34 14
4 Z 2 5 8 5 12 13
> join(a,b,type="left",by="ID")
ID a b c a b c
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 Z 2 5 8 5 12 13
4 H 45 22 3 0 34 14
I cannot manage to summarize the columns.
My dataframe is pretty big so if the solution can speed up things that would even be better.
If your data.frame is very big, then you may consider this option:
library(data.table)
## convert data.frame to data.table
setDT(a)
## convert data.frame to data.table
setDT(b)
## merge the two data.tables
c <- merge(a,b,by='ID')
## extract names of all columns except the first one i.e. ID
col_names <- colnames(a)[-1]
## query building
col_1 <- paste0(col_names,'.x')
col_2 <- paste0(col_names,'.y')
cols <- paste(col_1,col_2,sep=',')
cols_2 <- paste0(col_names," = sum(",cols,")")
cols_3 <- paste(cols_2,collapse=',')
query <- paste0("z <- c[,.(",cols_3,"),by=ID]")
## query execution
eval(parse(text = query))
This works at least for your example:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
match_a <- na.omit(match(b$ID, a$ID))
match_b <- na.omit(match(a$ID, b$ID))
df <- cbind(ID = a$ID[match_a], a[match_a, -1] + b[match_b, -1])
First, get matching rows from a in b and vice versa, so we can be sure that we only have those rows that appear in both data frames (and we now know their row-indices in both data frames). Then, simply use vectorized additions for those matching rows, but omit ID, as factor cannot be summed up; add ID back manually.
You cannot directly add both data frame is because both the data frames are of unequal size. To make them of equal size you can check for IDs in a which are present in b and then add them element wise.
new <- b[b$ID %in% a$ID, ]
cbind(ID = a$ID, a[-1] + new[-1])
# ID a b c
#1 A 9 7 18
#2 B 11 6 7
#3 Z 7 17 21
#4 H 45 56 17
Here is a problem I am trying to solve. Say, I have two data frames like the following:
observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
measurement = rnorm(100,5,7))
sampletimes <- data.frame(location = letters[1:20],
id = rep(1:4,5),
time1 = rep(c(2,7,12,17,22), each=4),
time2 = rep(c(4,9,14,19,24), each=4))
They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.
I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.
However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.
For our example, the data frame I would hope to get back would be as follows:
id time measurement letters[1:20]
1 3 10.5163892 a
2 3 5.5774119 b
3 3 10.5057060 c
4 3 14.1563179 d
1 8 2.2653761 e
2 8 -1.0905546 f
3 8 12.7434161 g
4 8 17.6129261 h
1 13 10.9234673 i
2 13 1.6974481 j
3 13 -0.3664951 k
4 13 13.8792198 l
1 18 6.5038847 m
2 18 1.2032935 n
3 18 15.0889469 o
4 18 0.8934357 p
1 23 3.6864527 q
2 23 0.2404074 r
3 23 11.6028766 s
4 23 20.7466908 t
Here's a proposal with merge:
# merge both data frames
dat <- merge(observations, sampletimes, by = "id")
# extract valid rows
dat2 <- dat[dat$time > dat$time1 & dat$time < dat$time2, seq(4)]
# sort
dat2[order(dat2$time, dat2$id), ]
The result:
id time measurement location
11 1 3 7.086246 a
141 2 3 6.893162 b
251 3 3 16.052627 c
376 4 3 -6.559494 d
47 1 8 11.506810 e
137 2 8 10.959782 f
267 3 8 11.079759 g
402 4 8 11.082015 h
83 1 13 5.584257 i
218 2 13 -1.714845 j
283 3 13 -11.196792 k
418 4 13 8.887907 l
99 1 18 1.656558 m
234 2 18 16.573179 n
364 3 18 6.522298 o
454 4 18 1.005123 p
125 1 23 -1.995719 q
250 2 23 -6.676464 r
360 3 23 10.514282 s
490 4 23 3.863357 t
Not efficient , but do the job :
subset(merge(observations,sampletimes), time > time1 & time < time2)
id time measurement location time1 time2
11 1 3 3.180321 a 2 4
47 1 8 6.040612 e 7 9
83 1 13 -5.999317 i 12 14
99 1 18 2.689414 m 17 19
125 1 23 12.514722 q 22 24
137 2 8 4.420679 f 7 9
141 2 3 11.492446 b 2 4
218 2 13 6.672506 j 12 14
234 2 18 12.290339 n 17 19
250 2 23 12.610828 r 22 24
251 3 3 8.570984 c 2 4
267 3 8 -7.112291 g 7 9
283 3 13 6.287598 k 12 14
360 3 23 11.941846 s 22 24
364 3 18 -4.199001 o 17 19
376 4 3 7.133370 d 2 4
402 4 8 13.477790 h 7 9
418 4 13 3.967293 l 12 14
454 4 18 12.845535 p 17 19
490 4 23 -1.016839 t 22 24
EDIT
Since you have more than 5 millions rows, you should give a try to a data.table solution:
library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]
I am trying to remove duplicate observations from a data set based on my variable, id. However, I want the removal of observations to be based on the following rules. The variables below are id, the sex of household head (1-male, 2-female) and the age of the household head. The rules are as follows. If a household has both male and female household heads, remove the female household head observation. If a household as either two male or two female heads, remove the observation with the younger household head. An example data set is below.
id = c(1,2,2,3,4,5,5,6,7,8,8,9,10)
sex = c(1,1,2,1,2,2,2,1,1,1,1,2,1)
age = c(32,34,54,23,32,56,67,45,51,43,35,80,45)
data = data.frame(cbind(id,sex,age))
You can do this by first ordering the data.frame so the desired entry for each id is first, and then remove the rows with duplicate ids.
d <- with(data, data[order(id, sex, -age),])
# id sex age
# 1 1 1 32
# 2 2 1 34
# 3 2 2 54
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 6 5 2 56
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 11 8 1 35
# 12 9 2 80
# 13 10 1 45
d[!duplicated(d$id), ]
# id sex age
# 1 1 1 32
# 2 2 1 34
# 4 3 1 23
# 5 4 2 32
# 7 5 2 67
# 8 6 1 45
# 9 7 1 51
# 10 8 1 43
# 12 9 2 80
# 13 10 1 45
With data.table, this is easy with "compound queries". To order the data when you read it in, set the "key" when you read it in as "id,sex" (required in case any female values would come before male values for a given ID).
> library(data.table)
> DT <- data.table(data, key = "id,sex")
> DT[, max(age), by = key(DT)][!duplicated(id)]
id sex V1
1: 1 1 32
2: 2 1 34
3: 3 1 23
4: 4 2 32
5: 5 2 67
6: 6 1 45
7: 7 1 51
8: 8 1 43
9: 9 2 80
10: 10 1 45