R: subset a data frame based on conditions from another data frame - r

Here is a problem I am trying to solve. Say, I have two data frames like the following:
observations <- data.frame(id = rep(rep(c(1,2,3,4), each=5), 5),
time = c(rep(1:5,4), rep(6:10,4), rep(11:15,4), rep(16:20,4), rep(21:25,4)),
measurement = rnorm(100,5,7))
sampletimes <- data.frame(location = letters[1:20],
id = rep(1:4,5),
time1 = rep(c(2,7,12,17,22), each=4),
time2 = rep(c(4,9,14,19,24), each=4))
They both contain a column named id, which links the data frames. I want to have the measurements from observationss for whichtimeis betweentime1andtime2from thesampletimesdata frame. Additionally, I'd like to connect the appropriatelocation` to each measurement.
I have successfully done this by converting my sampletimes to a wide format (i.e. all the time1 and time2 information in one row per entry for id), merging the two data frames by the id variable, and using conditional statements to take only instances when the time falls between at least one of the time intervals in the row, and then assigning location to the appropriate measurement.
However, I have around 2 million rows in observations and doing this takes a really long time. I'm looking for a better way where I can keep the data in long format. The example dataset is very simple, but in reality, my data contains variable numbers of intervals and locations per id.
For our example, the data frame I would hope to get back would be as follows:
id time measurement letters[1:20]
1 3 10.5163892 a
2 3 5.5774119 b
3 3 10.5057060 c
4 3 14.1563179 d
1 8 2.2653761 e
2 8 -1.0905546 f
3 8 12.7434161 g
4 8 17.6129261 h
1 13 10.9234673 i
2 13 1.6974481 j
3 13 -0.3664951 k
4 13 13.8792198 l
1 18 6.5038847 m
2 18 1.2032935 n
3 18 15.0889469 o
4 18 0.8934357 p
1 23 3.6864527 q
2 23 0.2404074 r
3 23 11.6028766 s
4 23 20.7466908 t

Here's a proposal with merge:
# merge both data frames
dat <- merge(observations, sampletimes, by = "id")
# extract valid rows
dat2 <- dat[dat$time > dat$time1 & dat$time < dat$time2, seq(4)]
# sort
dat2[order(dat2$time, dat2$id), ]
The result:
id time measurement location
11 1 3 7.086246 a
141 2 3 6.893162 b
251 3 3 16.052627 c
376 4 3 -6.559494 d
47 1 8 11.506810 e
137 2 8 10.959782 f
267 3 8 11.079759 g
402 4 8 11.082015 h
83 1 13 5.584257 i
218 2 13 -1.714845 j
283 3 13 -11.196792 k
418 4 13 8.887907 l
99 1 18 1.656558 m
234 2 18 16.573179 n
364 3 18 6.522298 o
454 4 18 1.005123 p
125 1 23 -1.995719 q
250 2 23 -6.676464 r
360 3 23 10.514282 s
490 4 23 3.863357 t

Not efficient , but do the job :
subset(merge(observations,sampletimes), time > time1 & time < time2)
id time measurement location time1 time2
11 1 3 3.180321 a 2 4
47 1 8 6.040612 e 7 9
83 1 13 -5.999317 i 12 14
99 1 18 2.689414 m 17 19
125 1 23 12.514722 q 22 24
137 2 8 4.420679 f 7 9
141 2 3 11.492446 b 2 4
218 2 13 6.672506 j 12 14
234 2 18 12.290339 n 17 19
250 2 23 12.610828 r 22 24
251 3 3 8.570984 c 2 4
267 3 8 -7.112291 g 7 9
283 3 13 6.287598 k 12 14
360 3 23 11.941846 s 22 24
364 3 18 -4.199001 o 17 19
376 4 3 7.133370 d 2 4
402 4 8 13.477790 h 7 9
418 4 13 3.967293 l 12 14
454 4 18 12.845535 p 17 19
490 4 23 -1.016839 t 22 24
EDIT
Since you have more than 5 millions rows, you should give a try to a data.table solution:
library(data.table)
OBS <- data.table(observations)
SAM <- data.table(sampletimes)
merge(OBS,SAM,allow.cartesian=TRUE,by='id')[time > time1 & time < time2]

Related

add rows to data frame for non-observations

I have a dataframe that summarizes the number of times birds were observed at their breeding site one each day and each hour during daytime (i.e., when the sun was above the horizon). example:
head(df)
ID site day hr nObs
1 19 A 202 11 60
2 19 A 202 13 18
3 19 A 202 15 27
4 8 B 188 8 6
5 8 B 188 9 6
6 8 B 188 11 7
However, this dataframe does not include hours when the bird was not observed. Eg. no line for bird 19 on day 202 at 14 with an nObs value of 0.
I'd like to find a way, preferably with dplyr (tidy verse), to add in those rows for when individuals were not observed.
You can use complete from tidyr, i.e.
library(tidyverse)
df %>%
group_by(ID, site) %>%
complete(hr = seq(min(hr), max(hr)))
which gives,
# A tibble: 9 x 5
# Groups: ID, site [2]
ID site hr day nObs
<int> <fct> <int> <int> <int>
1 8 B 8 188 6
2 8 B 9 188 6
3 8 B 10 NA NA
4 8 B 11 188 7
5 19 A 11 202 60
6 19 A 12 NA NA
7 19 A 13 202 18
8 19 A 14 NA NA
9 19 A 15 202 27
One way to do this would be to first build a "template" of all possible combinations where birds can be observed and then merge ("left join") the actual observations onto that template:
a = read.table(text = " ID site day hr nObs
1 19 A 202 11 60
2 19 A 202 13 18
3 19 A 202 15 27
4 8 B 188 8 6
5 8 B 188 9 6
6 8 B 188 11 7")
tpl <- expand.grid(c(unique(a[, 1:3]), list(hr = 1:24)))
merge(tpl, a, all.x = TRUE)
Edit based on comment by #user3220999: in case we want to do the process per ID, we can just use split to get a list of data.frames per ID, get a list of templates and mapply merge on the two lists:
a <- split(a, a$ID)
tpl <- lapply(a, function(ai) {
expand.grid(c(unique(ai[, 1:3]), list(hr = 1:24)))
})
res <- mapply(merge, tpl, a, SIMPLIFY = FALSE, MoreArgs = list(all.x = TRUE))

R - Issues while calling a user-defined function

I have the following dataframe named "dataset"
> dataset
V1 V2 V3 V4 V5 V6 V7
1 A 29 27 0 14 21 163
2 W 70 40 93 63 44 1837
3 E 11 1 11 49 17 315
4 S 20 59 36 23 14 621
5 C 12 7 48 24 25 706
6 B 14 8 78 27 17 375
7 G 12 7 8 4 4 257
8 T 0 0 0 0 0 0
9 N 32 6 9 14 17 264
10 R 28 46 49 55 38 608
11 O 12 2 8 12 11 450
I have two helper functions as below
get_A <- function(p){
return(data.frame(Scorecard = p,
Results = dataset[nrow(dataset),(p+1)]))
} #Pulls the value from the last row for a given value of (p and offset by 1)
get_P <- function(p){
return(data.frame(Scorecard= p,
Results = dataset[p,ncol(dataset)]))
} #Pulls the value from the last column for a given value of p
I have the following dataframe on which I need to run the above helper functions. There will be NAs because I'm reading this "data_sub" dataframe from an excel file which can have unequal rows for the two columns.
> data_sub
Key_P Key_A
1 2 1
2 3 3
3 4 5
4 NA NA
When I call the helper functions, I get some strange results as shown below:
> get_P(data_sub[complete.cases(data_sub$Key_P),]$Key_P)
Scorecard Results
1 2 1837
2 3 315
3 4 621
> get_A(data_sub[complete.cases(data_sub$Key_A),]$Key_A)
Scorecard Results.V2 Results.V4 Results.V6
1 1 12 8 11
2 3 12 8 11
3 5 12 8 11
Warning message:
In data.frame(Scorecard = p, Results = dataset[nrow(dataset), (p + :
row names were found from a short variable and have been discarded
The call to the get_P() helper function is working the way I want. I'm getting the "Results" for each non-NA value in data_sub$Key_P as a dataframe.
But the call to the get_A() helper function is giving strange results and also a warning.I was expecting it to give a similar dataframe as given the call to get_P(). Why is this happening and how can I make get_A() to give the correct dataframe? Basically, the output of this should be
Scorecard Results
1 1 12
2 3 8
3 5 11
I found this link related to the warning but it's unhelpful in solving my issue.
The following works
get_P <- function(df, data_sub) {
data_sub <- data_sub[complete.cases(data_sub), ]
data.frame(
Scorecard = data_sub$Key_P,
Results = df[data_sub$Key_P, ncol(df)])
}
get_P(df, data_sub)
# Scorecard Results
#1 2 1837
#2 3 315
#3 4 621
get_A <- function(df, data_sub) {
data_sub <- data_sub[complete.cases(data_sub), ];
data.frame(
Scorecard = data_sub$Key_A,
Results = as.numeric(df[nrow(df), data_sub$Key_A + 1]))
}
get_A(df, data_sub)
# Scorecard Results
#1 1 12
#2 3 8
#3 5 11
To avoid the warning, we need to strip rownames with as.numeric in get_A.
Another tip: It's better coding practice to make get_P and get_A a function of both df and data_sub to avoid global variables.
Sample data
df <- read.table(text =
" V1 V2 V3 V4 V5 V6 V7
1 A 29 27 0 14 21 163
2 W 70 40 93 63 44 1837
3 E 11 1 11 49 17 315
4 S 20 59 36 23 14 621
5 C 12 7 48 24 25 706
6 B 14 8 78 27 17 375
7 G 12 7 8 4 4 257
8 T 0 0 0 0 0 0
9 N 32 6 9 14 17 264
10 R 28 46 49 55 38 608
11 O 12 2 8 12 11 450", header = T, row.names = 1)
data_sub <- read.table(text =
" Key_P Key_A
1 2 1
2 3 3
3 4 5
4 NA NA", header = T, row.names = 1)

Check time series incongruencies

Let's say that we have the following matrix:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
c(14,28,42,14,46,64,71,85,14,28,51,84,66,22,38,32,40,42)))
colnames(x)<- c("ID","Visit", "Age")
The first column represents subject ID, the second a list of observations and the third the age at each of this consecutive observations.
Which would be the easiest way of finding visits where the age is wrong according to the previous visit age. (i.e. in row 13, subject C is 66 years old, when in the previous visit he was already 84 or in row 16, subject D is 32 years old, when in the previous visit he was already 38).
Which would be the way of highlighting the potential errors and removing rows 13 and 16?
I have tried to aggregate by IDs and look for the difference between ages across visits, but it seems hard for me since the error could occur in any visit.
How about this in base R?
df <- do.call(rbind.data.frame, lapply(split(x, x$ID), function(w)
w[c(1, which(diff(w[order(w$Visit), "Age"]) > 0) + 1), ]));
df;
# ID Visit Age
#A.1 A 1 14
#A.2 A 2 28
#A.3 A 3 42
#B.4 B 1 14
#B.5 B 2 46
#B.6 B 3 64
#B.7 B 4 71
#B.8 B 5 85
#C.9 C 1 14
#C.10 C 2 28
#C.11 C 3 51
#C.12 C 4 84
#D.14 D 1 22
#D.15 D 2 38
#D.17 D 4 40
#D.18 D 5 42
Explanation: We split the dataframe on column ID, then order every dataframe subset by Visit, calculate differences between successive Age values, and only keep those rows where the difference is > 0 (i.e. Age is increasing); rbinding gives the final dataframe.
You could do it by filtering out the rows where diff(Age) is negative for each ID.
Using the dplyr package:
library(dplyr)
x %>% group_by(ID) %>% filter(c(0,diff(Age))>=0)
# A tibble: 16 x 3
# Groups: ID [4]
ID Visit Age
<fctr> <fctr> <fctr>
1 A 1 14
2 A 2 28
3 A 3 42
4 B 1 14
5 B 2 46
6 B 3 64
7 B 4 71
8 B 5 85
9 C 1 14
10 C 2 28
11 C 3 51
12 C 4 84
13 D 1 22
14 D 2 38
15 D 4 40
16 D 5 42
The aggregate() approach is pretty concise.
Removing bad rows
good <- do.call(c, aggregate(Age ~ ID, x, function(z) c(z[1], diff(z)) > 0)$Age)
x[good,]
# ID Visit Age
# 1 A 1 14
# 2 A 2 28
# 3 A 3 42
# 4 B 1 14
# 5 B 2 46
# 6 B 3 64
# 7 B 4 71
# 8 B 5 85
# 9 C 1 14
# 10 C 2 28
# 11 C 3 51
# 12 C 4 84
# 14 D 1 22
# 15 D 2 38
# 17 D 4 40
# 18 D 5 42
This will only highlight which groups have an inconsistency:
aggregate(Age ~ ID, x, function(z) all(diff(z) > 0))
# ID Age
# 1 A TRUE
# 2 B TRUE
# 3 C FALSE
# 4 D FALSE

Subset data frame where values are greater than another data frame

Say I have a data frame with 3 columns of data (a,b,c) and 1 column of categories with multiple instances of each category (class).
set.seed(273)
a <- floor(runif(20,0,100))
b <- floor(runif(20,0,100))
c <- floor(runif(20,0,100))
class <- floor(runif(20,0,6))
df1 <- data.frame(a,b,c,class)
print(df1)
a b c class
1 31 73 28 3
2 44 33 57 3
3 19 35 53 0
4 68 70 39 4
5 92 7 57 2
6 13 67 23 3
7 73 50 14 2
8 59 14 91 5
9 37 3 72 5
10 27 3 13 4
11 63 28 0 5
12 51 7 35 4
13 11 36 76 3
14 72 25 8 5
15 23 24 6 3
16 15 1 16 5
17 55 24 5 5
18 2 54 39 1
19 54 95 20 3
20 60 39 65 1
And I have another data frame with the same 3 columns of data and category column, however this only has one instance per category (class).
a <- floor(runif(6,0,20))
b <- floor(runif(6,0,20))
c <- floor(runif(6,0,20))
class <- seq(0,5)
df2 <- data.frame(a,b,c,class)
print(df2)
a b c class
1 8 15 13 0
2 0 3 6 1
3 14 4 0 2
4 7 10 6 3
5 18 18 16 4
6 17 17 11 5
How to I subset the first data frame so that only rows where a, b, and c are all greater than the value in the second data frame for each class? For example, I only want rows where class == 0 if a > 8 & b > 15 & c > 13.
Note that I don't want to join the data frames, as the second data frame is the lowest acceptable value for the the first data frame.
As commented by Frank this can be done with non-equi joins.
# coerce to data.table
tmp <- setDT(df1)[
# non-equi join to find which rows of df1 fulfill conditions in df2
setDT(df2), on = .(class, a > a, b > b, c > c), rn, nomatch = 0L, which = TRUE]
# return subset in original order of df1
df1[sort(tmp)]
a b c class
1: 31 73 28 3
2: 44 33 57 3
3: 19 35 53 0
4: 68 70 39 4
5: 92 7 57 2
6: 13 67 23 3
7: 73 50 14 2
8: 11 36 76 3
9: 2 54 39 1
10: 54 95 20 3
11: 60 39 65 1
The parameter which = TRUE returns a vector of the matching row numbers instead of the joined data set. This saves us from creating a row id column before the join. (Credit to #Frank for reminding me of the which parameter!)
Note that there is no row in df1 which fulfills the condition for class == 5 in df2. Therefore, the parameter nomatch = 0L is used to exclude non-matching rows from the result.
This can be put together in a "one-liner":
setDT(df1)[sort(df1[setDT(df2), on = .(class, a > a, b > b, c > c), nomatch = 0L, which = TRUE])]

Merge two datasets in R

I have two different datasets arranged in column format as follows:
Dataset 1:
A B C D E
13 1 1.7 2 1
13 2 5.3 2 1
13 2 2 2 1
13 2 1.8 2 1
1 6 27 9 1
1 6 6.6 9 1
1 7 17 9 1
1 7 7.1 9 1
1 7 8.5 9 1
Dataset 2:
A B F G
13 1 42 1002
13 2 42 1002
13 2 42 1002
13 2 42 1002
13 3 42 1002
13 4 42 1002
13 5 42 1002
1 2 27 650
1 3 27 650
1 4 27 650
1 6 27 650
1 7 27 650
1 7 27 650
1 7 27 650
1 8 27 650
Row numbers of both datasets are variable but they contain data for two samples (for example, column A: 13 and 1 of both datasets). I want C D and E values of dataset 1 to be placed in dataset 2, those having the same values of A and B in both datasets. So, joining should be based on A and B. I need to do this for about 47560 rows.
I am new in R so should be thankful if I could get code for saving the new merged dataset in R.
Use the merge function in R.
Reference from : http://www.statmethods.net/management/merging.html
Edit:
So first you'd need to read in the datasets, CSV is a good format.
> dataset1 <- read.csv(file="dataset1.csv", head=TRUE, sep=",")
> dataset2 <- read.csv(file="dataset2.csv", head=TRUE, sep=",")
If you just type the variable names now and hit enter you should see a read-out of your datasets. So...
> dataset1
should read out your data above. Then I believe the following should occur...I may be wrong...
> dataset1_2 <- merge(dataset1, dataset2, by=c("A","B"))
EDIT 2 :
> write.table(dataset1_2, "c:/dataset1_2.txt", sep=" ")
Reference : http://www.statmethods.net/input/exportingdata.html

Resources