subset data frame on vector sequence - r

I have the data frame df and I want to subset df based on a number sequence within a categorical.
x <- c(1,2,3,4,5,7,9,11,13)
x2 <- x+77
df <- data.frame(x=c(x,x2),y= c(rep("A",9),rep("B",9)))
df
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 7 A
7 9 A
8 11 A
9 13 A
10 78 B
11 79 B
12 80 B
13 81 B
14 82 B
15 84 B
16 86 B
17 88 B
18 90 B
I want only the rows where x increments by 1 and not the rows where x increases by two: e.g.
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
10 78 B
11 79 B
12 80 B
13 81 B
14 82 B
I figured I have to do some dort of subtraction between elements and check if the difference is >1 and combine this with a ddply but this seems cumbersome. Is there a sort of sequence function I am missing?

using diff
df[which(c(1,diff(df$x))==1),]

Your example seems to behave well and can be nicely handled by #agstudy's answer. Should your data act up one day, though...
myfun <- function(d, whichDiff = 1) {
# d is the data.frame you'd like to subset, containing the variable 'x'
# whichDiff is the difference between values of x you're looking for
theWh <- which(!as.logical(diff(d$x) - whichDiff))
# Take the diff of x, subtract whichDiff to get the desired values equal to 0
# Coerce this to a logical vector and take the inverse (!)
# which() gets the indexes that are TRUE.
# allWh <- sapply(theWh, "+", 1)
# Since the desired rows may be disjoint, use sapply to get each index + 1
# Seriously? sapply to add 1 to a numeric vector? Not even on a Friday.
allWh <- theWh + 1
return(d[sort(unique(c(theWh, allWh))), ])
}
> library(plyr)
>
> ddply(df, .(y), myfun)
x y
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 78 B
7 79 B
8 80 B
9 81 B
10 82 B

Related

Subset and filter a dataframe by logical operators and select the foregoing rows

I have the following "random" Dataframe and want to apply a subset based on logical operators and then want to extract the foregoing rows:
set.seed(3)
Sample_Data <- data.frame(A = c(1:100, 1:100, 1:100), B = c(100:1, 100:1, 100:1))
print(Sample_Data)
Test_subset <- subset(Sample_Data, subset = A == 1 & B == 100)
Test_subset
A B
1 1 100
101 1 100
201 1 100
To make the subset with logical operators is no problem.
But now I want to know whether it is possible to create the following filter in R: "Filter all lines with the following criteria (see above) and also output the 10 lines in front of the correspondingly filtered lines."
Does anyone know a solution for this?
As noted in a comment, it does not make sense to filter rows that do not exist (there are none before row #1). Therefore, here's a solution for a filtering with slightly different parameters. Say, you want to filter target rows where A == 11 & B == 90 (this value combination also occurs 3 times in your data) and you want to get the five rows preceding the target rows. You can first define a function to get the indices of the rows in question:
Sequ <- function(col1, col2) {
# get row indices of target row with function `which`
inds <- which(col1 == 11 & col2 == 90)
# sort row indices of the rows before target row AND target row itself
sort(unique(c(inds-5, inds-4, inds-3,inds-2, inds-1, inds)))
}
Next you can use this function as input for slice:
library(dplyr)
Sample_Data %>%
slice(Sequ(col1 = A, col2 = B))
A B
1 6 95
2 7 94
3 8 93
4 9 92
5 10 91
6 11 90
7 6 95
8 7 94
9 8 93
10 9 92
11 10 91
12 11 90
13 6 95
14 7 94
15 8 93
16 9 92
17 10 91
18 11 90
You can add a column with row number to make this process easier.
Sample_Data$row <- seq(nrow(Sample_Data))
Test_subset <- subset(Sample_Data, subset = A == 1 & B == 100)
Test_subset
# A B row
#1 1 100 1
#101 1 100 101
#201 1 100 201
For every row in above subset select the next 10 rows.
result <- Sample_Data[unique(c(t(outer(Test_subset$row, 0:10, `+`)))), ]
result
# A B row
#1 1 100 1
#2 2 99 2
#3 3 98 3
#4 4 97 4
#5 5 96 5
#6 6 95 6
#7 7 94 7
#8 8 93 8
#9 9 92 9
#10 10 91 10
#11 11 90 11
#101 1 100 101
#102 2 99 102
#...
#...

Check time series incongruencies

Let's say that we have the following matrix:
x<- as.data.frame(cbind(c("A","A","A","B","B","B","B","B","C","C","C","C","C","D","D","D","D","D"),
c(1,2,3,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
c(14,28,42,14,46,64,71,85,14,28,51,84,66,22,38,32,40,42)))
colnames(x)<- c("ID","Visit", "Age")
The first column represents subject ID, the second a list of observations and the third the age at each of this consecutive observations.
Which would be the easiest way of finding visits where the age is wrong according to the previous visit age. (i.e. in row 13, subject C is 66 years old, when in the previous visit he was already 84 or in row 16, subject D is 32 years old, when in the previous visit he was already 38).
Which would be the way of highlighting the potential errors and removing rows 13 and 16?
I have tried to aggregate by IDs and look for the difference between ages across visits, but it seems hard for me since the error could occur in any visit.
How about this in base R?
df <- do.call(rbind.data.frame, lapply(split(x, x$ID), function(w)
w[c(1, which(diff(w[order(w$Visit), "Age"]) > 0) + 1), ]));
df;
# ID Visit Age
#A.1 A 1 14
#A.2 A 2 28
#A.3 A 3 42
#B.4 B 1 14
#B.5 B 2 46
#B.6 B 3 64
#B.7 B 4 71
#B.8 B 5 85
#C.9 C 1 14
#C.10 C 2 28
#C.11 C 3 51
#C.12 C 4 84
#D.14 D 1 22
#D.15 D 2 38
#D.17 D 4 40
#D.18 D 5 42
Explanation: We split the dataframe on column ID, then order every dataframe subset by Visit, calculate differences between successive Age values, and only keep those rows where the difference is > 0 (i.e. Age is increasing); rbinding gives the final dataframe.
You could do it by filtering out the rows where diff(Age) is negative for each ID.
Using the dplyr package:
library(dplyr)
x %>% group_by(ID) %>% filter(c(0,diff(Age))>=0)
# A tibble: 16 x 3
# Groups: ID [4]
ID Visit Age
<fctr> <fctr> <fctr>
1 A 1 14
2 A 2 28
3 A 3 42
4 B 1 14
5 B 2 46
6 B 3 64
7 B 4 71
8 B 5 85
9 C 1 14
10 C 2 28
11 C 3 51
12 C 4 84
13 D 1 22
14 D 2 38
15 D 4 40
16 D 5 42
The aggregate() approach is pretty concise.
Removing bad rows
good <- do.call(c, aggregate(Age ~ ID, x, function(z) c(z[1], diff(z)) > 0)$Age)
x[good,]
# ID Visit Age
# 1 A 1 14
# 2 A 2 28
# 3 A 3 42
# 4 B 1 14
# 5 B 2 46
# 6 B 3 64
# 7 B 4 71
# 8 B 5 85
# 9 C 1 14
# 10 C 2 28
# 11 C 3 51
# 12 C 4 84
# 14 D 1 22
# 15 D 2 38
# 17 D 4 40
# 18 D 5 42
This will only highlight which groups have an inconsistency:
aggregate(Age ~ ID, x, function(z) all(diff(z) > 0))
# ID Age
# 1 A TRUE
# 2 B TRUE
# 3 C FALSE
# 4 D FALSE

Assigning values to a new column based on a condition between two dataframes

I have two dataframes. I need to add the value of one column to every row in the other dataframe where the values of a particular column meet a condition from the first dataframe.
df1:
a b
x 23
s 34
v 15
g 05
k 69
df2:
x y z
1 0 10
2 10 20
3 20 30
4 30 40
5 40 50
6 50 60
7 60 70
Desired output:
a b n
x 23 3
s 34 4
v 15 2
g 05 1
k 69 7
In my dataset the intervals are large, and it's unlikely that a value from df1 is exactly on the boundary of a df2 interval.
Essentially for every row in df1 I need to assign the number which corresponds to which range it fits into in df2. So if df1$b is between df2$y and df2$z, then assign the value of output$n as the corresponding value of df2$x. This is quite a wordy question, so please ask if I need to clarify.
df1 = read.table(text = "
a b
x 23
s 34
v 15
g 05
k 69
", header=T, stringsAsFactors=F)
df2 = read.table(text = "
x y z
1 0 10
2 10 20
3 20 30
4 30 40
5 40 50
6 50 60
7 60 70
", header=T, stringsAsFactors=F)
# function
f = function(x) min(which(x >= df2$y & x <= df2$z))
f = Vectorize(f)
# apply function
df1$n = f(df1$b)
# check updated dataset
df1
# a b n
# 1 x 23 3
# 2 s 34 4
# 3 v 15 2
# 4 g 5 1
# 5 k 69 7
You can try:
library(tidyverse)
df1 %>%
rowwise() %>%
mutate(n=df2[ b > df2$y & b <= df2$z,1]) %>%
ungroup()
# A tibble: 5 x 3
a b n
<chr> <int> <int>
1 x 23 3
2 s 34 4
3 v 15 2
4 g 5 1
5 k 69 7
as already commented you have to change < or > to <= or >= accordingly to your needs.

Merging and summarizing two dataframes

I have the following data:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
> a
ID a b c
1 A 0 3 6
2 B 1 4 7
3 Z 2 5 8
4 H 45 22 3
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
> b
ID a b c
1: A 9 4 12
2: B 10 2 0
3: E 11 7 34
4: W 39 54 23
5: Z 5 12 13
6: H 0 34 14
I want to merge both dataframes, keeping only rows of data.frame a and summarize the same columns, so at the end I get:
> z
ID a b c
1 A 9 7 18
2 B 11 6 7
3 Z 7 17 21
4 H 45 56 17
So far I have tried the following:
merge(a,b,by="ID",all.x=T,all.y=F)
> merge(a,b,by="ID",all.x=T,all.y=F)
ID a.x b.x c.x a.y b.y c.y
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 H 45 22 3 0 34 14
4 Z 2 5 8 5 12 13
> join(a,b,type="left",by="ID")
ID a b c a b c
1 A 0 3 6 9 4 12
2 B 1 4 7 10 2 0
3 Z 2 5 8 5 12 13
4 H 45 22 3 0 34 14
I cannot manage to summarize the columns.
My dataframe is pretty big so if the solution can speed up things that would even be better.
If your data.frame is very big, then you may consider this option:
library(data.table)
## convert data.frame to data.table
setDT(a)
## convert data.frame to data.table
setDT(b)
## merge the two data.tables
c <- merge(a,b,by='ID')
## extract names of all columns except the first one i.e. ID
col_names <- colnames(a)[-1]
## query building
col_1 <- paste0(col_names,'.x')
col_2 <- paste0(col_names,'.y')
cols <- paste(col_1,col_2,sep=',')
cols_2 <- paste0(col_names," = sum(",cols,")")
cols_3 <- paste(cols_2,collapse=',')
query <- paste0("z <- c[,.(",cols_3,"),by=ID]")
## query execution
eval(parse(text = query))
This works at least for your example:
a <- data.frame(ID=c("A","B","Z","H"), a=c(0,1,2,45), b=c(3,4,5,22), c=c(6,7,8,3))
b <- data.frame(ID=c("A","B","E","W","Z","H"), a=c(9,10,11,39,5,0), b=c(4,2,7,54,12,34), c=c(12,0,34,23,13,14))
match_a <- na.omit(match(b$ID, a$ID))
match_b <- na.omit(match(a$ID, b$ID))
df <- cbind(ID = a$ID[match_a], a[match_a, -1] + b[match_b, -1])
First, get matching rows from a in b and vice versa, so we can be sure that we only have those rows that appear in both data frames (and we now know their row-indices in both data frames). Then, simply use vectorized additions for those matching rows, but omit ID, as factor cannot be summed up; add ID back manually.
You cannot directly add both data frame is because both the data frames are of unequal size. To make them of equal size you can check for IDs in a which are present in b and then add them element wise.
new <- b[b$ID %in% a$ID, ]
cbind(ID = a$ID, a[-1] + new[-1])
# ID a b c
#1 A 9 7 18
#2 B 11 6 7
#3 Z 7 17 21
#4 H 45 56 17

How to sum over diagonals of data frame

Say that I have this data frame:
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2
In this above data frame, the values indicate counts of how many observations take on (100, 1), (99, 1), etc.
In my context, the diagonals have the same meanings:
1 2 3 4
100 A B C D
99 B C D E
98 C D E F
97 D E F G
How would I sum across the diagonals (i.e., sum the counts of the like letters) in the first data frame?
This would produce:
group sum
A 8
B 13
C 13
D 28
E 10
F 18
G 2
For example, D is 5+5+4+14
You can use row() and col() to identify row/column relationships.
m <- read.table(text="
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2")
vals <- sapply(2:8,
function(j) sum(m[row(m)+col(m)==j]))
or (as suggested in comments by ?#thelatemail)
vals <- sapply(split(as.matrix(m), row(m) + col(m)), sum)
data.frame(group=LETTERS[seq_along(vals)],sum=vals)
or (#Frank)
data.frame(vals = tapply(as.matrix(m),
(LETTERS[row(m) + col(m)-1]), sum))
as.matrix() is required to make split() work correctly ...
Another aggregate variation, avoiding the formula interface, which actually complicates matters in this instance:
aggregate(list(Sum=unlist(dat)), list(Group=LETTERS[c(row(dat) + col(dat))-1]), FUN=sum)
# Group Sum
#1 A 8
#2 B 13
#3 C 13
#4 D 28
#5 E 10
#6 F 18
#7 G 2
Another solution using bgoldst's definition of df1 and df2
sapply(unique(c(as.matrix(df2))),
function(x) sum(df1[df2 == x]))
Gives
#A B C D E F G
#8 13 13 28 10 18 2
(Not quite the format that you wanted, but maybe it's ok...)
Here's a solution using stack(), and aggregate(), although it requires the second data.frame contain character vectors, as opposed to factors (could be forced with lapply(df2,as.character)):
df1 <- data.frame(a=c(8,1,2,5), b=c(12,6,5,3), c=c(5,4,4,7), d=c(14,3,11,2) );
df2 <- data.frame(a=c('A','B','C','D'), b=c('B','C','D','E'), c=c('C','D','E','F'), d=c('D','E','F','G'), stringsAsFactors=F );
aggregate(sum~group,data.frame(sum=stack(df1)[,1],group=stack(df2)[,1]),sum);
## group sum
## 1 A 8
## 2 B 13
## 3 C 13
## 4 D 28
## 5 E 10
## 6 F 18
## 7 G 2

Resources