Selecting rows by offsetting - r

I have this data frame, lets call it my_df.
It looks like this:
my_df <- data.frame(rnorm(n = 30,sd=.5),rep(c("a","b","c"),each=10))
names(my_df) <- c("num","let")
head(my_df)
num let
1 0.01202600 a
2 1.09025768 a
3 -0.08656178 a
4 -0.04847073 a
5 -0.63750258 a
6 0.58846135 a
What I want to do is select all of the rows when my_df$let == "b" as well as the five rows before the first row when my_df$let == "b", and the five rows after the last row when my_df == "b". So basically my_df[6:25,].
The data I'm actually working with is hundreds of thousands of lines long and I don't know what rows is what, and besides that each set of data doesn't match up row wise and I can't take the time to go through each set of data individually. I've been using a subset to select the data I want, but I don't know how to select the additional rows outside of the subset (1000 rows before and after).
Here's my subset for what I'm doing:
#The following lines seperate pXX_NoNegative into individual field sections
p04_HighWeeds <- subset(p04_NoNegative, subset = p04_NoNegative$GS_Field == "High Weeds")
I want to select all of the rows that the above code selects, but I also want 100 rows before that, and 1000 rows after that.
If you need any additional information that may help you please ask.

Here's another idea using dplyr:
library(dplyr)
my_df %>% filter(lead(let == "b", 5) | lag(let == "b", 5))
Or as per #akrun suggestion using the devel version of data.table:
setDT(my_df)[shift(let == "b", 5) | shift(let == "b", type = "lead", 5)]
Which gives:
# num let
#1 0.36723709 a
#2 0.24743170 a
#3 -0.33339924 a
#4 -0.57024317 a
#5 0.03390278 a
#6 -0.43495096 b
#7 -0.85107347 b
#8 0.53048931 b
#9 -0.26739611 b
#10 -0.96029355 b
#11 -0.71737408 b
#12 0.34324685 b
#13 0.12319646 b
#14 0.75207703 b
#15 0.18134006 b
#16 -0.02230777 c
#17 0.42646106 c
#18 -0.11055478 c
#19 0.06013187 c
#20 0.50782158 c

Normally splitting a data frame into a list of data frames based on some categorization is straightforward -- you would use split(my_df, my_df$let) in your case. However with the added complication that you want some number of rows before or after I would operate over the set of unique categorizations, selecting the rows you want in each case:
before <- 5
after <- 5
ret <- setNames(lapply(unique(my_df$let), function(x) {
positions <- which(my_df$let == x)
start.pos <- max(1, min(positions)-before)
end.pos <- min(nrow(my_df), max(positions)+after)
my_df[start.pos:end.pos,]
}), unique(my_df$let))
You can grab the observations for any category you want out of the returned list:
ret$b # Also works: ret[["b"]]
# num let
# 6 -0.197901427 a
# 7 0.194607192 a
# 8 -0.107318203 a
# 9 -0.365313233 a
# 10 -0.188926562 a
# 11 0.636272295 b
# 12 -0.058791973 b
# 13 -0.231029510 b
# 14 0.519441716 b
# 15 0.239510912 b
# 16 0.107025658 b
# 17 -0.446644081 b
# 18 0.145052077 b
# 19 -0.426090749 b
# 20 -0.356062993 b
# 21 -0.155012203 c
# 22 -0.007968255 c
# 23 -0.504253089 c
# 24 0.081624303 c
# 25 -0.657008233 c

I recently answered a nearly identical question: Select n rows after specific number. Adapting the single-segment solution to your data:
set.seed(1); my_df <- data.frame(rnorm(n = 30,sd=.5),rep(c("a","b","c"),each=10));
names(my_df) <- c("num","let");
brange <- range(which(my_df$let=='b'));
my_df$offb <- c((1-brange[1]):-1,rep(0,diff(brange)+1),1:(nrow(my_df)-brange[2]));
my_df;
## num let offb
## 1 -0.313226905 a -10
## 2 0.091821662 a -9
## 3 -0.417814306 a -8
## 4 0.797640401 a -7
## 5 0.164753886 a -6
## 6 -0.410234192 a -5
## 7 0.243714526 a -4
## 8 0.369162353 a -3
## 9 0.287890676 a -2
## 10 -0.152694194 a -1
## 11 0.755890584 b 0
## 12 0.194921618 b 0
## 13 -0.310620290 b 0
## 14 -1.107349944 b 0
## 15 0.562465459 b 0
## 16 -0.022466805 b 0
## 17 -0.008095132 b 0
## 18 0.471918105 b 0
## 19 0.410610598 b 0
## 20 0.296950661 b 0
## 21 0.459488686 c 1
## 22 0.391068150 c 2
## 23 0.037282492 c 3
## 24 -0.994675848 c 4
## 25 0.309912874 c 5
## 26 -0.028064370 c 6
## 27 -0.077897753 c 7
## 28 -0.735376192 c 8
## 29 -0.239075028 c 9
## 30 0.208970780 c 10
subset(my_df,offb>=-5&offb<=5);
## num let offb
## 6 -0.410234192 a -5
## 7 0.243714526 a -4
## 8 0.369162353 a -3
## 9 0.287890676 a -2
## 10 -0.152694194 a -1
## 11 0.755890584 b 0
## 12 0.194921618 b 0
## 13 -0.310620290 b 0
## 14 -1.107349944 b 0
## 15 0.562465459 b 0
## 16 -0.022466805 b 0
## 17 -0.008095132 b 0
## 18 0.471918105 b 0
## 19 0.410610598 b 0
## 20 0.296950661 b 0
## 21 0.459488686 c 1
## 22 0.391068150 c 2
## 23 0.037282492 c 3
## 24 -0.994675848 c 4
## 25 0.309912874 c 5

Related

Labelling rows according to how many times the group appeared in previous rows

Suppose I have the following data.frame object:
df = data.frame(id=(1:25),
col1=c('a','a','a',
'b','b','b',
'c','c','c',
'd','d',
'b','b','b',
'e',
'c','c','c',
'e','e',
'a','a','a',
'e','e'))
From the snapshot above, you can see that there are two groups of rows that have col1=="a": rows 1 through 3 and rows 21 through 23. Similarly, there are three groups of rows that have col1=="e": row 15, rows 19 through 20 and rows 24 through 25 (and so on and so on with "b", "c" and "d").
Here's my main question
Is it possible to label the rows according to what "chunk" we're currently on? More explicitly: since rows 1 through 3 are the first time where we have col1=="a", they should be labelled as 1. Then, rows 21 through 23 should be labelled as 2, because that is the second time that we have a set of rows that have col1=="a". Using the same logic, but for col1=="e", we'd label row 15 as 1, rows 19 and 20 as 2 and rows 24 and 25 as 3 (again, so on and so on with "b", "c" and "d").
Desired output
Here is what the resulting data.frame would look like:
df = data.frame(id=(1:25),
col1=c('a','a','a',
'b','b','b',
'c','c','c',
'd','d',
'b','b','b',
'e',
'c','c','c',
'e','e',
'a','a','a',
'e','e'),
grup=c(1,1,1,
1,1,1,
1,1,1,
1,1,
2,2,2,
1,
2,2,2,
2,2,
2,2,2,
3,3))
My attempt
I tried implementing a solution using a for loop, but that was quite slow (the original data I'm working on has about 500,000 rows), and it just looked a bit sloppy:
my_classifier = function(input_df, ref_column){
# Keeps a tally of how many times each unique group was "found" before.
group_counter = list()
# Dealing with the corner case of the first row
group_counter[[df$col1[1]]] = 1
output_groups = rep(-1, nrow(input_df))
output_groups[1] = 1
# The for loop starts at the second row because I've already "dealt" with the
# first row in the corner cases above
for(i in 2:nrow(input_df)){
prev_group = input_df[[ref_column]][i-1]
this_group = input_df[[ref_column]][i]
if(is.null(group_counter[[this_group]])){
this_counter = 0
}
else{
this_counter = group_counter[[this_group]]
}
if(prev_group != this_group){
this_counter = this_counter + 1
}
output_groups[i] = this_counter
group_counter[[this_group]] = this_counter
}
return(output_groups)
}
df$grup = my_classifier(df,'col1')
Is there a quicker/more efficient way to solve this problem? Maybe something that relies on vectorized functions or something?
Important notes
Consider that we cannot rely on the number of repetitions of each "block". Sometimes, col1 will have just one single row of a particular group, while other times the "block" will have several rows where col1 share the same value. Also consider that we cannot assume any logic in the "order" or the number of times each group shows up.
So, for example, there might be a a stretch of 10 rows where col1=="z", then a stretch of 15 rows where col1=="x", then another single row where col1=="x" and then finally a stretch of 100 rows where col1=="w".
You can use data.table::rleid() twice, like this:
library(data.table)
setDT(df)[,grp:=rleid(col1)][, grp:=rleid(grp), by=col1][order(id)]
Output:
id col1 grp
<int> <char> <int>
1: 1 a 1
2: 2 a 1
3: 3 a 1
4: 4 b 1
5: 5 b 1
6: 6 b 1
7: 7 c 1
8: 8 c 1
9: 9 c 1
10: 10 d 1
11: 11 d 1
12: 12 b 2
13: 13 b 2
14: 14 b 2
15: 15 e 1
16: 16 c 2
17: 17 c 2
18: 18 c 2
19: 19 e 2
20: 20 e 2
21: 21 a 2
22: 22 a 2
23: 23 a 2
24: 24 e 3
25: 25 e 3
id col1 grp
Here is a possible base R solution:
change <- with(rle(df$col1), rep(seq_along(values), lengths))
cbind(df, grp = with(df, ave(
change,
col1,
FUN = function(x)
inverse.rle(within.list(rle(x), values <- seq_along(values)))
)))
Or another option using a combination of rle and dplyr using the function from here:
rle_new <- function(x) {
x <- rle(x)$lengths
rep(seq_along(x), times=x)
}
library(dplyr)
df %>%
mutate(grp = rle_new(col1)) %>%
group_by(col1) %>%
mutate(grp = rle_new(grp))
Output
id col1 grp
1 1 a 1
2 2 a 1
3 3 a 1
4 4 b 1
5 5 b 1
6 6 b 1
7 7 c 1
8 8 c 1
9 9 c 1
10 10 d 1
11 11 d 1
12 12 b 2
13 13 b 2
14 14 b 2
15 15 e 1
16 16 c 2
17 17 c 2
18 18 c 2
19 19 e 2
20 20 e 2
21 21 a 2
22 22 a 2
23 23 a 2
24 24 e 3
25 25 e 3

Replace NA value with next or previous non-NA value conditional on other column

Below is an example data set similar to what I'm working with.
df<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c("A",rep(NA,8),"B",rep(NA,9),"C"))
In this example we have a string of values ranging from + to - values or vice versa (Loc). What I am trying to do accomplish is to fill these NA values, where B is always a associated with negative values of Loc, however, positive values can either take on values A if NA's are between A and B or C if NA's are between B and C.
The desired output should look like the following
df2<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c(rep("A",6),rep("B",8),rep("C",6)))
I have looked into the na.locf from the zoo package but I'm not sure how to order which direction the funcion looks for the non-NA value to get the desired output.
df$Reg2<-ifelse(df$Loc<=0,df$Reg2<-"B",na.locf(df$Reg,fromLast = F))
The above code is only returning the right response for some of the rows depending on the direction (i.e. fromLast = T or F)
Any help on this would be much appreciated.
Use ave splitting by a grouping variable generated from rleid of the sign. Then omit the NAs leaving the single non-NA in each group which ave will copy for all values in that group.
library(data.table)
transform(df, Reg = ave(Reg, rleid(Loc >= 0), FUN = na.omit))
giving:
Loc Reg
1 5 A
2 4 A
3 3 A
4 2 A
5 1 A
6 0 A
7 -1 B
8 -2 B
9 -3 B
10 -4 B
11 -4 B
12 -3 B
13 -2 B
14 -1 B
15 0 C
16 1 C
17 2 C
18 3 C
19 4 C
20 5 C
Here is a data.table solution which reproduces OP's expected answer:
library(data.table)
result <- as.data.table(df)[, Reg := first(Reg[!is.na(Reg)]), by = rleid(Loc >= 0)][]
result
Loc Reg
1: 5 A
2: 4 A
3: 3 A
4: 2 A
5: 1 A
6: 0 A
7: -1 B
8: -2 B
9: -3 B
10: -4 B
11: -4 B
12: -3 B
13: -2 B
14: -1 B
15: 0 C
16: 1 C
17: 2 C
18: 3 C
19: 4 C
20: 5 C
identical(as.data.frame(result), df2)
[1] TRUE
Note that this approach is similar to G. Grothendiek's base R solution in that it uses rleid(Loc >= 0) to group the data but it does not call transform() and ave() but updates Reg by reference, i.e., without copying the whole object.
Here is a quick solution with dplyr:
df<-data.frame(Loc=c(rev(seq(-4,5,1)),seq(-4,5,1)),
Reg=c("A",rep(NA,8),"B",rep(NA,9),"C"))
c <- match("C",df$Reg)
a <- match("A",df$Reg)
df2 <- df %>%
mutate(newReg=case_when(Loc < 0 ~ "B",
Loc >= 0 & abs(row_number()-c)<abs(row_number()-a)~ "C",
TRUE ~ "A"))
Note: This is hideous and I am doubtful this is reproducible for more use cases... this is probably better suited for some type of dplyr::case_when function, but I just couldn't think it through at this point.
lapply(2:nrow(df), function(i){
this_row <- df[i, ]
last_row <- i - 1
if(is.na(this_row[['Reg']])){
if(this_row[['Loc']] < 0){
df[i, 'Reg'] <<- "B"
}else if(df[i - 1, 'Reg'] == "A"){
df[i, 'Reg'] <<- "A"
}else {
df[i, "Reg"] <<- "C"
}
}
})
> df
Loc Reg
1 5 A
2 4 A
3 3 A
4 2 A
5 1 A
6 0 A
7 -1 B
8 -2 B
9 -3 B
10 -4 B
11 -4 B
12 -3 B
13 -2 B
14 -1 B
15 0 C
16 1 C
17 2 C
18 3 C
19 4 C
20 5 C

How to matching missing IDs?

I have a large table with 50000 obs. The following mimic the structure:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B",NA,"D","E",NA,"G","H","I")
b <- c(11,2233,12,2,22,13,23,23,100)
c <- c(12,10,12,23,16,17,7,9,7)
df <- data.frame(ID ,a,b,c)
Where there are some missing values on the vector "a". However, I have some tables where the ID and the missing strings are included:
ID <- c(1,2,3,4,5,6,7,8,9)
a <- c("A","B","C","D","E","F","G","H","I")
key <- data.frame(ID,a)
Is there a way to include the missing strings from key into the column a using the ID?
Another options is to use data.tables fast binary join and update by reference capabilities
library(data.table)
setkey(setDT(df), ID)[key, a := i.a]
df
# ID a b c
# 1: 1 A 11 12
# 2: 2 B 2233 10
# 3: 3 C 12 12
# 4: 4 D 2 23
# 5: 5 E 22 16
# 6: 6 F 13 17
# 7: 7 G 23 7
# 8: 8 H 23 9
# 9: 9 I 100 7
If you want to replace only the NAs (not all the joined cases), a bit more complicated implemintation will be
setkey(setDT(key), ID)
setkey(setDT(df), ID)[is.na(a), a := key[.SD, a]]
You can just use match; however, I would recommend that both your datasets are using characters instead of factors to prevent headaches later on.
key$a <- as.character(key$a)
df$a <- as.character(df$a)
df$a[is.na(df$a)] <- key$a[match(df$ID[is.na(df$a)], key$ID)]
df
# ID a b c
# 1 1 A 11 12
# 2 2 B 2233 10
# 3 3 C 12 12
# 4 4 D 2 23
# 5 5 E 22 16
# 6 6 F 13 17
# 7 7 G 23 7
# 8 8 H 23 9
# 9 9 I 100 7
Of course, you could always stick with factors and factor the entire "ID" column and use the labels to replace the values in column "a"....
factor(df$ID, levels = key$ID, labels = key$a)
## [1] A B C D E F G H I
## Levels: A B C D E F G H I
Assign that to df$a and you're done....
Named vectors make nice lookup tables:
lookup <- a
names(lookup) <- as.character(ID)
lookup is now a named vector, you can access each value by lookup[ID] e.g. lookup["2"] (make sure the number is a character, not numeric)
## should give you a vector of a as required.
lookup[as.character(ID_from_big_table)]

Convert datafile from wide to long format to fit ordinal mixed model in R

I am dealing with a dataset that is in wide format, as in
> data=read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
> data
factor1 factor2 count_1 count_2 count_3
1 a a 1 2 0
2 a b 3 0 0
3 b a 1 2 3
4 b b 2 2 0
5 c a 3 4 0
6 c b 1 1 0
where factor1 and factor2 are different factors which I would like to take along (in fact I have more than 2, but that shouldn't matter), and count_1 to count_3 are counts of aggressive interactions on an ordinal scale (3>2>1). I would now like to convert this dataset to long format, to get something like
factor1 factor2 aggression
1 a a 1
2 a a 2
3 a a 2
4 a b 1
5 a b 1
6 a b 1
7 b a 1
8 b a 2
9 b a 2
10 b a 3
11 b a 3
12 b a 3
13 b b 1
14 b b 1
15 b b 2
16 b b 2
17 c a 1
18 c a 1
19 c a 1
20 c a 2
21 c a 2
22 c a 2
23 c a 2
24 c b 1
25 c b 2
Would anyone happen to know how to do this without using for...to loops, e.g. using package reshape2? (I realize it should work using melt, but I just haven't been able to figure out the right syntax yet)
Edit: For those of you that would also happen to need this kind of functionality, here is Ananda's answer below wrapped into a little function:
widetolong.ordinal<-function(data,factors,responses,responsename) {
library(reshape2)
data$ID=1:nrow(data) # add an ID to preserve row order
dL=melt(data, id.vars=c("ID", factors)) # `melt` the data
dL=dL[order(dL$ID), ] # sort the molten data
dL[,responsename]=match(dL$variable,responses) # convert reponses to ordinal scores
dL[,responsename]=factor(dL[,responsename],ordered=T)
dL=dL[dL$value != 0, ] # drop rows where `value == 0`
out=dL[rep(rownames(dL), dL$value), c(factors, responsename)] # use `rep` to "expand" `data.frame` & drop unwanted columns
rownames(out) <- NULL
return(out)
}
# example
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
widetolong.ordinal(data,c("factor1","factor2"),c("count_1","count_2","count_3"),"aggression")
melt from "reshape2" will only get you part of the way through this problem. To go the rest of the way, you just need to use rep from base R:
data <- read.csv("http://www.kuleuven.be/bio/ento/temp/data.csv")
library(reshape2)
## Add an ID if the row order is importantt o you
data$ID <- 1:nrow(data)
## `melt` the data
dL <- melt(data, id.vars=c("ID", "factor1", "factor2"))
## Sort the molten data, if necessary
dL <- dL[order(dL$ID), ]
## Extract the numeric portion of the "variable" variable
dL$aggression <- gsub("count_", "", dL$variable)
## Drop rows where `value == 0`
dL <- dL[dL$value != 0, ]
## Use `rep` to "expand" your `data.frame`.
## Drop any unwanted columns at this point.
out <- dL[rep(rownames(dL), dL$value), c("factor1", "factor2", "aggression")]
This is what the output finally looks like. If you want to remove the funny row names, just use rownames(out) <- NULL.
out
# factor1 factor2 aggression
# 1 a a 1
# 7 a a 2
# 7.1 a a 2
# 2 a b 1
# 2.1 a b 1
# 2.2 a b 1
# 3 b a 1
# 9 b a 2
# 9.1 b a 2
# 15 b a 3
# 15.1 b a 3
# 15.2 b a 3
# 4 b b 1
# 4.1 b b 1
# 10 b b 2
# 10.1 b b 2
# 5 c a 1
# 5.1 c a 1
# 5.2 c a 1
# 11 c a 2
# 11.1 c a 2
# 11.2 c a 2
# 11.3 c a 2
# 6 c b 1
# 12 c b 2

Remove rows based on factor-levels

I have a data.frame df in format "long".
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df <- df[order(df$site), ]
df
site time value
1 A 11 12
2 A 22 -24
3 A 33 -30
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
Question
How do I remove the rows where an unique element of df$time is not present for each of the levels of df$site ?
In this case I want to remove df[3,], because for df$time the timestamp 33 is only present for site A and not for site B and site C.
Desired output:
df.trimmed
site time value
1 A 11 12
2 A 22 -24
4 B 11 3
5 B 22 16
6 C 11 3
7 C 22 9
The data.frame has easily 800k rows and 200k unique timestamps. I don't want to use loops but I don't know how to use vectorized functions like apply() or lapply() for this case.
Here's another possible solution using the data.table package:
unTime <- unique(df$time)
library(data.table)
DT <- data.table(df, key = "site")
(notInAll <- unique(DT[, list(ans = which(!unTime %in% time)), by = key(DT)]$ans))
# [1] 3
DT[time %in% unTime[-notInAll]]
# site time value
# [1,] A 11 3
# [2,] A 22 11
# [3,] B 11 -6
# [4,] B 22 -2
# [5,] C 11 -19
# [6,] C 22 -14
EDIT from Matthew
Nice. Or a slightly more direct way :
DT = as.data.table(df)
tt = DT[,length(unique(site)),by=time]
tt
time V1
1: 11 3
2: 22 3
3: 33 1
tt = tt[V1==max(V1)] # See * below
tt
time V1
1: 11 3
2: 22 3
DT[time %in% tt$time]
site time value
1: A 11 7
2: A 22 -2
3: B 11 8
4: B 22 -10
5: C 11 3
6: C 22 1
In case no time is present in all sites, when final result should be empty (as Ben pointed out in comments), the step marked * above could be :
tt = tt[V1==length(unique(DT$site))]
Would rle work for you?
df <- df[order(df$time), ]
df <- subset(df, time != rle(df$time)$value[rle(df$time)$lengths == 1])
df <- df[order(df$site), ]
df
## site time value
## 1 A 11 17
## 4 A 22 -3
## 2 B 11 8
## 5 B 22 5
## 3 C 11 0
## 6 C 22 13
Re-looking at your data, it seems that this solution might be too simple for your needs though....
Update
Here's an approach that should be better than the rle solution that I put above. Rather than look for a run-length of "1", will delete rows that do not match certain conditions of the results of table(df$site, df$time). To illustrate, I've also added some more fake data.
df <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(11,11,11,22,22,22,33),
value = ceiling(rnorm(7)*10))
df2 <- data.frame(site = rep(c("A","B","C"), 1, 7),
time = c(14,14,15,15,16,16,16),
value = ceiling(rnorm(7)*10))
df <- rbind(df, df2)
df <- df[order(df$site), ]
temp <- as.numeric(names(which(colSums(with(df, table(site, time)))
>= length(levels(df$site)))))
df2 <- merge(df, data.frame(temp), by.x = "time", by.y = "temp")
df2 <- df2[order(df2$site), ]
df2
## time site value
## 3 11 A -2
## 4 16 A -2
## 7 22 A 2
## 1 11 B -16
## 5 16 B 3
## 8 22 B -6
## 2 11 C 8
## 6 16 C 11
## 9 22 C -10
Here's the result of tabulating and summing up the site/time combination:
colSums(with(df, table(site, time)))
## 11 14 15 16 22 33
## 3 2 2 3 3 1
Thus, if we were interested in including sites where at least two sites had the timestamp, we could change the line >= length(levels(df$site)) (in this example, 3) to >= length(levels(df$site))-1 (obviously, 2).
Not sure if this solution is useful to you at all, but I thought I would share it to show the flexibility in solutions we have with R.

Resources