I am looking to select 3 values above every Event=x to view in a table. My data is as follows:
Event
1 a
2 b
3 c
4 a
5 x
6 c
7 a
8 b
9 c
10 x
This is what I would like for a return:
Value
1 b
2 c
3 a
4 a
5 b
6 c
Any help would be appreciated!
library(tidyverse)
library(magrittr)
Event <- as_data_frame(c("a","b","c","a","x","c","a","b","c","x"))
names(Event) <- "Event"
Event %<>%
filter(lead(Event,3)=="x" | lead(Event,2)=="x" | lead(Event,1)=="x")
Hope this helps. The lead and lag function are useful, although when you apply it you should be very careful about how the data is grouped and arranged/sorted. Cheers!
Here's a base R solution
Event = c("a","b","c","a","x","c","a","b","c","x")
xs = which(Event == "x")
lag = sort(c(xs-3,xs-2,xs-1))
Event[lag[lag > 0]
# [1] "b" "c" "a" "a" "b" "c"
which returns the index of the TRUE values in a logical vector. In this case, which positions are "x". We can then decrement the index by 1, 2 and 3 to get the lagged positions.
Now, which can be dangerous because if there's nothing meeting the condition then it returns a length 0 vector which can wreak havoc. In this case though, it's OK because you'll get a length 0 vector as output.
xs_bad = which(Event == "y")
lag_bad = sort(c(xs_bad - 3,xs_bad - 2,xs_bad - 1))
Event[lag_bad]
# character(0)
which can bite you in another way, too.
Event_bad = c("a","x","a","b","c","x")
Now there's an "x" that's not even 3 positions away from the beginning.
xs_bad = which(Event_bad == "x")
lag_bad = sort(c(xs_bad - 3,xs_bad - 2,xs_bad - 1))
lag_bad
# [1] -1 0 1 3 4 5
Negative indexes can't be mixed with non-0 indexes so our last step will fail. Defensively then we can change the code to remove this chance.
Event_bad[lag_bad > 0]
# [1] "a" "a" "b" "c"
Related
I have a vector with numbers and a lookup table. I want the numbers replaced by the description from the lookup table.
This is easy when vectors are straight forward like this example:
> variable <- sample(1:5, 10, replace=T)
> variable
[1] 5 4 5 3 2 3 2 3 5 2
>
> lookup <- data.frame(var = 1:5, description=LETTERS[1:5])
> lookup
var description
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
>
> with(lookup, description[match(variable, var)])
[1] E D E C B C B C E B
Levels: A B C D E
However, when single elements of a vector contain multiple outcomes, I get in trouble:
variable <- c("1", "2^3", "1^5", "4", "4")
I would like the vector returned to give:
c("A", "B^C", "A^E", "D", "D")
If you have only one character match and replacement you can use chartr
chartr(paste0(lookup$var, collapse = ""),
paste0(lookup$description, collapse = ""), variable)
#[1] "A" "B^C" "A^E" "D" "D"
chartr basically tells that replace
paste0(lookup$var, collapse = "")
#[1] "12345"
with
paste0(lookup$description, collapse = "")
#[1] "ABCDE"
It is also useful since it does not change or return NA for characters which do not match.
As mentioned in the comments, there are a couple of steps needed to achieve the desired output. The following splits your variable, indexes the results against the description variable and then uses paste to collapse multiple elements.
sapply(strsplit(variable, "\\^"), function(x) paste0(lookup$description[as.numeric(x)], collapse = "^"))
[1] "A" "B^C" "A^E" "D" "D"
You can use scan to parse text into numeric, which can then be used as an index to pick items which can then be collapsed together. Add quiet=TRUE to suppress "Read" messages.
sapply(variable, function(t) {
paste( lookup$description[ scan(text=t, sep="^")], collapse="^")} )
Read 1 item
Read 2 items
Read 2 items
Read 1 item
Read 1 item
1 2^3 1^5 4 4
"A" "B^C" "A^E" "D" "D"
It is straight forward to obtain unique values of a column using unique. However, I am looking to do the same but for multiple columns in a dataframe and store them in a list, all using base R. Importantly, it is not combinations I need but simply unique values for each individual column. I currently have the below:
# dummy data
df = data.frame(a = LETTERS[1:4]
,b = 1:4)
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols)
{
x = unique(i)
unique_values_by_col[[i]] = x
}
The problem comes when displaying unique_values_by_col as it shows as empty. I believe the problem is i is being passed to the loop as a text not a variable.
Any help would be greatly appreciated. Thank you.
Why not avoid the for loop altogether using lapply:
lapply(df, unique)
Resulting in:
> $a
> [1] A B C D
> Levels: A B C D
> $b
> [1] 1 2 3 4
Or you have also apply that is specifically done to be run on column or line:
apply(df,2,unique)
result:
> apply(df,2,unique)
a b
[1,] "A" "1"
[2,] "B" "2"
[3,] "C" "3"
[4,] "D" "4"
thought if you want a list lapply return you a list so may be better
Your for loop is almost right, just needs one fix to work:
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols) {
x = unique(df[[i]])
unique_values_by_col[[i]] = x
}
unique_values_by_col
# $a
# [1] A B C D
# Levels: A B C D
#
# $b
# [1] 1 2 3 4
i is just a character, the name of a column within df so unique(i) doesn't make sense.
Anyhow, the most standard way for this task is lapply() as shown by demirev.
Could this be what you're trying to do?
Map(unique,df)
Result:
$a
[1] A B C D
Levels: A B C D
$b
[1] 1 2 3 4
I have a data frame with a single column.
There are 620 rows. The first 31 rows we label class "A", the next 31 rows we label "class B", and so on. There are therefore 20 classes.
What I want to do is quite simple to explain but I need help coding it.
In the first iteration, I want to delete all rows that correspond to the last row for each class. That is, delete the last "A class" row, then delete the last "B class row", and so on. This iteration, and all others, have to be performed, since I intend to do something else with the newly created dataset.
In the second iteration, I want to delete all rows that correspond to the last TWO rows for each class. So, delete the last two rows for "A class", last two rows for "B class" and so on.
In the third iteration, delete the last three rows for each class. And so on.
In the final iteration, we delete the last 30 rows for each class. Meaning basically we only keep 1 row for each observation, the first one.
What's a quick way to put this into R code? I know I need to use a for loop and carefully pick some index to remove, but how?
EXAMPLE
column
A1
A2
A3
B1
B2
B3
If above is our original data frame, then in the first iteration, we should be left with
column
A1
A2
B1
B2
and so on.
I'm making it simple here and use n=3 instead of n=31 with this dummy data set
n <- 3
dummy <- c(rep("A", n), rep("B", n), rep("C", n))
> dummy
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"
Now, the trick is to use boolean indices to pick which values to keep per iteration and combine this with the feature that R will repeat an index vector as many time as needed for a short vector to match a longer vector.
This function creates a mask of which elements in a group should be picked
make_mask <- function(to_keep, n)
c(rep(TRUE, to_keep), rep(FALSE, n - to_keep))
It just gives you a boolean vector
> make_mask(2, 3)
[1] TRUE TRUE FALSE
We can use it in a function that picks the element for an iteration:
pick_subset <- function(to_keep) dummy[make_mask(n - to_keep, n)]
Now, you can use this in a loop or an lapply to get the elements you need per iteration.
iterations <- iterations <- lapply(0:(n-1), pick_subset)
will give you this
> iterations
[[1]]
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"
[[2]]
[1] "A" "A" "B" "B" "C" "C"
[[3]]
[1] "A" "B" "C"
If it is more to your taste to use 1:n in the lapply, simply adjust make_mask to compensate.
dat%>%mutate(grp=sub("\\d","",column))%>%
group_by(grp)%>%
slice(-n())%>%
ungroup()%>%select(-grp)
# A tibble: 4 x 1
column
<chr>
1 A1
2 A2
3 B1
4 B2
data:
dat=read.table(header = T,stringsAsFactors = F,text="column
A1
A2
A3
B1
B2
B3")
There is still another way. Assuming the codes are all grouped and sorted as you show, use the table function to obtain the number of codes in the column. Each value in the cumsum of table happens to correspond to the index of the last item in each sequence. The indexes variable is augmented by 1 each time through the loop. The y variable is created by removing the rows indexed by indexes. (It doesn't matter that indexes is unsorted.) You just do what you need to with y. Here's the code with an example data.frame:
N <- 31
dat <-data.frame(x=c(rep("A",31),rep("B",31),rep("C",31),rep("D",31),rep("E",31)))
t.x <- cumsum(table(dat$x))
for (i in 1:(N-1)) {
if (i == 1){
indexes <- t.x
} else {
indexes = c(indexes,t.x-i)
}
y <- dat$x[-indexes]
print(table(y))
}
The print(table(y)) will show that the count of each code will decrease as required.
y
A B C D E
30 30 30 30 30
y
A B C D E
29 29 29 29 29
Solution with data.table package
Because you know exactly how many items are in each class as well as how many classes exist in the data, the following simple solution works:
Import packages and generate some test data:
rm(list=ls())
library(data.table)
A = rep('A', 3)
B = rep('B', 3)
C = rep('C', 3)
val = rep(1:3, 3)
DT = data.table(class=c(A,B,C), val=val)
This loop simply iterates as many times as there are items in each of your so called "classes". With each iteration we subset an increasingly small portion of the original data with the .SD[1:(4-i)] portion of code. Be sure to set a value (4 in this case) that is one more that the number items in each class so that you don't receive an "index out of range error." The cool part is that data.table allows us to do this by a grouping vector ("class" in this case).
for(i in 1:3) {
print(DT[, .SD[1:(4-i)], by = class]) # edit as needed to save copies
}
Output:
class val
1: A 1
2: A 2
3: A 3
4: B 1
5: B 2
6: B 3
7: C 1
8: C 2
9: C 3
class val
1: A 1
2: A 2
3: B 1
4: B 2
5: C 1
6: C 2
class val
1: A 1
2: B 1
3: C 1
I don't find the help page for the replace function from the base package to be very helpful. Worst part, it has no examples which could help understand how it works.
Could you please explain how to use it? An example or two would be great.
If you look at the function (by typing it's name at the console) you will see that it is just a simple functionalized version of the [<- function which is described at ?"[". [ is a rather basic function to R so you would be well-advised to look at that page for further details. Especially important is learning that the index argument (the second argument in replace can be logical, numeric or character classed values. Recycling will occur when there are differing lengths of the second and third arguments:
You should "read" the function call as" "within the first argument, use the second argument as an index for placing the values of the third argument into the first":
> replace( 1:20, 10:15, 1:2)
[1] 1 2 3 4 5 6 7 8 9 1 2 1 2 1 2 16 17 18 19 20
Character indexing for a named vector:
> replace(c(a=1, b=2, c=3, d=4), "b", 10)
a b c d
1 10 3 4
Logical indexing:
> replace(x <- c(a=1, b=2, c=3, d=4), x>2, 10)
a b c d
1 2 10 10
You can also use logical tests
x <- data.frame(a = c(0,1,2,NA), b = c(0,NA,1,2), c = c(NA, 0, 1, 2))
x
x$a <- replace(x$a, is.na(x$a), 0)
x
x$b <- replace(x$b, x$b==2, 333)
Here's two simple examples
> x <- letters[1:4]
> replace(x, 3, 'Z') #replacing 'c' by 'Z'
[1] "a" "b" "Z" "d"
>
> y <- 1:10
> replace(y, c(4,5), c(20,30)) # replacing 4th and 5th elements by 20 and 30
[1] 1 2 3 20 30 6 7 8 9 10
Be aware that the third parameter (value) in the examples given above: the value is a constant (e.g. 'Z' or c(20,30)).
Defining the third parameter using values from the data frame itself can lead to confusion.
E.g. with a simple data frame such as this (using dplyr::data_frame):
tmp <- data_frame(a=1:10, b=sample(LETTERS[24:26], 10, replace=T))
This will create somthing like this:
a b
(int) (chr)
1 1 X
2 2 Y
3 3 Y
4 4 X
5 5 Z
..etc
Now suppose you want wanted to do, was to multiply the values in column 'a' by 2, but only where column 'b' is "X". My immediate thought would be something like this:
with(tmp, replace(a, b=="X", a*2))
That will not provide the desired outcome, however. The a*2 will defined as a fixed vector rather than a reference to the 'a' column. The vector 'a*2' will thus be
[1] 2 4 6 8 10 12 14 16 18 20
at the start of the 'replace' operation. Thus, the first row where 'b' equals "X", the value in 'a' will be placed by 2. The second time, it will be replaced by 4, etc ... it will not be replaced by two-times-the-value-of-a in that particular row.
Here's an example where I found the replace( ) function helpful for giving me insight. The problem required a long integer vector be changed into a character vector and with its integers replaced by given character values.
## figuring out replace( )
(test <- c(rep(1,3),rep(2,2),rep(3,1)))
which looks like
[1] 1 1 1 2 2 3
and I want to replace every 1 with an A and 2 with a B and 3 with a C
letts <- c("A","B","C")
so in my own secret little "dirty-verse" I used a loop
for(i in 1:3)
{test <- replace(test,test==i,letts[i])}
which did what I wanted
test
[1] "A" "A" "A" "B" "B" "C"
In the first sentence I purposefully left out that the real objective was to make the big vector of integers a factor vector and assign the integer values (levels) some names (labels).
So another way of doing the replace( ) application here would be
(test <- factor(test,labels=letts))
[1] A A A B B C
Levels: A B C
I am using matching operators to grab values that appear in a matrix from a separate data frame. However, the resulting matrix has the values in the order they appear in the data frame, not in the original matrix. Is there any way to preserve the order of the original matrix using the matching operator?
Here is a quick example:
vec=c("b","a","c"); vec
df=data.frame(row.names=letters[1:5],values=1:5); df
df[rownames(df) %in% vec,1]
This produces > [1] 1 2 3 which is the order "a" "b" "c" appears in the data frame. However, I would like to generate >[1] 2 1 3 which is the order they appear in the original vector.
Thanks!
Use match.
df[match(vec, rownames(df)), ]
# [1] 2 1 3
Be aware that if you have duplicate values in either vec or rownames(df), match may not behave as expected.
Edit:
I just realized that row name indexing will solve your issue a bit more simply and elegantly:
df[vec, ]
# [1] 2 1 3
Use match (and get rid of the NA values for elements in either vector for those that don't match in the other):
Filter(function(x) !is.na(x), match(rownames(df), vec))
Since row name indexing also works on vectors, we can take this one step further and define:
'%ino%' <- function(x, table) {
xSeq <- seq(along = x)
names(xSeq) <- x
Out <- xSeq[as.character(table)]
Out[!is.na(Out)]
}
We now have the desired result:
df[rownames(df) %ino% vec, 1]
[1] 2 1 3
Inside the function, names() does an auto convert to character and table is changed with as.character(), so this also works correctly when the inputs to %ino% are numbers:
LETTERS[1:26 %in% 4:1]
[1] "A" "B" "C" "D"
LETTERS[1:26 %ino% 4:1]
[1] "D" "C" "B" "A"
Following %in%, missing values are removed:
LETTERS[1:26 %in% 3:-5]
[1] "A" "B" "C"
LETTERS[1:26 %ino% 3:-5]
[1] "C" "B" "A"
With %in% the logical sequence is repeated along the dimension of the object being subsetted, this is not the case with %ino%:
data.frame(letters, LETTERS)[1:5 %in% 3:-5,]
letters LETTERS
1 a A
2 b B
3 c C
6 f F
7 g G
8 h H
11 k K
12 l L
13 m M
16 p P
17 q Q
18 r R
21 u U
22 v V
23 w W
26 z Z
data.frame(letters, LETTERS)[1:5 %ino% 3:-5,]
letters LETTERS
3 c C
2 b B
1 a A