r find index of minima with gap / distance condition - r

list <- c(1,1,1,4,5,6,9,9,2)
I want to find the index of the 3 lowest values , but with the condition that the index of the found minima is at least 3 points apart from each other.
To find the 3 lowest indices I'm using
which(list <= sort(list, decreasing=FALSE)[3], arr.ind=TRUE)
which doesn't look for any conditions and results in
1,2,3
My desired result is
1,9,4
I want to know if it's possible doing that without any loops since my list is a lot bigger.
Thank you so much in advance.
To clarify what I meant: The indices of minima must always be in a certain distance. For example for the list list<-c(1,3,9,5,9,9,2) the result of the minima should be 1,7,4. Not 1,7,2, because that the indices 1 and 2 are too close together.
Thank you again for helping me.

Try this using dplyr:
create a dataframe with sequence in the 2nd column, then sort and find first occurance
library(dplyr)
kk <- data.frame(cbind(list, seq=seq_along(list))) %>%
arrange(list) %>% # sort list
group_by(list) %>% # group
summarise(V3=min(seq)) %>% # find first occurance
.$V3 %>% # get sequence values
head(3) # get top 3
[1] 1 9 4

Related

How to assign unambiguous values for each row in a data frame based on values found in rows from another data frame using R?

I have been struggling with this question for a couple of days.
I need to scan every row from a data frame and then assign an univocal identifier for each rows based on values found in a second data frame. Here is a toy exemple.
df1<-data.frame(c(99443975,558,99009680,99044573,599,99172478))
names(df1)<-"Building"
V1<-c(558,134917,599,120384)
V2<-c(4400796,14400095,99044573,4500481)
V3<-c(NA,99009680,99340705,99132792)
V4<-c(NA,99156365,NA,99132794)
V5<-c(NA,99172478,NA, 99181273)
V6<-c(NA, NA, NA,99443975)
row_number<-1:4
df2<-data.frame(cbind(V1, V2,V3,V4,V5,V6, row_number))
The output I expect is what follows.
row_number_assigned<-c(4,1,2,3,3,2)
output<-data.frame(cbind(df1, row_number_assigned))
Any hints?
Here's an efficient method using the arr.ind feature of thewhich function:
sapply( df1$Building, # will send Building entries one-by-one
function(inp){ which(inp == df2, # find matching values
arr.in=TRUE)[1]}) # return only row; not column
[1] 4 1 2 3 3 2
Incidentally your use of the data.frame(cbind(.)) construction is very dangerous. A much less dangerous, and using fewer keystrokes as well, method for dataframe construction would be:
df2<-data.frame( V1=c(558,134917,599,120384),
V2=c(4400796,14400095,99044573,4500481),
V3=c(NA,99009680,99340705,99132792),
V4=c(NA,99156365,NA,99132794),
V5=c(NA,99172478,NA, 99181273),
V6=c(NA, NA, NA,99443975) )
(It didn't cause coding errors this time but if there were any character columns it would changed all the numbers to character values.) If you learned this from a teacher, can you somehow approach them gently and do their future students a favor and let them know that cbind() will coerce all of the arguments to the "lowest common denominator".
You could use a tidyverse approach:
library(dplyr)
library(tidyr)
df1 %>%
left_join(df2 %>%
pivot_longer(-row_number) %>%
select(-name),
by = c("Building" = "value"))
This returns
Building row_number
1 99443975 4
2 558 1
3 99009680 2
4 99044573 3
5 599 3
6 99172478 2

Find the combinations that add up to a number, followed by the next number until 0 - in R

I am trying to create a matrix of all the combinations that add up to a number followed by the combinations that add up to the next number until 0. For example
Input: N = 3
Output:
[1,]3 0
[2,]2 1
[3,]1 1 1
[4,]2 0
[5,]1 1
[6,]1 0
[7,]0
My question is, is there a package in R that can do this? or do I have to write my own function?
I would really appreciate any help or guidance on this. Thank you.
I think I have figured out the solution using the library partitions.
library(partitions)
n=7
df = t(as.matrix(blockparts(rep(n,n),n, include.fewer = FALSE))) # create a transposed matrix with the possible combinations
indx <- !duplicated(t(apply(df, 1, sort))) # find non - duplicates in sorted rows
df[indx, ] # select only the non - duplicates according to that index

R removes more observations than it should with dplyr or base subset

I've got a question regarding the filter() function of dplyr, and/or base subset() function within R. Basically, when I use filter() or subset() I can extract observations based on two conditions, which is what I need.
As an example, this is what I've been using so far:
df %>% filter(Axis_1_1 == "Diagnostic of function on axis1 postponed") %>% filter(is.na(diagnostic_code9))
This gives me the right amount of observations that satisfy these two conditions at the same time, i.e. 92 out of the 23992 in total.
However, when I use the negation sign to not include these observations in my current dataframe, R is deleting roughly 8000 extra observations. Thus, the end result is 15992 observations left after filtering with the negation "!" sign used. Example:
df %>% filter(Axis_1_1 != "Diagnostic of function on axis1 postponed") %>% filter(!is.na(diagnostic_code9))
Using simple subsetting from base R gives me the same wrong end result, while it manages to find the correct 92 observations that satisfy the condition, as stated in the first example.
subset(df, df$Axis1_1 == "Diagnostic of function on axis1 postponed" & is.na(diagnostic_code9))
My dataframe consists of 112 variables and 23900+ observations in the current setting.
Thus, my questions are:
Could there be something curious going on with my dataframe I'm using (Unfortunately I cannot give you a subset out of it)
Second, is there something wrong here with my coding?
Lastly, what is R exactly doing in the background? Since it is able to filter out these observations based on the exact conditioning where they match the string and is.na() function, while doing completely something else when using the negation sign.
Your logic doesn't quote work in this case. Doing two subsequent filter statments is kind of like doing an AND operation. Consider the following example
df <- data.frame(a=c(1,1,1,1,2,2,2, 2),
b=c(NA,NA,5,5,5,5,5,NA))
df %>% filter(a==1) %>% filter(is.na(b))
# a b
# 1 1 NA
# 2 1 NA
df %>% filter(a!=1) %>% filter(!is.na(b))
# a b
# 1 2 5
# 2 2 5
# 3 2 5
Note the rows with a=1, b=5 are not returned even though they are not in the first output because your first filter (filter(!=1)) eliminates them.
So if you consider your two filters as A and B, in the first case you are doing A and B. It would be the same as
df %>% filter(a==1 & is.na(b))
# a b
# 1 1 NA
# 2 1 NA
But in the second you are doing NOT A and NOT B. These are not equivalent. According to DeMorgan's Law, you need NOT A OR NOT B. So try
df %>% filter(a!=1 | !is.na(b))
# a b
# 1 1 5
# 2 1 5
# 3 2 5
# 4 2 5
# 5 2 5
# 6 2 NA
or equivalently (note the parenthsis applying the NOT (!) to the whole expression)
df %>% filter(!(a==1 & is.na(b)))

Excluding a number of answers from a R dataframe

I'm looking for a way to exclude a number of answers from a length function.
This is a follow on question from Getting R Frequency counts for all possible answers In sql the syntax could be
select * from someTable
where variableName not in ( 0, null )
Given
Id <- c(1,2,3,4,5)
ClassA <- c(1,NA,3,1,1)
ClassB <- c(2,1,1,3,3)
R <- c(5,5,7,NA,9)
S <- c(3,7,NA,9,5)
df <- data.frame(Id,ClassA,ClassB,R,S)
ZeroTenNAScale <- c(0:10,NA);
R.freq = setNames(nm=c('R','freq'),data.frame(table(factor(df$R,levels=ZeroTenNAScale,exclude=NULL))));
S.freq = setNames(nm=c('S','freq'),data.frame(table(factor(df$S,levels=ZeroTenNAScale,exclude=NULL))));
length(S.freq$freq[S.freq$freq!=0])
# 5
How would I change
length(S.freq$freq[S.freq$freq!=0])
to get an answer of 4 by excluding 0 and NA?
We can use colSums,
colSums(!is.na(S.freq)[S.freq$freq!=0,])[[1]]
#[1] 4
You can use sum to calculate the sum of integers. if NA's are found in your column you could be using na.rm(), however because the NA is located in a different column you first need to remove the row containing NA.
Our solution is as follows, we remove the rows containing NA by subsetting S.freq[!is.na(S.freq$S),], but we also need the second column freq:
sum(S.freq[!is.na(S.freq$S), "freq"])
# 4
You can try na.omit (to remove NAs) and subset ( to get rid off all lines in freq equal to 0):
subset(na.omit(S.freq), freq != 0)
S freq
4 3 1
6 5 1
8 7 1
10 9 1
From here, that's straightforward:
length(subset(na.omit(S.freq), freq != 0)$freq)
[1] 4
Does it solve your problem?
Just add !is.na(S.freq$S) as a second filter:
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$S)])
If you want to extend it with other conditions, you could make an index vector first for readability:
idx <- S.freq$freq!=0 & !is.na(S.freq$S)
length(S.freq$freq[idx])
You're looking for values with frequency > 0, that means you're looking for unique values. You get this information directly from vector S:
length(unique(df$S))
and leaving NA aside you get answer 4 by:
length(unique(df$S[!is.na(df$S)]))
Regarding your question on how to exclude a number of items based on their value:
In R this is easily done with logical vectors as you used it in you code already:
length(S.freq$freq[S.freq$freq!=0])
you can combine different conditions to one logical vector and use it for subsetting e.g.
length(S.freq$freq[S.freq$freq!=0 & !is.na(S.freq$freq)])

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

Resources