So I've been playing around with a data frame in R, although I'm still thinking too much in Python and cannot seem to find a solution for my problem.
I have a data frame and one of the column is an user id. I would like to remove all the first occurrence of a number, for instance:
1,2,3,4,3,4,2,1,3,4,6,7,7
I would like to have an output like this:
3,4,2,1,3,4,7
Where the first time the user_id appears I would remove it but keep all the others even if repeated.
With python I would probably use enumerate or loop over it. For R, I've seen some functions that seem cool but I'm not sure how to use it with the data frame, like rle.
Any pointers will be really helpful since right now I'm a bit lost about the best approach for this problem.
Thank you all
The function duplicated() is going to be helpful here:
x <- c(1,2,3,4,3,4,2,1,3,4,6,7,7)
> x[duplicated(x)]
[1] 3 4 2 1 3 4 7
This works because duplicated() returns a logical vector indicating whether that element is, well, duplicated:
duplicated(x)
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE
You then use this logical vector to subset (extract) the values you want from x. But notice that in the extraction I keep all of the duplicated values, not remove them.
To remove all of the duplicated values (not what you want, but I illustrate regardless), try the negation:
x[!duplicated(x)]
[1] 1 2 3 4 6 7
Related
I was trying to find a "readily available" function to do the following:
> my_array = c(5,9,11,10,6,5,9,13)
> my_array
[1] 5 9 11 10 6 5 9 13
> my_test <- c(5, 6)
> new_match_function(my_test, my_array)
[1] 1 5 6
# or instead, maybe:
# [[1]]
# [1] 1 6
# [[2]]
# [1] 5
For my purposes, %in% is close enough, since it will return:
> my_array %in% my_test
[1] TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
and I could just do:
> seq(length(my_array))[my_array %in% my_test]
[1] 1 5 6
But it just seems that something like match should provide this capability: a means to return multiple elements from the match.
If I were to create a package simply to provide this solution, it would not be strongly adopted (for good reason... this tiny use case is not worth installing a package).
Is there a solution already available? If not, where is a good place for me to add this? As I showed, it's easy enough to solve without a new function, but for match to not allow for multiple matches seems crazy. I'd ideally like to either:
Find out that I'm wrong and there is a direct function to accomplish this, or
Be able to alter match itself so that it can return multiple occurrences.
But my impression (right or wrong) has been that any adjustments to the base code are more trouble than they are worth.
For simple cases, which(my_array %in% my_test) or lapply(my_test, function(x) which(my_array==x)) works fine, but those are not the most efficient.
For the first case (just knowing which are matches, not seeing to which elements they correspond), using the fastmatch-package may help, it has the %fin% (fast-in) function, that keeps a hash table of your array so that subsequent lookups are more efficient.
For the second case, there is findMatches in the S4Vectors-bioconductor-package. (https://bioconductor.org/packages/release/bioc/html/S4Vectors.html)
Note that this function doesn't return a list, but a hits-object. To get a list, you need the buioconductor IRanges-package as well (and use as.list). (https://bioconductor.org/packages/release/bioc/html/IRanges.html)
While trying to find number of TRUE values in a vector, I came across the first Google hit. However, this does not fully meet my requirements. I am interested to find the number of TRUE values in a vector before the first FALSE if any. I have a vector a <- c(TRUE,TRUE,TRUE,FALSE,TRUE, TRUE) and want to find all TRUE values before the FALSE, so output will be three. Kindly note that it should also work if there are only TRUE values in the vector.
Here is a short way:
sum(cumprod(a))
# [1] 3
where cumprod gives a cumulative product (of zeros and ones in this case); so, it eliminates all TRUE's after the first FALSE, as in
cumprod(a)
# [1] 1 1 1 0 0 0
Using the below statement we can get the result easily.
which.min(a)-1
In a previous question (Setting Value Based on Matching Column) I was trying to take a string, split it into elements and create a column per element with a logical statement. This was answered brilliantly.
But after a couple of months work on things I now need something on the lines of an inverse.
Given...
df <- data.frame(E1=FALSE,E11=TRUE,E20=FALSE,E30=FALSE,E31=TRUE,E100=FALSE,E300=FALSE,E313=TRUE,ECAT=TRUE)
I need to produce a string containing all the column names that have a TRUE match - which would hopefully yield something like...
> df[1,]
E1 E11 E20 E30 E31 E100 E300 E313 ECAT Topics
1 FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE E11,E31,E313,ECAT
In reality I have 3,270 rows and there are actually 102 topics so really I need something that will for each row provide a concatenation of those TRUE topic codes.
My attempts have yielded nothing working, who will volunteer up an answer OR a link to duplicate question/answer (as they probably exist - it is an R question after all)?
You can try
df$Topics <- apply(df, 1, function(x) toString(names(x)[x]))
You can use apply to do this.
df$Topics = apply(df,1,function(x) paste0(colnames(df)[x],collapse=','))
I am sure this is a simple question that has been asked many times, but this is one of those times when I find it difficult to know which terms to search for in order to find the solution. I have a simple list of lists, such as the one below:
sets <- list(S1=NA, S2=1L, S3=2:5)
> sets
$S1
[1] NA
$S2
[1] 1
$S3
[1] 2 3 4 5
And I have a scalar variable val which can take the value of any integer in sets (but will never be NA). Suppose val <- 4 -- then, what is a quick way to return a vector of TRUE/FALSE corresponding to each list in set where TRUE means val is in that list and FALSE means it is not? In this case I would want something like
[1] FALSE FALSE TRUE
I was hoping there would be some recursive form of %in% but I haven't had luck searching for it. Thank you!
Like this:
sapply(sets, `%in%`, x = val)
# S1 S2 S3
# FALSE FALSE TRUE
I had to look at the help page ?"%in%" to find out that the first argument to %in% is named x. And for your curiosity (not needed here), the second one is named table.
Suppose I have a vector x<-c(1,2,NA,4,5,NA).
I apply some mythological code to that vector, which results in another vector, y<-c(1,NA,3, 4,10,NA)
Now I wish to find out at which positions my two vectors differ, where I count two NAs as being the same, and one NA and a non-NA (e.g. the second element of the two example vectors).
Specifically, for my example, I would like to end up with a vector holding c(2,3,5).
For my use case, I am not content with a vector of logical variables, but obviously I can easily convert (which), so I'll accept that as well.
I have some solutions like:
simplediff<-x!=y
nadiff<-is.na(x)!=is.na(y)
which(simplediff | nadiff)
but it feels like I'm reinventing the wheel here. Any better options?
How about looping and using identical?
!mapply(identical,x,y)
[1] FALSE TRUE TRUE FALSE TRUE FALSE
And for positions:
seq_along(x)[!mapply(identical,x,y)]
[1] 2 3 5
or
which(!mapply(identical,x,y))
[1] 2 3 5
One posible solution (but sure it is not the best):
(1:length(x))[-which((x-y)==0)]