Extracting consecutive occurences in R (like unix uniq) - r

I'm beginning to analyse datas for my thesis. I first need to count consecutive occurences of strings as one. Here's a sample vector :
test <- c("vv","vv","vv","bb","bb","bb","","cc","cc","vv","vv")
I would like to simply extract unique values, as in the unix command uniq. So expected output would be a vector as :
"vv","bb","cc","vv"
I looked at rle function, wich seems to be fine, but how would I get the output of rle as a vector ? I don't seem to understand the rle class...
> rle(test)
Run Length Encoding
lengths: int [1:5] 3 3 1 2 2
values : chr [1:5] "vv" "bb" "" "cc" "vv"
How to get one vector of the values output by rle and another one for the lengths ? Hope I'm making myself clear...
Thanks again for any help !

rle() returns a two-element list of class "rle"; as #gsk points out, you can use ordinary list-indexing constructs to access the component vectors.
Also, try this, to put the results of rle into a more familiar format:
as.data.frame(rev(unclass(rle(test))))
# values lengths
# 1 vv 3
# 2 bb 3
# 3 1
# 4 cc 2
# 5 vv 2

Source: http://www.sigmafield.org/2009/09/22/r-function-of-the-day-rle
Solution: rle(test)$values
They use: coin.rle <- rle(coin) and coin.rle$values so, rle(test)$values should work.

Related

Extract all values from a vector of named numerics with the same name in R

I'm trying to handle a vector of named numerics for the first time in R. The vector itself is named p.values. It consists of p-values which are named after their corresponding variabels. Through simulating I obtained a huge number of p-values that are always named like one of the five variables they correspond to. I'm interested in p-values of only one variable however and tried to extract them with p.values[["var_a"]] but that gives my only the p-value of var_a's last entry. p.values$var_a is invalid and as.numeric(p.values) or unname(p.values) gives my only all values without names obviously. Any idea how I can get R to give me the 1/5 of named numerics that are named var_a?
Short example:
p.values <- as.numeric(c(rep(1:5, each = 5)))
names(p.values) <- rep(letters[1:5], 5)
str(p.values)
Named num [1:25] 1 1 1 1 1 2 2 2 2 2 ...
- attr(*, "names")= chr [1:25] "a" "b" "c" "d" ...
I'd like to get R to show me all 5 numbers named "a".
Thanks for reading my first post here and I hope some more experienced R users know how to deal with named numerics and can help me with this issue.
You can subset p.values using [ with names(p.values) == "a" to show all values named a.
p.values[names(p.values) == "a"]
#a a a a a
#1 2 3 4 5

How do I get the number of levels of a factor in a tibble?

This seems pretty basic, but the number of verbs in the tidyverse is huge now and I don't know which package to look for this.
Here is the problem. I have a tibble
df <- tibble(f1 = factor(rep(letters[1:3],5)),
c1 = rnorm(15))
Now if I use the $ operator I can easily find out how many levels are in the factor.
nlevels(df$f1)
# [1] 3
But if I use the [] operator it returns an incorrect number of levels.
nlevels(df[,"f1"])
# [1] 0
Now if df is a data.frame and not a tibble the nlevels() function works with both the $ operator and the [] operator.
So does anyone know the tidyverse equivalent of nlevels() that works on both data.frames and tibbles?
Elaborating on the answer from timcdlucas (and the comments from r2evans), the issue here is the behavior of various forms of the extract operator, not the behavior of tibble. Why? a tibble is actually a kind of data.frame as illustrated when we use the str() function on a tibble.
> library(dplyr)
> aTibble <- tibble(f1 = factor(rep(letters[1:3],5)),
+ c1 = rnorm(15))
>
> # illustrate that aTibble is actually a type of data frame
> str(aTibble)
tibble [15 × 2] (S3: tbl_df/tbl/data.frame)
$ f1: Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
$ c1: num [1:15] -0.5829 0.3682 1.1854 -0.6309 -0.0268 ...
There are four forms of the extract operator in R: [, [[, $, and #; as noted in What is the meaning of the dollar sign $ in R function?.
The first form, [ can be used to extract content form vectors, lists, matrices, or data frames. When used with a data frame (or tibble in the tidyverse), it returns an object of type data.frame or tibble unless the drop = TRUE argument is included, as noted in the question comments by r2evans.
Since the default setting of drop= in the [ function is FALSE, it follows that df[,"f1"] produces an unexpected or "wrong" result for the code posted with the original question.
library(dplyr)
aTibble <- tibble(f1 = factor(rep(letters[1:3],5)),
c1 = rnorm(15))
# produces unexpected answer
nlevels(aTibble[,"f1"])
> nlevels(aTibble[,"f1"])
[1] 0
The drop = argument is used when extracting from matrices or arrays (i.e. any object that has a dim attribute, as explained in help for the drop() function.
> dim(aTibble)
[1] 15 2
>
When we set drop = TRUE, the extract function returns an object of the lowest type available, that is all extents of length 1 are removed. In the case of the original question, drop = TRUE with the extract operator returns a factor, which is the right type of input for nlevels().
> nlevels(aTibble[,"f1",drop=TRUE])
[1] 3
The [[ and $ forms of the extract operator extract a single object, so they return objects of type factor, the required input to nlevels().
> str(aTibble$f1)
Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
> nlevels(aTibble$f1)
[1] 3
>
> # produces expected answer
> str(aTibble[["f1"]])
Factor w/ 3 levels "a","b","c": 1 2 3 1 2 3 1 2 3 1 ...
> nlevels(aTibble[["f1"]])
[1] 3
>
The fourth form of the extract operator, # (known as the slot operator), is used with formally defined objects built with the S4 object system, and is not relevant for this question.
Conclusion: Base R is still relevant when using the Tidyverse
Per tidyverse.org, the tidyverse is a collection of R packages that share an underlying philosophy, grammar, and data structures. When one becomes familiar with the tidyverse family of packages, it's possible to do many things in R without understanding the fundamentals of how Base R works.
That said, when one incorporates Base R functions or functions from packages outside the tidyverse into tidyverse-style code, it's important to know key Base R concepts.
I think you might need to use [[ rather than [, e.g.,
> nlevels(df[["f1"]])
[1] 3
df[,"f1"] returns a tibble with one column. So you're doing nlevels on an entire tibble which doesn't make sense.
df %>% pull('f1') %>% nlevels
gives you what you want.

R: Index to unique vector that returns original

I have a vector v <- c(6,8,5,5,8) of which I can obtain the unique values using
> u <- unique(v)
> u
[1] 6 8 5
Now I need an index i = [2,3,1,1,3] that returns the original vector v when indexed into u.
> u[i]
[1] 6,8,5,5,8
I know such an index can be generated automatically in Matlab, the ci index, but does not seem to be part of the standard repertoire in R. Is anyone aware of a function that can do this?
The background is that I have several vectors with anonymized IDs that are long character strings:
ids
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
"PTefkd43fmkl28en==3rnl4"
"PTefkd43fmkl28en==3rnl4"
"cmdREW3rFDS32fDSdd;32FF"
To reduce the file size and simplify the code, I want to transform them into integers of the sort
ids
1
2
1
1
2
and found that the index of the unique vector does just this. Since there are many rows, I am hesitant to write a function that loops over each element of the unique vector and wonder whether there is a more efficient way — or a completely different way to transform the character strings into matching integers.
Try with match
df1$ids <- with(df1, match(ids, unique(ids)) )
df1$ids
#[1] 1 2 1 1 2
Or we can convert to factor and coerce to numeric
with(df1,as.integer(factor(ids, levels=unique(ids))))
#[1] 1 2 1 1 2
Using u and v. Based on the output of 'u' in the OP's post, it must have been sorted
u <- sort(unique(v))
match(v, u)
#[1] 2 3 1 1 3
Or using findInterval. Make sure that 'u' is sorted.
findInterval(v,u)
#[1] 2 3 1 1 3

Data.frame with both characters and numerics in one column

I have a function I'm using in R that requires input to several parameters, once as a numeric (1) and as a character (NULL). The default is NULL.
I want to apply the function using all possible combinations of parameters, so I used expand.grid to try and create a dataframe which stores these. However, I am running into problems with creating an object that contains both numerics and characters in one column.
This is what I've tried:
comb<-expand.grid(c("NULL",1),c("NULL",1),stringsAsFactors=FALSE), which returns:
comb
Var1 Var2
1 NULL NULL
2 1 NULL
3 NULL 1
4 1 1
with all entries characters:
class(comb[1,1])
[1] "character"
If I now try and insert a numeric into a specific spot, I still receive a character:
comb[2,1]<-as.numeric(1)
class(comb[2,1])
[1] "character"
I've also tried it using stringsAsFactors=TRUE, or using expand.grid(c(0,1),c(0,1)) and then switching out the 0 for NULL but always have the exact same problem: whenever I do this, I do not get a numeric 1.
Manually creating an object using cbind and then inserting the NULL as a character also does not help. I'd be grateful for a pointer, or a work-around to running the function with all possible combinations of parameters.
As you have been told, generally speaking columns of data frames need to be a single type. It's hard to solve your specific problem, because it is likely that the solution is not really "putting multiple types into a single column" but rather re-organizing your other unseen code to work within this restriction.
As I suggested, it probably will be better to use the built in NA value as expand.grid(c(NA,1),c(NA,1)) and then modify your function to use NA as an input. Or, of course, you could just use some "special" numeric value, like -1, or -99 or something.
The related issue that I mentioned is that you really should avoid using the character string "NULL" to mean anything, since NULL is a special value in R, and confusion will ensue.
These sorts of strategies would all be preferable to mixing types, and using character strings of reserved words like NULL.
All that said, it technically is possible to get around this, but it is awkward, and not a good idea.
d <- data.frame(x = 1:5)
> d$y <- list("a",1,2,3,"b")
> d
x y
1 1 a
2 2 1
3 3 2
4 4 3
5 5 b
> str(d)
'data.frame': 5 obs. of 2 variables:
$ x: int 1 2 3 4 5
$ y:List of 5
..$ : chr "a"
..$ : num 1
..$ : num 2
..$ : num 3
..$ : chr "b"

Reading csv file, having numbers and strings in one column

I am importing a 3 column CSV file. The final column is a series of entries which are either an integer, or a string in quotation marks.
Here are a series of example entries:
1,4,"m"
1,5,20
1,6,"Canada"
1,7,4
1,8,5
When I import this using read.csv, these are all just turned in to factors.
How can I set it up such that these are read as integers and strings?
Thank you!
This is not possible, since a given vector can only have a single mode (e.g. character, numeric, or logical).
However, you could split the vector into two separate vectors, one with numeric values and the second with character values:
vec <- c("m", 20, "Canada", 4, 5)
vnum <- as.numeric(vec)
vchar <- ifelse(is.na(vnum), vec, NA)
vnum
[1] NA 20 NA 4 5
vchar
[1] "m" NA "Canada" NA NA
EDIT Despite the OP's decision to accept this answer, #Andrie's answer is the preferred solution. My answer is meant only to inform about some odd features of data frames.
As others have pointed out, the short answer is that this isn't possible. data.frames are intended to contain columns of a single atomic type. #Andrie's suggestion is a good one, but just for kicks I thought I'd point out a way to shoehorn this type of data into a data.frame.
You can convert the offending column to a list (this code assumes you've set options(stringsAsFactors = FALSE)):
dat <- read.table(textConnection("1,4,'m'
1,5,20
1,6,'Canada'
1,7,4
1,8,5"),header = FALSE,sep = ",")
tmp <- as.list(as.numeric(dat$V3))
tmp[c(1,3)] <- dat$V3[c(1,3)]
dat$V3 <- tmp
str(dat)
'data.frame': 5 obs. of 3 variables:
$ V1: int 1 1 1 1 1
$ V2: int 4 5 6 7 8
$ V3:List of 5
..$ : chr "m"
..$ : num 20
..$ : chr "Canada"
..$ : num 4
..$ : num 5
Now, there are all sorts of reasons why this is a bad idea. For one, lots of code that you'd expect to play nicely with data.frames will not like this and either fail, or behave very strangely. But I thought I'd point it out as a curiosity.
No. A dataframe is a series of pasted together vectors (a list of vectors or matrices). Because each column is a vector it can not be classified as both integer and factor. It must be one or the other. You could split the vector apart into numeric and factor ( acolumn for each) but I don't believe this is what you want.

Resources