Retrieving minimum non-numeric value - r

This might be too simple question, but I'm still familiarising with R syntax.
I have a data frame with 2 columns and 3 rows:
The first column is a numeric vector from 1 to 3.
The second column is a character vector with values: best, good, worse.
Which function should I be using in order to obtain the minimum non-numeric value (i.e. "worse")?

Another solution would be to use an ordered factor for the character variable. This way min will know what to do:
dat <- data.frame(a=1:3, b=c("worst","good","best"))
dat$b <- ordered(dat$b, levels=c("worst","good","best"))
min(dat$b)
Result:
> min(dat$b)
[1] worst
Levels: worst < good < best

Related

How to subtract a value from specific values in a column on R

So I am working on a data frame on a column that should say hours of sleep per night however using difftime() function has given values which show the number of hours sleep in negative values for some and the number of hours awake in positive values for others. I want to subtract 24 from just those who are above 0 (non negative numbers) so I have done:
data$Sleep.time <- with(data = data,
difftime(Bed.time, Waking.up.time, units = "hours"))
data$Sleep.time <- as.numeric(data$Sleep.time)
data$subtract <- (24)
data$Sleep.time <- if (data$Sleep.time>0) {data$Sleep.time - data$subtract}
So this just takes 24 away from all of the values so my values that are already negative are completely wrong. I'm not quite sure how to use the if function so this works properly any help would be great!
if is not vectorized i.e. it expects a logical expression with length 1. The 'Sleep.time' column will have more than one element. We may either use ifelse or create an index and use it to subtract and assign
i1 <- data$Sleep.time> 0
data$Sleep.time[i1] <- data$Sleep.time[i1] - data$subtract[i1]
You could try using ifelse
something like this
data$Sleep.time <- ifelse(data$Sleep.time > 0, data$Sleep.time - 24, data$Sleep.time)
Syntax: ifelse(condition, if true, else) returns a vector if the condition is applied on a vector.
Hope it helps and this is vectorized, so much faster than a loop.

Counting all the matchings of a pattern in a vector in R

I have a boolean vector in which I want to count the number of occurrences of some patterns.
For example, for the pattern "(1,1)" and the vector "(1,1,1,0,1,1,1)", the answer should be 4.
The only built-in function I found to help is grepRaw, which finds the occurrences of a particular string in a longer string. However, it seems to fail when the sub-strings matching the pattern overlap:
length(grepRaw("11","1110111",all=TRUE))
# [1] 2
Do you have any ideas to obtain the right answer in this case?
Edit 1
I'm afraid that Rich's answer works for the particular example I posted, but fails in a more general setting:
> sum(duplicated(rbind(c(FALSE,FALSE),embed(c(TRUE,TRUE,TRUE,FALSE,TRUE,TRUE,TRUE),2))))
[1] 3
In this other example, the expected answer would be 0.
Using the function rollapply you can apply a moving window of width = 2 summing the values. Then you can sum the records where the result is equal to 2 i.e. sum(c(1,1))
library(zoo)
z <- c(1,1,1,0,1,1,1)
sum(rollapply(z, 2, sum) == 2)

Split vector of floats by whole integer value

Suppose that I have a vector like the following
> head(samp)
[1] 1959.000 1959.083 1959.167 1959.250 1959.333 1959.417
> tail(samp)
[1] 1997.500 1997.583 1997.667 1997.750 1997.833 1997.917
This vector represents x-values for a plot that I am constructing. I want to superimpose each year's values on top of one another for my plot. To do so, I figure that I have to split this samp vector by whole integer value.
What is the easiest way to do so ?
The only solution I have come up with is taking a sequence for all of the years with
years <- seq(floor(min(samp)),
ceiling(max(samp)))
and then looping through the years and indexing to find the values belonging to each year. There feels like there should be some way to cut my vector up by year like this more easily than an explicit loop, though.
I just make my comment into an answer:
You are looking for the split function (see ?split to check out some examples)
It takes as arguments your vector and a vector of the same length of factors (numeric is OK) defining how to group the values. The output of split is a list.
samp = c(1959.000 ,1959.083 ,1959.167 ,1959.250, 1960.000 ,1960.083)
split(samp, floor(samp))
#### $`1959`
#### [1] 1959.000 1959.083 1959.167 1959.250
####
#### $`1960`
#### [1] 1960.000 1960.083

Why does R need the name of the dataframe?

If you have a dataframe like this
mydf <- data.frame(firstcol = c(1,2,1), secondcol = c(3,4,5))
Why would
mydf[mydf$firstcol,]
work but
mydf[firstcol,]
wouldn't?
You can do this:
mydf[,"firstcol"]
Remember that the column goes second, not first.
In your example, to see what mydf[mydf$firstcol,] gives you, let's break it down:
> mydf$firstcol
[1] 1 2 1
So really mydf[mydf$firstcol,] is the same as
> mydf[c(1,2,1),]
firstcol secondcol
1 1 3
2 2 4
1.1 1 3
So you are asking for rows 1, 2, and 1. That is, you are asking for your row one to be the same as row 1 of mydf, your row 2 to be the same as row 2 of mydf and your row 3 to be the same as row 1 of mydf; and you are asking for both columns.
Another question is why the following doesn't work:
> mydf[,firstcol]
Error in `[.data.frame`(mydf, , firstcol) : object 'firstcol' not found
That is, why do you have to put quotes around the column name when you ask for it like that but not when you do mydf$firstcol. The answer is just that the operators you are using require different types of arguments. You can look at '$' to see the form x$name and thus the second argument can be a name, which is not quoted. You can then look up ?'[', which will actually lead you to the same help page. And there you will find the following, which explains it. Note that a "character" vector needs to have quoted entries (that is how you enter a character vector in R (and many other languages).
i, j, ...: indices specifying elements to extract or replace. Indices
are ‘numeric’ or ‘character’ vectors or empty (missing) or
‘NULL’. Numeric values are coerced to integer as by
‘as.integer’ (and hence truncated towards zero). Character
vectors will be matched to the ‘names’ of the object (or for
matrices/arrays, the ‘dimnames’): see ‘Character indices’
below for further details.
Nothing to add to the very clear explanation of Xu Wang. You might want to note in addition that the package data.table allows you to use notation such as mydf[firstcol==1,] or mydf[,firstcol], that many find more natural.

An elegant way to count number of negative elements in a vector?

I have a data vector with 1024 values and need to count the number of negative entries. Is there an elegant way to do this without looping and checking if an element is <0 and incrementing a counter?
You want to read 'An Introduction to R'. Your answer here is simply
sum( x < 0 )
which works thanks to vectorisation. The x < 0 expression returns a vector of booleans over which sum() can operate (by converting the booleans to standard 0/1 values).
There is a good answer to this question from Steve Lianoglou How to identify the rows in my dataframe with a negative value in any column?
Let me just replicate his code with one small addition (4th point).
Imagine you had a data.frame like this:
df <- data.frame(a = 1:10, b = c(1:3,-4, 5:10), c = c(-1, 2:10))
This will return you a boolean vector of which rows have negative values:
has.neg <- apply(df, 1, function(row) any(row < 0))
Here are the indexes for negative numbers:
which(has.neg)
Here is a count of elements with negative numbers:
length(which(has.neg))
The above solutions prescribed need to be tweaked in-order to apply this for a df.
The below command helps get the count of negative or any other symbolic logical relationship.
Suppose you have a dataframe:
df <- data.frame(x=c(2,5,-10,NA,7), y=c(81,-1001,-1,NA,-991))
In-order to get count of negative records in x:
nrow(df[df$x<0,])

Resources