Categorical to Numeric Variable in R

Categorical to Numeric Variable in R - r

Create a factor vector v1 using 10 random numbers without decimals.
Convert the factor vector to numeric vector v2.
Compare v1 and v2 element-wise. Store the comparison values (true or false) in a vector, and display it.
I have tried this:
v1<- factor(round(runif(10)),0)
v1
v2<-as.numeric(v1)
v2
comp<-v1==v2
comp

Have a look at the code below.
When v1 is a factor, as.numeric(v1) returns information on the level of each element of v1. In this example, the first element is a 5, which is the third level of the factor so as.numeric returns 3.
The second element of v1 is 2 which is also the second level so as.numeric returns 2 and we get TRUE in the comparison v1 == v2 for that element. Also check the help ?factor.
Using as.numeric(as.character(v1) does the expected conversion.
set.seed(2002)
v1 <- factor(round(10*runif(10),0))
v1
# [1] 5 2 9 0 9 8 8 10 10 9
# Levels: 0 2 5 8 9 10
str(v1)
#Factor w/ 6 levels "0","2","5","8",..: 3 2 5 1 5 4 4 6 6 5
v2 <- as.numeric(v1)
v2
# [1] 3 2 5 1 5 4 4 6 6 5
v1 == v2
#[1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
v2 <- as.numeric(as.character(v1))
v1 == v2
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Related

Find all the duplicate records using duplicated() on R [duplicate]

This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 4 years ago.
I have one question in R.
I have the following example code for a question.
> exdata <- data.frame(a = rep(1:4, each = 3),
+ b = c(1, 1, 2, 4, 5, 3, 3, 2, 3, 9, 9, 9))
> exdata
a b
1 1 1
2 1 1
3 1 2
4 2 4
5 2 5
6 2 3
7 3 3
8 3 2
9 3 3
10 4 9
11 4 9
12 4 9
> exdata[duplicated(exdata), ]
a b
2 1 1
9 3 3
11 4 9
12 4 9
I tried to use the duplicated() function to find all the duplicate records in the exdata dataframe, but it only finds a part of the duplicated records, so it is difficult to confirm intuitively whether duplicates exist.
I'm looking for a solution that returns the following results
a b
1 1 1
2 1 1
7 3 3
9 3 3
10 4 9
11 4 9
12 4 9
Can use the duplicated() function to find the right solution?
Or is there a way to use another function?
I would appreciate your help.

duplicated returns a logical vector with the length equal to the length of its argument, corresponding to the second time a value exists. It has a method for data frames, duplicated.data.frame, that looks for duplicated rows (and so has a logical vector of length nrow(exdata). Your extraction using that as a logical vector is going to return exactly those rows that have occurred once before. It WON'T however, return the first occurence of those rows.
Look at the index vector your using:
duplicated(exdata)
# [1] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
But you can combine it with fromLast = TRUE to get all of the occurrences of these rows:
exdata[duplicated(exdata) | duplicated(exdata, fromLast = TRUE),]
# a b
# 1 1 1
# 2 1 1
# 7 3 3
# 9 3 3
# 10 4 9
# 11 4 9
# 12 4 9
look at the logical vector for duplicated(exdata, fromLast = TRUE) , and the combination with duplicated(exdata) to convince yourself:
duplicated(exdata, fromLast = TRUE)
# [1] TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE
duplicated(exdata) | duplicated(exdata, fromLast = TRUE)
# [1] TRUE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE

Why do I get different results indexing with data.table

Here is a simple example of trying to extract some rows
from a data.table, but what appear to be the same type
of logical vectors, I get different answers:
a <- data.table(a=1:10, b = 10:1)
a # so here is the data we are working with
a b
1: 1 10
2: 2 9
3: 3 8
4: 4 7
5: 5 6
6: 6 5
7: 7 4
8: 8 3
9: 9 2
10: 10 1
let's extract just the first column since I need to dynamically
specify the column number as part of my processing
col <- 1L # get column 1 ('a')
x <- a[[col]] > 5 # logical vector specifying condition
x
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
str(x)
logi [1:10] FALSE FALSE FALSE FALSE FALSE TRUE ...
look at the structure of a[[col]] > 5
a[[col]] > 5
[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
str(a[[col]] > 5)
logi [1:10] FALSE FALSE FALSE FALSE FALSE TRUE ...
this looks very much like 'x', so why do these two different ways
of indexing 'a' give different results
a[x] # using 'x' as the logical vector
a b
1: 6 5
2: 7 4
3: 8 3
4: 9 2
5: 10 1
a[a[[col]] > 5] # using the expression as the logical vector
Empty data.table (0 rows) of 2 cols: a,b

counting lengths between alternating columns

I am trying to figure out how to count the number of rows from when one column says True to when the other column says True. I attempted to use run length encoding but couldnt figure out how to get the alternating values form each column.
set.seed(42)
s<-sample(c(0,1,2,3),500,replace=T)
isOverbought<-s==1
isOverSold<-s==0
head(cbind(isOverbought,isOverSold),20)
res<-rle(isOverSold)
tt<-res[res$values==0] #getting when Oversold is true
> head(cbind(isOverbought,isOverSold))
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] TRUE FALSE <-starting condition is overbought
[4,] FALSE FALSE
[5,] FALSE FALSE
[6,] FALSE FALSE
[7,] FALSE FALSE
[8,] FALSE TRUE <-is oversold. length from overbought to oversold = 5
[9,] FALSE FALSE
[10,] FALSE FALSE
[11,] TRUE FALSE <- is overbought. length from oversold to overbought = 3
[12,] FALSE FALSE
[13,] FALSE FALSE
[14,] TRUE FALSE
[15,] TRUE FALSE
[16,] FALSE FALSE
[17,] FALSE FALSE
[18,] FALSE TRUE <-is oversold. length from overbought to oversold = 7
[19,] TRUE FALSE <- is overbought. length from oversold to overbought = 1
[20,] FALSE FALSE
GOAL
overboughtTOoversold oversoldTOoverbought
5 3
7 1

This is sufficient to solve your problem.
## `a` to `b`
a2b <- function (a, b) {
x <- which(a) ## position of `TRUE` in `a`
y <- which(b) ## position of `TRUE` in `b`
z <- which(a | b) ## position of all `TRUE`
end <- match(y, z) ## match for end position
start <- c(1L, end[-length(end)] + 1L) ## start position
valid <- end > start ## remove cases with `end = start`
z[end[valid]] - z[start[valid]]
}
## cross `a` and `b`
axb <- function (a, b) {
if (any(a & b))
stop ("Invalid input! `a` and `b` can't have TRUE at the same time!")
x <- a2b(a, b); y <- a2b(b, a)
if (which(a)[1L] < which(b)[1L]) cbind(a2b = x, b2a = c(NA_integer_, y))
else cbind(a2b = c(NA_integer_, x), b2a = y)
}
For your isOverbought and isOverSold, we obtain:
result <- axb(isOverbought, isOverSold)
head(result)
# a2b b2a
#[1,] 5 NA
#[2,] 7 3
#[3,] 3 1
#[4,] 8 5
#[5,] 2 6
#[6,] 10 2
Since isOverbought has the first TRUE before isOverSold, the first element of the 2nd column is NA.

The assumption for this answer is that there is at least one overbought/oversold transition (either direction) and hence at least two rows in the data. This condition can easily be checked by counting the number of overbought and oversold conditions and making sure that both are greater than one.
The key is to remove the consecutive overbought and oversold conditions so that we only have alternating overbought and oversold conditions. One way to do this is:
## detect where we are overbought and oversold
i1 <- which(isOverbought)
i2 <- which(isOverSold)
## concatenate into one vector
i3 <- c(i1,i2)
## sort these and get the indices from the sort
i4 <- order(i3)
## at this point consecutive overbought or oversold conditions
## will be marked by a difference of 1 in i4 while alternating
## conditions will be marked by something other than 1. So
## filter those out to get i6. BTW, consecutive here does not mean
## consecutive rows in the data but consecutive occurrence of
## either overbought or oversold conditions without an intervening
## condition of the other. The assumption for at least one transition
## in the data is needed for this to work.
i5 <- diff(i4)
i6 <- i4[c(1,which(i5 != 1)+1)]
## then recover the alternating rows of overbought and oversold conditions in i7
i7 <- i3[i6]
## take the difference and format the output
## I need to credit #akrun for this part
i8 <- diff(i7)
## need to determine which is first
if (i1[1] < i2[1]) {
overboughtTOoversold <- i8[c(TRUE, FALSE)]
oversoldTOoverbought <- i8[c(FALSE, TRUE)]
} else {
overboughtTOoversold <- i8[c(FALSE, TRUE)]
oversoldTOoverbought <- i8[c(TRUE, FALSE)]
}
d1 <- cbind(overboughtTOoversold, oversoldTOoverbought)
print(head(d1))
## overboughtTOoversold oversoldTOoverbought
##[1,] 5 3
##[2,] 7 1
##[3,] 3 5
##[4,] 8 6
##[5,] 2 2
##[6,] 10 4
The cbind may generate a warning that the columns are not the same length. To get rid of that, just pad with NA at the end as appropriate.
A more compact version of the above is:
i3 <- c(which(isOverbought), which(isOverSold))
i4 <- order(i3)
i8 <- diff(i3[i4[c(1,which(diff(i4) != 1)+1)]])
if (which(isOverbought)[1] < which(isOverSold)[1]) {
overboughtTOoversold <- i8[c(TRUE, FALSE)]
oversoldTOoverbought <- i8[c(FALSE, TRUE)]
} else {
overboughtTOoversold <- i8[c(FALSE, TRUE)]
oversoldTOoverbought <- i8[c(TRUE, FALSE)]
}
d1 <- cbind(overboughtTOoversold, oversoldTOoverbought)

Here is a short version:
create a vector called mktState. Encode it with 1 if overbought is TRUE, -1 if oversold is TRUE and NA if both first 2 cols are FALSE.( You are interested only in days where the market state switches)
use na.locf to fill the NAs with the last observation carried forward
now use the rle function
mktState <- ifelse(df$overBought == TRUE,1,ifelse(df$overSold == TRUE,-1,NA))
mktState <- na.locf(mktState)
to get 'overbought' runs:
> rle(mktState)$lengths[rle(mktState)$values == 1]
[1] 5 7 3 8 2 10 7 3 1 2 4 2 5 6 3 11 4 1 5 2 4 6 1 1 8
[26] 7 3 1 1 1 1 3 2 3 1 6 1 1 1 3 2 4 2 1 6 8 8 1 5 15
[51] 2 5 4 2 1 1 3 4 7 1 7 11 1 3 4 2 4 1
and this will give you the 'oversold' runs:
> rle(mktState)$lengths[rle(mktState)$values == -1]
[1] 3 1 5 6 2 4 1 4 3 3 3 5 2 4 1 14 2 2 10 3 7 1 13 1 1
[26] 3 3 1 6 5 2 1 8 7 2 3 1 1 3 5 1 1 2 3 1 2 2 3 3 1
[51] 8 9 4 2 1 6 2 1 3 2 4 5 1 3 7 4 2 2

Here's a [somewhat long] tidyverse version:
library(dplyr)
library(tidyr)
# put vectors in a data.frame
data.frame(isOverbought, isOverSold) %>%
# evaluate each row separately
rowwise() %>%
# add column with name of event for any TRUE, else NA
mutate(change_type = ifelse(isOverbought | isOverSold, names(.)[c(isOverbought, isOverSold)], NA)) %>%
# reset grouping
ungroup() %>%
# replace NA values with last non-NA value
fill(change_type) %>%
# add a column of the cumulate number of changes in change_type
mutate(changes = data.table::rleid(change_type)) %>%
# count number of rows in each changes and change_type grouping
count(changes, change_type) %>%
# remove leading NAs
na.omit() %>%
# reset grouping
ungroup() %>%
# edit change into runs of two with integer division
mutate(changes = changes %/% 2) %>%
# spread to wide form
spread(change_type, n) %>%
# get rid of extra column
select(-changes)
## # A tibble: 68 x 2
## isOverbought isOverSold
## * <int> <int>
## 1 5 3
## 2 7 1
## 3 3 5
## 4 8 6
## 5 2 2
## 6 10 4
## 7 7 1
## 8 3 4
## 9 1 3
## 10 2 3
## # ... with 58 more rows

Compare two dataframe colums and add them to the dataframe

I have a dataframe with two columns. I want to add a new colume to df where all the values are inside, were the dataframe matches with the first colume.
I tried:
df<-data.frame(A=c("1","test","2","3",NA,"Test", NA),B=c("1","No Match","No Match","3",NA,"Test", "No Match"))
df[df$A == df$B ]
However, I get:
Error in Ops.factor(df$A, df$B) : level sets of factors are different
Any recommednation what I am doing wrong?

Dealing with NA first and then add your column:
> df[is.na(df)]=""
> df$New = with(df, A==B)
> df
A B New
1 1 1 TRUE
2 test No Match FALSE
3 2 No Match FALSE
4 3 3 TRUE
5 TRUE
6 Test Test TRUE
7 No Match FALSE
Or remove NA from your initial data.frame with df = df[complete.cases(df),] and then add the column.
If you really want to have False when there is NA in A or B column:
> transform(df, New=ifelse(is.na(A)|is.na(B), FALSE, df$A==df$B))
A B New
1 1 1 TRUE
2 test No Match FALSE
3 2 No Match FALSE
4 3 3 TRUE
5 <NA> <NA> FALSE
6 Test Test TRUE
7 <NA> No Match FALSE

Tricky multi-step subset selection

I have a matrix:
1 3 NA
1 2 0
1 7 2
1 5 NA
1 9 5
1 6 3
2 5 2
2 6 1
3 NA 4
4 2 9
...
I would like to select those elements for each number in the first column to which the corresponding value in the second column has an NA in its own second column.
So the search would go the following way:
look up number in the first column: 1.
check corresponding values in second column: 3,2,7,5,9,6...
look up 3,2,7,5,9,6 in first column and see if they have NA in their
second column
The result in the above case would be:
>3 NA 4<
Since this is the only value which has NA in its own second row.
Here's what I want to do in words:
Look at the number in column one, I find '1'.
What numbers does 1 have in its second column: 3,2,7,5,9,6
Do these numbers have NA in their own second column? yes, 3 has an NA
I would like it to return those numbers not row numbers.
the result would be the subset of the original matrix with those rows which satisfy the condition.
This would be the matlab equivalent, where i is the number in column 1:
isnan(matrix(matrix(:,1)==i,2))==1)

Using by, to get the result by group of column 1, assuming dat is your data frame
by(dat,dat$V1,FUN=function(x){
y <- dat[which(dat$V1 %in% x$V2),]
y[is.na(y$V2),]
})
dat$V1: 1
V1 V2 V3
9 3 NA 4
--------------------------------------------------------------------------------
dat$V1: 2
[1] V1 V2 V3
<0 rows> (or 0-length row.names)
--------------------------------------------------------------------------------
dat$V1: 3
[1] V1 V2 V3
<0 rows> (or 0-length row.names)
--------------------------------------------------------------------------------
dat$V1: 4
[1] V1 V2 V3
<0 rows> (or 0-length row.names)
EDIT
Here I trie to do the same function as matlab command:
here the R equivalent of matlab
isnan(matrix(matrix(:,1)==i,2))==1) ## what is i here
is.na(dat[dat[dat[,1]==1,2],]) ## R equivalent , I set i =1
V1 V2 V3
3 FALSE FALSE FALSE
2 FALSE FALSE FALSE
7 FALSE FALSE FALSE
5 FALSE FALSE FALSE
9 FALSE TRUE FALSE
6 FALSE FALSE FALSE

This hopefully reads easily as it follows the steps you described:
idx1 <- m[, 1L] == 1L
idx2 <- m[, 1L] %in% m[idx1, 2L]
idx3 <- idx2 & is.na(m[, 2L])
m[idx3, ]
# V1 V2 V3
# 3 NA 4
It is all vectorized and uses integer comparison so it should not be terribly slow. However, if it is too slow for your needs, you should use a data.table and use your first column as the key.
Note that you don't need any of the assignments, so if you are looking for a one-liner:
m[is.na(m[, 2L]) & m[, 1L] %in% m[m[, 1L] == 1L, 2L], ]
# [1] 3 NA 4
(but definitely harder to read and maintain.)

I am still not totally clear as to what you want, but maybe this would work?
m<-read.table(
textConnection("1 3 NA
1 2 0
1 7 2
1 5 NA
1 9 5
1 6 3
2 5 2
2 6 1
3 NA 4
4 2 9"))
do.call(rbind,lapply(split(m[,2],m[,1]),function(x) m[x[!is.na(x)][is.na(m[x[!is.na(x)],2])],]))
# V1 V2 V3
# 1 3 NA 4
It would be much nicer if you provided an example that you want to have more than one row.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Categorical to Numeric Variable in R - r

Related

Find all the duplicate records using duplicated() on R [duplicate]

Why do I get different results indexing with data.table

counting lengths between alternating columns

Compare two dataframe colums and add them to the dataframe

Tricky multi-step subset selection

Categories

Resources