Gather all the unique values from 2 columns into a new column - r

I have a dataframe which includes all the nodes' connections in a network and I want to create a new dataframe named 'nodes' with all the unique nodes. Im trying to do something like
eids<-as.factor(d$from)
mids<-as.factor(d$to)
nodes<-data.frame(c(eids,mids))
nodes<-unique(nodes)
but when I try to create the graph I get:Some vertex names in edge list are not listed in vertex data frame which means that part of my data is missed with this method. My dataset is quite large so I put a toy dataset here.
from<-c(2,3,4,3,1,2)
to<-c(8,8,7,5,6,5)
d<-data.frame(from,to)

First, to solve your question, you can use unique(stack(d)[1]) to get a data frame with one column with values 1 to 8.
Here I explain why your code doesn't work. Using c() to combine objects of factor class is dangerous. You can try the following example:
(x <- factor(c("A", "B", "C", "D")))
# [1] A B C D
# Levels: A B C D
(y <- factor(c("E", "F", "G", "H")))
# [1] E F G H
# Levels: E F G H
c(x, y)
# [1] 1 2 3 4 1 2 3 4
Actually, the factor object is based on numeric data, not character. You can strip away its class and find that it belongs to a numeric vector with an attribute named levels:
unclass(x)
# [1] 1 2 3 4
# attr(,"levels")
# [1] "A" "B" "C" "D"
The numeric part means the indices of levels. A factor object actually works like recording the indices of its levels.

Related

I use as.complex() to convert a string column to a numeric column in r

I have three columns which are characters A, B, and C respectively. I am using is.numeric to convert them to numeric and then assign them values e.g. 1,2 and 3, but when I am using is.numeric(). it returns back NAs. In different data frames these orders vary, e.g. ABC or ACB, but A=i+0i, B=2+3i and C is also a complex number. I want to first convert the string to a complex number and then assign values to them.
LV$phase1 <- as.numeric(LV$phase1)
class(phase1)
A=1
print(phase1)
This is the error:
"Warning message:
NAs introduced by coercion "
It does not usually make sense to convert character data to numeric, but if the letters refer to an ordered sequence of events/phases/periods, then it may be useful. R uses factors for this purpose. For example
set.seed(42)
phase <- sample(LETTERS[1:4], 10, replace=TRUE)
phase
# [1] "A" "A" "A" "A" "B" "D" "B" "B" "A" "D"
factor(phase)
# [1] A A A A B D B B A D
Levels: A B D
as.numeric(factor(phase))
# [1] 1 1 1 1 2 3 2 2 1 3
If this is what you are trying to do
LV$phase1 <- as.numeric(factor(LV$phase1))
will convert the letters to an ordered sequence and assign numbers to represent those categories.

A more elegant way to combine two vectors as separate columns (or dataframes), match the rows, and have NA where they do not match

I have two vectors of the same 'thing' that I want to combine into a dataframe. Each vector will become its own column, but they will match up the rows are the same and introduce NA values for one vector where it does not match the second vector. Since the data starts as just two vectors, there are no common id values or anything to match up other than the vector values.
I got this to work in a toy data test using a simple and straightforward approach, but would like to know if there is a more direct and elegant way to do this.
My current approach requires assigning a unique value by which I can then merge the two vectors, but I am curious if I can do this without it and rely instead on the vector values. My other attempts tried to not adopt a new id value, exploring functions like merge and join, cbind, rbind, bind_rows, bind_cols, intersect and union. Perhaps I wasn't using them as well as I could. I found some other useful posts on SO (like this one), but they all already start with a unique identifier.
Here is my toy data test with a final output how I want it to look. It does not matter to me if the final output has an id column or not. Note, my actual data will be character, hence my use of letters here.
# create toy data
x <- letters[1:5]
y <- letters[2:6]
# combine into dataframe, keep only unique values & assign id
xy <- data.frame(xy=unique(c(x,y))); xy
xy$id <- 1:length(xy$xy); xy
# match id back to original toy data as dataframes
x <- data.frame(x)
x$id <- match(x$x, xy$xy)
y <- data.frame(y)
y$id <- match(y$y, xy$xy)
# merge using id
xy2 <- merge(x, y, by="id", all=TRUE)
xy2
# results in
id x y
1 1 a <NA>
2 2 b b
3 3 c c
4 4 d d
5 5 e e
6 6 <NA> f
Using tidyverse you can try using full_join and create keys based on your 2 vectors:
library(tidyverse)
full_join(data.frame(key=x, x),
data.frame(key=y, y), by="key") %>%
select(-key)
Alternatively, you can just use merge in base R:
merge(data.frame('key'=x, x), data.frame('key'=y, y), by='key', all=T)[-1]
Output
x y
1 a <NA>
2 b b
3 c c
4 d d
5 e e
6 <NA> f
Here's an alternative one-liner in base R:
cbind(x[match(unique(c(x, y)), x)], y[match(unique(c(x, y)), y)])
#> [,1] [,2]
#> [1,] "a" NA
#> [2,] "b" "b"
#> [3,] "c" "c"
#> [4,] "d" "d"
#> [5,] "e" "e"
#> [6,] NA "f"

R rearrange factor levels [duplicate]

I have data frame with some numerical variables and some categorical factor variables. The order of levels for those factors is not the way I want them to be.
numbers <- 1:4
letters <- factor(c("a", "b", "c", "d"))
df <- data.frame(numbers, letters)
df
# numbers letters
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
If I change the order of the levels, the letters no longer are with their corresponding numbers (my data is total nonsense from this point on).
levels(df$letters) <- c("d", "c", "b", "a")
df
# numbers letters
# 1 1 d
# 2 2 c
# 3 3 b
# 4 4 a
I simply want to change the level order, so when plotting, the bars are shown in the desired order - which may differ from default alphabetical order.
Use the levels argument of factor:
df <- data.frame(f = 1:4, g = letters[1:4])
df
# f g
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
levels(df$g)
# [1] "a" "b" "c" "d"
df$g <- factor(df$g, levels = letters[4:1])
# levels(df$g)
# [1] "d" "c" "b" "a"
df
# f g
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
some more, just for the record
## reorder is a base function
df$letters <- reorder(df$letters, new.order=letters[4:1])
library(gdata)
df$letters <- reorder.factor(df$letters, letters[4:1])
You may also find useful Relevel and combine_factor.
Since this question was last active Hadley has released his new forcats package for manipulating factors and I'm finding it outrageously useful. Examples from the OP's data frame:
levels(df$letters)
# [1] "a" "b" "c" "d"
To reverse levels:
library(forcats)
fct_rev(df$letters) %>% levels
# [1] "d" "c" "b" "a"
To add more levels:
fct_expand(df$letters, "e") %>% levels
# [1] "a" "b" "c" "d" "e"
And many more useful fct_xxx() functions.
so what you want, in R lexicon, is to change only the labels for a given factor variable (ie, leave the data as well as the factor levels, unchanged).
df$letters = factor(df$letters, labels=c("d", "c", "b", "a"))
given that you want to change only the datapoint-to-label mapping and not the data or the factor schema (how the datapoints are binned into individual bins or factor values, it might help to know how the mapping is originally set when you initially create the factor.
the rules are simple:
labels are mapped to levels by index value (ie, the value
at levels[2] is given the label, label[2]);
factor levels can be set explicitly by passing them in via the the
levels argument; or
if no value is supplied for the levels argument, the default
value is used which is the result calling unique on the data vector
passed in (for the data argument);
labels can be set explicitly via the labels argument; or
if no value is supplied for the labels argument, the default value is
used which is just the levels vector
Dealing with factors in R is quite peculiar job, I must admit... While reordering the factor levels, you're not reordering underlying numerical values. Here's a little demonstration:
> numbers = 1:4
> letters = factor(letters[1:4])
> dtf <- data.frame(numbers, letters)
> dtf
numbers letters
1 1 a
2 2 b
3 3 c
4 4 d
> sapply(dtf, class)
numbers letters
"integer" "factor"
Now, if you convert this factor to numeric, you'll get:
# return underlying numerical values
1> with(dtf, as.numeric(letters))
[1] 1 2 3 4
# change levels
1> levels(dtf$letters) <- letters[4:1]
1> dtf
numbers letters
1 1 d
2 2 c
3 3 b
4 4 a
# return numerical values once again
1> with(dtf, as.numeric(letters))
[1] 1 2 3 4
As you can see... by changing levels, you change levels only (who would tell, eh?), not the numerical values! But, when you use factor function as #Jonathan Chang suggested, something different happens: you change numerical values themselves.
You're getting error once again 'cause you do levels and then try to relevel it with factor. Don't do it!!! Do not use levels or you'll mess things up (unless you know exactly what you're doing).
One lil' suggestion: avoid naming your objects with an identical name as R's objects (df is density function for F distribution, letters gives lowercase alphabet letters). In this particular case, your code would not be faulty, but sometimes it can be... but this can create confusion, and we don't want that, do we?!? =)
Instead, use something like this (I'll go from the beginning once again):
> dtf <- data.frame(f = 1:4, g = factor(letters[1:4]))
> dtf
f g
1 1 a
2 2 b
3 3 c
4 4 d
> with(dtf, as.numeric(g))
[1] 1 2 3 4
> dtf$g <- factor(dtf$g, levels = letters[4:1])
> dtf
f g
1 1 a
2 2 b
3 3 c
4 4 d
> with(dtf, as.numeric(g))
[1] 4 3 2 1
Note that you can also name you data.frame with df and letters instead of g, and the result will be OK. Actually, this code is identical with the one you posted, only the names are changed. This part factor(dtf$letter, levels = letters[4:1]) wouldn't throw an error, but it can be confounding!
Read the ?factor manual thoroughly! What's the difference between factor(g, levels = letters[4:1]) and factor(g, labels = letters[4:1])? What's similar in levels(g) <- letters[4:1] and g <- factor(g, labels = letters[4:1])?
You can put ggplot syntax, so we can help you more on this one!
Cheers!!!
Edit:
ggplot2 actually requires to change both levels and values? Hm... I'll dig this one out...
I wish to add another case where the levels could be strings carrying numbers alongwith some special characters : like below example
df <- data.frame(x = c("15-25", "0-4", "5-10", "11-14", "100+"))
The default levels of x is :
df$x
# [1] 15-25 0-4 5-10 11-14 100+
# Levels: 0-4 100+ 11-14 15-25 5-10
Here if we want to reorder the factor levels according to the numeric value, without explicitly writing out the levels, what we could do is
library(gtools)
df$x <- factor(df$x, levels = mixedsort(df$x))
df$x
# [1] 15-25 0-4 5-10 11-14 100+
# Levels: 0-4 5-10 11-14 15-25 100+
as.numeric(df$x)
# [1] 4 1 2 3 5
I hope this can be considered as useful information for future readers.
Here's my function to reorder factors of a given dataframe:
reorderFactors <- function(df, column = "my_column_name",
desired_level_order = c("fac1", "fac2", "fac3")) {
x = df[[column]]
lvls_src = levels(x)
idxs_target <- vector(mode="numeric", length=0)
for (target in desired_level_order) {
idxs_target <- c(idxs_target, which(lvls_src == target))
}
x_new <- factor(x,levels(x)[idxs_target])
df[[column]] <- x_new
return (df)
}
Usage: reorderFactors(df, "my_col", desired_level_order = c("how","I","want"))
I would simply use the levels argument:
levels(df$letters) <- levels(df$letters)[c(4:1)]
To add yet another approach that is quite useful as it frees us from remembering functions from differents packages. The levels of a factor are just attributes, so one can do the following:
numbers <- 1:4
letters <- factor(c("a", "b", "c", "d"))
df <- data.frame(numbers, letters)
# Original attributes
> attributes(df$letters)
$levels
[1] "a" "b" "c" "d"
$class
[1] "factor"
# Modify attributes
attr(df$letters,"levels") <- c("d", "c", "b", "a")
> df$letters
[1] d c b a
Levels: d c b a
# New attributes
> attributes(df$letters)
$levels
[1] "d" "c" "b" "a"
$class
[1] "factor"

How does one change the levels of a factor column in a data.table

What is the correct way to change the levels of a factor column in a data.table (note: not data frame)
library(data.table)
mydt <- data.table(id=1:6, value=as.factor(c("A", "A", "B", "B", "B", "C")), key="id")
mydt[, levels(value)]
[1] "A" "B" "C"
I am looking for something like:
mydt[, levels(value) <- c("X", "Y", "Z")]
But of course, the above line does not work.
# Actual # Expected result
> mydt > mydt
id value id value
1: 1 A 1: 1 X
2: 2 A 2: 2 X
3: 3 B 3: 3 Y
4: 4 B 4: 4 Y
5: 5 B 5: 5 Y
6: 6 C 6: 6 Z
You can still set them the traditional way:
levels(mydt$value) <- c(...)
This should be plenty fast unless mydt is very large since that traditional syntax copies the entire object. You could also play the un-factoring and refactoring game... but no one likes that game anyway.
To change the levels by reference with no copy of mydt :
setattr(mydt$value,"levels",c(...))
but be sure to assign a valid levels vector (type character of sufficient length) otherwise you'll end up with an invalid factor (levels<- does some checking as well as copying).
I would rather go the traditional way of re-assignment to the factors
> mydt$value # This we what we had originally
[1] A A B B B C
Levels: A B C
> levels(mydt$value) # just checking the levels
[1] "A" "B" "C"
**# Meat of the re-assignment**
> levels(mydt$value)[levels(mydt$value)=="A"] <- "X"
> levels(mydt$value)[levels(mydt$value)=="B"] <- "Y"
> levels(mydt$value)[levels(mydt$value)=="C"] <- "Z"
> levels(mydt$value)
[1] "X" "Y" "Z"
> mydt # This is what we wanted
id value
1: 1 X
2: 2 X
3: 3 Y
4: 4 Y
5: 5 Y
6: 6 Z
As you probably notices, the meat of the re-assignment is very intuitive, it checks for the exact level(use grepl in case there's a fuzzy math, regular expressions or likewise)
levels(mydt$value)[levels(mydt$value)=="A"] <- "X"
This explicitly checks the value in the levels of the variable under consideration and then reassigns X (and so on) to it - The advantage- you explicitly KNOW what labeled what.
I find renaming levels as here levels(mydt$value) <- c("X","Y","Z") very non-intuitive, since it just assigns X to the 1st level it SEES in the data (so the order really matters)
PPS : In case of too many levels, use looping constructs.
You can also rename and add to your levels using a related approach, which can be very handy, especially if you are making a plot that needs more informative labels in a particular order (as opposed to the default):
f <- factor(c("a","b"))
levels(f) <- list(C = "C", D = "a", B = "b")
(modified from ?levels)
This is safer than Matt Dowle's suggestion (because it uses the checks skipped by setattr) but won't copy the entire data.table. It will replace the entire column vector (whereas Matt's solution only replaces the attributes of the column vector) , but that seems like an acceptable trade-off in order to reduce the risk of messing up the factor object.
mydt[, value:=`levels<-`(value, c("X", "Y", "Z"))]
Simplest way to change a column's levels:
dat$colname <- as.factor(as.vector(dat$colname));

"replace" function examples

I don't find the help page for the replace function from the base package to be very helpful. Worst part, it has no examples which could help understand how it works.
Could you please explain how to use it? An example or two would be great.
If you look at the function (by typing it's name at the console) you will see that it is just a simple functionalized version of the [<- function which is described at ?"[". [ is a rather basic function to R so you would be well-advised to look at that page for further details. Especially important is learning that the index argument (the second argument in replace can be logical, numeric or character classed values. Recycling will occur when there are differing lengths of the second and third arguments:
You should "read" the function call as" "within the first argument, use the second argument as an index for placing the values of the third argument into the first":
> replace( 1:20, 10:15, 1:2)
[1] 1 2 3 4 5 6 7 8 9 1 2 1 2 1 2 16 17 18 19 20
Character indexing for a named vector:
> replace(c(a=1, b=2, c=3, d=4), "b", 10)
a b c d
1 10 3 4
Logical indexing:
> replace(x <- c(a=1, b=2, c=3, d=4), x>2, 10)
a b c d
1 2 10 10
You can also use logical tests
x <- data.frame(a = c(0,1,2,NA), b = c(0,NA,1,2), c = c(NA, 0, 1, 2))
x
x$a <- replace(x$a, is.na(x$a), 0)
x
x$b <- replace(x$b, x$b==2, 333)
Here's two simple examples
> x <- letters[1:4]
> replace(x, 3, 'Z') #replacing 'c' by 'Z'
[1] "a" "b" "Z" "d"
>
> y <- 1:10
> replace(y, c(4,5), c(20,30)) # replacing 4th and 5th elements by 20 and 30
[1] 1 2 3 20 30 6 7 8 9 10
Be aware that the third parameter (value) in the examples given above: the value is a constant (e.g. 'Z' or c(20,30)).
Defining the third parameter using values from the data frame itself can lead to confusion.
E.g. with a simple data frame such as this (using dplyr::data_frame):
tmp <- data_frame(a=1:10, b=sample(LETTERS[24:26], 10, replace=T))
This will create somthing like this:
a b
(int) (chr)
1 1 X
2 2 Y
3 3 Y
4 4 X
5 5 Z
..etc
Now suppose you want wanted to do, was to multiply the values in column 'a' by 2, but only where column 'b' is "X". My immediate thought would be something like this:
with(tmp, replace(a, b=="X", a*2))
That will not provide the desired outcome, however. The a*2 will defined as a fixed vector rather than a reference to the 'a' column. The vector 'a*2' will thus be
[1] 2 4 6 8 10 12 14 16 18 20
at the start of the 'replace' operation. Thus, the first row where 'b' equals "X", the value in 'a' will be placed by 2. The second time, it will be replaced by 4, etc ... it will not be replaced by two-times-the-value-of-a in that particular row.
Here's an example where I found the replace( ) function helpful for giving me insight. The problem required a long integer vector be changed into a character vector and with its integers replaced by given character values.
## figuring out replace( )
(test <- c(rep(1,3),rep(2,2),rep(3,1)))
which looks like
[1] 1 1 1 2 2 3
and I want to replace every 1 with an A and 2 with a B and 3 with a C
letts <- c("A","B","C")
so in my own secret little "dirty-verse" I used a loop
for(i in 1:3)
{test <- replace(test,test==i,letts[i])}
which did what I wanted
test
[1] "A" "A" "A" "B" "B" "C"
In the first sentence I purposefully left out that the real objective was to make the big vector of integers a factor vector and assign the integer values (levels) some names (labels).
So another way of doing the replace( ) application here would be
(test <- factor(test,labels=letts))
[1] A A A B B C
Levels: A B C

Resources