Changing an ID name by location in R - r

I have 2 repeated IDs in my data frame, and I want to change the name of one.
When I use revalue function by dplyr, the new name is assigned to both.
That's why I wanted to do it by specifying the location of the ID by column and row and then changing it but I couldn't find how to do it.
In short, my question is how can I change only one ID(element)'s name if it is repeated twice in a data frame?
Edit: To be more specific let's say my data is this-
Value <- c(12,23,4,5)
ID <- c("A", "B", "A", "C")
Score1 <- c(3, 2, 1, 4)
Score2 <- c(4,5,9,10)
mydf <- data.frame(Value, ID, Score1, Score2)
mydf
# Value ID Score1 Score2
1 12 A 3 4
2 23 B 2 5
3 4 A 1 9
4 5 C 4 10
I want to change the Second "A" to "A-1".
(It could be a very basic question, I am new in R, sorry :D)

Assuming you would want subsequent As (if then have more than one duplicate) to be sequentially numbered, then make.unique works well here:
make.unique(mydf$ID)
# [1] "A" "B" "A.1" "C"
If you must have it with a dash instead of a period, then
sub(".", "-", make.unique(mydf$ID), fixed = TRUE)
# [1] "A" "B" "A-1" "C"
Either way, this can be easily reassigned back into mydf$ID.

Related

Split dataframe columns into vectors in R

I have a dataframe as such:
Number <- c(1,2,3)
Number2 <- c(10,12,14)
Letter <- c("A","B","C")
df <- data.frame(Number,Number2,Letter)
I would like to split the df into its respective three columns, each one becoming a vector with the respective column name. In essence, the output should look exactly like the original three input vectors in the above example.
I have tried the split function and also using for loop, but without success.
Any ideas? Thank you.
We may use unclass as data.frame is a list with additional attributes. By unclassing, it removes the data.frame attribute
unclass(df)
Or another option is asplit with MARGIN specified as 2
asplit(df, 2)
NOTE: Both of them return a named list. If we intend to create new objects in the global env, use list2env (not recommended though)
We can use c oras.list
> c(df)
$Number
[1] 1 2 3
$Number2
[1] 10 12 14
$Letter
[1] "A" "B" "C"
> as.list(df)
$Number
[1] 1 2 3
$Number2
[1] 10 12 14
$Letter
[1] "A" "B" "C"
Assuming you are trying to create these as vectors if the global environment, use list2env:
df <- data.frame(Number = c(1, 2, 3),
Number2 = c(10, 12, 14),
Letter = c("A", "B", "C"))
list2env(df, .GlobalEnv)
## <environment: R_GlobalEnv>
ls()
## [1] "df" "Letter" "Number" "Number2"
list2env is clearly the easiest way, but if you want to do it with a for loop it can also be achieved.
The "tricky" part is to make a new vector based on the column names inside the for loop. If you just write
names(df[i]) <- input
a vector will not be created.
A workaround is to use paste to create a string with the new vector name and what should be in it, then use "eval(parse(text=)" to evaluate this expression.
Maybe not the most elegant solution, but seems to work.
for (i in colnames(df)){
vector_name <- names(df[i])
expression_to_be_evaluated <- paste(vector_name, "<- df[[i]]")
eval(parse(text=expression_to_be_evaluated))
}
> Letter
[1] A B C
Levels: A B C
> Number
[1] 1 2 3
> Number2
[1] 10 12 14

Why multiple option in vunion function of the vecsets package does not work for character vectors?

When I run the code:
library(vecsets)
p <- c("a","b")
q <- c( "a")
vunion(p,q, multiple = TRUE)
I get the result:
[1] "a" "b"
But I expect the result to be
vunion(p,q, multiple = TRUE)
[1] "a" "b" "a"
I also do not understand the result provided in the example of the vesect package. The example shows:
x <- c(1:5,3,3,3,2,NA,NA)
y <- c(2:5,4,3,NA)
vunion(x,y,multiple=TRUE)
[1] 2 3 3 4 5 NA 1 3 3 2 NA 4
But if we check
length(x)+length(y); length(vunion(x,y))
[1] 18
[1] 12
we get different lengths, but I think they should be the same. Note, for example, 5 appears only once.
What's going on here? Can someone explain?
I think the vecset package documentation (link) describes this behavior quite well:
The base::union function removes duplicates per algebraic set theory. vunion does not, and so returns as many duplicate elements as are in either input vector (not the sum of their inputs.) In short, vunion is the same as vintersect(x,y) + vsetdiff(x,y) + vsetdiff(y,x).
It's true that you have to read carefully, though. I've emphasized the important part. The issue is not with character versus numeric vectors, but rather whether elements are repeated within the same vector or not. Consider p1 versus p2 in the following example. The result from vunion will have as many a's as either p or q, so we expect 1 "a" in the first part and two a's in the second part; both times we expect only 1 "b":
library(vecsets)
q <- c("a", "b")
p1 <- c("a", "b")
vunion(p1, q, multiple = TRUE)
[1] "a" "b"
p2 <- c("a", "a", "b")
vunion(p2, q, multiple = TRUE)
[1] "a" "b" "a"

R: quick way to systematically delete rows like this?

I have a data frame with a single column.
There are 620 rows. The first 31 rows we label class "A", the next 31 rows we label "class B", and so on. There are therefore 20 classes.
What I want to do is quite simple to explain but I need help coding it.
In the first iteration, I want to delete all rows that correspond to the last row for each class. That is, delete the last "A class" row, then delete the last "B class row", and so on. This iteration, and all others, have to be performed, since I intend to do something else with the newly created dataset.
In the second iteration, I want to delete all rows that correspond to the last TWO rows for each class. So, delete the last two rows for "A class", last two rows for "B class" and so on.
In the third iteration, delete the last three rows for each class. And so on.
In the final iteration, we delete the last 30 rows for each class. Meaning basically we only keep 1 row for each observation, the first one.
What's a quick way to put this into R code? I know I need to use a for loop and carefully pick some index to remove, but how?
EXAMPLE
column
A1
A2
A3
B1
B2
B3
If above is our original data frame, then in the first iteration, we should be left with
column
A1
A2
B1
B2
and so on.
I'm making it simple here and use n=3 instead of n=31 with this dummy data set
n <- 3
dummy <- c(rep("A", n), rep("B", n), rep("C", n))
> dummy
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"
Now, the trick is to use boolean indices to pick which values to keep per iteration and combine this with the feature that R will repeat an index vector as many time as needed for a short vector to match a longer vector.
This function creates a mask of which elements in a group should be picked
make_mask <- function(to_keep, n)
c(rep(TRUE, to_keep), rep(FALSE, n - to_keep))
It just gives you a boolean vector
> make_mask(2, 3)
[1] TRUE TRUE FALSE
We can use it in a function that picks the element for an iteration:
pick_subset <- function(to_keep) dummy[make_mask(n - to_keep, n)]
Now, you can use this in a loop or an lapply to get the elements you need per iteration.
iterations <- iterations <- lapply(0:(n-1), pick_subset)
will give you this
> iterations
[[1]]
[1] "A" "A" "A" "B" "B" "B" "C" "C" "C"
[[2]]
[1] "A" "A" "B" "B" "C" "C"
[[3]]
[1] "A" "B" "C"
If it is more to your taste to use 1:n in the lapply, simply adjust make_mask to compensate.
dat%>%mutate(grp=sub("\\d","",column))%>%
group_by(grp)%>%
slice(-n())%>%
ungroup()%>%select(-grp)
# A tibble: 4 x 1
column
<chr>
1 A1
2 A2
3 B1
4 B2
data:
dat=read.table(header = T,stringsAsFactors = F,text="column
A1
A2
A3
B1
B2
B3")
There is still another way. Assuming the codes are all grouped and sorted as you show, use the table function to obtain the number of codes in the column. Each value in the cumsum of table happens to correspond to the index of the last item in each sequence. The indexes variable is augmented by 1 each time through the loop. The y variable is created by removing the rows indexed by indexes. (It doesn't matter that indexes is unsorted.) You just do what you need to with y. Here's the code with an example data.frame:
N <- 31
dat <-data.frame(x=c(rep("A",31),rep("B",31),rep("C",31),rep("D",31),rep("E",31)))
t.x <- cumsum(table(dat$x))
for (i in 1:(N-1)) {
if (i == 1){
indexes <- t.x
} else {
indexes = c(indexes,t.x-i)
}
y <- dat$x[-indexes]
print(table(y))
}
The print(table(y)) will show that the count of each code will decrease as required.
y
A B C D E
30 30 30 30 30
y
A B C D E
29 29 29 29 29
Solution with data.table package
Because you know exactly how many items are in each class as well as how many classes exist in the data, the following simple solution works:
Import packages and generate some test data:
rm(list=ls())
library(data.table)
A = rep('A', 3)
B = rep('B', 3)
C = rep('C', 3)
val = rep(1:3, 3)
DT = data.table(class=c(A,B,C), val=val)
This loop simply iterates as many times as there are items in each of your so called "classes". With each iteration we subset an increasingly small portion of the original data with the .SD[1:(4-i)] portion of code. Be sure to set a value (4 in this case) that is one more that the number items in each class so that you don't receive an "index out of range error." The cool part is that data.table allows us to do this by a grouping vector ("class" in this case).
for(i in 1:3) {
print(DT[, .SD[1:(4-i)], by = class]) # edit as needed to save copies
}
Output:
class val
1: A 1
2: A 2
3: A 3
4: B 1
5: B 2
6: B 3
7: C 1
8: C 2
9: C 3
class val
1: A 1
2: A 2
3: B 1
4: B 2
5: C 1
6: C 2
class val
1: A 1
2: B 1
3: C 1

R rearrange factor levels [duplicate]

I have data frame with some numerical variables and some categorical factor variables. The order of levels for those factors is not the way I want them to be.
numbers <- 1:4
letters <- factor(c("a", "b", "c", "d"))
df <- data.frame(numbers, letters)
df
# numbers letters
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
If I change the order of the levels, the letters no longer are with their corresponding numbers (my data is total nonsense from this point on).
levels(df$letters) <- c("d", "c", "b", "a")
df
# numbers letters
# 1 1 d
# 2 2 c
# 3 3 b
# 4 4 a
I simply want to change the level order, so when plotting, the bars are shown in the desired order - which may differ from default alphabetical order.
Use the levels argument of factor:
df <- data.frame(f = 1:4, g = letters[1:4])
df
# f g
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
levels(df$g)
# [1] "a" "b" "c" "d"
df$g <- factor(df$g, levels = letters[4:1])
# levels(df$g)
# [1] "d" "c" "b" "a"
df
# f g
# 1 1 a
# 2 2 b
# 3 3 c
# 4 4 d
some more, just for the record
## reorder is a base function
df$letters <- reorder(df$letters, new.order=letters[4:1])
library(gdata)
df$letters <- reorder.factor(df$letters, letters[4:1])
You may also find useful Relevel and combine_factor.
Since this question was last active Hadley has released his new forcats package for manipulating factors and I'm finding it outrageously useful. Examples from the OP's data frame:
levels(df$letters)
# [1] "a" "b" "c" "d"
To reverse levels:
library(forcats)
fct_rev(df$letters) %>% levels
# [1] "d" "c" "b" "a"
To add more levels:
fct_expand(df$letters, "e") %>% levels
# [1] "a" "b" "c" "d" "e"
And many more useful fct_xxx() functions.
so what you want, in R lexicon, is to change only the labels for a given factor variable (ie, leave the data as well as the factor levels, unchanged).
df$letters = factor(df$letters, labels=c("d", "c", "b", "a"))
given that you want to change only the datapoint-to-label mapping and not the data or the factor schema (how the datapoints are binned into individual bins or factor values, it might help to know how the mapping is originally set when you initially create the factor.
the rules are simple:
labels are mapped to levels by index value (ie, the value
at levels[2] is given the label, label[2]);
factor levels can be set explicitly by passing them in via the the
levels argument; or
if no value is supplied for the levels argument, the default
value is used which is the result calling unique on the data vector
passed in (for the data argument);
labels can be set explicitly via the labels argument; or
if no value is supplied for the labels argument, the default value is
used which is just the levels vector
Dealing with factors in R is quite peculiar job, I must admit... While reordering the factor levels, you're not reordering underlying numerical values. Here's a little demonstration:
> numbers = 1:4
> letters = factor(letters[1:4])
> dtf <- data.frame(numbers, letters)
> dtf
numbers letters
1 1 a
2 2 b
3 3 c
4 4 d
> sapply(dtf, class)
numbers letters
"integer" "factor"
Now, if you convert this factor to numeric, you'll get:
# return underlying numerical values
1> with(dtf, as.numeric(letters))
[1] 1 2 3 4
# change levels
1> levels(dtf$letters) <- letters[4:1]
1> dtf
numbers letters
1 1 d
2 2 c
3 3 b
4 4 a
# return numerical values once again
1> with(dtf, as.numeric(letters))
[1] 1 2 3 4
As you can see... by changing levels, you change levels only (who would tell, eh?), not the numerical values! But, when you use factor function as #Jonathan Chang suggested, something different happens: you change numerical values themselves.
You're getting error once again 'cause you do levels and then try to relevel it with factor. Don't do it!!! Do not use levels or you'll mess things up (unless you know exactly what you're doing).
One lil' suggestion: avoid naming your objects with an identical name as R's objects (df is density function for F distribution, letters gives lowercase alphabet letters). In this particular case, your code would not be faulty, but sometimes it can be... but this can create confusion, and we don't want that, do we?!? =)
Instead, use something like this (I'll go from the beginning once again):
> dtf <- data.frame(f = 1:4, g = factor(letters[1:4]))
> dtf
f g
1 1 a
2 2 b
3 3 c
4 4 d
> with(dtf, as.numeric(g))
[1] 1 2 3 4
> dtf$g <- factor(dtf$g, levels = letters[4:1])
> dtf
f g
1 1 a
2 2 b
3 3 c
4 4 d
> with(dtf, as.numeric(g))
[1] 4 3 2 1
Note that you can also name you data.frame with df and letters instead of g, and the result will be OK. Actually, this code is identical with the one you posted, only the names are changed. This part factor(dtf$letter, levels = letters[4:1]) wouldn't throw an error, but it can be confounding!
Read the ?factor manual thoroughly! What's the difference between factor(g, levels = letters[4:1]) and factor(g, labels = letters[4:1])? What's similar in levels(g) <- letters[4:1] and g <- factor(g, labels = letters[4:1])?
You can put ggplot syntax, so we can help you more on this one!
Cheers!!!
Edit:
ggplot2 actually requires to change both levels and values? Hm... I'll dig this one out...
I wish to add another case where the levels could be strings carrying numbers alongwith some special characters : like below example
df <- data.frame(x = c("15-25", "0-4", "5-10", "11-14", "100+"))
The default levels of x is :
df$x
# [1] 15-25 0-4 5-10 11-14 100+
# Levels: 0-4 100+ 11-14 15-25 5-10
Here if we want to reorder the factor levels according to the numeric value, without explicitly writing out the levels, what we could do is
library(gtools)
df$x <- factor(df$x, levels = mixedsort(df$x))
df$x
# [1] 15-25 0-4 5-10 11-14 100+
# Levels: 0-4 5-10 11-14 15-25 100+
as.numeric(df$x)
# [1] 4 1 2 3 5
I hope this can be considered as useful information for future readers.
Here's my function to reorder factors of a given dataframe:
reorderFactors <- function(df, column = "my_column_name",
desired_level_order = c("fac1", "fac2", "fac3")) {
x = df[[column]]
lvls_src = levels(x)
idxs_target <- vector(mode="numeric", length=0)
for (target in desired_level_order) {
idxs_target <- c(idxs_target, which(lvls_src == target))
}
x_new <- factor(x,levels(x)[idxs_target])
df[[column]] <- x_new
return (df)
}
Usage: reorderFactors(df, "my_col", desired_level_order = c("how","I","want"))
I would simply use the levels argument:
levels(df$letters) <- levels(df$letters)[c(4:1)]
To add yet another approach that is quite useful as it frees us from remembering functions from differents packages. The levels of a factor are just attributes, so one can do the following:
numbers <- 1:4
letters <- factor(c("a", "b", "c", "d"))
df <- data.frame(numbers, letters)
# Original attributes
> attributes(df$letters)
$levels
[1] "a" "b" "c" "d"
$class
[1] "factor"
# Modify attributes
attr(df$letters,"levels") <- c("d", "c", "b", "a")
> df$letters
[1] d c b a
Levels: d c b a
# New attributes
> attributes(df$letters)
$levels
[1] "d" "c" "b" "a"
$class
[1] "factor"

How does one change the levels of a factor column in a data.table

What is the correct way to change the levels of a factor column in a data.table (note: not data frame)
library(data.table)
mydt <- data.table(id=1:6, value=as.factor(c("A", "A", "B", "B", "B", "C")), key="id")
mydt[, levels(value)]
[1] "A" "B" "C"
I am looking for something like:
mydt[, levels(value) <- c("X", "Y", "Z")]
But of course, the above line does not work.
# Actual # Expected result
> mydt > mydt
id value id value
1: 1 A 1: 1 X
2: 2 A 2: 2 X
3: 3 B 3: 3 Y
4: 4 B 4: 4 Y
5: 5 B 5: 5 Y
6: 6 C 6: 6 Z
You can still set them the traditional way:
levels(mydt$value) <- c(...)
This should be plenty fast unless mydt is very large since that traditional syntax copies the entire object. You could also play the un-factoring and refactoring game... but no one likes that game anyway.
To change the levels by reference with no copy of mydt :
setattr(mydt$value,"levels",c(...))
but be sure to assign a valid levels vector (type character of sufficient length) otherwise you'll end up with an invalid factor (levels<- does some checking as well as copying).
I would rather go the traditional way of re-assignment to the factors
> mydt$value # This we what we had originally
[1] A A B B B C
Levels: A B C
> levels(mydt$value) # just checking the levels
[1] "A" "B" "C"
**# Meat of the re-assignment**
> levels(mydt$value)[levels(mydt$value)=="A"] <- "X"
> levels(mydt$value)[levels(mydt$value)=="B"] <- "Y"
> levels(mydt$value)[levels(mydt$value)=="C"] <- "Z"
> levels(mydt$value)
[1] "X" "Y" "Z"
> mydt # This is what we wanted
id value
1: 1 X
2: 2 X
3: 3 Y
4: 4 Y
5: 5 Y
6: 6 Z
As you probably notices, the meat of the re-assignment is very intuitive, it checks for the exact level(use grepl in case there's a fuzzy math, regular expressions or likewise)
levels(mydt$value)[levels(mydt$value)=="A"] <- "X"
This explicitly checks the value in the levels of the variable under consideration and then reassigns X (and so on) to it - The advantage- you explicitly KNOW what labeled what.
I find renaming levels as here levels(mydt$value) <- c("X","Y","Z") very non-intuitive, since it just assigns X to the 1st level it SEES in the data (so the order really matters)
PPS : In case of too many levels, use looping constructs.
You can also rename and add to your levels using a related approach, which can be very handy, especially if you are making a plot that needs more informative labels in a particular order (as opposed to the default):
f <- factor(c("a","b"))
levels(f) <- list(C = "C", D = "a", B = "b")
(modified from ?levels)
This is safer than Matt Dowle's suggestion (because it uses the checks skipped by setattr) but won't copy the entire data.table. It will replace the entire column vector (whereas Matt's solution only replaces the attributes of the column vector) , but that seems like an acceptable trade-off in order to reduce the risk of messing up the factor object.
mydt[, value:=`levels<-`(value, c("X", "Y", "Z"))]
Simplest way to change a column's levels:
dat$colname <- as.factor(as.vector(dat$colname));

Resources