Recoding an arbitrary grouping variable or factor in R - r

Suppose I have a vector or column of arbitrary length representing some grouping/factor variable with an arbitrary number of groups and arbitrary values for same along the lines of this:
a <- c(2,2,2,2,2,7,7,7,7,10,10,10,10,10)
a
[1] 2 2 2 2 2 7 7 7 7 10 10 10 10 10
How would I most easily turn that into this:
a
[1] 1 1 1 1 1 2 2 2 2 3 3 3 3 3

a <- c(2,2,2,2,2,7,7,7,7,10,10,10,10,10)
c(factor(a))
#[1] 1 1 1 1 1 2 2 2 2 3 3 3 3 3
Explanation:
A factor is just an integer vector with levels attribute and a class attribute. c removes attributes as a side effect. You could use as.numeric or as.integer instead of c with similar or the same results, respectively.

Related

Change the order of numerically named columns in r

If I have a dataframe like the one below which has numerical column names
example = data.frame(1=c(1,8,3,9), 2=c(3,2,3,3), 3=c(5,2,5,4), 4=c(1,2,3,4), 5=c(2,5,7,8))
Which looks like this:
1 2 3 4 5
1 3 5 1 2
8 2 2 2 5
3 3 5 3 7
9 3 4 4 8
And I want to arrange it so that the column names start with three and proceed through five and back to one, like this:
3 4 5 1 2
5 1 2 1 3
2 2 5 8 2
5 3 7 3 3
4 4 8 9 3
I know how to rearrange the position of a single column in a dataset, but I'm not sure how to do this with more than one column in this particular order.
We can use the column index concatenated (c) based on the sequence (:) on a range of values
example[c(3:5, 1:2)]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3
As the column names are all numeric, just convert to numeric and use that for ordering
v1 <- as.numeric(names(example))
example[c(v1[3:5], v1[1:2])]
Or simply do
example[c(names(example)[3:5], names(example)[1:2])]
Or another way is with head and tail
example[c(tail(names(example), 3), head(names(example), 2))]
data
example <- data.frame(`1`=c(1,8,3,9), `2`=c(3,2,3,3),
`3`=c(5,2,5,4), `4`=c(1,2,3,4), `5`=c(2,5,7,8), check.names = FALSE)
R will not easily let you create columns with numbers as name. If somehow, you are able to create columns with numbers you can use match to get order in which you want the column names.
example[match(c(3:5, 1:2), names(example))]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3

What does subset(df, !duplicated(x)) do?

Looking for a detailed answer.
When we have a data frame (df) that contains three variables x, y, and z, what does the following command do?
subset(df, !duplicated(x))
The duplicated function traverses its argument(s) sequentially and returns TRUE if there has been a prior value identical to the current value. It is a generic function, so it has a default definition (for vectors) but also a definition for other classes, such as objects of the data.frame class. The subset function treats expressions passed as a second or third argument as though column names are first class objects. This is called "non-standard evaluation". (Notice the negation operator.) So this call to subset will return the rows of a data.frame where only the first instance of the column named "x" is not duplicated. It would probably return a dataframe with only the number of rows that equal the number of unique items in the x column.
> dat <- data.frame( x =sample(1:5, 20, repl=TRUE), y=1:5, z=1:4)
> dat
x y z
1 2 1 1
2 2 2 2
3 2 3 3
4 5 4 4
5 4 5 1
6 1 1 2
7 2 2 3
8 2 3 4
9 5 4 1
10 1 5 2
11 2 1 3
12 4 2 4
13 5 3 1
14 4 4 2
15 3 5 3
16 3 1 4
17 4 2 1
18 4 3 2
19 1 4 3
20 1 5 4
> subset(dat, !duplicated(x))
x y z
1 2 1 1
4 5 4 4
5 4 5 1
6 1 1 2
15 3 5 3

Group values by unique elements [duplicate]

This question already has answers here:
How to create a consecutive group number
(13 answers)
Create group number for contiguous runs of equal values
(4 answers)
Closed 1 year ago.
I have a vector that looks like this:
a <- c("A110","A110","A110","B220","B220","C330","D440","D440","D440","D440","D440","D440","E550")
I would like to create another another vector, based on a, that should look like:
b <- c(1,1,1,2,2,2,3,4,4,4,4,4,4,5)
In other words, b should assign a value (starting from 1) to each different element of a.
First of all, (I assume) this is your vector
a <- c("A110","A110","A110","B220","B220","C330","D440","D440","D440","D440","D440","D440","E550")
As per possible solutions, here are few (can't find a good dupe right now)
as.integer(factor(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Or
cumsum(!duplicated(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Or
match(a, unique(a))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Also rle will work the similarly in your specific scenario
with(rle(a), rep(seq_along(values), lengths))
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Or (which is practically the same)
data.table::rleid(a)
# [1] 1 1 1 2 2 3 4 4 4 4 4 4 5
Though be advised that all 4 solutions have their unique behavior in different scenarios, consider the following vector
a <- c("B110","B110","B110","A220","A220","C330","D440","D440","B110","B110","E550")
And the results of the 4 different solutions:
1.
as.integer(factor(a))
# [1] 2 2 2 1 1 3 4 4 2 2 5
The factor solution begins with 2 because a is unsorted and hence the first values are getting higher integer representation within the factor function. Hence, this solution is only valid if your vector is sorted, so don't use it other wise.
2.
cumsum(!duplicated(a))
# [1] 1 1 1 2 2 3 4 4 4 4 5
This cumsum/duplicated solution got confused because of "B110" already been present at the beginning and hence grouped "D440","D440","B110","B110" into the same group.
3.
match(a, unique(a))
# [1] 1 1 1 2 2 3 4 4 1 1 5
This match/unique solution added ones at the end, because it is sensitive to "B110" showing up in more than one sequences (because of unique) and hence grouping them all into same group regardless of where they appear
4.
with(rle(a), rep(seq_along(values), lengths))
# [1] 1 1 1 2 2 3 4 4 5 5 6
This solution only cares about sequences, hence different sequences of "B110" were grouped into different groups

Proportion of dataset equal to a value

I have the following dataset called asteroids
3 4 3 3 1 4 1 3 2 3
1 1 4 2 3 3 2 6 1 1
3 3 2 2 2 2 1 3 2 1
6 1 3 2 2 1 2 2 4 2
I need to find out what proportion of this dataset is 1.
If you have a specific value in mind you can just do an equality comparison and then use mean on the resulting logical vector.
> asteroids <- scan(what=numeric())
1: 3 4 3 3 1 4 1 3 2 3 1 1 4 2 3 3 2 6 1 1 3 3 2 2 2 2 1 3 2 1 6 1 3 2 2 1 2 2 4 2
41:
Read 40 items
> mean(asteroids == 1)
[1] 0.25
This works since the equality comparison will give TRUE and FALSE and when T/F are coerced numerically they become 1s and 0s so mean ends up giving us the proportion of TRUEs.
I assumed asteroids was a vector. You don't specify in your question but if it's a different type of structure you'll probably need to coerce it into a vector in some way or another.
Assuming that 'asteroids' is a data.frame, unlist it, get the table and find the proportion with prop.table.
prop.table(table(unlist(asteroids)==1))
# FALSE TRUE
# 0.75 0.25
Or as #Richard Scriven mentioned, we can convert the data.frame to a logical matrix, and use table directly on it as 'matrix' is a vector with dim attributes.
prop.table(table(asteroids == 1))

changing values in dataframe in R based on criteria

I have a data frame that looks like
> mydata
ID Observation X
1 1 3
1 2 3
1 3 3
1 4 3
2 1 4
2 2 4
3 1 8
3 2 8
3 3 8
I have some code that counts the number of observations per ID, determines which IDs have a number of observations that meet a certain criteria (in this case, >=3 observations), and returns a vector with these IDs:
> vals
[1] 1 3
Now I want to manipulate the X values associated with these IDs, e.g. by adding 1 to each value, giving a data frame like this:
> mydata
ID Observation X
1 1 4
1 2 4
1 3 4
1 4 4
2 1 4
2 2 4
3 1 9
3 2 9
3 3 9
I'm pretty new to R and am uncertain how I might do this. It might help to know that X is constant for each ID.
The call mydata$ID %in% vals returns TRUE or FALSE to indicate whether the ID value for each row is in the vals vector. When you add this to the data currently in mydata$X, the TRUE and FALSE are converted to 1 and 0, respectively, yielding the desired result:
mydata$X <- mydata$X + mydata$ID %in% vals
# mydata
# ID Observation X
# 1 1 1 4
# 2 1 2 4
# 3 1 3 4
# 4 1 4 4
# 5 2 1 4
# 6 2 2 4
# 7 3 1 9
# 8 3 2 9
# 9 3 3 9

Resources