What does subset(df, !duplicated(x)) do? - r

Looking for a detailed answer.
When we have a data frame (df) that contains three variables x, y, and z, what does the following command do?
subset(df, !duplicated(x))

The duplicated function traverses its argument(s) sequentially and returns TRUE if there has been a prior value identical to the current value. It is a generic function, so it has a default definition (for vectors) but also a definition for other classes, such as objects of the data.frame class. The subset function treats expressions passed as a second or third argument as though column names are first class objects. This is called "non-standard evaluation". (Notice the negation operator.) So this call to subset will return the rows of a data.frame where only the first instance of the column named "x" is not duplicated. It would probably return a dataframe with only the number of rows that equal the number of unique items in the x column.
> dat <- data.frame( x =sample(1:5, 20, repl=TRUE), y=1:5, z=1:4)
> dat
x y z
1 2 1 1
2 2 2 2
3 2 3 3
4 5 4 4
5 4 5 1
6 1 1 2
7 2 2 3
8 2 3 4
9 5 4 1
10 1 5 2
11 2 1 3
12 4 2 4
13 5 3 1
14 4 4 2
15 3 5 3
16 3 1 4
17 4 2 1
18 4 3 2
19 1 4 3
20 1 5 4
> subset(dat, !duplicated(x))
x y z
1 2 1 1
4 5 4 4
5 4 5 1
6 1 1 2
15 3 5 3

Related

replace a given value within a column with the next different number in a row in R

I have a data set that will ultimately be about ~30,000 observations. I have formatted a variable in such a way that the numerical values 1:4 are of interest, while the value 5 is a place holder and was not able to be collected by our testing instrument for one reason or another (not worried about why or missingness etc).
I am looking to turn any observation of 5, or series of observations of 5, into the next number in the observations. As can be seen in the example data set below, the first four observations have the number 5 while the next four observations are the number 4. In this situation I would like the first 4 observations to be changed from 5 to 4.
Note that after the 8th observation another series of 5's occur, follow by a series of 3s. In this case the 5s should be changed to 3s.
In the code block below I have provided an example of what the current data look like, delineated by the column "Current." I have also provided a column of the desired output, delineated by the column name "Desired." The obs variable was helpful to create just to show the row number of the changes in values for the case of this post.
df <- data.frame(Current = c(5,5,5,5,4,4,4,4,5,5,3,3,3,5,3,3,5,5,2,5,5,5,1),
Desired = c(4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,2,2,2,1,1,1,1))
df$obs = seq(1,nrow(df), by = 1)
You could use
library(tidyr)
library(dplyr)
df %>%
mutate(new_column = na_if(Current, 5)) %>%
fill(new_column, .direction = "up")
This returns
Current Desired new_column
1 5 4 4
2 5 4 4
3 5 4 4
4 5 4 4
5 4 4 4
6 4 4 4
7 4 4 4
8 4 4 4
9 5 3 3
10 5 3 3
11 3 3 3
12 3 3 3
13 3 3 3
14 5 3 3
15 3 3 3
16 3 3 3
17 5 2 2
18 5 2 2
19 2 2 2
20 5 1 1
21 5 1 1
22 5 1 1
23 1 1 1
We use dplyr's na_if function to convert the 5 into missing values.
Next we use tidyr's fill function to replace the NA's by the following values.
You can use the following solution. I made use of zoo::na.locf function which takes the most non-NA value and replace all NAs on the way down. However, to fit this to your data set I first replaced all values equal to 5 with NA and then reverse the vector and after I replaced all the values with the desired values, I again reversed it back to its original order:
library(dplyr)
library(zoo)
library(zoo)
df %>%
mutate(Desired2 = ifelse(Current == 5, NA, Current),
Desired2 = rev(na.locf(rev(Desired2))))
Current Desired Desired2
1 5 4 4
2 5 4 4
3 5 4 4
4 5 4 4
5 4 4 4
6 4 4 4
7 4 4 4
8 4 4 4
9 5 3 3
10 5 3 3
11 3 3 3
12 3 3 3
13 3 3 3
14 5 3 3
15 3 3 3
16 3 3 3
17 5 2 2
18 5 2 2
19 2 2 2
20 5 1 1
21 5 1 1
22 5 1 1
23 1 1 1

Count number of rows that span the values in a vector for each item in the vector

I am trying to figure out how to count the number of rows in a dataframe that contain value pairs that span the values in a separate vector for each item in that separate vector.
For instance, If I have vector p:
> p=seq(1,7,1);p
[1] 1 2 3 4 5 6 7
and dataframe se:
> s=c(1,2,2,5)
> e=c(5,7,11,22)
> se=data.frame(s,e);se
s e
1 1 5
2 2 7
3 2 11
4 5 22
I would like to know how many rows in 'se' overlap each item in 'p'. The output I would expect should look similar to this:
1 1
2 3
3 3
4 3
5 4
6 3
7 3
You can do this with sapply to check for each value in p.
result <- data.frame(p, overlap=sapply(p,function(x) sum(x >= se$s & x <= se$e)))
result
# p overlap
#1 1 1
#2 2 3
#3 3 3
#4 4 3
#5 5 4
#6 6 3
#7 7 3

Change the order of numerically named columns in r

If I have a dataframe like the one below which has numerical column names
example = data.frame(1=c(1,8,3,9), 2=c(3,2,3,3), 3=c(5,2,5,4), 4=c(1,2,3,4), 5=c(2,5,7,8))
Which looks like this:
1 2 3 4 5
1 3 5 1 2
8 2 2 2 5
3 3 5 3 7
9 3 4 4 8
And I want to arrange it so that the column names start with three and proceed through five and back to one, like this:
3 4 5 1 2
5 1 2 1 3
2 2 5 8 2
5 3 7 3 3
4 4 8 9 3
I know how to rearrange the position of a single column in a dataset, but I'm not sure how to do this with more than one column in this particular order.
We can use the column index concatenated (c) based on the sequence (:) on a range of values
example[c(3:5, 1:2)]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3
As the column names are all numeric, just convert to numeric and use that for ordering
v1 <- as.numeric(names(example))
example[c(v1[3:5], v1[1:2])]
Or simply do
example[c(names(example)[3:5], names(example)[1:2])]
Or another way is with head and tail
example[c(tail(names(example), 3), head(names(example), 2))]
data
example <- data.frame(`1`=c(1,8,3,9), `2`=c(3,2,3,3),
`3`=c(5,2,5,4), `4`=c(1,2,3,4), `5`=c(2,5,7,8), check.names = FALSE)
R will not easily let you create columns with numbers as name. If somehow, you are able to create columns with numbers you can use match to get order in which you want the column names.
example[match(c(3:5, 1:2), names(example))]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3

Transforming a looping factor variable into a sequence of numerics

I have a factor variable with 6 levels, which simplified looks like:
1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 1 1 1 2 2 2 2... 1 1 1 2 2... (with n = 78)
Note, that each number is repeated mostly but not always three times.
I need to transform this variable into the following pattern:
1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8...
where each repetition of the 6 levels continuous counting ascending.
Is there any way / any function that lets me do that?
Sorry for my bad description!
Assuming that you have a numerical vector that represents your simplified version you posted. i.e. x = c(1,1,1,2,2,3,3,3,1,1,2,2), you can use this:
library(dplyr)
cumsum(x != lag(x, default = 0))
# [1] 1 1 1 2 2 3 3 3 4 4 5 5
which compares each value to its previous one and if they are different it adds 1 (starting from 1).
Maybe you can try rle, i.e.,
v <- rep(seq_along((v<-rle(x))$values),v$lengths)
Example with dummy data
x = c(1,1,1,2,2,3,3,3,4,4,5,6,1,1,2,2,3,3,3,4,4)
then we can get
> v
[1] 1 1 1 2 2 3 3 3 4 4 5 6 7 7 8 8 9 9
[19] 9 10 10
In base you can use diff and cumsum.
c(1, cumsum(diff(x)!=0)+1)
# [1] 1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8
Data:
x <- c(1,1,2,2,2,3,3,3,4,4,4,4,5,5,5,6,6,6,1,1,1,2,2,2,2)

changing values in dataframe in R based on criteria

I have a data frame that looks like
> mydata
ID Observation X
1 1 3
1 2 3
1 3 3
1 4 3
2 1 4
2 2 4
3 1 8
3 2 8
3 3 8
I have some code that counts the number of observations per ID, determines which IDs have a number of observations that meet a certain criteria (in this case, >=3 observations), and returns a vector with these IDs:
> vals
[1] 1 3
Now I want to manipulate the X values associated with these IDs, e.g. by adding 1 to each value, giving a data frame like this:
> mydata
ID Observation X
1 1 4
1 2 4
1 3 4
1 4 4
2 1 4
2 2 4
3 1 9
3 2 9
3 3 9
I'm pretty new to R and am uncertain how I might do this. It might help to know that X is constant for each ID.
The call mydata$ID %in% vals returns TRUE or FALSE to indicate whether the ID value for each row is in the vals vector. When you add this to the data currently in mydata$X, the TRUE and FALSE are converted to 1 and 0, respectively, yielding the desired result:
mydata$X <- mydata$X + mydata$ID %in% vals
# mydata
# ID Observation X
# 1 1 1 4
# 2 1 2 4
# 3 1 3 4
# 4 1 4 4
# 5 2 1 4
# 6 2 2 4
# 7 3 1 9
# 8 3 2 9
# 9 3 3 9

Resources