replace a given value within a column with the next different number in a row in R - r

I have a data set that will ultimately be about ~30,000 observations. I have formatted a variable in such a way that the numerical values 1:4 are of interest, while the value 5 is a place holder and was not able to be collected by our testing instrument for one reason or another (not worried about why or missingness etc).
I am looking to turn any observation of 5, or series of observations of 5, into the next number in the observations. As can be seen in the example data set below, the first four observations have the number 5 while the next four observations are the number 4. In this situation I would like the first 4 observations to be changed from 5 to 4.
Note that after the 8th observation another series of 5's occur, follow by a series of 3s. In this case the 5s should be changed to 3s.
In the code block below I have provided an example of what the current data look like, delineated by the column "Current." I have also provided a column of the desired output, delineated by the column name "Desired." The obs variable was helpful to create just to show the row number of the changes in values for the case of this post.
df <- data.frame(Current = c(5,5,5,5,4,4,4,4,5,5,3,3,3,5,3,3,5,5,2,5,5,5,1),
Desired = c(4,4,4,4,4,4,4,4,3,3,3,3,3,3,3,3,2,2,2,1,1,1,1))
df$obs = seq(1,nrow(df), by = 1)

You could use
library(tidyr)
library(dplyr)
df %>%
mutate(new_column = na_if(Current, 5)) %>%
fill(new_column, .direction = "up")
This returns
Current Desired new_column
1 5 4 4
2 5 4 4
3 5 4 4
4 5 4 4
5 4 4 4
6 4 4 4
7 4 4 4
8 4 4 4
9 5 3 3
10 5 3 3
11 3 3 3
12 3 3 3
13 3 3 3
14 5 3 3
15 3 3 3
16 3 3 3
17 5 2 2
18 5 2 2
19 2 2 2
20 5 1 1
21 5 1 1
22 5 1 1
23 1 1 1
We use dplyr's na_if function to convert the 5 into missing values.
Next we use tidyr's fill function to replace the NA's by the following values.

You can use the following solution. I made use of zoo::na.locf function which takes the most non-NA value and replace all NAs on the way down. However, to fit this to your data set I first replaced all values equal to 5 with NA and then reverse the vector and after I replaced all the values with the desired values, I again reversed it back to its original order:
library(dplyr)
library(zoo)
library(zoo)
df %>%
mutate(Desired2 = ifelse(Current == 5, NA, Current),
Desired2 = rev(na.locf(rev(Desired2))))
Current Desired Desired2
1 5 4 4
2 5 4 4
3 5 4 4
4 5 4 4
5 4 4 4
6 4 4 4
7 4 4 4
8 4 4 4
9 5 3 3
10 5 3 3
11 3 3 3
12 3 3 3
13 3 3 3
14 5 3 3
15 3 3 3
16 3 3 3
17 5 2 2
18 5 2 2
19 2 2 2
20 5 1 1
21 5 1 1
22 5 1 1
23 1 1 1

Related

Change the order of numerically named columns in r

If I have a dataframe like the one below which has numerical column names
example = data.frame(1=c(1,8,3,9), 2=c(3,2,3,3), 3=c(5,2,5,4), 4=c(1,2,3,4), 5=c(2,5,7,8))
Which looks like this:
1 2 3 4 5
1 3 5 1 2
8 2 2 2 5
3 3 5 3 7
9 3 4 4 8
And I want to arrange it so that the column names start with three and proceed through five and back to one, like this:
3 4 5 1 2
5 1 2 1 3
2 2 5 8 2
5 3 7 3 3
4 4 8 9 3
I know how to rearrange the position of a single column in a dataset, but I'm not sure how to do this with more than one column in this particular order.
We can use the column index concatenated (c) based on the sequence (:) on a range of values
example[c(3:5, 1:2)]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3
As the column names are all numeric, just convert to numeric and use that for ordering
v1 <- as.numeric(names(example))
example[c(v1[3:5], v1[1:2])]
Or simply do
example[c(names(example)[3:5], names(example)[1:2])]
Or another way is with head and tail
example[c(tail(names(example), 3), head(names(example), 2))]
data
example <- data.frame(`1`=c(1,8,3,9), `2`=c(3,2,3,3),
`3`=c(5,2,5,4), `4`=c(1,2,3,4), `5`=c(2,5,7,8), check.names = FALSE)
R will not easily let you create columns with numbers as name. If somehow, you are able to create columns with numbers you can use match to get order in which you want the column names.
example[match(c(3:5, 1:2), names(example))]
# 3 4 5 1 2
#1 5 1 2 1 3
#2 2 2 5 8 2
#3 5 3 7 3 3
#4 4 4 8 9 3

What does subset(df, !duplicated(x)) do?

Looking for a detailed answer.
When we have a data frame (df) that contains three variables x, y, and z, what does the following command do?
subset(df, !duplicated(x))
The duplicated function traverses its argument(s) sequentially and returns TRUE if there has been a prior value identical to the current value. It is a generic function, so it has a default definition (for vectors) but also a definition for other classes, such as objects of the data.frame class. The subset function treats expressions passed as a second or third argument as though column names are first class objects. This is called "non-standard evaluation". (Notice the negation operator.) So this call to subset will return the rows of a data.frame where only the first instance of the column named "x" is not duplicated. It would probably return a dataframe with only the number of rows that equal the number of unique items in the x column.
> dat <- data.frame( x =sample(1:5, 20, repl=TRUE), y=1:5, z=1:4)
> dat
x y z
1 2 1 1
2 2 2 2
3 2 3 3
4 5 4 4
5 4 5 1
6 1 1 2
7 2 2 3
8 2 3 4
9 5 4 1
10 1 5 2
11 2 1 3
12 4 2 4
13 5 3 1
14 4 4 2
15 3 5 3
16 3 1 4
17 4 2 1
18 4 3 2
19 1 4 3
20 1 5 4
> subset(dat, !duplicated(x))
x y z
1 2 1 1
4 5 4 4
5 4 5 1
6 1 1 2
15 3 5 3

R: Return values in a columns when the value in another column becomes negative for the first time

For each ID, I want to return the value in the 'distance' column where the value becomes negative for the first time. If the value does not become negative at all, return the value 99 (or some other random number) for that ID. A sample data frame is given below.
df <- data.frame(ID=c(rep(1, 4),rep(2,4),rep(3,4),rep(4,4),rep(5,4)),distance=rep(1:4,5), value=c(1,4,3,-1,2,1,-4,1,3,2,-1,1,-4,3,2,1,2,3,4,5))
> df
ID distance value
1 1 1 1
2 1 2 4
3 1 3 3
4 1 4 -1
5 2 1 2
6 2 2 1
7 2 3 -4
8 2 4 1
9 3 1 3
10 3 2 2
11 3 3 -1
12 3 4 1
13 4 1 -4
14 4 2 3
15 4 3 2
16 4 4 1
17 5 1 2
18 5 2 3
19 5 3 4
20 5 4 5
The desired output is as follows
> df2
ID first_negative_distance
1 1 4
2 2 3
3 3 3
4 4 1
5 5 99
I tried but couldn't figure out how to do it through dplyr. Any help would be much appreciated. The actual data I'm working on has thousands of ID's with 30 different distance levels for each. Bear in mind that for any ID, there could be multiple instances of negative values. I just need the first one.
Edit:
Tried the solution proposed by AntonoisK.
> df%>%group_by(ID)%>%summarise(first_neg_dist=first(distance[value<0]))
first_neg_dist
1 4
This is the result I am getting. Does not match what Antonois got. Not sure why.
library(dplyr)
df %>%
group_by(ID) %>%
summarise(first_neg_dist = first(distance[value < 0]))
# # A tibble: 5 x 2
# ID first_neg_dist
# <dbl> <int>
# 1 1 4
# 2 2 3
# 3 3 3
# 4 4 1
# 5 5 NA
If you really prefer 99 instead of NA you can use
summarise(first_neg_dist = coalesce(first(distance[value < 0]), 99L))
instead.

R: how to shift columns based on conditions

I have a dataset like the following and, for each row, I want to shift the some columns based on a condition.
flv1 attr1_1 attr2_1 flv2 atrr2_1 atrr2_2 flv3 atrr3_1 atrr3_2
1 3 4 3 4 2 2 2 5
2 3 4 3 4 2 1 5 5
1 3 4 3 4 2 2 4 5
and the result I want to achieve is that when the number under flvi is not i. I will move the corresponding values along the values in the two subsequent columns to the ith column. Specifically, the result I want to achieve is like the following:
flv1 attr1_1 attr2_1 flv2 atrr2_1 atrr2_2 flv3 atrr3_1 atrr3_2
1 3 4 2 2 5 3 4 2
1 5 5 2 3 4 3 4 2
1 3 4 2 4 5 3 4 2
Here's an option which is not terribly clean, but, well, neither is your data's form. If the original data.frame is called df:
library(dplyr)
# clean out asterisks
df %>% mutate_all(tidyr::extract_numeric) %>%
# apply a function to split each row into three groups, order by the flvis, and recombine
apply(1, function(x){split(x, rep(1:3, each = 3))[order(x[c(1,4,7)])] %>% unlist()}) %>%
# clean up matrix back to original data.frame form
t() %>% as.data.frame() %>% setNames(names(df))
## flv1 attr1_1 attr2_1 flv2 atrr2_1 atrr2_2 flv3 atrr3_1 atrr3_2
## 1 1 3 4 2 2 5 3 4 2
## 2 1 5 5 2 3 4 3 4 2
## 3 1 3 4 2 4 5 3 4 2

How do I preserve continuous (1,2,3,...n) ranking notation when ranking in R?

If I want to rank a set of numbers using the minimum rank for shared cases (aka ties):
dat <- c(13,13,14,15,15,15,15,15,15,16,17,22,45,46,112)
rank(dat, ties = 'min')
I get the results:
1 1 3 4 4 4 4 4 4 10 11 12 13 14 15
However, I want the rank to be a continuous series consisting of 1,2,3,...n, where n is the number of unique ranks.
Is there a way to make rank (or a similar function) rank a series of numbers by assigning ties to the lowest rank as above but instead of skipping subsequent rank values by the number of previous ties to instead continue ranking from the previous rank?
For example, I would like the above ranking to result in:
1 1 2 3 3 3 3 3 3 4 5 6 7 8 9
you could do it using dplyr:
library(dplyr)
dense_rank(dat)
[1] 1 1 2 3 3 3 3 3 3 4 5 6 7 8 9
if you don't want to load the whole library and do it in base r:
match(dat, sort(unique(dat)))
[1] 1 1 2 3 3 3 3 3 3 4 5 6 7 8 9
Use a factor and then bring it back to numeric format:
as.numeric(factor(rank(dat)))
# [1] 1 1 2 3 3 3 3 3 3 4 5 6 7 8 9

Resources