Replacing values in a data.frame that have lost their order - r

In my toy data, for each unique study, the numeric variables (sample and group) must have an order starting from 1. But:
For example, in study 1, we see that there are two unique sample values (1 & 3), so 3 must be replaced with 2.
For example, in study 2, we see that there is one unique group value (2), so it must be replaced with 1.
In study 3, both sample and group seem ok meaning their unique values are 1 and 2 (no replacing needed).
For this toy data, my desired output is shown below. But I appreciate a functional solution that can automatically replace any number of numeric variables in a data.frame that have lost their order just like I showed in my toy data.
m="
study sample group outcome
1 1 1 A
1 1 1 B
1 1 2 A
1 1 2 B
1 3 1 A
1 3 1 B
1 3 2 A
1 3 2 B
2 1 2 A
2 1 2 B
2 2 2 A
2 2 2 B
2 3 2 A
2 3 2 B
3 1 1 A
3 1 1 B
3 1 2 A
3 1 2 B
3 2 1 A
3 2 1 B
3 2 2 A
3 2 2 B"
data <- read.table(text=m, h=T)
Desired_output="
study sample group outcome
1 1 1 A
1 1 1 B
1 1 2 A
1 1 2 B
1 2 1 A
1 2 1 B
1 2 2 A
1 2 2 B
2 1 1 A
2 1 1 B
2 2 1 A
2 2 1 B
2 3 1 A
2 3 1 B
3 1 1 A
3 1 1 B
3 1 2 A
3 1 2 B
3 2 1 A
3 2 1 B
3 2 2 A
3 2 2 B"

You can do:
library(dplyr)
data %>%
group_by(study) %>%
mutate(across(tidyselect::vars_select_helpers$where(is.numeric),
function(x) as.numeric(as.factor(x)))) %>%
as.data.frame()
The resultant data frame looks like this:
study sample group outcome
1 1 1 1 A
2 1 1 1 B
3 1 1 2 A
4 1 1 2 B
5 1 2 1 A
6 1 2 1 B
7 1 2 2 A
8 1 2 2 B
9 2 1 1 A
10 2 1 1 B
11 2 2 1 A
12 2 2 1 B
13 2 3 1 A
14 2 3 1 B
15 3 1 1 A
16 3 1 1 B
17 3 1 2 A
18 3 1 2 B
19 3 2 1 A
20 3 2 1 B
21 3 2 2 A
22 3 2 2 B

Here is an alternative (not as elegant as #Allan Cameron +1 ) dplyr solution:
library(dplyr)
df %>%
group_by(study) %>%
mutate(x = n()/length(unique(sample)),
sample = rep(row_number(), each=x, length.out = n()),
y = length(unique(group)),
group = ifelse(y==1, 1, group)) %>%
select(-x, -y)
study sample group outcome
<int> <int> <dbl> <chr>
1 1 1 1 A
2 1 1 1 B
3 1 1 2 A
4 1 1 2 B
5 1 2 1 A
6 1 2 1 B
7 1 2 2 A
8 1 2 2 B
9 2 1 1 A
10 2 1 1 B
11 2 2 1 A
12 2 2 1 B
13 2 3 1 A
14 2 3 1 B
15 3 1 1 A
16 3 1 1 B
17 3 1 2 A
18 3 1 2 B
19 3 2 1 A
20 3 2 1 B
21 3 2 2 A
22 3 2 2 B

Related

How to create two columns based on some criteria in R

The data I have is almost similar to the data below.
A=01-03
B=04-06
C=07-09
D=10-11
data<-read.table (text=" ID Class Time1 Time2 Time3
1 1 1 3 3
2 1 4 3 2
3 1 2 2 2
1 2 1 4 1
2 3 2 1 1
3 2 3 2 3
1 3 1 1 2
2 2 4 3 1
3 3 3 2 1
1 1 4 3 2
2 1 2 2 2
3 2 1 4 1
", header=TRUE)
I want to create 2 columns right after the Class column, i.e. the Bin and Zero columns based on A, B, C and D and IDs.
Therefore A goes to IDs 1,2, and 3. B goes to the next IDs, i.e., 1,2 and 3, and C goes to the next IDs, i.e., 1,2,3 and so on. Column Zero gets only numbers zeros. So the outcome would be:
ID Class Bin Zero Time1 Time2 Time3
1 1 01-03 0 1 3 3
2 1 01-03 0 4 3 2
3 1 01-03 0 2 2 2
1 2 04-06 0 1 4 1
2 3 04-06 0 2 1 1
3 2 04-06 0 3 2 3
1 3 07-09 0 1 1 2
2 2 07-09 0 4 3 1
3 3 07-09 0 3 2 1
1 1 10-11 0 4 3 2
2 1 10-11 0 2 2 2
3 2 10-11 0 1 4 1
Please try the below code
library(tidyverse)
#use character vector with quotes
A='01-03'
B='04-06'
C='07-09'
D='10-11'
data<-read.table (text=" ID Class Time1 Time2 Time3
1 1 1 3 3
2 1 4 3 2
3 1 2 2 2
1 2 1 4 1
2 3 2 1 1
3 2 3 2 3
1 3 1 1 2
2 2 4 3 1
3 3 3 2 1
1 1 4 3 2
2 1 2 2 2
3 2 1 4 1
", header=TRUE)
#create a separate dataframe with bin column
data2 <- data.frame(bin=c(rep(A,3),rep(B,3),rep(C,3),rep(D,3)))
data3 <- bind_cols(data, data2) %>% mutate(zero=0)
If you are open to a dplyr based solution, you could use
library(dplyr)
data %>%
group_by(ID) %>%
mutate(Bin = c(A, B, C, D),
Zero = 0,
.after = 2) %>%
ungroup()
This returns
# A tibble: 12 × 7
ID Class Bin Zero Time1 Time2 Time3
<int> <int> <chr> <dbl> <int> <int> <int>
1 1 1 01-03 0 1 3 3
2 2 1 01-03 0 4 3 2
3 3 1 01-03 0 2 2 2
4 1 2 04-06 0 1 4 1
5 2 3 04-06 0 2 1 1
6 3 2 04-06 0 3 2 3
7 1 3 07-09 0 1 1 2
8 2 2 07-09 0 4 3 1
9 3 3 07-09 0 3 2 1
10 1 1 10-11 0 4 3 2
11 2 1 10-11 0 2 2 2
12 3 2 10-11 0 1 4 1

identify whenever values repeat in r

I have a dataframe like this.
data <- data.frame(Condition = c(1,1,2,3,1,1,2,2,2,3,1,1,2,3,3))
I want to populate a new variable Sequence which identifies whenever Condition starts again from 1.
So the new dataframe would look like this.
Thanks in advance for the help!
data <- data.frame(Condition = c(1,1,2,3,1,1,2,2,2,3,1,1,2,3,3),
Sequence = c(1,1,1,1,2,2,2,2,2,2,3,3,3,3,3))
base R
data$Sequence2 <- cumsum(c(TRUE, data$Condition[-1] == 1 & data$Condition[-nrow(data)] != 1))
data
# Condition Sequence Sequence2
# 1 1 1 1
# 2 1 1 1
# 3 2 1 1
# 4 3 1 1
# 5 1 2 2
# 6 1 2 2
# 7 2 2 2
# 8 2 2 2
# 9 2 2 2
# 10 3 2 2
# 11 1 3 3
# 12 1 3 3
# 13 2 3 3
# 14 3 3 3
# 15 3 3 3
dplyr
library(dplyr)
data %>%
mutate(
Sequence2 = cumsum(Condition == 1 & lag(Condition != 1, default = TRUE))
)
# Condition Sequence Sequence2
# 1 1 1 1
# 2 1 1 1
# 3 2 1 1
# 4 3 1 1
# 5 1 2 2
# 6 1 2 2
# 7 2 2 2
# 8 2 2 2
# 9 2 2 2
# 10 3 2 2
# 11 1 3 3
# 12 1 3 3
# 13 2 3 3
# 14 3 3 3
# 15 3 3 3
This took a while. Finally I find this solution:
library(dplyr)
data %>%
group_by(Sequnce = cumsum(
ifelse(Condition==1, lead(Condition)+1, Condition)
- Condition==1)
)
Condition Sequnce
<dbl> <int>
1 1 1
2 1 1
3 2 1
4 3 1
5 1 2
6 1 2
7 2 2
8 2 2
9 2 2
10 3 2
11 1 3
12 1 3
13 2 3
14 3 3
15 3 3

How to arrange/sort by unique sequences?

A) Here is my data frame arranged by plate:
df <- read.table(header=TRUE, stringsAsFactors=FALSE, text="
plate phase score
A 1 1
A 1 1
A 1 1
A 2 1
A 2 1
A 2 1
A 3 2
A 3 2
A 3 2
B 1 1
B 1 1
B 1 2
B 2 1
B 2 1
B 2 3")
B) Goal: I want to order it by both plate first and then phase but sequentially (see below how the rows are ordered alphabetically by plate but sequentially by phase
plate phase score
<chr> <int> <int>
1 A 1 1
2 A 2 1
3 A 3 2
4 A 1 1
5 A 2 1
6 A 3 2
7 A 1 1
8 A 2 1
9 A 3 2
10 B 1 1
11 B 2 1
12 B 1 1
13 B 2 1
14 B 1 2
15 B 2 3
One option is to create a sequence variable grouped by 'plate', 'phase' and arrange on it along with 'plate' and 'score'
library(dplyr)
df %>%
group_by(plate, phase) %>%
mutate(rn = row_number()) %>%
ungroup %>%
arrange(plate, rn, score) %>%
select(-rn)
# A tibble: 15 x 3
# plate phase score
# <chr> <int> <int>
# 1 A 1 1
# 2 A 2 1
# 3 A 3 2
# 4 A 1 1
# 5 A 2 1
# 6 A 3 2
# 7 A 1 1
# 8 A 2 1
# 9 A 3 2
#10 B 1 1
#11 B 2 1
#12 B 1 1
#13 B 2 1
#14 B 1 2
#15 B 2 3
Or using data.table
library(data.table)
setDT(df)[order(plate, rowid(phase), score)]
df[with(df, order(plate, ave(phase, phase, FUN = seq_along), phase)),]
#> plate phase score
#> 1 A 1 1
#> 4 A 2 1
#> 7 A 3 2
#> 2 A 1 1
#> 5 A 2 1
#> 8 A 3 2
#> 3 A 1 1
#> 6 A 2 1
#> 9 A 3 2
#> 10 B 1 1
#> 13 B 2 1
#> 11 B 1 1
#> 14 B 2 1
#> 12 B 1 2
#> 15 B 2 3

Using R code to reorganize data frame by randomly selecting one row from each combination

I have a data frame that looks like this:
Subject N S
Sub1-1 3 1
Sub1-2 3 1
Sub1-3 3 1
Sub1-4 3 1
Sub2-1 3 1
Sub2-2 3 1
Sub2-3 3 1
Sub2-4 3 1
Sub3-1 3 2
Sub3-2 3 2
Sub3-3 3 2
Sub4-1 3 2
Sub4-2 3 2
Sub4-3 3 2
Sub5-1 3 2
Sub5-2 3 2
Sub6-1 1 1
Sub6-2 1 1
Sub6-3 1 1
Sub7-1 1 1
Sub7-2 1 1
Sub7-3 1 1
Sub8-1 1 1
Sub8-2 1 1
Sub8-3 1 2
Sub9-1 1 2
Sub9-2 1 2
Sub1-1 1 2
Sub1-2 1 2
Sub1-3 1 2
Sub5-1 1 2
Sub5-2 1 2
Sub1-5 2 1
Sub1-6 2 1
Sub1-7 2 1
Sub1-5 2 1
Sub2-6 2 1
Sub2-5 2 1
Sub2-6 2 1
Sub2-7 2 1
Sub3-8 2 2
Sub3-5 2 2
Sub3-6 2 2
Sub4-7 2 2
Sub4-5 2 2
Sub4-6 2 2
Sub5-7 2 2
Sub5-8 2 2
As you can see in this data frame there are 6 different combinations in the N and S columns, and 8 consecutive rows of each combination. I want to create a new data frame where one row from each combination (be it 3 & 1 or 1 & 2) is randomly selected and then put into a new data frame so there are 8 consecutive rows of each different combination. That way the entire data frame of all 48 rows is completely reorganized. Is this possible in R code?
Edit: The desired output would be something like this, but repeating until all 48 rows are full and the subject number for each row would have be random because it is a randomly selected row of each N & S combo.
Subject N S
3 1
1 1
3 2
1 2
2 2
2 1
2 2
3 2
2 1
1 1
3 1
1 2
A solution using functions from dplyr.
# Load package
library(dplyr)
# Set seed for reproducibility
set.seed(123)
# Process the data
dt2 <- dt %>%
group_by(N, S) %>%
sample_n(size = 1)
# View the result
dt2
## A tibble: 6 x 3
## Groups: N, S [6]
# Subject N S
# <chr> <int> <int>
#1 Sub6-3 1 1
#2 Sub5-1 1 2
#3 Sub1-5 2 1
#4 Sub5-8 2 2
#5 Sub2-4 3 1
#6 Sub3-1 3 2
Update: Reorganize the row
The following randomize all rows.
dt3 <- dt %>% slice(sample(1:n(), n()))
Data Preparation
dt <- read.table(text = "Subject N S
Sub1-1 3 1
Sub1-2 3 1
Sub1-3 3 1
Sub1-4 3 1
Sub2-1 3 1
Sub2-2 3 1
Sub2-3 3 1
Sub2-4 3 1
Sub3-1 3 2
Sub3-2 3 2
Sub3-3 3 2
Sub4-1 3 2
Sub4-2 3 2
Sub4-3 3 2
Sub5-1 3 2
Sub5-2 3 2
Sub6-1 1 1
Sub6-2 1 1
Sub6-3 1 1
Sub7-1 1 1
Sub7-2 1 1
Sub7-3 1 1
Sub8-1 1 1
Sub8-2 1 1
Sub8-3 1 2
Sub9-1 1 2
Sub9-2 1 2
Sub1-1 1 2
Sub1-2 1 2
Sub1-3 1 2
Sub5-1 1 2
Sub5-2 1 2
Sub1-5 2 1
Sub1-6 2 1
Sub1-7 2 1
Sub1-5 2 1
Sub2-6 2 1
Sub2-5 2 1
Sub2-6 2 1
Sub2-7 2 1
Sub3-8 2 2
Sub3-5 2 2
Sub3-6 2 2
Sub4-7 2 2
Sub4-5 2 2
Sub4-6 2 2
Sub5-7 2 2
Sub5-8 2 2",
header = TRUE, stringsAsFactors = FALSE)

How to select only the last row among the subset of rows satisfying a condition in R programming

The dataframe looks like this :
Customer_id A B C D E F G
10000001 1 1 2 3 1 3 1
10000001 1 2 3 1 2 1 3
10000002 2 2 2 3 1 3 1
10000002 2 2 1 4 2 3 1
10000003 1 5 2 4 7 2 4
10000003 1 5 2 6 3 7 2
10000003 1 1 2 2 1 2 1
10000004 1 2 3 1 2 3 1
10000004 1 3 2 3 1 3 2
10000004 1 3 2 1 3 2 1
10000004 1 4 1 4 1 3 1
10000006 1 2 3 4 5 1 2
10000006 1 3 1 4 1 2 1
10000008 2 3 2 3 2 1 2
10000008 2 3 1 1 2 1 2
10000008 1 3 1 1 2 2 1
There are multiple entries for each customer_id. I need to create another data frame from this existing data frame. The new data frame should contain only the last row for every customer_id. It should look like this
10000001 1 1 2 3 1 3 1
10000002 2 2 1 4 2 3 1
10000003 1 1 2 2 1 2 1
10000004 1 4 1 4 1 3 1
10000006 1 3 1 4 1 2 1
10000008 1 3 1 1 2 2 1
Something like this (hard to code without the data in R format):
dataframe[ rev(!duplicated(rev(dataframe$Customer_id))),]
or better
dataframe[ !duplicated(dataframe$Customer_id,fromLast=TRUE),]
You can also use aggregate
aggregate(. ~ Customer_id, data = DF, FUN = tail, 1)
## Customer_id A B C D E F G
## 1 10000001 1 2 3 1 2 1 3
## 2 10000002 2 2 1 4 2 3 1
## 3 10000003 1 1 2 2 1 2 1
## 4 10000004 1 4 1 4 1 3 1
## 5 10000006 1 3 1 4 1 2 1
## 6 10000008 1 3 1 1 2 2 1
Assume your data is named dat,
Here's one way using by and rbind, although the other two methods (aggregate and duplicated) are much nicer:
> do.call(rbind, by(dat,dat$Customer_id,FUN=tail,1))
## Customer_id A B C D E F G
## 2 10000001 1 2 3 1 2 1 3
## 4 10000002 2 2 1 4 2 3 1
## 7 10000003 1 1 2 2 1 2 1
## 11 10000004 1 4 1 4 1 3 1
## 13 10000006 1 3 1 4 1 2 1
## 16 10000008 1 3 1 1 2 2 1

Resources