R replace specific value in many columns across dataframe - r

I am mainly interested in replacing a specific value (81) in many columns across the dataframe.
For example, if this is my dataset
Id Date Col_01 Col_02 Col_03 Col_04
30 2012-03-31 1 A42.2 20.46 43
36 1996-11-15 42 V73 23 55
96 2010-02-07 X48 81 13 3R
40 2010-03-18 AD14 18.12 20.12 36
69 2012-02-21 8 22.45 12 10
11 2013-07-03 81 V017 78.12 81
22 2001-06-01 11 09 55 12
83 2005-03-16 80.45 V22.15 46.52 X29.11
92 2012-02-12 1 4 67 12
34 2014-03-10 82.12 N72.22 V45.44 10
I like to replace value 81 in columns Col1, Col2, Col3, Col4 to NA. The final expected dataset like this
Id Date Col_01 Col_02 Col_03 Col_04
30 2012-03-31 1 A42.2 20.46 43
36 1996-11-15 42 V73 23 55
96 2010-02-07 X48 **NA 13 3R
40 2010-03-18 AD14 18.12 20.12 36
69 2012-02-21 8 22.45 12 10
11 2013-07-03 **NA V017 78.12 **NA
22 2001-06-01 11 09 55 12
83 2005-03-16 80.45 V22.15 46.52 X29.11
92 2012-02-12 1 4 67 12
34 2014-03-10 82.12 N72.22 V45.44 10
I tried this approach
df %>% select(matches("^Col_\\d+$"))[ df %>% select(matches("^Col_\\d+$")) == 81 ] <- NA
Something similar to this solution data[ , 2:3 ][ data[ , 2:3 ] == 4 ] <- 10 here
Replacing occurrences of a number in multiple columns of data frame with another value in R
This did not work.
Any suggestion is much appreciated. Thanks in adavance.

Instead of select, we can directly specify the matches in mutate to replace the values that are '81' to NA (use na_if)
library(dplyr)
df <- df %>%
mutate(across(matches("^Col_\\d+$"), ~ na_if(., "81")))
-output
df
Id Date Col_01 Col_02 Col_03 Col_04
1 30 2012-03-31 1 A42.2 20.46 43
2 36 1996-11-15 42 V73 23 55
3 96 2010-02-07 X48 <NA> 13 3R
4 40 2010-03-18 AD14 18.12 20.12 36
5 69 2012-02-21 8 22.45 12 10
6 11 2013-07-03 <NA> V017 78.12 <NA>
7 22 2001-06-01 11 09 55 12
8 83 2005-03-16 80.45 V22.15 46.52 X29.11
9 92 2012-02-12 1 4 67 12
10 34 2014-03-10 82.12 N72.22 V45.44 10
Or we can use base R
i1 <- grep("^Col_\\d+$", names(df))
df[i1][df[i1] == "81"] <- NA
The issue in the OP's code is the assignment is not triggered as we expect i.e.
(df %>%
select(matches("^Col_\\d+$")))[(df %>%
select(matches("^Col_\\d+$"))) == "81" ]
[1] "81" "81" "81"
which is same as
df[i1][df[i1] == "81"]
[1] "81" "81" "81"
and not the assignment
(df %>%
select(matches("^Col_\\d+$")))[(df %>%
select(matches("^Col_\\d+$"))) == "81" ] <- NA
Error in (df %>% select(matches("^Col_\\d+$")))[(df %>% select(matches("^Col_\\d+$"))) == :
could not find function "(<-"
In base R, it does the assignment with [<-
data
df <- structure(list(Id = c(30L, 36L, 96L, 40L, 69L, 11L, 22L, 83L,
92L, 34L), Date = c("2012-03-31", "1996-11-15", "2010-02-07",
"2010-03-18", "2012-02-21", "2013-07-03", "2001-06-01", "2005-03-16",
"2012-02-12", "2014-03-10"), Col_01 = c("1", "42", "X48", "AD14",
"8", "81", "11", "80.45", "1", "82.12"), Col_02 = c("A42.2",
"V73", "81", "18.12", "22.45", "V017", "09", "V22.15", "4", "N72.22"
), Col_03 = c("20.46", "23", "13", "20.12", "12", "78.12", "55",
"46.52", "67", "V45.44"), Col_04 = c("43", "55", "3R", "36",
"10", "81", "12", "X29.11", "12", "10")),
class = "data.frame", row.names = c(NA,
-10L))

We can also use replace:
library(dplyr)
df <- df %>%
mutate(across(matches("^Col_\\d+$"), ~ replace(.x, ~.x==81, NA)))

Related

How to calculate the sum of periods over each column for each row in R

I would like to calculate the sum of each flower in each year in R. Below is an example of how the table looks (Table 1) and what I want the outcome to be (Table 2). I know how to do the code calculation in a long table format but I am not sure how to do it in a wide table format. Note: I am using package: dplyr
(Table 1)
flower
1902
1950
2010
2012
2021
lily
23
0
0
8
5
rose
50
60
5
16
0
daisy
30
7
10
2
0
I need to calculate the sum for each flower in each year. The end result should give me:
(Table 2)
flower
1902
1950
2010
2012
2021
lily
23
23
23
31
36
rose
50
110
115
131
131
daisy
30
37
47
49
49
One option involving dplyr and purrr might be:
dat %>%
mutate(pmap_dfr(across(-1), ~ cumsum(c(...))))
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
Using rowCumsums from matrixStats
library(matrixStats)
df1[-1] <- rowCumsums(as.matrix(df1[-1]))
-output
df1
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
Here is one way of getting your expected result:
Your data frame :
dat <- structure(list(flower = c("lily", "rose", "daisy"), X1902 = c(23L,
50L, 30L), X1950 = c(0L, 60L, 7L), X2010 = c(0L, 5L, 10L), X2012 = c(8L,
16L, 2L), X2021 = c(5L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
Apply a function that calculate the cumulative sums and apply to each row of the data at column 2 to 6:
dat[1:nrow(dat), 2:6] <- t(apply(dat[1:nrow(dat), 2:6], 1, function(x) cumsum(c(x))))
# The result
dat
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
#benson23 has kindly suggested the following simpler code to get the same result:
dat[, 2:6] <- t(apply(dat[,2:6], 1, cumsum))
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
You can use apply with cumsum, plus a little bit of re-formatting.
setNames(as.data.frame(cbind(df[, 1], t(apply(df[, -1], 1, cumsum)))), colnames(df))
flower X1902 X1950 X2010 X2012 X2021
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49
Data
df <- structure(list(flower = c("lily", "rose", "daisy"), X1902 = c(23L,
50L, 30L), X1950 = c(0L, 60L, 7L), X2010 = c(0L, 5L, 10L), X2012 = c(8L,
16L, 2L), X2021 = c(5L, 0L, 0L)), class = "data.frame", row.names = c(NA,
-3L))
Here is an alternative using pivoting:
library(dplyr)
library(tidyr)
dat %>%
pivot_longer(-flower) %>%
group_by(flower) %>%
mutate(value = cumsum(value)) %>%
pivot_wider() %>%
ungroup()
flower X1902 X1950 X2010 X2012 X2021
<chr> <int> <int> <int> <int> <int>
1 lily 23 23 23 31 36
2 rose 50 110 115 131 131
3 daisy 30 37 47 49 49

Filling in multiple columns of missing data from another dataset

I have a data set that contains some missing values which can be completed by merging with a another dataset. My example:
This is the updated data set I am working with.
DF1
Name Paper Book Mug soap computer tablet coffee coupons
1 2 3 4 5 6 7 8 9
2 21 22 23 23 23 7 23 9
3 56 57 58 59 60 7 62 9
4 80.33333 81.33333 82.33333 83 83.66667 7 85 9
5 107.3333 108.3333 109.3333 110 110.6667 7 112 9
6 134.3333 135.3333 136.3333 137 137.6667 7 139 9
7 161.3333 162.3333 163.3333 164 164.6667
8 188.3333 189.3333 190.3333 191 191.6667 7 193 9
9 215.3333 216.3333 217.3333 218 218.6667 7 220 9
10 242.3333 243.3333 244.3333 245 245.6667 7 247 9
11 269.3333 270.3333 271.3333 272 272.6667 7 274 9
12 296.3333 297.3333 298.3333 299 299.6667
13 323.3333 324.3333 325.3333 326 326.6667 7 328 9
14 350.3333 351.3333 352.3333 353 353.6667 7 355 9
15 377.3333 378.3333 379.3333 380 380.6667
16 404.3333 405.3333 406.3333 407 407.6667 7 409 9
17 431.3333 432.3333 433.3333 434 434.6667 7 436 9
18 458.3333 459.3333 460.3333 461 461.6667 7 463 9
19 485.3333 486.3333 487.3333 488 488.6667
DF2
Name Paper Book Mug soap computer tablet coffee coupons
7 161.3333 162.3333 163.3333 164 164.6667 6 6 6
12 296.3333 297.3333 298.3333 299 299.6667 88 96 25
15 377.3333 378.3333 379.3333 380 380.6667 88 62 25
19 485.3333 486.3333 487.3333 488 488.6667 88 88 78
I want to get:
Name Paper Book Mug soap computer tablet coffee coupons
1 2 3 4 5 6 7 8 9
2 21 22 23 23 23 7 23 9
3 56 57 58 59 60 7 62 9
4 80.33333 81.33333 82.33333 83 83.66667 7 85 9
5 107.3333 108.3333 109.3333 110 110.6667 7 112 9
6 134.3333 135.3333 136.3333 137 137.6667 7 139 9
7 161.3333 162.3333 163.3333 164 164.6667 6 6 6
8 188.3333 189.3333 190.3333 191 191.6667 7 193 9
9 215.3333 216.3333 217.3333 218 218.6667 7 220 9
10 242.3333 243.3333 244.3333 245 245.6667 7 247 9
11 269.3333 270.3333 271.3333 272 272.6667 7 274 9
12 296.3333 297.3333 298.3333 299 299.6667 88 96 25
13 323.3333 324.3333 325.3333 326 326.6667 7 328 9
14 350.3333 351.3333 352.3333 353 353.6667 7 355 9
15 377.3333 378.3333 379.3333 380 380.6667 88 62 25
16 404.3333 405.3333 406.3333 407 407.6667 7 409 9
17 431.3333 432.3333 433.3333 434 434.6667 7 436 9
18 458.3333 459.3333 460.3333 461 461.6667 7 463 9
19 485.3333 486.3333 487.3333 488 488.6667 88 88 78
I have tried the following code:
DF1[,c(4:6)][is.na(DF1[,c(4:6)]<-DF2[,c(2:4)][match(DF1[,1],DF2[,1])]
[which(is.na(DF1[,c(4:6)]))]
One of the solutions using dplyr will work, if I omit the columns which are already complete. Not sure if it my version of dplyr, which I have updated last week.
Any help is greatly appreciated! Thanks!
We can do a left join and then coalesce the columns
library(dplyr)
DF1 %>%
left_join(DF2, by = c('NameVar')) %>%
transmute(NameVar, Var1, Var2,
Var3 = coalesce(Var3.x, Var3.y),
Var4 = coalesce(Var4.x, Var4.y),
Var5 = coalesce(Var5.x, Var5.y))
-output
# NameVar Var1 Var2 Var3 Var4 Var5
#1 Sub1 30 45 40 34 65
#2 Sub2 25 30 30 45 45
#3 Sub3 74 34 25 30 49
#4 Sub4 30 45 40 34 65
#5 Sub5 25 30 69 56 72
#6 Sub6 74 34 74 34 60
Or using data.table
library(data.table)
nm1 <- setdiff(intersect(names(DF1), names(DF2)), 'NameVar')
setDT(DF1)[DF2, (nm1) := Map(fcoalesce, mget(nm1),
mget(paste0("i.", nm1))), on = .(NameVar)]
data
DF1 <- structure(list(NameVar = c("Sub1", "Sub2", "Sub3", "Sub4", "Sub5",
"Sub6"), Var1 = c(30L, 25L, 74L, 30L, 25L, 74L), Var2 = c(45L,
30L, 34L, 45L, 30L, 34L), Var3 = c(40L, NA, NA, 40L, 69L, NA),
Var4 = c(34L, NA, NA, 34L, 56L, NA), Var5 = c(65L, NA, NA,
65L, 72L, NA)), class = "data.frame", row.names = c(NA, -6L
))
DF2 <- structure(list(NameVar = c("Sub2", "Sub3", "Sub6"), Var3 = c(30L,
25L, 74L), Var4 = c(45L, 30L, 34L), Var5 = c(45L, 49L, 60L)),
class = "data.frame", row.names = c(NA,
-3L))

Find the most recent value associated with multiple date columns

I have a dataframe that looks like this:
id date1 value1 date2 value2 date3 value3
1 1113 2012-01-14 29 2012-09-29 22 2013-10-28 21
2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26
3 1609 2014-08-30 30 2013-04-07 53 2013-03-20 100
4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81
5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56
6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65
I want to create a new column that contains whichever value of the three columns "value1", "value2", and "value3" that is the most recent. I don't need to know which date it was associated with.
id date1 value1 date2 value2 date3 value3 value_recent
1 1113 2012-01-14 29 2012-09-29 22 2013-10-28 21 21
2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26 26
3 1609 2014-08-30 30 2013-04-07 53 2013-03-20 100 30
4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81 84
5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56 29
6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65 65
Code to create working example:
set.seed(1234)
id <- sample(1000:2000, 6, replace=TRUE)
date1 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value1 <- sample(1:100, 6, replace=TRUE)
date2 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value2 <- sample(1:100, 6, replace=TRUE)
date3 <- sample(seq(as.Date('2012-01-01'), as.Date('2016-01-01'), by="day"), 6)
value3 <- sample(1:100, 6, replace=TRUE)
df <- data.frame(id, date1, value1, date2, value2, date3, value3)
Edit: Per #Pierre Lafortune's answer, you can actually collapse this into one statement.
Edit 2: Added in data with NAs, also changed code to handle NAs.
This should do the trick rather nicely. It does require a loop and I would be interested to see if someone could come up with a concise vecotrized solution.
date_cols <- colnames(df)[grep("date",colnames(df))]
df$value_recent<-df[cbind(1:nrow(df),grep("date",colnames(df))[apply(sapply(df[,date_cols],as.numeric),1,which.max)]+1)]
df
id date1 value1 date2 value2 date3 value3 value_recent
1 1113 <NA> 29 2012-09-29 22 2013-10-28 21 21
2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26 26
3 1609 <NA> 30 2013-04-07 53 2013-03-20 100 53
4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81 84
5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56 29
6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65 65
Data:
df<-structure(list(id = c(1113L, 1622L, 1609L, 1624L, 1861L, 1640L
), date1 = structure(c(NA, 15679, NA, 16090, 16351, 16134), class = "Date"),
value1 = c(29L, 93L, 30L, 84L, 29L, 27L), date2 = structure(c(15612,
15679, 15802, 15781, 15571, 15398), class = "Date"), value2 = c(22L,
82L, 53L, 92L, 84L, 5L), date3 = structure(c(16006, 15727,
15784, 16080, 15604, 16446), class = "Date"), value3 = c(21L,
26L, 100L, 81L, 56L, 65L)), .Names = c("id", "date1", "value1",
"date2", "value2", "date3", "value3"), row.names = c(NA, -6L), class = "data.frame")
I'm using apply to go over the rows looking for the most recent date. Then use that index to find the value that corresponds. We use a matrix subsetting method to keep it concise:
indx <- apply(df[grep("date", names(df))], 1, function(x) which(x == max(x))[1])
df$value_recent <- df[grep("val", names(df))][cbind(1:nrow(df), indx)]
# id date1 value1 date2 value2 date3 value3 value_recent
# 1 1113 2012-01-14 29 2012-09-29 22 2013-10-28 21 21
# 2 1622 2012-12-05 93 2012-12-05 82 2013-01-22 26 26
# 3 1609 2014-08-30 30 2013-04-07 53 2013-03-20 100 30
# 4 1624 2014-01-20 84 2013-03-17 92 2014-01-10 81 84
# 5 1861 2014-10-08 29 2012-08-19 84 2012-09-21 56 29
# 6 1640 2014-03-05 27 2012-02-28 5 2015-01-11 65 65
(Note: arranging your data this way will create more trouble than good.)
There are probably less verbose ways to do this, but here's one option. First move it to a "long" format, then split it by id, sort, and extract the most recent record and merge that back in with the original data frame.
ld <- reshape(df,
idvar = "id",
varying = list(paste0("date", 1:3),
paste0("value", 1:3)),
v.names = c("date", "value"),
direction = "long")
recent <- split(ld, ld$id)
recent <- lapply(recent, function(x) {
d <- x[order(x$date), ]
d <- d[nrow(d), c(1, 4)]
names(d)[2] <- "value_recent"
d
})
recent <- do.call(rbind, recent)
merge(df, recent, by = "id")
# id date1 value1 date2 value2 date3 value3 value_recent
# 1 1204 2014-10-25 73 2012-12-22 39 2015-07-18 62 62
# 2 1667 2012-01-16 97 2014-02-28 30 2014-12-31 83 83
# 3 1673 2015-01-16 96 2014-12-16 50 2014-08-05 31 96
# 4 1722 2015-02-07 10 2013-12-25 4 2012-08-18 93 10
# 5 1882 2012-10-20 91 2014-12-28 71 2015-09-03 18 18
# 6 1883 2012-03-30 73 2015-04-26 4 2014-12-23 74 4
Here's a similar solution that also starts with reshape but then does the rest in a series of pipes:
library(dplyr)
library(reshape)
df2 <- reshape(df,
varying = list(names(df)[grep("date", names(df))],
names(df)[grep("value", names(df))]),
v.names = c("date", "value"),
direction = "long") %>%
# order data for step to come
arrange(id, date) %>%
# next two steps cut down to last (ordered) obs for each id
group_by(id) %>%
slice(n()) %>%
# keep only the columns we need and rename the value column for merging
select(id, most.recent = value) %>%
# merge the values back into the original data frame, matching on id
left_join(df, .)

easiest way to add missing rows in R

I have the following table that I generate via a table/cumsum command.
> temp
numCars
18 1
17 2
16 8
15 18
14 25
13 29
12 42
11 55
10 70
9 134
8 160
7 172
6 177
5 180
3 181
2 181
1 181
0 181
temp <- structure(c(1L, 2L, 8L, 18L, 25L, 29L, 42L, 55L, 70L, 134L, 160L,
172L, 177L, 180L, 181L, 181L, 181L, 181L), .Dim = c(18L, 1L), .Dimnames = list(
c("18", "17", "16", "15", "14", "13", "12", "11", "10", "9",
"8", "7", "6", "5", "3", "2", "1", "0"), "numCars"))
As you can see the row with name 4 is missing. What's the easiest R way to fill it in where the value should be the value of the number lower (in this case 181).
I understand I can do this with a messy for loop where I can go in, size it, create a new DF, then put in any blank values. I'm just wondering if there's a better way?
Here's the table code:
cohortSizeByMileage <- data.matrix(cumsum(rev(table(cleanMileage$OdometerBucket))))
colnames(cohortSizeByMileage) <- "numCars"
We create the row names as column from the original dataset 'temp', based on the minimum and maximum value of row number in temp, another dataset ('df2') was created, merge or left_join the datasets, and fill the NA elements using na.locf from library(zoo).
df1 <- data.frame(numCars=temp[[1]], rn1=as.numeric(row.names(temp)))
df2 <- data.frame(rn1= max(df1$rn1):min(df1$rn1))
library(dplyr)
library(zoo)
left_join(df2, df1) %>%
mutate(numCars= na.locf(numCars,fromLast=TRUE ))
# rn1 numCars
#1 18 1
#2 17 2
#3 16 8
#4 15 18
#5 14 25
#6 13 29
#7 12 42
#8 11 55
#9 10 70
#10 9 134
#11 8 160
#12 7 172
#13 6 177
#14 5 180
#15 4 181
#16 3 181
#17 2 181
#18 1 181
#19 0 181

Releveling factor to facilitate use as nested factor in DESeq2 model in R

I am fitting a GLM using the DESeq2 package, and have the situation where individuals (RatIDs) are nested within the treatment (Diet). The author of the package suggests that the individuals be re-leveled from 1:N within each Diet (where N is the number of RatIDs within a specific Diet) rather than their original ID/factor level (DESeq2 vignette, page 35.)
The data looks something like this (there are actually more columns and rows, but omitted for simplicity):
Diet Extraction RatID
199 HAMSP 8 65
74 HAMS 9 108
308 HAMS 18 100
41 HAMSA 3 83
88 HAMSP 12 11
221 HAMSP 14 66
200 HAMSA 8 57
155 HAMSB 1 105
245 HAMSB 19 50
254 HAMS 21 90
182 HAMSB 4 4
283 HAMSA 23 59
180 HAMSP 4 22
71 HAMSP 9 112
212 HAMS 12 63
220 HAMSP 14 54
56 HAMS 7 81
274 HAMSP 1 11
114 HAMS 17 102
143 HAMSP 22 93
And here is a dput() output for the structure:
data = structure(list(Diet = structure(c(4L, 1L, 1L, 2L, 4L, 4L, 2L,
3L, 3L, 1L, 3L, 2L, 4L, 4L, 1L, 4L, 1L, 4L, 1L, 4L), .Label = c("HAMS",
"HAMSA", "HAMSB", "HAMSP", "LAMS"), class = "factor"), Extraction = c(8L,
9L, 18L, 3L, 12L, 14L, 8L, 1L, 19L, 21L, 4L, 23L, 4L, 9L, 12L,
14L, 7L, 1L, 17L, 22L), RatID = structure(c(61L, 7L, 3L, 76L,
9L, 62L, 52L, 6L, 46L, 81L, 37L, 54L, 20L, 12L, 59L, 50L, 74L,
9L, 4L, 84L), .Label = c("1", "10", "100", "102", "103", "105",
"108", "109", "11", "110", "111", "112", "113", "13", "14", "16",
"17", "18", "20", "22", "23", "24", "25", "26", "27", "28", "29",
"3", "30", "31", "32", "34", "35", "36", "37", "39", "4", "40",
"42", "43", "45", "46", "48", "49", "5", "50", "51", "52", "53",
"54", "55", "57", "58", "59", "6", "60", "61", "62", "63", "64",
"65", "66", "67", "68", "69", "70", "71", "73", "77", "78", "79",
"8", "80", "81", "82", "83", "85", "86", "88", "89", "90", "91",
"92", "93", "94", "95", "96", "98", "99"), class = "factor")), .Names = c("Diet",
"Extraction", "RatID"), row.names = c(199L, 74L, 308L, 41L, 88L,
221L, 200L, 155L, 245L, 254L, 182L, 283L, 180L, 71L, 212L, 220L,
56L, 274L, 114L, 143L), class = "data.frame")
Can someone please specify an elegant way to generate the new factor levels for RatIDs within Diet as an additional column of the above data.frame.
Could this be done with the roll function of data.table?
Desired output (done manually):
Diet Extraction RatID newCol
1 HAMSP 8 65 1
2 HAMS 9 108 1
3 HAMS 18 100 2
4 HAMSA 3 83 1
5 HAMSP 12 11 2
6 HAMSP 14 66 3
7 HAMSA 8 57 2
8 HAMSB 1 105 1
9 HAMSB 19 50 2
10 HAMS 21 90 3
11 HAMSB 4 4 3
12 HAMSA 23 59 3
13 HAMSP 4 22 4
14 HAMSP 9 112 5
15 HAMS 12 63 4
16 HAMSP 14 54 6
17 HAMS 7 81 5
18 HAMSP 1 11 2
19 HAMS 17 102 6
20 HAMSP 22 93 7
NOTE: There are not an equal number of Rats in each treatment. I'd also like the solution to not re-order the rows in the data (if possible).
EDIT: There is no 'natural' order to the RatIDs, just as long as there is a 1:1 mapping within a diet, its fine.
You can convert the 'RatID' to 'factor' and coerce it back to 'numeric'
library(data.table)#v1.9.4+
setDT(data)[, newCol:=as.numeric(factor(RatID,
levels=unique(RatID))), Diet]
# Diet Extraction RatID newCol
# 1: HAMSP 8 65 1
# 2: HAMS 9 108 1
# 3: HAMS 18 100 2
# 4: HAMSA 3 83 1
# 5: HAMSP 12 11 2
# 6: HAMSP 14 66 3
# 7: HAMSA 8 57 2
# 8: HAMSB 1 105 1
# 9: HAMSB 19 50 2
#10: HAMS 21 90 3
#11: HAMSB 4 4 3
#12: HAMSA 23 59 3
#13: HAMSP 4 22 4
#14: HAMSP 9 112 5
#15: HAMS 12 63 4
#16: HAMSP 14 54 6
#17: HAMS 7 81 5
#18: HAMSP 1 11 2
#19: HAMS 17 102 6
#20: HAMSP 22 93 7
Or use match
setDT(data)[, newCol:=match(RatID, unique(RatID)), Diet]
Or similar option with base R
data$newCol <- with(data, ave(as.numeric(levels(RatID))[RatID],
Diet, FUN=function(x) match(x, unique(x))))
Here is the as.numeric(factor(.)) trick implemented in dplyr:
require(dplyr)
data %>% group_by(Diet) %>% mutate(RatIDByDiet=as.numeric(factor(RatID)))
## Source: local data frame [20 x 4]
## Groups: Diet
##
## Diet Extraction RatID RatIDByDiet
## 1 HAMSP 8 65 5
## 2 HAMS 9 108 3
## 3 HAMS 18 100 1
## 4 HAMSA 3 83 3
## 5 HAMSP 12 11 1
## 6 HAMSP 14 66 6
## 7 HAMSA 8 57 1
## 8 HAMSB 1 105 1
## 9 HAMSB 19 50 3
## 10 HAMS 21 90 6
## 11 HAMSB 4 4 2
## 12 HAMSA 23 59 2
## 13 HAMSP 4 22 3
## 14 HAMSP 9 112 2
## 15 HAMS 12 63 4
## 16 HAMSP 14 54 4
## 17 HAMS 7 81 5
## 18 HAMSP 1 11 1
## 19 HAMS 17 102 2
## 20 HAMSP 22 93 7
And here is a solution that avoids going through factor(), if you want more control over how the numbering happens:
data %>% group_by(Diet) %>% mutate(RatIDByDiet=match(RatID, unique(RatID)))
## Source: local data frame [20 x 4]
## Groups: Diet
##
## Diet Extraction RatID RatIDByDiet
## 1 HAMSP 8 65 1
## 2 HAMS 9 108 1
## 3 HAMS 18 100 2
## 4 HAMSA 3 83 1
## 5 HAMSP 12 11 2
## 6 HAMSP 14 66 3
## 7 HAMSA 8 57 2
## 8 HAMSB 1 105 1
## 9 HAMSB 19 50 2
## 10 HAMS 21 90 3
## 11 HAMSB 4 4 3
## 12 HAMSA 23 59 3
## 13 HAMSP 4 22 4
## 14 HAMSP 9 112 5
## 15 HAMS 12 63 4
## 16 HAMSP 14 54 6
## 17 HAMS 7 81 5
## 18 HAMSP 1 11 2
## 19 HAMS 17 102 6
## 20 HAMSP 22 93 7

Resources