Change Column Random Numbers To Serial - r

In a data set, there is a specific column that as random values which repeat at regular interval. I want to replace these with increasing values as explained below.
Column_B has random data
Column_A Column_B
1.5 0
0.2 1
0.3 5
4.5 6
12.5 7
1.6 0
7.8 1
1.8 5
6.9 6
11.0 7
After transformation Column_B should have
Column_A Column_B
1.5 0
0.2 1
0.3 2
4.5 3
12.5 4
1.6 0
7.8 1
1.8 2
6.9 3
11.0 4
Is there a faster way to do this rather than creating a new column and then replacing it with Column_B? Thanks.

You can use recycling to fill the column with a repeating sequence. for example, if you want the sequence to be 64 long before repeating then you can use
DF$column_B <- 0:(64 - 1L)
More generally, for patterns like your example in which each element within the repeating sequence is distinct, you can find how long the sequence is, using which, then do the same thing
seq.length = which(dt$B == dt$B[1L])[2L] - 1L
dt$B = 0:(seq.length - 1L)

We group by cumulative sum of 'Column_B' where elements are 0 (or where there is decrease in the next element) and get the sequence of roww to assign it to 'Column_B'
library(data.table)
setDT(df1)[, Column_B := as.integer(seq_len(.N)-1), cumsum(Column_B==0)]
df1
# Column_A Column_B
# 1: 1.5 0
# 2: 0.2 1
# 3: 0.3 2
# 4: 4.5 3
# 5: 12.5 4
# 6: 1.6 0
# 7: 7.8 1
# 8: 1.8 2
# 9: 6.9 3
#10: 11.0 4
Or find the difference between adjacent elements in 'Column_B', get the cumulative sum based on that to create the group_by variable
setDT(df1)[, Column_B := as.integer(seq_len(.N)-1), cumsum(c(TRUE, diff(Column_B)< 0))]

Related

R Data.Table: Dynamically Update a Different Column for each Row

I'm working on some code where I need to find the maximum value over a set of columns and then update that maximum value. Consider this toy example:
test <- data.table(thing1=c('AAA','BBB','CCC','DDD','EEE'),
A=c(9,5,4,2,5),
B=c(2,7,2,6,3),
C=c(6,2,5,4,1),
ttl=c(1,1,3,2,1))
where the resulting data.table looks like this:
thing1
A
B
C
ttl
AAA
9
2
6
1
BBB
5
7
2
1
CCC
4
2
5
3
DDD
2
6
4
2
EEE
5
3
1
1
The goal is to find the column (A, B, or C) with the maximum value and replace that value by the current value minus 0.1 times the value in the ttl column (i.e. new_value=old_value - 0.1*ttl). The other columns (not containing the maximum value) should remain the same. The resulting DT should look like this:
thing1
A
B
C
ttl
AAA
8.9
2
6
1
BBB
5
6.9
2
1
CCC
4
2
4.7
3
DDD
2
5.8
4
2
EEE
4.9
3
1
1
The "obvious" way of doing this is to write a for loop and loop through each row of the DT. That's easy enough to do and is what the code I'm adapting this from did. However, the real DT is much larger than my toy example and the for loop takes some time to run, which is why I'm trying to adapt the code to take advantage of vectorization and get rid of the loop.
Here's what I have so far:
test[,max_position:=names(.SD)[apply(.SD,1,function(x) which.max(x))],.SDcols=(2:4)]
test[,newmax:=get(max_position)-ttl*.1,by=1:nrow(test)]
which produces this DT:
thing1
A
B
C
ttl
max_position
newmax
AAA
9
2
6
1
A
8.9
BBB
5
7
2
1
B
6.9
CCC
4
2
5
3
C
4.7
DDD
2
6
4
2
B
5.8
EEE
5
3
1
1
A
4.9
The problem comes in assigning the value of the newmax column back to where it needs to go. I naively tried this, along with some other things, which tells me that "'max_position' not found":
test[,(max_position):=newmax,by=1:nrow(test)]
It's straightforward to solve the problem by reshaping the DT, which is the solution I have in place for now (see below), but I worry that with my full DT two reshapes will be slow as well (though presumably better than the for loop). Any suggestions on how to make this work as intended?
Reshaping solution, for reference:
test[,max_position:=names(.SD)[apply(.SD,1,function(x) which.max(x))],.SDcols=(2:4)]
test[,newmax:=get(max_position)-ttl*.1,by=1:nrow(test)]
test <- setDT(gather(test,idgroup,val,c(A,B,C)))
test[,maxval:=max(val),by='thing1']
test[val==maxval,val:=newmax][,maxval:=NULL]
test <- setDT(spread(test,idgroup,val))
With the OP's code, replace can work
test[, (2:4) := replace(.SD, which.max(.SD), max(.SD, na.rm = TRUE) - 0.1 * ttl),
by = 1:nrow(test),.SDcols = 2:4]
-output
> test
thing1 A B C ttl
1: AAA 8.9 2.0 6.0 1
2: BBB 5.0 6.9 2.0 1
3: CCC 4.0 2.0 4.7 3
4: DDD 2.0 5.8 4.0 2
5: EEE 4.9 3.0 1.0 1
In base R, this may be faster with row/column indexing
test1 <- as.data.frame(test)
m1 <- cbind(seq_len(nrow(test1)), max.col(test1[2:4], "first"))
test1[2:4][m1] <- test1[2:4][m1] - 0.1 * test1$ttl

Row Means based on Column Substring

I have a dataframe that looks like this:
df <- data.frame("CB_1.1"=c(0,5,6,2), "CB_1.16"=c(1,5,3,6), "HC_2.11"=c(3,3,4,5), "HC_1.12"=c(2,3,4,5), "HC_1.13"=c(1,0,0,5))
> df
CB_1.1 CB_1.16 HC_2.11 HC_1.12 HC_1.13
1 0 1 3 2 1
2 5 5 3 3 0
3 6 3 4 4 0
4 2 6 5 5 5
I would like to take the mean of rows that share substring of the column name, before the ".". Resulting in a dataframe like this:
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
You'll notice that the column HC_2.11 values remain the same, because no other column has HC_2 in this dataframe.
Any help would be appreciated!
1) apply/tapply For each row use tapply on it using an INDEX of the name prefixes and a function mean. Transpose the result. No packages are used.
prefix <- sub("\\..*", "", names(df))
t(apply(df, 1, tapply, prefix, mean))
giving this matrix (wrap it in data.frame(...) if you need a data frame result):
CB_1 HC_1 HC_2
[1,] 0.5 1.5 3
[2,] 5.0 1.5 3
[3,] 4.5 2.0 4
[4,] 4.0 5.0 5
2) lm Run the indicated regression. The +0 in the formula means don't add on an intercept. The transpose of the coefficients will be the required matrix, m. In the next line make the names nicer. prefix is from (1). No packages are used.
m <- t(coef(lm(t(df) ~ prefix + 0)))
colnames(m) <- sub("prefix", "", colnames(m))
m
giving this matrix
CB_1 HC_1 HC_2
[1,] 0.5 1.5 3
[2,] 5.0 1.5 3
[3,] 4.5 2.0 4
[4,] 4.0 5.0 5
This follows from the facts that (1) the model matrix X contains only ones and zeros and (2) distinct columns of it are orthogonal. The model matrix is shown here:
X <- model.matrix(~ prefix + 0) # model matrix
X
giving:
prefixCB_1 prefixHC_1 prefixHC_2
1 1 0 0
2 1 0 0
3 0 0 1
4 0 1 0
5 0 1 0
attr(,"assign")
[1] 1 1 1
attr(,"contrasts")
attr(,"contrasts")$prefix
[1] "contr.treatment"
Because the columns of the model matrix X are orthogonal the coefficient corresponding to any column for a particular row, y, of df (column of t(df)) is just sum(x * y) / sum(x * x) and since x is a 0/1 vector that equals the mean of the values of y corresponding to the 1's in x.
3) stack/tapply Convert to long form inserting an id column at the same time. Then use tapply to convert back to wide form tapply-ing mean. No packages are used.
long <- transform(stack(df), ind = sub("\\..*", "", ind), id = c(row(df)))
with(long, tapply(values, long[c("id", "ind")], mean))
giving this table. Wrap it in as.data.frame.matrix if you want a data.frame.
ind
id CB_1 HC_1 HC_2
1 0.5 1.5 3
2 5.0 1.5 3
3 4.5 2.0 4
4 4.0 5.0 5
Here is a base R solution using rowMeans + split.default, i.e.,
dfout <- as.data.frame(Map(rowMeans, split.default(df,factor(s <- gsub("\\..*$","",names(df)), levels = unique(s)))))
such that
> dfout
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
If you do not mind the order of column names, you can use the shorter code below
dfout <- as.data.frame(Map(rowMeans,split.default(df,gsub("\\..*$","",names(df)))))
such that
> dfout
CB_1 HC_1 HC_2
1 0.5 1.5 3
2 5.0 1.5 3
3 4.5 2.0 4
4 4.0 5.0 5
One option involving dplyr and purrr could be:
map_dfc(.x = unique(sub("\\..*$", "", names(df))),
~ df %>%
transmute(!!.x := rowMeans(select(., starts_with(.x)))))
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0
A base option could be:
#find column names splitting on "."
cols <- unique(sapply(strsplit(names(df),".", fixed = T), `[`, 1))
#loop through each column name and find the rowMeans
as.data.frame(sapply(cols, function (x) rowMeans(df[grep(x, names(df))])))
CB_1 HC_2 HC_1
1 0.5 3 1.5
2 5.0 3 1.5
3 4.5 4 2.0
4 4.0 5 5.0

How to subset data.table using keys [duplicate]

This question already has an answer here:
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 3 years ago.
I have a data.table and I'm trying to subset it so I get the lowest TEST_RESULT by ID and LAB_DT:
DT <- data.table::data.table(ID=c("1","1","1","2","2","3","3","3"),
LAB_DT=lubridate::as_date(c("1992-11-01",
"1992-11-01",
"1992-11-02",
"1992-11-04",
"1992-11-04",
"1992-11-06",
"1992-11-06",
"1992-11-08")),
TEST_RESULT=c(5.4,5.8,5.2,5.6,6,6,7,8))
setkeyv(DT,c("ID","LAB_DT","TEST_RESULT"))
ID LAB_DT TEST_RESULT
1: 1 1992-11-01 5.4
2: 1 1992-11-01 5.8
3: 1 1992-11-02 5.2
4: 2 1992-11-04 5.6
5: 2 1992-11-04 6.0
6: 3 1992-11-06 6.0
7: 3 1992-11-06 7.0
8: 3 1992-11-08 8.0
I have already successfully done this via my own method:
DT[,FIRST.LAB_DT:=0]
DT[, FIRST.LAB_DT := c(1L, FIRST.LAB_DT[-1]), by = .(ID,LAB_DT)]
ID LAB_DT TEST_RESULT FIRST.LAB_DT
1: 1 1992-11-01 5.4 1
2: 1 1992-11-01 5.8 0
3: 1 1992-11-02 5.2 1
4: 2 1992-11-04 5.6 1
5: 2 1992-11-04 6.0 0
6: 3 1992-11-06 6.0 1
7: 3 1992-11-06 7.0 0
8: 3 1992-11-08 8.0 1
DT[FIRST.LAB_DT==1,]
ID LAB_DT TEST_RESULT FIRST.LAB_DT
1: 1 1992-11-01 5.4 1
2: 1 1992-11-02 5.2 1
3: 2 1992-11-04 5.6 1
4: 3 1992-11-06 6.0 1
5: 3 1992-11-08 8.0 1
However, the actual data.table I'm working with has ~8e6 rows and I recently found that you can subset data.table using the keys in a much faster manner.
Would anyone know how to produce the same final output using keys?
Since you report that your current approach works, your data is apparently sorted by increasing TEST_RESULT already. In this case, you can just drop duplicates to keep the first row per group:
unique(DT, by=c("ID", "LAB_DT"))
In general, if it is not sorted, you could roll from -Inf. Like your code, this selects only one row even if there are ties for the min value:
mDT = unique(DT[, .(ID, LAB_DT)])[, TEST_RESULT := -Inf]
DT[DT[mDT, on=names(mDT), roll=-Inf, which=TRUE]]
Finally, if you only have these three columns, you can just use the min function which is optimized for speed here (see ?GForce):
DT[, .(TEST_RESULT = min(TEST_RESULT)), by=.(ID, LAB_DT)]
library(dplyr)
DT %>%
group_by(ID,LAB_DT) %>%
slice(which.min(TEST_RESULT))

R sum consecutive duplicate rows and remove all but first

I am stuck with a probably simple question - how to sum consecutive duplicate rows and remove all but first row. And, if there is a NA in between two duplicates (such as 2,na,2) , also sum them and remove all but the first entry.
So far so good, here is my sample data
ia<-c(1,1,2,NA,2,1,1,1,1,2,1,2)
time<-c(4.5,2.4,3.6,1.5,1.2,4.9,6.4,4.4, 4.7, 7.3,2.3, 4.3)
a<-as.data.frame(cbind(ia, time))
sample output
a
ia time
1 1 4.5
2 1 2.4
3 2 3.6
4 NA 1.5
5 2 1.2
6 1 4.9
7 1 6.4
8 1 4.4
9 1 4.7
10 2 7.3
11 1 2.3
12 2 4.3
Now I want to
1.) sum the "time" column of consecutive ia's - i.e., sum the time if the number 1 occurs twice or more right after each other, in my case here sum first and second row of column time to 4.5+2.4.
2.) if there is a NA in between two numbers (ia column) which are the same (i.e., ia = 2, NA, 2), then also sum all of those times.
3.) keep only first occurence of the ia, and delete the rest.
In the end, I would want to have something like this:
a
ia time
1 1 6.9
3 2 6.3
6 1 20.4
10 2 7.3
11 1 2.3
12 2 4.3
I found this for summing, but it does not take into account the consecutive factor
aggregate(time~ia,data=a,FUN=sum)
and I found this for deleting
a[cumsum(rle(as.numeric(a[,1]))$lengths),]
although the rle approach keeps the last entry, and I would want to keep the first. I also have no idea how to handle the NAs.
if I have a pattern of 1-NA-2 then the NA should NOT be counted with either of them, in this case the NA row should be removed.
With data.table (as RHertel suggested for na.locf):
library(data.table)
library(zoo)
setDT(a)[na.locf(ia, fromLast=T)==na.locf(ia), sum(time), cumsum(c(T,!!diff(na.locf(ia))))]
# id V1
#1: 1 6.9
#2: 2 6.3
#3: 3 20.4
#4: 4 7.3
#5: 5 2.3
#6: 6 4.3
You first need to replace sequences of NAs with the values surrounding them (if they are the same). This answer shows zoo's na.locf function, which fills in NAs with the last observation. By testing whether it's the same when you carry values backwards or forwards, you can filter out the NAs you don't want, then do the carrying forward:
library(dplyr)
library(zoo)
a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia))
#> ia time
#> 1 1 4.5
#> 2 1 2.4
#> 3 2 3.6
#> 4 2 1.5
#> 5 2 1.2
#> 6 1 4.9
#> 7 1 6.4
#> 8 1 4.4
#> 9 2 7.3
#> 10 1 2.3
#> 11 2 4.3
Now that you've fixed those NAs, you can group consecutive sets of values using cumsum. The full solution is:
result <- a %>%
filter(na.locf(ia) == na.locf(ia, fromLast = TRUE)) %>%
mutate(ia = na.locf(ia)) %>%
mutate(change = ia != lag(ia, default = FALSE)) %>%
group_by(group = cumsum(change), ia) %>%
summarise(time = sum(time))
result
#> Source: local data frame [6 x 3]
#> Groups: group [?]
#>
#> group ia time
#> (int) (dbl) (dbl)
#> 1 1 1 6.9
#> 2 2 2 6.3
#> 3 3 1 15.7
#> 4 4 2 7.3
#> 5 5 1 2.3
#> 6 6 2 4.3
If you want to get rid of the group column, use the additional lines:
result %>%
ungroup() %>%
select(-group)
nas <- which(is.na(df$ia))
add.index <- sapply(nas, function(x) {logi <- which(as.logical(df$ia))
aft <- logi[logi > x][1]
fore <- tail(logi[logi< x], 1)
if(df$ia[aft] == df$ia[fore]) aft else NA})
df$ia[nas] <- df$ia[add.index]
df <- df[complete.cases(df),]
First we determine if the NA values of the column are surrounded by the same value. If yes, the surrounding value replaces the NA. There is no problem if the data has consecutive NA values.
Next we do a standard sum by group operation. cumsum allows us to create a unique group based on changes in the numbers.
df$grps <- cumsum(c(F, !df$ia[-length(df$ia)] == df$ia[-1]))+1
aggregate(time ~ grps, df, sum)
# grps time
# 1 1 6.9
# 2 2 6.3
# 3 3 20.4
# 4 4 7.3
# 5 5 2.3
# 6 6 4.3
This is a base R approach. With packages like dplyr, zoo, or data.table different options are available as they come built with specialized functions to do what we did here.

Is there a way to replace column values on matching rows while joining one data.frame to another?

I'm currently working on incorporating elements of one data table with another "master" set, by some reference column. To make things clearer, I have created some sample data:
This is the dataset I am looking to join onto another "master set".
data.frame(refID = c(1,3,4,5,7,8), value = c(3.3,3.9,4.4,8.0,1.1,2.5))
refID value
1 3.3
3 3.9
4 4.4
5 8.0
7 1.1
8 2.5
The master set:
data.frame(refID = 1:9, value = rep(0,9))
refID value
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
I am basically looking to send the column of values from the first data set into the second, but where there are gaps, let them have a value of 0. Ultimately, I am hoping to get:
Resulting Set:
refID value
1 3.3
2 0.0
3 3.9
4 4.4
5 8.0
6 0.0
7 1.1
8 2.5
9 0.0
I've played around with some stuff in the dplyr and data.table packages but can't seem to really pinpoint a good and direct way of doing it. Advice would be greatly appreciated, many thanks.
Using data.table, you can replace values from the first data.table on to the second by reference as follows:
require(data.table)
# data
DT1 = data.table(refID = c(1,3,4,5,7,8), value = c(3.3,3.9,4.4,8.0,1.1,2.5))
DT2 = data.table(refID = 1:9, value = 0)
setkey(DT2, refID)
DT2[DT1, value := i.value]
# refID value
# 1: 1 3.3
# 2: 2 0.0
# 3: 3 3.9
# 4: 4 4.4
# 5: 5 8.0
# 6: 6 0.0
# 7: 7 1.1
# 8: 8 2.5
# 9: 9 0.0
Please refer to this post for explanation.
The way I can think of would be to temporarily name the values val1 and val2, do a full_join so non-matching values are NA, then mutate a value column using the presence of the NAs. i.e. something like:
df_1 <- data.frame(refID = c(1,3,4,5,7,8), v1 = c(3.3,3.9,4.4,8.0,1.1,2.5))
df_2 <- data.frame(refID = 1:9, v2 = 0)
df_merged <- df_1 %>%
full_join(df_2) %>%
mutate(value=ifelse(!is.na(v1), v1, v2) %>% # or just 0 in this case
select(refID, value)

Resources