replace values in row if it matches with last row in R - r

I have below data frame in R
df <- read.table(text = "
A B C D E
14 6 8 16 14
5 6 10 6 4
2 4 6 3 4
26 6 18 39 36
1 2 3 1 2
3 1 1 1 1
3 5 1 4 11
", header = TRUE)
Now if values in last two rows are same, I need to replace these values with 0, can any one help me in this if it is doable in R
For example:
values last two rows in column 1 are 3 so I need to replace 3 by 0.
Also same for column 3 last two rows in column 3 are 1 so I need to replace 3 by 0.

you can compare last 2 rows and replace in the columns where the values are same :
nr <- nrow(df)
df[(nr-1):nr, df[nr-1, ]==df[nr, ]] <- 0
df
# A B C D E
#1 14 6 8 16 14
#2 5 6 10 6 4
#3 2 4 6 3 4
#4 26 6 18 39 36
#5 1 2 3 1 2
#6 0 1 0 1 1
#7 0 5 0 4 11

One option is to loop through the columns, check if the last two elements (tail(x,2)) or duplicated, then replace it with 0 or else return the column and assign the output back to the dataset. The [] make sure that the structure is intact.
df[] <- lapply(df, function(x) if(anyDuplicated(tail(x, 2))>0)
replace(x, c(length(x)-1, length(x)), 0) else x)
df
# A B C D E
#1 14 6 8 16 14
#2 5 6 10 6 4
#3 2 4 6 3 4
#4 26 6 18 39 36
#5 1 2 3 1 2
#6 0 1 0 1 1
#7 0 5 0 4 11

You could also do this:
r <- tail(df, 2)
r[,r[1,]==r[2,]] <- 0
df <- rbind(head(df, -2), r)

Related

How to remove a x quantity between several last rows?

I'm facing an issue concerning my willingness to remove a quantity to my last rows until this quantity goes to 0.
For instance, if my quantity to remove is 20, how can I remove it in my dataframe:
data.frame(Time = c(1,2,3,4,5,6,7,8,9,10), Quantity = c(5,9,2,17,23,101,15,21,7,3))
So, I have to subtract it to my last rows, to obtain:
data.frame(Time = c(1,2,3,4,5,6,7,8), Quantity = c(5,9,2,17,23,101,15,11))
Should I try a while loop?
Using base R you could do something like:
del_val = function(val){
a = cumsum(rev(df$Quantity))
b = which(a>val)[1]
replace(head(df,-b+1),cbind(nrow(df)-b+1,2),a[b]-val)
}
del_val(20)
Time Quantity
1 1 5
2 2 9
3 3 2
4 4 17
5 5 23
6 6 101
7 7 15
8 8 11
del_val(9)
Time Quantity
1 1 5
2 2 9
3 3 2
4 4 17
5 5 23
6 6 101
7 7 15
8 8 21
9 9 1
We can write a function to remove rows
return_rows <- function(df, n) {
vals <- cumsum(rev(df$Quantity))
inds <- nrow(df) - max(which(vals < n))
df$Quantity[inds] <- df$Quantity[inds] - (n - vals[nrow(df) - inds])
df[seq_len(inds), ]
}
return_rows(df,20)
# Time Quantity
#1 1 5
#2 2 9
#3 3 2
#4 4 17
#5 5 23
#6 6 101
#7 7 15
#8 8 11
return_rows(df,40)
# Time Quantity
#1 1 5
#2 2 9
#3 3 2
#4 4 17
#5 5 23
#6 6 101
#7 7 6

How do I select rows in a data frame before and after a condition is met?

I'm searching the web for a few a days now and I can't find a solution to my (probably easy to solve) problem.
I have huge data frames with 4 variables and over a million observations each. Now I want to select 100 rows before, all rows while and 1000 rows after a specific condition is met and fill the rest with NA's. I tried it with a for loop and if/ifelse but it doesn't work so far. I think it shouldn't be a big thing, but in the moment I just don't get the hang of it.
I create the data using:
foo<-data.frame(t = 1:15, a = sample(1:15), b = c(1,1,1,1,1,4,4,4,4,1,1,1,1,1,1), c = sample(1:15))
My Data looks like this:
ID t a b c
1 1 4 1 7
2 2 7 1 10
3 3 10 1 6
4 4 2 1 4
5 5 13 1 9
6 6 15 4 3
7 7 8 4 15
8 8 3 4 1
9 9 9 4 2
10 10 14 1 8
11 11 5 1 11
12 12 11 1 13
13 13 12 1 5
14 14 6 1 14
15 15 1 1 12
What I want is to pick the value of a (in this example) 2 rows before, all rows while and 3 rows after the value of b is >1 and fill the rest with NA's. [Because this is just an example I guess you can imagine that after these 15 rows there are more rows with the value for b changing from 1 to 4 several times (I did not post it, so I won't spam the question with unnecessary data).]
So I want to get something like:
ID t a b c d
1 1 4 1 7 NA
2 2 7 1 10 NA
3 3 10 1 6 NA
4 4 2 1 4 2
5 5 13 1 9 13
6 6 15 4 3 15
7 7 8 4 15 8
8 8 3 4 1 3
9 9 9 4 2 9
10 10 14 1 8 14
11 11 5 1 11 5
12 12 11 1 13 11
13 13 12 1 5 NA
14 14 6 1 14 NA
15 15 1 1 12 NA
I'm thankful for any help.
Thank you.
Best regards,
Chris
here is the same attempt as missuse, but with data.table:
library(data.table)
foo<-data.frame(t = 1:11, a = sample(1:11), b = c(1,1,1,4,4,4,4,1,1,1,1), c = sample(1:11))
DT <- setDT(foo)
DT[ unique(c(DT[,.I[b>1] ],DT[,.I[b>1]+3 ],DT[,.I[b>1]-2 ])), d := a]
t a b c d
1: 1 10 1 2 NA
2: 2 6 1 10 6
3: 3 5 1 7 5
4: 4 11 4 4 11
5: 5 4 4 9 4
6: 6 8 4 5 8
7: 7 2 4 8 2
8: 8 3 1 3 3
9: 9 7 1 6 7
10: 10 9 1 1 9
11: 11 1 1 11 NA
Here
unique(c(DT[,.I[b>1] ],DT[,.I[b>1]+3 ],DT[,.I[b>1]-2 ]))
gives you your desired indixes : the unique indices of the line for your condition, the same indices+3 and -2.
Here is an attempt.
Get indexes that satisfy the condition b > 1
z <- which(foo$b > 1)
get indexes for (z - 2) : (z + 3)
ind <- unique(unlist(lapply(z, function(x){
g <- pmax(x - 2, 1) #if x - 2 is negative
g : (x + 3)
})))
create d column filled with NA
foo$d <- NA
replace elements with appropriate indexes with foo$a
foo$d[ind] <- foo$a[ind]
library(dplyr)
library(purrr)
# example dataset
foo<-data.frame(t = 1:15,
a = sample(1:15),
b = c(1,1,1,1,1,4,4,4,4,1,1,1,1,1,1),
c = sample(1:15))
# function to get indices of interest
# for a given index x go 2 positions back and 3 forward
# keep only positive indices
GetIDsBeforeAfter = function(x) {
v = (x-2) : (x+3)
v[v > 0]
}
foo %>% # from your dataset
filter(b > 1) %>% # keep rows where b > 1
pull(t) %>% # get the positions
map(GetIDsBeforeAfter) %>% # for each position apply the function
unlist() %>% # unlist all sets indices
unique() -> ids_to_remain # keep unique ones and save them in a vector
foo$d = foo$c # copy column c as d
foo$d[-ids_to_remain] = NA # put NA to all positions not in our vector
foo
# t a b c d
# 1 1 5 1 8 NA
# 2 2 6 1 14 NA
# 3 3 4 1 10 NA
# 4 4 1 1 7 7
# 5 5 10 1 5 5
# 6 6 8 4 9 9
# 7 7 9 4 15 15
# 8 8 3 4 6 6
# 9 9 7 4 2 2
# 10 10 12 1 3 3
# 11 11 11 1 1 1
# 12 12 15 1 4 4
# 13 13 14 1 11 NA
# 14 14 13 1 13 NA
# 15 15 2 1 12 NA

Give unique identifier to consecutive groupings

I'm trying to identify groups based on sequential numbers. For example, I have a dataframe that looks like this (simplified):
UID
1
2
3
4
5
6
7
11
12
13
15
17
20
21
22
And I would like to add a column that identifies when there are groupings of consecutive numbers, for example, 1 to 7 are first consecutive , then they get 1 , the second consecutive set will get 2 etc .
UID Group
1 1
2 1
3 1
4 1
5 1
6 1
7 1
11 2
12 2
13 2
15 3
17 4
20 5
21 5
22 5
none of the existed code helped me to solved this issue
Here is one base R method that uses diff, a logical check, and cumsum:
cumsum(c(1, diff(df$UID) > 1))
[1] 1 1 1 1 1 1 1 2 2 2 3 4 5 5 5
Adding this onto the data.frame, we get:
df$id <- cumsum(c(1, diff(df$UID) > 1))
df
UID id
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 11 2
9 12 2
10 13 2
11 15 3
12 17 4
13 20 5
14 21 5
15 22 5
Or you can also use dplyr as follows :
library(dplyr)
df %>% mutate(ID=cumsum(c(1, diff(df$UID) > 1)))
# UID ID
#1 1 1
#2 2 1
#3 3 1
#4 4 1
#5 5 1
#6 6 1
#7 7 1
#8 11 2
#9 12 2
#10 13 2
#11 15 3
#12 17 4
#13 20 5
#14 21 5
#15 22 5
We can also get the difference between the current row and the previous row using the shift function from data.table, get the cumulative sum of the logical vector and assign it to create the 'Group' column. This will be faster.
library(data.table)
setDT(df1)[, Group := cumsum(UID- shift(UID, fill = UID[1])>1)+1]
df1
# UID Group
# 1: 1 1
# 2: 2 1
# 3: 3 1
# 4: 4 1
# 5: 5 1
# 6: 6 1
# 7: 7 1
# 8: 11 2
# 9: 12 2
#10: 13 2
#11: 15 3
#12: 17 4
#13: 20 5
#14: 21 5
#15: 22 5

Flag first by-group in R data frame

I have a data frame which looks like this:
id score
1 15
1 18
1 16
2 10
2 9
3 8
3 47
3 21
I'd like to identify a way to flag the first occurrence of id -- similar to first. and last. in SAS. I've tried the !duplicated function, but I need to actually append the "flag" column to my data frame since I'm running it through a loop later on. I'd like to get something like this:
id score first_ind
1 15 1
1 18 0
1 16 0
2 10 1
2 9 0
3 8 1
3 47 0
3 21 0
> df$first_ind <- as.numeric(!duplicated(df$id))
> df
id score first_ind
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0
You can find the edges using diff.
x <- read.table(text = "id score
1 15
1 18
1 16
2 10
2 9
3 8
3 47
3 21", header = TRUE)
x$first_id <- c(1, diff(x$id))
x
id score first_id
1 1 15 1
2 1 18 0
3 1 16 0
4 2 10 1
5 2 9 0
6 3 8 1
7 3 47 0
8 3 21 0
Using plyr:
library("plyr")
ddply(x,"id",transform,first=as.numeric(seq(length(score))==1))
or if you prefer dplyr:
x %>% group_by(id) %>%
mutate(first=c(1,rep(0,n-1)))
(although if you're operating completely in the plyr/dplyr framework you probably wouldn't need this flag variable anyway ...)
Another base R option:
df$first_ind <- ave(df$id, df$id, FUN = seq_along) == 1
df
# id score first_ind
#1 1 15 TRUE
#2 1 18 FALSE
#3 1 16 FALSE
#4 2 10 TRUE
#5 2 9 FALSE
#6 3 8 TRUE
#7 3 47 FALSE
#8 3 21 FALSE
This also works in case of unsorted ids. If you want 1/0 instead of T/F you can easily wrap it in as.integer(.).

Insert new columns based on the union of colnames of two data frames

I want to write a R function to insert many 0 vectors into a existed data.frame. Here is the example:
Data.frame 1
A B C D
1 1 3 4 5
2 4 5 6 7
3 4 5 6 2
4 4 55 2 3
Data.frame 2
A B E X
11 5 1 5 5
22 44 55 9 6
33 12 4 2 4
44 9 7 4 2
Based on the union of two colnames (that is A,B,C,D,E, X), I want to update the two data frames like:
Data.frame 1 (new)
A B C D E X
1 1 3 4 5 0 0
2 4 5 6 7 0 0
3 4 5 6 2 0 0
4 4 55 2 3 0 0
Data.frame 2 (new)
A B C D E X
11 5 1 0 0 5 5
22 44 55 0 0 9 6
33 12 4 0 0 2 4
44 9 7 0 0 4 2
Thanks in advance.
Option 1 (Thanks #Jilber for the edits)
I'm assuming the order of columns don't matter -
df2part <- subset(df2,select = setdiff(colnames(df2),colnames(df1)))*0
df1f <- cbind(df1,df2part)
df1part <- subset(df1,select = setdiff(colnames(df1),colnames(df2)))*0
df2f <- cbind(df2,df1part)
If the order really matters, then just reorder the columns
df2f <- df2f[, sort(names(df2f))]
Output
> df1f
A B C D E X
1 1 3 4 5 0 0
2 4 5 6 7 0 0
3 4 5 6 2 0 0
4 4 55 2 3 0 0
> df2f
A B C D E X
11 5 1 0 0 5 5
22 44 55 0 0 9 6
33 12 4 0 0 2 4
44 9 7 0 0 4 2
Option 2 -
library(data.table)
df1 <- data.table(df1)
df2 <- data.table(df2)
df1names <- colnames(df1)
df2names <- colnames(df2)
df1[,setdiff(df2names,df1names) := 0]
df2[,setdiff(df1names,df2names) := 0]

Resources