How to change value with condition dataframe in r? - r

I have dataframe something like:
myData <- User X Y Similar
A 1 4 100
A 1 2 100
A 1 1 100
A 3 2 80
A 2 1 20
A 2 4 100
B 3 1 50
B 4 2 90
B 1 3 100
To something like this:
myData <- User X Y Similar
A 1 4 0
A 1 2 0
A 1 1 0
A 3 2 80
A 2 1 20
A 2 4 100
B 3 1 50
B 4 2 90
B 1 3 0
Question
I want to change value in similar column to 0 with condition. The condition is if variable x = 1 and variable similar = 100. How to do that in r?
Thanks

We create a logical vector based on the 'X' and 'Similar' and do the assignment of 'Similar with that index to replace those values to 0
i1 <- with(myData, X ==1 & Similar == 100)
myData$Similar[i1] <- 0
-output
myData
# User X Y Similar
#1 A 1 4 0
#2 A 1 2 0
#3 A 1 1 0
#4 A 3 2 80
#5 A 2 1 20
#6 A 2 4 100
#7 B 3 1 50
#8 B 4 2 90
#9 B 1 3 0
data
myData <- structure(list(User = c("A", "A", "A", "A", "A", "A", "B", "B",
"B"), X = c(1L, 1L, 1L, 3L, 2L, 2L, 3L, 4L, 1L), Y = c(4L, 2L,
1L, 2L, 1L, 4L, 1L, 2L, 3L), Similar = c(100L, 100L, 100L, 80L,
20L, 100L, 50L, 90L, 100L)), class = "data.frame", row.names = c(NA,
-9L))

Related

Sum column values over a window and report the values of the previous window

I´m having a data.frame of the following form:
ID Var1
1 1
1 1
1 3
1 4
1 1
1 0
2 2
2 2
2 6
2 7
2 8
2 0
3 0
3 2
3 1
3 3
3 2
3 4
and I would like to get there:
ID Var1 X
1 1 0
1 1 0
1 3 0
1 4 5
1 1 5
1 0 5
2 2 0
2 2 0
2 6 0
2 7 10
2 8 10
2 0 10
3 0 0
3 2 0
3 1 0
3 3 3
3 2 3
3 4 3
so in words: I´d like to calculate the sum of the variable in a window = 3, and then report the results obtained in the previous window. This should happen with respect to the IDs and thus the first three observations on every ID should be returned with 0, as there is no previous time period that could be reported.
For understanding: In the actual dataset each row corresponds to one week and the window = 7. So X is supposed to give information on the sum of Var1 in the previous week.
I have tried using some rollapply stuff, but always ended in an error and also the window would be a rolling window if I got that right, which is specifically not what I need.
Thanks for your answers!
In rollapply, the width argument can be a list which provides the offsets to use. In this case we want to use the points 3, 2 and 1 back for the first point, 4, 3 and 2 back for the second, 5, 4 and 3 back for the third and then recycle. That is, for a window width of k = 3 we would want the following list of offset vectors:
w <- list(-(3:1), -(4:2), -(5:3))
In general we can write w below in terms of the window width k. ave then invokes rollapply with that width list for each ID.
library(zoo)
k <- 3
w <- lapply(1:k, function(x) seq(to = -x, length = k))
transform(DF, X = ave(Var1, ID, FUN = function(x) rollapply(x, w, sum, fill = 0)))
giving:
ID Var1 X
1 1 1 0
2 1 1 0
3 1 3 0
4 1 4 5
5 1 1 5
6 1 0 5
7 2 2 0
8 2 2 0
9 2 6 0
10 2 7 10
11 2 8 10
12 2 0 10
13 3 0 0
14 3 2 0
15 3 1 0
16 3 3 3
17 3 2 3
18 3 4 3
Note
The input DF in reproducible form is:
DF <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Var1 = c(1L, 1L, 3L, 4L, 1L,
0L, 2L, 2L, 6L, 7L, 8L, 0L, 0L, 2L, 1L, 3L, 2L, 4L)),
class = "data.frame", row.names = c(NA, -18L))
We could group by 'ID', create a new grouping column with window size of 3 using gl, then get the summarized output by taking the sum of 'Var1' and placing the 'Var1' in a list, get the lag of 'X' and unnest
library(dplyr) #1.0.0
library(tidyr)
df1 %>%
# // grouping by ID
group_by(ID) %>%
# // create another group added with gl
group_by(grp = as.integer(gl(n(), 3, n())), .add = TRUE) %>%
# // get the sum of Var1, while changing the Var1 in a list
summarise(X = sum(Var1), Var1 = list(Var1)) %>%
# // get the lag of X
mutate(X = lag(X, default = 0)) %>%
# // unnest the list column
unnest(c(Var1)) %>%
select(names(df1), X)
# A tibble: 18 x 3
# Groups: ID [3]
# ID Var1 X
# <int> <int> <dbl>
# 1 1 1 0
# 2 1 1 0
# 3 1 3 0
# 4 1 4 5
# 5 1 1 5
# 6 1 0 5
# 7 2 2 0
# 8 2 2 0
# 9 2 6 0
#10 2 7 10
#11 2 8 10
#12 2 0 10
#13 3 0 0
#14 3 2 0
#15 3 1 0
#16 3 3 3
#17 3 2 3
#18 3 4 3
data
df1 <- structure(list(ID = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L), Var1 = c(1L, 1L, 3L, 4L, 1L,
0L, 2L, 2L, 6L, 7L, 8L, 0L, 0L, 2L, 1L, 3L, 2L, 4L)), class = "data.frame",
row.names = c(NA,
-18L))

R aggregate() function: Sum and show missing values = 0

I want to sum the "value" column by group1 and by group2.
group2 can range from 1 to 5.
If there is no entry for group2, the sum should be 0.
Data:
group1 group2 value
a 1 100
a 2 200
a 3 300
b 1 10
b 2 20
I am using
aggregate(data$value, by=(list(data$group1, data$group2)), FUN = sum)
which gives
group1 group2 value
a 1 100
a 2 200
a 3 300
b 1 10
b 2 20
However, the result should look like
group1 group2 value
a 1 100
a 2 200
a 3 300
a 4 0
a 5 0
b 1 10
b 2 20
b 3 0
b 4 0
b 5 0
How can i address this using the aggregate function in R?
Thank you!
We can use complete from tidyr to complete missing combinations.
library(dplyr)
library(tidyr)
df %>%
group_by(group1, group2) %>%
summarise(value = sum(value)) %>%
complete(group2 = 1:5, fill = list(value = 0))
# group1 group2 value
# <fct> <int> <dbl>
# 1 a 1 100
# 2 a 2 200
# 3 a 3 300
# 4 a 4 0
# 5 a 5 0
# 6 b 1 10
# 7 b 2 20
# 8 b 3 0
# 9 b 4 0
#10 b 5 0
data
df <- structure(list(group1 = structure(c(1L, 1L, 1L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), group2 = c(1L, 2L, 3L, 1L, 2L), value = c(100L,
200L, 300L, 10L, 20L)), class = "data.frame", row.names = c(NA, -5L))
You need of course to tell R that "group 2 can range from 1 to 5". Best you merge it with an expand.grid accordingly and use with.
with(merge(expand.grid(group1=c("a", "b"), group2=1:5, value=0), data, all=TRUE),
aggregate(value, by=(list(group1, group2)), FUN=sum))
# Group.1 Group.2 x
# 1 a 1 100
# 2 b 1 10
# 3 a 2 200
# 4 b 2 20
# 5 a 3 300
# 6 b 3 0
# 7 a 4 0
# 8 b 4 0
# 9 a 5 0
# 10 b 5 0
Data:
data <- structure(list(group1 = c("a", "a", "a", "b", "b"), group2 = c(1L,
2L, 3L, 1L, 2L), value = c(100L, 200L, 300L, 10L, 20L)), row.names = c(NA,
-5L), class = "data.frame")

Subsetting and repetition of rows in a dataframe using R

Suppose we have the following data with column names "id", "time" and "x":
df<-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(20L, 6L, 7L, 11L, 13L, 2L, 6L),
x = c(1L, 1L, 0L, 1L, 1L, 1L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Each id has multiple observations for time and x. I want to extract the last observation for each id and form a new dataframe which repeats these observations according to the number of observations per each id in the original data. I am able to extract the last observations for each id using the following codes
library(dplyr)
df<-df%>%
group_by(id) %>%
filter( ((x)==0 & row_number()==n())| ((x)==1 & row_number()==n()))
What is left unresolved is the repetition aspect. The expected output would look like
df <-
structure(
list(
id = c(1L, 1L, 1L, 2L, 2L, 3L, 3L),
time = c(7L, 7L, 7L, 13L, 13L, 6L, 6L),
x = c(0L, 0L, 0L, 1L, 1L, 0L, 0L)
),
.Names = c("id", "time", "x"),
class = "data.frame",
row.names = c(NA,-7L)
)
Thanks for your help in advance.
We can use ave to find the max row number for each ID and subset it from the data frame.
df[ave(1:nrow(df), df$id, FUN = max), ]
# id time x
#3 1 7 0
#3.1 1 7 0
#3.2 1 7 0
#5 2 13 1
#5.1 2 13 1
#7 3 6 0
#7.1 3 6 0
You can do this by using last() to grab the last row within each id.
df %>%
group_by(id) %>%
mutate(time = last(time),
x = last(x))
Because last(x) returns a single value, it gets expanded out to fill all the rows in the mutate() call.
This can also be applied to an arbitrary number of variables using mutate_at:
df %>%
group_by(id) %>%
mutate_at(vars(-id), ~ last(.))
slice will be your friend in the tidyverse I reckon:
df %>%
group_by(id) %>%
slice(rep(n(),n()))
## A tibble: 7 x 3
## Groups: id [3]
# id time x
# <int> <int> <int>
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
In data.table, you could also use the mult= argument of a join:
library(data.table)
setDT(df)
df[df[,.(id)], on="id", mult="last"]
# id time x
#1: 1 7 0
#2: 1 7 0
#3: 1 7 0
#4: 2 13 1
#5: 2 13 1
#6: 3 6 0
#7: 3 6 0
And in base R, a merge will get you there too:
merge(df["id"], df[!duplicated(df$id, fromLast=TRUE),])
# id time x
#1 1 7 0
#2 1 7 0
#3 1 7 0
#4 2 13 1
#5 2 13 1
#6 3 6 0
#7 3 6 0
Using data.table you can try
library(data.table)
setDT(df)[,.(time=rep(time[.N],.N), x=rep(x[.N],.N)), by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0
Following #thelatemai, to avoid name the columns you can also try
df[, .SD[rep(.N,.N)], by=id]
id time x
1: 1 7 0
2: 1 7 0
3: 1 7 0
4: 2 13 1
5: 2 13 1
6: 3 6 0
7: 3 6 0

Ordering rows and columns of R Matrix by criteria

I have a matrix in R like this:
A B C D E F
A 2 5 0 1 3 6
B 5 0 0 1 5 9
C 0 0 0 0 0 1
D 6 1 1 3 4 4
E 3 1 5 2 1 6
F 0 0 1 1 7 9
mat = structure(c(2L, 5L, 0L, 6L, 3L, 0L, 5L, 0L, 0L, 1L, 1L, 0L, 0L,
0L, 0L, 1L, 5L, 1L, 1L, 1L, 0L, 3L, 2L, 1L, 3L, 5L, 0L, 4L, 1L,
7L, 6L, 9L, 1L, 4L, 6L, 9L), .Dim = c(6L, 6L), .Dimnames = list(
c("A", "B", "C", "D", "E", "F"), c("A", "B", "C", "D", "E",
"F")))
The matrix is not symmetric.
I want to reorder the rows and columns according to the following criteria:
NAME TYPE
A Dog
B Cat
C Cat
D Other
E Cat
F Dog
crit = structure(list(NAME = c("A", "B", "C", "D", "E", "F"), TYPE = c("Dog",
"Cat", "Cat", "Other", "Cat", "Dog")), .Names = c("NAME", "TYPE"
), row.names = c(NA, -6L), class = "data.frame")
I am trying to get the matrix rows and columns to be re-ordered, so that each category is grouped together:
A F B C E D
A
F
B
C
E
D
I am un-able to find any reasonable way of doing this.
In case it matters, or makes things simpler, I can get rid of the category 'Others' and just stick with 'Cat' and 'Dog'.
I need to find a way to write code for this re-ordering to happen as the matrix is quite big.
In base, just index by order:
mat[order(crit$TYPE), order(crit$TYPE)]
#
# B C E A F D
# B 0 0 5 5 9 1
# C 0 0 0 0 1 0
# E 1 5 1 3 6 2
# A 5 0 3 2 6 1
# F 0 1 7 0 9 1
# D 1 1 4 6 4 3
It orders on an alphabetical sort of crit$TYPE, so Cat (B, C, and E) comes before Dog (A and F). If you want to set the order, use factor levels:
mat[order(factor(crit$TYPE, levels = c('Dog', 'Cat', 'Other'))),
order(factor(crit$TYPE, levels = c('Dog', 'Cat', 'Other')))]
#
# A F B C E D
# A 2 6 5 0 3 1
# F 0 9 0 1 7 1
# B 5 9 0 0 5 1
# C 0 1 0 0 0 0
# E 3 6 1 5 1 2
# D 6 4 1 1 4 3

Deleting Rows per ID when value gets greater than... minus 2

I have the following data frame
id<-c(1,1,1,1,1,1,1,1,2,2,2,2,3,3,3,3)
time<-c(0,1,2,3,4,5,6,7,0,1,2,3,0,1,2,3)
value<-c(1,1,6,1,2,0,0,1,2,6,2,2,1,1,6,1)
d<-data.frame(id, time, value)
The value 6 appears only once for each id. For every id, i would like to remove all rows after the line with the value 6 per id except the first two lines coming after.
I've searched and found a similar problem, but i couldnt adapt it myself. I therefore use the code of this thread
In the above case the final data frame should be
id time value
1 0 1
1 1 1
1 2 6
1 3 1
1 4 2
2 0 2
2 1 6
2 2 2
2 3 2
3 0 1
3 1 1
3 2 6
3 3 1
On of the solution given seems getting very close to what i need. But i didn't manage to adapt it. Could u help me?
library(plyr)
ddply(d, "id",
function(x) {
if (any(x$value == 6)) {
subset(x, time <= x[x$value == 6, "time"])
} else {
x
}
}
)
Thank you very much.
We could use data.table. Convert the 'data.frame' to 'data.table' (setDT(d)). Grouped by the 'id' column, we get the position of 'value' that is equal to 6. Add 2 to it. Find the min of the number of elements for that group (.N) and the position, get the seq, and use that to subset the dataset. We can also add an if/else condition to check whether there are any 6 in the 'value' column or else to return .SD without any subsetting.
library(data.table)
setDT(d)[, if(any(value==6)) .SD[seq(min(c(which(value==6) + 2, .N)))]
else .SD, by = id]
# id time value
# 1: 1 0 1
# 2: 1 1 1
# 3: 1 2 6
# 4: 1 3 1
# 5: 1 4 2
# 6: 2 0 2
# 7: 2 1 6
# 8: 2 2 2
# 9: 2 3 2
#10: 3 0 1
#11: 3 1 1
#12: 3 2 6
#13: 3 3 1
#14: 4 0 1
#15: 4 1 2
#16: 4 2 5
Or as #Arun mentioned in the comments, we can use the ?head to subset, which would be faster
setDT(d)[, if(any(value==6)) head(.SD, which(value==6L)+2L) else .SD, by = id]
Or using dplyr, we group by 'id', get the position of 'value' 6 with which, add 2, get the seq and use that numeric index within slice to extract the rows.
library(dplyr)
d %>%
group_by(id) %>%
slice(seq(which(value==6)+2))
# id time value
#1 1 0 1
#2 1 1 1
#3 1 2 6
#4 1 3 1
#5 1 4 2
#6 2 0 2
#7 2 1 6
#8 2 2 2
#9 2 3 2
#10 3 0 1
#11 3 1 1
#12 3 2 6
#13 3 3 1
#14 4 0 1
#15 4 1 2
#16 4 2 5
data
d <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 4L, 4L, 4L), time = c(0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 3L, 0L, 1L, 2L, 3L, 0L, 1L, 2L), value = c(1L, 1L, 6L, 1L,
2L, 2L, 6L, 2L, 2L, 1L, 1L, 6L, 1L, 1L, 2L, 5L)), .Names = c("id",
"time", "value"), class = "data.frame", row.names = c(NA, -16L))

Resources