Assign groups with preassigned conditions in R - r

I have such a matrix(df1):
vid col1
103 9
103 3
103 7
103 6
104 7
104 8
104 9
105 6
105 8
106 8
106 9
106 4
106 6
I have another matrix (df2):
vid col1
103 0
104 1
105 5
106 3
I want to assign groups to df1 so that group IDs are conditional on df2 based on vid.
Namely, I would like the following manipulation on df1, desired output:
vid col1 col2
103 9 0
103 3 0
103 7 0
103 6 0
104 7 1
104 8 1
104 9 1
105 6 5
105 8 5
106 8 3
106 9 3
106 4 3
106 6 3
I was trying the following:
df1<-cbind(df1,0)
for (i in 1:nrow(df1)){
for(j in 1:nrow(df2))
{
if(df1[i,1]==df2[j,1]){
df1[i,3]=df2[j,2]
}
else{
df1[i,3]=NA
}
}
}
But this doesn't seem to work, could anyone help me with this please? Thanks!

You can use merge to merge the 2 dataframes together
merge(df1, df2, by='vid')
output would be
vid col1 col2
103 9 0
103 3 0
103 7 0
103 6 0
104 7 1
104 8 1
104 9 1
105 6 5
105 8 5
106 8 3
106 9 3
106 4 3
106 6 3

First, you should make a vector variable, this variable will be the third column of your final data frame.
Vec=vector()
I am assuming here, you have only 4 ID's which are
df2$vid
[1] 103 104 105 106
Now fill the third column while traversing all rows of first data frame df1.
for(i in 1:nrow(df1))
{
if(df1[i,1]==103){Vec[i]=df2[df2$vid==103,2]}
if(df1[i,1]==104){Vec[i]=df2[df2$vid==104,2]}
if(df1[i,1]==105){Vec[i]=df2[df2$vid==105,2]}
if(df1[i,1]==106){Vec[i]=df2[df2$vid==106,2]}
}
Finally, combine the third column with data frame df1.
df1=cbind(df1,Vec)

Related

R- Subtract values of all rows in a group from previous row in different group and filter out rows

In R say I had the data frame:
data
frame object x y
1 6 150 100
2 6 149 99
3 6 148 98
3 6 140 90
4 6 148.5 97
4 6 142 93
5 6 147 96
5 6 138 92
5 6 135 90
6 6 146.5 99
1 7 125 200
2 7 126 197
3 7 127 202
3 7 119 185
4 7 117 183
4 7 123 199
5 7 115 190
5 7 124 202
5 7 118 192
6 7 124.5 199
I want to output the object which is the closest in the previous frame based on the (x,y) coordinates and filter out the other objects. I want to find the difference in the x and y between all the objects in a given frame and the single object in the previous frame and keep the closest object while removing the rest. The object that is kept would then serve as reference for the next frame. The frames with only one object would be left as is. The output should be one object per frame:
data
frame object x y
1 6 150 100
2 6 149 99
3 6 148 98
4 6 148.5 97
5 6 147 96
6 6 146.5 99
1 7 125 200
2 7 126 197
3 7 127 202
4 7 123 199
5 7 124 202
6 7 124.5 199
This is a cumulative operation, so it'll take an iterative approach. Here's a simple function to do one operation, assuming it's for only one object.
fun <- function(Z, fr) {
prevZ <- head(subset(Z, frame == (fr-1)), 1)
thisZ <- subset(Z, frame == fr)
if (nrow(prevZ) < 1 || nrow(thisZ) < 2) return(Z)
ind <- which.min( abs(thisZ$x - prevZ$x) + abs(thisZ$y - prevZ$y) )
rbind(subset(Z, frame != fr), thisZ[ind,])
}
fun(subset(dat, object == 6), 3)
# frame object x y
# 1 1 6 150.0 100
# 2 2 6 149.0 99
# 5 4 6 148.5 97
# 6 4 6 142.0 93
# 7 5 6 147.0 96
# 8 5 6 138.0 92
# 9 5 6 135.0 90
# 10 6 6 146.5 99
# 3 3 6 148.0 98
(The order is not maintained, it can easily be sorted back into place as needed.)
Now we can Reduce this for each object within the data.
out <- do.call(rbind,
lapply(split(dat, dat$object),
function(X) Reduce(fun, seq(min(X$frame)+1, max(X$frame)), init=X)))
out <- out[order(out$object, out$frame),]
out
# frame object x y
# 6.1 1 6 150.0 100
# 6.2 2 6 149.0 99
# 6.3 3 6 148.0 98
# 6.5 4 6 148.5 97
# 6.7 5 6 147.0 96
# 6.10 6 6 146.5 99
# 7.11 1 7 125.0 200
# 7.12 2 7 126.0 197
# 7.13 3 7 127.0 202
# 7.16 4 7 123.0 199
# 7.18 5 7 124.0 202
# 7.20 6 7 124.5 199
We can create a for loop that applies the criteria to a single object, and then use group_by %>% summarize to apply it to every object:
library(dplyr)
keep_closest_frame = function(data) {
frames = split(data, dd$frame)
for(i in seq_along(frames)) {
if(nrow(frames[[i]]) != 1 & i == 1) {
stop("First frame must have exactly 1 row")
}
if(nrow(frames[[i]]) == 1) next
dists = with(frames[[i]], abs(x - frames[[i - 1]][["x"]]) + abs(y - frames[[i - 1]][["y"]]))
frames[[i]] = frames[[i]][which.min(dists), ]
}
bind_rows(frames)
}
data %>%
group_by(object) %>%
summarize(keep_closest_frame(across()))
# # A tibble: 12 × 4
# # Groups: object [2]
# object frame x y
# <int> <int> <dbl> <int>
# 1 6 1 150 100
# 2 6 2 149 99
# 3 6 3 148 98
# 4 6 4 148. 97
# 5 6 5 147 96
# 6 6 6 146. 99
# 7 7 1 125 200
# 8 7 2 126 197
# 9 7 3 127 202
# 10 7 4 123 199
# 11 7 5 124 202
# 12 7 6 124. 199

How to use mutate_at() with two sets of variables, in R

Using dplyr, I want to divide a column by another one, where the two columns have a similar pattern.
I have the following data frame:
My_data = data.frame(
var_a = 101:110,
var_b = 201:210,
number_a = 1:10,
number_b = 21:30)
I would like to create a new variable: var_a_new = var_a/number_a, var_b_new = var_b/number_b and so on if I have c, d etc.
My_data %>%
mutate_at(
.vars = c('var_a', 'var_b'),
.funs = list( new = function(x) x/(.[,paste0('number_a', names(x))]) ))
I did not get an error, but a wrong result. I think that the problem is that I don't understand what the 'x' is. Is it one of the string in .vars? Is it a column in My_data? Something else?
One option could be:
bind_cols(My_data,
My_data %>%
transmute(across(starts_with("var"))/across(starts_with("number"))) %>%
rename_all(~ paste0(., "_new")))
var_a var_b number_a number_b var_a_new var_b_new
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000
You can do this directly provided the columns are correctly ordered meaning "var_a" is first column in "var" group and "number_a" is first column in "number" group and so on for other pairs.
var_cols <- grep('var', names(My_data), value = TRUE)
number_cols <- grep('number', names(My_data), value = TRUE)
My_data[paste0(var_cols, '_new')] <- My_data[var_cols]/My_data[number_cols]
My_data
# var_a var_b number_a number_b var_a_new var_b_new
#1 101 201 1 21 101.00000 9.571429
#2 102 202 2 22 51.00000 9.181818
#3 103 203 3 23 34.33333 8.826087
#4 104 204 4 24 26.00000 8.500000
#5 105 205 5 25 21.00000 8.200000
#6 106 206 6 26 17.66667 7.923077
#7 107 207 7 27 15.28571 7.666667
#8 108 208 8 28 13.50000 7.428571
#9 109 209 9 29 12.11111 7.206897
#10 110 210 10 30 11.00000 7.000000
The function across() has replaced scope variants such as mutate_at(), summarize_at() and others. For more details, see vignette("colwise") or https://cran.r-project.org/web/packages/dplyr/vignettes/colwise.html. Based on tmfmnk's answer, the following works well:
My_data %>%
mutate(
new = across(starts_with("var"))/across(starts_with("number")))
The prefix "new." will be added to the names of the new variables.
var_a var_b number_a number_b new.var_a new.var_b
1 101 201 1 21 101.00000 9.571429
2 102 202 2 22 51.00000 9.181818
3 103 203 3 23 34.33333 8.826087
4 104 204 4 24 26.00000 8.500000
5 105 205 5 25 21.00000 8.200000
6 106 206 6 26 17.66667 7.923077
7 107 207 7 27 15.28571 7.666667
8 108 208 8 28 13.50000 7.428571
9 109 209 9 29 12.11111 7.206897
10 110 210 10 30 11.00000 7.000000

correct way to add columns to data frame without loop

I have this "d" data frame that has 2 groups. In real life I have 20 groups.
d= data.frame(group = c(rep("A",10),rep("B",10),"A"), value = c(seq(1,10,1),seq(101,110,1),10000))
d
group value
1 A 1
2 A 2
3 A 3
4 A 4
5 A 5
6 A 6
7 A 7
8 A 8
9 A 9
10 A 10
11 B 101
12 B 102
13 B 103
14 B 104
15 B 105
16 B 106
17 B 107
18 B 108
19 B 109
20 B 110
21 A 10000
I'd like to add 2 columns, "Upper" and "Lower" that are calculated at the GROUP below level. Since there are only 2 groups I can add the columns manually like this:
d= data.frame(group = c(rep("A",10),rep("B",10),"A"), value = c(seq(1,10,1),seq(101,110,1),10000))
d
d$upper = ifelse(d$group=="A", quantile(d$value[d$group=="A"])[4]+ 2.5*IQR(d$value[d$group=="A"]), quantile(d$value[d$group=="B"])[4]+ 2.5*IQR(d$value[d$group=="B"]) )
d$lower = ifelse(d$group=="A", quantile(d$value[d$group=="A"])[4]- 2.5*IQR(d$value[d$group=="A"]), quantile(d$value[d$group=="B"])[4]- 2.5*IQR(d$value[d$group=="B"]) )
group value upper lower
1 A 1 21 -4.0
2 A 2 21 -4.0
3 A 3 21 -4.0
4 A 4 21 -4.0
5 A 5 21 -4.0
6 A 6 21 -4.0
7 A 7 21 -4.0
8 A 8 21 -4.0
9 A 9 21 -4.0
10 A 10 21 -4.0
11 B 101 119 96.5
12 B 102 119 96.5
13 B 103 119 96.5
14 B 104 119 96.5
15 B 105 119 96.5
16 B 106 119 96.5
17 B 107 119 96.5
18 B 108 119 96.5
19 B 109 119 96.5
20 B 110 119 96.5
21 A 10000 21 -4.0
But when I have 20 or 30 columns whats the best way to add these columns without doing a loop?
Groupwise operations can easily be done using dplyr's group_by function:
library(dplyr)
d <- data.frame(group = c(rep("A",10),rep("B",10),"A"), value = c(seq(1,10,1),seq(101,110,1),10000))
d %>%
group_by(group) %>%
mutate(upper=quantile(value, 0.75) + 2.5*IQR(value),
lower=quantile(value, 0.75) - 2.5*IQR(value))
This splits the data frame by the "group" variable and then computes the "upper" and "lower" columns separately for each group.

Calculate mean of respective column values based on condition

I have a data.frame named sampleframe where I have stored all the table values. Inside sampleframe I have columns id, month, sold.
id month SMarch SJanFeb churn
101 1 0.00 0.00 1
101 2 0.00 0.00 1
101 3 0.00 0.00 1
108 2 0.00 6.00 1
103 2 0.00 10.00 1
160 1 0.00 2.00 1
160 2 0.00 3.00 1
160 3 0.50 0.00 0
164 1 0.00 3.00 1
164 2 0.00 6.00 1
I would like to calculate average sold for last three months based on ID. If it is month 3 then it has to consider average sold for the last two months based on ID, if it is month 2 then it has to consider average sold for 1 month based on ID., respectively for all months.
I have used ifelse and mean function to avail it but some rows are missing when i try to use it for all months
Query that I have used for execution
sampleframe$Churn <- ifelse(sampleframe$Month==4|sampleframe$Month==5|sampleframe$Month==6, ifelse(sampleframe$Sold<0.7*mean(sampleframe$Sold[sampleframe$ID[sampleframe$Month==-1&sampleframe$Month==-2&sampleframe$Month==-3]]),1,0),0)
adding according to the logic of the query it should compare with the previous months sold value of 70% and if the current value is higher than previous average months values then it should return 1 else 0
Not clear about the expected output. Based on the description about calculating average 'sold' for each 3 months, grouped by 'id', we can use roll_mean from library(RcppRoll). We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'id', if the number of rows is greater than 1, we get the roll_mean with n specified as 3 and concatenate with the averages for less than 3 or else i.e. for 1 observation, get the value itself.
library(RcppRoll)
library(data.table)
k <- 3
setDT(df1)[, soldAvg := if(.N>1) c(cumsum(sold[1:(k-1)])/1:(k-1),
roll_mean(sold,n=k, align='right')) else as.numeric(sold), id]
df1
# id month sold soldAvg
#1: 101 1 124 124.0000
#2: 101 2 211 167.5000
#3: 104 3 332 332.0000
#4: 105 4 124 124.0000
#5: 101 5 211 182.0000
#6: 101 6 332 251.3333
#7: 101 7 124 222.3333
#8: 101 8 211 222.3333
#9: 101 9 332 222.3333
#10: 102 10 124 124.0000
#11: 102 12 211 167.5000
#12: 104 3 332 332.0000
#13: 105 4 124 124.0000
#14: 102 5 211 182.0000
#15: 102 6 332 251.3333
#16: 106 7 124 124.0000
#17: 107 8 211 211.0000
#18: 102 9 332 291.6667
#19: 103 11 124 124.0000
#20: 103 2 211 167.5000
#21: 108 3 332 332.0000
#22: 108 4 124 228.0000
#23: 109 5 211 211.0000
#24: 103 6 332 222.3333
#25: 104 7 124 262.6667
#26: 105 8 211 153.0000
#27: 103 10 332 291.6667
Solution for above Question can be done by using library(dplyr) and use this query to avail the output
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
link to refer for solution and output Answer

Loop in a column-with conditions only once

data<- c(100,101,102,103,104,99,98,97,94,93,103,90,104,105,110)
date<- Sys.Date()-15:1
file<- xts(data,date)
colnames(file)<- "CLOSE"
file$high<- cummax(file$CLOSE)
file$trade <- 0
file$trade[file$high*.95>=file$CLOSE] <- 1
file$trade[file$high*.90>=file$CLOSE] <- 2
file$trade[file$high*.85>=file$CLOSE] <- 3
file
CLOSE high trade
2013-07-06 100 100 0
2013-07-07 101 101 0
2013-07-08 102 102 0
2013-07-09 103 103 0
2013-07-10 104 104 0
2013-07-11 99 104 0
2013-07-12 98 104 1
2013-07-13 97 104 1
2013-07-14 94 104 1
2013-07-15 93 104 2
2013-07-16 103 104 0
2013-07-17 90 104 2
2013-07-18 104 104 0
2013-07-19 105 105 0
2013-07-20 110 110 0
I need to modify trade column, so after i get my first "1" then all elements would be zero until i get 2 and then all elements should be 0, till i get 3 and so on.
I think, you could simply do:
> file$trade[duplicated(file$trade)] <- 0
You don't need a loop to do this. Indeed, you simply need to find the positions of the first "1", "2",.... Try the following codes.
rank.trade <- rank(file$trade, ties.method = "first")
marks <- cumsum(head(table(file$trade), -1)) + 1
black.list <- is.na(match(rank.trade, marks))
file$trade[black.list] <- 0

Resources