How can I convert the following tibble to the final result posted below using dplyr?
> group_by(hth, team) %>% arrange(team)
Source: local data frame [26 x 14]
Groups: team [13]
team CSK DC DD GL KKR KTK KXIP MI PW RCB RPSG
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 CSK 0 8 11 0 11 2 9 10 4 10 0
2 CSK 0 2 5 0 5 0 8 12 2 9 0
3 DC 2 0 8 0 2 1 7 5 3 8 0
4 DC 8 0 3 0 7 0 3 5 1 3 0
5 DD 5 3 0 0 7 2 8 5 2 10 2
6 DD 11 8 0 2 10 0 10 13 4 7 0
7 GL 0 0 2 0 0 0 0 0 0 1 0
8 GL 0 0 0 0 2 0 2 2 0 2 2
9 KKR 5 7 10 2 0 0 5 10 3 15 0
10 KKR 11 2 7 0 0 2 14 8 2 3 2
# ... with 16 more rows, and 2 more variables: RR <dbl>, SH <dbl>
>
I used plyr's ddply function and was able to achieve the result.
> ddply(hth, .(team), function(x) colSums(x[,-1], na.rm = TRUE))
team CSK DC DD GL KKR KTK KXIP MI PW RCB RPSG RR SH
1 CSK 0 10 16 0 16 2 17 22 6 19 0 17 6
2 DC 10 0 11 0 9 1 10 10 4 11 0 9 0
3 DD 16 11 0 2 17 2 18 18 6 17 2 16 8
4 GL 0 0 2 0 2 0 2 2 0 3 2 0 3
5 KKR 16 9 17 2 0 2 19 18 5 18 2 15 9
6 KTK 2 1 2 0 2 0 1 1 1 2 0 2 0
7 KXIP 17 10 18 2 19 1 0 18 6 18 2 15 8
8 MI 22 10 18 2 18 1 18 0 6 19 2 16 8
9 PW 6 4 6 0 5 1 6 6 0 5 0 5 2
10 RCB 19 11 17 3 18 2 18 19 5 0 2 16 9
11 RPSG 0 0 2 2 2 0 2 2 0 2 0 0 2
12 RR 17 9 16 0 15 2 15 16 5 16 0 0 7
13 SH 6 0 8 3 9 0 8 8 2 9 2 7 0
>
How to achieve the same using just dplyr functions?
Looks like you are grouping by team and summing the columns, in dplyr:
library(dplyr)
hth %>%
group_by(team) %>%
summarise_all(funs(sum), na.rm = TRUE)
Related
df <- data.frame (id = c(1,1,1,2,2,2,3,3,3,3,4,4,4,4,4,4),
qresult=c(0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0),
count=c(0,0,0,0,1,2,0,0,1,2,1,2,3,4,5,6))
> df
id qresult count
1 1 0 0
2 1 0 0
3 1 0 0
4 2 0 0
5 2 1 1
6 2 0 2
7 3 0 0
8 3 0 0
9 3 1 1
10 3 0 2
11 4 1 1
12 4 0 2
13 4 0 3
14 4 0 4
15 4 0 5
16 4 0 6
What would be a way to obtain the count column which begins counting when the condition, q_result==1 is met and resets for each new id?
We could wrap with double cumsum on a logical vector after grouping
library(dplyr)
df %>%
group_by(id) %>%
mutate(count2 = cumsum(cumsum(qresult))) %>%
ungroup
-output
# A tibble: 16 × 4
id qresult count count2
<dbl> <dbl> <dbl> <dbl>
1 1 0 0 0
2 1 0 0 0
3 1 0 0 0
4 2 0 0 0
5 2 1 1 1
6 2 0 2 2
7 3 0 0 0
8 3 0 0 0
9 3 1 1 1
10 3 0 2 2
11 4 1 1 1
12 4 0 2 2
13 4 0 3 3
14 4 0 4 4
15 4 0 5 5
16 4 0 6 6
I'm having trouble using the row number as index. For example I want a new column that will give me the sales taking into account the next 4 days. I want to create column name:sale_next 4
The issue with my code is that I don't know how to make the index of the row_number() variable, since what I'm doing is fetching the actual value of the column.
#heres to create the data
df <- read.table(text = "day price price_change sales High_sales_ind
1 5 0 12 1
2 5 0 6 0
3 5 0 5 0
4 5 0 4 0
5 5 0 10 1
6 5 0 10 1
7 5 0 10 1
8 5 0 12 1
9 5 0 14 1
10 7 2 3 0
11 7 0 2 0", header = TRUE)
#my code
df<- df %>% mutate(sales_next4 = sales[row_number():sales_rownumber()+4)
What I need:
day
price
price_change
sales
High_sales_ind
sales_next4
1
5
0
12
1
27
2
5
0
6
0
25
3
5
0
5
0
29
4
5
0
4
0
34
5
5
0
10
1
42
6
5
0
10
1
46
7
5
0
10
1
39
8
5
0
12
1
31
9
5
0
14
1
19
10
7
2
3
0
5
11
7
0
2
0
2
Any help would be appreciated.
You can use rollapply from the zoo package for cases like this, assuming that the days are consecutive as in the example data provided.
You'll need to use the partial = and align = arguments to fill the column correctly, see ?rollapply for the details.
library(dplyr)
library(zoo)
df <- df %>%
mutate(sales_next4 = rollapply(sales, 4, sum, partial = TRUE, align = "left"))
Result:
day price price_change sales High_sales_ind sales_next4
1 1 5 0 12 1 27
2 2 5 0 6 0 25
3 3 5 0 5 0 29
4 4 5 0 4 0 34
5 5 5 0 10 1 42
6 6 5 0 10 1 46
7 7 5 0 10 1 39
8 8 5 0 12 1 31
9 9 5 0 14 1 19
10 10 7 2 3 0 5
11 11 7 0 2 0 2
You can use map() from purrr to do rolling sum depending on the day column.
library(dplyr)
library(purrr)
df %>%
mutate(sales_next4 = map_dbl(day, ~ sum(sales[between(day, .x, .x+3)])))
# day price price_change sales High_sales_ind sales_next4
# 1 1 5 0 12 1 27
# 2 2 5 0 6 0 25
# 3 3 5 0 5 0 29
# 4 4 5 0 4 0 34
# 5 5 5 0 10 1 42
# 6 6 5 0 10 1 46
# 7 7 5 0 10 1 39
# 8 8 5 0 12 1 31
# 9 9 5 0 14 1 19
# 10 10 7 2 3 0 5
# 11 11 7 0 2 0 2
Using slider
library(dplyr)
library(slider)
df %>%
mutate(sales_next4 = slide_dbl(day, ~ sum(sales[.x]), .after = 3))
day price price_change sales High_sales_ind sales_next4
1 1 5 0 12 1 27
2 2 5 0 6 0 25
3 3 5 0 5 0 29
4 4 5 0 4 0 34
5 5 5 0 10 1 42
6 6 5 0 10 1 46
7 7 5 0 10 1 39
8 8 5 0 12 1 31
9 9 5 0 14 1 19
10 10 7 2 3 0 5
11 11 7 0 2 0 2
You can use Reduce() and data.table::shift()
library(data.table)
setDT(df)[, n4:=Reduce(`+`,shift(c(sales,0,0,0),-3:0))[1:.N]]
Output:
day price price_change sales High_sales_ind sales_next4
1 1 5 0 12 1 27
2 2 5 0 6 0 25
3 3 5 0 5 0 29
4 4 5 0 4 0 34
5 5 5 0 10 1 42
6 6 5 0 10 1 46
7 7 5 0 10 1 39
8 8 5 0 12 1 31
9 9 5 0 14 1 19
10 10 7 2 3 0 5
11 11 7 0 2 0 2
or, could this as part of dplyr/mutate pipeline
mutate(df, sales_next4 = Reduce(`+`, data.table::shift(c(sales,0,0,0),0:-3))[1:nrow(df)])
I use get.shortest.paths method to find the shortest path between two vertices. However, something odd is happening. After the comment that I received, I am changing the entire question body. I produced my graph with g <- sample_smallworld(1, 20, 5, 0.1) and here is the adjacency list.
*Vertices 20
*Edges
1 2 0
2 3 0
3 4 0
4 5 0
5 6 0
6 7 0
7 8 0
8 9 0
9 10 0
10 11 0
11 12 0
12 13 0
13 14 0
14 15 0
6 15 0
16 17 0
17 18 0
18 19 0
19 20 0
1 20 0
1 11 0
1 19 0
1 4 0
1 18 0
1 5 0
1 17 0
6 17 0
15 16 0
2 20 0
2 4 0
2 19 0
2 5 0
2 18 0
2 9 0
2 17 0
2 13 0
3 5 0
3 20 0
3 6 0
3 19 0
3 7 0
3 18 0
3 8 0
4 6 0
4 7 0
4 20 0
4 8 0
5 19 0
4 9 0
5 7 0
5 8 0
5 9 0
5 20 0
5 10 0
6 8 0
6 9 0
6 10 0
6 11 0
7 9 0
7 10 0
7 11 0
7 12 0
1 10 0
8 11 0
1 12 0
8 13 0
9 11 0
9 12 0
9 13 0
7 14 0
12 19 0
10 13 0
10 14 0
10 15 0
11 13 0
11 14 0
11 15 0
4 16 0
12 14 0
9 15 0
12 16 0
12 17 0
13 15 0
13 16 0
13 17 0
13 18 0
14 16 0
14 17 0
14 18 0
14 19 0
15 17 0
15 18 0
15 19 0
1 15 0
16 18 0
16 19 0
9 20 0
17 19 0
17 20 0
10 18 0
The shortest path reported between 7 and 2 is:
> get.shortest.paths(g,7,2)
$vpath
$vpath[[1]]
+ 4/20 vertices, from c915453:
[1] 7 14 19 2
Here is the adjacent nodes to node 7 and node 2:
> unlist(neighborhood(g, 1, 7, mode="out"))
[1] 7 3 4 5 6 8 9 10 11 12 14
> unlist(neighborhood(g, 1, 2, mode="out"))
[1] 2 1 3 4 5 9 13 17 18 19 20
As you can see, I can go from 7 to 3 and from 3 to 2. It looks like there is a shorter path. What could I be missing?
Yes, the problem is your edge weights of zero. Looking at the help page ?shortest_paths
weights
Possibly a numeric vector giving edge weights. If this is
NULL and the graph has a weight edge attribute, then the attribute is
used. If this is NA then no weights are used (even if the graph has a
weight attribute).
Note that weights=NULL is the default, so weights will be used. Therefore the weight of the path that was returned is zero - the same as the path that you wanted to get. The weighted distance is the same. If you want to find the path with the smallest number of hops, turn off the use of the weights like this:
get.shortest.paths(g,7,2, weights=NA)$vpath
I am trying to calculate consecutive proportions of the target feature.
Data Set
df <- data.frame(ID = c(11, 11, 11, 11, 11, 11, 11, 11, 11, 11,
22, 22, 22, 22, 22, 22, 22, 22, 22, 22),
target = c(0, 0, 0, 1, 1, 1, 0, 1, 1, 1,
0, 0, 1, 1, 1, 0, 1, 0, 1, 1))
ID target
1 11 0
2 11 0
3 11 0
4 11 1
5 11 1
6 11 1
7 11 0
8 11 1
9 11 1
10 11 1
11 22 0
12 22 0
13 22 1
14 22 1
15 22 1
16 22 0
17 22 1
18 22 0
19 22 1
20 22 1
This is what I tried:
df <- df %>%
group_by(ID) %>%
mutate(count_per_ID = row_number(),
consecutive_target = sequence(rle(as.character(target))$lengths),
val = ifelse(target == 0, 0, consecutive_target),
proportion_target_by_ID = val / count_per_ID) %>%
ungroup()
I created count_per_ID that calculates the total number of rows for each group ID.
Then consecutive_target feature counts the number of observations in target feature and each time a change occurs, it restarts. By change I mean, switch between values of 0 or 1 of the target value.
val copies those values in the consecutive_target based on target 1 or 0 value.
proportion_target_by_ID takes val feature and divides by count_per_ID
The issue is that when there is 0 value in val feature, then the idea of taking the proportion of target values by ID is invalid.
ID target count_per_ID consecutive_target val proportion_target_by_ID
<dbl> <dbl> <int> <int> <dbl> <dbl>
1 11 0 1 1 0 0
2 11 0 2 2 0 0
3 11 0 3 3 0 0
4 11 1 4 1 1 0.25
5 11 1 5 2 2 0.4
6 11 1 6 3 3 0.5
7 11 0 7 1 0 0
8 11 1 8 1 1 0.125
9 11 1 9 2 2 0.222
10 11 1 10 3 3 0.3
11 22 0 1 1 0 0
12 22 0 2 2 0 0
13 22 1 3 1 1 0.333
14 22 1 4 2 2 0.5
15 22 1 5 3 3 0.6
16 22 0 6 1 0 0
17 22 1 7 1 1 0.143
18 22 0 8 1 0 0
19 22 1 9 1 1 0.111
20 22 1 10 2 2 0.2
How the result should look like:
ID target count_per_ID consecutive_target val proportion_target_by_ID
<dbl> <dbl> <int> <int> <dbl> <dbl>
1 11 0 1 1 0 0
2 11 0 2 2 0 0
3 11 0 3 3 0 0
4 11 1 4 1 1 0.25
5 11 1 5 2 2 0.4
6 11 1 6 3 3 0.5
7 11 0 7 1 3 0.428
8 11 1 8 1 4 0.5
9 11 1 9 2 5 0.555
10 11 1 10 3 6 0.6
11 22 0 1 1 0 0
12 22 0 2 2 0 0
13 22 1 3 1 1 0.333
14 22 1 4 2 2 0.5
15 22 1 5 3 3 0.6
16 22 0 6 1 3 0.5
17 22 1 7 1 4 0.571
18 22 0 8 1 4 0.5
19 22 1 9 1 5 0.55
20 22 1 10 2 6 0.6
An option is to change the code for creating the 'val' from
val = ifelse(target == 0, 0, consecutive_target
to
val = cumsum(target != 0)
-fullcode
df %>%
group_by(ID) %>%
mutate(count_per_ID = row_number(),
consecutive_target = sequence(rle(as.character(target))$lengths),
val = cumsum(target != 0),
proportion_target_by_ID = val / count_per_ID)
# A tibble: 20 x 6
# Groups: ID [2]
# ID target count_per_ID consecutive_target val proportion_target_by_ID
# <dbl> <dbl> <int> <int> <int> <dbl>
# 1 11 0 1 1 0 0
# 2 11 0 2 2 0 0
# 3 11 0 3 3 0 0
# 4 11 1 4 1 1 0.25
# 5 11 1 5 2 2 0.4
# 6 11 1 6 3 3 0.5
# 7 11 0 7 1 3 0.429
# 8 11 1 8 1 4 0.5
# 9 11 1 9 2 5 0.556
#10 11 1 10 3 6 0.6
#11 22 0 1 1 0 0
#12 22 0 2 2 0 0
#13 22 1 3 1 1 0.333
#14 22 1 4 2 2 0.5
#15 22 1 5 3 3 0.6
#16 22 0 6 1 3 0.5
#17 22 1 7 1 4 0.571
#18 22 0 8 1 4 0.5
#19 22 1 9 1 5 0.556
#20 22 1 10 2 6 0.6
I have a file like this in R.
**0 1**
0 2
**0 3**
0 4
0 5
0 6
0 7
0 8
0 9
0 10
**1 0**
1 11
1 12
1 13
1 14
1 15
1 16
1 17
1 18
1 19
**3 0**
As we can see, there are similar unordered pairs in this ( marked pairs ), like,
1 0
and
0 1
I wish to remove these pairs. And I want to count the number of such pairs that I have and append the count in front of the tow that is repeated. If not repeated, then 1 should be written in the third column.
For example ( A sample of the output file )
0 1 2
0 2 1
0 3 2
0 4 1
0 5 1
0 6 1
0 7 1
0 8 1
0 9 1
0 10 1
1 11 1
1 12 1
1 13 1
1 14 1
1 15 1
1 16 1
1 17 1
1 18 1
1 19 1
How can I achieve it in R?
Here is a way using transform, pmin and pmax to reorder the data by row, and then aggregate to provide a count:
# data
x <- data.frame(a=c(rep(0,10),rep(1,10),3),b=c(1:10,0,11:19,0))
#logic
aggregate(count~a+b,transform(x,a=pmin(a,b), b=pmax(a,b), count=1),sum)
a b count
1 0 1 2
2 0 2 1
3 0 3 2
4 0 4 1
5 0 5 1
6 0 6 1
7 0 7 1
8 0 8 1
9 0 9 1
10 0 10 1
11 1 11 1
12 1 12 1
13 1 13 1
14 1 14 1
15 1 15 1
16 1 16 1
17 1 17 1
18 1 18 1
19 1 19 1
Here's one approach:
First, create a vector of the columns sorted and then pasted together.
x <- apply(mydf, 1, function(x) paste(sort(x), collapse = " "))
Then, use ave to create the counts you are looking for.
mydf$count <- ave(x, x, FUN = length)
Finally, you can use the "x" vector again, this time to detect and remove duplicated values.
mydf[!duplicated(x), ]
# V1 V2 count
# 1 0 1 2
# 2 0 2 1
# 3 0 3 2
# 4 0 4 1
# 5 0 5 1
# 6 0 6 1
# 7 0 7 1
# 8 0 8 1
# 9 0 9 1
# 10 0 10 1
# 12 1 11 1
# 13 1 12 1
# 14 1 13 1
# 15 1 14 1
# 16 1 15 1
# 17 1 16 1
# 18 1 17 1
# 19 1 18 1
# 20 1 19 1