I have a data frame which holds the times of random events occurring. What I want, is to subset the first case when either 'place' or 'Show' appears under Event, combined with 'kick' or 'Type' appearing under Event 2. So in this case, 'place run' wouldn't satisfy the condition, even though 'place' does appear under 'Event'. When I say the first case, I only want the first case when either of those options occur before the time resets back to 0. So for the first segment, the output I would want is 27, as this is the first time value when the condition is satisfied. For the second segment, I would want 16. For the last segment, the output would be 41. (I've put asterisk surrounding the rows which meet the condition so its easy to locate them. This isn't actually present in the data.)
Time Event Event 2
0 Begin NA
23 place run
27 *Show Type*
34 *place kick*
41 good bye
42 *place kick*
0 Begin NA
11 Hat Yellow
13 Show Green
16 *place kick*
20 place hit
29 sign redeem
35 *Show Type*
0 Begin NA
5 Cream Glue
17 Show Green
18 Orange Screen
30 place hit
33 sign redeem
41 *Show Type*
0 Begin NA
...
EDIT : So far, what I'm able to do, is subset the rows that have Show Type or place kick with the following code :
Rows <- Data[(Data[,'Event'] == 'Show' & Data[,'Event 2']== 'Type') |
(Data[,'Event'] == 'place' & Data[,'Event 2']== 'kick' ),]
Where I'm struggling, is being able to reset the search for these values after Time resets back to 0. Any help will be greatly appreciated!
The &-infix-function can be wrapped with the which function to generate a vector of the row numbers where those conditions are met. Then follow that with [1] to get just the first one.
df[ which(df[ , 'Event'] %in% c('place','Show') & df[ ,'Event.2'] %in% c('kick','Type') )[1], ]
Notice that I didn't leave a space between Event and 2, since that would have been parsed by R as two differnt symbols. The make.names-function is used by all the read.* functions to remove invalid punctuation from column names.
To make this process reset at each new segment, you would build a segment vector probably with something like segvec= cumsum(df$Time==0), and then probably use the split-apply-combine approach to get values just within the resulting subsets.
Some lightly test code:
lapply( split(dat, cumsum(dat[ ,'Time']==0)),
function(df){df[ which(df[ ,'Event'] %in% c('place','Show') &
df[ ,'Event.2'] %in% c('kick','Type') )[1], ]})
#------
$`1`
Time Event Event.2
3 27 Show Type
$`2`
Time Event Event.2
10 16 place kick
$`3`
Time Event Event.2
20 41 Show Type
dput(dat)
structure(list(Time = c(0L, 23L, 27L, 34L, 41L, 42L, 0L, 11L,
13L, 16L, 20L, 29L, 35L, 0L, 5L, 17L, 18L, 30L, 33L, 41L), Event = structure(c(1L,
6L, 7L, 6L, 3L, 6L, 1L, 4L, 7L, 6L, 6L, 8L, 7L, 1L, 2L, 7L, 5L,
6L, 8L, 7L), .Label = c("Begin", "Cream", "good", "Hat", "Orange",
"place", "Show", "sign"), class = "factor"), Event.2 = structure(c(NA,
7L, 9L, 5L, 1L, 5L, NA, 10L, 3L, 5L, 4L, 6L, 9L, NA, 2L, 3L,
8L, 4L, 6L, 9L), .Label = c("bye", "Glue", "Green", "hit", "kick",
"redeem", "run", "Screen", "Type", "Yellow"), class = "factor")), .Names = c("Time",
"Event", "Event.2"), class = "data.frame", row.names = c(NA,
-20L))
Far less succinct (and prbly less optimal) than 42-'s but:
library(stringi)
read.table(text="Time Event Event2
0 Begin NA
23 place run)
27 *Show Type*
34 (*place kic)k*
41 good bye
42 (*place kic)k*
0 Begin NA
11 Hat Yellow
13 Show Green
16 *place kick*
20 place hit
29 sign redeem
35 *Show Type*
0 Begin NA
5 Cream Glue
17 Show Green
18 Orange Screen
30 place hit
33 sign redeem
41 *Show Type*
0 Begin NA", header=TRUE, stringsAsFactors=FALSE) -> df
library(dplyr)
df$grp <- 0
df[which(df$Time == 0),]$grp <- 1
df$grp <- cumsum(df$grp)
group_by(df, grp) %>%
filter(grepl("place|show", Event, ignore.case=TRUE) & grepl("kick|type", Event2, ignore.case=TRUE)) %>%
slice(1) %>%
select(-grp)
## Source: local data frame [3 x 4]
## Groups: grp [3]
##
## grp Time Event Event2
## <dbl> <int> <chr> <chr>
## 1 1 27 *Show Type*
## 2 2 16 *place kick*
## 3 3 41 *Show Type*
Related
I have a simple Table:
ID|Value
1|10
1|20
1|-5
2|25
3|2
3|15
4|8
5|18
6|33
6|5
6|50
Actual I use this code:
for (row in 1:nrow(Table)) {
ID <- Table[row, 1]
Value <- Table[row, 2]
if ( oldID == ID) {
currentValue <- currentValue * ((100 - Value)/100) }
else {
addrow <- data.frame(oldID, currentValue)
PriceRR <- rbind(PriceRR, addrow)
oldID <- ID
currentValue <- 100 - Value
}
}
To allocated a discount for a later DAX Value in Power BI.
But it slow as hell. So I want to parallelize it.
daply might do the work. But I do not know the inner workings of it.
So basically what I need.
Split table in sets by group of ID.
Set1 1,10 1,20 1,5
Set2 2,25
Set3 3,2 3,15
.
.
.
Apply function to Sets parallel.
First call of function in set, initialize currentValue <- 100
after
currentValue <- currentValue * ((100 - Value)/100)
For Set1.1 90 <- 100 * ((100 - 10)/100)
For Set1.2 72 <- 90 * ((100 - 20)/100)
For Set1.3 68,4 <- 72 * ((100 - 5)/100)
It should return ID=1 Value=68,4
I need to know, is it possible to make a variable persistent in memory for the duration of execute a function an set, as long it lives?
Will daply or a other function create a new working thread to apply it on a set?
I am a R beginner and must jump right in the inner working of the R environment. :-)
Sven
An option with reduce from purrr
library(dplyr)
library(purrr)
data %>%
group_by(ID) %>%
summarise(Result = reduce(Value, ~ .x * (100 -.y)/100, .init = 100))
# A tibble: 6 x 2
# ID Result
#* <int> <dbl>
#1 1 68.4
#2 2 75
#3 3 83.3
#4 4 92
#5 5 82
data
data <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L,
6L), Value = c(10L, 20L, 5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L,
50L)), class = "data.frame", row.names = c(NA, -11L))
Here's an approach with dplyr and Reduce from base R:
library(dplyr)
data %>%
group_by(ID) %>%
summarize(Result = Reduce(function(x,y) x * ((100 - y)/ 100),
Value, init = 100))
# A tibble: 6 x 2
ID Result
<int> <dbl>
1 1 68.4
2 2 75
3 3 83.3
4 4 92
5 5 82
6 6 31.8
Reduce is a tricky function mostly because the documentation is terrible. Reduce applies a function with two arguments to elements in a vector in succession with the previous value as the first argument and the current value as the second argument. You can set an initial value with init =.
I notice in your explaination that your expected output for group 1 is 68.4. This is only true if the value for row 3 is 5 rather than the -5 you posted. Since this was the only negative value in your data, I went ahead and changed it to 5.
Data
data <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L,
6L), Value = c(10L, 20L, 5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L,
50L)), class = "data.frame", row.names = c(NA, -11L))
Your original script is slow for a couple of reason. First you are looping through every element in your initial table and not taking advantage of the vectorized nature of R. Second, there is a rbind function within the loop. Binding is a slow process, especially as the object size grows.
It looks likes the objective is a cumulative product of the the value column grouped by the ID column.
Here is a base R solution using the split, apply and merge strategy.
Table <-structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L,
6L), Value = c(10L, 20L, -5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L,
50L)), class = "data.frame", row.names = c(NA, -11L))
#Create column for the ((100 - Value)/100) factor
Table$factor<- ((100 - Table$Value)/100)
#split by ID
dfs<-split(Table, Table$ID)
currentValue<-sapply(dfs, function(x){
#find the cumulative product of the factor column
product<-cumprod(x$factor)
#return the last value fron the cumprod
return(100*product[length(product)])
})
#create the final answer
PriceRR<-data.frame(oldID=as.integer(names(dfs)), currentValue)
PriceRR
oldID currentValue
1 1 75.600
2 2 75.000
3 3 83.300
4 4 92.000
5 5 82.000
6 6 31.825
This script is using the cumprod function which is vectorized, thus very fast. Also the above script avoids the slow operation of continuing to growing the final dataframe.
I have a matrix matrix with two level groupings as illustrated in the row and column names.
UKC1_SS1 UKC1_SS2 UKC2_SS1 UKC2_SS2
UKC1_SS1 1 2 3 4
UKC1_SS2 5 6 7 8
UKC2_SS1 9 10 11 12
UKC2_SS2 13 14 15 16
I want to create with a table with the column and row sums based on the first four digits of the column and row names:
UKC1 UKC2
UKC1 14 22
UKC2 46 54
I tried calculating rowsums and colSums sequentially,
sum.matrix <- rowsum(matrix, substr(rownames(matrix), start = 1, stop = 4))
sum.matrix <- colSums(sum.matrix, substr(colnames(test), start = 1, stop = 4)
but I receive the following error message:
Error in colSums(test, substr(colnames(test), start = 1, stop = 4)) :
invalid 'na.rm' argument
When I run sum(is.na) I confirm that there are NA values in matrix .
We can do the sum with xtabs after changing the dimnames with the substr of 1st 4 characters
dimnames(m1) <- lapply(dimnames(m1), substr, 1, 4)
xtabs(Freq~ Var1 + Var2, as.data.frame.table(m1))
# Var2
#Var1 UKC1 UKC2
# UKC1 14 22
# UKC2 46 54
data
m1 <- structure(c(1L, 5L, 9L, 13L, 2L, 6L, 10L, 14L, 3L, 7L, 11L, 15L,
4L, 8L, 12L, 16L), .Dim = c(4L, 4L), .Dimnames = list(c("UKC1_SS1",
"UKC1_SS2", "UKC2_SS1", "UKC2_SS2"), c("UKC1_SS1", "UKC1_SS2",
"UKC2_SS1", "UKC2_SS1.1")))
I am trying to solve is how to calculate the weighted score for each class each month.
Each class has multiple students and the weight (contribution) of a student's score varies through time.
To be included in the calculation a student must have both score and weight.
I am a bit lost and none of the approaches I have used have worked.
Student Class Jan_18_score Feb_18_score Jan_18_Weight Feb_18_Weight
Adam 1 3 2 150 153
Char 1 5 7 30 60
Fred 1 -7 8 NA 80
Greg 1 2 NA 80 40
Ed 2 1 2 60 80
Mick 2 NA 6 80 30
Dave 3 5 NA 40 25
Nick 3 8 8 12 45
Tim 3 -2 7 23 40
George 3 5 3 65 NA
Tom 3 NA 8 78 50
The overall goal is to calculate the weighted score for each class each month.
Taking Class 1 (first 4 rows) as an example and looking at Jan_18.
-The observations of Adam, Char and Greg are valid since they have both scores and weights. Their scores and weights should be included
- Fred does not have a Jan_18_weight, therefore both his Jan_18_score and Jan_18_weight are excluded from the calculation.
The following calculation should then occur:
= [(3*150)+(5*30)+(2*80)]/ [150+30+80]
= 2.92307
This calculation would be repeated for each class and each month.
A new dataframe something like the following should be the output
Class Jan_18_Weight_Score Feb_18_Weight_Score
1 2.92307 etc
2 etc etc
3 etc etc
There are many columns and many rows.
Any help is appreciated.
Here's a way with tidyverse. The main trick is to replace NA with 0 in "weights" columns and then use weighted.mean() with na.rm = T to ignore NA scores. To do so, you can gather the scores and weights into a single column and then group by Class and month_abb (a calculated field for grouping) and then use weighted.mean().
df %>%
mutate_at(vars(ends_with("Weight")), ~replace_na(., 0)) %>%
gather(month, value, -Student, -Class) %>%
group_by(Class, month_abb = paste0(substr(month, 1, 3), "_Weight_Score")) %>%
summarize(
weight_score = weighted.mean(value[grepl("score", month)], value[grepl("Weight", month)], na.rm = T)
) %>%
ungroup() %>%
spread(month_abb, weight_score)
# A tibble: 3 x 3
Class Feb_Weight_Score Jan_Weight_Score
<int> <dbl> <dbl>
1 1 4.66 2.92
2 2 3.09 1
3 3 7.70 4.11
Data -
df <- structure(list(Student = c("Adam", "Char", "Fred", "Greg", "Ed",
"Mick", "Dave", "Nick", "Tim", "George", "Tom"), Class = c(1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), Jan_18_score = c(3L,
5L, -7L, 2L, 1L, NA, 5L, 8L, -2L, 5L, NA), Feb_18_score = c(2L,
7L, 8L, NA, 2L, 6L, NA, 8L, 7L, 3L, 8L), Jan_18_Weight = c(150L,
30L, NA, 80L, 60L, 80L, 40L, 12L, 23L, 65L, 78L), Feb_18_Weight = c(153L,
60L, 80L, 40L, 80L, 30L, 25L, 45L, 40L, NA, 50L)), class = "data.frame", row.names = c(NA,
-11L))
Maybe this could be solved in a much better way but here is one Base R option where we perform aggregation twice and then combine the results.
#Separate score and weight columns
score_cols <- grep("score$", names(df))
weight_cols <- grep("Weight$", names(df))
#Replace NA's in corresponding score and weight columns to 0
inds <- is.na(df[score_cols]) | is.na(df[weight_cols])
df[score_cols][inds] <- 0
df[weight_cols][inds] <- 0
#Find sum of weight columns for each class
df1 <- aggregate(.~Class, cbind(df["Class"], df[weight_cols]), sum)
#find sum of multiplication of score and weight columns for each class
df2 <- aggregate(.~Class, cbind(df["Class"], df[score_cols] * df[weight_cols]), sum)
#Get the ratio between two dataframes.
cbind(df1[1], df2[-1]/df1[-1])
# Class Jan_18_score Feb_18_score
#1 1 2.92 4.66
#2 2 1.00 3.09
#3 3 4.11 7.70
I am trying to solve is how to calculate the weighted score for each class each month.
Each class has multiple students and the weight (contribution) of a student's score varies through time.
To be included in the calculation a student must have both score and weight.
I am a bit lost and none of the approaches I have used have worked.
Student Class Jan_18_score Feb_18_score Jan_18_Weight Feb_18_Weight
Adam 1 3 2 150 153
Char 1 5 7 30 60
Fred 1 -7 8 NA 80
Greg 1 2 NA 80 40
Ed 2 1 2 60 80
Mick 2 NA 6 80 30
Dave 3 5 NA 40 25
Nick 3 8 8 12 45
Tim 3 -2 7 23 40
George 3 5 3 65 NA
Tom 3 NA 8 78 50
The overall goal is to calculate the weighted score for each class each month.
Taking Class 1 (first 4 rows) as an example and looking at Jan_18.
-The observations of Adam, Char and Greg are valid since they have both scores and weights. Their scores and weights should be included
- Fred does not have a Jan_18_weight, therefore both his Jan_18_score and Jan_18_weight are excluded from the calculation.
The following calculation should then occur:
= [(3*150)+(5*30)+(2*80)]/ [150+30+80]
= 2.92307
This calculation would be repeated for each class and each month.
A new dataframe something like the following should be the output
Class Jan_18_Weight_Score Feb_18_Weight_Score
1 2.92307 etc
2 etc etc
3 etc etc
There are many columns and many rows.
Any help is appreciated.
Here's a way with tidyverse. The main trick is to replace NA with 0 in "weights" columns and then use weighted.mean() with na.rm = T to ignore NA scores. To do so, you can gather the scores and weights into a single column and then group by Class and month_abb (a calculated field for grouping) and then use weighted.mean().
df %>%
mutate_at(vars(ends_with("Weight")), ~replace_na(., 0)) %>%
gather(month, value, -Student, -Class) %>%
group_by(Class, month_abb = paste0(substr(month, 1, 3), "_Weight_Score")) %>%
summarize(
weight_score = weighted.mean(value[grepl("score", month)], value[grepl("Weight", month)], na.rm = T)
) %>%
ungroup() %>%
spread(month_abb, weight_score)
# A tibble: 3 x 3
Class Feb_Weight_Score Jan_Weight_Score
<int> <dbl> <dbl>
1 1 4.66 2.92
2 2 3.09 1
3 3 7.70 4.11
Data -
df <- structure(list(Student = c("Adam", "Char", "Fred", "Greg", "Ed",
"Mick", "Dave", "Nick", "Tim", "George", "Tom"), Class = c(1L,
1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), Jan_18_score = c(3L,
5L, -7L, 2L, 1L, NA, 5L, 8L, -2L, 5L, NA), Feb_18_score = c(2L,
7L, 8L, NA, 2L, 6L, NA, 8L, 7L, 3L, 8L), Jan_18_Weight = c(150L,
30L, NA, 80L, 60L, 80L, 40L, 12L, 23L, 65L, 78L), Feb_18_Weight = c(153L,
60L, 80L, 40L, 80L, 30L, 25L, 45L, 40L, NA, 50L)), class = "data.frame", row.names = c(NA,
-11L))
Maybe this could be solved in a much better way but here is one Base R option where we perform aggregation twice and then combine the results.
#Separate score and weight columns
score_cols <- grep("score$", names(df))
weight_cols <- grep("Weight$", names(df))
#Replace NA's in corresponding score and weight columns to 0
inds <- is.na(df[score_cols]) | is.na(df[weight_cols])
df[score_cols][inds] <- 0
df[weight_cols][inds] <- 0
#Find sum of weight columns for each class
df1 <- aggregate(.~Class, cbind(df["Class"], df[weight_cols]), sum)
#find sum of multiplication of score and weight columns for each class
df2 <- aggregate(.~Class, cbind(df["Class"], df[score_cols] * df[weight_cols]), sum)
#Get the ratio between two dataframes.
cbind(df1[1], df2[-1]/df1[-1])
# Class Jan_18_score Feb_18_score
#1 1 2.92 4.66
#2 2 1.00 3.09
#3 3 4.11 7.70
How can I select all of the rows for a random sample of column values?
I have a dataframe that looks like this:
tag weight
R007 10
R007 11
R007 9
J102 11
J102 9
J102 13
J102 10
M942 3
M054 9
M054 12
V671 12
V671 13
V671 9
V671 12
Z990 10
Z990 11
That you can replicate using...
weights_df <- structure(list(tag = structure(c(4L, 4L, 4L, 1L, 1L, 1L, 1L,
3L, 2L, 2L, 5L, 5L, 5L, 5L, 6L, 6L), .Label = c("J102", "M054",
"M942", "R007", "V671", "Z990"), class = "factor"), value = c(10L,
11L, 9L, 11L, 9L, 13L, 10L, 3L, 9L, 12L, 12L, 14L, 5L, 12L, 11L,
15L)), .Names = c("tag", "value"), class = "data.frame", row.names = c(NA,
-16L))
I need to create a dataframe containing all of the rows from the above dataframe for two randomly sampled tags. Let's say tags R007and M942 get selected at random, my new dataframe needs to look like this:
tag weight
R007 10
R007 11
R007 9
M942 3
How do I do this?
I know I can create a list of two random tags like this:
library(plyr)
tags <- ddply(weights_df, .(tag), summarise, count = length(tag))
set.seed(5464)
tag_sample <- tags[sample(nrow(tags),2),]
tag_sample
Resulting in...
tag count
4 R007 3
3 M942 1
But I just don't know how to use that to subset my original dataframe.
is this what you want?
subset(weights_df, tag%in%sample(levels(tag),2))
If your data.frame is named dfrm, then this will select 100 random tags
dfrm[ sample(NROW(dfrm), 100), "tag" ] # possibly with repeats
If, on the other hand, you want a dataframe with the same columns (possibly with repeats):
samp <- dfrm[ sample(NROW(dfrm), 100), ] # leave the col name entry blank to get all
A third possibility... you want 100 distinct tags at random, but not with the probability at all weighted to the frequency:
samp.tags <- unique(dfrm$tag)[ sample(length(unique(dfrm$tag)), 100]
Edit: With to revised question; one of these:
subset(dfrm, tag %in% c("R007", "M942") )
Or:
dfrm[dfrm$tag %in% c("R007", "M942"), ]
Or:
dfrm[grep("R007|M942", dfrm$tag), ]