I have a simple Table:
ID|Value
1|10
1|20
1|-5
2|25
3|2
3|15
4|8
5|18
6|33
6|5
6|50
Actual I use this code:
for (row in 1:nrow(Table)) {
ID <- Table[row, 1]
Value <- Table[row, 2]
if ( oldID == ID) {
currentValue <- currentValue * ((100 - Value)/100) }
else {
addrow <- data.frame(oldID, currentValue)
PriceRR <- rbind(PriceRR, addrow)
oldID <- ID
currentValue <- 100 - Value
}
}
To allocated a discount for a later DAX Value in Power BI.
But it slow as hell. So I want to parallelize it.
daply might do the work. But I do not know the inner workings of it.
So basically what I need.
Split table in sets by group of ID.
Set1 1,10 1,20 1,5
Set2 2,25
Set3 3,2 3,15
.
.
.
Apply function to Sets parallel.
First call of function in set, initialize currentValue <- 100
after
currentValue <- currentValue * ((100 - Value)/100)
For Set1.1 90 <- 100 * ((100 - 10)/100)
For Set1.2 72 <- 90 * ((100 - 20)/100)
For Set1.3 68,4 <- 72 * ((100 - 5)/100)
It should return ID=1 Value=68,4
I need to know, is it possible to make a variable persistent in memory for the duration of execute a function an set, as long it lives?
Will daply or a other function create a new working thread to apply it on a set?
I am a R beginner and must jump right in the inner working of the R environment. :-)
Sven
An option with reduce from purrr
library(dplyr)
library(purrr)
data %>%
group_by(ID) %>%
summarise(Result = reduce(Value, ~ .x * (100 -.y)/100, .init = 100))
# A tibble: 6 x 2
# ID Result
#* <int> <dbl>
#1 1 68.4
#2 2 75
#3 3 83.3
#4 4 92
#5 5 82
data
data <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L,
6L), Value = c(10L, 20L, 5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L,
50L)), class = "data.frame", row.names = c(NA, -11L))
Here's an approach with dplyr and Reduce from base R:
library(dplyr)
data %>%
group_by(ID) %>%
summarize(Result = Reduce(function(x,y) x * ((100 - y)/ 100),
Value, init = 100))
# A tibble: 6 x 2
ID Result
<int> <dbl>
1 1 68.4
2 2 75
3 3 83.3
4 4 92
5 5 82
6 6 31.8
Reduce is a tricky function mostly because the documentation is terrible. Reduce applies a function with two arguments to elements in a vector in succession with the previous value as the first argument and the current value as the second argument. You can set an initial value with init =.
I notice in your explaination that your expected output for group 1 is 68.4. This is only true if the value for row 3 is 5 rather than the -5 you posted. Since this was the only negative value in your data, I went ahead and changed it to 5.
Data
data <- structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L,
6L), Value = c(10L, 20L, 5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L,
50L)), class = "data.frame", row.names = c(NA, -11L))
Your original script is slow for a couple of reason. First you are looping through every element in your initial table and not taking advantage of the vectorized nature of R. Second, there is a rbind function within the loop. Binding is a slow process, especially as the object size grows.
It looks likes the objective is a cumulative product of the the value column grouped by the ID column.
Here is a base R solution using the split, apply and merge strategy.
Table <-structure(list(ID = c(1L, 1L, 1L, 2L, 3L, 3L, 4L, 5L, 6L, 6L,
6L), Value = c(10L, 20L, -5L, 25L, 2L, 15L, 8L, 18L, 33L, 5L,
50L)), class = "data.frame", row.names = c(NA, -11L))
#Create column for the ((100 - Value)/100) factor
Table$factor<- ((100 - Table$Value)/100)
#split by ID
dfs<-split(Table, Table$ID)
currentValue<-sapply(dfs, function(x){
#find the cumulative product of the factor column
product<-cumprod(x$factor)
#return the last value fron the cumprod
return(100*product[length(product)])
})
#create the final answer
PriceRR<-data.frame(oldID=as.integer(names(dfs)), currentValue)
PriceRR
oldID currentValue
1 1 75.600
2 2 75.000
3 3 83.300
4 4 92.000
5 5 82.000
6 6 31.825
This script is using the cumprod function which is vectorized, thus very fast. Also the above script avoids the slow operation of continuing to growing the final dataframe.
Related
I am given a dataframe with 10 students, each one having a score for 4 different tests. i must select the 3 best scores and make their average using these 3
noma interro1 interro2 interro3 interro4
1 836016120449 6 3 NA 3
2 596844884419 1 4 2 8
3 803259953398 2 2 9 1
4 658786759629 3 1 3 2
5 571155022756 4 9 1 4
6 576037886365 8 7 8 7
7 045086625199 9 6 7 6
8 621909979467 5 8 4 5
9 457029205538 7 5 6 9
10 402526220817 NA 10 5 10
This dataframe provides the scores for 4 tests for 10 students.
Write a function that calculates the average score for the 3 best tests.
Calculate this average score for the 10 students.
average <- function(t){
x <- sort(t, decreasing = TRUE)[1:3]
return(mean(x, na.rm=TRUE))
}
apply(interro2, 1, average)
considering i want the 3 best, i thought that sort() could be useful here, however, what i receive is
In mean.default(x, na.rm = TRUE) :
argument is not numeric or logical: returning NA
i tried this one too
average <- function(t){
rowMeans(sort(t, decreasing = TRUE, na.rm=TRUE)[1:3])
}
UPDATE: answered, the dimensions of the dataframe were not correct in the apply line, i had to remove the first one which contained the names of the students, thus this one bellow works
average <- function(t){
x <- sort(t, decreasing = TRUE)[1:3]
return(mean(x, na.rm=TRUE))
}
apply(interro2[-1], 1, average)
Try pivot the scores, then sort the scores by name and keep the top 3 scores. Finally take the average grouping by name:
library(dplyr)
library(tidyr)
data <- data.frame(
stringsAsFactors = FALSE,
noma = c("836016120449","596844884419",
"803259953398","658786759629","571155022756",
"576037886365","045086625199","621909979467","457029205538",
"402526220817"),
interro1 = c(6L, 1L, 2L, 3L, 4L, 8L, 9L, 5L, 7L, NA),
interro2 = c(3L, 4L, 2L, 1L, 9L, 7L, 6L, 8L, 5L, 10L),
interro3 = c(NA, 2L, 9L, 3L, 1L, 8L, 7L, 4L, 6L, 5L),
interro4 = c(3L, 8L, 1L, 2L, 4L, 7L, 6L, 5L, 9L, 10L)
)
data <- data %>% pivot_longer(!noma, names_to = "interro", values_to = "value") %>% replace_na(list(value=0))
data_new1 <- data[order(data$noma, data$value, decreasing = TRUE), ] # Order data descending
data_new1 <- Reduce(rbind, by(data_new1, data_new1["noma"], head, n = 3)) # Top N highest values by group
data_new1 <- data_new1 %>% group_by(noma) %>% summarise(Value_mean = mean(value))
I have a list of nested data frames and I want to extract the observations of the earliest year, my problem is the first year change with the data frames. the year is either 1992 or 2005.
I want to create a list to stock them, I tried with which, but since there is the same year, observations are repeated, and I want them apart
new_df<- which(df[[i]]==1992 | df[[i]]==2005)
I've tried with ifelse() but I have to do an lm operation after, and it doesn't work. And I can't take only the first rows, because the year are repeated
my code looks like this:
df<- list(a<-data.frame(a_1<-(1992:2015),
a_2<-sample(1:24)),
b<-data.frame(b_1<-(1992:2015),
b_2<-sample(1:24)),
c<-data.frame(c_1<-(2005:2015),
c_2<-sample(1:11)),
d<-data.frame(d_1<-(2005:2015),
d_2<-sample(1:11)))
You can define a function to get the data on one data.frame and loop on the list to extract values.
Below I use map from the purrr package but you can also use lapply and for loops
Please do not use <- when assigning values in a function call (here data.frame() ) because it will mess colnames. = is used in function calls for arguments variables and it's okay to use it. You can read this ;)
df<- list(a<-data.frame(a_1 = (1992:2015),
a_2 = sample(1:24)),
b<-data.frame(b_1 = (1992:2015),
b_2 = sample(1:24)),
c<-data.frame(c_1 = (2005:2015),
c_2 = sample(1:11)),
d<-data.frame(d_1 = (2005:2015),
d_2 = sample(1:11)))
extract_miny <- function(df){
miny <- min(df[,1])
res <- df[df[,1] == miny, 2]
names(res) <- miny
return(res)
}
map(df, extract_miny)
If the data is sorted as the example, you can slice() the first row for the information. Notice the use of = rather than <- in creating a nested dataframe.
library(tidyverse)
df <- list(
a = data.frame(a_1 = (1992:2015),
a_2 = sample(1:24)),
b = data.frame(b_1 = (1992:2015),
b_2 = sample(1:24)),
c = data.frame(c_1 = (2005:2015),
c_2 = sample(1:11)),
d = data.frame(d_1 = (2005:2015),
d_2 = sample(1:11))
)
df %>%
imap_dfr( ~ slice(.x, 1) %>%
set_names(c("year", "value")) %>%
mutate(dataframe = .y) %>%
as_tibble())
# A tibble: 4 x 3
year value dataframe
<int> <int> <chr>
1 1992 19 a
2 1992 2 b
3 2005 1 c
4 2005 5 d
You may subset anonymeously.
lapply(df, \(x) setNames(x[x[[1]] == min(x[[1]]), ], c('year', 'value'))) |> do.call(what=rbind)
# year value
# 1 1992 6
# 2 1992 9
# 3 2005 11
# 4 2005 11
Or maybe better by creating a variable from which sample the value stems from.
Map(`[<-`, df, 'sample', value=letters[seq_along(df)]) |>
lapply(\(x) setNames(x[x[[1]] == min(x[[1]]), ], c('year', 'value', 'sample'))) |>
do.call(what=rbind)
# year value sample
# 1 1992 6 a
# 2 1992 9 b
# 3 2005 11 c
# 4 2005 11 d
Data:
df <- list(structure(list(a_1.....1992.2015. = 1992:2015, a_2....sample.1.24. = c(6L,
18L, 23L, 5L, 7L, 14L, 4L, 10L, 19L, 17L, 15L, 1L, 11L, 22L,
13L, 8L, 20L, 16L, 2L, 3L, 24L, 21L, 9L, 12L)), class = "data.frame", row.names = c(NA,
-24L)), structure(list(b_1.....1992.2015. = 1992:2015, b_2....sample.1.24. = c(9L,
24L, 18L, 8L, 16L, 11L, 13L, 23L, 15L, 20L, 19L, 21L, 12L, 22L,
7L, 3L, 6L, 17L, 2L, 5L, 4L, 10L, 1L, 14L)), class = "data.frame", row.names = c(NA,
-24L)), structure(list(c_1.....2005.2015. = 2005:2015, c_2....sample.1.11. = c(11L,
2L, 5L, 10L, 9L, 6L, 1L, 7L, 3L, 8L, 4L)), class = "data.frame", row.names = c(NA,
-11L)), structure(list(d_1.....2005.2015. = 2005:2015, d_2....sample.1.11. = c(11L,
2L, 5L, 1L, 6L, 9L, 3L, 7L, 10L, 4L, 8L)), class = "data.frame", row.names = c(NA,
-11L)))
I have a dataset with taxonomy assignment and I want to extract the genus in a new column.
library(tidyverse)
library(magrittr)
library(stringr)
df <- structure(list(C043 = c(18361L, 59646L, 27575L, 163L, 863L, 3319L,
0L, 6L), C057 = c(20020L, 97610L, 13427L, 1L, 161L, 237L, 2L,
105L), taxonomy = structure(c(3L, 2L, 1L, 6L, 4L, 4L, 5L, 2L), .Label = c("k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;NA",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;s__cloacae",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__coli",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__",
"k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__stutzeri"
), class = "factor")), .Names = c("C043", "C057", "taxonomy"), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 8L, 10L), class = "data.frame")
So this is my function (it works)
extract_genus <- function(str){
genus <- str_split(str, pattern = ";")[[1]][6]
genus %<>% str_sub(start = 4) #%>% as.character
return(genus)
}
But when I applied it in mutate (with or without as.character), it repeats first row value in the new column.
df %>% mutate(genus = extract_genus(taxonomy))
C043 C057 taxonomy genus
1 18361 20020 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__coli Escherichia
2 59646 97610 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;s__cloacae Escherichia
3 27575 13427 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;NA Escherichia
4 163 1 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__stutzeri Escherichia
5 863 161 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__ Escherichia
When I use sapply (but I don't want to, I want a solution with dplyr pipeline), it works.
df_group_gen$genus <- sapply(df_group_gen$taxonomy, extract_genus)
C043 C057 taxonomy genus
1 18361 20020 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Escherichia;s__coli Escherichia
2 59646 97610 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;s__cloacae Enterobacter
3 27575 13427 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Enterobacter;NA Enterobacter
4 163 1 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Pseudomonadales;f__Pseudomonadaceae;g__Pseudomonas;s__stutzeri Pseudomonas
5 863 161 k__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacteriales;f__Enterobacteriaceae;g__Klebsiella;s__ Klebsiella
Why mutate doesn't compute as we can expect? I find this question but no answer is provided, only a had hoc code.
Thank you :)
You can Vectorize your function to allow mutate to occur on every row:
ex_gen <- Vectorize(extract_genus, vectorize.args='str')
df %>% mutate(genus=ex_gen(taxonomy))
Alternatively, you can use rowwise to mutate each row:
df %>%
rowwise() %>%
mutate(genus = extract_genus(taxonomy))
I have a data frame which holds the times of random events occurring. What I want, is to subset the first case when either 'place' or 'Show' appears under Event, combined with 'kick' or 'Type' appearing under Event 2. So in this case, 'place run' wouldn't satisfy the condition, even though 'place' does appear under 'Event'. When I say the first case, I only want the first case when either of those options occur before the time resets back to 0. So for the first segment, the output I would want is 27, as this is the first time value when the condition is satisfied. For the second segment, I would want 16. For the last segment, the output would be 41. (I've put asterisk surrounding the rows which meet the condition so its easy to locate them. This isn't actually present in the data.)
Time Event Event 2
0 Begin NA
23 place run
27 *Show Type*
34 *place kick*
41 good bye
42 *place kick*
0 Begin NA
11 Hat Yellow
13 Show Green
16 *place kick*
20 place hit
29 sign redeem
35 *Show Type*
0 Begin NA
5 Cream Glue
17 Show Green
18 Orange Screen
30 place hit
33 sign redeem
41 *Show Type*
0 Begin NA
...
EDIT : So far, what I'm able to do, is subset the rows that have Show Type or place kick with the following code :
Rows <- Data[(Data[,'Event'] == 'Show' & Data[,'Event 2']== 'Type') |
(Data[,'Event'] == 'place' & Data[,'Event 2']== 'kick' ),]
Where I'm struggling, is being able to reset the search for these values after Time resets back to 0. Any help will be greatly appreciated!
The &-infix-function can be wrapped with the which function to generate a vector of the row numbers where those conditions are met. Then follow that with [1] to get just the first one.
df[ which(df[ , 'Event'] %in% c('place','Show') & df[ ,'Event.2'] %in% c('kick','Type') )[1], ]
Notice that I didn't leave a space between Event and 2, since that would have been parsed by R as two differnt symbols. The make.names-function is used by all the read.* functions to remove invalid punctuation from column names.
To make this process reset at each new segment, you would build a segment vector probably with something like segvec= cumsum(df$Time==0), and then probably use the split-apply-combine approach to get values just within the resulting subsets.
Some lightly test code:
lapply( split(dat, cumsum(dat[ ,'Time']==0)),
function(df){df[ which(df[ ,'Event'] %in% c('place','Show') &
df[ ,'Event.2'] %in% c('kick','Type') )[1], ]})
#------
$`1`
Time Event Event.2
3 27 Show Type
$`2`
Time Event Event.2
10 16 place kick
$`3`
Time Event Event.2
20 41 Show Type
dput(dat)
structure(list(Time = c(0L, 23L, 27L, 34L, 41L, 42L, 0L, 11L,
13L, 16L, 20L, 29L, 35L, 0L, 5L, 17L, 18L, 30L, 33L, 41L), Event = structure(c(1L,
6L, 7L, 6L, 3L, 6L, 1L, 4L, 7L, 6L, 6L, 8L, 7L, 1L, 2L, 7L, 5L,
6L, 8L, 7L), .Label = c("Begin", "Cream", "good", "Hat", "Orange",
"place", "Show", "sign"), class = "factor"), Event.2 = structure(c(NA,
7L, 9L, 5L, 1L, 5L, NA, 10L, 3L, 5L, 4L, 6L, 9L, NA, 2L, 3L,
8L, 4L, 6L, 9L), .Label = c("bye", "Glue", "Green", "hit", "kick",
"redeem", "run", "Screen", "Type", "Yellow"), class = "factor")), .Names = c("Time",
"Event", "Event.2"), class = "data.frame", row.names = c(NA,
-20L))
Far less succinct (and prbly less optimal) than 42-'s but:
library(stringi)
read.table(text="Time Event Event2
0 Begin NA
23 place run)
27 *Show Type*
34 (*place kic)k*
41 good bye
42 (*place kic)k*
0 Begin NA
11 Hat Yellow
13 Show Green
16 *place kick*
20 place hit
29 sign redeem
35 *Show Type*
0 Begin NA
5 Cream Glue
17 Show Green
18 Orange Screen
30 place hit
33 sign redeem
41 *Show Type*
0 Begin NA", header=TRUE, stringsAsFactors=FALSE) -> df
library(dplyr)
df$grp <- 0
df[which(df$Time == 0),]$grp <- 1
df$grp <- cumsum(df$grp)
group_by(df, grp) %>%
filter(grepl("place|show", Event, ignore.case=TRUE) & grepl("kick|type", Event2, ignore.case=TRUE)) %>%
slice(1) %>%
select(-grp)
## Source: local data frame [3 x 4]
## Groups: grp [3]
##
## grp Time Event Event2
## <dbl> <int> <chr> <chr>
## 1 1 27 *Show Type*
## 2 2 16 *place kick*
## 3 3 41 *Show Type*
How can I select all of the rows for a random sample of column values?
I have a dataframe that looks like this:
tag weight
R007 10
R007 11
R007 9
J102 11
J102 9
J102 13
J102 10
M942 3
M054 9
M054 12
V671 12
V671 13
V671 9
V671 12
Z990 10
Z990 11
That you can replicate using...
weights_df <- structure(list(tag = structure(c(4L, 4L, 4L, 1L, 1L, 1L, 1L,
3L, 2L, 2L, 5L, 5L, 5L, 5L, 6L, 6L), .Label = c("J102", "M054",
"M942", "R007", "V671", "Z990"), class = "factor"), value = c(10L,
11L, 9L, 11L, 9L, 13L, 10L, 3L, 9L, 12L, 12L, 14L, 5L, 12L, 11L,
15L)), .Names = c("tag", "value"), class = "data.frame", row.names = c(NA,
-16L))
I need to create a dataframe containing all of the rows from the above dataframe for two randomly sampled tags. Let's say tags R007and M942 get selected at random, my new dataframe needs to look like this:
tag weight
R007 10
R007 11
R007 9
M942 3
How do I do this?
I know I can create a list of two random tags like this:
library(plyr)
tags <- ddply(weights_df, .(tag), summarise, count = length(tag))
set.seed(5464)
tag_sample <- tags[sample(nrow(tags),2),]
tag_sample
Resulting in...
tag count
4 R007 3
3 M942 1
But I just don't know how to use that to subset my original dataframe.
is this what you want?
subset(weights_df, tag%in%sample(levels(tag),2))
If your data.frame is named dfrm, then this will select 100 random tags
dfrm[ sample(NROW(dfrm), 100), "tag" ] # possibly with repeats
If, on the other hand, you want a dataframe with the same columns (possibly with repeats):
samp <- dfrm[ sample(NROW(dfrm), 100), ] # leave the col name entry blank to get all
A third possibility... you want 100 distinct tags at random, but not with the probability at all weighted to the frequency:
samp.tags <- unique(dfrm$tag)[ sample(length(unique(dfrm$tag)), 100]
Edit: With to revised question; one of these:
subset(dfrm, tag %in% c("R007", "M942") )
Or:
dfrm[dfrm$tag %in% c("R007", "M942"), ]
Or:
dfrm[grep("R007|M942", dfrm$tag), ]