my data is like this
df<- structure(list(label = c("afghanestan", "afghanestan", "afghanestanIndia",
"afghanestanindiaholad", "afghanestanUSA", "USA", "Argentina",
"Brazil", "Argentinabrazil", "Brazil"), Start = c(114, 516, 89,
22, 33, 67, 288, 362, 45, 362), Stop = c(127, 544, 105, 34, 50,
85, 299, 381, 68, 381)), class = "data.frame", .Names = c("label",
"Start", "Stop"), row.names = c(NA, -10L))
when I want to remove the exact duplicate , I simply do this
df[!duplicated(df[,c('label','Start','Stop')]),]
now the problem is that I want to recognize those that are similar in the label but possibly different in the start and stop. so I would like to generate something like this afterwards
label Start Stop NewLab
1 afghanestan 114 127 TRUE
2 afghanestan 516 544 TRUE
3 afghanestanIndia 89 105 FALSE
4 afghanestanindiaholad 22 34 FALSE
5 afghanestanUSA 33 50 FLASE
6 USA 67 85 FALSE
7 Argentina 288 299 FALSE
8 Brazil 362 381 FALSE
9 Argentinabrazil 45 68 FALSE
This would work in a single line of code:
df$NewLab <- df$label %in% df[duplicated(df$label), ]$label
And the output:
> df$NewLab <- df$label %in% df[duplicated(df$label), ]$label
> df
label Start Stop NewLab
1 afghanestan 114 127 TRUE
2 afghanestan 516 544 TRUE
3 afghanestanIndia 89 105 FALSE
4 afghanestanindiaholad 22 34 FALSE
5 afghanestanUSA 33 50 FALSE
6 USA 67 85 FALSE
7 Argentina 288 299 FALSE
8 Brazil 362 381 FALSE
9 Argentinabrazil 45 68 FALSE
Or in dplyr notation:
df <- dplyr::mutate(df, NewLab = label %in% df[duplicated(df$label), ]$label)
Here is a somewhat convoluted methods using dplyr
library(tidyverse)
df %>%
group_by(label) %>%
mutate(n = n()) %>%
group_by(Start, Stop) %>%
mutate(n2 = n()) %>%
mutate(newlabel = ifelse(n>1 & n2==1, TRUE, FALSE)) %>%
dplyr::select(-n, -n2)
First create a grouping variable of labels - take a count, then a grouping variable of start and stop times - take a count, use an ifelse to assign True/False, then remove the intermediate columns.
Related
Table showing correct format of dataI have a data frame with four columns, and I need to find a way to sum the values in the third column. Only if the numbers in the first two columns are different. The only way I can think of is to maybe do an If loop? Is that something can be done or is there a better way?
Genotype summary`
Dnov1a Dnov1b Freq rel_geno_freq
1 220 220 1 0.003367003
7 220 224 4 0.013468013
8 224 224 8 0.026936027
13 220 228 14 0.047138047
This is a portion of the data as an example, I need to sum the third column Freq for rows 7 and 13 because they are different.
Here's a tidyverse way of doing it:
library(tidyverse)
data <- tribble(
~Dnov1a, ~Dnov1b, ~Freq, ~rel_geno_freq,
220, 220, 1, 0.003367003,
220, 224, 4, 0.013468013,
224, 224, 8, 0.026936027,
220, 228, 14, 0.047138047)
data %>%
mutate(filter_column = if_else(Dnov1a != Dnov1b, TRUE, FALSE)) %>%
filter(filter_column == TRUE) %>%
summarise(Total = sum(Freq))
# A tibble: 1 x 1
Total
<dbl>
1 18
data$new = data$Dnov1a!=data$Dnov1b
data
Dnov1a Dnov1b Freq rel_geno_freq new
<int> <int> <int> <dbl> <lgl>
1 220 220 1 0.00337 TRUE
2 220 224 4 0.0135 FALSE
3 224 224 8 0.0269 TRUE
4 220 228 14 0.0471 FALSE
sum(data$Freq[data$new])
28
Is this what you are looking for?
I am not very experienced in if statements and loops in R.
Probably you can help me to solve my problem.
My task is to add +1 to df$fz if sum(df$fz) < 450, but in the same time I have to add +1 only to max values in df$fz till that moment when when sum(df$fz) is lower than 450
Here is my df
ID_PP <- c(3,6, 22, 30, 1234456)
z <- c(12325, 21698, 21725, 8378, 18979)
fz <- c(134, 67, 70, 88, 88)
df <- data.frame(ID_PP,z,fz)
After mutating the new column df$new_value, it should look like 134 68 71 88 89
At this moment I have this code, but it adds +1 to all values.
if (sum(df$fz ) < 450) {
mutate(df, new_value=fz+1)
}
I know that I can pick top_n(3, z) and add +1 only to this top, but it is not what I want, because in that case I have to pick a top manually after checking sum(df$fz)
From what I understood from #Oksana's question and comments, we probably can do it this way:
library(tidyverse)
# data
vru <- data.frame(
id = c(3, 6, 22, 30, 1234456),
z = c(12325, 21698, 21725, 8378, 18979),
fz = c(134, 67, 70, 88, 88)
)
# solution
vru %>% #
top_n(450 - sum(fz), z) %>% # subset by top z, if sum(fz) == 450 -> NULL
mutate(fz = fz + 1) %>% # increase fz by 1 for the subset
bind_rows( #
anti_join(vru, ., by = "id"), # take rows from vru which are not in subset
. # take subset with transformed fz
) %>% # bind thous subsets
arrange(id) # sort rows by id
# output
id z fz
1 3 12325 134
2 6 21698 68
3 22 21725 71
4 30 8378 88
5 1234456 18979 89
The clarifications in the comments helped. Let me know if this works for you. Of course, you can drop the cumsum_fz and leftover columns.
# Making variables to use in the calculation
df <- df %>%
arrange(fz) %>%
mutate(cumsum_fz = cumsum(fz),
leftover = 450 - cumsum_fz)
# Find the minimum, non-negative value to use for select values that need +1
min_pos <- min(df$leftover[df$leftover > 0])
# Creating a vector that adds 1 using the min_pos value and keeps
# the other values the same
df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
# Checking the sum of the new value
> sum(df$new_value)
[1] 450
>
> df
ID_PP z fz cumsum_fz leftover new_value
1 6 21698 67 67 383 68
2 22 21725 70 137 313 71
3 30 8378 88 225 225 89
4 1234456 18979 88 313 137 88
5 3 12325 134 447 3 134
EDIT:
Because utubun already posted a great tidyverse solution, I am going to translate my first one completely to base (it was a bit sloppy to mix the two anyway). Same logic as above, and using the data OP provided.
> # Using base
> df <- df[order(fz),]
>
> leftover <- 450 - cumsum(fz)
> min_pos <- min(leftover[leftover > 0])
> df$new_value <- c((head(sort(df$fz), min_pos) + 1), tail(sort(df$fz), length(df$fz) - min_pos))
>
> sum(df$new_value)
[1] 450
> df
ID_PP z fz new_value
2 6 21698 67 68
3 22 21725 70 71
4 30 8378 88 89
5 1234456 18979 88 88
1 3 12325 134 134
I am trying to generate the below series (see attached image) based on the logic given below. I was able create the series for one product and store (code given below). i am having trouble when i try to generalize this for multiple product store combinations. Could you please advise if there is an easier way to do this.
Logic
a given
b lag of d by 4
c initial c for first week thereafter (c previous row + b current - a current)
d initial d - c current
my code
library(dplyr)
df = structure(list(
Product = c(11078931, 11078931, 11078931, 11078931, 11078931,
11078931, 12021216, 12021216, 12021216, 12021216,
12021216, 12021216, 10932270, 10932270, 10932270,
10932270, 10932270),
STORE = c(90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 90, 547, 547,
547, 547, 547),
WEEK = c(201627, 201628, 201629, 201630, 201631, 201632, 201627, 201628,
201629, 201630, 201631, 201632, 201627, 201628, 201629, 201630,
201631),
WEEK_SEQ = c(914, 915, 916, 917, 918, 919, 914, 915, 916, 917, 918, 919,
914, 915, 916, 917, 918),
a = c(9.161, 9.087, 8.772, 8.698, 7.985, 6.985, 0.945, 0.734, 0.629, 0.599,
0.55, 0.583, 5.789, 5.694, 5.488, 5.47, 5.659),
initial_d = c(179, 179, 179, 179, 179, 179, 18, 18, 18, 18, 18, 18, 37, 37,
37, 37, 37),
Initial_c = c(62, 0, 0, 0, 0, 0, 33, 0, 0, 0, 0, 0, 59, 0, 0, 0, 0)
),
.Names = c("Product", "STORE", "WEEK", "WEEK_SEQ", "a", "initial_d",
"Initial_c"),
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -17L))
# filter to extract one product and store
# df = df %>% filter(Product == 11078931) %>% filter(STORE == 90)
df$b = 0
df$c = 0
df$d = NA
c_init = 62
d_init = 179
df$d <- d_init
df$c[1] <- c_init
RQ <- function(df,...){
for(i in seq_along(df$WEEK_SEQ)){
if(i>4){
df[i, "b"] = round(df[i-4,"d"], digits = 0)# Calculate b with the lag
}
if(i>1){
df[i, "c"] = round(df[i-1, "c"] + df[i, "b"] - df[i, "a"], digits = 0) # calc c
}
df[i, "d"] <- round(d_init - df[i, "c"], digits = 0) # calc d
if(df[i, "d"] < 0) {
df[i, "d"] <- 0 # reset negative d values
}
}
return(df)
}
df = df %>% group_by(SKU_CD, STORE_CD) %>% RQ(df)
Expected output series
could you please advice what is wrong in my code. this code works fine for one product and store combination. but for multiple product and store it doesn't. thanks for your time and input!
Consider base R's by which subsets the input dataframe by each combination of factor types to return a list of the subsetted dataframes. Then, run a do.call(rbind, ...) to row bind the list into one final dataframe.
RQ_dfs <- by(df, df[c("Product", "STORE")], FUN=RQ)
finaldf <- do.call(rbind, RQ_dfs)
While I cannot achieve your outputted series screenshot with posted data, the filtered commented out pairing does show up finaldf:
# # A tibble: 17 × 10
# Product STORE WEEK WEEK_SEQ a initial_d Initial_c b c d
# * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 11078931 90 201627 914 9.161 179 62 0 62 117
# 2 11078931 90 201628 915 9.087 179 0 0 53 126
# 3 11078931 90 201629 916 8.772 179 0 0 44 135
# 4 11078931 90 201630 917 8.698 179 0 0 35 144
# 5 11078931 90 201631 918 7.985 179 0 117 144 35
# 6 11078931 90 201632 919 6.985 179 0 126 263 0
# 7 12021216 90 201627 914 0.945 18 33 0 0 179
# 8 12021216 90 201628 915 0.734 18 0 0 -1 180
# 9 12021216 90 201629 916 0.629 18 0 0 -2 181
# 10 12021216 90 201630 917 0.599 18 0 0 -3 182
# 11 12021216 90 201631 918 0.550 18 0 179 175 4
# 12 12021216 90 201632 919 0.583 18 0 180 354 0
# 13 10932270 547 201627 914 5.789 37 59 0 0 179
# 14 10932270 547 201628 915 5.694 37 0 0 -6 185
# 15 10932270 547 201629 916 5.488 37 0 0 -11 190
# 16 10932270 547 201630 917 5.470 37 0 0 -16 195
# 17 10932270 547 201631 918 5.659 37 0 179 157 22
I'm trying to break down a vector of event times into episodes. An episode must meet 2 criteria. 1) It consists of 3 or more events and 2) those events have inter-event times of 25 time units or less. My data is organized in a data frame as shown below.
So far, I figured out that I can find the difference between events with diff(EventTime). By creating a logical vector that corresponds to events that the 2nd inter-event criterion, I can use rle(EpisodeTimeCriterion) to get a the total number, and length of episodes.
EventTime TimeDifferenceBetweenNextEvent EpisodeTimeCriterion
25 NA NA
75 50 TRUE
100 25 TRUE
101 1 TRUE
105 4 TRUE
157 52 FALSE
158 1 TRUE
160 2 TRUE
167 7 TRUE
169 2 TRUE
170 1 TRUE
175 5 TRUE
178 3 TRUE
278 100 FALSE
302 24 TRUE
308 6 TRUE
320 12 TRUE
322 459 FALSE
However, I would like to know the timing of the episodes and 'rle()' doesn’t let me do that.
Ideally I would like to generate a data frame that looks like this:
Episode EventsPerEpisode EpisodeStartTime EpisodeEndTime
1 4 75 105
2 7 158 178
3 3 302 322
I know that this is probably a simple problem but being new to R, the only solution I can envision is some series of loops. Is there a way of doing this without loops? Or is there package that lends itself to this sort of analysis?
Thanks!
Edited for clarity. Added a desired outcome data fame and expanded the example data to make it clearer.
You've got the pieces you need. You really need to make a variable that gives each episode a number/name so you can group by it. rle(...)$length gives you run lengths, so just use rep to repeat a number that number of times:
runs <- rle(df$EpisodeTimeCriterion)$lengths # You don't need this extra variable, but it makes the code more readable
df$Episode <- rep(1:length(runs), runs)
so df looks like
> head(df)
EventTime TimeDifferenceBetweenNextEvent EpisodeTimeCriterion Episode
1 25 NA NA 1
2 75 50 TRUE 2
3 100 25 TRUE 2
4 101 1 TRUE 2
5 105 4 TRUE 2
6 157 52 FALSE 3
Now use dplyr to summarize the data:
library(dplyr)
df2 <- df %>% filter(EpisodeTimeCriterion) %>% group_by(Episode) %>%
summarise(EventsPerEpisode = n(),
EpisodeStartTime = min(EventTime),
EpisodeEndTime = max(EventTime))
which returns
> df2
Source: local data frame [3 x 4]
Episode EventsPerEpisode EpisodeStartTime EpisodeEndTime
(int) (int) (dbl) (dbl)
1 2 4 75 105
2 4 7 158 178
3 6 3 302 320
If you want your episode numbers to be integers starting with one, you can clean up with
df2$Episode <- 1:nrow(df2)
Data
If someone wants to play with the data, the results of dput(df) before running the above code:
df <- structure(list(EventTime = c(25, 75, 100, 101, 105, 157, 158,
160, 167, 169, 170, 175, 178, 278, 302, 308, 320, 322), TimeDifferenceBetweenNextEvent = c(NA,
50, 25, 1, 4, 52, 1, 2, 7, 2, 1, 5, 3, 100, 24, 6, 12, 459),
EpisodeTimeCriterion = c(NA, TRUE, TRUE, TRUE, TRUE, FALSE,
TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE,
TRUE, FALSE)), .Names = c("EventTime", "TimeDifferenceBetweenNextEvent",
"EpisodeTimeCriterion"), row.names = c(NA, -18L), class = "data.frame")
Here is one approach I came up with using a combination of cut2 from Hmisc package and cumsum to label episodes into numbers:
library(Hmisc)
library(dplyr)
df$episodeCut <- cut2(df$TimeDifferenceBetweenNextEvent, c(26))
df$episode <- cumsum((df$episodeCut == '[ 1,26)' & lag(df$episodeCut) != '[ 1,26)') | df$episodeCut != '[ 1,26)')
Output is as follows:
EventTime TimeDifferenceBetweenNextEvent EpisodeTimeCriterion episodeCut episode
1 25 50 FALSE [26,52] 1
2 75 25 TRUE [ 1,26) 2
3 100 1 TRUE [ 1,26) 2
4 101 4 TRUE [ 1,26) 2
5 105 52 TRUE [26,52] 3
6 157 52 FALSE [26,52] 4
As you can see, it tags rows 2, 3, 4 as belonging to a single episode.
Is this what you are looking for? Not sure from your description. So, my answer may be wrong.
I have a data frame consisting of an "ID" column and a "Diff" column. The ID column is responsible for marking groups of corresponding Diff values.
An example looks like this:
structure(list(ID = c(566, 566, 789, 789, 789, 487, 487, 11,
11, 189, 189), Diff = c(100, 277, 529, 43, NA, 860, 780, 445,
NA, 578, 810)), .Names = c("ID", "Diff"), row.names = c(9L, 10L,
20L, 21L, 22L, 25L, 26L, 51L, 52L, 62L, 63L), class = "data.frame")
My goal is to search each group for NAs in the Diff column and create a new column, that has either a "True" or "False" value for each row, depending if the corresponding group has an NA in Diff.
I tried
x <- aggregate(Diff ~ ID, data, is.na)
and
y <- aggregate(Diff ~ ID, data, function(x) any(is.na(x)))
The idea was to merge the result depending on ID. However, none of the above created a useful result. I know R can do it … and after searching for quite a while I ask you how :)
You can use the plyr and ddply
require(plyr)
ddply(data, .(ID), transform, na_diff = any(is.na(Diff)))
## ID Diff na_diff
## 1 11 445 TRUE
## 2 11 NA TRUE
## 3 189 578 FALSE
## 4 189 810 FALSE
## 5 487 860 FALSE
## 6 487 780 FALSE
## 7 566 100 FALSE
## 8 566 277 FALSE
## 9 789 529 TRUE
## 10 789 43 TRUE
## 11 789 NA TRUE
A very similar solution to #dickoa except in base:
do.call(rbind,by(data,data$ID,function(x)transform(x,na_diff=any(is.na(Diff)))))
# ID Diff na_diff
# 11.51 11 445 TRUE
# 11.52 11 NA TRUE
# 189.62 189 578 FALSE
# 189.63 189 810 FALSE
# 487.25 487 860 FALSE
# 487.26 487 780 FALSE
# 566.9 566 100 FALSE
# 566.10 566 277 FALSE
# 789.20 789 529 TRUE
# 789.21 789 43 TRUE
# 789.22 789 NA TRUE
Similarly, you could avoid transform with:
data$na_diff<-with(data,by(Diff,ID,function(x) any(is.na(x)))[as.character(ID)])
(You have two viable strategies already, but here is another which may be conceptually easier to follow if you are relatively new to R and aren't familiar with the way plyr works.)
I often need to know how many NAs I have in different variables, so here is a convenience function I use standard:
sna <- function(x){
sum(is.na(x))
}
From there, I sometimes use aggregate(), but sometimes I find ?summaryBy in the doBy package to be more convenient. Here's an example:
library(doBy)
z <- summaryBy(Diff~ID, data=my.data, FUN=sna)
z
ID Diff.sna
1 11 1
2 189 0
3 487 0
4 566 0
5 789 1
After this, you just need to use ?merge and convert the count of NAs to a logical to get your final data frame:
my.data <- merge(my.data, z, by="ID")
my.data$Diff.sna <- my.data$Diff.sna>0
my.data
ID Diff Diff.sna
1 11 445 TRUE
2 11 NA TRUE
3 189 578 FALSE
4 189 810 FALSE
5 487 860 FALSE
6 487 780 FALSE
7 566 100 FALSE
8 566 277 FALSE
9 789 529 TRUE
10 789 43 TRUE
11 789 NA TRUE