I'm working with time series data on a variable that generally increases slowly over time. Very simplified example:
df <- data.frame(index=1:8, value = c(rep(0.25, 3),1.95,0.25,rep(0.5,3)))
index value
1 0.25
2 0.25
3 0.25
4 1.95
5 0.25
6 0.50
7 0.50
8 0.50
A recurring feature of the dataset is what happens at index 4: the value spikes upward then immediately comes back down again. I want to remove those values. (There are also points in my dataset where the value makes a small increase followed by a small decrease some time later, but I want to keep those.)
I have found a way of removing the values, by using diff to calculate the change from the previous value, then turning the data frame upside down, using diff again to calculate the change from the next value and removing rows where the two diffs are the same, but that seems like the least efficient process ever:
library(dplyr)
df %>%
mutate(diffprev = diff(value) %>% c(0, .)) %>%
arrange(desc(variable)) %>%
mutate(diffnext = diff(value) %>% c(0, .)) %>%
filter(diffprev == 0 | diffprev != diffnext)
I realise that if the spike in value happened at index 5 rather than 4 this wouldn't work but in the full dataset this is so unlikely that unless there's a simple fix I'm not going to worry about it. But what would be a better way of going about this?
You could try:
df %>% filter(lag(value) != lead(value) | (value - lag(value)) %in% c(0, NA))
You might also be interested in the lag and lead functions from dplyr.
Edit: thanks #Frank for a couple modifications
You don't need to rearrange. The first diff column you make contains all the info you need:
df %>%
mutate(diffprev = diff(value) %>% c(0, .)) %>%
filter(diffprev == 0 | diffprev != -lead(diffprev) ) %>%
select(-diffprev)
which gives
variable value
1 1 0.25
2 2 0.25
3 3 0.25
4 5 0.25
5 6 0.50
6 7 0.50
7 8 0.50
Related
I'm hoping to clean out a time series dataset so that only the maximum value of each event is retained. To start, I filtered the data so that only values above a certain threshold are maintained but there are still values that, while separated by a millisecond or two, act as duplicate values but will throw off later analysis.
My initial dataset has >100,000 rows and a few more columns but here is the top of a smaller version.
head(shortfilter)
Time (Sec) ECG (Channel 6)
1 5534.023 1.371761
2 5534.024 1.232424
3 5534.152 1.414432
4 5534.153 1.359914
5 5534.272 1.639033
6 5534.396 1.476161
Explained: I don't have a concrete time value that they need to be within for it to be considered a duplicate, but the rest of the data is similar to this in that they are generally within .003 s.
Time (Sec) ECG (Channel 6)
1 5534.023 1.371761 #<-- Higher value (keep)
2 5534.024 1.232424
3 5534.152 1.414432 #<-- Higher value (keep)
4 5534.153 1.359914
5 5534.272 1.639033 #<-- Only value (keep)
6 5534.396 1.476161 #<-- Only value (keep)
Ideal:
Time (Sec) ECG (Channel 6)
1 5534.023 1.371761
2 5534.152 1.414432
3 5534.272 1.639033
4 5534.396 1.476161
5 ____.___ _.______
6 ____.___ _.______
I'll add my initial attempt at some conditionals to do what I was hoping, but keep in mind I'm new to coding in general and so I know it isn't remotely correct, just wanted to get some ideas out there. Hope it can give some additional info on what I hope to do though. I'm positive the formatting & syntax are complete gibberish but I'm sure many of you will understand what I was going for lol...
for (i in shortfilter$`Time (Sec)`){
for (j in shortfilter$`ECG (Channel 6)`){
if ((i+1)-i > 0.01 && j > j+1){
remove(j+1)
} else if ((i+1)-i > 0.01 && j < j+1){
remove(j)
}
}
}
Welcome to StackOverflow! My solution compares each value to the next value and finds the difference, then adjusts the predicted grouping number based on those values. Currently it can handle up to five consecutive duplicated numbers, but you can easily add more if you would like.
library(tidyverse)
tibble::tribble(
~`Time`, ~`ECG`,
5534.023, 1.371761,
5534.024, 1.232424,
5534.025, 1.27,
5534.026, 1.28,
5534.152, 1.414432,
5534.153, 1.359914,
5534.272, 1.639033,
5534.396, 1.476161
) %>%
arrange(Time) %>%
mutate(sim_val = if_else(!is.na(lead(Time)), lead(Time) - Time, 5),
Num = if_else(sim_val <= 0.03, row_number() + 1, as.numeric(row_number())),
Num = if_else(sim_val <= 0.03 & Num < lead(Num), Num + 1, Num),
Num = if_else(sim_val <= 0.03 & Num < lead(Num), Num + 1, Num),
Num = if_else(sim_val <= 0.03 & Num < lead(Num), Num + 1, Num)) %>%
arrange(Num, desc(ECG)) %>%
group_by(Num) %>%
slice_head(n = 1) %>%
ungroup() %>%
select(Time, ECG)
Also, feel free to fine-tune the threshold of 0.03 to your data. Let me know if this works!
I'm trying to implement a lag function but it seems i need an existing x column for it to work
lets say i have this data frames
df <- data.frame(AgeGroup=c("0-4", "5-39", "40-59","60-69","70+"),
px=c(.99, .97, .95, .96,.94))
i want a column Ix that is lag(Ix)*lag(px) starting from 1000.
The data i want is
df2 <- data.frame(AgeGroup=c("0-4", "5-39", "40-59","60-69","70+"),
px=c(.99, .97, .95, .96, .94),
Ix=c(1000, 990, 960.3, 912.285, 875.7936))
I've tried
library(dplyr)
df2<-mutate(df,Ix = lag(Ix, default = 1000)*lag(px))
ifelse statements don't work after the creation of a reference
df$Ix2=NA
df[1,3]=1000
df$Ix<-ifelse(df[,3]==1000,1000,
lag(df$Ix, default = 1000)*lag(px,default =1))
and have been playing around with creating separate Ix column with Ix=1000 then run the above but it doesn't seem to work. Does anyone have any ideas how i can create the x+1 column?
You could use cumprod() combined with dplyr::lag() for this:
> df$Ix <- 1000*dplyr::lag(cumprod(df$px), default = 1)
> df
AgeGroup px Ix
1 0-4 0.99 1000.0000
2 5-39 0.97 990.0000
3 40-59 0.95 960.3000
4 60-69 0.96 912.2850
5 70+ 0.94 875.7936
You can also use accumulate from purrr. Using head(px, -1) includes all values in px except the last one, and the initial Ix is set to 1000.
library(tidyverse)
df %>%
mutate(Ix = accumulate(head(px, -1), prod, .init = 1000))
Output
AgeGroup px Ix
1 0-4 0.99 1000.0000
2 5-39 0.97 990.0000
3 40-59 0.95 960.3000
4 60-69 0.96 912.2850
5 70+ 0.94 875.7936
Hello coding community
I have a two part question that is 1/2 answered
transpose, aka melt data frame, to my liking - done
add rows of data based on results found in "removed" column, a column created in the transposing step - stuck here
df<- read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t")
df_transformed<-tidyr::gather(df, day, removed, -(1:2), na.rm = TRUE) # melted data
In my example here (df), I have an experiment ran over 8 days. On certain days, I remove data points, and I am only interested in these days (hence why I added na.rm = TRUE in the transposing process). I sometimes remove 1 data point, or 4 (but this could be any number really)
I would like the removed data points to be called "individuals", and for them to be counted in chronological order. Therefore, I first need to add a column called "individuals"
df_transformed$individual <- ""
I would like to fill in the "individual" column based on the results in the "removed" column.
example: cage 2 had only 1 data point removed, and it was on day_8. I would therefore like to add, in the "individual" column, a 1. Cage 4, on the other hand, had data points removed on day_5 (1 data point) and day_7 (3 data points), for a total of 4 data points , aka , 4 "individuals". Therefore, Cage 4, when starting with day_5, I would like to add a 1 in the "individuals" column, and for day 7, create 3 total rows of data, and continue my "individual count" with 2,3,4. IF day_8 had 3 more data points removed, the individual count would continue with 5,6,7.
My desired result for my example data set today would be this:
desired_results <- read.table("https://pastebin.com/raw/r7QrC0y3", header=T, sep="\t") # 68 total rows of data
Interesting piece of information: The total number of rows in my final data set should equal the sum of all removed data points:
sum(df_transformed$removed) # 68
Thank you StackOverflow community. Looking forward to seeing the results.
We can use complete to create a sequence from 1 to each individual grouped by cage and day. We then fill the NA values in columns experiment and removed.
library(dplyr)
library(tidyr)
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
fill(experiment, removed, .direction = "up")
# cage day individual experiment removed
#1 2 day_8 1 sugar 1
#2 3 day_5 1 sugar 1
#3 4 day_5 1 sugar 3
#4 4 day_5 2 sugar 3
#5 4 day_5 3 sugar 3
#6 4 day_7 1 sugar 1
#7 7 day_7 1 sugar 1
#8 7 day_8 1 sugar 1
#9 8 day_5 1 sugar 2
#10 8 day_5 2 sugar 2
# … with 58 more rows
To update individual only based on cage we can do
df_transformed %>%
mutate(individual = removed) %>%
group_by(cage, day) %>%
complete(individual = seq_len(individual)) %>%
group_by(cage) %>%
mutate(individual = row_number()) %>%
fill(experiment, removed, .direction = "up")
I think the following bit of code does what you need:
library(tidyverse)
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE))
I have used the pipe operator (%>%), which enables cleaner syntax. I have also used the newer pivot_longer function instead of gather. Then, grouping by cage and later summing over the individual column with summarize you get how many individuals were removed per cage.
I checked the sum of all the individuals and it seems to work:
read.table("https://pastebin.com/raw/NEPcUG01",header=T, sep="\t") %>%
pivot_longer(starts_with("day_"), names_to = "day", values_to = "removed") %>%
# drop_na() %>%
group_by(cage) %>%
summarize(individual = sum(removed, na.rm = TRUE)) %>%
pull(individual) %>%
sum()
#> [1] 68
The result is slightly different to your desired result. I am not 100% your desired result is actually correct... From your question, I understand that cage 4 should have 4 individuals, but in your desired_result it appears 4 times with values 1, 2, 3 and 4. The code I sent you generates a data frame where each appears in a single row.
I have a data frame like these:
NUM_TURNO CODIGO_MUNICIPIO SIGLA_PARTIDO SHARE
1 1 81825 PPB 38.713318
2 1 81825 PMDB 61.286682
3 1 09717 PMDB 48.025900
4 1 09717 PL 1.279217
5 1 09717 PFL 50.694883
6 1 61921 PMDB 51.793868
This is a data.frame of elections in Brazil. Grouping by NUM_TURNO and CODGIDO_MUNICIPIO I want to compare the SHARE of the FIRST and SECOND most votted politics in each city and round (1 or 2) and create a new column.
What am I having problem to do? I don't know how to calculate the difference only for the two biggest SHARES of votes.
For the first case, for example, I want to create something that gives me the difference between 61.286682 and 38.713318 = 22.573364 and so on.
Something like this:
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
mutate(Diff = HIGHER SHARE - 2º HIGHER SHARE))
You can also use top_n from dplyr with grouping and summarizing. Keep in mind that in the data you provided, you will get an error in summarize if you use diff with a single value, hence the use of ifelse.
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
top_n(2, SHARE) %>%
summarize(Diff = ifelse(n() == 1, NA, diff(SHARE)))
# A tibble: 3 x 3
# Groups: NUM_TURNO [?]
NUM_TURNO CODIGO_MUNICIPIO Diff
<dbl> <dbl> <dbl>
1 1 9717 2.67
2 1 61921 NA
3 1 81825 22.6
You could arrange your dataframe by Share and then slice the first two values. Then you could use summarise to get the diff between the values for every group:
library(dplyr)
df %>%
group_by(NUM_TURNO, CODIGO_MUNICIPIO) %>%
arrange(desc(Share)) %>%
slice(1:2) %>%
summarise(Diff = -diff(Share))
I'm trying to count the number of rows using dplyr after using group_by. I have the following data:
scenario pertubation population
A 1 20
B 1 30
C 1 40
D 1 50
A 2 15
B 2 25
And I'm using the following code to group_by and mutate:
test <- all_scenarios %>%
group_by(scenario) %>%
mutate(rank = dense_rank(desc(population)),
exceedance_probability = rank / count(pertubation)) %>%
select(scenario, pertubation, All.ages, rank, exceedance_probability)
But I keep encoutering this error message and I am unsure of what it means, or why I keep getting it?
Error in mutate_impl(.data, dots) :
Evaluation error: no applicable method for 'groups' applied to an object of class "c('integer', 'numeric')".
I would like my output data to look something like this:
scenario pertubation population rank exceedance_probability
A 1 20 12 0.06
B 1 30 7 0.035
C 1 40 2 0.01
D 1 50 1 0.005
A 2 15 34 0.17
B 2 25 28 0.14
To calculate the exceedance probability I just need to divide the rank by the number of observations, but I've found it hard to do this in dplyr after a group_by statement. Am I ordering the dplyr statements incorrectly?
We can get the count separately and join with the original dataset
all_scenarios %>%
count(pertubation) %>%
left_join(all_scenarios, ., by = 'pertubation') %>%
group_by(scenario) %>%
mutate(rank = dense_rank(desc(population)), exceedance_probability = rank /n)
Or instead of using count, we can do a second group_by and get the n()
all_scenarios %>%
group_by(scenario) %>%
mutate(rank = dense_rank(desc(population))) %>%
group_by(pertubation) %>%
mutate( exceedance_probability = rank /n())
Your issue comes from the
count(pertubation)
part of the code. You cannot use count in a group_by scenario. I can't find a good explanation why, but it won't work. Just use
n()
in place of it in the code. Since youre grouping by scenario, and each scenario-pertubation is unique in your dataset, by counting the number of rows in each scenario you are effectively counting the number of values or pertubation for each scenario.