Obtain data from row based on a condition - r

I have the following dataframe.
SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3 VORDEN_PREVENT4 VORDEN_PREVENT5
2484628 1500 1328 2761 3003 2803
2491884 1500 1500 1169 2813 1328
2521158 1500 2813 1328 2761 3003
2548370 1500 1257 2595 1187 1837
2580994 1500 5057 2624 2940 2731
2670164 1500 1874 1218 2791 2892
In this dataframe I have as VORDEN_PREVENT* the number of cars sold every day, for example VORDEN_PREVENT1 means that I sold this day 1500 cars, what I want is to return the columns from the rows that produces a purchase of for example 3000 cars.
For that example, should be 1500 from VORDEN_PREVENT1, 1328 from VORDEN_PREVENT2 and 172 from VORDEN_PREVENT3, which is the difference from 2761 and the sum from VORDEN_PREVENT1 and VORDEN_PREVENT2.
I don't know how to obtain this row and column data and to get the difference properly, to obtain my data correctly.

If I understand correctly, the VORDEN_PREVENT* columns denote sales on subsequent days. The OP asks on which day the cumulative sum of sales exceeds a given threshold. In addition the OP wants to see the sales figures which sum up to threshold.
I suggest to solve this type of questions in long format where columns can be treated as data.
1. melt() / dcast()
library(data.table)
threshold <- 3000L
long <- melt(setDT(DT), id.var = "SEC")
long[, value := c(value[1L], diff(pmin(cumsum(value), threshold))), by = SEC]
dcast(long[value > 0], SEC ~ variable)
SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3
1: 2484628 1500 1328 172
2: 2491884 1500 1500 NA
3: 2521158 1500 1500 NA
4: 2548370 1500 1257 243
5: 2580994 1500 1500 NA
6: 2670164 1500 1500 NA
2. gather() / spread()
library(tidyr)
library(dplyr)
threshold <- 3000L
DT %>%
gather(, , -SEC) %>%
group_by(SEC) %>%
mutate(value = c(value[1L], diff(pmin(cumsum(value), threshold)))) %>%
filter(value >0) %>%
spread(key, value)
# A tibble: 6 x 4
# Groups: SEC [6]
SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3
<int> <int> <int> <int>
1 2484628 1500 1328 172
2 2491884 1500 1500 NA
3 2521158 1500 1500 NA
4 2548370 1500 1257 243
5 2580994 1500 1500 NA
6 2670164 1500 1500 NA
3. apply()
With base R:
DT[, -1] <- t(apply(DT[, -1], 1, function(x) c(x[1L], diff(pmin(cumsum(x), threshold)))))
DT
SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3 VORDEN_PREVENT4 VORDEN_PREVENT5
1 2484628 1500 1328 172 0 0
2 2491884 1500 1500 0 0 0
3 2521158 1500 1500 0 0 0
4 2548370 1500 1257 243 0 0
5 2580994 1500 1500 0 0 0
6 2670164 1500 1500 0 0 0
Data
library(data.table)
DT <- fread("
SEC VORDEN_PREVENT1 VORDEN_PREVENT2 VORDEN_PREVENT3 VORDEN_PREVENT4 VORDEN_PREVENT5
2484628 1500 1328 2761 3003 2803
2491884 1500 1500 1169 2813 1328
2521158 1500 2813 1328 2761 3003
2548370 1500 1257 2595 1187 1837
2580994 1500 5057 2624 2940 2731
2670164 1500 1874 1218 2791 2892",
data.table = FALSE)

Your question is not very clear to me so I reduce it to what I understand (you want to create a column, then filter on rows). Using dplyr this can be done quite easily but we first recreate some data.
# recreate some data
df <- data.frame(time=1:3,
sales1=c(1234, 1567, 2045),
sales2=c(865, 756, 890))
# first create a diff column
df <- df %>% mutate(sales_diff=sales1-sales2)
df
time sales1 sales2 sales_diff
1 1234 865 369
2 1567 756 811
3 2045 890 1155
# then you can access the rows you're interested in by filtering them
df %>% filter(sales1==1567)
time sales1 sales2 sales_diff
2 1567 756 811
You can just replace the object/column names with your own data.
Is that what you were looking for?

Related

How to add columns with particular calculations and make a plot based on that in R?

Say for example I have the following data set:
timestamp open close ID
2000 1000 1100 5
2060 1100 1150 5
2120 1150 1200 5
2180 1200 1150 5
2240 1150 1100 8
2300 1100 1000 8
2360 1000 950 8
2420 950 900 8
2480 900 950 5
2540 950 1000 5
2600 1000 1050 5
2660 1050 1100 4
2720 1100 1150 4
2780 1150 1200 4
How can I add another colum which shows how many times a particular ID has shown up, this is shown by Number_ID? And how can I add another column which gives the percentage change since the beginning when a new ID starts. The first open is the start of the ID and we use the closes to calculate the %_change. So this would look something like this (the because calculation doesnt have to be included, I added it so you can see the calculation):
timestamp open close ID Number_ID %_change
2000 1000 1100 5 1 10 (because (1100-1000)*100/1000)
2060 1100 1150 5 2 15 (because (1150-1000)*100/1000)
2120 1150 1200 5 3 20 (because (1200-1000)*100/1000)
2180 1200 1150 5 4 15 (because (1150-1000)*100/1000)
2240 1150 1100 8 1 -4 (because (1100-1150)*100/1150)
2300 1100 1000 8 2 -13 (because (1000-1150)*100/1150)
2360 1000 950 8 3 -17 (because (950-1150)*100/1150)
2420 950 900 8 4 -21 (because (900-1150)*100/1150)
2480 900 950 5 1 5 (because (950-900)*100/900)
2540 950 1000 5 2 11 (because (1000-900)*100/900)
2600 1000 1050 5 3 16 (because 1050-900)*100/900)
2660 1050 1100 4 1 4 (because (1100-1050)*100/1050)
2720 1100 1150 4 2 9 (because (1150-1050)*100/1050)
2780 1150 1200 4 3 14 (because (1200-1050)*100/1050)
And when have these 2 columns, how can I make a graph which plots the highest positive and negative % change per ID? So, I would first need to add a calculation which calculates the price difference in percentage between the open and the close of an ID. This would look something like this:
timestamp open close ID Number_ID %_change %_change_opencloseID
2000 1000 1100 5 1 10
2060 1100 1150 5 2 15
2120 1150 1200 5 3 20
2180 1200 1150 5 4 15 15 (because (1150-1000)*100/1000)
2240 1150 1100 8 1 -4
2300 1100 1000 8 2 -13
2360 1000 950 8 3 -17
2420 950 900 8 4 -21 -21 (because (900-1150)*100/1150)
2480 900 950 5 1 5
2540 950 1000 5 2 11
2600 1000 1050 5 3 16 16 (because (1050-900)*100/900)
2660 1050 1100 4 1 4
2720 1100 1150 4 2 9
2780 1150 1200 4 3 14 14 (because (1200-1050)*100/1050)
If I have this, how can I make a graph that plots the 16% change for ID 5 and not the 15% change for ID 5 automatically? With timestamp on the x-axis and %_change on the y-axis.
Thanks!
This is how you can do your first step :
library(dplyr)
df %>%
group_by(ID) %>%
mutate(Number_ID = row_number(),
perc_change = (close - first(open))/first(open) * 100)
# timestamp open close ID Number_ID perc_change
# <int> <int> <int> <int> <int> <dbl>
# 1 2000 1000 1100 5 1 10
# 2 2060 1100 1150 5 2 15
# 3 2120 1150 1200 5 3 20
# 4 2180 1200 1150 5 4 15
# 5 2240 1150 1100 8 1 -4.35
# 6 2300 1100 1000 8 2 -13.0
# 7 2360 1000 950 8 3 -17.4
# 8 2420 950 900 8 4 -21.7
# 9 2480 900 950 5 5 -5
#10 2540 950 1000 5 6 0
#11 2600 1000 1050 5 7 5
#12 2660 1050 1100 4 1 4.76
#13 2720 1100 1150 4 2 9.52
#14 2780 1150 1200 4 3 14.3
In data.table :
library(data.table)
setDT(df)[, c("Number_ID", "perc_change") := list(seq_len(.N),
(close - first(open))/first(open) * 100), ID]

Reshape Data where measure id measure variables need to be paired and then defined by multiple id variables in new column

I have the following data frame that looks like this:
NA X.nm. X.A. Reaction.Type Trial Actual.Total.Seconds RA
1 300 3.3294 0ng 1 14.784 NaDithio
51 350 0.1779 0ng 1 14.784 NaAsc
81 380 0.1000 50ng 2 14.784 NaAsc
101 400 0.0509 0ng 1 14.784 NaAsc
151 450 0.0125 0ng 2 14.784 NaAsc
201 500 0.0054 0ng 2 14.784 NaDithio
251 550 0.0026 0ng 1 14.784 NaDithio
301 600 0.0010 50ng 1 14.784 NaAsc
351 650 -0.0001 0ng 1 14.784 NaAsc
381 680 -0.0005 0ng 1 14.784 NaAsc
So there is a column for "Reaction.Type", "Trial", "Actual.Total.Seconds", "RA", "X.nm", "X.A". Please ignore column Na.
I want to reformat my data frame so that there is a new X.nm. and X.A. (which are paired) column for every combo of ("Reaction.Type", "Trial", "Actual.Total.Seconds", "RA"). I want the title of each column as the combo. Every X.nm. has a corresponding X.A (kind of like a coordinate point, for every X.nm, there is a X.A. that goes with it)
Example:
Column 1 title: X.nm from 0ng, 1, 50seconds, NaAsc (this will have another column that matches the X.nm to the X.A.)
Column 2 title: X.A. from 0ng, 1, 50seconds, NaAsc
*Then do this for every combo like below, so there'd be more columns for each combo
X.nm. from 0ng, 1, 14.784seconds, NaDithio X.A from 0ng, 1, 50seconds, NaDithio
300 3.3294
550 0.0026
When I try to use recast() from reshape2 package, it doesn't keep the X.A. and X.nm pairs together.
Please help! Thank you.
library(tidyverse)
df2 <- df %>%
unite(var, c(Reaction.Type, Trial, Actual.Total.Seconds, RA)) %>%
gather(var2, val, -var, -NA.) %>%
unite(var3, c(var2, var)) %>%
spread(var3, val)
df2[,split(names(df2), c(1, rep(2:7, 2))) %>% unlist]
# NA. X.A._0ng_1_14.784_NaAsc X.nm._0ng_1_14.784_NaAsc
# 1 1 NA NA
# 2 51 0.1779 350
# 3 81 NA NA
# 4 101 0.0509 400
# 5 151 NA NA
# 6 201 NA NA
# 7 251 NA NA
# 8 301 NA NA
# 9 351 -0.0001 650
# 10 381 -0.0005 680

R: Aggregating data by column group - mutate column with values for each observation

I'm having a beginner's issue aggregating the data for a category of data, creating a new column with the sum of each category's data for each observance.
I'd like the following data:
PIN Balance
221 5000
221 2000
221 1000
554 4000
554 4500
643 6000
643 4000
To look like:
PIN Balance Total
221 5000 8000
221 2000 8000
221 1000 8000
554 4000 8500
554 4500 8500
643 6000 10000
643 4000 10000
I've tried using aggregate: output <- aggregate(df$Balance ~ df$PIN, data = df, sum) but haven't been able to get the data back into my original dataset as the number of obsverations were off.
You can use dplyr to do what you want. We first group_by PIN and then create a new column Total using mutate that is the sum of the grouped Balance:
library(dplyr)
res <- df %>% group_by(PIN) %>% mutate(Total=sum(Balance))
Using your data as a data frame df:
df <- structure(list(PIN = c(221L, 221L, 221L, 554L, 554L, 643L, 643L
), Balance = c(5000L, 2000L, 1000L, 4000L, 4500L, 6000L, 4000L
)), .Names = c("PIN", "Balance"), class = "data.frame", row.names = c(NA,
-7L))
## PIN Balance
##1 221 5000
##2 221 2000
##3 221 1000
##4 554 4000
##5 554 4500
##6 643 6000
##7 643 4000
We get the expected result:
print(res)
##Source: local data frame [7 x 3]
##Groups: PIN [3]
##
## PIN Balance Total
## <int> <int> <int>
##1 221 5000 8000
##2 221 2000 8000
##3 221 1000 8000
##4 554 4000 8500
##5 554 4500 8500
##6 643 6000 10000
##7 643 4000 10000
Or we can use data.table:
library(data.table)
setDT(df)[,Table:=sum(Balance),by=PIN][]
## PIN Balance Total
##1: 221 5000 8000
##2: 221 2000 8000
##3: 221 1000 8000
##4: 554 4000 8500
##5: 554 4500 8500
##6: 643 6000 10000
##7: 643 4000 10000
Consider a base R solution with a sapply() conditional sum approach:
df <- read.table(text="PIN Balance
221 5000
221 2000
221 1000
554 4000
554 4500
643 6000
643 4000", header=TRUE)
df$Total <- sapply(seq(nrow(df)), function(i){
sum(df[df$PIN == df$PIN[i], c("Balance")])
})
# PIN Balance Total
# 1 221 5000 8000
# 2 221 2000 8000
# 3 221 1000 8000
# 4 554 4000 8500
# 5 554 4500 8500
# 6 643 6000 10000
# 7 643 4000 10000

R: How to create a conditional column indirectly based on a non-static amount of other columns, in a data.table

I have the following data.table:
Name x y h 120Hz 800Hz 1000Hz 1200Hz
1: Tower1 1354 829 245 0 8 7 0
2: Tower2 2654 234 285 7 0 3 0
3: Tower3 822 3040 256 0 4 0 9
4: Tower4 987 2747 250 0 6 5 3
5: Tower5 1953 1739 301 0 0 8 2
You can create it with:
DT <- data.table(Name = c("Tower1", "Tower2", "Tower3", "Tower4", "Tower5"),
x = c(1354,2654,822,987,1953),
y = c(829,234,3040,2747,1739),
h = c(245,285,256,250,301),
`120Hz` = c(0,7,0,0,0),
`800Hz` = c(8,0,4,6,0),
`1000Hz` = c(7,3,0,5,8),
`1200Hz` = c(0,0,9,3,2))
In reality, it came from a previous, larger data.table. The last four columns were auto-generated from that other data.table using dcast, so there is no way to know beforehand the number or the names of the columns after column h. This is important.
The final goal is to create another column named "Range", whose value for each row depends on the values in the columns after column "h", as it follows:
Consider the following associations between frequencies and ranges. These are the only stablished associations and are static, so this information could be stored as a pre-defined data.table.
assoc <- data.table(Frq = c("800Hz", "1000Hz", "1200Hz"),
Rng = c(750,850,950))
For each one of the four columns after column "h", the code should check if the column name exists in assoc. If so, AND if the value in that column for the row in question in DT is NOT zero, then the code considers the respective Rng value (from assoc). After checking all four columns, the code should return the MAXIMUM of the ranges considered and store in the new column "Range".
My approach:
Create one auxiliar column for each frequency column:
DT <- DT[, paste0(colnames(DT)[5:ncol(DT)],'_r') := 0]
Then I could use a conditional structure that does the algorithm described above. Let's take for example column 800Hz_r. This column checks the value in column 800Hz. If that value is not zero for the row in question, then it returns 750. At the end, the column Range simply takes the maximum of the previous 4 columns, the ones ending with _f. There's where I'm stuck, I can't find an useful command to do so. Everything I've tried throws me some error.
Finally, the auxiliary _f columns should be deleted. If anyone knows a way to do it without creating auxiliar columns it would be much better.
This is the expected result (prior to deletion of auxiliary columns):
Name x y h 120Hz 800Hz 1000Hz 1200Hz 120Hz_f 800Hz_f 1000Hz_f 1200Hz_f Range
1: Tower1 1354 829 245 0 8 7 0 0 750 850 0 850
2: Tower2 2654 234 285 7 0 3 0 0 0 850 0 850
3: Tower3 822 3040 256 0 4 0 9 0 750 0 950 950
4: Tower4 987 2747 250 0 6 5 3 0 750 850 950 950
5: Tower5 1953 1739 301 0 0 8 2 0 0 850 950 950
NOTE: The reason why there could be frequency columns that don't appear in assoc is because the original data could have typos. In this example, the column 120Hz would always generate only zeros in column 120Hz_f and thus it can never get to be considered for the maximum Range. That's ok.
A back and forth to long format can make this work:
dcast(melt(DT, measure.vars=patterns("Hz$"))[assoc, on = c(variable = 'Frq')
, Rng := i.Rng * (value != 0)],
Name + x + y + h ~ variable, max, value.var='Rng')[,
do.call(function(...) pmax(..., na.rm = T), .SD), .SDcols = `120Hz`:`1200Hz`]
#[1] 850 850 950 950 950
Or you can avoid creating the intermediate columns if you loop over assoc:
DT[, Range := -Inf]
assoc[, {DT[, Range := pmax(Range, (get(Frq) != 0) * Rng)]; NULL}, by = Frq]
DT
# Name x y h 120Hz 800Hz 1000Hz 1200Hz Range
#1: Tower1 1354 829 245 0 8 7 0 850
#2: Tower2 2654 234 285 7 0 3 0 850
#3: Tower3 822 3040 256 0 4 0 9 950
#4: Tower4 987 2747 250 0 6 5 3 950
#5: Tower5 1953 1739 301 0 0 8 2 950
It is not exactly as you intend but my motto is when the algorithm does not fit the data, then format the data to the algorithm.
A bit long but simple to implement.
I melt DT with the following code and use the convert the Hz into numeric with removing the "Hz" and converting into numeric.
a <- melt(DT,id.vars=1:4)[value>0][,crit:=as.numeric(gsub("Hz","",variable))]
to get something like:
##> a
## Name x y h variable value crit
## 1: Tower1 1354 829 245 800Hz 8 800
## 2: Tower1 1354 829 245 1000Hz 7 1000
## 3: Tower2 2654 234 285 120Hz 7 120
## 4: Tower2 2654 234 285 1000Hz 3 1000
## 5: Tower3 822 3040 256 800Hz 4 800
## 6: Tower3 822 3040 256 1200Hz 9 1200
## 7: Tower4 987 2747 250 800Hz 6 800
## 8: Tower4 987 2747 250 1000Hz 5 1000
## 9: Tower4 987 2747 250 1200Hz 3 1200
## 10: Tower5 1953 1739 301 1000Hz 8 1000
## 11: Tower5 1953 1739 301 1200Hz 2 1200
Then find the max by Tower.
## > a[,.(crit=max(crit)),by=Name]
## Name crit
## 1: Tower1 1000
## 2: Tower2 1000
## 3: Tower3 1200
## 4: Tower4 1200
## 5: Tower5 1200
Then merge it back with a
b <- merge(setkey(a,Name,crit),setkey(a[,.(crit=max(crit)),by=Name],Name,crit))
To get something like
## > b
## Name crit x y h variable value
## 1: Tower1 1000 1354 829 245 1000Hz 7
## 2: Tower2 1000 2654 234 285 1000Hz 3
## 3: Tower3 1200 822 3040 256 1200Hz 9
## 4: Tower4 1200 987 2747 250 1200Hz 3
## 5: Tower5 1200 1953 1739 301 1200Hz 2
Then merge b with assoc
## > merge(b,assoc,by.x="variable",by.y="Frq")
## variable Name crit x y h value Rng
## 1: 1000Hz Tower1 1000 1354 829 245 7 850
## 2: 1000Hz Tower2 1000 2654 234 285 3 850
## 3: 1200Hz Tower3 1200 822 3040 256 9 950
## 4: 1200Hz Tower4 1200 987 2747 250 3 950
## 5: 1200Hz Tower5 1200 1953 1739 301 2 950

Checking the value from given threshold in a set of observation and continue till end of vector

Task:
I have to check that if the value in the data vector is above from the given threshold,
If in my data vector, I found 5 consecutive values greater then the given threshold then I keep these values as they are.
If I have less then 5 values (not 5 consecutive values) then I will replace these values with NA's.
The sample data and required output is shown below. In this example the threshold value is 1000. X is input data variable and the desired output is: Y = X(Threshold > 1000)
X Y
580 580
457 457
980 980
1250 NA
3600 NA
598 598
1200 1200
1345 1345
9658 9658
1253 1253
4500 4500
1150 1150
596 596
594 594
550 550
1450 NA
320 320
1780 NA
592 592
590 590
I have used the following code in R for my desired output but unable to get the appropriate one:
for (i in 1:nrow(X)) # X is my data vector
{counter=0
if (X[i]>10000)
{
for (j in i:(i+4))
{
if (X[j]>10000)
{counter=counter+1}
}
ifelse (counter < 5, NA, X[j])
}
X[i]<- NA
}
X
I am sure that the above code contain some error. I need help in the form of either a new code or modifying this code or any package in R.
Here is an approach using dplyr, using a cumulative sum of diff(x > 1000) to group the values.
library(dplyr)
df <- data.frame(x)
df
# x
# 1 580
# 2 457
# 3 980
# 4 1250
# 5 3600
# 6 598
# 7 1200
# 8 1345
# 9 9658
# 10 1253
# 11 4500
# 12 1150
# 13 596
# 14 594
# 15 550
# 16 1450
# 17 320
# 18 1780
# 19 592
# 20 590
df %>% mutate(group = cumsum(c(0, abs(diff(x>1000))))) %>%
group_by(group) %>%
mutate(count = n()) %>%
ungroup() %>%
mutate(y = ifelse(x<1000 | count > 5, x, NA))
# x group count y
# (int) (dbl) (int) (int)
# 1 580 0 3 580
# 2 457 0 3 457
# 3 980 0 3 980
# 4 1250 1 2 NA
# 5 3600 1 2 NA
# 6 598 2 1 598
# 7 1200 3 6 1200
# 8 1345 3 6 1345
# 9 9658 3 6 9658
# 10 1253 3 6 1253
# 11 4500 3 6 4500
# 12 1150 3 6 1150
# 13 596 4 3 596
# 14 594 4 3 594
# 15 550 4 3 550
# 16 1450 5 1 NA
# 17 320 6 1 320
# 18 1780 7 1 NA
# 19 592 8 2 592
# 20 590 8 2 590
Another approach :
Y<-rep(NA,nrow(X))
for (i in 1:nrow(X)) {
if (X[i,1]<1000) {Y[i]<-X[i,1]} else if (sum(X[i:min((i+4),nrow(X)),1]>1000)>=5) {
Y[i:min((i+4),nrow(X))]<-X[i:min((i+4),nrow(X)),1]}
}
returns
> Y
[1] 580 457 980 NA NA 598 1200 1345 9658 1253 4500 1150 596 594 550 NA 320 NA 592 590
This assumes that the values of X are in the first column of a dataframe named X.
It then creates Y with NAand only change the values if the criteria are met.

Resources