R: Extracting values from a dataframe when sequential row values differ - r

I have a data.frame df with 2 columns:
A contains positive values.
B contains values (either zero or a positive values).
I wish to generate a new data.frame (or vector) (of unknown length) containing
the values from df[i+1, A], ONLY when df[i, B] == 0 & df[i + 1, B] != 0.
I can visualize how to do this by sequentially stepping through the data.frame using a loop, but this will take forever with >200,000 rows. What is the vectorized solution to a problem like this that requires arithmetic on sequential rows of a vector or data.frame?
Data is in this form:
A B
1 5 5
2 10 3
3 15 0
4 20 6
5 25 5
6 30 0
7 35 0
8 40 11
9 45 3
etc etc etc
I'd then like to extract the values of A from row 4 (A = 20) and row 8 (A = 40) etc.

You could use
df$A[-1][diff(df$B != 0) > 0]
[1] 20 40
The idea is as follows. First, given a vector c(1, 2), one way to extract 2 is of course c(1, 2)[2]. Another way is c(1, 2)[c(FALSE, TRUE)], i.e. you might subset a vector by using a logical vector.
After you edited your question, I see that we are no longer interested in the first row of df, so that is why I start with df$A[-1]. Then one way that is longer and very likely less efficient, but follows more readable logic, is
df$A[-1][df$B[-nrow(df)] == 0 & df$B[-1] != 0]
where df$B[-1] != 0 returns a logical vector corresponding to your condition df [ i+1, B ] != 0. Then df$B[-nrow(df)] == 0 returns another logical vector corresponding to df [ i, B ]==0. Then operator & performs element-wise AND operation, returns the final logical vector and gives the result.
Now diff(df$B != 0) > 0 is just a trickier way to write the same thing. df$B != 0 gives a logical vector. Then while performing diff(df$B != 0) we are taking differences of 1's (correspond to entries TRUE) and 0's (correspond to FALSE). For example, c(0, 1) != 0 gives c(FALSE, TRUE), which can be seen as c(0, 1), and then diff gives 1. So, we have ones in diff(df$B != 0) where entry 0 is followed by some nonzero (in your case - positive) number. To use these results for subsetting df$A[-1] we obtain the final logical vector with diff(df$B != 0) > 0.

Another option comes through 'dplyr' with the following code:
library(dplyr)
df %>% filter(B != 0 & lag(B, 1) == 0)
This uses the dataframe and leaves rows where B doesn't equal 0 and the prior B does equal zero. This does return columns A and B. If you want to see only certain columns add %>% select(...) with the argument being variables separated by commas.

My example (adds together sequential values from two vectors):
> i1=c(1:100)
> i2=c(100:1)
> i3=i1[-length(i1)]+i2[-1]
> i3
[1] 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
[55] 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100 100
My example (adds together sequential values from a single vector):
> i1=c(1:100)
> i2=i1[-length(i1)]+i1[-1]
> i2
[1] 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 101 103 105 107 109
[55] 111 113 115 117 119 121 123 125 127 129 131 133 135 137 139 141 143 145 147 149 151 153 155 157 159 161 163 165 167 169 171 173 175 177 179 181 183 185 187 189 191 193 195 197 199

Related

How to remove outliers by columns in R

I have this data frame.
IQ
sleep
GRE
happiness
105
70
200
15
40
50
150
15
70
20
70
10
150
150
80
6
148
60
900
7
115
10
1200
40
110
90
15
5
120
40
60
12
99
30
70
15
1000
15
30
68
70
60
12
70
I would like to remove the outliers for each variable. I do not want to delete a whole row if one value is identified an outlier. For example, let's say the outlier for IQ is 40, I just want to delete 40, I don't want a whole row deleted.
If I define any values that are > mean * 3sd and < mean - 3sd as outliers, what are the codes I can use to run it?
If I can achieve this using Dplyr and subset, that would be great
I would expect something like this
IQ
sleep
GRE
happiness
105
70
200
15
50
150
15
70
20
70
10
150
80
6
148
60
900
7
115
40
110
90
5
120
40
60
12
99
30
70
15
15
30
68
70
60
12
70
I have tried the remove_sd_outlier code (from dataPreperation package) and it deleted an entire row of data. I do not want this.
You can use scale() to compute z-scores and across() to apply across all numeric variables. Note none of your example values are > 3 SD from the mean, so I used 2 SD as the threshold for demonstration.
library(dplyr)
df1 %>%
mutate(across(
where(is.numeric),
~ ifelse(
abs(as.numeric(scale(.x))) > 2,
NA,
.x
)
))
# A tibble: 11 × 4
IQ sleep GRE happiness
<dbl> <dbl> <dbl> <dbl>
1 105 70 200 15
2 40 50 150 15
3 70 20 70 10
4 150 NA 80 6
5 148 60 900 7
6 115 10 NA 40
7 110 90 15 5
8 120 40 60 12
9 99 30 70 15
10 NA 15 30 68
11 70 60 12 70
I think you could rephrase the nested ifelse() as case_when() for something easier to read, but hopefully this works for you.
df %>%
mutate(across(everything(),
~ ifelse(. > (mean(.) + 3*sd(.)),
"",
ifelse(. < (mean(.) - 3*sd(.)),
"", 1*(.)))))

Turning data long to wide with repeating values

fill W id X T
1 403 29730 100 111
1 8395 10766 100 92
1 4170 14291 100 98
1 2768 20506 200 110
1 3581 15603 100 112
6 1 10504 200 87
9 48 29730 100 89
1 4790 10766 200 80
This is a slightly modified random sample from my actual data. I'd like:
id X T 403 8395 ....
29730 100 111 1
10766 100 92 1
14291 100 98
20506 200 110
15603 100 112
10504 200 87
29730 100 89
10766 200 80
Notice ID 29730 is both in T 89 and 111. I think this should just be reshape2::dcast however
data_wide <- reshape2::dcast(data_long, id + T + X ~ W, value.var = "fill") gives an illogical result. Is there generally a to keep the same ID at T1 and T2 while casting a data frame?
If I understand correctly, this is not a trivial reshape long to wide question considering OP's requirements:
The row order must be maintained.
The columns must be ordered in appearance of W.
Missing entries should appear blank rather than NA.
This requires
to add a row number to be included in the reshape formula,
to turn W into a factor where the factor levels are ordered by appearance using forecats::fct_inorder(), e.g.,
to use a aggregation function which turns NA in "" using toString(), e.g.,
and to remove the row numbers from the reshaped result.
Here, the data.table implementation of dcast() is used as data.table appears a bit more convenient, IMHO.
library(data.table)
dcast(setDT(data_long)[, rn := .I], rn + id + T + X ~ forcats::fct_inorder(factor(W)),
toString, value.var = "fill")[
, rn := NULL][]
id T X 403 8395 4170 2768 3581 1 48 4790
1: 29730 111 100 1
2: 10766 92 100 1
3: 14291 98 100 1
4: 20506 110 200 1
5: 15603 112 100 1
6: 10504 87 200 6
7: 29730 89 100 9
8: 10766 80 200 1
Data
library(data.table)
data_long <- fread(" fill W id X T
1 403 29730 100 111
1 8395 10766 100 92
1 4170 14291 100 98
1 2768 20506 200 110
1 3581 15603 100 112
6 1 10504 200 87
9 48 29730 100 89
1 4790 10766 200 80")

How to allocate groups of a data frame based on time in R

Hello I have a table like so:
Entry TimeOn TimeOff Alarm
1 60 70 355
2 80 85 455
3 100 150 400
4 105 120 320
5 125 130 254
6 135 155 220
7 160 170 850
I would like to understand how i can group those entries so the ones starting during another alarm and ending either during another alarm or after the other alarm such as entries 4,5 & 6 can be filtered out of the data frame?
so this would be the desired result a dataframe that looked like this:
Entry TimeOn TimeOff Alarm
1 60 70 355
2 80 85 455
3 100 150 400
7 160 170 850
so entries 4, 5 and 6 removed
library(dplyr)
library(data.table)
df$flag <- apply(df, 1, function(x) {
nrow(filter(df, data.table::between(x['TimeOn'],df$TimeOn,df$TimeOff)))
})
df[df$flag > 1, ]
Entry TimeOn TimeOff Alarm flag
4 4 105 120 320 2
5 5 125 130 254 2
6 6 135 155 220 2
#Save option using Base R
df$flag <- apply(df,1,function(x) {nrow(df[x['TimeOn'] >= df$TimeOn & x['TimeOn'] <= df$TimeOff,])})
Suggested by #Andre Elrico
df[apply(df, 1, function(x) { nrow( df[between(x[['TimeOn']],df$TimeOn,df$TimeOff),] ) > 1 }),]
data
df <- read.table(text="
Entry TimeOn TimeOff Alarm
1 60 70 355
2 80 85 455
3 100 150 400
4 105 120 320
5 125 130 254
6 135 155 220
7 160 170 850
",header=T)

Calculate number of values in vector that exceed values in column of data.frame

I have a long list of numbers, e.g.
set.seed(123)
y<-round(runif(100, 0, 200))
And I would like to store in column y the number of values that exceed each value in column x of a data frame:
df <- data.frame(x=seq(0,200,20))
I can compute the numbers manually, like this:
length(which(y>=20)) #93 values exceed 20
length(which(y>=40)) #81 values exceed 40
etc. I know I can use a for-loop with all values of x, but is there a more elegant way?
I tried this:
df$y <- length(which(y>=df$x))
But this gives a warning and does not give me the desired output.
The data frame should look like this:
df
x y
1 0 100
2 20 93
3 40 81
4 60 70
5 80 61
6 100 47
7 120 40
8 140 29
9 160 19
10 180 8
11 200 0
You can compare each value of df$x against all value of y using sapply
sapply(df$x, function(a) sum(y>a))
#[1] 99 93 81 70 61 47 40 29 18 6 0
#Looking at your output, maybe you want
sapply(df$x, function(a) sum(y>=a))
#[1] 100 93 81 70 61 47 40 29 19 8 0
Here's another approach using outer that allows for element wise comparison of two vectors
rowSums(outer(df$x,y, "<="))
#[1] 100 93 81 70 61 47 40 29 19 8 0
Yet one more (from alexis_laz's comment)
length(y) - findInterval(df$x, sort(y), left.open = TRUE)
# [1] 100 93 81 70 61 47 40 29 19 8 0

Summarizing a data frame

I am trying to take the following data, and then uses this data to create a table which has the information broken down by state.
Here's the data:
> head(mydf2, 10)
lead_id buyer_account_id amount state
1 52055267 62 300 CA
2 52055267 64 264 CA
3 52055305 64 152 CA
4 52057682 62 75 NJ
5 52060519 62 750 OR
6 52060519 64 574 OR
15 52065951 64 152 TN
17 52066749 62 600 CO
18 52062751 64 167 OR
20 52071186 64 925 MN
I've allready subset the states that I'm interested in and have just the data I'm interested in:
mydf2 = subset(mydf, state %in% c("NV","AL","OR","CO","TN","SC","MN","NJ","KY","CA"))
Here's an idea of what I'm looking for:
State Amount Count
NV 1 50
NV 2 35
NV 3 20
NV 4 15
AL 1 10
AL 2 6
AL 3 4
AL 4 1
...
For each state, I'm trying to find a count for each amount "level." I don't necessary need to group the amount variable, but keep in mind that they are are not just 1,2,3, etc
> mydf$amount
[1] 300 264 152 75 750 574 113 152 750 152 675 489 188 263 152 152 600 167 34 925 375 156 675 152 488 204 152 152
[29] 600 489 488 75 152 152 489 222 563 215 452 152 152 75 100 113 152 150 152 150 152 452 150 152 152 225 600 620
[57] 113 152 150 152 152 152 152 152 152 152 640 236 152 480 152 152 200 152 560 152 240 222 152 152 120 257 152 400
Is there an elegant solution for this in R for this or will I be stuck using Excel (yuck!).
Here's my understanding of what you're trying to do:
Start with a simple data.frame with 26 states and amounts only ranging from 1 to 50 (which is much more restrictive than what you have in your example, where the range is much higher).
set.seed(1)
mydf <- data.frame(
state = sample(letters, 500, replace = TRUE),
amount = sample(1:50, 500, replace = TRUE)
)
head(mydf)
# state amount
# 1 g 28
# 2 j 35
# 3 o 33
# 4 x 34
# 5 f 24
# 6 x 49
Here's some straightforward tabulation. I've also removed any instances where frequency equals zero, and I've reordered the output by state.
temp1 <- data.frame(table(mydf$state, mydf$amount))
temp1 <- temp1[!temp1$Freq == 0, ]
head(temp1[order(temp1$Var1), ])
# Var1 Var2 Freq
# 79 a 4 1
# 157 a 7 2
# 391 a 16 1
# 417 a 17 1
# 521 a 21 1
# 1041 a 41 1
dim(temp1) # How many rows/cols
# [1] 410 3
Here's a little bit different tabulation. We are tabulating after grouping the "amount" values. Here, I've manually specified the breaks, but you could just as easily let R decide what it thinks is best.
temp2 <- data.frame(table(mydf$state,
cut(mydf$amount,
breaks = c(0, 12.5, 25, 37.5, 50),
include.lowest = TRUE)))
temp2 <- temp2[!temp2$Freq == 0, ]
head(temp2[order(temp2$Var1), ])
# Var1 Var2 Freq
# 1 a [0,12.5] 3
# 27 a (12.5,25] 3
# 79 a (37.5,50] 3
# 2 b [0,12.5] 2
# 28 b (12.5,25] 6
# 54 b (25,37.5] 5
dim(temp2)
# [1] 103 3
I am not sure if I understand correctly (you have two data.frames mydf and mydf2). I'll assume your data is in mydf. Using aggregate:
mydf$count <- 1:nrow(mydf)
aggregate(data = mydf, count ~ amount + state, length)
Is this what you are looking for?
Note: here count is a variable that is created just to get directly the output of the 3rd column as count.
Alternatives with ddply from plyr:
# no need to create a variable called count
ddply(mydf, .(state, amount), summarise, count=length(lead_id))
Here' one could use any column that exists in one's data instead of lead_id. Even state:
ddply(mydf, .(state, amount), summarise, count=length(state))
Or equivalently without using summarise:
ddply(mydf, .(state, amount), function(x) c(count=nrow(x)))

Resources