How do I exclude rows in R based on multiple values? - r

Let's say I have a dataset that looks like this:
> data
iso3 Vaccine Coverage
1 ARG DPT3 95
2 ARG MCV 94
3 ARG Pol3 91
4 KAZ DPT3 99
5 KAZ MCV 98
6 KAZ Pol3 99
7 COD DPT3 67
8 COD MCV 62
9 COD Pol3 66
I want to filter out some records based on several conditions being met simultaneously; say, I want to drop any data from Argentina (ARG) with a coverage of more than 93 percent. The result should thus exclude rows 1 and 2:
iso3 Vaccine Coverage
3 ARG Pol3 91
4 KAZ DPT3 99
5 KAZ MCV 98
6 KAZ Pol3 99
7 COD DPT3 67
8 COD MCV 62
9 COD Pol3 66
I tried using subset() but it excludes too much:
> subset(data, iso3!="ARG" & Coverage>93)
iso3 Vaccine Coverage
4 KAZ DPT3 99
5 KAZ MCV 98
6 KAZ Pol3 99
The problem seems to be that the & operator doesn't seem to work like the boolean AND, returning the intersection of the two conditions. Instead, it functions like a boolean OR, returning their union.
My question is, what do I use here to force the boolean AND?

!= is an operator meaning "not equal".
! indicates logical negation (NOT)
Your condition
iso3!="ARG" & Coverage>93
is
(iso3 not equal to "ARG") AND (Coverage > 93)
If you want
NOT((iso equal to "ARG") AND (Coverage > 93))
You need to create a condition appropriately, eg
eg
!(iso == 'ARG' & Coverage > 93)
For a complete coverage of logical operators in base R see
help('Logic', package='base')

Related

Change selected data.frame columns using value in first row

I have created a table that show how much time each person in a team has spend for tasks each month.
Empl_level team_member 2022/05 2022/06 2022/07 2022/08
0 department 117 69 73 30
1 Diana 108 108 113 184
1 Irina 90 63 56 40
2 Inga 77 56 74 30
3 Elina 23 35 58 79
However there is such "team member" as department. how to to create a new dataset, where time from the sell department will be equally divided by real team members
Empl_level team_member 2022/05 2022/06
1 Diana 108+(117/4) 108+(69/4)
1 Irina 90+(117/4) 63+(69/4)
2 Inga 77+(117/4) etc.
3 Elina 23+(117/4)
Using data.table, something like the following could work:
library(data.table)
setDT(df)
df[, names(df)[-(1:2)] := lapply(.SD, function(x) {x + x[1]/4}), .SDcols = !1:2][-1]
The [-1] at the end removes the first "department" row.

How to sum column based on value in another column in two dataframes?

I am trying to create a limit order book and in one of the functions I want to return a list that sums the column 'size' for the ask dataframe and the bid dataframe in the limit order book.
The output should be...
$ask
oid price size
8 a 105 100
7 o 104 292
6 r 102 194
5 k 99 71
4 q 98 166
3 m 98 88
2 j 97 132
1 n 96 375
$bid
oid price size
1 b 95 100
2 l 95 29
3 p 94 87
4 s 91 102
Total volume: 318 1418
Where the input is...
oid,side,price,size
a,S,105,100
b,B,95,100
I have a function book.total_volumes <- function(book, path) { ... } that should return total volumes.
I tried to use aggregate but struggled with the fact that it is both ask and bid in the limit order book.
I appreciate any help, I am clearly a complete beginner. Only hear to learn :)
If there is anything more I can add to this question so is more clear feel free to leave a comment!

Basic for loop not working

I am trying to get my head around for loops in R and I have what seems to me a very basic example which isn't working.
I have data in a table:
Author ev.ctrl n.ctrl ev.trt n.trt year
1 Cammu 8 56 7 54 1994
2 Eckert 49 137 46 137 2001
3 Kuusela 1 15 1 18 1998
4 Ohlisson 205 625 183 612 2001
5 Rush 259 392 235 393 1996
6 Woodward 7 20 6 40 2004
I want to calculate the sum of the column n.trt I know I could do sum(epidural$n.trt) but want to try and use a for loop.
I have:
for (i in 1:6){
sum(epidural$n.trt[i])
}
This is not giving me anything, not a number nor an error. Any idea what the problem is?
Thanks
Do this instead... we don't need no steenking loops:
> treats <- sum(epidural['n.trt']); treats
[1] 1254
You need to declare sum variable outside of for loop and add values to it. There is no need to call sum function since you have only one value not vector.
s <- 0
for (i in 1:6){
s <- s + epidural$n.trt[i]
}
s

How to obtain a new table after filtering only one column in an existing table in R?

I have a data frame having 20 columns. I need to filter / remove noise from one column. After filtering using convolve function I get a new vector of values. Many values in the original column become NA due to filtering process. The problem is that I need the whole table (for later analysis) with only those rows where the filtered column has values but I can't bind the filtered column to original table as the number of rows for both are different. Let me illustrate using the 'age' column in 'Orange' data set in R:
> head(Orange)
Tree age circumference
1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
Convolve filter used
smooth <- function (x, D, delta){
z <- exp(-abs(-D:D/delta))
r <- convolve (x, z, type='filter')/convolve(rep(1, length(x)),z,type='filter')
r <- head(tail(r, -D), -D)
r
}
Filtering the 'age' column
age2 <- smooth(Orange$age, 5,10)
data.frame(age2)
The number of rows for age column and age2 column are 35 and 15 respectively. The original dataset has 2 more columns and I like to work with them also. Now, I only need 15 rows of each column corresponding to the 15 rows of age2 column. The filter here removed first and last ten values from age column. How can I apply the filter in a way that I get truncated dataset with all columns and filtered rows?
You would need to figure out how the variables line up. If you can add NA's to age2 and then do Orange$age2 <- age2 followed by na.omit(Orange) you should have what you want. Or, equivalently, perhaps this is what you are looking for?
df <- tail(head(Orange, -10), -10) # chop off the first and last 10 observations
df$age2 <- age2
df
Tree age circumference age2
11 2 1004 156 915.1678
12 2 1231 172 876.1048
13 2 1372 203 841.3156
14 2 1582 203 911.0914
15 3 118 30 948.2045
16 3 484 51 1008.0198
17 3 664 75 955.0961
18 3 1004 108 915.1678
19 3 1231 115 876.1048
20 3 1372 139 841.3156
21 3 1582 140 911.0914
22 4 118 32 948.2045
23 4 484 62 1008.0198
24 4 664 112 955.0961
25 4 1004 167 915.1678
Edit: If you know the first and last x observations will be removed then the following works:
x <- 2
df <- tail(head(Orange, -x), -x) # chop off the first and last x observations
df$age2 <- age2

Find the non zero values and frequency of those values in R

I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location https://www.dropbox.com/s/ef1411dq4gyg0cm/sampledataflow.csv
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
summary(flow)
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
sapply(flow,class)
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.
The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...
I looked a a small sample of the first two days
> do.call( cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
#--------
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
#--------
$`2010-06-01`
[1] 138 79
$`2010-06-02`
[1] 95 195 239
$`2010-06-03`
[1] 57 360
$`2010-06-04`
[1] 6 457
$`2010-06-05`
integer(0)
$`2010-06-06`
integer(0)
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!is.na(flowrle$values) & flowrle$values]
#----------
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278

Resources