Checking data for equality in R [duplicate] - r

This question already has answers here:
How to convert a factor to integer\numeric without loss of information?
(12 answers)
Closed 8 years ago.
I am quite new to R and I am having a problem checking some values for equality. I have a dataframe rt (below), and I wish to check whether the values in column r$V8 are equal to 606.
V1 V2 V3 V4 V5 V6 V7 V8 V9
710 256225 RAIN 1853-12-26 00:00 1 DLY3208 900 1 606 1001
712 256225 RAIN 1853-12-27 00:00 1 DLY3208 900 1 606 1001
714 256225 RAIN 1853-12-28 00:00 1 DLY3208 900 1 606 1001
716 256225 RAIN 1853-12-29 00:00 1 DLY3208 900 1 606 1001
718 256225 RAIN 1853-12-30 00:00 1 DLY3208 900 1 606 1001
720 256225 RAIN 1853-12-31 00:00 1 DLY3208 900 1 606 1001
> typeof(rt$V8)
[1] "integer"
> mode(rt$V8)
[1] "numeric"
> class(rt$V8)
[1] "factor"
> rt$V8
[1] 606 606 606 606 606 606
Levels: 606 1530
Test if equal to 606:
> rt$V8 == 606
[1] FALSE FALSE FALSE FALSE FALSE FALSE
> as.integer(rt$V8) == as.integer(606)
[1] FALSE FALSE FALSE FALSE FALSE FALSE
I do not understand why these checks return false, I would appreciate any advice please.

I have encountered the same issue multiple times and the real problem is usually how the data is imported in R. If you are using read.csv or similar function there is an attribute called 'colClasses' which is immensely useful. You can tell R using this attribute what the type of each column is and then R will not convert your numeric columns into factors.
An easy example is shown here :
Specifying colClasses in the read.csv

Related

finding largest consecutive region in table

I'm trying to find regions in a file that have consecutive lines based on two columns. I want to find the largest span of consecutive values. If column 4 (V3) comes immediately before the second line's value for column 3 (V2), then write the output for the longest span of consecutive values.
The input looks like this. input:
> x
grp V1 V2 V3 V4 V5 V6
1: 1 DOG.1 142 144 132 134 0
2: 2 DOG.1 313 315 303 305 0
3: 3 DOG.1 316 318 306 308 0
4: 4 DOG.1 319 321 309 311 0
5: 5 DOG.1 322 324 312 314 0
the output should look like this:
out.name in out
[1,] "DOG.1" "313" "324"
Notice how the x[1,] was removed and how the output is starting at x[2,3] and ending at x[5,4]. All of these values are consecutive.
One obvious way is to take tail(x$V2, -1L) - head(x$V3, -1L) and get the start and end indices corresponding to the maximum consecutive 1s. But I'll skip it here (and leave it to others) as I'd like to show how this can be done with the help of IRanges package:
require(data.table)
require(IRanges) ## Bioconductor package
x.ir = reduce(IRanges(x$V2, x$V3))
max.idx = which.max(width(x.ir))
ans = data.table(out.name = "DOG.1",
in = start(x.ir)[max.idx],
out = end(x.ir)[max.idx])
# out.name bla out
# 1: DOG.1 313 324

How to find NAs in groups and create new column for data frame

I have a data frame consisting of an "ID" column and a "Diff" column. The ID column is responsible for marking groups of corresponding Diff values.
An example looks like this:
structure(list(ID = c(566, 566, 789, 789, 789, 487, 487, 11,
11, 189, 189), Diff = c(100, 277, 529, 43, NA, 860, 780, 445,
NA, 578, 810)), .Names = c("ID", "Diff"), row.names = c(9L, 10L,
20L, 21L, 22L, 25L, 26L, 51L, 52L, 62L, 63L), class = "data.frame")
My goal is to search each group for NAs in the Diff column and create a new column, that has either a "True" or "False" value for each row, depending if the corresponding group has an NA in Diff.
I tried
x <- aggregate(Diff ~ ID, data, is.na)
and
y <- aggregate(Diff ~ ID, data, function(x) any(is.na(x)))
The idea was to merge the result depending on ID. However, none of the above created a useful result. I know R can do it … and after searching for quite a while I ask you how :)
You can use the plyr and ddply
require(plyr)
ddply(data, .(ID), transform, na_diff = any(is.na(Diff)))
## ID Diff na_diff
## 1 11 445 TRUE
## 2 11 NA TRUE
## 3 189 578 FALSE
## 4 189 810 FALSE
## 5 487 860 FALSE
## 6 487 780 FALSE
## 7 566 100 FALSE
## 8 566 277 FALSE
## 9 789 529 TRUE
## 10 789 43 TRUE
## 11 789 NA TRUE
A very similar solution to #dickoa except in base:
do.call(rbind,by(data,data$ID,function(x)transform(x,na_diff=any(is.na(Diff)))))
# ID Diff na_diff
# 11.51 11 445 TRUE
# 11.52 11 NA TRUE
# 189.62 189 578 FALSE
# 189.63 189 810 FALSE
# 487.25 487 860 FALSE
# 487.26 487 780 FALSE
# 566.9 566 100 FALSE
# 566.10 566 277 FALSE
# 789.20 789 529 TRUE
# 789.21 789 43 TRUE
# 789.22 789 NA TRUE
Similarly, you could avoid transform with:
data$na_diff<-with(data,by(Diff,ID,function(x) any(is.na(x)))[as.character(ID)])
(You have two viable strategies already, but here is another which may be conceptually easier to follow if you are relatively new to R and aren't familiar with the way plyr works.)
I often need to know how many NAs I have in different variables, so here is a convenience function I use standard:
sna <- function(x){
sum(is.na(x))
}
From there, I sometimes use aggregate(), but sometimes I find ?summaryBy in the doBy package to be more convenient. Here's an example:
library(doBy)
z <- summaryBy(Diff~ID, data=my.data, FUN=sna)
z
ID Diff.sna
1 11 1
2 189 0
3 487 0
4 566 0
5 789 1
After this, you just need to use ?merge and convert the count of NAs to a logical to get your final data frame:
my.data <- merge(my.data, z, by="ID")
my.data$Diff.sna <- my.data$Diff.sna>0
my.data
ID Diff Diff.sna
1 11 445 TRUE
2 11 NA TRUE
3 189 578 FALSE
4 189 810 FALSE
5 487 860 FALSE
6 487 780 FALSE
7 566 100 FALSE
8 566 277 FALSE
9 789 529 TRUE
10 789 43 TRUE
11 789 NA TRUE

Match objects with same IDs except for one

I have a dataframe with the following format:
df <- data.frame(DS.ID=c(123,214,543,325,123,214),
P.ID=c("AAC","JGK","DIF","ADL","AAE","JGK"),
OP.ID=c("xxab","xxac","xxad","xxae","xxab","xxac"))
DS.ID P.ID OP.ID
1 123 AAC xxab
2 214 JGK xxac
3 543 DIF xxad
4 325 ADL xxae
5 123 AAE xxab
6 214 JGK xxac
I'm trying to find instances where DS.ID is equal to another DS.ID, OP.ID is equal to another OP.ID, but the P.ID's are not equal. I know how to do it with a loop but I'd rather do a quicker method so it returns the DS.ID's/information of those that do not match. Either with a logical vector in another column or through the DS.ID's.
Using duplicated:
df$match <- duplicated(df$DS.ID,df$OP.ID,fromLast=TRUE) |
duplicated(df$DS.ID,df$OP.ID)
# df
# DS.ID P.ID OP.ID match
# 1 123 AAC xxab TRUE
# 2 214 JGK xxac TRUE
# 3 543 DIF xxad FALSE
# 4 325 ADL xxab FALSE
# 5 123 AAE xxab TRUE
# 6 214 JGK xxac TRUE
EDIT after OP clarification
dupli.2 <- duplicated(df$DS.ID,df$OP.ID,fromLast=TRUE) | duplicated(df$DS.ID,df$OP.ID)
dupli.all <- duplicated(df) | duplicated(df,fromLast=TRUE)
as.logical(dupli.2 - dupli.all)
[1] TRUE FALSE FALSE FALSE TRUE FALSE

R Read CSV file that has timestamp

I have a csvfile that has a time stamp column as a string
15,1035,4530,3502,2,892,482,0,20060108081608,2,N
15,1034,7828,3501,3,263,256,0,20071124175519,3,N
15,1035,7832,4530,2,1974,1082,0,20071124193818,7,N
15,2346,8381,8155,3,2684,649,0,20080207131002,9,N
I use the read.csv option but the problem with that is once I finish the import the data column looks like:
1 15 1035 4530 3502 2 892 482 0 2.006011e+13 2 N
2 15 1034 7828 3501 3 263 256 0 2.007112e+13 3 N
3 15 1035 7832 4530 2 1974 1082 0 2.007112e+13 7 N
4 15 2346 8381 8155 3 2684 649 0 2.008021e+13 9 N
Is there away to strip the date from string as it get read (csv file does have headers: removed here to keep data anonymous). If we can't strip as it get read can what is the best way to do the strip?
Here 2 methods:
Using zoo package. Personally I prefer this one. I deal with your data as a time series.
library(zoo)
read.zoo(text='15,1035,4530,3502,2,892,482,0,20060108081608,2,N
15,1034,7828,3501,3,263,256,0,20071124175519,3,N
15,1035,7832,4530,2,1974,1082,0,20071124193818,7,N
15,2346,8381,8155,3,2684,649,0,20080207131002,9,N',
index=9,tz='',format='%Y%m%d%H%M%S',sep=',')
V1 V2 V3 V4 V5 V6 V7 V8 V10 V11
2006-01-08 08:16:08 15 1035 4530 3502 2 892 482 0 2 N
2007-11-24 17:55:19 15 1034 7828 3501 3 263 256 0 3 N
2007-11-24 19:38:18 15 1035 7832 4530 2 1974 1082 0 7 N
2008-02-07 13:10:02 15 2346 8381 8155 3 2684 649 0 9 N
Using colClasses argument in read.table, as mentioned in the comment :
dat <- read.table(text='15,1035,4530,3502,2,892,482,0,20060108081608,2,N
15,1034,7828,3501,3,263,256,0,20071124175519,3,N
15,1035,7832,4530,2,1974,1082,0,20071124193818,7,N
15,2346,8381,8155,3,2684,649,0,20080207131002,9,N',
colClasses=c(rep('numeric',8),
'character','numeric','character')
,sep=',')
strptime(dat$V9,'%Y%m%d%H%M%S')
1] "2006-01-08 08:16:08" "2007-11-24 17:55:19"
"2007-11-24 19:38:18" "2008-02-07 13:10:02"
As Ricardo says, you can set the column classes with read.csv. In this case I recommend importing these as characters and once the csv is loaded, converting them to dates with strptime().
for example:
test <- '20080207131002'
strptime(x = test, format = "%Y%m%d%H%M%S")
Which will return a POSIXlt object w/ the date/time info.
You can use lubridate package
test <- '20080207131002'
lubridate::as_datetime(test)
Can also specify format for each case depends on your needs

Find the non zero values and frequency of those values in R

I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location https://www.dropbox.com/s/ef1411dq4gyg0cm/sampledataflow.csv
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
summary(flow)
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
sapply(flow,class)
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.
The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...
I looked a a small sample of the first two days
> do.call( cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
#--------
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
#--------
$`2010-06-01`
[1] 138 79
$`2010-06-02`
[1] 95 195 239
$`2010-06-03`
[1] 57 360
$`2010-06-04`
[1] 6 457
$`2010-06-05`
integer(0)
$`2010-06-06`
integer(0)
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!is.na(flowrle$values) & flowrle$values]
#----------
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278

Resources