Prevent duplicates in R - r

I have a column in a data table which has entries in non-decreasing order. But there can be duplicate entries.
labels <- c(123,123,124,125,126,126,128)
time <- data.table(labels,unique_labels="")
time
labels unique_labels
1: 123
2: 123
3: 124
4: 125
5: 126
6: 126
7: 128
I want to make all entries unique, so the output will be
time
labels unique_labels
1: 123 123
2: 123 124
3: 124 125
4: 125 126
5: 126 127
6: 126 128
7: 128 130
Following is a loop implementation for this:
prev_label <- 0
unique_counter <- 0
for (i in 1:length(time$label)){
if (time$label[i]!=prev_label)
prev_label <- time$label[i]
else
unique_counter <- unique_counter + 1
time$unique_label[i] <- time$label[i] + unique_counter
}

There's a vectorized solution that completly prevents you from using for loops.
Since time is a R function I've changed the name of your data.frame to tm.
cumsum(duplicated(tm$labels)) + tm$labels
[1] 123 124 125 126 127 128 130
tm$unique_labels <- cumsum(duplicated(tm$labels)) + tm$labels
tm
labels unique_labels
1: 123 123
2: 123 124
3: 124 125
4: 125 126
5: 126 127
6: 126 128
7: 128 130

tank = ("t", 1:NROW(labels), sep="")
time$unique_labels = ifelse(duplicated(time), tank, time$labels)
the duplicated function of the data.table package returns the index of duplicated rows of your dataset, just replace them with "random" values you are sure are not used in your set

Related

How to add columns to a dataframe based on indexes in R? (See example)

I'm working with a self made infix function which simply calculates the
percentage growth between observations in columns.
options(digits=3)
`%grow%` <- function(x,y) {
(y-x) / x * 100
}
test <- data.frame(a=c(101,202,301), b=c(123,214,199), h=c(134, 217, 205))
Then I use lapply to my toy database in order to add two new columns.
test[,4:5] <- lapply(1:(ncol(test)-1), function(i) test[,i] %grow% test[,(i+1)])
test
#Output
a b h V4 V5
1 101 123 134 21.78 8.94
2 202 214 217 5.94 1.40
3 301 199 205 -33.89 3.02
This is easy considering I just have three columns and I just can write test[,4:5]. Now talking in general terms: How to do this if we have n columns using column indexes?
What I mean is I want to create n-1 columns to a given database starting from the last one. Something like:
test[,(last_current_column+1):(last_column_created_using_function)]
Considering what I've read in some other posts, using my example, test[,(last_current_column+1): could be written as:
test[,(ncol(test)+1):]
but second part is still missing and I have no idea how to write it.
I hope I made myself clear. I fully appreciate any comment or advise.
Happy 2019 :)
Another way would be:
#options(digits=3)
`%grow%` <- function(x,y) {
(y-x) / x * 100
}
test <- data.frame(a=c(101,202,301),
b=c(123,214,199),
h=c(134, 217, 205),
d=c(156,234,235))
# a b h d
# 1 101 123 134 156
# 2 202 214 217 234
# 3 301 199 205 235
seqcols <- seq_along(test) # saved just to improve readability
test[,seqcols[-length(seqcols)] + max(seqcols)] <- lapply(seqcols[-length(seqcols)],
function(i) test[,i] %grow% test[,(i+1)])
test
# a b h d V5 V6 V7
# 1 101 123 134 156 21.78 8.94 16.42
# 2 202 214 217 234 5.94 1.40 7.83
# 3 301 199 205 235 -33.89 3.02 14.63
Similar to the second solution from #Ronak Shah, just with the use of map2_df from purrr:
cbind(test,
new=purrr::map2_df(test[seqcols[-length(seqcols)]], test[seqcols[-1]], `%grow%`),
deparse.level=1)
# a b h d new.a new.b new.h
# 1 101 123 134 156 21.78 8.94 16.42
# 2 202 214 217 234 5.94 1.40 7.83
# 3 301 199 205 235 -33.89 3.02 14.63
You would always ncol(test) - 1 new columns. Now using this logic there are multiple ways to do this.
One way would be to construct a character vector with some prefix value.
test[paste0("new_col", seq_len(ncol(test) - 1))] <- lapply(1:(ncol(test)-1),
function(i) test[,i] %grow% test[,(i+1)])
test
# a b h new_col1 new_col2
#1 101 123 134 21.782178 8.943089
#2 202 214 217 5.940594 1.401869
#3 301 199 205 -33.887043 3.015075
Another option using mapply and transform by creating subsets of dataframe
transform(test,
new_col = mapply(`%grow%`, test[1:(ncol(test)- 1)], test[2:ncol(test)]))
# a b h new_col.a new_col.b
#1 101 123 134 21.782178 8.943089
#2 202 214 217 5.940594 1.401869
#3 301 199 205 -33.887043 3.015075

finding largest consecutive region in table

I'm trying to find regions in a file that have consecutive lines based on two columns. I want to find the largest span of consecutive values. If column 4 (V3) comes immediately before the second line's value for column 3 (V2), then write the output for the longest span of consecutive values.
The input looks like this. input:
> x
grp V1 V2 V3 V4 V5 V6
1: 1 DOG.1 142 144 132 134 0
2: 2 DOG.1 313 315 303 305 0
3: 3 DOG.1 316 318 306 308 0
4: 4 DOG.1 319 321 309 311 0
5: 5 DOG.1 322 324 312 314 0
the output should look like this:
out.name in out
[1,] "DOG.1" "313" "324"
Notice how the x[1,] was removed and how the output is starting at x[2,3] and ending at x[5,4]. All of these values are consecutive.
One obvious way is to take tail(x$V2, -1L) - head(x$V3, -1L) and get the start and end indices corresponding to the maximum consecutive 1s. But I'll skip it here (and leave it to others) as I'd like to show how this can be done with the help of IRanges package:
require(data.table)
require(IRanges) ## Bioconductor package
x.ir = reduce(IRanges(x$V2, x$V3))
max.idx = which.max(width(x.ir))
ans = data.table(out.name = "DOG.1",
in = start(x.ir)[max.idx],
out = end(x.ir)[max.idx])
# out.name bla out
# 1: DOG.1 313 324

Find the non zero values and frequency of those values in R

I have a data which has two parameters, they are data/time and flow. The flow data is intermittent flow. Lets say at times there is zero flow and suddenly the flow starts and there will be non-zero values for sometime and then the flow will be zero again. I want to understand when the non-zero values occur and how long does each non-zero flow last. I have attached the sample dataset at this location https://www.dropbox.com/s/ef1411dq4gyg0cm/sampledataflow.csv
The data is 1 minute data.
I was able to import the data into R as follows:
flow <- read.csv("sampledataflow.csv")
summary(flow)
names(flow) <- c("Date","discharge")
flow$Date <- strptime(flow$Date, format="%m/%d/%Y %H:%M")
sapply(flow,class)
plot(flow$Date, flow$discharge,type="l")
I made plot to see the distribution but couldn't get a clue where to start to get the frequency of each non zero values. I would like to see a output table as follows:
Date Duration in Minutes
Please let me know if I am not clear here. Thanks.
Additional Info:
I think we need to check the non-zero value first and then find how many non zero values are there continuously before it reaches zero value again. What I want to understand is the flow release durations. For eg. in one day there might be multiple releases and I want to note at what time did the release start and how long did it continue before coming to value zero. I hope this explain the problem little better.
The first point is that you have too many NA in your data. In case you want to look into it.
If I understand correctly, you require the count of continuous 0's followed by continuous non-zeros, zeros, non-zeros etc.. for each date.
This can be achieved with rle of course, as also mentioned by #mnel under comments. But there are quite a few catches.
First, I'll set up the data with non-NA entries:
flow <- read.csv("~/Downloads/sampledataflow.csv")
names(flow) <- c("Date","discharge")
flow <- flow[1:33119, ] # remove NA entries
# format Date to POSIXct to play nice with data.table
flow$Date <- as.POSIXct(flow$Date, format="%m/%d/%Y %H:%M")
Next, I'll create a Date column:
flow$g1 <- as.Date(flow$Date)
Finally, I prefer using data.table. So here's a solution using it.
# load package, get data as data.table and set key
require(data.table)
flow.dt <- data.table(flow)
# set key to both "Date" and "g1" (even though, just we'll use just g1)
# to make sure that the order of rows are not changed (during sort)
setkey(flow.dt, "Date", "g1")
# group by g1 and set data to TRUE/FALSE by equating to 0 and get rle lengths
out <- flow.dt[, list(duration = rle(discharge == 0)$lengths,
val = rle(discharge == 0)$values + 1), by=g1][val == 2, val := 0]
> out # just to show a few first and last entries
# g1 duration val
# 1: 2010-05-31 120 0
# 2: 2010-06-01 722 0
# 3: 2010-06-01 138 1
# 4: 2010-06-01 32 0
# 5: 2010-06-01 79 1
# ---
# 98: 2010-06-22 291 1
# 99: 2010-06-22 423 0
# 100: 2010-06-23 664 0
# 101: 2010-06-23 278 1
# 102: 2010-06-23 379 0
So, for example, for 2010-06-01, there are 722 0's followed by 138 non-zeros, followed by 32 0's followed by 79 non-zeros and so on...
I looked a a small sample of the first two days
> do.call( cbind, tapply(flow$discharge, as.Date(flow$Date), function(x) table(x > 0) ) )
2010-06-01 2010-06-02
FALSE 1223 911
TRUE 217 529 # these are the cumulative daily durations of positive flow.
You may want this transposed in which case the t() function should succeed. Or you could use rbind.
If you jsut wante the number of flow-postive minutes, this would also work:
tapply(flow$discharge, as.Date(flow$Date), function(x) sum(x > 0, na.rm=TRUE) )
#--------
2010-06-01 2010-06-02 2010-06-03 2010-06-04 2010-06-05 2010-06-06 2010-06-07 2010-06-08
217 529 417 463 0 0 263 220
2010-06-09 2010-06-10 2010-06-11 2010-06-12 2010-06-13 2010-06-14 2010-06-15 2010-06-16
244 219 287 234 31 245 311 324
2010-06-17 2010-06-18 2010-06-19 2010-06-20 2010-06-21 2010-06-22 2010-06-23 2010-06-24
299 305 124 129 295 296 278 0
To get the lengths of intervals with discharge values greater than zero:
tapply(flow$discharge, as.Date(flow$Date), function(x) rle(x>0)$lengths[rle(x>0)$values] )
#--------
$`2010-06-01`
[1] 138 79
$`2010-06-02`
[1] 95 195 239
$`2010-06-03`
[1] 57 360
$`2010-06-04`
[1] 6 457
$`2010-06-05`
integer(0)
$`2010-06-06`
integer(0)
... Snipped output
If you want to look at the distribution of these durations you will need to unlist that result. (And remember that the durations which were split at midnight may have influenced the counts and durations.) If you just wanted durations without dates, then use this:
flowrle <- rle(flow$discharge>0)
flowrle$lengths[!is.na(flowrle$values) & flowrle$values]
#----------
[1] 138 79 95 195 296 360 6 457 263 17 203 79 80 85 30 189 17 270 127 107 31 1
[23] 2 1 241 311 229 13 82 299 305 3 121 129 295 3 2 291 278

Selecting/sorting values withing table R

I'm working in R with the following dataset for a metabolomics study.
first Name Area Sample Similarity
120 Pentanone 699468 PO4:1 954
120 Pentanone 153744 PO2:1 981
126 Methylamine 83528 PO4:1 887
126 Unknown 32741 PO2:1 645
126 Sulfurous 43634 PO1:1 800
I want to be able to selected in the first column, within the rowns with same value (for example 120), the compounds with same name (for example pentanone). From this selection I want to copy the row information that corresponds to the highest similarity and created new columns within the table. In this case the following information:
120 Pentanone 153744 PO2:1 981
I know that "send me the code posts" are not very appreciated by I would greatly appreciated some clues on how to start.
You can use plyr package:
I reproduce your data ( try to use dput(dat) next time)
dat <- read.table(text ='first Name Area Sample Similarity
120 Pentanone 699468 PO4:1 954
120 Pentanone 153744 PO2:1 981
126 Methylamine 83528 PO4:1 887
126 Unknown 32741 PO2:1 645
126 Sulfurous 43634 PO1:1 800',header=TRUE)
I split my data.frame by (first & Name)
I apply the function fo each set of rows
I aggregate in a new data.frame
library(plyr)
ddply(dat,.(first,Name),function(x) x[x$Similarity==max(x$Similarity),])
first Name Area Sample Similarity
1 120 Pentanone 153744 PO2:1 981
2 126 Methylamine 83528 PO4:1 887
3 126 Sulfurous 43634 PO1:1 800
4 126 Unknown 32741 PO2:1 645
There are many options. You already have one example using plyr; here are two more.
Base R approach, using aggregate and merge:
merge(dat, aggregate(Similarity ~ first + Name, dat, max))
# first Name Similarity Area Sample
# 1 120 Pentanone 981 153744 PO2:1
# 2 126 Methylamine 887 83528 PO4:1
# 3 126 Sulfurous 800 43634 PO1:1
# 4 126 Unknown 645 32741 PO2:1
A sqldf approach:
library(sqldf)
sqldf("select *, max(Similarity) `Similarity` from dat group by first, Name")
# first Name Similarity Area Sample
# 1 120 Pentanone 981 153744 PO2:1
# 2 126 Methylamine 887 83528 PO4:1
# 3 126 Sulfurous 800 43634 PO1:1
# 4 126 Unknown 645 32741 PO2:1

grep: How can i search through my data using a wildcard in R

I have recently started using R. So now I am trying to get some data out of it. However, the results I get are quite confusing. I have datas from the year 1961 to 1963 of everyday in the format 1961-04-25. I created a vector called: date
So when I try to use grep to just search for the period between April 10 and May 21 and display the dates I used this command:
date[date >= grep("196.-04-10", date, value = TRUE) &
date <= grep("196.-05-21", date, value = TRUE)]
The results I get is are somehow confusing as it is making 3 days steps instead of giving me every single day... see below.
[1] "1961-04-10" "1961-04-13" "1961-04-16" "1961-04-19" "1961-04-22" "1961-04-25" "1961-04-28" "1961-05-01" "1961-05-04" "1961-05-07" "1961-05-10"
[12] "1961-05-13" "1961-05-16" "1961-05-19" "1962-04-12" "1962-04-15" "1962-04-18" "1962-04-21" "1962-04-24" "1962-04-27" "1962-04-30" "1962-05-03"
[23] "1962-05-06" "1962-05-09" "1962-05-12" "1962-05-15" "1962-05-18" "1962-05-21" "1963-04-11" "1963-04-14" "1963-04-17" "1963-04-20" "1963-04-23"
[34] "1963-04-26" "1963-04-29" "1963-05-02" "1963-05-05" "1963-05-08" "1963-05-11" "1963-05-14" "1963-05-17" "1963-05-20"
I think the grep strategy is misguided, but maybe something like this will work ... basically, I'm computing the day-of-year (Julian date, yday()) and using that for comparison.
z <- as.Date(c("1961-04-10","1961-04-11","1961-04-12",
"1961-05-21","1961-05-22","1961-05-23",
"1963-04-09","1963-04-12","1963-05-21","1963-05-22"))
library(lubridate)
z[yday(z)>=yday(as.Date("1961-04-10")) & yday(z)<=yday(as.Date("1961-05-21"))]
## [1] "1961-04-10" "1961-04-11" "1961-04-12" "1961-05-21" "1963-04-12"
## [6] "1963-05-21"yz <- year(z)
Actually, this solution is fragile to leap-years ...
Better (?):
yz <- year(z)
z[z>=as.Date(paste0(yz,"-04-10")) & z<=as.Date(paste0(yz,"-05-21"))]
(You should definitely test this for yourself, I haven't tested carefully!)
Using a date format for your variable would be the best bet here.
## set up some test data
datevar <- seq.Date(as.Date("1961-01-01"),as.Date("1963-12-31"),by="day")
test <- data.frame(date=datevar,id=1:(length(datevar)))
head(test)
## which looks like:
> head(test)
date id
1 1961-01-01 1
2 1961-01-02 2
3 1961-01-03 3
4 1961-01-04 4
5 1961-01-05 5
6 1961-01-06 6
## find the date ranges you want
selectdates <-
(format(test$date,"%m") == "04" & as.numeric(format(test$date,"%d")) >= 10) |
(format(test$date,"%m") == "05" & as.numeric(format(test$date,"%d")) <= 21)
## subset the original data
result <- test[selectdates,]
## which looks as expected:
> result
date id
100 1961-04-10 100
101 1961-04-11 101
102 1961-04-12 102
103 1961-04-13 103
104 1961-04-14 104
105 1961-04-15 105
106 1961-04-16 106
107 1961-04-17 107
108 1961-04-18 108
109 1961-04-19 109
110 1961-04-20 110
111 1961-04-21 111
112 1961-04-22 112
113 1961-04-23 113
114 1961-04-24 114
115 1961-04-25 115
116 1961-04-26 116
117 1961-04-27 117
118 1961-04-28 118
119 1961-04-29 119
120 1961-04-30 120
121 1961-05-01 121
122 1961-05-02 122
123 1961-05-03 123
124 1961-05-04 124
125 1961-05-05 125
126 1961-05-06 126
127 1961-05-07 127
128 1961-05-08 128
129 1961-05-09 129
130 1961-05-10 130
131 1961-05-11 131
132 1961-05-12 132
133 1961-05-13 133
134 1961-05-14 134
135 1961-05-15 135
136 1961-05-16 136
137 1961-05-17 137
138 1961-05-18 138
139 1961-05-19 139
140 1961-05-20 140
141 1961-05-21 141
465 1962-04-10 465
...

Resources