removing duplicate units from data frame - r

I'm working on a large dataset with n covariates. Many of the rows are duplicates. In order to identify the duplicates I need to use a subset of the covariates to create an identification variable. That is, (n-x) covariates are irrelevant. I want to concatenate the values on the x covariates to uniquely identify the observations and eliminate the duplicates.
set.seed(1234)
UNIT <- c(1,1,1,1,2,2,2,3,3,3,4,4,4,5,6,6,6)
DATE <- c("1/1/2010","1/1/2010","1/1/2010","1/2/2012","1/2/2009","1/2/2004","1/2/2005","1/2/2005",
"1/1/2011","1/1/2011","1/1/2011","1/1/2009","1/1/2008","1/1/2008","1/1/2012","1/1/2013",
"1/1/2012")
OUT1 <- c(300,400,400,400,600,700,700,800,800,800,900,700,700,100,100,100,500)
JUNK1 <- c(rnorm(17,0,1))
JUNK2 <- c(rnorm(17,0,1))
test = data.frame(UNIT,DATE,OUT1,JUNK1,JUNK2)
'test' is a sample data frame. The variables I need to use to uniquely identify the observations are 'UNIT', 'DATE' and 'OUT1'. For example,
head(test)
UNIT DATE OUT1 JUNK1 JUNK2
1 1 1/1/2010 300 -1.2070657 -0.9111954
2 1 1/1/2010 400 0.2774292 -0.8371717
3 1 1/1/2010 400 1.0844412 2.4158352
4 1 1/2/2012 400 -2.3456977 0.1340882
5 2 1/2/2009 600 0.4291247 -0.4906859
6 2 1/2/2004 700 0.5060559 -0.4405479
Observations 1 and 4 are not a duplicate in the dataset. Observations 2 and 3 are duplicates. The new dataset I want to create would keep observations 1 and 4 and only one of 2 and 3. The solution I have tried is:
subset(test, !duplicated(c(UNIT,DATE,OUT1)))
Which unfortunately does not do the trick:
UNIT DATE OUT1 JUNK1 JUNK2
1 1 1/1/2010 300 -1.20706575 -0.9111954
5 2 1/2/2009 600 0.42912469 -0.4906859
8 3 1/2/2005 800 -0.54663186 -0.6937202
11 4 1/1/2011 900 -0.47719270 -1.0236557
14 5 1/1/2008 100 0.06445882 1.1022975
15 6 1/1/2012 100 0.95949406 -0.4755931
Although it does ignore the irrelevant variables (JUNK1, JUNK2) , the technique is too greedy. The new dataset should contain three observations on unit one because there are three unique combinations of UNIT + DATE + OUT1 when UNIT = 1. Is there a way to achieve this without writing a function?

You can pass a data.frame to duplicated
In your case, you want to pass the first 3 columns of test
test2 <- test[!duplicated(test[,1:3]),]
If you are using big data, and want to embrace data.tables, then you can set the key to be the first three columns (which you want to remove the duplicates from) and then use unique
library(data.table)
DT <- data.table(test)
# set the key
setkey(DT, UNIT,DATE,OUT1)
DTU <- unique(DT)
For more details on duplicates and data.tables see Filtering out duplicated/non-unique rows in data.table

Thanks! Looks like we can do:
test2 <- test[!duplicated(test[,c("OUT1","DATE","UNIT")]),]
and it delivers the goods as well. So, we can just use the column names rather than 1:3 and the order doesn't matter

You can use distinct() from the dplyr package:
library(dplyr)
test %>%
distinct(UNIT, DATE, OUT1)
Or without the %>% pipe:
distinct(test, UNIT, DATE, OUT1)

Related

R: how to merge two columns (column addition) while ignoring rows with same value

I have a data.frame like this
I want to add Sample_Intensity_RTC and Sample_Intensity_nRTC's values and then create a new column, however in cases of Sample_Intensity_RTC and Sample_Intensity_nRTC have the same value, no addition operation is done.
Please not that these columns are not rounded in the same way, so many numbers are same with different nsmall.
It seems you just want to combine these two columns, not add them in the sense of addition (+). Think of a zipper perhaps. Or two roads merging into one.
The two columns seem to have been created by two separate processes, the first looks to have more accuracy. However, after importing the data provided in the link, they have exactly the same values.
test <- read.csv("test.csv", row.names = 1)
options(digits=10)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC
1 191017QMXP002 NA NA
2 191017QNXP008 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46
6 191017USXP002 NA 76984658.00
In any case, to combine them, we can just use ifelse with the condition is.na for the first column.
test$new_col <- ifelse(is.na(test$Sample_Intensity_RTC),
test$Sample_Intensity_nRTC,
test$Sample_Intensity_RTC)
head(test)
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
1 191017QMXP002 NA NA NA
2 191017QNXP008 41293681.00 41293681.00 41293681.00
3 191017CPXP009 111446376.86 111446376.86 111446376.86
4 191017HPXP010 92302936.62 92302936.62 92302936.62
5 191017USXP001 NA 76693308.46 76693308.46
6 191017USXP002 NA 76984658.00 76984658.00
sapply(test, function(x) sum(is.na(x)))
Sample_ID Sample_Intensity_RTC Sample_Intensity_nRTC new_col
0 126 143 108
You could also use the coalesce function from dplyr.

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Alternative to for loop and indexing?

I have a large data set of 3 columns, Order, Discharge, Date (numeric). There are 20 years of daily Discharge values for each Order, which can extend beyond 100.
> head(dat)
Order Discharge date
1 0.04712 6574
2 0.05108 6574
3 0.00000 6574
4 0.00000 6574
5 3.54100 6574
6 3.61500 6574
For a given Order x, I would like to replace the Discharge value with the average of the Discharge at x+1 and x-1 for that date. I have been doing this in a crude manner with a for loop and indexing, but it takes over an hour to process. I know there has to be a better way.
x <- 4
for(i in min(dat[,3]):max(dat[,3]))
dat[,2][dat[,3] == i & dat[,1] == x ] <-
mean(c(dat[,2][dat[,3] == i & dat[,1] == x + 1],
dat[,2][dat[,3] == i & dat[,1] == x - 1]))
Gives
> head(dat)
Order Discharge date
1 0.04712 6574
2 0.05108 6574
3 0.00000 6574
4 1.77050 6574
5 3.54100 6574
6 3.61500 6574
Where the Discharge at Order 4, for date 6574 has been replaced with 1.77050. It works, but it's ridiculously slow.
I should specify that I don't need to do this calculation on every Order, but only a select few (only 8 out of a total of 117). Based on the answer, I have the following.
dat$NewDischarge <- by(dat$Discharge,dat$date,function(x)
colMeans(cbind(c(x[-1],NA), x,
c(NA, x[-length(x)])), na.rm=T))
I am trying to figure out a way still to only have the values of the select Orders to be calculated and am stuck in the rut of a for loop and indexing on date and Orders.
I would go by it as following:
Ensure that Order is a factor.
For each Order, you now have a sub-problem:
Sort the sub-data-frame by date.
Each Discharge-mean can be produced "vectorally" as:
colMeans(cbind(c(Discharge[-1], NA), Discharge, c(NA, Discharge[-length(Discharge)])))
The sub-problem can be dealt with a simple for-loop or the function by. I would prefer by.
Your data has been rearranged, but you can easily reorder it.
For point 2.2, imagine it (or try it) with a simple vector and see the effects of the cbind operation. It also forces you to consider the limit-situations; how is the first and last Discharge-value calculated (no preceding or proceeding dates).
There are several ways to solve your particular dilemma, but the basic question to ask when confronted with a slow for loop is, "How do I use vectorization to replace this loop?" (Well, maybe you should ask "Should I...?" first.) In your case, you're looping across dates, but there's no need to explicitly do that, since just grabbing all of the rows where dat$Order==x will implicitly grab all the dates.
The dataset you posted only has one date, but I can generate some fake data to illustrate:
generate.data <- function(n.order, n.date){
dat <- expand.grid(Order=seq_len(n.order), date=seq_len(n.date))
dat$Discharge <- rlnorm(n.order * n.date)
dat[, c("Order", "Discharge", "date")]
}
dat <- generate.data(10, 5)
head(dat)
# Order Discharge date
# 1 1 2.1925563 1
# 2 2 0.4093022 1
# 3 3 2.5525497 1
# 4 4 1.9274013 1
# 5 5 1.1941986 1
# 6 6 1.2407451 1
tail(dat)
# Order Discharge date
# 45 5 1.4344575 5
# 46 6 0.5757580 5
# 47 7 0.4986190 5
# 48 8 1.2076292 5
# 49 9 0.3724899 5
# 50 10 0.8288401 5
Here's all the rows where dat$Order==4, across all dates:
dat[dat$Order==4, ]
# Order Discharge date
# 4 4 1.9274013 1
# 14 4 3.5319072 2
# 24 4 0.2374532 3
# 34 4 0.4549798 4
# 44 4 0.7654059 5
You can just take the Discharge column, and you'll have the left-hand side of your assignment:
dat[dat$Order==4, ]$Discharge
# [1] 1.9274013 3.5319072 0.2374532 0.4549798 0.7654059
Now you just need the right side, which has two components: the x-1 discharges and the x+1 discharges. You can grab these the same way you grabbed the x discharges:
dat[dat$Order==4-1, ]$Discharge
# [1] 2.5525497 1.9143963 0.2800546 8.3627810 7.8577635
dat[dat$Order==4+1, ]$Discharge
# [1] 1.1941986 4.6076114 0.3963693 0.4190957 1.4344575
To obtain the new values, you need the parallel mean. R doesn't have a pmean function, but you can cbind these and take the rowMeans:
rowMeans(cbind(dat[dat$Order==4-1, ]$Discharge, dat[dat$Order==4+1, ]$Discharge))
# [1] 1.8733741 3.2610039 0.3382119 4.3909383 4.6461105
So, in the end you have:
dat[dat$Order==4, ]$Discharge <- rowMeans(cbind(dat[dat$Order==4-1, ]$Discharge,
dat[dat$Order==4+1, ]$Discharge))
You can even use %in% to make this work across all of your x values.
Note that this assumes your data is ordered.

Data split and sapply

I'm new to the R language and I'm having some difficult to calculate the returns of my dataset for every Identification.
I have a very large dataset of monthly observations grouped like so:
Code Subset Identification Names Times Value %
100 1001 10011 ..... 201012 10 40
100 1001 10012 ..... 201012 11 60
100 1002 10021 ..... 201012 7 30
100 1002 10022 ..... 201012 13 70
.....
100 1001 10011 ..... 201301 11 45
100 1001 10012 ..... 201301 15 55
100 1002 10021 ..... 201301 9 33
100 1002 10022 ..... 201301 17 67
I need to write a function that can calculate the monthly rate of returns for every Identification. Then, I need to aggregate the values so calculated in the upper level of "subset" (with a mean weighted "%").
I've changed the format of the vector times to year-month i.e. "%Y-%m" in this way:
as.yearmon(as.character(Data$Times), format = "%Y%m")
and I've tried to calculate the returns for every Identification using split and sapply, like this:
xm <- split(Data, Identification)
Retxm <- sapply(1:length(xm), function(x) returns(Value))
The output i had using the function above is like this:
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] 1.605198e-03 1.605198e-03 1.605198e-03 1.605198e-03
[3,] -1.190902e-02 -1.190902e-02 -1.190902e-02 -1.190902e-02
[4,] 3.318032e-03 3.318032e-03 3.318032e-03 3.318032e-03
The output is not many clear, so i would have on the row the Times and on the header the Identification.
Thank you so much!
Here's a minimal dataset which is similar:
set.seed(1)
df1 <- data.frame(id=sample(c("10011", "10012", "10013"), 6, replace=TRUE),
d1=rep(c(201012, 201101), each=3),
v1=ceiling(20*runif(6))
)
As to your first question, you can't format an object as Date in base R unless you specify the day in addition to the month and year.
To handle dates which are specified by month & year you could use:
library(zoo)
df1$d1 <- as.yearmon(as.character(df1$d1), format="%Y%m")
As to the second part of the question it's unclear to me what sort of calculation you're trying to perform. Following your methods you can indeed split the data.frame and do something with each element e.g. get the sum of the elements in the v1 column:
l1 <- split(df1, df1$id)
sapply(1:length(l1), function(i) sum(l1[[i]]$v1))
Edit My Java's not working so can't add comment. Still not clear what you're trying to do. Would be better if you could spell it out with a working example; try editing your original question if able to do so.

Altering a large distance matrix to be just three columns

I have a large data frame/.csv that is a matrix with 42 columns and 110,357,407. It was derived from the x and y coordinates for two datasets of points, one with 41 and another with 110,357,407 and the values of the rows represent the distances between these two sets of points (the distance of each point on list 1 to every single point on list 2). The first column is a list of points (from 1 to 110,357,407). An excerpt from the matrix is below.
V1 V2 V3 V4 V5 V6 V7
1 38517.05 38717.8 38840.16 38961.37 39281.06 88551.03 88422.62
2 38514.05 38714.79 38837.15 38958.34 39278 88545.48 88417.09
3 38511.05 38711.79 38834.14 38955.3 39274.94 88539.92 88411.56
4 38508.05 38708.78 38831.13 38952.27 39271.88 88534.37 88406.03
5 38505.06 38705.78 38828.12 38949.24 39268.83 88528.82 88400.5
6 38502.07 38702.78 38825.12 38946.21 39265.78 88523.27 88394.97
7 38499.08 38699.78 38822.12 38943.18 39262.73 88517.72 88389.44
8 38496.09 38696.79 38819.12 38940.15 39259.68 88512.17 88383.91
9 38493.1 38693.8 38816.12 38937.13 39256.63 88506.62 88378.38
10 38490.12 38690.8 38813.12 38934.11 39253.58 88501.07 88372.85
11 38487.14 38687.81 38810.13 38931.09 39250.54 88495.52 88367.33
12 38484.16 38684.83 38807.14 38928.07 39247.5 88489.98 88361.8
13 38481.18 38681.84 38804.15 38925.06 39244.46 88484.43 88356.28
14 38478.21 38678.86 38801.16 38922.04 39241.43 88478.88 88350.75
15 38475.23 38675.88 38798.17 38919.03 39238.39 88473.34 88345.23
16 38472.26 38672.9 38795.19 38916.03 39235.36 88467.8 88339.71
My issue is that I would like to change this matrix into just 3 columns, the first column would be similar to the first column of the matrix with the 110,357,407 rows, the second would be the 41 data points (each matched up with a distance each of the first points to all of the others) and the third would be the distance between those points. So it would look something like this
Back Pres Dist
1 1 3486
2 1 3456
3 1 3483
4 1 3456
5 1 3429
6 1 3438
7 1 3422
8 1 3427
9 1 3428
(After the distances between the back and all of the first value of pres are complete, pres will change to 2 and will eventually work its way up to 41)
I realize that this will output a hugely ridiculous number of rows, but this is the format that I need to run some processes that are outside of R.
I tried using this code
cols.Output <- data.frame(col = rep(colnames(output3), each = nrow(output3)),
row = rep(rownames(output3), ncol(output3)),
value = as.vector(output3))
But there won’t be the same number of rows for each column, so I received an error (and I don’t think it would have really worked with my pres column needs). I tried experimenting with some of the rbind.fill and cbind.fill functions (the one in plyr and ones that others have come up with in the forum). I also looked into some of the melting and reshaping but I was very confused about the functions and couldn’t figure out how to implement them appropriately (or if they even are appropriate for what I need). I would really appreciate any help on this as I’ve been struggling with it for a long time.
Edit: Just to be a little more clear about what I need. Take these two smaller data sets
back <- 1 dataset with 5 sets of x, y points
pres <- 1 dataset with 3 sets of x, y points
Calculating distances between these two data frames generates the initial matrix:
Back 1 2 3
1 3427 3444 3451
2 3432 3486 3476
3 3486 3479 3486
4 3449 3438 3484
5 3483 3486 3486
And my desired output would look like this:
Back Pres Dist
1 1 3427
2 1 3432
3 1 3486
4 1 3449
5 1 3483
1 2 3444
2 2 3486
3 2 3479
4 2 3438
5 2 3486
1 3 3451
2 3 3476
3 3 3486
4 3 3484
5 3 3486
Yes, it looks this is the kind of problem generally solved with some combination of melt and cast in the reshape2 package. That said, with 100+ million rows, I'm not sure that that's the most efficient way to go in this case.
You could do it all manually as follows. I'll assume your data frame is called df, and the distances are in columns 2 to 42. See if this works.
d <- unlist(df[-1]) # put all the distances into a vector
newdf <- cbind(expand.grid(back=seq_len(nrow(df)), pres=seq_len(ncol(df) - 1)), d)
This will probably die unless you have tons of memory. The same holds for any simple solution though, since you have > 4.2 billion elements in the vector of distances. You can work on subsets of the full dataset at a time to get around this problem.
Here's how to use melt on a small example:
require(reshape2)
a <- matrix(rnorm(9), nrow = 3)
a[, 1] <- 1:3 ## Pretending these are one set of points
rownames(a) <- a[, 1] ## We'll put them as rownames instead of a column
melt(a[, -1]) ## And omit that column when melting
If you have memory issues, you could write a for loop and do it in pieces, writing each to a file when they're completed.

Resources