R counting the occurrences of similar rows of data frame - r

I have data in the following format called DF (this is just a made up simplified sample):
eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0 random
1 1 1500 1500 100 120 40 232342
2 2 1000 1250 100 120 40 11843
3 3 1250 1250 100 120 40 981340234
4 4 1000 1187.5 100 120 40 4363453
5 1 2000 2000 200 100 40 345902
6 1 3000 3000 150 90 10 943
7 1 2000 2000 90 90 100 9304358
8 2 1800 1900 90 90 100 284333
However, the eval.count column is incorrect and I need to fix it. It should report the number of rows with the same values for (green.h.0, green.v.0, and offset.0) by only looking at the previous rows.
The example above uses the expected values, but assume they are incorrect.
How can I add a new column (say "count") which will count all previous rows which have the same values of the specified variables?
I have gotten help on a similar problem of just selecting all rows with the same values for specified columns, so I supposed I could just write a loop around that, but it seems inefficient to me.

Ok, let's first do it in the easy case where you just have one column.
> data <- rep(sample(1000, 5),
sample(5, 5))
> head(data)
[1] 435 435 435 278 278 278
Then you can just use rle to figure out the contiguous sequences:
> sequence(rle(data)$lengths)
[1] 1 2 3 1 2 3 4 5 1 2 3 4 1 2 1
Or altogether:
> head(cbind(data, sequence(rle(data)$lengths)))
[1,] 435 1
[2,] 435 2
[3,] 435 3
[4,] 278 1
[5,] 278 2
[6,] 278 3
For your case with multiple columns, there are probably a bunch of ways of applying this solution. Easiest might be to just paste the columns you care about together to form a single vector.

Okay I used the answer I had on another question and worked out a loop that I think will work. This is what I'm going to use:
cmpfun2 <- function(r) {
count <- 0
if (r[1] > 1)
{
for (row in 1:(r[1]-1))
{
if(all(r[27:51] == DF[row,27:51,drop=FALSE])) # compare to row bind
{
count <- count + 1
}
}
}
return (count)
}
brows <- apply(DF[], 1, cmpfun2)
print(brows)
Please comment if I made a mistake and this won't work, but I think I've figured it out. Thanks!

I have a solution I figured out over time (sorry I haven't checked this in a while)
checkIt <- function(bind) {
print(bind)
cmpfun <- function(r) {all(r == heeds.data[bind,23:47,drop=FALSE])}
brows <- apply(heeds.data[,23:47], 1, cmpfun)
#print(heeds.data[brows,c("eval.num","fitness","green.h.1","green.h.2","green.v.5")])
print(nrow(heeds.data[brows,c("eval.num","fitness","green.h.1","green.h.2","green.v.5")]))
}
Note that heeds.data is my actual data frame and I just printed a few columns originally to make sure that it was working (now commented out). Also, 23:47 is the part that needs to be checked for duplicates
Also, I really haven't learned as much R as I should so I'm open to suggestions.
Hope this helps!

Related

how to find which rows are related by mathematical difference of x in R

i have a data frame with about 20k IDs of chemical compounds and the corresponding molecular weights, something like this:
ID <- c(1,2,3,4,5)
MASS <- c(324,162,508,675,670)
d <- data.frame(ID, MASS)
ID MASS
1 1 324
2 2 162
3 3 508
4 4 675
5 5 670
I would like to find a way to loop over the rows of the column MASS to find which masses are related by having a difference (positive or negative) of 162∓0.5. Then I would like to have a new column (d$DIFF) where the IDs that are linked by a MASS difference of 162∓0.5 are reported, while get 0 for those IDs when the condition is not met, in this example it would be something like this:
ID MASS DIFF
1 1 324 1&2
2 2 162 1&2
3 3 508 3&5
4 4 675 0
5 5 670 3&5
Thanks in advance for any help
Here's a base R solution using outer:
d$DIFF <- unlist(lapply(apply(outer(d$MASS, d$MASS,
function(x, y) abs((abs(x - y) - 162)) < 0.5), 1, which),
function(x) if(length(x) == 0)
return("0")
else
return(paste(x, collapse = " & "))))
This gives the result:
d
#> ID MASS DIFF
#> 1 1 324 2
#> 2 2 162 1
#> 3 3 508 5
#> 4 4 675 0
#> 5 5 670 3
Note that in your example data, there is at most a single match to other rows, but if you apply this technique to your real data you should get multiple hits for some rows separated by "&" as requested.
You should also note that whatever way you do this in your real data, you will have to make approximately 20K * 20K (400 million) comparisons, so it may take some time to complete, and may result in memory issues depending on your set-up.

Add Elements of Data Frame to Another Data Frame Based on Condition R

I have two data frames that showcase results of an analysis from one month and then the subsequent month.
Here is a smaller version of the data:
Jan19=data.frame(Group=c(589,630,523,581,689),Count=c(191,84,77,73,57))
Dec18=data.frame(Group=c(589,630,523,478,602),Count=c(100,90,50,6,0))
Jan19
Group Count
1 589 191
2 630 84
3 523 77
4 581 73
5 689 57
Dec18
Group Count
1 589 100
2 630 90
3 523 50
4 478 6
5 602 0
Jan19 only has counts >0. Dec18 is the dataset with results from the previous month. Dec18 has counts >=0 for each group. I have been referencing the full Dec18 dataset for counts =0 and manually entering them in to the full Jan18 dataset. I want to rid myself of the manual part of this exercise and just be able to append the groups with counts = 0 to the end of the Jan19 dataset.
That lead me to the following code to perform what I described above:
GData=rbind(Jan19,Dec18)
GData=GData[!duplicated(GData$Group),]
While this code resulted in the correction dimensions, it does not choose the correct duplicate to remove. Among the appended dataset, it treats the Jan19 results>0 as the duplicate and removes that. This is the result:
Gdata
Group Count
1 589 191
2 630 84
3 523 77
4 581 73
5 689 57
9 478 6
10 602 0
Essentially, I wanted that 6 to show up as a 0. So, that lead me to the following line of code where I wanted to set a condition, if the new appended data (Dec18) has a duplicate Group to the newer data (Jan19), then that corresponding Count should=0. Otherwise, the value of count from the Jan19 dataset should hold.
Gdata=ifelse(Dec18$Group %in% Jan19$Group==FALSE, Gdata$Count==0,Jan19$Count)
This is resulting in errors and I'm not sure how to modify it to achieve my desired result. Any help would be appreciated!
Your rbind/deduplication approach is a good one, you just need the Dec18 data you rbind on to have have the Count column as 0:
Gdata = rbind(Jan19, transform(Dec18, Count = 0))
Gdata[!duplicated(Gdata$Group), ]
# Group Count
# 1 589 191
# 2 630 84
# 3 523 77
# 4 581 73
# 5 689 57
# 9 478 0
# 10 602 0
While this code resulted in the correction dimensions, it does not choose the correct duplicate to remove. Among the appended dataset, it treats the Jan19 results>0 as the duplicate and removes that. This is the result:
This is incorrect. !duplicated() will keep the first occurrence and remove later occurrences. None of the Jan19 data is removed---we can see that the first 5 rows of Gdata are exactly the 5 rows of Jan19. The only issue was that the non-duplicated rows from Dec18 were not all 0 counts. We fix this with the transform().
There are plenty of other ways to do this, with a join using the merge function, we could only rbind on the non-duplicated groups as d.b suggests, rbind(Jan19, transform(Dec18, Count = 0)[!Dec18$Group %in% Jan19$Group,]), and there are others too. We could make your ifelse approach work like this:
Gdata = rbind(Jan19, Dec18)
Gdata$Count = ifelse(!Dec18$Group %in% Jan19$Group, 0, Gdata$Count)
# an alternative to ifelse, a little cleaner
Gdata = rbind(Jan19, Dec18)
Gdata$Count[!Gdata$Group %in% Jan19$Group] = 0
Use whatever makes the most sense to you.

find largest smaller element

I have two lists of indices:
> k.start
[1] 3 19 45 120 400 809 1001
> k.event
[1] 3 4 66 300
I need a list that contains, for each element of k.event, the largest value in k.start which is less than or equal to it. The desired result is
k.desired = c(3,3,45,120)
So, I'm trying to replicate this code, except without a for loop:
for (i in 1:length(k.start){
k.start[max(which(k.event[i] > k.start))]
}
Thanks!
You could use
vapply(k.event, function(x) max(k.start[k.start <= x]), 1)
# [1] 3 3 45 120

select records according to the difference between records R

I hope someone could suggest me something for this "problem", because I really don't know how to proceed...
Well, my data are like this
data<-data.frame(site=c(rep("A",3),rep("B",3),rep("C",3)),time=c(100,180,245,5,55,130,70,120,160))
where time is in minute.
I want to select only the records, for each site, for which the difference is more than 60, so the output should be Like this:
out<-data[c(1:4,6,7,9),]
What I have tried so far. Well,to get the difference I use this:
difference<-stack(tapply(data$time,data$site,diff))
but then, no idea how to pick up those records which satisfied my condition...
If there is already a similar question, although I've searched for a while, I apologize for this.
To make things clear, as probably the definition of difference was not so unambiguous, I need to select all the records (for each site) which are separated at least by 60 minutes, so not only those that are strictly subsequent in time.
Specifically,
> out
site time
1 A 100#included because difference between 2 and 1 is>60
2 A 180#included because difference between 3 and 2 is>60
3 A 245#included because separated by 6o minutes before record#2
4 B 5#included because difference between 6 and 4 is>60
6 B 130#included because separated by 6o minutes before record#4
7 C 70#included because difference between 9 and 7 is>60
9 C 160#included because separated by 60 minutes before record#7
May be to solve the "problem", it could be useful to consider the results of the difference, something like this:
> difference
values ind
1 80 A#include record 1 and 2
2 65 A#include record 2 and 3
3 50 B#include only record 4
4 75 B#include record 6 because there are(50+75)>60 m from r#4
5 50 C#include only record 7
6 40 C#include record 9 because there are (50+40)>60 m from r#7
Thanks for the help.
data[ave(data$time, data$site, FUN = function(x){c(61, diff(x)) > 60}) == 1, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
Edit following updated question:
keep <- as.logical(ave(data$time, data$site, FUN = function(x){
c(TRUE, cumsum(diff(x)) > 60)
}))
data[keep, ]
# site time
# 1 A 100
# 2 A 180
# 3 A 245
# 4 B 5
# 6 B 130
# 7 C 70
# 9 C 160
#Calculate the differences
data$diff <- unlist(by(data$time, data$site,function(x)c(NA,diff(x))))
#subset data
data[is.na(data$diff) | data$diff > 60,]
Using plyr:
ddply(dat,.(site),function(x)x[c(TRUE , diff(x$time) >60),])

plyr to calculate relative aggregration

I have a data.frame that looks like this:
> head(activity_data)
ev_id cust_id active previous_active start_date
1 1141880 201 1 0 2008-08-17
2 4927803 201 1 0 2013-03-17
3 1141880 244 1 0 2008-08-17
4 2391524 244 1 0 2011-02-05
5 1141868 325 1 0 2008-08-16
6 1141872 325 1 0 2008-08-16
for each cust_id
for each ev_id
create a new variable $recent_active (= sum $active across all rows with this cust_id where $start_date > [this_row]$start_date - 10)
I am struggling to do this using ddply, as my split grouping was .(cust_id) and I wanted to return rows with cust_id and ev_id
Here is what I tried
ddply(activity_data, .(cust_id), function(x) recent_active=sum(x[this_row,]$active))
If ddply is not an option what other effieicent ways do you recommend. My dataset has ~200mn rows and I need to do this about 10-15 times per row.
sample data is here
You actually need to use two step approach here (and also need to convert date into date format before using the following code)
ddply(activity_date, .(cust_id), transform, recent_active=your function) #Not clear what you are asking regarding the function
ddply(activity_date, .(cust_id,ev_id), summarize,recent_active=sum(recent_active))

Resources