Find unique rows [duplicate] - r

This question already has an answer here:
Extracting only unique rows from data frame in R [duplicate]
(1 answer)
Closed 5 years ago.
This seems so simple, but I can't figure it out.
Given this data frame
df=data.frame(
x = c(12,12,165,165,115,148,148,155,155,521),
y = c(54,54,122,122,215,108,108,655,655,151)
)
df
x y
1 12 54
2 12 54
3 165 122
4 165 122
5 115 215
6 148 108
7 148 108
8 155 655
9 155 655
10 521 151
Now, how can I get the rows that only exists once. That is row 5 and 10. The order of rows can be totally arbitrary, so checking for the "next" row is not an option. I tried many things but nothing worked on my data.frame which has ~40k rows.
I had one solution working on a subset (~1k rows) of my data.frame which took 3 minutes to process. Thus, my solution would require 120 minutes on my original data.frame which is not appropiate. Can somebody help?

Check duplicated from the beginning and end of the data frame, if none returns true, then select it:
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)),]
# x y
#5 115 215
#10 521 151

A solution with table
library(dplyr)
table(df) %>% as.data.frame %>% subset(Freq ==1) %>% select(-3)
or with base as you said in comments you prefer not to load packages:
subset(as.data.frame(table(df)),Freq ==1)[,-3]
Also I think data.table is very fast for big data sets and filtering, so this may be worth trying too as you mentionned speed:
df2 <- copy(df)
df2 <- setDT(df2)[, COUNT := .N, by='x,y'][COUNT ==1][,c("x","y")]

A solution using dplyr. df2 is the final output.
library(dplyr)
df2 <- df %>%
count(x, y) %>%
filter(n == 1) %>%
select(-n)

Another base R solution that uses ave to calculate the total number of occurrences for each row and subsets only those that occur 1 time. It could also be modified for subsetting rows that occur a specific number of times.
df[ave(1:NROW(df), df, FUN = length) == 1,]
# x y
#5 115 215
#10 521 151

Related

Add Elements of Data Frame to Another Data Frame Based on Condition R

I have two data frames that showcase results of an analysis from one month and then the subsequent month.
Here is a smaller version of the data:
Jan19=data.frame(Group=c(589,630,523,581,689),Count=c(191,84,77,73,57))
Dec18=data.frame(Group=c(589,630,523,478,602),Count=c(100,90,50,6,0))
Jan19
Group Count
1 589 191
2 630 84
3 523 77
4 581 73
5 689 57
Dec18
Group Count
1 589 100
2 630 90
3 523 50
4 478 6
5 602 0
Jan19 only has counts >0. Dec18 is the dataset with results from the previous month. Dec18 has counts >=0 for each group. I have been referencing the full Dec18 dataset for counts =0 and manually entering them in to the full Jan18 dataset. I want to rid myself of the manual part of this exercise and just be able to append the groups with counts = 0 to the end of the Jan19 dataset.
That lead me to the following code to perform what I described above:
GData=rbind(Jan19,Dec18)
GData=GData[!duplicated(GData$Group),]
While this code resulted in the correction dimensions, it does not choose the correct duplicate to remove. Among the appended dataset, it treats the Jan19 results>0 as the duplicate and removes that. This is the result:
Gdata
Group Count
1 589 191
2 630 84
3 523 77
4 581 73
5 689 57
9 478 6
10 602 0
Essentially, I wanted that 6 to show up as a 0. So, that lead me to the following line of code where I wanted to set a condition, if the new appended data (Dec18) has a duplicate Group to the newer data (Jan19), then that corresponding Count should=0. Otherwise, the value of count from the Jan19 dataset should hold.
Gdata=ifelse(Dec18$Group %in% Jan19$Group==FALSE, Gdata$Count==0,Jan19$Count)
This is resulting in errors and I'm not sure how to modify it to achieve my desired result. Any help would be appreciated!
Your rbind/deduplication approach is a good one, you just need the Dec18 data you rbind on to have have the Count column as 0:
Gdata = rbind(Jan19, transform(Dec18, Count = 0))
Gdata[!duplicated(Gdata$Group), ]
# Group Count
# 1 589 191
# 2 630 84
# 3 523 77
# 4 581 73
# 5 689 57
# 9 478 0
# 10 602 0
While this code resulted in the correction dimensions, it does not choose the correct duplicate to remove. Among the appended dataset, it treats the Jan19 results>0 as the duplicate and removes that. This is the result:
This is incorrect. !duplicated() will keep the first occurrence and remove later occurrences. None of the Jan19 data is removed---we can see that the first 5 rows of Gdata are exactly the 5 rows of Jan19. The only issue was that the non-duplicated rows from Dec18 were not all 0 counts. We fix this with the transform().
There are plenty of other ways to do this, with a join using the merge function, we could only rbind on the non-duplicated groups as d.b suggests, rbind(Jan19, transform(Dec18, Count = 0)[!Dec18$Group %in% Jan19$Group,]), and there are others too. We could make your ifelse approach work like this:
Gdata = rbind(Jan19, Dec18)
Gdata$Count = ifelse(!Dec18$Group %in% Jan19$Group, 0, Gdata$Count)
# an alternative to ifelse, a little cleaner
Gdata = rbind(Jan19, Dec18)
Gdata$Count[!Gdata$Group %in% Jan19$Group] = 0
Use whatever makes the most sense to you.

Least Absolute Deviation in R [duplicate]

This question already has answers here:
Grouping functions (tapply, by, aggregate) and the *apply family
(10 answers)
Closed 4 years ago.
I have a LIST of dataframes. Each dataframe has the same numer of rows and columns.
Here is a sample dataframe:
df
TIME AMOUNT
20 456
30 345
15 122
12 267
Here is the expected RESULT:
I would like to count the AMOUNT_NORM column where
each value in the AMOUNT column was divided by the sum of all values in the AMOUNT column.
df
TIME AMOUNT AMOUNT_NORM
20 456 0.38
30 345 0.29
15 122 0.1
12 267 0.22
The following should do what you want
library(tidyverse)
df %>% mutate(AMOUNT_NORM = AMOUNT/SUM(AMOUNT))
EDIT: didn't read the list of dataframes bit. in this case you just do:
lapply(your_df_list, function(x) {
x %>% mutate(AMOUNT_NORM = AMOUNT/SUM(AMOUNT))
})

Filtering dataset by values and replacing with values in other dataset in R [duplicate]

This question already has answers here:
Replace values in data frame based on other data frame in R
(4 answers)
Closed 4 years ago.
I have two datasets like this:
>data1
id l_eng l_ups
1 6385 239
2 680 0
3 3165 0
4 17941 440
5 135 25
6 151 96
7 102188 84
8 440 65
9 6613 408
>data2
id l_ups
1 237
2 549
3 100
4 444
5 28
6 101
7 229
8 92
9 47
I want to filterout the values from data1 where l_ups==0 and replace them with values in data2 using id as lookup value in r.
Final output should look like this:
id l_eng l_ups
1 6385 239
2 680 549
3 3165 100
4 17941 440
5 135 25
6 151 96
7 102188 84
8 440 65
9 6613 408
I tried the below code but no luck
if(data1[,3]==0)
{
filter(data1, last_90_uploads == 0) %>%
merge(data_2, by.x = c("id", "l_ups"),
by.y = c("id", "l_ups")) %>%
select(-l_ups)
}
I am not able to get this by if statement as it will take only one value as logical condition. But, what if I have more than one value as logical statement?
like this:
>if(data1[,3]==0)
TRUE TRUE
Edit:
I want to filter the values with a condition and replace them with values in another dataset. Hence, this question is not similar to the one suggested as repetitive.
You don't want to filter. filter is an operation that returns a data set where rows might have been removed.
You are looking for a "conditional update" operation (in terms of a databases). You are already using dplyr, so try a join operation instead of match:
left_join(data1, data2, by='id') %>%
mutate(l_ups = ifelse(!is.na(l_ups.x) || l_ups.x == 0, l_ups.y, l_ups.x))
By using a join operation rather than the direct subsetting comparison as #markus suggested, you ensure that you only compare values with same ids. If one of your data frames happens to miss a row, the direct subsetting comparison will fail.
By using a left_join rather than inner_join also ensures that if data2 is missing an id, the corresponding id will not be removed from data1.

Subset by first and last value per group

I have a data frame in R with two columns temp and timeStamp. The data has temp values regularly. A portion of dataframe looks like-
I have to create line chart showing changes in temp over time. As can be seen here, temp values remain the same for several timeStamp. Having these repeating value increases the size of data file and I want to remove them. So the output should look like this-
Showing just the values where there is a change.
Cannot think of a way to get this think done in R. Any inputs in the right direction would be really helpful.
Here's a dplyr solution:
# Toy data
df <- data.frame(time = seq(20), temp = c(rep(60, 5), rep(61, 7), rep(59, 3), rep(60, 5)))
# Now filter for the first and last rows and ones bracketing a temperature change
df %>% filter(temp!=lag(temp) | temp!=lead(temp) | time==min(time) | time==max(time))
time temp
1 1 60
2 5 60
3 6 61
4 12 61
5 13 59
6 15 59
7 16 60
8 20 60
If the data are grouped by a third column (id), just add group_by(id) %>% before the filtering step.
One option would be using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'temp', we subset the first and last observation (.SD[c(1L, .N)]) per each group. If there is only a single value per group, we take the row as such (else .SD).
library(data.table)
setDT(df1)[, if(.N>1) .SD[c(1L, .N)] else .SD, by =temp]
# temp val
#1: 22.50 1
#2: 22.50 4
#3: 22.37 5
#4: 22.42 6
#5: 22.42 7
Or a base R option with duplicated. We check the duplicated values in 'temp' (output is a logical vector), and also check the duplication from the reverse side (fromLast=TRUE). Use & to find the elements that are TRUE in both cases, negate (!) and subset the rows of 'df1'.
df1[!(duplicated(df1$temp) & duplicated(df1$temp,fromLast=TRUE)),]
# temp val
#1 22.50 1
#4 22.50 4
#5 22.37 5
#6 22.42 6
#7 22.42 7
data
df1 <- data.frame(temp=c(22.5, 22.5, 22.5, 22.5, 22.37,22.42, 22.42), val=1:7)

filtering a dataset dependant on a value within a string

I am currently working with Google Analytics and R and have a query I hope someone can help me with.
I have exported my data from GA into R and have it in a dataframe ready for processing.
I want to create a for loop which goes through my data and sums a number of columns in my dataframe if one column contains a certain value.
For example, my dataframe looks like this
I have a list of ID's which are the individual 3 digit numbers, which I can use in a for loop.
My past experience of R I have been able to filter the list so that I have
data[data$ID == 341,] -> datanew
and I have found some code which can see if there is a certain string within a string producing a bool
grepl(value, chars)
Is there a way to link these up together so that I have a sum code similar to below
aggregate(cbind(users, conversion)~ID,data=datanew,FUN=sum) -> resultforID
Basically taking that data and for every 341 add the users and conversions..
I hope I have explained this the best way possible.
Thanks in advance
data table has 3 columns. ID, users, Conversion with the users and Conversion linked to the IDs.
Some ID's are on their own, so 341, others are 341|246 and some will have three numbers with them seperated by the |
# toy data
mydata = data.frame(ID = c("341|243","341|243","341|242","341","243",
"999","111|341|222"),
Users = 10:16,
Conv = 5:11)
# ID Users Conv
# 1 341|243 10 5
# 2 341|243 11 6
# 3 341|242 12 7
# 4 341 13 8
# 5 243 14 9
# 6 999 15 10
# 7 111|341|222 16 11
# are you looking for something like below:
# presume you just want to filter those IDs have 341.
library(dplyr)
mydata[grep("341",mydata$ID),] %>%
group_by(ID) %>%
summarise_each(funs(sum))
# ID Users Conv
# 1 111|341|222 16 11
# 2 341 13 8
# 3 341|242 12 7
# 4 341|243 21 11
If I understand your question correctly, you may want to look at cSplit from my "splitstackshape" package.
Using #KFB's sample data (which is hopefully representative of your actual data), try:
library(splitstackshape)
cSplit(mydata, "ID", "|", "long")[, lapply(.SD, sum), by = ID]
# ID Users Conv
# 1: 341 62 37
# 2: 243 35 20
# 3: 242 12 7
# 4: 999 15 10
# 5: 111 16 11
# 6: 222 16 11
Alternatively, from the Hadleyverse, you can use "dplyr" and "tidyr" together, like this:
library(dplyr)
library(tidyr)
mydata %>%
transform(ID = strsplit(as.character(ID), "|", fixed = TRUE)) %>%
unnest(ID) %>%
group_by(ID) %>%
summarise_each(funs(sum))
# Source: local data frame [6 x 3]
#
# ID Users Conv
# 1 111 16 11
# 2 222 16 11
# 3 242 12 7
# 4 243 35 20
# 5 341 62 37
# 6 999 15 10
I think this should work:
library(dplyr)
sumdf <- yourdf %>%
group_by(ID) %>%
summarise_each(funs(sum))
I'm not clear about the structure of your ID column, but if you need to just get the numbers you could try this:
library(tidyr)
newdf <- separate(yourdf, ID, c('id1', 'id2'), '|') %>%
filter(id1 == 341) # optional if you just want one ID
Here are two answers. The first being with subset and the second is with 'grep' using a string
initial run
x1<-sample(1:4,10,replace=TRUE)
x2<-sample(10:40,10)
x3<-sample(10:40,10)
dat<-as.data.frame(cbind(x1,x2,x3))
for(i in unique(dat$x1)) {
dat1<-subset(dat,subset=x1==i)
z<-(aggregate(.~x1,data=dat1,FUN=sum))
assign(paste0('x1',i),z)
}
with GREP
x1<-sample(letters[1:3],10,replace=TRUE)
x2<-sample(10:40,10)
x3<-sample(10:40,10)
dat<-as.data.frame(cbind(x1,x2,x3))
for(i in unique(dat$x1)) {
dat1<-dat[grep(i,dat$x1),]
z<-(aggregate(.~x1,data=dat1,FUN=sum))
assign(paste0('x1',i),z) #this will assign separate objects as your aggregates with names based on the string
}

Resources