Removing rows when some values match and some do not [duplicate] - r

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
ID Amount Previous
1 10 15
1 10 13
2 20 18
2 20 24
3 5 7
3 5 6
I want to remove the duplicate rows from the following data frame, where ID and Amount match. Values in the Previous column do not match. When deciding which row to take, I'd like to take the one where the Previous column value is higher.
This would look like:
ID Amount Previous
1 10 15
2 20 24
3 5 7

An option is distinct on the columns 'ID', 'Amount' (after arrangeing the dataset) while specifying the .keep_all = TRUE to get all the other columns that correspond to the distinct elements in those columns
library(dplyr)
df1 %>%
arrange(ID, Amount, desc(Previous)) %>%
distinct(ID, Amount, .keep_all = TRUE)
# ID Amount Previous
#1 1 10 15
#2 2 20 24
#3 3 5 7
Or with duplicated from base R applied on the 'ID', 'Amount' to create a logical vector and use that to subset the rows of the dataset
df2 <- df1[with(df1, order(ID, Amount, -Previous)),]
df2[!duplicated(df2[c('ID', 'Amount')]),]
# ID Amount Previous
#1 1 10 15
#3 2 20 24
#5 3 5 7
data
df1 <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L), Amount = c(10L,
10L, 20L, 20L, 5L, 5L), Previous = c(15L, 13L, 18L, 24L, 7L,
6L)), class = "data.frame", row.names = c(NA, -6L))

Related

Count an observation based on condition of another variable [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 1 year ago.
I have dataset of regional patent. I want to count where how many Appln_id has more than one Person_id and how many Apply_id has only one Person_id.
Appln_id 3 3 3 10 10 10 10 2 4 4
Person_id 23 22 24 49 50 55 51 101 122 104
here Appln_id 3 has three different person_id (23,22,24) and Appln_id 2 has only one Person_id(101). So, I want to count them that how many of Appln_id has more than one Person_id and how many Apply_id has only one Person_id
Count number of unique person for each Appln_id.
library(dplyr)
result <- df %>% group_by(Appln_id) %>% summarise(n = n_distinct(Person_id))
result
# Appln_id n
#* <int> <int>
#1 2 1
#2 3 3
#3 4 2
#4 10 4
Now you can count how many of them have only 1 Person_id and how many of them have more than that.
sum(result$n == 1)
#[1] 1
sum(result$n > 1)
#[1] 3
data
df <- structure(list(Appln_id = c(3L, 3L, 3L, 10L, 10L, 10L, 10L, 2L,
4L, 4L), Person_id = c(23L, 22L, 24L, 49L, 50L, 55L, 51L, 101L,
122L, 104L)), class = "data.frame", row.names = c(NA, -10L))
We can use data.table
library(data.table)
setDT(df)[, .(n = uniqueN(Person_id)), by = Appln_id]

Create a new data frame column that is a combination of other columns

I have 3 columns a , b ,c and I want to combine them into a new column with the help of column mood as the following :
if mod= 1 , data from a
if mod=2 , data from b
if mode=3, data from c
example
mode a b c
1 2 3 4
1 5 53 14
3 2 31 24
2 12 13 44
1 20 30 40
Output
mode a b c combine
1 2 3 4 2
1 5 53 14 5
3 2 31 24 24
2 12 13 44 13
1 20 30 40 20
We can use the row/column indexing to get the values from the dataset. Here, the row sequence (seq_len(nrow(df1))) and the column index ('mode') are cbinded to create a matrix to extract the corresponding values from the subset of dataset
df1$combine <- df1[2:4][cbind(seq_len(nrow(df1)), df1$mode)]
df1$combine
#[1] 2 5 24 13 20
data
df1 <- structure(list(mode = c(1L, 1L, 3L, 2L, 1L), a = c(2L, 5L, 2L,
12L, 20L), b = c(3L, 53L, 31L, 13L, 30L), c = c(4L, 14L, 24L,
44L, 40L)), class = "data.frame", row.names = c(NA, -5L))
Another solution in base R that works by converting "mode" to letters then extracting those values in the matching columns.
df1$combine <- diag(as.matrix(df1[, letters[df1$mode]]))
Also, two ways with dplyr(). Nested if_else :
library(dplyr)
df1 %>%
mutate(combine =
if_else(mode == 1, a,
if_else(mode == 2, b, c)
)
)
And case_when():
df1 %>% mutate(combine =
case_when(mode == 1 ~ a, mode == 2 ~ b, mode == 3 ~ c)
)

Reducing multiple rows to 1 by index in R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I am relatively new to R. I am working with a dataset that has multiple datapoints per timestamp, but they are in multiple rows. I am trying to make a single row for each timestamp with a columns for each variable.
Example dataset
Time Variable Value
10 Speed 10
10 Acc -2
10 Energy 10
15 Speed 9
15 Acc -1
20 Speed 9
20 Acc 0
20 Energy 2
I'd like to convert this to
Time Speed Acc Energy
10 10 -2 10
15 9 -1 (blank or N/A)
20 8 0 2
These are measured values so they are not always complete.
I have tried ddply to extract each individual value into an array and recombine, but the columns are different lengths. I have tried aggregate, but I can't figure out how to keep the variable and value linked. I know I could do this with a for loop type solution, but that seems a poor way to do this in R. Any advice or direction would help. Thanks!
I assume data.frame's name is df
library(tidyr)
spread(df,Variable,Value)
Typically a job for dcast in reshape2.First, we make your example reproducible:
df <- structure(list(Time = c(10L, 10L, 10L, 15L, 15L, 20L, 20L, 20L),
Variable = structure(c(3L, 1L, 2L, 3L, 1L, 3L, 1L, 2L), .Label = c("Acc",
"Energy", "Speed"), class = "factor"), Value = c(10L, -2L, 10L,
9L, -1L, 9L, 0L, 2L)), .Names = c("Time", "Variable", "Value"),
class = "data.frame", row.names = c(NA, -8L))
Then:
library(reshape2)
dcast(df, Time ~ ...)
Time Acc Energy Speed
10 -2 10 10
15 -1 NA 9
20 0 2 9
With dplyr you can (cosmetics) reorder the columns with:
library(dplyr)
dcast(df, Time ~ ...) %>% select(Time, Speed, Acc, Energy)
Time Speed Acc Energy
10 10 -2 10
15 9 -1 NA
20 9 0 2

Subset first n occurrences of certain value in dataframe

Suppose I have a matrix (or dataframe):
1 5 8
3 4 9
3 9 6
6 9 3
3 1 2
4 7 2
3 8 6
3 2 7
I would like to select only the first three rows that have "3" as their first entry, as follows:
3 4 9
3 9 6
3 1 2
It is clear to me how to pull out all rows that begin with "3" and it is clear how to pull out just the first row that begins with "3."
But in general, how can I extract the first n rows that begin with "3"?
Furthermore, how can I select just the 3rd and 4th appearances, as follows:
3 1 2
3 8 6
Without the need for an extra package:
mydf[mydf$V1==3,][1:3,]
results in:
V1 V2 V3
2 3 4 9
3 3 9 6
5 3 1 2
When you need the third and fourth row:
mydf[mydf$V1==3,][3:4,]
# or:
mydf[mydf$V1==3,][c(3,4),]
Used data:
mydf <- structure(list(V1 = c(1L, 3L, 3L, 6L, 3L, 4L, 3L, 3L),
V2 = c(5L, 4L, 9L, 9L, 1L, 7L, 8L, 2L),
V3 = c(8L, 9L, 6L, 3L, 2L, 2L, 6L, 7L)),
.Names = c("V1", "V2", "V3"), class = "data.frame", row.names = c(NA, -8L))
Bonus material: besides dplyr, you can do this also very efficiently with data.table (see this answer for speed comparisons on large datasets for the different data.table methods):
setDT(mydf)[V1==3, head(.SD,3)]
# or:
setDT(mydf)[V1==3, .SD[1:3]]
You can do something like this with dplyr to extract first three rows of each unique value of that column:
library(dplyr)
df %>% arrange(columnName) %>% group_by(columnName) %>% slice(1:3)
If you want to extract only three rows when the value of that column, you can try:
df %>% filter(columnName == 3) %>% slice(1:3)
If you want specific rows, you can supply to slice as c(3, 4), for example.
We could also use subset
head(subset(mydf, V1==3),3)
Update
If we need to extract also one row below the rows where V1==3,
i1 <- with(mydf, V1==3)
mydf[sort(unique(c(which(i1),pmin(which(i1)+1L, nrow(mydf))))),]

Sorting and aggregating in R

I used the aggregate function in R to bring down my data entries from 90k to 1800.
a=test$ID
b=test$Date
c=test$Value
d=test$Value1
sumA=aggregate(c, by=list(Date=b,Id=a), FUN=sum)
sumB=aggregate(d, by=list(Date=b,Id=a), FUN=sum)
final[1]=sumA[1],final[2]=sumA[2]
final[3]=sumA[3]/sumB[3]
Now I have data in 20 different dates in a month with close to 90 different ids each day so its around 1800 entries in the final table .
My question is that I want to aggregate further down and find the maximum value of final[3] for each date so that I am just left with 20 values .
In simple terms -
There are 20 days .
Each day has 90 values for 90 ids
I want to find maximum of these 90 values for each day .
So at last I would be left with just 20 values for 20 days .
Now aggregate function is not working here with function 'max' instead of sum
Date ID Value Value1
1 A 20 10
1 A 25 5
1 B 50 5
1 B 50 5
1 C 25 25
1 C 35 5
2 A 30 10
2 A 25 45
2 B 40 10
2 B 40 30
This is the Data
Now By using Aggregate function I got final table as
Date ID x
1 A 45/15=3
1 B 100/10=10
1 c 60/30=2
2 A 55/55=1
2 B 80/40=2
Now I want maximum value for date 1 and 2 thats it
Date max- Value
1 10
2 2
This is a one step process using data table. The data.table is an evolved version of data.frame, and works really well. It has the class of data.frame, so works just like data.frame.
Step0: Converting data.frame to data.table:
library(data.table)
setDT(test)
setkey(test,Date,ID)
Step1: Do the computation
test[,sum(Value)/sum(Value1),by=key(test)][,max(V1),by=Date]
Here the explanation of the step:
The first part creates what you call the final table in your question:
test[,sum(Value)/sum(Value1),by=key(test)]
# Date ID V1
# 1: 1 A 3
# 2: 1 B 10
# 3: 1 C 2
# 4: 2 A 1
# 5: 2 B 2
Now this is passed to the second item to do the max function by Date:
test[,sum(Value)/sum(Value1),by=key(test)][,max(V1),by=Date]
# Date V1
# 1: 1 10
# 2: 2 2
Hope this helps.
It's a very well documented package. You should read more about it.
May be this helps.
test <- structure(list(Date = c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
), ID = c("A", "A", "B", "B", "C", "C", "A", "A", "B", "B"),
Value = c(20L, 25L, 50L, 50L, 25L, 35L, 30L, 25L, 40L, 40L
), Value1 = c(10L, 5L, 5L, 5L, 25L, 5L, 10L, 45L, 10L, 30L
)), .Names = c("Date", "ID", "Value", "Value1"), class = "data.frame", row.names = c(NA,
-10L))
res1 <- aggregate(. ~ID+Date, data=test, FUN=sum)
res1 <- transform(res1, x=Value/Value1)
res1
# ID Date Value Value1 x
#1 A 1 45 15 3
#2 B 1 100 10 10
#3 C 1 60 30 2
#4 A 2 55 55 1
#5 B 2 80 40 2
aggregate(. ~Date, data=res1[,-c(1,3:4)], FUN=max)
# Date x
# 1 1 10
# 2 2 2
First I run the aggregate based on two grouping variables (ID and Date) on the two value column by using. ~`
Created a new variable x i.e. Value/Value1 with transform
Did the final run of aggregate with one grouping variable (Date) and removed the rest of the variables except x.

Resources