Duplicate removal in time series based on other column using R

Duplicate removal in time series based on other column using R - r

I am currently working on a problem which involves data cleaning and calculation in below fashion :
I have created the sample dataset here for a single unit A.
Data is sorted according to timestamp column for each unit. There are other columns as well.
For each distinct alternate value of event_log_value_desc, I need to get rows. In the case of multiple duplicate values of event_log_value_desc, it should return the row with the first occurrence of event_log_value_desc. event_log_value_desc should have alternate values of OFF and ON for each unit.
In return, the program should return the following :

I don't know if this solution works since it has not been tested on your dataset, but I believe it should be fine
library(dplyr)
df %>%
group_by(unit) %>%
mutate(event_log_value_desc_lag = lag(event_log_value_desc)) %>%
filter(event_log_value_desc != event_log_value_desc_lag | is.na(event_log_value_desc_lag))

Related

dplyr mutate grouped data without using exact column name

I'm trying to wirte a function to process multiple similar dataset, here I want to subtract scores obtained by subject in the second interview by scores obtained by the same subject in the previous interview. In all dataset I want to process, interested score will be stored in the second column. Writing for each specific dataset is simple, simply use the exact column name, everything will go fine.
d <- a %>%
arrange(by_group=interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = colname_2nd-lag(colname_2nd))
But since I need a generic function that can be used to process multiple dataset, I can not use exact column name. So I tried 3 approaches, both of them only altered the last line
Approach#1:
dplyr::mutate(score_change = dplyr::vars(2)-lag(dplyr::vars(2)))
Approach#2：
Second column name of interested dataset contains a same string ,so I tried
dplyr::mutate(score_change = dplyr::vars(matches('string'))-lag(dplyr::vars(matches('string'))))
Error messages of the above 2 approaches will be
Error in dplyr::vars(2) - lag(dplyr::vars(2)) :
non-numeric argument to binary operator
Approach#3:
dplyr::mutate(score_change = .[[2]]-lag(.[[2]]))
Error message:
Error: Column `score_change` must be length 2 (the group size) or one, not 10880
10880 is the row number of my sample dataset, so it look like group_by does not work in this approach
Does anyone know how to make the function perform in the desired way?

If you want to use position of the column names use cur_data()[[2]] to refer the 2nd column of the dataframe.
library(dplyr)
d <- a %>%
arrange(interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = cur_data()[[2]]-lag(cur_data()[[2]]))
Also note that cur_data() doesn't count the grouped column so if subjectkey is first column in your data and colname_2nd is the second one you may need to use cur_data()[[1]] instead when you group_by.

How do you delete only the duplicates that fulfill another condition in R?

I want to clean up this data-set.Example Table
It contains many duplicates. I want to delete only the duplicates from the UUID column that have the highest value in the column Shape_Area. A loop must be created that detects the duplicates and compares the values from column Area within the found duplicates.
I've tried the duplicate function, but I cannot trust that the selected value is the greatest value from column Area.
I want an Output table that includes unique values that have the greatest value in column Area.
Can anyone help on this one?

you can use the dplyr package like this
library(dplyr)
newdata <- mydata %>%
group_by(UUID) %>%
arrange(-Shape_Area) %>%
slice(1)
For each value of UUID this code creates a group and then arranges each group with respect to Shape_Area. Then only the first row (e.g. the highest value) is selected.
If you want to save this data use this:
write.csv(newdata, file = "Output.csv")

Average of Multiple columns in dplyr getting error "must resolve to integer column positions, not a list"

gene HSC_7256.bam HSC_6792.bam HSC_7653.bam HSC_5852
My data frame looks like this i can do that in a normal way such as take out the columns make another data frame average it ,but i want to do that in dplyr and im having a hard time I not sure what is the problem
I doing something like this
HSC<- EPIGENETIC_FACTOR_SEQMONK %>%
select(EPIGENETIC_FACTOR_SEQMONK,gene)
I get this error
Error: EPIGENETIC_FACTOR_SEQMONK must resolve to integer column positions, not a list
So i have to do this take out all the HSC sample average them
Anyone suggest what am i doing it incorrectly ?that would be helpful

The %>% function pulls whatever is to the left of it into the first position of the following function. If your data frame is EPIGENETIC_FACTOR_SEQMONK, then these two statements are equivalent:
HSC <- EPIGENETIC_FACTOR_SEQMONK %>%
select(gene)
HSC <- select(EPIGENETIC_FACTOR_SEQMONK, gene)
In the first, we are passing EPIGENETIC_FACTOR_SEQMONK into select using %>%, which is generally used in dplyr chains as the first argument in dplyr functions is a data frame.

Removing Duplicate rows while summing one column and preserving the other columns

I have dataset of a a few columns with duplicate row.( duplication based on one column by name ProjectID).
I want to remove the duplicate rows and keep just one of it.
However, each of these rows have a separate amount value against it which needs to be summed and stored for the final consolidated row.
I have used aggregate function. However it removes all other columns (by the use I know).
Can somebody Please tell me a easier way.
the example data set is attached.
dataset

This could be solved using dplyr as #PLapointe pointed out. If your dataset is called df then this would go as
df %>%
group_by(`Project ID`, `Project No.`, `Account Head`, `Function`, `Functionary`) %>%
summarise(cost.total = sum(Amount))
This should do it. You can also adjust the variables you want to keep.

Its a more complicated method, but worked for me.
I aggregated the amounts about the ProjectIDs using the aggregate function, storing them in a new tibble.
Further I appended this column to the original tibble as a new column.
It didn't work exactly what I wanted to. But I was able to work out with a new column Final_Amount keeping the earlier Amount column irrelevant.
Duplicate_remove2 <- function(dataGP_cleaned)
{
#aggregating unique amounts
aggregated_amount <- aggregate(dataGP_cleaned['Amount'], by=dataGP_cleaned['ProjectID'], sum)
#finding Distinct dataset
dataGP_unique <- distinct(dataGP_cleaned, ProjectID, .keep_all = TRUE)
#changing name of the column for easy identification
aggregated_amount$Final_Amount <- aggregated_amount$Amount
#appending the list
aggregate_dataGP <- bind_cols(dataGP_unique, aggregated_amount['Final_Amount'] )
return(aggregate_dataGP)
}

removing and aggregating duplicates

I've posted a sample of the data I'm working with here.
"Parcel.." is the main indexing variable and there are good amount of duplicates. The duplicates are not consistent in all of the other columns. My goal is to aggregate the data set so that there is only one observation of each parcel.
I've used the following code to attempt summing numerical vectors:
aggregate(Ap.sample$X.11~Ap.sample$Parcel..,FUN=sum)
The problem is it removes everything except the parcel and the other vector I reference.
My goal is to use the same rule for certain numerical vectors (sum) (X.11,X.13,X.15, num_units) of observations of that parcelID, a different rule (average) for other numerical vectors (Acres,Ttl_sq_ft,Mtr.Size), and still a different rule (just pick one name) for the character variables (pretend there's another column "customer.name" with different values for the same unique parcel ID, i.e. "Steven condominiums" and "Stephen apartments"), and to just delete the extra observations for all the other variables.
I've tried to use the numcolwise function but that also doesn't do what I need.
My instinct would be to specify the columns I want to sum and the columns I want to take the average like so:
DT<-as.data.table(Ap.sample)
sum_cols<-Ap.05[,c(10,12,14)]
mean_cols<-Ap.05[,c(17:19)]
and then use the lapply function to go through each observation and do what I need.
df05<-DT[,lapply(.SD,sum), by=DT$Parcel..,.SDcols=sum_cols]
df05<-DT[,lapply(.SD,mean),by=DT$Parcel..,.SDcols=mean_cols]
but that spits out errors on the first go. I know there's a simpler work around for this than trying to muscle through it.

You could do:
library(dplyr)
df %>%
# create an hypothetical "customer.name" column
mutate(customer.name = sample(LETTERS[1:10], size = n(), replace = TRUE)) %>%
# group data by "Parcel.."
group_by(Parcel..) %>%
# apply sum() to the selected columns
mutate_each(funs(sum(.)), one_of("X.11", "X.13", "X.15", "num_units")) %>%
# likewise for mean()
mutate_each(funs(mean(.)), one_of("Acres", "Ttl_sq_ft", "Mtr.Size")) %>%
# select only the desired columns
select(X.11, X.13, X.15, num_units, Acres, Ttl_sq_ft, Mtr.Size, customer.name) %>%
# de-duplicate while keeping an arbitrary value (the first one in row order)
distinct(Parcel..)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Duplicate removal in time series based on other column using R - r

Related

dplyr mutate grouped data without using exact column name

How do you delete only the duplicates that fulfill another condition in R?

Average of Multiple columns in dplyr getting error "must resolve to integer column positions, not a list"

Removing Duplicate rows while summing one column and preserving the other columns

removing and aggregating duplicates

Categories

Resources