removing and aggregating duplicates - r

I've posted a sample of the data I'm working with here.
"Parcel.." is the main indexing variable and there are good amount of duplicates. The duplicates are not consistent in all of the other columns. My goal is to aggregate the data set so that there is only one observation of each parcel.
I've used the following code to attempt summing numerical vectors:
aggregate(Ap.sample$X.11~Ap.sample$Parcel..,FUN=sum)
The problem is it removes everything except the parcel and the other vector I reference.
My goal is to use the same rule for certain numerical vectors (sum) (X.11,X.13,X.15, num_units) of observations of that parcelID, a different rule (average) for other numerical vectors (Acres,Ttl_sq_ft,Mtr.Size), and still a different rule (just pick one name) for the character variables (pretend there's another column "customer.name" with different values for the same unique parcel ID, i.e. "Steven condominiums" and "Stephen apartments"), and to just delete the extra observations for all the other variables.
I've tried to use the numcolwise function but that also doesn't do what I need.
My instinct would be to specify the columns I want to sum and the columns I want to take the average like so:
DT<-as.data.table(Ap.sample)
sum_cols<-Ap.05[,c(10,12,14)]
mean_cols<-Ap.05[,c(17:19)]
and then use the lapply function to go through each observation and do what I need.
df05<-DT[,lapply(.SD,sum), by=DT$Parcel..,.SDcols=sum_cols]
df05<-DT[,lapply(.SD,mean),by=DT$Parcel..,.SDcols=mean_cols]
but that spits out errors on the first go. I know there's a simpler work around for this than trying to muscle through it.

You could do:
library(dplyr)
df %>%
# create an hypothetical "customer.name" column
mutate(customer.name = sample(LETTERS[1:10], size = n(), replace = TRUE)) %>%
# group data by "Parcel.."
group_by(Parcel..) %>%
# apply sum() to the selected columns
mutate_each(funs(sum(.)), one_of("X.11", "X.13", "X.15", "num_units")) %>%
# likewise for mean()
mutate_each(funs(mean(.)), one_of("Acres", "Ttl_sq_ft", "Mtr.Size")) %>%
# select only the desired columns
select(X.11, X.13, X.15, num_units, Acres, Ttl_sq_ft, Mtr.Size, customer.name) %>%
# de-duplicate while keeping an arbitrary value (the first one in row order)
distinct(Parcel..)

Related

change all columns based on the maximum value in each row

I have a large data frame which has multiple columns calculated from other columns. The issues come where there are values of 8888 and 9999 which constitute NA or refused to answer respectively. These values have been incorrectly used to calculate other columns (such as the value of pricepergram) as they have not been signaled as NA prior to calculation.
I'm not able to recalculate all the values, so instead I would like to find some code, which takes in as an argument each row of the dataframe. If the maximum value in the row is above 8887, then I would like it to return the row but with the value of all prices set to NA.
the solution needs to be applicable to a dataframe of 250 columns.
I need to be able to apply the code across multiple columns, rather than just one.
I have confirmed that the only values above 8887 in the dataframe are indeed either 9999 or 8888 and therefore constitute values that we want to change.
I am not able to post the dataset due to data protection (apologies), but have given an example of minimum complexity to illustrate my point.
This would be the ideal output:
The rows with values above 8887 have had their price set to NA.
We can break this problem into two steps:
find out if there are any 8888 or 9999 codes in a row
set values in the row to NA
Step 1: The following code produces an indicator for whether a row contains any codes greater than 8887:
any_large_codes = apply(df, MARGIN = 1, function(row){any(row > 8887)})
It works as follows: apply treats the dataframe as a matrix. MARGIN = 1 means that the function is applied to each row of the matrix. function(row){any(row > 8887)} checks if any value in its input (each row) is larger than 8887.
I have not used dplyr for this as I am not aware of any row-wise operators in dplyr. This seems the best option. You can use dplyr to add it into the dataframe if you wish, but this is not necessary:
df = df %>% mutate(na_indicator = any_large_codes)
Step 2: The following code sets the values in a single column to NA where there are any large codes:
df = df %>%
mutate(this_one_column = ifelse(any_large_codes, NA, this_one_column))
If you want to handle multiple columns, I would suggest something like this:
all_columns_to_handle = c(
"col1",
"col2",
"col3",
...
)
for(cc in all_columns_to_handle){
df = df %>%
mutate(!!sym(cc) := ifelse(any_large_codes, NA, !!sym(cc)))
}
Where !!sym(cc) is a way to use the column name stored in cc and := is equivalent to = but allows us to use !!sym(cc) on the left-hand side. For other options to this approach see the programming with dplyr vignette.

dplyr mutate grouped data without using exact column name

I'm trying to wirte a function to process multiple similar dataset, here I want to subtract scores obtained by subject in the second interview by scores obtained by the same subject in the previous interview. In all dataset I want to process, interested score will be stored in the second column. Writing for each specific dataset is simple, simply use the exact column name, everything will go fine.
d <- a %>%
arrange(by_group=interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = colname_2nd-lag(colname_2nd))
But since I need a generic function that can be used to process multiple dataset, I can not use exact column name. So I tried 3 approaches, both of them only altered the last line
Approach#1:
dplyr::mutate(score_change = dplyr::vars(2)-lag(dplyr::vars(2)))
Approach#2:
Second column name of interested dataset contains a same string ,so I tried
dplyr::mutate(score_change = dplyr::vars(matches('string'))-lag(dplyr::vars(matches('string'))))
Error messages of the above 2 approaches will be
Error in dplyr::vars(2) - lag(dplyr::vars(2)) :
non-numeric argument to binary operator
Approach#3:
dplyr::mutate(score_change = .[[2]]-lag(.[[2]]))
Error message:
Error: Column `score_change` must be length 2 (the group size) or one, not 10880
10880 is the row number of my sample dataset, so it look like group_by does not work in this approach
Does anyone know how to make the function perform in the desired way?
If you want to use position of the column names use cur_data()[[2]] to refer the 2nd column of the dataframe.
library(dplyr)
d <- a %>%
arrange(interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = cur_data()[[2]]-lag(cur_data()[[2]]))
Also note that cur_data() doesn't count the grouped column so if subjectkey is first column in your data and colname_2nd is the second one you may need to use cur_data()[[1]] instead when you group_by.

Remove outlier rows by column and factor in R

I am working with a data-frame in R. I have the following function which removes all rows of a data-frame df where, for a specified column index/attribute, the value at that row is outside mean (of column) plus or minus n*stdev (of column).
remove_outliers <- function(df,attr,n){
outliersgone <- df[df[,attr]<=(mean(df[,attr],na.rm=TRUE)+n*sd(df[,attr],na.rm=TRUE)) & df[,attr]>=(mean(df[,attr],na.rm=TRUE)-n*sd(df[,attr],na.rm=TRUE)),]
return(outliersgone)
}
There are two parts to my question.
(1) My data-frame df also has a column 'Group', which specifies a class label. I would like to be able to remove outliers according to mean and standard deviation within their group within the column, i.e. organised by factor (within the column). So you would remove from the data-frame a row labelled with group A if, in the specified column/attribute, the value at that row is outside mean (of group A rows in that column) plus/minus n*stdev (of group A rows in that column). And the same for groups B, C, D, E, F, etc.
How can I do this? (Preferably using only base R and dplyr.) I have tried to use df %>% group_by(Group) followed by mutate but I'm not sure what to pass to mutate, given my function remove_outliers seems to require the whole data-frame to be passed into it (so it can return the whole data-frame with rows only removed based on the chosen attribute attr).
I am open to hearing suggestions for changing the function remove_outliers as well, as long as they also return the whole data-frame as explained. I'd prefer solutions that avoid loops if possible (unless inevitable and no more efficient method presents itself in base R / dplyr).
(2) Is there a straightforward way I could combine outlier considerations across multiple columns? e.g. remove from the dataframe df those rows which are outliers wrt at least $N$ attributes out of a specified vector of attributes/column indices (length≥N). or a more complex condition like, remove from the dataframe df those rows which are outliers wrt Attribute 1 and at least 2 of Attributes 2,4,6,8.
(Ideally the definition of outlier would again be within-group within column, as specified in question 1 above, but a solution working in terms of just within column without considering the groups would also be useful for me.)
Ok - part 1 (and trying to avoid loops wherever possible):
Here's some test data:
test_data=data.frame(
group=c(rep("a",100),rep("b",100)),
value=rnorm(200)
)
We'll find the groups:
groups=levels(test_data[,1]) # or unique(test_data[,1]) if it isn't a factor
And we'll calculate the outlier limits (here I'm specifying only 1 sd) - sorry for the loop, but it's only over the groups, not the data:
outlier_sds=1
outlier_limits=sapply(groups,function(g) {
m=mean(test_data[test_data[,1]==g,2])
s=sd(test_data[test_data[,1]==g,2])
return(c(m-outlier_sds*s,m+outlier_sds*s))
})
So we can define the limits for each row of test_data:
test_data_limits=outlier_limits[,test_data[,1]]
And use this to determine the outliers:
outliers=test_data[,2]<test_data_limits[1,] | test_data[,2]>test_data_limits[2,]
(or, combining those last steps):
outliers=test_data[,2]<outlier_limits[1,test_data[,1]] | test_data[,2]>outlier_limits[2,test_data[,1]]
Finally:
test_data_without_outliers=test_data[!outliers,]
EDIT: now part 2 (apply part 1 with a loop over all the columns in the data):
Some test data with more than one column of values:
test_data2=data.frame(
group=c(rep("a",100),rep("b",100)),
value1=rnorm(200),
value2=2*rnorm(200),
value3=3*rnorm(200)
)
Combine all the steps of part 1 into a new function find_outliers that returns a logical vector indicating whether any value is an outlier for its respective column & group:
find_outliers = function(values,n_sds,groups) {
group_names=levels(groups)
outlier_limits=sapply(group_names,function(g) {
m=mean(values[groups==g])
s=sd(values[groups==g])
return(c(m-n_sds*s,m+n_sds*s))
})
return(values < outlier_limits[1,groups] | values > outlier_limits[2,groups])
}
And then apply this function to each of the data columns:
test_groups=test_data2[,1]
test_data_outliers=apply(test_data2[,-1],2,function(d) find_outliers(values=d,n_sds=1,groups=test_groups))
The rowSums of test_data_outliers indicate how many times each row is considered an 'outlier' in the various columns, with respect to its own group:
rowSums(test_data_outliers)

R: Scale a subset of multiple columns (with similar names) with dplyr

I recently moved from common dataframe manipulation in R to the tidyverse. But I got a problem regarding scaling of columns with the scale()function.
My data consists of columns of whom some are numerical and some categorical features. Also the last column is the y value of data. So I want to scale all numerical columns but not the last column.
With the select()function i am able to write a very short line of code and select all my numerical columns that need to be scaled if i add the ends_with("...") argument. But I can't really make use of that with scaling. There I have to use transmute(feature1=scale(feature1),feature2=scale(feature2)...)and name each feature individually. This works fine but bloats up the code.
So my question is:
Is there a smart solution to manipulate column by column without the need to address every single column name with
transmute?
I imagine something like:
transmute(ends_with("...")=scale(ends_with("..."),featureX,featureZ)
(well aware that this does not work)
Many thanks in advance
library(tidyverse)
data("economics")
# add variables that are not numeric
economics[7:9] <- sample(LETTERS[1:10], size = dim(economics)[1], replace = TRUE)
# add a 'y' column (for illustration)
set.seed(1)
economics$y <- rnorm(n = dim(economics)[1])
economics_modified <- economics %>%
select(-y) %>%
transmute_if(is.numeric, scale) %>%
add_column(y = economics$y)
If you want to keep those columns that are not numeric replace transmute_if with modify_if. (There might be a smarter way to exclude column y from being scaled.)

Removing Duplicate rows while summing one column and preserving the other columns

I have dataset of a a few columns with duplicate row.( duplication based on one column by name ProjectID).
I want to remove the duplicate rows and keep just one of it.
However, each of these rows have a separate amount value against it which needs to be summed and stored for the final consolidated row.
I have used aggregate function. However it removes all other columns (by the use I know).
Can somebody Please tell me a easier way.
the example data set is attached.
dataset
This could be solved using dplyr as #PLapointe pointed out. If your dataset is called df then this would go as
df %>%
group_by(`Project ID`, `Project No.`, `Account Head`, `Function`, `Functionary`) %>%
summarise(cost.total = sum(Amount))
This should do it. You can also adjust the variables you want to keep.
Its a more complicated method, but worked for me.
I aggregated the amounts about the ProjectIDs using the aggregate function, storing them in a new tibble.
Further I appended this column to the original tibble as a new column.
It didn't work exactly what I wanted to. But I was able to work out with a new column Final_Amount keeping the earlier Amount column irrelevant.
Duplicate_remove2 <- function(dataGP_cleaned)
{
#aggregating unique amounts
aggregated_amount <- aggregate(dataGP_cleaned['Amount'], by=dataGP_cleaned['ProjectID'], sum)
#finding Distinct dataset
dataGP_unique <- distinct(dataGP_cleaned, ProjectID, .keep_all = TRUE)
#changing name of the column for easy identification
aggregated_amount$Final_Amount <- aggregated_amount$Amount
#appending the list
aggregate_dataGP <- bind_cols(dataGP_unique, aggregated_amount['Final_Amount'] )
return(aggregate_dataGP)
}

Resources