I have dataset of a a few columns with duplicate row.( duplication based on one column by name ProjectID).
I want to remove the duplicate rows and keep just one of it.
However, each of these rows have a separate amount value against it which needs to be summed and stored for the final consolidated row.
I have used aggregate function. However it removes all other columns (by the use I know).
Can somebody Please tell me a easier way.
the example data set is attached.
dataset
This could be solved using dplyr as #PLapointe pointed out. If your dataset is called df then this would go as
df %>%
group_by(`Project ID`, `Project No.`, `Account Head`, `Function`, `Functionary`) %>%
summarise(cost.total = sum(Amount))
This should do it. You can also adjust the variables you want to keep.
Its a more complicated method, but worked for me.
I aggregated the amounts about the ProjectIDs using the aggregate function, storing them in a new tibble.
Further I appended this column to the original tibble as a new column.
It didn't work exactly what I wanted to. But I was able to work out with a new column Final_Amount keeping the earlier Amount column irrelevant.
Duplicate_remove2 <- function(dataGP_cleaned)
{
#aggregating unique amounts
aggregated_amount <- aggregate(dataGP_cleaned['Amount'], by=dataGP_cleaned['ProjectID'], sum)
#finding Distinct dataset
dataGP_unique <- distinct(dataGP_cleaned, ProjectID, .keep_all = TRUE)
#changing name of the column for easy identification
aggregated_amount$Final_Amount <- aggregated_amount$Amount
#appending the list
aggregate_dataGP <- bind_cols(dataGP_unique, aggregated_amount['Final_Amount'] )
return(aggregate_dataGP)
}
Related
I wanna delete duplicated rows according a set of columns (all except 1) but i wanna keep that column in the df:
dfNew <- df %>% distinct(across(-column5))
The problem of that code is dfNew doesn't have the column5. I just wanna make the distinct excluding that column but keeping it in the final data frame.
If you want to keep the original columns, you can add '.keep_all = TRUE' as a parameter of your distinct command.
I have a wide dataset which makes it really difficult to manipulate the data in the way I need. It looks like the dummy table below:
Dummy_table_unsorted
Essentially, as seen in the table, the information held in 1 row is at a user level, you have a user id and then all the animals owned by each user are in this row. What I would like it, I want this at animal level, so that a user can have multiple entries, which represent each of their different animals. I have pasted a table below of what I would like it to look like:
Dummy_table_sorted
Is there a simple way to do this? I have an idea as to how, but it is very long winded. I thought to maybe subset by selected columns relating to one animal only and merge the datasets back together. The problem is, in may data, it is possible for one person to have up to 100 animals, which makes this very long winded.
Please can someone offer a suggestion or a package/command that would allow me to change this wide dataset into a long one?
Thank You.
First, you should provide data that someone can easily insert into R. Screenshots are not helpful and increase the amount of work a person needs to perform to help you.
The data as you have it should be able to be split, and recombined with bind_rows or rbind. I would subset the data into three dataframes, rename columns, and bind. Assuming your original data is called df
df1 <- df[,c(1:4)]
df2 <- df[,c(1,5:7)]
df3 <- df[,c(1,8:10)]
# rename columns to match
names(df1) <- c('user id', 'animal', 'colour', 'legs')
names(df2) <- c('user id', 'animal', 'colour', 'legs')
names(df3) <- c('user id', 'animal', 'colour', 'legs')
remade <- bind_rows(df1, df2) %>%
bind_rows(df3)
First of all, I am using the ukpolice library in R and extracted data to a new data frame called crimes. Now i am running into a new problem, i am trying to extract certain data to a new empty data frame called df.shoplifting if the category of the crime is equal to "shoplifiting" it needs to add the id, month and street name to the new dataframe. I need to use a loop and if statement togheter.
EDIT:
Currently i have this working but it lacks the IF statemtent:
for (i in crimes$category) {
shoplifting <- subset(crimes, category == "shoplifting", select = c(id, month, street_name))
names(shoplifting) <- c("ID", "Month", "Street_Name")
}
What i am trying to do:
for (i in crimes$category) {
if(crimes$category == "shoplifting"){
data1 <- subset(crimes, category == i, select = c(id, month, street_name))
}
}
It does run and create the new data frame data1. But the data that it extracts is wrong and does not only include items with the shoplifting category..
I'll guess, and update if needed based on your question edits.
rbind works only on data.frame and matrix objects, not on vectors. If you want to extend a vector (N.B., that is not part of a frame or column/row of a matrix), you can merely extend it with c(somevec, newvals) ... but I think that this is not what you want here.
You are iterating through each value of crimes$category, but if one category matches, then you are appending all data within crimes. I suspect you mean to subset crimes when adding. We'll address this in the next bullet.
One cannot extend a single column of a multi-column frame in the absence of the others. A data.frame as a restriction that all columns must always have the same length, and extending one column defeats that. (And doing all columns immediately-sequentially does not satisfy that restriction.)
One way to work around this is to rbind a just-created data.frame:
# i = "shoplifting"
newframe <- subset(crimes, category == i, select = c(id, month, street_name))
names(newframe) <- c("ID", "Month", "Street_Name") # match df.shoplifting names
df.shoplifting <- rbind(df.shoplifting, newframe)
I don't have the data, but if crimes$category ever has repeats, you will re-add all of the same-category rows to df.shoplifting. This might be a problem with my assumptions, but is likely not what you really need.
If you really just need to do it once for a category, then do this without the need for a for loop:
df.shoplifting <- subset(crimes, category == "shoplifting", select = c(id, month, street_name))
# optional
names(df.shoplifting) <- c("ID", "Month", "Street_Name")
Iteratively adding rows to a frame is a bad idea: while it works okay for smaller datasets, as your data scales, the performance worsens. Why? Because each time you add rows to a data.frame, the entire frame is copied into a new object. It's generally better to form a list of frames and then concatenate them all later (c.f., https://stackoverflow.com/a/24376207/3358227).
On this note, if you need one frame per category, you can get that simply with:
df_split(df, df$category)
and then operate on each category as its own frame by working on a specific element within the df_split named list (e.g., df_split[["shoplifting"]]).
And lastly, depending on the analysis you're doing, it might still make sense to keep it all together. Both the dplyr and data.table dialects of R making doing calculations on data within groups very intuitive and efficient.
Try:
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),]
Using a for loop in this instance will work, but when working in R you want to stick to vectorized operations if you can.
This operation subsets the crimes dataframe and selects rows where the category column is equal to shoplifting. It is not necessary to convert the category column into a factor - you can match the string with the == operator.
Note the comma at the end of the which(...) function, inside of the square brackets. The which function returns indices (row numbers) that meet the criteria. The comma after the function tells R that you want all of the rows. If you wanted to select only a few rows you could do:
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),c("id","Month","Street_Name")]
OR you could call the columns based on their number (I don't have your data so I don't know the numbers...but if the columns id, Month, Street_Name, you could use 1, 2, 3).
df.shoplifting <- crimes[which(crimes$category == 'shoplifting'),c(1,2,3)]
I want to clean up this data-set.Example Table
It contains many duplicates. I want to delete only the duplicates from the UUID column that have the highest value in the column Shape_Area. A loop must be created that detects the duplicates and compares the values from column Area within the found duplicates.
I've tried the duplicate function, but I cannot trust that the selected value is the greatest value from column Area.
I want an Output table that includes unique values that have the greatest value in column Area.
Can anyone help on this one?
you can use the dplyr package like this
library(dplyr)
newdata <- mydata %>%
group_by(UUID) %>%
arrange(-Shape_Area) %>%
slice(1)
For each value of UUID this code creates a group and then arranges each group with respect to Shape_Area. Then only the first row (e.g. the highest value) is selected.
If you want to save this data use this:
write.csv(newdata, file = "Output.csv")
I've posted a sample of the data I'm working with here.
"Parcel.." is the main indexing variable and there are good amount of duplicates. The duplicates are not consistent in all of the other columns. My goal is to aggregate the data set so that there is only one observation of each parcel.
I've used the following code to attempt summing numerical vectors:
aggregate(Ap.sample$X.11~Ap.sample$Parcel..,FUN=sum)
The problem is it removes everything except the parcel and the other vector I reference.
My goal is to use the same rule for certain numerical vectors (sum) (X.11,X.13,X.15, num_units) of observations of that parcelID, a different rule (average) for other numerical vectors (Acres,Ttl_sq_ft,Mtr.Size), and still a different rule (just pick one name) for the character variables (pretend there's another column "customer.name" with different values for the same unique parcel ID, i.e. "Steven condominiums" and "Stephen apartments"), and to just delete the extra observations for all the other variables.
I've tried to use the numcolwise function but that also doesn't do what I need.
My instinct would be to specify the columns I want to sum and the columns I want to take the average like so:
DT<-as.data.table(Ap.sample)
sum_cols<-Ap.05[,c(10,12,14)]
mean_cols<-Ap.05[,c(17:19)]
and then use the lapply function to go through each observation and do what I need.
df05<-DT[,lapply(.SD,sum), by=DT$Parcel..,.SDcols=sum_cols]
df05<-DT[,lapply(.SD,mean),by=DT$Parcel..,.SDcols=mean_cols]
but that spits out errors on the first go. I know there's a simpler work around for this than trying to muscle through it.
You could do:
library(dplyr)
df %>%
# create an hypothetical "customer.name" column
mutate(customer.name = sample(LETTERS[1:10], size = n(), replace = TRUE)) %>%
# group data by "Parcel.."
group_by(Parcel..) %>%
# apply sum() to the selected columns
mutate_each(funs(sum(.)), one_of("X.11", "X.13", "X.15", "num_units")) %>%
# likewise for mean()
mutate_each(funs(mean(.)), one_of("Acres", "Ttl_sq_ft", "Mtr.Size")) %>%
# select only the desired columns
select(X.11, X.13, X.15, num_units, Acres, Ttl_sq_ft, Mtr.Size, customer.name) %>%
# de-duplicate while keeping an arbitrary value (the first one in row order)
distinct(Parcel..)