R: Scale a subset of multiple columns (with similar names) with dplyr - r

I recently moved from common dataframe manipulation in R to the tidyverse. But I got a problem regarding scaling of columns with the scale()function.
My data consists of columns of whom some are numerical and some categorical features. Also the last column is the y value of data. So I want to scale all numerical columns but not the last column.
With the select()function i am able to write a very short line of code and select all my numerical columns that need to be scaled if i add the ends_with("...") argument. But I can't really make use of that with scaling. There I have to use transmute(feature1=scale(feature1),feature2=scale(feature2)...)and name each feature individually. This works fine but bloats up the code.
So my question is:
Is there a smart solution to manipulate column by column without the need to address every single column name with
transmute?
I imagine something like:
transmute(ends_with("...")=scale(ends_with("..."),featureX,featureZ)
(well aware that this does not work)
Many thanks in advance

library(tidyverse)
data("economics")
# add variables that are not numeric
economics[7:9] <- sample(LETTERS[1:10], size = dim(economics)[1], replace = TRUE)
# add a 'y' column (for illustration)
set.seed(1)
economics$y <- rnorm(n = dim(economics)[1])
economics_modified <- economics %>%
select(-y) %>%
transmute_if(is.numeric, scale) %>%
add_column(y = economics$y)
If you want to keep those columns that are not numeric replace transmute_if with modify_if. (There might be a smarter way to exclude column y from being scaled.)

Related

Counting number of rows where a value occurs at least once within many columns

I updated the question with pseudocode to better explain what I would like to do.
I have a data.frame named df_sel, with 5064 rows and 215 columns.
Some of the columns (~80) contains integers with a unique identifier for a specific trait (medications). These columns are named "meds_0_1", "meds_0_2", "meds_0_3" etc. as well as "meds_1_1", "meds_1_2", "meds_1_3". Each column may or may not contain any of the integer values I am looking for.
For the specific integer values to look for, some could be grouped under different types of medication, but coded for specific brand names.
metformin = 1140884600 # not grouped
sulfonylurea = c(1140874718, 1140874724, 1140874726) # grouped
If it would be possible to look-up a group of medications, like in a vector format as above, that would be helpful.
I would like to do this:
IF [a specific row]
CONTAINS [the single integer value of interest]
IN [any of the columns within the df starting with "meds_0"]
A_NEW_VARIABLE_METFORMIN = 1 ELSE A_NEW_VARIABLE_METFORMIN = 0
and concordingly
IF [a specific row]
CONTAINS [any of multiple integer values of interest]
IN [any of the columns within the df starting with "meds_0"]
A_NEW_VARIABLE_SULFONYLUREA = 1 ELSE A_NEW_VARIABLE_SULFONYLUREA = 0
I have manged to create a vector based on column names:
column_names <- names(df_sel) %>% str_subset('^meds_0')
But I havent gotten any further despite some suggestions below.
I hope you understand better what I am trying to do.
As for the selection of the columns, you could do this by first extracting the names in the way you are doing with a regex, and then using select:
library(stringr)
column_names <- names(df_sel) %>%
str_subset('^meds_0')
relevant_df <- df_sel %>%
select(column_names)
I didn't quite get the structure of your variables (if they are integers, logicals, etc.), so I'm not sure how to continue, but it would probably involve something like summing across all the columns and dropping those that are not 0, like:
meds_taken <- rowSums(relevant_df)
df_sel_med_count <- df_sel %>%
add_column(meds_taken)
At this point you should have your initial df with the relevant data in one column, and you can summarize by subject, medication or whatever in any way you want.
If this is not enough, please edit your question providing a relevant sample of your data (you can do this with the dput function) and I'll edit this answer to add more detail.
First, I would like to start off by recommending bioconductor for R libraries, as it sounds like you may be studying biological data. Now to your question.
Although tidyverse is the most widely acceptable and 'easy' method, I would recommend in this instance using 'lapply' as it is extremely fast. Your code from a programming standpoint becomes a simple boolean, as you stated, but I think we can go a little further. Using the built-in data from 'mtcars',
data(mtcars)
head(mtcars, 6)
target=6
#trues and falses for each row and column
rows=lapply(mtcars, function(x) x %in% target)
#Number of Trues for each column and which have more that 0 Trues
column_sums=unlist(lapply(rows, function(x) (sum(x, na.rm = TRUE))))
which(column_sums>0)
This will work with other data types with a few tweaks here and there.

add_column error with data frame

I'm trying to use the add_column function within the tibble package to add a column or two to my df data frame, but I keep getting different errors based on how I try to manipulate the arguments of the function. df is a 60 x17 data frame. Here's the code that I currently have tried so far:
Try 1:
library(tibble)
add_column(Depth +.5 =df[1], .after = 1)
Try 2:
library(tibble)
add_column(df, depth + .5 = rep(df[1], nrow(df)), .after = 1)
I want the new column to be inserted after column 1 in df, and I want the newly created column to say "Depth + .5" and to be filled with data from my df[1] column. (I'm going to alter the values in it later), but I need the row values to be adaptable for when I import different data sets of different lengths, which is why I'm trying to do it as df[1] since its length is going to change depending on the data that I import. Also, I'm not sure if I need to put "Depth + .5" in quotes or what in order to make it work, but that's what I'd like the column to say/be named at the top.
Two key points: one, you need to include df in the add column function. Second, I see where you were going with your rep line, because you were getting an error about number of rows needing to match. However, all you need to do is reference the existing column (which is already the correct length) and perform your operation. To do that we just use df$Depth, or you could use df[,1].
add_column(df, 'Depth + .5'= df$Depth + .5, .after = 1)

Removing Duplicate rows while summing one column and preserving the other columns

I have dataset of a a few columns with duplicate row.( duplication based on one column by name ProjectID).
I want to remove the duplicate rows and keep just one of it.
However, each of these rows have a separate amount value against it which needs to be summed and stored for the final consolidated row.
I have used aggregate function. However it removes all other columns (by the use I know).
Can somebody Please tell me a easier way.
the example data set is attached.
dataset
This could be solved using dplyr as #PLapointe pointed out. If your dataset is called df then this would go as
df %>%
group_by(`Project ID`, `Project No.`, `Account Head`, `Function`, `Functionary`) %>%
summarise(cost.total = sum(Amount))
This should do it. You can also adjust the variables you want to keep.
Its a more complicated method, but worked for me.
I aggregated the amounts about the ProjectIDs using the aggregate function, storing them in a new tibble.
Further I appended this column to the original tibble as a new column.
It didn't work exactly what I wanted to. But I was able to work out with a new column Final_Amount keeping the earlier Amount column irrelevant.
Duplicate_remove2 <- function(dataGP_cleaned)
{
#aggregating unique amounts
aggregated_amount <- aggregate(dataGP_cleaned['Amount'], by=dataGP_cleaned['ProjectID'], sum)
#finding Distinct dataset
dataGP_unique <- distinct(dataGP_cleaned, ProjectID, .keep_all = TRUE)
#changing name of the column for easy identification
aggregated_amount$Final_Amount <- aggregated_amount$Amount
#appending the list
aggregate_dataGP <- bind_cols(dataGP_unique, aggregated_amount['Final_Amount'] )
return(aggregate_dataGP)
}

How to create a "top ten" vector that keeps labels?

I have a data set that has 655 Rows, and 21 Columns. I'm currently looping through each column and need to find the top ten of each, but when I use the head() function, it doesn't keep the labels (they are names of bacteria, each column is a sample). Is there a way to create sorted subset of data that sorts the row name along with it?
right now I am doing
topten <- head(sort(genuscounts[,c(1,i)], decreasing = TRUE) n = 10)
but I am getting an error message since column 1 is the list of names.
Thanks!
Because sort() applies to vectors, it's not going to work with your subset genuscounts[,c(1,i)], because the subset has multiple columns. In base R, you'll want to use order():
thisColumn <- genuscounts[,c(1,i)]
topten <- head(thisColumn[order(thisColumn[,2],decreasing=T),],10)
You could also use arrange_() from the dplyr package, which provides a more user-friendly interface:
library(dplyr)
head(arrange_(genuscounts[,c(1,i)],desc(names(genuscounts)[i])),10)
You'd need to use arrange_() instead of arrange() because your column name will be a string and not an object.
Hope this helps!!

removing and aggregating duplicates

I've posted a sample of the data I'm working with here.
"Parcel.." is the main indexing variable and there are good amount of duplicates. The duplicates are not consistent in all of the other columns. My goal is to aggregate the data set so that there is only one observation of each parcel.
I've used the following code to attempt summing numerical vectors:
aggregate(Ap.sample$X.11~Ap.sample$Parcel..,FUN=sum)
The problem is it removes everything except the parcel and the other vector I reference.
My goal is to use the same rule for certain numerical vectors (sum) (X.11,X.13,X.15, num_units) of observations of that parcelID, a different rule (average) for other numerical vectors (Acres,Ttl_sq_ft,Mtr.Size), and still a different rule (just pick one name) for the character variables (pretend there's another column "customer.name" with different values for the same unique parcel ID, i.e. "Steven condominiums" and "Stephen apartments"), and to just delete the extra observations for all the other variables.
I've tried to use the numcolwise function but that also doesn't do what I need.
My instinct would be to specify the columns I want to sum and the columns I want to take the average like so:
DT<-as.data.table(Ap.sample)
sum_cols<-Ap.05[,c(10,12,14)]
mean_cols<-Ap.05[,c(17:19)]
and then use the lapply function to go through each observation and do what I need.
df05<-DT[,lapply(.SD,sum), by=DT$Parcel..,.SDcols=sum_cols]
df05<-DT[,lapply(.SD,mean),by=DT$Parcel..,.SDcols=mean_cols]
but that spits out errors on the first go. I know there's a simpler work around for this than trying to muscle through it.
You could do:
library(dplyr)
df %>%
# create an hypothetical "customer.name" column
mutate(customer.name = sample(LETTERS[1:10], size = n(), replace = TRUE)) %>%
# group data by "Parcel.."
group_by(Parcel..) %>%
# apply sum() to the selected columns
mutate_each(funs(sum(.)), one_of("X.11", "X.13", "X.15", "num_units")) %>%
# likewise for mean()
mutate_each(funs(mean(.)), one_of("Acres", "Ttl_sq_ft", "Mtr.Size")) %>%
# select only the desired columns
select(X.11, X.13, X.15, num_units, Acres, Ttl_sq_ft, Mtr.Size, customer.name) %>%
# de-duplicate while keeping an arbitrary value (the first one in row order)
distinct(Parcel..)

Resources