How to conditionally change colnames from certain rows? - r

I have a problem that arises often working with excel survey data: the first 10 or so column names are appropriate in a data set, the remaining x:ncol need to be renamed to the values of first row of the data set, starting at x+1. (The colnames are correct until x, after which point the colnames become empty, with the values that I would like to have as the colname being in the first row).
I have been doing this manually, writing them out one by one using dplyr::select(). How can I automate this in a tidy workflow? I imagine using set_names() or rename_at() but can't get the syntax. Thank you in advance
mtcars %>%
select(miles_per_gallon = "mpg", everything()) %>% #etc. keep some names
rename_at(vars(3:ncol(.)), funs(mtcars[,1]))
Error: `nm` must be `NULL` or a character vector the same length as `x`
The error isn't surprising, but to illustrate the point - how to have the names from x:ncol() replaced by the first row's values starting from x+1?

I think this should do it for you -
x <- 10 # means 1:10 column names are appropriate
names(df)[(x+1):ncol(df)] <- df[1, (x+1):ncol(df)]
df <- df[-1, ] # removing 1st row assuming it's bad data

Related

Drop rows if combination of multiple rows match regex value(s) using R dplyr (or other)

First off, I'm not very well versed in R, I code mainly in Bash and sometimes in python. That being said, I have a dataframe with the following (variable) structure. Between the columns 'info' and 'gene' there can be upto 5 columns and may have different names. The data content in each column will start with either r,Ref,n,N,No_GT,Hom or het. If a column starts with Ref, No_GT, Hom or Het, it will have additional data delimited by :.
For ex. demo table below.
info
s1
gene
a
r
GG
b
Hom:10,10:20:99
TG
c
Het:5,6:11:20
TGGB
To identify the column names of my interest, I'm using this snippet-
my_file %>% select('info':'gene') %>% colnames() -> samples
samples <- samples[! samples %in% c("info","gene")]
In case there is a single column, I need to remove rows containing r,n,N,Ref,No_GT. This can be achieved using grepl and a regex match. For ex. using
df[!grepl("^r|^n|^N|^Ref|^No_GT", df$s1),]
And the first row is removed.
However, there may be more columns between info and gene, example:
info
s1
s2
gene
a
r
n
GG
b
Hom:10,10:20:99
n
TG
c
Het:5,6:11:20
r
TGGB
My problem arises when there are multiple samples. In this case, I have to drop rows where all sample columns have combination of r,n,N,Ref and No_GT, i.e. if any of the sample has Hom or Het at the beginning, the row has to be preserved. I have the names of columns, but not entirely sure what the optimum way to solve this problem! I could cycle through each column, but then how do I break out if I encounter a Hom or Het?
Any help appreciated!
I tried using filter and select, however I'm not able to specify multiple columns even when using across.
I thought tried this -
my_file_sorted %>% filter(across(c(s1,s2), ~ "^Het|^Hom")) -> trimmed
However I'm getting this error
Error in `filter()`:
! Problem while computing `..1 = across(.cols = c(s1, s2))`.
✖ Input `..1$s1` must be a logical vector, not a character.
You could use if_any:
df %>%
filter(if_any(matches("^s[0-9]+$"), str_detect, pattern = "^(Hom|Het)"))
or pmp_lgl:
df %>%
filter(
pmap_lgl(
select(., matches("^s[0-9]+$")),
~any(str_detect(c(...), "^(Hom|Het)"))
)
)
If you know which columns are required for this test and have them in vector my_cols you just replace matches("^s[0-9]+$") with all_of(my_cols)

Return the column name as as a value in a column

I'm working on a classifier and I'm pretty much stuck on the last step. An image of my output is below. Each row corresponds with one observation and the values determine what target class it will be, the highest value wins. The following Table is an example of my intermediate output.
I'm currently writing the function with the tidyverse dialect and so far I've tried the following and received an empty column:
result <- result %>%
rowwise() %>%
transmute(class = colnames(max(c_across())))
return(result)
My intention with colnames(max(c_across))) is to find the column with the highest value and assign it's name to class.
In case you're willing to accept a Base R solution within the pipes you can use
names <- result %>%
apply(., 1, function(x) names(x)[which.max(x)])
and then add the name vector to the results dataframe next.

Exclude one single column from sapply

I have a dataframe with multiple columns that I want to group according to their names. When several columns names respond to the same pattern, I want them grouped in a single column and that column is the sum of the group.
colnames(dataframe)
[1] "Départements" "01...3" "01...4" "01...5" "02...6" "02...7" "02...8" "02...9" "02...10" "03...11"
[11] "03...12" "03...13" "04...14" "04...15" "05...16" "05...17" "05...18" "06...19" "06...20" "06...21"
So I use this bit of code that works just fine when every column are numeric, though the first one is character and therefore I hit an error. How can I exclude the first column from the code?
#Group columns by patern, look for a pattern and loop through
patterns <- unique(substr(names(dataframe_2012), 1, 3))` #store patterns in a vector
dataframe <- sapply(patterns, function(xx) rowSums(dataframe[,grep(xx, names(dataframe)), drop=FALSE]))
#loop through
This is the error code I get
Error in rowSums(DEPTpolicedata_2012[, grep(xx, names(DEPTpolicedata_2012)), :
'x' must be numeric
You can simply remove the first column using
patterns$Départements <- NULL

R: Scale a subset of multiple columns (with similar names) with dplyr

I recently moved from common dataframe manipulation in R to the tidyverse. But I got a problem regarding scaling of columns with the scale()function.
My data consists of columns of whom some are numerical and some categorical features. Also the last column is the y value of data. So I want to scale all numerical columns but not the last column.
With the select()function i am able to write a very short line of code and select all my numerical columns that need to be scaled if i add the ends_with("...") argument. But I can't really make use of that with scaling. There I have to use transmute(feature1=scale(feature1),feature2=scale(feature2)...)and name each feature individually. This works fine but bloats up the code.
So my question is:
Is there a smart solution to manipulate column by column without the need to address every single column name with
transmute?
I imagine something like:
transmute(ends_with("...")=scale(ends_with("..."),featureX,featureZ)
(well aware that this does not work)
Many thanks in advance
library(tidyverse)
data("economics")
# add variables that are not numeric
economics[7:9] <- sample(LETTERS[1:10], size = dim(economics)[1], replace = TRUE)
# add a 'y' column (for illustration)
set.seed(1)
economics$y <- rnorm(n = dim(economics)[1])
economics_modified <- economics %>%
select(-y) %>%
transmute_if(is.numeric, scale) %>%
add_column(y = economics$y)
If you want to keep those columns that are not numeric replace transmute_if with modify_if. (There might be a smarter way to exclude column y from being scaled.)

Is there a more elegant way to find duplicated records?

I've got 81,000 records in my test frame, and duplicated is showing me that 2039 are identical matches. One answer to Find duplicated rows (based on 2 columns) in Data Frame in R suggests a method for creating a smaller frame of just the duplicate records. This works for me, too:
dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`
But it seems, as the poster noted, inelegant. Is there a cleaner way to get the same result: a view of just those records that are duplicates?
In my case I'm working with scraped data and I need to figure out whether the duplicates exist in the original or were introduced by me scraping.
duplicated(df) will give you a logical vector (all values consisting of either T/F), which you can then use as an index to your dataframe rows.
# indx will contain TRUE values wherever in df$var there is a duplicate
indx <- duplicated(df$var)
df[indx, ] #note the comma
You can put it all together in one line
df[duplicated(df$var), ] # again, the comma, to indicate we are selected rows
doops <- which(duplicated(df$var)==TRUE)
uniques <- df[-doops,]
duplicates <- df[doops,]
Is the logic I generally use when I am trying to remove the duplicate entrys from a data frame.

Resources