dplyr mutate grouped data without using exact column name - r

I'm trying to wirte a function to process multiple similar dataset, here I want to subtract scores obtained by subject in the second interview by scores obtained by the same subject in the previous interview. In all dataset I want to process, interested score will be stored in the second column. Writing for each specific dataset is simple, simply use the exact column name, everything will go fine.
d <- a %>%
arrange(by_group=interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = colname_2nd-lag(colname_2nd))
But since I need a generic function that can be used to process multiple dataset, I can not use exact column name. So I tried 3 approaches, both of them only altered the last line
Approach#1:
dplyr::mutate(score_change = dplyr::vars(2)-lag(dplyr::vars(2)))
Approach#2:
Second column name of interested dataset contains a same string ,so I tried
dplyr::mutate(score_change = dplyr::vars(matches('string'))-lag(dplyr::vars(matches('string'))))
Error messages of the above 2 approaches will be
Error in dplyr::vars(2) - lag(dplyr::vars(2)) :
non-numeric argument to binary operator
Approach#3:
dplyr::mutate(score_change = .[[2]]-lag(.[[2]]))
Error message:
Error: Column `score_change` must be length 2 (the group size) or one, not 10880
10880 is the row number of my sample dataset, so it look like group_by does not work in this approach
Does anyone know how to make the function perform in the desired way?

If you want to use position of the column names use cur_data()[[2]] to refer the 2nd column of the dataframe.
library(dplyr)
d <- a %>%
arrange(interview_date) %>%
dplyr::group_by(subjectkey) %>%
dplyr::mutate(score_change = cur_data()[[2]]-lag(cur_data()[[2]]))
Also note that cur_data() doesn't count the grouped column so if subjectkey is first column in your data and colname_2nd is the second one you may need to use cur_data()[[1]] instead when you group_by.

Related

Return the column name as as a value in a column

I'm working on a classifier and I'm pretty much stuck on the last step. An image of my output is below. Each row corresponds with one observation and the values determine what target class it will be, the highest value wins. The following Table is an example of my intermediate output.
I'm currently writing the function with the tidyverse dialect and so far I've tried the following and received an empty column:
result <- result %>%
rowwise() %>%
transmute(class = colnames(max(c_across())))
return(result)
My intention with colnames(max(c_across))) is to find the column with the highest value and assign it's name to class.
In case you're willing to accept a Base R solution within the pipes you can use
names <- result %>%
apply(., 1, function(x) names(x)[which.max(x)])
and then add the name vector to the results dataframe next.

Duplicate removal in time series based on other column using R

I am currently working on a problem which involves data cleaning and calculation in below fashion :
I have created the sample dataset here for a single unit A.
Data is sorted according to timestamp column for each unit. There are other columns as well.
For each distinct alternate value of event_log_value_desc, I need to get rows. In the case of multiple duplicate values of event_log_value_desc, it should return the row with the first occurrence of event_log_value_desc. event_log_value_desc should have alternate values of OFF and ON for each unit.
In return, the program should return the following :
I don't know if this solution works since it has not been tested on your dataset, but I believe it should be fine
library(dplyr)
df %>%
group_by(unit) %>%
mutate(event_log_value_desc_lag = lag(event_log_value_desc)) %>%
filter(event_log_value_desc != event_log_value_desc_lag | is.na(event_log_value_desc_lag))

Average of Multiple columns in dplyr getting error "must resolve to integer column positions, not a list"

gene HSC_7256.bam HSC_6792.bam HSC_7653.bam HSC_5852
My data frame looks like this i can do that in a normal way such as take out the columns make another data frame average it ,but i want to do that in dplyr and im having a hard time I not sure what is the problem
I doing something like this
HSC<- EPIGENETIC_FACTOR_SEQMONK %>%
select(EPIGENETIC_FACTOR_SEQMONK,gene)
I get this error
Error: EPIGENETIC_FACTOR_SEQMONK must resolve to integer column positions, not a list
So i have to do this take out all the HSC sample average them
Anyone suggest what am i doing it incorrectly ?that would be helpful
The %>% function pulls whatever is to the left of it into the first position of the following function. If your data frame is EPIGENETIC_FACTOR_SEQMONK, then these two statements are equivalent:
HSC <- EPIGENETIC_FACTOR_SEQMONK %>%
select(gene)
HSC <- select(EPIGENETIC_FACTOR_SEQMONK, gene)
In the first, we are passing EPIGENETIC_FACTOR_SEQMONK into select using %>%, which is generally used in dplyr chains as the first argument in dplyr functions is a data frame.

How to create a "top ten" vector that keeps labels?

I have a data set that has 655 Rows, and 21 Columns. I'm currently looping through each column and need to find the top ten of each, but when I use the head() function, it doesn't keep the labels (they are names of bacteria, each column is a sample). Is there a way to create sorted subset of data that sorts the row name along with it?
right now I am doing
topten <- head(sort(genuscounts[,c(1,i)], decreasing = TRUE) n = 10)
but I am getting an error message since column 1 is the list of names.
Thanks!
Because sort() applies to vectors, it's not going to work with your subset genuscounts[,c(1,i)], because the subset has multiple columns. In base R, you'll want to use order():
thisColumn <- genuscounts[,c(1,i)]
topten <- head(thisColumn[order(thisColumn[,2],decreasing=T),],10)
You could also use arrange_() from the dplyr package, which provides a more user-friendly interface:
library(dplyr)
head(arrange_(genuscounts[,c(1,i)],desc(names(genuscounts)[i])),10)
You'd need to use arrange_() instead of arrange() because your column name will be a string and not an object.
Hope this helps!!

removing and aggregating duplicates

I've posted a sample of the data I'm working with here.
"Parcel.." is the main indexing variable and there are good amount of duplicates. The duplicates are not consistent in all of the other columns. My goal is to aggregate the data set so that there is only one observation of each parcel.
I've used the following code to attempt summing numerical vectors:
aggregate(Ap.sample$X.11~Ap.sample$Parcel..,FUN=sum)
The problem is it removes everything except the parcel and the other vector I reference.
My goal is to use the same rule for certain numerical vectors (sum) (X.11,X.13,X.15, num_units) of observations of that parcelID, a different rule (average) for other numerical vectors (Acres,Ttl_sq_ft,Mtr.Size), and still a different rule (just pick one name) for the character variables (pretend there's another column "customer.name" with different values for the same unique parcel ID, i.e. "Steven condominiums" and "Stephen apartments"), and to just delete the extra observations for all the other variables.
I've tried to use the numcolwise function but that also doesn't do what I need.
My instinct would be to specify the columns I want to sum and the columns I want to take the average like so:
DT<-as.data.table(Ap.sample)
sum_cols<-Ap.05[,c(10,12,14)]
mean_cols<-Ap.05[,c(17:19)]
and then use the lapply function to go through each observation and do what I need.
df05<-DT[,lapply(.SD,sum), by=DT$Parcel..,.SDcols=sum_cols]
df05<-DT[,lapply(.SD,mean),by=DT$Parcel..,.SDcols=mean_cols]
but that spits out errors on the first go. I know there's a simpler work around for this than trying to muscle through it.
You could do:
library(dplyr)
df %>%
# create an hypothetical "customer.name" column
mutate(customer.name = sample(LETTERS[1:10], size = n(), replace = TRUE)) %>%
# group data by "Parcel.."
group_by(Parcel..) %>%
# apply sum() to the selected columns
mutate_each(funs(sum(.)), one_of("X.11", "X.13", "X.15", "num_units")) %>%
# likewise for mean()
mutate_each(funs(mean(.)), one_of("Acres", "Ttl_sq_ft", "Mtr.Size")) %>%
# select only the desired columns
select(X.11, X.13, X.15, num_units, Acres, Ttl_sq_ft, Mtr.Size, customer.name) %>%
# de-duplicate while keeping an arbitrary value (the first one in row order)
distinct(Parcel..)

Resources