How to interpret column length error from ddplyr::mutate? - r

I'm trying to apply a function (more complex than the one used below, but I was trying to simplify) across two vectors. However, I receive the following error:
mutate_impl(.data, dots) :
Column `diff` must be length 2 (the group size) or one, not 777
I think I may be getting this error because the difference between rows results in one row less than the original dataframe, per some posts that I read. However, when I followed that advice and tried to add a vector to add 0/NA on the final line I received another error. Did I at least identify the source of the error correctly? Ideas? Thank you.
Original code:
diff_df <- DF %>%
group_by(DF$var1, DF$var2) %>%
mutate(diff = map2(DF$duration, lead(DF$duration), `-`)) %>%
as.data.frame()

We don't need map2 to get the difference between the 'duration' and the lead of 'duration'. It is vectorized. map2 will loop through each element of 'duration' with the corresponding element of lead(duration) which is unnecessary
DF %>%
group_by(var1, var2) %>%
mutate(diff = duration - lead(duration))
NOTE: When we extract the column with DF$duration after the group_by. it is breaking the grouping condition and get the full dataset column. Also, in the pipe, there is no need for dataset$columnname. It should be columnname (However,in certain situations, when we want to get the full column for some comparison - it can be used)

Related

Suppose the name of a column in a dataframe is unknown to me, how can I sort the df according to the values in that column?

I'm trying to sort a dataframe in descending order according to the values in a specific column whose name is supposed to be unknown to me (i.e. I know it but I am not allowed to use it). The only clue is that it is the last column of this dataframe.
I've tried arange() and order() but they doesn't work. I also noticed that if I try to use names(df)[ncol(df)], I will get the name of that column as a character. However, the correct argument formating in arrange() seems to be columnName in two grave accents rather than "columnName". So I don't know how to correctly passs the name I got to the functions I want to use.
Base R
mtcars[order(mtcars[tail(names(mtcars), 1)]), ] #ascending
mtcars[order(mtcars[tail(names(mtcars), 1)], decreasing = TRUE), ] #descending
tidyverse
library(dplyr)
mtcars %>% arrange_at(vars(last(names(.)))) #ascending
mtcars %>% arrange_at(vars(last(names(.))), desc) #descending

R: How to check if multiple columns' values exist in a list

I have a dataframe with columns containing words that make up an ngram. I would like to sum up the number of stopwords in each ngram and add this column to the dataframe but I can't think of an elegant way to do it with multiple values for n (4-grams, 5-grams etc. . .).
So far I have been doing the following:
mutate(Bigram_Counts_By_Company,
stopword_count = (word1 %in% stop_words$word) %>% as.integer() +
(word2 %in% stop_words$word) %>% as.integer())
Now this works but I'd so much rather write a general function that does the same with all columns starting with "name".
What I'd like to do:
mutate(Web_Bigram_Counts_By_Company,
stopword_count = select(Web_Bigram_Counts_By_Company, starts_with("word")) %in% stop_words$word)
select(Web_Bigram_Counts_By_Company, starts_with("word")) works great to select the columns whose names start with 'name', but when I use it in the call to mutate I get this error: Column 'stopword_count' must be length 360463 (the number of rows) or one, not 2
Is this just a simple R fundamentals error or am I going about this wrong?

Keeping a column throughout pipe manipulations, R, dyplr

I have a DF called data that has multiple columns. One of them is a column called 'swim_hours' that is a column of integers. I want to use dyplr to do some manipulations on this DF, but I also want to keep this column intact, though I am not manipulating it in any way. My code is as follows:
trial_swim = data %>%
group_by(penguin, date, trial, location) %>%
summarise(swim_t = as.numeric(datetime[n()] - datetime[1])) %>%
summarise(hours = swim_hours)
Again, the purpose of the last line is to simply maintain my 'swim_hours' column into my new DF 'trial_swim'.
When I run this, I get the error message: Error in summarise_impl(.data, dots) :
Evaluation error: object 'swim_hour' not found.
Clearly 'swim_hours' is in my 'data' DF, why can't it find it?
Is there an easier way to keep a column that is not being manipulated when using dyplr and pipes??

Error using dplyr package in R

I am using the below code to extract the summary of data with respect to column x by counting the values in column x from the dataset unique_data and arranging the count values in descending order.
unique_data %>%
group_by(x) %>%
arrange(desc(count(x)))
But, when I execute the above code i am getting the error message as below,
Error: no applicable method for 'group_by_' applied to an object of class "character"
Kindly, let me know as what is going wrong in my code. For your information the column x is of character data type.
Regards,
Mohan
The reason is the wrapping of arrange on count. We need to do this separately. If we use the same code as in the OP's post, just split up the count and arrange step in two separate pipes. The output of count is a frequency column 'n' (by default), which we arrange in descending (desc) order.
unique_data %>%
group_by(x) %>%
count(x) %>%
arrange(desc(n))
also the group_by is not needed. According to the ?count documentation
tally is a convenient wrapper for summarise that will either call n or
sum(n) depending on whether you're tallying for the first time, or
re-tallying. count() is similar, but also does the group_by for you.
So based on that, we can just do
count(unique_data, x) %>%
arrange(desc(n))

Error dplyr summarise

I have a data.frame:
set.seed(1L)
vector <- data.frame(patient=rep(1:5,each=2),medicine=rep(1:3,length.out=10),prob=runif(10))
I want to get the mean of the "prob" column while grouping by patient. I do this with the following code:
vector %>%
group_by(patient) %>%
summarise(average=mean(prob))
This code perfectly works. However, I need to get the same values without using the word "prob" on the "summarise" line. I tried the following code, but it gives me a data.frame in which the column "average" is a vector with 5 identical values, which is not what I want:
vector %>%
group_by(patient) %>%
summarise(average=mean(vector[,3]))
PD: for the sake of understanding why I need this, I have another data frame with multiple columns with complex names that need to be "summarised", that's why I can't put one by one on the summarise command. What I want is to put a vector there to calculate the probs of each column grouped by patients.
It appears you want summarise_each
vector %>%
group_by(patient) %>%
summarise_each(funs(mean), vars= matches('prop'))
Using data.table you could do
setDT(vector)[,lapply(.SD,mean),by=patient,.SDcols='prob')

Resources