Applying own function on rows by dplyr - r

I am newbie in R, I was searching solution a lot, need your help :).
I am trying to apply code that will create new column with summarised values from the same table with some conditions.
library(tidyverse)
set.seed(1)
a<-data.frame(weeks=1:52, index=sample(1:3,52,replace=TRUE),factory=sample(c('A','B'),52, replace=TRUE),qnt=sample(1:10,52,replace = TRUE))
a
qnt_sum<-function(x,y,z){
a %>% filter(index==x & factory==z) %>%
filter(weeks > (y - 4) & weeks <= y) %>%
summarise(suma = sum(qnt))
}
a %>%
mutate(sum_qnt=lapply(index,qnt_sum,weeks,factory))
qnt_sum(2,5,'B')
but when applying in mutate I got only errors, with this particular code
Error: Result must have length 16, not 52
but I was trying many variations with this code and I got a lot of different errors. I got a feeling that i have wrong approach to the problem.
expected values sample

This might work for you:
a %>% mutate(sum_qnt=mapply(qnt_sum, index, weeks, factory))

Related

Sum of selected columns works on subset of data but not full data set

I have a large R data set with over 90K observations and 400 variables representing patient diagnoses. I want to calculate the sum of the values in selected columns (named Code1 through Code200) and store the value in a new column (mytotal). The code below works when I run it with a subset (around 2K) of the observations.
mysubset <- mysubset %>%
mutate(mytotal = select(., Code1:Code200) %>%
rowSums(na.rm = TRUE))
However, when I try to run the same code on the full (90K observations, same dataframe structure) dataframe, I get an error:
Adding missing grouping variables: patient_num
Error in mutate():
! Problem while computing utils = select(., Code1:Code200) %>% rowSums(na.rm = TRUE).
✖ utils must be size 1, not 92574.
ℹ The error occurred in group 1: patient_num = 123456789.
I've searched online for hours to try to resolve the problem or to find an alternative solution, with no luck. If anyone has insights, I'd really appreciate them. Thank you.
Update: Just to save anyone else the hours I wasted trying to figure out the problem, it finally occurred to me to compare the subset and the full data set using class(). It turns out that the full data set had been saved as a grouped dataframe. Once I used ungroup(), the original code worked on the full data set. Apologies for the newbie distress call and thanks for the helpful responses!
Here's a tidyverse approach, where we could take just the columns we want and reshape them into longer data, which will be simpler to sum.
set.seed(42)
df <- matrix(rnorm(9E4*400), nrow= 9E4) |> as.data.frame()
library(tidyverse)
df_sums <- df %>%
mutate(row = row_number()) %>%
select(row, V1:V200) %>%
pivot_longer(-row) %>%
count(row, wt = value, name = "mytotal")
df %>%
bind_cols(df_sums)

How can I add columns without having to type out all of the column names?

Let's say I have 10 columns, and I want to add an 11th column that sums columns 1-6 for each row. How can I do this? I saw this on another answer:
data$newCol <- sum(data[1:6])
But that resulted in a single number for all rows in newCol, which isn't what I'm trying to do. The only way I know how to do this is like this:
data$newCol <- data$colA + data$colB + data$colC
and so on, but this gets tedious when I'm working with more than just a few columns. Is there a shortcut, like using [1:6] somehow? I'm sure this is such a beginner question, I tried searching but didn't see an answer that made sense to me, sorry.
Thank you!
You can try apply function
data$newCol <- apply(data[,1:6], 1, sum, na.rm=TRUE)
This code probably helps if I got what you had in mind:
library(dplyr)
data %>%
rowwise() %>%
mutate(new_col = sum(c_across(col1:col6), na.rm = TRUE))

Mean from multiple Columns (Error Message)

I'm still fairly new to R and have been practicing a bit lately.
I have the following (simplified) Data Set:
So it's basically a Questionnaire asking random People which of these Cities they prefer from 1-7.
I would like to find out which city has the highest average preference.
So what I first did was: mean(dataset[, 3], na.rm=TRUE) to find out the average preference for Prag. That worked!
Now I wanted to create a table which shows me every mean of each city.
My thought was: table(mean(dataset[3:8], na.rm=TRUE))
However, all I get is the following Error Message:
In mean.default(umfrage[37:38], na.rm = TRUE) :
argument is not numeric or logical: returning NA**
Does someone know what that means and how I could achieve the result?
I figured it out.
I simply used this function: lapply(dataset[3:8], mean, na.rm = TRUE)
You could also use dplyr and tidyr package (both packages are integrated in the tidyverse package):
library(tidyverse)
result <- dataset %>%
gather("city", "value", Pref_Prague:Pref_London) %>%
group_by(city) %>%
summarise(mean = mean(value))

Filtering one dataframe by a multiple columns in another

Sorry if this is a silly question!
My aim is basically the same as this post here: Take dates from one dataframe and filter data in another dataframe - R and continue using dplr as I am later going to run this code across all rows of my dataset using row_wise()
However, in my case I wish to take the 'start' and 'end' years from 2 different columns in the second dataframe.
Here's some dummy data (taken from the original post and adapted to my problem):
main_data = data.frame(year=c(1966:2017))
second_data = data.frame(Participant = c(1:6),
Start_year = c(2012,1994,1974,1983,1969,2002),
End_year = c(2017,2017,2017,2017,2017,2017))
and wrote this code based on the original post:
filtered.total =
main_data %>%
rowwise() %>%
mutate(year = any(year >= second_data$Start_year & year <=
second_data$End_year)) %>%
filter(year) %>%
data.frame()
I'm also filtering my data by location(country and county)but it just gives me the following error message for my dataset:
Error in filter_impl(.data, quo) : Result must have length 2299, not 0
and for the dummy data above:
In year <= second_data$End_year :
longer object length is not a multiple of shorter object length
Thanks for any help - quite new to R and my PhD is testing my minimal knowledge right now!
you might need to use min(second_data$year) and max(second_data$year), as at the moment you're providing many values to compare against, and i think its complaining about that.

Simplify Repeatable Code in R by passing text to a function to be used as an argument

I've looked around StackOverflow for an answer here, but I think I may be missing a term. Here's the scenario:
I have a large data set with multiple groups that I want to report on. Let's say that this data set has answers to certain questions as columns, and I want to take specific columns and responses, group the answers, and perform counts. Essentially, I have a dplyr filter expression that would look like this:
z <- results %>% filter(AgeGroup %in% c("16-20", "21-25", "26-30")) %>%
group_by(AgeGroup) %>% summarize(ageCount=n())
Then I generate a table with the results using xtable() and dump them in my Rmarkdown document. What I'd like to do is create a function that can do this, such that I can do the following
resultPrint <- function(qualifier, groupColumn) {
return(results %>% filter(qualifier) %>%
group_by(groupColumn) %>% summarize(count=n())
}
resultPrint("AgeGroup %in% c(\"16-20\", \"21-25\", \"26-30\")", "AgeGroup")
Or some equivalent.
Is there a way to do this in R? It would simplify a lot of code I am writing if I could. Thanks!
Thank you to r2evans! Here's my solution:
resultPrint <- function(qualifier, groupColumn) {
return(results %>% filter_(qualifier) %>%
group_by_(.dots = groupColumn) %>% summarize(count=n()))
}
filterClause = quote(AgeGroup %in% c("16-20", "21-25", "26-30"))
stuff <- resultPrint(filterClause, quote(AgeGroup))
Thank you!!

Resources