I'm still fairly new to R and have been practicing a bit lately.
I have the following (simplified) Data Set:
So it's basically a Questionnaire asking random People which of these Cities they prefer from 1-7.
I would like to find out which city has the highest average preference.
So what I first did was: mean(dataset[, 3], na.rm=TRUE) to find out the average preference for Prag. That worked!
Now I wanted to create a table which shows me every mean of each city.
My thought was: table(mean(dataset[3:8], na.rm=TRUE))
However, all I get is the following Error Message:
In mean.default(umfrage[37:38], na.rm = TRUE) :
argument is not numeric or logical: returning NA**
Does someone know what that means and how I could achieve the result?
I figured it out.
I simply used this function: lapply(dataset[3:8], mean, na.rm = TRUE)
You could also use dplyr and tidyr package (both packages are integrated in the tidyverse package):
library(tidyverse)
result <- dataset %>%
gather("city", "value", Pref_Prague:Pref_London) %>%
group_by(city) %>%
summarise(mean = mean(value))
Related
I have a large R data set with over 90K observations and 400 variables representing patient diagnoses. I want to calculate the sum of the values in selected columns (named Code1 through Code200) and store the value in a new column (mytotal). The code below works when I run it with a subset (around 2K) of the observations.
mysubset <- mysubset %>%
mutate(mytotal = select(., Code1:Code200) %>%
rowSums(na.rm = TRUE))
However, when I try to run the same code on the full (90K observations, same dataframe structure) dataframe, I get an error:
Adding missing grouping variables: patient_num
Error in mutate():
! Problem while computing utils = select(., Code1:Code200) %>% rowSums(na.rm = TRUE).
✖ utils must be size 1, not 92574.
ℹ The error occurred in group 1: patient_num = 123456789.
I've searched online for hours to try to resolve the problem or to find an alternative solution, with no luck. If anyone has insights, I'd really appreciate them. Thank you.
Update: Just to save anyone else the hours I wasted trying to figure out the problem, it finally occurred to me to compare the subset and the full data set using class(). It turns out that the full data set had been saved as a grouped dataframe. Once I used ungroup(), the original code worked on the full data set. Apologies for the newbie distress call and thanks for the helpful responses!
Here's a tidyverse approach, where we could take just the columns we want and reshape them into longer data, which will be simpler to sum.
set.seed(42)
df <- matrix(rnorm(9E4*400), nrow= 9E4) |> as.data.frame()
library(tidyverse)
df_sums <- df %>%
mutate(row = row_number()) %>%
select(row, V1:V200) %>%
pivot_longer(-row) %>%
count(row, wt = value, name = "mytotal")
df %>%
bind_cols(df_sums)
I'm trying to find the mean and standard deviation for C and P separately.
I have toyed around with this so far:
C <- rowMeans(dplyr::select(total, C1:41), na.rm=TRUE)
This didn't yield what I needed it to.
Then I thought about just using the summary, but again it didn't give me what I needed.
So then I thought of using na.omit:
Of course though, this would take out all of the data since I have NAs throughout the dataframe.
What am I missing here? Is this a matter of aggregating my data into certain groups?
I know describeby could force these descriptives, but again I'm not sure how to do that.
So, I think the angle I want to take is to order these, then aggregate and find totals, and then find the descriptives using describeby in order to avoid NAs. I'm stuck though. Where am I going wrong?
Try using this :
library(dplyr)
total %>%
#Select only columns that have S in their name
#i.e SP and SC
select(starts_with('S')) %>%
#Get the data in long format, remove NA values
tidyr::pivot_longer(cols = everything(), values_drop_na = TRUE) %>%
#Create a group for each participant
group_by(grp = c('Participant1', 'Participant2')[grepl('C\\d+', name) + 1]) %>%
#Take mean and standard deviation for each group
summarise(mean = mean(value), sd = sd(value))
I am newbie in R, I was searching solution a lot, need your help :).
I am trying to apply code that will create new column with summarised values from the same table with some conditions.
library(tidyverse)
set.seed(1)
a<-data.frame(weeks=1:52, index=sample(1:3,52,replace=TRUE),factory=sample(c('A','B'),52, replace=TRUE),qnt=sample(1:10,52,replace = TRUE))
a
qnt_sum<-function(x,y,z){
a %>% filter(index==x & factory==z) %>%
filter(weeks > (y - 4) & weeks <= y) %>%
summarise(suma = sum(qnt))
}
a %>%
mutate(sum_qnt=lapply(index,qnt_sum,weeks,factory))
qnt_sum(2,5,'B')
but when applying in mutate I got only errors, with this particular code
Error: Result must have length 16, not 52
but I was trying many variations with this code and I got a lot of different errors. I got a feeling that i have wrong approach to the problem.
expected values sample
This might work for you:
a %>% mutate(sum_qnt=mapply(qnt_sum, index, weeks, factory))
Quick question. I read my csv file into the variable data. It has a column label var, which has numerical values.
When I run the command
sd(data$var)
I get
[1] NA
instead of my standard deviation.
Could you please help me figure out what I am doing wrong?
Try sd(data$var, na.rm=TRUE) and then any NAs in the column var will be ignored. Will also pay to check out your data to make sure the NA's should be NA's and there haven't been read in errors, commands like head(data), tail(data), and str(data) should help with that.
I've made the mistake a time or two of reusing variable names in dplyr strings which has caused issues.
mtcars %>%
group_by(gear) %>%
mutate(ave = mean(hp)) %>%
ungroup() %>%
group_by(cyl) %>%
summarise(med = median(ave),
ave = mean(ave), # should've named this variable something different
sd = sd(ave)) # this is the sd of my newly created variable "ave", not the original one.
You probably have missing values in var, or the column is not numeric, or there's only one row.
Try removing missing values which will help for the first case:
sd(dat$var, na.rm = TRUE)
If that doesn't work, check that
class(dat$var)
is "numeric" (the second case) and that
nrow(dat)
is greater than 1 (the third case).
Finally, data is a function in R so best to use a different name, which I've done here.
There may be Inf or -Inf as values in the data.
Try
is.finite(data)
or
min(data, na.rm = TRUE)
max(data, na.rm = TRUE)
to check if that is indeed the case.
I have a dataframe so when I try to calculate the mean of column A I just write
mean(df$A)
and it works fine.
But when I try to calculate mean of only part of the data frame I get an error saying it isn't a number or logical value
df$A %>% filter(A=="some value") %>% mean(df$A)
The type of A is double. I also tried to convert it to numeric using
df$A <- as.numeric(as.character(df$A))
but it didn't work.
Best would be to provide an example of your column A.
However, by just looking to your question the problem is in your magrittr-dplyr syntax.
base syntax:
mean(df$A[df$A == 'some value'])
dplyr with pipes:
df %>% filter(A==2) %>% summarise(., average = mean(A))
Careful with syntax and pipes, more info here.
Try df %>% filter(A==some value) %>% summarise(mean(A)).
Note that the mean will be some value because of the filter.
Also, mean() works fine with objects of class double