I have a dataframe:
sex age
f 10
m 12
m 11
m 17
f 13
f 12
I 8
Want I want to calculate the mean of age per sex:
f=> mean age = (10+13+12) /3
m=> mean age = (12+11+17) /3
I=> mean age = 8
I am trying something like this:
combine(df, :age => mean => :mean_age, :sex => unique)
But all mean_age have the same value.
use groupby first:
combine(groupby(df, :sex), :age => mean => :mean_age)
or using DataFramesMeta.jl
#chain df begin
groupby(:sex)
#combine(:mean_age = mean(:age))
end
Related
Here is a representation of my dataset
mydata<-data.frame(ID=1:37,var=c(rep("A",12),rep("B",8),rep("C",17)))
I calculated the frequency of each modality of the variable var
library(tableone)
CreateTableOne(data = mydata["var"])
What I want is to sort the frequencies in decreasing way, like below:
Overall
n 37
var (%)
C 17 (45.9)
A 12 (32.4)
B 8 (21.6)
Change the factor levels to be in decreasing order of frequency.
library(tableone)
mydata2 <- transform(mydata, var = factor(var, names(sort(-table(var)))))
CreateTableOne("var", data = mydata2)
giving:
Overall
n 37
var (%)
C 17 (45.9)
A 12 (32.4)
B 8 (21.6)
I'm trying to subset data based on a conditional statement of a column that has blank values which means the employee logged in multiple times on a work order. An example data set is shown below:
employee_name <- c("Person A","Person A","Person A","Person A","Person A", "Person B","Person B","Person B")
work_order <- c("WO001","WO001","WO001","WO002","WO003","WO001","WO003", "WO003")
num_of_points <- c(40,"","",64,25,20,68,"")
time <- c(10, 30, 15, 20, 25, 5, 15, 30)
final_summary <- data.frame(employee_name,work_order,num_of_points, time)
View(final_summary)
Input
Basically, I want to sum up the points and time by selecting all rows with points > 30, then grouped by Employee Name and Work Order which should return this:
Output
I can do the summarize function properly, but when I perform the initial subset, it excludes the blank rows for num_of_points and thus does not compute all the adjacent time (in minutes) values. This makes sense because subset(num_of_points > 30) only finds anything greater than 30. How can I tweak this to include the blank rows so I can successfully filter the data in order to compute the sum of time accurately, grouped by unique work order and employee name?
Conver the num_of_points to numeric class, grouped by 'employee_name', 'work_order', get the sum of 'num_of_points' where it is greater than 30, and the sum of 'time', then filter out the rows where 'num_of_points' are 0
library(dplyr)
final_summary %>%
mutate(num_of_points = as.numeric(num_of_points)) %>%
group_by(employee_name, work_order) %>%
summarise(num_of_points = sum(num_of_points[num_of_points> 30],
na.rm = TRUE), time = sum(time)) %>%
filter(num_of_points > 0)
# A tibble: 3 x 4
# Groups: employee_name [2]
# employee_name work_order num_of_points time
# <chr> <chr> <dbl> <dbl>
#1 Person A WO001 40 55
#2 Person A WO002 64 20
#3 Person B WO003 68 45
In base R you will do:
aggregate(.~employee_name + work_order, type.convert(final_summary), sum, subset = num_of_points>30)
employee_name work_order num_of_points time
1 Person A WO001 40 10
2 Person A WO002 64 20
3 Person B WO003 68 15
You can aggregate num_of_points and time separately and merge the results.
merge(aggregate(num_of_points~employee_name + work_order, final_summary,
sum, subset = num_of_points>30),
aggregate(time~employee_name + work_order, final_summary, sum))
# employee_name work_order num_of_points time
#1 Person A WO001 40 55
#2 Person A WO002 64 20
#3 Person B WO003 68 45
when I try this code for barplot (L$neighbourhood is the apartment neighbourhood in Paris for example, Champs-Elysées, Batignolles, which is string data, and L$price is the numeric data for apartment price).
barplot(L$neighbourhood, L$price, main = "TITLE", xlab = "Neighbourhood", ylab = "Price")
But, I get an error:
Error in barplot.default(L$neighbourhood, L$price, main = "TITLE",
xlab = "Neighbourhood", : 'height' must be a vector or a matrix
We cannot use string data as an input in barplot function in R? How can I fix this error please?
allneighbourhoods
Quite unclear what you want to barplot. Let's assume you want to see the average price per neighborhood. If that's what you're after you can proceed like this.
First some illustrative data:
set.seed(123)
Neighborhood <- sample(LETTERS[1:4], 10, replace = T)
Price <- sample(10:100, 10, replace = T)
df <- data.frame(Neighborhood, Price)
df
Neighborhood Price
1 C 23
2 C 34
3 C 99
4 B 100
5 C 78
6 B 100
7 B 66
8 B 18
9 C 81
10 A 35
Now compute the averages by neighborhood using the function aggregate and store the result in a new dataframe:
df_new <- aggregate(x = df$Price, by = list(df$Neighborhood), function(x) mean(x))
df_new
Group.1 x
1 A 35
2 B 71
3 C 63
And finally you can plot the average prices in variable x and add the neighborhood names from the Group.1column:
barplot(df_new$x, names.arg = df_new$Group.1)
An even simpler solution is this, using tapplyand mean:
df_new <- tapply(df$Price, df$Neighborhood, mean)
barplot(df_new, names.arg = names(df_new))
This is the first time that I ask a question on stack overflow. I have tried searching for the answer but I cannot find exactly what I am looking for. I hope someone can help.
I have a huge data set of 20416 observation. Basically, I have 83 subjects and for each subject I have several observations. However, the number of observations per subject is not the same (e.g. subject 1 has 256 observations, while subject 2 has only 64 observations).
I want to add an extra column containing the mean of the observations for each subject (the observations are reading times (RT)).
I tried with the aggregate function:
aggregate (RT ~ su, data, mean)
This formula returns the correct mean per subject. But then I cannot simply do the following:
data$mean <- aggregate (RT ~ su, data, mean)
as R returns this error:
Error in $<-.data.frame(tmp, "mean", value = list(su = 1:83, RT
= c(378.1328125, : replacement has 83 rows, data has 20416
I understand that the formula lacks a command specifying that the mean for each subject has to be repeated for all the subject's rows (e.g. if subject 1 has 256 rows, the mean for subject 1 has to be repeated for 256 rows, if subject 2 has 64 rows, the mean for subject 2 has to be repeated for 64 rows and so forth).
How can I achieve this in R?
The data.table syntax lends itself well to this kind of problem:
Dt[, Mean := mean(Value), by = "ID"][]
# ID Value Mean
# 1: a 0.05881156 0.004426491
# 2: a -0.04995858 0.004426491
# 3: b 0.64054432 0.038809830
# 4: b -0.56292466 0.038809830
# 5: c 0.44254622 0.099747707
# 6: c -0.10771992 0.099747707
# 7: c -0.03558318 0.099747707
# 8: d 0.56727423 0.532377247
# 9: d -0.60962095 0.532377247
# 10: d 1.13808538 0.532377247
# 11: d 1.03377033 0.532377247
# 12: e 1.38789640 0.568760936
# 13: e -0.57420308 0.568760936
# 14: e 0.89258949 0.568760936
As we are applying a grouped operation (by = "ID"), data.table will automatically replicate each group's mean(Value) the appropriate number of times (avoiding the error you ran into above).
Data:
Dt <- data.table::data.table(
ID = sample(letters[1:5], size = 14, replace = TRUE),
Value = rnorm(14))[order(ID)]
Staying in Base R, ave is intended for this use:
data$mean = with(data, ave(x = RT, su, FUN = mean))
Simply merge your aggregated means data with full dataframe joined by the subject:
aggdf <- aggregate (RT ~ su, data, mean)
names(aggdf)[2] <- "MeanOfRT"
df <- merge(df, aggdf, by="su")
Another compelling way of handling this without generating extra data objects is by using group_by of dplyr package:
# Generating some data
data <- data.table::data.table(
su = sample(letters[1:5], size = 14, replace = TRUE),
RT = rnorm(14))[order(su)]
# Performing
> data %>% group_by(su) %>%
+ mutate(Mean = mean(RT)) %>%
+ ungroup()
Source: local data table [14 x 3]
su RT Mean
1 a -1.62841746 0.2096967
2 a 0.07286149 0.2096967
3 a 0.02429030 0.2096967
4 a 0.98882343 0.2096967
5 a 0.95407214 0.2096967
6 a 1.18823435 0.2096967
7 a -0.13198711 0.2096967
8 b -0.34897914 0.1469982
9 b 0.64297557 0.1469982
10 c -0.58995261 -0.5899526
11 d -0.95995198 0.3067978
12 d 1.57354754 0.3067978
13 e 0.43071258 0.2462978
14 e 0.06188307 0.2462978
I hope you can help me with this problem: For my work I have to use R to analyze survey data. The data has a number of columns by which I have/want to group the data and then do some calculations, e.g. How many men or women do work at a certain department? And then calculate the number and percentage for each group. --> at department A work 42 people, whereof 30 women and 12 men, at department B work 70 people, whereof 26 women and 44 men.
I currently use the following code to output the data (using ddply):
percentage_median_per_group_multiple_columns <- function(data, column_name, column_name2){
library(plyr)
descriptive <- ddply( data, column_name,
function(x){
percentage_median_per_group(x, column_name)
percentage_median_per_group(x, column_name2)
}
)
print(data.frame(descriptive))
}
## give number, percentage and median per group_value in column
percentage_median_per_group <- function(data, column_name3){
library(plyr)
descriptive <- ddply( data, column_name3,
function(x){
c(
N <- nrow(x[column_name3]), #number
pct <- (N/nrow(data))*100 #percentage
#TODO: median
)
}
)
return(descriptive)
}
#calculate
percentage_median_per_group_multiple_columns(users_surveys_full_responses, "department", "gender")
Now the data outputs like this:
Department Sex N % per sex
A f 30 71,4
m 12 28,6
B f 26 37,1
m 44 62,9
But, I want the output to look like this, so calculations take place and are printed in each substep:
Department N % per department Sex N % per sex
A 42 37,5 f 30 71,4
m 12 28,6
B 70 62,5 f 26 37,1
m 44 62,9
Does anyone have a suggestion of how I can do that, if possible even build it dynamic so I can potentially group it by the variables in multiple columns (e.g. department + sex + type of software + ...), but I would be happy if I can have it already like in the example =)
thanks!
EDIT
You can use this to generate example data:
n=100
sample_data = data.frame(department=sample(1:20,n,replace=TRUE), gender=sample(1:2,n,replace=TRUE))
percentage_median_per_group_multiple_columns(sample_data, "department", "gender")
V1 in the output stands for N (number) and V2 for %