Descriptive Statistics By Group - R - r

I'm looking for a way to produce descriptive statistics by group number in R. There is another answer on here I found, which uses dplyr, but I'm having too many problems with it and would like to see what alternatives others might recommend.
I'm looking to obtain descriptive statistics on revenue grouped by group_id. Let's say I have a data frame called company:
group_id company revenue
1 Company A 200
1 Company B 150
1 Company C 300
2 Company D 600
2 Company E 800
2 Company F 1000
3 Company G 50
3 Company H 80
3 Company H 60
and I'd like to product a new data frame called new_company:
group_id company revenue average min max SD
1 Company A 200 217 150 300 62
1 Company B 150 217 150 300 62
1 Company C 300 217 150 300 62
2 Company D 600 800 600 1000 163
2 Company E 800 800 600 1000 163
2 Company F 1000 800 600 1000 163
3 Company G 50 63 50 80 12
3 Company H 80 63 50 80 12
3 Company H 60 63 50 80 12
Again, I'm looking for alternatives to dplyr. Thank you

Using the sample data frame
dd<-read.csv(text="group_id,company,revenue
1,Company A,200
1,Company B,150
1,Company C,300
2,Company D,600
2,Company E,800
2,Company F,1000
3,Company G,50
3,Company H,80
3,Company H,60", header=T)
You could do something fancy like use ave() to create all the values per row for your different functions and then just combine that with the original data.frame.
ext <- with(dd, Map(function(x) ave(revenue, group_id, FUN=x),
list(avg=mean, min=min, max=max, SD=sd)))
cbind(dd, ext)
# group_id company revenue avg min max SD
# 1 1 Company A 200 216.66667 150 300 76.37626
# 2 1 Company B 150 216.66667 150 300 76.37626
# 3 1 Company C 300 216.66667 150 300 76.37626
# 4 2 Company D 600 800.00000 600 1000 200.00000
# 5 2 Company E 800 800.00000 600 1000 200.00000
# 6 2 Company F 1000 800.00000 600 1000 200.00000
# 7 3 Company G 50 63.33333 50 80 15.27525
# 8 3 Company H 80 63.33333 50 80 15.27525
# 9 3 Company H 60 63.33333 50 80 15.27525
but really a simple dplyr command would be easier.
dd %>% group_by(group_id) %>%
mutate(
avg=mean(revenue),
min=min(revenue),
max=max(revenue),
SD=sd(revenue))

Another function I like to use is: describeBy from package "psych".
library(psych)
description <- describeBy(data.frame$variable_to_be_described, df$group_variable)

Related

Transforming columns based off separate dataframe - R solution

I'm trying to write some more sophisticated code. I have a problem which I have a solution for, but I would like to increase the flexibility of the code.
Input data (d):
ID Height_cm Height_m Weight_kg Weight_lb
1 180 70
2 165 120
3 1.8 80
4 100 60
5 190 200
6 1.7 100
I want to transform height into cm and weight in to kg, all in one column like this:
ID Height Weight
1 180 70
2 165 54
3 180 80
4 100 45
5 190 90
6 170 100
I have a solution, but it's hard-coded:
library(tidyverse)
d$Height <- NA
d$Weight <-NA
d$Height <- ifelse(!is.na(d$Height_cm), d$Height_cm, d$Height_m * 0.01)
d$Weight <- ifelse(!is.na(d$Weight_kg), d$Weight_kg, d$Weight_lb * 0.45)
d <- d %>% select(ID, Height, Weight)
I want to be more sophisticated and take in an input file (below) and make the transformation based on that dataframe. The results will be the same, but it uses this transformation df to achieve the same thing:
Transformation_d:
marker unit new_col_name transformation
Height_cm centimetre Height 1
Height_m metre Height 0.01
Weight_kg kilogram Weight 1
Weight_lb pounds Weight 0.45
This is where I'm stuck... I'd appreciate some guidance.
Go easy, I'm new to R!
Height_m should have transformation value as 100, I guess.
You can use cur_column() to get the column name to match with Transformation_d dataframe and get corresponding transformation value to multiply.
library(dplyr)
d %>%
mutate(across(-1, ~. * Transformation_d$transformation[match(cur_column(),
Transformation_d$marker)])) %>%
transmute(ID,
Height = coalesce(Height_cm, Height_m),
Weight = coalesce(Weight_kg, Weight_lb))
# ID Height Weight
#1 1 180 70
#2 2 165 54
#3 3 180 80
#4 4 100 27
#5 5 190 90
#6 6 170 100
A similar process in base R
cbind(d[1],
transform(sweep(d[-1], 2,
Transformation_d$transformation[match(names(d)[-1],
Transformation_d$marker)], `*`),
Height = ifelse(is.na(Height_cm), Height_m, Height_cm),
Weight = ifelse(is.na(Weight_kg), Weight_lb, Weight_kg)))

Cross joining for the computation of a new variable

I have a game data set and I observe the number of points of one player.
da = data.frame(points = c(144,186,220,410,433))
da
points
1 144
2 186
3 220
4 410
5 433
I also now, in which the level the player was, because I know the ranges of points for different levels.
ranges = data.frame(level = c(1,2,3,4,5), points_from = c(0,100,200,300,430), points_to = c(100,170,300,430,550))
ranges
level points_from points_to
1 1 0 100
2 2 100 170
3 3 200 300
4 4 300 430
5 5 430 550
Now I want to compute a new variable, that indicates how far away the player was from the next level. It is computed by da$points/ranges$points_to of this specific level.
For example, if the player has 144 points and the next elvel is reached when achieving 170 points, the levle progress is 144/170.
Thus, the data set I want to have looks like this:
da_new = data.frame(points = c(144,186,220,410,433), points_to = c(170,300,300,430,550), level_progress = c(144/170,186/300,220/300,410/430,433/550))
da_new
points points_to level_progress
1 144 170 0.8471
2 186 300 0.6200
3 220 300 0.7333
4 410 430 0.9535
5 433 550 0.7873
How can I now compute this variable?
The main idea is to use merge(da, ranges, all = T) to do a "cross join" between the data. Then, we filter to where points is between points_from and points_to (meaning 186 is not in the final data).
library(dplyr)
merge(da, ranges, all = T) %>%
# keep only where points fall between points_from and points_to
filter(points >= points_from & points <= points_to) %>%
mutate(level_progress = points / points_to)
points level points_from points_to level_progress
1 144 2 100 170 0.8470588
2 220 3 200 300 0.7333333
3 410 4 300 430 0.9534884
4 433 5 430 550 0.7872727
Another option is to filter where points <= point_to, and find where points is closest to points_to (this method keeps 186):
merge(da, ranges, all = T) %>%
filter(points <= points_to) %>%
group_by(points) %>%
slice(which.min(abs(points - points_to))) %>%
mutate(level_progress = points / points_to)
points level points_from points_to level_progress
<dbl> <dbl> <dbl> <dbl> <dbl>
1 144 2 100 170 0.847
2 186 3 200 300 0.62
3 220 3 200 300 0.733
4 410 4 300 430 0.953
5 433 5 430 550 0.787
Here is a base R solution using findInterval
da_new <- da
da_new$points_to <- ranges$points_to[findInterval(da_new$points,c(0,ranges$points_to))]
da_new$level_progress <- da_new$points/da_new$points_to
such that
> da_new
points points_to level_progress
1 144 170 0.8470588
2 186 300 0.6200000
3 220 300 0.7333333
4 410 430 0.9534884
5 433 550 0.7872727

In R, why does this call to gather() do this?

Here's a reproducible example, with my explanation of why it does what it does.
data = read.csv(text="Email foo.final bar.final
abc#foo.com 100 200
cde#foo.com 101 201
xyz#foo.com 102 202
zzz#foo.com 103 103", header=T, sep="" )
a = gather(data, key, Grade, -Email)
means: Except "Email", put the values of all the columns into a single new column called "Grade" and add a new column called "key" which contains the column header under which the value occurred. Given that we have 4 observations with two variables each, that should produce 8 observations. Result:
Email key Grade
1 abc#foo.com foo.final 100
2 cde#foo.com foo.final 101
3 xyz#foo.com foo.final 102
4 zzz#foo.com foo.final 103
5 abc#foo.com bar.final 200
6 cde#foo.com bar.final 201
7 xyz#foo.com bar.final 202
8 zzz#foo.com bar.final 103
b = gather(data, key, Grade)
Same meaning but now we include Email. Now we have 4 observations but with 3 variables, so we should get 12 observations. Result:
key Grade
1 Email abc#foo.com
2 Email cde#foo.com
3 Email xyz#foo.com
4 Email zzz#foo.com
5 foo.final 100
6 foo.final 101
7 foo.final 102
8 foo.final 103
9 bar.final 200
10 bar.final 201
11 bar.final 202
12 bar.final 103
I am not surprised.
You may need to do something more like this
f2 <- f1 %>%
gather(key = Assignment, value = Grade, COURSE.final:EXAM.final) %>%
select(-email)

How to make new variable across conditions

I need to calculate new variable from data using conditions. New Pheno.
Data set is huge.
I have data set: Animal, Record, Days, Pheno
A R D P
1 1 240 300
1 2 230 290
2 1 305 350
2 2 260 290
3 1 350 450
Conditions are:
Constant pheno per day is 2.
If record days is more than 305 old pheno should be keept.
If record is less than 305 but has next records Pheno should be keept.
If record is less than 305 and have no next records it should be calculated as : 305-days*constant+pheno = (305 - 260)*2+300
Example for animal 1 having less than 305 for both records. So First record will be same in new pheno, but secon record is las and has less than 305, so we need to re-calculate... (305-230)*2+290=440
Finaly data will be like:
A R D P N_P
1 1 240 300 300
1 2 230 290 440
2 1 305 350 350
2 2 260 290 380
3 1 350 450 450
How to do it in R or linux ...
Here is a solution with base R
df <- read.table(header=TRUE, text=
"A R D P
1 1 240 300
1 2 230 290
2 1 305 350
2 2 260 290
3 1 350 450")
newP <- function(d) {
np <- numeric(nrow(d))
for (i in 1:nrow(d)) {
if (d$D[i] > 305) { np[i] <- d$P[i]; next }
if (d$D[i] <= 305 && i<nrow(d)) { np[i] <- d$P[i]; next }
np[i] <- (305-d$D[i])*2 + d$P[i]
}
d$N_P <- np
return(d)
}
D <- split(df, df$A)
D2 <- lapply(D, newP)
do.call(rbind, D2)
Check this out (I assume R is the number of records sorted, so if you have 10 records the last will have R=10)
library(dplyr)
df <- data.frame(A=c(1,1,2,2,3),
R=c(1,2,1,2,1),
D=c(240,230,305,260,350),
P=c(300,290,350,290,450))
df %>% group_by(A) %>%
mutate(N_P=ifelse(( D<305 & R==n()), # check if D<305 & Record is last record
((305-D)*2)+P # calculate new P
,P)) # Else : use old P
Source: local data frame [5 x 5]
Groups: A [3]
A R D P N_P
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 240 300 300
2 1 2 230 290 440
3 2 1 305 350 350
4 2 2 260 290 380
5 3 1 350 450 450
If you have predefined constants that depend on R value in the df, for example :
const <- c(1,2,1.5,2.5,3)
You can replace R in the code by const[R]
df %>% group_by(A) %>%
mutate(N_P=ifelse(( D<305 & R==n()), # check if D<305 & Record is last record
((305-D)*const[R])+P # calculate new P
,P)) # Else : use old P

if statement and mutate

EMPLTOT_N FIRMTOT average min
12289593 4511051 5 1
26841282 1074459 55 10
15867437 81243 300 100
6060684 8761 750 500
52366969 8910 1000 1000
137003 47573 5 1
226987 10372 55 10
81011 507 300 100
23379 52 750 500
13698 42 1000 1000
67014 20397 5 1
My data look like the data above. I want to create a new column EMP using mutate function that:
emp= average*FIRMTOT if EMPLTOT_N/FIRMTOT<min
and emp=EMPLTOT_N if EMPLTOT_N/FIRMTOT>min
In your sample data EMPLTOT_N / FIRMTOT is never less than min, but this should work:
df <- read.table(text = "EMPLTOT_N FIRMTOT average min
12289593 4511051 5 1
26841282 1074459 55 10
15867437 81243 300 100
6060684 8761 750 500
52366969 8910 1000 1000
137003 47573 5 1
226987 10372 55 10
81011 507 300 100
23379 52 750 500
13698 42 1000 1000
67014 20397 5 1", header = TRUE)
library('dplyr')
mutate(df, emp = ifelse(EMPLTOT_N / FIRMTOT < min, average * FIRMTOT, EMPLTOT_N))
In the above if EMPLTOT_N / FIRMTOT == min, emp will be given the value of EMPLTOT_N since you didn't specify what you want to happen in this case.

Resources