Here's a reproducible example, with my explanation of why it does what it does.
data = read.csv(text="Email foo.final bar.final
abc#foo.com 100 200
cde#foo.com 101 201
xyz#foo.com 102 202
zzz#foo.com 103 103", header=T, sep="" )
a = gather(data, key, Grade, -Email)
means: Except "Email", put the values of all the columns into a single new column called "Grade" and add a new column called "key" which contains the column header under which the value occurred. Given that we have 4 observations with two variables each, that should produce 8 observations. Result:
Email key Grade
1 abc#foo.com foo.final 100
2 cde#foo.com foo.final 101
3 xyz#foo.com foo.final 102
4 zzz#foo.com foo.final 103
5 abc#foo.com bar.final 200
6 cde#foo.com bar.final 201
7 xyz#foo.com bar.final 202
8 zzz#foo.com bar.final 103
b = gather(data, key, Grade)
Same meaning but now we include Email. Now we have 4 observations but with 3 variables, so we should get 12 observations. Result:
key Grade
1 Email abc#foo.com
2 Email cde#foo.com
3 Email xyz#foo.com
4 Email zzz#foo.com
5 foo.final 100
6 foo.final 101
7 foo.final 102
8 foo.final 103
9 bar.final 200
10 bar.final 201
11 bar.final 202
12 bar.final 103
I am not surprised.
You may need to do something more like this
f2 <- f1 %>%
gather(key = Assignment, value = Grade, COURSE.final:EXAM.final) %>%
select(-email)
Related
I have done the first step:
how many persons have more than 1 point
how many persons have more than 3 points
how many persons have more than 6 points
My goal:
I need to have random samples (with no duplicates of persons)
of 3 persons that have more than 1 point
of 3 persons that have more than 3 points
of 3 persons that have more than 6 points
My dataset looks like this:
id person points
201 rt99 NA
201 rt99 3
201 rt99 2
202 kt 4
202 kt NA
202 kt NA
203 rr 4
203 rr NA
203 rr NA
204 jk 2
204 jk 2
204 jk NA
322 knm3 5
322 knm3 NA
322 knm3 3
343 kll2 2
343 kll2 1
343 kll2 5
344 kll NA
344 kll 7
344 kll 1
345 nn 7
345 nn NA
490 kk 1
490 kk NA
490 kk 2
491 ww 1
491 ww 1
489 tt 1
489 tt 1
325 ll 1
325 ll 1
325 ll NA
That is what I have already tried to code, here is an example of code for finding persons that have more than 1 point:
persons_filtered <- dataset %>%
group_by(person) %>%
dplyr::filter(sum(points, na.rm = T)>1) %>%
distinct(person) %>%
pull()
person_filtered
more_than_1 <- sample(person_filtered, size = 3)
Question:
How to write this code better that I could have in the end 3 lists with unique persons. (I need to prevent to have same persons in the lists)
Here's a tidyverse solution, where the sampling in the three categories of interest is made at the same time.
library(tidyverse)
dataset %>%
# Group by person
group_by(person) %>%
# Get points sum
summarize(sum_points = sum(points, na.rm = T)) %>%
# Classify the sum points into categories defined by breaks, (0-1], (1-3] ...
# I used 100 as the last value so that all sum points between 6 and Inf get classified as (6-Inf]
mutate(point_class = cut(sum_points, breaks = c(0,1,3,6,Inf))) %>%
# ungroup
ungroup() %>%
# group by point class
group_by(point_class) %>%
# Sample 3 rows per point_class
sample_n(size = 3) %>%
# Eliminate the sum_points column
select(-sum_points) %>%
# If you need this data in lists you can nest the results in the sampled_data column
nest(sampled_data= -point_class)
I am trying to do the following. I have a dataset Test:
Item_ID Test_No Category Sharpness Weight Viscocity
132 1 3 14.93199362 94.37250417 579.4236727
676 1 4 44.58750591 70.03232054 1829.170727
699 2 5 89.02760079 54.30587287 1169.226863
850 3 6 30.74535903 83.84377678 707.2280513
951 4 237 67.79568019 51.10388484 917.6609965
1031 5 56 74.06697003 63.31274502 1981.17804
1175 4 354 98.9656142 97.7523884 100.7357981
1483 5 726 9.958040999 51.29537311 1222.910211
1529 7 800 64.11430235 65.69780939 573.8266137
1698 9 125 67.83105185 96.53847341 486.9620194
1748 9 1005 49.43602318 52.9139591 1881.740184
2005 9 28 26.89821508 82.12663209 1709.556135
2111 2 76 83.03593144 85.23622731 276.5088502
I would want to split this data based on Test_No and then compute the number of unique Category per Test_No and also the Median Category value. I chose to use split and Sappply in the following way. But, I am getting an error regarding a missing parenthesis. Is there anything wrong in my approach ? Please find my code below:
function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)), Median_Cat = median(unique(CatRange$Category), na.rm = TRUE) )
}
CatStat <- do.call(rbind,sapply(split(Test, Test$Test_No), function(ModRange)))
Appending my question:
I would want to display the data containing the following information:
Test_No, Category, Median_Cat and Cat_Count
We can try with dplyr
library(dplyr)
Test %>%
group_by(Test_No) %>%
summarise(Cat_Count = n_distinct(Category),
Median_Cat = median(Category,na.rm = TRUE),
Category = toString(Category))
# Test_No Cat_Count Median_Cat Category
# <int> <int> <dbl> <chr>
#1 1 2 3.5 3, 4
#2 2 2 40.5 5, 76
#3 3 1 6.0 6
#4 4 2 295.5 237, 354
#5 5 2 391.0 56, 726
#6 7 1 800.0 800
#7 9 3 125.0 125, 1005, 28
Or if you prefer base R we can also try with aggregate
aggregate(Category~Test_No, CatRange, function(x) c(Cat_Count = length(unique(x)),
Median_Cat = median(x,na.rm = TRUE), Category = toString(x)))
As far as the function written is concerned I think there are some synatx issues in it.
new_func <- function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)),
Median_Cat = median(unique(CatRange$Category), na.rm = TRUE),
Category = toString(CatRange$Category))
}
data.frame(t(sapply(split(CatRange, CatRange$Test_No), new_func)))
# Cat_Count Median_Cat Category
#1 2 3.5 3, 4
#2 2 40.5 5, 76
#3 1 6 6
#4 2 295.5 237, 354
#5 2 391 56, 726
#7 1 800 800
#9 3 125 125, 1005, 28
I'm looking for a way to produce descriptive statistics by group number in R. There is another answer on here I found, which uses dplyr, but I'm having too many problems with it and would like to see what alternatives others might recommend.
I'm looking to obtain descriptive statistics on revenue grouped by group_id. Let's say I have a data frame called company:
group_id company revenue
1 Company A 200
1 Company B 150
1 Company C 300
2 Company D 600
2 Company E 800
2 Company F 1000
3 Company G 50
3 Company H 80
3 Company H 60
and I'd like to product a new data frame called new_company:
group_id company revenue average min max SD
1 Company A 200 217 150 300 62
1 Company B 150 217 150 300 62
1 Company C 300 217 150 300 62
2 Company D 600 800 600 1000 163
2 Company E 800 800 600 1000 163
2 Company F 1000 800 600 1000 163
3 Company G 50 63 50 80 12
3 Company H 80 63 50 80 12
3 Company H 60 63 50 80 12
Again, I'm looking for alternatives to dplyr. Thank you
Using the sample data frame
dd<-read.csv(text="group_id,company,revenue
1,Company A,200
1,Company B,150
1,Company C,300
2,Company D,600
2,Company E,800
2,Company F,1000
3,Company G,50
3,Company H,80
3,Company H,60", header=T)
You could do something fancy like use ave() to create all the values per row for your different functions and then just combine that with the original data.frame.
ext <- with(dd, Map(function(x) ave(revenue, group_id, FUN=x),
list(avg=mean, min=min, max=max, SD=sd)))
cbind(dd, ext)
# group_id company revenue avg min max SD
# 1 1 Company A 200 216.66667 150 300 76.37626
# 2 1 Company B 150 216.66667 150 300 76.37626
# 3 1 Company C 300 216.66667 150 300 76.37626
# 4 2 Company D 600 800.00000 600 1000 200.00000
# 5 2 Company E 800 800.00000 600 1000 200.00000
# 6 2 Company F 1000 800.00000 600 1000 200.00000
# 7 3 Company G 50 63.33333 50 80 15.27525
# 8 3 Company H 80 63.33333 50 80 15.27525
# 9 3 Company H 60 63.33333 50 80 15.27525
but really a simple dplyr command would be easier.
dd %>% group_by(group_id) %>%
mutate(
avg=mean(revenue),
min=min(revenue),
max=max(revenue),
SD=sd(revenue))
Another function I like to use is: describeBy from package "psych".
library(psych)
description <- describeBy(data.frame$variable_to_be_described, df$group_variable)
I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!
library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122
Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8
I'm looking at some ecological data (diet) and trying to work out how to group by Predator. I would like to be able to extract the data so that I can look at the weights of each individual prey for each species for each predator, i.e work out the mean weight of each species eaten by e.g Predator 117. I've put a sample of my data below.
Predator PreySpecies PreyWeight
1 114 10 4.2035496
2 114 10 1.6307026
3 115 1 407.7279775
4 115 1 255.5430495
5 117 10 4.2503708
6 117 10 3.6268814
7 117 10 6.4342073
8 117 10 1.8590861
9 117 10 2.3181421
10 117 10 0.9749844
11 117 10 0.7424772
12 117 15 4.2803743
13 118 1 126.8559155
14 118 1 276.0256158
15 118 1 123.0529734
16 118 1 427.1129793
17 118 3 237.0437606
18 120 1 345.1957190
19 121 1 160.6688815
You can use the aggregate function as follows:
aggregate(formula = PreyWeight ~ Predator + PreySpecies, data = diet, FUN = mean)
# Predator PreySpecies PreyWeight
# 1 115 1 331.635514
# 2 118 1 238.261871
# 3 120 1 345.195719
# 4 121 1 160.668881
# 5 118 3 237.043761
# 6 114 10 2.917126
# 7 117 10 2.886593
# 8 117 15 4.280374
There are a few different ways of getting what you want:
The aggregate function. Probably what you are after.
aggregate(PreyWeight ~ Predator + PreySpecies, data=dd, FUN=mean)
tapply: Very useful, but only divides the variable by a single factor, hence, we need to create a need joint factor with the paste command:
tapply(dd$PreyWeight, paste(dd$Predator, dd$PreySpecies), mean)
ddply: Part of the plyr package. Very useful. Worth learning.
require(plyr)
ddply(dd, .(Predator, PreySpecies), summarise, mean(PreyWeight))
dcast: The output is in more of a table format. Part of the reshape2 package.
require(reshape2)
dcast(dd, PreyWeight ~ PreySpecies+ Predator, mean, fill=0)
mean(data$PreyWeight[data$Predator==117]);