I would like to assign the max value of a group to all rows within that group. How do I do that?
I have a dataframe containing the names of the group and the max number of credits that belongs to it.
course_credits <- aggregate(bsc_academic$Credits, by = list(bsc_academic$Course_code), max)
which gives
Course Credits
1 ABC1000 6.5
2 ABC1003 6.5
3 ABC1004 6.5
4 ABC1007 5.0
5 ABC1010 6.5
6 ABC1021 6.5
7 ABC1023 6.5
The main dataframe looks like this:
Appraisal.Type Resits Credits Course_code Student_ID
Final result 0 6.5 ABC1000 10
Final result 0 6.5 ABC1003 10
Grade supervisor 0 0 ABC1000 10
Grade supervisor 0 0 ABC1003 10
Final result 0 12 ABC1294 23
Grade supervisor 0 0 ABC1294 23
As you see, student 10 took course ABC1000, worth 6.5 credits. For each course (per student), however, two rows exist: Final result and Grade supervisor. In the end, Final result should be deleted, but the credits should be kept. Therefore, I want to assign the max value of 6.5 to the Grade supervisor row.
Likewise, student 23 has followed course ABC1294, worth 12 credits.
In the end, this should be the result:
Appraisal.Type Resits Credits Course_code Student_ID
Grade supervisor 0 6.5 ABC1000 10
Grade supervisor 0 6.5 ABC1003 10
Grade supervisor 0 12 ABC1294 23
How do I go about this?
An option would be to group by 'Student_ID', mutate the 'Credits' with max of 'Credits' and filter the rows with 'Appraisal.Type' as "Grade supervisor"
library(dplyr)
df1 %>%
group_by(Student_ID) %>%
dplyr::mutate(Credits = max(Credits)) %>%
ungroup %>%
filter(Appraisal.Type == "Grade supervisor")
# A tibble: 2 x 5
# Appraisal.Type Resits Credits Course_code Student_ID
# <chr> <int> <dbl> <chr> <int>
#1 Grade supervisor 0 6.5 ABC1000 10
#2 Grade supervisor 0 6.5 ABC1003 10
If we also need 'Course_code' to be included in the grouping
df2 %>%
group_by(Student_ID, Course_code) %>%
dplyr::mutate(Credits = max(Credits)) %>%
filter(Appraisal.Type == "Grade supervisor")
# A tibble: 3 x 5
# Groups: Student_ID, Course_code [3]
# Appraisal.Type Resits Credits Course_code Student_ID
# <chr> <int> <dbl> <chr> <int>
#1 Grade supervisor 0 6.5 ABC1000 10
#2 Grade supervisor 0 6.5 ABC1003 10
#3 Grade supervisor 0 12 ABC1294 23
NOTE: I case, plyr package is also loaded, there can be some masking of functions esp summarise/mutate which is also found in plyr. To prevent it, either do this on a fresh session without loading plyr or explicitly specify dplyr::mutate
data
df1 <- structure(list(Appraisal.Type = c("Final result", "Final result",
"Grade supervisor", "Grade supervisor"), Resits = c(0L, 0L, 0L,
0L), Credits = c(6.5, 6.5, 0, 0), Course_code = c("ABC1000",
"ABC1003", "ABC1000", "ABC1003"), Student_ID = c(10L, 10L, 10L,
10L)), class = "data.frame", row.names = c(NA, -4L))
df2 <- structure(list(Appraisal.Type = c("Final result", "Final result",
"Grade supervisor", "Grade supervisor", "Final result", "Grade supervisor"
), Resits = c(0L, 0L, 0L, 0L, 0L, 0L), Credits = c(6.5, 6.5,
0, 0, 12, 0), Course_code = c("ABC1000", "ABC1003", "ABC1000",
"ABC1003", "ABC1294", "ABC1294"), Student_ID = c(10L, 10L, 10L,
10L, 23L, 23L)), class = "data.frame", row.names = c(NA, -6L))
Generate a sample dataset.
data <- as.data.frame(list(Appraisal.Type = c(rep("Final result", 2), rep("Grade supervisor", 2)),
Resits = rep(0, 4),
Credits = c(rep(6.5, 2), rep(0, 2)),
Course_code = rep(c("ABC1000", "ABC1003"), 2),
Student_ID = rep(10, 4)))
Assign the max value of a group to all rows in this group and then delete rows that contain "Final results".
##Reassign the values of "Credits" column
for (i in 1: nlevels(as.factor(data$Course_code))) {
Course_code <- unique(data$Course_code)[i]
data$Credits [data$Course_code == Course_code] <- max (data$Credits [data$Course_code == Course_code])
}
##New dataset without "Final result" rows
data <- data[data$Appraisal.Type != "Final result",]
Here is the result.
data
Appraisal.Type Resits Credits Course_code Student_ID
3 Grade supervisor 0 6.5 ABC1000 10
4 Grade supervisor 0 6.5 ABC1003 10
Here's a data.table solution,
DT[,Credits := max(Credits),by=Student_ID]
Result <- DT[Appraisal.Type == "Grade supervisor"]
Related
I am using this dataset: http://www.openintro.org/stat/data/cdc.R
to create a table from a subset that only contains the means and standard deviations of male participants. The table should look like this:
Mean Standard Deviation
Age: 44.27 16.715
Height: 70.25 3.009219
Weight: 189.3 36.55036
Desired Weight: 178.6 26.25121
I created a subset for males and females with this code:
mdata <- subset(cdc, cdc$gender == ("m"))
fdata <- subset(cdc, cdc$gender == ("f"))
How should I create a table that only contains means and SDs of age, height, weight, and desired weight using these subsets?
The data frame you provided sucked up all the memory on my laptop, and it's not needed to provide that much data to solve your problem. Here's a dplyr/tidyr solution to create a summary table grouped by categories, using the starwars dataset available with dplyr:
library(dplyr)
library(tidyr)
starwars |>
group_by(sex) |>
summarise(across(
where(is.numeric),
.fns = list(Mean = mean, SD = sd), na.rm = TRUE,
.names = "{col}__{fn}"
)) |>
pivot_longer(-sex, names_to = c("var", ".value"), names_sep = "__")
# A tibble: 15 × 4
sex var Mean SD
<chr> <chr> <dbl> <dbl>
1 female height 169. 15.3
2 female mass 54.7 8.59
3 female birth_year 47.2 15.0
4 hermaphroditic height 175 NA
5 hermaphroditic mass 1358 NA
6 hermaphroditic birth_year 600 NA
7 male height 179. 36.0
8 male mass 81.0 28.2
9 male birth_year 85.5 157.
10 none height 131. 49.1
11 none mass 69.8 51.0
12 none birth_year 53.3 51.6
13 NA height 181. 2.89
14 NA mass 48 NA
15 NA birth_year 62 NA
Just make a data frame of colMeans and column sd. Note, that you may also select columns.
fdata <- subset(cdc, gender == "f", select=c("age", "height", "weight", "wtdesire"))
data.frame(mean=colMeans(fdata), sd=apply(fdata, 2, sd))
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
You can also use by to do it simultaneously for both groups, it's basically a combination of split and lapply. (To avoid apply when calculating column SDs, you could also use sd=matrixStats::colSds(as.matrix(fdata)) which is considerably faster.)
res <- by(cdc[c("age", "height", "weight", "wtdesire")], cdc$gender, \(x) {
data.frame(mean=colMeans(x), sd=matrixStats::colSds(as.matrix(x)))
})
res
# cdc$gender: m
# mean sd
# age 44.27307 16.719940
# height 70.25165 3.009219
# weight 189.32271 36.550355
# wtdesire 178.61657 26.251215
# ------------------------------------------------------------------------------------------
# cdc$gender: f
# mean sd
# age 45.79772 17.584420
# height 64.36775 2.787304
# weight 151.66619 34.297519
# wtdesire 133.51500 18.963014
To extract only one of the data frames in the list-like object use e.g. res$m.
Usually we use aggregate for this, which you also might consider:
aggregate(cbind(age, height, weight, wtdesire) ~ gender, cdc, \(x) c(mean=mean(x), sd=sd(x))) |>
do.call(what=data.frame)
# gender age.mean age.sd height.mean height.sd weight.mean weight.sd wtdesire.mean wtdesire.sd
# 1 m 44.27307 16.71994 70.251646 3.009219 189.32271 36.55036 178.61657 26.25121
# 2 f 45.79772 17.58442 64.367750 2.787304 151.66619 34.29752 133.51500 18.96301
The pipe |> call(what=data.frame) is just needed to get rid of matrix columns, which is useful in case you aim to further process the data.
Note: R >= 4.1 used.
Data:
source('https://www.openintro.org/stat/data/cdc.R')
or
cdc <- structure(list(genhlth = structure(c(3L, 3L, 1L, 5L, 3L, 3L), levels = c("excellent",
"very good", "good", "fair", "poor"), class = "factor"), exerany = c(0,
1, 0, 0, 1, 1), hlthplan = c(1, 1, 1, 1, 1, 1), smoke100 = c(1,
0, 0, 0, 0, 1), height = c(69, 66, 73, 65, 67, 69), weight = c(224L,
215L, 200L, 216L, 165L, 170L), wtdesire = c(224L, 140L, 185L,
150L, 165L, 165L), age = c(73L, 23L, 35L, 57L, 81L, 83L), gender = structure(c(1L,
2L, 1L, 2L, 2L, 1L), levels = c("m", "f"), class = "factor")), row.names = c("19995",
"19996", "19997", "19998", "19999", "20000"), class = "data.frame")
I want to re-work the data that I have in a dataframe based on adding certain values up, and to do that to all of the (numerical) columns in the same way.
In code, I have created a dataframe that is structured a bit like this:
library(tibble)
df_in <- tribble(~names, ~a, ~a_pc, ~b, ~b_pc,
"Three star", 1L, 1, 2L, 1,
"Two star", 5L, 5, 12L, 6,
"One star", 6L, 6, 100L, 50,
"No star", 88L, 88, 86L, 43,
"Empty", 0L, 0, 0L, 0,
"Also empty", 0L, 0, 0L, 0)
In my output I want to have one row that contains the sums of three rows in the input dataframe, another row that contains the sum of two of them, and one that contains the contents of a row from the original (but renamed).
I also want to keep other rows if they have numbers but to drop them if they are empty. I would prefer to do that programmatically, but can do it manually with indexing if need be, so that's a bit less important.
My desired output would be a bit like this:
df_out <- tribble(~names, ~a, ~a_pc, ~b, ~b_pc,
"Any stars", 12L, 12, 114L, 57,
"... of which at least 2 stars", 6L, 6, 14L, 7,
"... of which 3 stars", 1L, 1, 2L, 1,
"No star", 88L, 88, 86L, 43)
For example, that 12L in the top left (meaning column a, "Any stars") is the sum of the 1L, 5L and 6L entries in the a column of the input.
I want to do this merging of rows at this stage in my processing because it's important to do it after I've already calculated the percentage columns (..._pc in the example). You'll see that in the output the percentage columns add to more than 100, which is correct because there is deliberately some 'double counting' - things can correctly show up in multiple rows if they meet the conditions.
Edit to add: Note that the labels I am using in the $names column of the test dataset df_in are not the real labels I have in my real situation. I imagine that a workable solution to this will be able to somehow take a collection of vectors that specify sets of rows and another collection of the same number of strings to label the new rows, and process through them. I might be able to define the sets of rows and the associated names like this:
set_1 <- c("Three star", "Two star", "One star")
set_2 <- c("Three star", "Two star")
set_3 <- "Three star"
set_4 <- "No star"
new_name_1 <- "Any stars"
new_name_2 <- "... of which at least 2 stars"
new_name_3 <- "... of which 3 stars"
new_name_4 <- "No star"
We may use imap to loop over patterns (as some cases are overlapping) and do the group by sum across those columns (after filtering the rows)
library(purrr)
library(stringr)
imap_dfr(setNames(c('(?<!No) star', 'Two|Three', 'Three', 'Empty|No'),
c("Any stars", "... of which at least 2 stars",
"... of which 3 stars", "No star" )), ~ df_in %>%
filter(str_detect(names, regex(.x, ignore_case = TRUE))) %>%
group_by(names = .y) %>%
summarise(across(everything(), sum)))
-ouptut
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
OP's expected
> df_out
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
Update
If the OP is passing a custom set of names
map2_dfr(mget(ls(pattern = '^set_\\d+')),
mget(ls(pattern = '^new_name_\\d+')),
~ df_in %>%
filter(names %in% .x) %>%
group_by(names = .y) %>%
summarise(across(everything(), sum)))
# A tibble: 4 x 5
names a a_pc b b_pc
<chr> <int> <dbl> <int> <dbl>
1 Any stars 12 12 114 57
2 ... of which at least 2 stars 6 6 14 7
3 ... of which 3 stars 1 1 2 1
4 No star 88 88 86 43
I have R dataframe:
city hour value
0 NY 0 12
1 NY 12 24
2 LA 0 3
3 LA 12 9
I want, for each city, to divide each row by the previous one and write the result into a new dataframe. The desired output is:
city ratio
NY 2
LA 3
You can try aggregate like below
aggregate(value ~city,df, function(x) x[-1]/x[1])
which gives
city value
1 LA 3
2 NY 2
Data
> dput(df)
structure(list(city = c("NY", "NY", "LA", "LA"), hour = c(0L,
12L, 0L, 12L), value = c(12L, 24L, 3L, 9L)), class = "data.frame", row.names = c("0",
"1", "2", "3"))
You can use lag to get previous value, divide each value by it's previous value for each city and drop NA rows.
library(dplyr)
df %>%
arrange(city, hour) %>%
group_by(city) %>%
summarise(value = value/lag(value)) %>%
na.omit()
# city value
# <chr> <dbl>
#1 LA 3
#2 NY 2
In data.table we can do this via shift :
library(data.table)
setDT(df)[order(city, hour), value := value/shift(value), city]
na.omit(df)
I am very new to R but need to use it occasionally for my job. I have a .csv file that I need data from the first 14 rows (March through Sept) from only column 6 (Header is SNWD) to transpose horizontally with 14 new column names. I know how to read in the .csv file, just need help with the actual transpose code.
Current .csv format:
STN,NAME,MO,DAY,YEAR,SNWD
1234,STATION A,3,1,1919,2
1234,STATION A,3,15,1919,3
1234,STATION A,4,1,1919,1
1234,STATION A,4,15,1919,0
1234,STATION A,5,1,1919,6
1234,STATION A,5,15,1919,0
1234,STATION A,6,1,1919,4
1234,STATION A,6,15,1919,0.5
Need the output to look like:
March-1,March-15,April-1,April-15,May-1,May-15,June-1,June-15,July-1,July-15,Aug-1,Aug-15
2,3,1,0,6,0,4,0.5, , , , , ,
Would appreciate any help.
Thanks -K-
We can use
library(dplyr)
library(tidyr)
library(data.table)
library(lubridate)
dat %>%
unite(DATE, YEAR, MO, DAY, sep="-") %>%
mutate(DATE = format(ymd(DATE), "%b-%d"), rn = rowid(STN, NAME, DATE)) %>%
pivot_wider(names_from = DATE, values_from = SNWD)
# A tibble: 1 x 11
# STN NAME rn `Mar-01` `Mar-15` `Apr-01` `Apr-15` `May-01` `May-15` `Jun-01` `Jun-15`
# <int> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 1234 STATION A 1 2 3 1 0 6 0 4 0.5
data
dat <- structure(list(STN = c(1234L, 1234L, 1234L, 1234L, 1234L, 1234L,
1234L, 1234L), NAME = c("STATION A", "STATION A", "STATION A",
"STATION A", "STATION A", "STATION A", "STATION A", "STATION A"
), MO = c(3L, 3L, 4L, 4L, 5L, 5L, 6L, 6L), DAY = c(1L, 15L, 1L,
15L, 1L, 15L, 1L, 15L), YEAR = c(1919L, 1919L, 1919L, 1919L,
1919L, 1919L, 1919L, 1919L), SNWD = c(2, 3, 1, 0, 6, 0, 4, 0.5
)), class = "data.frame", row.names = c(NA, -8L))
dat <- read.csv(stringsAsFactors=F, text="
STN,NAME,MO,DAY,YEAR,SNWD
1234,STATION A,3,1,1919,2
1234,STATION A,3,15,1919,3
1234,STATION A,4,1,1919,1
1234,STATION A,4,15,1919,0
1234,STATION A,5,1,1919,6
1234,STATION A,5,15,1919,0
1234,STATION A,6,1,1919,4
1234,STATION A,6,15,1919,0.5")
with(dat, setNames(SNWD, format(as.Date(paste(YEAR, MO, DAY, sep="-")), format = "%B-%d")))
# March-01 March-15 April-01 April-15 May-01 May-15 June-01 June-15
# 2.0 3.0 1.0 0.0 6.0 0.0 4.0 0.5
Another option:
dat$Date <- with(dat, format(as.Date(paste(YEAR, MO, DAY, sep="-")), format = "%B-%d"))
dat
# STN NAME MO DAY YEAR SNWD Date
# 1 1234 STATION A 3 1 1919 2.0 March-01
# 2 1234 STATION A 3 15 1919 3.0 March-15
# 3 1234 STATION A 4 1 1919 1.0 April-01
# 4 1234 STATION A 4 15 1919 0.0 April-15
# 5 1234 STATION A 5 1 1919 6.0 May-01
# 6 1234 STATION A 5 15 1919 0.0 May-15
# 7 1234 STATION A 6 1 1919 4.0 June-01
# 8 1234 STATION A 6 15 1919 0.5 June-15
t(dat[,c("Date", "SNWD")])
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
# Date "March-01" "March-15" "April-01" "April-15" "May-01" "May-15" "June-01" "June-15"
# SNWD "2.0" "3.0" "1.0" "0.0" "6.0" "0.0" "4.0" "0.5"
With tidyverse you can use a very simple pipeline:
library(tidyverse)
data <- read_csv(...) # here insert your CSV file
# 1. make a nice date as YEAR-MO-DAY
# 2. select only a DATE and SNWD columns
# 3. make a wide tibble
data %>%
mutate(DATE = format(as.Date(paste(YEAR, MO, DAY, sep="-")), format = "%B-%d")) %>%
select(DATE, SNWD) %>%
pivot_wider(names_from = DATE, values_from = SNWD)
Note that dates are only column names:
# A tibble: 1 x 8
`March-01` `March-15` `April-01` `April-15` `May-01` `May-15` `June-01` `June-15`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 3 1 0 6 0 4 0.5
I have this dataset wich is structured like this
Neighborhood, var1, var2, COUNTRY, DAY, categ 1, categ 2
1 700 724 AL 0 YES YES
1 500 200 FR 0 YES NO
....
1 701 659 IT 1 NO YES
1 791 669 IT 1 NO YES
....
2 239 222 GE 0 YES NO
and so on...
So that the hyerarchy is "Neighborhood > DAY > COUNTRY" and for every neighborhood,for every day, for every country I have the observation of var1,var2,categ1 and categ2
I'm not interested for the moment in analyzing the country, so what I want to do is to aggregate that (by summing "over" the country field var1 and var2, the categorical variables categ1 and categ2 are not influenced by the country), and have a dataset that for each Neighborhood and for each Day gives me the infos on var1, var2, categ1 and categ2
I'm quite new to R-programming and basically don't know a lot of packages (I would write a program in c++, but I'm forcing myself to learn R)...
So do you have any idea on how to do this?
Data
df1 <- structure(list(Neighborhood = c(1L, 1L, 1L, 1L, 2L),
var1 = c(700L, 500L, 701L, 791L, 239L),
var2 = c(724L, 200L, 659L, 669L, 222L),
COUNTRY = c("AL", "FR", "IT", "IT", "GE"),
DAY = c(0L, 0L, 1L, 1L, 0L),
`categ 1` = c("YES", "YES", "NO", "NO", "YES"),
`categ 2` = c("YES", "NO", "YES", "YES", "NO")),
.Names = c("Neighborhood", "var1", "var2", "COUNTRY", "DAY", "categ 1", "categ 2"),
class = "data.frame", row.names = c(NA, -5L))
EDIT: #akrun
when I try your command, the result is:
aggregate(.~Neighborhood+DAY+COUNTRY, data= df1[!grepl("^categ", names(df1))], mean)
Neighborhood, DAY, COUNTRY, var1, var2
1 1 0 AL 700 724
2 1 0 FR 500 200
3 2 0 GE 239 222
4 1 1 IT 746 664
But (in this example) what I would like to have is:
Neighborhood, DAY, var1, var2
1 1 0 1200 924 //wher var1=700+500....
2 1 1 1492 1328
3 2 0 239 222
If we are not interested in the 'categ' columns, we can grep them out and use aggregate
aggregate(.~Neighborhood+DAY, data= df1[!grepl("^(categ|COUNTRY)", names(df1))], sum)
# Neighborhood DAY var1 var2
#1 1 0 1200 924
#2 2 0 239 222
#3 1 1 1492 1328
Or using dplyr
library(dplyr)
df1 %>%
group_by(Neighborhood, DAY) %>%
summarise_each(funs(sum), matches("^var"))
# Neighborhood DAY var1 var2
# (int) (int) (int) (int)
#1 1 0 1200 924
#2 1 1 1492 1328
#3 2 0 239 222