Percentages over Multiple Columns with two Factors - r

I'm trying to get relative proportions of tallies that belong to two seperate categories. This is an example of the raw file.
A tibble: 8 x 5
resp euRefVoteW1 euRefVoteW2 euRefVoteW3 Paper
<fct> <int> <int> <int> <fct>
1 Remain 316 290 313 Times
2 Leave 157 123 159 Times
3 Will Not Vote 2 3 3 Times
4 Don't Know 56 51 55 Times
5 Remain 190 175 199 Telegraph
6 Leave 339 282 334 Telegraph
7 Will Not Vote 4 3 4 Telegraph
8 Don't Know 70 62 69 Telegraph
It is a tally of two different factors. I'm trying to convert the tally of responses into percentages so it would look something like this:
A tibble: 8 x 5
resp euRefVoteW1 euRefVoteW2 euRefVoteW3 Paper
1 Remain 52% 53% .. Times
2 Leave 43% 42% .. Times
3 Will Not Vote 1% 2% . Times
4 Don't Know 4% 3% . Times
5 Remain 35% 35% . Telegraph
6 Leave 52% 52% . Telegraph
7 Will Not Vote 2% 2% . Telegraph
8 Don't Know 11% 11% . Telegraph
(Obviously these numbers aren't correct, but I hope it shows that each 4 x 1 section should sum to 100%).
The dataframe is in a similar format to table already, so is there a way to apply the prop.table method to the df ? When I tried like this, it refuses as the df is not a clean array. Is there a way around this?
for_stack <- combined_tallies %>%
group_by(Paper, resp) %>%
prop.table(margin=2)
Here is an rds copy of the dataframe if this helps!
[The best answers I could find elsewhere here in SO were of no use] (Percentage of factor levels by group in R)

I have recreated your data set using dput(), which you are encouranged to use to provide reproducible data for answers on StackOverflow.
votes <- structure(list(resp = c("Remain", "Leave", "Will Not Vote", "Don’t Know",
"Remain", "Leave", "Will Not Vote", "Don’t Know"), ref1 = c(316,
157, 2, 56, 190, 339, 4, 70), ref2 = c(290, 123, 3, 51, 175,
282, 3, 62), ref3 = c(313, 159, 3, 55, 199, 334, 4, 69), paper = c("Times",
"Times", "Times", "Times", "Telegraph", "Telegraph", "Telegraph",
"Telegraph")), .Names = c("resp", "ref1", "ref2", "ref3", "paper"
), row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"
))
An alternative approach is to change the structure of your dataset ahead of performing your analysis. You are trying to create relative values not across entire columns or rows but for subsets. One way around this is to use the tidyverse package and perform your analysis in that format. You can always revert to the original structure once you have calculated percentages.
library(tidyverse)
vote_long <- votes %>%
pivot_longer(cols = c(ref1, ref2, ref3), names_to = "ref", values_to = "votes")
vote_long
# A tibble: 24 x 4
resp paper ref votes
<chr> <chr> <chr> <dbl>
1 Remain Times ref1 316
2 Remain Times ref2 290
3 Remain Times ref3 313
4 Leave Times ref1 157
5 Leave Times ref2 123
6 Leave Times ref3 159
7 Will Not Vote Times ref1 2
8 Will Not Vote Times ref2 3
9 Will Not Vote Times ref3 3
10 Don’t Know Times ref1 56
# … with 14 more rows
# created grouped relative values
vote_long_relative <- vote_long %>%
group_by(paper, ref) %>%
mutate(rel_votes = votes/sum(votes) * 100)
vote_wide_relative <- vote_long_relative %>%
select(-votes) %>%
pivot_wider(id_cols = c(resp, paper), names_from = "ref", values_from = "rel_votes")
vote_wide_relative
# Groups: paper [2]
resp paper ref1 ref2 ref3
<chr> <chr> <dbl> <dbl> <dbl>
1 Remain Times 59.5 62.1 59.1
2 Leave Times 29.6 26.3 30
3 Will Not Vote Times 0.377 0.642 0.566
4 Don’t Know Times 10.5 10.9 10.4
5 Remain Telegraph 31.5 33.5 32.8
6 Leave Telegraph 56.2 54.0 55.1
7 Will Not Vote Telegraph 0.663 0.575 0.660
8 Don’t Know Telegraph 11.6 11.9 11.4

maybe you are looking for it
library(tidyverse)
combined_tallies %>%
group_by(Paper) %>%
mutate(across(where(is.numeric), ~ .x / sum(.x, na.rm = T) * 100))
# A tibble: 20 x 10
# Groups: Paper [5]
resp euRefVoteW1 euRefVoteW2 euRefVoteW3 euRefVoteW4 euRefVoteW6 euRefVoteW7 euRefVoteW8
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Rema~ 59.5 62.1 59.1 61.0 63.7 60.3 61.2
2 Leave 29.6 26.3 30 29.0 25.2 35.6 35.2
3 Will~ 0.377 0.642 0.566 0.565 0.377 0.377 0.377
4 Don'~ 10.5 10.9 10.4 9.42 10.7 3.77 3.20
...

Related

time between max and min of cycles

I have a series of data of 60,000 data which part of the data is as the figure 1 (the whole curve is not so nice and uniform like this image (some other part of data is as second image)) but there are many cycles with different period in my data.
I need to calculate the time of three red, green and purple rectangles for each of the cycles (** the time between each maximum and minimum and total time of cycles **)
Can you give me some ideas on how to do it in R ... is there any special command or package that I can use?
Premise is that the mean value of the data range is used to split the data into categories of peaks and not peaks. Then a running id is generated to group each set of data so an appropriate min or max value can be determined. The half_cycle provides the red and green boxes, while full_cycle provides the purple box for max-to-max and min-to-min. There is likely room for improvement, but it gives a method that can be adjusted as needed.
This sample uses random data since no sample data was provided.
set.seed(7)
wave <- c(seq(20, 50, 10), seq(50, 60, 0.5), seq(50, 20, -10))
df1 <- data.frame(time = seq_len(length(wave) * 5),
data = as.vector(replicate(5, wave + rnorm(length(wave), sd = 5))))
library(dplyr)
df1 %>%
mutate(peak = data > mean(range(df1$data))) %>%
mutate(run = cumsum(peak != lag(peak, default = TRUE))) %>%
group_by(run) %>%
mutate(max = max(data), min = min(data)) %>%
filter((peak == TRUE & data == max) | (peak == FALSE & data == min)) %>%
mutate(max = if_else(data == max, max, NULL), min = if_else(data == min, min , NULL)) %>%
ungroup() %>%
mutate(half_cycle = time - lag(time), full_cycle = time - lag(time, n = 2L))
# A tibble: 11 x 8
time data peak run max min half_cycle full_cycle
<int> <dbl> <lgl> <int> <dbl> <dbl> <int> <int>
1 2 24.0 FALSE 1 NA 24.0 NA NA
2 12 67.1 TRUE 2 67.1 NA 10 NA
3 29 15.1 FALSE 3 NA 15.1 17 27
4 54 68.5 TRUE 4 68.5 NA 25 42
5 59 20.8 FALSE 5 NA 20.8 5 30
6 80 70.6 TRUE 6 70.6 NA 21 26
7 87 18.3 FALSE 7 NA 18.3 7 28
8 108 63.1 TRUE 8 63.1 NA 21 28
9 117 13.8 FALSE 9 NA 13.8 9 30
10 140 64.5 TRUE 10 64.5 NA 23 32
11 145 22.4 FALSE 11 NA 22.4 5 28

How to change all values expect for the top values (by frequency) from a categorical variable in R

I have a data frame in R which looks similar to the one below, with the factor variable "Genre":
|Genre|Listening Time|
|Rock |1:05 |
|Pop |3:10 |
|RnB |4:12 |
|Rock |2:34 |
|Pop |5:01 |
|RnB |4:01 |
|Rock |1:34 |
|Pop |2:04 |
I want leave the top 15 genres (by count) as they are and only rename all other genres that are not among the top 15. Those should be renamed into the word "Other".
In other words - if for example the Genre "RnB" is not among the top 15 Genres, then it should be replaced by the word "Other".
The table I would like to get would look like this then:
|Genre|Listening Time|
|Rock |1:05 |
|Pop |3:10 |
|Other|4:12 |
|Rock |2:34 |
|Pop |5:01 |
|Other|4:01 |
|Rock |1:34 |
|Pop |2:04 |
How would I approach this?
Thank you!
If you want to look into tidyverse you may do something like this. I have tried to mimic your data frame but added some more rows.
You start with data > group_by Genre > order > chose top 5
library(tidyverse)
set.seed(1)
Data <- data.frame(
listen = format(as.POSIXlt(paste0(
as.character(sample(1:5)),
':',
as.character(sample(0:59))), format = '%H:%M'),format = '%H:%M'),
Genre = sample(c("Rock", "Pop", 'RnB'), 120, replace = TRUE)
)
Data %>%
group_by(Genre ) %>%
arrange(desc(listen)) %>%
select(listen) %>%
top_n(5) %>%
arrange(Genre)
#> Adding missing grouping variables: `Genre`
#> Selecting by listen
#> # A tibble: 15 x 2
#> # Groups: Genre [3]
#> Genre listen
#> <chr> <chr>
#> 1 Pop 05:47
#> 2 Pop 05:47
#> 3 Pop 05:43
#> 4 Pop 05:41
#> 5 Pop 05:28
#> 6 RnB 05:54
#> 7 RnB 05:44
#> 8 RnB 05:43
#> 9 RnB 05:29
#> 10 RnB 05:28
#> 11 Rock 05:54
#> 12 Rock 05:44
#> 13 Rock 05:41
#> 14 Rock 05:29
#> 15 Rock 05:26
Sorry, if I have misunderstood what you wanted. If you assign the code to a new data.frame and make an anti_join to the original DF and then mutate Genre to others it should be what you want - I guess.
df <- Data %>%
group_by(Genre ) %>%
arrange(desc(listen)) %>%
select(listen) %>%
top_n(5) %>%
arrange(Genre)
# make an anti_join and assign 'other' to Genre
anti_join(Data, df) %>%
mutate(Genre = 'others')
Next Edit
Hopefully I have now understood your question. You want just to count how often the Genres occure in your data and give those which do not belong to the top 15 the name Others. Maybe I was mislead by the data frame you offered which shows only 3 Genres. So I looked up in Wikipedia and added a few, invented some own Genres and used LETTERS to build up a DF with sufficient numbers of Genre.
With count(Genre) the occurences of Genres are counted, and then arranged in descending order. I have then introduced a new column with the row numbers. You can delete this if you want, as it is only there to do the next step which is introducing another column - I have chosen to make a new column, instead of renaming all the names in Genre - with the name Top15 an giving every Genre which is on place(in row) 16 or later the name Others and keeping the rest unchanged.
head(20) just prints the first 20 rows of this DF.
library(tidyverse)
set.seed(1)
Data <- data.frame(
listen = format(as.POSIXlt(paste0(
as.character(sample(1:5)),
':',
as.character(sample(0:59))), format = '%H:%M'),format = '%H:%M'),
Genre = sample(c("Rock", "Pop", 'RnB', 'Opera',
'Birthday Songs', 'HipHop',
'Chinese Songs', 'Napoli Lovesongs',
'Benga', 'Bongo', 'Kawito', 'Noise',
'County Blues','Mambo', 'Reggae',
LETTERS[0:24]), 300, replace = TRUE)
)
Data %>% count(Genre) %>%
arrange(desc(n)) %>%
mutate(place = row_number()) %>%
mutate(Top15 = ifelse(place > 15, 'Others', Genre)) %>%
head(20)
#> # A tibble: 20 x 4
#> Genre n place Top15
#> <chr> <int> <int> <chr>
#> 1 N 15 1 N
#> 2 T 13 2 T
#> 3 V 13 3 V
#> 4 K 12 4 K
#> 5 Rock 11 5 Rock
#> 6 X 11 6 X
#> 7 E 10 7 E
#> 8 W 10 8 W
#> 9 Benga 9 9 Benga
#> 10 County Blues 9 10 County Blues
#> 11 G 9 11 G
#> 12 J 9 12 J
#> 13 M 9 13 M
#> 14 Reggae 9 14 Reggae
#> 15 B 8 15 B
#> 16 D 8 16 Others
#> 17 I 8 17 Others
#> 18 P 8 18 Others
#> 19 R 8 19 Others
#> 20 S 8 20 Others
I hope this was what you were looking for
I can think of a data.table solution. Let's assume your data.frame is called music, then:
library(data.table)
setDT(music)
other_genres <- music[, .N, by = genre][order(-N)][16:.N, genre]
music[genre %chin% other_genres, genre := "other"]
The first line of effective code counts the appearances by genre, sorts it from largest to smallest and selects from the 16 down to the last one, assigning the result to a variable called other_genres.
The second line will check which genres are in that list, and update their name to "other".
library(dplyr)
set.seed(123)
compute_listen_time <- function(n.songs) {
min <- sample(1:15, n.songs, replace = TRUE)
sec <- sample(0:59, n.songs, replace = TRUE)
sec <- ifelse(sec > 10, sec, paste0("0", sec))
paste0(min, ":", sec)
}
df <- data.frame(
Genre = sample(c("Rock", "Pop", "RnB", "Rock", "Pop"), 100, replace = TRUE),
Listen_Time = compute_listen_time(100)
)
df <- add_count(df, Genre, name = "count") %>%
mutate(
rank = dense_rank(desc(count)),
group = ifelse(rank <= 15, Genre, "other")
)
df
There is a pretty neat solution with the forcats package applied here to the diamonds dataset to only name the top 5 clarity values and bundle the rest as "Other"
library(dplyr)
library(forcats)
diamonds %>%
mutate(clarity2 = fct_lump(fct_infreq(clarity), n = 5))
Result:
# A tibble: 53,940 x 11
carat cut color clarity depth table price x y z clarity2
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <ord>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 SI2
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 SI1
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 VS1
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 VS2
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 SI2
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 VVS2
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 Other
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 SI1
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 VS2
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 VS1
# … with 53,930 more rows
Try replacing df with your data.frame to check if you get the desired output :
df <- data.frame(Genre=sample(letters, 1000, replace=TRUE),
ListeningTime=runif(1000, 3, 5))
> head(df)
Genre ListeningTime
1 j 3.437013
2 n 4.151121
3 p 3.109044
4 z 4.529619
5 h 4.043982
6 i 3.590463
freq <- table(df$Genre)
sorted <- sort(freq, decreasing=TRUE) # Sorted by frequency of df$Genre
> sorted
d x o q r u g i j f a p b e v n w c k m z l h t y s
53 50 46 45 45 42 41 41 40 39 38 38 37 37 37 36 36 35 35 35 35 34 33 33 30 29
not_top_15 <- names(sorted[-1*1:15]) # The Genres not in the top 15
pos <- which(df$Genre %in% not_top_15) # Their position in df
> head(df[pos, ]) # The original data, without the top 15 Genres
Genre ListeningTime
2 n 4.151121
4 z 4.529619
5 h 4.043982
7 s 3.521054
16 w 3.528091
18 h 4.588815

Create a sequence of values by group between a min and max interval using dplyr

this is surely a basic question but couldn't find a way to solve.
I need to create a sequence of values for a minimum (dds_min) to maximum (dds_max) per group (fs).
This is my data:
fs <- c("early", "late")
dds_min <-as.numeric(c("47.2", "40"))
dds_max <-as.numeric(c("122", "105"))
dds_min.max <-as.data.frame(cbind(fs,dds_min, dds_max))
And this is what I did....
dss_levels <-dds_min.max %>%
group_by(fs) %>%
mutate(dds=seq(dds_min,dds_max,length.out=100))
I intended to create a new variable (dds), that has to be 100 length and start and end at different values depending on "fs". My expectation was to end with another dataframe (dss_levels) with two columns (fs and dds), 200 values on it.
But I am getting this error.
Error: Column `dds` must be length 1 (the group size), not 100
In addition: Warning messages:
1: In Ops.factor(to, from) : ‘-’ not meaningful for factors
2: In Ops.factor(from, seq_len(length.out - 2L) * by) :
‘+’ not meaningful for factors
Any help would be really appreciated.
Thanks!
I make the sequence length 5 for illustrative purposes, you can change it to 100.
library(purrr)
library(tidyr)
dds_min.max %>%
mutate(dds= map2(dds_min, dds_max, seq, length.out = 5)) %>%
unnest(cols = dds)
# # A tibble: 10 x 4
# fs dds_min dds_max dds
# <fct> <dbl> <dbl> <dbl>
# 1 early 47.2 122 47.2
# 2 early 47.2 122 65.9
# 3 early 47.2 122 84.6
# 4 early 47.2 122 103.
# 5 early 47.2 122 122
# 6 late 40 105 40
# 7 late 40 105 56.2
# 8 late 40 105 72.5
# 9 late 40 105 88.8
# 10 late 40 105 105
Using this data (make sure your numeric columns are numeric! Don't use cbind!)
fs <- c("early", "late")
dds_min <-c(47.2, 40)
dds_max <-c(122, 105)
dds_min.max <-data.frame(fs,dds_min, dds_max)

Reading fixed width format data into R with entries exceeding column width

I need to use the Annual Building Permits by Metropolitan Area Data distributed by the US Census Bureau, which are downloadable here as fixed width format text files. Here is an excerpt of the file (I've stripped the column names as they aren't in a nice format and can be replaced after reading the file into a date frame):
999 10180 Abilene, TX 306 298 8 0 0 0
184 10420 Akron, OH 909 905 0 4 0 0
999 13980 Blacksburg-Christiansburg-Radford,
VA 543 455 0 4 84 3
145 14010 Bloomington, IL 342 214 4 0 124 7
160 15380 Buffalo-Cheektowaga-Niagara Falls,*
NY 1964 931 14 14 1005 68
268 15500 Burlington, NC 1353 938 12 16 387 20
As seen in the above excerpt, many of the entries in the Name column exceed the width of the column (which looks to be 36 characters). I've experimented with the various fwf reading functions of both the utils package and readr but can't find a solution that takes these entries into account. Any tips would be much appreciated.
Edit: The original file excerpt was edited by a mod for formatting and in the process the example entries where the third column width was exceeded were deleted. I've since updated the excerpt to reinclude them and have stripped the column names.
I ran #markdly 's code, which was submitted before this edit, works for all the entries that don't have this issue. I exported the result to a csv, and included an excerpt below to show what happens with these entries:
"38","999",NA,"13980",NA,"Blacksburg-Christiansburg-Radford,",NA,NA,NA,NA,NA,NA
"39","V","A",NA,NA,NA,"543",455,0,4,84,3
"40","145",NA,"14010",NA,"Bloomington, IL","342",214,4,0,124,7
"51","160",NA,"15380",NA,"Buffalo-Cheektowaga-Niagara Falls,*",NA,NA,NA,NA,NA,NA
"52","N","Y",NA,NA,NA,"1964",931,14,14,1005,68
"53","268",NA,"15500",NA,"Burlington, NC","1353",938,12,16,387,20
Edit 2: Most of the major metro areas I'm actually looking at don't fall into this problem category, so while it would be nice to have the data for the ones that do, if there is no workable solution, would there be a way to remove these entries from the data set altogether?
Edit:
Based on the updated information, the files are not fixed width for some records. In this situation, I think readr::read_table is more useful than read_fwf. The following example is a tidyverse approach to importing and processing one of the source files (tb3u2016.txt). A base approach might involve using something like readLines.
Step 1 Read the file in and assign the split records a common record id
library(tidyverse)
df <- read_table("tb3u2016.txt", col_names = FALSE, skip = 11) %>%
rownames_to_column() %>%
mutate(record = if_else(lag(is.na(X2) & rowname > 1), lag(rowname), rowname))
df[37:40, ]
#> # A tibble: 4 x 8
#> rowname X1 X2
#> <chr> <chr> <int>
#> 1 37 999 13900 Bismarck, ND 856 629
#> 2 38 999 13980 Blacksburg-Christiansburg-Radford, NA
#> 3 39 VA 543 455
#> 4 40 145 14010 Bloomington, IL 342 214
#> # ... with 5 more variables: X3 <int>, X4 <int>, X5 <int>, X6 <int>,
#> # record <chr>
Step 2 Combine the split record text then put the contents into separate variables using tidyr::extract. Trim whitespace and remove the redundant records.
df <- df %>%
mutate(new_X1 = if_else(rowname != record, paste0(lag(X1), X1), X1)) %>%
extract(new_X1, c("CSA", "CBSA", "Name", "Total"), "([0-9]+) ([0-9]+) (.+) ([0-9]+)") %>%
mutate(Name = trimws(Name)) %>%
filter((lead(record) != record) | rowname == 1) %>%
select(CSA, CBSA, Name, Total, X2, X3, X4, X5, X6)
df[37:39, ]
#> # A tibble: 3 x 9
#> CSA CBSA Name Total X2 X3 X4
#> <chr> <chr> <chr> <chr> <int> <int> <int>
#> 1 999 13900 Bismarck, ND 856 629 16 6
#> 2 999 13980 Blacksburg-Christiansburg-Radford,VA 543 455 0 4
#> 3 145 14010 Bloomington, IL 342 214 4 0
#> # ... with 2 more variables: X5 <int>, X6 <int>
Below is a condensed version of the solution provided to an earlier version of the question using readr::read_fwf.
Example data
library(readr)
# example data
txt <- " Num of
Struc-
tures
With
3 and 4 5 Units 5 Units
CSA CBSA Name Total 1 Unit 2 Units Units or more or more
999 10180 Abilene, TX 306 298 8 0 0 0
184 10420 Akron, OH 909 905 0 4 0 0"
write_file(txt, "example.txt")
Solution
col_widths <- c(3, 1, 5, 1, 36, 8, 8, 8, 8, 8, NA)
col_names <- c("CSA", "blank_1", "CBSA", "blank_2", "Name", "Total", "units_1", "units_2",
"units_3_and_4", "units_5_or_more", "num_struc_5_or_more")
df <- read_fwf("example.txt", fwf_widths(col_widths, col_names), skip = 7)
df
#> # A tibble: 2 x 11
#> CSA blank_1 CBSA blank_2 Name Total units_1 units_2
#> <int> <chr> <int> <chr> <chr> <int> <int> <int>
#> 1 999 <NA> 10180 <NA> Abilene, TX 306 298 8
#> 2 184 <NA> 10420 <NA> Akron, OH 909 905 0
#> # ... with 3 more variables: units_3_and_4 <int>, units_5_or_more <int>,
#> # num_struc_5_or_more <int>

how to select data based on a list from a split data frame and then recombine in R

I am trying to do the following. I have a dataset Test:
Item_ID Test_No Category Sharpness Weight Viscocity
132 1 3 14.93199362 94.37250417 579.4236727
676 1 4 44.58750591 70.03232054 1829.170727
699 2 5 89.02760079 54.30587287 1169.226863
850 3 6 30.74535903 83.84377678 707.2280513
951 4 237 67.79568019 51.10388484 917.6609965
1031 5 56 74.06697003 63.31274502 1981.17804
1175 4 354 98.9656142 97.7523884 100.7357981
1483 5 726 9.958040999 51.29537311 1222.910211
1529 7 800 64.11430235 65.69780939 573.8266137
1698 9 125 67.83105185 96.53847341 486.9620194
1748 9 1005 49.43602318 52.9139591 1881.740184
2005 9 28 26.89821508 82.12663209 1709.556135
2111 2 76 83.03593144 85.23622731 276.5088502
I would want to split this data based on Test_No and then compute the number of unique Category per Test_No and also the Median Category value. I chose to use split and Sappply in the following way. But, I am getting an error regarding a missing parenthesis. Is there anything wrong in my approach ? Please find my code below:
function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)), Median_Cat = median(unique(CatRange$Category), na.rm = TRUE) )
}
CatStat <- do.call(rbind,sapply(split(Test, Test$Test_No), function(ModRange)))
Appending my question:
I would want to display the data containing the following information:
Test_No, Category, Median_Cat and Cat_Count
We can try with dplyr
library(dplyr)
Test %>%
group_by(Test_No) %>%
summarise(Cat_Count = n_distinct(Category),
Median_Cat = median(Category,na.rm = TRUE),
Category = toString(Category))
# Test_No Cat_Count Median_Cat Category
# <int> <int> <dbl> <chr>
#1 1 2 3.5 3, 4
#2 2 2 40.5 5, 76
#3 3 1 6.0 6
#4 4 2 295.5 237, 354
#5 5 2 391.0 56, 726
#6 7 1 800.0 800
#7 9 3 125.0 125, 1005, 28
Or if you prefer base R we can also try with aggregate
aggregate(Category~Test_No, CatRange, function(x) c(Cat_Count = length(unique(x)),
Median_Cat = median(x,na.rm = TRUE), Category = toString(x)))
As far as the function written is concerned I think there are some synatx issues in it.
new_func <- function(CatRange){
c(Cat_Count = length(unique(CatRange$Category)),
Median_Cat = median(unique(CatRange$Category), na.rm = TRUE),
Category = toString(CatRange$Category))
}
data.frame(t(sapply(split(CatRange, CatRange$Test_No), new_func)))
# Cat_Count Median_Cat Category
#1 2 3.5 3, 4
#2 2 40.5 5, 76
#3 1 6 6
#4 2 295.5 237, 354
#5 2 391 56, 726
#7 1 800 800
#9 3 125 125, 1005, 28

Resources