how to create scale with n numeric values in R - r

for example I have in a variable the different ages of people:
ed <- c(51,51,44,25,49,52,21,45,29,46,34,33,30,28,50,21,25,21,22,51,25,52,39,53,52,23,20,23,34)
but it turns out I want to summarize this huge amount in seven rows
---------------
scale | h
---------------
20 - 25 | 9
26 - 30 | 3
31 - 35 | 3
36 - 40 | 1
41 - 45 | 2
46 - 50 | 3
51 - 55 | 7
Are there any libraries that allow to create scales?
I have tried it with conditionals and it is very tedious

Related

R, Friedman's test 'not an unreplicated complete block design' error?

I am trying to do a Friedman's test and yes my data is repeated measures but nonparametric.
The data is organized like this from the csv and used Rstudio's import dataset function so it is a table in Rstudio:
score| treatment | day
10 | 1 | 1
20 | 1 | 1
40 | 1 | 1
7 | 2 | 1
100| 2 | 1
58 | 2 | 1
98 | 3 | 1
89 | 3 | 1
40 | 3 | 1
70 | 4 | 1
10 | 4 | 1
28 | 4 | 1
86 | 5 | 1
200| 5 | 1
40 | 5 | 1
77 | 1 | 2
100| 1 | 2
90 | 1 | 2
33 | 2 | 2
15 | 2 | 2
25 | 2 | 2
23 | 3 | 2
54 | 3 | 2
67 | 3 | 2
1 | 4 | 2
2 | 4 | 2
400| 4 | 2
16 | 5 | 2
10 | 5 | 2
90 | 5 | 2
library(readr)
sample_data$treatment <- as.factor(sample_data$treatment) #setting treatment as categorical independent variable
sample_data$day <- as.factor(sample_data$day) #setting day as categorical independent variable
summary(sample_data)
#attach(sample_data) #not sure if this should be used only because according to https://www.sheffield.ac.uk/polopoly_fs/1.714578!/file/stcp-marquier-FriedmanR.pdf it says to use attach for R to use the variables directly
friedman3 <- friedman.test(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day)
summary(friedman3)
I am interested in day and score using Friedman's.
this is the error I get:
>Error in friedman.test.default(y = sample_data$score, groups = sample_data$treatment, blocks = sample_data$day, :
not an unreplicated complete block design
Not sure what is wrong.
Prior to writing the Friedman part of the code, I only specified day and treatment as categorical using as.factor

R studio: Write a for loop to apply a customized function to a vector of inputs, and outputs a separate dataframe for each element in that vector

I have a dataframe that includes the lower and upper bound of a few parameters for each category of fruit. It looks sth like this:
+----------+-----------+-------+-------+
| Category | Parameter | Upper | Lower |
+----------+-----------+-------+-------+
| Apple | alpha | 10 | 20 |
+----------+-----------+-------+-------+
| Apple | beta | 20 | 30 |
+----------+-----------+-------+-------+
| Orange | alpha | 10 | 20 |
+----------+-----------+-------+-------+
| Orange | beta | 30 | 40 |
+----------+-----------+-------+-------+
| Orange | gamma | 50 | 60 |
+----------+-----------+-------+-------+
| Pear | alpha | 10 | 30 |
+----------+-----------+-------+-------+
| Pear | beta | 20 | 40 |
+----------+-----------+-------+-------+
| Pear | gamma | 20 | 30 |
+----------+-----------+-------+-------+
| Banana | alpha | 40 | 50 |
+----------+-----------+-------+-------+
| Banana | beta | 20 | 40 |
+----------+-----------+-------+-------+
I have wrote a function where I pass in this data frame, the fruit name, and the desired length for my sequence:
library(purrr)
param_grid <- function(df, fruit, length) {
df_fruit <- df %>%
filter(Category == fruit)
map2(df_fruit$Upper, df_fruit$Lower, seq, length.out = length) %>%
set_names(df_fruit$Parameter) %>%
cross_df()
}
Output
param_grid(df, "Apple", length=100)
# A tibble: 10,000 x 2
alpha beta
<dbl> <dbl>
1 10 20
2 10.1 20
3 10.2 20
4 10.3 20
5 10.4 20
6 10.5 20
7 10.6 20
8 10.7 20
9 10.8 20
10 10.9 20
# … with 9,990 more rows
Output
param_grid(df, "Orange", length=100)
# A tibble: 1,000,000 x 3
alpha beta gamma
<dbl> <dbl> <dbl>
1 10 30 50
2 10.1 30 50
3 10.2 30 50
4 10.3 30 50
5 10.4 30 50
6 10.5 30 50
7 10.6 30 50
8 10.7 30 50
9 10.8 30 50
10 10.9 30 50
# … with 999,990 more rows
Output
param_grid(df, "Pear", length=100)
# A tibble: 1,000,000 x 3
alpha beta gamma
<dbl> <dbl> <dbl>
1 10 20 20
2 10.2 20 20
3 10.4 20 20
4 10.6 20 20
5 10.8 20 20
6 11.0 20 20
7 11.2 20 20
8 11.4 20 20
9 11.6 20 20
10 11.8 20 20
# … with 999,990 more rows
Now, I would like to write a for loop to allow this function to apply to multiple fruits:
names <- c("Apple","Orange","Pear")
for (i in names){
results <- param_grid(df = df, fruit = i, length = 100)
print(head(results),10)
}
This works fine but it returns 3 dataframes altogether:
alpha beta
1 20.00000 30
2 19.89899 30
3 19.79798 30
4 19.69697 30
5 19.59596 30
6 19.49495 30
alpha beta gamma
1 20.00000 40 60
2 19.89899 40 60
3 19.79798 40 60
4 19.69697 40 60
5 19.59596 40 60
6 19.49495 40 60
alpha beta gamma
1 30.00000 40 30
2 29.79798 40 30
3 29.59596 40 30
4 29.39394 40 30
5 29.19192 40 30
6 28.98990 40 30
Is there a way I can edit this for-loop, so that I can have 3 separate dataframes for Apple, Orange, Pear, respectively? Or it could be 3 dataframes each callble / subsettable within a big nested dataframe (e.g. DF[[Apple]], DF[[Orange]]..)?
Thanks so much for your help!
We are looping on a for loop and just printing. Instead, we can store in a list
lst1 <- vector('list', length(names))
names(lst1) <- names
for (i in names){
results <- param_grid(data=df, fruit = i, length = 100)
lst1[[i]] <- results
}
Then, check the structure of the list created
str(lst1)
We can extract the individual datasets with $ or [[
lst1[[1]]
lst1[[2]]
If we want to create different objects with object name same as the elements of 'names' vector
list2env(lst1, .GlobalEnv)
But, it is better to store in a list and use it

R: stem and leaf plot issue

I have the following vector:
x <- c(54.11, 58.09, 60.82, 86.59, 89.92, 91.61,
95.03, 95.03, 96.77, 98.52, 100.29, 102.07,
102.07, 107.51, 113.10, 130.70, 130.70, 138.93,
147.41, 149.57, 153.94, 158.37, 165.13, 201.06,
208.67, 235.06, 240.53, 251.65,254.47, 254.47, 333.29)
I want to get the following stem and leaf plot in R:
Stem Leaf
5 4 8
6 0
8 6 9
9 1 5 5 6 8
10 0 2 2 7
11 3
13 0 0 8
14 7 9
15 3 8
16 5
20 1 8
23 5
24 0
25 1 4 4
33 3
However, when I try the stem() function in R, I get the folliwing:
> stem(x)
The decimal point is 2 digit(s) to the right of the |
0 | 566999
1 | 000000011334
1 | 55567
2 | 0144
2 | 555
3 | 3
> stem(x, scale = 2)
The decimal point is 1 digit(s) to the right of the |
4 | 48
6 | 1
8 | 7025579
10 | 02283
12 | 119
14 | 7048
16 | 5
18 |
20 | 19
22 | 5
24 | 1244
26 |
28 |
30 |
32 | 3
Question: Am I missing an argument in the stem() function? If not, is there another solution?
I believe what you want is a little non-standard: a stem-and-leaf should have on its left equally-spaced numbers/digits, and you're asking for irregularly-spaced. I understand your frustration that 54 and 58 are grouped within the 40s, but the stem-and-leaf graph is really just a textual representation of a horizontal histogram, and the numbers on the side reflect the "bins" which will often begin/end outside of the known data. Think of scale(x, scale=2) left-scale numbers as 40-59, 60-79, etc.
You probably already tried this, but
stem(x, scale=3)
# The decimal point is 1 digit(s) to the right of the |
# 5 | 48
# 6 | 1
# 7 |
# 8 | 7
# 9 | 025579
# 10 | 0228
# 11 | 3
# 12 |
# 13 | 119
# 14 | 7
# 15 | 048
# 16 | 5
# 17 |
# 18 |
# 19 |
# 20 | 19
# 21 |
# 22 |
# 23 | 5
# 24 | 1
# 25 | 244
# 26 |
# 27 |
# 28 |
# 29 |
# 30 |
# 31 |
# 32 |
# 33 | 3
This is a good start, and is "proper" in that the bins are equally sized.
If you must remove the empty rows (which to me are still statistically significant, relevant, informative, etc), then because stem's default is to print to the console, you'll need to capture the console output (might have problems in rmarkdown docs), filter out the empty rows, and re-cat them to the console.
cat(Filter(function(s) grepl("decimal|\\|.*[0-9]", s),
capture.output(stem(x, scale=3))),
sep="\n")
# The decimal point is 1 digit(s) to the right of the |
# 5 | 48
# 6 | 1
# 8 | 7
# 9 | 025579
# 10 | 0228
# 11 | 3
# 13 | 119
# 14 | 7
# 15 | 048
# 16 | 5
# 20 | 19
# 23 | 5
# 24 | 1
# 25 | 244
# 33 | 3
(My grepl regex could likely be improved to handle something akin to "if there is a pipe, then it must be followed by one or more digits", but I think this suffices for now.)
There are some inequalities, in that you want 6 | 0, but your 60.82 is rounding to 61 (ergo the "1"). If you really want the 60.82 to be a 6 | 0, then truncate it with stem(trunc(x), scale=3). It's not exact, but I'm guessing that's because your sample output is hand-jammed.

Cumulative Sum of Matrix with Conditions in R [duplicate]

This question already has answers here:
Calculate cumulative sum (cumsum) by group
(5 answers)
Closed 5 years ago.
I was just wondering how I can go about calculating the cumulative sum with conditions on a matrix. Here's what I mean: Let's say we have a matrix with a column called ID and a column called Value as follows:
ID | VALUE
------------------------------
2 | 50
7 | 19
5 | 32
2 | 21
8 | 56
7 | 5
7 | 12
2 | 16
5 | 42
I wish to compute the cumulative sum on this matrix based on the ID column. This means the cumulative sum column (or vector) would look like:
ID | CUMULATIVE SUM
----------------------------------
2 | 50
7 | 19
5 | 32
2 | 71
8 | 56
7 | 24
7 | 36
2 | 87
5 | 74
Is there a way to do this? A search for this hasn't turned up much at all (I've found stuff relevant for data frames/data tables, but I haven't found anything at all when it comes to 'conditions' with matrices), so any help would be appreciated.
There are a number of ways to do this, here I use data.table. I edited your data slightly to just use a , as a separator and got rid of the header row:
R> suppressMessages(library(data.table))
R> dat <- fread(" ID , VALUE
2 , 50
7 , 19
5 , 32
2 , 21
8 , 56
7 , 5
7 , 12
2 , 16
5 , 42")
R> dat[, cumsum(VALUE), by=ID]
ID V1
1: 2 50
2: 2 71
3: 2 87
4: 7 19
5: 7 24
6: 7 36
7: 5 32
8: 5 74
9: 8 56
R>
After that, it is a standard grouping by (which you can do in many different ways) and a cumulative sum in each group.
The reordering here is automatic because of the grouping. If you must keep your order you can.

r bin equal deciles

I have a dataset containing over 6,000 observations, each record having a score ranging from 0-100. Below is a sample:
+-----+-------+
| uID | score |
+-----+-------+
| 1 | 77 |
| 2 | 61 |
| 3 | 74 |
| 4 | 47 |
| 5 | 65 |
| 6 | 51 |
| 7 | 25 |
| 8 | 64 |
| 9 | 69 |
| 10 | 52 |
+-----+-------+
I want to bin them into equal deciles based upon their rank order relative to their peers within the score column with cutoffs being at every 10th percentile, as seen below:
+-----+-------+-----------+----------+
| uID | score | position% | scoreBin |
+-----+-------+-----------+----------+
| 7 | 25 | 0.1 | 1 |
| 4 | 47 | 0.2 | 2 |
| 6 | 51 | 0.3 | 3 |
| 10 | 52 | 0.4 | 4 |
| 2 | 61 | 0.5 | 5 |
| 8 | 64 | 0.6 | 6 |
| 5 | 65 | 0.7 | 7 |
| 9 | 69 | 0.8 | 8 |
| 3 | 74 | 0.9 | 9 |
| 1 | 77 | 1 | 10 |
+-----+-------+-----------+----------+
So far I've tried cut, cut2, tapply, etc. I think I'm on the right logic path, but I have no idea on how to apply them to my situation. Any help is greatly appreciated.
I would use ntile() in dplyr.
library(dplyr)
score<-c(77,61,74,47,65,51,25,64,69,52)
ntile(score, 10)
##[1] 10 5 9 2 7 3 1 6 8 4
scoreBin<- ntile(score, 10)
In base R we can use a combination of .bincode() and quantile():
df$new <- .bincode(df$score,
breaks = quantile(df$score, seq(0, 1, by = 0.1)),
include.lowest = TRUE)
# uID score new
#1 1 77 10
#2 2 61 5
#3 3 74 9
#4 4 47 2
#5 5 65 7
#6 6 51 3
#7 7 25 1
#8 8 64 6
#9 9 69 8
#10 10 52 4
Here is a method that uses quantile together with cut to get the bins:
df$scoreBin <- as.integer(cut(df$score,
breaks=quantile(df$score, seq(0,1, .1), include.lowest=T)))
as.integer coerces the output of cut (which is a factor) into the underlying integer.
One way to get the position percent is to use rank:
df$position <- rank(df$score) / nrow(df)

Resources