How to make frequency table with specific class in R - r

I want to make a frequency table with matrix "b", which has 100 observations.
I'm trying to make all observations are cut into 15 classes, so that for example the frequency table should include some 'empty classes' that no observations are included.
However, when I use function table(), classes(or levels) are included whose observations are not empty. ( 12 levels)
How can I force them to have 15 levels?
> b <- matrix(as.matrix(b),ncol=1)
> fivenum(b)
[1] 24.2 24.7 24.9 25.1 25.6
> bcut <- seq(from = 24.2, by =0.1, length.out = 16); bcut
[1] 24.2 24.3 24.4 24.5 24.6 24.7 24.8 24.9 25.0 25.1 25.2 25.3
[13] 25.4 25.5 25.6 25.7
> bgroup <- factor(cut(x = b, breaks = bcut, include.lowest = T))
> levels(bgroup)
[1] "[24.2,24.3]" "(24.3,24.4]" "(24.4,24.5]" "(24.6,24.7]"
[5] "(24.7,24.8]" "(24.8,24.9]" "(24.9,25]" "(25.1,25.2]"
[9] "(25.2,25.3]" "(25.3,25.4]" "(25.4,25.5]" "(25.6,25.7]"

Related

I'm having difficulty getting the right data into the matrices of my 4x3x21 array

What I'm having trouble with is I'd like the first row of this matrix (mat.a) to be the first row of matrix 1 in my array, and then the second row to be the first row of matrix 2, etc. Then the first row of mat.b to be the second row of the first matrix in my array, second row of mat. b to be the second row in the second matrix of the array, etc. This trend continues for mat.c. The fourth row of my matrix should be the averages of the values in each column. Also, I'm not allowed to use a for loop
mat.a <- matrix(c(scores$A1, scores$A2, scores$avgA), ncol = 3,
byrow = FALSE)
mat.b <- matrix(c(scores$B1, scores$B2, scores$avgB), ncol = 3,
byrow = FALSE)
mat.c <- matrix(c(scores$C1, scores$C2, scores$avgC), ncol = 3,
byrow = FALSE)
scores.array<- array(c(mat.a,mat.b, mat.c), dim = c(3,3,21))
> dim(mat.a)
[1] 21 3
> dim(scores)
[1] 21 10
> dim(mat.b)
[1] 21 3
> dim(mat.c)
[1] 21 3
scores
scores.updated
Here is a natural (I think) approach to this problem:
Use array to construct an array with A, B, and C lying along the third dimension.
Use aperm to transpose the array so that A, B, and C lie along the first dimension.
Use colMeans to compute means over the first dimension ("columnwise").
Use abind to attach the means to the transposed array.
nms <- c("A1", "A2", "avgA", "B1", "B2", "avgB", "C1", "C2", "avgC")
z <- array(unlist(scores[nms]), dim = c(21L, 3L, 3L))
zz <- aperm(zz, 3:1)
zzz <- abind::abind(zz, colMeans(zz, dims = 1L), along = 1L)
zzz[, , 1:2]
, , 1
[,1] [,2] [,3]
[1,] 28.75775 69.28034 49.01905
[2,] 41.37243 27.43836 34.40540
[3,] 10.28646 89.03502 49.66074
[4,] 26.80555 61.91791 44.36173
, , 2
[,1] [,2] [,3]
[1,] 78.83051 64.05068 71.44060
[2,] 36.88455 81.46400 59.17427
[3,] 43.48927 91.44382 67.46655
[4,] 53.06811 78.98617 66.02714
I have used scores as (very helpfully!) defined by #langtang.
Try this:
library(tidyverse)
# add the averages
scores <- scores %>%
rowwise() %>%
mutate(avg1 = mean(c_across(ends_with("1"))),
avg2 = mean(c_across(ends_with("2"))),
avg3 = mean(c_across(starts_with("avg")))) %>%
# relocate the columns
relocate(ini, A1,B1,C1,avg1, A2,B2,C2,avg2, avgA,avgB,avgC, avg3)
# create scores array
scores.array = array(scores %>% pivot_longer(cols = A1:avg3) %>% pull(value), dim=c(4,3,21))
# add dim names
dimnames(scores.array) = list(c("A","B","C","mean"), c("Midterm", "Final", "mean"), scores$ini)
Output (first two):
> scores.array[,,1:2]
, , ZO
Midterm Final mean
A 28.75775 69.28034 49.01905
B 41.37243 27.43836 34.40540
C 10.28646 89.03502 49.66074
mean 26.80555 61.91791 44.36173
, , UE
Midterm Final mean
A 78.83051 64.05068 71.44060
B 36.88455 81.46400 59.17427
C 43.48927 91.44382 67.46655
mean 53.06811 78.98617 66.02714
Input Data (fake data):
set.seed(123)
scores = data.frame(
A1 = runif(21)*100,
A2 = runif(21)*100,
B1 = runif(21)*100,
B2 = runif(21)*100,
C1 = runif(21)*100,
C2 = runif(21)*100
)
scores <- scores %>% rowwise() %>%
mutate(ini = paste0(sample(LETTERS,2), collapse="")) %>%
relocate(ini)
scores$avgA = apply(scores[,c("A1","A2")],1,mean)
scores$avgB = apply(scores[,c("B1","B2")],1,mean)
scores$avgC = apply(scores[,c("C1","C2")],1,mean)
ini A1 A2 B1 B2 C1 C2 avgA avgB avgC
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ZO 28.8 69.3 41.4 27.4 10.3 89.0 49.0 34.4 49.7
2 UE 78.8 64.1 36.9 81.5 43.5 91.4 71.4 59.2 67.5
3 HS 40.9 99.4 15.2 44.9 98.5 60.9 70.2 30.0 79.7
4 JR 88.3 65.6 13.9 81.0 89.3 41.1 76.9 47.4 65.2
5 JL 94.0 70.9 23.3 81.2 88.6 14.7 82.4 52.3 51.7
6 BJ 4.56 54.4 46.6 79.4 17.5 93.5 29.5 63.0 55.5
7 VL 52.8 59.4 26.6 44.0 13.1 30.1 56.1 35.3 21.6
8 TN 89.2 28.9 85.8 75.4 65.3 6.07 59.1 80.6 35.7
9 QN 55.1 14.7 4.58 62.9 34.4 94.8 34.9 33.8 64.6
10 VC 45.7 96.3 44.2 71.0 65.7 72.1 71.0 57.6 68.9
# ... with 11 more rows

How to create a data summary function?

I'm trying to create a function that summarizes several vectors and the prompt is
Write a function data_summary which takes three inputs:\
`dataset`: A data frame\
`vars`: A character vector whose elements are names of columns from dataset which the user wants summaries for\
`group.name`: A length one character vector which gives the name of the column from dataset which contains the factor which will be used as a grouping variable
\`var.names`: A character vector of the same length as vars which gives the names that the user would like used as the entries under “Variable” in the resulting output. This should be set equal to vars by default, so the default behavior is to use the column names from dataset.
The output of the function should be a data frame with the following structure:
Column names of the data frame will be:\
`Variable`\
`Missing`\
The `first` level of the factor group.name\
The `second` level of the factor group.name\
…\
The `kth` level of the factor group.name\
`p-value`
I've set up the code already,
data_summary <- function(dataset,vars,group.name,var.names) {
}
but I'm unsure how to proceed because I do not understand what this is trying to accomplish and what the output should look like. There is an example that shows
#data_summary<-function(dataset, vars,group.name, var.name){}
#example
#data_summary(titanic4, c("survived", "female", "age", "sibsp", "parch", "fare", "cabin"), "pclass")
#data_summary(titanic4, c("survived", "female", "age", "sibsp", "parch", "fare", "cabin"), "pclass", c("Survival rate", "% Female", "Age", "# siblings/spouses aboard", "# children/parents aboard", "Fare ($)", "Cabin"))
But it really did not help me outside of inputting the arguments for the function.
You can use dplyr package for this function. Also I don't know by which functions you want summarise your dataframe, so I use all functions which summary function returns from base package.
My data:
> NewSKUMatrix
# A tibble: 268,918 x 4
LagerID FilialID CSBID Price
<int> <int> <int> <dbl>
1 233 2578 1005 38.3
2 333 2543 NA 61.0
3 334 2543 NA 15.0
4 335 2543 NA 11.0
5 337 2301 NA 71.0
6 338 2031 NA 37.0
7 338 2044 NA 35.0
8 338 2054 NA 36.0
9 338 2060 NA 37.0
10 338 2063 NA 36.0
# ... with 268,908 more rows
Function:
data_summary <- function(data,
variables,
values,
names = NULL) {
if (is.null(x = names)) {
names <- variables
}
data %>%
group_by_at(.vars = variables) %>%
summarise_at(
.vars = values,
.funs = list(
Min. = min,
`1st Qu.` = ~ quantile(x = ., probs = 0.25),
Median = median,
Mean = mean,
`3rd Qu.` = ~ quantile(x = ., probs = 0.75),
Max. = max
)
) %>%
rename_at(.vars = variables,
.funs = ~ names)
}
Output:
data_summary(NewSKUMatrix,
c('LagerID'),
c('Price'),
c('SKU'))
# A tibble: 32,454 x 7
SKU Min. `1st Qu.` Median Mean `3rd Qu.` Max.
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 17 39.0 39.0 39.0 39.0 39.0 39.0
2 18 120. 120. 120. 121. 120. 140.
3 21 289. 289. 289. 289. 289. 289.
4 24 37.0 37.0 37.0 45.2 45.2 70.0
5 25 14.0 14.0 14.0 14.0 14.0 14.0
6 55 30.9 30.9 30.9 30.9 30.9 30.9
7 117 26.9 26.9 26.9 26.9 26.9 26.9
8 118 24.8 24.9 24.9 25.1 25.1 25.7
9 119 24.8 24.8 24.9 25.1 25.3 25.7
10 158 104. 108. 108. 107. 108. 108.
# ... with 32,444 more rows

How ro draw a multiline plot in R

I have a dataframe with 6 features like this:
X1 X2 X3 X4 X5 X6
Modern Dog 9.7 21.0 19.4 7.7 32.0 36.5
Golden Jackal 8.1 16.7 18.3 7.0 30.3 32.9
Chinese Wolf 13.5 27.3 26.8 10.6 41.9 48.1
Indian Wolf 11.5 24.3 24.5 9.3 40.0 44.6
Cuon 10.7 23.5 21.4 8.5 28.8 37.6
Dingo 9.6 22.6 21.1 8.3 34.4 43.1
I want to draw a line plot like this:
I'm trying this:
plot(df$X1, type = "o",col = "red", xlab = "Month", ylab = "Rain fall")
lines(c(df$X2, df$X3, df$X4, df$X5, df$X6), type = "o", col = "blue")
But it's only plotting a single variable. I'm sorry if this question is annoying, i'm totally new to R and i just don't know how to get this done. I would really appreciate any help on this.
Thanks in advance
The easiest way would be to convert your dataset to a long format (e.g. by using the gather function in the tidyr package), and then plotting using the group aesthetic in ggplot.
I recreate your dataset, assuming your group variable is named "Group":
df <- read.table(text = "
Group X1 X2 X3 X4 X5 X6
Modern_Dog 9.7 21.0 19.4 7.7 32.0 36.5
Golden_Jackal 8.1 16.7 18.3 7.0 30.3 32.9
Chinese_Wolf 13.5 27.3 26.8 10.6 41.9 48.1
Indian_Wolf 11.5 24.3 24.5 9.3 40.0 44.6
Cuon 10.7 23.5 21.4 8.5 28.8 37.6
Dingo 9.6 22.6 21.1 8.3 34.4 43.1 ",
header = TRUE, stringsAsFactors = FALSE)
Then convert the dataset to long format and plot:
library(tidyr)
library(ggplot2)
df_long <- df %>% gather(X1:X6, key = "Month", value = "Rainfall")
ggplot(df_long, aes(x = Month, y = Rainfall, group = Group, shape = Group)) +
geom_line() +
geom_point() +
theme(legend.position = "bottom")
See also the answers here: Group data and plot multiple lines.

How to make the speed profile of a moving object?

I am an R beginner user and I face the following problem. I have the following data frame:
distance speed
1 61.0 36.4
2 51.4 35.3
3 42.2 34.2
4 33.4 32.8
5 24.9 31.3
6 17.5 28.4
7 11.5 24.1
8 7.1 19.4
9 3.3 16.9
10 0.5 15.5
11 4.4 15.1
12 8.5 15.5
13 13.1 17.3
14 18.8 20.5
15 25.7 24.1
16 33.3 26.3
17 41.0 27.0
18 48.7 27.7
19 56.6 28.4
20 64.8 29.2
21 73.6 31.7
22 83.3 34.2
23 93.4 35.3
The column distance represents the distance of a following object over a specific point and the column speed the object's speed. As you can see the object is getting closer to the point and then it is getting away. I am trying to make its speed profile. I tried the following code but it didn't give me the plot I want (because I want to show how its speed is changing when the moving object moves closer and past the reference point)
ggplot(speedprofile, aes(x = distance, y = speed)) + #speedprofile is the data frame
geom_line(color = "red") +
geom_smooth() +
geom_vline(xintercept = 0) # the vline is the reference line
The plot is the following:
Then, I tried to set the first 10 distances as negative manually which are prior to zero (0). So I get a plot closer to that I want:
But there is a problem. The distance can't be defined as negative.
To sum up, the expected plot is the following (and I am sorry for the quality).
Do you have any ideas on how to solve this?
Thank you in advance!
You can do something like this to auto-compute the change point (to know when the distance should be negative) and then set the axis labels to be positive.
Your data (in case anyone needs it to answer):
read.table(text="distance speed
61.0 36.4
51.4 35.3
42.2 34.2
33.4 32.8
24.9 31.3
17.5 28.4
11.5 24.1
7.1 19.4
3.3 16.9
0.5 15.5
4.4 15.1
8.5 15.5
13.1 17.3
18.8 20.5
25.7 24.1
33.3 26.3
41.0 27.0
48.7 27.7
56.6 28.4
64.8 29.2
73.6 31.7
83.3 34.2
93.4 35.3", stringsAsFactors=FALSE, header=TRUE) -> speed_profile
Now, compute the "real" distance (negative for approaching, positive for receding):
speed_profile$real_distance <- c(-1, sign(diff(speed_profile$distance))) * speed_profile$distance
Now, compute the X axis breaks ahead of time:
breaks <- scales::pretty_breaks(10)(range(speed_profile$real_distance))
ggplot(speed_profile, aes(real_distance, speed)) +
geom_smooth(linetype = "dashed") +
geom_line(color = "#cb181d", size = 1) +
scale_x_continuous(
name = "distance",
breaks = breaks,
labels = abs(breaks) # make all the labels for the axis positive
)
Provided fonts are working well on your system you could even do:
labels <- abs(breaks)
labels[(!breaks == 0)] <- sprintf("%s\n→", labels[(!breaks == 0)])
ggplot(speed_profile, aes(real_distance, speed)) +
geom_smooth(linetype = "dashed") +
geom_line(color = "#cb181d", size = 1) +
scale_x_continuous(
name = "distance",
breaks = breaks,
labels = labels,
)

Why subsetting rows with 'apply' in data frame doesn't work in R

I have a data that looks like this.
Name|ID|p72|p78|p51|p49|c36.1|c32.1|c32.2|c36.2|c37
hsa-let-7a-5p|MIMAT0000062|9.1|38|12.7|185|8|4.53333333333333|17.9|23|63.3
hsa-let-7b-5p|MIMAT0000063|11.3|58.6|27.5|165.6|20.4|8.5|21|30.2|92.6
hsa-let-7c|MIMAT0000064|7.8|40.2|9.6|147.8|11.8|4.53333333333333|15.4|17.7|62.3
hsa-let-7d-5p|MIMAT0000065|4.53333333333333|27.7|13.4|158.1|8.5|4.53333333333333|14.2|13.5|50.5
hsa-let-7e-5p|MIMAT0000066|6.2|4.53333333333333|4.53333333333333|28|4.53333333333333|4.53333333333333|5.6|4.7|12.8
hsa-let-7f-5p|MIMAT0000067|4.53333333333333|4.53333333333333|4.53333333333333|78.2|4.53333333333333|4.53333333333333|6.8|4.53333333333333|8.9
hsa-miR-15a-5p|MIMAT0000068|4.53333333333333|70.3|10.3|147.6|4.53333333333333|4.53333333333333|21.1|30.2|100.8
hsa-miR-16-5p|MIMAT0000069|9.5|562.6|60.5|757|25.1|4.53333333333333|89.4|142.9|613.9
hsa-miR-17-5p|MIMAT0000070|10.5|71.6|27.4|335.1|6.3|10.1|51|51|187.1
hsa-miR-17-3p|MIMAT0000071|4.53333333333333|4.53333333333333|4.53333333333333|17.2|4.53333333333333|4.53333333333333|9.5|4.53333333333333|7.3
hsa-miR-18a-5p|MIMAT0000072|4.53333333333333|14.6|4.53333333333333|53.4|4.53333333333333|4.53333333333333|9.5|25.5|29.7
hsa-miR-19a-3p|MIMAT0000073|4.53333333333333|11.6|4.53333333333333|42.8|4.53333333333333|4.53333333333333|4.53333333333333|5.5|17.9
hsa-miR-19b-3p|MIMAT0000074|8.3|93.3|15.8|248.3|4.53333333333333|6.3|44.7|53.2|135
hsa-miR-20a-5p|MIMAT0000075|4.53333333333333|75.2|23.4|255.7|6.6|4.53333333333333|43.8|38|130.3
hsa-miR-21-5p|MIMAT0000076|6.2|19.7|18|299.5|6.8|4.53333333333333|49.9|68.5|48
hsa-miR-22-3p|MIMAT0000077|40.4|128.4|65.4|547.1|56.5|33.4|104.9|84.1|248.3
hsa-miR-23a-3p|MIMAT0000078|58.3|99.3|58.6|617.9|36.6|21.4|107.1|125.5|120.9
hsa-miR-24-1-5p|MIMAT0000079|4.53333333333333|4.53333333333333|4.53333333333333|9.2|4.53333333333333|4.53333333333333|4.53333333333333|4.9|4.53333333333333
hsa-miR-24-3p|MIMAT0000080|638.2|286.9|379.5|394.4|307.8|240.4|186|234.2|564
What I want to do is to simply pick rows where all the values is greater than 10.
But why this code of mine only report the last one?
The data clearly showed that there are more rows that satisfy this condition.
> dat<-read.delim("http://dpaste.com/1215552/plain/",sep="|",na.strings="",header=TRUE,blank.lines.skip=TRUE,fill=FALSE)
But why this code of mine only report the last one?
> dat[apply(dat[, -1], MARGIN = 1, function(x) all(x > 10)), ]
Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186 234.2 564
What is the right way to do it?
Update:
alexwhan solution works. But I wonder how can I generalized his approach
so that it can handle data with missing values (NA)
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",heade‌​r=FALSE,blank.lines.skip=TRUE,fill=FALSE)
Since you're including your ID column (which is a factor) in the all(), it's getting messed up. Try:
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10)), ]
# Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
EDIT
For the case where you have NA, you can just just use the na.rm argument for all(). Using your new data (from the comment):
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",header=FALSE,blank.lines.skip=TRUE,fill=FALSE)
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10, na.rm = T)), ]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 7 hsa-miR-15a-5p MIMAT0000068 NA 70.3 10.3 147.6 NA NA 21.1 30.2 100.8
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
# 20 hsa-miR-25-3p MIMAT0000081 19.3 78.6 25.6 84.3 14.9 16.9 19.1 27.2 113.8
# 21 hsa-miR-26a-5p MIMAT0000082 NA 22.8 31.0 561.2 12.4 NA 67.0 55.8 48.9
ANother idea is to transform your data ton long format( or molton format). I think it is even better to avoid missing values problem with:
library(reshape2)
dat.m <- melt(dat,id.vars=c('Name','ID'))
dat.m$value <- as.numeric(dat.m$value)
library(plyr)
res <- ddply(dat.m,.(Name,ID), summarise, keepme = all(value > 10))
res[res$keepme,]
# Name ID keepme
# 16 hsa-miR-22-3p MIMAT0000077 TRUE
# 17 hsa-miR-23a-3p MIMAT0000078 TRUE
# 19 hsa-miR-24-3p MIMAT0000080 TRUE

Resources