How to create a data summary function? - r

I'm trying to create a function that summarizes several vectors and the prompt is
Write a function data_summary which takes three inputs:\
`dataset`: A data frame\
`vars`: A character vector whose elements are names of columns from dataset which the user wants summaries for\
`group.name`: A length one character vector which gives the name of the column from dataset which contains the factor which will be used as a grouping variable
\`var.names`: A character vector of the same length as vars which gives the names that the user would like used as the entries under “Variable” in the resulting output. This should be set equal to vars by default, so the default behavior is to use the column names from dataset.
The output of the function should be a data frame with the following structure:
Column names of the data frame will be:\
`Variable`\
`Missing`\
The `first` level of the factor group.name\
The `second` level of the factor group.name\
…\
The `kth` level of the factor group.name\
`p-value`
I've set up the code already,
data_summary <- function(dataset,vars,group.name,var.names) {
}
but I'm unsure how to proceed because I do not understand what this is trying to accomplish and what the output should look like. There is an example that shows
#data_summary<-function(dataset, vars,group.name, var.name){}
#example
#data_summary(titanic4, c("survived", "female", "age", "sibsp", "parch", "fare", "cabin"), "pclass")
#data_summary(titanic4, c("survived", "female", "age", "sibsp", "parch", "fare", "cabin"), "pclass", c("Survival rate", "% Female", "Age", "# siblings/spouses aboard", "# children/parents aboard", "Fare ($)", "Cabin"))
But it really did not help me outside of inputting the arguments for the function.

You can use dplyr package for this function. Also I don't know by which functions you want summarise your dataframe, so I use all functions which summary function returns from base package.
My data:
> NewSKUMatrix
# A tibble: 268,918 x 4
LagerID FilialID CSBID Price
<int> <int> <int> <dbl>
1 233 2578 1005 38.3
2 333 2543 NA 61.0
3 334 2543 NA 15.0
4 335 2543 NA 11.0
5 337 2301 NA 71.0
6 338 2031 NA 37.0
7 338 2044 NA 35.0
8 338 2054 NA 36.0
9 338 2060 NA 37.0
10 338 2063 NA 36.0
# ... with 268,908 more rows
Function:
data_summary <- function(data,
variables,
values,
names = NULL) {
if (is.null(x = names)) {
names <- variables
}
data %>%
group_by_at(.vars = variables) %>%
summarise_at(
.vars = values,
.funs = list(
Min. = min,
`1st Qu.` = ~ quantile(x = ., probs = 0.25),
Median = median,
Mean = mean,
`3rd Qu.` = ~ quantile(x = ., probs = 0.75),
Max. = max
)
) %>%
rename_at(.vars = variables,
.funs = ~ names)
}
Output:
data_summary(NewSKUMatrix,
c('LagerID'),
c('Price'),
c('SKU'))
# A tibble: 32,454 x 7
SKU Min. `1st Qu.` Median Mean `3rd Qu.` Max.
<int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 17 39.0 39.0 39.0 39.0 39.0 39.0
2 18 120. 120. 120. 121. 120. 140.
3 21 289. 289. 289. 289. 289. 289.
4 24 37.0 37.0 37.0 45.2 45.2 70.0
5 25 14.0 14.0 14.0 14.0 14.0 14.0
6 55 30.9 30.9 30.9 30.9 30.9 30.9
7 117 26.9 26.9 26.9 26.9 26.9 26.9
8 118 24.8 24.9 24.9 25.1 25.1 25.7
9 119 24.8 24.8 24.9 25.1 25.3 25.7
10 158 104. 108. 108. 107. 108. 108.
# ... with 32,444 more rows

Related

R { : the condition has length > 1 [duplicate]

This question already has answers here:
Interpreting "condition has length > 1" warning from `if` function
(7 answers)
Closed 7 months ago.
this is my first time asking a question in StackOverflow and also my first time coding using R
So, please understand if my explanation is unclear :(
I now have a data frame (data2000) that is 1092 x 6
The headers are year, month, predictive horizon, name of the company, GDP Price Index, and Consumer Price Index
I want to create vectors on gdppi and cpi for each month
My ultimate goal is to get the mean, median, interquartile range, and 90th-10th percentile range for each month and I thought this is the first step
and this is the code that I wrote by far
***library(tidyverse)
data2000 <- read.csv("")
for (i in 1:12) {
i_gdppi <- c()
i_cpi <- c()
}
for (i in 1:12) {
if (data2000$month == i) {
append(i_gdppi,data2000[,gdppi])
append(i_cpi, data2000[,cpi])
}
}***
Unfortunately, I got an error message saying that
Error in if (data2000$month == 1) { : the condition has length > 1
I googled it by myself and in if statement, I cannot use a vector as a condition
How can I solve this problem?
Thank you so much and have a nice day!
If you use the group_by() function then it takes care of sub-setting your data:
library(dplyr)
data2000 <- data.frame(month = rep(c(1:12), times = 2), gdppi = runif(24)*100) # Dummy data
data2000 |>
group_by(month) |>
summarise(mean = mean(gdppi), q10 = quantile(gdppi, probs = .10), q25 = quantile(gdppi, probs = .25)) # Add the other percentiles, as needed
Gives this
# A tibble: 12 x 4
month mean q10 q25
<int> <dbl> <dbl> <dbl>
1 1 12.5 3.44 6.83
2 2 34.7 7.15 17.5
3 3 37.8 22.1 28.0
4 4 30.3 19.0 23.2
5 5 65.7 62.2 63.5
6 6 60.7 38.7 47.0
7 7 43.0 38.2 40.0
8 8 77.9 60.7 67.1
9 9 56.3 44.0 48.6
10 10 53.1 19.6 32.2
11 11 63.8 40.6 49.3
12 12 59.0 49.2 52.9
If you have years and months, then group_by(year, month)

Using map2_dfr() For Specific Column Addition and Subtraction R

Hello I have a data frame that is 2000x56. I would like to do a simple subtraction of specific columns. For example I would like to subtract column 1 from 3 and column 5 from 7 etc..
Here is a sample of the data set.
df= structure(list(c(48.9518, 47.9639, 47.5751, 46.5795, 46.6301,
45.0705, 43.7893, 43.8325, 46.507, 45.1127, 46.2437, 44.6545,
43.5113, 43.2287, 43.6998, 41.44, 41.44, 41.8239, 43.2681, 42.5079,
40.315), c(51.9657, 50.928, 50.559, 50.477, 51.8529, 47.506,
49.0126, 47.8382, 57.6266, 59.9311, 71.9462, 44.6545, 43.5113,
43.2287, 43.6998, 41.44, 41.44, 41.7783, 43.6673, 42.915, 40.4284
), c(42.0552, 40.141, 40.07, 40.3302, 39.7687, 39.3804, 40.5853,
40.2478, 40.7404, 36.0079, 39.3361, 38.6883, 33.1306, 34.2174,
34.0593, 34.4541, 32.1919, 36.2109, 37.0591, 35.7394, 34.8065
), c(43.5527, 40.6115, 41.1305, 42.6484, 42.1938, 41.2828, 41.8979,
41.9331, 47.0511, 48.0175, 49.5343, 45.5063, 33.1306, 34.2174,
34.0593, 34.4541, 32.0264, 36.1705, 37.2596, 35.5938, 34.3885
), c(56.3464, 53.5964, 55.2791, 54.7751, 53.6983, 48.2984, 46.8343,
50.339, 54.6205, 54.6327, 53.7313, 51.839, 49.9128, 60.1649,
64.1637, 57.4661, 57.4661, 57.9187, 51.9147, 51.5786, 49.357),
c(61.6417, 57.054, 58.8402, 60.6182, 58.3043, 48.7071, 47.5466,
52.9527, 67.9061, 64.3576, 63.6387, 61.2588, 43.1908, 59.254,
63.8611, 57.4661, 57.4661, 58.6671, 54.097, 53.8527, 51.4929
), c(62.3702, 58.9045, 58.1827, 59.4045, 57.7552, 50.4304,
45.2969, 51.3944, 55.3861, 54.3857, 50.634, 49.1729, 51.0196,
56.8711, 59.2268, 56.1792, 56.812, 53.9583, 52.6343, 49.8832,
47.8319)), row.names = c(NA, -21L), class = c("tbl_df", "tbl",
"data.frame"))
head(df)
A tibble: 6 x 7
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 49.0 52.0 42.1 43.6 56.3 61.6 62.4
2 48.0 50.9 40.1 40.6 53.6 57.1 58.9
3 47.6 50.6 40.1 41.1 55.3 58.8 58.2
4 46.6 50.5 40.3 42.6 54.8 60.6 59.4
5 46.6 51.9 39.8 42.2 53.7 58.3 57.8
6 45.1 47.5 39.4 41.3 48.3 48.7 50.4
I start by creating 2 vectors with the column numbers I would like to subtract.
First = seq(1, ncol(df), 4)
Second = seq(3, ncol(df), 4)
print(First)
1, 5
print(Second)
3, 7
Now I create a loop using map2 from purrr. I would like the output to be a dataframe so I use map2_dfr() from purrr
map2_dfr(First, Second, ~df[,.x]-df[,.y])
The result is a tibble with nothing.
I have tried creating a function inside map2_dfr() with no luck.
map2_dfr(First, Second, function(x, y){df[,x]-df[,y]})
My expected output is a data frame where
Column1 = df[,1]-df[,3]
Column2 = df[,5]-df[,7]
Thank you.
The issue is that the dataset doesn't have any column names
colnames(df) <- paste0("col", seq_along(df))
Now, applying the OP's code should work fine

Export results from split and lapply to a csv or Excel file

I'm using the split() and lapply functions to run Mann Kendall trend tests in bulk. In the code below, split() separates the results (ConcLow) by Analyte (water quality parameter). Then lapply runs the MannKendall and summary for each. The output goes to the console (example shown below code), but I'd like it to go into an Excel or cvs document so I can work with it. Ideally the Excel document would have the analyte (TOC for example) in the first column, then end column = tau value, 3rd column = pvalue. Then the next tab or following columns would display results from the summary function. Any assistance you can provide is greatly appreciated! I'm quite new to R.
mk.analyte <- split(BarkTop$ConcLow, BarkTop$Analyte)
lapply(mk.analyte, MannKendall)
lapply(mk.analyte, summary)
Output for each analyte looks like this (abbreviated here, but it's a long list):
$TOC
tau = 0.0108, 2-sided pvalue =0.8081
$TOC
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.378 2.054 2.255 2.434 2.600 4.530
Data look like this:
Date Location Analyte ConcLow Units
5/8/2000 Barker Res. Hardness 3.34 mg/L (as CaCO3)
11/24/2000 Barker Res. Hardness 9.47 mg/L (as CaCO3)
6/12/2001 Barker Res. Hardness 1.4 mg/L (as CaCO3)
12/29/2001 Barker Res. Hardness 21.9 mg/L (as CaCO3)
7/17/2002 Barker Res. Fe (diss 81 ug/L
2/2/2003 Barker Res. Fe (diss 90 ug/L
8/21/2003 Barker Res. Fe (diss 0.08 ug/L
3/8/2004 Barker Res. Fe (diss 15.748 ug/L
9/24/2004 Barker Res. TSS 6.2 mg/L
4/12/2005 Barker Res. TSS 8 mg/L
10/29/2005 Barker Res. TSS 10 mg/L
In my own opinion, I would use the tidyverse, as it is easier to read.
Short way:
#Sample data
set.seed(42)
df <- data.frame(
Location = replicate(1000, sample(letters[1:15], 1)),
Analyte = replicate(1000, sample(c("Hardness", "TSS", "Fe"), 1)),
ConcLow = runif(1000, 1, 30))
#Soltion
df %>%
nest(-Location, -Analyte) %>%
mutate(
mannKendall = purrr::map(data, function(x) {
broom::tidy(Kendall::MannKendall(x$ConcLow))}),
sumData = purrr::map(data, function(x) {
broom::tidy(summary(x$ConcLow))})) %>%
select(-data) %>%
unnest(mannKendall, sumData) %>%
write_excel_csv(path = "mydata.xls")
#How the table looks like:
# A tibble: 45 x 13
Location Analyte statistic p.value kendall_score denominator var_kendall_sco~ minimum q1 median
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 n Fe 0.264 0.0907 61 231. 1258. 1.38 14.4 20.6
2 o Hardne~ 0.0870 0.568 24 276. 1625. 2.02 9.52 18.3
3 e Fe -0.108 0.499 -25 231. 1258. 1.14 9.24 15.9
4 m TSS -0.00654 1 -1 153 697 2.19 5.89 10.4
5 j TSS -0.158 0.363 -27 171. 817 1.20 6.44 12.8
6 h Hardne~ 0.0909 0.466 48 528 4165. 4.28 11.1 19.4
7 l TSS -0.0526 0.780 -9 171. 817 5.39 12.5 21.1
8 c Fe -0.0736 0.652 -17 231. 1258. 1.63 5.87 10.6
9 j Hardne~ 0.415 0.0143 71 171. 817 4.50 11.7 15.4
10 k Fe -0.146 0.342 -37 253. 1434. 2.68 12.3 15.4
# ... with 35 more rows, and 3 more variables: mean <dbl>, q3 <dbl>, maximum <dbl>
Long way
It's a bit backwards but you can do something below.
Please note that I used subset from the mtcars dataset for my solution.
require(tidyverse)
df <- mtcars %>%
select(cyl, disp)
wilx <- df %>%
split(.$cyl) %>%
map(function(x) {broom::tidy(wilcox.test(x$disp, paired = FALSE,
exact = FALSE))})
sumData <- df %>%
split(.$cyl) %>%
map(function(x) {summary(x$disp)})
for (i in 1:length(wilx)) {
write_excel_csv(as.data.frame(wilx[i]), path = paste0(getwd(), "/wilx", i, ".xls"))
write_excel_csv(as.data.frame(unlist(sumData[i])), path = paste0(getwd(), "/sumData", i, ".xls"))
}

how to use map function in r to find the range and quantile

I first simulated 500 samples of size 55 in the normal distribution.
samples <- replicate(500, rnorm(55,mean=50, sd=10), simplify = FALSE)
1) For each sample, I want the mean, median, range, and third quartile. Then I need to store these together in a data frame.
This is what I have. I am not sure about the range or the quantile. I tried sapply and lapply but not sure how they work.
stats <- data.frame(
means = map_dbl(samples,mean),
medians = map_dbl(samples,median),
sd= map_dbl(samples,sd),
range= map_int(samples, max-min),
third_quantile=sapply(samples,quantile,type=3)
)
2) Then plot the sampling distribution (histogram) of the means.
I try to plot but I don't get how to get the mean
stats <- gather(stats, key = "Trials", value = "Mean")
ggplot(stats,aes(x=Trials))+geom_histogram()
3) Then I want to plot the other three statistics in (three separate graphs) of a single plotting window.
I know I need to use something like gather and facet_wrap, but I am not sure how to do it.
You were almost there. All it is needed is to define anonymous functions wherever there are errors.
library(tidyverse)
set.seed(1234) # Make the results reproducible
samples <- replicate(500, rnorm(55,mean=50, sd=10), simplify = FALSE)
str(samples)
stats <- data.frame(
means = map_dbl(samples, mean),
medians = map_dbl(samples, median),
sd = map_dbl(samples, sd),
range = map_dbl(samples, function(x) diff(range(x))),
third_quantile = map_dbl(samples, function(x) quantile(x, probs = 3/4, type = 3))
)
str(stats)
#'data.frame': 500 obs. of 5 variables:
# $ means : num 49.8 51.5 52.2 50.2 51.6 ...
# $ medians : num 51.5 51.7 51 51.1 50.5 ...
# $ sd : num 9.55 7.81 11.43 8.97 10.75 ...
# $ range : num 38.5 37.2 54 36.7 60.2 ...
# $ third_quantile: num 57.7 56.2 58.8 55.6 57 ...
The map_dbl functions you're using are definitely nice, but if you're trying to get a data frame in the end anyway, you might have an easier time converting the list into a data frame at the beginning, then taking advantage of some dplyr functions.
I'm first mapping over the list, creating tibbles, and binding it together with an added ID. The conversion creates a column value of the sample values. summarise_at lets you take a list of functions—supplying names in the list sets the names in the resultant data frame. You can use purrr's ~. notation to define these functions inline where needed. Cuts down on the number of times you have to map_dbl and so on.
library(tidyverse)
stats <- samples %>%
map_dfr(as_tibble, .id = "sample") %>%
group_by(sample) %>%
summarise_at(vars(value),
.funs = list(mean = mean, median = median, sd = sd,
range = ~(max(.) - min(.)),
third_quartile = ~quantile(., probs = 0.75)))
head(stats)
#> # A tibble: 6 x 6
#> sample mean median sd range third_quartile
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 45.0 44.4 8.71 47.6 48.6
#> 2 10 51.0 52.0 9.55 49.3 56.2
#> 3 100 51.6 52.2 10.4 60.7 58.1
#> 4 101 51.6 51.1 9.92 37.6 57.2
#> 5 102 49.1 48.2 9.65 39.8 57.0
#> 6 103 52.2 51.3 10.1 47.4 58.5
Next, in your code you gathered the data—which is often the solution folks need on SO—but if you're only trying to show the mean column, you can work with it as is.
ggplot(stats, aes(x = mean)) +
geom_histogram()

Why subsetting rows with 'apply' in data frame doesn't work in R

I have a data that looks like this.
Name|ID|p72|p78|p51|p49|c36.1|c32.1|c32.2|c36.2|c37
hsa-let-7a-5p|MIMAT0000062|9.1|38|12.7|185|8|4.53333333333333|17.9|23|63.3
hsa-let-7b-5p|MIMAT0000063|11.3|58.6|27.5|165.6|20.4|8.5|21|30.2|92.6
hsa-let-7c|MIMAT0000064|7.8|40.2|9.6|147.8|11.8|4.53333333333333|15.4|17.7|62.3
hsa-let-7d-5p|MIMAT0000065|4.53333333333333|27.7|13.4|158.1|8.5|4.53333333333333|14.2|13.5|50.5
hsa-let-7e-5p|MIMAT0000066|6.2|4.53333333333333|4.53333333333333|28|4.53333333333333|4.53333333333333|5.6|4.7|12.8
hsa-let-7f-5p|MIMAT0000067|4.53333333333333|4.53333333333333|4.53333333333333|78.2|4.53333333333333|4.53333333333333|6.8|4.53333333333333|8.9
hsa-miR-15a-5p|MIMAT0000068|4.53333333333333|70.3|10.3|147.6|4.53333333333333|4.53333333333333|21.1|30.2|100.8
hsa-miR-16-5p|MIMAT0000069|9.5|562.6|60.5|757|25.1|4.53333333333333|89.4|142.9|613.9
hsa-miR-17-5p|MIMAT0000070|10.5|71.6|27.4|335.1|6.3|10.1|51|51|187.1
hsa-miR-17-3p|MIMAT0000071|4.53333333333333|4.53333333333333|4.53333333333333|17.2|4.53333333333333|4.53333333333333|9.5|4.53333333333333|7.3
hsa-miR-18a-5p|MIMAT0000072|4.53333333333333|14.6|4.53333333333333|53.4|4.53333333333333|4.53333333333333|9.5|25.5|29.7
hsa-miR-19a-3p|MIMAT0000073|4.53333333333333|11.6|4.53333333333333|42.8|4.53333333333333|4.53333333333333|4.53333333333333|5.5|17.9
hsa-miR-19b-3p|MIMAT0000074|8.3|93.3|15.8|248.3|4.53333333333333|6.3|44.7|53.2|135
hsa-miR-20a-5p|MIMAT0000075|4.53333333333333|75.2|23.4|255.7|6.6|4.53333333333333|43.8|38|130.3
hsa-miR-21-5p|MIMAT0000076|6.2|19.7|18|299.5|6.8|4.53333333333333|49.9|68.5|48
hsa-miR-22-3p|MIMAT0000077|40.4|128.4|65.4|547.1|56.5|33.4|104.9|84.1|248.3
hsa-miR-23a-3p|MIMAT0000078|58.3|99.3|58.6|617.9|36.6|21.4|107.1|125.5|120.9
hsa-miR-24-1-5p|MIMAT0000079|4.53333333333333|4.53333333333333|4.53333333333333|9.2|4.53333333333333|4.53333333333333|4.53333333333333|4.9|4.53333333333333
hsa-miR-24-3p|MIMAT0000080|638.2|286.9|379.5|394.4|307.8|240.4|186|234.2|564
What I want to do is to simply pick rows where all the values is greater than 10.
But why this code of mine only report the last one?
The data clearly showed that there are more rows that satisfy this condition.
> dat<-read.delim("http://dpaste.com/1215552/plain/",sep="|",na.strings="",header=TRUE,blank.lines.skip=TRUE,fill=FALSE)
But why this code of mine only report the last one?
> dat[apply(dat[, -1], MARGIN = 1, function(x) all(x > 10)), ]
Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186 234.2 564
What is the right way to do it?
Update:
alexwhan solution works. But I wonder how can I generalized his approach
so that it can handle data with missing values (NA)
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",heade‌​r=FALSE,blank.lines.skip=TRUE,fill=FALSE)
Since you're including your ID column (which is a factor) in the all(), it's getting messed up. Try:
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10)), ]
# Name ID p72 p78 p51 p49 c36.1 c32.1 c32.2 c36.2 c37
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
EDIT
For the case where you have NA, you can just just use the na.rm argument for all(). Using your new data (from the comment):
dat<-read.delim("http://dpaste.com/1215354/plain/",sep="\t",na.strings="",header=FALSE,blank.lines.skip=TRUE,fill=FALSE)
dat[apply(dat[, -c(1,2)], MARGIN = 1, function(x) all(x > 10, na.rm = T)), ]
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 7 hsa-miR-15a-5p MIMAT0000068 NA 70.3 10.3 147.6 NA NA 21.1 30.2 100.8
# 16 hsa-miR-22-3p MIMAT0000077 40.4 128.4 65.4 547.1 56.5 33.4 104.9 84.1 248.3
# 17 hsa-miR-23a-3p MIMAT0000078 58.3 99.3 58.6 617.9 36.6 21.4 107.1 125.5 120.9
# 19 hsa-miR-24-3p MIMAT0000080 638.2 286.9 379.5 394.4 307.8 240.4 186.0 234.2 564.0
# 20 hsa-miR-25-3p MIMAT0000081 19.3 78.6 25.6 84.3 14.9 16.9 19.1 27.2 113.8
# 21 hsa-miR-26a-5p MIMAT0000082 NA 22.8 31.0 561.2 12.4 NA 67.0 55.8 48.9
ANother idea is to transform your data ton long format( or molton format). I think it is even better to avoid missing values problem with:
library(reshape2)
dat.m <- melt(dat,id.vars=c('Name','ID'))
dat.m$value <- as.numeric(dat.m$value)
library(plyr)
res <- ddply(dat.m,.(Name,ID), summarise, keepme = all(value > 10))
res[res$keepme,]
# Name ID keepme
# 16 hsa-miR-22-3p MIMAT0000077 TRUE
# 17 hsa-miR-23a-3p MIMAT0000078 TRUE
# 19 hsa-miR-24-3p MIMAT0000080 TRUE

Resources