I need to prepare a table that includes the means and standards deviations for each level of several demographic variables and for many variables.
Consider the following data:
df <- tibble(place=c("London","Paris","London","Rome","Rome","Madrid","Madrid"),gender=c("m","f","f","f","m","m","f"), education = c(1,1,2,3,5,5,3), var1 = c(2.2,3.1,4.5,1,5,1.4,2.3),var2 = c(4.2,2.1,2.5,4,5,4.4,1.3),var3 = c(0.2,0.1,3.5,3,5,2.4,4.3))
I would like to get a dataframe that contains the grouping variables (place, gender, education) and their levels (e.g., London, Paris, etc.) in the first column and their means and standard deviations for each variable starting with var (var1, var2, var3) in additional columns.
I know how to do this for one group and several variables at a time. However, since I need to repeat this dozens of times I am looking for a way to automate this process. It would be great to have a function to which I simply need to pass (a) the names of the grouping variables (e.g., gender, education) and (b) the variables from which to get the M / SD (e.g. var1, var2).
The solution I look for should look like this (the stats are not correct in the example below):
my_results <- tibble(grouping_vars = c("place_London","place_Paris","place_Rome","place_Madrid","gender_m","gender_f","last_element"),mean_var1=c(1.3,2.5,4.5,1.7,2.5,3.6,4.0),sd_var1=c(0.01,0.41,0.21,0.12,0.02,0.38,0.28),mean_var2=c(4.3,4.5,4.0,1.2,2.5,1.6,2.3),sd_var2=c(0.21,0.1,0.1,0.32,0.22,0.18,0.08),mean_var3=c(2.3,2.5,2.0,3.2,3.5,0.6,5),sd_var3=c(0.51,0.15,0.51,0.52,0.52,0.15,0.48))
grouping_vars mean_var1 sd_var1 mean_var2 sd_var2 mean_var3 sd_var3
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 place_London 1.3 0.01 4.3 0.21 2.3 0.51
2 place_Paris 2.5 0.41 4.5 0.1 2.5 0.15
3 place_Rome 4.5 0.21 4 0.1 2 0.51
4 place_Madrid 1.7 0.12 1.2 0.32 3.2 0.52
5 gender_m 2.5 0.02 2.5 0.22 3.5 0.52
6 gender_f 3.6 0.38 1.6 0.18 0.6 0.15
7 last_element 4 0.28 2.3 0.08 5 0.48
Since I typically work with tidyverse, I would particularly appreciate solutions that use these packages (probably dplyr or purrr?).
EDIT:
I thought there would be an elegant way to do this using map(). Maybe there is but I haven't found it yet. For the mean time, I figured out a way that simply restructures the data into an appropriate long format and then computes the statistics.
df %>%
# all grouping vars need to be of the same type, here "factor" is most appropriate
mutate_at(grouping_vars, list(factor)) %>%
# pivot longer, so that each row is a unique combination of grouping variable and grouping level
pivot_longer(
cols = one_of(grouping_vars),
names_to = "group_var",
values_to = "group_level"
) %>%
# merge grouping variable and group level into a single column
unite(var_level,group_var,group_level, sep="_") %>%
# group by group level
group_by(var_level) %>%
# compute means and sd for each test variable
summarise_at(test_vars, list(~mean(., na.rm = TRUE), ~sd(., na.rm = TRUE)))
The result seems fine, e.g., the mean of var1 of the two people who live in London (2.2 + 4.5) is 3.35.
# A tibble: 10 x 7
var_level var1_mean var2_mean var3_mean var1_sd var2_sd var3_sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 education_1 2.65 3.15 0.15 0.636 1.48 0.0707
2 education_2 4.5 2.5 3.5 NA NA NA
3 education_3 1.65 2.65 3.65 0.919 1.91 0.919
4 education_5 3.2 4.7 3.7 2.55 0.424 1.84
5 gender_f 2.72 2.48 2.72 1.47 1.13 1.83
6 gender_m 2.87 4.53 2.53 1.89 0.416 2.40
7 place_London 3.35 3.35 1.85 1.63 1.20 2.33
8 place_Madrid 1.85 2.85 3.35 0.636 2.19 1.34
9 place_Paris 3.1 2.1 0.1 NA NA NA
10 place_Rome 3 4.5 4 2.83 0.707 1.41
Any thoughts on possible risks of this approach or how this could be improved?
One option is the describeBy function from psych:
library(psych)
describeBy(df,group = c("gender","education"), mat= TRUE)
Then subset what you want from there.
Another, surprisingly simple option with dplyr:
library(dplyr)
group.vars <- c("gender","education")
measure.vars <- c("var1","var2")
df %>%
group_by_at(group.vars) %>%
summarize_at(measure.vars,
list(mean =~ mean(.),sd =~ sd(.)))
# A tibble: 5 x 6
# Groups: gender [2]
gender education var1_mean var2_mean var1_sd var2_sd
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 f 1 3.1 2.1 NA NA
2 f 2 4.5 2.5 NA NA
3 f 3 1.65 2.65 0.919 1.91
4 m 1 2.2 4.2 NA NA
5 m 5 3.2 4.7 2.55 0.424
You can continue adding additional function to that list. For every element, the name will be appended to the variable and the result will be come the column values. Recall that ~ is shorthand for function(x).
I am working on cleaning and processing of data with R. I would like to remove the duplicates from a matrix. See the example below.
I would like to remove duplicate according to two criterion, and if it is possible using an interval (If the RT ± 0.1 and the m.z ± 0.001 for a same row is detected more than one time in the table, so remove the extra row).
RT m.z
1 2.02 326.1988
2 2.03 326.1989
3 2.06 326.1990
4 2.03 331.1533
5 2.03 375.1785
6 2.03 301.2852
7 2.04 301.2852
8 2.06 301.2852
9 2.07 357.2609
10 2.07 308.0327
11 2.08 218.2221
12 2.08 312.3617
13 2.10 473.3453
14 2.15 388.3929
I would like a out put like that:
RT m.z
1 2.02 326.1988
2
3 2.06 326.1990
4 2.03 331.1533
5 2.03 375.1785
6 2.03 301.2852
7
8 2.06 301.2852
9 2.07 357.2609
10 2.07 308.0327
11 2.08 218.2221
12 2.08 312.3617
13 2.10 473.3453
14 2.15 388.3929
If you can help that will help me a lot.
Thanks in advance.
This is a way to do it with dplyr. Not sure if it's the most efficient way.
df <- read.table(textConnection("RT m.z
1 2.02 326.1988
2 2.03 326.1989
3 2.06 326.1990
4 2.03 331.1533
5 2.03 375.1785
6 2.03 301.2852
7 2.04 301.2852
8 2.06 301.2852
9 2.07 357.2609
10 2.07 308.0327
11 2.08 218.2221
12 2.08 312.3617
13 2.10 473.3453
14 2.15 388.3929"))
Now with the same data you provided.
library(dplyr)
# This calculates the difference in RT and m.z between consecutive rows
# and looks for absolute differences on which we filter further down the chain
df %>% mutate(
rtdiff = abs(lag(RT) - RT),
mzdiff = abs(lag(m.z) - m.z)
) %>%
# This replaces the NAs in the first row
# with large values so filter does not have to deal with NAs
mutate(rtdiff = replace(rtdiff, is.na(rtdiff), 999),
mzdiff = replace(mzdiff, is.na(mzdiff), 999)) %>%
# Remove the rows that don't meet your condition
filter(!(rtdiff < 0.02 & mzdiff < 0.0002)) %>%
# select only the columns you need and lose the rest
select(RT, m.z)
giving us:
RT m.z
1 2.02 326.1988
2 2.06 326.1990
3 2.03 331.1533
4 2.03 375.1785
5 2.03 301.2852
6 2.06 301.2852
7 2.07 357.2609
8 2.07 308.0327
9 2.08 218.2221
10 2.08 312.3617
11 2.10 473.3453
12 2.15 388.3929
Hi It seems I have intercalated value between my replicates.
So I propose a small change in the Maiasaura code.
for (i in 1:100){
reduced.list.pre.filtering = reduced.list.pre.filtering %>% mutate(
rtdiff = abs(lag(RT..min.,i) - RT..min.),
mzdiff = abs(lag(Max..m.z,i) - Max..m.z)) %>%
mutate(rtdiff = replace(rtdiff, is.na(rtdiff), 999),
mzdiff = replace(mzdiff, is.na(mzdiff), 999)) %>%
filter(!(rtdiff < setRT & mzdiff < setmz )) %>%
select(RT..min., Max..m.z)}
Like that we check all the 100 followed values of a row. Hope it gonna helps somebody else. Do not hesitate if you have a better solution.
I have a dataframe in R.
index seq change change1 change2 change3 change4 change5 change6
1 1 0.12 0.34 1.2 1.7 4.5 2.5 3.4
2 2 1.12 2.54 1.1 0.56 0.87 2.6 3.2
3 3 1.86 3.23 1.6 0.23 3.4 0.75 11.2
... ... ... ... ... ... ... ... ...
The name of the dataframe is just FUllData. I can access each column of the FullData using the code:
FullData[2] for 'change'
FullData[3] for 'change1'
FullData[4] for 'change3'
...
...
Now, I wish to calculate the standard deviation of values in first row of first four columns and so on for all the columns
standarddeviation = sd ( 0.12 0.34 1.2 1.7 )
then
standarddeviation = sd ( 0.34 1.2 1.7 4.5 )
Above has to be for all rows. so basically I want to calulate sd row wise and the data is stored sort of column wise is it possible to do this.
How can I access the row of the data frame with using a for loop on index or seq variable ?
How can I do this in R ? is there any better way ?
I guess you're looking for something like this.
st.dev=numeric()
for(i in 1:dim(FUllData)[1])
{
for(j in 1:dim(FUllData)[2])
{
st.dev=cbind(st.dev,sd(FUllData[i:dim(FUllData)[1],j:dim(FUllData)[2]]))
}
}
I am looking for an explicit function to subscript elements in R, say subscript(x,i) to mean x[i].
The reason that I need this traces back to a piece of code using dplyr and magrittr pipe operator, which is not a pipe, and where I need to divide by the first element of each column.
pipedDF <- rawdata %>% filter, merge, summarize, dcast %>%
mutate_each( funs(./subscript(., 1) ), -index)
I think this would do the trick and keep that pipe syntax which people like.
Without dplyr it would look like this...
Example,
> df
index a b c
1 1 6.00 5.0 4
2 2 7.50 6.0 5
3 3 5.00 4.5 6
4 4 9.00 7.0 7
> data.frame(sapply(df, function(x)x/x[1]))
index a b c
1 1 1.00 1.0 1.00
2 2 1.25 1.2 1.25
3 3 0.83 0.9 1.50
4 4 1.50 1.4 1.75
You should be able to use '[', as in
x<-5:1
'['(x,2)
# [1] 4
I am trying to run a paired t-test in R on data grouped by factors:
> head(i.o.diff,n=20)
# Difference Tree.ID Tree.Name Ins Outs
#1 0.20 AK-1 Akun 1.20 1.0
#2 -1.60 AK-2 Akun 0.40 2.0
#3 -0.60 AK-3 Akun 1.40 2.0
#4 0.40 AK-4 Akun 0.40 0.0
#5 1.30 AK-5 Akun 1.80 0.5
#6 2.70 J-1 Jaror 10.20 7.5
#7 6.60 J-2 Jaror 10.60 4.0
#8 2.50 J-3 Jaror 6.00 3.5
#9 7.50 J-4 Jaror 22.00 14.5
#10 -4.50 J-5 Jaror 5.00 9.5
#11 3.50 Ce-1 Ku'ch 4.00 0.5
#12 -0.70 Ce-2 Ku'ch 4.80 5.5
#13 1.60 Ce-3 Ku'ch 2.60 1.0
#14 -2.40 Ce-4 Ku'ch 2.60 5.0
#15 -1.75 Ce-5 Ku'ch 2.25 4.0
I first tried using:
pairwise.t.test(i.o.diff$In,i.o.diff$Out,g=i.o.diff$Tree.Name,paired=TRUE,pool=FALSE,p.adj="none",alternative=c("less"),mu=0)
but I get the error
Error in complete.cases(x, y) : not all arguments have the same length
which doesn't make a whole lot of sense to me.
I considered using ddply(), apply(), and summaryBy(), but couldn't get it to work because the inputs for the paired t-test require 2 vectors and most of the previous functions I mention seem to work best when only one column is being "operated" upon.
In order to get around this, I tried to use a for loop to achieve the same end:
for(i in unique(i.o.diff$Tree.Name)) {
pair_sub<-subset(i.o.diff,Tree.Name==i)
t.pair<-t.test(pair_sub$Ins,pair_sub$Outs,paired="True")
print(t.pair)
}
However when I do this, I get error
in paired || !is.null(y) : invalid 'x' type in x||y
So I checked typeof(pair_sub$Ins). Turns out that type is double, which is numeric, so I am not sure why the paired t-test is not working. Any ideas as to how to fix either of these methods?
Removed the quotes around TRUE in the for loop. Works great now.
From R documentation:
t.test {stats} R Documentation Student’s t-Test:
Description:
Performs one and two sample t-tests on vectors of data. Usage t.test(x, …)
Default S3 method:
t.test(x, y = NULL, alternative = c(“two.sided”, “less”, “greater”), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, …)