Grouping Frame Values - r

I have a dataset of ingredients for cookies. I'm trying to answer which group (A, B, C, etc) of cookies has the most sugar in them. The dataset is structured as follows:
group id mois prot fat hocolate sugar carb cal
1 A 14069 27.82 21.43 44.87 5.11 1.77 0.77 4.93
2 A 14053 28.49 21.26 43.89 5.34 1.79 1.02 4.84
3 A 14025 28.35 19.99 45.78 5.08 1.63 0.80 4.95
4 B 14016 30.55 20.15 43.13 4.79 1.61 1.38 4.74
5 B 14005 30.49 21.28 41.65 4.82 1.64 1.76 4.67
6 A 14075 31.14 20.23 42.31 4.92 1.65 1.40 4.67
7 C 14082 31.21 20.97 41.34 4.71 1.58 1.77 4.63
8 C 14097 28.76 21.41 41.60 5.28 1.75 2.95 4.72
etc....
How can I plot the mean of each grouping to show that one of them has a higher average of sugar than the others? Or at the least, how can I print off the results of the grouped averages of sugar to defend my argument that one has more sugar than the other?

After saving your text to CSV and loading this file into R, it's pretty easy to obtain the mean sugar quantity per group, which I'm assuming is what you need.
You first group your data by variable group and then summarize the data using the "mean" function.
library(dplyr)
(cookies = df %>%
group_by(group) %>%
summarize(meanSugar = mean(sugar)))
group meanSugar
<chr> <dbl>
1 A 1.71
2 B 1.62
3 C 1.66
As you can see, group A has sugar content a bit higher than the others based on your data.
If you wanna go a step further and really plot this data, you can do that:
library(ggplot2)
cookies %>%
ggplot(aes(x=meanSugar,y=reorder(group,meanSugar),fill=group,label=meanSugar)) +
geom_col()+
labs(y="Cookie groups",x="Mean Sugar")+
geom_label(stat="identity",hjust=+1.2,color="white")+
theme(legend.position = "none")
If you have any questions on some of these steps, let me know!
Obs: please try to provide better data the next time so it's easy to reproduce what you need and give you a quick answer :)

Related

How can I use dplyr to turn one column into 3 based on the characters in the original column?

Hopefully this makes sense. I have one column in my dataset that has multiple entries of one of three size category (read in the data as characters), "(0,1.88]", "(1.88,4]", and "(4,10]". I would to combine all of my entries together by plot (another column in the dataset), totaling the response for each size category in its own column.
Ideally, I'm trying to take data which has multiple responses in each Plot and end up with one total response for each plot, divided by size category. I'm hoping to get something like this:
Plot Total Response for (0,1.88] Total Response for (1.88,4] Total Response for (4,10]
Here is the head of my data. Not all of it is needed, only Plot, ounces, and tuber.diam. tuber.diam has the entries grouped into size categories.
head(newChippers)
Plot ounces Height Shape Area plot variety rate block width length tuber.oz.bin tuber.diam
1 2422 1.31 1.22 26122 3237 242 Lamoka 3 4 1.65 1.70 (0,4] (0,1.88]
2 2422 2.76 1.56 27853 5740 242 Lamoka 3 4 2.20 2.24 (0,4] (1.88,4]
3 2422 1.62 1.31 24125 3721 242 Lamoka 3 4 1.53 1.95 (0,4] (0,1.88]
4 2422 3.37 1.70 27147 6498 242 Lamoka 3 4 2.17 2.48 (0,4] (1.88,4]
5 2422 3.19 1.70 27683 6126 242 Lamoka 3 4 2.22 2.34 (0,4] (1.88,4]
6 2422 2.83 1.53 27356 6009 242 Lamoka 3 4 2.00 2.53 (0,4] (1.88,4]
Here is what I currently have for making the new dataset:
YieldSizeProfileDiameter <- newChippers %>%
group_by(Plot) %>%
summarize(totalOz = sum(Weight),
Diameter.0.1.88 = (tuber.diam("(0,1.88]")),
Diameter.1.88.4 = (tuber.diam(" (1.88,4]")),
Diameter.4.10 = (tuber.diam(" (4,10]")))
I get the following error code:
Error in x[[n]] : object of type 'closure' is not subsettable
Any help would be very much appreciated! Again, I'm very sorry if I've explained it poorly or made it too complicated. If any additional information is needed, I can try to provide it. Thank you!
I have revised your code. I assume your variable weight is the same as variable ounce as there is no weight variable in newChippers your data data. I use weight here as in your code:
YieldSizeProfileDiameter <- newChippers %>%
group_by(Plot, tuber.diam) %>%
summarize(totalOz = sum(Weight)) %>%
pivot_wider(names_from = tuber.diam, values_from = totalOz)
YieldSizeProfileDiameter
I have not tested the code on my side as I do not have the data.

Rolling 182-day average

I have a dataset with multiple sites and sampling years, with a score for every day of the year. For example, SiteA has 40 years of data with a value for every day, and sampling year defined as Sampling.Year. To make it confusing our sampling year is July-June so takes the form of 2016-2017.
For example:
SiteName Sampling.Year Date Score
A 2015-2016 1
A 2015-2016 5
A 2015-2016 2
A 2016-2017 3
A 2016-2017 12
A 2016-2017 6
B 2015-2016 9
B 2015-2016 2
B 2015-2016 1
B 2016-2017 4
B 2016-2017 1
B 2016-2017 7
I want to apply a rolling 182-day average across this data to find the maximum (182-day average) score for each site/Sampling.Year combination. The outcome would be, e.g.:
Site Sampling.Year MaxAve StartDate
A 2016-2017 7.5 01/10/2016
A 2017-2018 6.0 12/12/2017
B 2016-2017 2.3 13/11/2016
B 2017-2018 4.2 09/09/2017
I have saved a sample dataset here:
Sample data.
I want to use a loop code (because I am a novice and i'm not sure of a better way) along the lines of this, but it's the grouping of sites and years that I'm finding tricky. I would ideally like to have the moving average able to be exported as a new dataframe with start and end date (or at least start date) for each window so we can check it against weather conditions at the time.
Moving_Average_Function <- function(arr, n=182){
res = arr
for(i in n:length(arr)){
res[i] = mean(arr[(i-n+1):i])
}
res
}
Thanks in advance
If you are willing to use external libraries, you could use group_by() from the dplyr package, and the roll_mean() function from the RcppRoll package. RcppRoll has a set of fast flexible functions for calculating moving averages.
I would also tend to convert your DATE column to a date class so it arranges smoothly.
library(dplyr) # I would typically use library(tidyverse) to load both dplyr and tidyr (among other related packages)
library(tidyr)
library(lubridate)
library(RcppRoll)
my_data <- data.table::fread("DailyScore.csv") # easy way to load a data frame from file
my_data2 <- my_data %>%
mutate(DATE = dmy(DATE)) %>% # Converting to Date format
pivot_longer(H1:T2,
names_to = "Sensor",
values_to = "data"
) %>% # convert column names to data
group_by(STATION, Sensor) %>% # so you don't average by site.
arrange(STATION, DATE) %>% # to be sure you are in order for the rolling mean
# The STATION argument isn't necessary, but helps for display
mutate(Mean_182 = roll_meanr(data, 182)) %>% # New column with your rolling mean
pivot_wider(names_from = Sensor, values_from = c(data, Mean_182)) # converts back to original "wide" format
my_data2[180:195,]
# # A tibble: 16 x 14
# # Groups: STATION [1]
# STATION SITENAME Sampling.Year DATE data_H1 data_I1 data_H2 data_P2 data_T2 Mean_182_H1
# <chr> <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 Site A Site A 1979-1980 1980-06-28 2.85 1.06e- 9 2.10 0.762 2.85 NA
# 2 Site A Site A 1979-1980 1980-06-29 2.79 1.62e-12 2.06 0.744 2.79 NA
# 3 Site A Site A 1979-1980 1980-06-30 2.75 1.00e-11 2.04 0.732 2.75 2.70
# 4 Site A Site A 1980-1981 1980-07-01 2.72 1.00e-11 2.01 0.724 2.72 2.71
# 5 Site A Site A 1980-1981 1980-07-02 2.70 1.00e-11 2.00 0.720 2.70 2.73
# 6 Site A Site A 1980-1981 1980-07-03 2.68 1.00e-11 1.98 0.718 2.68 2.74
# 7 Site A Site A 1980-1981 1980-07-04 2.67 1.00e-11 1.97 0.719 2.67 2.75
# 8 Site A Site A 1980-1981 1980-07-05 2.65 1.11e- 9 1.95 0.708 2.65 2.76
# 9 Site A Site A 1980-1981 1980-07-06 2.62 2.77e-10 1.93 0.703 2.62 2.76
# 10 Site A Site A 1980-1981 1980-07-07 2.60 3.18e-12 1.92 0.700 2.60 2.77
# 11 Site A Site A 1980-1981 1980-07-08 2.59 1.00e-11 1.90 0.701 2.59 2.79
# 12 Site A Site A 1980-1981 1980-07-09 2.59 1.00e-11 1.89 0.706 2.59 2.80
# 13 Site A Site A 1980-1981 1980-07-10 2.59 1.00e-11 1.89 0.713 2.59 2.81
# 14 Site A Site A 1980-1981 1980-07-11 2.59 1.00e-11 1.88 0.722 2.59 2.82
# 15 Site A Site A 1980-1981 1980-07-12 2.60 1.00e-11 1.88 0.731 2.60 2.83
# 16 Site A Site A 1980-1981 1980-07-13 2.60 1.00e-11 1.87 0.741 2.60 2.84
# # ... with 4 more variables: Mean_182_I1 <dbl>, Mean_182_H2 <dbl>, Mean_182_P2 <dbl>, Mean_182_T2 <dbl>
Couple things to keep in mind, which will affect how you set this up.
In general rolling averages will return NA when they don't have a complete data set. So with a 182-day average, you'll get a series of 181 NA's before your first complete average.
You'll want to figure out how you want to handle the roll over parts- especially with a long period rolling average, if you don't want to mix sampling years, you'll have about half the year without data.
Using loops would be very inefficient for such operations. You can use some dedicated functions which allow you to perform this by group and use zoo::rollmean to get rolling mean.
library(dplyr)
DailyScore %>%
group_by(SITENAME, Sampling.Year) %>%
summarise(max_average = max(zoo::rollmean(Score, 182)))

rvest empty list for table via xpath

I need help for scraping data from the website below. I copied how the link here does https://msperlin.github.io/pafdR/importingInternet.html#accessing-data-from-web-pages-webscraping to get the table of rates down but i get only list 0. Can someone please help me?
library(rvest)
# set url and xpath
my.url <- 'https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yield'
my.xpath <- '//*[#id="t-content-main-content"]/div/table/tbody/tr/td/div/table'
# get nodes from html
out.nodes <- html_nodes(read_html(my.url),
xpath = my.xpath)
# get table from nodes (each element in
# list is a table)
df <- html_table(out.nodes)
df
It's usually better to not use extremely precise XPath statements since the structure of pages can change and sometimes what looks correct in the browser source or Developer Tools may not be (browsers modify HTML when they read it in).
Using said Developer Tools (in this case via Firefox but Chrome shld work fine too), an Inspect Element near the table shows:
We can just stick with CSS selector syntax and do:
library(rvest)
pg <- read_html("https://www.treasury.gov/resource-center/data-chart-center/interest-rates/Pages/TextView.aspx?data=yield")
html_node(pg, "table.t-chart") %>%
html_table()
## Date 1 mo 3 mo 6 mo 1 yr 2 yr 3 yr 5 yr 7 yr 10 yr 20 yr 30 yr
## 1 10/01/18 2.13 2.23 2.40 2.60 2.82 2.90 2.96 3.04 3.09 3.18 3.24
## 2 10/02/18 2.14 2.23 2.41 2.61 2.82 2.88 2.94 3.01 3.05 3.14 3.20
## 3 10/03/18 2.15 2.23 2.41 2.62 2.85 2.94 3.02 3.10 3.15 3.24 3.30
## 4 10/04/18 2.16 2.22 2.42 2.63 2.87 2.97 3.05 3.14 3.19 3.29 3.35
## 5 10/05/18 2.15 2.23 2.41 2.64 2.88 2.99 3.07 3.18 3.23 3.34 3.40
In this case CSS selectors are easier (that's not always the case) but you can also use XPath queries as well:
html_node(pg, xpath = ".//table[#class='t-chart']")

Trying to run many anovas and get an F value for each row

I'm working with a dataset that looks like what is shown below. Now I know that this is not the type of format that R likes. I know how to tidy up the data, but then I'm not sure what I'd do in order to obtain an F statistic for each unique_id, which is my goal. Is there an easy way to do that? Otherwise is there a way I could use some type of apply function to tidy up each row independently, perform, the anova, and then add F statistics as a new column?
unique_id heart heart heart kidney kidney kidney cortex cortex cortex
373020.8 1.39 1.18 1.30 2.71 2.96 2.52 1.97 1.67 1.44
371588.9 1.93 2.35 2.50 2.54 1.63 2.23 2.68 2.89 1.86
367772.8 0.42 0.51 0.97 1.02 0.03 0.82 0.01 0.90 1.01
I'm partial to data.tables and you can easily do this with a DT after melting your data. Here's my take, with DT as your data.table containing the info you provided.
DT <- dt<- melt(dt,
id.vars = c("unique_id"),
measure.vars = c("heart","cortex","kidney"))
DT[,fstat:=summary(aov(value~variable))[[1]][1,"F value"],by=unique_id]
This calculates the F-Stat by unique_id and should work.

remove duplicate two criterion interval R

I am working on cleaning and processing of data with R. I would like to remove the duplicates from a matrix. See the example below.
I would like to remove duplicate according to two criterion, and if it is possible using an interval (If the RT ± 0.1 and the m.z ± 0.001 for a same row is detected more than one time in the table, so remove the extra row).
RT m.z
1 2.02 326.1988
2 2.03 326.1989
3 2.06 326.1990
4 2.03 331.1533
5 2.03 375.1785
6 2.03 301.2852
7 2.04 301.2852
8 2.06 301.2852
9 2.07 357.2609
10 2.07 308.0327
11 2.08 218.2221
12 2.08 312.3617
13 2.10 473.3453
14 2.15 388.3929
I would like a out put like that:
RT m.z
1 2.02 326.1988
2
3 2.06 326.1990
4 2.03 331.1533
5 2.03 375.1785
6 2.03 301.2852
7
8 2.06 301.2852
9 2.07 357.2609
10 2.07 308.0327
11 2.08 218.2221
12 2.08 312.3617
13 2.10 473.3453
14 2.15 388.3929
If you can help that will help me a lot.
Thanks in advance.
This is a way to do it with dplyr. Not sure if it's the most efficient way.
df <- read.table(textConnection("RT m.z
1 2.02 326.1988
2 2.03 326.1989
3 2.06 326.1990
4 2.03 331.1533
5 2.03 375.1785
6 2.03 301.2852
7 2.04 301.2852
8 2.06 301.2852
9 2.07 357.2609
10 2.07 308.0327
11 2.08 218.2221
12 2.08 312.3617
13 2.10 473.3453
14 2.15 388.3929"))
Now with the same data you provided.
library(dplyr)
# This calculates the difference in RT and m.z between consecutive rows
# and looks for absolute differences on which we filter further down the chain
df %>% mutate(
rtdiff = abs(lag(RT) - RT),
mzdiff = abs(lag(m.z) - m.z)
) %>%
# This replaces the NAs in the first row
# with large values so filter does not have to deal with NAs
mutate(rtdiff = replace(rtdiff, is.na(rtdiff), 999),
mzdiff = replace(mzdiff, is.na(mzdiff), 999)) %>%
# Remove the rows that don't meet your condition
filter(!(rtdiff < 0.02 & mzdiff < 0.0002)) %>%
# select only the columns you need and lose the rest
select(RT, m.z)
giving us:
RT m.z
1 2.02 326.1988
2 2.06 326.1990
3 2.03 331.1533
4 2.03 375.1785
5 2.03 301.2852
6 2.06 301.2852
7 2.07 357.2609
8 2.07 308.0327
9 2.08 218.2221
10 2.08 312.3617
11 2.10 473.3453
12 2.15 388.3929
Hi It seems I have intercalated value between my replicates.
So I propose a small change in the Maiasaura code.
for (i in 1:100){
reduced.list.pre.filtering = reduced.list.pre.filtering %>% mutate(
rtdiff = abs(lag(RT..min.,i) - RT..min.),
mzdiff = abs(lag(Max..m.z,i) - Max..m.z)) %>%
mutate(rtdiff = replace(rtdiff, is.na(rtdiff), 999),
mzdiff = replace(mzdiff, is.na(mzdiff), 999)) %>%
filter(!(rtdiff < setRT & mzdiff < setmz )) %>%
select(RT..min., Max..m.z)}
Like that we check all the 100 followed values of a row. Hope it gonna helps somebody else. Do not hesitate if you have a better solution.

Resources