How to create correlation using dplyr in R studio

How to create correlation using dplyr in R studio - r

I have a data set with 3 attributes (organization hierarchy region-area-territory, territory is the lowest grain) plus two numeric fields (sales qty and headcount).
How do I generate correlation between sales qty and territory headcount, and display the correlation by region, area and territory?
I used dplyr package, g=group_by (mydataset, region, area, territory), and then summarize(g, cor(sales_qty, headcount). The display looks right, but all correlation is 'NA'. If I omit territory, then the result looks right (group by region and area). Even though territory is the lowest level, can I still use 'group_by' feature? Why is it showing NA?
Thank you for helping!

Without looking at your code it is hard to tell what you are trying. I can't comment what you are doing wrong. Here is what I have tried to get correlation with groups. It works well.
set.seed(1234)
df <- data.frame(group = rep(1:5, 100), x = rnorm(500) , y = rnorm(500) )
library(dplyr)
df %>%
group_by(group) %>%
do(data.frame(x=cor(.$x,.$y)))
Output:
group x
<int> <dbl>
1 1 0.1293551648
2 2 0.0006703073
3 3 0.2021294935
4 4 -0.0162522307
5 5 0.0995898089

Related

Combining/aggregating data in R

I feel like this is a really simple question, and I've looked a lot of places to try to find an answer to it, but everything seems to be looking to do a lot more than what I want--
I have a dataset that has multiple observations from multiple participants. One of the factors is where they're from (e.g. Buckinghamshire, Sussex, London). I want to combine everything that isn't London so I have two categories that are London and notLondon. How would I do this? I'd them want to be able to run a lm on these two, so how would I edit my dataset so that I could do lm(fom ~ [other factor]) where it would be the combined category?
Also, how would I combine all observations from each respective participant for a category? e.g. I have a category that's birth year, but currently when I do a summary of my data it will say, for example, 1996:265, because there are 265 observations from people born in '96. But I just want it to tell me how many participants were born in 1996.
Thanks!

There are multiple parts to your question so let's take it step by step.
1.
For the first part this is a great use of tidyr::fct_collapse(). See example here:
library(tidyverse)
set.seed(1)
d <- sample(letters[1:5], 20, T) %>% factor()
# original distribution
table(d)
#> d
#> a b c d e
#> 6 4 3 1 6
# lumped distribution
fct_collapse(d, a = "a", other_level = "other") %>% table()
#> .
#> a other
#> 6 14
Created on 2022-02-10 by the reprex package (v2.0.1)
2.
For the second part, you will have to clarify and share some data to get more help.
3.
Here you can use dplyr::summarize(n = n()) but you need to share some data to get an answer with your specific case.
However something like:
df %>% group_by(birth_year) %>% summarize(n = n())
will give you number of people with that birth year listed.

dplyr::mutate changes row numbers, how to keep them?

I am using lme4::lmList on a tibble to obtain the coefficients of linear fit lines fitted for each subject (id) in my data. What I actually want is a nice long chain of pipes because I don't want to keep any of this output, just use it for a slope/intercept plot. However, I am running into a problem. lmList is creating a dataframe where the row numbers are the original subject ID numbers. I want to keep that information, but as soon as I use mutate on the output, the row numbers change to be sequential from 1. I tried rescuing them first by using rowid_to_column but that just gives me a column of sequential numbers from 1 too. What can I do, other than drop out of the pipe and put them in a column with base R? Is unique(a_df$id) really the best solution? I had a look around on here but didn't see a question like this one.
library(tibble)
library(dplyr)
library(Matrix)
library(lme4)
a_df <- tibble(id = c(rep(4, 3), rep(11, 3), rep(12, 3), rep(42, 3)),
age = c(rep(seq(1, 3), 4)),
hair = 1 + (age*2) + rnorm(12) + as.vector(sapply(rnorm(4), function(x) rep(x, 3))))
# as.data.frame to get around stupid RStudio diagnostics bug
int_slope <- coef(lmList(hair ~ age | id, as.data.frame(a_df))) %>%
setNames(., c("Intercept", "Slope"))
# Notice how the row numbers are the original subject ids?
print(int_slope)
Intercept Slope
4 2.9723596 1.387635
11 0.2824736 2.443538
12 -1.8912636 2.494236
42 0.8648395 1.680082
int_slope2 <- int_slope %>% mutate(ybar = Intercept + (mean(a_df$age) * Slope))
# Look! Mutate has changed them to be the numbers 1 to 4
print(int_slope2)
Intercept Slope ybar
1 2.9723596 1.387635 5.747630
2 0.2824736 2.443538 5.169550
3 -1.8912636 2.494236 3.097207
4 0.8648395 1.680082 4.225004
# Try to rescue them with rowid_to_column
int_slope3 <- int_slope %>% rowid_to_column(var = "id")
# Nope, 1 to 4 again
print(int_slope3)
id Intercept Slope
1 1 2.9723596 1.387635
2 2 0.2824736 2.443538
3 3 -1.8912636 2.494236
4 4 0.8648395 1.680082
Thanks,
SJ

The dplyr/tidyverse universe doesn't "believe in" row names. Any data that is important for an observation should be included in a column. The tibble package includes a function to move row names into a column. Try
int_slope %>% rownames_to_column()
before any mutates.

Nothing like asking for help to make you see the answer. Those aren't row numbers, they're numeric row names. Of course they are! Non-contiguous row numbers make no sense. rownames_to_column is my answer.

Why you just don´t create another 'ybar' column on int_slope?
int_slope$ybar<- Intercept + mean(a_df$age) * Slope

How to plot profiles in R with ggplot2

I have a large data set with protein IDs and corresponding abundance profiles across a number of gel fractions. I want to plot these profiles of abundances across the fractions.
The data looks like this
IDs<- c("prot1", "prot2", "prot3", "prot4")
fraction1 <- c(3,4,2,4)
fraction2<- c(1,2,4,1)
fraction3<- c(6,4,6,2)
plotdata<-data.frame(IDs, fraction1, fraction2, fraction3)
> plotdata
IDs fraction1 fraction2 fraction3
1 prot1 3 1 6
2 prot2 4 2 4
3 prot3 2 4 6
4 prot4 4 1 2
I want it to look like this:
Every protein has a profile. Every fraction has a corresponding abundance value per protein. I want to have multiple proteins per plot.
I tried figuring out ggplot2 using the cheat sheet and failed. I don't know what the input df should look like and what method I should use to get these profiles.
I would use excel, but a bug draws the wrong profile of my data depending on order of data, so I can't trust it to do what I want.

First, you'll have to reorganize your data.frame for ggplot2. You can do it one step with reshape2::melt. Here you can change the 'variable' and 'value' names.
library(reshape2)
library(dplyr)
library(ggplot2)
data2 <- melt(plotdata, id.vars = "IDs")
Then, we'll group the data by protein:
data2 <- group_by(data2, IDs)
Finally, you can plot it quite simply:
ggplot(data2) +
geom_line(aes(variable, value, group = IDs,
color = IDs))

How to find the sum of the 2nd quartile based on a condition in R

The data I have represents sales and their distance (Dist) to a given store One and Two in this example. What I would like to do is, to define the stores catchment area based on sales desity. A cacthment area is defined as the radius that contains 50% of sales. Starting with orders that have the smallest distance (Dist) to a store I would like to calculate radius that contains 50% of sales of a given store.
I the following df that I've calculated in a previous model.
df <- data.frame(ID = c(1,2,3,4,5,6,7,8),
Store = c('One','One','One','One','Two','Two','Two','Two'),
Dist = c(1,5,7,23,1,9,9,23),
Sales = c(10,8,4,1,11,9,4,2))
Now I want to find the minimum distance dist that gives the closes figure to 50% of Sales. So my output looks as follows:
Output <- data.frame(Store = c('One','Two'),
Dist = c(5,9),
Sales = c(18,20))
I have a lot of observation in my actual df and it's unlekely that I will be able to solve for exactly 50%, so I need to round to the nearest observation.
Any suggestions how to do this?
NOTE: I appologise in advance for the poor title, I tried to think of a better way to formulate the problem, suggestions are welcome...

Here is one approach with data.table:
library(data.table)
setDT(df)
df[order(Store, Dist),
.(Dist, Sales = cumsum(Sales), Pct = cumsum(Sales) / sum(Sales)),
by = "Store"][Pct >= 0.5, .SD[1,], by = "Store"]
# Store Dist Sales Pct
# 1: One 5 18 0.7826087
# 2: Two 9 20 0.7692308
setDT(df) converts df into a data.table
The .(...) expression selects Dist, and calculates the cumulative sales and respective cumulative percentage of sales, by Store
Pct >= 0.5 subsets this to only cases where cumulative sales exceeds the threshold, and .SD[1,] takes only the top row (i.e., the smallest value of Dist), by Store

I think it would be easier if you rearrange your data in a certain format. My logic would be to first take cumsum by groups. Then merge sum of groups to the data. Finally i calculate percentage. Now You have got the data and you can subset in any way you want to get the first obs from the group.
df$cums=unlist(lapply(split(df$Sales, df$Store), cumsum), use.names = F)
zz=aggregate(df$Sales, by = list(df$Store), sum)
names(zz)=c('Store', 'TotSale')
df = merge(df, zz)
df$perc=df$cums/df$TotSale
sub-setting the data:
merge(aggregate(perc ~ Store,data=subset(df,perc>=0.5), min),df)
Store perc ID Dist Sales cums TotSale
1 One 0.7826087 2 5 8 18 23
2 Two 0.7692308 6 9 9 20 26

R: identify the factor associated with the highest sum of values for multiple groups

Consider this:
plot=c("A","A","A","A","B","B","B","B")
mean=c(3,5,40,0,3,5,3,0)
sp=c("ch","ch","ag",NA,"ch","ag","ch",NA)
df=data.frame(plot,mean,sp)
plot mean sp
1 A 3 ch
2 A 5 ch
3 A 40 ag
4 A 0 <NA>
5 B 3 ch
6 B 5 ag
7 B 3 ch
8 B 0 <NA>
I'd like to figure out some code that will return the "sp" from each "plot" with the highest cumulative "mean" value. For the example above, I'd like to return this:
plot=c("A","B")
sp=c("ag","ch")
df=data.frame(plot,sp)
plot sp
1 A ag
2 B ch
In case that wasn't clear, for plot A, the sp "ag" is returned becasue it has the highest cumulative mean value (40) for the plot. For plot B, "ch" is returned because it has the highest cumulative value (6). The values are not important to me; I want only the most dominant sp by cumulative mean value for each plot.
I've played around with aggregate and suspect that would be useful here, but am unsure about how to proceed.
Many thanks (this site is a huge resource for those of us new to R!)

Not sure how #jebyrnes would have done it with summarise and filter (edit: I figured it out and it's pretty simple too), but here's how I'd go about it with dplyr:
library(dplyr)
group_by(df, plot,sp) %>% summarise(sum=sum(mean)) %>% summarise(sp=sp[sum==max(sum)])
# plot sp
#1 A ag
#2 B ch

Here's an approach that uses the "data.table" package
library(data.table)
setDT(df)[, cumsum(mean), by=.(plot, sp)][, .(sp = sp[V1 == max(V1)]), by=plot]
# plot sp
# 1: A ag
# 2: B ch
After setting df to a data table with setDT(df), we are doing two things
[, cumsum(mean), by=.(plot, sp)] calculates the cumulative sum of the mean column, grouped by plot and sp
[, .(sp = sp[V1 == max(V1)]), by=plot] takes the sp value for which V1 (calculated in step 1) is equal to the maximum of V1 and renames that column sp, grouped by plot

You should be able to do this in two steps.
Step 1, aggregate the data frame by plot at sp and calculate the cummulative mean. You can use a package such as plyr with ddply or the dplyr package for this.
Step 2, once you've done this, for each plot output the sp with the highest cumulative mean. There are a lot of ways to to this. I'd again go with dplyr, but that's because I'm a bit besotted with it at the moment.
Actually...you can do this whole thing with 4 lines in dplyr with one line per operation piping your way through with magritr. 5 if you want to get rid of the cumulative means column. You just need a group_by, summarise, and filter statement. I'll post the code if you want it, but it will be far more useful for you to go read, say, http://seananderson.ca/2014/09/13/dplyr-intro.html and try it yourself.
Or....
df %>%
group_by(plot, sp) %>%
summarise(cumMean = sum(mean, na.rm=T)) %>%
filter(cumMean == max(cumMean)) %>%
select(plot, sp)

Aggregate twice: once to calculate the sums for each plot and sp, and a second time to get the maxima for each plot. The second aggregation is only going to give you the mean, though, so merge it back in with the first aggregate.
df2 = aggregate(mean ~ plot + sp, FUN = sum, data = df)
df3a = aggregate(mean ~ plot, data = df2, FUN = max)
merge(df3a, df2)
I haven't tested what happens if you have equal sums coming up here, though. Also, this drops any NAs in the data frame. If you want to keep those, I'd make sure you bring the data frame in with strings rather than factors and then changing the NAs to placeholders ("None" or even "NA") before you begin. The above code works fine with strings!
df = data.frame(plot,mean,sp, stringsAsFactors = FALSE)
df[is.na(df$sp), "sp"] = "None"
> df
plot mean sp
1 A 3 ch
2 A 5 ch
3 A 40 ag
4 A 0 None
5 B 3 ch
6 B 5 ag
7 B 3 ch
8 B 0 None

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to create correlation using dplyr in R studio - r

Related

Combining/aggregating data in R

dplyr::mutate changes row numbers, how to keep them?

How to plot profiles in R with ggplot2

How to find the sum of the 2nd quartile based on a condition in R

R: identify the factor associated with the highest sum of values for multiple groups

Categories

Resources