Plotting Number of Times Value Appears in Two Dataframes in R - r

I have two sets of data. Each contains a column for the name of the molecule and a column for the number of times that molecule appears in the sample. I want to create a scatterplot with the number of times a molecule appears in dataset #1 on the x-axis and how many times it appears in dataset #2. If a molecule is in one dataset and not the other, it appears 0 times.
Example:
dat1 <- data.frame(
name = c("A", "B", "D", "E")
count = c(10, 1, 30, 10)
)
dat2 <- data.frame(
name = c("A", "B", "C", "F")
count = c(1, 3, 50, 40)
)
Point #1 would be (10,1) corresponding to A, Point #2 would be (1,3), Point #3 would be (0,50) and so on. I don't want to label my points since my datasets contain tens of thousands of molecules.

Try joining the data.frames
full_join(dat1, dat2, by="name") %>%
mutate_all(function(xx) ifelse(is.na(xx), 0, xx)) %>%
ggplot(aes(count.x, count.y)) +
geom_point()
which produces

You would need a full_join():
library(dplyr)
library(ggplot2)
#Data
dat1 <- data.frame(
name = c("A", "B", "D", "E"),
count = c(10, 1, 30, 10)
)
dat2 <- data.frame(
name = c("A", "B", "C", "F"),
count = c(1, 3, 50, 40)
)
#Code
dat1 %>% full_join(dat2 %>% rename(count2=count)) %>%
replace(is.na(.),0) %>%
ggplot(aes(x=count,y=count2))+
geom_point()+
geom_text(aes(label=name),vjust=-0.5)
Output:

Related

How to get the number of unique values in a column considering another column values with R?

I have this data frame:
df <- data.frame(id = c(918, 919, 920, 921, 922),
city = c("a", "c", "b", "c", "a"),
mosquitoes = c(9, 13, 8, 25, 10))
What I want to do is to get the number of unique ID values for each city and then create a new dataframe that should looks like:
newdf <- data.frame(city = c("a", "b", "c"),
id = c(2,1,2),
mosquitoes = c(19, 8, 38))
I know how to do half of that using
newdf <- aggregate(mosquitoes ~ city, data = df, sum)
But no matter how I try, I can't get the range for unique values of ID according to the cities that I have. I've been trying
newdf$id <- aggregate(length(id) ~ city, data = df, sum)
And I also tried a loop (because my original data has way more than 3 cities), but only got disaster and can't make it work at all:
x <- unique(df$city)
unique_ID <-
for (x in df$city) {
city = unique(df$city)
mosquitoes = ?
ID = ?
}
This topic was the most similar to mine I could found, but apparently it only works with numeric values. At least I couldn't make it work with my character columns.
Can someone please help me?
You could do:
library(tidyverse)
df <- data.frame(id = c(918, 919, 920, 921, 922),
city = c("a", "c", "b", "c", "a"),
mosquitoes = c(9, 13, 8, 25, 10))
df %>%
group_by(city) %>%
summarise(id = n(), mosquitoes = sum(mosquitoes))
#> # A tibble: 3 x 3
#> city id mosquitoes
#> <chr> <int> <dbl>
#> 1 a 2 19
#> 2 b 1 8
#> 3 c 2 38
Created on 2022-09-05 with reprex v2.0.2

Calculating a rolling return

I have a data frame with 3 columns. What I want to do is to calculate the product of the return over a selected month rolling period for each monthly period (or said another way, each row) (where available). This is the basic structure of the data.
set.seed = 100
assets <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
FileDate <- seq(as.Date("2011-12-30"), as.Date("2019-01-31"), by="months")
df <- merge(x = assets, y = FileDate, all.x = TRUE)
df$return <- runif(774, min=0, max=1)
What it should end with is a dataframe where a new column is added with the selected period cumulative return for that time frame. For example, I have shown below a four month return. The calculation of the 4-month return on 03/30/2012 from the data would be:
((1+0.81/100)(1+0.715/100)(1+0.27/100)*(1+0.80/100)-1)*100
This would be repeated for each value under the X column.
I ended up utilizing the mutate function there you can set the lag width. in the end version I wanted
library(dplyr)
library(zoo)
# Create Test Dataframe
set.seed = 100
assets <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
FileDate <- seq(as.Date("2011-12-30"), as.Date("2019-01-31"), by="months")
df <- merge(x = assets, y = FileDate, all.x = TRUE)
df$performance <- runif(774, min=0, max=1)
This particular code creates a 5 month average on a rolling basis. If you sort by column X you can see and recreate it in excel.
df <- df %>%
group_by(x) %>%
mutate(x_mean = rollmean(performance, 5, fill = NA, align = 'right'))
I also found a way to create a lag so I could take the 4 prior values to the observation and calculate the mean:
df2 = df %>%
mutate(perf.4.previous = rollapply(data = perf.1.previous, width = 4, FUN =
mean, align = "right", fill = NA, na.rm = T))

Why does eval(parse(text=string)) give shorter output for object of class aov?

Why does this...
B.aov2<- eval(parse(text=StringforEvaluation))
summary(B.aov2)
produce a less detailed report than this...?
res.aov2 <- aov(DV ~ IV1*IV2+Error(Participant/(IV1*IV2), data = AnovaAnalysisData))
summary(res.aov2)
Reproducible data, per request of a user. I know it's weird, but I'm making a user-friendly stats program for my students. Now that I've posted the "reproducible code" it wants me to provide more data. Now, I didn't originally post this code because I thought it was a bit much and my general coding sucks.. My question still stand.. Why does one give me a short crappy, unhelpful output, and the other one provide exactly what I want...I ran the same command:
library(tidyverse)
DV <- c(1,1, 5,6, 1, 2, 7, 7, 1, 4, 9, 9)
IV1 <- c("A","B", "A", "B","A","B", "A", "B","A","B", "A", "B" )
IV2 <- c("C","C","D","D", "C","C", "D","D", "C","C", "D","D")
Participant <- c("A", "A", "A", "A", "B","B","B","B","C","C","C","C")
IV3 <- "no_data" #remove the word "no_data" and add c("A","B", "C", etc.. )
IV4 <- "no_data" #remove the word "no_data" and add c("A","B", "C", etc.. )
##### You have to tell the computer if the variable is within!
IV1_iswithin <-"Y"
IV2_iswithin <-"Y"
IV3_iswithin <-"N"
IV4_iswithin <-"N"
####### Your JOB is DONE
data <- data.frame(DV,Participant,IV1,IV2)
#Grouping the dataframe
data %>%group_by(IV2, IV1)%>% #subsetting the data set to calculate 4 different stats
mutate(MAD = median(abs(DV-median(DV))*2.5*1.4826))%>% #calculates 4 differnet mad numbers
mutate(MADLL = median(DV)-MAD)%>% #cacluates UL of MAD, pipe output to next command
mutate(MADUL = median(DV)+MAD)%>% #calculates the LL of MAD
mutate(OutlierPresent = ifelse(DV<MADLL | DV>MADUL, NA, DV))%>% #Creates NA values if it is an outlier
ungroup()%>% #converts back to big data set
mutate(OutlierPresent = ifelse(DV<MADLL | DV>MADUL, NA, DV))%>% #Creates NA
mutate(whichgroup <- paste(IV1,IV2))%>%
mutate(observation = 1:n()) %>%
{. ->> b }
b %>%
select(DV,OutlierPresent) %>%
{. ->> outlierfeedback }
formattable(outlierfeedback)
b %>%
select(OutlierPresent,IV1,IV2,Participant) %>%
pivot_wider(names_from = c(IV1,IV2), values_from = OutlierPresent, names_sep = "_", id_cols = Participant)%>%
drop_na()%>%
pivot_longer(-Participant, names_to = c("IV1","IV2"), names_sep = "_", values_to = "DV")%>%
{. ->> AnovaAnalysisData }
### Calculating the ANOVA for our outlier free data AnovaAnalysisData
#SettingUpTheListofFixedFactors
FactorModel <- list()
ifelse(length(IV1)> 1, FactorModel<- c(FactorModel, "IV1"), FactorModel<-FactorModel)
ifelse(length(IV2)> 1, FactorModel<- c(FactorModel, "IV2"), FactorModel<-FactorModel)
ifelse(length(IV3)> 1, FactorModel<- c(FactorModel, "IV3"), FactorModel<-FactorModel)
ifelse(length(IV4)> 1, FactorModel<- c(FactorModel, "IV4"), FactorModel<-FactorModel)
#SettingUPTheListofErrorFactors
ErrorModel <-list()
ErrorModel <- list()
ifelse(IV1_iswithin== "Y", ErrorModel<- c(ErrorModel, "IV1"), ErrorModel<-ErrorModel)
ifelse(IV2_iswithin== "Y", ErrorModel<- c(ErrorModel, "IV2"), ErrorModel<-ErrorModel)
ifelse(IV3_iswithin== "Y",ErrorModel<- c(ErrorModel, "IV3"), ErrorModel<-ErrorModel)
ifelse(IV4_iswithin== "Y", ErrorModel<- c(ErrorModel, "IV4"), ErrorModel<-ErrorModel)
StrStart = "aov(DV ~"
StrFactor<-paste(ErrorModel, collapse='*' )
StrErrorStart<-("+Error(Participant/(")
StrError<-paste(ErrorModel, collapse='*' )
StrErrorEnd<- ("),data=AnovaAnalysisData))")
StringforEvaluation<- paste(StrStart, StrFactor,StrErrorStart,StrError,StrErrorEnd)
B.aov2<- eval(parse(text=StringforEvaluation))
summary(B.aov2)
res.aov2 <- aov(DV ~ IV1*IV2+Error(Participant/(IV1*IV2), data = AnovaAnalysisData))
summary(res.aov2)

How to avoid for loop when iterating through unique values in a column [R]

Let's assume that we have following toy data:
library(tidyverse)
data <- tibble(
subject = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
id1 = c("a", "a", "b", "a", "a", "a", "b", "a", "a", "b"),
id2 = c("b", "c", "c", "b", "c", "d", "c", "b", "c", "c")
)
which represent network relationships for each subject. For example, there are three unique subjects in the data and the network for the first subject could be represented as sequence of relations:
a -- b, a --c, b -- c
The task is to compute centralities for each network. Using for loop this is straightforward:
library(igraph)
# Get unique subjects
subjects_uniq <- unique(data$subject)
# Compute centrality of nodes for each graph
for (i in 1:length(subjects_uniq)) {
current_data <- data %>% filter(subject == i) %>% select(-subject)
current_graph <- current_data %>% graph_from_data_frame(directed = FALSE)
centrality <- eigen_centrality(current_graph)$vector
}
Question: My dataset is huge so I wonder how to avoid explicit for loop. Should I use apply() and its modern cousins (maybe map() in the purrr package)? Any suggestions are greatly welcome.
Here is an option using map
library(tidyverse)
library(igraph)
map(subjects_uniq, ~data %>%
filter(subject == .x) %>%
select(-subject) %>%
graph_from_data_frame(directed = FALSE) %>%
{eigen_centrality(.)$vector})
#[[1]]
#a b c
#1 1 1
#[[2]]
# a b c d
#1.0000000 0.8546377 0.8546377 0.4608111
#[[3]]
#a b c
#1 1 1

sort data into deciles based on a rolling subset

I am trying to replicate the Fama French 1993 paper using R. I need to do the following sorting :
for each month,
calculate ME decile breakpoints on NYSE stocks only
sort all stocks into the deciles created in 2.
Data generation:
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME=floor(100*abs(rnorm(n)))))
Desired Output:
ME_NYSE_vals <- as.numeric(paste(df[df$Month==1 & df$exchangeCode=="NYSE","ME"]))
ME_ALL_vals <- as.numeric(paste(df[df$Month==1,"ME"]))
cut(x = ME_ALL_vals,
breaks = c(-Inf,quantile(ME_NYSE_vals,probs=seq(.1,.9,.1)),+Inf),
labels = 1:10
)
The breaks should be calculated based on ME_NSYE_vals. The cut should be applied to all ME_ALL_vals for each month.
If the intention is to keep the whole data frame but generate deciles only for the NYSE values the code below could do. The point was to generate deciles only for the entries pertaining to the NYSE values but to keep the full data set achieving some form of a partial sorting.
# Libs
Vectorize(require)(package = c("dplyr", "magrittr"),
character.only = TRUE)
# Transformations
df %<>%
mutate(nTileNYSE = ifelse(exchangeCode == "NYSE", ntile(ME, 10), NA))
arrange(nTileNYSE)
The code was applied to the data:
set.seed(1)
df <- as.data.frame(cbind(exchangeCode = c("NYSE", "NASDAQ"),
Stock = c("A", "B", "C", "A"),
Month = 1:12,
ME=rnorm(1200)))
2nd approach
Following the discussion in the comments I would suggest the following approach:
# Libs --------------------------------------------------------------------
Vectorize(require)(package = c( "tidyr", "dplyr", "magrittr", "xts", "Hmisc"),
char = TRUE)
# Data generation ---------------------------------------------------------
set.seed(1234)
n = 120
stocks <- c("A", "B", "C", "D", "E")
exchange <- c("NYSE", "NASDAQ", "AMEX")
df <- as.data.frame(cbind(Month = 1:12,
exchangeCode = exchange[round(runif(n, 1, 3))],
Stock = stocks[round(runif(n, 1, 5))],
ME = floor(100*abs(rnorm(n)))))
# Transformations ---------------------------------------------------------
# For some reason this was needed
df$ME <- as.numeric(as.character(df$ME))
# Generate cuts
dfNtiles <- df %>%
arrange(exchangeCode, Month, ME) %>%
group_by(exchangeCode, Month) %>%
mutate(cutsBsdOnNYSE = cut(x = ME,
breaks = cut2(x = df$ME[df$exchangeCode == "NYSE"],
g = 10, onlycuts = TRUE))) %>%
ungroup() %>%
group_by(cutsBsdOnNYSE) %>%
mutate(grpBsdOnNYSE = n())
It's fairly straightforward
Generating cut brackets reflecting subset of the data.
Applying those brackets to the whole vector (ME)
Numbering the obtained groups so a group identifier is created
and boils down to:

Resources