three min values for each column and more

three min values for each column and more - r

The data I have contain pair-wise distance between different locations (x,y,z) and (a,b,c,d,e,f,g,h,i,j). See below:
set.seed(123)
x <- rnorm(10, 15,1)
y <- rnorm(10, 7,0.1)
z <- rnorm(10, 3,0.01)
distdat <- data.frame(x,y,z)
rownames(distdat) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
I need to create another data that include: 1) the column name, row name of the min, and the minimum three values for each column. So in total, the new data will contain
three column and nine rows. Here is the first rows:
col_name <- c("x", "x", "x")
row_name <- c("h", "g", "a")
min_val <- c(14.21208, 14.88804, 14.98797)
newdat <- data.frame(col_name, row_name, min_val)
Similarly, we need to repeat this for column y and z.

How about this:
set.seed(123)
x <- rnorm(10, 15,1)
y <- rnorm(10, 7,0.1)
z <- rnorm(10, 3,0.01)
distdat <- data.frame(x,y,z)
rownames(distdat) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
# find indices of smallest values
idx <- sapply(distdat, order)[1:3, ]
# put everything in a data.frame
data.frame(col_name = rep(colnames(distdat), each = 3),
row_name = row.names(distdat)[c(idx)],
min_val = distdat[cbind(c(idx), rep(1:3, each = 3))]
)
Also, with the given seed I could not replicate your example, let me know if I missed something.

Its not real pretty, but this could work:
set.seed(123)
x <- rnorm(10, 15,1)
y <- rnorm(10, 7,0.1)
z <- rnorm(10, 3,0.01)
distdat <- data.frame(x,y,z)
rownames(distdat) <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
distdat$row_name <- rownames(distdat)
select(distdat, x, row_name) %>%
arrange(x) %>%
head(3) %>%
mutate(col_name='x') %>%
rename(min_val = x) -> newdat_x
select(distdat, y, row_name) %>%
arrange(y) %>%
head(3) %>%
mutate(col_name='y') %>%
rename(min_val = y) -> newdat_y
select(distdat, z, row_name) %>%
arrange(z) %>%
head(3) %>%
mutate(col_name='z') %>%
rename(min_val = z) -> newdat_z
newdat <- bind_rows(newdat_x, newdat_y, newdat_z)
certainly, we could (should) create a function to create those newdat_ dfs, and then run the function for each variable x,y,z.

You can use the dplyr and tidyr packages to do this. They make transformations much more readable.
newdat <- distdat %>%
mutate(row = rownames(.)) %>%
gather(col, dist, -row) %>%
group_by(col) %>%
arrange(col, dist) %>%
top_n(-3, dist)

Related

Plotting Number of Times Value Appears in Two Dataframes in R

I have two sets of data. Each contains a column for the name of the molecule and a column for the number of times that molecule appears in the sample. I want to create a scatterplot with the number of times a molecule appears in dataset #1 on the x-axis and how many times it appears in dataset #2. If a molecule is in one dataset and not the other, it appears 0 times.
Example:
dat1 <- data.frame(
name = c("A", "B", "D", "E")
count = c(10, 1, 30, 10)
)
dat2 <- data.frame(
name = c("A", "B", "C", "F")
count = c(1, 3, 50, 40)
)
Point #1 would be (10,1) corresponding to A, Point #2 would be (1,3), Point #3 would be (0,50) and so on. I don't want to label my points since my datasets contain tens of thousands of molecules.

Try joining the data.frames
full_join(dat1, dat2, by="name") %>%
mutate_all(function(xx) ifelse(is.na(xx), 0, xx)) %>%
ggplot(aes(count.x, count.y)) +
geom_point()
which produces

You would need a full_join():
library(dplyr)
library(ggplot2)
#Data
dat1 <- data.frame(
name = c("A", "B", "D", "E"),
count = c(10, 1, 30, 10)
)
dat2 <- data.frame(
name = c("A", "B", "C", "F"),
count = c(1, 3, 50, 40)
)
#Code
dat1 %>% full_join(dat2 %>% rename(count2=count)) %>%
replace(is.na(.),0) %>%
ggplot(aes(x=count,y=count2))+
geom_point()+
geom_text(aes(label=name),vjust=-0.5)
Output:

Calculating a rolling return

I have a data frame with 3 columns. What I want to do is to calculate the product of the return over a selected month rolling period for each monthly period (or said another way, each row) (where available). This is the basic structure of the data.
set.seed = 100
assets <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
FileDate <- seq(as.Date("2011-12-30"), as.Date("2019-01-31"), by="months")
df <- merge(x = assets, y = FileDate, all.x = TRUE)
df$return <- runif(774, min=0, max=1)
What it should end with is a dataframe where a new column is added with the selected period cumulative return for that time frame. For example, I have shown below a four month return. The calculation of the 4-month return on 03/30/2012 from the data would be:
((1+0.81/100)(1+0.715/100)(1+0.27/100)*(1+0.80/100)-1)*100
This would be repeated for each value under the X column.

I ended up utilizing the mutate function there you can set the lag width. in the end version I wanted
library(dplyr)
library(zoo)
# Create Test Dataframe
set.seed = 100
assets <- c("A", "B", "C", "D", "E", "F", "G", "H", "I")
FileDate <- seq(as.Date("2011-12-30"), as.Date("2019-01-31"), by="months")
df <- merge(x = assets, y = FileDate, all.x = TRUE)
df$performance <- runif(774, min=0, max=1)
This particular code creates a 5 month average on a rolling basis. If you sort by column X you can see and recreate it in excel.
df <- df %>%
group_by(x) %>%
mutate(x_mean = rollmean(performance, 5, fill = NA, align = 'right'))
I also found a way to create a lag so I could take the 4 prior values to the observation and calculate the mean:
df2 = df %>%
mutate(perf.4.previous = rollapply(data = perf.1.previous, width = 4, FUN =
mean, align = "right", fill = NA, na.rm = T))

Why does eval(parse(text=string)) give shorter output for object of class aov?

Why does this...
B.aov2<- eval(parse(text=StringforEvaluation))
summary(B.aov2)
produce a less detailed report than this...?
res.aov2 <- aov(DV ~ IV1*IV2+Error(Participant/(IV1*IV2), data = AnovaAnalysisData))
summary(res.aov2)
Reproducible data, per request of a user. I know it's weird, but I'm making a user-friendly stats program for my students. Now that I've posted the "reproducible code" it wants me to provide more data. Now, I didn't originally post this code because I thought it was a bit much and my general coding sucks.. My question still stand.. Why does one give me a short crappy, unhelpful output, and the other one provide exactly what I want...I ran the same command:
library(tidyverse)
DV <- c(1,1, 5,6, 1, 2, 7, 7, 1, 4, 9, 9)
IV1 <- c("A","B", "A", "B","A","B", "A", "B","A","B", "A", "B" )
IV2 <- c("C","C","D","D", "C","C", "D","D", "C","C", "D","D")
Participant <- c("A", "A", "A", "A", "B","B","B","B","C","C","C","C")
IV3 <- "no_data" #remove the word "no_data" and add c("A","B", "C", etc.. )
IV4 <- "no_data" #remove the word "no_data" and add c("A","B", "C", etc.. )
##### You have to tell the computer if the variable is within!
IV1_iswithin <-"Y"
IV2_iswithin <-"Y"
IV3_iswithin <-"N"
IV4_iswithin <-"N"
####### Your JOB is DONE
data <- data.frame(DV,Participant,IV1,IV2)
#Grouping the dataframe
data %>%group_by(IV2, IV1)%>% #subsetting the data set to calculate 4 different stats
mutate(MAD = median(abs(DV-median(DV))*2.5*1.4826))%>% #calculates 4 differnet mad numbers
mutate(MADLL = median(DV)-MAD)%>% #cacluates UL of MAD, pipe output to next command
mutate(MADUL = median(DV)+MAD)%>% #calculates the LL of MAD
mutate(OutlierPresent = ifelse(DV<MADLL | DV>MADUL, NA, DV))%>% #Creates NA values if it is an outlier
ungroup()%>% #converts back to big data set
mutate(OutlierPresent = ifelse(DV<MADLL | DV>MADUL, NA, DV))%>% #Creates NA
mutate(whichgroup <- paste(IV1,IV2))%>%
mutate(observation = 1:n()) %>%
{. ->> b }
b %>%
select(DV,OutlierPresent) %>%
{. ->> outlierfeedback }
formattable(outlierfeedback)
b %>%
select(OutlierPresent,IV1,IV2,Participant) %>%
pivot_wider(names_from = c(IV1,IV2), values_from = OutlierPresent, names_sep = "_", id_cols = Participant)%>%
drop_na()%>%
pivot_longer(-Participant, names_to = c("IV1","IV2"), names_sep = "_", values_to = "DV")%>%
{. ->> AnovaAnalysisData }
### Calculating the ANOVA for our outlier free data AnovaAnalysisData
#SettingUpTheListofFixedFactors
FactorModel <- list()
ifelse(length(IV1)> 1, FactorModel<- c(FactorModel, "IV1"), FactorModel<-FactorModel)
ifelse(length(IV2)> 1, FactorModel<- c(FactorModel, "IV2"), FactorModel<-FactorModel)
ifelse(length(IV3)> 1, FactorModel<- c(FactorModel, "IV3"), FactorModel<-FactorModel)
ifelse(length(IV4)> 1, FactorModel<- c(FactorModel, "IV4"), FactorModel<-FactorModel)
#SettingUPTheListofErrorFactors
ErrorModel <-list()
ErrorModel <- list()
ifelse(IV1_iswithin== "Y", ErrorModel<- c(ErrorModel, "IV1"), ErrorModel<-ErrorModel)
ifelse(IV2_iswithin== "Y", ErrorModel<- c(ErrorModel, "IV2"), ErrorModel<-ErrorModel)
ifelse(IV3_iswithin== "Y",ErrorModel<- c(ErrorModel, "IV3"), ErrorModel<-ErrorModel)
ifelse(IV4_iswithin== "Y", ErrorModel<- c(ErrorModel, "IV4"), ErrorModel<-ErrorModel)
StrStart = "aov(DV ~"
StrFactor<-paste(ErrorModel, collapse='*' )
StrErrorStart<-("+Error(Participant/(")
StrError<-paste(ErrorModel, collapse='*' )
StrErrorEnd<- ("),data=AnovaAnalysisData))")
StringforEvaluation<- paste(StrStart, StrFactor,StrErrorStart,StrError,StrErrorEnd)
B.aov2<- eval(parse(text=StringforEvaluation))
summary(B.aov2)
res.aov2 <- aov(DV ~ IV1*IV2+Error(Participant/(IV1*IV2), data = AnovaAnalysisData))
summary(res.aov2)

How to avoid for loop when iterating through unique values in a column [R]

Let's assume that we have following toy data:
library(tidyverse)
data <- tibble(
subject = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
id1 = c("a", "a", "b", "a", "a", "a", "b", "a", "a", "b"),
id2 = c("b", "c", "c", "b", "c", "d", "c", "b", "c", "c")
)
which represent network relationships for each subject. For example, there are three unique subjects in the data and the network for the first subject could be represented as sequence of relations:
a -- b, a --c, b -- c
The task is to compute centralities for each network. Using for loop this is straightforward:
library(igraph)
# Get unique subjects
subjects_uniq <- unique(data$subject)
# Compute centrality of nodes for each graph
for (i in 1:length(subjects_uniq)) {
current_data <- data %>% filter(subject == i) %>% select(-subject)
current_graph <- current_data %>% graph_from_data_frame(directed = FALSE)
centrality <- eigen_centrality(current_graph)$vector
}
Question: My dataset is huge so I wonder how to avoid explicit for loop. Should I use apply() and its modern cousins (maybe map() in the purrr package)? Any suggestions are greatly welcome.

Here is an option using map
library(tidyverse)
library(igraph)
map(subjects_uniq, ~data %>%
filter(subject == .x) %>%
select(-subject) %>%
graph_from_data_frame(directed = FALSE) %>%
{eigen_centrality(.)$vector})
#[[1]]
#a b c
#1 1 1
#[[2]]
# a b c d
#1.0000000 0.8546377 0.8546377 0.4608111
#[[3]]
#a b c
#1 1 1

subseting a data frame under a specific condition

How can i get rows of a data frame that has a same value in a element of that comparing with another data frame ?
I have written this but it didn't work.
# example of two data frame
df1 <- data.frame(V1 = c("a", "g", "h", "l", "n", "e"), V2 = c("b", "n", "i", "m", "i", "f"), stringsAsFactors = F)
df2 <- data.frame(V1 = c("a", "c", "f","h"), V2 = c("b", "d", "e","z"), stringsAsFactors = F)
# finding joint values in each element of two data frames
res1<-intersect(df1$V1,df2$V1)
res2<-intersect(df1$V2,df2$V2)
res3<-intersect(df1$V1,df2$V2)
res4<-intersect(df1$V1,df2$V2)
# Getting rows that has joint value at least in one element of df1
ress1<-df1[apply(df1, MARGIN = 1, function(x) all(x== res1)), ]
ress2<-df1[apply(df1, MARGIN = 1, function(x) all(x== res2)), ]
ress3<-df1[apply(df1, MARGIN = 1, function(x) all(x== res3)), ]
ress4<-df1[apply(df1, MARGIN = 1, function(x) all(x== res4)), ]
# Getting rows that has joint value at least in one element of df2
resss1<-df2[apply(df2, MARGIN = 1, function(x) all(x== res1)), ]
resss2<-df2[apply(df2, MARGIN = 1, function(x) all(x== res2)), ]
resss3<-df2[apply(df2, MARGIN = 1, function(x) all(x== res3)), ]
resss4<-df2[apply(df2, MARGIN = 1, function(x) all(x== res4)), ]
# then combine above results
final.res<-rbind(ress1,ress2,ress3,ress4,resss1,resss2,resss3,resss4)
My favorite result is:
a b
h z
h i
f e
e f

This should work
#Import data
df1 <- data.frame(V1 = c("a", "g", "h", "l", "n", "e"), V2 = c("b", "n", "i", "m", "i", "f"), stringsAsFactors = F)
df2 <- data.frame(V1 = c("a", "c", "f","h"), V2 = c("b", "d", "e","z"), stringsAsFactors = F)
# Get the intersects
vals <- intersect(c(df1$V1, df1$V2), c(df2$V1, df2$V2))
#Get the subsets and rbind them
full <- rbind(
subset(df1, df1$V1 %in% vals),
subset(df1, df1$V2 %in% vals),
subset(df2, df2$V1 %in% vals),
subset(df2, df2$V2 %in% vals)
)
#Remove duplicates
full <- full[!duplicated(full),]

Categories

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

three min values for each column and more - r

You can use the dplyr and tidyr packages to do this. They make transformations much more readable. newdat <- distdat %>% mutate(row = rownames(.)) %>% gather(col, dist, -row) %>% group_by(col) %>% arrange(col, dist) %>% top_n(-3, dist)

Related

Plotting Number of Times Value Appears in Two Dataframes in R

Calculating a rolling return

Why does eval(parse(text=string)) give shorter output for object of class aov?

How to avoid for loop when iterating through unique values in a column [R]

subseting a data frame under a specific condition

Categories

Resources