I have a lookup table in R that I am trying to figure out how to implement. The challenge for me is that it involves continuous values or ranges of data. If the value falls inbetween I'd like it to pick the right value.
I want to use the two continuous 'GRADE', 'SAT' variables plus the categorical 'TYPE' value to assign a 'GROUP' value. This big block of code looks intimidating but these are tiny tiny tables.
Any advice is appreciated!!!!
#lookup table code for recreating dataframe
structure(list(Type = structure(c(1L, 2L, 1L, 1L), .Label = c("A",
"B"), class = "factor"), min_grade = c(93L, 85L, 93L, 80L), max_grade = c(100L,
93L, 100L, 92L), min_sat = c(600L, 700L, 400L, 600L), max_sat = c(800L,
800L, 599L, 800L), Group = structure(c(1L, 1L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor")), .Names = c("Type", "min_grade",
"max_grade", "min_sat", "max_sat", "Group"), class = "data.frame", row.names = c(NA,
-4L))
#example ----- desired value is in the 'GROUP' column so this would be NULL before I used the lookup table
structure(list(Name = structure(c(3L, 1L, 2L, 4L), .Label = c("Jack",
"James", "John", "Jordan"), class = "factor"), Grade = c(95L,
95L, 92L, 93L), Sat = c(701L, 500L, 800L, 800L), Type = structure(c(1L,
1L, 1L, 2L), .Label = c("A", "B"), class = "factor"), Group = structure(c(1L,
2L, 3L, 1L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("Name",
"Grade", "Sat", "Type", "Group"), class = "data.frame", row.names = c(NA,
-4L))
how abt this?
ltab <- structure(list(Type = structure(c(1L, 2L, 1L, 1L), .Label = c("A",
"B"), class = "factor"), min_grade = c(93L, 85L, 93L, 80L), max_grade = c(100L,
93L, 100L, 92L), min_sat = c(600L, 700L, 400L, 600L), max_sat = c(800L,
800L, 599L, 800L), Group = structure(c(1L, 1L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor")), .Names = c("Type", "min_grade",
"max_grade", "min_sat", "max_sat", "Group"), class = "data.frame", row.names = c(NA,
-4L))
dat <- structure(list(Name = structure(c(3L, 1L, 2L, 4L), .Label = c("Jack",
"James", "John", "Jordan"), class = "factor"), Grade = c(95L,
95L, 92L, 93L), Sat = c(701L, 500L, 800L, 800L), Type = structure(c(1L,
1L, 1L, 2L), .Label = c("A", "B"), class = "factor")), .Names = c("Name",
"Grade", "Sat", "Type"), class = "data.frame", row.names = c(NA,
-4L))
library(plyr)
mdat <- adply(merge(dat, ltab, by="Type", all=T), 1, function(x) {
c(FallsIn=x$Grade > x$min_grade & x$Grade <= x$max_grade & x$Sat > x$min_sat & x$Sat <= x$max_sat)
})
mdat[mdat$FallsIn,]
thinking about generalizing, are there going to be more continuous variables that you need to check?
EDIT: could not edit OP post so taking OP's comment into account is how I would tackle an example of "categorizing multidimensional continuous random variables"
so that these keywords will flag up in future searches
breaks <- list(Var1=c(0, 0.25, 1),
Var2=c(0, 0.5, 1),
Var3=c(0, 0.25, 0.75, 1))
#generate this on the fly
genIntv <- function(x) {
ret <- paste0("(", x[1:(length(x)-1)],", ",x[2:length(x)], "]")
names(ret) <- 1:(length(x)-1)
ret
}
lookupTbl <- data.frame(expand.grid(lapply(breaks, genIntv), stringsAsFactors=F),
Group=LETTERS[1:12])
lookupTbl2 <- data.frame(expand.grid(lapply(breaks, function(x) 1:(length(x)-1)), stringsAsFactors=F),
Group=LETTERS[1:12])
#data set
dat <- data.frame(Var1=c(0.1, 0.76), Var2=c(0.5, 0.75), Var3=c(0.25,0.9))
binDat <- do.call(cbind, setNames(lapply(1:ncol(dat), function(k)
.bincode(dat[,k], breaks[[k]], T, T)),colnames(dat)))
merge(binDat, lookupTbl2, all.x=T, all.y=F)
good to learn if someone else has better approaches
If you have small data, a full join should be fine.
library(dplyr)
result =
example %>%
select(-Type) %>%
full_join(look_up) %>%
filter(min_grade < Grade & Grade <= max_grade &
min_sat < Sat & Sat <= max_sat)
Related
Dataframe "id" has the columns year, id, and matriline, where each row is an incident. I wanted to count the number of incidents by matriline per year, so I did:
events.bymatr =
id %>%
group_by(year, matr, .drop = FALSE) %>%
dplyr::summarise(n = n()) %>%
ungroup()
events.bymatr
I plotted a line graph of the number of incidents over time, by matriline.
ggplot(events.bymatr, aes(x=year, y=n, group=matr)) + geom_line(aes(color=matr))
My question is twofold:
Is there a way I could recreate this line graph where the thickness of the lines is determined by how many IDs there were, per matriline? I imagine this would involve reshaping my data above but when I tried to group_by(year,matr,id,.drop=FALSE) my data came out all wonky.
I want to change the color palete so that each color is very distinct - how do I attach a new color palette? I tried using this c25 palette with this code but it makes all my lines disappear.
ggplot(events.bymatr, aes(x=year, y=n, group=matr)) + geom_line(aes(color=c25))
Thanks so much in advance!
Output of "id" (shortened to just the first five rows per column):
> dput(id)
structure(list(date = structure(c(8243, 8243, 8243, 8248, 8947,
class = "Date"), year = c(1992L, 1992L, 1992L, 1992L, 1994L),
event.id = c(8L, 8L, 8L, 10L, 11L), id = structure(c(51L, 55L, 59L,
46L, 51L), .Label = c("J11", "J16", "J17", "J2", "J22"),
class = "factor"), sex = structure(c(1L, 2L, 2L, 1L, 1L),
.Label = c("0", "1"), class = "factor"), age = c(28L, 12L, 6L, 42L,
30L), matr = structure(c(20L, 20L, 20L, 11L, 20L), .Label = c("J2",
"J4", "J7", "J9", "K11"), class = "factor"),
matralive = structure(c(2L, 2L, 2L, 2L, 2L),
.Label = c("0", "1"), class = "factor"), pod = structure(c(3L, 3L,
3L, 3L, 3L), .Label = c("J", "K", "L"), class = "factor")),
row.names = c(NA, -134L), class = c("tbl_df", "tbl", "data.frame"))
Output of events.bymatr:
> dput(events.bymatr)
structure(list(year = c(1992L, 1992L, 1992L, 1992L, 1992L),
matr = structure(c(1L, 2L, 3L, 4L, 5L), .Label = c("J2", "J4",
"J7", "J9", "K11"), class = "factor"), n = c(0L, 0L, 0L, 0L, 0L)),
row.names = c(NA, -380L), class = c("tbl_df", "tbl",
"data.frame"))
As #r2evans noted, it is surprisingly hard to distinguish clearly among more than a handful of colors. I used an example 20-color scale here that does a pretty good job, but even so a few can be tricky to distinguish. Here's an attempt using the storms dataset included with dplyr.
library(dplyr)
storms %>%
group_by(name, year) %>%
summarize(n = n(), .groups = "drop") %>% # = number of name per year View
tidyr::complete(name, year = 1975:2015, fill = list(n = 0)) %>%
group_by(name) %>%
mutate(total = sum(n)) %>% # = number of name overall
ungroup() %>%
filter(total %% 12 == 0) %>% # Arbitrary, to reduce scope of data for example
ggplot(aes(year, n, color = name, size = total, group = name)) +
geom_line() +
guides(color = guide_legend(override.aes = list(size = 3))) +
ggthemes::scale_color_tableau(palette = "Tableau 20")
I have a txf_df which I subset by gene.list$entrez and then found the list of unique number of transcripts. The txf_df is then converted to txf_grange.
Now, I want to create a for loop of the 15 unique genes, where upon each iteration, subset the txf_grange objects by only the specific gene.
Code:
# Subset by the Entrez IDs
txf_df <- txf_df %>% filter(geneName %in% gene.list$entrez)
# Find the number of common transcripts
unique <- unique(txf_df$geneName)
length(unique)
# Recast this dataframe back to a GRanges object
txf_grange <- makeGRangesFromDataFrame(txf_df, keep.extra.columns=T)
# For each of the 15 genes, subset the Granges objects by only the gene
for (i in gene.list["entrez"]) {
for (j in txf_grange$geneName) {
if (i==j) {
assign(paste0("gene.", i), 1:j) <- txf_grange[j,]
}
}
}
Data:
> dput(head(txf_df))
structure(list(seqnames = structure(c(1L, 1L, 1L, 1L, 1L, 1L), .Label = "16", class = "factor"),
start = c(12058964L, 12059311L, 12059311L, 12060052L, 12060198L,
12060198L), end = c(12059311L, 12060052L, 12061427L, 12060198L,
12060877L, 12061427L), width = c(348L, 742L, 2117L, 147L,
680L, 1230L), strand = structure(c(1L, 1L, 1L, 1L, 1L, 1L
), .Label = c("+", "-", "*"), class = "factor"), type = structure(c(3L,
1L, 1L, 2L, 1L, 1L), .Label = c("J", "I", "F", "L", "U"), class = "factor"),
txName = structure(list(c("uc002dbv.3", "uc010buy.3", "uc010buz.3"
), c("uc002dbv.3", "uc010buy.3"), "uc010buz.3", c("uc002dbv.3",
"uc010buy.3"), "uc010buy.3", "uc002dbv.3"), class = "AsIs"),
geneName = structure(list("608", "608", "608", "608", "608",
"608"), class = "AsIs")), row.names = c(NA, 6L), class = "data.frame")
> dput(head(gene.list))
structure(list(Name = c("AQP8", "CLCA1", "GUCA2B", "ZG16", "CA4",
"CA1"), Pvalue = c(3.24077275512836e-22, 2.57708986670727e-21,
5.53491656902485e-21, 4.14482213350182e-20, 2.7795892896524e-19,
1.23890644641685e-18), adjPvalue = c(8.3845272720681e-18, 6.66744690314504e-17,
1.43199361473811e-16, 1.07234838237959e-15, 7.19135341018869e-15,
3.20529875816967e-14), logFC = c(-3.73323340223377, -2.96422555675244,
-3.34493724166712, -2.87787132076412, -2.87670608798164, -3.15664667432159
), entrez = c(AQP8 = "343", CLCA1 = "1179", GUCA2B = "2981",
ZG16 = "653808", CA4 = "762", CA1 = "759")), row.names = c(NA,
6L), class = "data.frame")
I have the following data frame:
structure(list(StepsGroup = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("(-Inf,3e+03]", "(3e+03,1.2e+04]", "(1.2e+04, Inf]"
), class = "factor"), GlucoseGroup = structure(c(1L, 2L, 3L,
1L, 2L, 3L, 1L, 2L, 3L), .Label = c("<100", "100-180", ">180"
), class = "factor"), n = c(396L, 1600L, 229L, 787L, 4182L, 375L,
110L, 534L, 55L), freq = c(0.177977528089888, 0.719101123595506,
0.102921348314607, 0.147267964071856, 0.782559880239521, 0.0701721556886228,
0.157367668097282, 0.763948497854077, 0.0786838340486409)), class =
c("grouped_df",
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -9L), vars = "StepsGroup",
labels = structure(list(
StepsGroup = structure(1:3, .Label = c("(-Inf,3e+03]", "(3e+03,1.2e+04]",
"(1.2e+04, Inf]"), class = "factor")), class = "data.frame", row.names =
c(NA, -3L), vars = "StepsGroup", drop = TRUE), indices = list(0:2,
3:5, 6:8), drop = TRUE, group_sizes = c(3L, 3L, 3L), biggest_group_size =
3L)
I would like to create a stacked bar plot, and add a summary of each StepsGroup on top of each bar. So the first group will have 2225, the second 5344 and the third 699.
I am using the following script:
ggplot(d_stepsFastingSummary , aes(y = freq, x = StepsGroup, fill =
GlucoseGroup)) + geom_bar(stat = "identity") +
geom_text(aes(label = sum(n()), vjust = 0))
The part until before the geom_text works, but for the last bit I get the following error:
Error: This function should not be called directly
Any idea how to add the aggregated quantity?
We could create a new dataframe stacked_df which would have sum for each StepsGroup
stacked_df <- df %>% group_by(StepsGroup) %>% summarise(nsum = sum(n))
ggplot(df) +
geom_bar(aes(y = freq, x = StepsGroup, fill= GlucoseGroup),stat = "identity") +
geom_text(data = stacked_df, aes(label = nsum, StepsGroup,y = 1.1))
I'm looking for a help to build a logic in R. I have a data set as shown in the image
For any given Pack and ID:
for the first occurrence, cost=Rates
for the second occurrence, cost = Rates[Current]- Rates[Previous]
For the third occurrence, cost = Rates[Current]- Rates[Previous]
I tried below piece of code, but the Cost column remain unaffected.
df_temp <- structure(list(Rates = c(100L, 200L, 300L, 400L, 500L, 600L),
ID = structure(c(2L, 2L, 2L, 1L, 3L, 3L), .Label = c("wwww",
"xxxx", "yyyy"), class = "factor"), Pack = structure(c(1L,
1L, 2L, 1L, 2L, 2L), .Label = c("a", "b"), class = "factor"),
Cost = c(100L, 100L, 300L, 400L, 500L, 100L)), class = "data.frame", row.names = c(NA,
-6L))
calculate_TTF_or_S <- function(dput(df)){
df <- arrange(df,ID,Rates)
unique_sysids <- unique(df$ID)
for (id in unique_sysids) {
df_sub <- df[which(df$ID == id),]
j=1
for (i in seq_len(nrow(df_sub))){
if (j==1){
df$Cost[i] <- df_sub$Rates[j]
} else {
df$Cost[i] <- df_sub$Rates[j] - df_sub$Rates[j-1]
}
j <- j+1
}
}
return (df$Cost)
}
df_temp$Cost <- calculate_TTF_or_S(df_temp)
Here's a solution with dplyr.
library(dplyr)
df_temp %>% # Start with your existing table...
group_by(Pack, ID) %>% # For each combination of Pack and ID...
mutate(calc_cost = if_else(row_number() == 1, # Add new column that is either...
Rates, # Rates for first appearance
Rates - lag(Rates)) # Otherwise, Rates minus prior Rates
) %>% ungroup() # Finally, ungroup the table
(still) new to r, and very confused as to how I should accomplish multiple melts of my data. Here is a subset:
df <- structure(list(Subject = c(101L, 101L, 101L, 102L, 102L, 102L
), Condition = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("apass",
"vpas"), class = "factor"), FreqCode = structure(c(1L, 1L, 1L,
2L, 2L, 2L), .Label = c("LessVerbal", "MoreVerbal"), class = "factor"),
Item = c(1L, 4L, 7L, 1L, 4L, 7L), Len = c(80L, 68L, 85L,
68L, 85L, 79L), R1_1.RT = c(237L, 203L, 207L, 336L, 487L,
340L), R1_2.RT = c(177L, 225L, 162L, 634L, 590L, 347L), R1_3.RT = c(200L,
226L, 212L, 707L, 653L, 379L), R1.RT = c(614L, 654L, 581L,
1677L, 1730L, 1066L), R1_1 = structure(c(1L, 1L, 1L, 1L,
1L, 1L), .Label = "The", class = "factor"), R1_2 = structure(c(3L,
1L, 2L, 1L, 2L, 4L), .Label = c("antique", "course", "new",
"road"), class = "factor"), R1_3 = structure(c(4L, 1L, 2L,
1L, 2L, 3L), .Label = c("car", "materials", "surfaces", "technology"
), class = "factor"), R1 = structure(c(3L, 1L, 2L, 1L, 2L,
4L), .Label = c("The antique car", "The course materials",
"The new technology", "The road surfaces"), class = "factor")), .Names = c("Subject",
"Condition", "FreqCode", "Item", "Len", "R1_1.RT", "R1_2.RT",
"R1_3.RT", "R1.RT", "R1_1", "R1_2", "R1_3", "R1"), class = "data.frame", row.names =
c(NA,
-6L))
My goal is to get output that (in part) looks like this:
Region RT WordRegion Word
R1_1.RT 237 R1_1 the
...
R1_2.RT 177 R1_2 new
...
EDIT: The variable ending with ".RT" (e.g., R1_1.RT) are Region names and will be melted into a Region column. The variables ending in numbers (e.g., R1_1) correspond exactly to the Region names and their associated values. I want them to be melted alongside the Region names so that I can analyze them in relation to the Region column
In the first part of the code, I melt all of the values into a Region column and change the value to RT. This seems to work fine:
#long transform (with individual regions at end)
SmallMelt1 = melt(df, measure.vars = c("R1_1.RT", "R1_2.RT", "R1_3.RT", "R1.RT"), var = "Region")
#change newly created column name to "RT" (note:you have to change the number in [] to match your data)
colnames(SmallMelt1)[11 ] <- "RT"
But I don't get how to simultaneously melt another span of variables such that they will line up vertically with the first span. I want to do something like this, after the first melt, but it does not work:
#Second Melt for region names (doesn't work)
SmallMelt2 = melt(SmallMelt1, measure.vars = c("R1_1", "R1_2", "R1_3", "R1"), var = "WordRegion")
#Change name to Word
colnames(SmallMelt2)[9] <- "Word" #add col number for "value" here
Please let me know if you need any clarification. I hope someone can help... thanks in advance - DT
So, after consulting with someone off-list, I found the solution. My mistake was that I was trying to run the second step on the output of the first step. By running the two steps independently on the original data and then concatenating, I get the right result.
SmallMelt1 = melt(df, measure.vars = c("R1_1.RT", "R1_2.RT", "R1_3.RT", "R1.RT"), var = "Region")
SmallMelt2 = melt(df, measure.vars = c("R1_1", "R1_2", "R1_3", "R1"), var = "WordRegion")
SmallMelt3=cbind(SmallMelt1,SmallMelt2[,11])