Developing a function to analyse rows of a data.table in R - r

For a sample dataframe:
df1 <- structure(list(area = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), region = structure(c(1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L), .Label = c("a1",
"a2", "b1", "b2"), class = "factor"), weight = c(0, 1.2, 3.2,
2, 1.6, 5, 1, 0.5, 0.2, 0, 1.5, 2.3, 1.5, 1.8, 1.6, 2, 1.3, 1.4,
1.5, 1.6, 2, 3, 4, 2.3, 1.3, 2.1, 1.3, 1.6, 1.7, 1.8, 2, 1.3,
1, 0.5), var.1 = c(0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 1L,
1L, 0L, 1L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L,
0L, 0L, 0L, 1L, 0L, 1L, 0L), var.2 = c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 0L, 1L, 1L, 1L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L,
1L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 1L, 1L)), .Names = c("area",
"region", "weight", "var.1", "var.2"), class = c("data.table",
"data.frame"))
I want to first produce a summary table...
area_summary <- setDT(df1)[,.(.N, freq.1 = sum(var.1==1), result = weighted.mean((var.1==1),
w = weight)*100), by = area]
...and then populate it by running the following code for each area (e.g. a, b). This looks for the highest and lowest 'result' in each region, and then produces a xtabs and calculates the relative difference (RD) before adding these to the summary table. Here I have developed the code for area 'a':
#Include only regions with highest or lowest percentage
a_cntry <- subset(df1, area=="a")
a_cntry.summary <- setDT(a_cntry)[,.(.N, freq.1 = sum(var.1==1), result = weighted.mean((var.1==1),
w = weight)*100), by = region]
#Include only regions with highest or lowest percentage
incl <- a_cntry.summary[c(which.min(result), which.max(result)),region]
region <- as.data.frame.matrix(a_cntry)
a_cntry <- a_cntry[a_cntry$region %in% incl,]
#Produce xtabs table of RD
a_cntry.var.1 <- xtabs(weight ~ var.1 + region, data=a_cntry)
a_cntry.var.1
#Produce xtabs table
RD.var.1 <- prop.test(x=a_cntry.var.1[,2], n=rowSums(a_cntry.var.1), correct = FALSE)
RD <- round(- diff(RD.var.1$estimate), 3)
RDpvalue <- round(RD.var.1$"p.value", 4)
RD
RDpvalue
#Add RD and RDpvalue tosummary table
area_summary$RD[area_summary$area == "a"] <- RD
area_summary$RDpvalue[area_summary$area == "a"] <- RDpvalue
rm(RD, RD.var.1, RDpvalue, a_cntry.var.1, incl, a_cntry,a_cntry.summary,region)
I wish to wrap this code into a function, so I can just specify the 'areas' (in the 'area' column in df1) and then the code completes all the analysis and adds the results to the summary table.
If I wanted to call my function stats, I understand it may start like this:
stats= function (df1, x) {
apply(x)
}
If anyone can start me off developing my function, I should be most grateful.

Related

HSD.test row names error. How do I check row names?

I have a dataframe for which I did a two-way ANOVA.
dput(m3)
structure(list(Delta = c(-40, -40, -40, -40, -31.7, -29.3, -27.8,
-26.7, -26.2, -25.4, -24.7, -23.1, -23, -22.9, -22.4, -22.2,
-21.4, -21, -20.8, -15.1, -14.9, -14.1, -6.2, -6.2, -6, -5.3,
-4.9), Location = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 2L, 3L, 2L,
3L, 3L, 3L), .Label = c("int", "pen + int", "ter + pen"), class = "factor"),
Between = c(0L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 0L, 2L, 1L, 0L,
1L, 0L, 2L, 0L, 2L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L
), Relative = structure(c(5L, 6L, 6L, 7L, 8L, 3L, 3L, 4L,
5L, 4L, 3L, 5L, 3L, 5L, 7L, 5L, 4L, 6L, 3L, 3L, 6L, 2L, 1L,
2L, 1L, 1L, 1L), .Label = c("1&2", "2&3", "2&4", "2&5", "3&4",
"3&5", "3&6", "4&6"), class = "factor")), class = "data.frame", row.names = c(NA,
-27L))
library(agricolae)
aov.2sum=aov(Delta.~Location*X.between, data=m3)
I want to analyze the data using a HSD.test as I have for another dataframe using the same features.
I am following the code format in the package manual as below.
tx <- with(m3, interaction(Location, X.between))
amod <-aov(Delta~tx, data=m3)
test=HSD.test(amod, "tx", group=TRUE)
Then I receive the following error
Error in .rowNamesDF<-(x, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘int.0’, ‘pen + int.1’, ‘pen + int.2’, ‘te + int.0’, ‘te + int.1’
Upon further analysis I see that my duplicate row names error is related to my X.between feature. When I use the following code I get the same duplicate row names error:
HSD.test(amod, "X.between", group=TRUE)
>> Error in data.frame(row.names = means[, 1], means[, 2:6]) :
duplicate row.names: 0, 1, 2
How are row names chosen for the HSD.test?
Then how can I change my row names? Or just avoid this duplication error?
Thank you for all and any help.

R plot Fire Trace

Just getting started with R and would value your input on this question.
What I'm trying to achieve is that:
X axis has all values for "Timestamp"(from 0 to 9)
Y axis has all values for "NID"(from 0 to 3)
There are "dots" at the coordinates of ("Timestamp","NID") where the attribute "Fired" = 1.
The source data has the following format:
dat = structure(list(TimeStamp = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L,
4L), NID = c(0L, 1L, 2L, 3L, 4L, 0L, 1L, 2L, 3L, 4L, 0L, 1L,
2L, 3L, 4L, 0L, 1L, 2L, 3L, 4L, 0L, 1L, 2L, 3L, 4L), NumberSynapsesTotal = c(2L,
2L, 3L, 2L, 4L, 2L, 2L, 3L, 2L, 4L, 2L, 2L, 3L, 2L, 4L, 2L, 2L,
3L, 2L, 4L, 2L, 2L, 3L, 2L, 4L), NumberActiveSynapses = c(1L,
2L, 1L, 2L, 3L, 1L, 2L, 1L, 1L, 0L, 1L, 2L, 1L, 1L, 0L, 1L, 2L,
1L, 1L, 0L, 1L, 0L, 0L, 1L, 0L), Fires = c(1L, 1L, 1L, 1L, 0L,
1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L,
0L, 1L, 0L, 0L)), row.names = c(NA, 25L), class = "data.frame")
I tried to apply a filter, but it shows a subset of data for those "ID"s, where there is value 1 for the attribute "Fired" (no all values for the axes):
dat %>%
filter(dat$Fires == 1) %>%
ggplot(aes(x = dat$TimeStamp[dat$Fires == 1], y = dat$NID[dat$Fires == 1])) +
geom_point()
Alternatively, I get all existing values for the attributes "Timestamp" and "NID" by using the following code:
plot(dat$TimeStamp, dat$NID,
xlab = "Time", ylab = "Neuron ID")
title(main = "Fire Trace Plot")
so the picture looks in the following way:
Finally, from the comment below I modified the code to:
ggplot(dat, aes(x = TimeStamp, y = NID) , xlab = "Time", ylab ="Neuron
ID") +
geom_blank() +
geom_point(dat = filter(dat) +
#title(main = "Fire Trace Plot")
scale_x_continuous(breaks = F_int_time_breaks(1) )
Is that the case that i should build two charts on one plot?
Thank you!
With ggplot2, never use data$ inside aes(), just use the column names. Similarly, the dplyr functions like filter should not be used with data$ - they know to look in the data frame for the column.
I think you want to build your ggplot with the full data, so the axes get set to cover the full data (we force this by adding a geom_blank() layer), and it is only the point layer that should be subset:
# create some sample data (it is nice if you provide this in the question)
dat = expand.grid(Timestamp = 0:9, NID = 0:3)
dat$Fires = ifelse(dat$NID == 2, 1, 0)
# make the plot
ggplot(dat, aes(x = Timestamp, y = NID)) +
geom_blank() +
geom_point(dat = filter(dat, Fires == 1))
The code should look like that (see reasons in the comments):
F_int_time_breaks<- function(k) {
step <- k
function(y) seq(floor(min(y)), ceiling(max(y)), by = step)
}
ggplot(dat, aes(x = TimeStamp, y = NID) , xlab = "Time", ylab ="Neuron ID") +
geom_blank() +
geom_point(dat = subset(dat, Fires == 1)) +
#title(main = "Fire Trace Plot")
scale_x_continuous(breaks = F_int_time_breaks(1) )

Problems with levels in a xtab in R

For a sample dataframe:
df <- structure(list(area = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L, 4L,
4L, 4L, 4L), .Label = c("a1", "a2", "a3", "a4"), class = "factor"),
result = c(0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 1L, 0L,
1L, 0L, 1L, 0L, 1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L, 0L, 1L),
weight = c(0.5, 0.8, 1, 3, 3.4, 1.6, 4, 1.6, 2.3, 2.1, 2,
1, 0.1, 6, 2.3, 1.6, 1.4, 1.2, 1.5, 2, 0.6, 0.4, 0.3, 0.6,
1.6, 1.8)), .Names = c("area", "result", "weight"), class = "data.frame", row.names = c(NA,
-26L))
I am trying to isolate areas with the highest and lowest regions and then produce a weighted crosstab which is then used to calculate risk difference.
df.summary <- setDT(df)[,.(.N, freq.1 = sum(result==1), result = weighted.mean((result==1),
w = weight)*100), by = area]
#Include only regions with highest or lowest percentage
df.summary <- data.table(df.summary)
incl <- df.summary[c(which.min(result), which.max(result)),area]
df.new <- df[df$area %in% incl,]
incl
'incl' has the two areas that I want, but still the four levels:
[1] a2 a3
Levels: a1 a2 a3 a4
How do I get rid of the levels as well? The subsequent analysis that I want to do needs just the two levels as well as the areas. Any ideas?
I found this elsewhere on the web (e.g. Problems with levels in a xtab in R)
df.new$area <- factor(df.new$area)
It works!
Hope it's useful for others.

identifying rows in data frame that exhibit patterns

Below I have code with 3 columns: a group field, a open/close field for the store, and the rolling sum of 3 month opens for the store. I also have the desired solution output.
My dataset can be thought of as an employees availability. You can assume each row to be a different time period (hour, day,month, year, whatever). In the open/closed column I have whether or not the employee was present. The 3month rolling column is a sum of the previous rows.
What I want to identify is the non-zero values in this rolling sum column following a gap of at least 3 zero rows for that particular group. While not present in this dataset, you can assume that there might be more than one 'gap' of zeros present.
structure(list(Group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), X0_closed_1_open = c(0L,
1L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L), X3month_roll_open = c(0L,
0L, 1L, 2L, 2L, 1L, 1L, 0L, 0L, 0L, 0L, 1L, 2L, 0L, 1L, 1L, 1L,
0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L), desired_solution = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L), .Label = c("no", "yes"), class ="factor")), .Names = c("Group", "X0_closed_1_open", "X3month_roll_open", "desired_solution"), class = "data.frame", row.names = c(NA,
-26L))
One option is:
res <- unsplit(
lapply(split(df1, df1$Group), function(x) {
rl <- with(x,rle(X3month_roll_open==0))
indx <- cumsum(c(0,diff(inverse.rle(within.list(rl,
values[values] <- lengths[values]>=3)))<0))
x$Flag <- indx!=0 & x[,3]!=0
x}),
df1$Group)
NOTE: Instead of 'yes/no', it may be better to have 'TRUE/FALSE' for easing subsetting.
identical(c('no', 'yes')[res$Flag+1L], as.character(res$desired_solution))
#[1] TRUE

Using Rs data.table on (weighted) survey data [duplicate]

This question already has answers here:
How to compute weighted mean in R?
(2 answers)
Closed 7 years ago.
For a sample dataframe:
df <- structure(list(id = 1:25, region.1 = structure(c(1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 4L), .Label = c("AT1", "AT2", "AT3", "AT4"
), class = "factor"), gndr = c(0L, 1L, 0L, 0L, 0L, 1L, 0L, 1L,
1L, 1L, 0L, 0L, 1L, 1L, 0L, 1L, 0L, 0L, 1L, 1L, 0L, 0L, 0L, 1L,
1L), PoorHealth = c(0L, 1L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L,
0L, 1L, 0L, 1L, 0L, 1L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 1L), weight = c(0.3,
1.6, 2.5, 3.5, 0.2, 0.2, 0.2, 0.6, 0.15, 0.25, 1.36, 1, 1, 1,
0.1, 0.2, 0.3, 0.3, 0.3, 0.4, 0.3, 1, 1.4, 1.3, 0.4)), .Names = c("id",
"region.1", "gndr", "PoorHealth", "weight"), class = c("data.table",
"data.frame"), row.names = c(NA, -25L))
I wish to create a summary data table (using data.table) using the code:
variable.table_1 <- setDT(df)[,.(.N,result=sum((PoorHealth==1)/.N)*100),
by=region.1]
However my original data is from a survey and I therefore have a design and population weight which I have multiplied together (following the guidance from the survey, and have called this variable 'weight').
How do I apply an appropriate weighting of my 'result' variable in variable.table_1?
Perhaps I have to use the survey package? Looking here seems to adjust I have to first run my dataframe through the survey package...
library(survey)
df.w <- svydesign(id = ~1, data = df, weights = df$weight)
... but I am unsure how I incorporate the results into my summary data table.
Many thanks in advance.
Perhaps you can use the weighted.mean function
variable.table_1 <- setDT(df)[,.(.N, result = weighted.mean((PoorHealth==1),
w = weight)*100), by = region.1]
In your example you could also simply use mean instead of sum in combination wiht /.N.

Resources