I am really new at R (i've been learning it for 1 week now) and even newer at Stack Overflow. I have a doubt about how to use the aggregate function. I am trying the following code:
a = aggregate(dom$pesoA,
by = list(tipoE = addNA(dom$typeEsg), mun =dom$codMun),
FUN = sum,
na.rm=FALSE)
where:
- dom$pesoA has only the values that I need to sum
- dom$typeEsg has numbers from 1 to 6 and also many NAs
- dom$codMun has municipalities codes (no NAs)
Can I transform this data frame (a) into a matrix where tipoE are the columns, mun are the rows and the sum value of dom$pesoA are the elements of my matrix (there are some missing combination of mun and tipoE)?
I dont know if you can understand my explanation, if you have any questions I will try to answer it.
This is what my a df looks like
Thanks in advance
TR
If your dataframe really does look like that then there is a serious mismatch between your column names and your code.
dom <- data.frame(tipoE=sample(c(letters[1:4],NA), 30, rep=TRUE),
mun=rep(c(3200102,3200106,3200310) , each=10),
x=runif(30, 100,200) )
dom
This reworking succeeds:
a = aggregate(dom$x,
by = list(tipoE = addNA(dom$tipoE), mun =dom$
FUN = sum)
a
This use of xtabs then gives your requests:
> aT <- xtabs( x ~ tipoE + mun, a)
> aT
mun
tipoE 3200102 3200106 3200310
a 340.7700 367.1412 180.0594
b 280.9851 485.8780 798.4880
c 280.7682 236.3637 165.2295
d 176.6967 125.0732 132.5339
<NA> 376.4278 117.1063 251.2514
Related
I have the following data, and I would like to expand it. For example, if June has two successes, and one failure, my dataset should look like:
month | is_success
------------------
6 | T
6 | T
6 | F
Dataset is as follows:
# Months from July to December
months <- 7:12
# Number of success (failures) for each month
successes <- c(11,22,12,7,6,13)
failures <- c(20,19,11,16,13,10)
A sample solution is as follows:
dataset<-data.frame()
for (i in 1:length(months)) {
dataset <- rbind(dataset,cbind(rep(months[i], successes[i]), rep(T, successes[i])))
dataset <- rbind(dataset,cbind(rep(months[i], failures[i]), rep(F, failures[i])))
}
names(dataset) <- c("months", "is_success")
dataset[,"is_success"] <- as.factor(dataset[,"is_success"])
Question: What are the different ways to rewrite this code?
I am looking for a comprehensive solution with different but efficient ways (matrix, loop, apply).
Thank you!
Here is one way with rep. Create a dataset with 'months' and 'is_success' based on replication of 1 and 0. Then replicate the rows by the values of 'successes', 'failures', order if necessary and set the row names to 'NULL'
d1 <- data.frame(months, is_success = factor(rep(c(1, 0), each = length(months))))
d2 <- d1[rep(1:nrow(d1), c(successes, failures)),]
d2 <- d2[order(d2$months),]
row.names(d2) <- NULL
Now, we check whether this is equal to the data created from for loop
all.equal(d2, dataset, check.attributes = FALSE)
#[1] TRUE
Or as #thelatemail suggested, 'd1' can be created with expand.grid
d1 <- expand.grid(month=months, is_success=1:0)
using mapply you can try this:
createdf<-function(month,successes,failures){
data.frame(month=rep(x = month,(successes+failures)),
is_success=c(rep(x = T,successes),
rep(x = F,failures))
)
}
Now create a list of required data.frames:
lofdf<-mapply(FUN = createdf,months,successes,failures,SIMPLIFY = F)
And then combine using the plyr package's ldply function:
resdf<-ldply(lofdf,.fun = data.frame)
I am calculating final averages for a course. There are about 500 students, and the grades are organized into a .csv file. Column headers include:
Name, HW1, ..., HW10, Quiz1, ..., Quiz5, Exam1, Exam2, Final
Each is weighted differently, and that shouldn't be an issue programming. However, the lowest 2 HW and the lowest Quiz are dropped for each student. How could I program this in r? Note that the HW/Quiz dropped for each student may be different (i.e. Student A has HW2, HW5, Quiz2 dropped, Student B has HW4, HW8, Quiz1 dropped).
Here is a simpler solution. The sum_after_drop function takes a vector x and drops the i lowest scores and sums up the remaining. We invoke this function for each row in the dataset. ddply is overkill for this job, but keeps things simple. You should be able to do this with apply, except that you will have to convert the end result to a data frame.
The actual grade calculations can then be carried out on dd2. Note that using the cut function with breaks is a simple way to get letter grades from the total scores.
library(plyr)
sum_after_drop <- function(x, i){
sum(sort(x)[-(1:i)])
}
dd2 = ddply(dd, .(Name), function(d){
hw = sum_after_drop(d[,grepl("HW", nms)], 1)
qz = sum_after_drop(d[,grepl("Quiz", nms)], 1)
data.frame(hw = hw, qz = qz)
})
Here's a sketch of how you could approach it using the reshape2 package and base functions.
#sample data
set.seed(734)
dd<-data.frame(
Name=letters[1:20],
HW1=rpois(20,7),
HW2=rpois(20,7),
HW3=rpois(20,7),
Quiz1=rpois(20,15),
Quiz2=rpois(20,15),
Quiz3=rpois(20,15)
)
Now I convert it to long format and split apart the field names
require(reshape2)
mm<-melt(dd, "Name")
mm<-cbind(mm,
colsplit(gsub("(\\w+)(\\d+)","\\1:\\2",mm$variable, perl=T), ":",
names=c("type","number"))
)
Now i can use by() to get a data.frame for each name and do the rest of the calculations. Here i just drop the lowest homework and lowest quiz and i give homework a weight of .2 and quizzes a weight of .8 (assuming all home works were worth 15pts and quizzes 25 pts).
grades<-unclass(by(mm, mm$Name, function(x) {
hw <- tail(sort(x$value[x$type=="HW"]), -1);
quiz <- tail(sort(x$value[x$type=="Quiz"]), -1);
(sum(hw)*.2 + sum(quiz)*.8) / (length(hw)*15*.2+length(quiz)*25*.8)
}))
attr(grades, "call")<-NULL #get rid of crud from by()
grades;
Let's check our work. Look at student "c"
Name HW1 HW2 HW3 Quiz1 Quiz2 Quiz3
c 6 9 7 21 20 14
Their grade should be
((9+7)*.2+(21+20)*.8) / ((15+15)*.2 + (25+25)*.8) = 0.7826087
and in fact, we see
grades["c"] == 0.7826087
Here's a solution with dplyr. It ranks the scores by student and type of assignment (i.e. calculates the rank order of all of student 1's homeworks, etc.), then filters out the lowest 1 (or 2, or whatever). dplyr's syntax is pretty intuitive—you should be able to walk through the code fairly easily.
# Load libraries
library(reshape2)
library(dplyr)
# Sample data
grades <- data.frame(name=c("Sally", "Jim"),
HW1=c(10, 9),
HW2=c(10, 5),
HW3=c(5, 10),
HW4=c(6, 9),
HW5=c(8, 9),
Quiz1=c(9, 5),
Quiz2=c(9, 10),
Quiz3=c(10, 8),
Exam1=c(95, 96))
# Melt into long form
grades.long <- melt(grades, id.vars="name", variable.name="graded.name") %.%
mutate(graded.type=factor(sub("\\d+","", graded.name)))
grades.long
# Remove the lowest scores for each graded type
grades.filtered <- grades.long %.%
group_by(name, graded.type) %.%
mutate(ranked.score=rank(value, ties.method="first")) %.% # Rank all the scores
filter((ranked.score > 2 & graded.type=="HW") | # Ignore the lowest two HWs
(ranked.score > 1 & graded.type=="Quiz") | # Ignore the lowest quiz
(graded.type=="Exam"))
grades.filtered
# Calculate the average for each graded type
grade.totals <- grades.filtered %.%
group_by(name, graded.type) %.%
summarize(total=mean(value))
grade.totals
# Unmelt, just for fun
final.grades <- dcast(grade.totals, name ~ graded.type, value.var="total")
final.grades
You technically could add the summarize(total=mean(value)) to the grades.filtered data frame rather than making a separate grade.totals data frame—I separated them into multiple data frames for didactical reasons.
I have data in this format:
Count, Thread1, Thread2, Thread3, Thread4,
10420162, 589768
46530936, 1164357
55563161, 275521, 12289
56741671, 25158, 28020
57792881, 44468, 91248
(As the additional threads come in to play, data appears in their columns)
I would like to plot the sum (running total) of the Thread data against the Count eg when x is 0, y is 0; when x is 10420162, y is 589768; when x is 46530936, y is 1754125; when x is 55563161, y is 2041935 and so on.
Not clear how I can do this - presumably it requires at least two steps - to sum the data and then plot it?
Your calculated numbers don't match mine so I have a feeling I didn't understand your questions correctly. Or did you calculate it wrong?
df <- read.csv(tex`t`Connection('Count, Thread1, Thread2, Thread3, Thread4,
10420162, 589768
46530936, 1164357
55563161, 275521, 12289
56741671, 25158, 28020
57792881, 44468, 91248'), header=TRUE)
dfcumsum <- data.frame(
count = df$Count ,
cumthreadsum = cumsum(rowSums(df[,-1], na.rm = TRUE))
)
Output -
> dfcumsum
count cumthreadsum
1 10420162 589768
2 46530936 1754125
3 55563161 2041935
4 56741671 2095113
5 57792881 2230829
The most elementary plot would be plot(dfcumsum$cumthreadsum)
I have been working on a file to calculate hospital infection rates. I want to standardise the infection rates to yearly procedure counts. The data are located here because it is too big for dput. SSI is the number of surgical infections(1 = infected, 0=not infected), Procedure is the type of procedure. Year has been derived using lubridate
library(plyr)
fname <- "https://raw.github.com/johnmarquess/some.data/master/hospG.csv"
download.file(fname, destfile='hospG.csv', method='wget')
hospG <- read.csv('hospG.csv')
Inf_table <- ddply(hospG, "Year", summarise,
Infections = sum(SSI == 1),
Procedures = length(Procedure),
PropInf = round(Infections/Procedures * 100 ,2)
)
This gives me the number of infections, procedures, and proportion infected per year for this hospital.
What I would like is an additional column with the standardised proportion infected. The long way to do this outside the inf_table is:
s1 <- sum(Inf_table$Infections)
s2 <- sum(Inf_table$Procedures)
Expected_prop_inf <- Inf_table$Procedures * s1/s2
Is there a way to get ddply to do this. I tied making a function with the calculation to produce Expected_prop_inf but I did not get very far.
Thanks for any help offered.
It's more difficult with ddply because you are dividing by a number outside the grouping . Better to do it with base R.
# base
> with(Inf_table, Procedures*(sum(Infections)/sum(Procedures)))
[1] 17.39184 17.09623 23.00847 20.84065 24.83141 24.83141
rather than with ddply which is not so natural:
# NB note .(Year) is unique for every row, you might also use rownames
> s1 <- sum(Inf_table$Infections)
> s2 <- sum(Inf_table$Procedures)
> ddply(Inf_table, .(Year), summarise, Procedures*(s1/s2))
Year ..1
1 2001 17.39184
2 2002 17.09623
3 2003 23.00847
4 2004 20.84065
5 2005 24.83141
6 2006 24.83141
Here is a solution to aggregate using data.table.
I'm not sure if it's posible to do it in one step.
require("data.table")
fname <- "https://raw.github.com/johnmarquess/some_data/master/hospG.csv"
hospG <- read.csv(fname)
Inf_table <- DT[, {Infections = sum(SSI == 1)
Procedures = length(Procedure)
PropInf = round(Infections/Procedures * 100 ,2)
list(
Infections = Infections,
Procedures = Procedures,
PropInf = PropInf
)
}, by = Year]
Inf_table[,Expected_prop_inf := list(Procedures * sum(Infections)/sum(Procedures))]
tables()
The added bonus of this approach is that you are not creating another data.table in the second step, a new column of the data.table is created. This would be relevant in case your datasets are bigger.
This question already has an answer here:
t-test in R between individuals columns and the rest of a given dataframe
(1 answer)
Closed 9 years ago.
I have a dataframe of the basic form:
> head(raw.data)
NAC cOF3 APir Pu Tu V2.3 mOF3 DGpf
1 6.314770 6.181188 6.708971 6.052134 6.546938 6.079848 6.640716 6.263770
2 8.825595 8.740217 9.532026 8.919598 8.776969 8.843287 8.631505 9.053732
3 5.518933 5.982044 5.632379 5.712680 5.655525 5.580141 5.750969 6.119935
4 6.063098 6.700194 6.255736 5.124315 6.133631 5.891009 6.070467 6.062815
5 8.931570 9.048621 9.258875 8.681762 8.680993 9.040971 8.785271 9.122226
6 5.694149 5.356218 5.608698 5.894171 5.629965 5.759247 5.929289 6.092337
I would like to perform t-tests of each column versus all other columns and save the subsequent p-values to a variable in some variation of the following:
#run tests
test.result = mapply(t.test, one.column, other.columns)
#store p-values
p.values = stack(mapply(function(x, y)
+ t.test(x,y)$p.value, one.column, other.columns))
Or would aov() be a better option for such an analysis? In any case, I would like to know how to streamline doing it using t-tests.
Here's one solution:
Read in the data:
dat <- read.table(text='NAC cOF3 APir Pu Tu V2.3 mOF3 DGpf
1 6.314770 6.181188 6.708971 6.052134 6.546938 6.079848 6.640716 6.263770
2 8.825595 8.740217 9.532026 8.919598 8.776969 8.843287 8.631505 9.053732
3 5.518933 5.982044 5.632379 5.712680 5.655525 5.580141 5.750969 6.119935
4 6.063098 6.700194 6.255736 5.124315 6.133631 5.891009 6.070467 6.062815
5 8.931570 9.048621 9.258875 8.681762 8.680993 9.040971 8.785271 9.122226
6 5.694149 5.356218 5.608698 5.894171 5.629965 5.759247 5.929289 6.092337')
Get all possible pairwise combinations:
com <- combn(colnames(dat), 2)
Get the p-values
p <- apply(com, 2, function(x) t.test(dat[,x[1]], dat[,x[2]])$p.val)
Put into a data frame:
data.frame(comparison = paste(com[1,], com[2,], sep = ' vs. '), p.value = p)
An even better solution is to use melt from the rehape package and pairwise.t.test:
library(reshape)
with(melt(dat), pairwise.t.test(value, variable, p.adjust.method = 'none'))
If you want to pair just the first with all other columns, you can also use this:
x <- sapply(dat[,-1], function(x) t.test(x, dat[,1])$p.value)
data.frame(variable = names(x), p.value = as.numeric(x))