I am trying to use the $names operator on my OutVals (outliers) to find the class these outliers are associated to and then put the outliers and their class name inside a data frame so I can see clearly from which class these outliers came from.
However, when trying to implement this, my class names return as "1", "2" etc... and not "Van", "Bus etc.. as it is in the dataset.
Have I missed something or am I approaching this completely wrong?
The goal is to get the outliers in the data and place them inside a table which shows from which class the outliers came from
Any help would be appreciated
I have shown my data frame as well as my reproduceable code below
library(reshape2)
vehData <-
structure(
list(
Samples = 1:6,
Comp = c(95L, 91L, 104L, 93L, 85L,
107L),
Circ = c(48L, 41L, 50L, 41L, 44L, 57L),
D.Circ = c(83L,
84L, 106L, 82L, 70L, 106L),
Rad.Ra = c(178L, 141L, 209L, 159L,
205L, 172L),
Pr.Axis.Ra = c(72L, 57L, 66L, 63L, 103L, 50L),
Max.L.Ra = c(10L,
9L, 10L, 9L, 52L, 6L),
Scat.Ra = c(162L, 149L, 207L, 144L, 149L,
255L),
Elong = c(42L, 45L, 32L, 46L, 45L, 26L),
Pr.Axis.Rect = c(20L,
19L, 23L, 19L, 19L, 28L),
Max.L.Rect = c(159L, 143L, 158L, 143L,
144L, 169L),
Sc.Var.Maxis = c(176L, 170L, 223L, 160L, 241L, 280L),
Sc.Var.maxis = c(379L, 330L, 635L, 309L, 325L, 957L),
Ra.Gyr = c(184L,
158L, 220L, 127L, 188L, 264L),
Skew.Maxis = c(70L, 72L, 73L,
63L, 127L, 85L),
Skew.maxis = c(6L, 9L, 14L, 6L, 9L, 5L),
Kurt.maxis = c(16L,
14L, 9L, 10L, 11L, 9L),
Kurt.Maxis = c(187L, 189L, 188L, 199L,
180L, 181L),
Holl.Ra = c(197L, 199L, 196L, 207L, 183L, 183L),
Class = c("van", "van", "saab", "van", "bus", "bus")
),
row.names = c(NA,
6L), class = "data.frame")
#Remove outliers function
removeOutliers <- function(data) {
OutVals <- boxplot(data)$out
namesforgroups <- boxplot(OutVals)$names #get group name of the outliers
dataf <- as.data.frame(OutVals, col.names = namesforgroups)#dataframe of outlier + names
print(OutVals) # show all outliers
remOutliers <- sapply(data, function(x) x[!x %in% OutVals]) #remove outliers from data
return (remOutliers)
}
#Remove class column and sample number
vehDataRemove1 <- vehData[, -1]
vehDataRemove2 <- vehDataRemove1[,-19]
vehData <- vehDataRemove2 #assign to new variable
vehClass <- vehData$Class #store original class names
#Begin removing outliers
removeOutliers1 <- removeOutliers(vehData) #remove first set of outliers
removeOutliers2 <- removeOutliers(removeOutliers1) #test again for more and remove
Output data frame
The information about which row/class name the outlier is tied to is not provided in the boxplot object. You have to get it yourself. What is given is the column that the outlier came from, inside boxplot(data)$group, so you can use which to see which row it was from, and use that to get what class it is. I rewrote your function and it now prints a table of the outlier value, the column it came from, and the row/class it came from. There are 5 outliers from 3 rows in the first iteration, and no outliers in the second iteration - makes sense because they've been removed.
removeOutliers <- function(data, class) {
x=boxplot(data)
OutVals <- x$out
columns <- x$group #get group name of the outliers
ind=numeric()
classes=c()
if (length(columns) > 0) {
for (i in 1:length(columns)) {
rows=which(data[,columns[i]]==OutVals[i])
ind=union(ind, rows)
classes=c(classes, class[rows])
}
dt=data.frame(OutVals, columns, classes) # show all outliers
print(dt)
return (list(data[-ind,], class[-ind]))
}
return(list(data, class))
}
#Remove class column and sample number
vehData1 <- vehData[, -c(1,20)]
vehClass <- vehData$Class #store original class names
#Begin removing outliers
removeOutliers1 <- removeOutliers(vehData1, vehClass) #remove first set of outliers
OutVals columns classes
1 103 5 bus
2 52 6 bus
3 6 6 bus
4 127 14 bus
5 14 15 saab
removeOutliers2 <- removeOutliers(removeOutliers1[[1]], removeOutliers1[[2]])
The first function returns a data frame with the outlier rows removed. The second function returns a table containing information about each outlier (the class, the column, and the value).
removeOutliers=function(data) {
x=boxplot(data %>% select(-Class), plot=FALSE)
outlierRows=c()
for (i in 1:length(x$out)) {
outlierRows=c(outlierRows, which(data[,x$group[i]]==x$out[i]))
}
return(data[-outlierRows,])
}
getOutliers=function(data) {
x=boxplot(data %>% select(-Class))
outlierInfo=data.frame()
for (i in 1:length(x$out)) {
rows=which(data[,x$group[i]]==x$out[i])
outlierInfo=bind_rows(outlierInfo, data.frame(class=data$Class[rows],
value=x$out[i],
column=names(data)[x$group[i]]))
}
return(outlierInfo)
}
removeOutliers(vehData)
Samples Comp Circ D.Circ Rad.Ra Pr.Axis.Ra Max.L.Ra Scat.Ra Elong Pr.Axis.Rect Max.L.Rect
1 1 95 48 83 178 72 10 162 42 20 159
2 2 91 41 84 141 57 9 149 45 19 143
4 4 93 41 82 159 63 9 144 46 19 143
Sc.Var.Maxis Sc.Var.maxis Ra.Gyr Skew.Maxis Skew.maxis Kurt.maxis Kurt.Maxis Holl.Ra Class
1 176 379 184 70 6 16 187 197 van
2 170 330 158 72 9 14 189 199 van
4 160 309 127 63 6 10 199 207 van
getOutliers(vehData)
class value column
1 bus 103 Pr.Axis.Ra
2 bus 52 Max.L.Ra
3 bus 6 Max.L.Ra
4 bus 127 Skew.Maxis
5 saab 14 Skew.maxis
Related
I am trying to fit the Richards model in R for this data bellow but can't get it to work.
time Volume
3 12
6 25
9 38
12 53
15 73
21 108
27 136
33 160
39 180
48 202
60 222
72 241
96 255
Richards <- nls(
Volume ~ (Vi*Vf)/((Vi^n) + ((Vf^n-(Vi^n))*exp(-u*time)))^(1/n),
data=dat1,
start=c(Vi=3, Vf=255, u=6, n=-0.5))
Any help is appreciated!
It is better to use dput to provide data since it is quicker to get into R and preserves data types such as integer and factor:
dat1 <- structure(list(time = c(3L, 6L, 9L, 12L, 15L, 21L, 27L, 33L,
39L, 48L, 60L, 72L, 96L), Volume = c(12L, 25L, 38L, 53L, 73L,
108L, 136L, 160L, 180L, 202L, 222L, 241L, 255L)), class = "data.frame",
row.names = c(NA, -13L))
Here are the data and the curve you are trying to fit with your starting values:
plot(Volume~time, dat1)
Vi <- 3; Vf <- 255; u <- 6; n <- -0.5
Vol.pred <- (Vi*Vf)/((Vi^n) + ((Vf^n-(Vi^n))*exp(-u*dat1$time)))^(1/n)
lines(dat1$time, Vol.pred, col="red")
You can see the predicted line is nowhere near the data. As #Maurits Evers indicated, it is not clear that the Richard's curve is appropriate, but you can try changing the starting values to get something closer, e.g. by changing u to .05:
lines(dat1$time, Vol.pred, col="blue")
That gives us starting values that will work:
Richards <- nls(
Volume ~ (Vi*Vf)/((Vi^n) + ((Vf^n-(Vi^n))*exp(-u*time)))^(1/n),
data=dat1,
start=c(Vi=3, Vf=255, u=.05, n=-0.5))
lines(dat1$time, predict(Richards), col="darkgreen")
I'm trying to type this formula into R:
The formula takes the following inputs:
M: annual number of deaths (all-cause mortality);
D: annual number of cancer deaths (cancer mortality);
R: annual number of registered cancer cases;
N: size of the mid-year population.
w: Width of each age-interval, eg. [0-5) is 5 years wide, and the final interval is 85+ year, and thus infinitely wide.
All the above input vectors 18 elements long, because they refer to 18 age-intervals.
The first 17 age-intervals are 5 years wide, and the last interval (85+ years) is infinitely wide.
The formula estimates lifetime risk of cancer as proposed by Sasieni et al 2011
http://www.nature.com/bjc/journal/v105/n3/full/bjc2011250a.html
It is the that I don't know how to type.
Below I have tried to implement the parts of the equation before and after the .
# Input data:
M <- c(140L, 12L, 12L, 59L, 94L, 101L, 117L, 213L, 368L, 607L, 1025L,
1488L, 2255L, 2787L, 3257L, 3715L, 4231L, 6281L)
R <- c(42L, 22L, 28L, 54L, 77L, 108L, 169L, 227L, 293L, 531L, 863L,
1464L, 2591L, 3334L, 3045L, 2605L, 1890L, 1261L)
D <- c(2L, 1L, 2L, 6L, 4L, 7L, 15L, 26L, 67L, 120L, 304L, 497L, 883L,
1158L, 1321L, 1318L, 1177L, 1065L)
N <- c(167323L, 168088L, 176017L, 180986L, 168189L, 155506L, 174274L,
195538L, 207287L, 204711L, 183802L, 174342L, 183415L, 151277L,
104199L, 71782L, 47503L, 33946L)
# W width of age interval
w <- c( 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,Inf )
# function
v1 <- numeric()
for(i in 1:length(R)) {
v1[i] <- R[i] / ( R[i] + M[i] - D[i] ) * ( 1 - exp( - (w[i]/N[i]) * (R[i] + M[i] - D[i]) ) )
}
sum(v1)
Answers where the code looks as much as possible like the equation are preferred, so that coworkers with no knowledge of R can recognize the equation in the code.
The answer is supposed to be 0.376127241057822
Maybe this will work. Isn't there an example in the paper that you can check?
f <- function(idx) {
s <- numeric(idx)
for (i in 1:idx)
s[i] <- R[i] / (R[i] + M[i] - D[i]) * S(i) * (1 - exp(-w[i] / N[i] * (R[i] + M[i] - D[i])))
s
}
S <- function(idx) {
if (idx == 1L)
return(1)
s <- numeric(idx - 1)
for (j in 1:(idx - 1))
s[j] <- (R[j] + (M[j] - D[j])) / N[j]
exp(-sum(s))
}
# Input data:
M <- c(140L, 12L, 12L, 59L, 94L, 101L, 117L, 213L, 368L, 607L, 1025L,
1488L, 2255L, 2787L, 3257L, 3715L, 4231L, 6281L)
R <- c(42L, 22L, 28L, 54L, 77L, 108L, 169L, 227L, 293L, 531L, 863L,
1464L, 2591L, 3334L, 3045L, 2605L, 1890L, 1261L)
D <- c(2L, 1L, 2L, 6L, 4L, 7L, 15L, 26L, 67L, 120L, 304L, 497L, 883L,
1158L, 1321L, 1318L, 1177L, 1065L)
N <- c(167323L, 168088L, 176017L, 180986L, 168189L, 155506L, 174274L,
195538L, 207287L, 204711L, 183802L, 174342L, 183415L, 151277L,
104199L, 71782L, 47503L, 33946L)
# W width of age interval
w <- c( 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,Inf )
f(18)
# [1] 0.0012516883 0.0006533947 0.0007939380 0.0014874104 0.0022786758 0.0034506651
# [7] 0.0048088199 0.0057397672 0.0069608906 0.0126706127 0.0226156951 0.0395612334
# [13] 0.0644167605 0.0956951717 0.1184236481 0.1330917708 0.1256574840 0.1421444626
sum(f(18))
# [1] 0.7817021
A more "R" way would be
lr <- length(R)
S <- sapply(seq(R), function(idx)
exp(-sum((R[-(idx:lr)] + (M[-(idx:lr)] - D[-(idx:lr)])) / N[-(idx:lr)])))
sum(R / (R + M - D) * S * (1 - exp(-w / N * (R + M - D))))
# [1] 0.7817021
Maybe I'm reading the problem incorrectly, but could you solve this by manually shifting the S*0(ai) vector by 1 to account for the summation from j=1 to i-1 and combining with cumsum?
#df is a data.frame of the example data. Jump to bottom for code.
#index i = row i
#Using mutate() from dplyr library to make code easier to read
df <- dplyr::mutate(df, RMDN.i = R/(R+M-D) * ( 1 - exp( -(w/N) * (R+M-D) ) ))
#Shift values down one because equation sums from j=1 to i-1.
df$RMDN.i_1 <- c(0, head(df$RMDN.i, -1))
df$S0.ai <-exp(-cumsum(df$RMDN.i_1)) #Cumulative sum
#Again, cumulative sum to calculate lifetime risk (Eq. 7)
df <- dplyr::mutate(df, risk = cumsum( R/(R+M-D) * S0.ai * (1 - exp(-(w/N) * (R+M-D)) ) ))
df
# age M R D N w RMDN.i RMDN.i_1 S0.ai risk
#1 0 140 42 2 167323 5 0.0012516883 0.0000000000 1.0000000 0.001251688
#2 5 12 22 1 168088 5 0.0006540980 0.0012516883 0.9987491 0.001904968
#3 10 12 28 2 176017 5 0.0007949486 0.0006540980 0.9980960 0.002698403
#4 15 59 54 6 180986 5 0.0014896253 0.0007949486 0.9973029 0.004184011
#5 20 94 77 4 168189 5 0.0022834186 0.0014896253 0.9958184 0.006457881
#6 25 101 108 7 155506 5 0.0034612823 0.0022834186 0.9935471 0.009896828
#7 30 117 169 15 174274 5 0.0048298858 0.0034612823 0.9901141 0.014678966
#8 35 213 227 26 195538 5 0.0057738828 0.0048298858 0.9853435 0.020368224
#9 40 368 293 67 207287 5 0.0070171053 0.0057738828 0.9796707 0.027242676
#10 45 607 531 120 204711 5 0.0128095925 0.0070171053 0.9728203 0.039704108
#11 50 1025 863 304 183802 5 0.0229777407 0.0128095925 0.9604383 0.061772810
#12 55 1488 1464 497 174342 5 0.0405424457 0.0229777407 0.9386212 0.099826810
#13 60 2255 2591 883 183415 5 0.0669506082 0.0405424457 0.9013283 0.160171288
#14 65 2787 3334 1158 151277 5 0.1016317397 0.0669506082 0.8429595 0.245842732
#15 70 3257 3045 1321 104199 5 0.1299648254 0.1016317397 0.7614977 0.344810654
#16 75 3715 2605 1318 71782 5 0.1532142188 0.1299648254 0.6686912 0.447263656
#17 80 4231 1890 1177 47503 5 0.1550955224 0.1532142188 0.5737009 0.536242096
#18 85 6281 1261 1065 33946 Inf 0.1946888992 0.1550955224 0.4912792 0.631888708
library(ggplot2)
ggplot(df, aes(x= age, y= risk)) + geom_line() + geom_point() + theme_classic()
# Input data:
df <- data.frame(
age = seq(0,85, by = 5), #age band
M = c(140L, 12L, 12L, 59L, 94L, 101L, 117L, 213L, 368L, 607L, 1025L,
1488L, 2255L, 2787L, 3257L, 3715L, 4231L, 6281L),
R = c(42L, 22L, 28L, 54L, 77L, 108L, 169L, 227L, 293L, 531L, 863L,
1464L, 2591L, 3334L, 3045L, 2605L, 1890L, 1261L),
D = c(2L, 1L, 2L, 6L, 4L, 7L, 15L, 26L, 67L, 120L, 304L, 497L, 883L,
1158L, 1321L, 1318L, 1177L, 1065L),
N = c(167323L, 168088L, 176017L, 180986L, 168189L, 155506L, 174274L,
195538L, 207287L, 204711L, 183802L, 174342L, 183415L, 151277L,
104199L, 71782L, 47503L, 33946L) ,
w = c( 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,Inf ) # W width of age interval
)
I am curious as to why there is a difference in the frequency created by these two methods(shown below) even though the same dataset is used.
First Method (cut(as.vector))
wd1<- apply(wd, 2, function(x) cut(((as.numeric(x))
+ 360/(16*2) )%% 360,seq(0,360,360/16) ,
c('N', 'NNE', 'NE', 'ENE', 'E', 'ESE', '
SE', 'SSE', 'S', 'SSW', 'SW', 'WSW',
'W', 'WNW', 'NW', 'NNW')))
wd2<- as.data.frame(table(wd1))
wd3<- transform(wd2, cumFreq = cumsum(Freq),
relative = prop.table(Freq))
and this yields
> wd3
wd1 Freq cumFreq relative
1 \nSE 2942 2942 0.01579292
2 E 11550 14492 0.06200144
3 ENE 5773 20265 0.03098998
4 ESE 5713 25978 0.03066790
5 N 11051 37029 0.05932276
6 NE 4725 41754 0.02536422
7 NNE 6196 47950 0.03326069
8 NNW 14880 62830 0.07987718
9 NW 18278 81108 0.09811795
10 S 6621 87729 0.03554212
11 SSE 3772 91501 0.02024844
12 SSW 10800 102301 0.05797537
13 SW 17004 119305 0.09127900
14 W 24903 144208 0.13368154
15 WNW 20603 164811 0.11059876
16 WSW 21475 186286 0.11527973
Second method(cut(wd,breaks=))
breaks1 <- apply(wd, 2, function(x) (cut(as.numeric(x), breaks=
(seq(0,360,360/16)))))
breaks2<- as.data.frame(table(breaks1))
breaks3<- transform(breaks2, cumFreq = cumsum(Freq),
relative = prop.table(Freq))
and this yields
> breaks3
breaks1 Freq cumFreq relative
1 (0,22.5] 8110 8110 0.04358036
2 (112,135] 3314 11424 0.01780830
3 (135,158] 3084 14508 0.01657236
4 (158,180] 5039 19547 0.02707786
5 (180,202] 8387 27934 0.04506886
6 (202,225] 14246 42180 0.07655312
7 (22.5,45] 5257 47437 0.02824932
8 (225,248] 19194 66631 0.10314198
9 (248,270] 24301 90932 0.13058525
10 (270,292] 22526 113458 0.12104700
11 (292,315] 19631 133089 0.10549027
12 (315,338] 16401 149490 0.08813335
13 (338,360] 13185 162675 0.07085167
14 (45,67.5] 4614 167289 0.02479405
15 (67.5,90] 9173 176462 0.04929256
16 (90,112] 9631 186093 0.05175369
The total frequency should be 186286 as the first one but its not, I'm sure it is omitting some numbers. Also the intervals are not completely in 22.5s (as 360/16 should indicate that), only three bins are. Well they are but R is rounding off all but those three. Why is this?
The (dput)is
dput(head(wd))
structure(list(X1000mb = c(86L, 130L, 75L, 59L, 56L, 69L), X925mb = c(70L,
45L, 30L, 66L, 54L, 71L), X850mb = c(355L, 349L, 350L, 65L, 36L,
56L), X700mb = c(331L, 342L, 329L, 35L, 1L, 44L), X600mb = c(328L,
328L, 321L, 0L, 247L, 227L), X500mb = c(331L, 324L, 317L, 331L,
251L, 241L), X400mb = c(340L, 328L, 310L, 296L, 261L, 246L),
X300mb = c(336L, 334L, 328L, 295L, 259L, 262L), X250mb = c(334L,
333L, 348L, 300L, 259L, 279L), X200mb = c(336L, 330L, 356L,
331L, 257L, 282L), X150mb = c(333L, 327L, 346L, 342L, 277L,
279L), X100mb = c(317L, 326L, 325L, 318L, 260L, 274L), X70mb = c(323L,
326L, 332L, 306L, 277L, 276L), X50mb = c(350L, 4L, 352L,
328L, 305L, 311L), X30mb = c(5L, 42L, 32L, 15L, 29L, 12L),
X20mb = c(3L, 42L, 48L, 30L, 46L, 45L), X10mb = c(28L, 25L,
4L, 14L, 104L, 76L)), .Names = c("X1000mb", "X925mb", "X850mb",
"X700mb", "X600mb", "X500mb", "X400mb", "X300mb", "X250mb", "X200mb",
"X150mb", "X100mb", "X70mb", "X50mb", "X30mb", "X20mb", "X10mb"
), row.names = c(NA, 6L), class = "data.frame")
For a dataset like:
21 79
78 245
21 186
65 522
4 21
3 4
4 212
4 881
124 303
28 653
28 1231
7 464
7 52
17 102
16 292
65 837
28 203
28 1689
136 2216
7 1342
56 412
I need to find the number of associated patterns. For example 21-79 and 21-186 have 21 in common. So they form 1 pattern. Also 21 is present in 4-21. This edge also contributes to the same pattern. Now 4-881, 4-212, 3-4 have 4 in their edge. So also contribute to the same pattern. Thus edges 21-79, 21-186, 4-21, 4-881, 4-212, 3-4 form 1 pattern. Similarly there are other patterns. Thus we need to group all edges that have any 1 node common to form a pattern (or subgraph). For the dataset given there are total 4 patterns.
I need to write code (preferably in R) that will find such no. of patterns.
Since you're describing the data as subgraphs, why not use the igraph package which is very knowledgeable about graphs. So here's your data in data.frame form
dd <- structure(list(V1 = c(21L, 78L, 21L, 65L, 4L, 3L, 4L, 4L, 124L,
28L, 28L, 7L, 7L, 17L, 16L, 65L, 28L, 28L, 136L, 7L, 56L), V2 = c(79L,
245L, 186L, 522L, 21L, 4L, 212L, 881L, 303L, 653L, 1231L, 464L,
52L, 102L, 292L, 837L, 203L, 1689L, 2216L, 1342L, 412L)), .Names = c("V1",
"V2"), class = "data.frame", row.names = c(NA, -21L))
We can treat each value as a vertex name so the data you provide is really like an edge list. Thus we create our graph with
library(igraph)
gg <- graph.edgelist(cbind(as.character(dd$V1), as.character(dd$V2)),
directed=F)
That defines the nodes and vertex resulting in the following graph (plot(gg))
Now you wanted to know the number of "patterns" which are really represented as connected subgraphs in this data. You can extract that information with the clusters() command. Specifically,
clusters(gg)$no
# [1] 10
Which shows there are 10 clusters in the data you provided. But you only want the ones that have more than two vertices. That we can get with
sum(clusters(gg)$csize>2)
# [1] 4
Which is 4 as you were expecting.
I have a data frame containing words and numeric entries. I want to sum all the entries for which the row entry in the word now is identical.
District name Population Child birth rate
A 30,000 .7
A 20,000 .5
B 10,000 .09
B 15,000 .6
C 80,000 .007
I want to sum up the population and child birth rates on the district level.
I tried using lapply and sum, but I can't figure it out.
The result to dput(head(mydata) is:
structure(list(District = structure(c(5L, 5L, 5L, 5L, 5L, 5L), .Label = c("Charlottenburg-Wilmersdorf",
"Friedrichshain-Kreuzberg", "Lichtenberg", "Marzahn-Hellersdorf",
"Mitte", "Neukoelln", "Pankow", "Reinickendorf", "Spandau", "Steglitz-Zehlendorf",
"Tempelhof-Schoeneberg", "Treptow-Koepenick"), class = "factor"),
Population = c(81205L, 70911L, 5629L, 12328L, 78290L, 84789L
), Overall.crime = c(27864L, 13181L, 943L, 4515L, 15673L,
16350L), Robbery = c(315L, 195L, 20L, 79L, 232L, 261L), Mugging = c(183L,
81L, 9L, 54L, 111L, 118L), Assault = c(2016L, 1046L, 51L,
468L, 1679L, 1718L), Molestation.Stalking = c(480L, 429L,
16L, 114L, 567L, 601L), Theft = c(13587L, 4961L, 396L, 2019L,
6725L, 6954L), Car.Theft = c(185L, 149L, 10L, 28L, 159L,
159L), Bycicle.Theft = c(1444L, 561L, 95L, 123L, 588L, 595L
), Burglary = c(557L, 297L, 37L, 87L, 397L, 528L), Arson = c(36L,
51L, 7L, 15L, 28L, 56L), Property.Damage = c(2113L, 871L,
64L, 260L, 1257L, 1172L), Drug.Offenses = c(781L, 538L, 24L,
87L, 604L, 492L)), .Names = c("District", "Population", "Overall.crime",
"Robbery", "Mugging", "Assault", "Molestation.Stalking", "Theft",
"Car.Theft", "Bycicle.Theft", "Burglary", "Arson", "Property.Damage",
"Drug.Offenses"), row.names = c(NA, 6L), class = "data.frame")
I had spared you all those German names before, but I guess that was stupid since the problem is within the data...
Using ddply gives me following error:
Error in df$Population : object of type 'closure' is not subsettable
Thank you for any help!
Using the data you originally posted did you mean to do this?
df <- read.table( text = "District_name Population Child_birth_rate
A 30000 .7
A 20000 .5
B 10000 .09
B 15000 .6
C 80000 .007" , h = TRUE )
aggregate( cbind( Population , Child_birth_rate ) ~ District_name , data = df , sum )
# District_name Population Child_birth_rate
#1 A 50000 1.200
#2 B 25000 0.690
#3 C 80000 0.007
Is it a good idea to sum the birth rate?
Using your actual data it might be more convenient to use ddply from plyr to aggregate in the a simillar fashion (but you want to use sum and mean on two different columns):
require( plyr )
ddply( mydata , "District" , function(df) c( "Pop" = sum( df$Population), "Robbery" = mean( df$Robbery ) ) )
# District Pop Crime
#1 Mitte 333152 183.6667