Table in r to be weighted - r

I'm trying to run a crosstab/contingency table, but need it weighted by a weighting variable.
Here is some sample data.
set.seed(123)
sex <- sample(c("Male", "Female"), 100, replace = TRUE)
age <- sample(c("0-15", "16-29", "30-44", "45+"), 100, replace = TRUE)
wgt <- sample(c(1:10), 100, replace = TRUE)
df <- data.frame(age,sex, wgt)
I've run this to get a regular crosstab table
table(df$sex, df$age)
to get a weighted frequency, I tried the Hmisc package (if you know a better package let me know)
library(Hmisc)
wtd.table(df$sex, df$age, weights=df$wgt)
Error in match.arg(type) : 'arg' must be of length 1
I'm not sure where I've gone wrong, but it doesn't run, so any help will be great.
Alternatively, if you know how to do this in another package, which may be better for analysing survey data, that would be great too. Many thanks in advance.

Try this
GDAtools::wtable(df$sex, df$age, w = df$wgt)
Output
0-15 16-29 30-44 45+ NA tot
Female 56 73 60 76 0 265
Male 76 99 106 90 0 371
NA 0 0 0 0 0 0
tot 132 172 166 166 0 636
Update
In case you do not want to install the whole package, here are two essential functions you need:
wtable and dichotom
Source them and you should be able to use wtable without any problem.

A solution is to repeat the rows of the data.frame by weight and then table the result.
The following repeats the data.frame's rows (only relevant columns):
df[rep(row.names(df), df$wgt), 1:2]
And it can be used to get the contingency table.
table(df[rep(row.names(df), df$wgt), 1:2])
# sex
#age Female Male
# 0-15 56 76
# 16-29 73 99
# 30-44 60 106
# 45+ 76 90

Base R, in stats, has xtabs for exactly this:
xtabs(wgt ~ age + sex, data=df)

A tidyverse solution using your data same set.seed, uncount is the equivalent to #Rui's rep of the weights.
library(dplyr)
library(tidyr)
df %>%
uncount(weights = .$wgt) %>%
select(-wgt) %>%
table
#> sex
#> age Female Male
#> 0-15 56 76
#> 16-29 73 99
#> 30-44 60 106
#> 45+ 76 90

Related

creating a two-way table with totals in R

I was wondering if there is an easy way to create a table that has the columns as well as row totals?
smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
colnames(smoke) <- c("High","Low","Middle")
rownames(smoke) <- c("current","former","never")
smoke <- as.table(smoke)
I thought this would be super easy, but the solutions i found until now seem to be pretty complicated involving lapply and rbind. However, this seems as such a trivial task, there must be some easier way?
derired results:
> smoke
High Low Middle TOTAL
current 51 43 22 116
former 92 28 21 141
never 68 22 9 99
TOTAL 211 93 52 51
addmargins(smoke)
addmargins is in the stats package.
You can use adorn_totals from janitor :
library(janitor)
library(magrittr)
smoke %>%
as.data.frame.matrix() %>%
tibble::rownames_to_column() %>%
adorn_totals(name = 'TOTAL') %>%
adorn_totals(name = 'TOTAL', where = 'col')
# rowname High Low Middle TOTAL
# current 51 43 22 116
# former 92 28 21 141
# never 68 22 9 99
# TOTAL 211 93 52 356

Normalise only some columns in R

I'm new to R and still getting to grips with how it handles data (my background is spreadsheets and databases). the problem I have is as follows. My data looks like this (it is held in CSV):
RecNo Var1 Var2 Var3
41 800 201.8 Y
43 140 39 N
47 60 20.24 N
49 687 77 Y
54 570 135 Y
58 1250 467 N
61 211 52 N
64 96 117.3 N
68 687 77 Y
Column 1 (RecNo) is my observation number; while it is a number, it is not required for my analysis. Column 4 (Var3) is a Yes/No column which, again, I do not currently need for the analysis but will need later in the process to add information in the output.
I need to normalise the numeric data in my dataframe to values between 0 and 1 without losing the other information. I have the following function:
normalize <- function(x) {
x <- sweep(x, 2, apply(x, 2, min))
sweep(x, 2, apply(x, 2, max), "/")
}
However, when I apply it to my above data by calling
myResult <- normalize(myData)
it returns an error because of the text in Column 4. If I set the text in this column to binary values it runs fine, but then also normalises my case numbers, which I don't want.
So, my question is: How can I change my normalize function above to accept the names of the columns to transform, while outputting the full dataset (i.e. without losing columns)?
I could not get TUSHAr's suggestion to work, but I have found two solutions that work fine:
1. akrun's suggestion above:
myData2 <- myData1 %>% mutate_at(2:3, funs((.-min(.))/max(.-min(.))))
This produces the following:
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
Alternatively, there is the package BBmisc which allowed me the following after transforming my record numbers to factors:
> myData <- myData %>% mutate(RecNo = factor(RecNo))
> myNorm <- normalize(myData2, method="range", range = c(0,1), margin = 1)
> myNorm
RecNo Var1 Var2 Var3
1 41 0.62184874 0.40601834 Y
2 43 0.06722689 0.04195255 N
3 47 0.00000000 0.00000000 N
4 49 0.52689076 0.12693105 Y
5 54 0.42857143 0.25663508 Y
6 58 1.00000000 1.00000000 N
7 61 0.12689076 0.07102414 N
8 64 0.03025210 0.21718329 N
9 68 0.52689076 0.12693105 Y
EDIT: For completion I include TUSHAr's solution as well, showing as always that there are many ways around a single problem:
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)
Thank you for your help!
normalize<-function(x){
minval=apply(x[,c(2,3)],2,min)
maxval=apply(x[,c(2,3)],2,max)
#print(minval)
#print(maxval)
y=sweep(x[,c(2,3)],2,minval)
#print(y)
sweep(y,2,(maxval-minval),"/")
}
df[,c(2,3)]=normalize(df)

How to use for loop in R

I have a CSV dataset (call it data) as follow:
CLASS CoverageT1 CoverageT2 CoverageT3
Gamma 90 80 75
Gamma 89 72 79
Gamma 92 86 75
Alpha 50 80 67
Alpha 53 78 60
Alpha 58 81 75
I would like to retrieve the unique classes and calculate the average for each coverage column.
What I've done so far is the following:
classes <- subset(data, select = c(CLASS))
unique_classes <- unique(classes)
for(x in unique_classes){
cove <- subset(data, CLASS == x , select=c(CoverageT1:CoverageT3))
average <- colMeans(cove)
print(cove)
}
As a result, I got the following results:
CoverageT1 CoverageT2 CoverageT3
1 90 80 75
3 92 86 75
4 50 80 67
6 58 81 75
I want to retrieve the coverage values based on each class and then calculate the average. When I print the retrieved coverage values, I get some rows and the other are missing!
Can someone help me solving this issue
Thanks
Your code isn't working because, amongst other things, you are assigning to average on each iteration and the previous is lost
There are several ways to do what you are trying to do. This would be my approach:
library(dplyr)
data %>% group_by(CLASS) %>% summarise_all(mean)
Another option using aggregate
aggregate(data, . ~ CLASS , mean)
Taking your idea and wrapping it in by.
xy <- read.table(text = "CLASS CoverageT1 CoverageT2 CoverageT3
Gamma 90 80 75
Gamma 89 72 79
Gamma 92 86 75
Alpha 50 80 67
Alpha 53 78 60
Alpha 58 81 75", header = TRUE)
out <- by(data = xy[, -1], INDICES = list(xy$CLASS), FUN = colMeans)
out <- do.call(rbind, out)
out
CoverageT1 CoverageT2 CoverageT3
Alpha 53.66667 79.66667 67.33333
Gamma 90.33333 79.33333 76.33333
This is how I solved it:
coverage_all <- aggregate(coverage , list(class=data$CLASS), mean)

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

Custom sorting of a dataframe in R

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,
Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)
Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

Resources