How to obtain conditioned results from an R dataframe - r

This is my first message here. I'm trying to solve an R exercise from an edX R course, and I'm stuck in it. It would be great if somebody could help me solve it. Here are the dataframe and question given:
> students
height shoesize gender population
1 181 44 male kuopio
2 160 38 female kuopio
3 174 42 female kuopio
4 170 43 male kuopio
5 172 43 male kuopio
6 165 39 female kuopio
7 161 38 female kuopio
8 167 38 female tampere
9 164 39 female tampere
10 166 38 female tampere
11 162 37 female tampere
12 158 36 female tampere
13 175 42 male tampere
14 181 44 male tampere
15 180 43 male tampere
16 177 43 male tampere
17 173 41 male tampere
Given the dataframe above, create two subsets with students whose height is equal to or below the median height (call it students.short) and students whose height is strictly above the median height (call it students.tall). What is the mean shoesize for each of the above 2 subsets by population?
I've been able to create the two subsets students.tall and students.short (both display the answers by TRUE/FALSE), but I don't know how to obtain the mean by population. The data should be displayed like this:
kuopio tampere
students.short xxxx xxxx
students.tall xxxx xxxx
Many thanks if you can give me a hand!

We can split by a logical vector based on the median height
# // median height
medHeight <- median(students$height, na.rm = TRUE)
# // split the data into a list of data.frames using the 'medHeight'
lst1 <- with(students, split(students, height > medHeight))
Then loop over the list use aggregate from base R
lapply(lst1, function(dat) aggregate(shoesize ~ population,
data = dat, FUN = mean, na.rm = TRUE))
However, we don't need to create two separate datasets or a list. It can be done by grouping with both 'population' and the 'grp' created with logical vector
library(dplyr)
students %>%
group_by(grp = height > medHeight, population) %>%
summarise(shoesize = mean(shoesize))

You can try this:
#Code
students.short <- students[students$height<=median(students$height),]
students.tall <- students[students$height>median(students$height),]
#Mean
mean(students.short$shoesize)
mean(students.tall$shoesize)
Output:
[1] 38.44444
[1] 42.75

You can use pivot_wider() in tidyr and set the argument values_fn as mean.
library(dplyr)
library(tidyr)
df %>%
mutate(grp = if_else(height > median(height), "students.tall", "students.short")) %>%
pivot_wider(id_cols = grp, names_from = population, values_from = height, values_fn = mean)
# # A tibble: 2 x 3
# grp kuopio tampere
# <chr> <dbl> <dbl>
# 1 students.tall 176. 177.
# 2 students.short 164 163.
With a base way, you can try xtabs(), which returns a table object.
xtabs(height ~ grp + population,
aggregate(height ~ grp + population, FUN = mean,
transform(df, grp = ifelse(height > median(height), "students.tall", "students.short"))))
# population
# grp kuopio tampere
# students.short 164.0000 163.4000
# students.tall 175.6667 177.2000
Note: To convert a table object into data.frame, you can use as.data.frame.matrix().

Related

Table in r to be weighted

I'm trying to run a crosstab/contingency table, but need it weighted by a weighting variable.
Here is some sample data.
set.seed(123)
sex <- sample(c("Male", "Female"), 100, replace = TRUE)
age <- sample(c("0-15", "16-29", "30-44", "45+"), 100, replace = TRUE)
wgt <- sample(c(1:10), 100, replace = TRUE)
df <- data.frame(age,sex, wgt)
I've run this to get a regular crosstab table
table(df$sex, df$age)
to get a weighted frequency, I tried the Hmisc package (if you know a better package let me know)
library(Hmisc)
wtd.table(df$sex, df$age, weights=df$wgt)
Error in match.arg(type) : 'arg' must be of length 1
I'm not sure where I've gone wrong, but it doesn't run, so any help will be great.
Alternatively, if you know how to do this in another package, which may be better for analysing survey data, that would be great too. Many thanks in advance.
Try this
GDAtools::wtable(df$sex, df$age, w = df$wgt)
Output
0-15 16-29 30-44 45+ NA tot
Female 56 73 60 76 0 265
Male 76 99 106 90 0 371
NA 0 0 0 0 0 0
tot 132 172 166 166 0 636
Update
In case you do not want to install the whole package, here are two essential functions you need:
wtable and dichotom
Source them and you should be able to use wtable without any problem.
A solution is to repeat the rows of the data.frame by weight and then table the result.
The following repeats the data.frame's rows (only relevant columns):
df[rep(row.names(df), df$wgt), 1:2]
And it can be used to get the contingency table.
table(df[rep(row.names(df), df$wgt), 1:2])
# sex
#age Female Male
# 0-15 56 76
# 16-29 73 99
# 30-44 60 106
# 45+ 76 90
Base R, in stats, has xtabs for exactly this:
xtabs(wgt ~ age + sex, data=df)
A tidyverse solution using your data same set.seed, uncount is the equivalent to #Rui's rep of the weights.
library(dplyr)
library(tidyr)
df %>%
uncount(weights = .$wgt) %>%
select(-wgt) %>%
table
#> sex
#> age Female Male
#> 0-15 56 76
#> 16-29 73 99
#> 30-44 60 106
#> 45+ 76 90

creating a two-way table with totals in R

I was wondering if there is an easy way to create a table that has the columns as well as row totals?
smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
colnames(smoke) <- c("High","Low","Middle")
rownames(smoke) <- c("current","former","never")
smoke <- as.table(smoke)
I thought this would be super easy, but the solutions i found until now seem to be pretty complicated involving lapply and rbind. However, this seems as such a trivial task, there must be some easier way?
derired results:
> smoke
High Low Middle TOTAL
current 51 43 22 116
former 92 28 21 141
never 68 22 9 99
TOTAL 211 93 52 51
addmargins(smoke)
addmargins is in the stats package.
You can use adorn_totals from janitor :
library(janitor)
library(magrittr)
smoke %>%
as.data.frame.matrix() %>%
tibble::rownames_to_column() %>%
adorn_totals(name = 'TOTAL') %>%
adorn_totals(name = 'TOTAL', where = 'col')
# rowname High Low Middle TOTAL
# current 51 43 22 116
# former 92 28 21 141
# never 68 22 9 99
# TOTAL 211 93 52 356

How do I filter .csv file before reading

I want to work with a filtered subset of my dataset.
Example: healthstats.csv
age weight height gender
A 25 150 65 female
B 24 175 78 male
C 26 130 72 male
D 32 200 69 female
E 28 156 66 male
F 40 112 78 female
I would start with
patients = read.csv("healthstats.csv")
but how to I only import a subset of
patients$gender == "female"
when I run
patients = read.csv("healthstats.csv")
If you want to import only a subset of rows without reading them you can use sqldf which accepts a query to filter data.
library(sqldf)
read.csv.sql("healthstats.csv", sql = "select * from file where gender == 'female'")
We can also use read_csv_chunked from readr
readr::read_csv_chunked('healthstats.csv',
callback = DataFrameCallback$new(function(x, pos) subset(x, gender == "female")))

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

Custom sorting of a dataframe in R

I have a binomail dataset that looks like this:
df <- data.frame(replicate(4,sample(1:200,1000,rep=TRUE)))
addme <- data.frame(replicate(1,sample(0:1,1000,rep=TRUE)))
df <- cbind(df,addme)
df <-df[order(df$replicate.1..sample.0.1..1000..rep...TRUE..),]
The data is currently soreted in a way to show the instances belonging to 0 group then the ones belonging to the 1 group. Is there a way I can sort the data in a 0-1-0-1-0... fashion? I mean to show a row that belongs to the 0 group, the row after belonging to the 1 group then the zero group and so on...
All I can think about is complex functions. I hope there's a simple way around it.
Thank you,
Here's an attempt, which will add any extra 1's at the end:
First make some example data:
set.seed(2)
df <- data.frame(replicate(4,sample(1:200,10,rep=TRUE)),
addme=sample(0:1,10,rep=TRUE))
Then order:
with(df, df[unique(as.vector(rbind(which(addme==0),which(addme==1)))),])
# X1 X2 X3 X4 addme
#2 141 48 78 33 0
#1 37 111 133 3 1
#3 115 153 168 163 0
#5 189 82 70 103 1
#4 34 37 31 174 0
#6 189 171 98 126 1
#8 167 46 72 57 0
#7 26 196 30 169 1
#9 94 89 193 134 1
#10 110 15 27 31 1
#Warning message:
#In rbind(which(addme == 0), which(addme == 1)) :
# number of columns of result is not a multiple of vector length (arg 1)
Here's another way using dplyr, which would make it suitable for within-group ordering. It's also probably pretty quick. If there's unbalanced numbers of 0's and 1's, it will leave them at the end.
library(dplyr)
df %>%
arrange(addme) %>%
mutate(n0 = sum(addme == 0),
orderme = seq_along(addme) - (n0 * addme) + (0.5 * addme)) %>%
arrange(orderme) %>%
select(-n0, -orderme)

Resources