R: Names of features not appearing when using table() - r

I am using the following command that returns this output:
> table(data$Smoke, data$Gender)
female male
no 314 334
yes 44 33
Nonetheless, in the tutorial I'm watching, the instructor uses the same line of code and they get
Gender
Smoke female male
no 314 334
yes 44 33
How can I achieve this result? It's not clear from the help menu.

Just pass a two-column data.frame object to table()
table(data[c("Smoke", "Gender")])
# Gender
# Smoke female male
# no 29 31
# yes 17 23
or use xtabs():
xtabs( ~ Smoke + Gender, data)
# Gender
# Smoke female male
# no 29 31
# yes 17 23
Although the following one also works, it looks some rude.
table(Smoke = data$Smoke, Gender = data$Gender)
Data
data <- data.frame(id = 1:100,
Smoke = sample(c("no", "yes"), 100, T),
Gender = sample(c("female", "male"), 100, T))

You can name the vectors you pass to table.
table(Smoke = c('no','yes'), Gender = c('male','female'))
#-----
Gender
Smoke female male
no 0 1
yes 1 0

Related

Table in r to be weighted

I'm trying to run a crosstab/contingency table, but need it weighted by a weighting variable.
Here is some sample data.
set.seed(123)
sex <- sample(c("Male", "Female"), 100, replace = TRUE)
age <- sample(c("0-15", "16-29", "30-44", "45+"), 100, replace = TRUE)
wgt <- sample(c(1:10), 100, replace = TRUE)
df <- data.frame(age,sex, wgt)
I've run this to get a regular crosstab table
table(df$sex, df$age)
to get a weighted frequency, I tried the Hmisc package (if you know a better package let me know)
library(Hmisc)
wtd.table(df$sex, df$age, weights=df$wgt)
Error in match.arg(type) : 'arg' must be of length 1
I'm not sure where I've gone wrong, but it doesn't run, so any help will be great.
Alternatively, if you know how to do this in another package, which may be better for analysing survey data, that would be great too. Many thanks in advance.
Try this
GDAtools::wtable(df$sex, df$age, w = df$wgt)
Output
0-15 16-29 30-44 45+ NA tot
Female 56 73 60 76 0 265
Male 76 99 106 90 0 371
NA 0 0 0 0 0 0
tot 132 172 166 166 0 636
Update
In case you do not want to install the whole package, here are two essential functions you need:
wtable and dichotom
Source them and you should be able to use wtable without any problem.
A solution is to repeat the rows of the data.frame by weight and then table the result.
The following repeats the data.frame's rows (only relevant columns):
df[rep(row.names(df), df$wgt), 1:2]
And it can be used to get the contingency table.
table(df[rep(row.names(df), df$wgt), 1:2])
# sex
#age Female Male
# 0-15 56 76
# 16-29 73 99
# 30-44 60 106
# 45+ 76 90
Base R, in stats, has xtabs for exactly this:
xtabs(wgt ~ age + sex, data=df)
A tidyverse solution using your data same set.seed, uncount is the equivalent to #Rui's rep of the weights.
library(dplyr)
library(tidyr)
df %>%
uncount(weights = .$wgt) %>%
select(-wgt) %>%
table
#> sex
#> age Female Male
#> 0-15 56 76
#> 16-29 73 99
#> 30-44 60 106
#> 45+ 76 90

How to obtain conditioned results from an R dataframe

This is my first message here. I'm trying to solve an R exercise from an edX R course, and I'm stuck in it. It would be great if somebody could help me solve it. Here are the dataframe and question given:
> students
height shoesize gender population
1 181 44 male kuopio
2 160 38 female kuopio
3 174 42 female kuopio
4 170 43 male kuopio
5 172 43 male kuopio
6 165 39 female kuopio
7 161 38 female kuopio
8 167 38 female tampere
9 164 39 female tampere
10 166 38 female tampere
11 162 37 female tampere
12 158 36 female tampere
13 175 42 male tampere
14 181 44 male tampere
15 180 43 male tampere
16 177 43 male tampere
17 173 41 male tampere
Given the dataframe above, create two subsets with students whose height is equal to or below the median height (call it students.short) and students whose height is strictly above the median height (call it students.tall). What is the mean shoesize for each of the above 2 subsets by population?
I've been able to create the two subsets students.tall and students.short (both display the answers by TRUE/FALSE), but I don't know how to obtain the mean by population. The data should be displayed like this:
kuopio tampere
students.short xxxx xxxx
students.tall xxxx xxxx
Many thanks if you can give me a hand!
We can split by a logical vector based on the median height
# // median height
medHeight <- median(students$height, na.rm = TRUE)
# // split the data into a list of data.frames using the 'medHeight'
lst1 <- with(students, split(students, height > medHeight))
Then loop over the list use aggregate from base R
lapply(lst1, function(dat) aggregate(shoesize ~ population,
data = dat, FUN = mean, na.rm = TRUE))
However, we don't need to create two separate datasets or a list. It can be done by grouping with both 'population' and the 'grp' created with logical vector
library(dplyr)
students %>%
group_by(grp = height > medHeight, population) %>%
summarise(shoesize = mean(shoesize))
You can try this:
#Code
students.short <- students[students$height<=median(students$height),]
students.tall <- students[students$height>median(students$height),]
#Mean
mean(students.short$shoesize)
mean(students.tall$shoesize)
Output:
[1] 38.44444
[1] 42.75
You can use pivot_wider() in tidyr and set the argument values_fn as mean.
library(dplyr)
library(tidyr)
df %>%
mutate(grp = if_else(height > median(height), "students.tall", "students.short")) %>%
pivot_wider(id_cols = grp, names_from = population, values_from = height, values_fn = mean)
# # A tibble: 2 x 3
# grp kuopio tampere
# <chr> <dbl> <dbl>
# 1 students.tall 176. 177.
# 2 students.short 164 163.
With a base way, you can try xtabs(), which returns a table object.
xtabs(height ~ grp + population,
aggregate(height ~ grp + population, FUN = mean,
transform(df, grp = ifelse(height > median(height), "students.tall", "students.short"))))
# population
# grp kuopio tampere
# students.short 164.0000 163.4000
# students.tall 175.6667 177.2000
Note: To convert a table object into data.frame, you can use as.data.frame.matrix().

How do I filter .csv file before reading

I want to work with a filtered subset of my dataset.
Example: healthstats.csv
age weight height gender
A 25 150 65 female
B 24 175 78 male
C 26 130 72 male
D 32 200 69 female
E 28 156 66 male
F 40 112 78 female
I would start with
patients = read.csv("healthstats.csv")
but how to I only import a subset of
patients$gender == "female"
when I run
patients = read.csv("healthstats.csv")
If you want to import only a subset of rows without reading them you can use sqldf which accepts a query to filter data.
library(sqldf)
read.csv.sql("healthstats.csv", sql = "select * from file where gender == 'female'")
We can also use read_csv_chunked from readr
readr::read_csv_chunked('healthstats.csv',
callback = DataFrameCallback$new(function(x, pos) subset(x, gender == "female")))

R column to row for table

I have a table:
table(sex)
male female
58 48
I would like to put it like that:
male 58
female 48
Is that possible?
We can wrap with data.frame
as.data.frame(table(sex))
# sex Freq
#1 female 42
#2 male 58
data
set.seed(24)
sex <- sample(c("male", "female"), 100, replace=TRUE)

Confusion matrix with a four-level class in R

I am trying to get a confusion matrix from a multi-level factor variable (Rating)
My data looks like this:
> head(credit)
Income Rating Cards Age Education Gender Student Married Ethnicity Balance
1 14.891 bad 2 34 11 Male No Yes Caucasian 333
2 106.025 excellent 3 82 15 Female Yes Yes Asian 903
3 104.593 excellent 4 71 11 Male No No Asian 580
4 148.924 excellent 3 36 11 Female No No Asian 964
5 55.882 good 2 68 16 Male No Yes Caucasian 331
6 80.180 excellent 4 77 10 Male No No Caucasian 1151
I built a classification tree with the rpart() function then predicted probabilities.
credit_model <- rpart(Rating ~ ., data=credit_train, method="class")
credit_pred <- predict(credit_model, credit_test)
Then I want to assess the prediction with CrossTable() from the gmodels package.
library(gmodels)
CrossTable(credit_test, credit_pred, prop.chisq=FALSE, prop.c=FALSE, prop.r=FALSE, dnn=c("actual Rating", "predicted Rating"))
But I get this error:
Error in CrossTable(credit_test, credit_pred, prop.chisq = FALSE,
prop.c = FALSE, : x and y must have the same length
I don't know why I get this error for a 4-level class. When I have a binary class it works fine.

Resources