R column to row for table - r

I have a table:
table(sex)
male female
58 48
I would like to put it like that:
male 58
female 48
Is that possible?

We can wrap with data.frame
as.data.frame(table(sex))
# sex Freq
#1 female 42
#2 male 58
data
set.seed(24)
sex <- sample(c("male", "female"), 100, replace=TRUE)

Related

Table in r to be weighted

I'm trying to run a crosstab/contingency table, but need it weighted by a weighting variable.
Here is some sample data.
set.seed(123)
sex <- sample(c("Male", "Female"), 100, replace = TRUE)
age <- sample(c("0-15", "16-29", "30-44", "45+"), 100, replace = TRUE)
wgt <- sample(c(1:10), 100, replace = TRUE)
df <- data.frame(age,sex, wgt)
I've run this to get a regular crosstab table
table(df$sex, df$age)
to get a weighted frequency, I tried the Hmisc package (if you know a better package let me know)
library(Hmisc)
wtd.table(df$sex, df$age, weights=df$wgt)
Error in match.arg(type) : 'arg' must be of length 1
I'm not sure where I've gone wrong, but it doesn't run, so any help will be great.
Alternatively, if you know how to do this in another package, which may be better for analysing survey data, that would be great too. Many thanks in advance.
Try this
GDAtools::wtable(df$sex, df$age, w = df$wgt)
Output
0-15 16-29 30-44 45+ NA tot
Female 56 73 60 76 0 265
Male 76 99 106 90 0 371
NA 0 0 0 0 0 0
tot 132 172 166 166 0 636
Update
In case you do not want to install the whole package, here are two essential functions you need:
wtable and dichotom
Source them and you should be able to use wtable without any problem.
A solution is to repeat the rows of the data.frame by weight and then table the result.
The following repeats the data.frame's rows (only relevant columns):
df[rep(row.names(df), df$wgt), 1:2]
And it can be used to get the contingency table.
table(df[rep(row.names(df), df$wgt), 1:2])
# sex
#age Female Male
# 0-15 56 76
# 16-29 73 99
# 30-44 60 106
# 45+ 76 90
Base R, in stats, has xtabs for exactly this:
xtabs(wgt ~ age + sex, data=df)
A tidyverse solution using your data same set.seed, uncount is the equivalent to #Rui's rep of the weights.
library(dplyr)
library(tidyr)
df %>%
uncount(weights = .$wgt) %>%
select(-wgt) %>%
table
#> sex
#> age Female Male
#> 0-15 56 76
#> 16-29 73 99
#> 30-44 60 106
#> 45+ 76 90

R: Names of features not appearing when using table()

I am using the following command that returns this output:
> table(data$Smoke, data$Gender)
female male
no 314 334
yes 44 33
Nonetheless, in the tutorial I'm watching, the instructor uses the same line of code and they get
Gender
Smoke female male
no 314 334
yes 44 33
How can I achieve this result? It's not clear from the help menu.
Just pass a two-column data.frame object to table()
table(data[c("Smoke", "Gender")])
# Gender
# Smoke female male
# no 29 31
# yes 17 23
or use xtabs():
xtabs( ~ Smoke + Gender, data)
# Gender
# Smoke female male
# no 29 31
# yes 17 23
Although the following one also works, it looks some rude.
table(Smoke = data$Smoke, Gender = data$Gender)
Data
data <- data.frame(id = 1:100,
Smoke = sample(c("no", "yes"), 100, T),
Gender = sample(c("female", "male"), 100, T))
You can name the vectors you pass to table.
table(Smoke = c('no','yes'), Gender = c('male','female'))
#-----
Gender
Smoke female male
no 0 1
yes 1 0

How to obtain conditioned results from an R dataframe

This is my first message here. I'm trying to solve an R exercise from an edX R course, and I'm stuck in it. It would be great if somebody could help me solve it. Here are the dataframe and question given:
> students
height shoesize gender population
1 181 44 male kuopio
2 160 38 female kuopio
3 174 42 female kuopio
4 170 43 male kuopio
5 172 43 male kuopio
6 165 39 female kuopio
7 161 38 female kuopio
8 167 38 female tampere
9 164 39 female tampere
10 166 38 female tampere
11 162 37 female tampere
12 158 36 female tampere
13 175 42 male tampere
14 181 44 male tampere
15 180 43 male tampere
16 177 43 male tampere
17 173 41 male tampere
Given the dataframe above, create two subsets with students whose height is equal to or below the median height (call it students.short) and students whose height is strictly above the median height (call it students.tall). What is the mean shoesize for each of the above 2 subsets by population?
I've been able to create the two subsets students.tall and students.short (both display the answers by TRUE/FALSE), but I don't know how to obtain the mean by population. The data should be displayed like this:
kuopio tampere
students.short xxxx xxxx
students.tall xxxx xxxx
Many thanks if you can give me a hand!
We can split by a logical vector based on the median height
# // median height
medHeight <- median(students$height, na.rm = TRUE)
# // split the data into a list of data.frames using the 'medHeight'
lst1 <- with(students, split(students, height > medHeight))
Then loop over the list use aggregate from base R
lapply(lst1, function(dat) aggregate(shoesize ~ population,
data = dat, FUN = mean, na.rm = TRUE))
However, we don't need to create two separate datasets or a list. It can be done by grouping with both 'population' and the 'grp' created with logical vector
library(dplyr)
students %>%
group_by(grp = height > medHeight, population) %>%
summarise(shoesize = mean(shoesize))
You can try this:
#Code
students.short <- students[students$height<=median(students$height),]
students.tall <- students[students$height>median(students$height),]
#Mean
mean(students.short$shoesize)
mean(students.tall$shoesize)
Output:
[1] 38.44444
[1] 42.75
You can use pivot_wider() in tidyr and set the argument values_fn as mean.
library(dplyr)
library(tidyr)
df %>%
mutate(grp = if_else(height > median(height), "students.tall", "students.short")) %>%
pivot_wider(id_cols = grp, names_from = population, values_from = height, values_fn = mean)
# # A tibble: 2 x 3
# grp kuopio tampere
# <chr> <dbl> <dbl>
# 1 students.tall 176. 177.
# 2 students.short 164 163.
With a base way, you can try xtabs(), which returns a table object.
xtabs(height ~ grp + population,
aggregate(height ~ grp + population, FUN = mean,
transform(df, grp = ifelse(height > median(height), "students.tall", "students.short"))))
# population
# grp kuopio tampere
# students.short 164.0000 163.4000
# students.tall 175.6667 177.2000
Note: To convert a table object into data.frame, you can use as.data.frame.matrix().

How do I filter .csv file before reading

I want to work with a filtered subset of my dataset.
Example: healthstats.csv
age weight height gender
A 25 150 65 female
B 24 175 78 male
C 26 130 72 male
D 32 200 69 female
E 28 156 66 male
F 40 112 78 female
I would start with
patients = read.csv("healthstats.csv")
but how to I only import a subset of
patients$gender == "female"
when I run
patients = read.csv("healthstats.csv")
If you want to import only a subset of rows without reading them you can use sqldf which accepts a query to filter data.
library(sqldf)
read.csv.sql("healthstats.csv", sql = "select * from file where gender == 'female'")
We can also use read_csv_chunked from readr
readr::read_csv_chunked('healthstats.csv',
callback = DataFrameCallback$new(function(x, pos) subset(x, gender == "female")))

Count number of occurances of a string in R under different conditions

I have a dataframe, with multiple columns called "data" which looks like this:
Preferences Status Gender
8a 8b 9a Employed Female
10b 11c 9b Unemployed Male
11a 11c 8e Student Female
That is, each customer selected 3 preferences and specified other information such as Status and Gender. Each preference is given by a [number][letter] combination, and there are c. 30 possible preferences. The possible preferences are:
8[a - c]
9[a - k]
10[a - d]
11[a - c]
12[a - i]
I want to count the number of occurrences of each preference, under certain conditions for the other columns - eg. for all women.
The output will ideally be a dataframe that looks like this:
Preference Female Male Employed Unemployed Student
8a 1034 934 234 495 203
8b 539 239 609 394 235
8c 124 395 684 94 283
9a 120 999 895 945 345
9b 978 385 596 923 986
etc.
What's the most efficient way to achieve this?
Thanks.
I am assuming you are starting with something that looks like this:
mydf <- structure(list(
Preferences = c("8a 8b 9a", "10b 11c 9b", "11a 11c 8e"),
Status = c("Employed", "Unemployed", "Student"),
Gender = c("Female", "Male", "Female")),
.Names = c("Preferences", "Status", "Gender"),
class = c("data.frame"), row.names = c(NA, -3L))
mydf
# Preferences Status Gender
# 1 8a 8b 9a Employed Female
# 2 10b 11c 9b Unemployed Male
# 3 11a 11c 8e Student Female
If that's the case, you need to "split" the "Preferences" column (by spaces), transform the data into a "long" form, and then reshape it to a wide form, tabulating while you do so.
With the right tools, this is pretty straightforward.
library(devtools)
library(data.table)
library(reshape2)
source_gist(11380733) # for `cSplit`
dcast.data.table( # Step 3--aggregate to wide form
melt( # Step 2--convert to long form
cSplit(mydf, "Preferences", " ", "long"), # Step 1--split "Preferences"
id.vars = "Preferences"),
Preferences ~ value, fun.aggregate = length)
# Preferences Employed Female Male Student Unemployed
# 1: 10b 0 0 1 0 1
# 2: 11a 0 1 0 1 0
# 3: 11c 0 1 1 1 1
# 4: 8a 1 1 0 0 0
# 5: 8b 1 1 0 0 0
# 6: 8e 0 1 0 1 0
# 7: 9a 1 1 0 0 0
# 8: 9b 0 0 1 0 1
I also tried a dplyr + tidyr approach, which looks like the following:
library(dplyr)
library(tidyr)
mydf %>%
separate(Preferences, c("P_1", "P_2", "P_3")) %>% ## splitting things
gather(Pref, Pvals, P_1:P_3) %>% # stack the preference columns
gather(Var, Val, Status:Gender) %>% # stack the status/gender columns
group_by(Pvals, Val) %>% # group by these new columns
summarise(count = n()) %>% # aggregate the numbers of each
spread(Val, count) # spread the values out
# Source: local data table [8 x 6]
# Groups:
#
# Pvals Employed Female Male Student Unemployed
# 1 10b NA NA 1 NA 1
# 2 11a NA 1 NA 1 NA
# 3 11c NA 1 1 1 1
# 4 8a 1 1 NA NA NA
# 5 8b 1 1 NA NA NA
# 6 8e NA 1 NA 1 NA
# 7 9a 1 1 NA NA NA
# 8 9b NA NA 1 NA 1
Both approaches are actually pretty quick. Test it with some better sample data than what you shared, like this:
preferences <- c(paste0(8, letters[1:3]),
paste0(9, letters[1:11]),
paste0(10, letters[1:4]),
paste0(11, letters[1:3]),
paste0(12, letters[1:9]))
set.seed(1)
nrow <- 10000
mydf <- data.frame(
Preferences = vapply(replicate(nrow,
sample(preferences, 3, FALSE),
FALSE),
function(x) paste(x, collapse = " "),
character(1L)),
Status = sample(c("Employed", "Unemployed", "Student"), nrow, TRUE),
Gender = sample(c("Male", "Female"), nrow, TRUE)
)

Resources