R extracting the frequencies - r

I am trying to get the frequencies but my ids are repeating. Here is a sample data:
id <- c(1,1,2,2,3,3)
gender <- c("m","m","f","f","m","m")
score <- c(10,5,10,5,10,5)
data <- data.frame("id"=id,"gender"=gender, "score"=score)
> data
id gender score
1 1 m 10
2 1 m 5
3 2 f 10
4 2 f 5
5 3 m 10
6 3 m 5
I would like to get the frequencies of the gender categories but I have repeating ids. When I run this code below:
gender<-as.data.frame(table(data$gender))
> gender
Var1 Freq
1 f 2
2 m 4
The frequency should be female = 1, male =2. it should look like this below:
> gender
Var1 Freq
1 f 1
2 m 2
How can I get this considering the id information?

You can use data.table::uniqueN to count the number of unique ids per gender group
library(data.table)
setDT(data)
data[, .(Freq = uniqueN(id)), gender]
# gender Freq
# 1: m 2
# 2: f 1

The idea from #IceCreamToucan with dplyr:
data %>%
group_by(gender) %>%
summarise(freq = n_distinct(id))
gender freq
<fct> <int>
1 f 1
2 m 2

In base R
rowSums(table(data$gender,data$id)!=0)
f m
1 2

Being late to the party, I was quite surprised about the sophisticated answers which use grouping or rowSums().
In base R, I would
remove the duplicate id rows from the data.frame by subsetting with duplicated(id),
apply table() on the gender column.
So, the code is
table(data[duplicated(data$id), "gender"])
f m
1 2

Related

Sample n random rows per group in a dataframe with dplyr when some observations have less than n rows

I have a data frame with two categorical variables.
samples<-c("A","A","A","A","B","B")
groups<-c(1,1,1,2,1,1)
df<- data.frame(samples,groups)
df
samples groups
1 A 1
2 A 1
3 A 1
4 A 2
5 B 1
6 B 1
The result that I would like to have is for each given observation (sample-group) to downsample (randomly, this is important) the data frame to a maximum of X rows and keep all obervation for which appear less than X times. In the example here X=2. Is there an easy way to do this? The issue that I have is that observation 4 (A,2) appears only once, thus dplyr sample_n would not work.
desired output
samples groups
1 A 1
2 A 1
3 A 2
4 B 1
5 B 1
You can sample minimum of number of rows or x for each group :
library(dplyr)
x <- 2
df %>% group_by(samples, groups) %>% sample_n(min(n(), x))
# samples groups
# <chr> <dbl>
#1 A 1
#2 A 1
#3 A 2
#4 B 1
#5 B 1
However, note that sample_n() has been super-seeded in favor of slice_sample but n() doesn't work with slice_sample. There is an open issue here for it.
However, as #tmfmnk mentioned we don't need to call n() here. Try :
df %>% group_by(samples, groups) %>% slice_sample(n = x)
One option with data.table:
df[df[, .I[sample(.N, min(.N, X))], by = .(samples, groups)]$V1]
samples groups
1: A 1
2: A 1
3: A 2
4: B 1
5: B 1

count frequency of variable dependent on other variable in an R dataframe [duplicate]

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 2 years ago.
df <- data.frame(samples = c('45fe.K2','45fe.K2','45fe.K2','45hi.K1','45hi.K1'),source = c('f','f','o','o','f'))
df
samples sou
1 45fe.K2 f
2 45fe.K2 f
3 45fe.K2 o
4 45hi.K1 o
5 45hi.K1 f
I want to count how many of the samples are from the sou f or o.
The result should look like this
samples sou count
1 45fe.K2 f 2
3 45fe.K2 o 1
4 45hi.K1 o 1
5 45hi.K1 f 1
I have tried this
df <- df %>%
group_by(sou) %>%
mutate(count = n_distinct(samples)) %>%
ungroup()
df <- within(df, { count <- ave(sou, samples, FUN=function(x) length(unique(x)))})
df$count <- ave(as.integer(df$samples), df$sou, FUN = function(x) length(unique(x)))
df$count <- with(df, ave(samples,sou, FUN = function(x) length(unique(x))))
All of these count only the unique samples (which is 2) or the unique amount of sou(which is 2). But I want to know how many unique sous are in the unique samples.
Try this dplyr solution with summarise() and n():
library(dplyr)
df %>% group_by(samples,source) %>% summarise(N=n())
Output:
# A tibble: 4 x 3
# Groups: samples [2]
samples source N
<chr> <chr> <int>
1 45fe.K2 f 2
2 45fe.K2 o 1
3 45hi.K1 k 1
4 45hi.K1 o 1
And a base R solution would be creating a indicator variable N with ones and then aggregate():
#Data
df$N <- 1
#Code
aggregate(N~samples+source,df,sum)
Output:
samples source N
1 45fe.K2 f 2
2 45hi.K1 k 1
3 45fe.K2 o 1
4 45hi.K1 o 1

count number of events grouped by id [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 2 years ago.
DF<-data.frame(id=c(1,1,2,3,3),code=c("A","A","A","E","E"))
> DF
id code
1 1 A
2 1 A
3 2 A
4 3 E
5 3 E
Now I want to count nr id with same code. Desired output:
# A tibble: 2 x 2
code count
1 A 2
2 E 1
I´v been trying:
> DF%>%group_by(code)%>%summarize(count=n())
# A tibble: 2 x 2
code count
<fct> <int>
1 A 3
2 E 2
> DF%>%group_by(code,id)%>%summarize(count=n())
# A tibble: 3 x 3
# Groups: code [2]
code id count
<fct> <dbl> <int>
1 A 1 2
2 A 2 1
3 E 3 2
>
Which doesn´t give me the desired output.
Best H
Being pedantic, I'd rephrase your question as "count the number of distinct IDs per code". With that mindset, the answer becomes clearer.
DF %>%
group_by(code) %>%
summarize(count = n_distinct(id))
An option with data.table would be uniqueN (instead of n_distinct from dplyr) after grouping by 'code' and converting to data.table (setDT)
library(data.table)
setDT(DF)[, .(count = uniqueN(id)), code]
# code count
#1: A 2
#2: E 1
A simple base R solution also works:
#Data
DF<-data.frame(id=c(1,1,2,3,3),code=c("A","A","A","E","E"))
#Classic base R sol
aggregate(id~code,data=DF,FUN = function(x) length(unique(x)))
code id
1 A 2
2 E 1

R dplyr select not removing columns

I'm a new user of R and have a very basic question. I'm trying to delete columns using the dplyr select function. It appears to run correctly but then when the data is viewed using head the deleted column still appears, and also a count is still able to be run on this column. I've run this on a very simple test dataset, the outputs are below. Please advise on how to permanently delete the columns from the data. Thanks
> library(dplyr)
> setwd("C:/")
> mydata <- read_csv("test.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
`Smoking Status` = col_character()
)
> head(mydata)
# A tibble: 4 x 3
Age Gender `Smoking Status`
<dbl> <chr> <chr>
1 18 M Smoker
2 25 F Non-smoker
3 40 M Ex-smoker
4 53 F Non-smoker
> select(mydata,-Age)
# A tibble: 4 x 2
Gender `Smoking Status`
<chr> <chr>
1 M Smoker
2 F Non-smoker
3 M Ex-smoker
4 F Non-smoker
> head(mydata)
# A tibble: 4 x 3
Age Gender `Smoking Status`
<dbl> <chr> <chr>
1 18 M Smoker
2 25 F Non-smoker
3 40 M Ex-smoker
4 53 F Non-smoker
> mydata %>%
+ count(Age)
# A tibble: 4 x 2
Age n
<dbl> <int>
1 18 1
2 25 1
3 40 1
4 53 1
If I am understanding your question. The reason the column is not being deleted is because you are not assigning the data to a variable.
df <- data.frame(age = 10:20,
sex = c('M','M','F','F','M','F','F','M','F','F','M'),
smoker = c('N','N','Y','N','N','Y','N','N','Y','Y','N'))
df_1 <- select(df,-age)
head(df_1)
sex smoker
1 M N
2 M N
3 F Y
4 F N
5 M N
6 F Y
I hope this helps.
I have extracted the first 4 rows (head) of your data and turned it into a reproducible answer which anyone can then copy and run easily. This helps us understand your problem which in turn helps you get your answer faster.
# Dataframe based on head of your table
mydata <- data.frame(Age = c(18,25,40,53),
Gender = c("M","F","M","F"),
Smoking_Status = c("Smoker","Non_smoker","Ex-smoker","Non-smoker"))
> mydata
Age Gender Smoking_Status
1 18 M Smoker
2 25 F Non_smoker
3 40 M Ex-smoker
4 53 F Non-smoker
Essentially you are creating a new data frame once you have transformed your dataframe in any way, and this new data frame needs to be saved into a variable. This can be done by using either = or <-.
I prefer using <- as it helps differentiate assigning a variable.
If you have no need for your original dataframe, you can simply overwrite it by assinging the new data frame with the same name.
mydata <- select(mydata, -Age)
To preserve your original data frame, you can create a new variable and store this data frame inside. Now, mydata is still the same as above but mydata2 has no Age column.
mydata2 <- select(mydata, -Age)
> mydata2
Gender Smoking_Status
1 M Smoker
2 F Non_smoker
3 M Ex-smoker
4 F Non-smoker

how to count and dcast for all columns in r

I have following dataframe in r
Company Education Health
A NA 1
A 1 2
A 1 NA
I want the count of levels in each columns(1,2,NA) in a following format
Company Education_1 Education_NA Health_1 Health_2 Health_NA
A 2 1 1 1 1
How can I do it in R?
You can do the following:
library(tidyverse)
df %>%
gather(k, v, -Company) %>%
unite(tmp, k, v, sep = "_") %>%
count(Company, tmp) %>%
spread(tmp, n)
## A tibble: 1 x 6
# Company Education_1 Education_NA Health_1 Health_2 Health_NA
# <fct> <int> <int> <int> <int> <int>
#1 A 2 1 1 1 1
Sample data
df <- read.table(text =
" Company Education Health
A NA 1
A 1 2
A 1 NA ", header = T)
Using DF in the Note at the end where we have added a company B as well and using the reshape2 package it can be done in one recast call. The id.var and fun arguments can be omitted and the same answer will be given but it will produce a message saying it used those defaults.
library(reshape2)
recast(DF, Company ~ variable + value,
id.var = "Company", fun = length)
giving this data frame:
Company Education_1 Education_NA Health_1 Health_2 Health_NA
1 A 2 1 1 1 1
2 B 2 1 1 1 1
Note
Lines <- " Company Education Health
1 A NA 1
2 A 1 2
3 A 1 NA
4 B NA 1
5 B 1 2
6 B 1 NA"
DF <- read.table(text = Lines)
In plyr you can use a hack with ddply by transposing tables to get what appear to be new columns:
x <- data.frame(Company="A",Education=c(NA,1,1),Health=c(1,2,NA))
library(plyr)
ddply(x,.(Company),plyr::summarise,
Education=t(table(addNA(Education))),
Health=t(table(addNA(Health)))
)
Company Education.1 Education.NA Health.1 Health.2 Health.NA
1 A 2 1 1 1 1
However, they are not really columns, but table elements in the data.frame.
You can use a do.call(data.frame,y) construct to make them proper data frame columns, but you need more than one row for it to work.

Resources