Frequency of data points by two variables in R [duplicate] - r

This question already has answers here:
Count number of rows within each group
(17 answers)
Closed 5 years ago.
I have what I know must be a simple answer but I can't seem to figure it out.
Suppose I have a dataset:
id <- c(1,1,1,2,2,3,3,4,4)
visit <- c("A", "B", "C", "A", "B", "A", "C", "A", "B")
test <- c(12,16, NA, 11, 15,NA, 0,12, 5)
df <- data.frame(id,visit,test)
And I want to know the number of data points per visit so that the final output looks something like this:
visit test
A 3
B 3
C 1
How would I go about doing this? I've tried using table
table(df$visit, df$test)
but I get a full grid of the number of values present the combination of visits and test values.
I can sum each row by doing this:
sum(table(df$visit, df$test))[1,]
sum(table(df$visit, df$test))[2,]
sum(table(df$visit, df$test))[3,]
But I feel like there is an easier way and I'm missing it! Any help would be greatly appreciated!

aggregate of base R would be ideal for this. Group id by visit and count the length. Remove the rows with NA using !is.na() prior to determining the length
aggregate(x = df$id[!is.na(df$test)], by = list(df$visit[!is.na(df$test)]), FUN = length)
# Group.1 x
#1 A 3
#2 B 3
#3 C 1

How about:
data.frame(rowSums(table(df$visit, df$test)))

Related

I have 9 observations in a table. I want to take mean of first 3 observations, then next 3 and then last 3 in R. How do we do that? [duplicate]

This question already has answers here:
Calculating moving average
(17 answers)
Closed 1 year ago.
X <- c(1:9)
data <- data.frame(x)
I want a table where I get three numbers:
mean(of first three)
mean(of next three)
mean(of last three)
You need to create a new variable that will be used as a group. Here's an example:
> data$group <- rep(c("a", "b", "c"), each = 3)
> aggregate(X ~ group, data = data, FUN = mean)
group X
1 a 2
2 b 5
3 c 8

Select factor values with level NA [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 5 years ago.
How can I avoid using a loop to subset a dataframe based on multiple factor levels?
In the following example my desired output is a dataframe. The dataframe should contain the rows of the original dataframe where the value in "Code" equals one of the values in "selected".
Working example:
#sample data
Code<-c("A","B","C","D","C","D","A","A")
Value<-c(1, 2, 3, 4, 1, 2, 3, 4)
data<-data.frame(cbind(Code, Value))
selected<-c("A","B") #want rows that contain A and B
#Begin subsetting
result<-data[which(data$Code==selected[1]),]
s1<-2
while(s1<length(selected)+1)
{
result<-rbind(result,data[which(data$Code==selected[s1]),])
s1<-s1+1
}
This is a toy example of a much larger dataset, so "selected" may contain a great number of elements and the data a great number of rows. Therefore I would like to avoid the loop.
You can use %in%
data[data$Code %in% selected,]
Code Value
1 A 1
2 B 2
7 A 3
8 A 4
Here's another:
data[data$Code == "A" | data$Code == "B", ]
It's also worth mentioning that the subsetting factor doesn't have to be part of the data frame if it matches the data frame rows in length and order. In this case we made our data frame from this factor anyway. So,
data[Code == "A" | Code == "B", ]
also works, which is one of the really useful things about R.
Try this:
> data[match(as.character(data$Code), selected, nomatch = FALSE), ]
Code Value
1 A 1
2 B 2
1.1 A 1
1.2 A 1

R data.frame Aggregate data to calculate diversity ratio [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
I have a data frame of data of demographics in R
Name...Region...Gender
...A...........1.............F
...B...........2.............M
...C...........1.............F
...D...........1.............M
...E...........2.............M
I want to calculate gender ratio for every region. Output should look like:
Region ..........GenderRatio
.... 1........................(0.67)
.... 2........................(0.50)
This can be calculated using normal BODMAS usage. Is there any efficient way to calculate it in R?
You can use the dplyr library in R for all sorts of datamanipulation. See here to learn more about dplyr and other extremely useful R packages.
An example:
First I create some sample data. (I changed it a little bit to actually have a gender ratio that fits your output.)
df <- data.frame(name = c("A", "B", "C", "D", "E"),
region = c(1,2,1,1,2),
gender = c("F", "M", "F", "M", "F"))
Now we can calculate gender_ratio and summarise the data. The function mutate is used to create and calculate the new variable gender_ratio. The group_by and summarise functions to logically organise the data before calculation (in order it is calculated by region) and later to only output the summarised data.
library(dplyr)
df %>% group_by(region) %>% mutate(gender_ratio = sum(gender == "F")/length(gender)) %>% group_by(region, gender_ratio) %>% summarise()
Output is:
region gender_ratio
<dbl> <dbl>
1 1 0.667
2 2 0.5
Hope this helps.
As a (base R) alternative, you can use by with prop.table(table(...)) to return a list of fractions for both male/female
with(df, by(df, Region, function(x) prop.table(table(x$Gender))))
#Region: 1
#
# F M
#0.6666667 0.3333333
#------------------------------------------------------------
#Region: 2
#
#F M
#0 1
Or to return only the male fraction
with(df, by(df, Region, function(x) prop.table(table(x$Gender))[2]))
#Region: 1
#[1] 0.3333333
#------------------------------------------------------------
#Region: 2
#[1] 1
Or to store male fraction and region in a data.frame simply stack the above result:
setNames(
stack(with(df, by(df, Region, function(x) prop.table(table(x$Gender))[2]))),
c("GenderRatio", "Region"))
# GenderRatio Region
#1 0.3333333 1
#2 1.0000000 2

Better subsetting and counting values in a dataframe [duplicate]

This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 4 years ago.
I have a data frame with two columns and 70,000 rows. One column serves an identifier for a household, column b in the example below. The other column refers to the individuals in the household, numbering them from 1 to n with some error (could be 1,2,3 or 1,4,5), column a in the example below.
I'm trying to use hierarchical clustering with the number of individuals in a household as a feature. The code I've written below counts the number of individuals in a household and puts them in the proper column and row, however takes several minutes with the actual data set I have, I assume due to its size. Is there a better way of going about getting this information?
fake.data <- data.frame(a = c(1,1,5,6,7,1,2,3,1,2,4), b = c("a", "a", "a", "a", "a", "b", "b", "b", "c", "c", "c"))
fake.cluster <- data.frame(b = unique(fake.data$b))
fake.cluster$members <- sapply(fake.cluster$b, function(x) length(unique(subset(fake.data, fake.data$b == x)$a)))
Don't know if this is quicker, but you could use dplyr in various ways. One approach: get the distinct rows and then count b.
library(dplyr)
fake.cluster <- fake.data %>%
distinct() %>%
count(b)
Here is an option using data.table
library(data.table)
setDT(fake.data)[, .(members = uniqueN(a)), b]
# b members
#1: a 4
#2: b 3
#3: c 3

Select rows in a dataframe based on values of all columns [duplicate]

This question already has an answer here:
Subset dataframe such that all values in each row are less than a certain value
(1 answer)
Closed 6 years ago.
I feel this question is very simple, but I am new in R, and I don't know how to solve it.
I have a dataframe df with 100 rows. The first column is Patient_ID and all the others are measurements of T cells over time. I want to select the rows (the patients) in which all the cell measurements are lower than 200.
My idea (maybe very complicated) was:
f200 = function(x){x "inferior to" 200}
df2 = f200(df[,2:10])
select the rows where all elements are True, i.e., where product of all elements is equal to 1... But I don't know how to write this! Can you help me? Or tell me a simpler way?
We can try with Reduce and &
df[Reduce(`&`, lapply(replace(df[-1], is.na(df[-1]), 0), `<`, 200)),]
# ID col1 col2
#1 1 NA 24
#2 2 20 NA
data
set.seed(24)
df <- data.frame(ID=1:4, col1 = c(NA, 20, 210, 30), col2 = c(24, NA, 30, 240))

Resources