Summarizing data by subgroups [closed] - r

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
My dataset looks like this
Org_ID Market volume Indicator variable
1 100 1
1 200 0
1 300 0
2 50 1
2 500 1
3 400 0
3 200 0
3 300 0
3 100 0
And i want to summarize it by market TRx and org_id by calculating the % of 0 indicator variables in terms of market volume, as follows:
Org_ID % of 0's by market volume
1 83.3%
2 0%
3 100%
I tried subgroups but can't seem to be able to do this. Can anyone suggest what are some of the ways i can do?

with dplyr:
library(dplyr)
df %>%
group_by(Org_ID) %>%
summarize(sum_market_vol = sum(Market_volume*!Indicator_variable),
tot_market_vol = sum(Market_volume)) %>%
transmute(Org_ID, Perc_Market_Vol = 100*sum_market_vol/tot_market_vol)
Result:
# A tibble: 3 x 2
Org_ID Perc_Market_Vol
<int> <dbl>
1 1 83.33333
2 2 0.00000
3 3 100.00000
Data:
df = structure(list(Org_ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
Market_volume = c(100L, 200L, 300L, 50L, 500L, 400L, 200L,
300L, 100L), Indicator_variable = c(1L, 0L, 0L, 1L, 1L, 0L,
0L, 0L, 0L)), .Names = c("Org_ID", "Market_volume", "Indicator_variable"
), class = "data.frame", row.names = c(NA, -9L))

Related

Counting patients that have not had medication [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 2 years ago.
I have a dataframe in which patients have multiple observations of medication use over time. Some patients have consistently used medication, others have gaps, while I am trying to count the patients which have never used medication.
I can't show the actual data but here is an example data frame of what I am working with.
patid meds
1 0
1 1
1 1
2 0
2 0
3 1
3 1
3 1
4 0
5 1
5 0
So from this two patients (4 and 2) never used medication. That's what I'm looking for.
I'm fairly new to R and have no idea how to do this, any would be appreciated.
Here is another alternative from dplyr package.
library(dplyr)
df <- data.frame(patid = c(1,1,1,2,2,3,3,3,4,5,5),
meds = c(0,1,1,0,0,1,1,1,0,1,0))
df %>%
distinct(patid, meds) %>%
arrange(desc(meds))%>%
filter(meds == 0 & !duplicated(patid))
# patid meds
#1 2 0
#2 4 0
Try this:
library(dplyr)
#Data
df <- structure(list(patid = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 4L,
5L, 5L), meds = c(0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 1L, 0L)), class = "data.frame", row.names = c(NA,
-11L))
#Code
df %>% group_by(patid) %>% summarise(sum=sum(meds,na.rm=T)) %>% filter(sum==0)
# A tibble: 2 x 2
patid sum
<int> <int>
1 2 0
2 4 0
A Base R solution could be
subset(aggregate(meds ~ patid, df, sum), meds == 0)
which returns
patid meds
2 2 0
4 4 0

Two different id values for the same individuals in different datasets

I have two vectors of id values associated with two different datasets. The two vectors correspond to the same individuals, but the id vectors are unrelated (and there are multiple observations for each individual in each dataset). My goal is to merge them by id, but because the ids are different and they are different lengths there is no way to do that without matching on id. There's obviously a lot more data than what I included in the example.
a <- c(4033,4833,681,9567,6175,7112,3889,264,3918,7685)
b <- c(1,4,7,10,14,18,22,26,27,37)
So 4033 = 1; 4833 = 4...etc.
dummy dataset1:
id day y
1 1 10
1 2 4
1 3 2
4 1 9
4 2 10
4 3 6
dummy dataset2:
id day y1
4033 1 100
4033 1 120
4033 2 150
4033 3 200
4833 1 120
4833 2 100
4833 2 50
4833 3 100
4833 3 200
What I would like is an easy way to get:
dummy dataset1 output:
id day y id.2
1 1 10 4033
1 2 4 4033
1 3 2 4033
4 1 9 4833
4 2 10 4833
4 3 6 4833
I'm trying a solution in a forloop like:
for (i in length(dataset)) {
dataset$id[dataset[[1]] %in% int] <- int1
}
But that's not working correctly (probably for an obvious reason I'm missing).
As we have two vectors, we can easily create a match with a named vector in base R
df1$id.2 <- setNames(a, b)[as.character(df1$id)]
df1
# id day y id.2
#1 1 1 10 4033
#2 1 2 4 4033
#3 1 3 2 4033
#4 4 1 9 4833
#5 4 2 10 4833
#6 4 3 6 4833
Or another base R option is match
df1$id.2 <- a[match(df1$id, b)]
data
df1 <- structure(list(id = c(1L, 1L, 1L, 4L, 4L, 4L), day = c(1L, 2L,
3L, 1L, 2L, 3L), y = c(10L, 4L, 2L, 9L, 10L, 6L)),
class = "data.frame", row.names = c(NA,
-6L))
df2 <- structure(list(id = c(4033L, 4033L, 4033L, 4033L, 4833L, 4833L,
4833L, 4833L, 4833L), day = c(1L, 1L, 2L, 3L, 1L, 2L, 2L, 3L,
3L), y1 = c(100L, 120L, 150L, 200L, 120L, 100L, 50L, 100L, 200L
)), class = "data.frame", row.names = c(NA, -9L))
Another approach is to make a data.frame of the IDs and use merge.
datasetID <- data.frame(id = b, id.2 = a)
merge(dataset1,datasetID)
id day y a
1 1 1 10 4033
2 1 2 4 4033
3 1 3 2 4033
4 4 1 9 4833
5 4 2 10 4833
6 4 3 6 4833
Data
a <- c(4033,4833,681,9567,6175,7112,3889,264,3918,7685)
b <- c(1,4,7,10,14,18,22,26,27,37)
dataset1 <- structure(list(id = c(1L, 1L, 1L, 4L, 4L, 4L), day = c(1L, 2L,
3L, 1L, 2L, 3L), y = c(10L, 4L, 2L, 9L, 10L, 6L)), class = "data.frame", row.names = c(NA,
-6L))

How can I get the number of speaker from a data frame by R [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
NUMBER WEIGHT DAILY-LANG RELIGION PROVINCE DISTRICT SUB_DISTRI
5 9.50 1167 1 11 01 010
6 9.50 1167 1 11 01 010
7 9.50 1167 1 11 01 010
8 10.30 4 2 33 071 220
9 10.10 6 1 61 8 170
It is the data screen I have to find the daily_lang speaker numbers by each Sub_disrict
If thw colums WEIGHT, DAILY-LANG, RELIGION, PROVINCE, DISTRICT and SUB_DISTRI are unique for a speaker you can use nrow and unique to get the number of speakers.
nrow(unique(x))
#[1] 3
To get DAILY-LANG per RELIGION, PROVINCE, DISTRICT and SUB_DISTRI you can use unique, split and interaction:
y <- unique(x)
split(y$DAILY.LANG,
interaction(y[c("RELIGION", "PROVINCE", "DISTRICT", "SUB_DISTRI")], drop=TRUE))
#$`1.11.1.10`
#[1] 1167
#
#$`1.61.8.170`
#[1] 6
#
#$`2.33.71.220`
#[1] 4
Or if SUB_DISTRI is already unique:
split(y$DAILY.LANG, y$SUB_DISTRI)
#$`10`
#[1] 1167
#
#$`170`
#[1] 6
#
#$`220`
#[1] 4
Data:
x <- structure(list(WEIGHT = c(9.5, 9.5, 9.5, 10.3, 10.1), DAILY.LANG = c(1167L,
1167L, 1167L, 4L, 6L), RELIGION = c(1L, 1L, 1L, 2L, 1L), PROVINCE = c(11L,
11L, 11L, 33L, 61L), DISTRICT = c(1L, 1L, 1L, 71L, 8L), SUB_DISTRI = c(10L,
10L, 10L, 220L, 170L)), row.names = c(NA, -5L), class = "data.frame")

How can I combine rows based on a specific parameter in R

I have a dataframe which looks like this:
ID Smoker Asthma Age Sex COPD Event_Date
1 1 0 0 65 M 0 12-2009
2 1 0 1 65 M 0 21-2009
3 1 0 1 65 M 0 23-2009
4 2 1 0 67 M 0 19-2010
5 2 1 0 67 M 0 21-2010
6 2 1 1 67 M 1 01-2011
7 2 1 1 67 M 1 02-2011
8 3 2 1 77 F 0 09-2015
9 3 2 1 77 F 1 10-2015
10 3 2 1 77 F 1 10-2015
I would like to know whether it would be possible it combine my rows in order to achieve a dataset like this:
ID Smoker Asthma Age Sex COPD Event_Data
1 0 1 65 M 0 12-2009
2 1 1 66 M 1 19-2010
3 2 1 77 F 1 09-2015
I have tried using the unique function, however this doesn't give me my desired output and repeats the ID for multiple rows.
This is an example of the code i've tried
Data2<-unique(Data)
I do not just want the first row because I want to include each column status. For example, just getting the first row would not include the COPD status which occurs in the later rows for each ID.
Alternative Solution:
library(dplyr)
d %>%
group_by(ID, Age, Sex, Smoker) %>%
summarise(Asthma = !is.na(match(1, Asthma)),
COPD = !is.na(match(1, COPD)),
Event_Date = first(Event_Date)) %>%
ungroup %>%
mutate_if(is.logical, as.numeric)
# A tibble: 3 x 7
ID Age Sex Smoker Asthma COPD Event_Date
<int> <int> <fct> <int> <dbl> <dbl> <fct>
1 1 65 M 0 1 0 12-2009
2 2 67 M 1 1 1 19-2010
3 3 77 F 2 1 1 09-2015
If you want to get the (first) row for each ID you can try something like this:
d <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L),
Smoker = c(0L, 0L, 0L, 1L, 1L, 1L, 1L, 2L, 2L, 2L),
Asthma = c(0L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 1L, 1L),
Age = c(65L, 65L, 65L, 67L, 67L, 67L, 67L, 77L, 77L, 77L),
Sex = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L),
.Label = c("F", "M"), class = "factor"),
COPD = c(0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L),
Event_Date = structure(c(5L, 7L, 9L, 6L, 8L, 1L, 2L, 3L, 4L, 4L),
.Label = c("01-2011", "02-2011", "09-2015",
"10-2015", "12-2009", "19-2010",
"21-2009", "21-2010", "23-2009"),
class = "factor")),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10"))
d[!duplicated(d$ID), ]
# ID Smoker Asthma Age Sex COPD Event_Date
# 1 1 0 0 65 M 0 12-2009
# 4 2 1 0 67 M 0 19-2010
# 8 3 2 1 77 F 0 09-2015
Use max when you need a value further down and dplyr::first for others, here an example
library(dplyr)
df %>% group_by(ID) %>% summarise(Smoker=first(Smoker), Asthma=max(Asthma, na.rm = TRUE))

Subsetting in R , for and if loop [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a dataframe that looks like this:
ID Team
11 1
22 2
45 4
45 2
79 3
79 4
100 2
123 1
167 3
167 1
I have to subset only those rows which ARE duplicated until the end of the data frame is reached. How can it be done?
If you meant to subset rows that have duplicated IDs
dat <- structure(list(ID = c(11L, 22L, 45L, 45L, 79L, 79L, 100L, 123L,
167L, 167L), Team = c(1L, 2L, 4L, 2L, 3L, 4L, 2L, 1L, 3L, 1L)), .Names = c("ID",
"Team"), class = "data.frame", row.names = c(NA, -10L))
dat[duplicated(dat$ID)|duplicated(dat$ID,fromLast=T),]
# ID Team
# 3 45 4
# 4 45 2
# 5 79 3
# 6 79 4
# 9 167 3
# 10 167 1

Resources