Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a dataframe that looks like this:
ID Team
11 1
22 2
45 4
45 2
79 3
79 4
100 2
123 1
167 3
167 1
I have to subset only those rows which ARE duplicated until the end of the data frame is reached. How can it be done?
If you meant to subset rows that have duplicated IDs
dat <- structure(list(ID = c(11L, 22L, 45L, 45L, 79L, 79L, 100L, 123L,
167L, 167L), Team = c(1L, 2L, 4L, 2L, 3L, 4L, 2L, 1L, 3L, 1L)), .Names = c("ID",
"Team"), class = "data.frame", row.names = c(NA, -10L))
dat[duplicated(dat$ID)|duplicated(dat$ID,fromLast=T),]
# ID Team
# 3 45 4
# 4 45 2
# 5 79 3
# 6 79 4
# 9 167 3
# 10 167 1
Related
This question already has answers here:
Counting unique / distinct values by group in a data frame
(12 answers)
Closed 1 year ago.
I have dataset of regional patent. I want to count where how many Appln_id has more than one Person_id and how many Apply_id has only one Person_id.
Appln_id 3 3 3 10 10 10 10 2 4 4
Person_id 23 22 24 49 50 55 51 101 122 104
here Appln_id 3 has three different person_id (23,22,24) and Appln_id 2 has only one Person_id(101). So, I want to count them that how many of Appln_id has more than one Person_id and how many Apply_id has only one Person_id
Count number of unique person for each Appln_id.
library(dplyr)
result <- df %>% group_by(Appln_id) %>% summarise(n = n_distinct(Person_id))
result
# Appln_id n
#* <int> <int>
#1 2 1
#2 3 3
#3 4 2
#4 10 4
Now you can count how many of them have only 1 Person_id and how many of them have more than that.
sum(result$n == 1)
#[1] 1
sum(result$n > 1)
#[1] 3
data
df <- structure(list(Appln_id = c(3L, 3L, 3L, 10L, 10L, 10L, 10L, 2L,
4L, 4L), Person_id = c(23L, 22L, 24L, 49L, 50L, 55L, 51L, 101L,
122L, 104L)), class = "data.frame", row.names = c(NA, -10L))
We can use data.table
library(data.table)
setDT(df)[, .(n = uniqueN(Person_id)), by = Appln_id]
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
NUMBER WEIGHT DAILY-LANG RELIGION PROVINCE DISTRICT SUB_DISTRI
5 9.50 1167 1 11 01 010
6 9.50 1167 1 11 01 010
7 9.50 1167 1 11 01 010
8 10.30 4 2 33 071 220
9 10.10 6 1 61 8 170
It is the data screen I have to find the daily_lang speaker numbers by each Sub_disrict
If thw colums WEIGHT, DAILY-LANG, RELIGION, PROVINCE, DISTRICT and SUB_DISTRI are unique for a speaker you can use nrow and unique to get the number of speakers.
nrow(unique(x))
#[1] 3
To get DAILY-LANG per RELIGION, PROVINCE, DISTRICT and SUB_DISTRI you can use unique, split and interaction:
y <- unique(x)
split(y$DAILY.LANG,
interaction(y[c("RELIGION", "PROVINCE", "DISTRICT", "SUB_DISTRI")], drop=TRUE))
#$`1.11.1.10`
#[1] 1167
#
#$`1.61.8.170`
#[1] 6
#
#$`2.33.71.220`
#[1] 4
Or if SUB_DISTRI is already unique:
split(y$DAILY.LANG, y$SUB_DISTRI)
#$`10`
#[1] 1167
#
#$`170`
#[1] 6
#
#$`220`
#[1] 4
Data:
x <- structure(list(WEIGHT = c(9.5, 9.5, 9.5, 10.3, 10.1), DAILY.LANG = c(1167L,
1167L, 1167L, 4L, 6L), RELIGION = c(1L, 1L, 1L, 2L, 1L), PROVINCE = c(11L,
11L, 11L, 33L, 61L), DISTRICT = c(1L, 1L, 1L, 71L, 8L), SUB_DISTRI = c(10L,
10L, 10L, 220L, 170L)), row.names = c(NA, -5L), class = "data.frame")
The name of my dataset is student_performance which can be seen below:
gender race lunch math reading writing
2 2 2 72 72 74
2 3 2 69 90 88
2 2 2 90 95 93
1 1 1 47 57 44
1 3 2 76 78 75
2 2 2 71 83 78
2 2 2 88 95 92
1 2 1 40 43 39
1 4 1 64 64 67
2 2 1 38 60 50
I want to calculate how many digits "2" is within a gender column. For this I tried this code:
count(studentperformance$gender[1:10], vars = "2")
But the code shows error. Please suggest how can I achieve this?
As #user2974951 said, you can use base R for that:
sum(studentperformance$gender==2)
[1] 6
You can also create a table for every level in gender:
table(studentperformance$gender,factor(studentperformance$gender))
1 2
1 4 0
2 0 6
Sample data:
studentperformance <- read.table(text = "gender race lunch math reading writing
2 2 2 72 72 74
2 3 2 69 90 88
2 2 2 90 95 93
1 1 1 47 57 44
1 3 2 76 78 75
2 2 2 71 83 78
2 2 2 88 95 92
1 2 1 40 43 39
1 4 1 64 64 67
2 2 1 38 60 50", header = TRUE)
You can create some simple tables without indexing or comparisons. Try the following with count, which will return the variable gender containing the unique values of gender, and n indicating the count of each unique value:
library(dplyr)
count(df, gender)
#### OUTPUT ####
# A tibble: 2 x 2
gender n
<int> <int>
1 1 4
2 2 6
You can do pretty much the same thing using base R's table. The output is just a little different: The unique values are now the variable headers 1 and 2, and the counts are the row just beneath, with 4 and 6:
table(df$gender)
#### OUTPUT ####
1 2
4 6
Consider also:
studentperformance <- transform(studentperformance,
count_by_gender = ave(studentperformance$gender,
studentperformance$gender,
FUN = length))
Data:
structure(
list(
gender = c(2L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 1L,
2L),
race = c(2L, 3L, 2L, 1L, 3L, 2L, 2L, 2L, 4L, 2L),
lunch = c(2L,
2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L),
math = c(72L, 69L, 90L,
47L, 76L, 71L, 88L, 40L, 64L, 38L),
reading = c(72L, 90L, 95L,
57L, 78L, 83L, 95L, 43L, 64L, 60L),
writing = c(74L, 88L, 93L,
44L, 75L, 78L, 92L, 39L, 67L, 50L),
count_by_gender = c(6L, 6L,
6L, 4L, 4L, 6L, 6L, 4L, 4L, 6L)
),
class = "data.frame",
row.names = c(NA,-10L)
)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
My dataset looks like this
Org_ID Market volume Indicator variable
1 100 1
1 200 0
1 300 0
2 50 1
2 500 1
3 400 0
3 200 0
3 300 0
3 100 0
And i want to summarize it by market TRx and org_id by calculating the % of 0 indicator variables in terms of market volume, as follows:
Org_ID % of 0's by market volume
1 83.3%
2 0%
3 100%
I tried subgroups but can't seem to be able to do this. Can anyone suggest what are some of the ways i can do?
with dplyr:
library(dplyr)
df %>%
group_by(Org_ID) %>%
summarize(sum_market_vol = sum(Market_volume*!Indicator_variable),
tot_market_vol = sum(Market_volume)) %>%
transmute(Org_ID, Perc_Market_Vol = 100*sum_market_vol/tot_market_vol)
Result:
# A tibble: 3 x 2
Org_ID Perc_Market_Vol
<int> <dbl>
1 1 83.33333
2 2 0.00000
3 3 100.00000
Data:
df = structure(list(Org_ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L),
Market_volume = c(100L, 200L, 300L, 50L, 500L, 400L, 200L,
300L, 100L), Indicator_variable = c(1L, 0L, 0L, 1L, 1L, 0L,
0L, 0L, 0L)), .Names = c("Org_ID", "Market_volume", "Indicator_variable"
), class = "data.frame", row.names = c(NA, -9L))
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have some data that look like this:
head(t)
sub trialnum block.x lat.x block.y lat.y diff
1 1 10 3 1355 5 1337 18
2 1 11 3 1324 5 1470 -146
3 1 12 3 1861 5 1690 171
4 1 13 3 3501 5 1473 2028
5 1 14 3 1566 5 1402 164
6 1 15 3 1380 5 1539 -159
What I would like to do is reformat the data in R such that the values of "trialnum" (there are 20 of them) are the new columns, "sub" is the row values, and each cell has the "diff" value. For example
trialnum1 trialnum2 trialnum3...
sub
1
2
3
.
.
.
Any help would be much appreciated. Although the answer is probably simple, I've been struggling with this problem for some time.
Base package. We transpose column diff with the function t(x), then create the desired column names.
df <- data.frame(t(t[, 7]))
# Using the trialnum column
colnames(df) <- paste0(colnames(t[2]), t[, 2])
# or just the number of rows
colnames(df) <- paste0(colnames(t[2]), 1:nrow(t))
Output:
trialnum10 trialnum11 trialnum12 trialnum13 trialnum14 trialnum15
1 18 -146 171 2028 164 -159
trialnum1 trialnum2 trialnum3 trialnum4 trialnum5 trialnum6
1 18 -146 171 2028 164 -159
With dplyr and tidyr, first get rid of the columns you don't want, then spread trialnum and diff.
library(dplyr)
library(tidyr)
t %>% select(-block.x:-lat.y) %>% # get rid of extra columns so t will collapse
mutate(trialnum = paste0('trialnum', trialnum)) %>% # fix values for column names
spread(trialnum, diff) # spread columns
# sub trialnum10 trialnum11 trialnum12 trialnum13 trialnum14 trialnum15
# 1 1 18 -146 171 2028 164 -159
Data
t <- structure(list(sub = c(1L, 1L, 1L, 1L, 1L, 1L), trialnum = 10:15,
block.x = c(3L, 3L, 3L, 3L, 3L, 3L), lat.x = c(1355L, 1324L,
1861L, 3501L, 1566L, 1380L), block.y = c(5L, 5L, 5L, 5L,
5L, 5L), lat.y = c(1337L, 1470L, 1690L, 1473L, 1402L, 1539L
), diff = c(18L, -146L, 171L, 2028L, 164L, -159L)), .Names = c("sub",
"trialnum", "block.x", "lat.x", "block.y", "lat.y", "diff"), row.names = c(NA,
-6L), class = "data.frame")