Trying to create a count column with a specific column [duplicate] - r

This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed 7 years ago.
I have a data frame df that looks like the following where the gender column is a factor with two levels:
gender age
m 18
f 14
m 18
m 18
m 15
f 15
I would like to add a new column called count that simply reflects the number of times that gender level appears in the data frame. So, ultimately, the data frame would look like:
gender age count
m 18 4
f 14 2
m 18 4
m 18 4
m 15 4
f 15 2
I know that I can do table(df$gender) that gives me the number of times the factor appears, but I do not know how to translate those results into a new column in df. I'm wondering how can I use the table function--or is there a better way to achieve my new column?

You may try ave:
# first, convert 'gender' to class character
df$gender <- as.character(df$gender)
df$count <- as.numeric(ave(df$gender, df$gender, FUN = length))
df
# gender age count
# 1 m 18 4
# 2 f 14 2
# 3 m 18 4
# 4 m 18 4
# 5 m 15 4
# 6 f 15 2
Update following #flodel's comment - thanks!
df <- transform(df, count = ave(age, gender, FUN = length))

Since gender is a factor, you can use it to index the table output:
dat$count <- table(dat$gender)[dat$gender]
Or to avoid repeating dat$ too many times:
dat <- transform(dat, count = table(gender)[gender])

Using plyr:
library(plyr)
ddply(dat,.(gender),transform,count=length(age))
gender age count
1 f 14 2
2 f 15 2
3 m 18 4
4 m 18 4
5 m 18 4
6 m 15 4

And a data.table version for good measure.
library(data.table)
df <- as.data.table(df)
Once you have the data.table, it's then a simple operation:
df[,count := .N,by="gender"]
df
# gender age count
#1: m 18 4
#2: f 14 2
#3: m 18 4
#4: m 18 4
#5: m 15 4
#6: f 15 2

You can set the counts and then do something like this, but that's not exactly elegant.
m.cnt <- length(which(df$gender == "m"))
f.cnt <- length(which(df$gender == "f"))
df$count <- NA
df$count[which(df$gender == "m")] <- m.cnt
df$count[which(df$gender == "f")] <- f.cnt
Alternatively you can use plyr but this results in recalculating the same thing over and over again, which might not be worth it since you only have 2 factors.

Related

mutate string into numeric, ignore alphabetical order of factor

I am trying to recode factor levels into numbers using mutate function, but I want to ignore alphabetical order the factors are appearing in. There are multiple same values of factor levels and I want them to be assigned the number in the new column of the row in which they first appeared in the dataframe.
Example:
library(stringi)
set.seed(234)
data<-stri_rand_strings(20,1)
data<-as.data.frame(data)
data2<-data %>% mutate(num=(as.numeric(factor(data))))
data2
Expected outcome:
dat<-data2[,-2]
order<-c(1,2,3,2,4,5)
expected_result<-cbind.data.frame(head(dat), order)
expected_result
I think you can just create a new factor and set the levels as unique values of data2$data in your example:
new_fac <- factor(data2$data, levels = unique(data2$data))
The numeric values can be obtained:
new_order <- as.numeric(new_fac)
And this is what your final result would look like:
head(data.frame(new_fac, new_order))
new_fac new_order
1 k 1
2 m 2
3 1 3
4 m 2
5 4 4
6 d 5
Or in your example with dplyr, you can do:
data %>%
mutate(num = as.numeric(factor(data, levels = unique(data))))
You could accomplish this with a helper table that contains the row number of the first time a string appears in your table. I.e.
library(stringi)
library(tidyverse)
# generate data
data<-stri_rand_strings(20,1)
data<-as.data.frame(data)
Create helper table:
factorlevels <- data %>% unique() %>% mutate(order = row_number())
... and inner join to data
data %>% inner_join(factorlevels)
Output:
> data %>% inner_join(factorlevels)
Joining, by = "data"
data order
1 k 1
2 m 2
3 1 3
4 m 2
5 4 4
6 d 5
7 v 6
8 i 7
9 v 6
10 H 8
11 Y 9
12 X 10
13 a 11
14 a 11
15 0 12
16 R 13
17 J 14
18 j 15
19 8 16
20 s 17
I am sure that there is a one-liner approach to this, but I could not figure it out right away.

Is there a better way to combine these dataframes and match values?

I have two dataframes that I have combined using left_join()
data1 can be simplified to something like...
Date <- as.Date(c('2011-7-26','2011-7-26','2010-11-1','2010-11-1','2009-5-10','2009-5-10','2008-3-25','2008-3-25','2007-3-14','2007-3-14'))
Location <- c("A","B","A","B","A","B","A","B","A","B")
Result <- sample(1:30, 10)
data1 <- data.frame(Date,Location,Result)
data2 can be simplified to something like...
Date <- as.Date(c('2011-7-26','2009-5-10','2007-3-14'))
Flow_A <- c(6,2,9)
Flow_B <- c(10,11,25)
data2 <- data.frame(Date,Flow_A,Flow_B)
After combining by date, I have this
data3 <- left_join(data2, data1, by = "Date")
Date Flow_A Flow_B Location Result
1 2011-07-26 6 10 A 11
2 2011-07-26 6 10 B 17
3 2009-05-10 2 11 A 6
4 2009-05-10 2 11 B 22
5 2007-03-14 9 25 A 20
6 2007-03-14 9 25 B 1
Each value in Result corresponds to a specific Location (A or B) and I want to attach the correct values for Flow (Flow_A or Flow_B) to that row according to location (i.e. combine columns Flow_A and Flow_B into one column 'Flow' with just the correct value). I have been able to do this using a combination of mutate(),ifelse(),grepl(),and very simple functions:
a <- data3$Flow_A
Choose_A <- function(a) {
return(a)}
d <- data3$Flow_B
Choose_B <- function(b) {
return(b)}
data3 <- mutate(data3, Flow =
ifelse(grepl("A", Location), Choose_A(a),
ifelse(grepl("B", Location), Choose_B(b),NA)))
Date Flow_A Flow_B Location Result Flow
1 2011-07-26 6 10 A 11 6
2 2011-07-26 6 10 B 17 10
3 2009-05-10 2 11 A 6 2
4 2009-05-10 2 11 B 22 11
5 2007-03-14 9 25 A 20 9
6 2007-03-14 9 25 B 1 25
But this seems rather clunky. Is there a better (more efficient) way to achieve this?
Please excuse my ignorance - I'm still learning!
Thanks!
You can create a vector of column numbers to extract from each row using match and create a matrix with cbind which is used to subset relevant value from either 'Flow_A' or 'Flow_B' depending on Location column.
column_num <- match(paste0('Flow_', data3$Location), names(data3))
row_num <- seq_len(nrow(data3))
data3$Flow <- data3[cbind(row_num, column_num)]

Loop or apply for sum of rows based on multiple conditions in R dataframe

I've hacked together a quick solution to my problem, but I have a feeling it's quite obtuse. Moreover, it uses for loops, which from what I've gathered, should be avoided at all costs in R. Any and all advice to tidy up this code is appreciated. I'm still pretty new to R, but I fear I'm making a relatively simple problem much too convoluted.
I have a dataset as follows:
id count group
2 6 A
2 8 A
2 6 A
8 5 A
8 6 A
8 3 A
10 6 B
10 6 B
10 6 B
11 5 B
11 6 B
11 7 B
16 6 C
16 2 C
16 0 C
18 6 C
18 1 C
18 6 C
I would like to create a new dataframe that contains, for each unique ID, the sum of the first two counts of that ID (e.g. 6+8=14 for ID 2). I also want to attach the correct group identifier.
In general you might need to do this when you measure a value on consecutive days for different subjects and treatments, and you want to compute the total for each subject for the first x days of measurement.
This is what I've come up with:
id <- c(rep(c(2,8,10,11,16,18),each=3))
count <- c(6,8,6,5,6,3,6,6,6,5,6,7,6,2,0,6,1,6)
group <- c(rep(c("A","B","C"),each=6))
df <- data.frame(id,count,group)
newid<-c()
newcount<-c()
newgroup<-c()
for (i in 1:length(unique(df$"id"))) {
newid[i] <- unique(df$"id")[i]
newcount[i]<-sum(df[df$"id"==unique(df$"id")[i],2][1:2])
newgroup[i] <- as.character(df$"group"[df$"id"==newid[i]][1])
}
newdf<-data.frame(newid,newcount,newgroup)
Some possible improvements/alternatives I'm not sure about:
For loops vs apply functions
Can I create a dataframe directly inside a for loop or should I stick to creating vectors I can late assign to a dataframe?
More consistent approaches to accessing/subsetting vectors/columns ($, [], [[]], subset?)
You could do this using data.table
setDT(df)[, list(newcount = sum(count[1:2])), by = .(id, group)]
# id group newcount
#1: 2 A 14
#2: 8 A 11
#3: 10 B 12
#4: 11 B 11
#5: 16 C 8
#6: 18 C 7
You could use dplyr:
library(dplyr)
df %>% group_by(id,group) %>% slice(1:2) %>% summarise(newcount=sum(count))
The pipe syntax makes it easy to read: group your data by id and group, take the first two rows for each group, then sum the counts
You can try to use a self-defined function in aggregate
sum1sttwo<-function (x){
return(x[1]+x[2])
}
aggregate(count~id+group, data=df,sum1sttwo)
and the output is:
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7
04/2015 edit: dplyr and data.table are definitely better choices when your data set is large. One of the most important disadvantages of base R is that dataframe is too slow. However, if you just need to aggregate a very simple/small data set, the aggregate function in base R can serve its purpose.
library(plyr)
-Keep first 2 rows for each group and id
df2 <- ddply(df, c("id","group"), function (x) x$count[1:2])
-Aggregate by group and id
df3 <- ddply(df2, c("id", "group"), summarize, count=V1+V2)
df3
id group count
1 2 A 14
2 8 A 11
3 10 B 12
4 11 B 11
5 16 C 8
6 18 C 7

How to count how many values per level in a given factor?

I have a data.frame mydf with about 2500 rows. These rows correspond to 69 classes of objects in colum 1 mydf$V1, and I want to count how many rows per object class I have.
I can get a factor of these classes with:
objectclasses = unique(factor(mydf$V1, exclude="1"));
What's the terse R way to count the rows per object class? If this were any other language I'd be traversing an array with a loop and keeping count but I'm new to R programming and am trying to take advantage of R's vectorised operations.
Or using the dplyr library:
library(dplyr)
set.seed(1)
dat <- data.frame(ID = sample(letters,100,rep=TRUE))
dat %>%
group_by(ID) %>%
summarise(no_rows = length(ID))
Note the use of %>%, which is similar to the use of pipes in bash. Effectively, the code above pipes dat into group_by, and the result of that operation is piped into summarise.
The result is:
Source: local data frame [26 x 2]
ID no_rows
1 a 2
2 b 3
3 c 3
4 d 3
5 e 2
6 f 4
7 g 6
8 h 1
9 i 6
10 j 5
11 k 6
12 l 4
13 m 7
14 n 2
15 o 2
16 p 2
17 q 5
18 r 4
19 s 5
20 t 3
21 u 8
22 v 4
23 w 5
24 x 4
25 y 3
26 z 1
See the dplyr introduction for some more context, and the documentation for details regarding the individual functions.
Here 2 ways to do it:
set.seed(1)
tt <- sample(letters,100,rep=TRUE)
## using table
table(tt)
tt
a b c d e f g h i j k l m n o p q r s t u v w x y z
2 3 3 3 2 4 6 1 6 5 6 4 7 2 2 2 5 4 5 3 8 4 5 4 3 1
## using tapply
tapply(tt,tt,length)
a b c d e f g h i j k l m n o p q r s t u v w x y z
2 3 3 3 2 4 6 1 6 5 6 4 7 2 2 2 5 4 5 3 8 4 5 4 3 1
Using plyr package:
library(plyr)
count(mydf$V1)
It will return you a frequency of each value.
Using data.table
library(data.table)
setDT(dat)[, .N, keyby=ID] #(Using #Paul Hiemstra's `dat`)
Or using dplyr 0.3
res <- count(dat, ID)
head(res)
#Source: local data frame [6 x 2]
# ID n
#1 a 2
#2 b 3
#3 c 3
#4 d 3
#5 e 2
#6 f 4
Or
dat %>%
group_by(ID) %>%
tally()
Or
dat %>%
group_by(ID) %>%
summarise(n=n())
We can use summary on factor column:
summary(myDF$factorColumn)
One more approach would be to apply n() function which is counting the number of observations
library(dplyr)
library(magrittr)
data %>%
group_by(columnName) %>%
summarise(Count = n())
In case I just want to know how many unique factor levels exist in the data, I use:
length(unique(df$factorcolumn))
Use the package plyr with lapply to get frequencies for every value (level) and every variable (factor) in your data frame.
library(plyr)
lapply(df, count)
This is an old post, but you can do this with base R and no data frames/data tables:
sapply(levels(yTrain), function(sLevel) sum(yTrain == sLevel))

adding row/column total data when aggregating data using plyr and reshape2 package in R

I create aggregate tables most of the time during my work using the flow below:
set.seed(1)
temp.df <- data.frame(var1=sample(letters[1:5],100,replace=TRUE),
var2=sample(11:15,100,replace=TRUE))
temp.output <- ddply(temp.df,
c("var1","var2"),
function(df) {
data.frame(count=nrow(df))
})
temp.output.all <- ddply(temp.df,
c("var2"),
function(df) {
data.frame(var1="all",
count=nrow(df))
})
temp.output <- rbind(temp.output,temp.output.all)
temp.output[,"var1"] <- factor(temp.output[,"var1"],levels=c(letters[1:5],"all"))
temp.output <- dcast(temp.output,formula=var2~var1,value.var="count",fill=0)
I start feeling silly to writing the "boilerplate" code every time to include the row/column total when I create a new aggregate table, is there some way for skipping it?
Looking at your desired output (now that I'm in front of a computer), perhaps you should look at the margins argument of dcast:
library(reshape2)
dcast(temp.df, var2 ~ var1, value.var = "var2",
fun.aggregate=length, margins = "var1")
# var2 a b c d e (all)
# 1 11 3 1 6 4 2 16
# 2 12 1 3 6 5 5 20
# 3 13 5 9 3 6 1 24
# 4 14 4 7 3 6 2 22
# 5 15 0 5 1 5 7 18
Also look into the addmargins function in base R.

Resources