Adding group column to data frame [duplicate] - r

This question already has an answer here:
Compute the minimum of a pair of vectors
(1 answer)
Closed 7 years ago.
Say I have the following data frame:
dx=data.frame(id=letters[1:4], count=1:4)
# id count
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d 4
And I would like to (grammatically) add a column that will get the count whenever count<3, otherwise 3, so I'll get the following:
# id count group
# 1 a 1 1
# 2 b 2 2
# 3 c 3 3
# 4 d 4 3
I thought to use
dx$group=if(dx$count<3){dx$count}else{3}
but it doesn't work on arrays. How can I do it?

In this particular case you can just use pmin (as I stated in the comments above):
df$group <- pmin(df$count, 3)
In general your if/else construction does not work on vectors, but you can use the function ifelse. It takes three arguments: First the condition, then the result if the condition is met and finally the result if the condition is not met. For your example you would write the following:
df$group <- ifelse(df$count < 3, df$count, 3)
Note that in your example the pmin solution is better. Just mentioning the ifelse solution for completeness.

Related

Sum of data in a column based on categorical condition from another column [duplicate]

This question already has answers here:
Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
(3 answers)
Closed 9 years ago.
Suppose I have a data frame like this:
set.seed(123)
df <- as.data.frame(cbind(y<-sample(c("A","B","C"),10,T), X<-sample(c(1,2,3),10,T)))
df <- df[order(df$V1),]
Is there a simply function to sum (or any FUN) V2 by V1 and add to df as a new column, such that:
df$sum <- c(6,6,8,8,8,8,6,6,6,6)
df
I may write a function for that, but I have to do that frequently and be better to know the simplest way to realize that.
I agree with #mnel at least on his first point. I didn't see ave demonstrated in the answers he cited and I think it's the "simplest" base-R method. Using that data.frame(cbind( ...)) construction should be outlawed and teachers who demonstrate it should be stripped of their credentials.
set.seed(123)
df<-data.frame(y=sample( c("A","B","C"), 10, T),
X=sample(c (1,2,3), 10, T))
df<-df[order(df$y),] # that step is not necessary for success.
df
df$sum <- ave(df$X, df$y, FUN=sum)
df
y X sum
1 A 3 6
6 A 3 6
3 B 3 8
7 B 1 8
9 B 1 8
10 B 3 8
2 C 2 6
4 C 2 6
5 C 1 6
8 C 1 6

How sum a column corresponding to group elements? [duplicate]

This question already has answers here:
Idiomatic R code for partitioning a vector by an index and performing an operation on that partition
(3 answers)
Closed 9 years ago.
Suppose I have a data frame like this:
set.seed(123)
df <- as.data.frame(cbind(y<-sample(c("A","B","C"),10,T), X<-sample(c(1,2,3),10,T)))
df <- df[order(df$V1),]
Is there a simply function to sum (or any FUN) V2 by V1 and add to df as a new column, such that:
df$sum <- c(6,6,8,8,8,8,6,6,6,6)
df
I may write a function for that, but I have to do that frequently and be better to know the simplest way to realize that.
I agree with #mnel at least on his first point. I didn't see ave demonstrated in the answers he cited and I think it's the "simplest" base-R method. Using that data.frame(cbind( ...)) construction should be outlawed and teachers who demonstrate it should be stripped of their credentials.
set.seed(123)
df<-data.frame(y=sample( c("A","B","C"), 10, T),
X=sample(c (1,2,3), 10, T))
df<-df[order(df$y),] # that step is not necessary for success.
df
df$sum <- ave(df$X, df$y, FUN=sum)
df
y X sum
1 A 3 6
6 A 3 6
3 B 3 8
7 B 1 8
9 B 1 8
10 B 3 8
2 C 2 6
4 C 2 6
5 C 1 6
8 C 1 6

Count certain words in dataframe/matrix in R [duplicate]

This question already has answers here:
Extracting a series of integers using a loop
(4 answers)
Closed 5 years ago.
I have this data , and what i want to do is to count the occurences(frequencies) of ONE, TWO, THREE in each columns
ex. 2 ONEs in the A column, 2 TWOs in the B column, 1 ONE in the C column etc
What function can i use to count certain words in R?
And how can i make a histogram out of this counts?
ABC <-read.csv("c:/Data/dataset.csv")
A B C
1 TWO ONE THREE
2 ONE ONE TWO
3 THREE TWO THREE
4 ONE TWO ONE
5 TWO THREE TWO
We can use mtabulate to get the count of unique elements in the dataset by each column
library(qdapTools)
t(mtabulate(ABC))
# A B C
#ONE 2 2 1
#THREE 1 1 2
#TWO 2 2 2
Or we use table, after unlisting the dataset and replicating the names of 'ABC'. Note that here we are calling the table only once.
tbl <- table(unlist(ABC),names(ABC)[col(ABC)])
tbl
# A B C
# ONE 2 2 1
# THREE 1 1 2
# TWO 2 2 2
A slightly faster option would be to use vapply with tabulate
vapply(ABC, function(x) tabulate(factor(x)), numeric(3))
If we need a barplot
barplot(tbl, beside=TRUE, legend=TRUE)
df <- data.frame(A=c('TWO','ONE','THREE','ONE','TWO'),B=c('ONE','ONE','TWO','TWO','THREE'),C=c('THREE','TWO','THREE','ONE','TWO'),stringsAsFactors=F);
sapply(df,table);
## A B C
## ONE 2 2 1
## THREE 1 1 2
## TWO 2 2 2

R: Aggregate and create columns based on counts [duplicate]

This question already has answers here:
Frequency counts in R [duplicate]
(2 answers)
Closed 7 years ago.
I'm sure this question has been asked before, but I can't seem to find an answer anywhere, so I apologize if this is a duplicate.
I'm looking for R code that allows me to aggregate a variable in R, but while doing so creates new columns that count instances of levels of a factor.
For example, let's say I have the data below:
Week Var1
1 a
1 b
1 a
1 b
1 b
2 c
2 c
2 a
2 b
2 c
3 b
3 a
3 b
3 a
First, I want to aggregate by week. I'm sure this can be done with group_by in dplyr. I then need to be able to cycle through the code and create a new column each time a new level appears in Var 1. Finally, I need counts of each level of Var1 within each week. Note that I can probably figure out a way to do this manually, but I'm looking for an automated solution as I will have thousands of unique values in Var1. The result would be something like this:
Week a b c
1 2 3 0
2 1 1 3
3 2 2 0
I think from the way you worded your question, you've been looking for the wrong thing/something too complicated. It's a simple data-reshaping problem, and as such can be solved with reshape2:
library(reshape2)
#create wide dataframe (from long)
res <- dcast(Week~Var1, value.var="Var1",
fun.aggregate = length, data=data)
> res
Week a b c
1 1 2 3 0
2 2 1 1 3
3 3 2 2 0

Return df with a columns values that occur more than once [duplicate]

This question already has answers here:
Subset data frame based on number of rows per group
(4 answers)
Closed 5 years ago.
I have a data frame df, and I am trying to subset all rows that have a value in column B occur more than once in the dataset.
I tried using table to do it, but am having trouble subsetting from the table:
t<-table(df$B)
Then I try subsetting it using:
subset(df, table(df$B)>1)
And I get the error
"Error in x[subset & !is.na(subset)] :
object of type 'closure' is not subsettable"
How can I subset my data frame using table counts?
Here is a dplyr solution (using mrFlick's data.frame)
library(dplyr)
newd <- dd %>% group_by(b) %>% filter(n()>1) #
newd
# a b
# 1 1 1
# 2 2 1
# 3 5 4
# 4 6 4
# 5 7 4
# 6 9 6
# 7 10 6
Or, using data.table
setDT(dd)[,if(.N >1) .SD,by=b]
Or using base R
dd[dd$b %in% unique(dd$b[duplicated(dd$b)]),]
May I suggest an alternative, faster way to do this with data.table?
require(data.table) ## 1.9.2
setDT(df)[, .N, by=B][N > 1L]$B
(or) you can couple .I (another special variable - see ?data.table) which gives the corresponding row number in df, along with .N as follows:
setDT(df)[df[, .I[.N > 1L], by=B]$V1]
(or) have a look at #mnel's another for another variation (using yet another special variable .SD).
Using table() isn't the best because then you have to rejoin it to the original rows of the data.frame. The ave function makes it easier to calculate row-level values for different groups. For example
dd<-data.frame(
a=1:10,
b=c(1,1,2,3,4,4,4,5,6, 6)
)
dd[with(dd, ave(b,b,FUN=length))>1, ]
#subset(dd, ave(b,b,FUN=length)>1) #same thing
a b
1 1 1
2 2 1
5 5 4
6 6 4
7 7 4
9 9 6
10 10 6
Here, for each level of b, it counts the length of b, which is really just the number of b's and returns that back to the appropriate row for each value. Then we use that to subset.

Resources