length of column, split by group - r

Hi I am using R to count the number of datapoints in a column, split by groups like so:
Type Value
---- -----
A 1
A 6
A 4
A 6
B 8
B 10
B 3
B 8
C 7
C 4
where I want to plot 3 bars, how many A's,how many B's, how many C's. The values in the value column are not important.
How do I do this.
If my data were in different columns I could obviously use
sapply(list(col1,col2,col3),length))
but I don't want to transform my data.
Thanks

If the Value column doesn't matter, then ggplot2 can help you in that regard
library(ggplot2)
set.seed(9001)
df <- data.frame(Type = c(rep("A", 4), rep("B", 4), rep("C", 2)), Value = sample(1:20, 10))
df
## Type Value
## 1 A 5
## 2 A 19
## 3 A 4
## 4 A 12
## 5 B 1
## 6 B 18
## 7 B 13
## 8 B 17
## 9 C 8
## 10 C 14
ggplot(df) + geom_bar(aes(x = Type))

Related

Expand data frame and and add rowsums from another dataframe

I am trying to find a faster way of accomplishing the following code since my actual dataset is very large. I would like to get rid of the for loop altogether. I am trying to duplicate each row in xdf into a new data frame based on the number of columns in values. Then, next to each entry in the new dataset, show the row sums from column 1 in values up to the column j.
xdf <- data_frame(
x = c('a', 'b', 'c'),
y = c(4, 5, 6),
)
values <- data_frame(
col_1 = c(5, 9, 1),
col_2 = c(4, 7, 6),
col_3 = c(1, 5, 2),
col_4 = c(7, 8, 5)
)
for (j in seq(ncol(values))){
if (j==1){
Temp <- cbind(xdf, z= rowSums(values[1:j]))
}
else{
Temp <- rbind(Temp, cbind(xdf, z= rowSums(values[1:j])))
}
}
print(Temp)
The output should be:
x y z
1 a 4 5
2 b 5 9
3 c 6 1
4 a 4 9
5 b 5 16
6 c 6 7
7 a 4 10
8 b 5 21
9 c 6 9
10 a 4 17
11 b 5 29
12 c 6 14
Is there a shorter way to accomplish this?
This is the closest answer that I could get on SO.
How to expand data frame based on values?
I am new to R, so sorry for the longwinded code.
Here's one base R option :
Repeat the rows in xdf as there are number of columns in values, iteratively increment one column at a time to find rowSums and add it as a new column in the final dataframe.
newdf <- xdf[rep(seq(nrow(xdf)), ncol(values)), ]
newdf$z <- c(sapply(seq(ncol(values)), function(x) rowSums(values[1:x])))
newdf
# A tibble: 12 x 3
# x y z
# <chr> <dbl> <dbl>
# 1 a 4 5
# 2 b 5 9
# 3 c 6 1
# 4 a 4 9
# 5 b 5 16
# 6 c 6 7
# 7 a 4 10
# 8 b 5 21
# 9 c 6 9
#10 a 4 17
#11 b 5 29
#12 c 6 14
A concise one-liner as suggested by #sindri_baldur doesn't require repeating the rows explicitly.
cbind(xdf, z = c(sapply(seq(ncol(values)), function(x) rowSums(values[1:x]))))

Combining elements of one column into two columns by group in R

Given a two column data.frame with one containing group labels and a second containing integer values ordered from smallest to largest. How can the data be expanded creating pairs of combinations of the integer column?
Not sure the best way to state this. I'm not interested in all possible combinations but instead all unique combinations starting from the lowest value.
In r, the combn function gives the desired output not considering groups, for example:
t(combn(seq(1:4),2))
[,1] [,2]
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 2 3
[5,] 2 4
[6,] 3 4
Since the first values is 1 we get the unique combination of (1,2) and not the additional combination of (2,1) which I don't need. How would one then apply a similar method by groups?
for example given a data.frame
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
test
Group Val
1 A 1
2 A 3
3 A 6
4 A 8
5 B 2
6 B 4
7 B 5
8 B 7
I was able to come up with this solution that gives the desired output:
test <- data.frame(Group = rep(c("A","B"),each=4),
Val = c(1,3,6,8,2,4,5,7))
j=1
for(i in unique(test$Group)){
if(j==1){
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test1 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
j=j+1
}else{
one <- filter(test,i == Group)
two <- data.frame(t(combn(one$Val,2)))
test2 <- data.frame(Group = i,Val1=two$X1,Val2=two$X2)
test1 <- rbind(test1,test2)
}
}
test1
Group Val1 Val2
1 A 1 3
2 A 1 6
3 A 1 8
4 A 3 6
5 A 3 8
6 A 6 8
7 B 2 4
8 B 2 5
9 B 2 7
10 B 4 5
11 B 4 7
12 B 5 7
However, this is not elegant and is really slow as the number of groups and length of each group become large. It seems like there should be a more elegant and efficient solution but so far I have not come across anything on SO.
I would appreciate any ideas!
here is a data.table approach
library( data.table )
#make test a data.table
setDT(test)
#split by group
L <- split( test, by = "Group")
#get unique combinations of 2 Vals
L2 <- lapply( L, function(x) {
as.data.table( t( combn( x$Val, m = 2, simplify = TRUE ) ) )
})
#merge them back together
data.table::rbindlist( L2, idcol = "Group" )
# Group V1 V2
# 1: A 1 3
# 2: A 1 6
# 3: A 1 8
# 4: A 3 6
# 5: A 3 8
# 6: A 6 8
# 7: B 2 4
# 8: B 2 5
# 9: B 2 7
#10: B 4 5
#11: B 4 7
#12: B 5 7
You can set simplify = F in combn() and then use unnest_wider() in dplyr.
library(dplyr)
library(tidyr)
test %>%
group_by(Group) %>%
summarise(Val = combn(Val, 2, simplify = F)) %>%
unnest_wider(Val, names_sep = "_")
# Group Val_1 Val_2
# <chr> <dbl> <dbl>
# 1 A 1 3
# 2 A 1 6
# 3 A 1 8
# 4 A 3 6
# 5 A 3 8
# 6 A 6 8
# 7 B 2 4
# 8 B 2 5
# 9 B 2 7
# 10 B 4 5
# 11 B 4 7
# 12 B 5 7
library(tidyverse)
df2 <- split(df$Val, df$Group) %>%
map(~gtools::combinations(n = 4, r = 2, v = .x)) %>%
map(~as_tibble(.x, .name_repair = "unique")) %>%
bind_rows(.id = "Group")

How to add a date to each row for a column in a data frame?

df <- data.frame(DAY = character(), ID = character())
I'm running a (for i in DAYS[i]) and get IDs for each day and storing them in a data frame
df <- rbind(df, data.frame(ID = IDs))
I want to add the DAY[i] in a second column across each row in a loop.
How do I do that?
As #Pascal says, this isn't the best way to create a data frame in R. R is a vectorised language, so generally you don't need for loops.
I'm assuming each ID is unique, so you can create a vector of IDs from 1 to 10:
ID <- 1:10
Then, you need a vector for your DAYs which can be the same length as your IDs, or can be recycled (i.e. if you only have a certain number of days that are repeated in the same order you can have a smaller vector that's reused). Use c() to create a vector with more than one value:
DAY <- c(1, 2, 9, 4, 4)
df <- data.frame(ID, DAY)
df
# ID DAY
# 1 1 1
# 2 2 2
# 3 3 9
# 4 4 4
# 5 5 4
# 6 6 1
# 7 7 2
# 8 8 9
# 9 9 4
# 10 10 4
Or with a vector for DAY that includes unique values:
DAY <- sample(1:100, 10, replace = TRUE)
df <- data.frame(ID, DAY)
df
# ID DAY
# 1 1 61
# 2 2 30
# 3 3 32
# 4 4 97
# 5 5 32
# 6 6 74
# 7 7 97
# 8 8 73
# 9 9 16
# 10 10 98

How to calculate value for an observation by group?

I have a data frame like so:
mydf <- data.frame(group=c(rep("a", 4),rep("b", 4), rep("c", 4)), score=sample(1:10, 12, replace=TRUE))
mydf
group score
1 a 10
2 a 9
3 a 2
4 a 3
5 b 1
6 b 10
7 b 1
8 b 10
9 c 3
10 c 7
11 c 1
12 c 3
I can calculate the mean of each group like so:
> by(mydf[,c("score")], mydf$group, mean)
mydf$group: a
[1] 6
-------------------------------------------------------------------
mydf$group: b
[1] 5.5
-------------------------------------------------------------------
mydf$group: c
[1] 3.5
But what I wish to do, is create a new column, say called resdidual which contains the residual from the mean of the group. It would seem like there is some way to use one of the apply functions to do this, but for some reason I can't see it.
I would want my end result to look like so:
mydf
group score residual
1 a 10 4
2 a 9 3
3 a 2 -4
4 a 3 -3
5 b 1 -4.5
6 b 10 4.5
7 b 1 -4.5
8 b 10 4.5
9 c 3 -.5
10 c 7 3.5
11 c 1 -2.5
12 c 3 -.5
Any ideas or pointers to the right direction is appreciated.
How about:
mydf$score - tapply(mydf$score, mydf$group, mean)[as.character(mydf$group)]
tapply works the same as by but with a nicer output. The [as.character(mydf$group)] subsets and replicates tapply's output so that it aligns mdf$group.
library(dplyr)
mydf %>% group_by(group) %>% mutate(residual = score - mean(score))
I take the data, I group by group, then I add a column (using mutate) which is the difference between the variable score and the mean of that variable in each group.
library(hash)
mydf <- data.frame(group=c(rep("a", 4),rep("b", 4), rep("c", 4)), score=sample(1:10, 12, replace=TRUE))
byResult <- by(mydf[,c("score")], mydf$group, mean)
h <- hash(keys= names(byResult), values =byResult)
residualsVar <- apply(mydf,1,function(row){
as.vector(values(h,row[1]))-as.numeric(row[2])
})
df <- cbind(mydf,residualsVar)

sum by group in a data.frame

I'm trying to get the sum of a numerical variable per a categorical variable (in a data frame). I've tried using tapply, but it's doesn't take a whole data.frame.
Here is a working example with some data that looks like this:
> set.seed(667)
> df <- data.frame(a = sample(c("Group A","Group B","Group C",NA), 10, rep = TRUE),
b = sample(c(1, 2, 3, 4, 5, 6), 10, rep=TRUE),
c = sample(c(11, 12, 13, 14, 15, 16), 10, rep=TRUE))
> df
a b c
1 Group A 4 12
2 Group B 6 12
3 <NA> 4 14
4 Group C 1 16
5 <NA> 2 14
6 <NA> 3 13
7 Group C 4 13
8 <NA> 6 15
9 Group B 3 16
10 Group B 5 16
using tapply, I can get one vector at a time:
> tapply(df$b,df$a,sum)
Group A Group B Group C
4 14 5
but I am more interested in getting something like this:
a b c
1 Group A 4 12
2 Group B 14 44
3 Group C 5 29
Any help would be appreciated. Thanks.
Use aggregate instead:
aggregate(df[ , c("b","c")], df['a'], FUN=sum)
a b c
1 Group A 4 12
2 Group B 14 44
3 Group C 5 29
I'm not sure why but you need to pass the second argument to aggregate as a list, so using df$a will error out. It then uses the function on the individual columns in the first argument.

Resources