This question already has answers here:
Count number of rows per group and add result to original data frame
(11 answers)
Closed 4 years ago.
I have no idea where to start with this, but what I am trying to do is create a new value based on the number of times another value is represented in another column.
For example
# Existing Data
key newcol
a ?
a ?
a ?
b ?
b ?
c ?
c ?
c ?
Would like the output to look like
key newcol
a 3
a 3
a 3
b 2
b 2
c 3
c 3
c 3
Thanks!
This can be achieved with the doBy package like so:
require(doBy)
#original data frame
df <- data.frame(key = c('a', 'a', 'a', 'b', 'b', 'c', 'c', 'c'))
#add counter
df$count <- 1
#use summaryBy to count number of instances of key
counts <- summaryBy(count ~ key, data = df, FUN = sum, var.names = 'newcol', keep.names = TRUE)
#merge counts into original data frame
df <- merge(df, counts, by = 'key', all.x = TRUE)
df then looks like:
> df
key count newcol
1 a 1 3
2 a 1 3
3 a 1 3
4 b 1 2
5 b 1 2
6 c 1 3
7 c 1 3
8 c 1 3
If key is a vector like this key <- rep(c("a", "b", "c"), c(3,2,3)), then you can get what you want by using table to count occurences of key elements
> N <- table(key)
> data.frame(key, newcol=rep(N,N))
key newcol
1 a 3
2 a 3
3 a 3
4 b 2
5 b 2
6 c 3
7 c 3
8 c 3
On the other hand, if key is a data.frame, then...
key.df <- data.frame(key = rep(letters[1:3], c(3, 2, 3)))
N <- table(key.df$key)
data.frame(key=key.df, newcol=rep(N, N))
Related
how do I change a value of a dataframe, based on another dataframe with different column names? I have Table 1 and Table 2, and would like to produce the Desired outcome as stated below.
I would like to change Number column of table 1, if Index of table 1 exists in Alphabet of Table2, without the use of if else as my dataframe is large.
Table 1)
Index
Number
A
1
B
2
C
3
D
4
Table 2)
Alphabet
Integer
B
25
C
30
Desired Output:
Index
Number
A
1
B
25
C
30
D
4
What's the issue with using ifelse() on a large dataframe with simple replacement?
df1 <- data.frame(Index = LETTERS[1:4],
Number = 1:4)
df2 <- data.frame(Alphabet = c("B", "C"),
Integer = c(25, 30))
df1$Number2 <- ifelse(df1$Index %in% df2$Alphabet, df2$Integer, df1$Number)
df1
#> Index Number Number2
#> 1 A 1 1
#> 2 B 2 30
#> 3 C 3 25
#> 4 D 4 4
Perhaps if you are concerned with incorrect indexing when using ifelse(), you could use merge() like this:
merge(df1, df2, by.x = "Index", by.y = "Alphabet", all.x = TRUE)
#> Index Number Integer
#> 1 A 1 NA
#> 2 B 2 25
#> 3 C 3 30
#> 4 D 4 NA
I have two dataframes, one of which is a subset of the other. I want to visualize them (ggplot) indicating the subset of dataframes using a different color. I am therefore looking for a way to identify matches across data frames, flag the matches and use the flag as the col aesthetic.
Here is a short example:
#create two sample dataframes
df_full <- data.frame(type = c('A', 'A', 'B', 'B', 'A'),
case = c(1, 2, 1, 2, 2),
val = c(3, 4, 7, 1, 5))
df_special <- data.frame(type = c('A', 'A', 'B'),
case = c(1, 2, 1))
#print df for clarity
df_full
type case val
1 A 1 3
2 A 2 4
3 B 1 7
4 B 2 1
5 A 2 5
df_special
type case
1 A 1
2 A 2
3 B 1
What I want is the following:
type case val special
1 A 1 3 TRUE
2 A 2 4 TRUE
3 B 1 7 TRUE
4 B 2 1 FALSE
5 A 2 5 TRUE
I can do this manually using ifelse conditions, but in cases where there are lots of potential matches this becomes laborious. I assume there is a simple way to to just check whether type and case match across df's (similar to a join function) and then flag if they would join. I can't seem to word the search correctly to find anything.
dplyr solutions would be welcome.
Thanks.
We can do a join
library(dplyr)
df_full %>%
left_join(df_special %>%
mutate(special = TRUE))%>%
mutate(special = replace_na(special, FALSE))
type case val special
1 A 1 3 TRUE
2 A 2 4 TRUE
3 B 1 7 TRUE
4 B 2 1 FALSE
5 A 2 5 TRUE
Or another option is %in% on the pasted columns 'type', 'case'
library(stringr)
df_full %>%
mutate(special = str_c(type, case) %in%
str_c(df_special$type, df_special$case))
I want a way to count values on a dataframe based on its presence by row
a = data.frame(c('a','b','c','d','f'),
c('a','b','a','b','d'))
colnames(a) = c('let', 'let2')
In this reproducible example, we have the letter "a" appearing in the first row and third row, totalizing two appearences. I've made this code to count the values based if the presence is TRUE, but I want it to atribute it automaticaly for all the variables present in the dataframe:
#for counting the variable a and atribunting the count to the b dataframe
b = data.frame(unique(unique(unlist(a))))
b$count = 0
for(i in 1:nrow(a)){
if(TRUE %in% apply(a[i,], 2, function(x) x %in% 'a') == TRUE){
b$count[1] = b$count[1] + 1
}
}
b$count[1]
[1] 2
The problem is that I have to make this manually for all variables and I want a way to make this automatically. Is there a way? The expected output is:
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
It can be done in base R by taking the unique values separately from the column, unlist to a vector and get the frequency count with table. If needed convert the table object to a two column data.frame with stack
stack(table(unlist(lapply(a, unique))))[2:1]
-output
# ind values
#1 a 2
#2 b 2
#3 c 1
#4 d 2
#5 f 1
If it is based on row, use apply with MARGIN = 1
table(unlist(apply(a, 1, unique)))
Or do a group by row to get the unique and count with table
table(unlist(tapply(unlist(a), list(row(a)), unique)))
Or a faster approach with dapply from collapse
library(collapse)
table(unlist(dapply(a, funique, MARGIN = 1)))
Does this work:
library(dplyr)
library(tidyr)
a %>% pivot_longer(cols = everything()) %>% distinct() %>% count(value)
# A tibble: 5 x 2
value n
<chr> <int>
1 a 2
2 b 2
3 c 1
4 d 2
5 f 1
Data used:
a
let let2
1 a a
2 b b
3 c a
4 d b
5 f d
I have searched high and low and also tried multiple options to solve this but did not get the desired output as mentioned below:
I have dataframe df3 with headers as date and values beteween 0-1 as shown below:
df = data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) = c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 = data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 = cbind(df2,df)
Now I need df4 in which sum of first 3 columns in series will form one column. This will be repeated in series for rest of the columns dynamically.
df4
Options I tried:
a) rbind.data.frame(apply(matrix(df3, nrow = n - 1), 1,sum))
b) col_list <- list(c("1/1/2018","1/2/2018","1/3/2018"), c("1/4/2018","1/5/2018","1/6/2018"))
lapply(col_list, function(x)sum(df3[,x])) %>% data.frame
One way would be to split df3 every 3 columns using split.default. To split the data we generate a sequence using rep, then for each dataframe we take rowSums and finally cbind the result together.
cbind(df3[1], sapply(split.default(df3[-1],
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))), rowSums))
# CUST_ID 1 2
#1 A 1 1
#2 B 2 0
#3 C 2 1
#4 D 1 1
#5 E 2 2
#6 F 2 2
FYI, the sequence generated from rep is
rep(1:ncol(df3), each = 3, length.out = (ncol(df3) -1))
#[1] 1 1 1 2 2 2
This makes it possible to split every 3 columns.
The results are different because OP used sample without set.seed.
If rep seems too long then we can generate the same sequence of columns using gl
gl(ncol(df3[-1])/3, 3)
#[1] 1 1 1 2 2 2
#Levels: 1 2
So the final code, would be
cbind(df3[1], sapply(split.default(df3[-1], gl(ncol(df3[-1])/3, 3)), rowSums))
We can use seq to create index, get the subset of columns within in a list, Reduce by taking the sum, and create new columns
df4 <- df3[1]
df4[paste0('col', c('123', '456'))] <- lapply(seq(2, ncol(df3), by = 3),
function(i) Reduce(`+`, df3[i:min((i+2), ncol(df3))]))
df4
# CUST_ID col123 col456
#1 A 2 2
#2 B 3 3
#3 C 1 3
#4 D 2 3
#5 E 2 1
#6 F 0 1
data
set.seed(123)
df <- data.frame(replicate(6,sample(0:1,6,rep=TRUE)))
colnames(df) <- c("1/1/2018","1/2/2018","1/3/2018","1/4/2018","1/5/2018","1/6/2018")
df2 <- data.frame(c("A","B","C","D","E","F"))
colnames(df2) = c("CUST_ID")
df3 <- cbind(df2, df)
I have a data.frame that looks like this:
which has 1000+ columns with similar names.
And I have a vector of those column names that looks like this:
The vector is sorted by the cluster_id (which goes up to 11).
I want to sort the columns in the data frame such that the columns are in the order of the names in the vector.
A simple example of what I want is that:
Data:
A B C
1 2 3
4 5 6
Vector:
c("B","C","A")
Sorted:
B C A
2 3 1
5 6 4
Is there a fast way to do this?
UPDATE, with reproducible data added by OP:
df <- read.table(h=T, text="A B C
1 2 3
4 5 6")
vec <- c("B", "C", "A")
df[vec]
Results in:
B C A
1 2 3 1
2 5 6 4
As OP desires.
How about:
df[df.clust$mutation_id]
Where df is the data.frame you want to sort the columns of and df.clust is the data frame that contains the vector with the column order (mutation_id).
This basically treats df as a list and uses standard vector indexing techniques to re-order it.
Brodie's answer does exactly what you're asking for. However, you imply that your data are large, so I will provide an alternative using "data.table", which has a function called setcolorder that will change the column order by reference.
Here's a reproducible example.
Start with some simple data:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
matches <- data.frame(X = 1:3, Y = c("C", "A", "B"), Z = 4:6)
mydf
# A B C
# 1 1 3 5
# 2 2 4 6
matches
# X Y Z
# 1 1 C 4
# 2 2 A 5
# 3 3 B 6
Provide proof that Brodie's answer works:
out <- mydf[matches$Y]
out
# C A B
# 1 5 1 3
# 2 6 2 4
Show a more memory efficient way to do the same thing.
library(data.table)
setDT(mydf)
mydf
# A B C
# 1: 1 3 5
# 2: 2 4 6
setcolorder(mydf, as.character(matches$Y))
mydf
# C A B
# 1: 5 1 3
# 2: 6 2 4
A5C1D2H2I1M1N2O1R2T1's solution didn't work for my data (I've a similar problem that Yilun Zhang) so I found another option:
mydf <- data.frame(A = 1:2, B = 3:4, C = 5:6)
# A B C
# 1 1 3 5
# 2 2 4 6
matches <- c("B", "C", "A") #desired order
mydf_reorder <- mydf[,match(matches, colnames(mydf))]
colnames(mydf_reorder)
#[1] "B" "C" "A"
match() find the the position of first element on the second one:
match(matches, colnames(mydf))
#[1] 2 3 1
I hope this can offer another solution if anyone is having problems!