What is the simplest way to simplify the first two columns of the a data, so that each row is counted to with a new variable freq?
In other words, go from this:
var1 var2
1 a d
2 b e
3 b e
4 c f
5 c f
6 c f
To this:
var1 var2 freq
1 a d 1
2 b e 2
3 c f 3
You probably did not take a close look with dplyr package ( You tagged it :) ). The easiest way is below ...
df <-data.frame(freq1 = c("a","b","b","c","c","c"),
freq2 = c("d","e","e","f","f","f"))
df %>% group_by(freq1,freq2) %>% tally()
Output
freq1 freq2 n
(fctr) (fctr) (int)
1 a d 1
2 b e 2
3 c f 3
I dont know if the is the easiest, but if the data isnt that complex you can create unique codes using a paste0(collapse="_") and then aggregate by that unique code using a simple table command
data<-read.csv("data.csv")
x<-apply(data,1,function(x) paste0(x,collapse = "_"))
table(x)
If for some reason you don't want to use the dplyr package's count function, an alternative is to use the contingency tables generated by the ftable function and filter out contingencies with 0 occurrences. For example:
df <- data.frame(freq1 = c("a", "b", "b", "c", "c", "c"),
freq2 = c("d", "e", "e", "f", "f", "f"))
x <- as.data.frame(ftable(df))
x <- x[x$Freq > 0, ]
This yields the output:
freq1 freq2 Freq
1 a d 1
5 b e 2
9 c f 3
Related
I must imagine this question is not unique, but I was struggling with which words to search for so if this is redundant please point me to the post!
I have a dataframe
test <- data.frame(x = c("a", "b", "c", "d", "e"))
x
1 a
2 b
3 c
4 d
5 e
And I'd like to replace SOME of the values using a separate data frame
metadata <- data.frame(
a = c("c", "d"),
b = c("REPLACE_1", "REPLACE_2"))
Resulting in:
x
1 a
2 b
3 REPLACE_1
4 REPLACE_2
5 e
A base R solution using match + replace
test <- within(test,x <- replace(as.character(x),match(metadata$a,x),as.character(metadata$b)))
such that
> test
x
1 a
2 b
3 REPLACE_1
4 REPLACE_2
5 e
Importing your data with stringsAsFactors = FALSE and using dplyr and stringr, you can do:
test %>%
mutate(x = str_replace_all(x, setNames(metadata$b, metadata$a)))
x
1 a
2 b
3 REPLACE_1
4 REPLACE_2
5 e
Or using the basic idea from #Sotos:
test %>%
mutate(x = pmax(x, metadata$b[match(x, metadata$a, nomatch = x)], na.rm = TRUE))
You can do,
test$x[test$x %in% metadata$a] <- na.omit(metadata$b[match(test$x, metadata$a)])
# x
#1 a
#2 b
#3 REPLACE_1
#4 REPLACE_2
#5 e
Here's one approach, though I presume there are shorter ones:
library(dplyr)
test %>%
left_join(metadata, by = c("x" = "a")) %>%
mutate(b = coalesce(b, x))
# x b
#1 a a
#2 b b
#3 c REPLACE_1
#4 d REPLACE_2
#5 e e
(Note, I have made the data types match by loading metadata as character, not factors:
metadata <- data.frame(stringsAsFactors = F,
a = c("c", "d"),
b = c("REPLACE_1", "REPLACE_2"))
You can use match to make this update join.
i <- match(metadata$a, test$x)
test$x[i] <- metadata$b
# test
# x
#1 a
#2 b
#3 REPLACE_1
#4 REPLACE_2
#5 e
Or:
i <- match(test$x, metadata$a)
j <- !is.na(i)
test$x[j] <- metadata$b[i[j]]
test
# x
#1 a
#2 b
#3 REPLACE_1
#4 REPLACE_2
#5 e
Data:
test <- data.frame(x = c("a", "b", "c", "d", "e"), stringsAsFactors = FALSE)
metadata <- data.frame(
a = c("c", "d"),
b = c("REPLACE_1", "REPLACE_2"), stringsAsFactors = FALSE)
I'm using group by funciton in a dataset using R software. But the target of the id would duplicate. Here is the sample dataset:
ID Var1
A 1
A 3
B 2
C 3
C 1
D 2
In tradtional groupby function by each id, I can do
DT<- data.table(dataset )
DT[,sum(Var1),by = ID]
and get the result:
ID V1
A 4
B 2
C 4
D 2
However, I've to group ID by A+B and B+C and D
(PS. say that F=A+B ,G=B+C)
and the target result dataset below:
ID V1
F 6
G 6
D 2
IF I use recoding technique on ID, the duplicate B would be covered twice.
IS there any one have the solution?
MANY THANKS!
library(dplyr)
library(tidyr)
df <- df %>% mutate(F=ifelse(ID %in% c("A", "B"), 1, 0),
G = ifelse(ID %in% c("B", "C"), 1, 0),
D = ifelse(ID == "D", 1, 0))
df %>%
gather(var, val, F:D) %>%
filter(val==1) %>%
group_by(var) %>%
summarise(V1=sum(V1))
# # A tibble: 3 x 2
# var V1
# <chr> <dbl>
# 1 D 2
# 2 F 6
# 3 G 6
I have two data.frames dfA and dfB. Both of them have a column called key.
Now I'd like to know how many duplicates for A$key there are in B$key.
A <- data.frame(key=c("A", "B", "C", "D"))
B <- data.frame(key=c("A", "A", "B", "B", "B", "D"))
It should be A=2, B=3, C=0 and D=1. Whats the most easiest way to do this?
Use table
table(factor(B$key, levels = sort(unique(A$key))))
#A B C D
#2 3 0 1
factor is needed here such that we also 'count' entries that do not appear in B$key, that is C.
A <- data.frame(key=c("A", "B", "C", "D"))
B <- data.frame(key=c("A", "A", "B", "B", "B", "D"))
library(dplyr)
library(tidyr)
B %>%
filter(key %in% A$key) %>% # keep values that appear in A
count(key) %>% # count values
complete(key = A$key, fill = list(n = 0)) # add any values from A that don't appear
# # A tibble: 4 x 2
# key n
# <chr> <dbl>
# 1 A 2
# 2 B 3
# 3 C 0
# 4 D 1
Using tidyverse you can do:
A %>%
left_join(B %>% #Merging df A with df B for which the count in "key" was calculated
group_by(key) %>%
tally(), by = c("key" = "key")) %>%
mutate(n = ifelse(is.na(n), 0, n)) #Replacing NA with 0
key n
1 A 2
2 B 3
3 C 0
4 D 1
Actually you mean how many occurrences of each value of A$key you have in B$key?
You can obtain this by coding B$key as factor with the unique values of A$key as levels.
o <- table(factor(B$key, levels=unique(A$key)))
Yielding:
> o
A B C D
2 3 0 1
If you really want to count duplicates, do
dupes <- ifelse(o - 1 < 0, 0, o - 1)
Yielding:
> dupes
A B C D
1 2 0 0
I have a table df that looks like this:
a <- c(10,20, 20, 20, 30)
b <- c("u", "u", "u", "r", "r")
c <- c("a", "a", "b", "b", "b")
df <- data.frame(a,b,c)
I would like to create a new table that contains the mean of col a, grouped by variable c. And I would like to have a column with the counts of the occurrence of b types within each group c.
I would therefore like the result table to look like df2:
a_m <- c(15, 23.3)
c <- c("a", "b")
counts_b <-c("2 u", "1 u, 2 r")
df2 <- data.frame(a_m, c, counts_b)
What I have so far is:
df2 <- df %>% group_by(c) %>% summarise(a_m = mean(a, na.rm = TRUE))
I do not know how to add the column counts_b in the example df2.
Giulia
Here's a way using a little table magic:
df %>%
group_by(c) %>%
summarise(a_mean = mean(a),
b_list = paste(names(table(b)), table(b), collapse = ', '))
# A tibble: 2 x 3
c a_mean b_list
<fct> <dbl> <chr>
1 a 15.0 r 0, u 2
2 b 23.3 r 2, u 1
Here is another solution using reshape2. The output format may be more convenient to work with, each value of b has its own column with the number of occurrences.
out1 <- dcast(df, c ~ b, value.var="c", fun.aggregate=length)
c r u
1 a 0 2
2 b 2 1
out2 <- df %>% group_by(c) %>% summarise(a_m = mean(a))
# A tibble: 2 x 2
c a_m
<fctr> <dbl>
1 a 15.00000
2 b 23.33333
df2 <- merge(out1, out2, by=c)
c r u a_m
1 a 0 2 15.00000
2 b 2 1 23.33333
I have a df that looks like this:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
5 e 0.11258865
6 f 0.07228970
7 g 0.05673759
8 h 0.05319149
9 i 0.03989362
I would like to subset it using the sum of the column value, i.e, I want to extract those rows which sum of values from column value is higher than 0.6, but starting to sum values from the first row. My desired output will be:
> df2
name value
1 a 0.20019421
2 b 0.17996454
3 c 0.14257010
4 d 0.14257010
I have tried df2[, colSums[,5]>=0.6] but obviously colSums is expecting an array
Thanks in advance
Here's an approach:
df2[seq(which(cumsum(df2$value) >= 0.6)[1]), ]
The result:
name value
1 a 0.2001942
2 b 0.1799645
3 c 0.1425701
4 d 0.1425701
I'm not sure I understand exactly what you are trying to do, but I think cumsum should be able to help.
First to make this reproducible, let's use dput so others can help:
df <- structure(list(name = structure(1:9, .Label = c("a", "b", "c",
"d", "e", "f", "g", "h", "i"), class = "factor"), value = c(0.20019421,
0.17996454, 0.1425701, 0.1425701, 0.11258865, 0.0722897, 0.05673759,
0.05319149, 0.03989362)), .Names = c("name", "value"), class = "data.frame", row.names = c(NA,
-9L))
Then look at what cumsum(df$value) provides:
cumsum(df$value)
# [1] 0.2001942 0.3801587 0.5227289 0.6652990 0.7778876 0.8501773 0.9069149 0.9601064 1.0000000
Finally, subset accordingly:
subset(df, cumsum(df$value) <= 0.6)
# name value
# 1 a 0.2001942
# 2 b 0.1799645
# 3 c 0.1425701
subset(df, cumsum(df$value) >= 0.6)
# name value
# 4 d 0.14257010
# 5 e 0.11258865
# 6 f 0.07228970
# 7 g 0.05673759
# 8 h 0.05319149
# 9 i 0.03989362