Given a data frame
df=data.frame(
E=c(1,1,2,1,3,2,2),
N=c(4,4,10,4,3,2,2)
)
I would like to create a third column: Every time a value equals another value in the same column and these rows are also equal in the other column it results in a match (new character for every match).
dfx=data.frame(
E=c(1,1,2,1,3,2,2,3, 2),
N=c(4,4,10,4,3,2,2,6, 10),
matched=c("A", "A", "B","A", NA, "C", "C", NA, "B")
)
Thanks!
Here, df is:
df <- structure(list(E = c(1, 1, 2, 1, 3, 2, 2, 3, 2), N = c(4, 4,
10, 4, 3, 2, 2, 6, 10)), .Names = c("E", "N"), row.names = c(NA,
-9L), class = "data.frame")
You can do:
dfx <- transform(df, matched = {
i <- as.character(interaction(df[c("E", "N")]))
tab <- table(i)[order(unique(i))]
LETTERS[match(i, names(tab)[tab > 1])]
})
# E N matched
# 1 1 4 A
# 2 1 4 A
# 3 2 10 B
# 4 1 4 A
# 5 3 3 <NA>
# 6 2 2 C
# 7 2 2 C
# 8 3 6 <NA>
# 9 2 10 B
Related
Let's say I test 3 drugs (A, B, C) at 3 conditions (0, 1, 2), and then I want to compare two of the conditions (1, 2) to a reference condition (0). This is the plot I would like to get:
First: I do get there, but my solution seems overly complex.
# The data I have
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9),
)
# The data I want
df_wider0 <- data.frame(
drug = c("A", "A", "B", "B", "C", "C"),
result0 = c(1, 1, 2, 2, 3, 3),
cond = c(1, 2, 1, 2, 1, 2),
result = c(2, 3, 4, 6, 6, 9)
)
# This pivots also condition 1 and 2 ...
df_wider <- tidyr::pivot_wider(
df,
names_from = cond,
values_from = result
)
# ... so I pivot out these two again ...
colnames(df_wider)[colnames(df_wider) == "0"] <- "result0"
df_wider0 <- tidyr::pivot_longer(
df_wider,
cols = c("1", "2"),
names_to = "cond",
values_to = "result"
)
# ... so that I can use this ggplot command:
library(ggplot2)
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")
As you can see, I use a sequence of pivot_wider and pivot_longer to do a selective pivot_wider (by inverting some of its effects later). Is there an integrated command that I can use to achieve this more elegantly?
This can also be a strategy. (Will work even if there are unequal number of conditions per group)
df %>%
filter(cond != 0) %>%
right_join(df %>% filter(cond == 0), by = "drug", suffix = c("", "0")) %>%
select(-cond0)
Revised df adopted
df <- data.frame(
drug = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D"),
cond = c(0, 1, 2, 0, 1, 2, 0, 1, 2, 0),
result = c(1, 2, 3, 2, 4, 6, 3, 6, 9, 10)
)
Result of above syntax
drug cond result result0
1 A 1 2 1
2 A 2 3 1
3 B 1 4 2
4 B 2 6 2
5 C 1 6 3
6 C 2 9 3
7 D NA NA 10
You may also fill cond if desired so
You can do this without any pivot statement at all.
library(dplyr)
library(ggplot2)
df_wider0 <- df %>%
mutate(result0 = result[match(drug, unique(drug))]) %>%
filter(cond != 0)
df_wider0
# drug cond result result0
#1 A 1 2 1
#2 A 2 3 1
#3 B 1 4 2
#4 B 2 6 2
#5 C 1 6 3
#6 C 2 9 3
Plot the data :
ggplot(df_wider0, aes(x = result0, y = result, label = drug)) +
geom_label() +
facet_wrap("cond")
I have a dataset: (actually I have more than 100 groups)
and I want to use dplyr to create a variable-y for each group, and fill first value of y to be 1,
Second y = 1* first x + 2*first y
The result would be:
I tried to create a column- y, all=1, then use
df%>% group_by(group)%>% mutate(var=shift(x)+2*shift(y))%>% ungroup()
but the formula for y become, always use initialize y value--1
Second y = 1* first x + 2*1
Could someone give me some ideas about this? Thank you!
The dput of my result data is:
structure(list(group = c("a", "a", "a", "a", "a", "b", "b", "b" ), x =
c(1, 2, 3, 4, 5, 6, 7, 8), y = c(1, 3, 8, 19, 42, 1, 8, 23)),
row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame" ))
To perform such calculation we can use accumulate from purrr or Reduce in base R.
Since you are already using dplyr we can use accumulate :
library(dplyr)
df %>%
group_by(group) %>%
mutate(y1 = purrr::accumulate(x[-n()], ~.x * 2 + .y, .init = 1))
# group x y y1
# <chr> <dbl> <dbl> <dbl>
#1 a 1 1 1
#2 a 2 3 3
#3 a 3 8 8
#4 a 4 19 19
#5 a 5 42 42
#6 b 6 1 1
#7 b 7 8 8
#8 b 8 23 23
I have a data.table similar to the one below, but with around 3 million rows and a lot more columns.
key1 price qty status category
1: 1 9.26 3 5 B
2: 1 14.64 1 5 B
3: 1 16.66 3 5 A
4: 1 18.27 1 5 A
5: 2 2.48 1 7 A
6: 2 0.15 2 7 C
7: 2 6.29 1 7 B
8: 3 7.06 1 2 A
9: 3 24.42 1 2 A
10: 3 9.16 2 2 C
11: 3 32.21 2 2 B
12: 4 20.00 2 9 B
Heres the dput() string
dados = structure(list(key1 = c(1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4),
price = c(9.26, 14.64, 16.66, 18.27, 2.48, 0.15, 6.29, 7.06,
24.42, 9.16, 32.21, 20), qty = c(3, 1, 3, 1, 1, 2, 1, 1,
1, 2, 2, 2), status = c(5, 5, 5, 5, 7, 7, 7, 2, 2, 2, 2,
9), category = c("B", "B", "A", "A", "A", "C", "B", "A",
"A", "C", "B", "B")), .Names = c("key1", "price", "qty",
"status", "category"), row.names = c(NA, -12L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x0000000004720788>)
I need to transform this data so that I have one entry for each key, and on the proccess I need to create some additional variables. So far I was using this:
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
key.aggregate = function(x){
return(data.table(
key1 = Mode(x$key1),
perc.A = sum(x$price[x$category == "A"],na.rm=T)/sum(x$price),
perc.B = sum(x$price[x$category == "B"],na.rm=T)/sum(x$price),
perc.C = sum(x$price[x$category == "C"],na.rm=T)/sum(x$price),
status = Mode(x$status),
qty = sum(x$qty),
price = sum(x$price)
))
}
new_data = split(dados,by = "key1") #Runs out of RAM here
results = rbindlist(lapply(new_data,key.aggregate))
And expecting the following output:
> results
key1 perc.A perc.B perc.C status qty price
1: 1 0.5937447 0.4062553 0.00000000 5 8 58.83
2: 2 0.2780269 0.7051570 0.01681614 7 4 8.92
3: 3 0.4321208 0.4421414 0.12573782 2 6 72.85
4: 4 0.0000000 1.0000000 0.00000000 9 2 20.00
But I'm always running out of RAM when splitting the data by keys. I've tried using only a third of the data, and now only a sixth of it but it still gives the same Error: cannot allocate vector of size 593 Kb.
I'm thinking this approach is very inefficient, which would be the best way to get this result?
I'm looking to obtain a subset of my first, larger, dataframe 'df1' by selecting rows which contain particular combinations in the first two variables, as specified in a smaller 'df2'. For example:
df1 <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
day = c(1, 2, 2, 1, 2, 3), value = seq(4,9))
df1 # my actual df has 20 varables
ID day value
A 1 4
A 2 5
A 2 6
B 1 7
B 2 8
B 3 9
df2 <- data.frame(ID = c("A", "B"), day = c(2, 1))
df2 # this df remains at 2 variables
ID day
A 2
B 1
Where the output would be:
ID day value
A 2 5
A 2 6
B 1 7
Any help wouldbe much appreciated, thanks!
This is a good use of the merge function.
df1 <- data.frame(ID = c("A", "A", "A", "B", "B", "B"),
day = c(1, 2, 2, 1, 2, 3), value = seq(4,9))
df2 <- data.frame(ID = c("A", "B"), day = c(2, 1))
merge(df1,
df2,
by = c("ID", "day"))
Which gives output:
ID day value
1 A 2 5
2 A 2 6
3 B 1 7
Here is a dplyr solution:
library("dplyr")
semi_join(df1, df2, by = c("ID", "day"))
# ID day value
# 1 A 2 5
# 2 A 2 6
# 3 B 1 7
Sample data.frame:
structure(list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
Output:
df
# a b c
# 1 1 4 7
# 2 2 5 8
# 3 3 6 9
I'd like to get the first and third columns, but I want to subset by name and also by column index.
df[, "a"]
# [1] 1 2 3
df[, 3]
# [1] 7 8 9
df[, c("a", 3)]
# Error in `[.data.frame`(df, , c("a", 3)) : undefined columns selected
df[, c(match("a", names(df)), 3)]
# a c
# 1 1 7
# 2 2 8
# 3 3 9
Are there functions or packages that allow for clean/simple syntax, as in the third example, while also achieving the result of the fourth example?
Maybe use dplyr?
For interactive use - i.e., if you know ahead of time the name of the column you want to select
library(dplyr)
df %>% select(a, 3)
If you do not know the name of the column in advance, and want to pass it as a variable,
x <- names(df)[1]
x
[1] "a"
df %>% select_(x, 3)
Either way the output is
# a c
#1 1 7
#2 2 8
#3 3 9
In base R you can combine subset with select.
df <- structure(list(a = c(1, 2, 3),
b = c(4, 5, 6), c = c(7, 8, 9)),
.Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
df <- subset(df, select = c(a, 3))
You can index names(df) without using dplyr:
df <- structure(list(a = c(1, 2, 3), b = c(4, 5, 6), c = c(7, 8, 9)), .Names = c("a", "b", "c"), row.names = c(NA, -3L), class = "data.frame")
df[,c("a",names(df)[3]) ]
Output:
a c
1 1 7
2 2 8
3 3 9