Conditional adding of values in R with two dataframes [duplicate] - r

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 6 years ago.
I have two dataframes (df1, df2).
df1 <- data.frame(term = c("A", "B", "C", "D", "E", "F"))
df2 <- data.frame(term = c("C", "F", "G"), freq = c(7, 3, 5))
In df1, I want to add a column ("freq") based on the values of "freq" in df2. So if the term in df1 and the term in df2 match, the count ("freq") of this term should be added to df1. Else it should be "0" (zero).
How can I do it, so that the processing time is as small as possible? Is there a way how to do it with dplyr? I cannot figure it out!!!

If we need a faster option, a data.table join can be used along with assigning (:=) the NA values to 0 in place.
library(data.table)
setDT(df2)[df1, on = "term"][is.na(freq), freq := 0][]
Or to avoid copies, as #Arun mentioned, create a 'freq' column in 'df1' and then join on 'term' replace the 'freq' with the corresponding 'i.freq' values.
setDT(df1)[, freq := 0][df2, freq := i.freq, on = "term"]
Or use left_join
library(dplyr)
left_join(df1, df2, by = 'term') %>%
mutate(freq = replace(freq, is.na(freq), 0)

Related

Efficient vectors accumulation by group in data frame [duplicate]

This question already has answers here:
Cumulatively paste (concatenate) values grouped by another variable
(6 answers)
Closed 3 years ago.
I currently have a large data table and I would like to accumulate a vector column (the classes column) for each group (id) along the years to get all past classes up to the current year in vector format.
EDIT: Previous topics (ie Cumulatively paste (concatenate) values grouped by another variable) have answerd this question in the case of characters concatenation which I don't want (because analyzing strings forces me to parse it before, which is cumputer intensive on large datasets). I would like to accumulate the vectors and get a column of vectors as well. I think the solution is pretty close but I just can't manage to find the right syntax for it.
 
Sample data:
id year classes
----------------------------
1 2000 c("A", "B")
1 2001 c("C", "A")
1 2002 "D"
1 2003 "E"
2 2001 "A"
2 2002 c("A", "D")
2 2003 "E"
...
Expected output :
id year classes cumclasses
-----------------------------------------------------------
1 2000 c("A", "B") c("A", "B")
1 2001 c("C", "A") c("A", "B", "C", "A")
1 2002 "D" c("A", "B", "C", "A", "D")
1 2003 "E" c("A", "B", "C", "A", "D", "E")
2 2001 "A" "A"
2 2002 c("A", "D") c("A", "A", "D")
2 2003 "E" c("A", "A", "D", "E")
...
My goal is to find an efficient solution because my dataset is fairly large.
For now I have a working (but ultra slow) solution using dplyr and purrr :
dt2 <- dt %>%
setkeyv(c("id", "year")) %>%
group_by(id) %>%
mutate(cumclasses = accumulate(classes, append))
I'm looking for a data.table solution of the type:
#not working example
dt2 <- dt[, cumclasses := accumulate(classes, append), by = id]
or even a base R solution, the faster the better !
Thank you!
If you want to reproduce sample data please copy the following code:
dt <- data.table(id =
c(1,1,1,1,2,2,2),
year =
c(2000,2001,2002,2003,2001,2002,2003),
classes =
list(c('A', 'B'), c('C', 'A'), 'D', 'E', 'A', c('A', 'D'), 'E'), key = 'id')
EDIT [SOLVED]:
A working solution is (using data.table and purrr):
dt[, cumClasses := list(accumulate(classes, append)), by = id]
One option would be to group by 'id', loop over the sequence of rows and extract the 'Classes' and paste it to together after unlisting the list column
dt[, cumClasses := sapply(seq_len(.N), function(i) toString(unlist(classes[seq_len(i)]))), id][,
cumClasses := as.list(cumClasses)][]

Subsetting data from a dataframe and taking specific values from the subsetted values

I want to check if values (in example below "letters") in 1 dataframe appear in another dataframe. And if that is the case, I want a value (in example below "ranking") which is specific for that value from the first dataframe to be added to the second dataframe... What I have now Is the following:
Df1 <- data.frame(c("A", "C", "E"), c(1:3))
colnames(Df1) <- c("letters", "ranking")
Df2 <- data.frame(c("A", "B", "C", "D", "E"))
colnames(Df2) <- c("letters")
Df2$rank <- ifelse(Df2$letters %in% Df1$letters, 1, 0)
However... Instead of getting a '1' when the letters overlap, I want to get the specific 'ranking' number from Df1.
Thanks!
What you're looking for is called a merge:
merge(Df2, Df1, by="letters", all.x=TRUE)
Also, fun fact, you can create a dataframe and name the columns at the same time (and you'll usually want to "turn off" strings as factors):
df1 <- data.frame(
letters = c("a", "b", "c"),
ranking = 1:3,
stringsAsFactors = FALSE)
dplyr package is best for this.
Df2 <- Df2 %>%
left_join(Df1,by = "letters")
this will show a NA for "D" if you want to keep it.
Otherwise you can do semi_join
DF2 <- Df2 %>%
semi_join(Df1, by = "letters")
And this will only keep the ones they have in common (intersection)

Create a ordered list in a dataframe [duplicate]

This question already has answers here:
Sorting each row of a data frame [duplicate]
(2 answers)
Row wise Sorting in R
(2 answers)
Row-wise sort then concatenate across specific columns of data frame
(2 answers)
Closed 5 years ago.
I have the following data frame:
col1 <- c("a", "b", "c")
col2 <- c("c", "a", "d")
col3 <- c("b", "c", "a")
df <- data.frame(col1,col2,col3)
I want to create a new column in this data frame that has, for each row, the ordered list of the columns col1, col2, col3. So, for the first row it would be a list like "a", "b", "c".
The way I'm handling it is to create a loop but since I have 50k rows, it's quite inefficient, so I'm looking for a better solution.
rown <- nrow(df)
i = 0
while(i<rown){
i = i +1
col1 <- df$col1[i]
col2 <- df$col2[i]
col3 <- df$col3[i]
col1 <- as.character(col1)
col2 <- as.character(col2)
col3 <- as.character(col3)
list1 <- c(col1, col2, col3)
list1 <- list1[order(sapply(list1, '[[', 1))]
a <- list1[1]
b <- list1[2]
c <- list1[3]
df$col.list[i] <- paste(a, b, c, sep = " ")
}
Any ideas on how to make this code more efficient?
EDIT: the other question is not relevant in my case since I need to paste the three columns after sorting each row, so it's the paste statement that is dynamic, I'm not trying to change the data frame by sorting.
Expected output:
col1 col2 col3 col.list
a c b a b c
b a c a b c
c d a a c d

Sort and concatenate values by group [duplicate]

This question already has answers here:
Collapse / concatenate / aggregate a column to a single comma separated string within each group
(6 answers)
Closed 6 years ago.
I've got a list of Groups and Names, as seen in DF below. I'm looking to arrange this list alphabetically and concatenate each name separated by a comma, as seen in DF2 below. I thought this would be simple, but it is proving to be more challenging than expected!
DF <- tibble::data_frame(
Group = c(1, 1, 1, 2, 2, 3, 3, 3),
Name = c("A", "B", "C", "B", "A", "B", "C", "A"))
DF2 <- tibble::data_frame(
Group = c(1, 2, 3),
Name = c("A, B, C", "A, B", "A, B, C"))
I'd appreciate any help in solving this to account for an unknown number of names listed per group, either with or without a dplyr pipeline.
Thanks!
We can use data.table
library(data.table)
setDT(DF)[order(Name), .(Comb = toString(Name)) , by = Group]
In base R:
aggregate(Name~Group, DF, function(x) paste0(sort(x), collapse = ","))
# Group Name
#1 1 A,B,C
#2 2 A,B
#3 3 A,B,C

Most efficient to append some columns of a data frame to some other columns

Suppose I have the following data frame:
foo <- data.frame(a=letters,b=seq(1,26),
n1=rnorm(26),n2=rnorm(26),
u1=runif(26),u2=runif(26))
I want to append columns u1 and u2 to columns n1 and n2. For now, I found the following way:
df1 <- foo[,c("a","b","n1","n2")]
df2 <- foo[,c("a","b","u1","u2")]
names(df2) <- names(df1)
bar <- rbind(df1,df2)
That does the trick. However, it seems a little bit involved. Am I too picky? Or is there a faster/simpler way to do this in R?
Here is one way using full_join() from dplyr:
library(dplyr)
full_join(df1, df2, by = c("a", "b", "n1" = "u1", "n2" = "u2"))
From the documentation:
full_join
return all rows and all columns from both x and y. Where
there are not matching values, returns NA for the one missing.
by
a character vector of variables to join by. If NULL, the default,
join will do a natural join, using all variables with common names
across the two tables. A message lists the variables so that you can
check they're right.
To join by different variables on x and y use a named vector. For
example, by = c("a" = "b") will match x.a to y.b.
Use Map() to concatenate the columns, and cbind() with recycling to arrive at the final data frame.
cbind(foo[1:2], Map(c, foo[3:4], foo[5:6]))
Substitute numerical indexes with column names, if desired.
cbind(foo[c("a", "b")], Map(c, foo[c("n1", "n2")], foo[c("u1", "u2")]))
Short-hand:
rbind(foo[1:4], setNames(foo[c(1, 2, 5, 6)], names(foo[1:4])))
Long-winded:
rbind(foo[c("a", "b", "n1", "n2")], setNames(foo[c("a", "b", "u1", "u2")], c("a", "b", "n1", "n2")))
Long-winded (more DRY):
nms <- c("a", "b", "n1", "n2")
rbind(foo[nms], setNames(foo[c("a", "b", "u1", "u2")], nms))

Resources