Combining 3 versions of same table together in R - r

I scraped some data from a website but it was really janky and for some reason had little mistakes in it. So, I scraped the same data 3 times, and produced 3 tables that look like:
library(data.table)
df1 <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
id = c(1, 2, 3, 4),
thing=c(2, 1, 3, 4),
otherthing = c(2,1, 3, 4)
)
df2 <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
id = c(1, 2, 3, 4),
thing=c(1, 1, 1, 4),
otherthing = c(2,2, 3, 4)
)
df3 <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
id = c(1, 2, 3, 4),
thing=c(1, 1, 3, 4),
otherthing = c(2,1, 3, 3)
)
Except I have many more columns. I want to combine the 3 tables together, and when the values for "thing" and "other thing" etc. conflict, I want it to pick the value that has at least 2/3 and perhaps return an N/A if there is no 2/3 value. I'm confident the "name" and "id" field are good and they're what I want to sort of merge on.
I was considering setting the names for the tables to be, "thing1" "thing2" and "thing3" in the 3 tables respectively, merging together, and then writing some loops through the names. Is there a more elegant solution? It needs to work for 300+ value columns although I'm not super worried about speed.
In this example, the solution I think should be:
final_result <- data.table(name = c('adam', 'bob', 'carl', 'dan'),
id = c(1, 2, 3, 4),
thing=c(1, 1, 3, 4),
otherthing = c(2,1, 3, 4)
)

To generalize the approach from #IceCreamToucan, we can use:
library(dplyr)
n_mode <- function(...) {
x <- table(c(...))
if(any(x > 1)) as.numeric(names(x)[which.max(x)])
else NA
}
bind_rows(df1, df2, df3) %>%
group_by(name, id) %>%
summarise_all(funs(n_mode(.)))
N.B. Be careful with your namespace and how you name the function...preferring something like n_mode() to avoid conflicts with base::mode. Finally, if you extend this to more data.frames, you probably want to put them in a list. If that's not possible/practical, you could replace the bind_rows with purrr::map_df(ls(pattern = "^df[[:digit:]]+"), get)

data table version of Jason's solution (you should leave his as accepted)
library(data.table)
n_mode <- function(x) {
x <- table(x)
if(any(x > 1)) as.numeric(names(x)[which.max(x)])
else NA
}
my_list <- list(df1, df2, df3)
rbindlist(my_list)[, lapply(.SD, n_mode), .(name, id)]
# name id thing otherthing
# 1: adam 1 1 2
# 2: bob 2 1 1
# 3: carl 3 3 3
# 4: dan 4 4 4
Here's the output of rbindlist. Hopefully this makes it more clear why just taking n_mode of all the columns, grouped by name and id, gives the output you want.
rbindlist(my_list)[order(name, id)]
# name id thing otherthing
# 1: adam 1 2 2
# 2: adam 1 1 2
# 3: adam 1 1 2
# 4: bob 2 1 1
# 5: bob 2 1 2
# 6: bob 2 1 1
# 7: carl 3 3 3
# 8: carl 3 1 3
# 9: carl 3 3 3
# 10: dan 4 4 4
# 11: dan 4 4 4
# 12: dan 4 4 3

Related

Remove rows of a data frame from another dataframe but keep duplicated in R

I'm working in R and I have two dataframes, one is the base dataframe, and another has the rows that i need to remove from the base one. But I can't use setdiff() function, because it removes duplicated rows. Here's an example:
a <- data.frame(var1 = c(1, NA, 2, 2, 3, 4, 5),
var2 = c(1, 7, 2, 2, 3, 4, 5))
b <- data.frame(id = c(2, 4),
numero = c(2, 4))
And the result must be:
id numero
1 1
NA 7
2 2
3 3
5 5
It must be an efficient algorithm, too, because the base dataframe has 3 million rows with 26 columns.
We may need to create a sequence column before joining
library(data.table)
setDT(a)[, rn := rowid(var1, var2)][!setDT(b)[,
rn:= rowid(id, numero)], on = .(var1 = id, var2 = numero, rn)][,
rn := NULL][]
-output
var1 var2
<num> <num>
1: 1 1
2: NA 7
3: 2 2
4: 3 3
5: 5 5

Per group, select first row and another which matches a condition

Let's say I have the following data.table:
x <- data.table(a = c(1, 3, 2, 2, 4, 3, 7, 10, 9, 8),
b = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3))
And, after grouping by b, I want to select rows which:
are the first row of the group
have the highest a in the group
If a single row satisfies both conditions, it should only be selected once (the group will only contain one row).
Each of these selections is trivial:
x[, .SD[1], by = b] # selects first row per group
# b a
# 1: 1 1
# 2: 2 2
# 3: 3 10
x[, .SD[which.max(a)], by = b] # selects row with the highest 'a' in the group
# b a
# 1: 1 3
# 2: 2 7
# 3: 3 10
But I can't figure out how to do both at once (obviously .SD[1 | which.max(a)] doesn't work). I could perform them separately and then rbindlist the final result, but I'd like to know if there's a simpler way.
For clarity, in the case above, the expected output would be (different order is also acceptable):
b a
1: 1 1
2: 1 3
3: 2 2
4: 2 7
5: 3 10
One option is to concatenate the index 1 (for the first row) along with which.max -returns a numeric index as well, then take the unique of that (in case the same value 1 is returned by which.max and use that to subset the data.table (.SD)
x[, .SD[unique(c(1, which.max(a)))], by = b]
# b a
#1: 1 1
#2: 1 3
#3: 2 2
#4: 2 7
#5: 3 10
Or use .I
x[x[, .I[unique(c(1, which.max(a)))], by = b]$V1]
Here is how I would do it in dplyr:
library(dplyr)
x <- data.frame(a = c(1, 3, 2, 2, 4, 3, 7, 10, 9, 8),
b = c(1, 1, 1, 2, 2, 2, 2, 3, 3, 3))
x %>% group_by(b) %>% filter(row_number() == 1 | a == max(a))
Output
# a b
#1: 1 1
#2: 3 1
#3: 2 2
#4: 7 2
#5: 10 3
If you only have those two columns, just take the union of the two tables:
funion(
x[, lapply(.SD, max), by=b],
x[, lapply(.SD, first), by=b]
)
I guess max is more efficient than your which.max, since it is optimized (see ?GForce).

Create new identifier column in data frame with values from the name of containing nested list

I would like to create a new identifier column in each data frame with values from the name of containing nested list.
parent <- list(
a = list(
foo = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
bar = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
puppy = data.frame(first = c(1, 2, 3), second = c(4, 5, 6))),
b = list(
foo = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
bar = data.frame(first = c(1, 2, 3), second = c(4, 5, 6)),
puppy = data.frame(first = c(1, 2, 3), second = c(4, 5, 6))))
Therefore, the result for the first data frame in list a would look like this:
> foo
first second identifier
1 1 4 a
2 2 5 a
3 3 6 a
The first data frame in list b would look like this:
>foo
first second identifier
1 1 4 b
2 2 5 b
3 3 6 b
Seems like you might want something like this
Map(function(name, list) {
lapply(list, function(x) cbind(x, identifier=name))
}, names(parent), parent)
Here we use Map() and take the list and the names of the list and just cbind those identifiers into the data.frames.
We could use tidyverse. Loop through the list with imap (gives both the values as well as the keys (name of the list) as .x and .y, then with map2, loop through the inner list of data.frame and mutate to create the column 'identifier as .y aka the names of the list
library(tidyverse)
imap(parent, ~ map2(.x, .y, ~ .x %>%
mutate(identifier = .y)))
#$a
#$a$foo
# first second identifier
#1 1 4 a
#2 2 5 a
#3 3 6 a
#$a$bar
# first second identifier
#1 1 4 a
#2 2 5 a
#3 3 6 a
#$a$puppy
# first second identifier
#1 1 4 a
#2 2 5 a
#3 3 6 a
#$b
#$b$foo
# first second identifier
#1 1 4 b
#2 2 5 b
#3 3 6 b
#$b$bar
# first second identifier
#1 1 4 b
#2 2 5 b
#3 3 6 b
#$b$puppy
# first second identifier
#1 1 4 b
#2 2 5 b
#3 3 6 b
If we want to have the column based on the data.frame name, loop through just the list elements with map, then use imap to loop through the inner list so as to get the keys (names of the inner list) and create a new column 'identifier
map(parent, ~ imap(.x, ~ .x %>%
mutate(identifier = .y)))

Sample by groupy with a condition (r)

I need to randomly select a diary for each individual (id) but only for those who filled more than one.
Let us suppose my data look like this
dta = rbind(c(1, 1, 'a'),
c(1, 2, 'a'),
c(1, 3, 'b'),
c(2, 1, 'a'),
c(3, 1, 'b'),
c(3, 2, 'a'),
c(3, 3, 'c'))
colnames(dta) <- c('id', 'DiaryNumber', 'type')
dta = as.data.frame(dta)
dta
id DiaryNumber type
1 1 a
1 2 a
1 3 b
2 1 a
3 1 b
3 2 a
3 3 c
For example, id 1 filled 3 diaries. What I need is to randomly select one of the 3 diaries. Id 2 only filled one diary, so I do not need to do anything with it.
I have no idea how I could do that.
Any ideas ?
You can use sample_n:
library(dplyr)
dta %>% group_by(id) %>% sample_n(1)
## Source: local data frame [3 x 3]
## Groups: id
##
## id DiaryNumber type
## 1 1 2 a
## 2 2 1 a
## 3 3 1 b
Base package:
set.seed(123)
df <- lapply(split(dta, dta$id), function(x) x[sample(nrow(x), 1), ])
do.call("rbind", df)
Output:
id DiaryNumber type
1 1 1 a
2 2 1 a
3 3 2 a

data.table merge produces extra columns [R]

Below I define a master dataset of dimensions 12x5. I divide it into four data.tables and I want to merge them. There is no row ID overlap between data.tables and some column name overlap. When I merge them, merge() doesn't recognize column name matches, and creates new columns for every column in each data.table. The final merged data.table should be 12x5, but it is coming out as 12x7. I thought that the all=TRUE command in data.table's merge() would solve this.
library(data.table)
a <- data.table(id = c(1, 2, 3), C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6), C1 = c(1, 2, 3), C2 = c(2, 3, 4))
c <- data.table(id = c(7, 8, 9), C3 = c(5, 2, 7))
d <- data.table(id = c(10, 11, 12), C3 = c(8, 2, 3), C4 = c(4, 6, 8))
setkey(a, "id")
setkey(b, "id")
setkey(c, "id")
setkey(d, "id")
final <- merge(a, b, all = TRUE)
final <- merge(final, c, all = TRUE)
final <- merge(final, d, all = TRUE)
names(final)
dim(final) #outputs correct numb of rows, but too many columns
The problem is with the way you are using the 'merge' function.
'merge' function in data.table package by default merges two data tables by the "shared key columns between them". Suppose you create 'a' and 'b' data tables like this:
library(data.table)
a <- data.table(id = c(1, 2, 3), C1 = c(1, 2, 3))
b <- data.table(id = c(4, 5, 6), C1 = c(1, 2, 3), C2 = c(2, 3, 4))
setkey(a, "id")
setkey(b, "id")
where 'a' is going to be like:
id C1
1: 1 1
2: 2 2
3: 3 3
and 'b' is going to be like:
id C1 C2
1: 4 1 2
2: 5 2 3
3: 6 3 4
now, lets first try your code:
merge(a, b, all = TRUE)
This is the result:
id C1.x C1.y C2
1: 1 1 NA NA
2: 2 2 NA NA
3: 3 3 NA NA
4: 4 NA 1 2
5: 5 NA 2 3
6: 6 NA 3 4
This is due to the fact that 'merge' function is taking only 'id' field (shared key between data tables 'a' and 'b') as the merging column, while adding all non-shared columns to the resulting data table. Now lets try specifying what columns to merge on:
merge(a, b, by=c("id","C1"), all = TRUE)
now the result is going to be:
id C1 C2
1: 1 1 NA
2: 2 2 NA
3: 3 3 NA
4: 4 1 2
5: 5 2 3
6: 6 3 4
Same applies to other merge functions you called. So try this:
final <- merge(a, b, by=c("id","C1"), all = TRUE)
final <- merge(final, c, by="id", all = TRUE) #here you don't necessarily need to specify by...
final <- merge( final, d, by=c("id","C3"),all=TRUE)
dim(final)
[1] 12 5

Resources