Setting different values in duplicated observations to NA - r

I have a data frame (DF) that looks as follows:
structure(list(ID = c("123", "123", "456", "789", "789"), REPORTER = c("ONE",
"ONE", "TWO", "THREE", "THREE"), VALUE1 = c("1", "1", "2", "1",
"1"), VALUE3 = c("2", "1", "1", "2", "1"), VALUE4 = c("2", "1",
"2", "1", "1")), .Names = c("ID", "REPORTER", "VALUE1", "VALUE3",
"VALUE4"), row.names = c(1L, 2L, 3L, 5L, 6L), class = "data.frame")
Uniqueness in this case is defined by ID and REPORTER. So the DF above contains a duplicate for the ID 123 and REPORTER ONE and the ID 789 and REPORTER THREE. Since I cannot tell which values of VALUE1 to VALUE4 are the correct ones, I like to set all values to NA, that differ within a duplicate.
This means I first have to identify the columns of VALUE that contain different values. These are the ones to be set to NA. For the rest I like to keep the data since here I can tell the value is correct.
The expected output would look like this:
structure(list(ID = c("123", "123", "456", "789", "789"), REPORTER = c("ONE",
"ONE", "TWO", "THREE", "THREE"), VALUE1 = c("1", "1", "2", "1",
"1"), VALUE3 = c(NA, NA, "1", NA, NA), VALUE4 = c(NA, NA, "2",
"1", "1")), .Names = c("ID", "REPORTER", "VALUE1", "VALUE3",
"VALUE4"), row.names = c(1L, 2L, 3L, 5L, 6L), class = "data.frame")
The goal is to ensure data quality. I don't like to just remove the problem cases since I can use the not differing values for analysis. But I also do not like to just use one of the rows because this would lead to wrong conclusions if I had chosen the wrong values.
How can I do this?

I think this is what you are looking for:
library(reshape2)
DFL <- melt(cbind(rn = 1:nrow(DF), DF), id.vars=c("rn", "ID", "REPORTER"))
DFL$value2 <- ave(DFL$value, DFL[c("ID", "REPORTER", "variable")],
FUN = function(x) {
ifelse(length(unique(x)) > 1, NA, x)
})
dcast(DFL, rn + ID + REPORTER ~ variable, value.var = "value2")
# rn ID REPORTER VALUE1 VALUE3 VALUE4
# 1 1 123 ONE 1 <NA> <NA>
# 2 2 123 ONE 1 <NA> <NA>
# 3 3 456 TWO 2 1 2
# 4 4 789 THREE 1 <NA> 1
# 5 5 789 THREE 1 <NA> 1
As you can see, I had to add a dummy "rn" supplementary ID variable to make sure that dcast wouldn't just collapse all the values into one row per ID+REPORTER combination.
Update
This is actually also entirely doable with base R's reshape and the ave step described above:
DFL <- reshape(DF, direction = "long",
varying = grep("VALUE", names(DF)), sep = "")
DFL <- within(DFL, {
VALUE <- ave(VALUE, ID, REPORTER, time, FUN = function(x)
ifelse(length(unique(x)) > 1, NA, x))
})
reshape(DFL)
# ID REPORTER id VALUE1 VALUE3 VALUE4
# 1.1 123 ONE 1 1 <NA> <NA>
# 2.1 123 ONE 2 1 <NA> <NA>
# 3.1 456 TWO 3 2 1 2
# 4.1 789 THREE 4 1 <NA> 1
# 5.1 789 THREE 5 1 <NA> 1
In the last line above, the attributes from the original reshape statement make it so we don't have to even worry about what arguments we need to put in. :-)

I created a function replaceDifferent() that looks like this:
replaceDifferent <- function(vector){
max <- max(vector)
min <- min(vector)
test <- max == min
if (!test){
return(NA)
}
else{
return(min(vector))
}
}
Then I melted the DF with melt() from the reshape package:
DFmelt <- melt(DF, id = c("ID", "REPORTER"))
After that I was able to apply the new function to the melted data frame wit ddply()
DFres <- ddply(DFmelt, .(ID, REPORTER, variable), function(x){replaceDifferent(x$value)})
To get the result data frame with duplicates removed I called dcast() on DFres:
DFres <- dcast(DFres, ID+REPORTER ~ variable)
This produces a slightly different output than the one I asked for, but is better in the way that I do not have to deal with duplicates anymore.

Related

Merging 2 data sets with different number of rows, matched on a column, and creating NA values

I'm trying to accomplish something that allows me to merge two datasets with differing number of rows, match them on a common column and create NA values where there isn't matching data. For some reason, when I'm merging, the newly created data frame is auto filling values that should be NA and creating extra rows that I don't want. I'm trying to merge df_add (which has a total of 6 rows) into df_main (which has a total of 4 rows) and match the 2 on column "match_id" in df_main and "other_id" in df_add.
df_main <- data.frame (match_id = c("1", "1", "2", "2"),
index_date = c("2006-09-13", "2006-09-13", "2006-09-13", "2006-09-13"),
type = c("Good", "Good", "Bad", "Bad")
)
df_add <- data.frame (other_id = c("1", "1", "1", "2", "2", "2"),
measure_date = c("2005-01-01", "2005-03-13", "2005-04-19", "2005-06-22", "2005-09-29", "2005-11-03"),
wt = c(10, 11, 15, 60, 42, 33)
)
This code is the closest I've gotten so far - it gives me the 6 rows that I want with the NA values but it doesn't match "match_id" and "other_id"
merge(df_main, df_add, by = 0, all = TRUE)[-1]
This is what I want my final merged data set to look like with only a total of 6 rows:
df_goal <- data.frame (match_id = c("1", "1", "1", "2", "2", "2"),
index_date = c("2006-09-13", "2006-09-13", NA, "2006-09-13", "2006-09-13", NA),
type = c("Good", "Good", NA, "Bad", "Bad", NA),
measure_date = c("2005-01-01", "2005-03-13", "2005-04-19", "2005-06-22", "2005-09-29", "2005-11-03"),
wt = c(10, 11, 15, 60, 42, 33)
)
df_goal
Is there a way to accomplish this in r? Any help would be greatly appreciated!
This is really not a merge operation, mostly a cbind by-id.
ids <- unique(c(df_main$match_id, df_add$other_id))
ids
# [1] "1" "2"
mains <- split(df_main, df_main$match_id)
adds <- split(df_add, df_add$other_id)
do.call(rbind,
Map(function(x1, x2) {
nr <- max(nrow(x1), nrow(x2))
cbind(
rbind(x1, x1[0,][rep(NA, nr - nrow(x1)),]),
rbind(x2, x2[0,][rep(NA, nr - nrow(x2)),])
)
}, mains[ids], adds[ids])
)
# match_id index_date type other_id measure_date wt
# 1.1 1 2006-09-13 Good 1 2005-01-01 10
# 1.2 1 2006-09-13 Good 1 2005-03-13 11
# 1.NA <NA> <NA> <NA> 1 2005-04-19 15
# 2.3 2 2006-09-13 Bad 2 2005-06-22 60
# 2.4 2 2006-09-13 Bad 2 2005-09-29 42
# 2.NA <NA> <NA> <NA> 2 2005-11-03 33
The use of [ids] is solely to ensure that the _id variables are in the same order. This will run into problems if an id is in one and not the other, though if that's a possibility then it's possible to overcome that ...
Below is a solution with the package data.table. I have added the variable id_row to define a grouping order with the *_id columns. Then you merge on this as well through an outer join.
library(data.table)
df_main <- data.frame (match_id = c("1", "1", "2", "2"),
index_date = c("2006-09-13", "2006-09-13", "2006-09-13", "2006-09-13"),
type = c("Good", "Good", "Bad", "Bad")
)
df_add <- data.frame (other_id = c("1", "1", "1", "2", "2", "2"),
measure_date = c("2005-01-01", "2005-03-13", "2005-04-19", "2005-06-22", "2005-09-29", "2005-11-03"),
wt = c(10, 11, 15, 60, 42, 33)
)
df_goal <- data.frame (match_id = c("1", "1", "1", "2", "2", "2"),
index_date = c("2006-09-13", "2006-09-13", NA, "2006-09-13", "2006-09-13", NA),
type = c("Good", "Good", NA, "Bad", "Bad", NA),
measure_date = c("2005-01-01", "2005-03-13", "2005-04-19", "2005-06-22", "2005-09-29", "2005-11-03"),
wt = c(10, 11, 15, 60, 42, 33)
)
# convert to data.table
setDT(df_main)
setDT(df_add)
# define a row counter by either match_id and other_id
df_main[ , id_row := 1L:.N, by = match_id]
df_add[ , id_row := 1L:.N, by = other_id]
# rename other_id to match_id
setnames(df_add, "other_id", "match_id")
# set joining keys
setkey(df_main, match_id, id_row)
setkey(df_add, match_id, id_row)
# do an outer join
out = df_main[ df_add ]
out
#> match_id index_date type id_row measure_date wt
#> 1: 1 2006-09-13 Good 1 2005-01-01 10
#> 2: 1 2006-09-13 Good 2 2005-03-13 11
#> 3: 1 <NA> <NA> 3 2005-04-19 15
#> 4: 2 2006-09-13 Bad 1 2005-06-22 60
#> 5: 2 2006-09-13 Bad 2 2005-09-29 42
#> 6: 2 <NA> <NA> 3 2005-11-03 33
Created on 2022-09-23 with reprex v2.0.2
You're missing a column to join by, we can create it and then slightly modify your code:
df_main$id2 <- ave(df_main$match_id, df_main$match_id, FUN = seq_along)
df_add$id2 <- ave(df_add$other_id, df_add$other_id, FUN = seq_along)
merge(df_main, df_add, by.x = c("match_id", "id2"), by.y = c("other_id", "id2"), all = TRUE)
#> match_id id2 index_date type measure_date wt
#> 1 1 1 2006-09-13 Good 2005-01-01 10
#> 2 1 2 2006-09-13 Good 2005-03-13 11
#> 3 1 3 <NA> <NA> 2005-04-19 15
#> 4 2 1 2006-09-13 Bad 2005-06-22 60
#> 5 2 2 2006-09-13 Bad 2005-09-29 42
#> 6 2 3 <NA> <NA> 2005-11-03 33
Created on 2022-09-27 by the reprex package (v2.0.1)

How to subset R data frame based on duplicates in one column and unique values in another

This seems pretty straightforward but I am stumped. I have a data frame that looks like this:
df1 %>% head()
values paired
<ch> <int>
1 apples 1
2 x 1
3 oranges 2
4 z 2
5 bananas 3
6 y 3
7 apples 4
8 p 4
I would like to create a new data frame that extracts all paired values based on a search criteria. So if I want all pairs that correspond to apples I would like to end up with something like this:
df1 %>% head()
values paired
<ch> <int>
1 apples 1
2 x 1
3 apples 4
4 p 4
I have tried using:
new_pairs <- df1 %>%
arrange(values, paired) %>%
filter(duplicated(paired) == TRUE,
values=="apples")
But I am getting only the apple rows back
You'll need to group on the paired variable before filtering.
How about:
df1 %>%
group_by(paired) %>%
filter("apples" %in% values) %>%
ungroup()
Result:
# A tibble: 4 x 2
values paired
<chr> <int>
1 apples 1
2 x 1
3 apples 4
4 p 4
Your data:
df1 <- structure(list(values = c("apples", "x", "oranges", "z", "bananas", "y", "apples", "p"),
paired = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L)),
class = "data.frame",
row.names = c("1", "2", "3", "4", "5", "6", "7", "8"))
Here is another tidyverse possibility. I filter for the rows that have apples and also keep the rows that immediately follow apples.
library(tidyverse)
df %>%
filter((values == "apples" |
row_number() %in% c(which(
values == "apples", arr.ind = TRUE
) + 1)))
Output
values paired
1 apples 1
2 x 1
3 apples 4
4 p 4
Here is a data.table option (subset is only used to change the order of the columns):
library(data.table)
dt <- as.data.table(df)
subset(dt[, .SD[any(values == "apples")], by = paired], select = c("values", "paired"))
values paired
1: apples 1
2: x 1
3: apples 4
4: p 4
Data
df <-
structure(list(
values = c("apples", "x", "oranges", "z", "bananas",
"y", "apples", "p"),
paired = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L)
),
class = "data.frame",
row.names = c(NA,-8L))
In base R, find the values of the pairs of interest
pairs = subset(df1, values %in% "apples")$paired
and create a subset of the data
subset(df1, paired %in% pairs)

Count words in each cell of a dataframe in R

I have a dataframe that looks like
df <- structure(list(Variable = c("Factor1", "Factor2", "Factor3"),
Variable1 = c("word1, word2", "word1", "word1"),
Variable2 = c("word1", "word1, word2", "word1"),
Variable3 = c("word1, word2", "word1", "word1, word2, word3")),
row.names = c(NA, -3L), class = "data.frame")
and would like to create a df that counts occurrences of words in each cell (separated by ",") and input the number into each cell.
df2 <- structure(list(Variable = c("Factor1", "Factor2", "Factor3"),
Variable1 = c("2", "1", "1"),
Variable2 = c("1", "2", "1"),
Variable3 = c("2", "1", "3")),
row.names = c(NA, -3L), class = "data.frame")
Would someone be able to help me in how this would be done?
Thanks!
Using dplyr and stringi:
df %>%
mutate(across(matches("variable\\d{1,}"),stringi::stri_count_words))
Variable Variable1 Variable2 Variable3
1 Factor1 2 1 2
2 Factor2 1 2 1
3 Factor3 1 1 3
I suppose you could try this if desired a base-R solution. Count the number of characters with nchar of a given character value, and subtract the number of characters after removing commas. The difference would be the number of commas (adding 1 would give the number of words/phrases separated by commas). This should be fast too (also see this answer).
cbind(df[1], t(apply(df[-1], 1, \(x) {
nchar(x) - nchar(gsub(",", "", x, fixed = T)) + 1
})))
Output
Variable Variable1 Variable2 Variable3
1 Factor1 2 1 2
2 Factor2 1 2 1
3 Factor3 1 1 3

R: Combining dataset and lookup-table to extract value to new colume

I want to combine to dataframes, df1 with 15.000 obs and df2 consisting of 2.3 mill. I'm trying to match values, if df1$col1 == df2$c1, AND df1$col2 == df2$c2, then insert value from df2$dummy, to df1$col3. If no match in both, do nothing. All are 8 digits, except df2$dummy, which is a dummy of 0 or 1.
df1 col1 col2 col3
1 25382701 65352617 -
2 22363658 45363783 -
3 20019696 23274747 -
df2 c1 c2 dummy
1 17472802 65548585 1
2 20383829 24747473 0
3 20019696 23274747 0
4 01382947 21930283 1
5 22123425 65382920 0
In the example the only match is row 3, and the value 0 from the dummy column should be inserted in col3 row3.
I've tried to make a look-up table, a function of for and if, but not found a solution when requiring matches in two dataframes. (No need to say I guess, but I'm new to R and programming..)
We can use a join in data.table
library(data.table)
df1$col3 <- NULL
setDT(df1)[df2, col3 := i.dummy, on = .(col1 = c1, col2 = c2)]
df1
# col1 col2 col3
#1: 25382701 65352617 NA
#2: 22363658 45363783 NA
#3: 20019696 23274747 0
data
df1 <- structure(list(col1 = c(25382701L, 22363658L, 20019696L), col2 = c(65352617L,
45363783L, 23274747L), col3 = c("-", "-", "-")), class = "data.frame", row.names = c("1",
"2", "3"))
df2 <- structure(list(c1 = c(17472802L, 20383829L, 20019696L, 1382947L,
22123425L), c2 = c(65548585L, 24747473L, 23274747L, 21930283L,
65382920L), dummy = c(1L, 0L, 0L, 1L, 0L)), class = "data.frame",
row.names = c("1",
"2", "3", "4", "5"))

R lapply need to use a different input depending on which value is being evaluated

Say I have a list c of three data frames:
> c
$first
a b
1 1 2
2 2 3
3 3 4
$second
a b
1 2 4
2 4 6
3 6 8
$third
a b
1 3 6
2 6 9
3 9 12
I want to run an lapply on c that will do a custom function on each data frame.
The custom function depends on three numbers and I want the function to use a different number depending on which data frame it's evaluating.
I was thinking of utilizing the names 'first', 'second', and 'third', but I'm unsure how to get those names once they're inside the lapply function. It would look something like this:
lapply(c, function(list, num1 = 1, num2 = -1, num3 = 0) {num <- ifelse(names(list) == "first", num1, ifelse(names(list) == "second", num2, num3)); return(list*num)})
So the result I would want would be first multiplied by 1, second multiplied by -1, and third multiplied by 0.
The names function gives the values a and b (the column names) instead of the name of the data frame itself, so that doesn't work. Is there a function that would be able to give me the 'first', 'second', and 'third' values I need?
Or alternatively, is there a better way of doing this in a lapply function?
May be, it would be easier with Map. We pass the number of interest in the order we want and do a simple multiplication
Map(`*`, lst1, c(1, -1, 0))
If the numbers are named
num1 <- setNames(c(1, -1, 0), c("first", "third", "second"))
then, match with the names of the list
Map(`*`, lst1, num1[names(lst1)])
#$first
# a b
#1 1 2
#2 2 3
#3 3 4
#$second
# a b
#1 0 0
#2 0 0
#3 0 0
#$third
# a b
#1 -3 -6
#2 -6 -9
#3 -9 -12
Or if we decide to go with lapply, loop over the names of the list , extract the list element based on the name as well as the corresponding vector element (named vector)
lapply(names(lst1), function(nm) lst1[[nm]] * num1[nm])
Or with sapply
sapply(names(lst1), function(nm) lst1[[nm]] * num1[nm], simplify = FALSE)
Or another option is map2 from purrr
library(purrr)
map2(lst1, num1[names(lst1)], `*`)
Note: c is a function name and it is not recommended to create object names with function names
data
lst1 <- list(first = structure(list(a = 1:3, b = 2:4), class = "data.frame",
row.names = c("1",
"2", "3")), second = structure(list(a = c(2L, 4L, 6L), b = c(4L,
6L, 8L)), class = "data.frame", row.names = c("1", "2", "3")),
third = structure(list(a = c(3L, 6L, 9L), b = c(6L, 9L, 12L
)), class = "data.frame", row.names = c("1", "2", "3")))
Besides the solutions by #akrun, you can also try the following code
mapply(`*`, lst1, c(1, -1, 0),SIMPLIFY = F)
or
lapply(seq_along(lst1), function(k) lst1[[k]]*c(1,-1,0)[k])

Resources