Is there a way to pickout data based on associations using R? - r

If I have a data set like this:
names <- c("Dave", "Ashley", "Drew")
score1 <- c(5, 1, 3)
opponent <- c("Drew", "Dave", "Ashley")
x <- cbind(names, score1, opponent)
x
y <- as.numeric(ifelse(x[, 3]==x[1, 1], x[1, "score1"], ifelse(
x[, 3]==x[2, 1], x[2, "score1"], ifelse(
x[, 3]==x[3, 1], x[3, "score1"], 1))))
y <- (y * score1)
x <- cbind(x, y)
x
Can I create a loop, to create a new column, where the number in the "score1" column is multiplied by the number from the "y" column from a different row. For instance, create a new column where the values at [1, 2] would be 5 since "Dave" had "Drew" in his row so "Dave"’s "score1" column “5” is multiplied by "Drew"’s "score1" column “3”. Is there a way to do this in a loop that would work for a hundred rows and a hundred columns? Currently, the only way I know to do it is to write a ton of "ifelse" statements like above.

You can perform a self join. Although, you'll have to be careful of duplicates if any names are repeated.:
names <- c("Dave", "Ashley", "Drew")
score1 <- c(5, 1, 3)
opponent <- c("Drew", "Dave", "Ashley")
x <- data.frame(names,score1,opponent)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
new_df <- left_join(x,x[,-3],by=c("opponent"="names"))
new_df
#> names score1.x opponent score1.y
#> 1 Dave 5 Drew 3
#> 2 Ashley 1 Dave 5
#> 3 Drew 3 Ashley 1
new_df <- mutate(new_df,y=score1.x * score1.y)
new_df
#> names score1.x opponent score1.y y
#> 1 Dave 5 Drew 3 15
#> 2 Ashley 1 Dave 5 5
#> 3 Drew 3 Ashley 1 3
Created on 2019-05-07 by the reprex package (v0.2.1)

Related

Convert every n # of rows to columns and stack them in R?

I have a tab-delimited text file with a series of timestamped data. I've read it into R using read.delim() and it gives me all the data as characters in a single column. Example:
df <- data.frame(c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C"))
colnames(df) <- "col1"
df
I want to convert every n # of rows (in this case 4) to columns and stack them without using a for loop. Desired result:
col1 <- c("2017","2018","2018")
col2 <- c("A","X","X")
col3 <- c("B","Y","B")
col4 <- c("C","Z","C")
df2 <- data.frame(col1, col2, col3, col4)
df2
I created a for loop, but it can't handle the millions of rows in my df. Should I convert to a matrix? Would converting to a list help? I tried as.matrix(read.table()) and unlist() but without success.
You could use tidyr to reshape data into the form you want, you will first need to mutate the data as to identify which indexes should be first, and which go with a specific column.
Assuming you know there are 4 groups (n = 4) you could do something like the following with the help of the dplyr package.
library(tidyr)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
n <- 4
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C")) %>%
mutate(cols = rep(1:n, n()/n),
id = rep(1:(n()/n), each = n))
pivot_wider(df, id_cols = id, names_from = cols, values_from = x, names_prefix = "cols")
#> # A tibble: 3 × 5
#> id cols1 cols2 cols3 cols4
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 2017 A B C
#> 2 2 2018 X Y Z
#> 3 3 2018 X B C
Or, in base you could use the split function on the vector, and then use do.call to make the data frame
df <- data.frame(x = c("2017","A","B","C","2018","X","Y","Z","2018","X","B","C"))
split_df <- setNames(split(df$x, rep(1:4, 3)), paste0("cols", 1:4))
do.call("data.frame", split_df)
#> cols1 cols2 cols3 cols4
#> 1 2017 A B C
#> 2 2018 X Y Z
#> 3 2018 X B C
Created on 2022-02-01 by the reprex package (v2.0.1)
The easiest way would be to create a matrix with matrix(ncol=x, byrow=TRUE), then convert back to data.frame. Should be quite fast too.
df |>
unlist() |>
matrix(ncol=4, byrow = TRUE) |>
as.data.frame() |>
setNames(paste0('col', 1:4))
col1 col2 col3 col4
1 2017 A B C
2 2018 X Y Z
3 2018 X B C

Remove column only if it exists

I want to remove a column from a data frame only if it's there.
Example:
a <- 1:5
x <- tibble(a, b = a * 2, c = 1)
x %>% select(-'a')
x %>% select(-'d') # Throws an error
I want a way to remove columns a and d only if they exist, so a is removed and the attempt to remove d never happens. I tried modifying this solution to my problem, but I could not get it to work.
data.table
library(data.table)
a <- 1:5
x <- data.frame(a, b = a * 2, c = 1)
cols <- c("a", "d")
my_cols <- intersect(cols, names(x))
setDT(x)[, ..my_cols]
#> a
#> 1: 1
#> 2: 2
#> 3: 3
#> 4: 4
#> 5: 5
Created on 2021-07-09 by the reprex package (v2.0.0)

How to anonymize data without losing duplicates

I need to anonymize data containing clientnumbers. About half of them are duplicate values, as these clients appear more than once.
How can I anonymize in R so that duplicates are transformed into the same value?
Thanks in advance!
Suppose your data looks like this:
df <- data.frame(id = c("A", "B", "C", "A", "B", "C"), value = rnorm(6),
stringsAsFactors = FALSE)
df
#> id value
#> 1 A -0.8238857
#> 2 B -0.1553338
#> 3 C -0.6297834
#> 4 A -0.4616377
#> 5 B 0.1643057
#> 6 C -0.6719061
And your list of new ID strings (which can be created randomly - see footnote) looks like this:
newIds <- c("newId1", "newId2", "newId3")
Then you should first ensure that your id column is a factor:
df$id <- as.factor(df$id)
Then you should probably store the client IDs for safe lookup later
lookup <- data.frame(key = newIds, value = levels(df$id))
lookup
#> key value
#> 1 newId1 A
#> 2 newId2 B
#> 3 newId3 C
Now all you need to do is overwrite the factor levels:
levels(df$id) <- newIds
df
#> id value
#> 1 newId1 0.7241847
#> 2 newId2 0.4313706
#> 3 newId3 -0.8687062
#> 4 newId1 1.3464852
#> 5 newId2 0.6973432
#> 6 newId3 1.9872338
Note: If you want to create random strings for the ids you can do this:
sapply(seq_along(levels(df$id)), function(x) paste0(sample(LETTERS, 5), collapse = ""))
#> [1] "TWABF" "YSBUF" "WVQEY"
Created on 2020-03-02 by the reprex package (v0.3.0)

left_join two data frames and overwrite

I'd like to merge two data frames where df2 overwrites any values that are NA or present in df1. Merge data frames and overwrite values provides a data.table option, but I'd like to know if there is a way to do this with dplyr. I've tried all of the _join options but none seem to do this. Is there a way to do this with dplyr?
Here is an example:
df1 <- data.frame(y = c("A", "B", "C", "D"), x1 = c(1,2,NA, 4))
df2 <- data.frame(y = c("A", "B", "C"), x1 = c(5, 6, 7))
Desired output:
y x1
1 A 5
2 B 6
3 C 7
4 D 4
I think what you want is to keep the values of df2 and only add the ones in df1 that are not present in df2 which is what anti_join does:
"anti_join return all rows from x where there are not matching values in y, keeping just columns from x."
My solution:
df3 <- anti_join(df1, df2, by = "y") %>% bind_rows(df2)
Warning messages:
1: In anti_join_impl(x, y, by$x, by$y) :
joining factors with different levels, coercing to character vector
2: In rbind_all(x, .id) : Unequal factor levels: coercing to character
> df3
Source: local data frame [4 x 2]
y x1
(chr) (dbl)
1 D 4
2 A 5
3 B 6
4 C 7
this line gives the desired output (in a different order) but, you should pay attention to the warning message, when working with your dataset be sure to read y as a character variable.
This is the idiom I now use, as, in addition, it handles keeping columns that are not part of the update table. I use some different names than from the OP, but the flavor is similar.
The one thing I do is create a variable for the keys used in the join, as I use that in a few spots. But otherwise, it does what is desired.
In itself it doesn't handle the action of, for example, "update this row if a value is NA", but you should exercise that condition when creating the join table.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
.keys <- c("key1", "key2")
.base_table <- tribble(
~key1, ~key2, ~val1, ~val2,
"A", "a", 0, 0,
"A", "b", 0, 1,
"B", "a", 1, 0,
"B", "b", 1, 1)
.join_table <- tribble(
~key1, ~key2, ~val2,
"A", "b", 100,
"B", "a", 111)
# This works
df_result <- .base_table %>%
# Pull off rows from base table that match the join table
semi_join(.join_table, .keys) %>%
# Drop cols from base table that are in join table, except for the key columns
select(-matches(setdiff(names(.join_table), .keys))) %>%
# Left join on the join table columns
left_join(.join_table, .keys) %>%
# Remove the matching rows from the base table, and bind on the newly joined result from above.
bind_rows(.base_table %>% anti_join(.join_table, .keys))
df_result %>%
print()
#> # A tibble: 4 x 4
#> key1 key2 val1 val2
#> <chr> <chr> <dbl> <dbl>
#> 1 A b 0 100
#> 2 B a 1 111
#> 3 A a 0 0
#> 4 B b 1 1
Created on 2019-12-12 by the reprex package (v0.3.0)

How to replace NA values in a table for selected columns

There are a lot of posts about replacing NA values. I am aware that one could replace NAs in the following table/frame with the following:
x[is.na(x)]<-0
But, what if I want to restrict it to only certain columns? Let's me show you an example.
First, let's start with a dataset.
set.seed(1234)
x <- data.frame(a=sample(c(1,2,NA), 10, replace=T),
b=sample(c(1,2,NA), 10, replace=T),
c=sample(c(1:5,NA), 10, replace=T))
Which gives:
a b c
1 1 NA 2
2 2 2 2
3 2 1 1
4 2 NA 1
5 NA 1 2
6 2 NA 5
7 1 1 4
8 1 1 NA
9 2 1 5
10 2 1 1
Ok, so I only want to restrict the replacement to columns 'a' and 'b'. My attempt was:
x[is.na(x), 1:2]<-0
and:
x[is.na(x[1:2])]<-0
Which does not work.
My data.table attempt, where y<-data.table(x), was obviously never going to work:
y[is.na(y[,list(a,b)]), ]
I want to pass columns inside the is.na argument but that obviously wouldn't work.
I would like to do this in a data.frame and a data.table. My end goal is to recode the 1:2 to 0:1 in 'a' and 'b' while keeping 'c' the way it is, since it is not a logical variable. I have a bunch of columns so I don't want to do it one by one. And, I'd just like to know how to do this.
Do you have any suggestions?
You can do:
x[, 1:2][is.na(x[, 1:2])] <- 0
or better (IMHO), use the variable names:
x[c("a", "b")][is.na(x[c("a", "b")])] <- 0
In both cases, 1:2 or c("a", "b") can be replaced by a pre-defined vector.
Building on #Robert McDonald's tidyr::replace_na() answer, here are some dplyr options for controlling which columns the NAs are replaced:
library(tidyverse)
# by column type:
x %>%
mutate_if(is.numeric, ~replace_na(., 0))
# select columns defined in vars(col1, col2, ...):
x %>%
mutate_at(vars(a, b, c), ~replace_na(., 0))
# all columns:
x %>%
mutate_all(~replace_na(., 0))
Edit 2020-06-15
Since data.table 1.12.4 (Oct 2019), data.table gains two functions to facilitate this: nafill and setnafill.
nafill operates on columns:
cols = c('a', 'b')
y[ , (cols) := lapply(.SD, nafill, fill=0), .SDcols = cols]
setnafill operates on tables (the replacements happen by-reference/in-place)
setnafill(y, cols=cols, fill=0)
# print y to show the effect
y[]
This will also be more efficient than the other options; see ?nafill for more, the last-observation-carried-forward (LOCF) and next-observation-carried-backward (NOCB) versions of NA imputation for time series.
This will work for your data.table version:
for (col in c("a", "b")) y[is.na(get(col)), (col) := 0]
Alternatively, as David Arenburg points out below, you can use set (side benefit - you can use it either on data.frame or data.table):
for (col in 1:2) set(x, which(is.na(x[[col]])), col, 0)
This is now trivial in tidyr with replace_na(). The function appears to work for data.tables as well as data.frames:
tidyr::replace_na(x, list(a=0, b=0))
Not sure if this is more concise, but this function will also find and allow replacement of NAs (or any value you like) in selected columns of a data.table:
update.mat <- function(dt, cols, criteria) {
require(data.table)
x <- as.data.frame(which(criteria==TRUE, arr.ind = TRUE))
y <- as.matrix(subset(x, x$col %in% which((names(dt) %in% cols), arr.ind = TRUE)))
y
}
To apply it:
y[update.mat(y, c("a", "b"), is.na(y))] <- 0
The function creates a matrix of the selected columns and rows (cell coordinates) that meet the input criteria (in this case is.na == TRUE).
We can solve it in data.table way with tidyr::repalce_na function and lapply
library(data.table)
library(tidyr)
setDT(df)
df[,c("a","b","c"):=lapply(.SD,function(x) replace_na(x,0)),.SDcols=c("a","b","c")]
In this way, we can also solve paste columns with NA string. First, we replace_na(x,""),then we can use stringr::str_c to combine columns!
Starting from the data.table y, you can just write:
y[, (cols):=lapply(.SD, function(i){i[is.na(i)] <- 0; i}), .SDcols = cols]
Don't forget to library(data.table) before creating y and running this command.
This needed a bit extra for dealing with NA's in factors.
Found a useful function here, which you can then use with mutate_at or mutate_if:
replace_factor_na <- function(x){
x <- as.character(x)
x <- if_else(is.na(x), 'NONE', x)
x <- as.factor(x)
}
df <- df %>%
mutate_at(
vars(vector_of_column_names),
replace_factor_na
)
Or apply to all factor columns:
df <- df %>%
mutate_if(is.factor, replace_factor_na)
For a specific column, there is an alternative with sapply
DF <- data.frame(A = letters[1:5],
B = letters[6:10],
C = c(2, 5, NA, 8, NA))
DF_NEW <- sapply(seq(1, nrow(DF)),
function(i) ifelse(is.na(DF[i,3]) ==
TRUE,
0,
DF[i,3]))
DF[,3] <- DF_NEW
DF
For completeness, built upon #sbha's answer, here is the tidyverse version with the across() function that's available in dplyr since version 1.0 (which supersedes the *_at() variants, and others):
# random data
set.seed(1234)
x <- data.frame(a = sample(c(1, 2, NA), 10, replace = T),
b = sample(c(1, 2, NA), 10, replace = T),
c = sample(c(1:5, NA), 10, replace = T))
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
# with the magrittr pipe
x %>% mutate(across(1:2, ~ replace_na(.x, 0)))
#> a b c
#> 1 2 2 5
#> 2 2 2 2
#> 3 1 0 5
#> 4 0 2 2
#> 5 1 2 NA
#> 6 1 2 3
#> 7 2 2 4
#> 8 2 1 4
#> 9 0 0 3
#> 10 2 0 1
# with the native pipe (since R 4.1)
x |> mutate(across(1:2, ~ replace_na(.x, 0)))
#> a b c
#> 1 2 2 5
#> 2 2 2 2
#> 3 1 0 5
#> 4 0 2 2
#> 5 1 2 NA
#> 6 1 2 3
#> 7 2 2 4
#> 8 2 1 4
#> 9 0 0 3
#> 10 2 0 1
Created on 2021-12-08 by the reprex package (v2.0.1)
it's quite handy with data.table and stringr
library(data.table)
library(stringr)
x[, lapply(.SD, function(xx) {str_replace_na(xx, 0)})]
FYI
this works fine for me
DataTable DT = new DataTable();
DT = DT.AsEnumerable().Select(R =>
{
R["Campo1"] = valor;
return (R);
}).ToArray().CopyToDataTable();

Resources