Remove duplicate values across a few columns but keep rows - r

I have a dataframe that looks like this:
dat <- data.frame(id=1:6,
z_1=c(100,290,38,129,0,290),
z_2=c(20,0,0,0,0,290),
z_3=c(0,0,38,0,0,98),
z_4=c(0,0,38,127,38,78),
z_5=c(23,0,25,0,0,98),
z_6=c(100,0,25,127,0,9))
dat
id z_1 z_2 z_3 z_4 z_5 z_6
1 1 100 20 0 0 23 100
2 2 290 0 0 0 0 0
3 3 38 0 38 38 25 25
4 4 129 0 0 127 0 127
5 5 0 0 0 38 0 0
6 6 290 290 98 78 98 9
I want to remove duplicate values of z_x across each row, replacing any duplicates with either a 0 or NA, but leaving the rows & columns intact (ie not dropping any). The 0s here do not count as duplicates, they are missing values. Duplicate values within a column are ok. My ideal output would look like this:
id z_1 z_2 z_3 z_4 z_5 z_6
1 1 100 20 0 0 23 0
2 2 290 0 0 0 0 0
3 3 38 0 0 0 25 0
4 4 129 0 0 127 0 0
5 5 0 0 0 38 0 0
6 6 290 0 98 78 0 9
I don't really care what order the values within the z_xs appear in, so it's fine if they get moved around. Is there an efficient way to do this, preferably in some tidyverse way? I know I can pivot longer and drop duplicate rows, but my dataset is very large and I'm looking for a way to do this without pivoting.

Base R way using apply :
cols <- grep('z_\\d+', names(dat))
dat[cols] <- t(apply(dat[cols], 1, function(x) replace(x, duplicated(x), 0)))
# id z_1 z_2 z_3 z_4 z_5 z_6
#1 1 100 20 0 0 23 0
#2 2 290 0 0 0 0 0
#3 3 38 0 0 0 25 0
#4 4 129 0 0 127 0 0
#5 5 0 0 0 38 0 0
#6 6 290 0 98 78 0 9
tidyverse way without reshaping can be done using pmap :
library(tidyverse)
dat %>%
mutate(result = pmap(select(., matches('z_\\d+')), ~{
x <- c(...)
replace(x, duplicated(x), 0)
})) %>%
select(id, result) %>%
unnest_wider(result)
Since tests performed by #thelatemail suggests reshaping is a better option than handling the data rowwise you might want to consider it.
dat %>%
pivot_longer(cols = matches('z_\\d+')) %>%
group_by(id) %>%
mutate(value = replace(value, duplicated(value), 0)) %>%
pivot_wider()

This solution isn't tidyverse, but hopefully is sufficiently simple.
The duplicated() function does what you want. You can use apply() function to feed duplicated() your data by row.
dat[t(apply(dat, MARGIN = 1, duplicated))] <- 0

Related

Interaction terms loop in R

Here is a small example of data. Imagine I have many more covariates than this.
install.packages("mltools")
library(mltools)
library(dplyr)
set.seed(1234)
data <- tibble::data_frame(
age = round(runif(60, min = 48, max = 90)),
gender = sample(c(0,1), replace=TRUE, size=60),
weight = round(runif(60, min = 100, max = 300)),
group = sample(letters[1:4], size = 60, replace = TRUE))
one_hot <- data[,c("group")] %>%
glmnet::makeX() %>%
data.frame()
data$group <- NULL
data <- cbind(data, one_hot)
I want to create a data.frame that interacts with the group (groupa, groupb, groupc,groupd) and all variables (age, gender weight).
groupa * age
groupa * gender
groupa * weight
Same for the groupb, groupc, and groupd.
I've seen many questions about all possible interaction generators.
But I haven't seen any that show interaction with one column and the rest.
Hope this question was clear enough.
Thanks.
I am sure there is a more elegant solution, but you could try writing your own function that does the interaction then use apply to go over the columns and do.call to combine everything:
intfun <- function(var){
data %>%
mutate(across(starts_with("group"),~.*{{var}})) %>%
select(starts_with("group"))
}
int_terms <- cbind(data,
do.call(cbind, apply(data[,1:3], 2, function(x) intfun(x))))
Output (note not all columns presented here):
# > head(int_terms)
# age gender weight groupa groupb groupc groupd age.groupa age.groupb age.groupc age.groupd gender.groupa gender.groupb gender.groupc gender.groupd weight.groupa
# 1 88 33 113 0 1 0 0 0 88 0 0 0 33 0 0 0
# 2 49 33 213 1 0 0 0 49 0 0 0 33 0 0 0 213
# 3 83 33 152 1 0 0 0 83 0 0 0 33 0 0 0 152
# 4 75 33 101 0 1 0 0 0 75 0 0 0 33 0 0 0
# 5 61 33 218 0 1 0 0 0 61 0 0 0 33 0 0 0
# 6 79 33 204 1 0 0 0 79 0 0 0 33 0 0 0 204

Creating columns from scraped pdf with cuts on spaces

I'm trying to create a data frame from the following PDF
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)
However, when I call tab1 it only has one column:
[,1]
[1,] "NYS DOCCS INCARCERATED INDIVIDUALS COVID-19 REPORT BY REPORTED FACILITY"
[2,] "AS OF JUNE 29, 2020 AT 3:00 PM"
[3,] "POSITIVE CASE STATUS OTHER TESTS"
[4,] "TOTAL"
[5,] "FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE"
[6,] "TOTAL 495 16 519 97 805"
[7,] "ADIRONDACK 0 0 0 75 0"
[8,] "ALBION 0 0 0 0 2"
[9,] "ALTONA 0 0 0 0 1"
I would like to extract what should be the individual columns to create a dataframe (e.g. for row 7 I extract its contents into the following columns: Facility ("Adirondack") Recovered (0) Decesased (0) Positive (0) Pending (75) Negative (0) ). I'm thinking that the most efficient way to do this would be to make cuts in tab1 based on spaces, but this doesn't work since some of the facilities have multiple words in them, so the space cut would get messed up. Does anyone have an idea for a solution? Thanks for the help!
Here is how I would handle this using the "lattice" method of table extraction from the tabulizer package.
#install.packages("tidyverse")
library(tidyverse)
#install.packages("janitor")
library(janitor)
#install.packages("tabulizer")
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- tabulizer::extract_tables(url, method = "lattice") %>%
as.data.frame() %>%
dplyr::slice(-1,-2) %>%
janitor::row_to_names(row_number = 1)
Here is a workaround:
library(tabulizer)
url <- "https://doccs.ny.gov/system/files/documents/2020/06/doccs-covid-19-confirmed-by-facility-6.30.2020.pdf"
tab1 <- extract_tables(url)
plouf <- tab1[[1]][6:dim(tab1[[1]])[1],]
plouf <- gsub("([A-Z]+) ([A-Z]+)","\\1_\\2",plouf)
df <- read.table(text = paste0(t(plouf) ,collapse = "\n\r"),sep = " ")
names(df) <- strsplit(tab1[[1]][5,]," ")[[1]]
FACILITY RECOVERED DECEASED POSITIVE PENDING NEGATIVE
1 TOTAL 495 16 519 97 805
2 ADIRONDACK 0 0 0 75 0
3 ALBION 0 0 0 0 2
4 ALTONA 0 0 0 0 1
5 ATTICA 2 0 2 1 7
6 AUBURN 0 0 0 0 10
7 BARE_HILL 0 0 0 0 6
8 BEDFORD_HILLS 43 1 44 5 53
9 CAPE_VINCENT 0 0 0 0 0
10 CAYUGA 0 0 0 2 1
11 CLINTON 1 0 1 0 25
12 COLLINS 1 0 1 0 13
13 COXSACKIE 1 0 1 0 57
14 DOWNSTATE 1 0 1 0 12
15 EASTERN 17 1 20 0 17
16 EDGECOMBE 0 0 0 0 0
17 ELMIRA 0 0 0 1 20
18 FISHKILL 78 5 83 4 98
19 FIVE_POINTS 0 0 0 0 4
20 FRANKLIN 1 0 1 0 24
I take the table after the title, then remove the spaces between the FACILITY names with gsub (I actually replace them with _, so you can rechange to space after if you want. You can also use str_replace from stringr instead of gsub).
I then use read.table, forcing the text with a end of line after each line. I add the name after (because if not, they get changed in the gsub and read.table do not read them properly).

count the number of occurrences for each variable using dplyr

Here is my data frame (tibble) df:
ENSG00000000003 ENSG00000000005 ENSG00000000419 ENSG00000000457 ENSG00000000460
<dbl> <dbl> <dbl> <dbl> <dbl>
1 61 0 70 0 0
2 0 0 127 0 0
3 318 0 2 0 0
4 1 0 0 0 0
5 1 0 67 0 0
6 0 0 0 139 0
7 0 0 0 0 0
8 113 0 0 0 0
9 0 0 1 0 0
10 0 0 0 1 0
For each column/variable, I would like to count the number of rows with value greater than 10. In this case, column 1 would be 3, column 2 would be zero, etc. This is a test data frame, and I would like to do this for many columns.
We can use colSums on a logical matrix
colSums(df > 10, na.rm = TRUE)
Or using dplyr
library(dplyr)
df %>%
summarise_all(~ sum(. > 10, na.rm = TRUE))
I think
library(dplyr)
df %>% summarise_all(~sum(.>10))
will do what you want.

Stacking multiple columns in R

I am trying to convert data frame into long-form in R.
This is an example data for surveys conducted in 'id' grids over 9 days and if the variable of interest was detected '1' or not detected '0'.
I want to convert this data frame so that the number of surveys is reduced from 9 to 3
but each survey period now contains 3 visits.
I am trying to do so by stacking three columns at a time, so that survey visits 'v1' to 'v9' (in the image below) gets converted to v1, v2, v3 by adding a column called 'visit_no' which describe the visit number within the survey period.
The following link is the image of dataframe in current form and below if the code to generate the data
Code to generate data:
id<- c(240,220,160)
v1<- c(rep(0,9))
v2<-c(rep(0,3),1,rep(0,5))
v3<- c(1,rep(0,8))
v<-as.data.frame(rbind(v1,v2,v3))
survey<- cbind(id,v)
survey
This is the link to the image of data frame I need
Reference data-frame
One way is using reshape in base R"
reshape(survey, direction="long", idvar="id",
varying=list(c("V1","V4","V7"), c("V2","V5","V8"), c("V3","V6","V9")),
v.names=c("Visit1", "Visit2", "Visit3"), timevar="visit_no")
id visit_no Visit1 Visit2 Visit3
240.1 240 1 0 0 0
220.1 220 1 0 0 0
160.1 160 1 1 0 0
240.2 240 2 0 0 0
220.2 220 2 1 0 0
160.2 160 2 0 0 0
240.3 240 3 0 0 0
220.3 220 3 0 0 0
160.3 160 3 0 0 0
If you want it sorted by id, then add arrange from dplyr
%>% dplyr::arrange(id)
id visit_no Visit1 Visit2 Visit3
1 160 1 1 0 0
2 160 2 0 0 0
3 160 3 0 0 0
4 220 1 0 0 0
5 220 2 1 0 0
6 220 3 0 0 0
7 240 1 0 0 0
8 240 2 0 0 0
9 240 3 0 0 0
If your original variable names were in a consistent format, then the reshape command is even simpler because it will correctly guess the times from the names. For example,
names(survey)[2:10] <- paste0(names(survey)[2:10], ".", rep(1:3, 3))
head(survey)
id V1.1 V2.2 V3.3 V4.1 V5.2 V6.3 V7.1 V8.2 V9.3
v1 240 0 0 0 0 0 0 0 0 0
v2 220 0 0 0 1 0 0 0 0 0
v3 160 1 0 0 0 0 0 0 0 0
reshape(survey, direction="long", idvar="id",
varying=2:10, # Can just give the indices now.
v.names=c("Visit1", "Visit2", "Visit3"), timevar="visit_no") %>%
arrange(id)
Although the times are in a consistent format, the original variable names are not, so R cannot guess the names for the long format (Visit1, Visit2, Visit3), and these need to be supplied in the v.names argument.
If they were in a consistent format, then the reshape is even simpler.
names(survey)[2:10] <- paste0("Visit", rep(1:3, each=3), ".", rep(1:3, 3))
head(survey)
id Visit1.1 Visit1.2 Visit1.3 Visit2.1 Visit2.2 Visit2.3 Visit3.1 Visit3.2 Visit3.3
v1 240 0 0 0 0 0 0 0 0 0
v2 220 0 0 0 1 0 0 0 0 0
v3 160 1 0 0 0 0 0 0 0 0
reshape(survey, direction="long", varying=2:10, timevar="visit_no") %>%
arrange(id)
The tidyr version would probably involve two reshapes; one to get everything in very long form, and again to get it back to a wider form (what I call the 1 step back, 2 steps forward method).
You can change the name of the columns based on the sequence that you want.
names(survey)[-1] <- paste(rep(paste0("visit", 1:3), each =3), 1:3, sep = "_")
names(survey)
#[1] "id" "visit1_1" "visit1_2" "visit1_3" "visit2_1" "visit2_2" "visit2_3"
# "visit3_1" "visit3_2" "visit3_3"
And then use pivot_longer from tidyr to get data in different columns.
tidyr::pivot_longer(survey, cols = -id, names_to = c(".value", "visit_no"),
names_sep = "_") %>%
type.convert(as.is = TRUE)
# A tibble: 9 x 5
# id visit_no visit1 visit2 visit3
# <int> <int> <int> <int> <int>
#1 240 1 0 0 0
#2 240 2 0 0 0
#3 240 3 0 0 0
#4 220 1 0 1 0
#5 220 2 0 0 0
#6 220 3 0 0 0
#7 160 1 1 0 0
#8 160 2 0 0 0
#9 160 3 0 0 0

Create factor based on matching rows from dataframes of unequal size

I have two dataframes
DF1
10
11
12
13
15
16
17
19
and
DF2
12
16
19
I am looking for a way to get an output as
DF3
10 0
11 0
12 1
13 0
15 0
16 1
17 0
19 1
I know how to find matched rows from two data frames
match = which(DF1 %in% DF2)
but lost in find the way to assign 0/1 for matched rows in these two dataframes. Any help is highly appreciated.
We can do this with %in% to create a logical vector, which can be coerced to binary with as.integer
DF3 <- DF1
DF3$newCol <- as.integer(DF3[,1] %in% DF2[,1])
DF3$newCol
#[1] 0 0 1 0 0 1 0 1
If we need consecutive numbers from the above result
DF3$newCol[DF3$newCol!=0] <- seq_along(DF3$newCol[DF3$newCol!=0])
DF3$newCol
#[1] 0 0 1 0 0 2 0 3
Or another option is
cumsum(DF1[,1] %in% DF2[,1])*(DF1[,1] %in% DF2[,1])
#[1] 0 0 1 0 0 2 0 3
This is a good job for match:
df3 <- df1
df3$matchCol <- +!is.na(match(df1[,1], df2[,1]))
# df1 matchCol
#1 10 0
#2 11 0
#3 12 1
#4 13 0
#5 15 0
#6 16 1
#7 17 0
#8 19 1

Resources