Space separated IDS treated differently in a text file - count

Input file looks something like this:
123 456
456 869
123 562
562 123
How do I find the total amount of IDs?
What i've tried so far:
cat file | tr " " "\n" | sort | uniq -c
Which gives:
2 123
1 123
1 456
1 456
1 562
1 562
1 869
Which would give a total of 7 uniq IDs, but there are 4: 123, 456, 562 and 869

Related

How to find detect duplicates of single values in all rows and columns in R data.frame

I have a large data-set consisting of a header and a series of values in that column. I want to detect the presence and number of duplicates of these values within the whole dataset.
1 2 3 4 5 6 7
734 456 346 545 874 734 455
734 783 482 545 456 948 483
So for example, it would detect 734 3 times, 456 twice etc.
I've tried using the duplicated function in r but this seems to only work on rows as a whole or columns as a whole. Using
duplicated(df)
doesn't pick up any duplicates, though I know there are two duplicates in the first row.
So I'm asking how to detect duplicates both within and between columns/rows.
Cheers
You can use table() and data.frame() to see the occurrence
data.frame(table(v))
such that
v Freq
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 6 1
7 7 1
8 346 1
9 455 1
10 456 2
11 482 1
12 483 1
13 545 2
14 734 3
15 783 1
16 874 1
17 948 1
DATA
v <- c(1, 2, 3, 4, 5, 6, 7, 734, 456, 346, 545, 874, 734, 455, 734,
783, 482, 545, 456, 948, 483)
You can transform it to a vector and then use table() as follows:
library(data.table)
library(dplyr)
df<-fread("734 456 346 545 874 734 455
734 783 482 545 456 948 483")
df%>%unlist()%>%table()
# 346 455 456 482 483 545 734 783 874 948
# 1 1 2 1 1 2 3 1 1 1

Apply binary variable to multiple records of same key in R

I have a table of doctor visits wherein there are sometimes multiple records for the same encounter key if there are multiple diagnoses, such as:
Enc_Key | Patient_Key | Enc_Date | Diag_Key
123 789 20160512 765
123 789 20160512 263
123 789 20160515 493
546 013 20160226 765
564 444 20160707 004
789 226 20160707 546
789 226 20160707 765
I am trying to create an indicator variable based on the value of the Diag_Key column, but I need to apply it for the entire encounter. In other word, if I get a value of "756" for the diagnoses code, then I want to apply a "1" for the indicator variable to every record that has the same Enc_Key as the record that has a Diag_Code value of 756, such as below:
Enc_Key | Patient_Key | Enc_Date | Diag_Key | Diag_Ind
123 789 20160512 765 1
123 789 20160512 263 1
123 789 20160515 493 1
546 013 20160226 723 0
564 444 20160707 004 0
789 226 20160707 546 1
789 226 20160707 765 1
I can't seem to figure out a way to apply this binary indicator to multiple different records. I have been using a line of code that resembles this:
tbl$Diag_Ind <- ifelse(grepl('765',tbl$Diag_Key),1,0)
but this would only assign a value of "1" to the single record with that Diag_Key value, and I'm unsure of how to apply it to the rest of the records with the same Enc_Key value
Use == to compare values directly and %in% for filtering with multiple values. For example, this will identify all Enc_Keys which have some Diag_Key == 765:
dat$Enc_Key[dat$Diag_Key == 765]
Then just select the data by Enc_Key and convert boolean to integer:
as.integer(
dat$Enc_Key %in% unique(dat$Enc_Key[dat$Diag_Key == 765])
)
Use mutate from dplyr. May be you have a typo in required output in original data Enc_Key = 546 is 765 but not in the required dataframe.
library(dplyr)
input = read.table(text = "Enc_Key Patient_Key Enc_Date Diag_Key
123 789 20160512 765
123 789 20160512 263
123 789 20160515 493
546 013 20160226 765
564 444 20160707 004
789 226 20160707 546
789 226 20160707 765", header = TRUE, stringsAsFactors = FALSE)
input %>% group_by(Enc_Key) %>%
mutate(Diag_Ind = max(grepl('765',Diag_Key)))
Output:
Enc_Key Patient_Key Enc_Date Diag_Key Diag_Ind
1 123 789 20160512 765 1
2 123 789 20160512 263 1
3 123 789 20160515 493 1
4 546 13 20160226 765 1
5 564 444 20160707 4 0
6 789 226 20160707 546 1
7 789 226 20160707 765 1
With corrected typo output is
Enc_Key Patient_Key Enc_Date Diag_Key Diag_Ind
1 123 789 20160512 765 1
2 123 789 20160512 263 1
3 123 789 20160515 493 1
4 546 13 20160226 723 0
5 564 444 20160707 4 0
6 789 226 20160707 546 1
7 789 226 20160707 765 1

Binary variable to multiple records of same key based on characters in another field in R

I have a table of doctor visits wherein there are sometimes multiple records for the same encounter key (Enc_Key) if there are multiple diagnoses, such as:
Enc_Key | Patient_Key | Enc_Date | Diag
123 789 20160512 asthma
123 789 20160512 fever
123 789 20160515 coughing
546 013 20160226 flu
564 444 20160707 laceration
789 226 20160707 asthma
789 226 20160707 fever
I am trying to create an indicator variable Diag_Ind based on the value of the character variable Diag, but I need to apply it for the entire encounter. In other words, if I get a value of "asthma" for Diag for a record, then I want to apply a "1" for the Diag_Ind to every record that has the same Enc_Key, such as below:
Enc_Key | Patient_Key | Enc_Date | Diag | Diag_Ind
123 789 20160512 asthma 1
123 789 20160512 fever 1
123 789 20160515 coughing 1
546 013 20160226 flu 0
564 444 20160707 laceration 0
789 226 20160707 asthma attack 1
789 226 20160707 fever 1
I can't seem to figure out a way to apply this binary indicator to multiple records. I have been using a line of code that resembles this:
tbl$Diag_Ind <- ifelse(grepl('asthma',tolower(tbl$Diag)),1,0)
but this would only assign a value of "1" to the single record with that Diag value, such as this:
Enc_Key | Patient_Key | Enc_Date | Diag | Diag_Ind
123 789 20160512 asthma 1
123 789 20160512 fever 0
123 789 20160515 coughing 0
546 013 20160226 flu 0
564 444 20160707 laceration 0
789 226 20160707 asthma attack 1
789 226 20160707 fever 0
I'm unsure of how to apply it to the rest of the records with the same Enc_Key value
We can use base R ave to check if any value in every group of Enc_Key has asthma in it
df$Diag_Ind<- ave(df$Diag, df$Enc_Key,FUN=function(x) as.integer(any(grep("asthma", x))))
df
# Enc_Key Patient_Key Enc_Date Diag Diag_Ind
#1 123 789 20160512 asthma 1
#2 123 789 20160512 fever 1
#3 123 789 20160515 coughing 1
#4 546 13 20160226 flu 0
#5 564 444 20160707 laceration 0
#6 789 226 20160707 asthma 1
#7 789 226 20160707 fever 1
Similar solution with dplyr
library(dplyr)
df %>%
group_by(Enc_Key) %>%
mutate(Diag_Ind = as.numeric(any(grep("asthma", Diag))))
# Enc_Key Patient_Key Enc_Date Diag Diag_Ind
# (int) (int) (int) (fctr) (dbl)
#1 123 789 20160512 asthma 1
#2 123 789 20160512 fever 1
#3 123 789 20160515 coughing 1
#4 546 13 20160226 flu 0
#5 564 444 20160707 laceration 0
#6 789 226 20160707 asthma 1
#7 789 226 20160707 fever 1

sort data.frame based on the number of identical repeats in a column

I want to sort the data.frame based on the highest number of times a given character is repeated in the last column
data=
chr start end name
1 234 267 ttn
2 345 367 Elm
3 445 489 ttn
4 544 598 Rm
5 644 680 ttn
i want some thing like this
chr start end name
1 234 267 ttn
3 445 489 ttn
5 644 680 ttn
2 345 367 Elm
4 544 598 Rm
Here's a quick data.table approach which will sort the data by reference
library(data.table)
setorder(setDT(df)[, indx := .N, by = name], -indx)[]
# chr start end name indx
# 1: 1 234 267 ttn 3
# 2: 3 445 489 ttn 3
# 3: 5 644 680 ttn 3
# 4: 2 345 367 Elm 1
# 5: 4 544 598 Rm 1
Try
data[with(data, order(-ave(seq_along(name), name, FUN=length))),]
# chr start end name
#1 1 234 267 ttn
#3 3 445 489 ttn
#5 5 644 680 ttn
#2 2 345 367 Elm
#4 4 544 598 Rm
Or another base R approach is
data[order(factor(data$name, levels=names(sort(-table(data$name))))),]
# chr start end name
# 1 1 234 267 ttn
# 3 3 445 489 ttn
# 5 5 644 680 ttn
# 2 2 345 367 Elm
# 4 4 544 598 Rm
Or using dplyr
library(dplyr)
data %>%
group_by(name) %>%
mutate(n=n()) %>%
arrange(-n) %>%
select(-n)

load this format into an R dataframe

How can I load an input file like this into an R dataframe?
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]
=====================================================================================
959335 959806 | 169 640 | 472 472 | 80.84 | LmjF.34 ULAVAL|LtaPseq521
322990 324081 | 1436 342 | 1092 1095 | 83.86 | LmjF.12 ULAVAL|LtaPseq501
324083 324327 | 245 1 | 245 245 | 91.84 | LmjF.12 ULAVAL|LtaPseq501
1097873 1098325 | 892 437 | 453 456 | 76.75 | LmjF.32 ULAVAL|LtaPseq491
1098566 1098772 | 207 4 | 207 204 | 75.60 | LmjF.32 ULAVAL|LtaPseq491
This looks like Fixed Width Formatted data, and can be easily read in with read.fwf - the tricky bit might be getting rid of the | marks. WHat do you want to do with the [TAGS] section?
Here I work out the widths of each field, add some fields (length 3) to skip over the | markers, read it in, then use negative column subsetting to drop the separator columns:
> widths=c(8,9,3,9,9,3,9,9,3,9,3,100)
> read.fwf("data.txt",widths=widths,skip=2)[,-c(3,6,9,11)]
V1 V2 V4 V5 V7 V8 V10 V12
1 959335 959806 169 640 472 472 80.84 LmjF.34 ULAVAL|LtaPseq521
2 322990 324081 1436 342 1092 1095 83.86 LmjF.12 ULAVAL|LtaPseq501
3 324083 324327 245 1 245 245 91.84 LmjF.12 ULAVAL|LtaPseq501
4 1097873 1098325 892 437 453 456 76.75 LmjF.32 ULAVAL|LtaPseq491
5 1098566 1098772 207 4 207 204 75.60 LmjF.32 ULAVAL|LtaPseq491
You might want to split the tags into two columns - just work out the width of each part and add field widths to the widths vector. An exercise for the reader.
Note this only works if the file is spaced out with space characters and NOT tab characters...
Read the file using readLines or Scan
test <-' [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] |
[% IDY] | [TAGS]
=====================================================================================
959335 959806 | 169 640 | 472 472 | 80.84 | LmjF.34 ULAVAL|LtaPseq521
322990 324081 | 1436 342 | 1092 1095 | 83.86 | LmjF.12 ULAVAL|LtaPseq501
324083 324327 | 245 1 | 245 245 | 91.84 | LmjF.12 ULAVAL|LtaPseq501
1097873 1098325 | 892 437 | 453 456 | 76.75 | LmjF.32 ULAVAL|LtaPseq491
1098566 1098772 | 207 4 | 207 204 | 75.60 | LmjF.32 ULAVAL|LtaPseq491'
test2 <- gsub('|',' ',test, fixed=TRUE)
test2 <- gsub('=','',test2, fixed=TRUE)
test3 <- gsub('[ \t]{2,8}',';',test2,perl=TRUE)
test3 <- gsub('\n','',test3,perl=TRUE)
test4<-strsplit(test3,split=';')
test5<- data.frame(matrix(test4[[1]],ncol=9,
byrow=T),stringsAsFactors=FALSE)
colnames(test5)[1:8]<-test5[1,2:9]
test5<-test5[-1,]
the output:
test5
[S1] [E1] [S2] [E2] [LEN 1] [LEN 2] [% IDY] [TAGS] X9
2 959335 959806 169 640 472 472 80.84 LmjF.34 ULAVAL LtaPseq521
3 322990 324081 1436 342 1092 1095 83.86 LmjF.12 ULAVAL LtaPseq501
4 324083 324327 245 1 245 245 91.84 LmjF.12 ULAVAL LtaPseq501
5 1097873 1098325 892 437 453 456 76.75 LmjF.32 ULAVAL LtaPseq491
6 1098566 1098772 207 4 207 204 75.60 LmjF.32 ULAVAL LtaPseq491
In this way is not simple with R.
If you are using UNIX, they are simple scripts (like in awk) to converts also large files before import it (i use this tecnique).

Resources