How can I load an input file like this into an R dataframe?
[S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [TAGS]
=====================================================================================
959335 959806 | 169 640 | 472 472 | 80.84 | LmjF.34 ULAVAL|LtaPseq521
322990 324081 | 1436 342 | 1092 1095 | 83.86 | LmjF.12 ULAVAL|LtaPseq501
324083 324327 | 245 1 | 245 245 | 91.84 | LmjF.12 ULAVAL|LtaPseq501
1097873 1098325 | 892 437 | 453 456 | 76.75 | LmjF.32 ULAVAL|LtaPseq491
1098566 1098772 | 207 4 | 207 204 | 75.60 | LmjF.32 ULAVAL|LtaPseq491
This looks like Fixed Width Formatted data, and can be easily read in with read.fwf - the tricky bit might be getting rid of the | marks. WHat do you want to do with the [TAGS] section?
Here I work out the widths of each field, add some fields (length 3) to skip over the | markers, read it in, then use negative column subsetting to drop the separator columns:
> widths=c(8,9,3,9,9,3,9,9,3,9,3,100)
> read.fwf("data.txt",widths=widths,skip=2)[,-c(3,6,9,11)]
V1 V2 V4 V5 V7 V8 V10 V12
1 959335 959806 169 640 472 472 80.84 LmjF.34 ULAVAL|LtaPseq521
2 322990 324081 1436 342 1092 1095 83.86 LmjF.12 ULAVAL|LtaPseq501
3 324083 324327 245 1 245 245 91.84 LmjF.12 ULAVAL|LtaPseq501
4 1097873 1098325 892 437 453 456 76.75 LmjF.32 ULAVAL|LtaPseq491
5 1098566 1098772 207 4 207 204 75.60 LmjF.32 ULAVAL|LtaPseq491
You might want to split the tags into two columns - just work out the width of each part and add field widths to the widths vector. An exercise for the reader.
Note this only works if the file is spaced out with space characters and NOT tab characters...
Read the file using readLines or Scan
test <-' [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] |
[% IDY] | [TAGS]
=====================================================================================
959335 959806 | 169 640 | 472 472 | 80.84 | LmjF.34 ULAVAL|LtaPseq521
322990 324081 | 1436 342 | 1092 1095 | 83.86 | LmjF.12 ULAVAL|LtaPseq501
324083 324327 | 245 1 | 245 245 | 91.84 | LmjF.12 ULAVAL|LtaPseq501
1097873 1098325 | 892 437 | 453 456 | 76.75 | LmjF.32 ULAVAL|LtaPseq491
1098566 1098772 | 207 4 | 207 204 | 75.60 | LmjF.32 ULAVAL|LtaPseq491'
test2 <- gsub('|',' ',test, fixed=TRUE)
test2 <- gsub('=','',test2, fixed=TRUE)
test3 <- gsub('[ \t]{2,8}',';',test2,perl=TRUE)
test3 <- gsub('\n','',test3,perl=TRUE)
test4<-strsplit(test3,split=';')
test5<- data.frame(matrix(test4[[1]],ncol=9,
byrow=T),stringsAsFactors=FALSE)
colnames(test5)[1:8]<-test5[1,2:9]
test5<-test5[-1,]
the output:
test5
[S1] [E1] [S2] [E2] [LEN 1] [LEN 2] [% IDY] [TAGS] X9
2 959335 959806 169 640 472 472 80.84 LmjF.34 ULAVAL LtaPseq521
3 322990 324081 1436 342 1092 1095 83.86 LmjF.12 ULAVAL LtaPseq501
4 324083 324327 245 1 245 245 91.84 LmjF.12 ULAVAL LtaPseq501
5 1097873 1098325 892 437 453 456 76.75 LmjF.32 ULAVAL LtaPseq491
6 1098566 1098772 207 4 207 204 75.60 LmjF.32 ULAVAL LtaPseq491
In this way is not simple with R.
If you are using UNIX, they are simple scripts (like in awk) to converts also large files before import it (i use this tecnique).
Related
Input file looks something like this:
123 456
456 869
123 562
562 123
How do I find the total amount of IDs?
What i've tried so far:
cat file | tr " " "\n" | sort | uniq -c
Which gives:
2 123
1 123
1 456
1 456
1 562
1 562
1 869
Which would give a total of 7 uniq IDs, but there are 4: 123, 456, 562 and 869
I have a two-column tibble with many rows, and I would like to display the contents of the tibble in an HTML table while flowing the contents of the tibble to multiple columns. Here's a sample tibble of data typical of what I'm working with.
structure(list(scores = 328:360, points = c(1.44976648822324,
2.39850620178477, 3.54432361637504, 4.87641377160755, 6.38641933285773,
8.06758106817846, 9.91425882252425, 11.9216360940354, 14.0855256867808,
16.4022354393545, 18.8684718158155, 21.4812684932283, 24.2379320869751,
27.136, 30.1732070790481, 33.3474588150326, 36.6568095024477,
40.0994442217347, 43.6736638137649, 47.3778722279125, 51.2105657758874,
55.1703239324027, 59.2558014037444, 63.46572124494, 67.7988688512606,
72.2540866842331, 76.8302696189673, 81.5263608204015, 86.3413480724772,
91.2742604972992, 96.3241656118041, 101.490166677918, 106.771400309065
)), row.names = c(NA, -33L), class = c("tbl_df", "tbl", "data.frame"
))
I would like to display those data pairs in an HTML table something like the following:
| Score | Points | Score | Points | Score | Points | Score | Points |
|------:|-------:|------:|-------:|------:|-------:|------:|-------:|
| 328 | 1.4 | 337 | 16.4 | 346 | 43.7 | 355 | 81.5 |
| 329 | 2.4 | 338 | 18.9 | 347 | 47.4 | 356 | 86.3 |
| 330 | 3.5 | 339 | 21.5 | 348 | 51.2 | 357 | 91.3 |
| 331 | 4.9 | 340 | 24.2 | 349 | 55.2 | 358 | 96.3 |
| 332 | 6.4 | 341 | 27.1 | 350 | 59.3 | 359 | 101.5 |
| 333 | 8.1 | 342 | 30.2 | 351 | 63.5 | 360 | 106.8 |
| 334 | 9.9 | 343 | 33.3 | 352 | 67.8 | | |
| 335 | 11.9 | 344 | 36.7 | 353 | 72.3 | | |
| 336 | 14.1 | 345 | 40.1 | 354 | 76.8 | | |
I'd like to have a solution that would generate a four-doublecolumn layout no matter how many rows are in the original tibble.
I started by slicing the tibble into four sections, but got stuck because the fourth one didn't have as many elements as the first three.
Any suggestions on a method to accomplish this?
You can write a small function, that will take in the data and the number of columns you need. Default is just 4 columns
reshaping = function(dat, cols = 4){
n = nrow(dat)
m = ceiling(n/cols)
time=rep(1:cols, each = m, len = n)
id = rep(1:m, times = cols, len = n)
reshape(cbind(id, time, dat), idvar = 'id', dir='wide')[-1]
}
reshaping(dat)
scores.1 points.1 scores.2 points.2 scores.3 points.3 scores.4 points.4
1 328 1.449766 337 16.40224 346 43.67366 355 81.52636
2 329 2.398506 338 18.86847 347 47.37787 356 86.34135
3 330 3.544324 339 21.48127 348 51.21057 357 91.27426
4 331 4.876414 340 24.23793 349 55.17032 358 96.32417
5 332 6.386419 341 27.13600 350 59.25580 359 101.49017
6 333 8.067581 342 30.17321 351 63.46572 360 106.77140
7 334 9.914259 343 33.34746 352 67.79887 NA NA
8 335 11.921636 344 36.65681 353 72.25409 NA NA
9 336 14.085526 345 40.09944 354 76.83027 NA NA
reshaping(dat,8)
scores.1 points.1 scores.2 points.2 scores.3 points.3 scores.4 points.4 scores.5 points.5 scores.6 points.6 scores.7 points.7
1 328 1.449766 333 8.067581 338 18.86847 343 33.34746 348 51.21057 353 72.25409 358 96.32417
2 329 2.398506 334 9.914259 339 21.48127 344 36.65681 349 55.17032 354 76.83027 359 101.49017
3 330 3.544324 335 11.921636 340 24.23793 345 40.09944 350 59.25580 355 81.52636 360 106.77140
4 331 4.876414 336 14.085526 341 27.13600 346 43.67366 351 63.46572 356 86.34135 NA NA
5 332 6.386419 337 16.402235 342 30.17321 347 47.37787 352 67.79887 357 91.27426 NA NA
I have a table of doctor visits wherein there are sometimes multiple records for the same encounter key if there are multiple diagnoses, such as:
Enc_Key | Patient_Key | Enc_Date | Diag_Key
123 789 20160512 765
123 789 20160512 263
123 789 20160515 493
546 013 20160226 765
564 444 20160707 004
789 226 20160707 546
789 226 20160707 765
I am trying to create an indicator variable based on the value of the Diag_Key column, but I need to apply it for the entire encounter. In other word, if I get a value of "756" for the diagnoses code, then I want to apply a "1" for the indicator variable to every record that has the same Enc_Key as the record that has a Diag_Code value of 756, such as below:
Enc_Key | Patient_Key | Enc_Date | Diag_Key | Diag_Ind
123 789 20160512 765 1
123 789 20160512 263 1
123 789 20160515 493 1
546 013 20160226 723 0
564 444 20160707 004 0
789 226 20160707 546 1
789 226 20160707 765 1
I can't seem to figure out a way to apply this binary indicator to multiple different records. I have been using a line of code that resembles this:
tbl$Diag_Ind <- ifelse(grepl('765',tbl$Diag_Key),1,0)
but this would only assign a value of "1" to the single record with that Diag_Key value, and I'm unsure of how to apply it to the rest of the records with the same Enc_Key value
Use == to compare values directly and %in% for filtering with multiple values. For example, this will identify all Enc_Keys which have some Diag_Key == 765:
dat$Enc_Key[dat$Diag_Key == 765]
Then just select the data by Enc_Key and convert boolean to integer:
as.integer(
dat$Enc_Key %in% unique(dat$Enc_Key[dat$Diag_Key == 765])
)
Use mutate from dplyr. May be you have a typo in required output in original data Enc_Key = 546 is 765 but not in the required dataframe.
library(dplyr)
input = read.table(text = "Enc_Key Patient_Key Enc_Date Diag_Key
123 789 20160512 765
123 789 20160512 263
123 789 20160515 493
546 013 20160226 765
564 444 20160707 004
789 226 20160707 546
789 226 20160707 765", header = TRUE, stringsAsFactors = FALSE)
input %>% group_by(Enc_Key) %>%
mutate(Diag_Ind = max(grepl('765',Diag_Key)))
Output:
Enc_Key Patient_Key Enc_Date Diag_Key Diag_Ind
1 123 789 20160512 765 1
2 123 789 20160512 263 1
3 123 789 20160515 493 1
4 546 13 20160226 765 1
5 564 444 20160707 4 0
6 789 226 20160707 546 1
7 789 226 20160707 765 1
With corrected typo output is
Enc_Key Patient_Key Enc_Date Diag_Key Diag_Ind
1 123 789 20160512 765 1
2 123 789 20160512 263 1
3 123 789 20160515 493 1
4 546 13 20160226 723 0
5 564 444 20160707 4 0
6 789 226 20160707 546 1
7 789 226 20160707 765 1
I have a table of doctor visits wherein there are sometimes multiple records for the same encounter key (Enc_Key) if there are multiple diagnoses, such as:
Enc_Key | Patient_Key | Enc_Date | Diag
123 789 20160512 asthma
123 789 20160512 fever
123 789 20160515 coughing
546 013 20160226 flu
564 444 20160707 laceration
789 226 20160707 asthma
789 226 20160707 fever
I am trying to create an indicator variable Diag_Ind based on the value of the character variable Diag, but I need to apply it for the entire encounter. In other words, if I get a value of "asthma" for Diag for a record, then I want to apply a "1" for the Diag_Ind to every record that has the same Enc_Key, such as below:
Enc_Key | Patient_Key | Enc_Date | Diag | Diag_Ind
123 789 20160512 asthma 1
123 789 20160512 fever 1
123 789 20160515 coughing 1
546 013 20160226 flu 0
564 444 20160707 laceration 0
789 226 20160707 asthma attack 1
789 226 20160707 fever 1
I can't seem to figure out a way to apply this binary indicator to multiple records. I have been using a line of code that resembles this:
tbl$Diag_Ind <- ifelse(grepl('asthma',tolower(tbl$Diag)),1,0)
but this would only assign a value of "1" to the single record with that Diag value, such as this:
Enc_Key | Patient_Key | Enc_Date | Diag | Diag_Ind
123 789 20160512 asthma 1
123 789 20160512 fever 0
123 789 20160515 coughing 0
546 013 20160226 flu 0
564 444 20160707 laceration 0
789 226 20160707 asthma attack 1
789 226 20160707 fever 0
I'm unsure of how to apply it to the rest of the records with the same Enc_Key value
We can use base R ave to check if any value in every group of Enc_Key has asthma in it
df$Diag_Ind<- ave(df$Diag, df$Enc_Key,FUN=function(x) as.integer(any(grep("asthma", x))))
df
# Enc_Key Patient_Key Enc_Date Diag Diag_Ind
#1 123 789 20160512 asthma 1
#2 123 789 20160512 fever 1
#3 123 789 20160515 coughing 1
#4 546 13 20160226 flu 0
#5 564 444 20160707 laceration 0
#6 789 226 20160707 asthma 1
#7 789 226 20160707 fever 1
Similar solution with dplyr
library(dplyr)
df %>%
group_by(Enc_Key) %>%
mutate(Diag_Ind = as.numeric(any(grep("asthma", Diag))))
# Enc_Key Patient_Key Enc_Date Diag Diag_Ind
# (int) (int) (int) (fctr) (dbl)
#1 123 789 20160512 asthma 1
#2 123 789 20160512 fever 1
#3 123 789 20160515 coughing 1
#4 546 13 20160226 flu 0
#5 564 444 20160707 laceration 0
#6 789 226 20160707 asthma 1
#7 789 226 20160707 fever 1
I have data on work stations were workers worked by day, and I need to find how many days a worker began working in the same station he left off the period day. Each observation is one work-day per worker.
worker.id | start.station | end.station | day
1 | 234 | 342 | 2015-01-02
1 | 342 | 425 | 2015-01-03
1 | 235 | 621 | 2015-01-04
2 | 155 | 732 | 2015-01-02
2 | 318 | 632 | 2015-01-03
2 | 632 | 422 | 2015-01-04
So the desired outcomes would be to generate a variable (same) that identifies days in which worker started at same work station as he left off previous day (with NA or FALSE in first observation for each worker).
worker.id | start.station | end.station | day | same
1 | 234 | 342 | 2015-01-02 | FALSE
1 | 342 | 425 | 2015-01-03 | TRUE
1 | 235 | 621 | 2015-01-04 | FALSE
2 | 155 | 732 | 2015-01-02 | FALSE
2 | 318 | 632 | 2015-01-03 | FALSE
2 | 632 | 422 | 2015-01-04 | TRUE
I think something using dplyr would work, but not sure what.
Thanks!
worker.id<-c(1,1,1,2,2,2)
start.station<-c(234,342,235,155,218,632)
end.station<-c(342,425,621,732,632,422)
end.station<-c(342,425,621,732,632,422)
day<-c("2015-01-02"," 2015-01-03"," 2015-01-04"," 2015-01-02"," 2015-01-03"," 2015-01-04")
df<-data.frame(worker.id, start.station ,end.station, day)
worker.id start.station end.station day
1 1 234 342 2015-01-02
2 1 342 425 2015-01-03
3 1 235 621 2015-01-04
4 2 155 732 2015-01-02
5 2 218 632 2015-01-03
6 2 632 422 2015-01-04
df$same<-ifelse(df$start.station!=lag(df$end.station) |
df$day=="2015-01-02", "FALSE","TRUE")
worker.id start.station end.station day same
1 1 234 342 2015-01-02 FALSE
2 1 342 425 2015-01-03 TRUE
3 1 235 621 2015-01-04 FALSE
4 2 155 732 2015-01-02 FALSE
5 2 218 632 2015-01-03 FALSE
6 2 632 422 2015-01-04 TRUE
Per suggestions in comments below if you want to group by worker ID but use ifelse (clunky):
df <-df %>%
group_by(worker.id) %>%
mutate(same=ifelse(start.station!=lag(end.station) &
start.station!=NA, "FALSE","TRUE")) %>%
mutate(same=ifelse(is.na(same), "FALSE","TRUE"))
as.data.frame(df)
worker.id start.station end.station day same
1 1 234 342 2015-01-02 FALSE
2 1 342 425 2015-01-03 TRUE
3 1 235 621 2015-01-04 FALSE
4 2 155 732 2015-01-02 FALSE
5 2 218 632 2015-01-03 FALSE
6 2 632 422 2015-01-04 TRUE