What can R do about a messy data format?

What can R do about a messy data format? - r

Sometimes I see data posted in a Stack Overflow question formatted like in this question. This is not the first time, so I have decided to ask a question about it, and answer the question with a way to make the posted data palatable.
I will post the dataset example here just in case the question is deleted.
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
As you can see this is not the right way to post data. As a user wrote in a comment,
It must've taken a bit of time to format the data the way you're
showing it here. Unfortunately this is not a good format for us to
copy & paste.
I believe this says it all. The asker is well intended and it took some work and time to try to be nice, but the result is not good.
What can R code do to make that table usable, if anything? Will it take a great deal of trouble?

Using data.table::fread:
x = '
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
'
fread(gsub('[\\+-]+\\n', '', x), drop = c(1,7))
# Date Emp1 Case Priority PriorityCountinLast7days
# 1: 2018-06-01 A A1 0 0
# 2: 2018-06-03 A A2 0 1
# 3: 2018-06-03 A A3 0 2
# 4: 2018-06-03 A A4 1 1
# 5: 2018-06-03 A A5 2 1
# 6: 2018-06-04 A A6 0 3
# 7: 2018-06-01 B B1 0 1
# 8: 2018-06-02 B B2 0 2
# 9: 2018-06-03 B B3 0 3
The gsub part removes the horizontal rules. drop removes the extra columns caused by delimiters at the line ends.

The short answer to the question is yes, R code can solve that mess and no, it doesn't take that much trouble.
The first step after copying & pasting the table into an R session is to read it in with read.table setting the header, sep, comment.char and strip.white arguments.
Credits for reminding me of arguments comment.char and strip.white go to #nicola, and his comment.
dat <- read.table(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
", header = TRUE, sep = "|", comment.char = "+", strip.white = TRUE)
But as you can see there are some issues with the result.
dat
X Date Emp1 Case Priority PriorityCountinLast7days X.1
1 NA 2018-06-01 A A1 0 0 NA
2 NA 2018-06-03 A A2 0 1 NA
3 NA 2018-06-03 A A3 0 2 NA
4 NA 2018-06-03 A A4 1 1 NA
5 NA 2018-06-03 A A5 2 1 NA
6 NA 2018-06-04 A A6 0 3 NA
7 NA 2018-06-01 B B1 0 1 NA
8 NA 2018-06-02 B B2 0 2 NA
9 NA 2018-06-03 B B3 0 3 NA
To have separators start and end each data row made R believe those separators mark extra columns, which is not what is meant by the original question's OP.
So the second step is to keep only the real columns. I will do this subsetting the columns by their numbers, easily done, they usually are the first and last columns.
dat <- dat[-c(1, ncol(dat))]
dat
Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-03 A A3 0 2
4 2018-06-03 A A4 1 1
5 2018-06-03 A A5 2 1
6 2018-06-04 A A6 0 3
7 2018-06-01 B B1 0 1
8 2018-06-02 B B2 0 2
9 2018-06-03 B B3 0 3
That wasn't too hard, much better.
In this case there is still a problem, to coerce column Date to class Date.
dat$Date <- as.Date(dat$Date)
And the result is satisfactory.
str(dat)
'data.frame': 9 obs. of 5 variables:
$ Date : Date, format: "2018-06-01" "2018-06-03" ...
$ Emp1 : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 2 2 2
$ Case : Factor w/ 9 levels "A1","A2","A3",..: 1 2 3 4 5 6 7 8 9
$ Priority : int 0 0 0 1 2 0 0 0 0
$ PriorityCountinLast7days: int 0 1 2 1 1 3 1 2 3
Note that I have not set the more or less standard argument stringsAsFactors = FALSE. If needed, this should be done when running read.table.
The whole process took only 3 lines of base R code.
Finally, the end result in dput format, like it should be in the first place.
dat <-
structure(list(Date = structure(c(17683, 17685, 17685, 17685,
17685, 17686, 17683, 17684, 17685), class = "Date"), Emp1 = c("A",
"A", "A", "A", "A", "A", "B", "B", "B"), Case = c("A1", "A2",
"A3", "A4", "A5", "A6", "B1", "B2", "B3"), Priority = c(0, 0,
0, 1, 2, 0, 0, 0, 0), PriorityCountinLast7days = c(0, 1, 2, 1,
1, 3, 1, 2, 3)), row.names = c(NA, -9L), class = "data.frame")

The issue isn't so much how many lines of code it takes, two or five, not much difference. The question is more whether it will work beyond the example you posted here.
I haven't come across this sort of thing in the wild, but I had a go at constructing another example that I thought could conceivably exist.
I've since come across a couple more cases and added them to the test suite.
I've also included a table drawn using box-drawing characters. You don't come across this much these days, but for completeness' sake it's here.
x1 <- "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+
"
x2 <- "
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
Date | Emp1 | Case | Priority | PriorityCountinLast7days
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
2018-06-01 | A | A|1 | 0 | 0
2018-06-03 | A | A|2 | 0 | 1
2018-06-02 | B | B|2 | 0 | 2
2018-06-03 | B | B|3 | 0 | 3
––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
"
x3 <- "
Maths | English | Science | History | Class
0.1 | 0.2 | 0.3 | 0.2 | Y2
0.9 | 0.5 | 0.7 | 0.4 | Y1
0.2 | 0.4 | 0.6 | 0.2 | Y2
0.9 | 0.5 | 0.2 | 0.7 | Y1
"
x4 <- "
Season | Team | W | AHWO
-------------------------------------
1 | 2017/2018 | TeamA | 2 | 1.75
2 | 2017/2018 | TeamB | 1 | 1.85
3 | 2017/2018 | TeamC | 1 | 1.70
4 | 2016/2017 | TeamA | 1 | 1.49
5 | 2016/2017 | TeamB | 3 | 1.51
6 | 2016/2017 | TeamC | 2 | N/A
"
x5 <- "
A B C
┌───┬───┬───┐
A │ 5 │ 1 │ 3 │
├───┼───┼───┤
B │ 2 │ 5 │ 3 │
├───┼───┼───┤
C │ 3 │ 4 │ 4 │
└───┴───┴───┘
"
x6 <- "
------------------------------------------------------------
|date |Material |Description |
|----------------------------------------------------------|
|10/04/2013 |WM.5597394 |PNEUMATIC |
|11/07/2013 |GB.D040790 |RING |
------------------------------------------------------------
------------------------------------------------------------
|date |Material |Description |
|----------------------------------------------------------|
|08/06/2013 |WM.4M01004A05 |TOUCHEUR |
|08/06/2013 |WM.4M010108-1 |LEVER |
------------------------------------------------------------
"
My go at a function
f <- function(x=x6, header=TRUE, rem.dup.header=header,
na.strings=c("NA", "N/A"), stringsAsFactors=FALSE, ...) {
# read each row as a character string
x <- scan(text=x, what="character", sep="\n", quiet=TRUE)
# keep only lines containing alphanumerics
x <- x[grep("[[:alnum:]]", x)]
# remove vertical bars with trailing or leading space
x <- gsub("\\|? | \\|?", " ", x)
# remove vertical bars at beginning and end of string
x <- gsub("\\|?$|^\\|?", "", x)
# remove vertical box-drawing characters
x <- gsub("\U2502|\U2503|\U2505|\U2507|\U250A|\U250B", " ", x)
if (rem.dup.header) {
dup.header <- x == x[1]
dup.header[1] <- FALSE
x <- x[!dup.header]
}
# read the result as a table
read.table(text=paste(x, collapse="\n"), header=header,
na.strings=na.strings, stringsAsFactors=stringsAsFactors, ...)
}
lapply(c(x1, x2, x3, x4, x5, x6), f)
Output
[[1]]
Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A1 0 0
2 2018-06-03 A A2 0 1
3 2018-06-02 B B2 0 2
4 2018-06-03 B B3 0 3
[[2]]
Date Emp1 Case Priority PriorityCountinLast7days
1 2018-06-01 A A|1 0 0
2 2018-06-03 A A|2 0 1
3 2018-06-02 B B|2 0 2
4 2018-06-03 B B|3 0 3
[[3]]
Maths English Science History Class
1 0.1 0.2 0.3 0.2 Y2
2 0.9 0.5 0.7 0.4 Y1
3 0.2 0.4 0.6 0.2 Y2
4 0.9 0.5 0.2 0.7 Y1
[[4]]
Season Team W AHWO
1 2017/2018 TeamA 2 1.75
2 2017/2018 TeamB 1 1.85
3 2017/2018 TeamC 1 1.70
4 2016/2017 TeamA 1 1.49
5 2016/2017 TeamB 3 1.51
6 2016/2017 TeamC 2 NA
[[5]]
A B C
A 5 1 3
B 2 5 3
C 3 4 4
[[6]]
date Material Description
1 10/04/2013 WM.5597394 PNEUMATIC
2 11/07/2013 GB.D040790 RING
3 08/06/2013 WM.4M01004A05 TOUCHEUR
4 08/06/2013 WM.4M010108-1 LEVER
x3 is from here (will have to look at the edit history).
x4 is from here
x6 is from here

md_table <- scan(text = "
+------------+------+------+----------+--------------------------+
| Date | Emp1 | Case | Priority | PriorityCountinLast7days |
+------------+------+------+----------+--------------------------+
| 2018-06-01 | A | A1 | 0 | 0 |
| 2018-06-03 | A | A2 | 0 | 1 |
| 2018-06-03 | A | A3 | 0 | 2 |
| 2018-06-03 | A | A4 | 1 | 1 |
| 2018-06-03 | A | A5 | 2 | 1 |
| 2018-06-04 | A | A6 | 0 | 3 |
| 2018-06-01 | B | B1 | 0 | 1 |
| 2018-06-02 | B | B2 | 0 | 2 |
| 2018-06-03 | B | B3 | 0 | 3 |
+------------+------+------+----------+--------------------------+",
what = "", sep = "", comment.char = "+", quiet = TRUE)
## it is clear that there are 5 columns
mat <- matrix(md_table[md_table != "|"], ncol = 5, byrow = TRUE)
# [,1] [,2] [,3] [,4] [,5]
# [1,] "Date" "Emp1" "Case" "Priority" "PriorityCountinLast7days"
# [2,] "2018-06-01" "A" "A1" "0" "0"
# [3,] "2018-06-03" "A" "A2" "0" "1"
# [4,] "2018-06-03" "A" "A3" "0" "2"
# [5,] "2018-06-03" "A" "A4" "1" "1"
# [6,] "2018-06-03" "A" "A5" "2" "1"
# [7,] "2018-06-04" "A" "A6" "0" "3"
# [8,] "2018-06-01" "B" "B1" "0" "1"
# [9,] "2018-06-02" "B" "B2" "0" "2"
#[10,] "2018-06-03" "B" "B3" "0" "3"
## a data frame with all character columns
dat <- setNames(data.frame(mat[-1, ], stringsAsFactors = FALSE), mat[1, ])
# Date Emp1 Case Priority PriorityCountinLast7days
#1 2018-06-01 A A1 0 0
#2 2018-06-03 A A2 0 1
#3 2018-06-03 A A3 0 2
#4 2018-06-03 A A4 1 1
#5 2018-06-03 A A5 2 1
#6 2018-06-04 A A6 0 3
#7 2018-06-01 B B1 0 1
#8 2018-06-02 B B2 0 2
#9 2018-06-03 B B3 0 3
## or maybe just use `type.convert` on some columns?
dat[] <- lapply(dat, type.convert)

Well, about this specific dataset I used the import feature in RStudio, but I took one additional step beforehand.
Copy the dataset into the Notepad file.
Replace all | characters with ,
Import the Notepad file using read.csv to RStudio using this code (seperate columns by ,).
But, if you mean use the R to fully understand it in one step, then I have no idea.

As it was suggested, you could use dput to save the content of a dataframe to a file, open the file in a text editor and paste its content. An example of mtcar's dataset limited to first 10 rows:
dput(mtcars %>% head(10), file = 'reproducible.txt')
The content of reproducible.txt can be used to make a dataframe/tibble as shown below. In such a case data the format is machine readable, but it is hard to be undestood by human at first glance (without pasting into R).
df <- structure(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
24.4, 22.8, 19.2), cyl = c(6, 6, 4, 6, 8, 6, 8, 4, 4, 6), disp = c(160,
160, 108, 258, 360, 225, 360, 146.7, 140.8, 167.6), hp = c(110,
110, 93, 110, 175, 105, 245, 62, 95, 123), drat = c(3.9, 3.9,
3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92), wt = c(2.62,
2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19, 3.15, 3.44), qsec = c(16.46,
17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3), vs = c(0,
0, 1, 1, 0, 1, 0, 1, 1, 1), am = c(1, 1, 1, 0, 0, 0, 0, 0, 0,
0), gear = c(4, 4, 4, 3, 3, 3, 3, 4, 4, 4), carb = c(4, 4, 1,
1, 2, 1, 4, 2, 2, 4)), .Names = c("mpg", "cyl", "disp", "hp",
"drat", "wt", "qsec", "vs", "am", "gear", "carb"), row.names = c("Mazda RX4",
"Mazda RX4 Wag", "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout",
"Valiant", "Duster 360", "Merc 240D", "Merc 230", "Merc 280"), class = "data.frame")

Related

Skipping number of rows after a certain value in R

I have a data looks like below, I would like to skip 2 rows after max index of certain types (3 and 4). For example, I have two 4s in my table, but I only need to remove 2 rows after the second 4. Same for 3, I only need to remove 2 rows after the third 3.
-----------------
| grade | type |
-----------------
| 93 | 2 |
-----------------
| 90 | 2 |
-----------------
| 54 | 2 |
-----------------
| 36 | 4 |
-----------------
| 31 | 4 |
-----------------
| 94 | 1 |
-----------------
| 57 | 1 |
-----------------
| 16 | 3 |
-----------------
| 11 | 3 |
-----------------
| 12 | 3 |
-----------------
| 99 | 1 |
-----------------
| 99 | 1 |
-----------------
| 9 | 3 |
-----------------
| 10 | 3 |
-----------------
| 97 | 1 |
-----------------
| 96 | 1 |
-----------------
The desired output would be:
-----------------
| grade | type |
-----------------
| 93 | 2 |
-----------------
| 90 | 2 |
-----------------
| 54 | 2 |
-----------------
| 36 | 4 |
-----------------
| 31 | 4 |
-----------------
| 16 | 3 |
-----------------
| 11 | 3 |
-----------------
| 12 | 3 |
-----------------
| 9 | 3 |
-----------------
| 10 | 3 |
-----------------
Here is the code of my example:
data <- data.frame(grade = c(93,90,54,36,31,94,57,16,11,12,99,99,9,10,97,96), type = c(2,2,2,4,4,1,1,3,3,3,1,1,3,3,1,1))
Could anyone give me some hints on how to approach this in R? Thanks a bunch in advance for your help and your time!

data[-c(max(which(data$type==3))+1:2,max(which(data$type==4))+1:2),]
# grade type
# 1 93 2
# 2 90 2
# 3 54 2
# 4 36 4
# 5 31 4
# 8 16 3
# 9 11 3
# 10 12 3

Using some indexing:
data[-(nrow(data) - match(c(3,4), rev(data$type)) + 1 + rep(1:2, each=2)),]
# grade type
#1 93 2
#2 90 2
#3 54 2
#4 36 4
#5 31 4
#8 16 3
#9 11 3
#10 12 3
Or more generically:
vals <- c(3,4)
data[-(nrow(data) - match(vals, rev(data$type)) + 1 + rep(1:2, each=length(vals))),]
The logic is to match the first instance of each value to the reversed values in the column, then spin that around to give the original row index, then add 1 and 2 to the row indexes, then drop these rows.

Similar to Ric, but I find it a bit easier to read (way more verbose, though):
idx = data %>% mutate(id = row_number()) %>%
filter(type %in% 3:4) %>% group_by(type) %>% filter(id == max(id)) %>% pull(id)
data[-c(idx + 1, idx + 2),]

Apply function to one random row per group (in specified set of groups)

I have the following data frame df = data.frame(name = c("abc", "abc", "abc", "def", "def", "ghi", "ghi", "jkl", "jkl", "jkl", "jkl", "jkl"), ignore = c(0,1,0,0,1,1,1,0,0,0,1,1), time = 31:42)
name | ignore | time |
-----|--------|------|
abc | 0 | 31 |
abc | 1 | 32 |
abc | 0 | 33 |
def | 0 | 34 |
def | 1 | 35 |
ghi | 1 | 36 |
ghi | 1 | 37 |
jkl | 0 | 38 |
jkl | 0 | 39 |
jkl | 0 | 40 |
jkl | 1 | 41 |
jkl | 1 | 42 |
and I want to do the following:
Group by name
If ignore is all non-zero in a group, leave the time values as is for this group
If ignore contains at least one zero in a group (e.g. where name is jkl), randomly choose one of the rows in this group where ignore is zero, and apply a function f to the time value.
More specifically, for example if f(x) = x - 30 then I would expect to see something like this:
name | ignore | time |
-----|--------|------|
abc | 0 | 1 | <- changed
abc | 1 | 32 |
abc | 0 | 33 |
def | 0 | 4 | <- changed
def | 1 | 35 |
ghi | 1 | 36 | <- unchanged group
ghi | 1 | 37 | <- unchanged group
jkl | 0 | 38 |
jkl | 0 | 39 |
jkl | 0 | 10 | <- changed
jkl | 1 | 41 |
jkl | 1 | 42 |
I'm finding it hard to get an elegant solution to this. I am not sure how to apply a function to randomly selected rows within a group, nor what the best approach is for only applying a function to selected groups. I would ideally like to solve this via dplyr, but no problem if not.

f <- function(x) x - 30
df %>%
group_by(name) %>%
mutate(samp = if(any(ignore == 0)) sample(which(ignore == 0), 1) else F,
time = ifelse(row_number() != samp, time, f(time))) %>%
select(-samp)
output
name ignore time
<chr> <dbl> <dbl>
1 abc 0 1
2 abc 1 32
3 abc 0 33
4 def 0 4
5 def 1 35
6 ghi 1 36
7 ghi 1 37
8 jkl 0 8
9 jkl 0 39
10 jkl 0 40
11 jkl 1 41
12 jkl 1 42

How can I check if strings in one dataset are present in another one and add the missing values in R?

I want to check if some strings in one data_frame are present in a index dataframe, if it is not, I want to add it and put 0 for the empty columns. I suppose it should be quite simple with %in% but I am struggling to combine it with other functions.
Imagine I have these two dfs, ls has all the possible values from the column A and B. On the other hand, df is the dataframe I want to add rows to become complete, so for every row with unique ID and S at the same time, it will be included all the possible A and B values.
Example of dfs:
ls <- data.frame(A = c("ABC", "DEF", "GHI", "XYZ", "JKL"),
B = c("KLM","MNO","", "", ""))
df <- data.frame(ID = c(1,2,2),
S = c("x","y","z"),
A = c("ABC","DEF","XYZ"),
B = c("KLM","MNO","MNO"),
C = c("100","150","2"))
ls
+-----+-----+
| A | B |
+-----+-----+
| ABC | KLM |
| DEF | MNO |
| GHI | |
| XYZ | |
| JKL | |
+-----+-----+
df
+----+---+-----+-----+-----+
| ID | S | A | B | C |
+----+---+-----+-----+-----+
| 1 | x | ABC | KLM | 100 |
| 2 | y | DEF | MNO | 150 |
| 2 | z | XYZ | MNO | 2 |
+----+---+-----+-----+-----+
From those two data sets, I want to search if A from df is present in A in ls for each unique pair of ID and S. For the pairs who are incomplete, the A from ls will be added.
So, the output data_frame would be something like this:
+----+---+-----+-----+-----+
| ID | S | A | B | C |
+----+---+-----+-----+-----+
| 1 | x | ABC | KLM | 100 |
| 1 | x | ABC | MNO | 0 |
| 1 | x | DEF | KLM | 0 |
| 1 | x | DEF | MNO | 0 |
| 1 | x | GHI | KLM | 0 |
| 1 | x | GHI | MNO | 0 |
| 1 | x | XYZ | KLM | 0 |
| 1 | x | XYZ | MNO | 0 |
| 1 | x | JKL | KLM | 0 |
| 1 | x | JKL | MNO | 0 |
| 2 | y | ABC | KLM | 0 |
| 2 | y | ABC | MNO | 0 |
| 2 | y | DEF | KLM | 0 |
| 2 | y | DEF | MNO | 150 |
| 2 | y | GHI | KLM | 0 |
| 2 | y | GHI | MNO | 0 |
| 2 | y | XYZ | KLM | 0 |
| 2 | y | XYZ | MNO | 0 |
| 2 | y | JKL | KLM | 0 |
| 2 | y | JKL | MNO | 0 |
| 2 | z | ABC | KLM | 0 |
| 2 | z | ABC | MNO | 0 |
| 2 | z | DEF | KLM | 0 |
| 2 | z | DEF | MNO | 0 |
| 2 | z | GHI | KLM | 0 |
| 2 | z | GHI | MNO | 0 |
| 2 | z | XYZ | KLM | 0 |
| 2 | z | XYZ | MNO | 2 |
| 2 | z | JKL | KLM | 0 |
| 2 | z | JKL | MNO | 0 |
+----+---+-----+-----+-----+
So far, I was trying something with group_by and add_row:
df %>% group_by(ID, S) %>%
ifelse(ls$A %in% df$A & ls$B %in% df$B, "",add_row(ID = df$ID,
S = df$S,
A = ls$A,
B = ls$B,
C = 0))
I am not sure if I am in the right path, I would be happy if someone could enlighten me on this.
Edit*
My real dataframes are like this:
> str(vj)
'data.frame': 2123 obs. of 5 variables:
$ ID : chr "E11" "E11" "E11" "E11" ...
$ Specificity: chr "DP" "PostF" "DP" "DP" ...
$ V_gene : chr "IGHV5-15" "IGHV2-NGC5" "IGHV5-157" "IGHV3-122" ...
$ J_gene : chr "IGHJ4-3" "IGHJ4-3" "IGHJ4-3" "IGHJ4-3" ...
$ Size : num 664 533 369 282 273 205 200 175 164 163 ...
> str(ls)
'data.frame': 96 obs. of 2 variables:
$ V_gene: chr "IGHV1-124" "IGHV1-138" "IGHV1-170" "IGHV1-58" ...
$ J_gene: chr "IGHJ1-1" "IGHJ2-1" "IGHJ3-2" "IGHJ4-3" ...
> head(vj)
ID Specificity V_gene J_gene Size
1 E11 DP IGHV5-15 IGHJ4-3 664
2 E11 PostF IGHV2-NGC5 IGHJ4-3 533
3 E11 DP IGHV5-157 IGHJ4-3 369
4 E11 DP IGHV3-122 IGHJ4-3 282
5 E11 PreF IGHV3-76 IGHJ2-1 273
6 E11 DP IGHV3-76 IGHJ4-3 205
> head(ls)
V_gene J_gene
1 IGHV1-124 IGHJ1-1
2 IGHV1-138 IGHJ2-1
3 IGHV1-170 IGHJ3-2
4 IGHV1-58 IGHJ4-3
5 IGHV1-84 IGHJ5-4
6 IGHV1-NGC1 IGHJ5-5

You can use complete and fill :
library(dplyr)
library(tidyr)
df %>%
complete(S, A = unique(ls$A), B = unique(ls$B), fill = list(C = 0)) %>%
group_by(S) %>%
fill(ID, .direction = "downup")

You can use expand.grid, cbind and mutate. The code below should give you some direction. I am sure there are shorter ways to do it, but this gives you step-wise approach to understand each step.
ls <- data.frame(A = c("ABC", "DEF", "GHI", "XYZ"),
B = c("KLM","MNO","", ""))
df <- data.frame(ID = c(1,2,2),
S = c("x","y","z"),
A = c("ABC","DEF","XYZ"),
B = c("KLM","MNO","MNO"),
C = c("100","150","2"))
lsb <- subset(ls,ls$B != "")
ls2 <- expand.grid(S=df$S, A=ls$A, B=lsb$B)
ls3 <- expand.grid(ID=df$ID, A=ls$A, B=lsb$B)
ls4 <- cbind(ID=ls3$ID,ls2)
lsa <- mutate(ls4, C=ifelse((ls4$ID==df$ID & ls4$S==df$S & ls4$A==df$A & ls4$B==df$B) , df$C, 0))
lsa
> lsa
ID S A B C
1 1 x ABC KLM 100
2 2 y ABC KLM 0
3 2 z ABC KLM 0
4 1 x DEF KLM 0
5 2 y DEF KLM 0
6 2 z DEF KLM 0
7 1 x GHI KLM 0
8 2 y GHI KLM 0
9 2 z GHI KLM 0
10 1 x XYZ KLM 0
11 2 y XYZ KLM 0
12 2 z XYZ KLM 0
13 1 x ABC MNO 0
14 2 y ABC MNO 0
15 2 z ABC MNO 0
16 1 x DEF MNO 0
17 2 y DEF MNO 150
18 2 z DEF MNO 0
19 1 x GHI MNO 0
20 2 y GHI MNO 0
21 2 z GHI MNO 0
22 1 x XYZ MNO 0
23 2 y XYZ MNO 0
24 2 z XYZ MNO 2

Product calculation by group in R data.table

I'm currently working on transforming a dataset to take the product of each previous observation in a datatable. This is something that is implemented easy in excel but I am struggling to find a non-recursive solution to in data.table. The data in short form, ID has thousands of more levels and thousands of x's per ID in the real data. Each ID has the same number of X's.
| index | ID | X |
|-------|----|------|
| 1 | 1 | 0.8 |
| 2 | 1 | 0.75 |
| 3 | 1 | 0.72 |
| 4 | 2 | 0.9 |
| 5 | 2 | 0.5 |
| 6 | 2 | 0.45 |
What I want to end up with is the following
| index | ID | X | product |
|-------|----|------|---------|
| 1 | 1 | 0.8 | 0.8 |
| 2 | 1 | 0.75 | 0.6 |
| 3 | 1 | 0.72 | 0.432 |
| 4 | 2 | 0.9 | 0.9 |
| 5 | 2 | 0.5 | 0.45 |
| 6 | 2 | 0.45 | 0.2025 |
Where product is equal to x multiplied by all previous values of x for that particular ID. This can be done in a for loop however I am looking for a solution that leverages the use of data.table so this can be run on a cluster.
Reproducible data:
df <- fread('
index ID X
1 1 0.8
2 1 0.75
3 1 0.72
4 2 0.9
5 2 0.5
6 2 0.45
')

You can use cumprod
# If data.table not already loaded, these steps are required first
# library(data.table)
# setDT(df)
df[, Xprod := cumprod(X), ID][]
# index ID X Xprod
# 1: 1 1 0.80 0.8000
# 2: 2 1 0.75 0.6000
# 3: 3 1 0.72 0.4320
# 4: 4 2 0.90 0.9000
# 5: 5 2 0.50 0.4500
# 6: 6 2 0.45 0.2025
If you need to apply a function other than prod, you can use frollapply. For example, the code below gives the same result as the code above.
df[, Xprod := frollapply(X, 1:.N, prod, adaptive = TRUE), by = ID]

order grouping variable in R

I have a database like this:
ID | familysize | age | gender
------+------------+-------------------+------------+-----+----------
1001 | 4 | 26 | 1
1001 | 4 | 38 | 2
1001 | 4 | 30 | 2
1001 | 4 | 7 | 1
1002 | 3 | 25 | 2
1002 | 3 | 39 | 1
1002 | 3 | 10 | 2
1003 | 5 | 60 | 1
1003 | 5 | 50 | 2
1003 | 5 | 26 | 2
1003 | 5 | 23 | 1
1003 | 5 | 20 | 1
1004 | ....
I want to order this dataframe by age of people in each ID , so I use this command:
library(plyr)
require(plyr)
b2<-ddply(b , "ID", function(x) head(x[order(x$ age, decreasing = TRUE), ], ))
but when I use this command I lost some of observation. what should I do for ordering this database ?

b2 <- b[order(b$ID, -b$age), ]
should do the trick.

The arrange function in plyr does a great job here. Order by ID after that by age but in a descending order.
arrange(b, ID, desc(age))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

What can R do about a messy data format? - r

Related

Skipping number of rows after a certain value in R

Apply function to one random row per group (in specified set of groups)

How can I check if strings in one dataset are present in another one and add the missing values in R?

Product calculation by group in R data.table

order grouping variable in R

Categories

Resources