Filling NAs with a dataframe merge [duplicate]

Filling NAs with a dataframe merge [duplicate] - r

This question already has answers here:
Using ifelse() to replace NAs in one data frame by referencing another data frame of different length
(3 answers)
Closed 6 years ago.
I have a data frame c like this
c
Freq CTM
000110100111 2 NA
110110100111 1 32.58847
111001011000 2 NA
111111111111 1 25.61041
and a data frame nona_c like this
nona_c
Freq CTM
000110100111 2 37.0642
111001011000 2 37.0642
I want to replace the NAs in the CTM column of c with the CTM values of nona_c. The rownames of nona_c (the binary strings) will always exist in c.
The output should be
mergedC
Freq CTM
000110100111 2 37.0642
110110100111 1 32.58847
111001011000 2 37.0642
111111111111 1 25.61041
I've been trying merge without success here.
mergedC <- merge(x = c, y = nona_c, by = 0, #rownames
all.y = TRUE)

A match operation might make this more straightforward:
c$CTM[is.na(c$CTM)] <- nona_c$CTM[match(rownames(c)[is.na(c$CTM)], rownames(nona_c))]
# Freq CTM id
#000110100111 2 37.06420 000110100111
#110110100111 1 32.58847 110110100111
#111001011000 2 37.06420 111001011000
#111111111111 1 25.61041 111111111111

We can do this with data.table using a join on the variable of interest. Here we are joining on the row name column. The values of "i.CTM" are assigned (:=) to the 'CTM'.
library(data.table)
setDT(c, keep.rownames=TRUE)[]
setDT(nona_c, keep.rownames=TRUE)[]
c[nona_c, CTM := i.CTM , on = "rn"]
c
# rn Freq CTM
#1: 000110100111 2 37.06420
#2: 110110100111 1 32.58847
#3: 111001011000 2 37.06420
#4: 111111111111 1 25.61041
NOTE: The row.names are not retained in data.table or dplyr. So, while converting the 'data.frame' to 'data.table', we use the keep.rownames = TRUE.

Related

Filter rows within data.table group if max group value > some value [duplicate]

This question already has answers here:
Choose groups to keep/drop in data.table
(3 answers)
Subset rows corresponding to max value by group using data.table
(1 answer)
Closed 3 years ago.
I am trying to filter all rows within a group in a data.table if a max value within that group is > some value. Below is how I would do it in DPLY and how I got it working in two steps in data.table.
#DPLYR
df<-data.table(
x =1:12
,y = 1:3
)
df %>% group_by(y) %>%
filter(max(x) < 11)
##data.table
df[,max_value :=max(x),by=y][max_value<11]
The output should be
x y
1: 1 1
2: 4 1
3: 7 1
4: 10 1
Is there a way to do this in one step without creating the column in my dataset? All that I have been able to find are subsetting a group to get one specific value within a group, not return all row of the group that meet the condition.

We can use .I to get the row index, extract the index column and subset
df[df[, .I[max(x) < 11], y]$V1]
# x y
#1: 1 1
#2: 4 1
#3: 7 1
#4: 10 1
Or another option is .SD
df[, .SD[max(x) < 11], y]

Retrieving columns without explicit mentioning [duplicate]

This question already has answers here:
Use `j` to select the join column of `x` and all its non-join columns
(2 answers)
What does < stand for in data.table joins with on=
(2 answers)
Closed 3 years ago.
Morning everyone
In data.table I found that with a left join, when mentioning a column name implicitly i.e. without mentioning the table (in which the column resides in) induces unexpected results despite unique column names.
dummy data
x <- data.table(a = 1:2); x
# a
# 1: 1
# 2: 2
y <- data.table(c = 1
,d = 2); y
# c d
# 1: 1 2
left join without mentioning table name in retrieve of column c
z <- y[x, on=.(c=a), .(a,c,d)]; z
# a c d
# 1: 1 1 2
# 2: 2 2 NA
Problem arises when looking at results above. Row 2 of column c is supposed to be NA. However, it shows 2
This is only rectified when the user explicitly mentions the table:
z <- y[x, on=.(c=a), .(a,x.c,d)]; z
# a x.c d
# 1: 1 1 2
# 2: 2 NA NA
It is perhaps worth mentioning the x in x.c is referring to the position of syntax x[i], in this case, table y
My question is why is the explicit mention of table necessary for a task seemingly basic. Or am I missing something? Thank you.

How to transpose a long data frame every n rows

I have a data frame like this:
x=data.frame(type = c('a','b','c','a','b','a','b','c'),
value=c(5,2,3,2,10,6,7,8))
every item has attributes a, b, c while some records may be missing records, i.e. only have a and b
The desired output is
y=data.frame(item=c(1,2,3), a=c(5,2,6), b=c(2,10,7), c=c(3,NA,8))
How can I transform x to y? Thanks

We can use dcast
library(data.table)
out <- dcast(setDT(x), rowid(type) ~ type, value.var = 'value')
setnames(out, 'type', 'item')
out
# item a b c
#1: 1 5 2 3
#2: 2 2 10 8
#3: 3 6 7 NA

Create a grouping vector g assuming each occurrence of a starts a new group, use tapply to create a table tab and coerce that to a data frame. No packages are used.
g <- cumsum(x$type == "a")
tab <- with(x, tapply(value, list(g, type), c))
as.data.frame(tab)
giving:
a b c
1 5 2 3
2 2 10 NA
3 6 7 8
An alternate definition of the grouping vector which is slightly more complex but would be needed if some groups have a missing is the following. It assumes that x lists the type values in order of their levels within group so that if a level is less than the prior level it must be the start of a new group.
g <- cumsum(c(-1, diff(as.numeric(x$type))) < 0)
Note that ultimately there must be some restriction on missingness; otherwise, the problem is ambiguous. For example if one group can have b and c missing and then next group can have a missing then whether b and c in the second group actually form a second group or are part of the first group is not determinable.

Value as column names in data.table

I have the following data.table:
dat<-data.table(Y=as.factor(c("a","b","a")),"a"=c(1,2,3),"b"=c(3,2,1))
It looks like:
Y a b
1: a 1 3
2: b 2 2
3: a 3 1
What I want is to subtract the value of the column indicated by the value of Y by 1. E.g. the Y value of the first row is "a", so the value of the column "a" in the first row should be reduced by one.
The result should be:
Y a b
1: a 0 3
2: b 2 1
3: a 2 1
Is this possible? If yes, how? Thank you!

Using self-joins and get:
for (yval in dat[ , unique(Y)]){
dat[yval, (yval) := get(yval) - 1L, on = "Y"]
}
dat[]
# Y a b
# 1: a 0 3
# 2: b 2 1
# 3: a 2 1

We can use melt/dcast to do this. melt the dataset after creating a row sequence ('N') to 'long' format, subtract 1 from the 'value' column where 'Y' and 'variable' elements are equal, assign (:= the output to 'value', then dcast the 'long' format to 'wide'.
dcast(melt(dat[, N := 1:.N], id.var = c("Y", "N"))[Y==variable,
value := value -1], N + Y ~variable, value.var = "value")[, N := NULL][]
# Y a b
#1: a 0 3
#2: b 2 1
#3: a 2 1

First an apply function to make the actual transformation. We need to apply by row and then use the first element to name the second element to access and over write. For some reason the values I was accessing in a and b were strings, so I used as.numeric to transform them to numbers. I don't know if this is normal in data.tables or a result of using the apply statement on one since I don't use data.tables normally.
tformDat <- apply(dat, 1, function(x) {x[x[1]] <- as.numeric(x[x[1]]) - 1;x})
Then you need to reformat back to the original data.table format
data.table(t(tformDat))
The whole thing can be done in one line.
data.table(t(apply(dat, 1, function(x) {x[x[1]] <- as.numeric(x[x[1]]) - 1;x})))

Subsetting data.table by not head(key(DT),m), using binary search not vector scan

If I specify n columns as a key of a data.table, I'm aware that I can join to fewer columns than are defined in that key as long as I join to the head of key(DT). For example, for n=2 :
X = data.table(A=rep(1:5, each=2), B=rep(1:2, each=5), key=c('A','B'))
X
A B
1: 1 1
2: 1 1
3: 2 1
4: 2 1
5: 3 1
6: 3 2
7: 4 2
8: 4 2
9: 5 2
10: 5 2
X[J(3)]
A B
1: 3 1
2: 3 2
There I only joined to the first column of the 2-column key of DT. I know I can join to both columns of the key like this :
X[J(3,1)]
A B
1: 3 1
But how do I subset using only the second column colum of the key (e.g. B==2), but still using binary search not vector scan? I'm aware that's a duplicate of :
Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan
so I'd like to generalise this question to n. My data set has about a million rows and solution provided in dup question linked above doesn't seem to be optimal.

Here is a simple function that will extract the correct unique values and return a data table to use as a key.
X <- data.table(A=rep(1:5, each=4), B=rep(1:4, each=5),
C = letters[1:20], key=c('A','B','C'))
make.key <- function(ddd, what){
# the names of the key columns
zzz <- key(ddd)
# the key columns you wish to keep all unique values
whichUnique <- setdiff(zzz, names(what))
## unique data.table (when keyed); .. means "look up one level"
ud <- lapply([, ..whichUnique], unique)
## append the `what` columns and a Cross Join of the new
## key columns
do.call(CJ, c(ud,what)[zzz])
}
X[make.key(X, what = list(C = c('a','b'))),nomatch=0]
## A B C
## 1: 1 1 a
## 2: 1 1 b
I'm not sure this will be any quicker than a couple of vector scans on a large data.table though.

Adding secondary keys is on the feature request list :
FR#1007 Build in secondary keys
In the meantime we are stuck with either vector scan, or the approach used in the answer to the n=2 case linked in the question (which #mnel generalises nicely in his answer).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Filling NAs with a dataframe merge [duplicate] - r

Related

Filter rows within data.table group if max group value > some value [duplicate]

Retrieving columns without explicit mentioning [duplicate]

How to transpose a long data frame every n rows

Value as column names in data.table

Subsetting data.table by not head(key(DT),m), using binary search not vector scan

Categories

Resources