Difference across columns in a data.table - r

Sorry if it might be a quite basic point, but I fail to find a convenient tool.
I have a (quite large) data table and want to difference across columns, that is
A B C D
9 N.A. 3 2
15 4 N.A. N.A.
N.A. N.A 2 3
I want to create a new column E that is the what is left of A after differencing B, C, and D. For N.A.s in columns B,C and D, I can assume zeros, but when there is a N.A. in A I have to ignore this observation. So the final result should be
A B C D E
9 N.A. 3 2 4
15 4 N.A. N.A. 11
I was removing all the columns in DT that are N.A. in A by
DT <- DT[!(DT$A=="N.A.")]
and then I tried
DT[, E:= lapply(.SD, diff), .SDcols = c("A", "B", "C", "D")].
but that fails because of the N.A.'s.
I don't want to manually change the N.A.s into 0 (because later on I might want to distinguish what was a real zero and what was what I imputed)- I'd like to do it inside a function. Does anybody have a good idea?

Here you go:
df$E[!is.na(df$A)] = 2*df$A[!is.na(df$A)] - rowSums(df[!is.na(df$A),], na.rm = T)
Example:
df = data.frame(A = c(19,25,NA,17),B = c(1,2,3,4), C = c(5,NA,NA,9), D = c(3,1,2,NA))
>df
A B C D
1 19 1 5 3
2 25 2 NA 1
3 NA 3 NA 2
4 17 4 9 NA
df$E[!is.na(df$A)] = 2*df$A[!is.na(df$A)] - rowSums(df[!is.na(df$A),], na.rm = T)
> df
A B C D E
1 19 1 5 3 10
2 25 2 NA 1 22
3 NA 3 NA 2 NA
4 17 4 9 NA 4

I assume all columns ar with type character.
require(data.table)
DT <- data.table(A = c("9", "15", "N.A."),
B = c("N.A.", "4", "N.A."),
C = c("3", "N.A.", "2"),
D = c("2", "N.A.", "3"))
DT <- DT[A != "N.A."]
Compute row number.
DT[, rownum := .I]
You will get warnings because N.A. can not be converted to type numeric.
DT[, E := as.numeric(A) - sum(as.numeric(B),
as.numeric(C),
as.numeric(D), na.rm = T), by = rownum]
DT

Related

Find overlap of multiple ranges in data.table

I would like to find the overlapping part of multiple ranges which are given rowise in a data.table object.
An example would be:
t <- data.table(a = c(3,4,5), b = c(13,12,19))
So we have the ranges:
3 - 13,
4 - 12,
5 - 19
Hence the overlapping range would be:
5 - 12
In case of an additional range 19 - 22 the overlap should return NA - NA or 0 - 0 since there is no overlap.
I found solutions for similar problems like spatstat.utils:: intersect.ranges(). However this works only on two vectors and is hard to implement in a data.table
DT[,.(o.l = function()[1], o.r = function()[2], by=.()]
manner which I would really like to do if possible,..
As output for this example I would like to have:
t <- data.table(a = c(3,4,5), b = c(13,12,19), o.l = c(5,5,5), o.r = c(12,12,12))
Here's a one-line example:
library(data.table)
dt = data.table(a = c(3,4,5), b = c(13,12,19))
dt[, c("o.l", "o.r") := as.list(range(Reduce(intersect, mapply(seq, a, b, 1))))]
dt
# a b o.l o.r
# 1: 3 13 5 12
# 2: 4 12 5 12
# 3: 5 19 5 12
Where the core of the problem is
dt = data.table(a = c(3,4,5), b = c(13,12,19))
dt[, Reduce(intersect, mapply(seq, a, b, 1))]
# [1] 5 6 7 8 9 10 11 12
Borrowing idea from David Aurenburg answer in How to flatten / merge overlapping time periods, here is another possible approach:
DT[, g := c(0L, cumsum(shift(a, -1L) >= cummax(b))[-.N])][,
c("ol", "or") := .(max(a), min(b)), g]
data:
DT <- data.table(a = c(3,4,5,19,20,24), b = c(13,12,19,22,23,25))
output:
a b g ol or
1: 3 13 0 5 12
2: 4 12 0 5 12
3: 5 19 0 5 12
4: 19 22 1 20 22
5: 20 23 1 20 22
6: 24 25 2 24 25

Column order of `.SD` in j argument differs when `get()` is used

I very often transform subsets of data using the .SDcols option in data.table. It makes sense that the .SD columns sent to j are in the same order as the original data.table.
EDITED to properly identify the issue
It's nice that .SD columns have the same order as that specified in the .SDcols argument. This does not happen when get is used in the j argument (inside an lapply call, at least). In this case, the .SD table columns maintain their original order.
Is there any way to override this behaviour?
An example without get works fine
# library(data.table)
dt = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# Generate columns of first differences by group
dt[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) L - shift(L, n = 1, type='lag') ),
keyby = col1, .SDcols = d.vars]
The result is assigns differenced values to the "wrong" column because my named vector (d.vars) is ordered differently than the columns in dt. The result is:
The results are as expected, the .SD table's columns are ordered the same way as the names in d.vars.
> dt
col1 b a c d.a d.b
1: A -0.28901751 1 A NA NA
2: A 0.65746901 4 D 3 0.94648651
3: A -0.10602462 7 G 3 -0.76349362
4: A -0.38406252 10 J 3 -0.27803790
5: B -1.06963450 2 B NA NA
6: B 0.35137273 5 E 3 1.42100723
7: B 0.43394046 8 H 3 0.08256772
8: B 0.82525042 11 K 3 0.39130996
9: C 0.50421710 3 C NA NA
10: C -1.09493665 6 F 3 -1.59915375
11: C -0.04858163 9 I 3 1.04635501
12: C 0.45867279 12 L 3 0.50725443
Which is the expected output because lapply in j processed column a first and b second, in spite of the column order in dt.
Example with get behaves differently
dt2 = data.table(col1 = rep(LETTERS[1:3], 4),
b = rnorm(12),
a = 1:12,
neg = -1,
c = LETTERS[1:12])
# columns I want to do something to
d.vars = c('a', 'b') #' names in different order than names(dt)
# name of variable to be called in j.
negate <- 'neg'
dt2[, paste('d', d.vars, sep='.') :=
lapply(.SD, function(L) {(L - shift(L, n = 1, type='lag') ) * get(negate) }),
keyby = col1, .SDcols = d.vars]
Now the naming of the newly created columns doesn't align with the name order in d.vars:
> dt2
col1 b a neg c d.a d.b
1: A -0.3539066 1 -1 A NA NA
2: A 0.2702374 4 -1 D -0.62414408 -3
3: A -0.7834941 7 -1 G 1.05373150 -3
4: A -1.2765652 10 -1 J 0.49307118 -3
5: B -0.2936422 2 -1 B NA NA
6: B -0.2451996 5 -1 E -0.04844252 -3
7: B -1.6577614 8 -1 H 1.41256181 -3
8: B 1.0668059 11 -1 K -2.72456737 -3
9: C -0.1160938 3 -1 C NA NA
10: C -0.7940771 6 -1 F 0.67798333 -3
11: C 0.2951743 9 -1 I -1.08925140 -3
12: C -0.4508854 12 -1 L 0.74605969 -3
In this second example the b column is processed by lapply first and therefore assigned to d.a.
If I refer to neg directly (i.e., I don't use get) then the results are as expected: lapply processes the .SD columns in the order given in d.vars.
p.s. Thanks data.table team! I love this package!
Based on the description, we can use match to match the 'd.vars' and the column names of 'dt' ('d.vars1') and then use it to get the order right
d.vars1 <- d.vars[match(names(dt), d.vars, nomatch = 0)]
dt[, paste0("d.",d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') ), keyby = col1, .SDcols = d.vars1]
dt
# col1 b a c d.b d.a
# 1: A -0.28901751 1 A NA NA
# 2: A 0.65746901 4 D 0.94648652 3
# 3: A -0.10602462 7 G -0.76349363 3
# 4: A -0.38406252 10 J -0.27803790 3
# 5: B -1.06963450 2 B NA NA
# 6: B 0.35137273 5 E 1.42100723 3
# 7: B 0.43394046 8 H 0.08256773 3
# 8: B 0.82525042 11 K 0.39130996 3
# 9: C 0.50421710 3 C NA NA
#10: C -1.09493665 6 F -1.59915375 3
#11: C -0.04858163 9 I 1.04635502 3
#12: C 0.45867279 12 L 0.50725442 3
Update
Based on the new dataset
d.vars1 <- d.vars[match(names(dt2), d.vars, nomatch = 0)]
dt2[, paste0('d.', d.vars1) := lapply(.SD, function(L)
L - shift(L, n = 1, type='lag') * get(negate) ),
keyby = col1, .SDcols = d.vars1]
dt2
# col1 b a neg c d.b d.a
# 1: A -0.3539066 1 -1 A NA NA
# 2: A 0.2702374 4 -1 D -0.0836692 5
# 3: A -0.7834941 7 -1 G -0.5132567 11
# 4: A -1.2765652 10 -1 J -2.0600593 17
# 5: B -0.2936422 2 -1 B NA NA
# 6: B -0.2451996 5 -1 E -0.5388418 7
# 7: B -1.6577614 8 -1 H -1.9029610 13
# 8: B 1.0668059 11 -1 K -0.5909555 19
# 9: C -0.1160938 3 -1 C NA NA
#10: C -0.7940771 6 -1 F -0.9101709 9
#11: C 0.2951743 9 -1 I -0.4989028 15
#12: C -0.4508854 12 -1 L -0.1557111 21

Set value of data frame new field equal to another field based on condition on a third field in R

If I want to add a field to a given data frame and setting it equal to an existing field in the same data frame based on a condition on a different (existing) field.
I know this works:
is.even <- function(x) x %% 2 == 0
df <- data.frame(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
df$test[is.even(df$a)] <- as.character(df[is.even(df$a), "b"])
> df
a b test
1 1 A NA
2 2 B B
3 3 C NA
4 4 D D
5 5 E NA
6 6 F F
But I have this feeling it can be done a lot better than this.
Using data.table it's quite easy
library(data.table)
dt = data.table(a = c(1,2,3,4,5,6),
b = c("A","B","C","D","E","F"))
dt[is.even(a), test := b]
> dt
a b test
1: 1 A NA
2: 2 B B
3: 3 C NA
4: 4 D D
5: 5 E NA
6: 6 F F

Match a list of items with rows items of a data.frame

Hi guys I have a difficult situation to manage:
I have a data.frame that looks like this:
General_name
a
b
c
d
m
n
and another data.frame that looks like this:
First_names_list a=34;b=4
Second_names_list d=2;m=98;n=32
Third_names_list c=1;d=12;m=0.1
I have to match each element of the first data.frame with each element before = in the second data.frame[,2] so that finally I have to obtain the following table:
Names a b c d m n
First_names_list 34 4 NA NA NA NA
Second_names_list NA NA NA 2 98 32
Third_names_list NA NA 1 12 0.1 NA
Any suggestion? It seems to be too difficult to me.
Best
E.
Option 1
Here is one approach using dcast from "reshape2" and concat.split from my "splitstackshape" package:
library(splitstackshape)
## The following can also be done in 2 steps. The basic idea is to split
## the values into a semi-long form for `dcast` to be able to use. So,
## I've split first on the semicolon, and made the data into a long form
## at the same time, then I've split on =, but kept it wide that time.
out <- concat.split(concat.split.multiple(df, "V2", ";", "long"),
"V2", "=", drop = TRUE)
out
# V1 time V2_1 V2_2
# 1 First_names_list 1 a 34.0
# 2 Second_names_list 1 d 2.0
# 3 Third_names_list 1 c 1.0
# 4 First_names_list 2 b 4.0
# 5 Second_names_list 2 m 98.0
# 6 Third_names_list 2 d 12.0
# 7 First_names_list 3 <NA> NA
# 8 Second_names_list 3 n 32.0
# 9 Third_names_list 3 m 0.1
library(reshape2)
dcast(out[complete.cases(out), ], V1 ~ V2_1, value.var="V2_2")
# V1 a b c d m n
# 1 First_names_list 34 4 NA NA NA NA
# 2 Second_names_list NA NA NA 2 98.0 32
# 3 Third_names_list NA NA 1 12 0.1 NA
Option 2
Here's another option using a more recent version of data.table. The concept is very similar to the approach taken above.
library(data.table)
library(reshape2)
packageVersion("data.table")
# [1] ‘1.8.11’
dt <- data.table(df)
S1 <- dt[, list(X = unlist(strsplit(as.character(V2), ";"))), by = V1]
S1[, c("A", "B") := do.call(rbind.data.frame, strsplit(X, "="))]
S1
# V1 X A B
# 1: First_names_list a=34 a 34
# 2: First_names_list b=4 b 4
# 3: Second_names_list d=2 d 2
# 4: Second_names_list m=98 m 98
# 5: Second_names_list n=32 n 32
# 6: Third_names_list c=1 c 1
# 7: Third_names_list d=12 d 12
# 8: Third_names_list m=0.1 m 0.1
dcast.data.table(S1, V1 ~ A, value.var="B")
# V1 a b c d m n
# 1: First_names_list 34 4 NA NA NA NA
# 2: Second_names_list NA NA NA 2 98 32
# 3: Third_names_list NA NA 1 12 0.1 NA
Both of the above options assume we're starting with:
df <- structure(list(V1 = c("First_names_list", "Second_names_list",
"Third_names_list"), V2 = c("a=34;b=4", "d=2;m=98;n=32",
"c=1;d=12;m=0.1")), .Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -3L))
Here is a solution, using apply within apply:
#Data frame 1
df1 <- read.table(text=
"General_name
a
b
c
d
m
n", header=T, as.is=T)
#Data frame 2
df2 <- read.table(text=
"col1 col2
First_names_list a=34;b=4
Second_names_list d=2;m=98;n=32
Third_names_list c=1;d=12;m=0.1", header=T, as.is=T)
#make lists for each row, sep by ";"
df2split <- strsplit(df2$col2,split=";")
#result
t(
sapply(seq(1:nrow(df2)),function(c){
x <- df2split[[c]]
sapply(df1$General_name,function(n){
t <- gsub(paste0(n,"="),"",x[grepl(n,x)])
ifelse(length(t)==0,NA,as.numeric(t))
})
})
)
I feel this is a slightly round-about way to do it so I look forward to a better solution as well. But this works.
library(data.table)
library(reshape2)
#creating datasets
dt <- data.table(read.csv(textConnection('
"First_names_list","a=34;b=4"
"Second_names_list","d=2;m=98;n=32"
"Third_names_list","c=1;d=12;m=0.1"
'),header = FALSE))
General_name = c('a','b','c','d','m','n')
TotalBreakup <- data.table(
V1 = General_name
)
# Fixing datatypes
TotalBreakup <- TotalBreakup[,lapply(.SD,as.character)]
dt <- dt[,lapply(.SD,as.character)]
# looping through each row and calculating breakdown
for(i in 1:nrow(dt))
{
# the next two statements are the workhorse of this code. Run each part of these statements step by step to see
dtlist <- strsplit(unlist(strsplit(dt[i,V2],";")),"=")
breakup <- data.table(
t(
matrix(
unlist(
strsplit(
unlist(
strsplit(
dt[i,V2],
";"
)
),
"="
)
),
nrow = 2
)
)
)
# fixing datatypes again
breakup <- breakup[,lapply(.SD,as.character)]
#appending to master dataset
TotalBreakup <- merge(TotalBreakup, breakup, by = "V1", all.x = TRUE)
}
#formatting results
setnames(TotalBreakup,c("Names",dt[,V1]))
TotalBreakup <- acast(melt(TotalBreakup,id.vars = "Names"),variable~Names)
Output -
> TotalBreakup
a b c d m n
First_names_list "34" "4" NA NA NA NA
Second_names_list NA NA NA "2" "98" "32"
Third_names_list NA NA "1" "12" "0.1" NA
A way is this:
#the second dataframe you provided
DF2 <- read.table(text = '
First_names_list a=34;b=4
Second_names_list d=2;m=98;n=32
Third_names_list c=1;d=12;m=0.1
', header = F, stringsAsFactors = F)
#empty dataframe
DF <- structure(list(a = c(NA, NA, NA), b = c(NA, NA, NA), c = c(NA,
NA, NA), d = c(NA, NA, NA), m = c(NA, NA, NA), n = c(NA, NA,
NA)), .Names = c("a", "b", "c", "d", "m", "n"), row.names = c("First_names_list",
"Second_names_list", "Third_names_list"), class = "data.frame")
DF
# a b c d m n
#First_names_list NA NA NA NA NA NA
#Second_names_list NA NA NA NA NA NA
#Third_names_list NA NA NA NA NA NA
#fill the dataframe
myls <- strsplit(DF2$V2, split = ";")
for(i in 1:length(myls))
{
sapply(myls[[i]],
function(x) { res <- unlist(strsplit(x, "=")) ; DF[i,res[1]] <<- res[2] })
}
DF
# a b c d m n
#First_names_list 34 4 <NA> <NA> <NA> <NA>
#Second_names_list <NA> <NA> <NA> 2 98 32
#Third_names_list <NA> <NA> 1 12 0.1 <NA>

I have multiple dataframes under one name and I need to create a new column in each one by combining two of the other columns? [duplicate]

I have several csv files all named with dates and for all of them I want to create a new column in each file that contains data from two other columns placed together. Then, I want to combine them into one big dataframe and choose only two of those columns to keep. Here's an example:
Say I have two dataframes:
a b c a b c
x 1 2 3 x 3 2 1
y 2 3 1 y 2 1 3
Then I want to create a new column d in each of them:
a b c d a b c d
x 1 2 3 13 x 3 2 1 31
y 2 3 1 21 y 2 1 3 23
Then I want to combine them like this:
a b c d
x 1 2 3 13
y 2 3 1 21
x 3 2 1 31
y 2 1 3 23
Then keep two of the columns a and d and delete the other two columns b and c:
a d
x 1 13
y 2 21
x 3 31
y 2 23
Here is my current code (It doesn't work when I try to combine two of the columns or when I try to only keep two of the columns):
f <- list.files(pattern="201\\d{5}\\.csv") # reading in all the files
mydata <- sapply(f, read.csv, simplify=FALSE) # assigning them to a dataframe
do.call(rbind,mydata) # combining all of those dataframes into one
mydata$Data <- paste(mydata$LAST_UPDATE_DT,mydata$px_last) # combining two of the columns into a new column named "Data"
c('X','Data') %in% names(mydata) # keeping two of the columns while deleting the rest
The object mydata is a list of data frames. You can change the data frames in the list with lapply:
lapply(mydata, function(x) "[<-"(x, "c", value = paste0(x$a, x$b)))
file1 <- "a b
x 2 3"
file2 <- "a b
x 3 1"
mydata <- lapply(c(file1, file2), function(x) read.table(text = x, header =TRUE))
lapply(mydata, function(x) "[<-"(x, "c", value = paste0(x$a, x$b)))
# [[1]]
# a b c
# x 2 3 23
#
# [[2]]
# a b c
# x 3 1 31
You can use rbind (data1,data2)[,c(1,3)] for that. I assume that you can create col d in each dataframe which is a basic thing.
data1<-structure(list(a = 1:2, b = 2:3, c = c(3L, 1L), d = c(13L, 21L
)), .Names = c("a", "b", "c", "d"), row.names = c("x", "y"), class = "data.frame")
> data1
a b c d
x 1 2 3 13
y 2 3 1 21
data2<-structure(list(a = c(3L, 2L), b = c(2L, 1L), c = c(1L, 3L), d = c(31L,
23L)), .Names = c("a", "b", "c", "d"), row.names = c("x", "y"
), class = "data.frame")
> data2
a b c d
x 3 2 1 31
y 2 1 3 23
data3<-rbind(data1,data2)
> data3
a b c d
x 1 2 3 13
y 2 3 1 21
x1 3 2 1 31
y1 2 1 3 23
finaldata<-data3[,c("a","d")]
> finaldata
a d
x 1 13
y 2 21
x1 3 31
y1 2 23

Resources