how to select subset only by [] in r? - r

a<-data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
I want to convert the element of q1,q2 which !=1 to 0,and I want to use only [].I believe all the subset can be done by [].
a[grep("q\\d",colnames(a),perl=TRUE)!=1,grep("q\\d",colnames(a),perl=TRUE)]<-0
but it doesn't work, what's the problem?

We create the a numeric index of the column names that start with 'q' followed by numbers ('nm1'), use that to subset the columns in 'a' and assign the values that are not equal to 1 in that subset to 0.
nm1 <- grep("q\\d+", names(a))
a[nm1][a[nm1] != 1] <- 0
and make sure we have the columns as character class by using stringsAsFactors= FALSE in the data.frame
The above replacement is based on a logical matrix (a[nm1]!=1) which may create memory problems if the dataset is really big. In that case, it is better to loop through the columns and replace with 0
a[nm1] <- lapply(a[nm1], function(x) replace(x, x!=1, 0))
data
a <- data.frame(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)

Just in case, if you know column names, you can use them for indexing.
a<-data.frame(q1=rep(c(1,'A','B'),4), q2=c(1,'A','B','C'),
w1=c(1,'A','B','C'), stringsAsFactors=FALSE)
col_n <- c("q1", "q2")
a[, col_n][a[, col_n]!=1]<-0
> a
q1 q2 w1
1 1 1 1
2 0 0 A
3 0 0 B
4 1 0 C
5 0 1 1
6 0 0 A
7 1 0 B
8 0 0 C
9 0 1 1
10 1 0 A
11 0 0 B
12 0 0 C

data.table approach:
a<-data.table(q1=rep(c(1,'A','B'),4),q2=c(1,'A','B','C'),w1=c(1,'A','B','C'))
a[,grep("^q", colnames(a), value = T):=lapply(a[,grep("^q", colnames(a), value = T), with = F], function(x) ifelse(x == 1, 1, 0))]
> a
q1 q2 w1
1: 1 1 1
2: 0 0 A
3: 0 0 B
4: 1 0 C
5: 0 1 1
6: 0 0 A
7: 1 0 B
8: 0 0 C
9: 0 1 1
10: 1 0 A
11: 0 0 B
12: 0 0 C

Related

How to remove rows where all columns are zero using data.table

I have the following data frame:
# the original dataset
dat <- data.frame(a = c(0,0,2,3), b= c(1,0,0,0), c=c(0,0,1,3))
It looks like this:
> dat
a b c
1 0 1 0
2 0 0 0
3 2 0 1
4 3 0 3
What I want to do is to remove rows with all zero, resulting in :
a b c
0 1 0
2 0 1
3 0 3
How can I do that with data.table.
In reality I have much higher dimension need to be processed so need to be super fast.
I tried this but still slow:
dat <- dat[Reduce(`|`, dat), ]
You can try with rowSums -
library(data.table)
setDT(dat)
dat[rowSums(dat != 0) != 0]
# a b c
#1: 0 1 0
#2: 2 0 1
#3: 3 0 3
Again using rowSums(). I think this is more readable.
library(data.table)
dat[(rowSums(dat) !=0),]
a b c
1 0 1 0
3 2 0 1
4 3 0 3

Replacing row values in R based on previous rows

Below is my data frame df
df <- data.frame(A=c(1,1,1,1,0,0,-1,-1,-1,1,1,1,1))
I would like to have another variable T_D which maintains the first value when it encounters the change in the value of A by either 1 or -1and replaces the next rows by 0
Expected Output:
A T_D
1 1
1 0
1 0
1 0
0 0
0 0
-1 -1
-1 0
-1 0
1 1
1 0
1 0
1 0
dplyr's window functions make this easy. You can use the lag function to look at the previous value and see if it equals the current value. The first row of the table doesn't have a previous value so T_D will always be NA. Fortunately that row will always be equal to a so it's an easy problem to fix with a second mutate (or df[1,2] <- df[1,1]).
library(tidyverse) # Loads dplyr and other useful packages
df <- tibble(a = c(1, 1, 1, 1, 0, 0, -1, -1, -1, 1, 1, 1, 1))
df %>%
mutate(T_D = ifelse(a == lag(a), 0, a)) %>%
mutate(T_D = ifelse(is.na(T_D), a, T_D))
Base R solution, this seems to work for you:
df$T_D = df$A*!c(FALSE,diff(df$A,lag=1)==0),
Find the difference between sequential rows. If the difference is 1, take the entry from column A, otherwise set to 0.
OUTPUT
A T_D
1 1 1
2 1 0
3 1 0
4 1 0
5 0 0
6 0 0
7 -1 -1
8 -1 0
9 -1 0
10 1 1
11 1 0
12 1 0
13 1 0
df$T_D <- sign(abs(df$A)*diff(c(0, df$A)))
A data.table approach would be,
library(data.table)
setDT(df)[, T_D := replace(A, duplicated(A), 0), by = rleid(A)][]
# A T_D
# 1: 1 1
# 2: 1 0
# 3: 1 0
# 4: 1 0
# 5: 0 0
# 6: 0 0
# 7: -1 -1
# 8: -1 0
# 9: -1 0
#10: 1 1
#11: 1 0
#12: 1 0
#13: 1 0

Conversion of row value counts to column in R

Suppose I have a dataframe as follows,
Name value
A 0
A 1
A 2
A 3
B 0
B 0
B 3
C 5
I want the following output,
Name 0 0<X<2 2-4 5 and above
A 1 1 2 0
B 2 0 1 0
C 0 0 0 1
I want the new columns to be created and counts of the rows to fall into it. I have tried reshape for it. But it is changing the structure, but count is not working. Can anybody help me in doing this?
Thanks
We can use cut
library(data.table)
setDT(df1)[, gr := cut(value, breaks=c(-1, 0, 2, 4, Inf),
labels=c(0, '0<X<2', '2-4', '5 and above'))]
dcast(df1, Name~gr)
# Name 0 0<X<2 2-4 5 and above
#1: A 1 2 1 0
#2: B 2 0 1 0
#3: C 0 0 0 1
You can make a class column in your dataframe, then use table():
df <- read.table(text="Name value
A 0
A 1
A 2
A 3
B 0
B 0
B 3
C 5", header=T)
df[df$value==0, 'class'] <- "0"
df[df$value==1, 'class'] <- "0<X<2"
df[df$value>=2&df$value<=4, 'class'] <- "2-4"
df[df$value>4, 'class'] <- "5 and above"
table(df$Name, df$class)
0 0<X<2 2-4 5 and above
A 1 1 2 0
B 2 0 1 0
C 0 0 0 1
And just for fun:
f <- function(x) c(sum(x == 0), sum(x > 0 & x < 2), sum(x >= 2 & x <= 4), sum(x > 5))
t(sapply(split(df$value, df$Name), f))
[,1] [,2] [,3] [,4]
A 1 1 2 0
B 2 0 1 0
C 0 0 0 0

R Data table - add vector of values as a column

I would like to create a new column inside my data table, this column being a vector of values; but I am getting the following error:
DT = data.table(x=rep(c("a","b"),c(2,3)),y=1:5)
>
> DT
x y
1: a 1
2: a 2
3: b 3
4: b 4
5: b 5
> DT[, my_vec := rep(0,y)]
Error in rep(0, y) : invalid 'times' argument
My expected result is:
> DT
x y my_vec
1: a 1 0
2: a 2 0 0
3: b 3 0 0 0
4: b 4 0 0 0 0
5: b 5 0 0 0 0 0
Is there a way to do that?
The syntax is a little cumbersome, but you can do this:
DT[, my_vec := list(list(rep(0, y))), by = y]
DT
# x y my_vec
#1: a 1 0
#2: a 2 0,0
#3: b 3 0,0,0
#4: b 4 0,0,0,0
#5: b 5 0,0,0,0,0
It is not clear whether you need a list as my_vec or a vector. If it is the latter, we group by sequence of rows, replicate the 0 with 'y' and paste the elements together within each group.
DT[, my_vec := paste(rep(0, y), collapse=' ') , 1:nrow(DT)]
DT
# x y my_vec
#1: a 1 0
#2: a 2 0 0
#3: b 3 0 0 0
#4: b 4 0 0 0 0
#5: b 5 0 0 0 0 0

Populating data from one data.table to another

I have a distance matrix (as data.table) showing pairwise distances between a number of items, but not all items are in the matrix. I need to create a larger data.table that has all the missing items populated. I can do this with matrices fairly easily:
items=c("a", "b", "c", "d")
small_matrix=matrix(c(0, 1, 2, 3), nrow=2, ncol=2,
dimnames=list(c("a", "b"), c("a", "b")))
# create zero matrix of the right size
full_matrix <- matrix(0, ncol=length(items), nrow=length(items),
dimnames=list(items, items))
# populate items from the small matrix
full_matrix[rownames(small_matrix), colnames(small_matrix)] <- small_matrix
full_matrix
# a b c d
# a 0 2 0 0
# b 1 3 0 0
# c 0 0 0 0
# d 0 0 0 0
What is the equivalent of that in data.table? I can create an 'id' column in small_DT and use it as the key, but I'm not sure how to overwrite items in full_DT that has the same id/column pair.
Let's convert to data.table and keep the row names as an extra column:
dts = as.data.table(small_matrix, keep = T)
# rn a b
#1: a 0 2
#2: b 1 3
dtf = as.data.table(full_matrix, keep = T)
# rn a b c d
#1: a 0 0 0 0
#2: b 0 0 0 0
#3: c 0 0 0 0
#4: d 0 0 0 0
Now just join on the rows, and assuming small matrix is always a subset you can do the following:
dtf[dts, names(dts) := dts, on = 'rn']
dtf
# rn a b c d
#1: a 0 2 0 0
#2: b 1 3 0 0
#3: c 0 0 0 0
#4: d 0 0 0 0
Above assumes version 1.9.5+. Otherwise you'll need to set the key first.
Suppose you have these two data.table:
dt1 = as.data.table(small_matrix)
# a b
#1: 0 2
#2: 1 3
dt2 = as.data.table(full_matrix)
# a b c d
#1: 0 0 0 0
#2: 0 0 0 0
#3: 0 0 0 0
#4: 0 0 0 0
You can't operate like with data.frame or matrix, eg by doing:
dt2[rownames(full_matrix) %in% rownames(small_matrix), names(dt1), with=F] <- dt1
This code will raise an error, because to affect new values, you need to use the := operator:
dt2[rownames(full_matrix) %in% rownames(small_matrix), names(dt1):=dt1][]
# a b c d
#1: 0 2 0 0
#2: 1 3 0 0
#3: 0 0 0 0
#4: 0 0 0 0

Resources