Split dataframe array column into multiple binary columns [R] - r

Array column is current and the others are the goal
I have a column of arrays and I would like to split it out into multiple binaries.
I have created all the columns by using
dat[,unique(unlist(df$array_column))] = 0
I tried to use an ifelse statement to then set the columns to '1' as needed however using %in% does not work with ifelse. I could create a nested for loop however I have millions of rows and am looking for a faster solution than that.
testdf = data.frame('a'=c(1,2,3,4,5),'array_column'=c('a-b-c','b-a','c-d','d-e-e','e-a'),stringsAsFactors = F)
testdf$array_column = strsplit(testdf$array_column,'-')

I think the question is rather how convert a list of vectors into a binary matrix/data.frame
Here is a solution
testdf = data.frame('a'=c(1,2,3,4,5),'array_column'=c('a-b-c','b-a','c-d','d-e-e','e-a'),stringsAsFactors = F)
testdf$array_column = strsplit(testdf$array_column,'-')
library('plyr')
# Creates a list of data.frames with 1s for each value observed
binary <- lapply(testdf$array_column, function(x) {
vals <- unique(x)
x <- setNames(rep(1,length(vals)), vals);
do.call(data.frame, as.list(x))
})
# Joins into single data.frame
result <- do.call(rbind.fill, binary)
result[is.na(result)] <- 0
result
# a b c d e
# 1 1 1 1 0 0
# 2 1 1 0 0 0
# 3 0 0 1 1 0
# 4 0 0 0 1 1
# 5 1 0 0 0 1

Related

Replacing values in a list of data frames

I have a list of data frames. Each has an ID column, followed by a number of numeric columns (with column names).
I would like to replace all the 1's with 0's for all the numeric columns, but keep the ID column the same. I can do this in part with a single data frame using
df[,-1] <- 0
But when I try to embed this in lapply, it fails:
df2 <- lapply(df, function(x) {x[,-1] <- 0})
I've tried using subset, ifelse, while, mutate, but struggling with this simple replacement. Could recreate the data frames from scratch, or recombine the ID column at the end, but this strikes me as something that should be easy...
Test list:
test_list <- list(data.frame("ID"=letters[1:3], "col2"=1:3, "col3"=0:2), data.frame("ID"=letters[4:6], "col2"=4:6, "col3"=0:2))
The end result should be:
final_list <- list(data.frame("ID"=letters[1:3], "col2"=0, "col3"=0), data.frame("ID"=letters[4:6], "col2"=0, "col3"=0))
Add return(x) to your function and then it should work fine.
lapply(test_list, function(x){
x[, -1] <- 0
return(x)
})
# [[1]]
# ID col2 col3
# 1 a 0 0
# 2 b 0 0
# 3 c 0 0
#
# [[2]]
# ID col2 col3
# 1 d 0 0
# 2 e 0 0
# 3 f 0 0
Your question is worded a little bit strangely in that it sounds like you want to replace all the 1's with 0's, but your example seems to contradict that.
If you want to replace just 1's with 0's, you could do so like this:
lapply(df, function(x) {x[x==1] <- 0; return(x)})
[[1]]
ID col2 col3
1 a 0 0
2 b 2 0
3 c 3 2
[[2]]
ID col2 col3
1 d 4 0
2 e 5 0
3 f 6 2

removing columns equal to 0 from multiple data frames in a list; lapply not actually removing columns when applying function to a list

I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1

R Sort one column ascending, all others descending (based on column order)

I have an ordered table, similar to as follows:
df <- read.table(text =
"A B C Size
1 0 0 1
0 1 1 2
0 0 1 1
1 1 0 2
0 1 0 1",
header = TRUE)
In reality there will be many more columns, but this is fine for a solution.
I wish to sort this table first by SIZE (Ascending), then by each other column in priority sequence (Descending) - i.e. by column A first, then B, then C, etc.
The problem is that I will not know the column names in advance so cannot name them, but need in effect "all columns except SIZE".
End result should be:
A B C Size
1 0 0 1
0 1 0 1
0 0 1 1
1 1 0 2
0 1 1 2
I've seen examples of sorting by two columns, but I just can't find the correct syntax to sort by 'all other columns sequentially'.
Many thanks
With the names use order like this. No packages are used.
o <- with(df, order(Size, -A, -B, -C))
df[o, ]
This gives:
A B C Size
1 1 0 0 1
5 0 1 0 1
3 0 0 1 1
4 1 1 0 2
2 0 1 1 2
or without the names just use column numbers:
o <- order(df[[4]], -df[[1]], -df[[2]], -df[[3]])
or
k <- 4
o <- do.call("order", data.frame(df[[k]], -df[-k]))
If Size is always the last column use k <- ncol(df) instead or if it is not necessarily last but always called Size then use k <- match("Size", names(df)) instead.
Note: Although not needed in the example shown in the question if the columns were not numeric then one could not negate them so a more general solution would be to replace the first line above with the following where xtfrm is an R function which converts objects to numeric such that the result sorts in the order expected.
o <- with(df, order(Size, -xtfrm(A), -xtfrm(B), -xtfrm(C)))
We can use arrange from dplyr
library(dplyr)
arrange(df, Size, desc(A), desc(B), desc(C))
For more number of columns, arrange_ can be used
cols <- paste0("desc(", names(df)[1:3], ")")
arrange_(df, .dots = c("Size", cols))

Populate a dataframe with a for loop

I would like to fill a dataframe ("DF") with 0's or 1's depending if values in a vector ("Date") match with other date values in a second dataframe ("df$Date").
If they match the output value have to be 1, otherwise 0.
I tried to adjust this code made by a friend of mine, but it doesn't work:
for(j in 1:length(Date)) { #Date is a vector with all dates from 1967 to 2006
# Start count
count <- 0
# Check all Dates between 1967-2006
if(any(Date[j] == df$Date)) { #df$Date contains specific dates of interest
count <- count + 1
}
# If there is a match between Date and df$Date, its output is 1, else 0.
DF[j,i] <- count
}
The main dataframe "DF" has got 190 columns, which have to filled, and of course a number of rows equal to the Date vector.
extra info
1) Each column is different from the other ones and therefore the observations in a row cannot be all equal (i.e. in a single row, I should have a mixture between 0's and 1's).
2) The column names in "DF" are also present in "df" as df$Code.
We can vectorize this operation with %in% and as.integer(), leveraging the fact that coercing logical to integer returns 0 for false and 1 for true:
Mat[,i] <- as.integer(Date%in%df$Date);
If you want to fill every single column of Mat with the exact same result vector:
Mat[] <- as.integer(Date%in%df$Date);
My above code exactly reproduces the logic of the code in your (original) question.
From your edit, I'm not 100% sure I understand the requirement, but my best guess is this:
set.seed(4L);
LV <- 10L; ND <- 10L;
Date <- sample(seq_len(ND),LV,T);
df <- data.frame(Date=sample(seq_len(ND),3L),Code=c('V1','V2','V3'));
DF <- data.frame(V1=rep(NA,NV),V2=rep(NA,NV),V3=rep(NA,NV));
Date;
## [1] 6 1 3 3 9 3 8 10 10 1
df;
## Date Code
## 1 8 V1
## 2 3 V2
## 3 1 V3
for (cn in colnames(DF)) DF[,cn] <- as.integer(Date%in%df$Date[df$Code==cn]);
DF;
## V1 V2 V3
## 1 0 0 0
## 2 0 0 1
## 3 0 1 0
## 4 0 1 0
## 5 0 0 0
## 6 0 1 0
## 7 1 0 0
## 8 0 0 0
## 9 0 0 0
## 10 0 0 1

Conversion of pipe delimited single column data to multiple column matrix - R

I need some help with data manipulation in R. I have a long code which does this as a series of steps, but I am looking for a shorter way to do it.
Here is a data frame which has two columns - the first one is an ID and the other has pipe delimited data in it as shown below:
ID DATA
1 a
2 a|b
3 b|c
4 d|e
I need to convert this to this form:
ID a b c d e
1 1 0 0 0 0
2 1 1 0 0 0
3 0 1 1 0 0
4 0 0 0 1 1
I am hoping there is a simpler way to do this than the lengthy code I have.
Thanks in advance for your help.
This works on the supplied data. First read in your data:
pipdat <- read.table(stdin(),header=TRUE,stringsAsFactors=FALSE)
ID DATA
1 a
2 a|b
3 b|c
4 d|e
# leave a blank line at the end so it stops reading
Now here goes:
nr <- dim(pipdat)[1]
chrs <- strsplit(pipdat[,2],"[|]")
af <- unique(unlist(chrs))
whichlet <- function(a,fac) as.numeric(fac %in% a)
matrix(unlist(lapply(chrs,whichlet,af)),
byrow=TRUE,nr=nr,dimnames=list(ID=1:nr,af))
(That can be done in fewer lines, but it's handy to see what some of those steps do)
It produces:
ID a b c d e
1 1 0 0 0 0
2 1 1 0 0 0
3 0 1 1 0 0
4 0 0 0 1 1
I guessed from your post that you wanted ID as row names; if you need it to be a column of data that last line needs to be different.
I'd have used sapply instead of lapply, but you end up with the transpose of the desired matrix. That works if you replace the last line with:
res <- t(sapply(chrs,whichlet,af))
dimnames(res) <- list(ID=1:nr,af)
res
but it might be slower.
---
If you don't follow the line
matrix(unlist(lapply(chrs,whichlet,af)),
byrow=TRUE,nr=nr,dimnames=list(ID=1:nr,af))
just break it up from the innermost function outward:
lres <- lapply(chrs,whichlet,af)
vres <- unlist(lres)
matrix(vres,byrow=TRUE,nr=nr,dimnames=list(ID=1:nr,af))
---
If you need ID as a column of data instead of row names, one way to do it is:
lres <- lapply(chrs,whichlet,af)
vres <- unlist(lres)
cbind(ID=1:nr,matrix(vres,byrow=TRUE,nr=nr,dimnames=list(1:nr,af)))
or you could do
res <- t(sapply(chrs,whichlet,af))
dimnames(res) <- list(1:nr,af)
cbind(ID=1:nr,res)

Resources