Conversion of pipe delimited single column data to multiple column matrix - R - r

I need some help with data manipulation in R. I have a long code which does this as a series of steps, but I am looking for a shorter way to do it.
Here is a data frame which has two columns - the first one is an ID and the other has pipe delimited data in it as shown below:
ID DATA
1 a
2 a|b
3 b|c
4 d|e
I need to convert this to this form:
ID a b c d e
1 1 0 0 0 0
2 1 1 0 0 0
3 0 1 1 0 0
4 0 0 0 1 1
I am hoping there is a simpler way to do this than the lengthy code I have.
Thanks in advance for your help.

This works on the supplied data. First read in your data:
pipdat <- read.table(stdin(),header=TRUE,stringsAsFactors=FALSE)
ID DATA
1 a
2 a|b
3 b|c
4 d|e
# leave a blank line at the end so it stops reading
Now here goes:
nr <- dim(pipdat)[1]
chrs <- strsplit(pipdat[,2],"[|]")
af <- unique(unlist(chrs))
whichlet <- function(a,fac) as.numeric(fac %in% a)
matrix(unlist(lapply(chrs,whichlet,af)),
byrow=TRUE,nr=nr,dimnames=list(ID=1:nr,af))
(That can be done in fewer lines, but it's handy to see what some of those steps do)
It produces:
ID a b c d e
1 1 0 0 0 0
2 1 1 0 0 0
3 0 1 1 0 0
4 0 0 0 1 1
I guessed from your post that you wanted ID as row names; if you need it to be a column of data that last line needs to be different.
I'd have used sapply instead of lapply, but you end up with the transpose of the desired matrix. That works if you replace the last line with:
res <- t(sapply(chrs,whichlet,af))
dimnames(res) <- list(ID=1:nr,af)
res
but it might be slower.
---
If you don't follow the line
matrix(unlist(lapply(chrs,whichlet,af)),
byrow=TRUE,nr=nr,dimnames=list(ID=1:nr,af))
just break it up from the innermost function outward:
lres <- lapply(chrs,whichlet,af)
vres <- unlist(lres)
matrix(vres,byrow=TRUE,nr=nr,dimnames=list(ID=1:nr,af))
---
If you need ID as a column of data instead of row names, one way to do it is:
lres <- lapply(chrs,whichlet,af)
vres <- unlist(lres)
cbind(ID=1:nr,matrix(vres,byrow=TRUE,nr=nr,dimnames=list(1:nr,af)))
or you could do
res <- t(sapply(chrs,whichlet,af))
dimnames(res) <- list(1:nr,af)
cbind(ID=1:nr,res)

Related

removing columns equal to 0 from multiple data frames in a list; lapply not actually removing columns when applying function to a list

I have a list of three data frames that are similar (same number of columns but different number of rows), and were split from a larger data set.
Here is some example code to make three data frames and put them in a list. It is really hard to make an exact replicate of my data since the files are so large (over 400 columns and the first 6 columns are not numerical)
a <- c(0,1,0,1,0,0,0,0,0,1,0,1)
b <- c(0,0,0,0,0,0,0,0,0,0,0,0)
c <- c(1,0,1,1,1,1,1,1,1,1,0,1)
d <- c(0,0,0,0,0,0,0,0,0,0,0,0)
e <- c(1,1,1,1,0,1,0,1,0,1,1,1)
f <- c(0,0,0,0,0,0,0,0,0,0,0,0)
g <- c(1,0,1,0,1,1,1,1,1,1)
h <- c(0,0,0,0,0,0,0,0,0,0)
i <- c(1,0,0,0,0,0,0,0,0,0)
j <- c(0,0,0,0,1,1,1,1,1,0)
k <- c(0,0,0,0,0)
l <- c(1,0,1,0,1)
m <- c(1,0,1,0,0)
n <- c(0,0,0,0,0)
o <- c(1,0,1,0,1)
df1 <- data.frame(a,b,c,d,e,f)
df2 <- data.frame(g,h,i,j)
df3 <- data.frame(k,l,m,n,o)
my.list <- list(df1,df2,df3)
I am looking to remove all the columns in each data frame whose total == 0. The code is below:
list2 <- lapply(my.list, function(x) {x[, colSums(x) != 0];x})
list2 <- lapply(my.list, function(x) {x[, colSums(x != 0) > 0];x})
Both of the above codes will run, but neither actually remove the columns == 0.
I am not sure why that is, any tips are greatly appreciated
The OP found a solution by exchanging comments with me. But I wanna drop the following. In lapply(my.list, function(x) {x[, colSums(x) != 0];x}), the OP was asking R to do two things. The first thing was subsetting each data frame in my.list. The second thing was showing each data frame. I think he thought that each data frame was updated after subsetting columns. But he was simply asking R to show each data frame as it is in the second command. So R was showing the result for the second command. (On the surface, he did not see any change.) If I follow his way, I would do something like this.
lapply(my.list, function(x) {foo <- x[, colSums(x) != 0]; foo})
He wanted to create a temporary object in the anonymous function and return the object. Alternatively, he wanted to do the following.
lapply(my.list, function(x) x[, colSums(x) != 0])
For each data frame in my.list, run a logical check for each column. If colSums(x) != 0 is TRUE, keep the column. Otherwise remove it. Hope this will help future readers.
[[1]]
a c e
1 0 1 1
2 1 0 1
3 0 1 1
4 1 1 1
5 0 1 0
6 0 1 1
7 0 1 0
8 0 1 1
9 0 1 0
10 1 1 1
11 0 0 1
12 1 1 1
[[2]]
g i j
1 1 1 0
2 0 0 0
3 1 0 0
4 0 0 0
5 1 0 1
6 1 0 1
7 1 0 1
8 1 0 1
9 1 0 1
10 1 0 0
[[3]]
l m o
1 1 1 1
2 0 0 0
3 1 1 1
4 0 0 0
5 1 0 1

R Sort one column ascending, all others descending (based on column order)

I have an ordered table, similar to as follows:
df <- read.table(text =
"A B C Size
1 0 0 1
0 1 1 2
0 0 1 1
1 1 0 2
0 1 0 1",
header = TRUE)
In reality there will be many more columns, but this is fine for a solution.
I wish to sort this table first by SIZE (Ascending), then by each other column in priority sequence (Descending) - i.e. by column A first, then B, then C, etc.
The problem is that I will not know the column names in advance so cannot name them, but need in effect "all columns except SIZE".
End result should be:
A B C Size
1 0 0 1
0 1 0 1
0 0 1 1
1 1 0 2
0 1 1 2
I've seen examples of sorting by two columns, but I just can't find the correct syntax to sort by 'all other columns sequentially'.
Many thanks
With the names use order like this. No packages are used.
o <- with(df, order(Size, -A, -B, -C))
df[o, ]
This gives:
A B C Size
1 1 0 0 1
5 0 1 0 1
3 0 0 1 1
4 1 1 0 2
2 0 1 1 2
or without the names just use column numbers:
o <- order(df[[4]], -df[[1]], -df[[2]], -df[[3]])
or
k <- 4
o <- do.call("order", data.frame(df[[k]], -df[-k]))
If Size is always the last column use k <- ncol(df) instead or if it is not necessarily last but always called Size then use k <- match("Size", names(df)) instead.
Note: Although not needed in the example shown in the question if the columns were not numeric then one could not negate them so a more general solution would be to replace the first line above with the following where xtfrm is an R function which converts objects to numeric such that the result sorts in the order expected.
o <- with(df, order(Size, -xtfrm(A), -xtfrm(B), -xtfrm(C)))
We can use arrange from dplyr
library(dplyr)
arrange(df, Size, desc(A), desc(B), desc(C))
For more number of columns, arrange_ can be used
cols <- paste0("desc(", names(df)[1:3], ")")
arrange_(df, .dots = c("Size", cols))

finding if boolean is ever true by groups in R

I want a simple way to create a new variable determining whether a boolean is ever true in R data frame.
Here is and example:
Suppose in the dataset I have 2 variables (among other variables which are not relevant) 'a' and 'b' and 'a' determines a group, while 'b' is a boolean with values TRUE (1) or FALSE (0). I want to create a variable 'c', which is also a boolean being 1 for all entries in groups where 'b' is at least once 'TRUE', and 0 for all entries in groups in which 'b' is never TRUE.
From entries like below:
a b
-----
1 1
2 0
1 0
1 0
1 1
2 0
2 0
3 0
3 1
3 0
I want to get variable 'c' like below:
a b c
-----------
1 1 1
2 0 0
1 0 1
1 0 1
1 1 1
2 0 0
2 0 0
3 0 1
3 1 1
3 0 1
-----------
I know how to do it in Stata, but I haven't done similar things in R yet, and it is difficult to find information on that on the internet.
In fact I am doing that only in order to later remove all the observations for which 'c' is 0, so any other suggestions would be fine as well. The application of that relates to multinomial logit estimation, where the alternatives that are never-chosen need to be removed from the dataset before estimation.
if X is your data frame
library(dplyr)
X <- X %>%
group_by(a) %>%
mutate(c = any(b == 1))
A base R option would be
df1$c <- with(df1, ave(b, a, FUN=any))
Or
library(sqldf)
sqldf('select * from df1
left join(select a, b,
(sum(b))>0 as c
from df1
group by a)
using(a)')
Simple data.table approach
require(data.table)
data <- data.table(data)
data[, c := any(b), by = a]
Even though logical and numeric (0-1) columns behave identically for all intents and purposes, if you'd like a numeric result you can simply wrap the call to any with as.numeric.
An answer with base R, assuming a and b are in dataframe x
c value is a 1-to-1 mapping with a, and I create a mapping here
cmap <- ifelse(sapply(split(x, x$a), function(x) sum(x[, "b"])) > 0, 1, 0)
Then just add in the mapped value into the data frame
x$c <- cmap[x$a]
Final output
> x
a b c
1 1 1 1
2 2 0 0
3 1 0 1
4 1 0 1
5 1 1 1
6 2 0 0
7 2 0 0
8 3 0 1
9 3 1 1
10 3 0 1
edited to change call to split.

Two-way frequency table row and column labels

I want to style the output of table(). Suppose I have the following:
dat$a <- c(1,2,3,4,4,3,4,2,2,2)
dat$b <- c(1,2,3,4,1,2,4,3,2,2)
table(dat$a,dat$b)
1 2 3 4
1 50 0 0 0
2 0 150 50 0
3 0 50 50 0
4 50 0 0 100
There are two problems with this. First, it doesn't give me the correct frequencies. Additionally, it has no row or column labels. I found this, and the table works for both frequency counts and axis labels. Is there an issue because this way subsets from a data frame? I would appreciate any tips on both fixing the frequency counts and adding style to the table.
The only problem is the way that you are inputting arguments to table. To get the desired output (with labels), use the data frame as argument, not 2 vectors (the columns). If you have a larger data frame, use only the subset that you want.
a <- c(1,2,3,4,4,3,4,2,2,2)
b <- c(1,2,3,4,1,2,4,3,2,2)
dat <- data.frame(a,b)
table(dat)
Gives me the output:
b
a 1 2 3 4
1 1 0 0 0
2 0 3 1 0
3 0 1 1 0
4 1 0 0 2
It shouldn't give the wrong frequencies, even with your approach. You could try restarting your R session to check this.

How do I apply a "patch" to a data.frame?

For lack of a better word, how do I apply a "patch" to a R data.frame? Suppose I have a master database with firm and outlet columns and an ownership shares variable that is 1 or 0 in this example, but could be any percentage.
// master
firm outlet shares.pre
1 five 1 0
2 one 1 1
3 red 1 0
4 yellow 1 0
5 five 2 0
6 one 2 0
// many more
I want to let firm "one" sell outlet "1" to firm "red", which transaction I have in another data.frame
// delta
firm outlet shares.delta
1 one 1 -1
2 red 1 1
What is the most efficient way in R to apply this "patch" or transaction to my master database? The end result should look like this:
// preferably master, NOT a copy
firm outlet shares.post
1 five 1 0
2 one 1 0 <--- was 1
3 red 1 1 <--- was 0
4 yellow 1 0
5 five 2 0
6 one 2 0
// many more
I am not particular about keeping the suffixes pre, post or delta. If they were all named shares that would be fine too, I simply want to "add" these data frames.
UPDATE: my current approach is this
update <- (master$firm %in% delta$firm) & (master$outlet %in% delta$outlet)
master[update,]$shares <- master[update,]$shares + delta$shares
Yes, I'm aware it does a vector scan to creat the Boolean update vector, and that the subsetting is also not very efficient. But the thing I don't like about it most is that I have to write out the matching columns.
Another way using data.table. Assuming you've loaded both your data in df1 and df2 data.frames,
require(data.table)
dt1 <- data.table(df1)
dt2 <- data.table(df2)
setkey(dt1, firm, outlet)
setkey(dt2, firm, outlet)
dt1 <- dt2[dt1]
dt1[is.na(dt1)] <- 0
dt1[, shares.post := shares.delta + shares.pre]
# firm outlet shares.delta shares.pre shares.post
# 1: five 1 0 0 0
# 2: five 2 0 0 0
# 3: one 1 -1 1 0
# 4: one 2 0 0 0
# 5: red 1 1 0 1
# 6: yellow 1 0 0 0
I'd give a more precise answer if you had provided a reproducible example, but here's one way:
Call your first data.frame dat and your second chg
Then you could merge the two:
dat <- merge(dat,chg)
And just subtract:
dat$shares <- with(dat, shares.pre + shares.delta )

Resources