How do I make two data columns into one longer data column - r

Let me clear, I do not want to add, multiply, subtract or divide the data. I want the new column to include all the information from both the first column and the second column. Here is an example of what I mean.
data 1 data 2 new data
1 Q 1 Q
2 T 5 T
3 R 3 R
4 1
5 5
6 3
The values I am looking at are not categorical but I used them as an example to show the difference.

cbind.fill <- function(...){
# From a SO answer by Tyler Rinker
nm <- list(...)
nm <- lapply(nm, as.matrix)
n <- max(sapply(nm, nrow))
do.call(cbind, lapply(nm, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x)))))
}
df1 <- data.frame("data 1"=c("Q","T","R"),"data 2"=c(1,5,3), stringsAsFactors = F)
df2 <- cbind.fill(df1, c(df1$data.1, df1$data.2))
colnames(df2) <- c(colnames(df2)[1:2], "new data")
df2
data.1 data.2 new data
[1,] "Q" "1" "Q"
[2,] "T" "5" "T"
[3,] "R" "3" "R"
[4,] NA NA "1"
[5,] NA NA "5"
[6,] NA NA "3"
Source of cbind.fill function: cbind a df with an empty df (cbind.fill?)

Related

Find and remove duplicated observations across two columns in R

I have an example data set like this:
df1 <- data.frame(c1=c('a','b','c','d','e','f','g', 'h'),
c2=c('l','m','a','g','e','q','a','d'))
and I just want a data frame that removed the duplicates between c1 and c2. I already know how to grab the unique elements from c1 and c2, but what do I do after that, to end up with something like the following:
data.frame(c1=c(b,c,f,h),c2=c(l,m,q,NA))
An option is to get the intersecting elements with Reduce, remove those elements from each column with %in% and !, and then pad NA at the end
v1 <- Reduce(intersect, df1)
lst1 <- lapply(df1, function(x) x[!x %in% v1])
data.frame(lapply(lst1, `length<-`, max(lengths(lst1))))
# c1 c2
#1 b l
#2 c m
#3 f q
#4 h <NA>
data
df1 <- data.frame(c1=c('a','b','c','d','e','f','g', 'h'),
c2=c('l','m','a','g','e','q','a','d'))
One-liner in base:
sapply(list(df1$c1[!df1$c1%in%df1$c2],
df1$c2[!df1$c2%in%df1$c1]), '[', 1:length(setdiff(df1$c1, df1$c2)))
# [,1] [,2]
# [1,] "b" "l"
# [2,] "c" "m"
# [3,] "f" "q"
# [4,] "h" NA

From list of characters to matrix/data-frame of numeric (R)

I have a long list, whose elements are lists of length one containing a character vector. These vectors can have different lengths.
The element of the vectors are 'characters' but I would like to convert them in numeric, as they actually represent numbers.
I would like to create a matrix, or a data frame, whose rows are the vectors above, converted into numeric. Since they have different lengths, the "right ends" of each row could be filled with NA.
I am trying to use the function rbind.fill.matrix from the library {plyr}, but the only thing I could get is a long numeric 1-d array with all the numbers inside, instead of a matrix.
This is the best I could do to get a list of numeric (dat here is my original list):
dat<-sapply(sapply(dat,unlist),as.numeric)
How can I create the matrix now?
Thank you!
I would do something like:
library(stringi)
temp <- stri_list2matrix(dat, byrow = TRUE)
final <- `dim<-`(as.numeric(temp), dim(temp))
The basic idea is that stri_list2matrix will convert the list to a matrix, but it would still be a character matrix. as.numeric would remove the dimensional attributes of the matrix, so we add those back in with:
`dim<-` ## Yes, the backticks are required -- or at least quotes
POC:
dat <- list(1:2, 1:3, 1:2, 1:5, 1:6)
dat <- lapply(dat, as.character)
dat
# [[1]]
# [1] "1" "2"
#
# [[2]]
# [1] "1" "2" "3"
#
# [[3]]
# [1] "1" "2"
#
# [[4]]
# [1] "1" "2" "3" "4" "5"
#
# [[5]]
# [1] "1" "2" "3" "4" "5" "6"
library(stringi)
temp <- stri_list2matrix(dat, byrow = TRUE)
final <- `dim<-`(as.numeric(temp), dim(temp))
final
# [,1] [,2] [,3] [,4] [,5] [,6]
# [1,] 1 2 NA NA NA NA
# [2,] 1 2 3 NA NA NA
# [3,] 1 2 NA NA NA NA
# [4,] 1 2 3 4 5 NA
# [5,] 1 2 3 4 5 6

cbind coerces a data frame to matrix

I'm having trouble When using cbind. Prior to using cbind the object is a data.frame of two character vectors.
After I add a column using cbind, the data.frame object changes class to matrix. I've tried as.vector, declaring h as an empty character vector, etc. but couldn't fix it. Thank you for any suggestions and help.
output <- data.frame(h = character(), st = character()) ## empty dataframe
st <- state.abb
h <- (rep("a", 50))
output <- cbind(output$h, h) ## output changes to matrix class here
output <- cbind(output, st) ## adding a second column
I guess you may not need cbind().
output <- data.frame(state = state.abb, h = rep("a", 50))
head(output)
state h
1 AL a
2 AK a
3 AZ a
4 AR a
5 CA a
6 CO a
# Ken I'm not sure what you actually want to obtain but it may be easier if variables are kept in a list. Below is an example.
state <- state.abb
h <- rep("a", 50)
lst <- list(state = state, h = h)
mat <- as.matrix(do.call(cbind, lst))
head(mat)
state h
[1,] "AL" "a"
[2,] "AK" "a"
[3,] "AZ" "a"
[4,] "AR" "a"
[5,] "CA" "a"
[6,] "CO" "a"
df <- as.data.frame(do.call(cbind, lst))
head(df)
state h
1 AL a
2 AK a
3 AZ a
4 AR a
5 CA a
6 CO a
As a complement of info, notice that you could use single bracket notation to make it work with something close to your original code:
data
output <- data.frame(h = letters[1:5],st = letters[6:10])
h2 <- (rep("a", 5))
This won't work
cbind(output$h, h2)
# h2
# [1,] "1" "a"
# [2,] "2" "a"
# [3,] "3" "a"
# [4,] "4" "a"
# [5,] "5" "a"
class(cbind(output$h, h2)) # matrix
It's a matrix and factors have been coerced in numbers
this will work
cbind(output["h"], h2)
# h h2
# 1 a a
# 2 b a
# 3 c a
# 4 d a
# 5 e a
class(cbind(output["h"], h2)) # data.frame
Note that with double brackets (output[["h"]]) you'll have the same inadequate result as when using the dollar notation.

R: transposing and splitting a row with a delimiter.

I have a table
rawData <- as.data.frame(matrix(c(1,2,3,4,5,6,"a,b,c","d,e","f"),nrow=3,ncol=3))
1 4 a,b,c
2 5 d,e
3 6 f
I would like to convert to
1 2 3
4 5 6
a d f
b e
c
so far I can transpose and split the third column, however, I'm lost as to how to reconstruct a new table with the format outline above?
new = t(rawData)
for (e in 1:ncol(new)){
s<-strsplit(new[3:3,e], split=",")
print(s)
}
I tried creating new vectors for each iteration but I'm not sure how to efficiently put each one back into a dataframe. Would be grateful for any help. thanks!
You can use stri_list2matrix from the stringi package:
library(stringi)
rawData <- as.data.frame(matrix(c(1,2,3,4,5,6,"a,b,c","d,e","f"),nrow=3,ncol=3),stringsAsFactors = F)
d1 <- t(rawData[,1:2])
rownames(d1) <- NULL
d2 <- stri_list2matrix(strsplit(rawData$V3,split=','))
rbind(d1,d2)
# [,1] [,2] [,3]
# [1,] "1" "2" "3"
# [2,] "4" "5" "6"
# [3,] "a" "d" "f"
# [4,] "b" "e" NA
# [5,] "c" NA NA
You can also use cSplit from my "splitstackshape" package.
By default, it just creates additional columns after splitting the input:
library(splitstackshape)
cSplit(rawData, "V3")
# V1 V2 V3_1 V3_2 V3_3
# 1: 1 4 a b c
# 2: 2 5 d e NA
# 3: 3 6 f NA NA
You can just transpose that to get your desired output.
t(cSplit(rawData, "V3"))
# [,1] [,2] [,3]
# V1 "1" "2" "3"
# V2 "4" "5" "6"
# V3_1 "a" "d" "f"
# V3_2 "b" "e" NA
# V3_3 "c" NA NA

count of records within levels of a factor

I am trying to populate a field in a table (or create a separate vector altogether, whichever is easier) with consecutive numbers from 1 to n, where n is the total number of records that share the same factor level, and then back to 1 for the next level, etc. That is, for a table like this
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
the result should be a new column (e.g. "sample") as follows:
sample<-c(1,2,3,4,1,2,3,1,2,3,4,1,2)
You can get it as follows, using ave:
data <- data.frame(data)
new <- ave(rep(1,nrow(data)),data$data,FUN=cumsum)
all.equal(new,sample) # check if it's right.
You can use rle function together with lapply :
sample <- unlist(lapply(rle(data[,1])$lengths,FUN=function(x){1:x}))
data <- cbind(data,sample)
Or even better, you can combine rle and sequence in the following one-liner (thanks to #Arun suggestion)
data <- cbind(data,sequence(rle(data[,1])$lengths))
> data
[,1] [,2]
[1,] "A" "1"
[2,] "A" "2"
[3,] "A" "3"
[4,] "A" "4"
[5,] "B" "1"
[6,] "B" "2"
[7,] "B" "3"
[8,] "C" "1"
[9,] "C" "2"
[10,] "C" "3"
[11,] "C" "4"
[12,] "D" "1"
[13,] "D" "2"
There are lots of different ways of achieving this, but I prefer to use ddply() from plyr because the logic seems very consistent to me. I think it makes more sense to be working with a data.frame (your title talks about levels of a factor):
dat <- data.frame(ID = c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)))
library(plyr)
ddply(dat, .(ID), summarise, sample = 1:length(ID))
# ID sample
# 1 A 1
# 2 A 2
# 3 A 3
# 4 A 4
# 5 B 1
# 6 B 2
# 7 B 3
# 8 C 1
# 9 C 2
# 10 C 3
# 11 C 4
# 12 D 1
# 13 D 2
My answer:
sample <- unlist(lapply(levels(factor(data)), function(x)seq_len(sum(factor(data)==x))))
factors <- unique(data)
f1 <- length(which(data == factors[1]))
...
fn <- length(which(data == factors[length(factors)]))
You can use a for loop or 'apply' family to speed that part up.
Then,
sample <- c(1:f1, 1:f2, ..., 1:fn)
Once again you can use a for loop for that part. Here is the full script you can use:
data<-matrix(c(rep('A',4),rep('B',3),rep('C',4),rep('D',2)),ncol=1)
factors <- unique(data)
f <- c()
for(i in 1:length(factors)) {
f[i] <- length(which(data == factors[i]))
}
sample <- c()
for(i in 1:length(f)) {
sample <- c(sample, 1:f[i])
}
> sample
[1] 1 2 3 4 1 2 3 1 2 3 4 1 2

Resources