I am trying to bind together a named vector onto a matrix. The named vector has a different length as the matrix:
> m <- matrix(data = c("1", "2", "3"),
nrow = 1, ncol = 3,
dimnames = list(c(),
c("column 1", "column 2", "column 3")))
> named_vec <- c("4", "5")
> names(named_vec) <- c("column 1", "column 2")
> rbind(m, named_vec)
I get the following:
Warning message:
In rbind(m, named_vec) :
number of columns of result is not a multiple of vector length (arg 2)
This has the undesired effect of repeating the shorter vector.
Also, plyr's rbind.fill function does not work here, since both arguments need to be data frames:
> plyr::rbind.fill(data.frame(m), data.frame(named_vec))
Error: All inputs to rbind.fill must be data.frames
My desired output is a matrix that fills in missing values with NA's instead of repeating the vector, like this:
column 1 column 2 column 3
[1,] "1" "2" "3"
[2,] "4" "5" NA
Below is a base R solution
do.call(rbind,lapply(u<-list(m,named_vec),`length<-`,max(lengths(u))))
such that
column 1 column 2 column 3
[1,] "1" "2" "3"
[2,] "4" "5" NA
If it is ok to convert the matrices to dataframe, you can use bind_rows.
dplyr::bind_rows(data.frame(m), data.frame(t(named_vec)))
# column.1 column.2 column.3
#1 1 2 3
#2 4 5 <NA>
We can use rbindlist
library(data.table)
rbindlist(list(as.data.frame(m), as.data.frame(t(named_vec))), fill = TRUE)
Related
I have a simple problem in that I have a very long data frame which reports 0 as a char "nothing" in the data frame column. How would I replace all of these to a numeric 0. A sample data frame is below
Group
Candy
A
5
B
nothing
And this is what I want to change it into
Group
Candy
A
5
B
0
Keeping in mind my actual dataset is 100s of rows long.
My own attempt was to use is.na but apparently it only works for NA and can convert those into zeros with ease but wasn't sure if there's a solution for actual character datatypes.
Thanks
The best way is to read the data in right, not with "nothing" for missing values. This can be done with argument na.strings of functions read.table or read.csv. Then change the NA's to zero.
The following function is probably slow for large data.frames but replaces the "nothing" values by zeros.
nothing_zero <- function(x){
tc <- textConnection("nothing", "w")
sink(tc) # divert output to tc connection
print(x) # print in string "nothing" instead of console
sink() # set the output back to console
close(tc) # close connection
tc <- textConnection(nothing, "r")
y <- read.table(tc, na.strings = "nothing", header = TRUE)
close(tc) # close connection
y[is.na(y)] <- 0
y
}
nothing_zero(df1)
# Group Candy
#1 A 5
#2 B 0
The main advantage is to read numeric data as numeric.
str(nothing_zero(df1))
#'data.frame': 2 obs. of 2 variables:
# $ Group: chr "A" "B"
# $ Candy: num 5 0
Data
df1 <- read.table(text = "
Group Candy
A 5
B nothing", header = TRUE)
sapply(df,function(x) {x <- gsub("nothing",0,x)})
Output
a
[1,] "0"
[2,] "5"
[3,] "6"
[4,] "0"
Data
df <- structure(list(a = c("nothing", "5", "6", "nothing")),
class = "data.frame",
row.names = c(NA,-4L))
Another option
df[] <- lapply(df, gsub, pattern = "nothing", replacement = "0", fixed = TRUE)
If you are only wanting to apply to one column
library(tidyverse)
df$a <- str_replace(df$a,"nothing","0")
Or applying to one column in base R
df$a <- gsub("nothing","0",df$a)
I'd like to convert matrix values in vector with some conditions. In my example:
# Create my matrix
mymatrix <-matrix(
# Create a numeric variable
abs(rnorm(300)),
# No of rows
nrow = 10,
# No of columns
ncol = 3,
# By default matrices are in column-wise order
# So this parameter decides how to arrange the matrix
byrow = TRUE
)
# Naming rows
rownames(mymatrix) = 1:10
# Naming columns
colnames(mymatrix ) = c("1", "2", "3")
mymatrix
# 1 2 3
#1 0.85882558 1.38755611 0.369197570
#2 1.58785948 1.13064411 1.542977629
#3 0.35293056 1.44036121 1.806414543
#4 0.02709663 1.25620400 0.794001157
#5 0.34426152 0.32365824 2.026024465
#6 0.03608507 1.12315562 1.072635275
#7 0.39055300 0.49463748 0.645037388
#8 0.33406392 0.63543332 0.005055208
#9 1.04796081 0.04062249 2.330948193
#10 0.42538451 0.24574490 0.268357588
I'd like to convert my matrix to vector (myvector) using a custom rule:
If mymatrix[,1]is the maximum value in the row and mymatrix[,1]>=0.95 then the vector result is "1", but if mymatrix[,1]<0.95 than the result is "misclassified", but for mymatrix[,2] and mymatrix[,3] the result ("2") or ("3") is the maximum value inside each row. My desirable output is:
myvector
#[1] "2" "1" "3" "2" "3" "2" "3" "2" "1" "misclassified"
Please, any ideas?
Here's a vectorised option -
#Get the column number of max value in each row
res <- max.col(mymatrix)
#Get row number where column 1 is highest
inds <- which(res == 1)
#If those value is less than 0.95 make it 'misclassified'
res[inds][mymatrix[inds, 1] < 0.95] <- 'misclassified'
res
#[1] "2" "1" "3" "2" "3"
#[6] "2" "3" "2" "3" "misclassified"
It looks like you want to apply a function over your rows. So apply would be appropriate here:
apply(mymatrix, 1, \(x) { y <- which.max(x)
if (y == 1) {if (x[y] >= 0.95) "1" else "misclassified"} else as.character(y)})
[1] "2" "1" "3" "2" "3"
[6] "2" "3" "2" "3" "misclassified"
You can try apply + ifelse
apply(
mymatrix,
1,
function(x) {
ifelse(max(x) >= 0.95,
colnames(mymatrix)[which.max(x)],
"misclassified"
)
}
)
I have the following dataframe in R and am trying to use a stringsplit function to the same to yield a different dataframe
DF
A B C
"1,2,3" "1,2"
"2" "1"
The cells of the dataframe are filled with characters. The empty spaces are blank values. I have created the following function
sepfunc<-function(x){strsplit(as.character(x, split= ","))[[1]][1]}
The function works neatly when i use it on a single column
sapply(DF$A, sepfunc)
[1] "1" "2"
However, the following command yields only a single row
sapply(DF, sepfunc)
A B C
"1" NA "1"
The second row is not displayed. I know I must be missing something rudimentary. I request someone to help.
The expected output is
A B C
"1" NA "1"
"2" "1" "NA"
When we do the strsplit, the output is a list of vectors. If we just subset the first list element with [[1]], then the rest of the elements are skipped. Here the first element corresponds to the first row. But, when we do the same on a single column, it is looping through each element and then do the strsplit. It will not hurt by taking the first element [[1]] because the list is of length 1. Here, the case is different. The number of list elements are the same as the number of rows for each of the columns. So, we need to loop through the list (either with sapply/lapply - former gives a vector depends on the case, while latter always return list)
sapply(DF, function(x) sapply(strsplit(as.character(x), ","), `[`, 1))
# A B C
#[1,] "1" NA "1"
#[2,] "2" "1" NA
Let's look this more closely by splitting the codes into chunks. On each column, we can find the output as list of splitted vectors
lapply(DF, function(x) strsplit(as.character(x), ","))
#$A
#$A[[1]]
#[1] "1" "2" "3"
#$A[[2]]
#[1] "2"
#$B
#$B[[1]]
#[1] NA
#$B[[2]]
#[1] "1"
#$C
#$C[[1]]
#[1] "1" "2"
#$C[[2]]
#character(0)
When we do [[1]], the first element is extracted i.e. the first row of 'A', 'B', 'C'
lapply(DF, function(x) strsplit(as.character(x), ",")[[1]])
#$A
#[1] "1" "2" "3"
#$B
#[1] NA
#$C
#[1] "1" "2"
If we again subset on the above, i.e. the first element, the output will be 1 NA 1.
Instead we want to loop through the list and get the first element of each list
As you only want to extract the first part before the , you can also do
sapply(DF, function(x) gsub("^([^,]*),.*$", "\\1", x))
# A B C
# [1,] "1" NA "1"
# [2,] "2" NA "1"
This extracts the the first group (\\1) which is here marked with brackets. ([^,]*)
Or with stringr:
library(stringr)
sapply(DF, function(x) str_extract(x, "^([^,]*)"))
Here is another version of this
lapply(X = df, FUN = function(x) sapply(strsplit(x = as.character(x), split = ","), FUN = head, n=1))
First of all, notice that your sepfun should always give an error:
sepfunc<-function(x){strsplit(as.character(x, split= ","))[[1]][1]}
split should go with strsplit, not as.character, so what you meant is probably:
sepfunc<-function(x){strsplit(as.character(x), split= ",")[[1]][1]}
Second, the question of data sanity. You have character variables stored as factors, and missing data stored as empty strings. I would recommend dealing with these issues before trying to do anything else. (Why do I say NA is more sensible here than an empty string? Because you told me so. You want NA's in the output, so I guess this means that if there are no numbers in the string, it means that something is missing. Missing = NA. There is also a technical reason which would take a bit longer to explain.)
So in the following, I'm just using an altered version of your DF:
DF <- data.frame(A=c("1,2,3", "2"), B=c(NA, "1"), C=c("1,2", NA), stringsAsFactors=FALSE)
(If DF comes from a file, then you could use read.csv("file", as.is=TRUE). And then DF[DF==""] <- NA.)
The output of strsplit is a list so you'll need sapply to get something useful out from it. And another sapply to apply it to all columns in a data frame.
sapply(DF, function(x) sapply(strsplit(x, ","), head, 1))
# A B C
# [1,] "1" NA "1"
# [2,] "2" "1" NA
Or step by step. Before you can sapply a function over all columns of a data frame, you need it to give meaningful results for all the columns. Let's try:
sf <- function(x) sapply(strsplit(x, ","), head, 1)
# and sepfunc as defined above:
sepfunc<-function(x){strsplit(as.character(x), split= ",")[[1]][1]}
sf(DF$A)
# [1] "1" "2"
# as expected
sepfunc(DF$A)
# [1] "1"
Notice that sepfunc uses only the first element (as you told it to!) of each column, and the rest is discarded. You need sapply or something similar to use all elements. So as a consequence, you get this:
sapply(DF, sepfunc)
# A B C
# "1" NA "1"
(It works, because we've redefined empty strings as NA. But you get the results only for the first row of each variable.)
sapply(DF, sf)
# A B C
# [1,] "1" NA "1"
# [2,] "2" "1" NA
I have a data table in R with text columns of colon delimited data. I want to return a matrix/data table of results where one of the delimited values is returned for each cell.
The code pasted below demonstrates the problem and is a working solution. However, my actual data table is large (a few thousand rows and columns), and the pasted method takes on the order of a minute or two to complete.
I'm wondering if there is a more efficient way to perform this task? It appears that the sep2 option in fread will be very useful for this problem once implemented.
Thanks!
> # Set up data.table
> DT <- data.table(A = c("cat:1:meow", "dog:2:bark", "cow:3:moo"),
B = c("dog:3:meow", "dog:4:bark", "frog:3:croak"),
C = c("dingo:0:moo", "cat:8:croak", "frog:1:moo"))
> print(DT)
A B C
1: cat:1:meow dog:3:meow dingo:0:moo
2: dog:2:bark dog:4:bark cat:8:croak
3: cow:3:moo frog:3:croak frog:1:moo
# grab the second delimited value in each cell
> part_index <- 2
> f = function(x) {vapply(t(x), function(x) {unlist(strsplit(x, ":", fixed=T))[part_index]}, character(1))}
> sapply(DT, f)
A B C
[1,] "1" "3" "0"
[2,] "2" "4" "8"
[3,] "3" "3" "1"
1) sub Try this:
DT[, lapply(.SD, sub, pattern = ".*:(.*):.*", replacement = "\\1")]
giving:
A B C
1: 1 3 0
2: 2 4 8
3: 3 3 1
2) fread or using fread:
DT[, lapply(.SD, function(x) fread(paste(x, collapse = "\n"))$V2)]
3) matrix Note that similar code would work with plain character matrix without data.table:
m <- as.matrix(DT)
replace(m, TRUE, sub(".*:(.*):.*", "\\1", m))
giving:
A B C
[1,] "1" "3" "0"
[2,] "2" "4" "8"
[3,] "3" "3" "1"
3a) Even simpler (no regular expressions) would be:
replace(m, TRUE, read.table(text = m, sep = ":")$V2)
3b) or using fread from data.table:
replace(m, TRUE, fread(paste(m, collapse = "\n"))$V2)
I have a list of lists of strings as follows:
> ll
[[1]]
[1] "2" "1"
[[2]]
character(0)
[[3]]
[1] "1"
[[4]]
[1] "1" "8"
The longest list is of length 2, and I want to build a data frame with 2 columns from this list. Bonus points for also converting each item in the list to a number or NA for character(0). I have tried using mapply() and data.frame to convert to a data frame and fill with NA's as follows.
# Find length of each list element
len = sapply(awards2, length)
# Number of NAs to fill for column shorter than longest
len = 2 - len
df = data.frame(mapply( function(x,y) c( x , rep( NA , y ) ) , ll , len))
However, I do not get a data frame with 2 columns (and NA's as fillers) using the code above.
Thanks for the help.
We can use stri_list2matrix from stringi. As the list elements are all character vectors, it seems okay to use this function
library(stringi)
t(stri_list2matrix(ll))
# [,1] [,2]
#[1,] "2" "1"
#[2,] NA NA
#[3,] "1" NA
#[4,] "1" "8"
If we need to convert to data.frame, wrap it with as.data.frame