converting dataframe columns from matrix to vector - r

I'm trying to combine a few dataframes. In the process of doing so, I noticed that one dataframe contains matrices rather than vectors. Here's a basic example:
df3 <- structure(list(v5 = structure(c(NA, 0), .Dim = c(2L, 1L), .Dimnames = list(
c("206", "207"), "ecbi1")), v6 = structure(c(NA, 0), .Dim = c(2L,
1L), .Dimnames = list(c("206", "207"), "ecbi2"))), .Names = c("v5",
"v6"), row.names = 206:207, class = "data.frame")
# get class
class(df3[,1])
# [1] "matrix"
I want the columns in df3 to be vectors, not matrices.

Just apply as.vector
df3[] = lapply(df3, as.vector)

I think most important is to figure out how you managed to get matrix-type columns in the first place and understand whether this was a desired behavior or a side effect of a mistake somewhere earlier.
Given where you are, you can just use c to undo a given column:
df3$v5 <- c(df3$v5)
Or if this is a problem with all columns:
df3[ ] <- lapply(df3, c)
(lapply returns a list of vectors, and when we pass a list via assignment to a data.frame, it interprets each list element as a column; df3[ ] returns all columns of df3. We could also do df3 <- lapply(df3, c), but using [ ] is more robust -- if we make a mistake and somehow return the wrong number of columns, an error will be thrown, where as simply using df3 would have simply overwritten our data.frame silently in case of such an error)
Lastly, if only some columns are matrix-type, we can replace only those columns like so:
mat.cols <- sapply(df3, is.matrix)
df3[ , mat.cols] <- lapply(df3[ , mat.cols], c)
As pertains to the relation between this approach and that using as.vector, from ?c:
c is sometimes used for its side effect of removing attributes except names, for example to turn an array into a vector. as.vector is a more intuitive way to do this, but also drops names.
So given that the names don't mean much in this context, c is simply a more concise approach, but the end result is practically identical.

Use this.
df3[,1] = as.vector(df3[,1])
The same procedure can be applied generally to the rest of the columns.

We can use do.call
do.call(data.frame, df3)

Related

Extract specific portion of a string and paste to a new column in R

I have the following dataframe with a string column and I want to extract T,N,M,G,L status (and so on..) for each observation into separate new columns including their respective prefix and suffix. I have tried the grep() and strsplit function but the resulting columns have differing number of rows due to NA values and it doesn't seem to work. I'm not an expert in coding and I'd really appreciate your support for a working script. Thanks in advance.
df <- data.frame(input="cT1b;cN1a;cM0;G3",
"pT1a;pN0;cM0;G1;L0;V0;Pn0;R0",
"cT3;cN0;M0")
The expected output should look like
df <- data.frame(input=c("cT1b;cN1a;cM0;G3",
"pT1a;pN0;cM0;G1;L0;V0;Pn0;R0",
"cT3;cN0;M0" ),
T_output=c("cT1b","pT1a","cT3"),
G_output=c("G3","G1",NA),
L_output=c(NA,"L0",NA))
grep is typically for finding (true/false) strings or occasionally returning whole strings that contain a substring (value=TRUE), but not for extracting substrings from a whole string. For that, one might look into sub/gsub or gregexpr or stringr::str_extract/str_extract_all for extracting substrings. However, I think that's not the best (well, certainly not the only) approach.
Try this:
library(dplyr)
dat %>%
select(input) %>%
mutate(
bind_rows(lapply(
strsplit(input, ";"),
function(S) as.data.frame(lapply(setNames(nm = c("T", "G", "L")),
function(z) paste0(grep(pattern = z, x = S, value = TRUE), collapse = ";"))))),
across(one_of(c("T","G","L")), ~ ifelse(nzchar(.), ., .[NA]))
)
# input T G L
# 1 cT1b;cN1a;cM0;G3 cT1b G3 <NA>
# 2 pT1a;pN0;cM0;G1;L0;V0;Pn0;R0 pT1a G1 L0
# 3 cT3;cN0;M0 cT3 <NA> <NA>
Note: it is obviously doing nothing with the M or N substrings, which might be intentional or whatever. If you want them too, you can do setNames(nm=c("T","G","L","N")) (and again the second time within one_of) to get another upper-letter column.
Data
dat <- structure(list(input = c("cT1b;cN1a;cM0;G3", "pT1a;pN0;cM0;G1;L0;V0;Pn0;R0", "cT3;cN0;M0")), class = "data.frame", row.names = c(NA, -3L))

How to match list of characters with partial strings in R?

I am analysing IDs from the RePEc database. Each ID matches a unique publication and sometimes publications are linked because they are different versions of each other (e.g. a working paper that becomes a journal article). I have a database of about 250,000 entries that shows the main IDs in one column and then the previous or alternative IDs in another. It looks like this:
df$repec_id <– c("RePEc:cid:wgha:353", "RePEc:hgd:wpfacu:350","RePEc:cpi:dynxce:050")
df$alt_repec_id <– c("RePEc:sii:giihdizi:heidwg06-2019|RePEc:azi:cusiihdizi:gdhs06-2019", "RePEc:tqu:vishdizi:d8z7-200x", "RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050", "RePEc:cid:wgha:353|RePEc:hgd:wpfacu:350")
I want to find out which IDs from the repec_id column are also present in the alt_repec_id column and create a dataframe that only has rows matching this condition. I tried to strsplit at "|" and use the %in% function like this:
df <- separate_rows(df, alt_repec_id, sep = "\\|")
df1 <- df1[trimws(df$alt_repec_id) %in% trimws(df$repec_id), ]
df1<- data.frame(df1)
df1 <- na.omit(df1)
df1 <- df1[!duplicated(df1$repec_id),]
It works but I'm worried that by eliminating duplicate rows based on the values in the repec_id column, I'm randomly eliminating matches. Is that right?
Ultimately, I want a dataframe that only contains values in which strings in the repec_id column match the partial strings in the alt_repec_id column. Using the example above, I want the following result:
df$repec_id <– c("RePEc:cpi:dynxce:050")
df$alt_repec_id <– c("RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050")
Does anyone have a solution to my problem? Thanks in advance for your help!
Try using str_detect() from stringr to identify if the repec_id exists in the larger alt_repec_id string.
Then filter() down to where it was found. This this is not returning as expected, try looking at and posting a few examples where found_match == FALSE but the match was expected.
library(stringr)
library(dplyr)
df %>%
mutate(found_match = str_detect(alt_repec_id, repec_id)) %>%
filter(found_match == TRUE)
Here is a base R solution using grepl() + apply() + subset()
dfout <- subset(df,apply(df, 1, function(v) grepl(v[1],v[2])))
such that
> dfout
repec_id alt_repec_id
3 RePEc:cpi:dynxce:050 RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050
DATA
df <- structure(list(repec_id = structure(c(1L, 3L, 2L), .Label = c("RePEc:cid:wgha:353",
"RePEc:cpi:dynxce:050", "RePEc:hgd:wpfacu:350"), class = "factor"),
alt_repec_id = structure(c(2L, 3L, 1L), .Label = c("RePEc:aus:cecips:15_59|RePEc:sga:leciam:c8wc0z888s|RePEc:cpi:dynxce:050",
"RePEc:sii:giihdizi:heidwg06-2019|RePEc:azi:cusiihdizi:gdhs06-2019",
"RePEc:tqu:vishdizi:d8z7-200x"), class = "factor")), class = "data.frame", row.names = c(NA,
-3L))

Nested named list to data frame

I have the following named list output from a analysis. The reproducible code is as follows:
list(structure(c(-213.555409754509, -212.033637890131, -212.029474755074,
-211.320398316741, -211.158815833294, -210.470525157849), .Names = c("wasn",
"chappal", "mummyji", "kmph", "flung", "movie")), structure(c(-220.119433774144,
-219.186901747536, -218.743319709963, -218.088361753899, -217.338920075687,
-217.186050877079), .Names = c("crazy", "wired", "skanndtyagi",
"andr", "unveiled", "contraption")))
I want to convert this to a data frame. I have tried unlist to data frame options using reshape2, dplyr and other solutions given for converting a list to a data frame but without much success. The output that I am looking for is something like this:
Col1 Val1 Col2 Val2
1 wasn -213.55 crazy -220.11
2 chappal -212.03 wired -219.18
3 mummyji -212.02 skanndtyagi -218.74
so on and so forth. The actual out put has multiple columns with paired values and runs into many rows. I have tried the following codes already:
do.call(rbind, lapply(df, data.frame, stringsAsFactors = TRUE))
works partially provides all the character values in a column and numeric values in the second.
data.frame(Reduce(rbind, df))
didn't work - provides the names in the first list and numbers from both the lists as tow different rows
colNames <- unique(unlist(lapply(df, names)))
M <- matrix(0, nrow = length(df), ncol = length(colNames),
dimnames = list(names(df), colNames))
matches <- lapply(df, function(x) match(names(x), colNames))
M[cbind(rep(sequence(nrow(M)), sapply(matches, length)),
unlist(matches))] <- unlist(df)
M
didn't work correctly.
Can someone help?
Since the list elements are all of the same length, you should be able to stack them and then combine them by columns.
Try:
do.call(cbind, lapply(myList, stack))
Here's another way:
as.data.frame( c(col = lapply(x, names), val = lapply(x,unname)) )
How it works. lapply returns a list; two lists combined with c make another list; and a list is easily coerced to a data.frame, since the latter is just a list of vectors having the same length.
Better than coercing to a data.frame is just modifying its class, effectively telling the list "you're a data.frame now":
L = c(col = lapply(x, names), val = lapply(x,unname))
library(data.table)
setDF(L)
The result doesn't need to be assigned anywhere with = or <- because L is modified "in place."

rename the columns name after cbind the data

merger <- cbind(as.character(Date),weather1$High,weather1$Low,weather1$Avg..High,weather1$Avg.Low,sale$Scanned.Movement[a])
After cbind the data, the new DF has column names automatically V1, V2......
I want rename the column by
colnames(merger)[,1] <- "Date"
but failed. And when I use merger$V1 ,
Error in merger$V1 : $ operator is invalid for atomic vectors
You can also name columns directly in the cbind call, e.g.
cbind(date=c(0,1), high=c(2,3))
Output:
date high
[1,] 0 2
[2,] 1 3
Try:
colnames(merger)[1] <- "Date"
Example
Here is a simple example:
a <- 1:10
b <- cbind(a, a, a)
colnames(b)
# change the first one
colnames(b)[1] <- "abc"
# change all colnames
colnames(b) <- c("aa", "bb", "cc")
you gave the following example in your question:
colnames(merger)[,1]<-"Date"
the problem is the comma: colnames() returns a vector, not a matrix, so the solution is:
colnames(merger)[1]<-"Date"
If you pass only vectors to cbind() it creates a matrix, not a dataframe. Read ?data.frame.
A way of producing a data.frame and being able to do this in one line is to coerce all matrices/data frames passed to cbind into a data.frame while setting the column names attribute using setNames:
a = matrix(rnorm(10), ncol = 2)
b = matrix(runif(10), ncol = 2)
cbind(setNames(data.frame(a), c('n1', 'n2')),
setNames(data.frame(b), c('u1', 'u2')))
which produces:
n1 n2 u1 u2
1 -0.2731750 0.5030773 0.01538194 0.3775269
2 0.5177542 0.6550924 0.04871646 0.4683186
3 -1.1419802 1.0896945 0.57212043 0.9317578
4 0.6965895 1.6973815 0.36124709 0.2882133
5 0.9062591 1.0625280 0.28034347 0.7517128
Unfortunately, there is no setColNames function analogous to setNames for data frames that returns the matrix after the column names, however, there is nothing to stop you from adapting the code of setNames to produce one:
setColNames <- function (object = nm, nm) {
colnames(object) <- nm
object
}
See this answer, the magrittr package contains functions for this.
If you offer cbind a set of arguments all of whom are vectors, you will get not a dataframe, but rather a matrix, in this case an all character matrix. They have different features. You can get a dataframe if some of your arguments remain dataframes, Try:
merger <- cbind(Date =as.character(Date),
weather1[ , c("High", "Low", "Avg..High", "Avg.Low")] ,
ScnMov =sale$Scanned.Movement[a] )
It's easy just add the name which you want to use in quotes before adding
vector
a_matrix <- cbind(b_matrix,'Name-Change'= c_vector)

How to obtain all possible matrices (or data.frames) which have a column less from the original matrix (or data.frame)

How can I obtain all possible matrices (or data.frames) which have a column less than the original matrix (or data.frame)
For example, lets say I have a matrix (or data.frame) tmp
structure(list(V1 = 1:5, V2 = 6:10, V3 = 11:15, V4 = 16:20, V5 = 21:25),
.Names = c("V1", "V2", "V3", "V4", "V5"), row.names = c(NA, -5L),
class = "data.frame")
How can I efficiently get: tmp[ , - 1], tmp[ , -2], tmp[ ,-3], tmp[ ,-4], and tmp[ , -5]
At the moment I can think of:
llply(list(1,2,3,4,5), function(x) {tmp[[x]] <- NULL; tmp})
This can be also done using lapply. Is there a better or more intuitive and efficient way of doing this, especially when tmp is matrix (without doing as.data.frame(tmp))?
Thank you in advance for any help or pointers.
Why not just
lapply(1:ncol(tmp),function(i)tmp[,-i])
Works both for matrices and data frames.
Can't see how you can do that more intuitively than that.
You can do it with lapply with simple indexing :
R> tmp <- matrix(1:10000,nrow=100,ncol=100)
R> lapply(1:ncol(tmp), function(i) {tmp[,-i]})
Maybe there is a more efficient solution, but I think that the main problem with this method is that if your original matrix has a big number of columns (say n) the result of the lapply function will be huge (as it will be a list of n-1 matrices) and you may run out of memory to handle it.

Resources