I haven't been able to find an answer to this, but I am guessing this is because I am not phrasing my question properly.
I want to combine two strings containing several comma-separated values into one string, alternating the inputs from each original string.
x <- '1,2'
y <- 'R,L'
# fictitious function
z <- combineSomehow(x,y)
z = '1R, 2L'
EDIT : Adding dataframe to better describe my issue. I would like to be able to accomplish the above, but within a mutate ideally.
df <- data.frame(
x = c('1','2','1,1','2','1'),
y = c('R','L','R,L','L','R'),
desired_result = c('1R','2L','1R,1L','2L','1R')
)
df:
x y desired_result
1 1 R 1R
2 2 L 2L
3 1,1 R,L 1R,1L
4 2 L 2L
5 1 R 1R
Final Edit/Answer: Based on #akrun's comment/response below and after removing the error originally in df, this ended up being the tidyverse answer:
mutate(desired_result = map2(.x=strsplit(x,','),.y=strsplit(y,','),
~ str_c(.x,.y, collapse=',')))
It can be done with strsplit and paste
combineSomehow <- function(x, y) {
do.call(paste0, c(strsplit(c(x,y),","), collapse=", "))
}
combineSomehow(x,y)
#[1] "1R, 2L"
Without modifying the function, we can Vectorize it to apply on multiple elements
df$desired_result2 <- Vectorize(combineSomehow)(df$x, df$y)
Related
HEADLINE: Is there a way to get R to recognize data.frame column names contained within lists in the same way that it can recognize free-floating vectors?
SETUP: Say I have a vector named varA:
(varA <- 1:6)
# [1] 1 2 3 4 5 6
To get the length of varA, I could do:
length(varA)
#[1] 6
and if the variable was contained within a larger list, the variable and its length could still be found by doing:
list <- list(vars = "varA")
length(get(list$vars[1]))
#[1] 6
PROBLEM:
This is not the case when I substitute the vector for a dataframe column and I don't know how to work around this:
rows <- 1:6
cols <- c("colA")
(df <- data.frame(matrix(NA,
nrow = length(rows),
ncol = length(cols),
dimnames = list(rows, cols))))
# colA
# 1 NA
# 2 NA
# 3 NA
# 4 NA
# 5 NA
# 6 NA
list <- list(vars = "varA",
cols = "df$colA")
length(get(list$vars[1]))
#[1] 6
length(get(list$cols[1]))
#Error in get(list$cols[1]) : object 'df$colA' not found
Though this contrived example seems inane, because I could always use the simple length(variable) approach, I'm actually interested in writing data from hundreds of variables varying in lengths onto respective dataframe columns, and so keeping them in a list that I could iterate through would be very helpful. I've tried everything I could think of, but it may be the case that it's just not possible in R, especially given that I cannot find any posts with solutions to the issue.
You could try:
> length(eval(parse(text = list$cols[1])))
[1] 6
Or:
list <- list(vars = "varA",
cols = "colA")
length(df[, list$cols[1]])
[1] 6
Or with regex:
list <- list(vars = "varA",
cols = "df$colA")
length(df[, sub(".*\\$", "", list$cols[1])])
[1] 6
If you are truly working with a data frame d, then nrow(d) is the length of all of the variables in d. There should be no reason to use length in this case.
If you are actually working with a list x containing variables of potentially different lengths, then you should use the [[ operator to extract those variables by name (see ?Extract):
x <- list(a = 1:10, b = rnorm(20L))
l <- list(vars = "a")
length(d[[l$vars[1L]]]) # 10
If you insist on using get (you shouldn't), then you need to supply a second argument telling it where to look for the variable (see ?get):
length(get(l$vars[1L], x)) # 10
After searching for some time, I cannot find a smooth R-esque solution.
I have a list of vectors that I want to convert to dataframes and add a column with the names of the vectors. I cant do this with cbind() and melt() to a single dataframe b/c there are vectors with different number of rows.
Basic example would be:
list<-list(a=c(1,2,3),b=c(4,5,6,7))
var<-"group"
What I have come up with and works is:
list<-lapply(list, function(x) data.frame(num=x,grp=""))
for (j in 1:length(list)){
list[[j]][,2]<-names(list[j])
names(list[[j]])[2]<-var
}
But I am trying to better use lapply() and have cleaner coding practices. Right now I rely so heavily on for and if statements, which a lot of the base functions do already and much more efficiently than I can code at this point.
The psuedo code I would like is something like:
list<-lapply(list, function(x) data.frame(num=x,get(var)=names(x))
Is there a clean way to get this done?
Second closely related question, if I already have a list of dataframes, why is it so hard to reassign column values and names using lapply()?
So using something like:
list<-list(a=data.frame(num=c(1,2,3),grp=""),b=data.frame(num=c(4,5,6,7),grp=""))
var<-"group"
#pseudo code
list<-lapply(list, function(x) x[,2]<-names(x)) #populate second col with name of df[x]
list<-lapply(list, function(x) names[[x]][2]<-var) #set 2nd col name to 'var'
The first line of pseudo code throws an error about matching row lengths. Why does lapply() not just loop over and repeat names(x) like the same function on a single dataframe does in a for loop?
For the second line, as I understand it I can use setNames() to reassign all the column names, but how do I make this work for just one of the col names?
Many thanks for any ideas or pointing to other threads that cover this and helping me understand the behavior of lapply() in this context.
A full R base approach without using loops
> l<-list(a=c(1,2,3),b=c(4,5,6,7))
> data.frame(grp=rep(names(l), lengths(l)), num=unlist(l), row.names = NULL)
grp num
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
Related to your first/main question you can use the function enframe from package tibble for this purpose
library(tibble)
library(tidyr)
library(dplyr)
l<-list(a=c(1,2,3),b=c(4,5,6,7))
l %>%
enframe(name = "group", value="value") %>%
unnest(value) %>%
group_split(group)
Try this:
library(dplyr)
mylist <- list(a = c(1,2,3), b = c(4,5,6,7))
bind_rows(lapply(names(mylist), function(x) tibble(grp = x, num = mylist[[x]])))
# A tibble: 7 x 2
grp num
<chr> <dbl>
1 a 1
2 a 2
3 a 3
4 b 4
5 b 5
6 b 6
7 b 7
This is essentially a lapply-based solution where you iterate over the names of your list, and not the individual list elements themselves. If you prefer to do everything in base R, note that the above is equivalent to
do.call(rbind, lapply(names(mylist), function(x) data.frame(grp = x, num = mylist[[x]], stringsAsFactors = F)))
Having said that, tibbles as modern implementation of data.frames are preferred, as is bind_rows over the do.call(rbind... construct.
As to the second question, note the following:
lapply(mylist, function(x) str(x))
num [1:3] 1 2 3
num [1:4] 4 5 6 7
....
lapply(mylist, function(x) names(x))
$a
NULL
$b
NULL
What you see here is that the function inside of lapply gets the elements of mylist. In this case, it get's to work with the numeric vector. This does not have any name as far as the function that is called inside lapply is concerned. To highlight this, consider the following:
names(c(1,2,3))
NULL
Which is the same: the vector c(1,2,3) does not have a name attribute.
Another question for me as a beginner. Consider this example here:
n = c(2, 3, 5)
s = c("ABBA", "ABA", "STING")
b = c(TRUE, "STING", "STRING")
df = data.frame(n,s,b)
n s b
1 2 ABBA TRUE
2 3 ABA STING
3 5 STING STRING
How can I search within this dataframe for similar strings, i.e. ABBA and ABA as well as STING and STRING and make them the same (doesn't matter whether ABBA or ABA, either fine) that would not require me knowing any variations? My actual data.frame is very big so that it would not be possible to know all the different variations.
I would want something like this returned:
> n = c(2, 3, 5)
> s = c("ABBA", "ABBA", "STING")
> b = c(TRUE, "STING", "STING")
> df = data.frame(n,s,b)
> print(df)
n s b
1 2 ABBA TRUE
2 3 ABBA STING
3 5 STING STING
I have looked around for agrep, or stringdist, but those refer to two data.frames or are able to name the column which I can't since I have many of those.
Anyone an idea? Many thanks!
Best regards,
Steffi
This worked for me but there might be a better solution
The idea is to use a recursive function, special, that uses agrepl, which is the logical version of approximate grep, https://www.rdocumentation.org/packages/base/versions/3.4.1/topics/agrep. Note that you can specify the 'error tolerance' to group similar strings with agrep. Using agrepl, I split off rows with similar strings into x, mutate the s column to the first-occurring string, and then add a grouping variable grp. The remaining rows that were not included in the ith group are stored in y and recursively passed through the function until y is empty.
You need the dplyr package, install.packages("dplyr")
library(dplyr)
desired <- NULL
grp <- 1
special <- function(x, y, grp) {
if (nrow(y) < 1) { # if y is empty return data
return(x)
} else {
similar <- agrepl(y$s[1], y$s) # find similar occurring strings
x <- rbind(x, y[similar,] %>% mutate(s=head(s,1)) %>% mutate(grp=grp))
y <- setdiff(y, y[similar,])
special(x, y, grp+1)
}
}
desired <- special(desired,df,grp)
To change the stringency of string similarity, change max.distance like agrepl(x,y,max.distance=0.5)
Output
n s b grp
1 2 ABBA TRUE 1
2 3 ABBA STING 1
3 5 STING STRING 2
To remove the grouping variable
withoutgrp <- desired %>% select(-grp)
I am trying to train a data that's converted from a document term matrix to a dataframe. There are separate fields for the positive and negative comments, so I wanted to add a string to the column names to serve as a "tag", to differentiate the same word coming from the different fields - for example, the word hello can appear both in the positive and negative comment fields (and thus, represented as a column in my dataframe), so in my model, I want to differentiate these by making the column names positive_hello and negative_hello.
I am looking for a way to rename columns in such a way that a specific string will be appended to all columns in the dataframe. Say, for mtcars, I want to rename all of the columns to have "_sample" at the end, so that the column names would become mpg_sample, cyl_sample, disp_sample and so on, which were originally mpg, cyl, and disp.
I'm considering using sapplyor lapply, but I haven't had any progress on it. Any help would be greatly appreciated.
Use colnames and paste0 functions:
df = data.frame(x = 1:2, y = 2:1)
colnames(df)
[1] "x" "y"
colnames(df) <- paste0('tag_', colnames(df))
colnames(df)
[1] "tag_x" "tag_y"
If you want to prefix each item in a column with a string, you can use paste():
# Generate sample data
df <- data.frame(good=letters, bad=LETTERS)
# Use the paste() function to append the same word to each item in a column
df$good2 <- paste('positive', df$good, sep='_')
df$bad2 <- paste('negative', df$bad, sep='_')
# Look at the results
head(df)
good bad good2 bad2
1 a A positive_a negative_A
2 b B positive_b negative_B
3 c C positive_c negative_C
4 d D positive_d negative_D
5 e E positive_e negative_E
6 f F positive_f negative_F
Edit:
Looks like I misunderstood the question. But you can rename columns in a similar way:
colnames(df) <- paste(colnames(df), 'sample', sep='_')
colnames(df)
[1] "good_sample" "bad_sample" "good2_sample" "bad2_sample"
Or to rename one specific column (column one, in this case):
colnames(df)[1] <- paste('prefix', colnames(df)[1], sep='_')
colnames(df)
[1] "prefix_good_sample" "bad_sample" "good2_sample" "bad2_sample"
You can use setnames from the data.table package, it doesn't create any copy of your data.
library(data.table)
df <- data.frame(a=c(1,2),b=c(3,4))
# a b
# 1 1 3
# 2 2 4
setnames(df,paste0(names(df),"_tag"))
print(df)
# a_tag b_tag
# 1 1 3
# 2 2 4
In my code, I am filling the columns of a dataframe with vectors, as so:
df1[columnNum] <- barWidth
This works fine, except for one thing: I want the name of the vector variable (barWidth above) to be retained as the column header, one column at a time. Furthermore, I do not wish to use cbind. This slows the execution of my code down considerably. Consequently, I am using a pre-allocated dataframe.
Can this be done in the vector-to-column assignment? If not, then how do I change it after the fact? I can't find the right syntax to do this with colNames().
TIA
It's being done by the [<-.data.frame function. It could conceivably be replaced by one that looked at the name of the argument but it's such a fundamental function I would be hesitant. Furthermore there appears to be an aversion to that practice signaled by this code at the top of the function definition:
> `[<-.data.frame`
function (x, i, j, value)
{
if (!all(names(sys.call()) %in% c("", "value")))
warning("named arguments are discouraged")
nA <- nargs()
if (nA == 4L) {
<snipped rest of rather long definition>
I don't know why that is there, but it is. Maybe you should either be thinking about using names<- after the column assignment, or using this method:
> dfrm["barWidth"] <- barWidth
> dfrm
a V2 barWidth
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
This can be generalized to a list of new columns:
dfrm <- data.frame(a=letters[1:4])
barWidth <- 1:4
newcols <- list(barWidth=barWidth, bw2 =barWidth)
dfrm[names(newcol)] <- newcol
dfrm
#
a barWidth bw2
1 a 1 1
2 b 2 2
3 c 3 3
4 d 4 4
If you have the list of names of vectors you want to apply you could do:
namevec <- c(...,"barWidth"...,)
columnNums <- c(...,10,...)
df1[columnNums[i]] <- get(namevec[i])
names(df1)[columnNums[i]] <- namevec[i]
or even
columnNums <- c(barWidth=4,...)
for (i in seq_along(columnNums)) {
df1[columnNums[i]] <- get(names(columnNums)[i])
}
names(df1)[columnNums] <- names(columnNums)
but the deeper question would be where this set of vectors is coming from in the first place: could you have them in a list all along?
I'd simply use cbind():
df1 <- cbind( df1, barWidth )
which retains the name. It will, however, end up as the last column in df1