I want to concatenate text across 20 columns of my dataset (dat), skipping all NA values.
For example, if the first row had "cat" in column 1, "dog" in column 2, and NA in column 3, I want to compile that as "cat dog" in a new column (dat$results). Here's what I have:
m <- ""
for(i in 1:20){
if(!is.na(dat[,i])){
m <- paste(m, dat[,i], sep = " ")
}
else {
next
}
}
dat$results <- m
The loop only runs up to column 3 (which is NA for my first row). Not a problem for that first row, BUT other rows that do have text in column 3 don't get that column compiled. What can I do here?
Maybe just concatenate all the columns in one go and remove na's and double spaces afterward
dat$result <- gsub('\\s\\s', ' ', gsub('NA', '', do.call(paste, d[,1:20])))
Related
I want to find out the maximum amount comma had appeared in a row in a single column.
For example,
Cars
1 Bugatti (4)","Ferrari (7)","Audi (10)
2 Toyota (6)
3 Tesla (9)","Mercedes(8)
4 Suzuki (11)","Mitsubishi (19)","Ford (7)","BMW (6)
For the table column above, the maximum number a comma had appeared in a row is 3, and it is on row 4. How do I achieve this on a much more larger data (4000+ rows)?
You can use gregexp() to return a vector of the positions of the comma(s) in each string. Then you can apply the length() function to count up the commas:
sapply(gregexpr(",", df$cars), length)
## 2 1 1 3
To answer the exact question asked, just wrap the above line of code in max() to determine the maximum number of times a comma appeared in one of your strings.
The above actually returns a "1" when a "0" is expected. There is probably a more elegant solution, but here's a function that will handle zeros correctly:
count_commas <- function(x) {
y <- sapply(gregexpr(",", x), as.integer) # get position of commas
y <- lapply(y, function(y) if(y[1] == -1) NULL else y) # replace zeros
return( sapply(y, length) ) # return count of commas
}
count_commas(df$cars)
# 2 0 1 3
My idea is to remove the non-comma characters and calculate the number of chars.
I have no clue which class of object you are using for cars. Assuming your input is
cars <- c(' Bugatti (4)","Ferrari (7)","Audi (10)','Toyota (6)','Tesla (9)","Mercedes(8)','Suzuki (11)","Mitsubishi (19)","Ford (7)","BMW (6)')
then you can use nchar(gsub("[^,]","", cars)) to get the number of commas of each row.
The first column in my data.frame consists of strings, and the second column are unique keys.
I want to extract all words after the nth word from each string, and if the string has <= n words, extract the entire string.
I have over 10k rows in my data.frame and was wondering if there is a quick way of doing this other than using for loops?
Thanks.
How about the following:
# Generate some sample data
library(tidyverse)
df <- data.frame(
one = c("Entries from row one", "Entries from row two", "Entries from row three"),
two = runif(3))
# Define function to extract all words after the n=1 word
# (or return the full string if n > # of words in string)
crop_string <- function(ss, n) {
lapply(strsplit(as.character(ss), "\\s"), function(v)
if (length(v) > n) paste(v[(n + 1):length(v)], collapse = " ")
else paste(v, collapse = " "))
}
# Let's crop strings from column one by removing the first 3 words (n = 3)
n <- 3;
df %>%
mutate(words_after_n = crop_string(one, n))
# one two words_after_n
#1 Entries from row one 0.5120053 one
#2 Entries from row two 0.1873522 two
#3 Entries from row three 0.0725107 three
# If n > # of words, return the full string
n <- 10;
df %>%
mutate(words_after_n = crop_string(one, n))
# one two words_after_n
#1 Entries from row one 0.9363278 Entries from row one
#2 Entries from row two 0.3024628 Entries from row two
#3 Entries from row three 0.6666226 Entries from row three
here I use nchar(), so make your data has transformed to the character.
as.character(YOUR_DATA)
as.character(sapply(YOUR_DATA,function(x,y){
if(nchar(x)>=y){
substr(x,y,nchar(x))
}
else{x}
},y= nth_data_you_want))
asumme the data is like:
"gene#seq"
"Cblb#TAGTCCCGAAGGCATCCCGA"
"Fbxo27#CCCACGTGTTCTCCGGCATC"
"Fbxo11#GGAATATACGTCCACGAGAA"
"Pwp1#GCCCGACCCAGGCACCGCCT"
I use 10 as nth data, the result is:
"gene#seq"
"CCCGAAGGCATCCCGA"
"CACGTGTTCTCCGGCATC"
"AATATACGTCCACGAGAA"
"GACCCAGGCACCGCCT"
I am trying to read a file into R that has different delimiters in the first row has space as delimiters but from the 2nd row to the last between the first column and the second there is a space, the same between the second and third, then all the block of two, zeros and ones should be different columns.
any hint?!
ID Chip AX-77047182 AX-80910836 AX-80737273 AX-77048714 AX-77048779 AX-77050447
3811582 1 2002202222200202022020200200220200222200022220002200000201202000222022
3712982 1 2002202222200202022020200200220200222200022220002200000200202000222022
3712990 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713019 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713025 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713126 1 2002202222200202022020200200220200222200022220002200000200202000222022
Certainly not the most elegant solution, but you could try the following. If I have understood your example data correctly, you have not provided all the column names (AX-77047182,...) that would be needed for the rows of zeros/ones/twos. If my understanding is wrong, below approach will not result in the desired result, but might still aid you in finding a workaround - you might simply adapt the delimiter in the second split command. I hope this helps...
#read file as character vector
chipstable <- readLines(".../chips.txt")
#extact first line to be used as column names
tablehead <- unlist(strsplit(chipstable[1], " "))
#split by first delimiter, i.e., space
chipstable <- strsplit(chipstable[2:length(chipstable)], " ")
#split by second delimiter, i.e., between each character (here number)
#and merge the two split results in one line
chipstable <- lapply(chipstable, function(x) {
c(x[1:2], unlist(strsplit(x[3], "")))
})
#combine all lines to a data frame
chipstable <- do.call(rbind, chipstable)
#assign column names
colnames(chipstable) <- tablehead
#turn values to numeric (if needed)
chipstable <- apply(chipstable, 2, as.numeric)
You can try ... read(pattern = " || 1 ", recursive = TRUE)
After make a bind
For instance:
data <- "ID Chip AX-77047182 AX-80910836 AX-80737273 AX-77048714 AX-77048779 AX-77050447
3811582 1 2002202222200202022020200200220200222200022220002200000201202000222022
3712982 1 2002202222200202022020200200220200222200022220002200000200202000222022
3712990 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713019 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713025 1 2002202211200202021011100101210200111101022121112100111110211110122122
3713126 1 2002202222200202022020200200220200222200022220002200000200202000222022"
teste <- strsplit(data, split = "\n")
for(i in seq(1, length(teste[[1]]),1)) {
if (i==1) {
dataOut <- strsplit(teste[[1]][i], split = " ")
print(dataOut)
} else
dataOut <- strsplit(teste[[1]][i], split = " 1 ")
print(dataOut)
}
I have a dataframe (df) with several columns. Column A contains values with the format "1 34", "368 879",... So with white spaces in between.
I would like to create a new column which replaces the white spaces by a fix quantity of 0. I mean:
A Rep New_A
1 15 3 100015
378 567 2 37800567
45 2 4 4500002
For a single value, for example df[1,"A"], something like this works:
New_A <- gsub("[[:punct:]])|\\s+",paste(rep(0,df[1,"Rep"]), sep="", collapse=""), df[1,"A"])
But for the whole dataframe, I tried that, but it doesn't work:
df$New_A <- gsub("[[:punct:]])|\\s+",paste(rep(0,df$Rep), sep="", collapse=""), df$A])
I could do it with a for-loop, but I would prefer to avoid this because my dataframe has more than 1000000 rows... so it's not efficient at all...
Following the idea of #CathG, here's how I got what I was asking for:
df$New_A <- mapply(function(x,y){gsub("[[:punct:]])|\\s+",paste(rep(0,y), sep="", collapse=""), x)}, x=df$A, y=df$Rep)
where:
rep(0,y) #produces as much 0 as indicated in df$Rep --> gives a vector of 0
paste(rep(0,y), sep="", collapse="") #puts all zeros of the vector together (like "000")
gsub("[[:punct:]])|\\s+",...,...) #substitutes the white spaces of df$A by the "string of zeros"
I have a data frame where the row names are words and I can call the first column of that row of data drame using something like
>df['rowB',1]
i know I can use paste to combine a variable and a string using paste to do something like
>paste("the value is ", df['rowB',1], "."]
and that will get me an output of the string with the value of the variable. what if rowname is a variable that equals 'rowB? I tried to do a first paste to put in the paste above, but the result of the first paste doesn't evaulate to the value, but rather is just a string that says
>rowname<-'rowB'
>type<-paste("relatype[\'", rowname, "\',1]", sep="")
'df['rowB',1]'
long story short, I want to input a value called 'rowname' as a parameter of a function and have it be evaluated for the value of rowname, so I can then put that value into a string within that same function.
I'm also open to a wholly different solution. any and all suggestions are welcome.
thanks
Not sure what the problem might be, not entirely clear from your description, but if rowname is a variable, you don't need anything special, because it will evaluate to it's value anyway. Let
mat <- matrix(1:10, nrow = 5)
rownames(mat) <- letters[1:5]
mat
## [,1] [,2]
##a 1 6
##b 2 7
##c 3 8
##d 4 9
##e 5 10
and rowname <- "b", then
rowname
##[1] "b"
so
mat[rowname, 1]
##b
##2
which is the same as mat["b", 1]. It only fails, if you use mat['rowname', 1].
If you want to put this in functions, you can do something like:
getElement <- function(mat, row.name, column.index) {
mat[row.name, column.index]
}
getElement(mat, "b", 1)
##b
##2
pasteSenstence <- function(mat, row.name, col.index) {
paste("The element of row", row.name, "and column", col.index, "is",
getElement(mat, row.name, col.index))
}
pasteSentence(mat, "b", 1)
##[1] "The element of row b and column 1 is 2"
which also works with rowname <- "b"
pasteSentence(mat, rowname, 1)
##[1] "The element of row b and column 1 is 2"
This should work:
paste("the value is ", get(df['rowname',1]), "."]
If you are not familiar, 'get' in r is similar to 'eval' in python.
x=c('a', 'c', 'b')
a=2
x[1]
'a'
get(x[1])
2
I'm afraid I don't understand the question; how is your function different from the following?
foo = function(rowname = "Species", d = t(iris)){
paste("I'm selecting", d[rowname, 1])
}
foo()
# [1] "I'm selecting setosa"