Pairing truncated character into a dataframe

Pairing truncated character into a dataframe - r

I have a chr[1:10] truncated data and each line is organized in such way and some rows don't have the same length:
[1] "\nA B C D E"
[2] "\n1 3 4 5"
[3] "\nF G H"
[4] "\n6 7 8"
Here's an updated version of my question
line.1 <- c("A B C D E")
line.2 <- c("1 3 4 5")
line <- rbind(line.1, line.2)
line <- data.frame(line)
line
line.1 A B C D E
line.2 1 3 4 5
So, my desired output should be:
V1 V2 V3 V4 V5
Line.1 A B C D E
Line.2 1 3 4 5
I can't quite figure out how to split it into different columns with the extra space in between being counted as one value.

Here's one way to do it:
# Build the character vector
x <- c("\nA B C D E", "\n1 3 4 5", "\nF G H", "\n6 7 8")
# Remove the new line characters
x <- sub("\n", "", x)
# Select every other element of the character vector as column 1
Col1 <- paste(x[c(T, F)], collapse = ' ')
Col1 <- strsplit(Col1, ' ')[[1]]
# Do the same for column 2
Col2 <- paste(x[c(F, T)], collapse = ' ')
Col2 <- strsplit(Col2, ' ')[[1]]
# Combine them in a data frame
data.frame(Col1, Col2)
# Col1 Col2
# 1 A 1
# 2 B
# 3 C 3
# 4 D 4
# 5 E 5
# 6 F 6
# 7 G 7
# 8 H 8
The use of strsplit is what splits the values into different columns:
> strsplit(line.2, ' ')[[1]]
[1] "1" "" " 3" "4" "5"
So to combine both lines as a dataframe, you can do:
data.frame(rbind(strsplit(line.1, ' ')[[1]], strsplit(line.2, ' ')[[1]]))

Related

Find reoccuring values in one column that correspond to differing values in another column

I have a dataframe with two columns. The first column ("A") contains numbers, the second ("B") letters:
A B
1 a
1 a
1 a
2 b
2 c
3 d
4 e
4 e
5 f
5 g
5 g
5 h
Most numbers are always matched with the same letter (e.g. "1" is always matched with "a"), but some numbers are matched with different letters (e.g. "2" is matched with "b" and "c"). I want to find the numbers that are matched with multiple letters. For the example, the result should be a vector containing "2" and "5".
Sample Data:
example <- read.table(textConnection('
A B
1 a
1 a
1 a
2 b
2 c
3 d
4 e
4 e
5 f
5 g
5 g
5 h
'), header = TRUE, colClasses=c("double", "character"))

Same as #Paul's without the apply function
names(which(rowSums(table(example$A, example$B) != 0) > 1))
-output
>
[1] "2" "5"

library(tidyverse)
> distinct(example) %>% group_by(A) %>%
+ summarize(count = n()) %>%
+ filter(count > 1)
# A tibble: 2 x 2
A count
<dbl> <int>
1 2 2
2 5 3

Another possible solution, in base R:
as.numeric(names(which(apply(table(example$A, example$B), 1,
\(x) sum(x == 0) != (length(x)-1)))))
#> [1] 2 5

I have a sample dataset , which has missing values in it

I have a sample dataset , which has missing values in it.I want to create a new column with a message of different combinations where it should tell which columns values are missing.
Example:
Dataset:
A B C D
1 2 4
4 4
4 1
3 2 3
The permutaions of the above data set is :
"a" ,"b","c","d" ,"a, b","a, c" ,"a, d" , "b, c","b, d","c, d" , "a, b, c","a, b, d","a, c, d","b, c, d","a, b, c, d"
Result:
A B C D Message
1 2 4 Column B is missing
4 4 column A and D is Missing
4 1 Column C and D is Missing
All column values are missing
3 2 3 Column B is Missing
Any suggestion would be really appreciated

Here's a way using apply from base R -
set.seed(4)
df <- data.frame(matrix(sample(c(1:5, NA), 15, replace = T), ncol = 3))
names(df) <- LETTERS[1:3]
df$msg <- apply(df, 1, function(x) {
if(anyNA(x)) {
paste0(paste0(names(x)[which(is.na(x))], collapse = " "), " missing", collapse = "")
} else {
"No missing"
}
})
df
A B C msg
1 4 2 5 No missing
2 1 5 2 No missing
3 2 NA 1 B missing
4 2 NA NA B C missing
5 5 1 3 No missing

Optimizing matching in R

Hoping someone can help. I have a ton of ortholog mapping to do in R, which is proving to be incredibly time consuming. I've posted an example structure below. Obvious answers such as iterating line by line (for i in 1:nrow(df)) and string splitting, or using sapply have been tried and are incredibly slow. I am therefore hoping for a vectorized option.
stringsasFactors = F
# example accession mapping
map <- data.frame(source = c("1", "2 4", "3", "4 6 8", "9"),
target = c("a b", "c", "d e f", "g", "h i"))
# example protein list
df <- data.frame(sourceIDs = c("1 2", "3", "4", "5", "8 9"))
# now, map df$sourceIDs to map$target
# expected output
> matches
[1] "a b c" "d e f" "g" "" "g h i"
I appreciate any help!

In most cases, the best approach to this kind of problem is to create data.frames with one observation per row.
map_split <- lapply(map, strsplit, split = ' ')
long_mappings <- mapply(expand.grid, map2$source, map2$target, SIMPLIFY = FALSE)
all_map <- do.call(rbind, long_mappings)
names(all_map) <- c('source', 'target')
Now all_map looks like this:
source target
1 1 a
2 1 b
3 2 c
4 4 c
5 3 d
6 3 e
7 3 f
8 4 g
9 6 g
10 8 g
11 9 h
12 9 i
Doing the same for df...
sourceIDs_split <- strsplit(df$sourceIDs, ' ')
df_long <- data.frame(
index = rep(seq_along(sourceIDs_split), lengths(sourceIDs_split)),
source = unlist(sourceIDs_split)
)
Give us this for df_long:
index source
1 1 1
2 1 2
3 2 3
4 3 4
5 4 5
6 5 8
7 5 9
Now they just need to be merged and collapsed.
matches <- merge(df_long, all_map, by = 'source', all.x = TRUE)
tapply(
matches$target,
matches$index,
function(x) {
paste0(sort(x), collapse = ' ')
}
)
# 1 2 3 4 5
# "a b c" "d e f" "c g" "" "g h i"

R: reshape data frame when one column has unequal number of entries

I have a data frame x with 2 character columns:
x <- data.frame(a = numeric(), b = I(list()))
x[1:3,"a"] = 1:3
x[[1, "b"]] <- "a, b, c"
x[[2, "b"]] <- "d, e"
x[[3, "b"]] <- "f"
x$a = as.character(x$a)
x$b = as.character(x$b)
x
str(x)
The entries in column b are comma-separated strings of characters.
I need to produce this data frame:
1 a
1 b
1 c
2 d
2 e
3 f
I know how to do it when I loop row by row. But is it possible to do without looping?
Thank you!

Have you checked out require(splitstackshape)?
> cSplit(x, "b", ",", direction = "long")
a b
1: 1 a
2: 1 b
3: 1 c
4: 2 d
5: 2 e
6: 3 f

> s <- strsplit(as.character(x$b), ',')
> data.frame(value=rep(x$a, sapply(s, FUN=length)),b=unlist(s))
value b
1 1 a
2 1 b
3 1 c
4 2 d
5 2 e
6 3 f

there you go, should be very fast:
library(data.table)
x <- data.table(x)
x[ ,strsplit(b, ","), by = a]

How can I build an inverted index from a data frame in R?

Say I have a data frame in R : data.frame(x=1:4, y=c("a b c", "b", "a c", "c"))
x y
1 1 a b c
2 2 b
3 3 a c
4 4 c
Now I want to build a new data frame, an inverted index which is quite common in IR or recommendation systems, from it:
y x
a 1 3
b 1 2
c 1 3 4
How can I do this in an efficient way?

conv <- function(x) {
l <- function(z) {
paste(x$x[grep(z, x$y)], collapse=' ')
}
lv <- Vectorize(l)
alphabet <- unique(unlist(strsplit(as.character(x$y), ' '))) # hard-coding this might be preferred for some uses.
y <- lv(alphabet)
data.frame(y=names(y), x=y)
}
x <- data.frame(x=1:4, y=c("a b c", "b", "a c", "c"))
> conv(x)
## y x
## a a 1 3
## b b 1 2
## c c 1 3 4

An attempt, after converting y to characters:
test <- data.frame(x=1:4,y=c("a b c","b","a c","c"),stringsAsFactors=FALSE)
result <- strsplit(test$y," ")
result2 <- sapply(unique(unlist(result)),function(y) sapply(result,function(x) y %in% x))
result3 <- apply(result2,2,function(x) test$x[which(x)])
final <- data.frame(x=names(result3),y=sapply(result3,paste,collapse=" "))
> final
x y
a a 1 3
b b 1 2
c c 1 3 4

quick and dirty
original.df <- data.frame(x=1:4, y=c("a b c", "b", "a c", "c"))
original.df$y <- as.character(original.df$y)
y.split <- strsplit(original.df$y, " ")
y.unlisted <- unique(unlist(y.split))
new.df <-
sapply(y.unlisted, function(element)
paste(which(sapply(y.split, function(y.row) element %in% y.row)), collapse=" " ))
as.data.frame(new.df)
> new.df
a 1 3
b 1 2
c 1 3 4

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Pairing truncated character into a dataframe - r

Related

Find reoccuring values in one column that correspond to differing values in another column

I have a sample dataset , which has missing values in it

Optimizing matching in R

R: reshape data frame when one column has unequal number of entries

How can I build an inverted index from a data frame in R?

Categories

Resources