Once again I'm struggling with strsplit. I'm transforming some strings to data frames, but there's a forward slash, / and some white space in my string that keep bugging me. I could work around it, but I eager to learn if I can use some fancy either or in strsplit. My working example below should illustrate the issue
The strsplit function I'm currrently using
str_to_df <- function(string){
t(sapply(1:length(string), function(x) strsplit(string, "\\s+")[[x]])) }
one type of string I got,
string1 <- c('One\t58/2', 'Two 22/3', 'Three\t15/5')
str_to_df(string1)
#> [,1] [,2]
#> [1,] "One" "58/2"
#> [2,] "Two" "22/3"
#> [3,] "Three" "15/5"
another type I got in the same spot,
string2 <- c('One 58 / 2', 'Two 22 / 3', 'Three 15 / 5')
str_to_df(string2)
#> [,1] [,2] [,3] [,4]
#> [1,] "One" "58" "/" "2"
#> [2,] "Two" "22" "/" "3"
#> [3,] "Three" "15" "/" "5"
They obviously create different outputs, and I can't figure out how to code a solution that work for both. Below is my desired outcome. Thank you in advance!
desired_outcome <- structure(c("One", "Two", "Three", "58", "22",
"15", "2", "3", "5"), .Dim = c(3L, 3L))
desired_outcome
#> [,1] [,2] [,3]
#> [1,] "One" "58" "2"
#> [2,] "Two" "22" "3"
#> [3,] "Three" "15" "5"
This works:
str_to_df <- function(string){
t(sapply(1:length(string), function(x) strsplit(string, "[/[:space:]]+")[[x]])) }
string1 <- c('One\t58/2', 'Two 22/3', 'Three\t15/5')
string2 <- c('One 58 / 2', 'Two 22 / 3', 'Three 15 / 5')
str_to_df(string1)
# [,1] [,2] [,3]
# [1,] "One" "58" "2"
# [2,] "Two" "22" "3"
# [3,] "Three" "15" "5"
str_to_df(string2)
# [,1] [,2] [,3]
# [1,] "One" "58" "2"
# [2,] "Two" "22" "3"
# [3,] "Three" "15" "5"
Another approach with tidyr could be:
string1 %>%
as_tibble() %>%
separate(value, into = c("Col1", "Col2", "Col3"), sep = "[/[:space:]]+")
# A tibble: 3 x 3
# Col1 Col2 Col3
# <chr> <chr> <chr>
# 1 One 58 2
# 2 Two 22 3
# 3 Three 15 5
We can create a function to split at one or more space or tab or forward slash
f1 <- function(str1) do.call(rbind, strsplit(str1, "[/\t ]+"))
f1(string1)
# [,1] [,2] [,3]
#[1,] "One" "58" "2"
#[2,] "Two" "22" "3"
#[3,] "Three" "15" "5"
f1(string2)
# [,1] [,2] [,3]
#[1,] "One" "58" "2"
#[2,] "Two" "22" "3"
#[3,] "Three" "15" "5"
Or we can do with read.csv after replacing the spaces with a common delimiter
read.csv(text=gsub("[\t/ ]+", ",", string1), header = FALSE)
# V1 V2 V3
#1 One 58 2
#2 Two 22 3
#3 Three 15 5
Related
I am currently trying to create a new matrix by looping over the old one. The thing that I would want to change in the new matrix is replacing certain values with the character "recoding".Both of the matrixes should have 10 columns and 100 rows.
In the current case, the certain value is one that matches with on eof the values in vector_A.
e.g:
for (i in 1:10) {
new_matrix[,i] <- old_matrix[,i]
output_t_or_f <- is.element(new_matrix[,i],unlist(vector_A))
if (any(output_t_or_f, na.rm = FALSE)) {
replace(new_matrix, list = new_matrix[,i], values = "recode")
}
}
so output_t_or_f should either take on the value TRUE or FALSE, depending on whether i is in vector_A
and if output_t_or_f is TRUE then the old value should be replaced with the character "recode"
Currently the new_matrix looks just like the old_matrix so I guess there is a problem with the if statement?
Unfortunately, I can't really share my Data but I put some example data together:
if old_matrix looks like this:
> old_matrix
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
and vector_A looks like this:
> vector_A
[1] 12 27 30 42 37 9
then the new matrix should looks like this:
new_matrix
[,1] [,2] [,3] [,4] [,5]
[1,] "1" "6" "11" "16" "21"
[2,] "2" "7" "recoding" "17" "22"
[3,] "3" "8" "13" "18" "23"
[4,] "4" "recoding" "14" "19" "24"
[5,] "5" "10" "15" "20" "25"
I am very new to R and can't seem to find the problem. Would appreciate any help!!
Thanks :-)
Since the replacements are the same in every column you shouldn't need a loop. Try this:
new_matrix <- old_matrix
new_matrix[new_matrix %in% vector_A] <- "recode"
First of all I've never coded anything in my life, and I'm just learning R this week.
I'm not sure if the title is any clear, but I guess showing my problem is easier:
Let's say I have this Matrix (m):
[,1] [,2] [,3] [,4]
[1,] A 1 2 3
[2,] B 1 4
[3,] C 3
Basically that A contains 1, 2 and 3, B contains 1 and 4 and so on.
How would I show that in a matrix with 2 columns only?
[,1] [,2]
[1,] A 1
[2,] A 2
[3,] A 3
[4,] B 1
[5,] B 4
[6,] C 3
Thanks a lot!
Assuming that the blanks showed are NA, get the count of NA elements per row with rowSums, cbind the replicated first column based on 'n' while transposing the rest of the columns after omitting the NAs
n <- rowSums(!is.na(m1[,-1]))
cbind(rep(m1[,1], n), na.omit(c(t(m1[,-1]))))
# [,1] [,2]
#[1,] "A" "1"
#[2,] "A" "2"
#[3,] "A" "3"
#[4,] "B" "1"
#[5,] "B" "4"
#[6,] "C" "3"
Or a slightly more compact option is to replicate the first column with col index, cbind with the transpose of rest of the columns, and finally remove the NA rows with na.omit
na.omit(cbind(m1[,1][col(m1[,-1])], c(t(m1[,-1]))))
# [,1] [,2]
#[1,] "A" "1"
#[2,] "A" "2"
#[3,] "A" "3"
#[4,] "B" "1"
#[5,] "B" "4"
#[6,] "C" "3"
NOTE: matrix cannot have multiple column types. So, if there is a character class, all the elements are converted to character
data
m1 <- structure(c("A", "B", "C", "1", "1", "3", "2", "4", NA, "3",
NA, NA), .Dim = 3:4)
I have a list like L (comes from a vector splitting).
L <- strsplit(c("1 5 9", "", "3 7 11", ""), " ")
# [[1]]
# [1] "1" "5" "9"
#
# [[2]]
# character(0)
#
# [[3]]
# [1] "3" "7" "11"
#
# [[4]]
# character(0)
When I do an ordinary rbind as follows, I'm losing all the character(0) rows.
do.call(rbind, L)
# [,1] [,2] [,3]
# [1,] "1" "5" "9"
# [2,] "3" "7" "11"
Do I always have to do a lapply like the following or have I missed something?
do.call(rbind, lapply(L, function(x)
if (length(x) == 0) rep("", 3) else x))
# [,1] [,2] [,3]
# [1,] "1" "5" "9"
# [2,] "" "" ""
# [3,] "3" "7" "11"
# [4,] "" "" ""
Base R answers are preferred.
If you use lapply you don't have to worry about length so you can skip the rep part it will automatically be recycled across columns.
do.call(rbind, lapply(L, function(x) if (length(x) == 0) "" else x))
# [,1] [,2] [,3]
#[1,] "1" "5" "9"
#[2,] "" "" ""
#[3,] "3" "7" "11"
#[4,] "" "" ""
Another option using same logic as #NelsonGon we can replace the empty lists with blank and then rbind.
L[lengths(L) == 0] <- ""
do.call(rbind, L)
# [,1] [,2] [,3]
#[1,] "1" "5" "9"
#[2,] "" "" ""
#[3,] "3" "7" "11"
#[4,] "" "" ""
Maybe this roundabout using data.table suits you:
L <- data.table::tstrsplit(c("1 5 9", "", "3 7 11", ""), " ", fill="")
t(do.call(rbind,L))
With plyr then proceed with replacement. Since OP asked for base R, see below.
plyr::ldply(L,rbind)
1 2 3
1 1 5 9
2 <NA> <NA> <NA>
3 3 7 11
4 <NA> <NA> <NA>
A less efficient base R way:
L <- strsplit(c("1 5 9", "", "3 7 11", ""), " ")
L[lapply(L,length)==0]<-"Miss"
res<-Reduce(rbind,L)
res[res=="Miss"]<-""
Result:
[,1] [,2] [,3]
init "1" "5" "9"
"" "" ""
"3" "7" "11"
"" "" ""
That is the defined behavior for scenarios like that. As written in ?rbind:
For cbind (rbind), vectors of zero length (including NULL) are ignored
unless the result would have zero rows (columns), for S compatibility.
(Zero-extent matrices do not occur in S3 and are not ignored in R.)
When you inspect your elements, you see that it is true:
length(L[[1]])
[1] 3
length(L[[2]])
[1] 0
However, as you see, multiple workarounds are possible.
We can use stri_list2matrix in a simple way
library(stringi)
stri_list2matrix(L, byrow = TRUE, fill = "")
# [,1] [,2] [,3]
#[1,] "1" "5" "9"
#[2,] "" "" ""
#[3,] "3" "7" "11"
#[4,] "" "" ""
a is a matrix:
a <- matrix(1:9,3)
> a
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
I want to replace all the 1 to good, all the 4 to medium, and all the 9 to bad.
I use the following code:
a[a==1] <- "good"
a[a==4] <- "medium"
a[a==9] <- "bad"
> a
[,1] [,2] [,3]
[1,] "good" "medium" "7"
[2,] "2" "5" "8"
[3,] "3" "6" "bad"
It works, but is this the simplest way to work it out? Can I combine these codes into one command?
Using cut():
matrix(cut(a, breaks = c(0:9),
labels = c("good", 2:3, "medium", 5:8, "bad")), 3)
But not really happy with manual labels bit.
Maybe using match(), more flexible:
res <- matrix(c("good", "medium", "bad")[match(a, c(1, 4, 9))], 3)
res <- ifelse(is.na(res), a, res)
car::recode() does nicely here, returning the same matrix structure as was given as input.
car::recode(a, "1='good';4='medium';9='bad'")
# [,1] [,2] [,3]
# [1,] "good" "medium" "7"
# [2,] "2" "5" "8"
# [3,] "3" "6" "bad"
I have df dataframe that needs subsetting into chunks of 2 names. From example below, there are 4 unique names: a,b,c,d. I need to subset into 2 one column matrices a,b and c,d.
Output format:
name1
item_value
item_value
...
END
name2
item_value
item_value
...
END
Example:
#dummy data
df <- data.frame(name=sort(c(rep(letters[1:4],2),"a","a","c")),
item=round(runif(11,1,10)),
stringsAsFactors=FALSE)
#tried approach - split per name. I need to split per 2 names.
lapply(split(df,f=df$name),
function(x)
{name <- unique(x$name)
as.matrix(c(name,x[,2],"END"))
})
#expected output
[,1]
[1,] "a"
[2,] "8"
[3,] "9"
[4,] "6"
[5,] "4"
[6,] "END"
[1,] "b"
[2,] "2"
[3,] "10"
[4,] "END"
[,2]
[1,] "c"
[2,] "6"
[3,] "6"
[4,] "2"
[5,] "END"
[1,] "d"
[2,] "4"
[3,] "1"
[4,] "END"
Note: Actual df has ~300000 rows with ~35000 unique names.
You may try this.
# for each 'name', "pad" 'item' with 'name' and 'END'
l1 <- lapply(split(df, f = df$name), function(x){
name <- unique(x$name)
as.matrix(c(name, x$item, "END"))
})
# create a sequence of numbers, to select two by two elements from the list
steps <- seq(from = 0, to = length(unique(df$name))/2, by = 2)
# loop over 'steps' to bind together list elements, two by two.
l2 <- lapply(steps, function(x){
do.call(rbind, l1[1:2 + x])
})
l2
# [[1]]
# [,1]
# [1,] "a"
# [2,] "6"
# [3,] "4"
# [4,] "10"
# [5,] "3"
# [6,] "END"
# [7,] "b"
# [8,] "6"
# [9,] "7"
# [10,] "END"
#
# [[2]]
# [,1]
# [1,] "c"
# [2,] "2"
# [3,] "6"
# [4,] "10"
# [5,] "END"
# [6,] "d"
# [7,] "5"
# [8,] "4"
# [9,] "END"
Instead of making the lists from individual names make it from the column of subsets of the data.frame
res <- list("a_b" = c(df[df$name == "a",2],"END",df[df$name == "b", 2],"END"),
"c_d" = c(df[df$name == "c",2],"END", df[df$name == "d", 2],"END"))
res2 <- vector(mode="list",length=2)
res2 <- sapply(1:(length(unique(df$name))/2),function(x) {
sapply(seq(1,length(unique(df$name))-1,by=2), function(y) {
name <- unique(df$name)
res2[x] <- as.matrix(c(name[y],df[df$name == name[y],2],"END",name[y+1],df[df$name == name[y+1],2],"END"))
})
})
answer <- res2[,1]
This is giving me a matrix of lists since there are two sapplys happening, I think everything you want is in res2[,1]