Related
How can I plot (for example with a barplot) the top 5 values in a matrix with row names and column names?
rownames = c("row1", "row2", "row3", "row4")
colnames = c("col1", "col2", "col3")
P <- matrix(3:14, nrow = 4, byrow = TRUE, dimnames = list(rownames, colnames))
# col1 col2 col3
#row1 3 4 5
#row2 6 7 8
#row3 9 10 11
#row4 12 13 14
List top k values (with their positions) of a matrix P:
k <- 5
ij <- arrayInd(order(P, decreasing = TRUE)[1:k], dim(P))
top.k <- data.frame(x = P[ij], i = ij[, 1], j = ij[, 2])
# x i j
#1 14 4 3
#2 13 4 2
#3 12 4 1
#4 11 3 3
#5 10 3 2
If P has row/column names, we can add them to top.k:
top.k[c("ni", "nj")] <- Map(`[`, dimnames(P), top.k[c("i", "j")])
# x i j ni nj
#1 14 4 3 row4 col3
#2 13 4 2 row4 col2
#3 12 4 1 row4 col1
#4 11 3 3 row3 col3
#5 10 3 2 row3 col2
You can quickly produce a bar-chart using:
with( top.k, barplot(x, names.arg = paste(ni, nj, sep = ",")) )
but I don't know if you like its style. Again, plotting is a subjective matter.
I want to replace multiple letters/words with a single letter/word, multiple times in a dataframe. As an example,
Some data:
df = data.frame(
a = 1:8,
b = c("colour1 o", "colour2 O", "colour3 out", "colour4 Out",
"soundi i", "soundr I", "sounde in", "soundw In"))
df
a b
1 1 colour1 o
2 2 colour2 O
3 3 colour3 out
4 4 colour4 Out
5 5 soundi i
6 6 soundr I
7 7 sounde in
8 8 soundw In
Here is what I want to replace with:
df_repl <- list(
O = c("o", "out", "Out"),
In = c("i", "in", "I"))
So in df$b o, out and Out should become O and i, in and I become In, but only if they are separated from any other words by a space, so o in colour is not capitalised.
This gets me half way there, but I think I need another nested for-loop to move through df_repl...
for (word in df_repl[[1]]){
patt <- paste0('\\b', word, '\\b')
repl <- paste(names(df_repl[1]))
df$b <- gsub(patt, repl, df$b)
}
df
a b
1 1 colour1 O
2 2 colour2 O
3 3 colour3 O
4 4 colour4 O
5 5 soundi i
6 6 soundr I
7 7 sounde in
8 8 soundw In
Above o, out and Out become O but i, in and I are not altered, here is the desired output:
a b
1 1 colour1 O
2 2 colour2 O
3 3 colour3 O
4 4 colour4 O
5 5 soundi In
6 6 soundr In
7 7 sounde In
8 8 soundw In
In the real data there are many more than two replacement words/letters so I can't just rerun the for-loop again. I'm not tied to a for-loop solution, but preferably using base R, any suggestions much appreciated.
EDIT
Trying to clarify my question:
Whenever one of o, out or Out occur in df$b I want to replace it with O
Whenever one of i, in or I occur in df$b I want to replace it with In
I can achieve the desired output like this:
for (word in df_repl[[1]]){
patt <- paste0('\\b', word, '\\b')
repl <- paste(names(df_repl[1]))
df$b <- gsub(patt, repl, df$b)
}
for (word in df_repl[[2]]){
patt <- paste0('\\b', word, '\\b')
repl <- paste(names(df_repl[2]))
df$b <- gsub(patt, repl, df$b)
}
But in my real dataset df_repl is length 50 rather two so I don't want to copy/paste/edit/rerun the for-loop 50 times
You can skip the loop over the words in df_repl when you paste them with | (or) between the words like:
for(i in names(df_repl)) {
df$b <- sub(paste(paste0("\\b",df_repl[[i]],"\\b"), collapse = "|")
, i, df$b)
}
df
# a b
#1 1 colour1 O
#2 2 colour2 O
#3 3 colour3 O
#4 4 colour4 O
#5 5 soundi In
#6 6 soundr In
#7 7 sounde In
#8 8 soundw In
You may try using three separate calls to sub:
df$b <- sub("\\bo\\b", "i", df$b)
df$b <- sub("\\bout\\b", "in", df$b)
df$b <- sub("\\bOut\\b", "I", df$b)
df
a b
1 1 colour1 i
2 2 colour2 O
3 3 colour3 in
4 4 colour4 I
5 5 soundi i
6 6 soundr I
7 7 sounde in
8 8 soundw In
To automate this, you could try using sapply with an index:
terms_in <- c("o", "out", "Out")
pat <- paste0("\\b", terms_in, "\\b")
replace <- c("i", "in", "I")
sapply(seq_along(pat), function(x) {
df$b <<- sub(pat[x], replace[x], df$b)
})
This is another solution:
library(stringr)
in1 <- str_split(df$b, " ", simplify = TRUE)[,1]
in2 <- str_split(df$b, " ", simplify = TRUE)[,2]
in2[in2 %in% c("o", "out", "Out")] <- "O"
in2[in2 %in% c("i", "in", "I")] <- "In"
df$b <- paste(in1, in2, sep=" ")
df
If you have a long list of words in your data, you could also move c(word list) outside:
in1<- str_split(df$b, " ", simplify = TRUE)[,1]
in2<- str_split(df$b, " ", simplify = TRUE)[,2]
o <- c("o", "Out", "Out")
i <- c("i", "in", "I")
in2[in2 %in% o] <- "O"
in2[in2 %in% i] <- "In"
df$b <- paste(in1, in2, sep=" ")
df
> df
a b
1 1 colour1 O
2 2 colour2 O
3 3 colour3 O
4 4 colour4 O
5 5 soundi In
6 6 soundr In
7 7 sounde In
8 8 soundw In
I have a dataframe, say
df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))
I want to remove only those rows in which one or multiple ts are directly in between a d and a c, in all other cases I want to retain the cases. So for this example, I would like to remove the ts on row 8, 18 and 19, but keep the others. I have over thousands of cases so doing this manually would be a true horror. Any help is very much appreciated.
One option would be to use rle to get runs of the same string and then you can use an sapply to check forward/backward and return all the positions you want to drop:
rle_vals <- rle(as.character(df$x))
drop <- unlist(sapply(2:length(rle_vals$values), #loop over values
function(i, vals, lengths) {
if(vals[i] == "t" & vals[i-1] == "d" & vals[i+1] == "c"){#Check if value is "t", previous is "d" and next is "c"
(sum(lengths[1:i-1]) + 1):sum(lengths[1:i]) #Get row #s
}
},vals = rle_vals$values, lengths = rle_vals$lengths))
drop
#[1] 8 18 19
df[-drop,]
# x y
#1 a 2
#2 a 4
#3 b 5
#4 b 2
#5 b 6
#6 c 2
#7 d 4
#9 c 2
#10 b 6
#11 t 2
#12 c 4
#13 t 5
#14 a 2
#15 a 6
#16 b 2
#17 d 4
#20 c 6
This also works, by collapsing to a string, identifying groups of t's between d and c (or c and d - not sure whether you wanted this option as well), then working out where they are and removing the rows as appropriate.
df = data.frame(x=c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y=c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6),stringsAsFactors = FALSE)
dfs <- paste0(df$x,collapse="") #collapse to a string
dfs2 <- do.call(rbind,lapply(list(gregexpr("dt+c",dfs),gregexpr("ct+d",dfs)),
function(L) data.frame(x=L[[1]],y=attr(L[[1]],"match.length"))))
dfs2 <- dfs2[dfs2$x>0,] #remove any -1 values (if string not found)
drop <- unlist(mapply(function(a,b) (a+1):(a+b-2),dfs2$x,dfs2$y))
df2 <- df[-drop,]
Here is another solution with base R:
df = data.frame(x = c("a","a","b","b","b","c","d","t","c","b","t","c","t","a","a","b","d","t","t","c"),
y = c(2,4,5,2,6,2,4,5,2,6,2,4,5,2,6,2,4,5,2,6))
#
s <- paste0(df$x, collapse="")
L <- c(NA, NA)
while (TRUE) {
r <- regexec("dt+c", s)[[1]]
if (r[1]==-1) break
L <- rbind(L, c(pos=r[1]+1, length=attr(r, "match.length")-2))
s <- sub("d(t+)c", "x\\1x", s)
}
L <- L[-1,]
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]
# > drop
# 8 18 19
# > df[-drop, ]
# x y
# 1 a 2
# 2 a 4
# 3 b 5
# 4 b 2
# 5 b 6
# 6 c 2
# 7 d 4
# 9 c 2
# 10 b 6
# 11 t 2
# 12 c 4
# 13 t 5
# 14 a 2
# 15 a 6
# 16 b 2
# 17 d 4
# 20 c 6
With gregexpr() it is shorter:
s <- paste0(df$x, collapse="")
g <- gregexpr("dt+c", s)[[1]]
L <- data.frame(pos=g+1, length=attr(g, "match.length")-2)
drop <- unlist(apply(L,1, function(x) seq(from=x[1], len=x[2])))
df[-drop, ]
Given a list variable, I'd like to have a data frame of the positions of each element. For a simple non-nested list, it seems quite straightforward.
For example, here's a list of character vectors.
l <- replicate(
10,
sample(letters, rpois(1, 2), replace = TRUE),
simplify = FALSE
)
l looks like this:
[[1]]
[1] "m"
[[2]]
[1] "o" "r"
[[3]]
[1] "g" "m"
# etc.
To get the data frame of positions, I can use:
d <- data.frame(
value = unlist(l),
i = rep(seq_len(length(l)), lengths(l)),
j = rapply(l, seq_along, how = "unlist"),
stringsAsFactors = FALSE
)
head(d)
## value i j
## 1 m 1 1
## 2 o 2 1
## 3 r 2 2
## 4 g 3 1
## 5 m 3 2
## 6 w 4 1
Given a trickier nested list, for example:
l2 <- list(
"a",
list("b", list("c", c("d", "a", "e"))),
character(),
c("e", "b"),
list("e"),
list(list(list("f")))
)
this doesn't easily generalize.
The output I expect for this example is:
data.frame(
value = c("a", "b", "c", "d", "a", "e", "e", "b", "e", "f"),
i1 = c(1, 2, 2, 2, 2, 2, 4, 4, 5, 6),
i2 = c(1, 1, 2, 2, 2, 2, 1, 2, 1, 1),
i3 = c(NA, 1, 1, 2, 2, 2, NA, NA, 1, 1),
i4 = c(NA, NA, 1, 1, 2, 3, NA, NA, NA, 1),
i5 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 1)
)
How do I get a data frame of positions for a nested list?
Here's an approach that yields a slightly different output than you showed, but it'll be useful further down the road.
f <- function(l) {
names(l) <- seq_along(l)
lapply(l, function(x) {
x <- setNames(x, seq_along(x))
if(is.list(x)) f(x) else x
})
}
Function f simply iterates (recursively) through all levels of the given list and names it's elements 1,2,...,n where n is the length of the (sub)list. Then, we can make use of the fact that unlist has a use.names argument that is TRUE by default and has effect when used on a named list (that's why we have to use f to name the list first).
For the nested list l2 it returns:
unlist(f(l2))
# 1.1 2.1.1 2.2.1.1 2.2.2.1 2.2.2.2 2.2.2.3 4.1 4.2 5.1.1 6.1.1.1.1
# "a" "b" "c" "d" "a" "e" "e" "b" "e" "f"
Now, in order to return a data.frame as asked for in the question, I'd do this:
g <- function(l) {
vec <- unlist(f(l))
n <- max(lengths(strsplit(names(vec), ".", fixed=TRUE)))
require(tidyr)
data.frame(
value = unname(vec),
i = names(vec)
) %>%
separate(i, paste0("i", 1:n), sep = "\\.", fill = "right", convert = TRUE)
}
And apply it like this:
g(l2)
# value i1 i2 i3 i4 i5
#1 a 1 1 NA NA NA
#2 b 2 1 1 NA NA
#3 c 2 2 1 1 NA
#4 d 2 2 2 1 NA
#5 a 2 2 2 2 NA
#6 e 2 2 2 3 NA
#7 e 4 1 NA NA NA
#8 b 4 2 NA NA NA
#9 e 5 1 1 NA NA
#10 f 6 1 1 1 1
An improved version of g, contributed by #AnandaMahto (thanks!), would use data.table:
g <- function(inlist) {
require(data.table)
temp <- unlist(f(inlist))
setDT(tstrsplit(names(temp), ".", fixed = TRUE))[, value := unname(temp)][]
}
Edit (credits go to #TylerRinkler - thanks!)
This has the beneft of easily being converted to a data.tree object which can then be converted to many other data types. With a slight mod to g:
g <- function(l) {
vec <- unlist(f(l))
n <- max(lengths(strsplit(names(vec), ".", fixed=TRUE)))
require(tidyr)
data.frame(
i = names(vec),
value = unname(vec)
) %>%
separate(i, paste0("i", 1:n), sep = "\\.", fill = "right", convert = TRUE)
}
library(data.tree)
x <- data.frame(top=".", g(l2))
x$pathString <- apply(x, 1, function(x) paste(trimws(na.omit(x)), collapse="/"))
mytree <- data.tree::as.Node(x)
mytree
# levelName
#1 .
#2 ¦--1
#3 ¦ °--1
#4 ¦ °--a
#5 ¦--2
#6 ¦ ¦--1
#7 ¦ ¦ °--1
#8 ¦ ¦ °--b
#9 ¦ °--2
#10 ¦ ¦--1
#11 ¦ ¦ °--1
#12 ¦ ¦ °--c
#13 ¦ °--2
#14 ¦ ¦--1
#15 ¦ ¦ °--d
#16 ¦ ¦--2
#17 ¦ ¦ °--a
#18 ¦ °--3
#19 ¦ °--e
#20 ¦--4
#21 ¦ ¦--1
#22 ¦ ¦ °--e
#23 ¦ °--2
#24 ¦ °--b
#25 ¦--5
#26 ¦ °--1
#27 ¦ °--1
#28 ¦ °--e
#29 °--6
#30 °--1
#31 °--1
#32 °--1
#33 °--1
#34 °--f
And to produce a nice plot:
plot(mytree)
Other forms of presenting the data:
as.list(mytree)
ToDataFrameTypeCol(mytree)
More on converting data.tree types:
https://cran.r-project.org/web/packages/data.tree/vignettes/data.tree.html#tree-conversion
http://www.r-bloggers.com/how-to-convert-an-r-data-tree-to-json/
Here's an alternative. It's not going to be as fast as the approach by #docendodiscimus, but it is still pretty straightforward.
The basic idea is to use melt from "reshape2"/"data.table". melt has a method for lists that creates output like the following:
melt(l2)
# value L3 L2 L4 L1
# 1 a NA NA NA 1
# 2 b NA 1 NA 2
# 3 c 1 2 NA 2
# 4 d 2 2 NA 2
# 5 a 2 2 NA 2
# 6 e 2 2 NA 2
# 7 e NA NA NA 4
# 8 b NA NA NA 4
# 9 e NA 1 NA 5
# 10 f 1 1 1 6
Except for the column ordering and the last value that you're interested in, that seems to have all the info you're after. To get the last value you're interested in, you can use rapply(l2, seq_along).
Putting those two requirements together, you would have something like this:
myFun <- function(inlist) {
require(reshape2) ## Load required package
x1 <- melt(inlist) ## Melt the data
x1[[paste0("L", ncol(x1))]] <- NA_integer_ ## Add a column to hold the position info
x1 <- x1[c(1, order(names(x1)[-1]) + 1)] ## Reorder the columns
vals <- rapply(inlist, seq_along) ## These are the positional values
positions <- max.col(is.na(x1), "first") ## This is where the positions should go
x1[cbind(1:nrow(x1), positions)] <- vals ## Matrix indexing for replacement
x1 ## Return the output
}
myFun(l2)
# value L1 L2 L3 L4 L5
# 1 a 1 1 NA NA NA
# 2 b 2 1 1 NA NA
# 3 c 2 2 1 1 NA
# 4 d 2 2 2 1 NA
# 5 a 2 2 2 2 NA
# 6 e 2 2 2 3 NA
# 7 e 4 1 NA NA NA
# 8 b 4 2 NA NA NA
# 9 e 5 1 1 NA NA
# 10 f 6 1 1 1 1
The "data.table" version of g from the answer by #docendodiscimus is a little bit more direct:
g <- function(inlist) {
require(data.table)
temp <- unlist(f(inlist))
setDT(tstrsplit(names(temp), ".", fixed = TRUE))[, value := unname(temp)][]
}
Similar to docendo's, but attempting to operate as much as possible inside the recursion than fixing the result afterwards:
ff = function(x)
{
if(!is.list(x)) if(length(x)) return(seq_along(x)) else return(NA)
lapply(seq_along(x),
function(i) cbind(i, do.call(rBind, as.list(ff(x[[i]])))))
}
ans = do.call(rBind, ff(l2))
data.frame(value = unlist(l2),
ans[rowSums(is.na(ans[, -1L])) != (ncol(ans) - 1L), ])
# value X1 X2 X3 X4 X5
#1 a 1 1 NA NA NA
#2 b 2 1 1 NA NA
#3 c 2 2 1 1 NA
#4 d 2 2 2 1 NA
#5 a 2 2 2 2 NA
#6 e 2 2 2 3 NA
#7 e 4 1 NA NA NA
#8 b 4 2 NA NA NA
#9 e 5 1 1 NA NA
#10 f 6 1 1 1 1
rBind is a wrapper around rbind to avoid the "non-matching columns" errors:
rBind = function(...)
{
args = lapply(list(...), function(x) if(is.matrix(x)) x else matrix(x))
nc = max(sapply(args, ncol))
do.call(rbind,
lapply(args, function(x)
do.call(cbind, c(list(x), rep_len(list(NA), nc - ncol(x))))))
}
This can also be done with rrapply in the rrapply-package (extended version of base rapply) using how = "melt" to return a melted data.frame similar to reshape2::melt:
library(rrapply)
## use rapply or rrapply to convert terminal nodes to lists
l2_list <- rapply(l2, f = as.list, how = "replace")
## use rrapply with how = "melt" to return melted data.frame
l2_melt <- rrapply(l2_list, how = "melt")
#> L1 L2 L3 L4 L5 value
#> 1 ..1 ..1 <NA> <NA> <NA> a
#> 2 ..2 ..1 ..1 <NA> <NA> b
#> 3 ..2 ..2 ..1 ..1 <NA> c
#> 4 ..2 ..2 ..2 ..1 <NA> d
#> 5 ..2 ..2 ..2 ..2 <NA> a
#> 6 ..2 ..2 ..2 ..3 <NA> e
#> 7 ..4 ..1 <NA> <NA> <NA> e
#> 8 ..4 ..2 <NA> <NA> <NA> b
#> 9 ..5 ..1 ..1 <NA> <NA> e
#> 10 ..6 ..1 ..1 ..1 ..1 f
NB: we can convert the level columns to numeric columns afterwards if necessary.
rrapply(l2_melt, condition = function(x, .xname) grepl("^L", .xname), f = function(x) as.numeric(sub("\\.+", "", x)))
#> L1 L2 L3 L4 L5 value
#> 1 1 1 NA NA NA a
#> 2 2 1 1 NA NA b
#> 3 2 2 1 1 NA c
#> 4 2 2 2 1 NA d
#> 5 2 2 2 2 NA a
#> 6 2 2 2 3 NA e
#> 7 4 1 NA NA NA e
#> 8 4 2 NA NA NA b
#> 9 5 1 1 NA NA e
#> 10 6 1 1 1 1 f
Computation times
Using rrapply instead of reshape2::melt can give significant speed-ups for (very) large nested lists as shown in the benchmark timings below:
## create deeply nested list
deep_list <- rrapply(list(1, 1), classes = c("list", "numeric"), condition = function(x, .xpos) length(.xpos) < 18, f = function(x) list(1, 1), how = "recurse")
system.time(reshape2::melt(deep_list))
#> user system elapsed
#> 119.747 0.024 119.784
system.time(rrapply(deep_list, how = "melt"))
#> user system elapsed
#> 0.240 0.008 0.249
## create large shallow nested list
large_list <- lapply(replicate(500, 1, simplify = F), function(x) replicate(500, 1, simplify = F))
system.time(reshape2::melt(large_list))
#> user system elapsed
#> 40.558 0.008 40.569
system.time(rrapply(large_list, how = "melt"))
#> user system elapsed
#> 0.073 0.000 0.073
I got a data frame where "." is used both as decimal marker and alone as NA.
A B C D
1 . 1.2 6
1 12 . 3
2 14 1.6 4
To work on this data frame I need to obtain:
A B C D
1 NA 1.2 6
1 12 NA 3
2 14 1.6 4
How should I deal to keep decimals but transform alone "." in column C?
Here is the data in a reproducible format:
data <- structure(list(A = c(1L, 1L, 2L), B = c(".", "12", "14"), C = c("1.2",
".", "1.6"), D = c(6L, 3L, 4L)), .Names = c("A", "B", "C", "D"),
class = "data.frame", row.names = c(NA, -3L))
Assuming your data frame is data:
data[data == "."] <- NA
should work. Or:
data <- sapply(data, as.numeric)
You can use type.convert and specify "." as your na.string:
df <- data ## Create a copy in case you need the original form
df
# A B C D
# 1 1 . 1.2 6
# 2 1 12 . 3
# 3 2 14 1.6 4
df[] <- lapply(df, function(x) type.convert(as.character(x), na.strings="."))
df
# A B C D
# 1 1 NA 1.2 6
# 2 1 12 NA 3
# 3 2 14 1.6 4
Note that the argument is na.strings (with a plural "s") so you can specify more characters to be treated as NA values if you have any.
Also, the actual answer to this question might be to simply specify the na.strings argument when you are first reading your data into R, perhaps with read.table or read.csv.
Let's replicate the process of reading a csv from within R:
x <- tempfile()
write.csv(data, x, row.names = FALSE)
read.csv(x)
# A B C D
# 1 1 . 1.2 6
# 2 1 12 . 3
# 3 2 14 1.6 4
read.csv(x, na.strings = ".")
# A B C D
# 1 1 NA 1.2 6
# 2 1 12 NA 3
# 3 2 14 1.6 4