Characters counting and subletting specific patterns

Characters counting and subletting specific patterns - r

I have a list of data.frames (d) that looks like this:
$ 1 :'data.frame': 1 obs. of 2 variables:
..$ index: int 2
..$ V1 : Factor w/ 125 levels "cgtsloqasmlkjybjlo,..:"
$ 2 :'data.frame': 1 obs. of 2 variables:
..$ index: int 2
..$ V1 : Factor w/ 125 levels "ponlohlofdctlo,..:"
and so on for 1000 data.frames. I have to count the number of unique letters occurring in "cgtsloqasmlkjybjlo,..:" as well as in "ponlohlofdctlo,..:" and in the other 1000 data.frames.
I tried a stupid function, but I'm not an expert so it is wrong also because it does not work:
Anyway I tried to split (but it does not work..):
chars = sapply(d, function(x) strsplit(as.character(d),""))
In addiction, I have to count the number of occurrences of "lo" in "cgtsloqasmlkjybjlo,..:" as well as in "ponlohlofdctlo,..:" and in the other 1000.
Edit: the desired output will be a data.frame:
Seq length(unique_letters) lo_occurrences
cgtsloqasmlkjybjlo 13 2
ponlohlofdctlo 9 3
.............. ............ ............
dput output:
dput(d[1:3])
structure(list(1 = structure(1000L, .Label = c("jhgfilsouilohgucaksfiaaknajdauloadbayrzjdhad", "fjkhqurtglowqgbdahhmolovdethabvfdalo", "....", "V1"), class = "factor")), .Names = c("1", "2", "3"))

A way is this:
#simulating your list; I got an error trying to use your dput
d <- list(data.frame(index = 2, V1 = "cgtsloqasmlkjybjlo"),
data.frame(index = 2, V1 = "ponlohlofdctlo"))
d
#[[1]]
# index V1
#1 2 cgtsloqasmlkjybjlo
#[[2]]
# index V1
#1 2 ponlohlofdctlo
res <- do.call(rbind, lapply(d, function(x) data.frame(seq = as.character(x$V1),
length_uniques = length(unique(unlist(strsplit(as.character(x$V1), "")))),
lo_counts = length(unlist(gregexpr("lo", as.character(x$V1)))))))
res
# seq length_uniques lo_counts
#1 cgtsloqasmlkjybjlo 13 2
#2 ponlohlofdctlo 9 3

Related

Creating an empty dataframe in R with column names stored in two separate lists

I have two separate lists containing column names of a new dataframe df to be created.
fixed <- list("a", "b")
variable <- list("a1", "b1", "c1")
How do I proceed so as to make the column names of df appear in the order aba1b1c1

Probabaly, unlist both lists, concatenate and subset the data
df[unlist(c(fixed, variable))]
If there are additional elements in the list that are not as column names in 'df', use intersect
df[intersect(unlist(c(fixed, variable)), names(df))]
a a1 c1
1 7 8 1
2 3 1 5
3 8 5 4
4 7 5 6
5 2 5 6
If it is a null data.frame, we could do
v1 <- unlist(c(fixed, variable))
df <- as.data.frame(matrix(numeric(), nrow = 0,
ncol = length(v1), dimnames = list(NULL, v1)))
str(df)
'data.frame': 0 obs. of 5 variables:
$ a : num
$ b : num
$ a1: num
$ b1: num
$ c1: num
Or another option is
df <- data.frame(setNames(rep(list(0), length(v1)), v1))[0,]
> str(df)
'data.frame': 0 obs. of 5 variables:
$ a : num
$ b : num
$ a1: num
$ b1: num
$ c1: num
data
v1 <- c('a', 'd2', 'c', 'a1', 'd1', 'c1', 'e1')
set.seed(24)
df <- as.data.frame(matrix(sample(1:9, 5 * length(v1),
replace = TRUE), ncol = length(v1), dimnames = list(NULL, v1)))

r reshape2 melt returns Warning in if (drop.margins) { : the condition has length > 1 and only the first element will be used

I have a piece of code that returns a warning message:
Warning in if (drop.margins) { :
the condition has length > 1 and only the first element will be used
from deep inside the reshape2 melt function as shown in the error message.
How do I correct?
The code is difficult to subset so I've included a description of the data frame. I'm just looking for a hint.
S_Melted <- melt(S_Flattened, S_Flattened$db_date)
BTW: S_Flattened was created by a cast in an earlier statement:
S_Flattened = cast(S, db_date ~ MetricType, value="AvgValue")

There are a couple of problems here:
You're actually using the "reshape" package, not "reshape2".
You're specifying a vector of values as the "id" variable, hence this warning.
Consider the following:
long <- structure(list(ID = structure(c(1L, 2L, 3L, 1L, 2L, 3L), .Label = c("A",
"B", "C"), class = "factor"), variable = structure(c(1L, 1L,
1L, 2L, 2L, 2L), .Label = c("V1", "V2"), class = "factor"), value = 1:6),
.Names = c("ID", "variable", "value"), row.names = c(NA, 6L), class = "data.frame")
Using cast from "reshape" gives you something that looks like a data.frame, but which has a bunch of other attributes.
reshape_df <- reshape::cast(long, ID ~ variable)
str(reshape_df)
# List of 3
# $ ID: Factor w/ 3 levels "A","B","C": 1 2 3
# $ V1: int [1:3] 1 2 3
# $ V2: int [1:3] 4 5 6
# - attr(*, "row.names")= int [1:3] 1 2 3
# - attr(*, "idvars")= chr "ID"
# - attr(*, "rdimnames")=List of 2
# ..$ :'data.frame': 3 obs. of 1 variable:
# .. ..$ ID: Factor w/ 3 levels "A","B","C": 1 2 3
# ..$ :'data.frame': 2 obs. of 1 variable:
# .. ..$ variable: Factor w/ 2 levels "V1","V2": 1 2
Here's your warning:
reshape::melt(reshape_df, reshape_df$ID)
# ID value variable
# V1 A 1 V1
# V1.1 B 2 V1
# V1.2 C 3 V1
# V2 A 4 V2
# V2.1 B 5 V2
# V2.2 C 6 V2
# Warning message:
# In if (drop.margins) { :
# the condition has length > 1 and only the first element will be used
And the same thing, without a warning.
reshape::melt(reshape_df, id = "ID")
# ID value variable
# V1 A 1 V1
# V1.1 B 2 V1
# V1.2 C 3 V1
# V2 A 4 V2
# V2.1 B 5 V2
# V2.2 C 6 V2
A better approach would be to stop using "reshape" and start using "reshape2", "data.table" (which provides more flexible implementations of melt and dcast than "reshape2" does), or "tidyr".
Here's the same set of steps with "reshape2":
reshape2_df <- reshape2::dcast(long, ID ~ variable)
You get back a standard data.frame with no extra attributes.
str(reshape2_df)
# 'data.frame': 3 obs. of 3 variables:
# $ ID: Factor w/ 3 levels "A","B","C": 1 2 3
# $ V1: int 1 2 3
# $ V2: int 4 5 6
melting is not a problem either -- just don't supply it with a vector, as you did in your attempt.
reshape2::melt(reshape2_df, "ID")
# ID variable value
# 1 A V1 1
# 2 B V1 2
# 3 C V1 3
# 4 A V2 4
# 5 B V2 5
# 6 C V2 6

R: How to write a function that converts the data type of multiple columns from integer to factor

I have a dataset looks like this:
df <- data.frame(id = seq(1, 10, by = 1),
group = rep(1:2, each = 5),
r_level = print(c(rep('low', 2), rep('medium', 4), rep('high', 4))),
date = sample(seq(as.Date('2017/02/09'), as.Date('2018/02/09'), by = 'day'), 10),
score = round(rnorm(10, mean = 66, sd = 12), 0),
time_rank = floor(runif(10, min = 1, max = 10)))
I want convert the data type of id, group, r_level, and time_rank to 'factor', and want to avoid duplicating as.factor() function (something like this:
df$id <- as.factor(df$id)
df$group <- as.factor(df$group)
Want to have a convert_dtype() function, that is:
DataFrame <- convert_dtype(DataFrame, ColumnNames, Old_Date_Type, New_Data_Type)
This post Convert type of multiple columns of a dataframe at once might be helpful.
Thanks in advance!

Just use lapply on the part of the dataframe required and assign back to those columns:
df[ , c('id', 'group', 'r_level', 'time_rank')] <-
lapply( df[ , c('id', 'group', 'r_level', 'time_rank')], factor)
Check result:
str(df)
'data.frame': 10 obs. of 6 variables:
$ id : Factor w/ 10 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10
$ group : Factor w/ 2 levels "1","2": 1 1 1 1 1 2 2 2 2 2
$ r_level : Factor w/ 3 levels "high","low","medium": 2 2 3 3 3 3 1 1 1 1
$ date : Date, format: "2017-11-01" "2018-01-21" "2017-12-10" ...
$ score : num 59 73 67 69 68 48 71 60 43 68
$ time_rank: Factor w/ 6 levels "1","3","6","7",..: 1 5 6 5 1 2 6 4 3 6

Transforming a nested data frame with varying number of elements

I have a data frame with a column of nested data frames with 1 or 2 columns and n rows. It looks like df in the sample below:
'data.frame': 3 obs. of 2 variables:
$ vector:List of 3
..$ : chr "p1"
..$ : chr "p2"
..$ : chr "p3"
$ lists :List of 3
..$ :'data.frame': 2 obs. of 2 variables:
.. ..$ n1: Factor w/ 2 levels "a","b": 1 2
.. ..$ n2: Factor w/ 2 levels "1","2": 1 2
..$ :'data.frame': 1 obs. of 1 variable:
.. ..$ n1: Factor w/ 1 level "d": 1
..$ :'data.frame': 1 obs. of 2 variables:
.. ..$ n1: Factor w/ 1 level "e": 1
.. ..$ n2: Factor w/ 1 level "3": 1
df can be recreated like this :
v <- c("p1", "p2", "p3")
l <- list(data.frame(n1 = c("a", "b"), n2 = c("1", "2")), data.frame(n1 = "d"), data.frame(n1 = "e", n2 = "3"))
df <- as.data.frame(cbind(v, l))
I'd like to transform it to a data frame that looks like that:
[v] [n1] [n2]
p1 a 1
p1 b 2
p2 d NA
p3 e 3
n1 and n2 are in seperate columns
if the data frame in row i has n rows, the vector element of row i should be repeated n times
if there is no content in n1 or n2, there should be a NA
I've tried using tidyr::unnest but got the following error
unnest(df)
Error: All nested columns must have the same number of elements.
Does anyone has a better idea how to transform the dataframe in the desired format?

Using purrr::pmap_df, within each row of df, we combine v and l into a single data frame and then combine all of the data frames into a single data frame.
library(tidyverse)
pmap_df(df, function(v,l) {
data.frame(v,l)
})
v n1 n2
1 p1 a 1
2 p1 b 2
3 p2 d <NA>
4 p3 e 3

This will avoid by-row operations, which will be important if you have a lot of rows.
library(data.table)
rbindlist(df$l, fill = T, id = 'row')[, v := df$v[row]][]
# row n1 n2 v
#1: 1 a 1 p1
#2: 1 b 2 p1
#3: 2 d NA p2
#4: 3 e 3 p3

A solution using dplyr and tidyr. suppressWarnings is not required. Because when you created data frames, there are factor columns, suppressWarnings is to suppress the warning message when combining factors.
library(dplyr)
library(tidyr)
df1 <- suppressWarnings(df %>%
mutate(v = unlist(.$v)) %>%
unnest())
df1
# v n1 n2
# 1 p1 a 1
# 2 p1 b 2
# 3 p2 d <NA>
# 4 p3 e 3

Store list into a file (R)

I have the following data frame:
Group.1 V2
1 27562761 GO:0003676
2 27562765 c("GO:0004345", "GO:0050661", "GO:0006006", "GO:0055114")
3 27562775 GO:0016020
4 27562776 c("GO:0005525", "GO:0007264", "GO:0005622")
where the second column is a list. I tried to write the data frame into a text file using write.table, but it did not work. My desired output is the following one (file.txt):
27562761 GO:0003676
27562765 GO:0004345, GO:0050661, GO:0006006, GO:0055114
27562775 GO:0016020
27562776 GO:0005525, GO:0007264, GO:0005622
How could I obtain that?

You could look into sink, or you could use write.csv after flattening "V2" to a character string.
Try the following examples:
## recreating some data that is similar to your example
df <- data.frame(a = c(1, 1, 2, 2, 3), b = letters[1:5])
x <- aggregate(list(V2 = df$b), list(df$a), c)
x
# Group.1 V2
# 1 1 1, 2
# 2 2 3, 4
# 3 3 5
## V2 is a list, as you describe in your question
str(x)
# 'data.frame': 3 obs. of 2 variables:
# $ Group.1: num 1 2 3
# $ V2 :List of 3
# ..$ 1: int 1 2
# ..$ 3: int 3 4
# ..$ 5: int 5
sink(file = "somefile.txt")
x
sink()
## now open up "somefile.txt" from your working directory
x$V2 <- sapply(x$V2, paste, collapse = ", ")
write.csv(x, file = "somefile.csv")
## now open up "somefile.csv" from your working directory

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Characters counting and subletting specific patterns - r

Related

Creating an empty dataframe in R with column names stored in two separate lists

r reshape2 melt returns Warning in if (drop.margins) { : the condition has length > 1 and only the first element will be used

R: How to write a function that converts the data type of multiple columns from integer to factor

Transforming a nested data frame with varying number of elements

Store list into a file (R)

Categories

Resources