Creating new rows for listed substrings [duplicate] - r

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
My goal is to create a wordcloud in R, but I'm working with nested JSON data (which also happens to be incredibly messy).
There's a nice explanation here for how to create a wordcloud of phrases rather than singular words. I also know melt() from reshape2 can create new rows out of entire columns. Is there a way in R to perform a melt-like function over nested substrings?
Example:
N Group String
1 A c("a", "b", c")
2 A character(0)
3 B a
4 B c("b", d")
5 B d
...should become:
N Group String
1 A a
2 A b
3 A c
4 A character(0)
5 B a
6 B b
7 B d
8 B d
...where each subsequent substring is returned to the next row. In my actual data, the pattern c("x, y") is consistent but the substrings are too varied to know a priori.
If there's no great way to do this, too bad... just thought I'd ask the experts!

You can use separate_rows from the tidyr package:
library(tidyverse)
data %>%
separate_rows(listcites, sep = ",") %>% # split on commas
dmap_at("listcites", ~ gsub("^c\\(\"|\")$|\"", "", .x)) # clean up the quotations and parens

Related

select multiple ranges of columns in data.table using column names [duplicate]

This question already has answers here:
Select multiple ranges of columns using column names in data.table
(2 answers)
Closed 4 years ago.
I can select multiple ranges of columns in a data.table using a numeric vector like c(1:5,27:30). Is there any way to do the same with column names? For example, in some form similar to col1:col5,col27:col30?
You can with dplyr:
df <- data.frame(a=1, b=2, c=3, d=4, e=5, f=6, g=7)
dplyr::select(df, a:c, f:g)
a b c f g
1 2 3 6 7
I am not sure if my answer is efficient, but I think that could give you a workaround at least in case you need to work with data.table.
My proposal is to use data.table in conjunction with cbind. Thus you could have:
df <- data.frame(a=1, b=2, c=3, d=4, e=5, f=6, g=7)
multColSelectedByName<- cbind(df[,a:c],df[,f:g])
#a b c f g
#1: 1 2 3 6 7
One point that one should be careful is that if there is only one column in one of the selections, for example df[,f] then the name of this column would be something like V2 and not f. In such a case one could use:
multColSelectedByName<- cbind(df[,a:c],f=df[,f])

How do I pass column name as variable to data.table in R? [duplicate]

This question already has answers here:
Select subset of columns in data.table R [duplicate]
(7 answers)
Closed 6 years ago.
I would like to pass a variable (that holds the column name as a string) as argument to data.table. How do I do it?
Consider a data.table below:
myvariable <- "a"
myvariable_2 <- "b"
DT = data.table(ID = c("b","b","b","a","a","c"), a = 1:6, b = 7:12, c = 13:18)
DT
# ID a b c
# 1: b 1 7 13
# 2: b 2 8 14
# 3: b 3 9 15
# 4: a 4 10 16
# 5: a 5 11 17
# 6: c 6 12 18
I can use subset to extract columns i.e: subset(DT, TRUE, myvariable)but this just outputs the column/s
How do I use subset to extract column based on some criteria? e.g: extract myvariable column when myvariable_2 < 10
How do I extract summary statistics over groups by passing column names as variables?
How do I plot descriptive plots using data.table by passing column names as variables?
I know that this could be easier in data.frame i.e. passing variables as column names. But I read everywhere that data.table is faster/memory efficient hence would like to stick with it.
Does switching between data.table and data.frame have huge memory/performance implications?
I do not want to explicitly code the column names as I want this piece of code to be re-usable.
the comment from #thelatemail is a very good start. Do read that first! Another quick way is below
library(data.table)
df = data.table(a=1:10, b=letters[1:2], c=11:20)
var1="a"
var2="b"
dt1=df[,c(var1,var2), with=F]
Think of "with=F" as making "j" part data.table behave like that of data.frame
Edit 1 : to subset on a condition within a datatable
df[get(var1) > 5, c(var1, var2),with = F]

Doing something similar to melt to an R dataframe [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 6 years ago.
I've got a dataframe like this:
The first column is numeric, and the second column is a comma separated list (character)
id numbers
1 2,4,5
2 1,4,6
3 NA
4 NA
5 5,1,2
And I want to in essence "melt" the dataframe similar to the reshape package. So that the output is a dataframe which looks like this
id numbers
1 2
1 4
1 5
2 1
2 4
2 6
3 NA
4 NA
5 5
5 1
5 2
Except in the reshape2 package each number will have to be each in a column... which takes up too much storage space if there are many numbers... which is why I have opted to set the list of numbers as a comma separated list. But melt no longer works with this setup.
Can you recommend the most efficient way to achieve the transformation from the input dataframe to output dataframe?
The way I would do it for each row, create a data.frame and store them in a list, where df is your initial data.frame.
l = list()
for (j in 1:nrow(df)){
l[[j]] = data.frame(id = df$id[[j]],
numbers = split(df$numbers[[j]], ','))
}
Afterwards, you can stack all list elements into a single data.frame using plyr::ldply with the 'data.frame' option.

How can I merge the different elements of the list? [duplicate]

This question already has answers here:
Paste multiple columns together
(11 answers)
Concatenate row-wise across specific columns of dataframe
(3 answers)
Closed 7 years ago.
I have a list/dataframe such as
a b c d e f g VALUE
1 0 1 0 0 0 1 934
what I wanted to do is to print,
1010001 without using for loop. so basically, take those integers as a string and merge them while printing?
I will define a function, which truncate the last value and paste all the other elements together. And then use "apply" on all the dataframe
cc <- data.frame(a=1,b=0,c=1,d=0,e=0,f=0,g=1,VALUE=934)
# This function contains the all the jobs you want to do for the row.
myfuns <- function(x, collapse=""){
x <- x[-length(x)] # truncate the last element
paste(x,collapse="") # paste all the integers together
}
# the second argument "MARGIN=1" means apply this function on the row
apply(cc,MARGIN=1,myfuns2) # output: "1010001"

R count occurrences of an element by groups [duplicate]

This question already has answers here:
Add column with order counts
(2 answers)
Count number of rows within each group
(17 answers)
Closed 7 years ago.
What is the easiest way to count the occurrences of a an element on a vector or data.frame at every grouop?
I don't mean just counting the total (as other stackoverflow questions ask) but giving a different number to every succesive occurence.
for example for this simple dataframe: (but I will work with dataframes with more columns)
mydata <- data.frame(A=c("A","A","A","B","B","A", "A"))
I've found this solution:
cbind(mydata,myorder=ave(rep(1,nrow(mydata)),mydata$A, FUN=cumsum))
and here the result:
A myorder
A 1
A 2
A 3
B 1
B 2
A 4
A 5
Isn't there any single command to do it?. Or using an specialized package?
I want it to later use tidyr's spread() function.
My question is not the same than
Is there an aggregate FUN option to count occurrences?
because I don't want to know the total number of occurrencies at the end but the cumulative occurencies till every element.
OK, my problem is a little bit more complex
mydata <- data.frame(group=c("x","x","x","x","y","y", "y"), letter=c("A","A","A","B","B","A", "A"))
I only know to solve the first example I wrote above.
But what happens when I want it also by a second grouping variable?
something like occurrencies(letter) by group.
group letter "occurencies within group"
x A 1
x A 2
x A 3
x B 1
y B 1
y A 1
y A 2
I've found the way with
ave(rep(1,nrow(mydata)),list(mydata$group, mydata$letter), FUN=cumsum)
though it shoould be something easier.
Using data.table
library(data.table)
setDT(mydata)
mydata[, myorder := 1:.N, by = .(group, letter)]
The by argument makes the table be dealt with within the groups of the column called A. .N is the number of rows within that group (if the by argument was empty it would be the number of rows in the table), so for each sub-table, each row is indexed from 1 to the number of rows in that sub-table.
mydata
group letter myorder
1: x A 1
2: x A 2
3: x A 3
4: x B 1
5: y B 1
6: y A 1
7: y A 2
or a dplyr solution which is pretty much the same
mydata %>%
group_by(group, letter) %>%
mutate(myorder = 1:n())

Resources