How to count strings separated by a semicolon - r

My data looks like below:
df <- structure(list(V1 = structure(c(7L, 4L, 8L, 8L, 5L, 3L, 1L, 1L,
2L, 1L, 6L), .Label = c("", "cell and biogenesis;transport",
"differentiation;metabolic process;regulation;stimulus", "MAPK cascade;cell and biogenesis",
"MAPK cascade;cell and biogenesis;transport", "metabolic process;regulation;stimulus;transport",
"mRNA;stimulus;transport", "targeting"), class = "factor")), .Names = "V1", class = "data.frame", row.names = c(NA,
-11L))
I want to count how many similar strings are there but also have a track from which row they come from. Each string is separated by a ; but they belong to the row that they are in there.
I want to have the output like this:
String Count position
mRNA 1 1
stimulus 3 1,6,11
transport 4 1,5,9,11
MAPK cascade 2 2,5
cell and biogenesis 3 2,5,9
targeting 2 3,4
regulation of mRNA stability 1 1
regulation 2 6,11
differentiation 1 6,11
metabolic process 2 6,11
The count shows how many times each of the string (the string are separated by a semicolon) is repeated in the entire data.
Second column shows where they were, for example mRNA was only in the first row so it is 1. stimulus was in three rows 1 and 6 and 11
Some rows are blank and they are also count as rows.

In the code below we do the following:
Add the row numbers as a column.
Use strplit to split each string into its components and store the result in a column called string.
strsplit returns a list. We use unnest to stack the list components to create a "long" data frame, giving us a "tidy" data frame that's ready to summarize.
Group by string and return a new data frame that counts the frequency of each string and gives the original row number in which each instance of the string originally appeared.
library(tidyverse)
df$V1 = as.character(df$V1)
df %>%
rownames_to_column() %>%
mutate(string = strsplit(V1, ";")) %>%
unnest %>%
group_by(string) %>%
summarise(count = n(),
rows = paste(rowname, collapse=","))
string count rows
1 cell and biogenesis 3 2,5,9
2 differentiation 1 6
3 MAPK cascade 2 2,5
4 metabolic process 2 6,11
5 mRNA 1 1
6 regulation 2 6,11
7 stimulus 3 1,6,11
8 targeting 2 3,4
9 transport 4 1,5,9,11
If you plan to do further processing on the row numbers, you might want to keep them as numeric values, rather than as a string of pasted values. In that case, you could do this:
df.new = df %>%
rownames_to_column("rows") %>%
mutate(string = strsplit(V1, ";")) %>%
select(-V1) %>%
unnest
This will give you a long data frame with one row for each combination of string and rows.

A base R approach:
# convert 'V1' to a character vector (only necessary of it isn't already)
df$V1 <- as.character(df$V1)
# get the unique strings
strng <- unique(unlist(strsplit(df$V1,';')))
# create a list with the rows for each unique string
lst <- lapply(strng, function(x) grep(x, df$V1, fixed = TRUE))
# get the counts for each string
count <- lengths(lst)
# collpase the list string positions into a string with the rownumbers for each string
pos <- sapply(lst, toString)
# put everything together in one dataframe
d <- data.frame(strng, count, pos)
You can shorten this approach to:
d <- data.frame(strng = unique(unlist(strsplit(df$V1,';'))))
lst <- lapply(d$strng, function(x) grep(x, df$V1, fixed = TRUE))
transform(d, count = lengths(lst), pos = sapply(lst, toString))
The result:
> d
strng count pos
1 mRNA 1 1
2 stimulus 3 1, 6, 11
3 transport 4 1, 5, 9, 11
4 MAPK cascade 2 2, 5
5 cell and biogenesis 3 2, 5, 9
6 targeting 2 3, 4
7 differentiation 1 6
8 metabolic process 2 6, 11
9 regulation 2 6, 11

A possible data.table solution for completeness
library(data.table)
setDT(df)[, .(.I, unlist(tstrsplit(V1, ";", fixed = TRUE)))
][!is.na(V2), .(count = .N, pos = toString(sort(I))),
by = .(String = V2)]
# String count pos
# 1: mRNA 1 1
# 2: MAPK cascade 2 2, 5
# 3: targeting 2 3, 4
# 4: differentiation 1 6
# 5: cell and biogenesis 3 2, 5, 9
# 6: metabolic process 2 6, 11
# 7: stimulus 3 1, 6, 11
# 8: transport 4 1, 5, 9, 11
# 9: regulation 2 6, 11
This is basically splits V1 column by ; while converting to a long format while simultaneously binding it with the row index (.I). Afterwards it's just a simple aggregation on row count (.N) and binding positions into a single string per String.

Related

Get column names into a new variable based on conditions

I have a data frame like this and I am doing this on R. My problems can be divided into two steps.
SUBID
ABC
BCD
DEF
192838
4
-3
2
193928
-6
-2
6
205829
4
-5
9
201837
3
4
4
I want to make a new variable that contains a list of the column names that has a negative value for each SUBID. The output should look something like this:
SUBID
ABC
BCD
DEF
output
192838
4
-3
2
"BCD"
193928
-6
-2
6
"ABC","BCD"
205829
4
-5
9
"BCD"
201837
3
4
4
" "
And then, in the second step, I would like to collapse the SUBID into a more general ID and get the number of unique strings from the output variable for each ID (I just need the number, the specific strings in the parenthesis are just for illustration).
SUBID
output
19
2 ("ABC","BCD")
20
1 ("BCD")
Those are the two steps that I thing should be done, but maybe there is a way that can skip the first step and goes to the second step directly that I don't know.
I would appreciate any help since right now I am not sure where to start on this. Thank you!
Another way:
library(dplyr)
library(tidyr)
df <- df %>% pivot_longer(-SUBID)
df1 <- df %>%
group_by(SUBID) %>%
summarise(output = paste(name[value < 0L], collapse = ','))
df2 <- df %>%
group_by(SUBID = substr(SUBID, 1, 2)) %>%
summarise(output_count = n_distinct(name[value < 0L]),
output = paste0(output_count, ' (', paste(name[value < 0L], collapse = ','), ')'))
Outputs (two columns are created in the second case, one with just the count and another following your example):
df1
# A tibble: 4 x 2
SUBID output
<int> <chr>
1 192838 "BCD"
2 193928 "ABC,BCD"
3 201837 ""
4 205829 "BCD"
df2
# A tibble: 2 x 3
SUBID output_count output
<chr> <int> <chr>
1 19 2 2 (BCD,ABC,BCD)
2 20 1 1 (BCD)
This answers the first part of your question, the second one, I didn't understand
df$output <-apply(df[,-1], 1, function(x) paste(names(df)[-1][x<0], collapse = ","))
df
SUBID ABC BCD DEF output
1 192838 4 3 -2 DEF
2 193928 -6 -2 6 ABC,BCD
3 205829 4 -5 9 BCD
4 201837 3 4 4
For the second part, try this:
id <- sapply(strsplit(sub("\\W+", "", df$output), split = ""), function(x){
sum(!(duplicated(x) | duplicated(x, fromLast = TRUE)))
} )
data.frame(SUBID = substr(df$SUBID, 1,2), output = id, string = df$output)
SUBID output string
1 19 3 DEF
2 19 2 ABC,BCD
3 20 3 BCD
4 20 0
I added the variable string for you make sure your count of unique values is ok.
One option is to take advantage of dplyr::cur_data() to access the names() of the data and subset based on your criteria. Then you can take advantage of tibble list-columns to hold on to a set of column names of arbitrary length and finally calculate the number of unique values in that list.
library(tidyverse)
d <- structure(list(SUBID = c(192838, 193928, 205829, 201837), ABC = c(4, -6, 4, 3), BCD = c(-3, -2, -5, 4), DEF = c(2, 6, 9, 4)), row.names = c(NA, -4L), class = "data.frame")
d %>%
rowwise() %>%
mutate(neg_col_names = list(names(cur_data())[cur_data() < 0])) %>%
group_by(ID_grp = str_sub(SUBID, 1, 2)) %>%
summarize(neg_col_count = n_distinct(unlist(c(neg_col_names))))
#> # A tibble: 2 × 2
#> ID_grp neg_col_count
#> <chr> <int>
#> 1 19 2
#> 2 20 1
Created on 2022-11-22 with reprex v2.0.2

For Loop over data frame, using dplyr results in error

I have a, simplified, a data frame with 71 columns and N rows. What I want to get is a frequency table of the values in the first column based on all other columns (all other columns have dummies). Simplified (with only 4 columns) this would be like that:
df <- data.frame(sample(1:8,20,replace=T),sample(0:1,20,replace = T),sample(0:1,20,replace = T),sample(0:1,20,replace = T))
I have tried this for loop with dplyr (where x is the first column with the 8 different values), and it only works for the first 10 or 11 columns without problems, but after then it only generates NA's and returns the error:
freq_df <- data.frame(matrix(NA, nrow=8, ncol=71))
for (i in 2:71){
freq_df[,i] <- df %>%
filter(df[i]==1) %>%
count(x) %>%
select(n)
}
in `[<-.data.frame`(`*tmp*`, , i, value = list(n = c(3L, 5L, 8L, :
replacement element 1 has 7 rows, need 8
Anyone knows why R returns this error? Thank you for your help!
Your error is because not all first column values will occur where other columns are 1. You have 8 unique values in the first column, maybe you have 7 when you filter on the 11th column == 1. So the results can have different lengths, which is a problem.
Try this instead, I think it's what you're trying to do. (If not, please clarify your goal by showing the expected output.)
names(df) = paste0("V", 1:4)
df %>%
group_by(V1) %>%
summarize(across(everything(), sum, .names = "{.col}_count"))
# V1 V2_count V3_count V4_count
# <int> <int> <int> <int>
# 1 1 1 0 1
# 2 2 2 1 2
# 3 3 3 3 2
# 4 4 0 0 0
# 5 5 0 0 0
# 6 6 3 1 2
# 7 7 3 1 1
# 8 8 1 1 0
In base R, we can do
names(df) <- paste0("V", 1:4)
out <- aggregate(.~ V1, df, sum, na.rm = TRUE)
names(out)[-1] <- paste0(names(out)[-1], "_count")

Assigning values to patterns of letters in character strings using R

I have a data frame that looks like this:
head(df)
shotchart
1 BMMMBMMBMMBM
2 MMMBBMMBBMMB
3 BBBBMMBMMMBB
4 MMMMBBMMBBMM
Different patterns of the letter 'M' are worth certain values such as the following:
MM = 1
MMM = 2
MMMM = 3
I want to create an extra column to this data frame that calculates the total value of the different patterns of 'M' in each row individually.
For example:
head(df)
shotchart score
1 BMMMBMMBMMBM 4
2 MMMBBMMBBMMB 4
3 BBBBMMBMMMBB 3
4 MMMMBBMMBBMM 5
I can't seem to figure out how to assign the values to the different 'M' patterns.
I tried using the following code but it didn't work:
df$score <- revalue(df$scorechart, c("MM"="1", "MMM"="2", "MMMM"="3"))
We create a named vector ('nm1'), split the 'shotchart' to extract only 'M' and then use the named vector to change the values to get the sum
nm1 <- setNames(1:3, strrep("M", 2:4))
sapply(strsplit(gsub("[^M]+", ",", df$shotchart), ","),
function(x) sum(nm1[x[nzchar(x)]], na.rm = TRUE))
Or using tidyverse
library(tidyverse)
df %>%
mutate(score = str_extract_all(shotchart, "M+") %>%
map_dbl(~ nm1[.x] %>%
sum(., na.rm = TRUE)))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5
You can also split on "B" and base the result on the count of "M" characters -1 as follows:
df <- data.frame(shotchart = c("BMMMBMMBMMBM", "MMMBBMMBBMMB", "BBBBMMBMMMBB", "MMMMBBMMBBMM"),
score = NA_integer_,
stringsAsFactors = F)
df$score <- lapply(strsplit(df$shotchart, "B"), function(i) sum((nchar(i)-1)[(nchar(i)-1)>0]))
# shotchart score
#1 BMMMBMMBMMBM 4
#2 MMMBBMMBBMMB 4
#3 BBBBMMBMMMBB 3
#4 MMMMBBMMBBMM 5

Product of several columns on a data frame by a vector using dplyr

I would like to multiply several columns on a dataframe by the values of a vector (all values within the same column should be multiplied by the same value, which will be different according to the column), while keeping the other columns as they are.
Since I'm using dplyr extensively I thought that it might be useful to use mutate_each function, so I can modify all columns at the same time, but I am completely lost on the syntax on the fun() part.
On the other hand, I've read this solution which is simple and works fine, but only works for all columns instead of the selected ones.
That's what I've done so far:
Imagine that I want to multiply all columns in df but letters by weight_df vector as follows:
df = data.frame(
letters = c("A", "B", "C", "D"),
col1 = c(3, 3, 2, 3),
col2 = c(2, 2, 3, 1),
col3 = c(4, 1, 1, 3)
)
> df
letters col1 col2 col3
1 A 3 2 4
2 B 3 2 1
3 C 2 3 1
4 D 3 1 3
>
weight_df = c(1:3)
If I use select before applying mutate_each I get rid of letters columns (as expected), and that's not what I want (a part from the fact that the vector is not applyed per columns basis but per row basis! and I want the opposite):
df = df %>%
select(-letters) %>%
mutate_each(funs(. * weight_df))
> df
col1 col2 col3
1 3 2 4
2 6 4 2
3 6 9 3
4 3 1 3
But if I don't select any particular columns, all values within letters are removed (which makes a lot of sense, by the way), but that's not what I want, neither (a part from the fact that the vector is not applyed per columns basis but per row basis! and I want the opposite):
df = df %>%
mutate_each(funs(. * issb_weight))
> df
letters col1 col2 col3
1 NA 3 2 4
2 NA 6 4 2
3 NA 6 9 3
4 NA 3 1 3
(Please note that this is a very simple dataframe and the original one has way more rows and columns -which unfortunately are not labeled in such an easy way and no patterns can be obtained)
The problem here is that you are basically trying to operate over rows, rather columns, hence methods such as mutate_* won't work. If you are not satisfied with the many vectorized approaches proposed in the linked question, I think using tydeverse (and assuming that letters is unique identifier) one way to achieve this is by converting to long form first, multiply a single column by group and then convert back to wide (don't think this will be overly efficient though)
library(tidyr)
library(dplyr)
df %>%
gather(variable, value, -letters) %>%
group_by(letters) %>%
mutate(value = value * weight_df) %>%
spread(variable, value)
#Source: local data frame [4 x 4]
#Groups: letters [4]
# letters col1 col2 col3
# * <fctr> <dbl> <dbl> <dbl>
# 1 A 3 4 12
# 2 B 3 4 3
# 3 C 2 6 3
# 4 D 3 2 9
using dplyr. This filters numeric columns only. Gives flexibility for choosing columns. Returns the new values along with all the other columns (non-numeric)
index <- which(sapply(df, is.numeric) == TRUE)
df[,index] <- df[,index] %>% sweep(2, weight_df, FUN="*")
> df
letters col1 col2 col3
1 A 3 4 12
2 B 3 4 3
3 C 2 6 3
4 D 3 2 9
try this
library(plyr)
library(dplyr)
df %>% select_if(is.numeric) %>% adply(., 1, function(x) x * weight_df)

Convert Column Values into Row Names using R

I need to Convert Column Values into Row Names using R.
For example to convert format1 into format2
var<-c("Id", "Name", "Score", "Id", "Score", "Id", "Name")
num<-c(1, "Tom", 4, 2, 7, 3, "Jim")
format1<-data.frame(var, num)
format1
var num
1 Id 1
2 Name Tom
3 Score 4
4 Id 2
5 Score 7
6 Id 3
7 Name Jim
Be careful, there are missing values in the format1,and that's the challenge, I guess.
Id<-c(1, 2, 3)
Name<-c("Tom", NA, "Jim")
Score<-c(4, 7, NA)
format2<-data.frame(Id, Name, Score)
format2
Id Name Score
1 1 Tom 4
2 2 <NA> 7
3 3 Jim NA
# How to convert format1 into format2?
I may not articulate in the exact way, however, you can refer to the toy data i give above.
I know a litter bit about reshape and reshape2, however, I failed in converting the data format using both of them.
format1$ID <- cumsum(format1$var == "Id")
format2 <- reshape(format1, idvar = "ID",timevar = "var", direction = "wide")[-1]
names(format2) <- gsub("num.", "", names(format2)
# Id Name Score
# 1 1 Tom 4
# 4 2 <NA> 7
# 6 3 Jim <NA>
Alternatively, if you'd like to skip the gsub() step, you could directly specify the output column names via the varying argument:
reshape(format1, idvar = "ID",timevar = "var", direction = "wide",
varying = list(c("Id", "Name", "Score")))[-1]
You can use dcast after adding an identifier column.
format1$pk <- cumsum( format1$var=="Id" )
library(reshape2)
dcast( format1, pk ~ var, value.var="num" )

Resources