I'm new in programming in R, and I've been having this problem for several days now. I started with a list, I created from splitting a file. This list contains a number of items in a single row.
head(sales2)
> $`7143443`
>>[1] "SSS-U-CCXVCSS1" "L-CCX-8GETTS-LIC"
>$`7208993`
>>[1] "NFFGSR4=" "1MV-FT-1=" "VI-NT/TE="
>$`7241758`
>>[1] "PW_SQSGG="
>$`9273628`
>>[1] "O1941-SE9" "CCO887VA-K9" "2901-SEC/K9" "CO1941-C/K9"
>$`9371709`
>>[1] "HGR__SASS=" "WWQTTB0S-L" "WS-RRRT48FP" "WTTTF24PS-L"
[5] "GEDQTT8TS-L" "WD-SRNS-2S-L"
>$`9830473`
>>[1] "SPA$FFSB0S"
I wanted it to convert into a data frame , I used
x<-do.call(rbind, lapply(sales2,data.frame))
It gets converted in the data frame ,but it converts like this
> head(x,6)
id
> 7143443.1 "SSS-U-CCXVCSS1"
> 7143443.2 "L-CCX-8GETTS-LIC"
> 7208993.1 "NFFGSR4="
> 7208993.2 "1MV-FT-1="
> 7208993.3 "VI-NT/TE="
> 7241758 "PW_SQSGG="
I want 7143443's all item in a single row not in multiple row
Through this I want to calculate how many rows contain 2 items together
for example "WS-C2960S-48TS-L" , "WS-C2960S-24TS-L",
these 2 elements are there in how many rows?
You can also say probability of these over all other elements.
I am not sure what is your final desired output. But the following script can convert your list to a data frame. Perhaps you can begin your analysis from this data frame.
# Create example list
sales2 <- list(`7143443` = c("SSS-U-CCXVCSS1", "L-CCX-8GETTS-LIC"),
`7208993` = c("NFFGSR4=", "1MV-FT-1=", "VI-NT/TE="),
`7241758` = "PW_SQSGG=",
`9273628` = c("O1941-SE9", "CCO887VA-K9", "2901-SEC/K9", "CO1941-C/K9"),
`9371709` = c("HGR__SASS=", "WWQTTB0S-L", "WS-RRRT48FP", "WTTTF24PS-L",
"GEDQTT8TS-L", "WD-SRNS-2S-L"),
`9830473` = "SPA$FFSB0S")
# Load packages
library(dplyr)
library(purrr)
dat <- map(sales2, data_frame) %>% # Convert each list element to a data frame
bind_rows(.id = "ID") %>% # Combine all data frame
rename(Value = `.x[[i]]`) %>% # Change the name of the second column
group_by(ID) %>% # Group by the first column
summarise(Value = paste0(Value, collapse = " ")) # Collapse the second column
dat
# A tibble: 6 × 2
ID Value
<chr> <chr>
1 7143443 SSS-U-CCXVCSS1 L-CCX-8GETTS-LIC
2 7208993 NFFGSR4= 1MV-FT-1= VI-NT/TE=
3 7241758 PW_SQSGG=
4 9273628 O1941-SE9 CCO887VA-K9 2901-SEC/K9 CO1941-C/K9
5 9371709 HGR__SASS= WWQTTB0S-L WS-RRRT48FP WTTTF24PS-L GEDQTT8TS-L WD-SRNS-2S-L
6 9830473 SPA$FFSB0S
Update
After reading original poster's comment, I decided to update my solution, to count how many rows contain two specified string patterns.
Here one row is a unique ID. So I assume that the request can be rephrased to "How many IDs contain two specified string patterns?" If this is the case, I would prefer not to collapse all the observations. Because after collapsing all observations to from one ID per row, we need to develop a strategy to match the string, such as using the regular expression. I am not familiar with regular string, so I will leave this for others to provide solutions.
In addition, the original poster did not specify which two strings are the targeted, so I would develop a strategy that the users can replace the targeted string case by case.
dat <- map(sales2, data_frame) %>% # Convert each list element to a data frame
bind_rows(.id = "ID") %>% # Combine all data frame
rename(Value = `.x[[i]]`) # Change the name of the second column
# After this, there is no need to collapse the rows
# Set the target string, User can change the strings here
target_string1 <- c("SSS-U-CCXVCSS1", "L-CCX-8GETTS-LIC")
dat2 <- dat %>%
filter(Value %in% target_string1) %>% # Filter rows matching the targeted string
distinct(ID, Value, .keep_all = TRUE) %>% # Only keep one row if ID and Value have exact duplicated
count(ID) %>% # Count how many rows per ID
filter(n > 1) %>% # Keep only ID that the Count number is larger than 1
select(ID)
dat2
# A tibble: 1 × 1
ID
<chr>
1 7143443
Related
I'm an R beginner and have a basic question I can't seem to figure out. I have row values that need to be separated into different columns, but there are more than one delimiter I am trying to use. The Expression_level column contains an ensembl gene ID with its corresponding value, in the form ensembl:exp value, but there are sometimes 2 ensembl IDs in the same row separated by ;. I want to have a column for ensembl and for gene expression value, but not sure how to separate while keeping them mapped to the correct ID/expression value. This is the type of data I am working with: rna_seq and this is what I am trying to get out: org_rna. TYIA
rna_seq= cbind("Final_gene" = c("KLHL15", "CPXCR1", "MAP7D3", "WDR78"), "Expression_level" = c("1.62760683812965:ENSG00000174010", "-9.96578428466209:ENSG00000147183",
"-4.32192809488736:ENSG00000129680", "-1.39592867633114:ENSG00000152763;-9.96578428466209:ENSG00000231080"))
org_rna = cbind("Final_gene" = c("KLHL15", "CPXCR1", "MAP7D3", "WDR78", "WDR78"), "Ensembl" = c("ENSG00000174010", "ENSG00000147183", "ENSG00000129680", "ENSG00000152763", "ENSG00000231080")
, "Expression" = c("1.62760683812965", "-9.96578428466209", "-4.32192809488736", "-1.39592867633114", "-9.96578428466209"))
library(tidyr)
library(dplyr)
rna_seq %>%
as.data.frame() %>%
# separate cells containing multiple values into
# multiple rows
separate_rows(Expression_level, sep = ";") %>%
# extract pairs
extract(col = Expression_level,
into = c("Expression", "Ensembl"),
regex = "(.*):(.*)")
# A tibble: 5 x 3
# Final_gene Expression Ensembl
# <chr> <chr> <chr>
# KLHL15 1.62760683812965 ENSG00000174010
# CPXCR1 -9.96578428466209 ENSG00000147183
# MAP7D3 -4.32192809488736 ENSG00000129680
# WDR78 -1.39592867633114 ENSG00000152763
# WDR78 -9.96578428466209 ENSG00000231080
Another (less elegant) solution using separate():
library(tidyr)
library(dplyr)
rna_seq |> as.data.frame() |>
# Separate any second IDs
separate(Expression_level, sep = ";", into = c("ID1", "ID2")) |>
# Reshape to longer (columns to rows)
pivot_longer(cols = starts_with("ID")) |>
# Separate Expression from Ensembl
separate(value, sep = ":", into = c("Expression", "Ensembl")) |>
filter(!is.na(Expression)) |>
select(Final_gene, Ensembl, Expression)
I am analysing some fmri data – in particular, I am looking at what sorts of cognitive functions are associated with coordinates from an fmri scan (conducted while subjects were performing a task. My data can be obtained with the following function:
library(httr)
scrape_and_sort = function(neurosynth_link){
result = content(GET(neurosynth_link), "parsed")$data
names = c("Name", "z_score", "post_prob", "func_con", "meta_analytic")
df = do.call(rbind, lapply(result, function(x) setNames(as.data.frame(x), names)))
df$z_score = as.numeric(df$z_score)
df = df[order(-df$z_score), ]
df = df[-which(df$z_score<3),]
df = na.omit(df)
return(df)
}
RO4 = scrape_and_sort('https://neurosynth.org/api/locations/-58_-22_6_6/compare')
Now, I want know which key words are coming up most often and ideally construct a list of the most common words. I tried the following:
sort(table(RO4$Name),decreasing=TRUE)
But this clearly won't work.The problem is that the names (for example: "auditory cortex") are strings with multiple words in, so results such 'auditory' and 'auditory cortex' come out as two separate entries, whereas I want them counted as two instances of 'auditory'.
But I am not sure how to search inside each string and record individual words like that. Any ideas?
using packages {jsonlite}, {dplyr} and the pipe operator %>% for legibility:
store response as dataframe df
url <- 'https://neurosynth.org/api/locations/-58_-22_6_6/compare/'
df <- jsonlite::fromJSON(url) %>% as.data.frame
reshape and aggregate
df %>%
## keep first column only and name it 'keywords':
select('keywords' = 1) %>%
## multiple cell values (as separated by a blank)
## into separate rows:
separate_rows(keywords, sep = " ") %>%
group_by(keywords) %>%
summarise(count = n()) %>%
arrange(desc(count))
result:
+ # A tibble: 965 x 2
keywords count
<chr> <int>
1 cortex 53
2 gyrus 26
3 temporal 26
4 parietal 23
5 task 22
6 anterior 19
7 frontal 18
8 visual 17
9 memory 16
10 motor 16
# ... with 955 more rows
edit: or, if you want to proceed from your dataframe
RO4 %>%
select(Name) %>%
## select(everything())
## select(Name:func_con)
separate_rows(Name, sep=' ') %>%
## do remaining stuff
You can of course select more columns in a number of convenient ways (see commented lines above and ?dplyr::select). Mind that values of the other variables will repeated as many times as rows are needed to accomodate any multivalue in column "Name", so that will introduce some redundancy.
If you want to adopt {dplyr} style, arranging by descending z-score and excluding unwanted z-scores would read like this:
RO4 %>%
filter(z_score < 3 & !is.na(z_score)) %>%
arrange(desc(z_score))
Not sure to understand. Can't you proceed like this:
x <- c("auditory cortex", "auditory", "auditory", "hello friend")
unlist(strsplit(x, " "))
# "auditory" "cortex" "auditory" "auditory" "hello" "friend"
I have a column in a data frame in R that contains sample names. Some names are identical except that they end in A or B at the end, and some samples repeat themselves, like this:
df <- data.frame(Samples = c("S_026A", "S_026B", "S_028A", "S_028B", "S_038A", "S_040_B", "S_026B", "S_38A"))
What I am trying to do is to isolate all sample names that have an A and B at the end and not include the sample names that only have either A or B.
The end result of what I'm looking for would look like this:
"S_026" and "S_028" as these are the only ones that have A and B at the end.
All I seem to find is how to remove duplicates, and removing duplicates would only give me "S_026B" and "S_38A" in this case.
Alternatively, I have tried to strip the A's and B's at the end and then sum how many times each of those names sum > 2, but again, this does not give me the desired results.
Any suggestions?
We could use substring to get the last character after grouping by substring not including the last character, and check if there are both 'A', and 'B' in the substring
library(dplyr)
df %>%
group_by(grp = substr(Samples, 1, nchar(Samples)-1)) %>%
filter(all(c("A", "B") %in% substring(Samples, nchar(Samples)))) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 5 x 1
Samples
<chr>
1 S_026A
2 S_026B
3 S_028A
4 S_028B
5 S_026B
You can extract the last character from Sample in different column, keep only those values that have both 'A' and 'B' and keep only the unique values.
library(dplyr)
library(tidyr)
df %>%
extract(Samples, c('value', 'last'), '(.*)(.)') %>%
group_by(value) %>%
filter(all(c('A', 'B') %in% last)) %>%
ungroup %>%
distinct(value)
# value
# <chr>
#1 S_026
#2 S_028
So basically I have a vector of tags that I want to find in my Transcript column (row by row) and if I find any word from the tags in my Transcript string, I want to create a separate column concatenating all the tags as shown in the example below (see image):
tags=c("loan","deposit","quarter","morning")
So, the output should look like this:
Output Result
Currently, I am able to tag this by using two for loops i.e. one to go over Tags vector and the other to go over my data frame's Transcript column one-by-one. But, I have a tag list of around 500 words and data frame has more than 100,000 rows. So, I am concerned about the run time. Is there any better way to optimize my R code using apply function or any other method?
Using, the following code to tag all the rows of Transcript column one-by-one
for (i in 1:length(tags)) {
for (j in 1:nrow(FinalData)){
check_tag <- str_extract(string = FinalData$Cleaned_Transcript[j], pattern = tags[i])
if (is.na(check_tag)==FALSE) {
FinalData$Tags[j] <- stri_remove_empty(paste(FinalData$Tags[j],check_tag,sep = ","))
}
}
}
Not sure if you are open to not using a for loop, but if so, here's a tidyverse approach.
library(tidyverse)
dat <- data.frame(Transcript = c("This is example text a", "this is loan", "deposit is not quarter"))
# as per comment from TO, we want to provide an input vector of tags
my_tags <- c("loan", "deposit", "quarter", "morning")
my_tags_collapsed <- str_c(my_tags, collapse = "|")
# We can now use the collapsed tags in the str_extract_all function
dat %>%
mutate(test = str_extract_all(Transcript, my_patterns_collapsed)) %>%
unnest_wider(test) %>%
mutate(across(-Transcript, replace_na, "")) %>%
mutate(Tags_Marked = apply(across(-Transcript), 1, str_c, collapse = ",")) %>%
select(Transcript, Tags_Marked)
Which gives:
# A tibble: 3 x 2
Transcript Tags_Marked
<chr> <chr>
1 This is example text a ,
2 this is loan loan,
3 deposit is not quarter deposit,quarter
Admittedly, this is not 100% ok, since you still get the comma separator for 0-length characters.
Alternative could be to not concatenate the strings into one column, but keep them as separate columns which would mean that you could stop much earlier:
dat %>%
mutate(test = str_extract_all(Transcript, my_tags_collapsed)) %>%
unnest_wider(test)
which would give you:
# A tibble: 3 x 3
Transcript ...1 ...2
<chr> <chr> <chr>
1 This is example text a NA NA
2 this is loan loan NA
3 deposit is not quarter deposit quarter
In a data frame, I have a column (type: chr) that contains answers separated by a comma. I want to create another column based on the size of the string and award points. For example, some of the entries in a column are:
Column1
word1,word2,word3
word1,word2
word1
Now, for the first cell, I want the size of the cell to be evaluated as 3 (as it contains three distinct word and there are no duplicates in the cell values). I'm not sure how do I achieve this.
An option is to split the column with strsplit into a list of vectors, get the unique elements by looping over the list with lapply and get the lengths
df1$Size <- lengths(lapply(strsplit(df1$Column1, ",\\s*"), unique))
Another option is separate_rows from tidyr
library(dplyr)
library(tidyr)
df1 %>%
mutate(rn = row_number()) %>%
separate_rows(Column1) %>%
group_by(rn) %>%
summarise(Size = n_distinct(Column1), .groups = 'drop') %>%
select(Size) %>%
bind_cols(df1, .)
-output
# Column1 Size
#1 word1,word2,word3 3
#2 word1,word2 2
#3 word1 1
data
df1 <- data.frame(Column1 = c('word1,word2,word3', 'word1,word2', 'word1'))
Original Answer:
Another option:
library(dplyr)
library(stringr)
df %>%
mutate(Lengths = str_count(Column1, ",") + 1)
Edit:
I hadn't noticed the OP requirements properly (about non-duplicates). As #Onyambu pointed out in the comments, this chunk will only works if there are no duplicated words in data.
It basically counts how many words there are.