Subset the Orginal dataframe with different combinations of 2 factor variables - r

I have a dataset with 11 columns and 18350 observations which has a variable company and region. There are 9 companies(company-0) spread across 5 regions(region-0 to region-5) and not all companies are present at all regions. I want to create a seperate dataframe for each combination of company and region.You can see like this-
company0-region1,
company0-region10,
company0-region7,
company1-region5,
company2-region0,
company3-region2,
company4-region3,
company5-region7,
company6-region6,
company8-region9,
company9-region8
Thus I need 11 different dataframes in R.No other combinations are possible
Any other approach would be highly appreciated.
Thanks in Advance
I used split function to get a list-
p<-split(tsog1,list(tsog1$company),drop=TRUE)
Now I have a list of dataframes and I can't convert the each element of that list into an individual dataframe.
I tried using loops too, but can't get a unique named dataframe.
v<-c(1:9)
p<-levels(tsog1$company)
for (x in v)
{
x.tsog1<-subset(tsog1,tsog1$company==p[x])
}
Dataset Image

You can create a column for the region company combination and split by that column.
For example:
library(tidyverse)
# Create a df with 9 regions, 6 companies, and some dummy observations (3 per case)
df <- expand.grid(region = 0:8, company = 0:5, dummy = 1:3 ) %>%
mutate(x = round(rnorm((54*3)),2)) %>%
select(-dummy) %>% as_tibble()
# Create the column to split, and split.
df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
Now, what to do once you have the list of data frames, depends on your next steps. If you want to for example, save them, you can do walk or lapply.
For saving:
df_list <- df %>%
mutate(region_company = paste(region,company, sep = '_')) %>%
split(., .$region_company)
iwalk(df_list,function(df, nm){
write_csv(df, paste0(nm,'.csv'))
})
Or if you simply wants to access it:
> df_list$`0_4`
# A tibble: 3 x 4
region company x region_company
<int> <int> <dbl> <chr>
1 0 4 0.54 0_4
2 0 4 1.61 0_4
3 0 4 0.16 0_4

Related

How do I find the most common words in a character vector in R?

I am analysing some fmri data – in particular, I am looking at what sorts of cognitive functions are associated with coordinates from an fmri scan (conducted while subjects were performing a task. My data can be obtained with the following function:
library(httr)
scrape_and_sort = function(neurosynth_link){
result = content(GET(neurosynth_link), "parsed")$data
names = c("Name", "z_score", "post_prob", "func_con", "meta_analytic")
df = do.call(rbind, lapply(result, function(x) setNames(as.data.frame(x), names)))
df$z_score = as.numeric(df$z_score)
df = df[order(-df$z_score), ]
df = df[-which(df$z_score<3),]
df = na.omit(df)
return(df)
}
RO4 = scrape_and_sort('https://neurosynth.org/api/locations/-58_-22_6_6/compare')
Now, I want know which key words are coming up most often and ideally construct a list of the most common words. I tried the following:
sort(table(RO4$Name),decreasing=TRUE)
But this clearly won't work.The problem is that the names (for example: "auditory cortex") are strings with multiple words in, so results such 'auditory' and 'auditory cortex' come out as two separate entries, whereas I want them counted as two instances of 'auditory'.
But I am not sure how to search inside each string and record individual words like that. Any ideas?
using packages {jsonlite}, {dplyr} and the pipe operator %>% for legibility:
store response as dataframe df
url <- 'https://neurosynth.org/api/locations/-58_-22_6_6/compare/'
df <- jsonlite::fromJSON(url) %>% as.data.frame
reshape and aggregate
df %>%
## keep first column only and name it 'keywords':
select('keywords' = 1) %>%
## multiple cell values (as separated by a blank)
## into separate rows:
separate_rows(keywords, sep = " ") %>%
group_by(keywords) %>%
summarise(count = n()) %>%
arrange(desc(count))
result:
+ # A tibble: 965 x 2
keywords count
<chr> <int>
1 cortex 53
2 gyrus 26
3 temporal 26
4 parietal 23
5 task 22
6 anterior 19
7 frontal 18
8 visual 17
9 memory 16
10 motor 16
# ... with 955 more rows
edit: or, if you want to proceed from your dataframe
RO4 %>%
select(Name) %>%
## select(everything())
## select(Name:func_con)
separate_rows(Name, sep=' ') %>%
## do remaining stuff
You can of course select more columns in a number of convenient ways (see commented lines above and ?dplyr::select). Mind that values of the other variables will repeated as many times as rows are needed to accomodate any multivalue in column "Name", so that will introduce some redundancy.
If you want to adopt {dplyr} style, arranging by descending z-score and excluding unwanted z-scores would read like this:
RO4 %>%
filter(z_score < 3 & !is.na(z_score)) %>%
arrange(desc(z_score))
Not sure to understand. Can't you proceed like this:
x <- c("auditory cortex", "auditory", "auditory", "hello friend")
unlist(strsplit(x, " "))
# "auditory" "cortex" "auditory" "auditory" "hello" "friend"

Obtain a Count of all the combinations created in a column when grouping by another column in df with different length combinations in R

Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1

Looping dplyr and creating multiple dataframe

I have a large dataset in which I would like to use dplyr and filter and select the data to create 12 separate dataframes.
Essentially, I am using only two columns of data from a larger dataset. The first column is "plot", where I filter by "plot" number and another condition in another 3rd column ("pos_ID"). I want to create a loop that filters by plot number (I tried plot==[i]) and the 3rd condition, and then creates a new dataframe. The loop would repeat 12 times (because plot spans from 1-12).
Here is the code that I used without a loop (based on sample data)
p1_Germ <- data %>% #p1 stands for plot 1
filter(plot==1, pos_ID<21) %>%
select(germ_bin)
Here is the code that I tried to incorporate a loop (based on sample data)
for(i in seq_along(plot)) {
data %>%
group_by(plot[[i]], pos_ID<21) %>%
select(germ_bin)
}
Here is some sample data
plot <- c(1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12)
germ_bin <- c(0,0,1,0,1,0,0,1,1,0,1,1,0,1,0,1,0,1,1,0,1,0,1,0)
pos_ID <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24)
dataset <- data.frame(plot, germ_bin, pos_ID)
dataset
My guess is to use a list, but I'm not familiar with loops and list and could not find a solution online. I need to create 12 dataframes because I'm trying to convert them each into a matrix after for another function. Any helpful would be much appreciated!
We can use group_split and map to filter based on criteria to get list of dataframes.
library(dplyr)
library(purrr)
dataset %>%
group_split(plot) %>%
map(. %>% filter(pos_ID < 21) %>% select(germ_bin))
#[[1]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 0
#2 0
#[[2]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 1
#2 0
#[[3]]
# A tibble: 2 x 1
# germ_bin
# <dbl>
#1 1
#2 0
#....
For the shared example, if you want to drop empty groups you can filter first
dataset %>%
filter(pos_ID < 21) %>%
group_split(plot) %>%
map(. %>% select(germ_bin))
As far your attempt with for loop is concerned, you can correct that by doing
unique_plot <- unique(dataset$plot)
plot_list <- list(length = length(unique_plot))
for(i in seq_along(unique_plot)) {
plot_list[[i]] <- dataset %>%
filter(plot == unique_plot[i], pos_ID<21) %>%
select(germ_bin)
}
Or keeping it completely in base R
lapply(split(dataset, dataset$plot), function(x)
subset(x, pos_ID < 21, select = germ_bin, drop = FALSE))

Reshaping a data frame with a list column produced by tidygraph

I am working with the tidygraph package and try to find a "tidy" solution
for the example below. The problem is not really tied to tidygraph and more about data wrangling but I think it is interesting for people working with this package.
In the following code chunk I just generate some sample data.
library(tidyverse)
library(tidygraph)
library(igraph)
library(randomNames)
library(reshape2)
graph <- play_smallworld(1, 100, 3, 0.05)
labeled_graph <- graph %>%
activate(nodes) %>%
mutate(group = sample(letters[1:3], size = 100, replace = TRUE),
name = randomNames(100)
)
sub_graphs_df <- labeled_graph %>%
morph(to_split, group) %>%
crystallise()
The resulting data.frame looks as follows:
sub_graphs_df
# A tibble: 3 x 2
name graph
<chr> <list>
1 group: a <S3: tbl_graph>
2 group: b <S3: tbl_graph>
3 group: c <S3: tbl_graph>
Now to my actual problem. I want do apply a function to each element in the column graph. The result is simply a named vector.
sub_graphs_df$graph %>% map(degree)
The first thing I do not like is the subsetting by $. Is there a better way?
Next, I want to reshape this result into only one data.frame with 3 columns. One column for name (the name attribute of the vectors), one for group (the name attribute of the list) and one for the number (the elements of the vectors).
I tried melt from the reshape2 package.
sub_graphs_df$graph %>% map(degree) %>% melt
It works decently but the names are lost and as I read it, one should use
tidyr instead. However, I could not get gather to work because only data.frames are accepted.
Another option would be unlist:
sub_graphs_df$graph %>% map(degree) %>% unlist
Here the group and the name are in the names attribute and I would have to recover them with regular expressions.
I am pretty sure there is an easy way I just could not think of.
We can create a list column with mutate while applying the function with map, extract the names and integer and unnest to create the 'long' format dataset
sub_graphs_df %>%
mutate(newout = map(graph, degree)) %>%
transmute(name, group = map(newout, ~.x %>% names), number = map(newout, as.integer)) %>%
unnest
# A tibble: 100 x 3
# name group number
# <chr> <chr> <int>
# 1 group: a Seng, Trevor 0
# 2 group: a Buccieri, Joshua 1
# 3 group: a Street, Aimee 2
# 4 group: a Gonzalez, Corey 2
# 5 group: a Barber, Monique 1
# 6 group: a Doan, Christina 1
# 7 group: a Ninomiya, Janna 1
# 8 group: a Bazemore, Chao 1
# 9 group: a Perfecto, Jennifer 1
#10 group: a Lopez Jr, Vinette 0
# ... with 90 more rows

R: Add new column to dataframe using function

I have a data frame df that has two columns, term and frequency. I also have a list of terms with given IDs stored in a vector called indices. To illustrate these two info, I have the following:
> head(indices)
Term
1 hello
256 i
33 the
Also, for the data frame.
> head(df)
Term Freq
1 i 24
2 hello 12
3 the 28
I want to add a column in df called TermID which will just be the index of the term in the vector indices. I have tried using dplyr::mutate but to no avail. Here is my code below
library(dplyr)
whichindex <- function(term){
ind <- which(indices == as.character(term))
ind}
mutate(df, TermID = whichindex(Term))
What I am getting as output is a df that has a new column called TermID, but all the values for TermID are the same.
Can someone help me figure out what I am doing wrong? It would be nice as well if you can recommend a more efficient algorithm to do this in [R]. I have implemented this in Python and I have not encountered such issues.
Thanks in advance.
what about?
df %>% rowwise() %>% mutate(TermID = grep(Term,indices))
w/ example data:
library(dplyr)
indices <- c("hello","i","the")
df <- data_frame(Term = c("i","hello","the"), Freq = c(24,12,28))
df_res <- df %>% rowwise() %>% mutate(TermID = grep(Term,indices))
df_res
gives:
Source: local data frame [3 x 3]
Groups: <by row>
Term Freq TermID
1 i 24 2
2 hello 12 1
3 the 28 3

Resources