Hi I have not seen a similar solution to this problem I am having. I am trying to make a regrex pattern to extract the characters following the word major within { } and place them in a major column. However, the major repeats in row 2 and I need to extract and combine all characters within both { } following major. Ideally I would do this for minor and incidental attributes as well. Not sure what I am getting wrong here. Thanks!
test <- data.frame(lith=c("major{basalt} minor{andesite} incidental{dacite rhyolite}",
"major {andesite flows} major {dacite flows}",
"major{andesite} minor{dacite}",
"major{basaltic andesitebasalt}"))
test %>%
mutate(major = str_extract_all(test$lith, "[major].*[{](\\D[a-z]*)[}]") %>%
map_chr(toString))
What I am looking for:
major minor incidental
1 basalt andesite dacite ryolite
2 andesite flows, decite flows <NA> <NA>
3 basaltic andesitebasalt <NA> <NA>
First, (almost) never use test$ within a dplyr pipe starting with test %>%. At best it's just a little inefficient; if there are any intermediate steps that re-order, alter, or filter the data, then the results will be either (a) an error, preferred; or (b) silently just wrong. The reason: let's say you do
test %>%
filter(grepl("[wy]", lith)) %>%
mutate(major = str_extract_all(test$lith, ...))
In this case, the filter reduced the data from 4 rows to just 2 rows. However, since you're using test$lith, that's taken from the contents of test before the pipe started, so here test$lith is length-4 where we need it to be length-2.
Alternatively (and preferred),
test %>%
filter(grepl("[wy]", lith)) %>%
mutate(major = str_extract_all(lith, ...))
Here, the str_extract_all(lith, ...) sees only two values, not the original four.
On to the regularly-scheduled answer ...
I'll add a row number rn column as an original row reference (id of sources). This is both functional (for things to work internally) and in case you need to tie it back to the original data somehow. I'm inferring that you group the values together as strings instead of list-columns, though it's easy enough to switch to the latter if desired.
library(dplyr)
library(stringr) # str_extract_all
library(tidyr) # unnest, pivot_wider
test %>%
mutate(
rn = row_number(),
tmp = str_extract_all(lith, "\\b([[:alpha:]]+) ?\\{([^}]+)\\}"),
tmp = lapply(tmp, function(z) strcapture("^([^{}]*) ?\\{(.*)\\}", z, list(categ="", val="")))
) %>%
unnest(tmp) %>%
mutate(across(c(categ, val), trimws)) %>%
group_by(rn, categ) %>%
summarize(val = paste(val, collapse = ", ")) %>%
pivot_wider(rn, names_from = "categ", values_from = "val") %>%
ungroup()
# # A tibble: 4 x 4
# rn incidental major minor
# <int> <chr> <chr> <chr>
# 1 1 dacite rhyolite basalt andesite
# 2 2 NA andesite flows, dacite flows NA
# 3 3 NA andesite dacite
# 4 4 NA basaltic andesitebasalt NA
Related
I am analysing some fmri data – in particular, I am looking at what sorts of cognitive functions are associated with coordinates from an fmri scan (conducted while subjects were performing a task. My data can be obtained with the following function:
library(httr)
scrape_and_sort = function(neurosynth_link){
result = content(GET(neurosynth_link), "parsed")$data
names = c("Name", "z_score", "post_prob", "func_con", "meta_analytic")
df = do.call(rbind, lapply(result, function(x) setNames(as.data.frame(x), names)))
df$z_score = as.numeric(df$z_score)
df = df[order(-df$z_score), ]
df = df[-which(df$z_score<3),]
df = na.omit(df)
return(df)
}
RO4 = scrape_and_sort('https://neurosynth.org/api/locations/-58_-22_6_6/compare')
Now, I want know which key words are coming up most often and ideally construct a list of the most common words. I tried the following:
sort(table(RO4$Name),decreasing=TRUE)
But this clearly won't work.The problem is that the names (for example: "auditory cortex") are strings with multiple words in, so results such 'auditory' and 'auditory cortex' come out as two separate entries, whereas I want them counted as two instances of 'auditory'.
But I am not sure how to search inside each string and record individual words like that. Any ideas?
using packages {jsonlite}, {dplyr} and the pipe operator %>% for legibility:
store response as dataframe df
url <- 'https://neurosynth.org/api/locations/-58_-22_6_6/compare/'
df <- jsonlite::fromJSON(url) %>% as.data.frame
reshape and aggregate
df %>%
## keep first column only and name it 'keywords':
select('keywords' = 1) %>%
## multiple cell values (as separated by a blank)
## into separate rows:
separate_rows(keywords, sep = " ") %>%
group_by(keywords) %>%
summarise(count = n()) %>%
arrange(desc(count))
result:
+ # A tibble: 965 x 2
keywords count
<chr> <int>
1 cortex 53
2 gyrus 26
3 temporal 26
4 parietal 23
5 task 22
6 anterior 19
7 frontal 18
8 visual 17
9 memory 16
10 motor 16
# ... with 955 more rows
edit: or, if you want to proceed from your dataframe
RO4 %>%
select(Name) %>%
## select(everything())
## select(Name:func_con)
separate_rows(Name, sep=' ') %>%
## do remaining stuff
You can of course select more columns in a number of convenient ways (see commented lines above and ?dplyr::select). Mind that values of the other variables will repeated as many times as rows are needed to accomodate any multivalue in column "Name", so that will introduce some redundancy.
If you want to adopt {dplyr} style, arranging by descending z-score and excluding unwanted z-scores would read like this:
RO4 %>%
filter(z_score < 3 & !is.na(z_score)) %>%
arrange(desc(z_score))
Not sure to understand. Can't you proceed like this:
x <- c("auditory cortex", "auditory", "auditory", "hello friend")
unlist(strsplit(x, " "))
# "auditory" "cortex" "auditory" "auditory" "hello" "friend"
penguins %>% group_by(island, species) %>% drop_na() %>%
summarise(meaxbill = max(penguins$bill_length_mm))
penguins %>% group_by(island, species) %>% drop_na() %>%
summarise(meaxbill = max(bill_length_mm))
I'll word it a little more strongly: when using the pipe operator %>% and the dplyr package, you should not use the dataframe name with the column names ($-indexing); while it works sometimes, if anything in the pipeline removes, adds, or reorders the rows, then your subsequent calculations will be wrong. It isn't that you don't need to assign the dataframe name, it's that if you do use it then you are likely corrupting your data. The first code is broken, do not trust it. (Whether it is truly corrupted or not may be contextual; I don't know if it corrupts it here.)
Let me demonstrate. If we want to know the max bill length (mm) of all of the penguins, by sex, we should do something like this:
library(dplyr)
data("penguins", package = "palmerpenguins")
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(maxbill = max(bill_length_mm))
# # A tibble: 2 x 2
# sex maxbill
# <fct> <dbl>
# 1 female 58
# 2 male 59.6
If for some reason we instead use penguins$bill_length_mm, then we'll see this:
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(maxbill = max(penguins$bill_length_mm))
# # A tibble: 2 x 2
# sex maxbill
# <fct> <dbl>
# 1 female NA
# 2 male NA
which will likely encourage us to add na.rm=TRUE to the data, and we'll get a seemingly valid-ish number:
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(maxbill = max(penguins$bill_length_mm, na.rm = TRUE))
# # A tibble: 2 x 2
# sex maxbill
# <fct> <dbl>
# 1 female 59.6
# 2 male 59.6
but the problem is that max(.) is being passed all of penguins$bill_length_mm, not just the values within each group.
In this case, the use of penguins$ is not a syntax error, it is a logical error, and there is no way for dplyr or anything else in R to know that what you are doing is not what you really need. It works, because max(.) sees a vector and it returns a single number; then summarize(.) sees a single number and assigns it to a new variable.
And in this case, our results are corrupted.
The only time it may be valid to use penguins$ in this is if we truly need to bring in a number or object from outside of the current "view" of the data. Realize that the data that summarize(.) sees is not the data that started in the pipe: it has been filtered (by drop_na()), it might be changed (if we mutated some columns into it) or reordered (if we arrange the data).
But if we need to find out the percentage of the max bill length with respect to the max of the original data, we might do this:
penguins %>%
drop_na() %>%
group_by(sex) %>%
summarize(
maxbill = max(bill_length_mm),
maxbill_ratio = max(bill_length_mm) / max(penguins$bill_length_mm, na.rm = TRUE)
)
# # A tibble: 2 x 3
# sex maxbill maxbill_ratio
# <fct> <dbl> <dbl>
# 1 female 58 0.973
# 2 male 59.6 1
(Recall that we needed to add na.rm=TRUE in that call because one of the rows has an NA ... and the data we see in that last max has not been filtered/cleaned by the drop_na() call.)
Sample data frame
Guest <- c("ann","ann","beth","beth","bill","bill","bob","bob","bob","fred","fred","ginger","ginger")
State <- c("TX","IA","IA","MA","AL","TX","TX","AL","MA","MA","IA","TX","AL")
df <- data.frame(Guest,State)
Desired output
I have tried about a dozen different ideas but not getting close. Closest was setting up a crosstab but didn't know how to get counts from that. Long/wide got me nowhere. etc. Too new still to think out of the box I guess.
Try this approach. You can arrange your values and then use group_by() and summarise() to reach a structure similar to those expected:
library(dplyr)
library(tidyr)
#Code
new <- df %>%
arrange(Guest,State) %>%
group_by(Guest) %>%
summarise(Chain=paste0(State,collapse = '-')) %>%
group_by(Chain,.drop = T) %>%
summarise(N=n())
Output:
# A tibble: 4 x 2
Chain N
<chr> <int>
1 AL-MA-TX 1
2 AL-TX 2
3 IA-MA 2
4 IA-TX 1
We can use base R with aggregate and table
table(aggregate(State~ Guest, df[do.call(order, df),], paste, collapse='-')$State)
-output
# AL-MA-TX AL-TX IA-MA IA-TX
# 1 2 2 1
I have extremely messy data. A portion of it looks like the following example.
x1_01=c("bearing_coordinates", "bearing_coordinates", "bearing_coordinates", "roadkill")
x1_02=c(146,122,68,1)
x2_01=c("tree_density","animals_on_road","animals_on_road", "tree_density")
x2_02=c(13,2,5,11)
x3_01=c("animals_on_road", "tree_density", "roadkill", "bearing_coordinates")
x3_02=c(3,10,1,1000)
x4_01=c("roadkill","roadkill", "tree_density", "animals_on_road")
x4_02=c(1,1,12,6)
testframe = data.frame(x1_01 = x1_01,x1_02=x1_02,x2_01=x2_01, x2_02=x2_02, x3_01=x3_01, x3_02=x3_02, x4_01=x4_01, x4_02=x4_02)
x1_01 x1_02 x2_01 x2_02 x3_01 x3_02 x4_01
1 bearing_coordinates 146 tree_density 13 animals_on_road 3 roadkill
2 bearing_coordinates 122 animals_on_road 2 tree_density 10 roadkill
3 bearing_coordinates 68 animals_on_road 5 roadkill 1 tree_density
4 roadkill 1 tree_density 11 bearing_coordinates 1000 animals_on_road
x4_02
1 1
2 1
3 12
4 6
I noticed when using dplyr spread that if I spread x1_01 and x1_02 on the initial datasheet, e.g.
test <- testframe %>%
spread(x1_01, x1_02)
and then used spread on that dataframe for x2_01 and x2_02, e.g.
testtest <- test %>%
spread(x2_01, x2_02)
that the second "bearing_coordinates" column would replace the original column, and result in NAs where there were values. To get around that, I went down the route of creating multiple dataframes and merging them together, e.g.
test <- testframe %>%
spread(x1_01, x1_02) %>%
mutate(id = row_number())
test2 <- testframe %>%
spread(x2_01, x2_02) %>%
mutate(id = row_number())
test3 <- testframe %>%
spread(x3_01, x3_02) %>%
mutate(id = row_number())
test4 <- testframe %>%
spread(x4_01, x4_02) %>%
mutate(id = row_number())
merge_test <- merge(test, test2, by="id")
merge_test2 <- merge(merge_test, test3, by ="id")
merge_test3 <- merge(merge_test2, test4, by = "id")
This (long-winded) approach is ok if it is a small dataset, like the test data I have supplied. However, as variables increase (x5_01, x5_02, x5_01, x5_02, etc) columns begin getting duplicated and deleting the previous columns named e.g. "bearing_coordinates", which results in loss of data. My question is, is there a way to do this where the data pivots from long to wide, and as it moves across variables, into one logical key:value column, so that all values associated with "bearing_coordinates" are in that column? The data should then look like this:
bearing_coordinates=c(146,122,68,1000)
roadkill=c(1,1,1,1)
tree_density=c(13,10,12,11)
animals_on_road=c(3,2,5,6)
id=c(1,2,3,4)
clean.data = data.frame(bearing.coordinates=bearing_coordinates,roadkill=roadkill,tree_density=tree_density,animals_on_road=animals_on_road,id=id)
bearing_coordinates roadkill tree_density animals_on_road id
1 146 1 13 3 1
2 122 1 10 2 2
3 68 1 12 5 3
4 1000 1 11 6 4
I assume there must be a way to do this surprisingly easily in dplyr, but I rarely have data this messy and so am at a bit of loss as to what tools will accomplish this.
I've been looking through the dplyr documentation and SO posts and everything seems to be almost what I'm looking for but not quite right. For example, this post indicates that there could be a different strategy of taking "bearing.coordinates.x" and "bearing.coordinates.y" and then making those columns have duplicate names before finally merging them with no loss of data. However, that looks like it could be even more long-winded (particularly with multiple key:value pairs, as in my real dataset) and also potentially prone to error. I've also looked at filter as perhaps being a good option, but it seems to still hit that issue of columns deleting each other, and results in a necessary extra coding step to keep all the rest of the data.
Thank you in advance for help.
EDIT: Ben's answer below is correct, but I initially inaccurately represented the variables as being separated by "." and not "_" as they are in my real data. This could be addressed by simply changing the regex to (.*)_(.*), so:
testframe %>%
pivot_longer(cols = everything(), names_to = c("name", ".value"), names_pattern = "(.*)_(.*)") %>%
select(-name) %>%
pivot_wider(names_from = "01", values_from = "02", values_fn = list) %>%
unnest(cols = everything())
This is a really beautiful and elegant solution. Thank you Ben!
Maybe you might try something like this below. Based on your needs it could be modified further - but a lot depends on what your actual data looks like. This assumes complete key/value pairs, evenly divided.
Would first use pivot_longer to get your keys/values in two columns. Then you can use pivot_wider so that values are placed in the appropriate key columns.
library(tidyr)
library(dplyr)
testframe %>%
pivot_longer(cols = everything(), names_to = c("name", ".value"), names_pattern = "x(\\d+)_(\\d+)") %>%
select(-name) %>%
pivot_wider(names_from = `01`, values_from = `02`, values_fn = list) %>%
unnest(cols = everything())
Output
bearing.coordinates tree.density animals.on.road roadkill
<dbl> <dbl> <dbl> <dbl>
1 146 13 3 1
2 122 10 2 1
3 68 12 5 1
4 1000 11 6 1
I have a lot of old R code using the following syntax to perform what I think are left joins (or left outer joins if you prefer the SQL name):
merge(a, b, by="id", all.x=TRUE)
From my point of view, this is completely equivalent to using dplyr's dedicated function:
left_join(a, b, by="id")
I'm wondering if this is always the case or if the two can in some cases lead to different results. Please feel free to provide examples of when they could be considered equivalent and when not.
In this silly example, the two seems to yield the same result
require(dplyr)
a = data.frame(id=1:4, c(letters[1:3], NA)) %>% as_tibble()
b = data.frame(id=1:2) %>% as_tibble()
all_equal(left_join(b, a, by="id"), merge(b, a, by='id', all.x = T))
# TRUE
Why am I asking this question?
I'm asking this because, for instance, stats::aggregate and dplyr::group_by, if used with default arguments are not equivalent:
a %>% group_by(letter) %>% summarise(mean(id))
# # A tibble: 4 x 2
# letter `mean(id)`
# <fct> <dbl>
# 1 a 1.00
# 2 b 2.00
# 3 c 3.00
# 4 <NA> 4.00
aggregate(id ~ letter, data = a, FUN = mean)
# letter id
# 1 a 1
# 2 b 2
# 3 c 3
That, is they give the same result if you omit NAs from the dplyr's data (because the default for aggregate is na.omit). I'm asking also because when working with big datasets it's hard to spot at a glance why something is happening (especially when dealing with some code that was not written by you) and if you have to do some maintenance work, harmless sostitutions like those presented above can cause significant changes in the output.
EDIT: I'm using dplyr 0.7.4 and R 3.4.1.
The tidyverse functions uses de NA as a part of data, because it should explain some aspects of information that can't be explained by "identified" data. In other words you must use a especific function to drop NA values. In your example there are many ways to perform the same process with equivalent results. For example, consider the na.omit() function:
library(tidyverse) #This include dplyr package
a = data.frame(id = 1:4,
letter = c(letters[1:3], NA)
) %>% as_tibble()
all.equal(
a %>%
na.omit(letter) %>% #This drop NA values in the column "letter"
group_by(letter) %>%
summarise(id = mean(id)),
aggregate(id ~ letter,
data = a,
FUN = mean ))
#>[1] TRUE
Other example is using filter() function:
all.equal(
a %>%
filter(!is.na(letter)) %>% #This drop NA values in the column "letter"
group_by(letter) %>%
summarise(id = mean(id)),
aggregate(id ~ letter,
data = a,
FUN = mean ))
#>[1] TRUE
Hope is can help you!