How to write the Mesh from PubMed into a data frame - r

I recently downloaded the RISmed package to download a little under 500,000 article abstracts from PubMed, and I'm trying to input this data into a data frame. My problem comes when trying to include the Mesh data i get the error message
'Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows:'
Is there a way to nest this information into a data frame for analysis?
RCT_topic <- 'randomized clinical trial'
RCT_query <- EUtilsSummary(RCT_topic, mindate=2016, maxdate=2017, retmax=100)
summary(RCT_query)
RCT_records <- EUtilsGet(RCT_query)
RCT_data <- data.frame('PMID'=PMID(RCT_records),
'Title'=ArticleTitle(RCT_records),
'Abstract'=AbstractText(RCT_records),
'YearPublished'=YearPubmed(RCT_records),
'Month.Published'=MonthPubmed(RCT_records),
'Country'=Country(RCT_records),
'Grant'=GrantID(RCT_records),
'Acronym'=Acronym(RCT_records),
'Agency'=Agency(RCT_records),
'Mesh'=Mesh(RCT_records))

Consider using a tibble:
library(RISmed)
library(dplyr) # tibble and other functions
RCT_topic <- 'randomized clinical trial'
RCT_query <- EUtilsSummary(RCT_topic, mindate=2016, maxdate=2017, retmax=100)
summary(RCT_query)
RCT_records <- EUtilsGet(RCT_query)
RCT_data <- data_frame('PMID'=PMID(RCT_records),
'Title'=ArticleTitle(RCT_records),
'Abstract'=AbstractText(RCT_records),
'YearPublished'=YearPubmed(RCT_records),
'Month.Published'=MonthPubmed(RCT_records),
'Country'= Country(RCT_records),
'Grant' =GrantID(RCT_records),
'Acronym' =Acronym(RCT_records),
'Agency' =Agency(RCT_records),
'Mesh'=Mesh(RCT_records))
The Mesh column is now a list of dataframes:
select(RCT_data, PMID, Mesh)
# # A tibble: 100 x 2
# PMID Mesh
# <chr> <list>
# 1 29288775 <data.frame [21 × 2]>
# 2 29288545 <data.frame [19 × 2]>
# 3 29288510 <data.frame [15 × 2]>
# 4 29288507 <data.frame [19 × 2]>
# 5 29288478 <data.frame [16 × 2]>
# 6 29288309 <data.frame [19 × 2]>
# 7 29288191 <data.frame [11 × 2]>
# 8 29288190 <data.frame [23 × 2]>
# 9 29288184 <data.frame [21 × 2]>
# 10 29288175 <data.frame [12 × 2]>
# # ... with 90 more rows

Related

How to refer to data frames dynamically in an R for loop?

R novice, so apologies in advance. I want to write a for loop that does sequential operations on a series of dataframes and then binds them (by sequence number).
Ideally, I'd think it would be something like this (where sc2 is the base dataframe I'm working from, week3 is the selection variable used. The dataframes I'm trying to create would be t1, t2, t3,... and w1, w2, w3,... etc. In other words, the 'i' in the dataframe name would read from the for statement.
for(i in 1:16) {
ti= tail((subset(sc2, sc2$week3<i)), n=200)
wi= subset(sc2, sc2$week3==i)
mi=rbind(ti, wi)
}
Which I'm sure you know doesn't work. I've gotten this far -
for(i in 1:16) {
txi= tail((subset(sc2, sc2$week3<i)), n=200)
assign(paste0("trst",i), txi, envir = .GlobalEnv)
wxi= subset(sc2, sc2$week3==i)
assign(paste0("w",i), wxi, envir = .GlobalEnv)
}
Which creates a dummy dataframes (*xi) that are then assigned for each i to the global environment. But now how to rbind them? Is there a more elegant way to do all of this, or am I missing something about the way to refer to the dataframes dynamically?
Don't do it in a loop!
This can be done much easier by holding data frame in data frame or rather I should write tibble in tibble. See the example below.
library(tidyverse)
sc2 = tibble(
week3 = sample(1:20, 100, replace = TRUE),
x = rnorm(100)
)
ftxi = function(i) sc2 %>% filter(week3<i)
fwxi = function(i) sc2 %>% filter(week3==i)
df = tibble(id = 1:16) %>%
group_by(id) %>%
mutate(txi = map(id, ~ftxi(.x)),
wxi = map(id, ~fwxi(.x)))
Let's see what is df.
# A tibble: 16 x 3
# Groups: id [16]
id txi wxi
<int> <list> <list>
1 1 <tibble [0 x 2]> <tibble [4 x 2]>
2 2 <tibble [4 x 2]> <tibble [6 x 2]>
3 3 <tibble [10 x 2]> <tibble [6 x 2]>
4 4 <tibble [16 x 2]> <tibble [6 x 2]>
5 5 <tibble [22 x 2]> <tibble [4 x 2]>
6 6 <tibble [26 x 2]> <tibble [4 x 2]>
7 7 <tibble [30 x 2]> <tibble [6 x 2]>
8 8 <tibble [36 x 2]> <tibble [4 x 2]>
9 9 <tibble [40 x 2]> <tibble [3 x 2]>
10 10 <tibble [43 x 2]> <tibble [6 x 2]>
11 11 <tibble [49 x 2]> <tibble [3 x 2]>
12 12 <tibble [52 x 2]> <tibble [4 x 2]>
13 13 <tibble [56 x 2]> <tibble [6 x 2]>
14 14 <tibble [62 x 2]> <tibble [5 x 2]>
15 15 <tibble [67 x 2]> <tibble [5 x 2]>
16 16 <tibble [72 x 2]> <tibble [7 x 2]>
As you can see it is a tibble which has other tibble in it.
So let's see if everything is correct and take a look at the second row.
First, let's look at the txi variable df$txi[[2]]
# A tibble: 4 x 2
week3 x
<int> <dbl>
1 1 -0.0829
2 1 -2.15
3 1 -0.949
4 1 -0.0583
Now it's the turn of the variable wxi df$wxi[[2]]
# A tibble: 6 x 2
week3 x
<int> <dbl>
1 2 -0.0643
2 2 -0.228
3 2 -0.620
4 2 -1.21
5 2 0.186
6 2 1.19
Bingo you get what you expected!
It is also a very quick method. You can see my other answer in this forum
What is faster/better: Loop over each row..

How to convert a list of tibbles/dataframes into a nested tibble/dataframe

Sample Data
ex_list <- list(a = tibble(x = 1:4, y = 5:8),
b = mtcars)
How do I convert this list of tibbles/dataframes into a nested tibble as shown below:
# A tibble: 2 x 2
data_name data
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
Tidy solutions appreciated!
We may use enframe
library(tibble)
enframe(ex_list)
# A tibble: 2 x 2
name value
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
If we need to change the column names, use the name and value
> enframe(ex_list, name = 'data_name', value = 'data')
# A tibble: 2 x 2
data_name data
<chr> <list>
1 a <tibble [4 × 2]>
2 b <df [32 × 11]>
Is this what you want?
library(tidyverse)
lapply(ex_list, nest) %>%
dplyr::bind_rows(., .id = "data_name")
# # A tibble: 2 x 2
# data_name data
# <chr> <list>
# 1 a <tibble [4 x 2]>
# 2 b <tibble [32 x 11]>
#OR map
#map(ex_list, nest) %>%
# bind_rows(., .id = "data_name")

purrr::map does not work with pipe operator

I have a data frame like this:
df <- tibble(
i = rep(1:10, times = 5),
t = rep(1:5, each = 10)
) %>%
mutate(y = rnorm(50))
I want to apply a function that takes data frame of each t as argument:
f <- function(df){
return(lm(y ~ +1, data = df))
}
When I apply purrr::map for a nested data frame with pipe operator, I get error.
# does not work
df_nested <- df %>%
nest(data = c(t, y)) %>%
rename(data_col = data)
df_nested %>%
purrr::map(.x = .$data_col, .f = f)
On the other hand, when I do not use pipe operator, I get the desired result.
# Ok
purrr::map(.x = df_nested$data_col, .f = f)
To my understanding, both code should return the same result. What is wrong with the code with pipe operator?
Pipe already passes the previous value (df_nested) as the first argument to map. You may use {} to stop that from happening.
library(tidyverse)
df_nested %>%
{purrr::map(.x = .$data_col, .f = f)}
Another way would be to use -
df %>%
nest(data_col = c(t, y)) %>%
mutate(model = map(data_col, f))
# i data_col model
# <int> <list> <list>
# 1 1 <tibble [5 × 2]> <lm>
# 2 2 <tibble [5 × 2]> <lm>
# 3 3 <tibble [5 × 2]> <lm>
# 4 4 <tibble [5 × 2]> <lm>
# 5 5 <tibble [5 × 2]> <lm>
# 6 6 <tibble [5 × 2]> <lm>
# 7 7 <tibble [5 × 2]> <lm>
# 8 8 <tibble [5 × 2]> <lm>
# 9 9 <tibble [5 × 2]> <lm>
#10 10 <tibble [5 × 2]> <lm>

Match two datasets by minimum geospatial distance (R)

I have the two following datasets:
houses <- data.table(house_number = c(1:3),
lat_decimal = seq(1.1, 1.3, by = 0.1),
lon_decimal = seq(1.4, 1.6, by = 0.1))
stations <- data.table(station_numer = c(1:11),
lat_decimal = seq(1, 2, by = 0.1),
lon_decimal = seq(2, 3, by = 0.1))
I want to merge houses and stations together such that the resulting station_number is the station that's closest to the corresponding house_number.
This question is very similar, but I'm not sure if they're working with latitude and longitude and also, I don't know how to calculate distances when dealing with longitude and latitude (which is why I prefer to simply use distm from the geosphere package).
I have never worked with the outer function. In case the answer from the aforementioned question would work, how can I adapt the answer to use the distmfunction instead of the sqrtfunction?
Use match_nrst_haversine from hutilscpp:
library(hutilscpp)
houses[, c("station_number", "dist") := match_nrst_haversine(lat_decimal,
lon_decimal,
addresses_lat = stations$lat_decimal,
addresses_lon = stations$lon_decimal,
Index = stations$station_numer,
close_enough = 0,
cartesian_R = 5)]
houses
#> house_number lat_decimal lon_decimal station_number dist
#> 1: 1 1.1 1.4 1 67.62617
#> 2: 2 1.2 1.5 1 59.87076
#> 3: 3 1.3 1.6 1 55.59026
You may want to adjust close_enough and cartesian_R if your data are numerous (i.e. over a million points to match) for performance.
`cartesian_R`
The maximum radius of any address from the points to be geocoded. Used to accelerate the detection of minimum distances. Note, as the
argument name suggests, the distance is in cartesian coordinates, so a
small number is likely.
`close_enough`
The distance, in metres, below which a match will be considered to have occurred. (The distance that is considered "close enough" to be a
match.)
For example, close_enough = 10 means the first location within ten metres will be matched, even if a closer match occurs later.
May be provided as a string to emphasize the units, e.g. close_enough = "0.25km". Only km and m are permitted.
Your question is a bit more complicated than a simple merge, and outer is somewhat ill-suited for the purpose. To be as thorough as possible, we want to calculate the distance between all combinations of houses and stations, then keep only the closest station per house. We'll need two packages:
library(tidyverse)
library(geosphere)
First, a bit of prep. distm expects coordinates to be ordered as longitude first, latitude second (you have the opposite), so let's fix that, name the columns better, and correct a typo while we're at it:
houses <- data.frame(house_number = c(1:3),
lon_house = seq(1.4, 1.6, by = 0.1),
lat_house = seq(1.1, 1.3, by = 0.1)
)
stations <- data.frame(station_number = c(1:11),
lon_station = seq(2, 3, by = 0.1),
lat_station = seq(1, 2, by = 0.1)
)
We'll create "nested" data frames so that it's easier to keep coordinates together:
house_nest <- nest(houses, -house_number, .key = 'house_coords')
station_nest <- nest(stations, -station_number, .key = 'station_coords')
house_number house_coords
<int> <list>
1 1 <data.frame [1 × 2]>
2 2 <data.frame [1 × 2]>
3 3 <data.frame [1 × 2]>
station_number station_coords
<int> <list>
1 1 <data.frame [1 × 2]>
2 2 <data.frame [1 × 2]>
3 3 <data.frame [1 × 2]>
4 4 <data.frame [1 × 2]>
5 5 <data.frame [1 × 2]>
6 6 <data.frame [1 × 2]>
7 7 <data.frame [1 × 2]>
8 8 <data.frame [1 × 2]>
9 9 <data.frame [1 × 2]>
10 10 <data.frame [1 × 2]>
11 11 <data.frame [1 × 2]>
Use dplyr::crossing to combine every row from both data frames:
data.master <- crossing(house_nest, station_nest)
house_number house_coords station_number station_coords
<int> <list> <int> <list>
1 1 <data.frame [1 × 2]> 1 <data.frame [1 × 2]>
2 1 <data.frame [1 × 2]> 2 <data.frame [1 × 2]>
3 1 <data.frame [1 × 2]> 3 <data.frame [1 × 2]>
4 1 <data.frame [1 × 2]> 4 <data.frame [1 × 2]>
5 1 <data.frame [1 × 2]> 5 <data.frame [1 × 2]>
6 1 <data.frame [1 × 2]> 6 <data.frame [1 × 2]>
7 1 <data.frame [1 × 2]> 7 <data.frame [1 × 2]>
8 1 <data.frame [1 × 2]> 8 <data.frame [1 × 2]>
9 1 <data.frame [1 × 2]> 9 <data.frame [1 × 2]>
10 1 <data.frame [1 × 2]> 10 <data.frame [1 × 2]>
# ... with 23 more rows
With all this now in place, we can use distm on each row to calculate a distance, and keep the shortest distance per house:
data.dist <- data.master %>%
mutate(dist = map2_dbl(house_coords, station_coords, distm)) %>%
group_by(house_number) %>%
filter(dist == min(dist))
house_number house_coords station_number station_coords dist
<int> <list> <int> <list> <dbl>
1 1 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 67690.
2 2 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 59883.
3 3 <data.frame [1 × 2]> 1 <data.frame [1 × 2]> 55519.

apply function on certain columns in nested variable in r

I want to apply a vectorised operation on certain columns in the nested variable. The function that I want to apply is to find the sum of missing value in the numeric features i.e. weight and calories. The data frame that I have is as following
df <- data.frame(country = c("US", "US", "UK", "PAK"),name = c("David",
"James", "Junaid", "Ali"), fruit = c("Apple", "banana", "orange", "melon"),
weight = c(90,110,120,NA), calories = c(NA,20, NA,NA))
country name fruit weight calories
1 US David Apple 90 NA
2 US James banana 110 20
3 UK Junaid orange 120 NA
4 PAK Ali melon NA NA
When I nest the data frame
nested_df <- df %>% group_by(country) %>% nest()
# A tibble: 3 × 2
country data
<fctr> <list>
1 US <tibble [2 × 4]>
2 UK <tibble [1 × 4]>
3 PAK <tibble [1 × 4]>
I have tried to use the following syntax but to no avail.
nested_df %>% mutate(missings = map(data, c("weight", "calories")) %>%
map_lgl(function(x) sum(!is.na(x))/length(x) ==1))`
The result I am expected are as following
`# A tibble: 3 × 3
country data missings
<fctr> <list> <lgl>
1 US <tibble [2 × 4]> FALSE
2 UK <tibble [1 × 4]> FALSE
3 PAK <tibble [1 × 4]> TRUE`
however, what i am getting is
` A tibble: 3 × 3
country data missings
<fctr> <list> <lgl>
1 US <tibble [2 × 4]> NA
2 UK <tibble [1 × 4]> NA
3 PAK <tibble [1 × 4]> NA`
This will check whether more than 50% of values are NA...
colstocheck <- c("weight", "calories")
nested_df %>% mutate(missings = (map_lgl(data,
function(x) sum(is.na(x[,colstocheck]))/length(x[,colstocheck]) > 0.5)))
# A tibble: 3 x 3
country data missings
<fctr> <list> <lgl>
1 US <tibble [2 x 4]> FALSE
2 UK <tibble [1 x 4]> FALSE
3 PAK <tibble [1 x 4]> TRUE

Resources