How to use extract on multiple columns and name output columns based on input column names - r

I have a data frame of blood pressure data of the following form:
bpdata <- data.frame(bp1 = c("120/89", "110/70", "121/78"), bp2 = c("130/69", "120/90", "125/72"), bp3 = c("115/90", "112/71", "135/80"))
I would like to use the following extract command, but globally, i.e. on all bp\d columns
extract(bp1, c("systolic_1","diastolic_1"),"(\\d+)/(\\d+)")
How can I capture the digit in the column selection and use it in the column output names? I can hack around this by creating a list of column names and then using one of the apply family, but it seems to me there ought to be a more elegant way to do this.
Any suggestions?

We could use read.csv on multiple columns in a loop (Map) with sep = "/" and cbind the list elements at the end with do.call
do.call(cbind, Map(function(x, y) read.csv(text= x, sep="/", header = FALSE,
col.names = paste0(c('systolic', 'diastolic'), y)),
unname(bpdata), seq_along(bpdata)))
# systolic1 diastolic1 systolic2 diastolic2 systolic3 diastolic3
#1 120 89 130 69 115 90
#2 110 70 120 90 112 71
#3 121 78 125 72 135 80
Or without a loop, paste the columns to a single string for each row and then use read.csv/read.table
read.csv(text = do.call(paste, c(bpdata, sep="/")),
sep="/", header = FALSE,
col.names = paste0(c('systolic', 'diastolic'),
rep(seq_along(bpdata), each = 2)))
# systolic1 diastolic1 systolic2 diastolic2 systolic3 diastolic3
#1 120 89 130 69 115 90
#2 110 70 120 90 112 71
#3 121 78 125 72 135 80
Or using tidyverse, similar option is to unite the column into a single one with /, then use either extract or separate to split the column into multiple columns
library(dplyr)
library(tidyr)
library(stringr)
bpdata %>%
unite(bpcols, everything(), sep="/") %>%
separate(bpcols, into = str_c(c('systolic', 'diastolic'),
rep(seq_along(bpdata), each = 2)), convert = TRUE)
# systolic1 diastolic1 systolic2 diastolic2 systolic3 diastolic3
#1 120 89 130 69 115 90
#2 110 70 120 90 112 71
#3 121 78 125 72 135 80

Related

Select a range of rows from every n rows from a data frame

I have 2880 observations in my data.frame. I have to create a new data.frame in which, I have to select rows from 25-77 from every 96 selected rows.
df.new = df[seq(25, nrow(df), 77), ] # extract from 25 to 77
The above code extracts only row number 25 to 77 but I want every row from 25 to 77 in every 96 rows.
One option is to create a vector of indeces with which subset the dataframe.
idx <- rep(25:77, times = nrow(df)/96) + 96*rep(0:29, each = 77-25+1)
df[idx, ]
You can use recycling technique to extract these rows :
from = 25
to = 77
n = 96
df.new <- df[rep(c(FALSE, TRUE, FALSE), c(from - 1, to - from + 1, n - to))), ]
To explain for this example it will work as :
length(rep(c(FALSE, TRUE, FALSE), c(24, 53, 19))) #returns
#[1] 96
In these 96 values, value 25-77 are TRUE and rest of them are FALSE which we can verify by :
which(rep(c(FALSE, TRUE, FALSE), c(24, 53, 19)))
# [1] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
#[23] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
#[45] 69 70 71 72 73 74 75 76 77
Now this vector is recycled for all the remaining rows in the dataframe.
First, define a Group variable, with values 1 to 30, each value repeating 96 times. Then define RowWithinGroup and filter as required. Finally, undo the changes introduced to do the filtering.
df <- tibble(X=rnorm(2880)) %>%
add_column(Group=rep(1:96, each=30)) %>%
group_by(Group) %>%
mutate(RowWithinGroup=row_number()) %>%
filter(RowWithinGroup >= 25 & RowWithinGroup <= 77) %>%
select(-Group, -RowWithinGroup) %>%
ungroup()
Welcome to SO. This question may not have been asked in this exact form before, but the proinciples required have been rerefenced in many, many questions and answers,
A one-liner base solution.
lapply(split(df, cut(1:nrow(df), nrow(df)/96, F)), `[`, 25:77, )
Note: Nothing after the last comma
The code above returns a list. To combine all data together, just pass the result above into
do.call(rbind, ...)

using map function to create a dataframe from google trends data

relatively new to r, I have a list of words I want to run through the gtrendsr function to look at the google search hits, and then create a tibble with dates as index and relevant hits for each word as columns, I'm struggling to do this using the map functions in purr,
I started off trying to use a for loop but I've been told to try and use map in the tidyverse package instead, this is what I had so far:
library(gtrendsr)
words = c('cruise', 'plane', 'car')
for (i in words) {
rel_word_data = gtrends(i,geo= '', time = 'today 12-m')
iot <- data.frame()
iot[i] <- rel_word_data$interest_over_time$hits
}
I need to have the gtrends function take one word at a time, otherwise it will give a value for hits which is a adjusted for the popularity of the other words. so basically, I need the gtrends function to run the first word in the list, obtain the hits column in the interest_over_time section and add it to a final dataframe that contains a column for each word and the date as index.
I'm a bit lost in how to do this without a for loop
Assuming the gtrends output is the same length for every keyword, you can do the following:
# Load packages
library(purrr)
library(gtrendsR)
# Generate a vector of keywords
words <- c('cruise', 'plane', 'car')
# Download data by iterating gtrends over the vector of keywords
# Extract the hits data and make it into a dataframe for each keyword
trends <- map(.x = words,
~ as.data.frame(gtrends(keyword = .x, time = 'now 1-H')$interest_over_time$hits)) %>%
# Add the keywords as column names to the three dataframes
map2(.x = .,
.y = words,
~ set_names(.x, nm = .y)) %>%
# Convert the list of three dataframes to a single dataframe
map_dfc(~ data.frame(.x))
# Check data
head(trends)
#> cruise plane car
#> 1 50 75 84
#> 2 51 74 83
#> 3 100 67 81
#> 4 46 76 83
#> 5 48 77 84
#> 6 43 75 82
str(trends)
#> 'data.frame': 59 obs. of 3 variables:
#> $ cruise: int 50 51 100 46 48 43 48 53 43 50 ...
#> $ plane : int 75 74 67 76 77 75 73 80 70 79 ...
#> $ car : int 84 83 81 83 84 82 84 87 85 85 ...
Created on 2020-06-27 by the reprex package (v0.3.0)
You can use map to get all the data as a list and use reduce to combine the data.
library(purrr)
library(gtrendsr)
library(dplyr)
map(words, ~gtrends(.x,geo= '', time = 'today 12-m')$interest_over_time %>%
dplyr::select(date, !!.x := hits)) %>%
reduce(full_join, by = 'date')
# date cruise plane car
#1 2019-06-30 64 53 96
#2 2019-07-07 75 48 97
#3 2019-07-14 73 48 100
#4 2019-07-21 74 48 100
#5 2019-07-28 71 47 100
#6 2019-08-04 67 47 97
#7 2019-08-11 68 56 98
#.....

Grouping columns and creating a list output

I am new to R. I have a R dataframe of following structure:
164_I_.CEL 164_II.CEL 183_I.CEL 183_II.CEL 2114_I.CEL
1 4496 5310 4492 4511 2872
2 181 280 137 101 91
3 4556 5104 4379 4608 2972
4 167 217 99 79 82
5 89 110 69 58 47
I want to group the columns which have "_I.CEL" in the column name.
I need a list output like NI, NI, I, NI, I
where NI means Not I.
A combination of ifelse and grepl looking for the required pattern in the column names.
ifelse(grepl("_I\\.CEL", names(df1)), "I", "NI")
#[1] "NI" "NI" "I" "NI" "I"
where df1 is your data frame.
Or use fixed = TRUE
ifelse(grepl("_I.CEL", names(df1), fixed = TRUE), "I", "NI")

Subset Columns based on partial matching of column names in the same data frame

I would like to understand how to subset multiple columns from same data frame by matching the first 5 letters of the column names with each other and if they are equal then subset it and store it in a new variable.
Here is a small explanation of my required output. It is described below,
Lets say the data frame is eatable
fruits_area fruits_production vegetable_area vegetable_production
12 100 26 324
33 250 40 580
66 510 43 581
eatable <- data.frame(c(12,33,660),c(100,250,510),c(26,40,43),c(324,580,581))
names(eatable) <- c("fruits_area", "fruits_production", "vegetables_area",
"vegetable_production")
I was trying to write a function which will match the strings in a loop and will store the subset columns after matching first 5 letters from the column names.
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
checkExpression(eatable,"your_string")
The above function checks the string correctly but I am confused how to do matching among the column names in the dataset.
Edit:- I think regular expressions would work here.
You could try:
v <- unique(substr(names(eatable), 0, 5))
lapply(v, function(x) eatable[grepl(x, names(eatable))])
Or using map() + select_()
library(tidyverse)
map(v, ~select_(eatable, ~matches(.)))
Which gives:
#[[1]]
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
#
#[[2]]
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581
Should you want to make it into a function:
checkExpression <- function(df, l = 5) {
v <- unique(substr(names(df), 0, l))
lapply(v, function(x) df[grepl(x, names(df))])
}
Then simply use:
checkExpression(eatable, 5)
I believe this may address your needs:
checkExpression <- function(dataset,str){
cols <- grepl(paste0("^",str),colnames(dataset),ignore.case = TRUE)
subset(dataset,select=colnames(dataset)[cols])
}
Note the addition of "^" to the pattern used in grepl.
Using your data:
checkExpression(eatable,"fruit")
## fruits_area fruits_production
##1 12 100
##2 33 250
##3 660 510
checkExpression(eatable,"veget")
## vegetables_area vegetable_production
##1 26 324
##2 40 580
##3 43 581
Your function does exactly what you want but there was a small error:
checkExpression <- function(dataset,str){
dataset[grepl((str),names(dataset),ignore.case = TRUE)]
}
Change the name of the object from which your subsetting from obje to dataset.
checkExpression(eatable,"fr")
# fruits_area fruits_production
#1 12 100
#2 33 250
#3 660 510
checkExpression(eatable,"veg")
# vegetables_area vegetable_production
#1 26 324
#2 40 580
#3 43 581

Splitting multiple columns in R

I have following dataframe:
olddf <- structure(list(test = structure(1:6, .Label = c("test1", "test2",
"test3", "test4", "test5", "test6"), class = "factor"), month0_gp1 = c("163±28",
"133±20", "177±29", "153±30", "161±31", "159±23"), month0_gp2 = c("122±17",
"167±20", "146±26", "150±27", "148±33", "161±37"), month1_gp1 = c("157±32",
"152±37", "151±24", "143±25", "144±29", "126±30"), month1_gp2 = c("181±14",
"133±34", "152±38", "144±30", "148±20", "137±19"), month3_gp1 = c("139±38",
"161±39", "166±38", "162±39", "151±38", "155±38"), month3_gp2 = c("151±40",
"161±33", "137±25", "161±31", "168±30", "147±34")), .Names = c("test",
"month0_gp1", "month0_gp2", "month1_gp1", "month1_gp2", "month3_gp1",
"month3_gp2"), row.names = c(NA, 6L), class = "data.frame")
test month0_gp1 month0_gp2 month1_gp1 month1_gp2 month3_gp1 month3_gp2
1 test1 163±28 122±17 157±32 181±14 139±38 151±40
2 test2 133±20 167±20 152±37 133±34 161±39 161±33
3 test3 177±29 146±26 151±24 152±38 166±38 137±25
4 test4 153±30 150±27 143±25 144±30 162±39 161±31
5 test5 161±31 148±33 144±29 148±20 151±38 168±30
6 test6 159±23 161±37 126±30 137±19 155±38 147±34
I have to split columns 2:7 into 2 each (one for mean and other for sd):
test month0_gp1_mean month0_gp1_sd month0_gp2_mean month0_gp2_sd month1_gp1_mean month1_gp1_sd ....
I checked earlier posts and used do.call(rbind... method:
mydf <- data.frame(do.call(rbind, strsplit(olddf$month0_gp1,'±')))
mydf
X1 X2
1 163 28
2 133 20
3 177 29
4 153 30
5 161 31
6 159 23
But this works for one column at a time. How can I modify this to loop for 2:7 columns, and combine them to form one new dataframe? Thanks for your help.
First, get my cSplit function from this GitHub Gist.
Second, split it up:
cSplit(olddf, 2:ncol(olddf), sep = "±")
# test 2_1 2_2 3_1 3_2 4_1 4_2 5_1 5_2 6_1 6_2 7_1 7_2
# 1: test1 163 28 122 17 157 32 181 14 139 38 151 40
# 2: test2 133 20 167 20 152 37 133 34 161 39 161 33
# 3: test3 177 29 146 26 151 24 152 38 166 38 137 25
# 4: test4 153 30 150 27 143 25 144 30 162 39 161 31
# 5: test5 161 31 148 33 144 29 148 20 151 38 168 30
# 6: test6 159 23 161 37 126 30 137 19 155 38 147 34
If you want to do the column renaming in the same step, try:
Nam <- names(olddf)[2:ncol(olddf)]
setnames(
cSplit(olddf, 2:ncol(olddf), sep = "±"),
c("test", paste(rep(Nam, each = 2), c("mean", "sd"), sep = "_")))[]
Another option would be to look at dplyr + tidyr.
Here's the best I could come up with, but I'm not sure if this is the correct way to do this with these tools....
olddf %>%
gather(GM, value, -test) %>% # Makes the data somewhat long
separate(value, c("MEAN", "SD")) %>% # Splits "value" column. We're wide again
gather(MSD, value, -test, -GM) %>% # Makes the data long again
unite(var, GM, MSD) %>% # Combines GM and MSD columns
spread(var, value) # Goes from wide to long
This is sort of the equivalent of melting the data once, using colsplit on the resulting "value" column, melting the data again, and using dcast to get the wide format.
Here's a qdap approach:
library(qdap)
for(i in seq(2, 13, by = 2)){
olddf <- colsplit2df(olddf, i,
paste0(names(olddf)[i], "_", c("mean", "sd")), sep = "±")
}
olddf[,-1] <- lapply(olddf[,-1], as.numeric)
olddf
I looked at Ananda's splitstackshape package first as I figured there was an easy way to do this but I couldn't figure out a way.
Not sure if you need the last line converting the columns to numeric but assumed you would.

Resources