combine two data frame based on cell value in R - r

I have two data frames. One is the baseline data for different test type and the other one is my experiment data. Now I would like to combine this two data frame together. But it is not a simply merge or rbind. I wonder any R professionals can help me to solve it. Thank you.
Here is a example of two data frames:
experiment data:
experiment_num timepoint type value
50 10 7a,b4 90
50 20 7a,b4 89
50 20 10a,b4 93
50 10 7a,b6 85
50 20 7a,b6 87
50 20 10a,b6 88
baseline data:
experiment_num timepoint type value
50 0 0,b4 85
50 0 0,b6 90
Here is the output I would like to have:
experiment_num timepoint type value
50 0 7a,b4 85
50 10 7a,b4 90
50 20 7a,b4 89
50 0 10a,b4 85
50 20 10a,b4 89
50 0 7a,b6 90
50 10 7a,b6 85
50 20 7a,b6 87
50 0 10a,b6 90
50 20 10a,b6 88

This should do the job. You first need to install a couple of packages:
install.packages("dplyr")
install.packages("tidyr")
* Data *
ed <- data.frame(experiment_num=rep(50, 6), timepoint=rep(c(10, 20, 20), 2),
type=c("7a,b4", "7a,b4", "10a,b4", "7a,b6", "7a,b6", "10a,b6"),
value=c(90, 89, 93, 85, 87, 88))
db <- data.frame(experiment_num=rep(50, 2), timepoint=rep(0, 2), type=c("0,b4", "0,b6"),
value=c(85, 90))
* Code *
library(tidyr)
library(dplyr)
final <- rbind(separate(ed, type, into=c("typea", "typeb")),
left_join(ed %>% select(type) %>% unique %>%
separate(type, into=c("typea", "typeb")),
separate(db, type, into=c("zero", "typeb"))) %>%
select(experiment_num, timepoint, typea, typeb, value)
) %>%
arrange(typeb, typea, timepoint) %>% mutate(type=paste(typea, typeb, sep=",")) %>%
select(experiment_num, timepoint, type, value)
The logic is the following.
Separate the type into two columns typea and typeb then "create" the missing typea for baseline data. and then join to the experimental data.
final is the data set you are looking for.

Related

How to merge a single measurement into a dataframe of multiple measurements in R

I have a long dataframe of multiple measurements per ID, at different time points for variables BP1 and BP2.
ID <- c(1,1,1,2,2,2,3,3,4)
Time <- c(56,57,58,61,62,64,66,67,72)
BP1 <- c(70,73,73,74,75,76,74,74,70)
BP2 <- c(122,122,123,126,124,121,130,132,140)
df1 <- data.frame(ID, Time, BP1, BP2)
I would like to merge another dataframe (df2), which contains a single measurement for BP1 and BP2 per ID.
ID <- c(1,2,3,4)
Time <- c(55, 60, 65, 70)
BP1 <- c(70, 72, 73, 74)
BP2 <- c(120, 124, 130, 134)
df2 <- data.frame(ID, Time, BP1, BP2)
How do I combine these dataframes so that the Time variable is in order, and the dataframe looks like this:
Any help greatly appreciated, thank you!
In base R, use rbind() to combine and order() to sort, then clean up the rownames:
df3 <- rbind(df1, df2)
df3 <- df3[order(df3$ID, df3$Time), ]
rownames(df3) <- seq(nrow(df3))
df3
Or, using dplyr:
library(dplyr)
bind_rows(df1, df2) %>%
arrange(ID, Time)
Result from either approach:
ID Time BP1 BP2
1 1 55 70 120
2 1 56 70 122
3 1 57 73 122
4 1 58 73 123
5 2 60 72 124
6 2 61 74 126
7 2 62 75 124
8 2 64 76 121
9 3 65 73 130
10 3 66 74 130
11 3 67 74 132
12 4 70 74 134
13 4 72 70 140

Randomly Select 10 percent of data from the whole data set in R

For my project, I have taken a data set which have 1296765 observations of 23 columns, I want to take just 10% of this data randomly. How can I do that in R.
I tried the below code but it only sampled out just 10 rows. But, I wanted to select randomly 10% of the data. I am a beginner so please help.
library(dplyr)
x <- sample_n(train, 10)
Here is a function from dplyr that select rows at random by a specific proportion:
dplyr::slice_sample(train,prop = .1)
In base R, you can subset by sampling a proportion of nrow():
set.seed(13)
train <- data.frame(id = 1:101, x = rnorm(101))
train[sample(nrow(train), nrow(train) / 10), ]
id x
69 69 1.14382456
101 101 -0.36917269
60 60 0.69967564
58 58 0.82651036
59 59 1.48369123
72 72 -0.06144699
12 12 0.46187091
89 89 1.60212039
8 8 0.23667967
49 49 0.27714729

Select a range of rows from every n rows from a data frame

I have 2880 observations in my data.frame. I have to create a new data.frame in which, I have to select rows from 25-77 from every 96 selected rows.
df.new = df[seq(25, nrow(df), 77), ] # extract from 25 to 77
The above code extracts only row number 25 to 77 but I want every row from 25 to 77 in every 96 rows.
One option is to create a vector of indeces with which subset the dataframe.
idx <- rep(25:77, times = nrow(df)/96) + 96*rep(0:29, each = 77-25+1)
df[idx, ]
You can use recycling technique to extract these rows :
from = 25
to = 77
n = 96
df.new <- df[rep(c(FALSE, TRUE, FALSE), c(from - 1, to - from + 1, n - to))), ]
To explain for this example it will work as :
length(rep(c(FALSE, TRUE, FALSE), c(24, 53, 19))) #returns
#[1] 96
In these 96 values, value 25-77 are TRUE and rest of them are FALSE which we can verify by :
which(rep(c(FALSE, TRUE, FALSE), c(24, 53, 19)))
# [1] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
#[23] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
#[45] 69 70 71 72 73 74 75 76 77
Now this vector is recycled for all the remaining rows in the dataframe.
First, define a Group variable, with values 1 to 30, each value repeating 96 times. Then define RowWithinGroup and filter as required. Finally, undo the changes introduced to do the filtering.
df <- tibble(X=rnorm(2880)) %>%
add_column(Group=rep(1:96, each=30)) %>%
group_by(Group) %>%
mutate(RowWithinGroup=row_number()) %>%
filter(RowWithinGroup >= 25 & RowWithinGroup <= 77) %>%
select(-Group, -RowWithinGroup) %>%
ungroup()
Welcome to SO. This question may not have been asked in this exact form before, but the proinciples required have been rerefenced in many, many questions and answers,
A one-liner base solution.
lapply(split(df, cut(1:nrow(df), nrow(df)/96, F)), `[`, 25:77, )
Note: Nothing after the last comma
The code above returns a list. To combine all data together, just pass the result above into
do.call(rbind, ...)

using map function to create a dataframe from google trends data

relatively new to r, I have a list of words I want to run through the gtrendsr function to look at the google search hits, and then create a tibble with dates as index and relevant hits for each word as columns, I'm struggling to do this using the map functions in purr,
I started off trying to use a for loop but I've been told to try and use map in the tidyverse package instead, this is what I had so far:
library(gtrendsr)
words = c('cruise', 'plane', 'car')
for (i in words) {
rel_word_data = gtrends(i,geo= '', time = 'today 12-m')
iot <- data.frame()
iot[i] <- rel_word_data$interest_over_time$hits
}
I need to have the gtrends function take one word at a time, otherwise it will give a value for hits which is a adjusted for the popularity of the other words. so basically, I need the gtrends function to run the first word in the list, obtain the hits column in the interest_over_time section and add it to a final dataframe that contains a column for each word and the date as index.
I'm a bit lost in how to do this without a for loop
Assuming the gtrends output is the same length for every keyword, you can do the following:
# Load packages
library(purrr)
library(gtrendsR)
# Generate a vector of keywords
words <- c('cruise', 'plane', 'car')
# Download data by iterating gtrends over the vector of keywords
# Extract the hits data and make it into a dataframe for each keyword
trends <- map(.x = words,
~ as.data.frame(gtrends(keyword = .x, time = 'now 1-H')$interest_over_time$hits)) %>%
# Add the keywords as column names to the three dataframes
map2(.x = .,
.y = words,
~ set_names(.x, nm = .y)) %>%
# Convert the list of three dataframes to a single dataframe
map_dfc(~ data.frame(.x))
# Check data
head(trends)
#> cruise plane car
#> 1 50 75 84
#> 2 51 74 83
#> 3 100 67 81
#> 4 46 76 83
#> 5 48 77 84
#> 6 43 75 82
str(trends)
#> 'data.frame': 59 obs. of 3 variables:
#> $ cruise: int 50 51 100 46 48 43 48 53 43 50 ...
#> $ plane : int 75 74 67 76 77 75 73 80 70 79 ...
#> $ car : int 84 83 81 83 84 82 84 87 85 85 ...
Created on 2020-06-27 by the reprex package (v0.3.0)
You can use map to get all the data as a list and use reduce to combine the data.
library(purrr)
library(gtrendsr)
library(dplyr)
map(words, ~gtrends(.x,geo= '', time = 'today 12-m')$interest_over_time %>%
dplyr::select(date, !!.x := hits)) %>%
reduce(full_join, by = 'date')
# date cruise plane car
#1 2019-06-30 64 53 96
#2 2019-07-07 75 48 97
#3 2019-07-14 73 48 100
#4 2019-07-21 74 48 100
#5 2019-07-28 71 47 100
#6 2019-08-04 67 47 97
#7 2019-08-11 68 56 98
#.....

How to choose third 20% of part of dataset?

Assume i have a dataset like:
df<-data.frame(data=(1:100))
how can i select the nth 20% of my data?
let's say, i need to access the third 20%, which contains numbers between 40-60
Using the function ntile from the dplyr package. We divide the data frame into 5 buckets and take the third one.
library(dplyr)
# One line
df[ntile(df$data, 5) == 3, ]
# Using pipes
df %>%
mutate(n = ntile(data, 5)) %>%
filter(n == 3) %>%
select(data)
Output:
[1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
Here's a quick function to call for the specific rows based on percentage of rows of data
rowNumbs <- function(i, perc, df){
((i - 1)*ceiling(perc*nrow(df)) + 1) : (i*ceiling(perc*nrow(df)))
}
where i is the nth set, perc is the percentage and df is the data.frame.
To call the third 20% of your data.frame:
df[rowNumbs(3, .2, df), ]
[1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Resources