R programming, split data - r

I have data as follows:
ID age sugarlevel
123 15 8
456 13 10
789 25 5
...
Anyone knows how to use R to split the data according to sugar level (>=7, <7)? Which means should split into two groups:
group 1:
ID age sugarlevel
123 15 8
456 13 10
...
group 2:
ID age sugarlevel
789 25 5
...
Thanks in advance.

We can split the dataset by a grouping variable df1$sugarlevel >=7 (from #nicola's comments)
lst <- setNames(split(df1, df1$sugarlevel >=7), paste0('group',1:2))
lst
#$group1
# ID age sugarlevel
#3 789 25 5
#$group2
# ID age sugarlevel
#1 123 15 8
#2 456 13 10
It is better to work with the dataset in the 'list', but if we need to have two sepearate objects in the global environment,
list2env(lst, envir=.GlobalEnv)
group1
# ID age sugarlevel
#3 789 25 5

Related

How can I read in a data table with the first column name missing?

I want to read in the data set from this website into R: https://socialsciences.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Cowles.txt
There is an ID column without a column name in the first column. How can I either not read in the ID column or read in the ID column with a blank/new title?
import all columns (then change name)
# load package
library(data.table)
# import txt
df <- fread("https://socialsciences.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Cowles.txt")
# change 1st column's name
names(df)[1] <- 'id'
import all except 1st column
df <- fread("https://socialsciences.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Cowles.txt", drop = 1)
We can use setnames to change the first column's name (which is automatically named V1) when you read in the file, and do it all in one step:
library(data.table)
dt <-
setnames(
fread(
"https://socialsciences.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Cowles.txt"
),
"V1",
"ID"
)
Output
ID neuroticism extraversion sex volunteer
1: 1 16 13 female no
2: 2 8 14 male no
3: 3 5 16 male no
4: 4 8 20 female no
5: 5 9 19 male no
---
1417: 1417 5 10 male yes
1418: 1418 8 4 female yes
1419: 1419 8 8 male yes
1420: 1420 19 20 female yes
1421: 1421 15 20 male yes
If you want to ignore the first column, then you can either use drop (as suggested in the other answer), or you can also use select to keep the columns that you want.
dt <- fread(
"https://socialsciences.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Cowles.txt", select = c(2:5)
)
neuroticism extraversion sex volunteer
1: 16 13 female no
2: 8 14 male no
3: 5 16 male no
4: 8 20 female no
5: 9 19 male no
---
1417: 5 10 male yes
1418: 8 4 female yes
1419: 8 8 male yes
1420: 19 20 female yes
1421: 15 20 male yes

Summation of dataframes in a list [duplicate]

This question already has an answer here:
Aggregating across list of dataframes and storing all results
(1 answer)
Closed 3 years ago.
I am working on a script where I have two lists and I am trying to combine the results so I get a new list. Each list has a date and then two numbers. The lists look like this:
date clicks impressions
1 2019-06-01 1 2
2 2019-06-02 0 0
3 2019-06-03 100 120
and
date clicks impressions
1 2019-06-01 2 14
2 2019-06-02 3 14
3 2019-06-03 11 29
I'd like a single list that is
date clicks impressions
1 2019-06-01 3 16
2 2019-06-02 3 14
3 2019-06-03 111 149
What is the best way to accomplish this. In time I will have 20 - 30 more lists that will be added to this, so I'll want to pull the first list and then combine with the second and then a third and so on. I don't know if I'll be able to assume that each date will be in each list.
Assuming your list is called list_df, you can bind them all together using bind_rows, group_by date and then sum all the other columns.
library(dplyr)
list_df %>%
bind_rows() %>%
group_by(date) %>%
summarise_all(sum)
# A tibble: 3 x 3
# date clicks impressions
# <fct> <int> <int>
#1 2019-06-01 3 16
#2 2019-06-02 3 14
#3 2019-06-03 111 149
which in base R could be achieved using Reduce
aggregate(.~date, Reduce(rbind, list_df), sum)
We can use data.table
library(data.table)
rbindlist(list_df)[, lapply(.SD, sum), date]
# date clicks impressions
#1: 2019-06-01 3 16
#2: 2019-06-02 3 14
#3: 2019-06-03 111 149
data
list_df <- mget(paste0("df", 1:2))
We can do:
cbind(date=df1[,1],do.call(`+`, list(df1[,-1],df2[,-1])),
row.names = NULL)
date clicks impressions
1 2019-06-01 3 16
2 2019-06-02 3 14
3 2019-06-03 111 149
If you are not sure about the presence of dates(can then cbind as above):
do.call(`+`,lapply(list(df1,df2), function(x) x[,-1]))
clicks impressions
1 3 16
2 3 14
3 111 149
This assumes that the data sets will have the same structure always.

Exclude intervals that overlap between two data frame's (by range of two column values)

This is almost an extension of a previous question I asked, but I've run into a new problem I haven't found a solution for.
Here is the original question and answer: Find matching intervals in data frame by range of two column values
(this found overlapping intervals that were common among different names within same data frame)
I now want to find a way to exclude row's in DF1 when there are overlapping intervals with a new data-frame, DF2.
Using the same DF1 :
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 2 A 60 112 52 ID1
JOHN 3 A 392 429 37 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 1 A 19 75 56 ID1
ADAM 2 A 384 407 23 ID1
ADAM 3 B 0 79 79 ID1
ADAM 4 B 505 586 81 ID1
ADAM 5 C 140 205 65 ID1
ADAM 6 C 522 599 77 ID1
This continues for 18 different names and two ID groups.
Now have a second data frame with intervals that I wish to exclude from the above data frame.
Here is an example of DF2:
Name Event Order Sequence start_event end_event duration Group
GAP1 1 A 55 121 66 ID1
GAP2 2 A 394 419 25 ID1
GAP3 3 C 502 635 133 ID1
I.E., I am hoping to find any interval for each "Name" in DF1, that is in the same "Sequence" and has overlapping time at any point of the interval found in DF2 (any portion, whether it begins before the start event, or begins midway and ends after the end event). I would like to iterate through each distinct "Name" in DF1. Also, the sequence matters, so I would only like to return results found common between sequence A and sequence A, then sequence B and sequence B, and finally sequence C and sequence C.
Desired Result (showing just the first name):
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
ADAM 3 B 0 79 79 ID1
ADAM 4 B 505 586 81 ID1
ADAM 5 C 140 205 65 ID1
Last time the answer was resolved in part with foverlaps, but I am still not overly familiar with it to be able to solve this problem - assuming that's the best way to answer this.
Thanks!
This piece of code should work for you
library(data.table)
Dt1 <- data.table(a = 1:1000,b=1:1000 + 100)
Dt2 <- data.table(a = 100:200,b=100:200+10)
#identify the positions that are not allowed
badSeq <- unique(unlist(lapply(1:nrow(Dt2),function(i) Dt2[i,a:b,])))
#select for the rows outside of the range
correctPos <- sapply(1:nrow(Dt1),
function(i)
all(!Dt1[i,a:b %in% badSeq]))
Dt1[correctPos,]
I have done it with data.tables rather than data.frames. I like them better and they can be faster. But you can apply the same ideas to a data.frame

Find matching intervals in data frame by range of two column values

I have a data frame of time related events.
Here is an example:
Name Event Order Sequence start_event end_event duration Group
JOHN 1 A 0 19 19 ID1
JOHN 2 A 60 112 52 ID1
JOHN 3 A 392 429 37 ID1
JOHN 4 B 282 329 47 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 1 A 19 75 56 ID2
ADAM 2 A 384 407 23 ID2
ADAM 3 B 0 79 79 ID2
ADAM 4 B 505 586 81 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2
There are essentially two different groups, ID 1 & 2. For each of those groups, there are 18 different name's. Each of those people appear in 3 different sequences, A-C. They then have active time periods during those sequences, and I mark the start/end events and calculate the duration.
I'd like to isolate each person and find when they have matching time intervals with people in both the opposite and same group ID.
Using the example data above, I want to find when John and Adam appear during the same sequence, at the same time. I then want to compare John to the rest of the 17 names in ID1/ID2.
I do not need to match the exact amount of shared 'active' time, I just am hoping to isolate the rows that are common.
My comforts are in using dplyr, but I can't crack this yet. I looked around and saw some similar examples with adjacency matrices, but those are with precise and exact data points. I can't figure out the strategy with a range/interval.
Thank you!
UPDATE:
Here is the example of the desired result
Name Event Order Sequence start_event end_event duration Group
JOHN 3 A 392 429 37 ID1
JOHN 5 C 147 226 79 ID1
JOHN 6 C 566 611 45 ID1
ADAM 2 A 384 407 23 ID2
ADAM 5 C 140 205 65 ID2
ADAM 6 C 522 599 77 ID2
I'm thinking you'd isolate each event row for John, mark the start/end time frame and then iterate through every name and event for the remainder of the data frame to find time points that fit first within the same sequence, and then secondly against the bench-marked start/end time frame of John.
As I understand it, you want to return any row where an event for John with a particular sequence number overlaps an event for anybody else with the same sequence value. To achieve this, you could use split-apply-combine to split by sequence, identify the overlapping rows, and then re-combine:
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
jpos <- which(x$Name == "JOHN")
njpos <- which(x$Name != "JOHN")
over <- outer(jpos, njpos, function(a, b) {
overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
})
x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
}))
# Name EventOrder Sequence start_event end_event duration Group
# A.2 JOHN 2 A 60 112 52 ID1
# A.3 JOHN 3 A 392 429 37 ID1
# A.7 ADAM 1 A 19 75 56 ID2
# A.8 ADAM 2 A 384 407 23 ID2
# C.5 JOHN 5 C 147 226 79 ID1
# C.6 JOHN 6 C 566 611 45 ID1
# C.11 ADAM 5 C 140 205 65 ID2
# C.12 ADAM 6 C 522 599 77 ID2
Note that my output includes two additional rows that are not shown in the question -- sequence A for John from time range [60, 112], which overlaps sequence A for Adam from time range [19, 75].
This could be pretty easily mapped into dplyr language:
library(dplyr)
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
sliceRows <- function(name, start, end) {
jpos <- which(name == "JOHN")
njpos <- which(name != "JOHN")
over <- outer(jpos, njpos, function(a, b) overlap(start[a], end[a], start[b], end[b]))
c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0])
}
dat %>%
group_by(Sequence) %>%
slice(sliceRows(Name, start_event, end_event))
# Source: local data frame [8 x 7]
# Groups: Sequence [3]
#
# Name EventOrder Sequence start_event end_event duration Group
# (fctr) (int) (fctr) (int) (int) (int) (fctr)
# 1 JOHN 2 A 60 112 52 ID1
# 2 JOHN 3 A 392 429 37 ID1
# 3 ADAM 1 A 19 75 56 ID2
# 4 ADAM 2 A 384 407 23 ID2
# 5 JOHN 5 C 147 226 79 ID1
# 6 JOHN 6 C 566 611 45 ID1
# 7 ADAM 5 C 140 205 65 ID2
# 8 ADAM 6 C 522 599 77 ID2
If you wanted to be able to compute the overlaps for a specified pair of users, this could be done by wrapping the operation into a function that specifies the pair of users to be processed:
overlap <- function(start1, end1, start2, end2) pmin(end1, end2) > pmax(start2, start1)
pair.overlap <- function(dat, user1, user2) {
dat <- dat[dat$Name %in% c(user1, user2),]
do.call(rbind, lapply(split(dat, dat$Sequence), function(x) {
jpos <- which(x$Name == user1)
njpos <- which(x$Name == user2)
over <- outer(jpos, njpos, function(a, b) {
overlap(x$start_event[a], x$end_event[a], x$start_event[b], x$end_event[b])
})
x[c(jpos[rowSums(over) > 0], njpos[colSums(over) > 0]),]
}))
}
You could use pair.overlap(dat, "JOHN", "ADAM") to get the previous output. Generating the overlaps for every pair of users can now be done with combn and apply:
apply(combn(unique(as.character(dat$Name)), 2), 2, function(x) pair.overlap(dat, x[1], x[2]))

Deduplicate dataframe based on criteria in R?

I've got this dataframe:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18
As you can see the value "Jane" appears 3 times. What I want to do is to deduplicate the list based on the variable "Name" but because the rest of the columns are important to me, I want to keep the rows that have the most information in them. For example if I was to deduplicate the above file in excel, it would keep the first value of "Jane" and delete all the other ones. But the first value of "Jane" (row no3) has got missing information in the other columns.
So in other words I want to deduplicate the list by "Name" but add a criteria to keep the rows that have any other value different from "0" in the column "Age". This way the result I would get would be this:
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane US F 30
4 Kate GB F 18
I have tried this
file3 <- file1[!duplicated(file1$Name),]
But like excel it keeps the value of "Jane" that has no usable information in the other columns.
How do I sort the rows based on column "Age" in a Z-A order so that anything that has "0" will be on the bottom and will be removed when I deduplicate the list?
Cheers
David
Try this trick
ind <- with(DF,
Country !=0 &
Gender %in% c('F', 'M') &
Age !=0)
DF[ind, ]
Name Country Gender Age
1 John GB M 25
2 Mark US M 35
4 Jane US F 30
6 Kate GB F 18
So far it works well and produces your desired output
EDIT
library(doBy)
orderBy(~ -Age+Name, DF) # Sort decreasingly by Age and Name
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Or simply using Base functions:
DF[order(DF$Age, DF$Name, decreasing = TRUE), ]
Name Country Gender Age
2 Mark US M 35
4 Jane US F 30
1 John GB M 25
6 Kate GB F 18
3 Jane 0 0 0
5 Jane US F 0
Now you can select by indexing the correct rows meeting your conditions, I really think the first part is better than these two lasts.
If all duplicated rows have the value zero in column Age, it will work with subset:
# the data
file1 <- read.table(text="Name Country Gender Age
1 John GB M 25
2 Mark US M 35
3 Jane 0 0 0
4 Jane US F 30
5 Jane US F 0
6 Kate GB F 18", header = TRUE, stringsAsFactors = FALSE)
# create a subset of the data
subset(file1, Age > 0)
# Name Country Gender Age
# 1 John GB M 25
# 2 Mark US M 35
# 4 Jane US F 30
# 6 Kate GB F 18

Resources