selecting a subset of data based on another column

selecting a subset of data based on another column - r

I have a dataset which looks something like this:
Area Num
[1,] "Area 1" "99"
[2,] "Area 3" "85"
[3,] "Area 1" "60"
[4,] "Area 2" "90"
[5,] "Area 1" "40"
[6,] "Area 3" NA
[7,] "Area 4" "10"
...
code:
structure(c("Area 1", "Area 3", "Area 1", "Area 2", "Area 1",
"Area 3", "Area 4", "99", "85", "60", "90", "40", NA, "10"), .Dim = c(7L,
2L), .Dimnames = list(NULL, c("Area", "Num")))
I need to do some calculation on values in Num for each Area, for example calculating the sum of each Area, or the summary of each Area.
I'm thinking of using a nested for loop to achieve this, but I'm not sure how to.

You can do this using aggregate, but the dplyr package makes it very easy to work with such problems. There are plenty of duplicates of this question, though.
library(dplyr)
df <- structure(c("Area 1", "Area 3", "Area 1", "Area 2", "Area 1",
"Area 3", "Area 4", "99", "85", "60", "90", "40", NA, "10"), .Dim = c(7L,
2L), .Dimnames = list(NULL, c("Area", "Num")))
df <- data.frame(df)
df$Num <- as.numeric(df$Num)
df2 <- df %>%
group_by(Area) %>%
summarise(totalNum = sum(Num, na.rm=T))
df2

In order to apply the function to every level of the factor, we can recurse to the by function:
dt <- structure(c("Area 1", "Area 3", "Area 1", "Area 2", "Area 1",
"Area 3", "Area 4", "99", "85", "60", "90", "40", NA, "10"), .Dim = c(7L, 2L), .Dimnames = list(NULL, c("Area", "Num")))
dt <- data.frame(dt)
dt$Num <- as.numeric(dt$Num)
t <- by(dt$Num, dt$Area, sum)
t

Doing the same thing using data.table
library(data.table)
dt <- data.table(df)
dt[,sum(as.numeric(Num),na.rm=T),by=Area]
## Area V1
## 1: Area 1 199
## 2: Area 3 85
## 3: Area 2 90
## 4: Area 4 10

Related

how to change one col value using the information from other cols in the same df

I have a data.farme that looks like this:
I want to generate a new df as codebook where the numbers in col Label will be replaced using the information from ID and Subject.
what should I do?
The codebook file that I want to achieve is sth that looks like this:
Sample data can be build using codes:
df<-structure(list(Var = c("Subject1", "Subject2", "Subject4", "Subject5",
"Subject6", "Score1", "Score2", "Score3", "Score4", "Score5",
"Score6", "TestDate1", "TestDate2", "TestDate3", "TestDate4",
"TestDate5", "TestDate6"), Label = c("Subject 1", "Subject 2",
"Subject 4", "Subject 5", "Subject 6", "Score for Subject 1",
"Score for Subject 2", "Score for Subject 3", "Score for Subject 4",
"Score for Subject 5", "Score for Subject 6", "Date for test Subject 1",
"Date for test Subject 2", "Date for test Subject 3", "Date for test Subject 4",
"Date for test Subject 5", "Date for test Subject 6"), ID = c(1,
2, 3, 4, 5, 6, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Subject = c("Math",
"ELA", "PE", "Art", "Physic", "Chemistry", NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA)), row.names = c(NA, -17L), class = c("tbl_df",
"tbl", "data.frame"))

We can use str_replace_all with a named vector
library(dplyr)
library(stringr)
df1 <- df %>%
transmute(Var, Label = str_replace_all(Label,
setNames(na.omit(Subject), na.omit(ID))))
-output
df1
# A tibble: 17 x 2
# Var Label
# <chr> <chr>
# 1 Subject1 Subject Math
# 2 Subject2 Subject ELA
# 3 Subject4 Subject Art
# 4 Subject5 Subject Physic
# 5 Subject6 Subject Chemistry
# 6 Score1 Score for Subject Math
# 7 Score2 Score for Subject ELA
# 8 Score3 Score for Subject PE
# 9 Score4 Score for Subject Art
#10 Score5 Score for Subject Physic
#11 Score6 Score for Subject Chemistry
#12 TestDate1 Date for test Subject Math
#13 TestDate2 Date for test Subject ELA
#14 TestDate3 Date for test Subject PE
#15 TestDate4 Date for test Subject Art
#16 TestDate5 Date for test Subject Physic
#17 TestDate6 Date for test Subject Chemistry
or using gsubfn
library(gsubfn)
df$Label <- with(df, gsubfn("(\\d+)",
setNames(as.list(na.omit(Subject)), na.omit(ID)), Label))

create dataframe from list of lists of data.frames

I have a list of lists of data.frames, which I would like to convert to a data.frame. The structure is as follows:
l_of_lists <- list(
year1 = list(
one = data.frame(date = c("Jan-10", "Jan-22"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-1", "Feb-28"), type = c("type 2", "type 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"))
),
year2 = list( # dates is used here on purpose, as the names don't perfectly match
one = data.frame(dates = c("Jan-22"), type = c("type 2"), another_col = c("entry 2")),
two = data.frame(date = c("Feb-10", "Feb-18"), type = c("type 2", "type 3"), another_col = c("entry 2", "entry 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"), another_col = c("entry 4", "entry 5"))
),
year3 = list( # this deliberately only contains two data frames
one = data.frame(date = c("Jan-10", "Jan-12"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-8", "Jan-28"), type = c("type 2", "type 3"))
))
The data frame has two particularities I tried to mimic above:
the column names differ by 1-2 characters (e.g. date vs. dates)
some columns are only present in some data frames (e.g. another_col)
I now would like to convert this to a data frame (I tried different calls to rbind and also do.call, as described e.g. here unsuccessfully) and would like to
- match on column names tolerantly (if the column names are similar to 1-2 characters, I want them to be matched and
- fill non-existent columns with NA in other columns.
I want a data frame similar to the following
year level date type another_col
1 one "Jan-10" "type 1" NA
1 one "Jan-22" "type 2" NA
1 two "Feb-1" "type 2" NA
1 two "Feb-28" "type 3" NA
1 three "Mar-10" "type 1" NA
1 three "Mar-15" "type 4" NA
2 one "Jan-22" "type 2" "entry 2"
2 two "Feb-1" "type 2" "entry 2"
2 two "Feb-28" "type 3" "entry 3"
2 three "Mar-10" "type 1" "entry 4"
2 three "Mar-15" "type 4" "entry 5"
3 one "Jan-10" "type 1" NA
3 one "Jan-12" "type 2" NA
3 two "Feb-8" "type 2" NA
3 two "Feb-28" "type 3" NA
Can someone point out if rbind is the correct path here - and what I am missing?

You could do something like the following using purrr and dplyr:
l_of_lists <- list(
year1 = list(
one = data.frame(date = c("Jan-10", "Jan-22"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-1", "Feb-28"), type = c("type 2", "type 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"))
),
year2 = list( # dates is used here on purpose, as the names don't perfectly match
one = data.frame(dates = c("Jan-22"), type = c("type 2"), another_col = c("entry 2")),
two = data.frame(date = c("Feb-10", "Feb-18"), type = c("type 2", "type 3"), another_col = c("entry 2", "entry 3")),
three = data.frame(date = c("Mar-10", "Mar-15"), type = c("type 1", "type 4"), another_col = c("entry 4", "entry 5"))
),
year3 = list( # this deliberately only contains two data frames
one = data.frame(date = c("Jan-10", "Jan-12"), type = c("type 1", "type 2")),
two = data.frame(date = c("Feb-8", "Jan-28"), type = c("type 2", "type 3"))
))
# add libraries
library(dplyr)
library(purrr)
# Map bind_rows to each list within the list
l_of_lists %>%
map_dfr(~bind_rows(.x, .id = "level"), .id = "year")
This will yield:
year level date type dates another_col
1 year1 one Jan-10 type 1 <NA> <NA>
2 year1 one Jan-22 type 2 <NA> <NA>
3 year1 two Feb-1 type 2 <NA> <NA>
4 year1 two Feb-28 type 3 <NA> <NA>
5 year1 three Mar-10 type 1 <NA> <NA>
6 year1 three Mar-15 type 4 <NA> <NA>
7 year2 one <NA> type 2 Jan-22 entry 2
8 year2 two Feb-10 type 2 <NA> entry 2
9 year2 two Feb-18 type 3 <NA> entry 3
10 year2 three Mar-10 type 1 <NA> entry 4
11 year2 three Mar-15 type 4 <NA> entry 5
12 year3 one Jan-10 type 1 <NA> <NA>
13 year3 one Jan-12 type 2 <NA> <NA>
14 year3 two Feb-8 type 2 <NA> <NA>
15 year3 two Jan-28 type 3 <NA> <NA>
Then of course you can do some regex parsing to keep only the numeric year:
l_of_lists %>%
map_dfr(~bind_rows(.x, .id = "level"), .id = "year") %>%
mutate(year = substring(year, regexpr("\\d", year)))
If you know that date and dates are the same, you can always use mutate to changed then to those values that are not missing (i.e.mutate(date = ifelse(!is.na(date), date, dates)))

binning the numbers with wrong outcome

I have problems with the output after I bin the a numerical vector.
I am trying to bin the length of stay, which was calculated beforehand with difftime function. It does not make sense to provide the whole code since this is only the background. Yet, when I bin, I do not get the right answer.
Here is the length of stay assigned it with los.
dput(los)
c(61.0416666666667, 61.0416666666667, 61.0416666666667, 2, 2, 3, 3)
Here are my breaks. I used na.rm inside as tried several methods. I passed na.rm with TRUE, FALSE and took it out of my breaks.
breaks <- c(0, 0.8, 0.16,
1.0, 1.8, 1.16,
2.0, 2.8, 2.16,
3.0, 3.8, 3.16,
4.0, 4.8, 4.16,
5.0, 5.8, 5.16,
6.0, 6.8, 6.16,
7.0, 14.0, 21.0, 28.0, max(los)) #, , na.rm = FALSE
Nevertheless, the next code tried
dt_los$losbinned <- cut(dt_los$LOS,
breaks = breaks,
labels = c("0hrs", "8hrs", "16hrs", "1 d",
"1 d 8hrs", "1 d 16hrs", "2 d",
"2 d 8hrs", "2 d 16hrs", "3 d",
"3 d 8hrs", "3 d 16hrs", "4 d",
"4 d 8hrs", "4 d 16hrs", "5 d",
"5 d 8hrs", "5 d 16hrs", "6 d",
"6 d 8hrs","6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d"),
right = FALSE)#
with different parameters passed for the 'right' gives me this:
when right = FALSE I do not get LOS for 61.04 binned for the category ">28 d". BBut do get the right bins for the other ones 2.00 and 3.00.
structure(list(IDcol = 101:107, Admissions = structure(c(1539160200,
1539160200, 1539160200, 1539154800, 1539154800, 1539154800, 1539154800
), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"),
Discharges = structure(c(1544434200, 1544434200, 1544434200,
1539327600, 1539327600, 1539414000, 1539414000), class = c("POSIXct",
"POSIXt"), tzone = "Europe/London"), Admission_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), LOS = c(61.0416666666667, 61.0416666666667,
61.0416666666667, 2, 2, 3, 3), Ward_code = c("DSN", "DSN",
"DNA", "NAS", "BAS", "BAS", "BAS"), Same_day_discharge = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), Spell_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), Adm_period = c(TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE), losbinned = structure(c(NA, NA, NA, 7L, 7L,
10L, 10L), .Label = c("0hrs", "8hrs", "16hrs", "1 d", "1 d 8hrs",
"1 d 16hrs", "2 d", "2 d 8hrs", "2 d 16hrs", "3 d", "3 d 8hrs",
"3 d 16hrs", "4 d", "4 d 8hrs", "4 d 16hrs", "5 d", "5 d 8hrs",
"5 d 16hrs", "6 d", "6 d 8hrs", "6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d"), class = "factor")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
when I pass right = TRUE, the output for 61.04 is binning into ">28 d" which is the desired answer, yet, I do not get the right bins for 2.0 and 3.0, which are bbinned in 1 d 16hrs for 2.0 and 2 d 16 hrs for 3. And again, these shall be binned in 2, respectively 3.
structure(list(IDcol = 101:107, Admissions = structure(c(1539160200,
1539160200, 1539160200, 1539154800, 1539154800, 1539154800, 1539154800
), class = c("POSIXct", "POSIXt"), tzone = "Europe/London"),
Discharges = structure(c(1544434200, 1544434200, 1544434200,
1539327600, 1539327600, 1539414000, 1539414000), class = c("POSIXct",
"POSIXt"), tzone = "Europe/London"), Admission_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), LOS = c(61.0416666666667, 61.0416666666667,
61.0416666666667, 2, 2, 3, 3), Ward_code = c("DSN", "DSN",
"DNA", "NAS", "BAS", "BAS", "BAS"), Same_day_discharge = c(FALSE,
FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), Spell_type = c("Elective",
"Emergency", "Emergency", "Elective", "Emergency", "Elective",
"Emergency"), Adm_period = c(TRUE, TRUE, TRUE, TRUE, TRUE,
TRUE, TRUE), losbinned = structure(c(25L, 25L, 25L, 6L, 6L,
9L, 9L), .Label = c("0hrs", "8hrs", "16hrs", "1 d", "1 d 8hrs",
"1 d 16hrs", "2 d", "2 d 8hrs", "2 d 16hrs", "3 d", "3 d 8hrs",
"3 d 16hrs", "4 d", "4 d 8hrs", "4 d 16hrs", "5 d", "5 d 8hrs",
"5 d 16hrs", "6 d", "6 d 8hrs", "6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d"), class = "factor")), row.names = c(NA,
-7L), class = c("tbl_df", "tbl", "data.frame"))
The actual and expected results should the the right bins assigned for my length of stay. For 61.04 -> ">28d", for 2 -> "2 d", for 3 -> "3 d".
If this can be done with tidyverse that would be amazing. But respecting the bins I have assigned. However, I am aware this isn't done yet. Therefore, okay with the corrected code I have came up with, but corrected.

The cut function's bins are exclusive to inclusive.
From the cut function's help: The factor level labels are constructed as "(b1, b2]", "(b2, b3]" etc. for right = TRUE and as "[b1, b2)"
In order to include the lowest value (or highest value in this case), the include.lowest=TRUE option in required. This will make the first bin exclusive to exclusive, "[b1, b2]".
Try:
labels<-c("0hrs", "8hrs", "16hrs", "1 d",
"1 d 8hrs", "1 d 16hrs", "2 d",
"2 d 8hrs", "2 d 16hrs", "3 d",
"3 d 8hrs", "3 d 16hrs", "4 d",
"4 d 8hrs", "4 d 16hrs", "5 d",
"5 d 8hrs", "5 d 16hrs", "6 d",
"6 d 8hrs","6 d 16hrs", "7 - 14 d",
"14 - 21 d", "21 - 28 d", "> 28 d")
dt_los$losbinned <- cut(los, breaks=breaks, labels=labels, right=FALSE, include.lowest = TRUE)

R select certain values from multiple columns using conditions

I want to select certain values from multiple columns using conditions.(also let assign row 1 as ID#1, ... row5 as ID#5)
column1 <- c("rice 2", "apple 4", "melon 6", "blueberry 4", "orange 6")
column2 <- c("rice 8", "blueberry 8", "grape 10", "water 10", "mango 3")
column3 <- c("rice 6", "apple 8", "blueberry 12", "pineapple 8", "mango 3")
I want to get new column using IDs with condition only rice > 5, blueberry > 7 or orange > 5
First, I would like to get ID#1, ID#2, ID#3, ID#5
Second, I would to count how many conditions met per ID
I would like to get results
ID#1 -> 2 conditions met
ID#2 -> 1 conditions met
ID#3 -> 1 conditions met
ID#4 -> 0 conditions met
ID#5 -> 1 conditions met

If I understood the question correctly then one of the approach could be
library(dplyr)
cols <- names(df)[-1]
df1 <- df %>%
mutate_if(is.factor, as.character) %>%
mutate(rice_gt_5 = (select(., one_of(cols)) %>%
rowwise() %>%
mutate_all(funs(strsplit(., split=" ")[[1]][1] =='rice' & as.numeric(strsplit(., split=" ")[[1]][2]) > 5)) %>%
rowSums)) %>%
mutate(blueberry_gt_7 = (select(., one_of(cols)) %>%
rowwise() %>%
mutate_all(funs(strsplit(., split=" ")[[1]][1] =='blueberry' & as.numeric(strsplit(., split=" ")[[1]][2]) > 7)) %>%
rowSums)) %>%
mutate(orange_gt_5 = (select(., one_of(cols)) %>%
rowwise() %>%
mutate_all(funs(strsplit(., split=" ")[[1]][1] =='orange' & as.numeric(strsplit(., split=" ")[[1]][2]) > 5)) %>%
rowSums))
#IDs which satisfy at least one of your conditions i.e. rice > 5 OR blueberry > 7 OR orange > 5
df1$ID[which(df1 %>% select(rice_gt_5, blueberry_gt_7, orange_gt_5) %>% rowSums() >0)]
#[1] 1 2 3 5
#How many conditions are met per ID
df1 %>%
mutate(no_of_cond_met = rowSums(select(., one_of(c("rice_gt_5", "blueberry_gt_7", "orange_gt_5"))))) %>%
select(ID, no_of_cond_met)
# ID no_of_cond_met
#1 1 2
#2 2 1
#3 3 1
#4 4 0
#5 5 1
Sample data:
df <- structure(list(ID = 1:5, column1 = structure(c(5L, 1L, 3L, 2L,
4L), .Label = c("apple 4", "blueberry 4", "melon 6", "orange 6",
"rice 2"), class = "factor"), column2 = structure(c(4L, 1L, 2L,
5L, 3L), .Label = c("blueberry 8", "grape 10", "mango 3", "rice 8",
"water 10"), class = "factor"), column3 = structure(c(5L, 1L,
2L, 4L, 3L), .Label = c("apple 8", "blueberry 12", "mango 3",
"pineapple 8", "rice 6"), class = "factor")), .Names = c("ID",
"column1", "column2", "column3"), row.names = c(NA, -5L), class = "data.frame")

Extract maximum number from text string

I am trying to get the maximum value from the column Number struck in a data frame. As you can see some of the rows have a range. Thanks in advance.
Aircraft Number struck
B-757-200 2 to 10
B-737-300 1
B-737-300 1
B-727-200 1
UNKNOWN 1
C-550 1
B-727-200 1
CITATION II 1
DA-2000 1
B-737-500 1
B-737-300 2 to 10
UNKNOWN 2 to 10
HAWKER 800 1
MD-80 11 to 100
B-737-400 1
B-737 1
B-767-300 2 to 10
EMB-120 2 to 10
Data
df <- structure(list(Aircraft = c("B-757-200", "B-737-300", "B-737-300",
"B-727-200", "UNKNOWN", "C-550", "B-727-200", "CITATION II",
"DA-2000", "B-737-500", "B-737-300", "UNKNOWN", "HAWKER 800",
"MD-80", "B-737-400", "B-737", "B-767-300", "EMB-120"), Number.struck = c("2 to 10",
"1", "1", "1", "1", "1", "1", "1", "1", "1", "2 to 10", "2 to 10",
"1", "11 to 100", "1", "1", "2 to 10", "2 to 10")), .Names = c("Aircraft",
"Number.struck"), row.names = c(NA, -18L), class = "data.frame")

Maybe this will work
res <- as.numeric(as.character(unlist(strsplit(gsub("[a-zA-Z]","",df$Number.struck),"\\s"))))
max(res,na.rm=T)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

selecting a subset of data based on another column - r

Doing the same thing using data.table library(data.table) dt <- data.table(df) dt[,sum(as.numeric(Num),na.rm=T),by=Area] ## Area V1 ## 1: Area 1 199 ## 2: Area 3 85 ## 3: Area 2 90 ## 4: Area 4 10

Related

how to change one col value using the information from other cols in the same df

create dataframe from list of lists of data.frames

binning the numbers with wrong outcome

R select certain values from multiple columns using conditions

Extract maximum number from text string

Categories

Resources