Delete rows by dplyr but leave rownames indexes

Delete rows by dplyr but leave rownames indexes - r

Let's consider data following :
df1 <-data.frame('col_1'=rnorm(100),'col_2'=runif(100),'col_3'=rexp(100))
head(df1)
col_1 col_2 col_3
1 1.1626853 0.7081688 0.1356186
2 -0.5859245 0.8679017 0.4680558
3 1.7854650 0.4107538 0.5867553
4 -1.3325937 0.3032165 0.4111656
5 -0.4465668 0.8882200 3.4235329
6 0.5696061 0.4715614 1.0981746
Now I want to filter my data :
df1 %>%
filter(col_1>0)
However, I lost my unique numbering i.e. I have just new data frame with rows from 1-49 and I want to have old indexing with just data deleted. Is there any possibility how it can be done ?

Best approach would be to create a new column with row index as tibbles don't support rownames.
library(dplyr)
df1 %>%
mutate(row = row_number()) %>%
filter(col_1 > 0)

In order to keep row index try this:
library(tidyverse)
#Data
df1 <-data.frame('col_1'=rnorm(100),'col_2'=runif(100),'col_3'=rexp(100))
#Code
new <- df1 %>% rownames_to_column('id') %>%
filter(col_1>0) %>%
column_to_rownames('id')
Output:
col_1 col_2 col_3
1 0.44582154 0.485113710 1.12780556
9 0.91338077 0.028025045 0.03392986
12 0.39850519 0.693677593 0.08575707
15 1.31992767 0.875082565 1.69923642
18 1.01032450 0.874306072 0.07470948
19 0.21004100 0.489900673 0.06544119
20 1.83231058 0.777010624 1.04503362
23 1.76636414 0.932134284 0.89963322
24 0.14665427 0.453811105 1.69614288
27 0.95768915 0.540466270 2.08754680
28 2.12894656 0.265205677 1.26068462
29 1.20613178 0.590121360 0.69933346
31 0.17498536 0.003435992 0.90773187
33 1.09692125 0.321649196 3.08840026
35 0.71434379 0.592343229 1.51961595
36 2.18998179 0.288959794 0.86319077
37 0.24424922 0.129267751 0.01765732
39 1.10932154 0.515400529 0.34381840
40 1.62120910 0.843270861 1.22549044
42 0.61201364 0.299831635 0.24302644
43 0.69583869 0.621354113 1.71074969
50 0.12516294 0.337942860 0.13970981
51 0.55032446 0.204976125 0.58245053
52 1.24819371 0.796629076 0.36528538
53 0.78363419 0.321154495 0.09472414
55 0.98528573 0.626797295 0.36268645
56 0.82932405 0.404080363 0.18517625
60 0.65893951 0.441280360 0.15770949
62 0.23747401 0.498418489 0.32947354
67 2.05117816 0.702286040 2.04353073
68 0.46038166 0.455878959 0.78142526
69 0.85814858 0.167027385 0.77806710
73 0.36265229 0.836850527 0.08689737
74 1.75032050 0.918432489 2.44187445
80 1.84781396 0.064257761 1.31418005
82 0.69448019 0.664345881 0.22248944
84 1.43213456 0.172975017 1.02372291
86 0.05623400 0.436021922 0.67705170
87 0.50485963 0.791348607 0.32379094
90 0.08281623 0.608697963 0.87405171
91 0.15252262 0.026808318 0.28446487
92 0.13104612 0.649343508 1.19998877
95 2.47542034 0.071355988 0.78619673
97 0.42994024 0.616706005 0.68963918
98 1.42811745 0.642106243 0.99258297
99 0.27834373 0.310252127 0.71026805
100 0.98552422 0.073099646 0.21789834

Using dplyr, we can use slice
library(dplyr)
df1 %>%
mutate(row = row_number()) %>%
slice(which(col_1 > 0))

Related

Select a range of rows from every n rows from a data frame

I have 2880 observations in my data.frame. I have to create a new data.frame in which, I have to select rows from 25-77 from every 96 selected rows.
df.new = df[seq(25, nrow(df), 77), ] # extract from 25 to 77
The above code extracts only row number 25 to 77 but I want every row from 25 to 77 in every 96 rows.

One option is to create a vector of indeces with which subset the dataframe.
idx <- rep(25:77, times = nrow(df)/96) + 96*rep(0:29, each = 77-25+1)
df[idx, ]

You can use recycling technique to extract these rows :
from = 25
to = 77
n = 96
df.new <- df[rep(c(FALSE, TRUE, FALSE), c(from - 1, to - from + 1, n - to))), ]
To explain for this example it will work as :
length(rep(c(FALSE, TRUE, FALSE), c(24, 53, 19))) #returns
#[1] 96
In these 96 values, value 25-77 are TRUE and rest of them are FALSE which we can verify by :
which(rep(c(FALSE, TRUE, FALSE), c(24, 53, 19)))
# [1] 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
#[23] 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68
#[45] 69 70 71 72 73 74 75 76 77
Now this vector is recycled for all the remaining rows in the dataframe.

First, define a Group variable, with values 1 to 30, each value repeating 96 times. Then define RowWithinGroup and filter as required. Finally, undo the changes introduced to do the filtering.
df <- tibble(X=rnorm(2880)) %>%
add_column(Group=rep(1:96, each=30)) %>%
group_by(Group) %>%
mutate(RowWithinGroup=row_number()) %>%
filter(RowWithinGroup >= 25 & RowWithinGroup <= 77) %>%
select(-Group, -RowWithinGroup) %>%
ungroup()
Welcome to SO. This question may not have been asked in this exact form before, but the proinciples required have been rerefenced in many, many questions and answers,

A one-liner base solution.
lapply(split(df, cut(1:nrow(df), nrow(df)/96, F)), `[`, 25:77, )
Note: Nothing after the last comma
The code above returns a list. To combine all data together, just pass the result above into
do.call(rbind, ...)

creating a two-way table with totals in R

I was wondering if there is an easy way to create a table that has the columns as well as row totals?
smoke <- matrix(c(51,43,22,92,28,21,68,22,9),ncol=3,byrow=TRUE)
colnames(smoke) <- c("High","Low","Middle")
rownames(smoke) <- c("current","former","never")
smoke <- as.table(smoke)
I thought this would be super easy, but the solutions i found until now seem to be pretty complicated involving lapply and rbind. However, this seems as such a trivial task, there must be some easier way?
derired results:
> smoke
High Low Middle TOTAL
current 51 43 22 116
former 92 28 21 141
never 68 22 9 99
TOTAL 211 93 52 51

addmargins(smoke)
addmargins is in the stats package.

You can use adorn_totals from janitor :
library(janitor)
library(magrittr)
smoke %>%
as.data.frame.matrix() %>%
tibble::rownames_to_column() %>%
adorn_totals(name = 'TOTAL') %>%
adorn_totals(name = 'TOTAL', where = 'col')
# rowname High Low Middle TOTAL
# current 51 43 22 116
# former 92 28 21 141
# never 68 22 9 99
# TOTAL 211 93 52 356

Eliminate cases based on multiple rows values

I have a base with the following information:
edit: *each row is an individual that lives in a house, multiple individuals with a unique P_ID and AGE can live in the same house with the same H_ID, I'm looking for all the houses with all the individuals based on the condition that there's at least one person over 60 in that house, I hope that explains it better *
show(base)
H_ID P_ID AGE CONACT
1 10010000001 1001000000102 35 33
2 10010000001 1001000000103 12 31
3 10010000001 1001000000104 5 NA
4 10010000001 1001000000101 37 10
5 10010000002 1001000000206 5 NA
6 10010000002 1001000000205 10 NA
7 10010000002 1001000000204 18 31
8 10010000002 1001000000207 3 NA
9 10010000002 1001000000203 24 35
10 10010000002 1001000000202 43 33
11 10010000002 1001000000201 47 10
12 10010000003 1001000000302 26 33
13 10010000003 1001000000301 29 10
14 10010000004 1001000000401 56 32
15 10010000004 1001000000403 22 31
16 10010000004 1001000000402 49 10
17 10010000005 1001000000503 1 NA
18 10010000005 1001000000501 24 10
19 10010000005 1001000000502 23 10
20 10010000006 1001000000601 44 10
21 10010000007 1001000000701 69 32
I want a list with all the houses and all the individuals living there based on the condition that there's at least one person 60+, here's a link for the data: https://drive.google.com/drive/folders/1Od8zlOE3U3DO0YRGnBadFz804OUDnuQZ?usp=sharing
And here's how I made the base:
hogares<-read.csv("/home/servicio/Escritorio/TR_VIVIENDA01.CSV")
personas<-read.csv("/home/servicio/Escritorio/TR_PERSONA01.CSV")
datos<-merge(hogares,personas)
base<-data.frame(datos$ID_VIV, datos$ID_PERSONA, datos$EDAD, datos$CONACT)
base
Any help is much much appreciated, Thanks!

This can be done by:
Adding a variable with the maximum age per household
base$maxage <- ave(base$AGE, base$H_ID, FUN=max)
Then only keeping households with a maximum age above 60.
base <- subset(base, maxage >= 60)
Or you could combine the two lines into one. With the column names in your linked data:
> base <- subset(base, ave(base$datos.EDAD, base$datos.ID_VIV, FUN=max) >= 60)
> head(base)
datos.ID_VIV datos.ID_PERSONA datos.EDAD datos.CONACT
21 10010000007 1001000000701 69 32
22 10010000008 1001000000803 83 33
23 10010000008 1001000000802 47 33
24 10010000008 1001000000801 47 10
36 10010000012 1001000001204 4 NA
37 10010000012 1001000001203 2 NA

Using dplyr, we can group_by H_ID and select houses where any AGE is greater than 60.
library(dplyr)
df %>% group_by(H_ID) %>% filter(any(AGE > 60))
Similarly with data.table
library(data.table)
setDT(df)[, .SD[any(AGE > 60)], H_ID]

To get a list of the houses with a tenant Age > 60 we can filter and create a list of distinct H_IDs
house_list <- base %>%
filter(AGE > 60) %>%
distinct(H_ID) %>%
pull(H_ID)
Then we can filter the original dataframe based on that house_list to remove any households that do not have someone over the age of 60.
house_df <- base %>%
filter(H_ID %in% house_list)
To then calculate the CON values we can filter out NA values in CONACT, group_by(H_ID) and summarize to find the number of individuals within each house that have a non-NA CONACT value.
CON_calcs <- house_df %>%
filter(!is.na(CONACT)) %>%
group_by(H_ID) %>%
summarize(Count = n())
And join that back into the house_df based on H_ID to include the newly calculated CON values, and I believe that should end with your desired result.
final_df <- left_join(house_df, CON_calcs, by = 'H_ID')

How to choose third 20% of part of dataset?

Assume i have a dataset like:
df<-data.frame(data=(1:100))
how can i select the nth 20% of my data?
let's say, i need to access the third 20%, which contains numbers between 40-60

Using the function ntile from the dplyr package. We divide the data frame into 5 buckets and take the third one.
library(dplyr)
# One line
df[ntile(df$data, 5) == 3, ]
# Using pipes
df %>%
mutate(n = ntile(data, 5)) %>%
filter(n == 3) %>%
select(data)
Output:
[1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Here's a quick function to call for the specific rows based on percentage of rows of data
rowNumbs <- function(i, perc, df){
((i - 1)*ceiling(perc*nrow(df)) + 1) : (i*ceiling(perc*nrow(df)))
}
where i is the nth set, perc is the percentage and df is the data.frame.
To call the third 20% of your data.frame:
df[rowNumbs(3, .2, df), ]
[1] 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

combine two data frame based on cell value in R

I have two data frames. One is the baseline data for different test type and the other one is my experiment data. Now I would like to combine this two data frame together. But it is not a simply merge or rbind. I wonder any R professionals can help me to solve it. Thank you.
Here is a example of two data frames:
experiment data:
experiment_num timepoint type value
50 10 7a,b4 90
50 20 7a,b4 89
50 20 10a,b4 93
50 10 7a,b6 85
50 20 7a,b6 87
50 20 10a,b6 88
baseline data:
experiment_num timepoint type value
50 0 0,b4 85
50 0 0,b6 90
Here is the output I would like to have:
experiment_num timepoint type value
50 0 7a,b4 85
50 10 7a,b4 90
50 20 7a,b4 89
50 0 10a,b4 85
50 20 10a,b4 89
50 0 7a,b6 90
50 10 7a,b6 85
50 20 7a,b6 87
50 0 10a,b6 90
50 20 10a,b6 88

This should do the job. You first need to install a couple of packages:
install.packages("dplyr")
install.packages("tidyr")
* Data *
ed <- data.frame(experiment_num=rep(50, 6), timepoint=rep(c(10, 20, 20), 2),
type=c("7a,b4", "7a,b4", "10a,b4", "7a,b6", "7a,b6", "10a,b6"),
value=c(90, 89, 93, 85, 87, 88))
db <- data.frame(experiment_num=rep(50, 2), timepoint=rep(0, 2), type=c("0,b4", "0,b6"),
value=c(85, 90))
* Code *
library(tidyr)
library(dplyr)
final <- rbind(separate(ed, type, into=c("typea", "typeb")),
left_join(ed %>% select(type) %>% unique %>%
separate(type, into=c("typea", "typeb")),
separate(db, type, into=c("zero", "typeb"))) %>%
select(experiment_num, timepoint, typea, typeb, value)
) %>%
arrange(typeb, typea, timepoint) %>% mutate(type=paste(typea, typeb, sep=",")) %>%
select(experiment_num, timepoint, type, value)
The logic is the following.
Separate the type into two columns typea and typeb then "create" the missing typea for baseline data. and then join to the experimental data.
final is the data set you are looking for.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Delete rows by dplyr but leave rownames indexes - r

Best approach would be to create a new column with row index as tibbles don't support rownames. library(dplyr) df1 %>% mutate(row = row_number()) %>% filter(col_1 > 0)

Using dplyr, we can use slice library(dplyr) df1 %>% mutate(row = row_number()) %>% slice(which(col_1 > 0))

Related

Select a range of rows from every n rows from a data frame

creating a two-way table with totals in R

Eliminate cases based on multiple rows values

How to choose third 20% of part of dataset?

combine two data frame based on cell value in R

Categories

Resources