R dplyr select not removing columns - r

I'm a new user of R and have a very basic question. I'm trying to delete columns using the dplyr select function. It appears to run correctly but then when the data is viewed using head the deleted column still appears, and also a count is still able to be run on this column. I've run this on a very simple test dataset, the outputs are below. Please advise on how to permanently delete the columns from the data. Thanks
> library(dplyr)
> setwd("C:/")
> mydata <- read_csv("test.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
`Smoking Status` = col_character()
)
> head(mydata)
# A tibble: 4 x 3
Age Gender `Smoking Status`
<dbl> <chr> <chr>
1 18 M Smoker
2 25 F Non-smoker
3 40 M Ex-smoker
4 53 F Non-smoker
> select(mydata,-Age)
# A tibble: 4 x 2
Gender `Smoking Status`
<chr> <chr>
1 M Smoker
2 F Non-smoker
3 M Ex-smoker
4 F Non-smoker
> head(mydata)
# A tibble: 4 x 3
Age Gender `Smoking Status`
<dbl> <chr> <chr>
1 18 M Smoker
2 25 F Non-smoker
3 40 M Ex-smoker
4 53 F Non-smoker
> mydata %>%
+ count(Age)
# A tibble: 4 x 2
Age n
<dbl> <int>
1 18 1
2 25 1
3 40 1
4 53 1

If I am understanding your question. The reason the column is not being deleted is because you are not assigning the data to a variable.
df <- data.frame(age = 10:20,
sex = c('M','M','F','F','M','F','F','M','F','F','M'),
smoker = c('N','N','Y','N','N','Y','N','N','Y','Y','N'))
df_1 <- select(df,-age)
head(df_1)
sex smoker
1 M N
2 M N
3 F Y
4 F N
5 M N
6 F Y
I hope this helps.

I have extracted the first 4 rows (head) of your data and turned it into a reproducible answer which anyone can then copy and run easily. This helps us understand your problem which in turn helps you get your answer faster.
# Dataframe based on head of your table
mydata <- data.frame(Age = c(18,25,40,53),
Gender = c("M","F","M","F"),
Smoking_Status = c("Smoker","Non_smoker","Ex-smoker","Non-smoker"))
> mydata
Age Gender Smoking_Status
1 18 M Smoker
2 25 F Non_smoker
3 40 M Ex-smoker
4 53 F Non-smoker
Essentially you are creating a new data frame once you have transformed your dataframe in any way, and this new data frame needs to be saved into a variable. This can be done by using either = or <-.
I prefer using <- as it helps differentiate assigning a variable.
If you have no need for your original dataframe, you can simply overwrite it by assinging the new data frame with the same name.
mydata <- select(mydata, -Age)
To preserve your original data frame, you can create a new variable and store this data frame inside. Now, mydata is still the same as above but mydata2 has no Age column.
mydata2 <- select(mydata, -Age)
> mydata2
Gender Smoking_Status
1 M Smoker
2 F Non_smoker
3 M Ex-smoker
4 F Non-smoker

Related

Creating a Subset from a Dataframe using a Group Based on summary values

count(df1,age,gender)
age gender n
25 M 4
32 F 3
full_df
patient_ID age gender
pt1 23 M
pt2 26 F
...
I would like to create a 4:1 age/sex matched subset of full_df based on count stats of df1. For example, I have 4 male patients aged 25 in df1, so I would like to pull 16 random patients from full_df. And 12 32yo females.
I need to find a way to shuffle full_df, then add 1:len(group) to it as follows:
patient_ID age gender order
pt100 25 M 1
pt251 25 M 2
pt201 25 M 3
...
pt376 26 M 1
pt872 26 M 2
pt563 26 M 3
...
I have created a small example for you based only on age (since there was no example df available this saves a lot of typing) but you can easily add gender to the method.
First we join the dataframe with the count information to the full dataframe, and then sample the number of rows per age group (in this example 2 times n, you would want to do 4 times n but my df is too small).
Then we add a new column 'order' with numbers ranging from 1 to the number of samples and lastly drop the 'n' column.
df1 = data.frame(age = c(25,32),
n = c(1,2))
df = data.frame(patient_ID = 1:10,
age = c(rep(25,4),rep(32,6)))
df %>%
left_join(df1, by = 'age') %>%
group_by(age) %>%
sample_n(n*2) %>%
mutate(order = 1:n()) %>%
ungroup() %>%
select(-n)
this gives the output with the selected patients (in line with the numbers in df1):
# A tibble: 6 x 3
patient_ID age order
<int> <dbl> <int>
1 4 25 1
2 2 25 2
3 10 32 1
4 9 32 2
5 7 32 3
6 8 32 4

Caret dummy variable does not work as expected

I am trying to use caret's DummyVar function in R to convert some categorical data to numeric. My dataset has an id, town (A or B), district (d1,d2,d3), street(s1,s2,s3,s4), family(f1,f2,f3), gender(male, female), replicate (numeric). Here is a snapshot:
Dataset Snapshot
Here is the code I currently have to decode the variables
library('caret')
train <- read.csv("HW1PB4Data_train.csv", header = TRUE)
dummy <- dummyVars("~ .", data = train)
train2 <- data.frame(predict(dummy, newdata = train))
train2
When I look at the output, train2, it has created a few additional towns (C,D,E) which did not exists in the original data. This does not happen with any of the other columns. Why is this? How do I fix it? Here is a snapshot of the output data: Output
We can use tidyr::pivot_wider or fastDummies::dummy_cols
Example data:
library(dplyr)
df <- tibble(subject = c(1.2, 1.5), town = c('a', 'b'), street = c('1', '2'))
# A tibble: 2 × 3
subject town street
<dbl> <chr> <chr>
1 1.2 a 1
2 1.5 b 2
Solution with tidyr:
df %>% pivot_wider(names_from= c(town:street),
values_from = c(town:street),
values_fill = 0,
values_fn = ~1)
# A tibble: 2 × 5
subject town_a_1 town_b_2 street_a_1 street_b_2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1.2 1 0 1 0
2 1.5 0 1 0 1
solution with dummy_cols:
> dummy_cols(df,
c("town", "street"),
remove_selected_columns = TRUE)
# A tibble: 2 × 5
subject town_a town_b street_1 street_2
<dbl> <int> <int> <int> <int>
1 1.2 1 0 1 0
2 1.5 0 1 0 1
The above answer is already good. You can also go the easy way and just use an ifelse statement to convert your data from categorical to numeric. An example dataset similar to yours:
train <- data.frame(subject = round(rnorm(n=100,
mean=5,
sd=2)), # rounded subjects
town = rep(c("A","B"),50),
district = rep(c("d1","d2"),50),
street = rep(c("s1","s2"),50),
family = rep(c("f1","f2"),50),
gender = rep(c("male","female"),50),
replicate = rbinom(n=100,
size=2,
prob=.9))
head(train)
Seen below:
subject town district street family gender replicate
1 6 A d1 s1 f1 male 2
2 4 B d2 s2 f2 female 2
3 4 A d1 s1 f1 male 1
4 7 B d2 s2 f2 female 2
5 3 A d1 s1 f1 male 2
6 6 B d2 s2 f2 female 2
Simply mutate the gender data with ifelse by coding "male" as 0 and everything else ("female" in this case) as 1:
m.train <- train %>%
mutate(gender = ifelse(gender=="male",0,1))
head(m.train)
You get a transformed gender variable with 0's and 1's for dummy coding:
subject town district street family gender replicate
1 6 A d1 s1 f1 0 2
2 4 B d2 s2 f2 1 2
3 4 A d1 s1 f1 0 1
4 7 B d2 s2 f2 1 2
5 3 A d1 s1 f1 0 2
6 6 B d2 s2 f2 1 2

R extracting the frequencies

I am trying to get the frequencies but my ids are repeating. Here is a sample data:
id <- c(1,1,2,2,3,3)
gender <- c("m","m","f","f","m","m")
score <- c(10,5,10,5,10,5)
data <- data.frame("id"=id,"gender"=gender, "score"=score)
> data
id gender score
1 1 m 10
2 1 m 5
3 2 f 10
4 2 f 5
5 3 m 10
6 3 m 5
I would like to get the frequencies of the gender categories but I have repeating ids. When I run this code below:
gender<-as.data.frame(table(data$gender))
> gender
Var1 Freq
1 f 2
2 m 4
The frequency should be female = 1, male =2. it should look like this below:
> gender
Var1 Freq
1 f 1
2 m 2
How can I get this considering the id information?
You can use data.table::uniqueN to count the number of unique ids per gender group
library(data.table)
setDT(data)
data[, .(Freq = uniqueN(id)), gender]
# gender Freq
# 1: m 2
# 2: f 1
The idea from #IceCreamToucan with dplyr:
data %>%
group_by(gender) %>%
summarise(freq = n_distinct(id))
gender freq
<fct> <int>
1 f 1
2 m 2
In base R
rowSums(table(data$gender,data$id)!=0)
f m
1 2
Being late to the party, I was quite surprised about the sophisticated answers which use grouping or rowSums().
In base R, I would
remove the duplicate id rows from the data.frame by subsetting with duplicated(id),
apply table() on the gender column.
So, the code is
table(data[duplicated(data$id), "gender"])
f m
1 2

How can I create an incremental ID column based on whenever one of two variables are encountered?

My data came to me like this (but with 4000+ records). The following is data for 4 patients. Every time you see surgery OR age reappear, it is referring to a new patient.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
So to say again, every time surgery or age appear (surgery isn't always there, but age is), those records and the ones after pertain to the same patient until you see surgery or age appear again.
Thus I somehow need to add an ID column with this data:
ID = c(1,1,1,1,1,2,2,2,2,2,3,3,3,4,4,4,4)
testdat$ID = ID
I know how to transpose and melt and all that to put the data into regular format, but how can I create that ID column?
Advice on relevant tags to use is helpful!
Assuming that surgery and age will be the first two pieces of information for each patient and that each patient will have a information that is not age or surgery afterward, this is a solution.
col1 = c("surgery", "age", "weight","albumin","abiotics","surgery","age", "weight","BAPPS", "abiotics","surgery", "age","weight","age","weight","BAPPS","albumin")
col2 = c("yes","54","153","normal","2","no","65","134","yes","1","yes","61","210", "46","178","no","low")
testdat = data.frame(col1,col2)
# Use a tibble and get rid of factors.
dfTest = as_tibble(testdat) %>%
mutate_all(as.character)
# A little dplyr magic to see find if the start of a new patient, then give them an id.
dfTest = dfTest %>%
mutate(couldBeStart = if_else(col1 == "surgery" | col1 == "age", T, F)) %>%
mutate(isStart = couldBeStart & !lag(couldBeStart, default = FALSE)) %>%
mutate(patientID = cumsum(isStart)) %>%
select(-couldBeStart, -isStart)
# # A tibble: 17 x 3
# col1 col2 patientID
# <chr> <chr> <int>
# 1 surgery yes 1
# 2 age 54 1
# 3 weight 153 1
# 4 albumin normal 1
# 5 abiotics 2 1
# 6 surgery no 2
# 7 age 65 2
# 8 weight 134 2
# 9 BAPPS yes 2
# 10 abiotics 1 2
# 11 surgery yes 3
# 12 age 61 3
# 13 weight 210 3
# 14 age 46 4
# 15 weight 178 4
# 16 BAPPS no 4
# 17 albumin low 4
# Get the data to a wide workable format.
dfTest %>% spread(col1, col2)
# # A tibble: 4 x 7
# patientID abiotics age albumin BAPPS surgery weight
# <int> <chr> <chr> <chr> <chr> <chr> <chr>
# 1 1 2 54 normal NA yes 153
# 2 2 1 65 NA yes no 134
# 3 3 NA 61 NA NA yes 210
# 4 4 NA 46 low no NA 178
Using dplyr:
library(dplyr)
testdat = testdat %>%
mutate(patient_counter = cumsum(col1 == 'surgery' | (col1 == 'age' & lag(col1 != 'surgery'))))
This works by checking whether the col1 value is either 'surgery' or 'age', provided 'age' is not preceded by 'surgery'. It then uses cumsum() to get the cumulative sum of the resulting logical vector.
You can try the following
keywords <- c('surgery', 'age')
lgl <- testdat$col1 %in% keywords
testdat$ID <- cumsum(c(0, diff(lgl)) == 1) + 1
col1 col2 ID
1 surgery yes 1
2 age 54 1
3 weight 153 1
4 albumin normal 1
5 abiotics 2 1
6 surgery no 2
7 age 65 2
8 weight 134 2
9 BAPPS yes 2
10 abiotics 1 2
11 surgery yes 3
12 age 61 3
13 weight 210 3
14 age 46 4
15 weight 178 4
16 BAPPS no 4
17 albumin low 4

How to add a variable based on order of observation in a dataframe - R

I want to add a variable value to a dataframe based on the order of the observation in the data frame.
… Subject Latency(s)
1 A 25
2 A 24
3 A 25
4 B 22
5 B 24
6 B 23
I want to add a third column called Trial and I want the values to be either T1, T2, or T3 based on the order of the observation and by Subject. So for example, Subject A would get T1 in row 1, T2 in row 2, and T3 in row 3. Then the same for subject B, and so on.
Right now my approach is to use group_by in dplyr to group by Subject. But I'm not sure then how to specify the new variable using mutate.
Use mutate w/ row_number & group_by(Subject)
library(dplyr)
txt <- "ID Subject Latency(s)
1 A 25
2 A 24
3 A 25
4 B 22
5 B 24
6 B 23"
dat <- read.table(text = txt, header = TRUE)
dat <- dat %>%
group_by(Subject) %>%
mutate(Trial = paste0("T", row_number()))
dat
#> # A tibble: 6 x 4
#> # Groups: Subject [2]
#> ID Subject Latency.s. Trial
#> <int> <fct> <int> <chr>
#> 1 1 A 25 T1
#> 2 2 A 24 T2
#> 3 3 A 25 T3
#> 4 4 B 22 T1
#> 5 5 B 24 T2
#> 6 6 B 23 T3
Created on 2018-03-17 by the reprex package (v0.2.0).
This solution should work for any number of subjects. To illustrate, copy and paste this code into your console.
library(dplyr)
d <- data.frame(subject = c("A","A","A","B","B","B","C","D","D"),
latency = c(25,24,25,22,24,23,34,54,34))
# get counts of unique subjects
n <- d %>% dplyr::count(subject)
# create a list of sequences
my_list <- lapply(n$n, seq)
# paste a "T" to each of these sequences
t_list <- lapply(my_list, function(x){paste0("T", x)})
# bind the collapsed list back onto your df
d$trial <- do.call(c, t_list)

Resources