Applying custom formats to numeric variables in R - r

I have a df that uses numbers to refer to categorical data and a CSV that defines the categories (1=Smoker, 2=Non-Smoker, etc). In SAS, I was able to convert the format CSV into a format file and apply these formats to the variables:
data want;
set have;
formatted = put(varX,custFormat.);
run;
This would provide me with the output:
varX formatted
1 Smoker
2 Non-Smoker
3 Occasional Smoker
1 Smoker
Given that I have a csv with all the formats, I could bring this in and merge to my R df to have the formats in a new column:
print(have)
varX
1
2
3
1
print(format.file)
formatIndex group
1 Smoker
2 Non-Smoker
3 Occasional Smoker
11 Female
12 Male
13 Unknown
df.format <- merge(have, format.file, by.x = "varX", by.y = "formatIndex")
print(df.format)
varX group
1 Smoker
2 Non-Smoker
3 Occasional Smoker
1 Smoker
The issue with a join approach is I often want to apply the same formats for many columns (i.e. varX, varY, and varZ all use different formatIndex). Is there a similar method of applying formats to variables as SAS has?

You could use plyr::mapvalues within the across verb.
Example:
df <- data.frame(V1 = c(1,2,3,4),
V2 = c(2,3,1,3))
V1 V2
1 1 2
2 2 3
3 3 1
4 4 3
liste_format <- data.frame(ID = c(1,2,3,4),
group = c("Smoker","Non-Smoker","Occasional Somker","Unknown"))
ID group
1 1 Smoker
2 2 Non-Smoker
3 3 Occasional Somker
4 4 Unknown
library(dplyr)
df |>
mutate(across(V1:V2,
~ plyr::mapvalues(.,
from = liste_format$ID,
to = liste_format$group,
warn_missing = F),
.names = "format_{.col}"))
V1 V2 format_V1 format_V2
1 1 2 Smoker Non-Smoker
2 2 3 Non-Smoker Occasional Somker
3 3 1 Occasional Somker Smoker
4 4 3 Unknown Occasional Somker

Related

Use replicate to create new variable

I have the following code:
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)
This results in the following output:
ID observations
1 1
1 3
1 4
1 5
1 6
1 8
However, I also want a variable 'times' to indicate how many times of measurement there were for each individual. But since every ID has a different length, I am not sure how to implement this. This anybody know how to include that? I want it to look like this:
ID observations times
1 1 1
1 3 2
1 4 3
1 5 4
1 6 5
1 8 6
Using dplyr you could group by ID and use the row number for times:
library(dplyr)
dat |>
group_by(ID) |>
mutate(times = row_number()) |>
ungroup()
With base we could create the sequence based on each of the lengths of the ID variable:
dat$times <- sequence(rle(dat$ID)$lengths)
Output:
# A tibble: 734 × 3
ID observations times
<int> <dbl> <int>
1 1 1 1
2 1 3 2
3 1 9 3
4 2 1 1
5 2 5 2
6 2 6 3
7 2 8 4
8 3 1 1
9 3 2 2
10 3 5 3
# … with 724 more rows
Data (using a seed):
set.seed(1)
Ni <- 133 # number of individuals
MXmeas <- 10 # number of measurements
# simulate number of observations for each individual
Nmeas <- round(runif(Ni, 1, MXmeas))
# simulate observation moments (under the assumption that everybody has at least one observation)
obs <- unlist(sapply(Nmeas, function(x) c(1, sort(sample(2:MXmeas, x-1, replace = FALSE)))))
# set up dataframe (id, observations)
dat <- data.frame(ID = rep(1:Ni, times = Nmeas), observations = obs)

R dplyr select not removing columns

I'm a new user of R and have a very basic question. I'm trying to delete columns using the dplyr select function. It appears to run correctly but then when the data is viewed using head the deleted column still appears, and also a count is still able to be run on this column. I've run this on a very simple test dataset, the outputs are below. Please advise on how to permanently delete the columns from the data. Thanks
> library(dplyr)
> setwd("C:/")
> mydata <- read_csv("test.csv")
Parsed with column specification:
cols(
Age = col_double(),
Gender = col_character(),
`Smoking Status` = col_character()
)
> head(mydata)
# A tibble: 4 x 3
Age Gender `Smoking Status`
<dbl> <chr> <chr>
1 18 M Smoker
2 25 F Non-smoker
3 40 M Ex-smoker
4 53 F Non-smoker
> select(mydata,-Age)
# A tibble: 4 x 2
Gender `Smoking Status`
<chr> <chr>
1 M Smoker
2 F Non-smoker
3 M Ex-smoker
4 F Non-smoker
> head(mydata)
# A tibble: 4 x 3
Age Gender `Smoking Status`
<dbl> <chr> <chr>
1 18 M Smoker
2 25 F Non-smoker
3 40 M Ex-smoker
4 53 F Non-smoker
> mydata %>%
+ count(Age)
# A tibble: 4 x 2
Age n
<dbl> <int>
1 18 1
2 25 1
3 40 1
4 53 1
If I am understanding your question. The reason the column is not being deleted is because you are not assigning the data to a variable.
df <- data.frame(age = 10:20,
sex = c('M','M','F','F','M','F','F','M','F','F','M'),
smoker = c('N','N','Y','N','N','Y','N','N','Y','Y','N'))
df_1 <- select(df,-age)
head(df_1)
sex smoker
1 M N
2 M N
3 F Y
4 F N
5 M N
6 F Y
I hope this helps.
I have extracted the first 4 rows (head) of your data and turned it into a reproducible answer which anyone can then copy and run easily. This helps us understand your problem which in turn helps you get your answer faster.
# Dataframe based on head of your table
mydata <- data.frame(Age = c(18,25,40,53),
Gender = c("M","F","M","F"),
Smoking_Status = c("Smoker","Non_smoker","Ex-smoker","Non-smoker"))
> mydata
Age Gender Smoking_Status
1 18 M Smoker
2 25 F Non_smoker
3 40 M Ex-smoker
4 53 F Non-smoker
Essentially you are creating a new data frame once you have transformed your dataframe in any way, and this new data frame needs to be saved into a variable. This can be done by using either = or <-.
I prefer using <- as it helps differentiate assigning a variable.
If you have no need for your original dataframe, you can simply overwrite it by assinging the new data frame with the same name.
mydata <- select(mydata, -Age)
To preserve your original data frame, you can create a new variable and store this data frame inside. Now, mydata is still the same as above but mydata2 has no Age column.
mydata2 <- select(mydata, -Age)
> mydata2
Gender Smoking_Status
1 M Smoker
2 F Non_smoker
3 M Ex-smoker
4 F Non-smoker

how to convert factor levels to integer in r

I have following dataframe in R
ID Season Year Weekday
1 Winter 2017 Monday
2 Winter 2018 Tuesday
3 Summer 2017 Monday
4 Summer 2018 Wednsday
I want to convert these factor levels to integer,following is my desired dataframe
ID Season Year Weekday
1 1 1 1
2 1 2 2
3 2 1 1
4 2 2 3
Winter = 1,Summer =2
2017 = 1 , 2018 = 2
Monday = 1,Tuesday = 2,Wednesday = 3
Currently, I am doing ifelse for above 3
otest_xgb$Weekday <- as.integer(ifelse(otest_xgb$Weekday == "Monday",1,
ifelse(otest_xgb$Weekday == "Tuesday",2,
ifelse(otest_xgb$Weekday == "Wednesday",3,
ifelse(otest_xgb$Weekday == "Thursday",4,5)))))
Is there any way to avoid writing long ifelse ?
m=dat
> m[]=lapply(dat,function(x)as.integer(factor(x,unique(x))))
> m
ID Season Year Weekday
1 1 1 1 1
2 2 1 2 2
3 3 2 1 1
4 4 2 2 3
We can use match with unique elements
library(dplyr)
dat %>%
mutate_all(funs(match(., unique(.))))
# ID Season Year Weekday
#1 1 1 1 1
#2 2 1 2 2
#3 3 2 1 1
#4 4 2 2 3
Ordered and Nominal factor variables are needed to be taken care of separately. Directly converting a factor column to integer or numeric will provide values in lexicographical sense.
Here Weekday is conceptually ordinal, Year is integer, Season is generally nominal. However, this is again subjective depending on the kind of analysis required.
For eg. When you directly convert from factor to integer variables. In Weekday column, Wednesday will get a higher value than both Saturday and Tuesday:
dat[] <- lapply(dat, function(x)as.integer(factor(x)))
dat
# ID Season Year Weekday
#1 1 2 1 1
#2 2 2 2 3
#3 3 1 1 2 (Saturday)
#4 4 1 2 4 (Wednesday): assigned value greater than that ofSaturday
Therefore, you can convert directly from factor to integers for Season and Year columns only. It might be noted that for year column, it works fine as the lexicographical sense matches with its ordinal sense.
dat[c('Season', 'Year')] <- lapply(dat[c('Season', 'Year')],
function(x) as.integer(factor(x)))
Weekday needs to converted from an ordered factor variable with desired order of levels. It might be harmless if doing general aggregation, but will drastically affect results when implementing statistical models.
dat$Weekday <- as.integer(factor(dat$Weekday,
levels = c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday"), ordered = TRUE))
dat
# ID Season Year Weekday
#1 1 2 1 1
#2 2 2 2 2
#3 3 1 1 6 (Saturday)
#4 4 1 2 3 (Wednesday): assigned value less than that of Saturday
Data Used:
dat <- read.table(text=" ID Season Year Weekday
1 Winter 2017 Monday
2 Winter 2018 Tuesday
3 Summer 2017 Saturday
4 Summer 2018 Wednesday", header = TRUE)
You can simply use as.numeric() to convert a factor to a numeric. Each value will be changed to the corresponding integer that that factor level represents:
library(dplyr)
### Change factor levels to the levels you specified
otest_xgb$Season <- factor(otest_xgb$Season , levels = c("Winter", "Summer"))
otest_xgb$Year <- factor(otest_xgb$Year , levels = c(2017, 2018))
otest_xgb$Weekday <- factor(otest_xgb$Weekday, levels = c("Monday", "Tuesday", "Wednesday"))
otest_xgb %>%
dplyr::mutate_at(c("Season", "Year", "Weekday"), as.numeric)
# ID Season Year Weekday
# 1 1 1 1 1
# 2 2 1 2 2
# 3 3 2 1 1
# 4 4 2 2 NA
Once you have converted the season, year and weekday to factors, use this code to change to dummy indicator variables
contrasts(factor(dat$season)
contrasts(factor(dat$year)
contrasts(factor(dat$weekday)

R: how to remove duplicate rows by column [duplicate]

This question already has answers here:
Remove duplicated rows using dplyr
(6 answers)
Closed 5 years ago.
df <- data.frame(id = c(1, 1, 1, 2, 2),
gender = c("Female", "Female", "Male", "Female", "Male"),
variant = c("a", "b", "c", "d", "e"))
> df
id gender variant
1 1 Female a
2 1 Female b
3 1 Male c
4 2 Female d
5 2 Male e
I want to remove duplicate rows in my data.frame according to the gender column in my data set. I know there has been a similar question asked (here) but the difference here is that I would like to remove duplicate rows within each subset of the data set, where each subset is defined by an unique id.
My desired result is this:
id gender variant
1 1 Female a
3 1 Male c
4 2 Female d
5 2 Male e
I've tried the following and it works, but I'm wondering if there's a cleaner, more efficient way of doing this?
out = list()
for(i in 1:2){
df2 <- subset(df, id == i)
out[[i]] <- df2[!duplicated(df2$gender), ]
}
do.call(rbind.data.frame, out)
df[!duplicated(df[ , c("id","gender")]),]
# id gender variant
# 1 1 Female a
# 3 1 Male c
# 4 2 Female d
# 5 2 Male e
Another way of doing this using subset as below:
subset(df, !duplicated(subset(df, select=c(id, gender))))
# id gender variant
# 1 1 Female a
# 3 1 Male c
# 4 2 Female d
# 5 2 Male e
Here's a dplyr based solution in case you are interested (edited to include Gregor's suggestions)
library(dplyr)
group_by(df, id, gender) %>% slice(1)
#> # A tibble: 4 x 3
#> # Groups: id, gender [4]
#> id gender variant
#> <dbl> <fctr> <fctr>
#> 1 1 Female a
#> 2 1 Male c
#> 3 2 Female d
#> 4 2 Male e
It might also be worth using the arrange function as well depending on which values of variant should be removed.

Create missing observations in panel data

I am working on panel data with a unique case identifier and a column for the time points of the observations (long format). There are both time-constant variables and time-varying observations:
id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2
For my model I now need data with complete records per id for each time point. In other words, if an observation is missing I still need to put in a row with id, time, time-constant variables, and NA for the observed variables (as would be the line (102, 2, "female", NA) in the above example). So my question is:
How can I find out if a row with unique combination of id and time already exists in my dataset?
If not, how can I add this row, carry over time-constant variables and fill the observations with NA?
Would be great if someone could shed some light on this.
Thanks a lot in advance!
EDIT
Thank you everyone for your replies. Here is what I finally did, which is mix of several suggested approaches. The thing is that I have several time-varying variables (obs1-obsn) per row and I did not get dcast to accomodate for that - value.name does not take more than argument.
# create all possible permutations of id and year
iddat = expand.grid(id = unique(dataset$id), time = (c(1996,1999,2002,2005,2008,2011)))
iddat <- iddat[order(iddat$id, iddat$time), ]
# add permutations to existing data, combinations so far missing are NA
dataset_new <- merge(dataset, iddat, all.x=TRUE, all.y=TRUE, by=c("id", "time"))
# drop time-constant variables from data
dataset_new[c("tc1", "tc2", "tc3")] <- list(NULL)
# merge back time-constant variables from original data
temp <- dataset[c("tc1", "tc2", "tc3")]
dataset_new <- merge(dataset_new, temp, by=c("id"))
# sort
dataset_new <- dataset_new[order(dataset_new$id, dataset_new$time), ]
dataset_new <- unique(dataset_new) # some rows are duplicates after last merge, no idea why
rm(temp)
rm(iddat)
All the best and thanks again, Matt
You could create an empty dataset and then merge in the records in which you have matches.
# Create dataset. For you actual data ,you would replace c(1:3) with
# c(1:max(yourdata$id)) and adjust the number of time periods to match your data.
id <- rep(c(1:3), each = 3)
time <- rep(c(1:3), 3)
df <- data.frame(id,time)
test <- df[c(1,3,5,7,9),]
test$tc1 <- c("male", "male", "female", "male", "male")
test$obs1 <-c(4,5,3,6,2)
merge(df, test, by.x = c("id","time"), by.y = c("id","time"), all.x = TRUE)
The result:
id time tc1 obs1
1 1 1 male 4
2 1 2 <NA> NA
3 1 3 male 5
4 2 1 <NA> NA
5 2 2 female 3
6 2 3 <NA> NA
7 3 1 male 6
8 3 2 <NA> NA
9 3 3 male 2
There are probably more elegant ways, but here's one option. I'm assuming that you need all combinations of id and time but not tc1 (i.e. tc1 is tied to id).
# your data
df <- read.table(text = " id time tc1 obs1
1 101 1 male 4
2 101 2 male 5
3 101 3 male 3
4 102 1 female 6
5 102 3 female 2
6 103 1 male 2", header = TRUE)
First cast your data to wide format to introduce NAs, then convert back to long.
library('reshape2')
df_wide <- dcast(
df,
id + tc1 ~ time,
value.var = "obs1",
fill = NA
)
df_long <- melt(
df_wide,
id.vars = c("id","tc1"),
variable.name = "time",
value.name = "obs1"
)
# sort by id and then time
df_long[order(df_long$id, df_long$time), ]
id tc1 time obs1
1 101 male 1 4
4 101 male 2 5
7 101 male 3 3
2 102 female 1 6
5 102 female 2 NA
8 102 female 3 2
3 103 male 1 2
6 103 male 2 NA
9 103 male 3 NA

Resources