Caret dummy variable does not work as expected

Caret dummy variable does not work as expected - r

I am trying to use caret's DummyVar function in R to convert some categorical data to numeric. My dataset has an id, town (A or B), district (d1,d2,d3), street(s1,s2,s3,s4), family(f1,f2,f3), gender(male, female), replicate (numeric). Here is a snapshot:
Dataset Snapshot
Here is the code I currently have to decode the variables
library('caret')
train <- read.csv("HW1PB4Data_train.csv", header = TRUE)
dummy <- dummyVars("~ .", data = train)
train2 <- data.frame(predict(dummy, newdata = train))
train2
When I look at the output, train2, it has created a few additional towns (C,D,E) which did not exists in the original data. This does not happen with any of the other columns. Why is this? How do I fix it? Here is a snapshot of the output data: Output

We can use tidyr::pivot_wider or fastDummies::dummy_cols
Example data:
library(dplyr)
df <- tibble(subject = c(1.2, 1.5), town = c('a', 'b'), street = c('1', '2'))
# A tibble: 2 × 3
subject town street
<dbl> <chr> <chr>
1 1.2 a 1
2 1.5 b 2
Solution with tidyr:
df %>% pivot_wider(names_from= c(town:street),
values_from = c(town:street),
values_fill = 0,
values_fn = ~1)
# A tibble: 2 × 5
subject town_a_1 town_b_2 street_a_1 street_b_2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1.2 1 0 1 0
2 1.5 0 1 0 1
solution with dummy_cols:
> dummy_cols(df,
c("town", "street"),
remove_selected_columns = TRUE)
# A tibble: 2 × 5
subject town_a town_b street_1 street_2
<dbl> <int> <int> <int> <int>
1 1.2 1 0 1 0
2 1.5 0 1 0 1

The above answer is already good. You can also go the easy way and just use an ifelse statement to convert your data from categorical to numeric. An example dataset similar to yours:
train <- data.frame(subject = round(rnorm(n=100,
mean=5,
sd=2)), # rounded subjects
town = rep(c("A","B"),50),
district = rep(c("d1","d2"),50),
street = rep(c("s1","s2"),50),
family = rep(c("f1","f2"),50),
gender = rep(c("male","female"),50),
replicate = rbinom(n=100,
size=2,
prob=.9))
head(train)
Seen below:
subject town district street family gender replicate
1 6 A d1 s1 f1 male 2
2 4 B d2 s2 f2 female 2
3 4 A d1 s1 f1 male 1
4 7 B d2 s2 f2 female 2
5 3 A d1 s1 f1 male 2
6 6 B d2 s2 f2 female 2
Simply mutate the gender data with ifelse by coding "male" as 0 and everything else ("female" in this case) as 1:
m.train <- train %>%
mutate(gender = ifelse(gender=="male",0,1))
head(m.train)
You get a transformed gender variable with 0's and 1's for dummy coding:
subject town district street family gender replicate
1 6 A d1 s1 f1 0 2
2 4 B d2 s2 f2 1 2
3 4 A d1 s1 f1 0 1
4 7 B d2 s2 f2 1 2
5 3 A d1 s1 f1 0 2
6 6 B d2 s2 f2 1 2

Related

In R, how do you classify your data into different classes based on certain standards?

For example, I want to classify R1 based on R2.
R1 is like
# A tibble: 5 x 2
lon lat
<dbl> <dbl>
1 1 2
2 3 5
3 6 8
4 5 10
5 3 2
and R2 is like
# A tibble: 3 x 3
lon lat place
<dbl> <dbl> <chr>
1 1 2 A
2 3 6 B
3 5 8 C
R2 is like a standard. I want to find the corresponding place for my observations in R1. Suppose the 1st place in R1 is graded like:
scores of A: (1-1)^2 + (2-2)^2 = 0
scores of B: (1-3)^2 + (2-6)^2 = 20
scores of C: (1-5)^2 + (2-8)^2 = 52
If the scores of any place may be smaller than 3, we classify this place into the class.
The final result should be like this
# A tibble: 5 x 2
lon lat place
<dbl> <dbl> <chr>
1 1 2 A
2 3 5 B
3 6 8 C
4 5 10 NA
5 3 2 NA

There might be a neater way to do this with some purrr mapping, but using a couple of loops instead could get you the desired results:
library(tidyverse)
## Create R1 and R2 as tibbles, with place as a row name
R1 <- tribble(~lon, ~lat,
1,2,
3,5,
6,8,
5,10,
3,2)
R2 <- tribble(~lon, ~lat,~place,
1,2,"A",
3,6,"B",
5,8,"C") %>% column_to_rownames(var = "place")
## Create a results tibble
results <- R1 %>% mutate(A = NaN, B = NaN, C = NaN, match = "NA")
## Function to calculate place scores
place_scores <- function(vec){
apply(R2,1,function(x) x-vec) %>%
apply(.,2,function(x) x^2) %>%
colSums()
}
## Run function in a loop for each row in R1
for(i in 1:nrow(R1)){
res <- place_scores(as.numeric(R1[i,]))
results[i,3:5] <- res
}
## Run another loop to match the column with the lowest score and < 3
for(i in 1:nrow(results)){
match <- ifelse(any( results[i,3:5] < 3), colnames(results[,3:5])[which.min(as.numeric(results[i,3:5]))], NA)
results$match[i] <- match
}
results
# A tibble: 5 x 6
lon lat A B C match
<dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 1 2 0 20 52 A
2 3 5 13 1 13 B
3 6 8 61 13 1 C
4 5 10 80 20 4 NA
5 3 2 4 16 40 NA

I also came up with a way to do this using for-loop:
class = R2$place
for (i in 1:length(R1$place))
{
dist = rep(0, length(R2$place))
for (j in 1:length(R2$place))
{
dist[j] = (R1[i, 1] - R2[j,1])^2 + (R1[i, 2] - R2[j, 2])^2
}
R1$class[i] = class[which(dist <= 3)]
}

Comparison across unique readers

Reprex
dat <- data.frame(id = c(1,1,2,2,3,3,4,4),
reader = c(1,4,2,3,3,4,2,5),
response = c("CR","PR","SD","SD","PR","PR","CR","SD"))
Problem: Wish to compare response across each unique reader for each id. There are 5 unique readers in total, but each id only has 2 individual readers.
The resulting dataset would look something like this:
# A tibble: 4 x 4
id read1 read2 matchflag
<dbl> <chr> <chr> <dbl>
1 1 CR PR 0
2 2 SD SD 1
3 3 PR PR 1
4 4 CR SD 0

A data.table option
dcast(
setDT(df),
id ~ paste0("reader", rowid(id)),
value.var = "response"
)[
,
match_flag := +(reader1 == reader2)
][]
gives
id reader1 reader2 match_flag
1: 1 CR PR 0
2: 2 SD SD 1
3: 3 PR PR 1
4: 4 CR SD 0

This should work:
dat <- data.frame(id, reader, response)
dat %>%
select(-reader) %>%
group_by(id) %>%
mutate(obs = seq_along(id)) %>%
pivot_wider(names_from="obs", values_from="response", names_prefix="read") %>%
mutate(match_flag = as.numeric(read1 == read2))
# # A tibble: 4 x 4
# # Groups: id [4]
# id read1 read2 match_flag
# <dbl> <chr> <chr> <dbl>
# 1 1 CR PR 0
# 2 2 SD SD 1
# 3 3 PR PR 1
# 4 4 CR SD 0

A slight change from #DaveArmstrong's solution is also by creating the row sequence with rowid (from data.table, and then pivot to wide format and create the new column by using a relational operator and coerce to binary with +
library(dplyr)
library(tidyr)
library(data.table)
dat %>%
transmute(id, obs = rowid(id), response) %>%
pivot_wider(names_from = obs,values_from = response, names_prefix = 'read') %>%
mutate(match_flag = +(read1 == read2))
# A tibble: 4 x 4
# id read1 read2 match_flag
# <dbl> <chr> <chr> <int>
#1 1 CR PR 0
#2 2 SD SD 1
#3 3 PR PR 1
#4 4 CR SD 0

Infill missing variables of a df from a list

I have missing categorical variables in a list. I would like to add all the combinations of these classifications to the data frame using complete. I can do this for a single variable using mutate.
Simplified example:
library(tidyverse)
df <- tibble(a1 = 1:6,
b1 = rep(c(1,2),3),
c1 = rep(c(1:3), 2))
missing_cols <- list(d1 = c(7:8),
e1 = c(12:14))
# Use the first classification of d1 for mutate and complete with all classifications
df %>%
mutate(!!names(missing_cols)[1] := missing_cols[[1]][1]) %>%
complete(nesting(a1, b1,c1), d1 = missing_cols[[1]])
Desired output
df %>%
mutate(!!names(missing_cols)[1] := missing_cols[[1]][1]) %>%
mutate(!!names(missing_cols)[2] := missing_cols[[2]][1]) %>%
complete(nesting(a1, b1,c1), d1 = missing_cols[[1]], e1 = missing_cols[[2]])
This will get the correct output for d1. How can I do this for all variables in my list?

We can use crossing with cross_df :
library(tidyr)
crossing(df, cross_df(missing_cols))
# a1 b1 c1 d1 e1
# <int> <dbl> <int> <int> <int>
# 1 1 1 1 7 12
# 2 1 1 1 7 13
# 3 1 1 1 7 14
# 4 1 1 1 8 12
# 5 1 1 1 8 13
# 6 1 1 1 8 14
# 7 2 2 2 7 12
# 8 2 2 2 7 13
# 9 2 2 2 7 14
#10 2 2 2 8 12
# … with 26 more rows
cross_df creates all possible combination of missing_cols while crossing takes that output and creates all possible combination with df.

Using expand.grid
library(tidyr)
crossing(df, expand.grid(missing_cols))

Data Frame: mean over certain variables, ignore but keep others

I am analysing my data with R for the first time which is a bit challenging. I have a data frame with my data that looks like this:
head(data)
subject group age trial cond acc rt
1 S1 2 1 1 1 1 5045
2 S1 2 1 2 2 1 8034
3 S1 2 1 3 1 1 6236
4 S1 2 1 4 2 1 8087
5 S1 2 1 5 3 0 8756
6 S1 2 1 6 1 1 6619
I would like to compute a mean and standard deviation for each subject in each condition for rt and a sum for each subject in each condition for acc. All the other variables are should remain the same (group and age are subject-specific, and trial can be disregarded).
I have tried using aggregate but that seemed kind of complicated because I had to do it in several steps and re-add information...
I'd be thankful for any help =)
Edit: I realise that I wasn't being clear. I want trial to be disregarded and end up with one row per subject per condition:
head(data_new)
subject group age cond rt_mean rt_sd acc_sum
1 S1 2 1 1 7581 100 5
2 S2 2 1 2 8034 150 4
Sorry about the confusion!

If you don't mind using the data.table package:
library(data.table)
data <- data.table(data)
data[, ':=' (rt_mean = mean(rt), rt_sd = sd(rt), acc_sum = sum(acc)), by = .(subject, cond)]
data
subject group age trial cond acc rt rt_mean rt_sd acc_sum
1: S1 2 1 1 1 1 5045 5966.667 820.83758 3
2: S1 2 1 2 2 1 8034 8060.500 37.47666 2
3: S1 2 1 3 1 1 6236 5966.667 820.83758 3
4: S1 2 1 4 2 1 8087 8060.500 37.47666 2
5: S1 2 1 5 3 0 8756 8756.000 NA 0
6: S1 2 1 6 1 1 6619 5966.667 820.83758 3
Edit:
If you want to get rid of some of the variables and duplicated rows, you need only a small modification - remove the := assignment operator (instead of adding new colums, it will now create a new data.table), add the variables you want to keep and use the unique function:
unique(dt[, .(group, age, rt_mean = mean(rt), rt_sd = sd(rt), acc_sum = sum(acc)), by = .(subject, cond)])
subject cond group age rt_mean rt_sd acc_sum
1: S1 1 2 1 5966.667 820.83758 3
2: S1 2 2 1 8060.500 37.47666 2
3: S1 3 2 1 8756.000 NA 0
If you additionally want to get rid of rows with missing values, use the na.omit function.

The package dplyr is made for this:
library(dplyr)
d %>%
group_by(subject, cond) %>% # we group by the two values
summarise(
mean_rt = mean(rt, na.rm=T),
sd_rt = sd(rt, na.rm=T),
sum_acc = sum(acc, na.rm=T) # here we apply each function to summarise values
)
# A tibble: 3 x 5
# Groups: subject [?]
subject cond mean_rt sd_rt sum_acc
<fct> <int> <dbl> <dbl> <int>
1 S1 1 5967. 821. 3
2 S1 2 8060. 37.5 2
3 S1 3 8756 NA 0
# NA for the last sd_rt is because you can't have
# sd for a single obs.
Basically you need to group_by the columns (one or more) that you need to use as grouping, then inside summarise, you apply each function you need (mean, sd, sum, ecc) to each variable (rt, acc, ecc).
Change summarise with mutate if you want to keep all variables:
d %>%
select(-trial) %>% # use select with -var_name to eliminate columns
group_by(subject, cond) %>%
mutate(
mean_rt = mean(rt, na.rm=T),
sd_rt = sd(rt, na.rm=T),
sum_acc = sum(acc, na.rm=T)
) %>%
ungroup()
# A tibble: 6 x 9
subject group age cond acc rt mean_rt sd_rt sum_acc
<fct> <int> <int> <int> <int> <int> <dbl> <dbl> <int>
1 S1 2 1 1 1 5045 5967. 821. 3
2 S1 2 1 2 1 8034 8060. 37.5 2
3 S1 2 1 1 1 6236 5967. 821. 3
4 S1 2 1 2 1 8087 8060. 37.5 2
5 S1 2 1 3 0 8756 8756 NA 0
6 S1 2 1 1 1 6619 5967. 821. 3
Update based on op request, maybe this is what you need:
d %>%
group_by(subject, cond, group, age) %>%
summarise(
mean_rt = mean(rt, na.rm=T),
sd_rt = sd(rt, na.rm=T),
sum_acc = sum(acc, na.rm=T)
)
# A tibble: 3 x 7
# Groups: subject, cond, group [?]
subject cond group age mean_rt sd_rt sum_acc
<fct> <int> <int> <int> <dbl> <dbl> <int>
1 S1 1 2 1 5967. 821. 3
2 S1 2 2 1 8060. 37.5 2
3 S1 3 2 1 8756 NA 0
Data used:
tt <- "subject group age trial cond acc rt
S1 2 1 1 1 1 5045
S1 2 1 2 2 1 8034
S1 2 1 3 1 1 6236
S1 2 1 4 2 1 8087
S1 2 1 5 3 0 8756
S1 2 1 6 1 1 6619"
d <- read.table(text=tt, header=T)

If you want to compute for example the mean of rt for subject S1 under condition 1, you can use mean(data[data$subject == "S1" & data$cond == 1, 7]).
I hope this gives you an idea how you can filter your values.

R: Generating indicators that values differ within groups

I have a data frame where each row is an observation and I have two columns:
the group membership of the observation
the outcome for the observation.
I'm trying to create a new variable outcome_change that takes a value of 1 if outcome is NOT identical for all observations in a given group and 0 otherwise.
Shown in the below code (dat) is an example of the data I have. Meanwhile, dat_out1 shows what I'm looking for the code to produce in the presence of no NA values. The dat_out2 is identical except it shows that the same results arise when there are missing values in a group's values.
Surely there is somewhat to do this with dplyr::group_by()? I don't know how to make these comparisons within groups.
# Input (2 groups: 1 with identical values of outcome
# in the group (group a) and 1 with differing values of
# outcome in the group (group b)
dat <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,1,3,2,2))
# Output 1: add a variable for all observations belonging to
# a group where the outcome changed within each group
dat_out1 <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,1,3,2,2),
outcome_change = c(0,0,0,1,1,1))
# Output 2: same as Output 1, but able to ignore NA values
dat_out2 <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,NA,3,2,NA),
outcome_change = c(0,0,0,1,1,1))

Here is an aproach:
library(tidyverse)
dat %>%
group_by(group) %>%
mutate(outcome_change = ifelse(length(unique(outcome[!is.na(outcome)])) > 1, 1, 0))
#output
# A tibble: 6 x 3
# Groups: group [2]
group outcome outcome_change
<fctr> <dbl> <dbl>
1 a 1 0
2 a 1 0
3 a 1 0
4 b 3 1
5 b 2 1
6 b 2 1
with dat2
# A tibble: 6 x 3
# Groups: group [2]
group outcome outcome_change
<fctr> <dbl> <dbl>
1 a 1 0
2 a 1 0
3 a NA 0
4 b 3 1
5 b 2 1
6 b NA 1

library(dplyr)
dat <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,1,3,2,2))
dat2 <- data.frame(group = c("a","a","a","b","b","b"),
outcome = c(1,1,NA,3,2,NA))
dat_out1 <- dat %>% group_by(group) %>%
mutate(outcome_change = ifelse(min(outcome) == max(outcome), 0, 1))
dat_out2 <- dat2 %>% group_by(group) %>%
mutate(outcome_change = ifelse(min(outcome, na.rm = TRUE) == max(outcome, na.rm = TRUE), 0, 1))

Here is an option using data.table
library(data.table)
setDT(dat1)[, outcome_change := as.integer(uniqueN(outcome[!is.na(outcome)])>1), group]
dat1
# group outcome outcome_change
#1: a 1 0
#2: a 1 0
#3: a 1 0
#4: b 3 1
#5: b 2 1
#6: b 2 1
If we apply the same with 'dat2'
dat2
# group outcome outcome_change2
#1: a 1 0
#2: a 1 0
#3: a NA 0
#4: b 3 1
#5: b 2 1
#6: b NA 1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Caret dummy variable does not work as expected - r

Related

In R, how do you classify your data into different classes based on certain standards?

Comparison across unique readers

Infill missing variables of a df from a list

Data Frame: mean over certain variables, ignore but keep others

R: Generating indicators that values differ within groups

Categories

Resources