r- Duplicated rows in dummyVars - r

I have a dataframe in R, here there is an example
asdf <- data.frame(id = c(2345, 7323, 2345, 4533),
place = c("Home", "Home", "Office", "Office"),
sex = c("Male", "Male", "Male", "Female"),
consumed = c(1000, 800, 1000, 500))
As you can see there is one id duplicated, because he has two locations, Home and Office. I want to convert every character variable to a dummy variable, and obtain just one id, without duplicated id's. I am sure that the only duplicated values can be the "place" variable.
When i apply dummyVars from caret, i can't do this, and for me this behavior does not make sense, for example, when I apply the following
dummy <- dummyVars( ~ ., data = asdf, fullRank = FALSE, levelsOnly = TRUE)
predict(dummy, asdf)
I get the following dataframe, with duplicated id's
result <- data.frame(id = c(2345, 7323, 2345, 4533),
placeHome = c(1, 1, 0, 0),
placeOffice = c(0, 0, 1, 1),
sexFemale = c(0, 0, 0, 1),
sexMale = c(1, 1, 1, 0),
consumed = c(1000, 800, 1000, 500))
but I want this
sexy_result <- data.frame(id = c(2345, 7323, 4533),
placeHome = c(1, 1, 0),
placeOffice = c(1, 0, 1),
sexFemale = c(0, 0, 1),
sexMale = c(1, 1, 0),
consumed = c(1000, 800, 500))

You could transform your result data frame using dplyr package.
library(dplyr)
sexy_result <- result %>% group_by(id) %>% summarise_all(sum)
data.frame(sexy_result)
id placeHome placeOffice sexFemale sexMale consumed
1 2345 1 1 0 2 2000
2 4533 0 1 1 0 500
3 7323 1 0 0 1 800
If you want to sum only placeHome and placeOffice, you could use the following code
sexy_result <- result %>% group_by(id) %>% summarise(placeHome=sum(placeHome), placeOffice=sum(placeOffice), sexFemale=mean(sexFemale), sexMale=mean(sexMale), consumed=mean(consumed))
data.frame(sexy_result)
id placeHome placeOffice sexFemale sexMale consumed
1 2345 1 1 0 1 1000
2 4533 0 1 1 0 500
3 7323 1 0 0 1 800

Related

CRAN R - Assign the value '1' to many dummy variables at once [duplicate]

Here is dput() of a structure I currently have.
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", "Germany", "Germany", "USA"), USA = c(0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, -5L))
I want to edit this dataframe to get the below results in order to apply this approach to a dataset with 100k+ observations. Specifically, I want to use information from (df$country) that describes a country assigned to a particular ID (e.g., id == 1 and country == Japan), and changes the column value with the corresponding column name (e.g., a column named "Japan") equal to 1. Note that IDs are not unique!
This is what I'd like to end up with:
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", "Germany", "Germany", "USA"), USA = c(1, 1, 0, 1, 1), Germany = c(0, 0, 1, 1, 1), Japan = c(1, 1, 0, 0, 0)), class = "data.frame", row.names = c(NA, -5L))
The following code gives a close result:
df[levels(factor(df$country))] = model.matrix(~country - 1, df)
But ends up giving me the following, erroneous structure:
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan",
"Germany", "Germany", "USA"), USA = c(1, 0, 0, 0, 1), Germany = c(0,
0, 1, 1, 0), Japan = c(0, 1, 0, 0, 0)), row.names = c(NA, -5L
), class = "data.frame")
How can I edit the above command in order to yield my desired result? I cannot use pivot because, in actuality, I'm working with many datasets that have different values in the "country" column that, once pivoted, will yield datasets with non-uniform columns/structures, which will impede data analysis later on.
Thank you for any help!
Perhaps this helps
library(dplyr)
df %>%
mutate(across(USA:Japan, ~ +(country == cur_column()))) %>%
group_by(id) %>%
mutate(across(USA:Japan, max)) %>%
ungroup
-output
# A tibble: 5 × 5
id country USA Germany Japan
<dbl> <chr> <int> <int> <int>
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0
Or modifying the model.matrix as
m1 <- model.matrix(~country - 1, df)
m1[] <- ave(c(m1), df$id[row(m1)], col(m1), FUN = max)
You can use base R
re <- rle(df$id)
for(j in re$values){
y <- which(j == df$id)
df[y , match(df$country[y] , colnames(df))] <- 1
}
Output
id country USA Germany Japan
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0
Are you looking for such a solution (in combination) to your closed question here CRAN R - Assign the value '1' to many dummy variables at once
The solution provided by #akrun solves the question here. But you may look for something like this:
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(-country, ~case_when(country == cur_column() ~ 1))) %>%
fill(-country, .direction = "updown") %>%
mutate(across(-country, ~ifelse(is.na(.), 0, .))) %>%
ungroup()
id country USA Germany Japan
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0

Changing values of many columns at once -- model.matrix()?

Here is dput() of a structure I currently have.
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", "Germany", "Germany", "USA"), USA = c(0, 0, 0, 0, 0), Germany = c(0, 0, 0, 0, 0), Japan = c(0, 0, 0, 0, 0)), class = "data.frame", row.names = c(NA, -5L))
I want to edit this dataframe to get the below results in order to apply this approach to a dataset with 100k+ observations. Specifically, I want to use information from (df$country) that describes a country assigned to a particular ID (e.g., id == 1 and country == Japan), and changes the column value with the corresponding column name (e.g., a column named "Japan") equal to 1. Note that IDs are not unique!
This is what I'd like to end up with:
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan", "Germany", "Germany", "USA"), USA = c(1, 1, 0, 1, 1), Germany = c(0, 0, 1, 1, 1), Japan = c(1, 1, 0, 0, 0)), class = "data.frame", row.names = c(NA, -5L))
The following code gives a close result:
df[levels(factor(df$country))] = model.matrix(~country - 1, df)
But ends up giving me the following, erroneous structure:
structure(list(id = c(1, 1, 2, 4, 4), country = c("USA", "Japan",
"Germany", "Germany", "USA"), USA = c(1, 0, 0, 0, 1), Germany = c(0,
0, 1, 1, 0), Japan = c(0, 1, 0, 0, 0)), row.names = c(NA, -5L
), class = "data.frame")
How can I edit the above command in order to yield my desired result? I cannot use pivot because, in actuality, I'm working with many datasets that have different values in the "country" column that, once pivoted, will yield datasets with non-uniform columns/structures, which will impede data analysis later on.
Thank you for any help!
Perhaps this helps
library(dplyr)
df %>%
mutate(across(USA:Japan, ~ +(country == cur_column()))) %>%
group_by(id) %>%
mutate(across(USA:Japan, max)) %>%
ungroup
-output
# A tibble: 5 × 5
id country USA Germany Japan
<dbl> <chr> <int> <int> <int>
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0
Or modifying the model.matrix as
m1 <- model.matrix(~country - 1, df)
m1[] <- ave(c(m1), df$id[row(m1)], col(m1), FUN = max)
You can use base R
re <- rle(df$id)
for(j in re$values){
y <- which(j == df$id)
df[y , match(df$country[y] , colnames(df))] <- 1
}
Output
id country USA Germany Japan
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0
Are you looking for such a solution (in combination) to your closed question here CRAN R - Assign the value '1' to many dummy variables at once
The solution provided by #akrun solves the question here. But you may look for something like this:
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(-country, ~case_when(country == cur_column() ~ 1))) %>%
fill(-country, .direction = "updown") %>%
mutate(across(-country, ~ifelse(is.na(.), 0, .))) %>%
ungroup()
id country USA Germany Japan
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 USA 1 0 1
2 1 Japan 1 0 1
3 2 Germany 0 1 0
4 4 Germany 1 1 0
5 4 USA 1 1 0

generate a weighted matrix from r dataframe

I have a toy example of a dataframe:
df <- data.frame(matrix(, nrow = 5, ncol = 0))
df["A|A"] <- c(0.3, 0, 0, 100, 23)
df["A|B"]= c(0, 0, 0.3, 10, 0.23)
df["A|C"]= c(0.3, 0.1, 0, 100, 2)
df["B|B"]= c(0, 0, 0, 12, 2)
df["B|B"]= c(0, 0, 0.3, 0, 0.23)
df["B|C"]= c(0.3, 0, 0, 21, 3)
df["C|A"]= c(0.3, 0, 1, 100, 0)
df["C|B"]= c(0, 0, 0.3, 10, 0.2)
df["C|C"]= c(0.3, 0, 1, 1, 0.3)
I need to get a matrix with counts of non-zero values between A and A, A and B, ..., C and C.
I started splitting the colnames and assigning them to variables. But I don't know how to create a matrix with certain rows and columns in a loop
counts <- colSums(df != 0)
df <- rbind(df, counts)
for(i in colnames(df)) {
cluster1 <- (strsplit(i, "\\|")[[1]])[1]
cluster2 <- (strsplit(i, "\\|")[[1]])[2]
}
A base R option
> table(read.table(text = rep(names(df), colSums(df > 0)), sep = "|"))
V2
V1 A B C
A 3 3 4
B 0 2 3
C 3 3 4
or a longer version
table(
data.frame(
do.call(
rbind,
strsplit(
as.character(subset(stack(df), values > 0)$ind),
"\\|"
)
)
)
)
gives
X2
X1 A B C
A 3 3 4
B 0 2 3
C 3 3 4
Reshape the data into 'long' format with pivot_longer, then separate the 'name' column into two, and reshape back to 'wide' with pivot_wider, specifying the values_fn as a lambda function to get the count of non-zero values
library(dplyr)
library(tidyr)
df %>%
pivot_longer(cols = everything()) %>%
separate(name, into = c('name1', 'name2')) %>%
pivot_wider(names_from = name2, values_from = value,
values_fn = list(value = ~ sum(. > 0)), values_fill = 0)
-output
# A tibble: 3 x 4
name1 A B C
<chr> <int> <int> <int>
1 A 3 3 4
2 B 0 2 3
3 C 3 3 4

Create single year variables corresponding to a range of dates in R

I would like to be able to take two variables representing a starting date and an ending date and create variables indicating which years are covered over the range of those two dates.
What I have:
df1 <- data.frame(ID = c("A", "B", "C"),
Start_Date = c("3/5/2004", "8/22/2005", "4/8/2008"),
End_Date = c("6/25/2009","11/2/2006", "6/9/2011"))
What I want:
df2 <- data.frame(ID = c("A", "B", "C"),
Start_Date = c("3/5/2004", "8/22/2005", "4/8/2008"),
End_Date = c("6/25/2009","11/2/2006", "6/9/2011"),
y2004 = c(1, 0, 0),
y2005 = c(1, 1, 0),
y2006 = c(1, 1, 0),
y2007 = c(1, 0, 0),
y2008 = c(1, 0, 1),
y2009 = c(0, 0, 1),
y2010 = c(0, 0, 1),
y2011 = c(0, 0, 1))
As above, each new year variable indicates whether or not the year is captured in the range of the two date variables "Start_Date" and "End_Date".
Any ideas would be greatly appreciated. Thanks in advance.
One method is to pivot to 'long' format, extract the year part after converting to Date class, then get the seq (:) from the first to last grouped by 'ID' and reshape back to 'wide', then join with the original data by 'ID'
library(dplyr)
library(tidyr)
library(stringr)
library(lubridate)
df1 %>%
pivot_longer(cols = -ID) %>%
group_by(ID) %>%
summarise(year = str_c('y', year(mdy(value)[1]):year(mdy(value)[2])),
n = 1, .groups = 'drop') %>%
pivot_wider(names_from = year, values_from = n, values_fill = 0) %>%
left_join(df1, .)
-output
# ID Start_Date End_Date y2004 y2005 y2006 y2007 y2008 y2009 y2010 y2011
#1 A 3/5/2004 6/25/2009 1 1 1 1 1 1 0 0
#2 B 8/22/2005 11/2/2006 0 1 1 0 0 0 0 0
#3 C 4/8/2008 6/9/2011 0 0 0 0 1 1 1 1

How to plot a "matrix" in a fixed grid pattern in R

I have a large data frame. A sample of the first 6 rows is below:
> temp
M1 M2 M3 M4 M5 M6
1 1 1 1 1 1 1
2 1 1 1 1 1 1
3 0 1 0 -1 1 0
4 1 1 1 1 1 1
5 0 0 0 -1 0 1
6 0 0 0 0 0 0
> dput(temp)
structure(list(M1 = c(1, 1, 0, 1, 0, 0), M2 = c(1, 1, 1, 1, 0,
0), M3 = c(1, 1, 0, 1, 0, 0), M4 = c(1, 1, -1, 1, -1, 0), M5 = c(1,
1, 1, 1, 0, 0), M6 = c(1, 1, 0, 1, 1, 0)), .Names = c("M1", "M2",
"M3", "M4", "M5", "M6"), row.names = c(NA, -6L), class = "data.frame")
The data frame only has values -1, 0 and 1. The total number of rows is 2,156. What I would like to do is to plot a a "grid" format where each row is comprised of 6 squares (one for each column). Each of the three values is then assigned a color (say, red, green, blue).
I've tried to do this with heatmap.2 (but I can't get the distinct squares to show up).
I've tried to do this using ggplot2 with geom_points() but can't figure out the best way to do it.
Any help on how to efficiently do this would be much appreciated!
Thanks!
Another option:
library(lattice)
temp <- as.matrix(temp)
levelplot(temp, col.regions= colorRampPalette(c("red","green","blue")))
This will produce the following plot:
I think geom_tile() is a better bet, in combination with reshaping to long.
library(ggplot2)
library(reshape2)
#assign an id to plot rows to y-axis
temp$id <- 1:nrow(temp)
#reshape to long
m_temp <- melt(temp, id.var="id")
p1 <- ggplot(m_temp, aes(x=variable,
y=id,fill=factor(value))) +
geom_tile()
p1
You could use ggplot to do the following:
library(ggplot2)
dd <- expand.grid(x = 1:ncol(temp), y = 1:nrow(temp))
dd$col <- unlist(c(temp))
ggplot(dd, aes(x = x, y = y, fill = factor(col))) + geom_tile()

Resources