How to split data in each records in R? - r

I have a dataframe which has a column,
service-id
ids-1-2-3-4-5
ids-1-2-6
ids-5
ids-7-8
with many other columns.
I want to split the data ids-1-2-3-4-5 into different columns 1,2,3...8 like one hot encoding ,having columns 1 2 3 4 5 6 7 8 also having 1 and rest 0 if not present.
col.1 col.2 col.3 col.4 col.5 col.6 ..... col.8
1 1 1 1 1 0 ..... 0 for ids-1-2-3-4-5
1 1 0 0 0 1 ...... 0 for ids-1-2-6
I tried tidyverse but it is not helpful.

A solution using basic R code.
Your data
db<-data.frame("service-id"=c("ids-1-2-3-4-5","ids-1-2-6","ids-5","ids-7-8"))
Identify number of columns
ncol<-max(suppressWarnings(as.numeric(unlist(strsplit(as.character(db$service.id),"-")))),na.rm = T)
Extract numeric id list
number_list<-strsplit(as.character(db$service.id),"-")
number_list<-suppressWarnings(lapply(number_list,as.numeric))
number_list <- lapply(number_list, function(x) x[!is.na(x)])
Create output dataframe
f<-function(x,ncol)
{
return(as.numeric(seq(1:ncol) %in% x))
}
out<-t(data.frame(lapply(number_list, f, ncol=ncol)))
colnames(out)<-paste0("col.",seq(1:ncol))
rownames(out)<-NULL
Your output
out
col.1 col.2 col.3 col.4 col.5 col.6 col.7 col.8
[1,] 1 1 1 1 1 0 0 0
[2,] 1 1 0 0 0 1 0 0
[3,] 0 0 0 0 1 0 0 0
[4,] 0 0 0 0 0 0 1 1

If we need tidyverse option, here is a way
library(tidyverse)
df1 %>%
rownames_to_column('rn') %>%
extract(service.id, into = c('id', 'col'), "^([^-]+)-(.*)") %>%
separate_rows(col) %>%
mutate(n = 1, col = paste0("col.", col)) %>%
spread(col, n, fill = 0) %>%
select(-rn, -id)
# col.1 col.2 col.3 col.4 col.5 col.6 col.7 col.8
#1 1 1 1 1 1 0 0 0
#2 1 1 0 0 0 1 0 0
#3 0 0 0 0 1 0 0 0
#4 0 0 0 0 0 0 1 1
data
df1 <- structure(list(service.id = c("ids-1-2-3-4-5", "ids-1-2-6", "ids-5",
"ids-7-8")), .Names = "service.id", class = "data.frame", row.names = c(NA,
-4L))

Related

Separate rows to make dummy rows

Consider this dataframe:
dat <- structure(list(col1 = c(1, 2, 0), col2 = c(0, 3, 2), col3 = c(1, 2, 3)), class = "data.frame", row.names = c(NA, -3L))
col1 col2 col3
1 1 0 1
2 2 3 2
3 0 2 3
How can one dummify rows? i.e. whenever there is a row with more than 1 non-0 value, separate the row into multiple rows with one non-0 value per row.
In this case, this would be:
col1 col2 col3
1 1 0 0
2 0 0 1
3 2 0 0
4 0 3 0
5 0 0 2
6 0 2 0
7 0 0 3
You can do:
library(tidyverse)
dat |>
pivot_longer(everything()) |>
mutate(id = 1:n()) |>
pivot_wider(values_fill = 0) |>
filter(!if_all(-id, ~ . == 0)) |>
select(-id)
# A tibble: 7 x 3
col1 col2 col3
<dbl> <dbl> <dbl>
1 1 0 0
2 0 0 1
3 2 0 0
4 0 3 0
5 0 0 2
6 0 2 0
7 0 0 3
Another approach, here I used data.table
library(data.table)
x <- rbindlist(apply(dat, 1, function(x) {
x <- data.table(diag(x, ncol(dat)))
x[colSums(x) > 0]
}))
setnames(x, names(dat))
x
# col1 col2 col3
# 1: 1 0 0
# 2: 0 0 1
# 3: 2 0 0
# 4: 0 3 0
# 5: 0 0 2
# 6: 0 2 0
# 7: 0 0 3
A very ugly way is:
library(tidyverse)
dat %>%
apply(1, diag) %>%
matrix(nrow = 3) %>%
t() %>%
as.data.frame() %>%
rename_with(~ names(dat), everything()) %>%
filter(rowSums(.) != 0)
col1 col2 col3
1 1 0 0
2 0 0 1
3 2 0 0
4 0 3 0
5 0 0 2
6 0 2 0
7 0 0 3

Change multiple values in a dataframe based on two other values

If anyone mind lending some knowledge... What I am trying to do is make a new dataframe based on the below data frame values.
id value
ant 10
cat 4
cat 6
dog 5
dog 3
dog 2
fly 9
What I want to do next is, in sequential order I want to make a dataframe that looks like the following.
Every time we see a new id, we create a column. The max value is 10 so there should be 10 rows.
Our first word is ant and so therefore for every row of ant, I would like a 0.
Our next column is cat. We have 2 values and what I would like to do is for the first value we see, the first 4 rows must be 0 which is followed by 6 rows of 1.
Same logic for dog, with first five rows as 0 and next three rows as 1 and last 2 as 0.
Fly has only 9 rows of 0 and the last row should contain NA.
It should look like this
ant cat dog fly
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
0 1 0 0
0 1 1 0
0 1 1 0
0 1 1 0
0 1 0 0
0 1 0 NA
I know how to do this the long way by
newdf <- data.frame(matrix(2, ncol = length(unique(df[,"id"])) , nrow = 10))
newdf$X1[1:10] <- 0
newdf$X2[1:4] <- 0
newdf$X2[5:10] <- 1
...
However, is there any way to do this more efficiently? Note that my actual data will have roughly 50 rows so that's why I am looking for a more efficient way to complete this!
Here's a tidyverse answer -
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
mutate(val = rep(c(0, 1), length.out = n())) %>%
uncount(value) %>%
mutate(row = row_number()) %>%
complete(row = 1:10) %>%
pivot_wider(names_from = id, values_from = val) %>%
select(-row)
# ant cat dog fly
# <dbl> <dbl> <dbl> <dbl>
# 1 0 0 0 0
# 2 0 0 0 0
# 3 0 0 0 0
# 4 0 0 0 0
# 5 0 1 0 0
# 6 0 1 1 0
# 7 0 1 1 0
# 8 0 1 1 0
# 9 0 1 0 0
#10 0 1 0 NA
For each id we assign an alternate 0, 1 value and use uncount to repeat the rows based on the count. Get the data in wide format so that we have a separate column for each id.
data
df <- structure(list(id = c("ant", "cat", "cat", "dog", "dog", "dog",
"fly"), value = c(10, 4, 6, 5, 3, 2, 9)), row.names = c(NA, -7L
), class = "data.frame")
You can try the following base R code
maxlen <- with(df, max(tapply(value, id, sum)))
list2DF(
lapply(
with(df, split(value, id)),
function(x) {
`length<-`(
rep(rep(c(0, 1), length.out = length(x)), x),
maxlen
)
}
)
)
which gives
ant cat dog fly
1 0 0 0 0
2 0 0 0 0
3 0 0 0 0
4 0 0 0 0
5 0 1 0 0
6 0 1 1 0
7 0 1 1 0
8 0 1 1 0
9 0 1 0 0
10 0 1 0 NA

Separate character string variable into several variables [duplicate]

This question already has answers here:
Split character column into several binary (0/1) columns
(7 answers)
Closed 2 years ago.
I have data (a column in a dataframe) of type character. I want to separate these characters and, depending on the content, fill separate variables with 0s and 1s.
The column can be recreated with:
df <- data.frame(var = c("1;2", NA, "1;2;3;4;5", "3;5", "1", "1;4", "3", NA, "4", "1;5"))
For example, the characters can range from 1 to 5. I want to create six variables:
var_1, var_2, var_3, var_4, var_5, and var_NA. I want var_1 to contain a 1 if that row has a 1 within the character string, and 0 if it does not.
Thank you!
Perhaps, using cSplit_e would be an option
library(splitstackshape)
library(dplyr)
cSplit_e(df, 'var', sep=";", type = 'character', fill = 0, drop = TRUE)%>%
mutate(var_NA = +(is.na(df$var)))
# var_1 var_2 var_3 var_4 var_5 var_NA
#1 1 1 0 0 0 0
#2 0 0 0 0 0 1
#3 1 1 1 1 1 0
#4 0 0 1 0 1 0
#5 1 0 0 0 0 0
#6 1 0 0 1 0 0
#7 0 0 1 0 0 0
#8 0 0 0 0 0 1
#9 0 0 0 1 0 0
#10 1 0 0 0 1 0
Or using base R
t(sapply(strsplit(df$var, "[:;]"), function(x) +(1:5 %in% x)))
In tidyverse , we can get the data in long format by splitting on ";", create a column with "var", change all values to 1 and get the data in wide format.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
separate_rows(var, sep = ";") %>%
mutate(col = paste0('var_', var),
var = 1) %>%
pivot_wider(names_from = col, values_from = var, values_fill = 0) %>%
ungroup %>%
select(-row)
# A tibble: 10 x 6
# var_1 var_2 var_NA var_3 var_4 var_5
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 0 0 0 0
# 2 0 0 1 0 0 0
# 3 1 1 0 1 1 1
# 4 0 0 0 1 0 1
# 5 1 0 0 0 0 0
# 6 1 0 0 0 1 0
# 7 0 0 0 1 0 0
# 8 0 0 1 0 0 0
# 9 0 0 0 0 1 0
#10 1 0 0 0 0 1

Create a new dataframe with the all possible combinations

Having a dataframe like this:
data.frame(previous = c(1,2,2,1,3,3), next = c(1,1,2,3,1,3), id = c(1,2,3,4,5,6))
How is it possible to exatract a data frame which will check the previous and next columns and create 9 new columns which will have 1 only if the combination of previous and next exist. Example if previous if 2 and next 1 the combination is 2 1 and receives one.
Example of expected output:
data.frame(previous = c(1,2,2,1,3,3), next = c(1,1,2,3,1,3),
col1_1 = c(1,0,0,0,0,0),
col1_2 = c(0,0,0,0,0,0),
col1_3 = c(0,0,0,1,0,0),
col2_1 = c(0,1,0,0,0,0),
col2_2 = c(0,0,1,0,0,0),
col2_3 = c(0,0,0,0,0,0),
col3_1 = c(0,0,0,0,1,0),
col3_2 = c(0,0,0,0,0,0),
col3_3 = c(0,0,0,0,0,1), id = c(1,2,3,4,5,6))
You could use expand.grid to get all the combinations.
Assuming your data frame is called df and the column next is actually called next. to avoid clashing with the keyword next:
as.data.frame(apply(expand.grid(1:3, 1:3), 1, function(x) {
as.numeric(x[1] == df$previous & x[2] == df$next.)}))
#> V1 V2 V3 V4 V5 V6 V7 V8 V9
#> 1 1 0 0 0 0 0 0 0 0
#> 2 0 1 0 0 0 0 0 0 0
#> 3 0 0 0 0 1 0 0 0 0
#> 4 0 0 0 0 0 0 1 0 0
#> 5 0 0 1 0 0 0 0 0 0
#> 6 0 0 0 0 0 0 0 0 1
An step by step approach might be the following one. I have changed the next column name for next1 to avoid problems:
AllComb<-expand.grid(unique(df$previous),unique(df$next1))# Creating all possible combinations
myframe <- matrix(rep(0,nrow(AllComb)*nrow(df)),ncol=nrow(AllComb),nrow =nrow(df))
colnames(myframe)<-paste("col_",AllComb$Var1,"_",AllComb$Var2, sep ="")
for(id_row in 1:ncol(df)){
myvec <- df[id_row,]
Word <- paste("col_",myvec[1],"_",myvec[2], sep ="")# Finding Word
Colindex <-which(colnames(myframe)==Word) #Finding Column index
myframe[id_row, Colindex] <-1 # Replacing in column index and vetor
}
dfRes<-cbind(previous =df$previous, "next"= df$next1, myframe, id=df$id)
# previous next col_1_1 col_2_1 col_3_1 col_1_2 col_2_2 col_3_2 col_1_3 col_2_3 col_3_3 id
# [1,] 1 1 1 0 0 0 0 0 0 0 0 1
# [2,] 2 1 0 1 0 0 0 0 0 0 0 2
# [3,] 2 2 0 0 0 0 1 0 0 0 0 3
# [4,] 1 3 0 0 0 0 0 0 0 0 0 4
# [5,] 3 1 0 0 0 0 0 0 0 0 0 5
# [6,] 3 3 0 0 0 0 0 0 0 0 0 6
Inside a by you could use a switch, because your values are nicely consecutive 1:3. Finally we merge to get the result.
tmp <- by(dat, dat$next., function(x) {
x1 <- x$previous
o <- `colnames<-`(t(sapply(x1, function(z)
switch(z, c(1, 0, 0), c(0, 1, 0), c(0, 0, 1)))),
paste(el(x1), 1:3, sep="_"))
cbind(x, col=o)
})
res <- Reduce(function(...) merge(..., all=TRUE), tmp)
res[is.na(res)] <- 0 ## set NA to zero if wanted
Result
res[order(res$id),] ## order by ID if needed
# previous next. id col.1_1 col.1_2 col.1_3 col.2_1 col.2_2 col.2_3
# 1 1 1 1 1 0 0 0 0 0
# 3 2 1 2 0 1 0 0 0 0
# 4 2 2 3 0 0 0 0 1 0
# 2 1 3 4 1 0 0 0 0 0
# 5 3 1 5 0 0 1 0 0 0
# 6 3 3 6 0 0 1 0 0 0
Data
dat <- structure(list(previous = c(1, 2, 2, 1, 3, 3), next. = c(1, 1,
2, 3, 1, 3), id = c(1, 2, 3, 4, 5, 6)), class = "data.frame", row.names = c(NA,
-6L))
Note: next as column name is not particularly a good idea, since it has a special meaning in R.
Here is a tidyverse approach:
library(tidyr)
library(dplyr)
df %>%
rowid_to_column() %>%
complete(previous, nxt) %>%
unite(col , previous, nxt, sep = "_", remove = FALSE) %>%
pivot_wider(names_from = col, values_from = rowid, values_fn = list(rowid = ~1), values_fill = list(rowid = 0)) %>%
na.omit() %>%
arrange(id)
# A tibble: 6 x 12
previous nxt id `1_1` `1_2` `1_3` `2_1` `2_2` `2_3` `3_1` `3_2` `3_3`
<dbl> <dbl> <dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 1 0 0 0 0 0 0 0 0
2 2 1 2 0 0 0 1 0 0 0 0 0
3 2 2 3 0 0 0 0 1 0 0 0 0
4 1 3 4 0 0 1 0 0 0 0 0 0
5 3 1 5 0 0 0 0 0 0 1 0 0
6 3 3 6 0 0 0 0 0 0 0 0 1
This is another tidyverse solution that differ a little (maybe more concise) from #H1's one.
library(dplyr)
library(tidyr)
df %>%
mutate(n = 1) %>%
complete(id, previous, next., fill = list(n = 0)) %>%
unite(col, previous, next.) %>%
pivot_wider(names_from = col, names_prefix = "col", values_from = n) %>%
right_join(df)
# # A tibble: 6 x 12
# id col1_1 col1_2 col1_3 col2_1 col2_2 col2_3 col3_1 col3_2 col3_3 previous next.
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 0 0 0 0 0 0 0 0 1 1
# 2 2 0 0 0 1 0 0 0 0 0 2 1
# 3 3 0 0 0 0 1 0 0 0 0 2 2
# 4 4 0 0 1 0 0 0 0 0 0 1 3
# 5 5 0 0 0 0 0 0 1 0 0 3 1
# 6 6 0 0 0 0 0 0 0 0 1 3 3
You can try the code below
dfout <- within(df,
col <- `colnames<-`(t(sapply((Previous-1)*3+Next,
function(v) replace(rep(0,9),v,1))),
do.call(paste,c(expand.grid(1:3,1:3),sep = "_"))))
such that
> dfout
Previous Next id col.1_1 col.2_1 col.3_1 col.1_2 col.2_2 col.3_2 col.1_3 col.2_3 col.3_3
1 1 1 1 1 0 0 0 0 0 0 0 0
2 2 1 2 0 0 0 1 0 0 0 0 0
3 2 2 3 0 0 0 0 1 0 0 0 0
4 1 3 4 0 0 1 0 0 0 0 0 0
5 3 1 5 0 0 0 0 0 0 1 0 0
6 3 3 6 0 0 0 0 0 0 0 0 1

adding together multiple sets of columns in r

I'm trying to add several sets of columns together.
Example df:
df <- data.frame(
key = 1:5,
ab0 = c(1,0,0,0,1),
ab1 = c(0,2,1,0,0),
ab5 = c(1,0,0,0,1),
bc0 = c(0,1,0,2,0),
bc1 = c(2,0,0,0,0),
bc5 = c(0,2,1,0,1),
df0 = c(0,0,0,1,0),
df1 = c(1,0,3,0,0),
df5 = c(1,0,0,0,6)
)
Giving me:
key ab0 ab1 ab5 bc0 bc1 bc5 df0 df1 df5
1 1 1 0 1 0 2 0 0 1 1
2 2 0 2 0 1 0 2 0 0 0
3 3 0 1 0 0 0 1 0 3 0
4 4 0 0 0 2 0 0 1 0 0
5 5 1 0 1 0 0 1 0 0 6
I want to add all sets of columns with 0s and 5s in them together and place them in the 0 column.
So the end result would be:
key ab0 ab1 ab5 bc0 bc1 bc5 df0 df1 df5
1 1 2 0 1 0 2 0 0 1 1
2 2 0 2 0 3 0 2 0 0 0
3 3 0 1 0 1 0 1 0 3 0
4 4 0 0 0 2 0 0 2 0 0
5 5 2 0 1 1 0 1 0 0 6
I could add the columns together using 3 lines:
df$ab0 <- df$ab0 + df$ab5
df$bc0 <- df$bc0 + df$bc5
df$df0 <- df$df0 + df$df5
But my real example has over a hundred columns so I'd like to iterate over them and use apply.
The column names of the first set are contained in col0 and the names of the second set are in col5.
col0 <- c("ab0","bc0","df0")
col5 <- c("ab5","bc5","df5")
I created a function to add the columns to gether using mapply:
fun1 <- function(df,x,y) {
df[,x] <- df[,x] + df[,y]
}
mapply(fun1,df,col0,col5)
But I get an error: Error in df[, x] : incorrect number of dimensions
Thoughts?
Simply add two data frames together by their subsetted columns, assuming they will be the same length. No loops needed. All vectorized operation.
final_df <- df[grep("0", names(df))] + df[grep("5", names(df))]
final_df <- cbind(final_df, df[grep("0", names(df), invert=TRUE)])
final_df <- final_df[order(names(final_df))]
final_df
# ab0 ab1 ab5 bc0 bc1 bc5 df0 df1 df5 key
# 1 2 0 1 0 2 0 1 1 1 1
# 2 0 2 0 3 0 2 0 0 0 2
# 3 0 1 0 1 0 1 0 3 0 3
# 4 0 0 0 2 0 0 1 0 0 4
# 5 2 0 1 1 0 1 6 0 6 5
Rextester demo
You could use map2 from the purrr package to iterate over the two vectors at once:
df <- data.frame(
key = 1:5,
ab0 = c(1,0,0,0,1),
ab1 = c(0,2,1,0,0),
ab5 = c(1,0,0,0,1),
bc0 = c(0,1,0,2,0),
bc1 = c(2,0,0,0,0),
bc5 = c(0,2,1,0,1),
df0 = c(0,0,0,1,0),
df1 = c(1,0,3,0,0),
df5 = c(1,0,0,0,6)
)
col0 <- c("ab0","bc0","df0")
col5 <- c("ab5","bc5","df5")
purrr::map2(col0, col5, function(x, y) {
df[[x]] <<- df[[x]] + df[[y]]
})
> df
key ab0 ab1 ab5 bc0 bc1 bc5 df0 df1 df5
1 1 2 0 1 0 2 0 1 1 1
2 2 0 2 0 3 0 2 0 0 0
3 3 0 1 0 1 0 1 0 3 0
4 4 0 0 0 2 0 0 1 0 0
5 5 2 0 1 1 0 1 6 0 6
Here's an approach using tidyr and dplyr from the tidyverse meta-package.
First, I bring the table into long ("tidy") format, and split out the column into two components, and spread by the number part of those components.
Then I do the calculation you describe.
Finally, I bring it back into the original format using the inverse of step 1.
library(tidyverse)
df_tidy <- df %>%
# Step 1
gather(col, value, -key) %>%
separate(col, into = c("grp", "num"), 2) %>%
spread(num, value) %>%
# Step 2
mutate(`0` = `0` + `5`) %>%
# Step 3, which is just the inverse of Step 1.
gather(num, value, -key, - grp) %>%
unite(col, c("grp", "num")) %>%
spread(col, value)
df_tidy
key ab_0 ab_1 ab_5 bc_0 bc_1 bc_5 df_0 df_1 df_5
1 1 2 0 1 0 2 0 1 1 1
2 2 0 2 0 3 0 2 0 0 0
3 3 0 1 0 1 0 1 0 3 0
4 4 0 0 0 2 0 0 1 0 0
5 5 2 0 1 1 0 1 6 0 6

Resources