revert multiple (multiple columns to one column ) - r

I have a dataset with answers from a survey of 17 questions (10 questions are 5 or 7 questions are 7 point scale), and now the data format gives me 5 or 7 columns for each question answer (True or False), which is like a one-hot encoding style. And I want to convert these columns back to 15 single column.
To be more specific, the data I have looks like the following
Q1.1 Q1.2 Q1.3 Q1.4 Q1.5 Q1.6 Q1.7 .... Q17.1 Q17.2 ... Q17.5
row1 T F F F F F F F T F
... ...
row2000 F T F F F F F T F F
the desired format I want to have is
Q1 Q2 .... Q17
row1 1 4 2 # with number indicating the value that the column is True
....
row2000 2 3 1 #(e.g., if Q2.4 is T, then for Q2, it is 4).

Base R approach using split.default and max.col. Using split.default we can split the columns based on the pattern in their name, so that every question is divided into a list. Assuming every question would have only one TRUE value we can use max.col to find the TRUE index.
sapply(split.default(df, sub("\\..*", "", names(df))), max.col)
# Q1 Q2
#[1,] 1 2
#[2,] 6 5
data
df <-read.table(text = "Q1.1 Q1.2 Q1.3 Q1.4 Q1.5 Q1.6 Q1.7 Q2.1 Q2.2 Q2.3 Q2.4 Q2.5
T F F F F F F F T F F F
F F F F F T F F F F F T", header = T)
This is assuming class of your data is "logical". If "T"/"F" is stored in character format (like in #Maurits answer) we need to convert them to logical first.
Using data from #Maurits Evers
df[] <- lapply(df, as.logical)
sapply(split.default(df, sub("\\..*", "", names(df))), max.col)
# Q1 Q17
#[1,] 1 2
#[2,] 2 1

Here is a tidyverse option:
library(tidyverse)
df %>%
rownames_to_column("row") %>%
gather(k, v, -row) %>%
separate(k, c("question", "part"), sep = "\\.") %>%
filter(v == "T") %>%
group_by(row) %>%
select(-v) %>%
spread(question, part)
## A tibble: 2 x 3
## Groups: row [2]
# row Q1 Q17
# <chr> <chr> <chr>
#1 row1 1 2
#2 row2000 2 1
I assume that your original data contains "T"/"F" as character entries. If they are in fact TRUE/FALSE, you should change filter(v == "T") to filter(v == TRUE).
Sample data
df <- read.table(text =
"Q1.1 Q1.2 Q1.3 Q1.4 Q1.5 Q1.6 Q1.7 Q17.1 Q17.2 Q17.5
row1 T F F F F F F F T F
row2000 F T F F F F F T F F", colClasses = "character")

Related

How to combine two columns using some conditions?

I know that this is not the big trouble but I'm juts new to this. I have this output obtained from merging two dataframes. Each one has a column that corresponds to the sex for each participant of an event.
Sex.x
Sex.y
M
M
F
F
F
M
M
M
F
F
M
M
NA
M
F
F
Desired output: the two columns mixed in one that has "?" when their two values doesn't match and that conserves the only value if there is a NA in the adjacent cell.
F_Sex
M
F
?
M
F
M
M
F
I was trying to do it with dplyr package but I just get to this code. I know I need to use if_else but after many tries, I have nothing.
all_data1 <- all_data %>% unite(F_sexo, c(sexo.x, sexo.y), sep = "-", remove = TRUE)
Thanks a lot in advance.
Here is one idea. Use coalesce first to get the rows with only one NA to have the correct sex. And then use an ifelse to change those rows with different sexes to ?.
Notice that if you have a row with both columns are NA, this solution will return NA. Please make sure this is the behavior you want.
library(dplyr)
dat2 <- dat %>%
mutate(Sex = coalesce(.$Sex.x, .$Sex.y)) %>%
mutate(Sex = ifelse(Sex.x != Sex.y & !is.na(Sex.x) & !is.na(Sex.y), "?", Sex))
dat2
# Sex.x Sex.y Sex
# 1 M M M
# 2 F F F
# 3 F M ?
# 4 M M M
# 5 F F F
# 6 M M M
# 7 <NA> M M
# 8 F F F
DATA
dat <- read.table(text = "Sex.x Sex.y
M M
F F
F M
M M
F F
M M
NA M
F F", header = TRUE)
Check this solution. The data is assigned as df.
df %>% mutate(F_sex = case_when(Sex.x == Sex.y ~ Sex.x,
TRUE ~"?"))
or
df %>% mutate(F_sex = case_when(is.na(Sex.x) ~ Sex.y,
is.na(Sex.y) ~ Sex.x,
Sex.x == Sex.y ~ Sex.x,
TRUE ~"?"))

Is there a base R version of tidyr's unnest() function?

I've been using tidyverse quite a lot and now I'm interested in the possibilities of base R.
Let's take a look at this simple data.frame
df <- data.frame(id = 1:4, nested = c("a, b, f", "c, d", "e", "e, f"))
Using dplyr, stringr and tidyr we could do
df %>%
mutate(nested = str_split(nested, ", ")) %>%
unnest(nested)
to get (let's ignore the tibble part)
# A tibble: 8 x 2
id nested
<int> <chr>
1 1 a
2 1 b
3 1 f
4 2 c
5 2 d
6 3 e
7 4 e
8 4 f
Now we want to rebuild this one using base R tools. So
transform(df, nested = strsplit(nested, ", "))
gives use the mutate-part, but how can we unnest() this data.frame? I though of using unlist() but couldn't find a satisfying way.
We could use stack on a named list in a single line
with(df, setNames(stack(setNames(strsplit(nested, ","), id))[2:1], names(df)))
-output
id nested
1 1 a
2 1 b
3 1 f
4 2 c
5 2 d
6 3 e
7 4 e
8 4 f
If we use transform, then use rep to replicate based on the lengths of the list column
out <- transform(df, nested = strsplit(nested, ", "))
data.frame(id = rep(out$id, lengths(out$nested)), nested = unlist(out$nested))

Lookup value from DF column in df col names, take value for corresponding row

My DF:
dataAB <- c("A","B","A","A","B")
dataCD <- c("C","C","D","D","C")
dataEF <- c("F","E","E","E","F")
key <- c("dataC","dataA","dataC","dataE","dataE")
df <- data.frame(dataAB,dataCD,dataEF,key)
I'd like to add a column that looks for the value in "key" in the names of the DF and takes the value in that column for the row. My result would look like this:
df$result <- c("C","B","D","E","F")
Note that the value in the "key" column only partially matches the col names of df and is not the complete names of the col names. I suspect I'll need grep or grepl somewhere. I've tried variations on the following code, but can't get anything to work, and I'm unsure how to apply grep or grepl in this case.
df$result <- mapply(function(a) {df[[as.character(a)]]}, a=df$key)
Using apply with margin = 1 (row-wise) from which column we need to take the value using grepl which helps to detect the pattern.
df$result <- apply(df, 1, function(x) x[grepl(x["key"], names(x))])
df
# dataAB dataCD dataEF key result
#1 A C F dataC C
#2 B C E dataA B
#3 A D E dataC D
#4 A D E dataE E
#5 B C F dataE F
Another option with mapply would be to find out the columns from where we need to extract the values using sapply and then get the corresponding value from each row.
df$result <- mapply(function(x, y) df[x, y], 1:nrow(df),
sapply(df$key, function(x) grep(x, names(df), value = TRUE)))
df
# dataAB dataCD dataEF key result
#1 A C F dataC C
#2 B C E dataA B
#3 A D E dataC D
#4 A D E dataE E
#5 B C F dataE F
Perhaps, with 'tidyverse' :
df <- data.frame(dataAB,dataCD,dataEF,key,stringsAsFactors=FALSE) %>% mutate(id=row_number())
df %>% gather(k,v,-key,-id) %>%
filter(str_detect(substring(k,5),substring(key,5))) %>%
select(result=v,id) %>%
inner_join(df,.,by="id")
# dataAB dataCD dataEF key id result
#1 A C F dataC 1 C
#2 B C E dataA 2 B
#3 A D E dataC 3 D
#4 A D E dataE 4 E
#5 B C F dataE 5 F

Extract character list values from data.frame rows and reshape data

I have a variable x with character lists in each row:
dat <- data.frame(id = c(rep('a',2),rep('b',2),'c'),
x = c('f,o','f,o,o','b,a,a,r','b,a,r','b,a'),
stringsAsFactors = F)
I would like to reshape the data so that each row is a unique (id, x) pair such as:
dat2 <- data.frame(id = c(rep('a',2),rep('b',3),rep('c',2)),
x = c('f','o','a','b','r','a','b'))
> dat2
id x
1 a f
2 a o
3 b a
4 b b
5 b r
6 c a
7 c b
I've attempted to do this by splitting the character lists and keeping only the unique list values in each row:
dat$x <- sapply(strsplit(dat$x, ','), sort)
dat$x <- sapply(dat$x, unique)
dat <- unique(dat)
> dat
id x
1 a f, o
3 b a, b, r
5 c a, b
However, I'm not sure how to proceed with converting the row lists into individual row entries.
How would I accomplish this? Or is there a more efficient way of converting a list of strings to reshape the data as described?
You can use tidytext::unnest_tokens:
library(tidytext)
library(dplyr)
dat %>%
unnest_tokens(x1, x) %>%
distinct()
id x1
1 a f
2 a o
3 b b
4 b a
5 b r
6 c b
7 c a
A base R method with two lines is
#get list of X potential vars
x <- strsplit(dat$x, ",")
# construct full data.frame, then use unique to return desired rows
unique(data.frame(id=rep(dat$id, lengths(x)), x=unlist(x)))
This returns
id x
1 a f
2 a o
6 b b
7 b a
9 b r
13 c b
14 c a
If you don't want to write out the variable names yourself, you can use setNames.
setNames(unique(data.frame(rep(dat$id, lengths(x)), unlist(x))), names(dat))
We could use separate_rows
library(tidyverse)
dat %>%
separate_rows(x) %>%
distinct()
# id x
#1 a f
#2 a o
#3 b b
#4 b a
#5 b r
#6 c b
#7 c a
A solution can be achieved using splitstackshape::cSplit to split x column into mulltiple columns. Then gather and filter will help to achieve desired output.
library(tidyverse)
library(splitstackshape)
dat %>% cSplit("x", sep=",") %>%
mutate_if(is.factor, as.character) %>%
gather(key, value, -id) %>%
filter(!is.na(value)) %>%
select(-key) %>% unique()
# id value
# 1 a f
# 3 b b
# 5 c b
# 6 a o
# 8 b a
# 10 c a
# 13 b r
Base solution:
temp <- do.call(rbind, apply( dat, 1,
function(z){ data.frame(
id=z[1],
x = scan(text=z['x'], what="",sep=","),
stringsAsFactors=FALSE)} ) )
Read 2 items
Read 3 items
Read 4 items
Read 3 items
Read 2 items
Warning messages:
1: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
2: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
3: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
4: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
5: In data.frame(id = z[1], x = scan(text = z["x"], what = "", sep = ",")) :
row names were found from a short variable and have been discarded
temp[!duplicated(temp),]
#------
id x
1 a f
2 a o
6 b b
7 b a
9 b r
13 c b
14 c a
To get rid of all the messages and warnings:
temp <- do.call(rbind, apply( dat, 1,
function(z){ suppressWarnings(data.frame(id=z[1],
x = scan(text=z['x'], what="",sep=",", quiet=TRUE), stringsAsFactors=FALSE)
)} ) )
temp[!duplicated(temp),]

parent child structure in R dataframe

I have a csv that contains an org structure as follows plus some additional columns. I use R to create charts and it works great !.
The challenge is when trying to create the charts for a subset manager and its children/grandchildren.
Is there any filtering that is possible in dplr or any alternative package?
Sample format:
emp_id mgr_id nest_id
A A 0
B A 1
C B 2
D C 3
D1 D 4
D2 D 4
E C 3
E1 E 4
F C 3
G B 2
H G 3
The subset I need is for manager "C"
Scenario 1:emp_id==C should contain all nodes of 'D','D1','D2','E','E1','F'
expected structure:
manager,all_children
C D
C D1
C D2
C E
C E1
C F
Scenario 2:emp_id==C should contain all above nodes but retain mgr_id structure for 'D','E'
expected structure:
manager,all_children
C D
C E
C F
D D1
D D2
E E1
Consider the base package with by which creates a df list for every level of mgr_id (not just C):
SCENARIO 1
dfList <- by(df, df$mgr_id, function(i){
names(i) <- paste0(names(i), "_") # SUFFIX UNDERSCORE (TO AVOID DUP COLUMNS)
child <- merge(i, df, by.x="mgr_id_", by.y="emp_id")[,1:2]
grandchild <- merge(child, df, by.x="emp_id_", by.y="mgr_id")[c("mgr_id_", "emp_id")]
names(child) <- gsub("*_$", "", names(child)) # REMOVE LAST UNDERSCORE
names(grandchild) <- gsub("*_$", "", names(grandchild)) # REMOVE LAST UNDERSCORE
rbind(child, grandchild)
})
dfList$C
# mgr_id emp_id
# 1 C D
# 2 C E
# 3 C F
# 4 C D1
# 5 C D2
# 6 C E1
SCENARIO 2 (where the selected columns change in grandchild and then first column rename)
dfList <- by(df, df$mgr_id, function(i){
names(i) <- paste0(names(i), "_") # SUFFIX UNDERSCORE (TO AVOID DUP COLUMNS)
child <- merge(i, df, by.x="mgr_id_", by.y="emp_id")[,1:2]
grandchild <- merge(child, df, by.x="emp_id_", by.y="mgr_id")[c("emp_id_", "emp_id")]
names(child) <- gsub("*_$", "", names(child)) # REMOVE LAST UNDERSCORE
names(grandchild) <- gsub(".*_$", "", names(grandchild)) # REMOVE LAST UNDERSCORE
names(grandchild)[1] <- "mgr_id"
rbind(child, grandchild)
})
dfList$C
# mgr_id emp_id
# 1 C D
# 2 C E
# 3 C F
# 4 D D1
# 5 D D2
# 6 E E1
Here is one solution using functions from dplyr and data.table. dt3 is the output for scenario 1, while dt4 is the output for scenario 2.
# Load packages
library(dplyr)
library(data.table)
# Create example data frame
dt <- read.table(text = "emp_id mgr_id nest_id
A A 0
B A 1
C B 2
D C 3
D1 D 4
D2 D 4
E C 3
E1 E 4
F C 3
G B 2
H G 3",
header = TRUE, stringsAsFactors = FALSE)
# Process the data
dt2 <- dt %>%
# Filter levels lower than 1
filter(nest_id > 1) %>%
mutate(group_id = ifelse(nest_id > 2, 0, 1)) %>%
# Create "run_id", which will be used to fill manager label
mutate(run_id = rleid(group_id)) %>%
mutate(run_id = ifelse(run_id %% 2 == 0, run_id - 1, run_id)) %>%
group_by(run_id) %>%
mutate(manager = first(emp_id)) %>%
# Select for manager C
filter(manager %in% "C") %>%
ungroup() %>%
# Remove rows if manager == emp_id
filter(manager != emp_id) %>%
rename(all_children = emp_id)
# Scenario 1
dt3 <- dt2 %>% select(manager, all_children)
# Scenario 2
dt4 <- dt2 %>%
select(manager = mgr_id, all_children) %>%
arrange(manager, all_children)

Resources