I have the result of a select_multiple question stored in a list. That comes from a dataset collected with OpenDataKit
example <- list("a", c("b", "c"), c("d", "e", "f"), c(""))
In the example below for the record #4, there were no answers (meaning NA for all options).
I need to create a data frame from this list where each options from the select multiple would become a new variable. Each element of the list is de facto not of the same length.
The result should look like this:
variable | a b c d e f
row1 | 1 0 0 0 0 0
row2 | 0 1 1 0 0 0
row3 | 0 0 0 1 1 1
row4 | <NA> <NA><NA> <NA><NA> <NA>
I have found options with stri_list2matrix but that does not provide the expected results.
I tried as well
df <-data.frame( lNames <- rep(names(example), lapply(example, length)),
lVal <- unlist(example))
and got the same
arguments imply differing number of rows
Please help!
Thanks
You could use setNames, stack and dcast for that:
example <- list("a", c("b", "c"), c("d", "e", "f"), c(""))
example <- setNames(example, seq_along(example))
ex2 <- stack(example)
ex2[ex2$values=='','values'] <- NA
library(reshape2)
dcast(ex2, ind ~ values, fun.aggregate = length)
This will result in:
ind a b c d e f NA
1 1 1 0 0 0 0 0 0
2 2 0 1 1 0 0 0 0
3 3 0 0 0 1 1 1 0
4 4 0 0 0 0 0 0 1
Related
I have a data frame of the following way
dat <- data.frame(A=c("D", "A", "D", "B"), B=c("B", "B", "D", "R"), C=c("A", "D", "C", ""), D=c("D", "C", "A", "A"))
My idea is to create a matrix with this information, based on the number of occasions that each column variable refers to the other columns (and ignore when referring to other things that are not in one of the columns (e.g. "R")). So I want to fill the following matrix:
n <- ncol(dat)
names_d <- colnames(dat)
mat <- matrix(0, nrow=n, ncol=n)
rownames(mat) <- names_d
colnames(mat) <- names_d
So in the end, I would have something like this:
A B C D
A 1 1 0 2
B 0 2 0 1
C 1 0 1 1
D 2 0 1 1
Which would be the most efficient way of doing this in R?
You can try the code below
> t(sapply(dat, function(x) table(factor(x, levels = names(dat)))))
A B C D
A 1 1 0 2
B 0 2 0 1
C 1 0 1 1
D 2 0 1 1
or
> t(xtabs(~., subset(stack(dat), values != "")))
values
ind A B C D
A 1 1 0 2
B 0 2 0 1
C 1 0 1 1
D 2 0 1 1
Another option is stack with table
table(subset(stack(dat), nzchar(values) & values != 'R'))
My variable is as follows
variable
D
D
B
C
B
D
C
C
D
I want to make the column in the above figure below
variable
B
C
D
D
0
0
1
D
0
0
1
B
1
0
0
C
0
1
0
B
1
0
0
D
0
0
1
C
0
1
0
C
0
1
0
D
0
0
1
But I don't want a code like the one below. Because the number of factors in the variable column is too many
data = data %>% mutate(B=ifelse(variable=="B", 1,0),
C=ifelse(variable=="C", 1,0),
D=ifelse(variable=="D", 1,0))
Here is a base R approach. We can first find all unique variable values from the data frame. Then, sapply over that vector and generate a new column for each value. Finally, we can rbind this new data frame of 0/1 valued columns to the original data frame.
cols <- sort(unique(df$variable))
df2 <- sapply(cols, function(x) ifelse(df$variable == x, 1, 0))
df <- cbind(df, df2)
df
variable B C D
1 D 0 0 1
2 D 0 0 1
3 B 1 0 0
4 C 0 1 0
5 B 1 0 0
6 D 0 0 1
7 C 0 1 0
8 C 0 1 0
9 D 0 0 1
Data:
df <- data.frame(variable=c("D", "D", "B", "C", "B",
"D", "C", "C", "D"),
stringsAsFactors=FALSE)
Try this with reshaping and duplicating the original variable in order to have a reference for values. Then, you can reshape to obtain the expected output:
library(dplyr)
library(tidyr)
#Code
new <- df %>% mutate(Var=variable,Val=1,id=row_number()) %>%
pivot_wider(names_from = Var,values_from=Val,values_fill = 0) %>%
select(-id)
Output:
# A tibble: 9 x 4
variable D B C
<chr> <dbl> <dbl> <dbl>
1 D 1 0 0
2 D 1 0 0
3 B 0 1 0
4 C 0 0 1
5 B 0 1 0
6 D 1 0 0
7 C 0 0 1
8 C 0 0 1
9 D 1 0 0
Some data used:
#Data
df <- structure(list(variable = c("D", "D", "B", "C", "B", "D", "C",
"C", "D")), class = "data.frame", row.names = c(NA, -9L))
1) model.matrix
model.matrix will generate column names like variableB so the last line removes the variable part to ensure that the column names are exactly the same as in the question. Omit the last line if it is not important that the column names be exactly as shown there.
dat2 <- cbind(dat, model.matrix(~ variable - 1, dat))
names(dat2) <- sub("variable(.)", "\\1", names(dat2))
giving:
> dat2
variable B C D
1 D 0 0 1
2 D 0 0 1
3 B 1 0 0
4 C 0 1 0
5 B 1 0 0
6 D 0 0 1
7 C 0 1 0
8 C 0 1 0
9 D 0 0 1
2) outer
This can also be done using outer as shown. Each component of variable is compared to each level. We name the levels so that outer uses them as column names. The output is the same.
levs <- sort(unique(dat$variable))
names(levs) <- levs
cbind(dat, +outer(dat$variable, levs, `==`))
Note
The input in reproducible form:
Lines <- "
variable
D
D
B
C
B
D
C
C
D"
dat <- read.table(text = Lines, header = TRUE)
I would need some help with adding row names from one dataframe to another.
For the sake of simplicity, say I have two dataframes (df1 and df2) with different dimensions (df1 is 3x3 and df2 is 5x5). In reality my dataframes are a lot bigger (i.e. thousands of rows / columns)
df1 <- data.frame("rownames" = c("A", "B", "C"), "a1" = c(0,1,2), "a2" = c(2,0,1), "a3" = c(0,0,1), row.names = "rownames")
df2 <- data.frame("rownames" = c("A", "B", "D", "E", "F"), "a1" = c(1,1,2,2,0), a2 = c(2,0,0,1,0), a3 = c(1,0,2,3,0), a4 = c(1,1,0,0,1), a5 = c(0,0,0,0,0), row.names = "rownames")
What I'd like to do is to append the rows of df1 to include the row names "D", "E", and "F" that are in df2 but not in df1, in such a way that the column ("a1", "a2", "a3") values would be set to zeros.
So the input would be the two dataframes:
df1
a1 a2 a3
A 0 2 0
B 1 0 0
C 2 1 1
df2
a1 a2 a3 a4 a5
A 1 2 1 1 0
B 1 0 0 1 0
D 2 0 2 0 0
E 2 1 3 0 0
F 0 0 0 1 0
and the desired output would be:
a1 a2 a3
A 0 2 0
B 1 0 0
C 2 1 1
D 0 0 0
E 0 0 0
F 0 0 0
Thank you!
If you know that df1 is going to be smaller dataframe and df2 bigger you can do :
df1[setdiff(rownames(df2), rownames(df1)), ] <- 0
df1
# a1 a2 a3
#A 0 2 0
#B 1 0 0
#C 2 1 1
#D 0 0 0
#E 0 0 0
#F 0 0 0
In case, if you have to programatically determine the dataframe which is bigger/smaller you can test it with if condition
if(nrow(df1) > nrow(df2)) {
small_df <- df2
big_df <- df1
} else {
small_df <- df1
big_df <- df2
}
small_df[setdiff(rownames(big_df), rownames(small_df)), ] <- 0
We can use %in% with negate (!)
df1[row.names(df2)[!row.names(df2) %in% row.names(df1)], ] <- 0
df1
# a1 a2 a3
#A 0 2 0
#B 1 0 0
#C 2 1 1
#D 0 0 0
#E 0 0 0
#F 0 0 0
I have an edge list like this
a 1
b 2
c 3
a 2
b 1
and I want to build it's incidence matrix which would be like:
a b c d
1 1 1 0 0
2 1 1 0 0
3 0 0 1 0
4 0 0 0 0
Any idea to do it?
Using factor and add one more lvl
df=read.table(text='A B
a 1
b 2
c 3
a 2
b 1',header=T)
levels(df$A)=c(levels(df$A),'d')
df$B=as.factor(df$B)
levels(df$B)=c(levels(df$B),'4')
subset=table(df$B,df$A)
> subset
a b c d
1 1 1 0 0
2 1 1 0 0
3 0 0 1 0
4 0 0 0 0
df<-cbind(df,1)
require(qdapTools)
incidence<-df[rep(seq_len(nrow(df)), df[,'1']), c('A', 'B')] %>%
{split(.[,'B'], .[,'A'])} %>%
mtabulate()
you can do it with xtabs from base stats:
a<- data.frame(var1=c("a", "b", "c", "a", "b"), var2=c(1, 2, 3, 2, 1), stringsAsFactors = F)
b<- as.matrix(xtabs(~var1 +var2, a, sparse = F))
If you have also a column weight that you want to use to populate the incidence matrix, I recommend the package EconGeo. The function get.matrix() does the trick wonderfully.
I have a data frame like this, called df:
a b c d e f
b c f a a a
d f a b c c
f e d f f d
The first row is actually the column name. Let's take an example to explain the meaning here: df[1,1] is b, which means there is relation from a to b, so the values in the column means there is relation from the 'column name' to that entry.
I want create a matrix(df1) with 6*6 dimension, column and row names are both column names of df. The (i,j) entry is 1, if there is relation from 'i' to 'j', otherwise, 0.
The output I want is:
a b c d e f
a 0 1 0 1 0 1
b 0 0 1 0 1 1
c 1 0 0 1 0 1
d 1 1 0 0 0 1
e 1 0 1 0 0 1
f 1 0 1 1 0 0
How to do this with a loop in R?
How to do this without a loop, and only use basic R?
How to do this using some fancy packages in R?
Using the reshape2 package, this is one way to go. My sample data has all columns as character. You use melt() to reshape your data in a long format. Then, you use dcast() from the same package.
library(magrittr)
library(reshape2)
melt(mydf, measure.vars = names(mydf)) %>%
dcast(variable ~ value, length)
variable a b c d e f
1 a 0 1 0 1 0 1
2 b 0 0 1 0 1 1
3 c 1 0 0 1 0 1
4 d 1 1 0 0 0 1
5 e 1 0 1 0 0 1
6 f 1 0 1 1 0 0
EDIT
As mentioned below by akrun, you can do all work using recast() in the reshape2 package.
recast(mydf, measure.var= names(mydf),variable~value, length)
DATA
mydf <- structure(list(a = c("b", "d", "f"), b = c("c", "f", "e"), c = c("f",
"a", "d"), d = c("a", "b", "f"), e = c("a", "c", "f"), f = c("a",
"c", "d")), .Names = c("a", "b", "c", "d", "e", "f"), class = "data.frame", row.names = c(NA,
-3L))
Just use table:
table(colnames(mydf)[col(mydf)], unlist(mydf) )
# a b c d e f
# a 0 1 0 1 0 1
# b 0 0 1 0 1 1
# c 1 0 0 1 0 1
# d 1 1 0 0 0 1
# e 1 0 1 0 0 1
# f 1 0 1 1 0 0
If you have multiple matches, then:
pmin(table(colnames(mydf)[col(mydf)], unlist(mydf) ), 1)
You can do this with reshaping.
library(dplyr)
library(tidyr)
data %>%
gather(from, to) %>%
distinct %>%
mutate(value = 1) %>%
spread(to, value, fill = 0)
The other solution using dplyr is really neat and smart. I recommend using that solution.
Here is an alternative solution to your problem using most basic functions in R.
Say your data frame has n columns and m rows i.e. n <- ncol(df) and m <- nrow(df).
output_matrix <- matrix(rep(0, n*n), ncol = n)
for(i in 1:n){
for(j in 1:m){
# UTF to integer conversion
# utf8ToInt("a") = 97
rowWithRelation <- utf8ToInt(df[j, i]) - 96
output_matrix[rowWithRelation, i] <- 1
}
}
rownames(output_matrix) <- letters[seq(from = 1, to = n)]
colnames(output_matrix) <- letters[seq(from = 1, to = n)]