How to count different aswers in R - r

I am analyzing a questionnaire and I've written the code below to count how many answers there are to each question. The questions are in columns and the answer is coded as a number, where 1=a, 2=b.
The main objective is to count how many times an answer was chosen, ignoring pattern to summarize the information.
DS is the data frame, containing questions Q_092 to Q_096. I have the code to change column names, but it expects a fixed number of columns.
Is there a prettier way to do it?
conta_respostas <- function (arr_resp) {
arr_resp[(is.na(arr_resp))]<-99
arr_result = c(
sum(arr_resp[(arr_resp=="1")])/1,
sum(arr_resp[(arr_resp=="2")])/2,
sum(arr_resp[(arr_resp=="3")])/3,
sum(arr_resp[(arr_resp=="4")])/4,
sum(arr_resp[(arr_resp=="5")])/5,
sum(arr_resp[(arr_resp=="6")])/6,
sum(arr_resp[(arr_resp=="7")])/7,
sum(arr_resp[(arr_resp=="8")])/8,
sum(arr_resp[(arr_resp=="9")])/9,
sum(arr_resp[(arr_resp=="10")])/10,
sum(arr_resp[(arr_resp=="99")])/99
)
}
adply(DS, 2, conta_respostas)
X1 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 Q_092 431 1987 5053 1388 0 0 0 0 0 0 36
2 Q_093 281 1489 5728 1336 0 0 0 0 0 0 61
3 Q_094 594 3380 4365 519 0 0 0 0 0 0 37
4 Q_095 89 216 5042 3511 0 0 0 0 0 0 37
5 Q_096 213 1764 5384 1511 0 0 0 0 0 0 23

what it sounds like your data looks like:
DS <- data.frame(
'Q_092' = c(1, 3, 4, 5, 2, 99, 10),
'Q_093' = c(2, 5, 6, 2, 99, 1, 1),
'Q_094' = c(3, 5, 6, 2, 4, 7, 8),
'Q_095' = c(10, 5, 5, 6, 7, 8, 6),
'Q_096' = c(1, 3, 4, 5, 2, 99, 10)
)
DS
Q_092 Q_093 Q_094 Q_095 Q_096
1 1 2 3 10 1
2 3 5 5 5 3
3 4 6 6 5 4
4 5 2 2 6 5
5 2 99 4 7 2
6 99 1 7 8 99
7 10 1 8 6 10
Recreating your code:
library(plyr)
conta_respostas <- function (arr_resp) {
arr_resp[(is.na(arr_resp))]<-99
arr_result = c(
sum(arr_resp[(arr_resp=="1")])/1,
sum(arr_resp[(arr_resp=="2")])/2,
sum(arr_resp[(arr_resp=="3")])/3,
sum(arr_resp[(arr_resp=="4")])/4,
sum(arr_resp[(arr_resp=="5")])/5,
sum(arr_resp[(arr_resp=="6")])/6,
sum(arr_resp[(arr_resp=="7")])/7,
sum(arr_resp[(arr_resp=="8")])/8,
sum(arr_resp[(arr_resp=="9")])/9,
sum(arr_resp[(arr_resp=="10")])/10,
sum(arr_resp[(arr_resp=="99")])/99
)
}
adply(DS, 2, conta_respostas)
X1 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1 Q_092 1 1 1 1 1 0 0 0 0 1 1
2 Q_093 2 2 0 0 1 1 0 0 0 0 1
3 Q_094 0 1 1 1 1 1 1 1 0 0 0
4 Q_095 0 0 0 0 2 2 1 1 0 1 0
5 Q_096 1 1 1 1 1 0 0 0 0 1 1
Without having to write that function, you can do something like this:
t(apply(DS, 2, function(x) table(factor(x, levels = c('1', '2', '3', '4', '5',
'6', '7', '8', '9', '10',
'99')))))
This will do the following:
transform your data into factors with the levels as input in levels =. Having your data as a factor will allow you to avoid the levels where no respondents chose that response to be left out.
This creates a table for each variable with a cell for each of the factor levels.
This function is applied over the five variable columns that you are interested in.
Finally, the output from the apply() function is transposed to match the output from your original output:
1 2 3 4 5 6 7 8 9 10 99
Q_092 1 1 1 1 1 0 0 0 0 1 1
Q_093 2 2 0 0 1 1 0 0 0 0 1
Q_094 0 1 1 1 1 1 1 1 0 0 0
Q_095 0 0 0 0 2 2 1 1 0 1 0
Q_096 1 1 1 1 1 0 0 0 0 1 1

One option would be to use the apply() function with FUN=table. The only issue here is that your tables may be of different lengths, thus the final result may not be combined row-by-row.

Related

How can I count values in one dataframe and transfer the results to a second dataframe under the corresponding column in R?

I'm trying to extract and organize the values from the first data frame into the second. In the first you have cbn which is a factor that lists combinations of variables 1 to 31 (example dataframe shows a portion of all my data). For each of these combinations A, B, and C have values 1 or 2.
cbn A B C
1 1, 2, 3, 4 1 2 1
2 1, 2, 3, 5 1 1 1
3 1, 2, 3, 7 1 1 1
4 1, 2, 3, 8 1 2 1
5 1, 2, 3, 9 1 1 1
6 1, 2, 3, 10 1 1 1
7 1, 2, 3, 12 1 2 1
8 1, 2, 3, 13 1 2 1
9 1, 2, 3, 17 1 2 1
10 1, 2, 3, 18 1 2 1
11 1, 2, 3, 20 2 2 2
12 1, 2, 3, 22 1 2 1
13 1, 2, 3, 23 1 2 1
14 1, 2, 3, 25 1 2 1
15 1, 2, 3, 26 1 2 1
16 1, 2, 3, 28 1 2 1
17 1, 2, 3, 29 1 2 1
18 1, 2, 3, 30 1 2 1
19 1, 2, 3, 31 1 2 1
I'm trying to get all that data into a new dataframe. The rows become the 31 variables, and the columns become separated into 1 and 2 for A,B, and C. For every row in df1, the variables used in the combination are separated and added to the corresponding row in df2 under the column with the letter and value indicated in the df1. Thus, the first line in df1 has variables 1, 2, 3, and 4, and A is 1. In df2 under the A1 column, 1 is added to each corresponding variable row. For each variable present under cbn in df1, 1 is added to the count for that variable in df2 under letter with the same value in df1. I have added the first two rows of df1 to df2.
Variable A1 A2 B1 B2 C1 C2
1 1 2 0 1 1 2 0
2 2 2 0 1 1 2 0
3 3 2 0 1 1 2 0
4 4 1 0 0 1 1 0
5 5 1 0 1 0 1 0
6 6 0 0 0 0 0 0
7 7 0 0 0 0 0 0
8 8 0 0 0 0 0 0
9 9 0 0 0 0 0 0
10 10 0 0 0 0 0 0
11 11 0 0 0 0 0 0
12 12 0 0 0 0 0 0
13 13 0 0 0 0 0 0
14 14 0 0 0 0 0 0
15 15 0 0 0 0 0 0
16 16 0 0 0 0 0 0
... ... ... ... ... ... ... ...
31 31 0 0 0 0 0 0
How can I transfer this data into df2?
Using you first two rows of data:
df1 <- data.frame(cbn = c("1, 2, 3, 4", "1, 2, 3, 5" ),
A = c(1,1),
B = c(2,1),
C = c(1,1))
First add the letters to the entries of the column:
for(x in c("A","B","C")){
df1[,x] <- paste0(x, df1[,x])
}
Then using sperate to split the cbn column in to multiple columns and using gather, summarize and then spread:
library(tidyverse)
df2 <- df1 %>%
separate(cbn , paste("V",1:4), sep = ",") %>%
gather("dummy", "Variable", starts_with("V")) %>%
mutate(Variable = as.numeric(Variable))%>%
select(-dummy) %>%
gather("dummy", "value", -Variable) %>%
select(-dummy) %>%
mutate(value = factor(value, levels = c("A1","A2","B1","B2","C1","C2"))) %>%
group_by(Variable, value) %>%
summarize(n = n()) %>%
spread("value", "n", fill = 0, drop = F) %>%
as.data.frame()
results in:
> df2
Variable A1 A2 B1 B2 C1 C2
1 1 2 0 1 1 2 0
2 2 2 0 1 1 2 0
3 3 2 0 1 1 2 0
4 4 1 0 0 1 1 0
5 5 1 0 1 0 1 0

Transform variable length list into matrix in R

If I had a list of vectors of variable lengths :
[[1]]
[1] 1 2 3 4
[[2]]
[1] 4 5 6
[[3]]
[1] 1 2 3 4 5 6 7 8 9
[[4]]
[1] 'a' 'b' 'c'
How could I transform this into a data frame / logical matrix with elements of the list represented as columns?
i.e a dataframe like:
1 2 3 4 5 6 7 8 9 'a' 'b' 'c'
[1] 1 1 1 1 0 0 0 0 0 0 0 0
[2] 0 0 0 1 1 1 0 0 0 0 0 0
[3] 1 1 1 1 1 1 1 1 1 0 0 0
[4] 0 0 0 0 0 0 0 0 0 1 1 1
some data:
x <- list(c(1, 2, 3, 4), c(4, 5, 6), c(1, 2, 3, 4, 5, 6, 7, 8, 9), c("a", "b", "c"))
Here is a base R option:
# extract unique values from x
uv <- unique(unlist(x))
# Check in each element of lists which values are present and bind everything toegether
out <- do.call(rbind, lapply(x, function(e) as.integer(uv %in% e) ))
# Convert from matrix to data.frame and add column names
out <- setNames(as.data.frame(out), uv)
out
1 2 3 4 5 6 7 8 9 a b c
1 1 1 1 1 0 0 0 0 0 0 0 0
2 0 0 0 1 1 1 0 0 0 0 0 0
3 1 1 1 1 1 1 1 1 1 0 0 0
4 0 0 0 0 0 0 0 0 0 1 1 1
Here is a base R option with stack and table
table(stack(setNames(x, seq_along(x)))[2:1])
# values
#ind 1 2 3 4 5 6 7 8 9 a b c
# 1 1 1 1 1 0 0 0 0 0 0 0 0
# 2 0 0 0 1 1 1 0 0 0 0 0 0
# 3 1 1 1 1 1 1 1 1 1 0 0 0
# 4 0 0 0 0 0 0 0 0 0 1 1 1
Something like this?
library(tidyverse)
x = list(c(1, 2, 3, 4), c(4, 5, 6), c(1, 2, 3, 4, 5, 6, 7, 8, 9))
y = tibble(column1= map_chr(x, str_flatten, " "))
Where y is this:
# A tibble: 3 x 1
column1
<chr>
1 1 2 3 4
2 4 5 6
3 1 2 3 4 5 6 7 8 9

Rename a range of columns in a dataframe

I'm trying to rename a range of columns in a dataframe so that they have the format [V1:V5]:
result_df = data.frame(V1 = 1, V2 = 2, V3 = 3, V4 = 4, V5 = 5, colnamethatshouldntberenamed = 6)
If the existing dataframe has the range of numbers somewhere in their names, it's relatively straigthforward (although I'm thinking there's probably a way to do it with one line of code, not two):
df1 = data.frame(X1q = 1, X2q = 2, X3q = 3, X4q = 4, X5q = 5, colnamethatshouldntberenamed = 6)
names(df1) <- gsub("X", "V", names(df1))
names(df1) <- gsub("q", "", names(df1))
But what if the column names have completely random names?
df2 = data.frame(name = 1, col = 2, random = 3, alsorandom = 4, somethingelse = 5, colnamethatshouldntberenamed = 6)
Is there a way to rename all of these columns in one-go? (assuming that they are adjoining columns in the dataframe, but there may be other columns in the dataframe with names that don't need to be changed)
If you have a different number of columns and/or you want to %>%, you can use purrr::set_names().
For example:
Sample data with 10 columns:
example1 <- data.frame(replicate(10,sample(0:1,5,rep=TRUE)))
example1
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10
1 1 1 0 1 1 1 0 1 0 1
2 0 0 1 0 1 1 1 0 0 1
3 0 0 1 0 0 1 1 1 0 0
4 0 1 0 1 1 0 1 0 0 0
5 1 0 0 1 1 0 1 1 0 1
You can use seq_along inside set_names which will rename the columns by order (with piping):
example1 %>%
set_names(c(seq_along(example1)))
Results:
1 2 3 4 5 6 7 8 9 10
1 1 1 0 1 1 1 0 1 0 1
2 0 0 1 0 1 1 1 0 0 1
3 0 0 1 0 0 1 1 1 0 0
4 0 1 0 1 1 0 1 0 0 0
5 1 0 0 1 1 0 1 1 0 1
Same idea with 15 columns and naming them using paste in set_names:
example2 <- data.frame(replicate(15,sample(0:1,10,rep=TRUE)))
example2 %>%
set_names(c(paste("VarNum", seq_along(example2), sep = "")))
Results
VarNum1 VarNum2 VarNum3 VarNum4 VarNum5 VarNum6 VarNum7 VarNum8 VarNum9 VarNum10 VarNum11 VarNum12 VarNum13 VarNum14 VarNum15
1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 1
2 1 1 0 0 0 1 0 1 1 0 0 0 1 0 1
3 1 1 0 1 0 1 1 1 1 1 1 0 1 0 1
4 0 0 0 0 1 1 1 1 0 1 1 0 0 0 1
5 1 1 0 1 0 0 1 0 0 1 1 0 0 0 0

Creating a factor/categorical variable from 4 dummies

I have a data frame with four columns, let's call them V1-V4 and ten observations. Exactly one of V1-V4 is 1 for each row, and the others of V1-V4 are 0. I want to create a new column called NEWCOL that takes on the value of 3 if V3 is 1, 4 if V4 is 1, and is 0 otherwise.
I have to do this for MANY sets of variables V1-V4 so I would like the solution to be as short as possible so that it will be easy to replicate.
This does it for 4 columns to add a fifth using matrix multiplication:
> cbind( mydf, newcol=data.matrix(mydf) %*% c(0,0,3,4) )
V1 V2 V3 V4 newcol
1 1 0 0 0 0
2 1 0 0 0 0
3 0 1 0 0 0
4 0 1 0 0 0
5 0 0 1 0 3
6 0 0 1 0 3
7 0 0 0 1 4
8 0 0 0 1 4
9 0 0 0 1 4
10 0 0 0 1 4
It's generalizable to getting multiple columns.... we just need the rules. You need to make a matric with the the same number of rows as there are columns in the original data and have one column for each of the new factors needed to build each new variable. This shows how to build one new column from the sum of 3 times the third column plus 4 times the fourth, and another new column from one times the first and 2 times the second.
> cbind( mydf, newcol=data.matrix(mydf) %*% matrix(c(0,0,3,4, # first set of factors
1,2,0,0), # second set
ncol=2) )
V1 V2 V3 V4 newcol.1 newcol.2
1 1 0 0 0 0 1
2 1 0 0 0 0 1
3 0 1 0 0 0 2
4 0 1 0 0 0 2
5 0 0 1 0 3 0
6 0 0 1 0 3 0
7 0 0 0 1 4 0
8 0 0 0 1 4 0
9 0 0 0 1 4 0
10 0 0 0 1 4 0
An example data set:
mydf <- data.frame(V1 = c(1, 1, rep(0, 8)),
V2 = c(0, 0, 1, 1, rep(0, 6)),
V3 = c(rep(0, 4), 1, 1, rep(0, 4)),
V4 = c(rep(0, 6), rep(1, 4)))
# V1 V2 V3 V4
# 1 1 0 0 0
# 2 1 0 0 0
# 3 0 1 0 0
# 4 0 1 0 0
# 5 0 0 1 0
# 6 0 0 1 0
# 7 0 0 0 1
# 8 0 0 0 1
# 9 0 0 0 1
# 10 0 0 0 1
Here's an easy approach to generate the new column:
mydf <- transform(mydf, NEWCOL = V3 * 3 + V4 * 4)
# V1 V2 V3 V4 NEWCOL
# 1 1 0 0 0 0
# 2 1 0 0 0 0
# 3 0 1 0 0 0
# 4 0 1 0 0 0
# 5 0 0 1 0 3
# 6 0 0 1 0 3
# 7 0 0 0 1 4
# 8 0 0 0 1 4
# 9 0 0 0 1 4
# 10 0 0 0 1 4

Splitting one column into multiple columns

I have a huge dataset in which there is one column including several values for each subject (row). Here is a simplified sample dataframe:
data <- data.frame(subject = c(1:8), sex = c(1, 2, 2, 1, 2, 1, 1, 2),
age = c(35, 29, 31, 46, 64, 57, 49, 58),
v1 = c("2", "0", "3,5", "2 1", "A,4", "B,1,C", "A and B,3", "5, 6 A or C"))
> data
subject sex age v1
1 1 1 35 2
2 2 2 29 0
3 3 2 31 3,5 # separated by a comma
4 4 1 46 2 1 # separated by a blank space
5 5 2 64 A,4
6 6 1 57 B,1,C
7 7 1 49 A and B,3
8 8 2 58 5, 6 A or C
I first want to remove the letters (A, B, A and B, …) in the fourth column (v1), and then split the fourth column into multiple columns just like this:
subject sex age x1 x2 x3 x4 x5 x6
1 1 1 35 0 1 0 0 0 0
2 2 2 29 0 0 0 0 0 0
3 3 2 31 0 0 1 0 1 0
4 4 1 46 1 1 0 0 0 0
5 5 2 64 0 0 0 1 0 0
6 6 1 57 1 0 0 0 0 0
7 7 1 49 0 0 1 0 0 0
8 8 2 58 0 0 0 0 1 1
where the 1st subject takes 1 at x2 because it takes 2 at v1 in the original dataset, the 3rd subject takes 1 at both x3 and x5 because it takes 3 and 5 at v1 in the original dataset, and so on.
I would appreciate any help on this question. Thanks a lot.
You can cbind this result to data[-4] and get what you need:
0+t(sapply(as.character(data$v1), function(line)
sapply(1:6, function(x) x %in% unlist(strsplit(line, split="\\s|\\,"))) ))
#----------------
[,1] [,2] [,3] [,4] [,5] [,6]
2 0 1 0 0 0 0
0 0 0 0 0 0 0
3,5 0 0 1 0 1 0
2 1 1 1 0 0 0 0
A,4 0 0 0 1 0 0
B,1,C 1 0 0 0 0 0
A and B,3 0 0 1 0 0 0
5, 6 A or C 0 0 0 0 1 1
One solution:
r <- sapply(strsplit(as.character(dt$v1), "[^0-9]+"), as.numeric)
m <- as.data.frame(t(sapply(r, function(x) {
y <- rep(0, 6)
y[x[!is.na(x)]] <- 1
y
})))
data <- cbind(data[, c("subject", "sex", "age")], m)
# subject sex age V1 V2 V3 V4 V5 V6
# 1 1 1 35 0 1 0 0 0 0
# 2 2 2 29 0 0 0 0 0 0
# 3 3 2 31 0 0 1 0 1 0
# 4 4 1 46 1 1 0 0 0 0
# 5 5 2 64 0 0 0 1 0 0
# 6 6 1 57 1 0 0 0 0 0
# 7 7 1 49 0 0 1 0 0 0
# 8 8 2 58 0 0 0 0 1 1
Following DWin's awesome solution, m could be modified as:
m <- as.data.frame(t(sapply(r, function(x) {
0 + 1:6 %in% x[!is.na(x)]
})))

Resources