I have a data set in R as follows
ID Variable1 Variable2 Choice
1 1 2 1
1 2 1 0
2 2 1 1
2 2 1 1
I need to get the output table for it as under
Id Variable1-1 Variable1-2 Variable2-1 Variable2-2
1 1 0 0 1
2 0 2 2 0
Note that only those rows are counted where the choice is 1 (choice is a binary variable, however other variables have any integer values). The aim is to have as many columns for a variable as its levels.
Is there a way I can do this in R?
You could use melt and dcast from the reshape2 package:
mydf<-read.table(text="ID Variable1 Variable2 Choice
1 1 2 1
1 2 1 0
2 2 1 1
2 2 1 1",header=TRUE)
library(reshape2)
First melt the data.frame, selecting only those rows where Choice == 1 and removing the Choice column
mydfM <- melt(mydf[mydf$Choice %in% 1, -match("Choice", names(mydf))], id = "ID")
# EDIT above: As #TylerRinker points out, using which could be avoided.
# I've replaced it with %in%
# ID variable value
# 1 1 Variable1 1
# 2 2 Variable1 2
# 3 2 Variable1 2
# 4 1 Variable2 2
# 5 2 Variable2 1
# 6 2 Variable2 1
Then cast the melted data.frame, using length as the aggregation function
(mydfC <- dcast(mydfM, ID ~ variable + value, fun.aggregate = length))
# ID Variable1_1 Variable1_2 Variable2_1 Variable2_2
# 1 1 1 0 0 1
# 2 2 0 2 2 0
It took me a while to figure out what you were after but I got it (I think). I have done what you asked but it's convoluted at best. I think this will help others see what you're after and you'll get better answers now.
dat <- read.table(text="ID Variable1 Variable2 Choice
1 1 2 1
1 2 1 0
2 2 1 1
2 2 1 1", header=T)
A <- split(dat$Choice, list(dat$Variable1, dat$ID))
B <- split(dat$Choice, list(dat$Variable2, dat$ID))
C <- list(A, B)
FUN <- function(x) sapply(x, function(y) sum(y))
FUN2 <- function(x){
len <- length(x)/2
rbind(x[1:len], x[(len+1):length(x)])
}
dat2 <- do.call('data.frame', lapply(lapply(C, FUN), FUN2))
colnames(dat2) <- c('Variable1-1', 'Variable1-2', 'Variable2-1',
'Variable2-2')
dat2
This ain't you're grandmother's contingency table that's for sure. Probably there's a much better way to accomplish all of this, maybe with reshape.
Related
I am trying to combine several binary variables into one categorical variable. I have ten categorial variables, each describing tasks of a job.
Data looks something like this:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
# etc.
My goal is to combine them into one variable, where the value 1 (=Yes) of each binary variable will be a seperate level of the categorical variable.
To illustrate what I imagine (wrong code obviously):
If Personal_Help = 1 -> Jobcontent = 1
If PR = 1 -> Jobcontent = 2
If Fundraising = 1 -> Jobcontent = 3
etc.
Thank you very much in advance!
Edit:
Thanks for your Answers and apologies for my late answer. I think more context from my side is needed. The goal of combining the binary variables into a categorical variable is to print them into one graphic (using ggplot). The graphic should display how many respondants report the above mentioned tasks as part of their work.
if you're interested only in the first occurrence of 1 among your variables:
df <- data.frame(t(data.frame(Personal_Help, PR,Fundraising)))
result <- sapply(df, function(x) which(x==1)[1])
X1 X2 X3 X4 X5 X6
1 1 2 1 2 1
Of course, this will depend on what you want to do when multiple values are 1 as asked in comments.
Since there are three different variables, and each variable can take either of 2 values, there are 2^3 = 8 possible unique combinations of the three variables, each of which should have a unique number associated.
One way to do this is to imagine each column as being a digit in a three digit binary number. If we subtract 1 from each column, we get a 1 for "no" and a 0 for "yes". This means that our eight possible unique values, and the binary numbers associated with each would be:
binary decimal
0 0 0 = 0
0 0 1 = 1
0 1 0 = 2
0 1 1 = 3
1 0 0 = 4
1 0 1 = 5
1 1 0 = 6
1 1 1 = 7
This system will work for any number of columns, and can be achieved as follows:
Personal_Help <- c(1,1,2,1,2,1)
PR <- c(2,1,1,2,1,2)
Fundraising <- c(1,2,1,2,2,1)
df <- data.frame(Personal_Help, PR, Fundraising)
New_var <- 0
for(i in seq_along(df)) New_var <- New_var + (2^(i - 1)) * (df[[i]] - 1)
df$New_var <- New_var
The end result would then be:
df
#> Personal_Help PR Fundraising New_var
#> 1 1 2 1 2
#> 2 1 1 2 4
#> 3 2 1 1 1
#> 4 1 2 2 6
#> 5 2 1 2 5
#> 6 1 2 1 2
In your actual data, there will be 1024 possible combinations of tasks, so this will generate numbers for New_var between 0 and 1023. Because of how it is generated, you can actually use this single number to reverse engineer the entire row as long as you know the original column order.
As #ulfelder commented, you need to clarify how you want to handle cases where more than one column is 1.
Assuming you want to use the first column equal to 1, you can use which.min(), applied by row:
data <- data.frame(Personal_Help, PR, Fundraising)
data$Jobcontent <- apply(data, MARGIN = 1, which.min)
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 1
2 1 1 2 1
3 2 1 1 2
4 1 2 2 1
5 2 1 2 2
6 1 2 1 1
If you’d like Jobcontent to include the name of each job, you can index into names(data):
data$Jobcontent <- names(data)[apply(data, MARGIN = 1, which.min)]
Result:
Personal_Help PR Fundraising Jobcontent
1 1 2 1 Personal_Help
2 1 1 2 Personal_Help
3 2 1 1 PR
4 1 2 2 Personal_Help
5 2 1 2 PR
6 1 2 1 Personal_Help
max.col may help here:
Jobcontent <- max.col(-data.frame(Personal_Help, PR, Fundraising), "first")
Jobcontent
#> [1] 1 1 2 1 2 1
suppose I have the following data:
A <- c(4,4,4,4,4)
B <- c(1,2,3,4,4)
C <- c(1,2,4,4,4)
D <- c(3,2,4,1,4)
filt <- c(1,1,10,8,10)
data <- as.data.frame(rbind(A,B,C,D,filt))
data <- t(data)
data <- as.data.frame(data)
> data
A B C d filt
V1 4 1 1 3 1
V2 4 2 2 2 1
V3 4 3 4 4 10
V4 4 4 4 1 8
V5 4 4 4 4 10
I want to get counts on the occurances of 1,2,3, & 4 for each variable, after filtering. In my attempt to achieve this below, I get Error: length(rows) == 1 is not TRUE.
data %>%
dplyr::filter(filt ==1) %>%
plyr::summarize(A_count = count(A),
B_count = count(B))
I get the error - its because some of my columns do not contain all values 1-4. Is there a way to specify what it should look for & give 0 values if not found? I'm not sure how to do this if possible, or if there is a different work around.
Any help is VERY appreciated!!!
This was a bit of a weird one, I didn't use classical plyr, but I think this is roughly what you're looking for. I removed the filtering column , filt as to not get counts of that:
library(dplyr)
data %>%
filter(filt == 1) %>%
select(-filt) %>%
purrr::map_df(function(a_column){
purrr::map_int(1:4, function(num) sum(a_column == num))
})
# A tibble: 4 x 4
A B C D
<int> <int> <int> <int>
1 0 1 1 0
2 0 1 1 1
3 0 0 0 1
4 2 0 0 0
Is it possible to group and count instances of all other columns using R (dplyr)? For example, The following dataframe
x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1
Turns to this (note: y is value that is being counted)
EDIT:- explaining the transformation, x is what I'm grouping by, for each number grouped, i want to count how many times 0 and 1 and 2 was mentioned, as in the first row in the transformed dataframe, we counted how many times x = 1 was equal to 0 in the other columns (y), so 0 was in column a one time, column b two times and column c one time
x y a b c
1 0 1 2 1
1 1 1 0 2
1 2 1 1 0
2 1 1 0 1
2 2 0 1 0
An approach with a combination of the melt and dcast functions of data.table or reshape2:
library(data.table) # v1.9.5+
dt.new <- dcast(melt(setDT(df), id.vars="x"), x + value ~ variable)
this gives:
dt.new
# x value a b c
# 1: 1 0 1 2 1
# 2: 1 1 1 0 2
# 3: 1 2 1 1 0
# 4: 2 1 1 0 1
# 5: 2 2 0 1 0
In dcast you can specify which aggregation function to use, but this is in this case not necessary as the default aggregation function is length. Without using an aggregation function, you will get a warning about that:
Aggregation function missing: defaulting to length
Furthermore, if you do not explicitly convert the dataframe to a data table, data.table will redirect to reshape2 (see the explanation from #Arun in the comments). Consequently this method can be used with reshape2 as well:
library(reshape2)
df.new <- dcast(melt(df, id.vars="x"), x + value ~ variable)
Used data:
df <- read.table(text="x a b c
1 0 0 0
1 1 0 1
1 2 2 1
2 1 2 1", header=TRUE)
I'd use a combination of gather and spread from the tidyr package, and count from dplyr:
library(dplyr)
library(tidyr)
df = data.frame(x = c(1,1,1,2), a = c(0,1,2,1), b = c(0,0,2,2), c = c(0,1,1,1))
res = df %>%
gather(variable, value, -x) %>%
count(x, variable, value) %>%
spread(variable, n, fill = 0)
# Source: local data frame [5 x 5]
#
# x value a b c
# 1 1 0 1 2 1
# 2 1 1 1 0 2
# 3 1 2 1 1 0
# 4 2 1 1 0 1
# 5 2 2 0 1 0
Essentially, you first change the format of the dataset to:
head(df %>%
gather(variable, value, -x))
# x variable value
#1 1 a 0
#2 1 a 1
#3 1 a 2
#4 2 a 1
#5 1 b 0
#6 1 b 0
Which allows you to use count to get the information regarding how often certain values occur in columns a to c. After that, you reformat the dataset to your required format using spread.
I am trying to split one column in a data frame in to multiple columns which hold the values from the original column as new column names. Then if there was an occurrence for that respective column in the original give it a 1 in the new column or 0 if no match. I realize this is not the best way to explain so, for example:
df <- data.frame(subject = c(1:4), Location = c('A', 'A/B', 'B/C/D', 'A/B/C/D'))
# subject Location
# 1 1 A
# 2 2 A/B
# 3 3 B/C/D
# 4 4 A/B/C/D
and would like to expand it to wide format, something such as, with 1's and 0's (or T and F):
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I have looked into tidyr and the separate function and reshape2 and the cast function but seem to getting hung up on giving logical values. Any help on the issue would be greatly appreciated. Thank you.
You may try cSplit_e from package splitstackshape:
library(splitstackshape)
cSplit_e(data = df, split.col = "Location", sep = "/",
type = "character", drop = TRUE, fill = 0)
# subject Location_A Location_B Location_C Location_D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
You could take the following step-by-step approach.
## get the unique values after splitting
u <- unique(unlist(strsplit(as.character(df$Location), "/")))
## compare 'u' with 'Location'
m <- vapply(u, grepl, logical(length(u)), x = df$Location)
## coerce to integer representation
m[] <- as.integer(m)
## bind 'm' to 'subject'
cbind(df["subject"], m)
# subject A B C D
# 1 1 1 0 0 0
# 2 2 1 1 0 0
# 3 3 0 1 1 1
# 4 4 1 1 1 1
I'd like to group this data but apply different functions to some columns when grouping.
ID type isDesc isImage
1 1 1 0
1 1 0 1
1 1 0 1
4 2 0 1
4 2 1 0
6 1 1 0
6 1 0 1
6 1 0 0
I want to group by ID, columns isDesc and isImage can be summed, but I would like to get the value of type as it is. type will be the same through the whole dataset. The result should look like this:
ID type isDesc isImage
1 1 1 2
4 2 1 1
6 1 1 1
Currently I am using
library(plyr)
summarized = ddply(data, .(ID), numcolwise(sum))
but it simply sums up all the columns. You don't have to use ddply but if you think it's good for the job I'd like to stick to it. data.table library is also an alternative
Using data.table:
require(data.table)
dt <- data.table(data, key="ID")
dt[, list(type=type[1], isDesc=sum(isDesc),
isImage=sum(isImage)), by=ID]
# ID type isDesc isImage
# 1: 1 1 1 2
# 2: 4 2 1 1
# 3: 6 1 1 1
Using plyr:
ddply(data , .(ID), summarise, type=type[1], isDesc=sum(isDesc), isImage=sum(isImage))
# ID type isDesc isImage
# 1 1 1 1 2
# 2 4 2 1 1
# 3 6 1 1 1
Edit: Using data.table's .SDcols, you can do this in case you've too many columns that are to be summed, and other columns to be just taken the first value.
dt1 <- dt[, lapply(.SD, sum), by=ID, .SDcols=c(3,4)]
dt2 <- dt[, lapply(.SD, head, 1), by=ID, .SDcols=c(2)]
> dt2[dt1]
# ID type isDesc isImage
# 1: 1 1 1 2
# 2: 4 2 1 1
# 3: 6 1 1 1
You can provide column names or column numbers as arguments to .SDcols. Ex: .SDcols=c("type") is also valid.