I'm new to R and seeking some help. I understand the following problem is fairly simple and have looked for similar questions. None give quite the answer I'm looking for - any help would be appreciated.
The problem:
Producing a frequency table using the table() function for three variables with data in the format:
Var1 Var2 Var3
1 0 1 0
2 0 1 0
3 1 1 1
4 0 0 1
Where, 0 = "No" and 1 = "Yes"
And the final table is in the following format with variables and values labelled:
Var3
Yes No
Var1 Yes 1 0
No 1 2
Var2 Yes 1 2
No 1 0
What I have tried so far:
Using the following code I'm able to produce a 2 variable table, with labels for the variables but not for the values (ie. No and Yes).
table(data$Var1, data$Var3, dnn = c("Var1", "Var3"))
It looks like this:
Var3
Var1 0 1
0 2 1
1 0 1
In trying to label the row and column values (0 = No and 1= Yes) I understand row.names and responseName can be used, however the following attempt to label row names gives an all arguments must have the same length error.
> table(data$Var1, data$Var2, dnn = c("Var1", "Var2"), row.names = c("No", "Yes"))
I have also tried using ftable() however the shape of the table produced using code below is not correct resulting in incorrect frequencies for the problem. The issue with labeling row & col values persists.
> ftable(data$Var1, data$Var2, data$Var3, dnn = c("Var1", "Var2", "Var3"))
Var3 0 1
Var1 Var2
0 0 0 1
1 2 0
1 0 0 0
1 0 1
Any help in using table() to produce a table of the shape desired would be greatly appreciated.
You could try tabular from library(tables) after changing the labels as showed by #thelatemail
library(tables)
data[] <- lapply(data, factor, levels=1:0, labels=c('Yes', 'No'))
tabular(Var1+Var2~Var3, data=data)
# Var3
# Yes No
#Var1 Yes 1 0
# No 1 2
#Var2 Yes 1 2
# No 1 0
data
data <- structure(list(Var1 = c(0L, 0L, 1L, 0L), Var2 = c(1L, 1L, 1L,
0L), Var3 = c(0L, 0L, 1L, 1L)), .Names = c("Var1", "Var2", "Var3"
), class = "data.frame", row.names = c("1", "2", "3", "4"))
The easiest way is to probably use the reshape2 package. Firstly you will need to convert your numeric information to factors so that it doesn't treat it as a number.
data$Var1 <- as.factor(data$Var1)
data$Var2 <- as.factor(data$Var2)
data$Var3 <- as.factor(data$Var3)
Then you can easily just apply table(data) to get the information you want. If you really want to transform it in the format you specified, then pull it as a data.frame and then transform it as required:
df <- as.data.frame(table(data))
library(reshape2)
dcast(df, Var1+Var2 ~ Var3)
This as the output:
Var1 Var2 0 1
1 0 0 0 1
2 0 1 2 0
3 1 0 0 0
4 1 1 0 1
EDIT: You can just use ftable on the data frame once its all factors:
> ftable(data)
Var3 0 1
Var1 Var2
0 0 0 1
1 2 0
1 0 0 0
1 0 1
Related
I Have 2 indicator:
licence age.6-17
Na 1
1 0
Na 0
0 1
how I can change Na to 1 if a person is more than 17 years (that is second column is 0) old and 0 otherwise?
output
licence age.6-17
0 1
1 0
1 0
0 1
using dplyr and ifelse
yourdata %>% mutate(licence = ifelse(`age.6-17` == 0, 1,0))
No need to change how the nature of "Na" nor the column name.
In addition, in case you would need to replace only the "Na" cells, considering "Na" is a string here
yourdata %>% mutate(licence = ifelse(licence == "Na" & `age.6-17` == 0, 1,0))
If however it is <NA> you would need is.na(licence) instead of licence == "Na"
In base you can subset with is.na and then subtract the value of age.6.17 from 1.
x <- read.table(header=T, na.string="Na", text="licence age.6-17
Na 1
1 0
Na 0
0 1")
idx <- is.na(x$licence)
x$licence[idx] <- 1-x$age.6.17[idx]
x
# licence age.6.17
#1 0 1
#2 1 0
#3 1 0
#4 0 1
or in case you ignore what is actualy stored in column licence you can use:
with(x, data.frame(licence=1-age.6.17, age.6.17))
# licence age.6.17
#1 0 1
#2 1 0
#3 1 0
#4 0 1
Assuming your NAs are actual NA we can use case_when in dplyr and apply the conditions.
library(dplyr)
df %>%
mutate(licence = case_when(is.na(licence) & age.6.17 == 0 ~ 1L,
is.na(licence) & age.6.17 == 1 ~ 0L,
TRUE ~ licence))
# licence age.6.17
#1 0 1
#2 1 0
#3 1 0
#4 0 1
data
df <- structure(list(licence = c(NA, 1L, NA, 0L), age.6.17 = c(1L,
0L, 0L, 1L)), class = "data.frame", row.names = c(NA, -4L))
I have a dataframe that I would like to delete rows from, based on a value in a specific column.
As an example, the dataframe appears something like this:
a b c d
1 1 2 3 0
2 4 NA 1 NA
3 6 4 0 1
4 NA 5 0 0
I would like to remove all rows with a value greater than 0 in column d. I have been trying to use the following code to do this:
df <- df[!df$d > 0, ]
but this is appearing to have the effect of deleting all the value is rows with an NA value in column d. I was assuming that a na.rm = TRUEargument was needed but I wasn't sure where to fit it in the function above.
Cheers,
Ant
We need to select the rows where d is not greater than 0 OR there is NA in d
df[with(df, !d > 0 | is.na(d)), ]
# a b c d
#1 1 2 3 0
#2 4 NA 1 NA
#4 NA 5 0 0
Or we can also use subset
subset(df, !d > 0 | is.na(d))
or dplyr filter
library(dplyr)
df %>% filter(!d > 0 | is.na(d))
The !d > 0 part can also be reversed to
subset(df, d < 1 | is.na(d))
to get the same result.
We can construct the logical vector with complete.cases
subset(df, !d > 0 | complete.cases(d))
# a b c d
#1 1 2 3 0
#3 6 4 0 1
#4 NA 5 0 0
Or use subset with replace
subset(df, !replace(d, is.na(d), 0) > 0)
Or with tidyverse
library(tidyverse)
df %>%
filter(!replace_na(d, 0) >0)
which is slightly different from the method mentioned here or here
data
df <- structure(list(a = c(1L, 4L, 6L, NA), b = c(2L, NA, 4L, 5L),
c = c(3L, 1L, 0L, 0L), d = c(0L, NA, 1L, 0L)), class = "data.frame",
row.names = c("1", "2", "3", "4"))
If u add a |all rows that has a NA will match. The condition !df$d > 0 will get executed for those in d that are not a NA. So I think you were looking for:
df[is.na(df$d) | !df$d > 0, ]
Wheras, the below wont include the rows that has a NA in column d and that does not match the condition !df$d > 0
df[!is.na(df$d) & !df$d > 0, ]
I have the following data
ID v1 v2 v3 v4 v5
1 1 3 6 4
2 4 2
3 3 1 8 5
4 2 5 3 1
Can I rearrange the data so that it will automatically create new columns and assign binary value (1 or 0) according to the value in each variable (v1 to v5)?
E.g. In first row, I have values of 1,3,4 and 6. Can R automatically create 6 dummy variables to have assign the value to the respective column as below:
ID dummy1 dummy2 dummy3 dummy4 dummy5 dummy6
1 1 0 1 1 0 1
To have something like this:
ID c1 c2 c3 c4 c5 c6 c7 c8
1 1 0 1 1 0 1 0 0
2 0 1 0 1 0 0 0 0
3 1 0 1 0 1 0 0 1
4 1 1 1 0 1 0 0 0
Thanks.
We can use base R to do this. Loop through the rows of the dataset except the first column, get the sequence of max value in the row, check how many of these are in the row and convert it to integer with as.integer, append NAs at the end to make the lengths same in the list output and cbind with the first column
lst <- apply(df[-1], 1, function(x) as.integer(seq_len(max(x, na.rm = TRUE)) %in% x))
res <- cbind(df[1], do.call(rbind, lapply(lst, `length<-`, max(lengths(lst)))))
res[is.na(res)] <- 0
colnames(res)[-1] <- paste0('c', 1:8)
res
# ID c1 c2 c3 c4 c5 c6 c7 c8
#1 1 1 0 1 1 0 1 0 0
#2 2 0 1 0 1 0 0 0 0
#3 3 1 0 1 0 1 0 0 1
#4 4 1 1 1 0 1 0 0 0
In base R, you can use:
table(transform(cbind(mydf[1], stack(mydf[-1]))[1:2], values = factor(values, 1:8)))
## values
## ID 1 2 3 4 5 6 7 8
## 1 1 0 1 1 0 1 0 0
## 2 0 1 0 1 0 0 0 0
## 3 1 0 1 0 1 0 0 1
## 4 1 1 1 0 1 0 0 0
Note that you need to convert the stacked values to factor if you want the "7" to be included in the output. This applies to the "data.table" and "tidyverse" approaches as well.
Alternatively, you can try the following with "data.table":
library(data.table)
melt(as.data.table(mydf), "ID", na.rm = TRUE)[
, dcast(.SD, ID ~ factor(value, 1:8), fun = length, drop = FALSE)]
Or the following with the "tidyverse":
library(tidyverse)
mydf %>%
gather(var, val, -ID, na.rm = TRUE) %>%
select(-var) %>%
mutate(var = 1, val = factor(val, 1:8)) %>%
spread(val, var, fill = 0, drop = FALSE)
Sample data:
mydf <- structure(list(ID = 1:4, v1 = c(1L, 4L, 3L, 2L), v2 = c(3L, 2L,
1L, 5L), v3 = c(6L, NA, 8L, 3L), v4 = c(4L, NA, 5L, 1L), v5 = c(NA,
NA, NA, NA)), .Names = c("ID", "v1", "v2", "v3", "v4", "v5"), row.names = c(NA,
4L), class = "data.frame")
If automation is important, you can also use syntax like factor(value, sequence(max(value)) in the "data.table" approach or val = factor(val, sequence(max(val)))) in the "tidyverse" approach.
Another base R answer with some similarities to akrun's is
# create matrix of values
myMat <- as.matrix(dat[-1])
# create result matrix of desired shape, filled with 0s
res <- matrix(0L, nrow(dat), ncol=max(myMat, na.rm=TRUE))
# use matrix indexing to fill in 1s
res[cbind(dat$ID, as.vector(myMat))] <- 1L
# convert to data.frame, add ID column, and provide variable names
setNames(data.frame(cbind(dat$ID, res)), c("ID", paste0("c", 1:8)))
which returns
ID c1 c2 c3 c4 c5 c6 c7 c8
1 1 1 0 1 1 0 1 0 0
2 2 0 1 0 1 0 0 0 0
3 3 1 0 1 0 1 0 0 1
4 4 1 1 1 0 1 0 0 0
I have a dataset where all my data is categorical and I would like to use one hot encoding for further analysis.
Main issues I would like to resolve:
Some cells contain many text in one cell (an example will follow).
Some numerical values need to be changed to factor for further process.
Data with 3 headings Age, info & Target
mydf <- structure(list(Age = c(99L, 10L, 40L, 15L), Info = c("c(\"good\", \"bad\", \"sad\"",
"c(\"nice\", \"happy\", \"joy\"", "NULL", "c(\"okay\", \"nice\", \"fun\", \"wild\", \"go\""
), Target = c("Boy", "Girl", "Boy", "Boy")), .Names = c("Age",
"Info", "Target"), row.names = c(NA, 4L), class = "data.frame")
I want to create one hot encoding of all these variables shown above so it will look like the following:
Age_99 Age_10 Age_40 Age_15 good bad sad nice happy joy null okay nice fun wild go Boy Girl
1 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1
Some of the questions on SO I have checked are this and this.
I would suppose that the following should work:
library(splitstackshape)
library(magrittr)
suppressWarnings({ ## Just to silence melt
mydf %>% ## The dataset
as.data.table(keep.rownames = TRUE) %>% ## Convert to data.table
.[, Info := gsub("c\\(|\"", "", Info)] %>% ## Strip out c( and quotes
cSplit("Info", ",") %>% ## Split the "Info" column
melt(id.vars = "rn") %>% ## Melt everyting except rn
dcast(rn ~ value, fun.aggregate = length) ## Go wide
})
# rn 10 15 40 99 Boy Girl NULL bad fun go good happy joy nice okay sad wild NA
# 1: 1 0 0 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 2
# 2: 2 1 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 2
# 3: 3 0 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 4
# 4: 4 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0
Here's the sample data I used:
mydf <- structure(list(Age = c(99L, 10L, 40L, 15L), Info = c("c(\"good\", \"bad\", \"sad\"",
"c(\"nice\", \"happy\", \"joy\"", "NULL", "c(\"okay\", \"nice\", \"fun\", \"wild\", \"go\""
), Target = c("Boy", "Girl", "Boy", "Boy")), .Names = c("Age",
"Info", "Target"), row.names = c(NA, 4L), class = "data.frame")
You can use the grepl function to scan each string for whatever you are looking for, and use ifelse to fill the column appropriately.
Something like:
# This will create a new column labeled 'good' with 1 if the string contains and 0 if not
data$good = ifelse(grepl("good",data$info),1, 0)
# and do this for each variable of interest
And at the end you can remove the info column if you'd like. This way you don't have to make any new data tables.
data$info <- NULL
Note that you should change 'data' to whatever the actual name of your data set is.
As for the problem with age, no need to change it into factors, just use:
data$age99 = ifelse(data$Age == 99, 1,0) # and so forth for the other ages
I have a data frame that contains three variables: treatment, dose, and outcome (plus or minus). I have multiple observations for each treatment and dose. I'm trying to output a contingency table that would collapse the data to indicate the number of each outcome as a function of the treatment and dose, as well as the number of observations. For example:
treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1
The desired output would be:
treatment dose outcome n
control 0 0 1 4
treatmentA 1 2 3
treatmentA 2 3 3
I've played around with this all day and haven't had much luck beyond being able to get a frequency for each outcome for each observation. Any suggestions would be appreciated (including pointing out what parts of the R manual and/or examples) i've overlooked.
Thanks!
R
Here is a solution using a wonderful package data.table:
library(data.table)
x <- data.table(read.table( text = "treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1", header = TRUE)
x[, list(outcome = sum(outcome), count = .N), by = 'treatment,dose']
produces
treatment dose outcome count
1: control 0 1 4
2: treatmentA 1 2 3
3: treatmentA 2 3 3
If you don't want to use extra libraries as suggested in other answers, you can try following.
> df
treatment dose outcome
1 control 0 0
2 control 0 0
3 control 0 0
4 control 0 1
5 treatmentA 1 0
6 treatmentA 1 1
7 treatmentA 1 1
8 treatmentA 2 1
9 treatmentA 2 1
10 treatmentA 2 1
> dput(df)
structure(list(treatment = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L, 2L, 2L), .Label = c("control", "treatmentA"), class = "factor"),
dose = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 2L, 2L, 2L), outcome = c(0L,
0L, 0L, 1L, 0L, 1L, 1L, 1L, 1L, 1L)), .Names = c("treatment",
"dose", "outcome"), class = "data.frame", row.names = c(NA, -10L
))
Now we use aggregate function to get count and sum of outcome column
> nObs <- aggregate(outcome ~ treatment + dose, data = df, length)
> sObs <- aggregate(outcome ~ treatment + dose, data = df, sum)
Change names of of aggregated column appropriately
names(nObs) <- c('treatment', 'dose', 'count')
> names(sObs) <- c('treatment', 'dose', 'sum')
> nObs
treatment dose count
1 control 0 4
2 treatmentA 1 3
3 treatmentA 2 3
> sObs
treatment dose sum
1 control 0 1
2 treatmentA 1 2
3 treatmentA 2 3
Use merge to combine above two by all columns by same name treatment and dose in this case.
> result <- merge(nObs, sObs)
> result
treatment dose count sum
1 control 0 4 1
2 treatmentA 1 3 2
3 treatmentA 2 3 3
If I understand correctly, this is straightforward with the data.table library. First, load the library and read the data in:
library(data.table)
data <- read.table(header=TRUE, text="
treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1")
Next, create a data.table with the treatment and dose columns as the table keys (indices).
data <- data.table(data, key="treatment,dose")
Then aggregate using data.table syntax.
data[, list(outcome=sum(outcome), n=length(outcome)), by=list(treatment,dose)]
treatment dose outcome n
1: control 0 1 4
2: treatmentA 1 2 3
3: treatmentA 2 3 3
imho, sql is underrated. :)
# read in your example data as `x`
x <- read.table( text = "treatment dose outcome
control 0 0
control 0 0
control 0 0
control 0 1
treatmentA 1 0
treatmentA 1 1
treatmentA 1 1
treatmentA 2 1
treatmentA 2 1
treatmentA 2 1",h=T)
# load the sql data frame library
library(sqldf)
# create a new table of all unique `treatment` and `dose` columns,
# summing the `outcome` column and
# counting the number of records in each combo
y <- sqldf( 'SELECT treatment, dose ,
sum( outcome ) as outcome ,
count(*) as n
FROM x
GROUP BY treatment, dose' )
# check the results
y
Here are another couple of options (even thought the data.table approach clearly wins in succinctness of syntax).
The first uses ave within within. ave can apply a function to a variable (the first variable mentioned) grouped by one or more variables. We wrap the output in unique after dropping the now unnecessary "outcome" column.
unique(within(df, {
SUM <- ave(outcome, treatment, dose, FUN = sum)
COUNT <- ave(outcome, treatment, dose, FUN = length)
rm(outcome)
}))
# treatment dose COUNT SUM
# 1 control 0 4 1
# 5 treatmentA 1 3 2
# 8 treatmentA 2 3 3
A second solution in base R is very similar to #geektrader's answer, except it calculates both sum and length in one call to aggregate. There is a "downside" though: the result of that cbind is a "column" in your data.frame that is actually a matrix. See the result of str to see what I mean.
temp <- aggregate(outcome ~ treatment + dose, df,
function(x) cbind(sum(x), length(x)))
str(temp)
# 'data.frame': 3 obs. of 3 variables:
# $ treatment: Factor w/ 2 levels "control","treatmentA": 1 2 2
# $ dose : int 0 1 2
# $ outcome : int [1:3, 1:2] 1 2 3 4 3 3
colnames(temp$outcome) <- c("SUM", "COUNT")
temp
# treatment dose outcome.SUM outcome.COUNT
# 1 control 0 1 4
# 2 treatmentA 1 2 3
# 3 treatmentA 2 3 3
I mention storage structure as a "downside" mostly because you might not get what you expect when you try to access the data in ways you might be accustomed to.
temp$outcome.SUM
# NULL
temp$outcome
# SUM COUNT
# [1,] 1 4
# [2,] 2 3
# [3,] 3 3
Instead, you have to access it via:
temp$outcome[, "SUM"] ## or temp$outcome[, 1]
# [1] 1 2 3