Generate crosstabulations from dataframe of categorical variables in survey - r

I've got some survey results and am trying to do some basic cross tabulations. Each column is a type of chemical, and the number 0:5 are how useful they were.
I'm trying to figure out a nice table that presents the freq, and percent. Using table, or xtabs, I'm able to get individual results for each column, but I'd like to figure out a way to create a nice table that I'll be able to output into Latex that includes all the chemicals on one table.
Thanks for any help you can provide.
Data Frame :
df <- read.table(text = "
V1 V2 V3 V4 V5 V6 V7
1 NA NA NA NA NA NA NA
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 NA
4 NA NA NA NA NA NA 5
5 0 0 0 0 0 2 0
6 NA 4 NA NA NA NA NA
7 0 0 0 0 0 0 0
8 NA NA NA NA NA 3 NA
9 NA 2 NA NA NA 3 NA
10 NA 4 NA NA NA NA NA
11 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0
14 NA NA NA NA NA 2 3
15 NA 3 NA 3 NA NA NA
16 NA 4 NA NA NA NA NA
17 0 0 0 0 0 0 0
18 NA 5 NA 5 NA NA NA
19 0 0 0 0 0 0 0
20 NA 1 NA NA NA NA NA", header = T)
Desired Output (Accurate numbers for V1 and V2) :
V1 V2 etc....
Freq Percent Freq Percent
No 9 100 9 56.2
Poor 0 0 1 6.2
Somewhat effective 0 0 1 6.2
Good 0 0 1 6.2
Very Good 0 0 3 18.75
NA 0 0 1 6.2

Here, we are getting the frequency for each column by using lapply and table. lapply gets the data.frame in a list environment and then use table after converting the column to factor with levels specified as 0:5. Use, prop.table to get the proportion, cbind the Freq and Percent, convert the list to data.frame by do.call(cbind, and finally rename the row.names and colnames
res <- do.call(cbind,lapply(df, function(x) {
x1 <- table(factor(x, levels=0:5,
labels=c('No', 'Poor', 'Somewhat Effective',
'Good', 'Very Good', 'NA') ))
cbind(Freq=x1, Percent=round(100*prop.table(x1),2))}))
colnames(res) <- paste(rep(paste0('V',1:7),each=2),
colnames(res),sep=".")
head(res,2)
# V1.Freq V1.Percent V2.Freq V2.Percent V3.Freq V3.Percent V4.Freq
#No 9 100 9 56.25 9 100 9
#Poor 0 0 1 6.25 0 0 0
# V4.Percent V5.Freq V5.Percent V6.Freq V6.Percent V7.Freq V7.Percent
#No 81.82 9 100 8 66.67 8 80
#Poor 0.00 0 0 0 0.00 0 0

I'm not a regular "dplyr" or "tidyr" user, so I'm not sure if this is the best approach using those tools (but it seems to work):
library(dplyr)
library(tidyr)
df %>%
gather(var, val, V1:V7) %>% ## Make the data long
na.omit() %>% ## We don't need the NAs
## Factor the "value" column
mutate(val = factor(val, 0:5, c("No", "Poor", "Somewhat Effective",
"Good", "Very Good", "NA"))) %>%
group_by(val, var) %>% ## Group by val and var
summarise(Freq = n()) %>% ## Get the count
group_by(var) %>% ## Group just by var now
mutate(Pct = Freq/sum(Freq) * 100) %>% ## Calculate the percent
gather(R1, R2, Freq:Pct) %>% ## Go long again....
unite(Var, var, R1) %>% ## Combine the var and R1 cols
spread(Var, R2, fill = 0) ## Go wide....
# Source: local data frame [6 x 15]
#
# val V1_Freq V1_Pct V2_Freq V2_Pct V3_Freq V3_Pct V4_Freq
# 1 No 9 100 9 56.25 9 100 9
# 2 Poor 0 0 1 6.25 0 0 0
# 3 Somewhat Effective 0 0 1 6.25 0 0 0
# 4 Good 0 0 1 6.25 0 0 1
# 5 Very Good 0 0 3 18.75 0 0 0
# 6 NA 0 0 1 6.25 0 0 1
# Variables not shown: V4_Pct (dbl), V5_Freq (dbl), V5_Pct (dbl), V6_Freq
# (dbl), V6_Pct (dbl), V7_Freq (dbl), V7_Pct (dbl)
The "data.table" approach is similar in terms of the series of steps you'll have to go through.
library(data.table)
library(reshape2)
levs <- c("No", "Poor", "Somewhat Effective", "Good", "Very Good", "NA")
DT <- melt(as.data.table(df, keep.rownames = TRUE),
id.vars = "rn", na.rm = TRUE)
DT <- DT[, value := factor(value, 0:5, levs)
][, list(Freq = .N), by = list(variable, value)
][, Pct := Freq/sum(Freq) * 100, by = list(variable)]
dcast.data.table(melt(DT, id.vars = c("variable", "value")),
value ~ variable + variable.1,
value.var = "value.1", fill = 0)
OK, one more... (a variant on #akrun's answer)
library(gdata) ## For "interleave"
levs <- c("No", "Poor", "Somewhat Effective", "Good", "Very Good", "NA")
x1 <- sapply(lapply(df, factor, 0:5, levs), table)
t(interleave(t(x1), t(prop.table(x1, 2))))
### Or, skipping the transposing....
## library(SOfun) ## For "Riffle" which is like "interleave"
## Riffle(x1, prop.table(x1, 2) * 100)

Related

need help creating a function to delete one wave of data if it's a duplicate of the previous wave for a retrospective measure

I'm working with a longitudinal dataset that has a retrospective measure of trauma that provides a yes/no endorsement of a question and the age of onset if the answer was "yes". If a question was endorsed at the first wave of data and then again at the second wave with the same age it needs to be converted to a "no" and a the age to NA. My data looks like this:
df <- as.data.frame(cbind(Aw1 = c(0,0,1,0,0),
Aagew1 = c(NA,NA,23,NA,NA),
Aw2 = c(1,0,1,0,0),
Aagew2 = c(29,NA,23,NA,NA),
Bw1 = c(1,0,0,0,1),
Bagew1 = c(20,NA,NA,NA,23),
Bw2 = c(1,0,1,0,1),
Bagew2 = c(20,NA,28,NA,23)))
print(df)
Aw1 Aagew1 Aw2 Aagew2 Bw1 Bagew1 Bw2 Bagew2
1 0 NA 1 29 1 20 1 20
2 0 NA 0 NA 0 NA 0 NA
3 1 23 1 23 0 NA 1 28
4 0 NA 0 NA 0 NA 0 NA
5 0 NA 0 NA 1 23 1 23
Using the following data.table syntax I'm able to recode what I want conditionally so that the "A" question at wave two, that is the same incident recorded at wave 1, is no longer present
dt <- as.data.table(df)
dt[Aagew1 == Aagew2, ':=' (Aw2 = 0, Aagew2 = NA)]
print(dt)
Aw1 Aagew1 Aw2 Aagew2 Bw1 Bagew1 Bw2 Bagew2
1: 0 NA 1 29 1 20 1 20
2: 0 NA 0 NA 0 NA 0 NA
3: 1 23 0 NA 0 NA 1 28
4: 0 NA 0 NA 0 NA 0 NA
5: 0 NA 0 NA 1 23 1 23
I'd like to automate this syntax and put it into a function so that for every question that repeats itself at wave 2 is deleted. Here's a function I made that doesn't work:
rm.duplicate <- function(x){
y <- as.data.table(x)
for(i in LETTERS[1:2]){
y[paste0(i,"age","w1") == paste0(i,"age","w2"), ':=' (paste0(i,"w2") = 0, paste0(i,"age","w2") = NA)]
}
return(as.data.frame(y))
}
The desired outcome is (so that the duplicates at wave 2 are deleted for all the unique questions):
Aw1 Aagew1 Aw2 Aagew2 Bw1 Bagew1 Bw2 Bagew2
1: 0 NA 1 29 1 20 0 NA
2: 0 NA 0 NA 0 NA 0 NA
3: 1 23 0 NA 0 NA 1 28
4: 0 NA 0 NA 0 NA 0 NA
5: 0 NA 0 NA 1 23 0 NA
Thank you for helping me out!
The OP's function can be modified slightly to add get to return the value
rm.duplicate <- function(x){
y <- as.data.table(x)
for(i in LETTERS[1:2]){
y[get(paste0(i,"age","w1")) == get(paste0(i,"age","w2")),
paste0(i,c("", "age"), "w2") := .(0, NA)]
}
return(as.data.frame(y))
}
rm.duplicate(df)
# Aw1 Aagew1 Aw2 Aagew2 Bw1 Bagew1 Bw2 Bagew2
#1 0 NA 1 29 1 20 0 NA
#2 0 NA 0 NA 0 NA 0 NA
#3 1 23 0 NA 0 NA 1 28
#4 0 NA 0 NA 0 NA 0 NA
#5 0 NA 0 NA 1 23 0 NA
Or another option is set
dt <- as.data.table(df)
for(i in LETTERS[1:2]){
w1 <- paste0(i,"age","w1")
w2 <- paste0(i,"age","w2")
i1 <- which(dt[[w1]] == dt[[w2]])
nm1 <- paste0(i,c("", "age"), "w2")
set(dt, i = i1, j = nm1[1], value = 0)
set(dt, i = i1, j = nm1[2], value = NA)
}
dt

R - adding a count table in a dataframe

I would like to get a count table from several values in my dataframe, which are spread over several columns.
In my case, several of my subjects have one or more types of medication. Now they are mentioned in several columns. I would instead like to have each medication name in a separate column, and a 1 or 0 (a count) per subject if they have it or not.
For example, my dataframe now looks like:
Pp X X2 X3 X4
1 NA NA NA NA
2 trimipramin NA NA NA
3 Quetiapin venlafaxin escitalopram NA
4 duloxetin amitriptylin NA NA
5 quetiapin citalopram escitalopram Lithium
6 NA NA NA NA
7 duloxetin escitalopram venlafaxin NA
I would like to it look like:
Pp trimipramin escitalopram quetiapin venlafaxin duloxetin …
1 0 0 0 0 0 …
2 1 0 0 0 0 …
3 0 1 1 1 0 …
4 0 0 0 0 1 …
5 0 1 1 0 0 …
6 0 0 0 0 0 …
7 0 1 0 1 1 …
Would it be possible to get such a count table added to my dataframe and not separately?
I assume dplyr's spread() might be useful in this case, but I couldn't get it to work :-(
Any help?
The table function in base is a simple method for creating this type of binary indicator matrix.
library(tidyverse)
D %>%
gather(key, drug, -Pp) %>%
select(-key) %>%
table
#>Pp amitriptylin citalopram duloxetin escitalopram Lithium ...
#> 1 0 0 0 0 0 ...
#> 2 0 0 0 0 0 ...
#> 3 0 0 0 1 0 ...
#> 4 1 0 1 0 0 ...
#> 5 0 1 0 1 1 ...
#> 6 0 0 0 0 0 ...
#> 7 0 0 1 1 0 ...
You can also do it using spread instead, but this solution will remove patients without any associated drugs:
D %>%
gather(key, value, -Pp) %>%
filter(!is.na(value)) %>%
mutate(key=1) %>%
spread(value, key, fill=0, drop = FALSE)
library(dplyr)
library(tidyr)
library(reshape2)
data %>% gather(key = "Med", value = "Value", -Pp) %>% select(-Med) %>%
dcast(Pp~Value,fun.aggregate = function(x){as.integer(length(x) > 0)})
Data
data <- read.table(text="
Pp X X2 X3 X4
1 NA NA NA NA
2 trimipramin NA NA NA
3 Quetiapin venlafaxin escitalopram NA
4 duloxetin amitriptylin NA NA
5 quetiapin citalopram escitalopram Lithium
6 NA NA NA NA
7 duloxetin escitalopram venlafaxin NA",header=T, stringsAsFactors = F)
I do not know if it is the easiest solution, but it works. First you have to get all medications in a vector. I do that with unlist and unqiue. Then you have to loop through them and and get the number of cases. I do that with the apply function and the sum of matches. As you want it in the same data.frame you can store it with data[[name]]. This solution works without any packages.
names = as.character(na.omit(unique(unlist(data))))
for(name in names){
data[[name]] = apply(data, FUN = function(x) { sum(x == name, na.rm = T) }, 1)
}
Data
data = read.table(text = "X X2 X3 X4
NA NA NA NA
trimipramin NA NA NA
Quetiapin venlafaxin escitalopram NA
duloxetin amitriptylin NA NA
quetiapin citalopram escitalopram Lithium
NA NA NA NA
duloxetin escitalopram venlafaxin NA", header = T, stringsAsFactors = F)

Turn dataframe with a row for each id and law (with begin and end years) into a file with a row for each id and year

I have a df called laws with a row for each law (one for each id):
laws <- data.frame(id=c(1,2,3),beginyear=c(2001,2002,2005),endyear=c(2003,2005,2006), law1=c(0,0,1), law2=c(1,0,1))
from which I want to create second called idyear with a row for each id and year:
idyear <- data.frame(id=c(rep(1,6),rep(2,6),rep(3,6)), year=(rep(c(2001:2006),3)), law1=c(rep(0,16),1,1), law2=c(1,1,1,rep(0,13),1,1))
How would I efficiently go about writing some code to get the idyear df output from the laws df? The two law variables are indicator variables == 1 if the idyear$year is >= laws$beginyear AND idyear$year is <= laws$endyear.
I am a beginner with R, but I'm willing to try anything (apply, loops, etc.) to get this to work.
1) base expand.grid will create an 18 x 2 data frame of all id and year combinations and then merge will merge it back together with laws. Zero out any law1 and law2 entry for which year is not between beginyear and endyear. Finally drop the beginyear and endyear columns. No packages are used.
g <- with(laws, expand.grid(year = min(beginyear):max(endyear), id = id))
m <- merge(g, laws)
m[m$year < m$beginyear | m$year > m$endyear, c("law1", "law2")] <- 0
m <- subset(m, select = - c(beginyear, endyear))
# check
identical(m, idyear)
## [1] TRUE
2) magrittr This is the same solution as (1) except we have used magrittr pipelines to express it. Note the mixture of pipe operators.
library(magrittr)
laws %$%
expand.grid(year = min(beginyear):max(endyear), id = id) %>%
merge(laws) %$%
{ .[year < beginyear | year > endyear, c("law1", "law2")] <- 0; .} %>%
subset(select = - c(beginyear, endyear))
Update: Fixed. Added (2).
A solution using tidyverse. The last as.data.frame() is optional, which just convert the tbl to a data frame.
library(tidyverse)
idyear <- laws %>%
mutate(year = map2(beginyear, endyear, `:`)) %>%
unnest() %>%
complete(id, year = full_seq(year, period = 1L), fill = list(law1 = 0L, law2 = 0L)) %>%
select(-beginyear, -endyear) %>%
as.data.frame()
idyear
# id year law1 law2
# 1 1 2001 0 1
# 2 1 2002 0 1
# 3 1 2003 0 1
# 4 1 2004 0 0
# 5 1 2005 0 0
# 6 1 2006 0 0
# 7 2 2001 0 0
# 8 2 2002 0 0
# 9 2 2003 0 0
# 10 2 2004 0 0
# 11 2 2005 0 0
# 12 2 2006 0 0
# 13 3 2001 0 0
# 14 3 2002 0 0
# 15 3 2003 0 0
# 16 3 2004 0 0
# 17 3 2005 1 1
# 18 3 2006 1 1
Use of mapply function can help.
# Function to expand year between begin and end
gen_data <- function(x_id, x_beginyear, x_endyear, x_law1, x_law2){
df <- data.frame(x_id, x_beginyear:x_endyear, x_law1, x_law2)
df
}
idyearlst <- data.frame()
idyearlst <- rbind(idyearlst, mapply(gen_data, laws$id, laws$beginyear,
laws$endyear, laws$law1, laws$law2))
# Finally convert list to data.frame
idyear <- setNames(do.call(rbind.data.frame, idyearlst), c("id", "year", "law1", "law2"))
Result will be like:
> idyear
id year law1 law2
V1.1 1 2001 0 1
V1.2 1 2002 0 1
V1.3 1 2003 0 1
V2.4 2 2002 0 0
V2.5 2 2003 0 0
V2.6 2 2004 0 0
V2.7 2 2005 0 0
V3.8 3 2005 1 1
V3.9 3 2006 1 1
Kind of an ugly approach, but I think it gets what you're after, using G. Grothendieck's g expand.grid data frame as a base, and your laws dataframe.
new.df <- data.frame(t(apply(g, 1, function(x){
yearspan = laws[laws$id == x['id'], 'beginyear']:laws[laws$id == x['id'], 'endyear']
law1 = laws$law1[laws$id == x['id'] & x['year'] %in% yearspan]
law2 = laws$law2[laws$id == x['id'] & x['year'] %in% yearspan]
x['law1'] = ifelse(length(law1 > 0), law1, 0)
x['law2'] = ifelse(length(law2 > 0), law2, 0)
return(x)
})))
> new.df
id year law1 law2
1 1 2001 0 1
2 1 2002 0 1
3 1 2003 0 1
4 1 2004 0 0
5 1 2005 0 0
6 1 2006 0 0
7 2 2001 0 0
8 2 2002 0 0
9 2 2003 0 0
10 2 2004 0 0
11 2 2005 0 0
12 2 2006 0 0
13 3 2001 0 0
14 3 2002 0 0
15 3 2003 0 0
16 3 2004 0 0
17 3 2005 1 1
18 3 2006 1 1
Libraries:
dplyr (for arrange, not really necessary)
Data:
laws <- data.frame(id=c(1,2,3),
beginyear=c(2001,2002,2005),
endyear=c(2003,2005,2006),
law1=c(0,0,1), law2=c(1,0,1))
g <- with(laws, expand.grid(id = id, year = min(beginyear):max(endyear)))
g <- arrange(g, id)

dcast in r with duplicates and no aggregation

I have looked through many similar questions here but can't find an answer that addresses this situation. My data frame is like this:
SET SP T1 T2 T3
A dog 1 0 0
A cat 0 NA 4
A bird 5 0 NA
B cat 2 0 0
B bird NA 3 0
C dog 1 0 0
C cat 0 0 6
C bird 0 0 0
D dog NA 22 1
Where SET is purposefully duplicated many times, with each record including a single SP and values for multiple TRIALS (T1-3).
What I desire is a wide dataframe like the following. There is to be NO summation/averaging/mathematical operation of any kind:
SET DOG_T1 DOG_T2 DOG_T3 CAT_T1 CAT_T2 CAT_T3 BIRD_T1 BIRD_T2 BIRD_T3
142 1 0 0 0 NA 4 5 0 NA
255 NA NA NA 2 0 0 NA 3 0
336 1 0 0 0 0 6 0 0 0
66 NA 22 1 NA NA NA NA NA NA
I have tried the following, but receive the error with melt and dcast defaulting to length. This turns the SET variable into different numbers and only fills in 0s and 1s for the value.
df %>%
group_by(SET, SP) %>%
melt(id.vars = c('SET','SP')) %>%
data.table::dcast(SP + variable ~ SET, fun.aggregate = NULL, value.var = 'value')
This works when I DON'T have any duplicate SETs, but fails as soon as I include the full dataset.
A note: my real data frame is about 2.5 million rows, so speed is of concern.
This is a situation where dcast.data.table excels. It allows for multiple arguments as 'value.var', allowing for very concise syntax:
library(data.table)
dcast(df, SET ~ SP, value.var=c("T1", "T2", "T3"))
# SET T1_bird T1_cat T1_dog T2_bird T2_cat T2_dog T3_bird T3_cat T3_dog
#1: A 5 0 1 0 NA 0 NA 4 0
#2: B NA 2 NA 3 0 NA 0 0 NA
#3: C 0 0 1 0 0 0 0 6 0
#4: D NA NA NA NA NA 22 NA NA 1
As per #lukeA above but add fun.aggregate = identity or fun.aggregate = list argument in dcast() function call
You could try
library(tidyverse)
df <- read.table(header=T, text="
SET SP T1 T2 T3
A dog 1 0 0
A cat 0 NA 4
A bird 5 0 NA
B cat 2 0 0
B bird NA 3 0
C dog 1 0 0
C cat 0 0 6
C bird 0 0 0
D dog NA 22 1")
df %>%
gather(var, val, -(1:2)) %>%
unite("SP", SP, var) %>%
spread(SP, val)
# SET bird_T1 bird_T2 bird_T3 cat_T1 cat_T2 cat_T3 dog_T1 dog_T2 dog_T3
# 1 A 5 0 NA 0 NA 4 1 0 0
# 2 B NA 3 0 2 0 0 NA NA NA
# 3 C 0 0 0 0 0 6 1 0 0
# 4 D NA NA NA NA NA NA NA 22 1
Since my current reputation doesn't allow adding a comment to #lukeA answer above, I am making this a new answer which is more of a suggestion:
using the data.table function setcolorder one could have the columns reordered as "bird_T1, cat_T1, dog_T1, bird_T2, cat_T2 etc." by using a custom function similar to
newOrder <- function() {
lapply(1:max(index)
, function(i) grep(
sprintf('%s', i)
, names(DT), value = TRUE)
)}
where index is the index created in order to allow dcast.data.table for duplicated keys such as the SET variable in the initial table above:
DT[, index := 1:.N, by = SET]
finally the new order is achieved in regular manner:
setcolorder(dcast(DT), neworder = newOrder())

R add columns by loop in data table

I have a data table like this:
DT <- data.table(ID=rep(c(1:2),each=6), year=rep(c(2003:2006),each=3), month=rep(c(5:8),3), day=rep(c(11:14),3),value=c(101:112))
And I would like to add columns with the conditions:
1, add 5 columns with names: V100, V102, V105, V108, V112
2, in each column, grouped by ID and year, sum up the values less than the value in the column name, eg: for column V112, sum up grouped values less than 112
So the outcome will look like:
DT1 <- data.table(ID=rep(c(1:2),each=2), year=c(2003:2006), "100"=rep(0,4), "102"=c(2,0,0,0),"105"=c(3,2,0,0),"108"=c(3,3,2,0),"112"=rep(3,4))
I tried write codes but couldn't figure out:
degree <- c(100,102,105,108,112)
for (d in degree)
{
f_year <- function(d) {sum(DT$value <= d)}
DT <- DT[,d:=f_year(),by=list(ID,year)]
}
Any help would be appreciated!
Thats what lapply can be used for.
degree <- c(100, 102, 105, 108, 112)
myfun <- function(x,y) sum(y <= x)
DT1 <- DT[, lapply(degree, myfun, value), by = .(ID, year)]
setnames(DT1, c("ID", "year", as.character(degree)))
Result:
> DT1
ID year 100 102 105 108 112
1: 1 2003 0 2 3 3 3
2: 1 2004 0 0 2 3 3
3: 2 2005 0 0 0 2 3
4: 2 2006 0 0 0 0 3
Just another way:
cols = c(100,102,105,108,112)
DT[, lapply(cols, function(x) sum(value <= x)), by=.(ID, year)]
# ID year V1 V2 V3 V4 V5
# 1: 1 2003 0 2 3 3 3
# 2: 1 2004 0 0 2 3 3
# 3: 2 2005 0 0 0 2 3
# 4: 2 2006 0 0 0 0 3
Then you can set the names.
Instead if you'd like to set names directly, then you can create a named list first:
named_cols = setattr(as.list(cols), 'names', cols)
DT[, lapply(named_cols, function(x) sum(value<=x)), by=.(ID, year)]
# ID year 100 102 105 108 112
# 1: 1 2003 0 2 3 3 3
# 2: 1 2004 0 0 2 3 3
# 3: 2 2005 0 0 0 2 3
# 4: 2 2006 0 0 0 0 3

Resources