R - adding a count table in a dataframe - r

I would like to get a count table from several values in my dataframe, which are spread over several columns.
In my case, several of my subjects have one or more types of medication. Now they are mentioned in several columns. I would instead like to have each medication name in a separate column, and a 1 or 0 (a count) per subject if they have it or not.
For example, my dataframe now looks like:
Pp X X2 X3 X4
1 NA NA NA NA
2 trimipramin NA NA NA
3 Quetiapin venlafaxin escitalopram NA
4 duloxetin amitriptylin NA NA
5 quetiapin citalopram escitalopram Lithium
6 NA NA NA NA
7 duloxetin escitalopram venlafaxin NA
I would like to it look like:
Pp trimipramin escitalopram quetiapin venlafaxin duloxetin …
1 0 0 0 0 0 …
2 1 0 0 0 0 …
3 0 1 1 1 0 …
4 0 0 0 0 1 …
5 0 1 1 0 0 …
6 0 0 0 0 0 …
7 0 1 0 1 1 …
Would it be possible to get such a count table added to my dataframe and not separately?
I assume dplyr's spread() might be useful in this case, but I couldn't get it to work :-(
Any help?

The table function in base is a simple method for creating this type of binary indicator matrix.
library(tidyverse)
D %>%
gather(key, drug, -Pp) %>%
select(-key) %>%
table
#>Pp amitriptylin citalopram duloxetin escitalopram Lithium ...
#> 1 0 0 0 0 0 ...
#> 2 0 0 0 0 0 ...
#> 3 0 0 0 1 0 ...
#> 4 1 0 1 0 0 ...
#> 5 0 1 0 1 1 ...
#> 6 0 0 0 0 0 ...
#> 7 0 0 1 1 0 ...
You can also do it using spread instead, but this solution will remove patients without any associated drugs:
D %>%
gather(key, value, -Pp) %>%
filter(!is.na(value)) %>%
mutate(key=1) %>%
spread(value, key, fill=0, drop = FALSE)

library(dplyr)
library(tidyr)
library(reshape2)
data %>% gather(key = "Med", value = "Value", -Pp) %>% select(-Med) %>%
dcast(Pp~Value,fun.aggregate = function(x){as.integer(length(x) > 0)})
Data
data <- read.table(text="
Pp X X2 X3 X4
1 NA NA NA NA
2 trimipramin NA NA NA
3 Quetiapin venlafaxin escitalopram NA
4 duloxetin amitriptylin NA NA
5 quetiapin citalopram escitalopram Lithium
6 NA NA NA NA
7 duloxetin escitalopram venlafaxin NA",header=T, stringsAsFactors = F)

I do not know if it is the easiest solution, but it works. First you have to get all medications in a vector. I do that with unlist and unqiue. Then you have to loop through them and and get the number of cases. I do that with the apply function and the sum of matches. As you want it in the same data.frame you can store it with data[[name]]. This solution works without any packages.
names = as.character(na.omit(unique(unlist(data))))
for(name in names){
data[[name]] = apply(data, FUN = function(x) { sum(x == name, na.rm = T) }, 1)
}
Data
data = read.table(text = "X X2 X3 X4
NA NA NA NA
trimipramin NA NA NA
Quetiapin venlafaxin escitalopram NA
duloxetin amitriptylin NA NA
quetiapin citalopram escitalopram Lithium
NA NA NA NA
duloxetin escitalopram venlafaxin NA", header = T, stringsAsFactors = F)

Related

need help creating a function to delete one wave of data if it's a duplicate of the previous wave for a retrospective measure

I'm working with a longitudinal dataset that has a retrospective measure of trauma that provides a yes/no endorsement of a question and the age of onset if the answer was "yes". If a question was endorsed at the first wave of data and then again at the second wave with the same age it needs to be converted to a "no" and a the age to NA. My data looks like this:
df <- as.data.frame(cbind(Aw1 = c(0,0,1,0,0),
Aagew1 = c(NA,NA,23,NA,NA),
Aw2 = c(1,0,1,0,0),
Aagew2 = c(29,NA,23,NA,NA),
Bw1 = c(1,0,0,0,1),
Bagew1 = c(20,NA,NA,NA,23),
Bw2 = c(1,0,1,0,1),
Bagew2 = c(20,NA,28,NA,23)))
print(df)
Aw1 Aagew1 Aw2 Aagew2 Bw1 Bagew1 Bw2 Bagew2
1 0 NA 1 29 1 20 1 20
2 0 NA 0 NA 0 NA 0 NA
3 1 23 1 23 0 NA 1 28
4 0 NA 0 NA 0 NA 0 NA
5 0 NA 0 NA 1 23 1 23
Using the following data.table syntax I'm able to recode what I want conditionally so that the "A" question at wave two, that is the same incident recorded at wave 1, is no longer present
dt <- as.data.table(df)
dt[Aagew1 == Aagew2, ':=' (Aw2 = 0, Aagew2 = NA)]
print(dt)
Aw1 Aagew1 Aw2 Aagew2 Bw1 Bagew1 Bw2 Bagew2
1: 0 NA 1 29 1 20 1 20
2: 0 NA 0 NA 0 NA 0 NA
3: 1 23 0 NA 0 NA 1 28
4: 0 NA 0 NA 0 NA 0 NA
5: 0 NA 0 NA 1 23 1 23
I'd like to automate this syntax and put it into a function so that for every question that repeats itself at wave 2 is deleted. Here's a function I made that doesn't work:
rm.duplicate <- function(x){
y <- as.data.table(x)
for(i in LETTERS[1:2]){
y[paste0(i,"age","w1") == paste0(i,"age","w2"), ':=' (paste0(i,"w2") = 0, paste0(i,"age","w2") = NA)]
}
return(as.data.frame(y))
}
The desired outcome is (so that the duplicates at wave 2 are deleted for all the unique questions):
Aw1 Aagew1 Aw2 Aagew2 Bw1 Bagew1 Bw2 Bagew2
1: 0 NA 1 29 1 20 0 NA
2: 0 NA 0 NA 0 NA 0 NA
3: 1 23 0 NA 0 NA 1 28
4: 0 NA 0 NA 0 NA 0 NA
5: 0 NA 0 NA 1 23 0 NA
Thank you for helping me out!
The OP's function can be modified slightly to add get to return the value
rm.duplicate <- function(x){
y <- as.data.table(x)
for(i in LETTERS[1:2]){
y[get(paste0(i,"age","w1")) == get(paste0(i,"age","w2")),
paste0(i,c("", "age"), "w2") := .(0, NA)]
}
return(as.data.frame(y))
}
rm.duplicate(df)
# Aw1 Aagew1 Aw2 Aagew2 Bw1 Bagew1 Bw2 Bagew2
#1 0 NA 1 29 1 20 0 NA
#2 0 NA 0 NA 0 NA 0 NA
#3 1 23 0 NA 0 NA 1 28
#4 0 NA 0 NA 0 NA 0 NA
#5 0 NA 0 NA 1 23 0 NA
Or another option is set
dt <- as.data.table(df)
for(i in LETTERS[1:2]){
w1 <- paste0(i,"age","w1")
w2 <- paste0(i,"age","w2")
i1 <- which(dt[[w1]] == dt[[w2]])
nm1 <- paste0(i,c("", "age"), "w2")
set(dt, i = i1, j = nm1[1], value = 0)
set(dt, i = i1, j = nm1[2], value = NA)
}
dt

Combine two columns (numerical values) and retain NAs

I have two columns that I wish to combine. I would usually sum both columns. However, in this case I need to retain the original NAs and only have the numerical values combined from two columns to one. I have tried to use ifelse statements but since im working with 2x columns then combining both is trickier with nested ifelse.
Heres my example data:
# Example
data <- c(80,7.692307692,
7.692307692
,8.333333333
,9.090909091
,20
,27.27272727
,50
,50
,21.42857143
,58.33333333
,46.66666667
,78.06451613
,186.15384615
,42.85714286
,44.1860465,
20,
25,
40,45,78,55)
df <- data.frame(data)
df$long <- ifelse(df$data <20,1,0) # print 1s
df$long_exit <- df$data <=70 # logical
df$long_sigs <- ifelse(df$long_exit == FALSE,0,NA) # convert FALSE to 1 and retain NAs
desired_sum_output <- c(0,1,1,1,1,NA,NA,NA,NA,NA,NA,NA,0,0,NA,NA,NA,NA,NA,NA,0,NA)
df <- data.frame(df,desired_sum_output)
#sum df$long + df$long_sigs to and retain 0
sums <- df$long + df$long_sigs # summing
> df
data long long_exit long_sigs desired_sum_output
1 80.000000 0 FALSE 0 0
2 7.692308 1 TRUE NA 1
3 7.692308 1 TRUE NA 1
4 8.333333 1 TRUE NA 1
5 9.090909 1 TRUE NA 1
6 20.000000 0 TRUE NA NA
7 27.272727 0 TRUE NA NA
8 50.000000 0 TRUE NA NA
9 50.000000 0 TRUE NA NA
10 21.428571 0 TRUE NA NA
11 58.333333 0 TRUE NA NA
12 46.666667 0 TRUE NA NA
13 78.064516 0 FALSE 0 0
14 186.153846 0 FALSE 0 0
15 42.857143 0 TRUE NA NA
16 44.186047 0 TRUE NA NA
17 20.000000 0 TRUE NA NA
18 25.000000 0 TRUE NA NA
19 40.000000 0 TRUE NA NA
20 45.000000 0 TRUE NA NA
21 78.000000 0 FALSE 0 0
22 55.000000 0 TRUE NA NA
This is usually reasonable:
df$z <- rowSums(df[, c("long", "long_sigs")], na.rm=TRUE)
However i lose my NA positioning.
Reason for retaining NA's is I will use na.locf from zoo packaged to forward fill the NA values.
If I understand correctly, you want to sum with na.rm = TRUE where long = 1, otherwise retain the NA. So this should create z = desired_sum_output:
df$z <- ifelse(df$long == 1,
rowSums(df[, c("long", "long_sigs")], na.rm=TRUE),
rowSums(df[, c("long", "long_sigs")]))

dcast in r with duplicates and no aggregation

I have looked through many similar questions here but can't find an answer that addresses this situation. My data frame is like this:
SET SP T1 T2 T3
A dog 1 0 0
A cat 0 NA 4
A bird 5 0 NA
B cat 2 0 0
B bird NA 3 0
C dog 1 0 0
C cat 0 0 6
C bird 0 0 0
D dog NA 22 1
Where SET is purposefully duplicated many times, with each record including a single SP and values for multiple TRIALS (T1-3).
What I desire is a wide dataframe like the following. There is to be NO summation/averaging/mathematical operation of any kind:
SET DOG_T1 DOG_T2 DOG_T3 CAT_T1 CAT_T2 CAT_T3 BIRD_T1 BIRD_T2 BIRD_T3
142 1 0 0 0 NA 4 5 0 NA
255 NA NA NA 2 0 0 NA 3 0
336 1 0 0 0 0 6 0 0 0
66 NA 22 1 NA NA NA NA NA NA
I have tried the following, but receive the error with melt and dcast defaulting to length. This turns the SET variable into different numbers and only fills in 0s and 1s for the value.
df %>%
group_by(SET, SP) %>%
melt(id.vars = c('SET','SP')) %>%
data.table::dcast(SP + variable ~ SET, fun.aggregate = NULL, value.var = 'value')
This works when I DON'T have any duplicate SETs, but fails as soon as I include the full dataset.
A note: my real data frame is about 2.5 million rows, so speed is of concern.
This is a situation where dcast.data.table excels. It allows for multiple arguments as 'value.var', allowing for very concise syntax:
library(data.table)
dcast(df, SET ~ SP, value.var=c("T1", "T2", "T3"))
# SET T1_bird T1_cat T1_dog T2_bird T2_cat T2_dog T3_bird T3_cat T3_dog
#1: A 5 0 1 0 NA 0 NA 4 0
#2: B NA 2 NA 3 0 NA 0 0 NA
#3: C 0 0 1 0 0 0 0 6 0
#4: D NA NA NA NA NA 22 NA NA 1
As per #lukeA above but add fun.aggregate = identity or fun.aggregate = list argument in dcast() function call
You could try
library(tidyverse)
df <- read.table(header=T, text="
SET SP T1 T2 T3
A dog 1 0 0
A cat 0 NA 4
A bird 5 0 NA
B cat 2 0 0
B bird NA 3 0
C dog 1 0 0
C cat 0 0 6
C bird 0 0 0
D dog NA 22 1")
df %>%
gather(var, val, -(1:2)) %>%
unite("SP", SP, var) %>%
spread(SP, val)
# SET bird_T1 bird_T2 bird_T3 cat_T1 cat_T2 cat_T3 dog_T1 dog_T2 dog_T3
# 1 A 5 0 NA 0 NA 4 1 0 0
# 2 B NA 3 0 2 0 0 NA NA NA
# 3 C 0 0 0 0 0 6 1 0 0
# 4 D NA NA NA NA NA NA NA 22 1
Since my current reputation doesn't allow adding a comment to #lukeA answer above, I am making this a new answer which is more of a suggestion:
using the data.table function setcolorder one could have the columns reordered as "bird_T1, cat_T1, dog_T1, bird_T2, cat_T2 etc." by using a custom function similar to
newOrder <- function() {
lapply(1:max(index)
, function(i) grep(
sprintf('%s', i)
, names(DT), value = TRUE)
)}
where index is the index created in order to allow dcast.data.table for duplicated keys such as the SET variable in the initial table above:
DT[, index := 1:.N, by = SET]
finally the new order is achieved in regular manner:
setcolorder(dcast(DT), neworder = newOrder())

creating data frame for confusionMatrix(0 input, however getting unusual dataframe

I’m try to understand results of a prediction object via caret’s confusionMatrix() function, which requires table input according to http://artax.karlin.mff.cuni.cz/r-help/library/caret/html/confusionMatrix.html, my table() creates results that I understand , but its not friendly to the confusionMatrix() function.
Here is the relevant code snippet:
#MODEL CREATION
#convert categorical A to E values , into numeric 1 to 5 in order to be regression friendly
training_data_subset_numeric <- training_data_subset;
testing_data_subset_numeric <- testing_data_subset;
training_data_subset_numeric$classe <- as.numeric(training_data_subset$classe)
testing_data_subset_numeric$classe <- as.numeric(testing_data_subset$classe)
#model
exercise.model <- glm(formula = classe ~ ., data = training_data_subset_numeric)
#MODEL EVALUATION
exercise.prediction <- predict(exercise.model,newdata = testing_data_subset_numeric)
eval_table <- table(exercise.prediction,testing_data_subset$classe)
tail(eval_table)
exercise.prediction A B C D E
4.35504232913594 1 0 0 0 0
4.47219097065568 1 0 0 0 0
4.50838854075835 1 0 0 0 0
4.6173551930011 0 1 0 0 0
4.69261223447305 0 1 0 0 0
4.73297946213265 0 1 0 0 0
Basically I need to convert the above output , to a data frame with 1 col corresponding to prediction value that follows this rule:
If column A is 1 , than predicted value is 1
If column B is 1 , than predicted value is 2
If column C is 1 , than predicted value is 3
If column D is 1 , than predicted value is 4
If column E is 1 , than predicted value is 5
I therefore, wrote this function to get the job done:
getPredictResults<- function(x)
{
# create 1 column & n row data frame
num <- data.frame(matrix(0, ncol = 1, nrow = nrow(x)));
for (r in 1:nrow(x) ) {
for (c in 1:ncol(x) ) {
#if column A has value 1 than num[1,r] <- 1
if (x[r,'A']== 1)
{
num[1,r] <- 1;
}
#if column B has value 1 than num[1,r] <- 2
else if (x[r,'B']== 1)
{
num[1,r] <- 2;
}
#if column C has value 1 than num[1,r] <- 3
else if (x[r,'C']== 1)
{
num[1,r] <- 3;
}
#if column D has value 1 than num[1,r] <- 4
else if (x[r,'D']== 1)
{
num[1,r] <- 4;
}
#if column E has value 1 than num[1,r] <- 5
else if (x[r,'E']== 1)
{
num[1,r] <- 5;
}
else
{
}
}#end inner for
}#end outer for
return (num);
}#end function
exercise.prediction_df <- getPredictResults(eval_table)
However when typing :
head(exercise.prediction_df)
Im getting an unusual output , here is the bottom snippet:
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4840 V4841 V4842 V4843 V4844 V4845 V4846 V4847 V4848 V4849 V4850 V4851 V4852 V4853 V4854 V4855 V4856 V4857
1 5 1 4 5 2 2 5 5 1 2 5 4 5 5 1 5 5 4
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4858 V4859 V4860 V4861 V4862 V4863 V4864 V4865 V4866 V4867 V4868 V4869 V4870 V4871 V4872 V4873 V4874 V4875
1 4 2 1 2 5 1 4 5 2 1 4 5 2 4 2 4 4 2
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4876 V4877 V4878 V4879 V4880 V4881 V4882 V4883 V4884 V4885 V4886 V4887 V4888 V4889 V4890 V4891 V4892 V4893
1 5 1 1 4 1 2 2 1 1 5 1 4 1 1 1 1 1 1
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4894 V4895 V4896 V4897 V4898 V4899 V4900 V4901 V4902 V4903 V4904
1 1 1 1 1 1 1 1 1 2 2 2
2 NA NA NA NA NA NA NA NA NA NA NA
[ reached getOption("max.print") -- omitted 4 rows ]
Further investigation shows:
> ncol(exercise.prediction_df)
[1] 4904
> nrow(exercise.prediction_df)
[1] 4904
Which ncol() should only return 1 & nrow() obviously can be any integer value.
How can I fix this function, in order to create the right dataframe as an input to confusionMatrix() function?
Thanks.
classe <- cut(runif(100), seq(0, 1, length.out = 5))
levels(a) <- c("A", "B", "C", "D", "E")
exercise.prediction <- rnorm(100)
eval_table <- table(exercise.prediction, classe)
eval_matrix <- as.matrix(tab)
transform <- apply(eval_matrix, 1, function(x) sum(x * c(1:5)))
head(as.data.frame(transform))

Generate crosstabulations from dataframe of categorical variables in survey

I've got some survey results and am trying to do some basic cross tabulations. Each column is a type of chemical, and the number 0:5 are how useful they were.
I'm trying to figure out a nice table that presents the freq, and percent. Using table, or xtabs, I'm able to get individual results for each column, but I'd like to figure out a way to create a nice table that I'll be able to output into Latex that includes all the chemicals on one table.
Thanks for any help you can provide.
Data Frame :
df <- read.table(text = "
V1 V2 V3 V4 V5 V6 V7
1 NA NA NA NA NA NA NA
2 0 0 0 0 0 0 0
3 0 0 0 0 0 0 NA
4 NA NA NA NA NA NA 5
5 0 0 0 0 0 2 0
6 NA 4 NA NA NA NA NA
7 0 0 0 0 0 0 0
8 NA NA NA NA NA 3 NA
9 NA 2 NA NA NA 3 NA
10 NA 4 NA NA NA NA NA
11 0 0 0 0 0 0 0
12 0 0 0 0 0 0 0
13 0 0 0 0 0 0 0
14 NA NA NA NA NA 2 3
15 NA 3 NA 3 NA NA NA
16 NA 4 NA NA NA NA NA
17 0 0 0 0 0 0 0
18 NA 5 NA 5 NA NA NA
19 0 0 0 0 0 0 0
20 NA 1 NA NA NA NA NA", header = T)
Desired Output (Accurate numbers for V1 and V2) :
V1 V2 etc....
Freq Percent Freq Percent
No 9 100 9 56.2
Poor 0 0 1 6.2
Somewhat effective 0 0 1 6.2
Good 0 0 1 6.2
Very Good 0 0 3 18.75
NA 0 0 1 6.2
Here, we are getting the frequency for each column by using lapply and table. lapply gets the data.frame in a list environment and then use table after converting the column to factor with levels specified as 0:5. Use, prop.table to get the proportion, cbind the Freq and Percent, convert the list to data.frame by do.call(cbind, and finally rename the row.names and colnames
res <- do.call(cbind,lapply(df, function(x) {
x1 <- table(factor(x, levels=0:5,
labels=c('No', 'Poor', 'Somewhat Effective',
'Good', 'Very Good', 'NA') ))
cbind(Freq=x1, Percent=round(100*prop.table(x1),2))}))
colnames(res) <- paste(rep(paste0('V',1:7),each=2),
colnames(res),sep=".")
head(res,2)
# V1.Freq V1.Percent V2.Freq V2.Percent V3.Freq V3.Percent V4.Freq
#No 9 100 9 56.25 9 100 9
#Poor 0 0 1 6.25 0 0 0
# V4.Percent V5.Freq V5.Percent V6.Freq V6.Percent V7.Freq V7.Percent
#No 81.82 9 100 8 66.67 8 80
#Poor 0.00 0 0 0 0.00 0 0
I'm not a regular "dplyr" or "tidyr" user, so I'm not sure if this is the best approach using those tools (but it seems to work):
library(dplyr)
library(tidyr)
df %>%
gather(var, val, V1:V7) %>% ## Make the data long
na.omit() %>% ## We don't need the NAs
## Factor the "value" column
mutate(val = factor(val, 0:5, c("No", "Poor", "Somewhat Effective",
"Good", "Very Good", "NA"))) %>%
group_by(val, var) %>% ## Group by val and var
summarise(Freq = n()) %>% ## Get the count
group_by(var) %>% ## Group just by var now
mutate(Pct = Freq/sum(Freq) * 100) %>% ## Calculate the percent
gather(R1, R2, Freq:Pct) %>% ## Go long again....
unite(Var, var, R1) %>% ## Combine the var and R1 cols
spread(Var, R2, fill = 0) ## Go wide....
# Source: local data frame [6 x 15]
#
# val V1_Freq V1_Pct V2_Freq V2_Pct V3_Freq V3_Pct V4_Freq
# 1 No 9 100 9 56.25 9 100 9
# 2 Poor 0 0 1 6.25 0 0 0
# 3 Somewhat Effective 0 0 1 6.25 0 0 0
# 4 Good 0 0 1 6.25 0 0 1
# 5 Very Good 0 0 3 18.75 0 0 0
# 6 NA 0 0 1 6.25 0 0 1
# Variables not shown: V4_Pct (dbl), V5_Freq (dbl), V5_Pct (dbl), V6_Freq
# (dbl), V6_Pct (dbl), V7_Freq (dbl), V7_Pct (dbl)
The "data.table" approach is similar in terms of the series of steps you'll have to go through.
library(data.table)
library(reshape2)
levs <- c("No", "Poor", "Somewhat Effective", "Good", "Very Good", "NA")
DT <- melt(as.data.table(df, keep.rownames = TRUE),
id.vars = "rn", na.rm = TRUE)
DT <- DT[, value := factor(value, 0:5, levs)
][, list(Freq = .N), by = list(variable, value)
][, Pct := Freq/sum(Freq) * 100, by = list(variable)]
dcast.data.table(melt(DT, id.vars = c("variable", "value")),
value ~ variable + variable.1,
value.var = "value.1", fill = 0)
OK, one more... (a variant on #akrun's answer)
library(gdata) ## For "interleave"
levs <- c("No", "Poor", "Somewhat Effective", "Good", "Very Good", "NA")
x1 <- sapply(lapply(df, factor, 0:5, levs), table)
t(interleave(t(x1), t(prop.table(x1, 2))))
### Or, skipping the transposing....
## library(SOfun) ## For "Riffle" which is like "interleave"
## Riffle(x1, prop.table(x1, 2) * 100)

Resources