I have two columns that I wish to combine. I would usually sum both columns. However, in this case I need to retain the original NAs and only have the numerical values combined from two columns to one. I have tried to use ifelse statements but since im working with 2x columns then combining both is trickier with nested ifelse.
Heres my example data:
# Example
data <- c(80,7.692307692,
7.692307692
,8.333333333
,9.090909091
,20
,27.27272727
,50
,50
,21.42857143
,58.33333333
,46.66666667
,78.06451613
,186.15384615
,42.85714286
,44.1860465,
20,
25,
40,45,78,55)
df <- data.frame(data)
df$long <- ifelse(df$data <20,1,0) # print 1s
df$long_exit <- df$data <=70 # logical
df$long_sigs <- ifelse(df$long_exit == FALSE,0,NA) # convert FALSE to 1 and retain NAs
desired_sum_output <- c(0,1,1,1,1,NA,NA,NA,NA,NA,NA,NA,0,0,NA,NA,NA,NA,NA,NA,0,NA)
df <- data.frame(df,desired_sum_output)
#sum df$long + df$long_sigs to and retain 0
sums <- df$long + df$long_sigs # summing
> df
data long long_exit long_sigs desired_sum_output
1 80.000000 0 FALSE 0 0
2 7.692308 1 TRUE NA 1
3 7.692308 1 TRUE NA 1
4 8.333333 1 TRUE NA 1
5 9.090909 1 TRUE NA 1
6 20.000000 0 TRUE NA NA
7 27.272727 0 TRUE NA NA
8 50.000000 0 TRUE NA NA
9 50.000000 0 TRUE NA NA
10 21.428571 0 TRUE NA NA
11 58.333333 0 TRUE NA NA
12 46.666667 0 TRUE NA NA
13 78.064516 0 FALSE 0 0
14 186.153846 0 FALSE 0 0
15 42.857143 0 TRUE NA NA
16 44.186047 0 TRUE NA NA
17 20.000000 0 TRUE NA NA
18 25.000000 0 TRUE NA NA
19 40.000000 0 TRUE NA NA
20 45.000000 0 TRUE NA NA
21 78.000000 0 FALSE 0 0
22 55.000000 0 TRUE NA NA
This is usually reasonable:
df$z <- rowSums(df[, c("long", "long_sigs")], na.rm=TRUE)
However i lose my NA positioning.
Reason for retaining NA's is I will use na.locf from zoo packaged to forward fill the NA values.
If I understand correctly, you want to sum with na.rm = TRUE where long = 1, otherwise retain the NA. So this should create z = desired_sum_output:
df$z <- ifelse(df$long == 1,
rowSums(df[, c("long", "long_sigs")], na.rm=TRUE),
rowSums(df[, c("long", "long_sigs")]))
Related
I would like to get a count table from several values in my dataframe, which are spread over several columns.
In my case, several of my subjects have one or more types of medication. Now they are mentioned in several columns. I would instead like to have each medication name in a separate column, and a 1 or 0 (a count) per subject if they have it or not.
For example, my dataframe now looks like:
Pp X X2 X3 X4
1 NA NA NA NA
2 trimipramin NA NA NA
3 Quetiapin venlafaxin escitalopram NA
4 duloxetin amitriptylin NA NA
5 quetiapin citalopram escitalopram Lithium
6 NA NA NA NA
7 duloxetin escitalopram venlafaxin NA
I would like to it look like:
Pp trimipramin escitalopram quetiapin venlafaxin duloxetin …
1 0 0 0 0 0 …
2 1 0 0 0 0 …
3 0 1 1 1 0 …
4 0 0 0 0 1 …
5 0 1 1 0 0 …
6 0 0 0 0 0 …
7 0 1 0 1 1 …
Would it be possible to get such a count table added to my dataframe and not separately?
I assume dplyr's spread() might be useful in this case, but I couldn't get it to work :-(
Any help?
The table function in base is a simple method for creating this type of binary indicator matrix.
library(tidyverse)
D %>%
gather(key, drug, -Pp) %>%
select(-key) %>%
table
#>Pp amitriptylin citalopram duloxetin escitalopram Lithium ...
#> 1 0 0 0 0 0 ...
#> 2 0 0 0 0 0 ...
#> 3 0 0 0 1 0 ...
#> 4 1 0 1 0 0 ...
#> 5 0 1 0 1 1 ...
#> 6 0 0 0 0 0 ...
#> 7 0 0 1 1 0 ...
You can also do it using spread instead, but this solution will remove patients without any associated drugs:
D %>%
gather(key, value, -Pp) %>%
filter(!is.na(value)) %>%
mutate(key=1) %>%
spread(value, key, fill=0, drop = FALSE)
library(dplyr)
library(tidyr)
library(reshape2)
data %>% gather(key = "Med", value = "Value", -Pp) %>% select(-Med) %>%
dcast(Pp~Value,fun.aggregate = function(x){as.integer(length(x) > 0)})
Data
data <- read.table(text="
Pp X X2 X3 X4
1 NA NA NA NA
2 trimipramin NA NA NA
3 Quetiapin venlafaxin escitalopram NA
4 duloxetin amitriptylin NA NA
5 quetiapin citalopram escitalopram Lithium
6 NA NA NA NA
7 duloxetin escitalopram venlafaxin NA",header=T, stringsAsFactors = F)
I do not know if it is the easiest solution, but it works. First you have to get all medications in a vector. I do that with unlist and unqiue. Then you have to loop through them and and get the number of cases. I do that with the apply function and the sum of matches. As you want it in the same data.frame you can store it with data[[name]]. This solution works without any packages.
names = as.character(na.omit(unique(unlist(data))))
for(name in names){
data[[name]] = apply(data, FUN = function(x) { sum(x == name, na.rm = T) }, 1)
}
Data
data = read.table(text = "X X2 X3 X4
NA NA NA NA
trimipramin NA NA NA
Quetiapin venlafaxin escitalopram NA
duloxetin amitriptylin NA NA
quetiapin citalopram escitalopram Lithium
NA NA NA NA
duloxetin escitalopram venlafaxin NA", header = T, stringsAsFactors = F)
I have looked through many similar questions here but can't find an answer that addresses this situation. My data frame is like this:
SET SP T1 T2 T3
A dog 1 0 0
A cat 0 NA 4
A bird 5 0 NA
B cat 2 0 0
B bird NA 3 0
C dog 1 0 0
C cat 0 0 6
C bird 0 0 0
D dog NA 22 1
Where SET is purposefully duplicated many times, with each record including a single SP and values for multiple TRIALS (T1-3).
What I desire is a wide dataframe like the following. There is to be NO summation/averaging/mathematical operation of any kind:
SET DOG_T1 DOG_T2 DOG_T3 CAT_T1 CAT_T2 CAT_T3 BIRD_T1 BIRD_T2 BIRD_T3
142 1 0 0 0 NA 4 5 0 NA
255 NA NA NA 2 0 0 NA 3 0
336 1 0 0 0 0 6 0 0 0
66 NA 22 1 NA NA NA NA NA NA
I have tried the following, but receive the error with melt and dcast defaulting to length. This turns the SET variable into different numbers and only fills in 0s and 1s for the value.
df %>%
group_by(SET, SP) %>%
melt(id.vars = c('SET','SP')) %>%
data.table::dcast(SP + variable ~ SET, fun.aggregate = NULL, value.var = 'value')
This works when I DON'T have any duplicate SETs, but fails as soon as I include the full dataset.
A note: my real data frame is about 2.5 million rows, so speed is of concern.
This is a situation where dcast.data.table excels. It allows for multiple arguments as 'value.var', allowing for very concise syntax:
library(data.table)
dcast(df, SET ~ SP, value.var=c("T1", "T2", "T3"))
# SET T1_bird T1_cat T1_dog T2_bird T2_cat T2_dog T3_bird T3_cat T3_dog
#1: A 5 0 1 0 NA 0 NA 4 0
#2: B NA 2 NA 3 0 NA 0 0 NA
#3: C 0 0 1 0 0 0 0 6 0
#4: D NA NA NA NA NA 22 NA NA 1
As per #lukeA above but add fun.aggregate = identity or fun.aggregate = list argument in dcast() function call
You could try
library(tidyverse)
df <- read.table(header=T, text="
SET SP T1 T2 T3
A dog 1 0 0
A cat 0 NA 4
A bird 5 0 NA
B cat 2 0 0
B bird NA 3 0
C dog 1 0 0
C cat 0 0 6
C bird 0 0 0
D dog NA 22 1")
df %>%
gather(var, val, -(1:2)) %>%
unite("SP", SP, var) %>%
spread(SP, val)
# SET bird_T1 bird_T2 bird_T3 cat_T1 cat_T2 cat_T3 dog_T1 dog_T2 dog_T3
# 1 A 5 0 NA 0 NA 4 1 0 0
# 2 B NA 3 0 2 0 0 NA NA NA
# 3 C 0 0 0 0 0 6 1 0 0
# 4 D NA NA NA NA NA NA NA 22 1
Since my current reputation doesn't allow adding a comment to #lukeA answer above, I am making this a new answer which is more of a suggestion:
using the data.table function setcolorder one could have the columns reordered as "bird_T1, cat_T1, dog_T1, bird_T2, cat_T2 etc." by using a custom function similar to
newOrder <- function() {
lapply(1:max(index)
, function(i) grep(
sprintf('%s', i)
, names(DT), value = TRUE)
)}
where index is the index created in order to allow dcast.data.table for duplicated keys such as the SET variable in the initial table above:
DT[, index := 1:.N, by = SET]
finally the new order is achieved in regular manner:
setcolorder(dcast(DT), neworder = newOrder())
I’m try to understand results of a prediction object via caret’s confusionMatrix() function, which requires table input according to http://artax.karlin.mff.cuni.cz/r-help/library/caret/html/confusionMatrix.html, my table() creates results that I understand , but its not friendly to the confusionMatrix() function.
Here is the relevant code snippet:
#MODEL CREATION
#convert categorical A to E values , into numeric 1 to 5 in order to be regression friendly
training_data_subset_numeric <- training_data_subset;
testing_data_subset_numeric <- testing_data_subset;
training_data_subset_numeric$classe <- as.numeric(training_data_subset$classe)
testing_data_subset_numeric$classe <- as.numeric(testing_data_subset$classe)
#model
exercise.model <- glm(formula = classe ~ ., data = training_data_subset_numeric)
#MODEL EVALUATION
exercise.prediction <- predict(exercise.model,newdata = testing_data_subset_numeric)
eval_table <- table(exercise.prediction,testing_data_subset$classe)
tail(eval_table)
exercise.prediction A B C D E
4.35504232913594 1 0 0 0 0
4.47219097065568 1 0 0 0 0
4.50838854075835 1 0 0 0 0
4.6173551930011 0 1 0 0 0
4.69261223447305 0 1 0 0 0
4.73297946213265 0 1 0 0 0
Basically I need to convert the above output , to a data frame with 1 col corresponding to prediction value that follows this rule:
If column A is 1 , than predicted value is 1
If column B is 1 , than predicted value is 2
If column C is 1 , than predicted value is 3
If column D is 1 , than predicted value is 4
If column E is 1 , than predicted value is 5
I therefore, wrote this function to get the job done:
getPredictResults<- function(x)
{
# create 1 column & n row data frame
num <- data.frame(matrix(0, ncol = 1, nrow = nrow(x)));
for (r in 1:nrow(x) ) {
for (c in 1:ncol(x) ) {
#if column A has value 1 than num[1,r] <- 1
if (x[r,'A']== 1)
{
num[1,r] <- 1;
}
#if column B has value 1 than num[1,r] <- 2
else if (x[r,'B']== 1)
{
num[1,r] <- 2;
}
#if column C has value 1 than num[1,r] <- 3
else if (x[r,'C']== 1)
{
num[1,r] <- 3;
}
#if column D has value 1 than num[1,r] <- 4
else if (x[r,'D']== 1)
{
num[1,r] <- 4;
}
#if column E has value 1 than num[1,r] <- 5
else if (x[r,'E']== 1)
{
num[1,r] <- 5;
}
else
{
}
}#end inner for
}#end outer for
return (num);
}#end function
exercise.prediction_df <- getPredictResults(eval_table)
However when typing :
head(exercise.prediction_df)
Im getting an unusual output , here is the bottom snippet:
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4840 V4841 V4842 V4843 V4844 V4845 V4846 V4847 V4848 V4849 V4850 V4851 V4852 V4853 V4854 V4855 V4856 V4857
1 5 1 4 5 2 2 5 5 1 2 5 4 5 5 1 5 5 4
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4858 V4859 V4860 V4861 V4862 V4863 V4864 V4865 V4866 V4867 V4868 V4869 V4870 V4871 V4872 V4873 V4874 V4875
1 4 2 1 2 5 1 4 5 2 1 4 5 2 4 2 4 4 2
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4876 V4877 V4878 V4879 V4880 V4881 V4882 V4883 V4884 V4885 V4886 V4887 V4888 V4889 V4890 V4891 V4892 V4893
1 5 1 1 4 1 2 2 1 1 5 1 4 1 1 1 1 1 1
2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
V4894 V4895 V4896 V4897 V4898 V4899 V4900 V4901 V4902 V4903 V4904
1 1 1 1 1 1 1 1 1 2 2 2
2 NA NA NA NA NA NA NA NA NA NA NA
[ reached getOption("max.print") -- omitted 4 rows ]
Further investigation shows:
> ncol(exercise.prediction_df)
[1] 4904
> nrow(exercise.prediction_df)
[1] 4904
Which ncol() should only return 1 & nrow() obviously can be any integer value.
How can I fix this function, in order to create the right dataframe as an input to confusionMatrix() function?
Thanks.
classe <- cut(runif(100), seq(0, 1, length.out = 5))
levels(a) <- c("A", "B", "C", "D", "E")
exercise.prediction <- rnorm(100)
eval_table <- table(exercise.prediction, classe)
eval_matrix <- as.matrix(tab)
transform <- apply(eval_matrix, 1, function(x) sum(x * c(1:5)))
head(as.data.frame(transform))
My data is structured as follows:
DT <- data.table(Id=c(1,2,3,4,5), Va1=c(3,13,NA,NA,NA), Va2=c(4,40,NA,NA,4), Va3=c(5,34,NA,7,84),
Va4=c(2,23,NA,63,9), Vb1=c(8,45,1,7,0), Vb2=c(0,35,0,7,6), Vb3=c(63,0,0,0,5), Vc1=c(2,5,0,0,4))
>DT
Id Va1 Va2 Va3 Va4 Vb1 Vb2 Vb3 Vc1
1: 1 3 4 5 2 8 0 63 2
2: 2 13 40 34 23 45 35 0 5
3: 3 NA NA NA NA 1 0 0 0
4: 4 NA NA 7 63 7 7 0 0
5: 5 NA 4 84 9 0 6 5 4
additionally, I have a reference list that references all the column groups:
reference <- list(g.1=c(2,3,4,5), g.2=c(6,7,8), g.3=c(9))
Columns 2,3,4,5 (variables Va1, Va2, Va3, and Va4) belong to one group of variables. Columns 6,7,8 (variables Vb1, Vb2, Vb3) belong to a second group. Column 9 (variable Vc1) belongs to a third group.
What I need to do is calculate the difference between consecutive columns within column groups.
I.e. I need to find the difference between Va2 and Va1, and between Va3 and Va2, etc... but not between Vb1 and Va4.
The output should look like:
Id Va1 Va2 Va3 Va4 Vb1 Vb2 Vb3 Vc1 D[Va1:Va2] D[Va2:Va3] D[Va3:Va4] D[Vb1:Vb2] D[Vb2:Vb3]
1: 1 3 4 5 2 8 0 63 2 1 1 -3 -8 63
2: 2 13 40 34 23 45 35 0 5 27 -6 -11 -10 -35
3: 3 NA NA NA NA 1 0 0 0 NA NA NA -1 0
4: 4 NA NA 7 63 7 7 0 0 NA NA 56 0 -7
5: 5 NA 4 84 9 0 6 5 4 NA 80 -75 6 -1
Currently I am using the following loop:
for(i in 1:(length(reference)-1)){
tmp <- NULL
tmp <- as.list(reference[[i]])
tmp <- tmp[-length(tmp)]
tmp <- mapply(c, lapply(tmp, FUN = function(x) x+1), tmp, SIMPLIFY=FALSE)
for(j in 1:length(tmp)){
data <- cbind(data, delta = data[, tmp[[j]][1], with = F] - data[, tmp[[j]][2], with = F])
}
}
but my real data.table has 300-500 columns and +1'000'000 rows.
How can I make this more efficient?
I think your loop is fine, except you should use := instead of cbind to add columns:
ref <- lapply(reference,function(x) names(DT)[x])
for (g in ref){
if (length(g)==1) next
gx = tail(g,-1)
gy = head(g,-1)
gn = paste0("D[",gy,":",gx,"]")
DT[,(gn) := mapply(function(x,y).SD[[x]]-.SD[[y]], gx, gy, SIMPLIFY=FALSE)]
}
Still getting the gist of R. I have two data frames where the rows are named with different coordinates (e.g. x_1013y_41403; see below). The coordinates form sets of five, each set makes a cross if plotted onto a grid. The center coordinate is in one data frame, and the four peripheral coordinates are in the other.
Center A B C D E F
x_723y_6363.txt 554 NA 604 NA 645 NA
x_749y_41403.txt 14 NA 6 NA 13 NA
Peripheral A B C D E F
x_1013y_41403.txt NA 1 NA 0 NA 0
x_459y_6363.txt NA 2 NA 1 NA 4
x_485y_41403.txt NA 0 NA 0 NA 0
x_723y_6100.txt NA 1 NA 0 NA 3
x_723y_6627.txt NA 1 NA 0 NA 1
x_749y_41139.txt NA 1 NA 0 NA 0
x_749y_41667.txt NA 2 NA 0 NA 0
x_987y_6363.txt NA 1 NA 0 NA 0
To form a set, the peripheral coordinates would have the same x or y location as the center coordinate. For example, the center coordinate x_723y_6363 would be associated with x_723y_6100 and x_723y_6627 (same x location), as well as x_459y_6363 and x_987y_6363 (same y location).
I would like to combine the coordinates into their respective sets, and name the set with the center coordinate. For the case above, I would end up with two rows, where each row is the summation of a set.
A B C D E F
x_723y_6363.txt 554 5 604 1 645 8
x_749y_41403.txt 14 4 6 0 13 0
I am not sure at all how this can be done. I have thought about creating regular expressions to pick out the x and y coordinates individually and then doing a comparison across the two data frames. Any help would be greatly appreciated!
I hope someone else comes up with a better answer as this is ugly. I would first split the .txt names into x and y values then loop over each of the variables that is NA in center and sum all values that are share an x or y value with that center. Edit: Changed the sapply to make it slightly nicer.
center <- read.table(textConnection("
A B C D E F
x_723y_6363.txt 554 NA 604 NA 645 NA
x_749y_41403.txt 14 NA 6 NA 13 NA"),
header = TRUE)
peripheral <- read.table(textConnection("
A B C D E F
x_1013y_41403.txt NA 1 NA 0 NA 0
x_459y_6363.txt NA 2 NA 1 NA 4
x_485y_41403.txt NA 0 NA 0 NA 0
x_723y_6100.txt NA 1 NA 0 NA 3
x_723y_6627.txt NA 1 NA 0 NA 1
x_749y_41139.txt NA 1 NA 0 NA 0
x_749y_41667.txt NA 2 NA 0 NA 0
x_987y_6363.txt NA 1 NA 0 NA 0"),
header = TRUE)
xpat <- "^([^y]+).*"
ypat <- ".*(y_[0-9]+)\\.txt"
center$x <- gsub(xpat, "\\1", rownames(center))
center$y <- gsub(ypat, "\\1", rownames(center))
peripheral$x <- gsub(xpat, "\\1", rownames(peripheral))
peripheral$y <- gsub(ypat, "\\1", rownames(peripheral))
vars <- c("B", "D", "F")
center[vars] <- sapply(peripheral[vars], function(col)
apply(center, 1, function(row) sum(col[peripheral$x %in% row["x"] | peripheral$y %in% row["y"]]) )
)
R> center
A B C D E F x y
x_723y_6363.txt 554 5 604 1 645 8 x_723 y_6363
x_749y_41403.txt 14 4 6 0 13 0 x_749 y_41403
Another option:
# function to split coordinates x and y:
f <- function(DF) structure(
t(sapply(strsplit(row.names(DF), "[_y.]"), `[`, c(2,4))),
dimnames=list(NULL, c("x", "y")))
# get x and y for peripheral data:
P <- cbind(Peripheral, f(Peripheral))
# get x and y for centers, and mark ids:
C <- cbind(Center, f(Center), id=1:nrow(Center))
# matching:
Q <- merge(merge(P, C[,c("x","id")], all=TRUE), C[,c("y","id")], by="y", all=TRUE)
# prepare for union:
R <- within(Q, {id <- ifelse(is.na(id.y), id.x, id.y); id.x <- NULL; id.y <- NULL})
# join everything and aggregate:
S <- rbind(R, C)
aggregate(S[,3:8], by=list(id=S$id), FUN=sum, na.rm=TRUE)
Result:
id A B C D E F
1 1 554 5 604 1 645 8
2 2 14 4 6 0 13 0