How to add an aggregated variable to an existing dataset in R - r

How do you add a variable to a dataset using the aggregate and by commands? For example, I have:
num x1
1 1
1 0
2 0
2 0
And I'm looking to create a variable to identify every variable for which any num is 1, for example:
num x1 x2
1 1 1
1 0 1
2 0 0
2 0 0
or
num x1 x2
1 1 TRUE
1 0 TRUE
2 0 FALSE
2 0 FALSE
I've tried to use
df$x2 <- aggregate(df$x1, by = list(df$num), FUN = sum)
But I'm getting an error that says the replacement has a different number of rows than the data. Can anyone help?

This can be done by grouping with 'num' and checking if there are any 1 element in 'x'1. The ave from base R is convenient for this instead of aggregate
df1$x2 <- with(df1, ave(x1==1, num, FUN = any))
df1$x2
#[1] 1 1 0 0
Or using dplyr, we group by 'num' and create the 'x2' by checking if any 'x1' is equal to 1. It will be a logical vector if we are not wrapping with as.integer to convert to binary
library(dplyr)
df1 %>%
group_by(num) %>%
mutate(x2 = as.integer(any(x1==1)))

Related

Best option to search for characters in a column using R

I have a df with column which contains different codes (ICD-10). The column contains codes which consists of 4 alpha numeric characters. I want to search for specific codes based on just the first two characters. For example if this is the column
codes = c("s001", "s1234", "s4g6", "T002", "T191","t985","s761","t17.5")
and I want all those rows where it contains S0, S1, T0, T9, T1 and assign it one and 0 if not present. I previously have used %like% with case_when. However, I would like to know if there an efficient way to do this in R.Thanks
Use grepl() to test for a regular expression and return true for any string that starts with s0, s1, T0, T1, T9 and otherwise false. Then ifelse() to take that vector of TRUEs and FALSEs and assigned 1 for the TRUEs, otherwise 0.
codes <- c("s001", "s1234", "s4g6", "T002", "T191","t985","s761","t17.5")
ifelse(grepl("^s[01]|^T[019]", codes), 1, 0)
Output:
[1] 1 1 0 1 1 0 0 0
Can also do:
as.numeric(grepl("^s[01]|^T[019]", codes))
We can use
+(grepl("^s[01]|^T[019]", codes))
[1] 1 1 0 1 1 0 0 0
We could define a pattern you want to detect and then use str_detect and assign 1 to TRUE and 0 to FALSE:
library(dplyr)
library(stringr)
# your dataframe with codes column
df <- data.frame(codes = c("s001", "s1234", "s4g6",
"T002", "T191","t985",
"s761","t17.5"))
# define what you want to search for
search_pattern <- "S0|S1|T0|T9|T1"
# check with `str_detect`
df %>%
mutate(check = ifelse(str_detect(df$codes, search_pattern)==TRUE, 1, 0))
Output:
codes check
1 s001 0
2 s1234 0
3 s4g6 0
4 T002 1
5 T191 1
6 t985 0
7 s761 0
8 t17.5 0
Another option with grepl
> +grepl("^([sT][01]|T9)", codes)
[1] 1 1 0 1 1 0 0 0
You can also use the substring approach. Extract only first 2 characters from the codes using substr and compare it against the correct_values.
codes = c("s001", "s1234", "s4g6", "T002", "T191","t985","s761","t17.5")
correct_values <- c("s0", "s1", "T0", "T9", "T1")
as.integer(substr(codes, 1, 2) %in% correct_values)
#[1] 1 1 0 1 1 0 0 0

How to reference all other columns in R?

I am working with data similar to the data below:
ID <- c("A", "B", "C", "D", "E")
x1 <- c(1,1,1,1,0)
x2 <- c(0,0,1,2,2)
x3 <- c(0,0,0,0,0)
x4 <- c(0,0,0,0,0)
df <- data.frame(ID, x1, x2, x3, x4)
It looks like:
> df
ID x1 x2 x3 x4
1 A 1 0 0 0
2 B 1 0 0 0
3 C 1 1 0 0
4 D 1 2 0 0
5 E 0 2 0 0
I want to create a new column, which is the product of the conditional statement: if x1 == 1 and all the other columns are equal to 0, then it is coded "Positive".
How can I reference all the other columns besides x1 without having to write out the rest of the columns in the conditional statement?
Base R:
df$new <- ifelse(df$x1==1 & ## check x1 condition
rowSums(df[,3:5]!=0)==0), ## add the logical outcomes by row
"Positive",
"not_Positive"))
The second line is a little tricky.
df[,3:5] (or df[,-(1:2)]) selects all the columns except the first two. You could also use subset(df,select=x2:x4) here (although ?subset says "Warning: This is a convenience function intended for use interactively ...")
!=0 tests whether the values are 0 or not, returning TRUE or FALSE
rowSums() adds up the values (FALSE→0, TRUE →1)
the row sum is zero if all of the logical values in that row are zero when converted to numeric (→ all FALSE → none are not equal to zero)
If there might be NA values then you'll need an na.rm=TRUE in your rowSums() specification
With select, we can have multiple options. The below one is with range (:), In the example, the columns selected are from 'x2' to 'x4' and are in the order. If we want to select based on some patterns, it can be done with matches("^x[2-9]$").
In the below code, it is creating logical condition on the single column 'x1', and the rest of the columns selected with rowSums, joined together with &, then the output is passed on the case_when two-sided formula as the lhs, with the replacement value as the rhs of the ~. By default, all other elements that doesn't satisfy the condition will be NA
library(dplyr)
df %>%
mutate(new = case_when(x1 == 1 &
rowSums(select(., x2:x4)!= 0) == 0~ 'Positive'))
# ID x1 x2 x3 x4 new
#1 A 1 0 0 0 Positive
#2 B 1 0 0 0 Positive
#3 C 1 1 0 0 <NA>
#4 D 1 2 0 0 <NA>
#5 E 0 2 0 0 <NA>

If any value is present in a column of Dataframe, change the value to 1 else insert 0

I have a dataframe with about 1000 rows and 1000 columns. What I want to do is that if any value is present in any cell of the dataframe then change the value to 1 or else put a 0 in that cell. I am programming in R so a R code would be appreciated. I don't want the value of the T column to change but only for the rest of the columns to change.
For example
I have a dataframe like this :
T A B C D
1 29 90 0 100
2 30 12 76 0
3 0 12 0 32
convert it to :
T A B C D
1 1 1 0 1
2 1 1 1 0
3 0 1 0 1
To ignore the first column, you could combine it with a simple modification of akrun's first solution. For example,
data.frame(df[, 1, drop=FALSE], +(df[,-1] != 0))
We can convert to a logical matrix and coerce it to integer
df1 <- +(df != 0)
Or with replace
replace(df, df != 0, 1)
If we need to do this without taking the first column
df[-1] <- +(df[-1] != 0)
Or with sapply
+(sapply(df, `!=`, 0))
In tidyverse, we can use mutate_all
library(dplyr)
df <- df %>%
mutate_all(~ as.integer(. != 0))

Counting occurrencies by row

Imagine I have a data.frame (or matrix) with few different values such as this
test <- data.frame(replicate(10,sample(c(-1,0,1),20, replace=T, prob=c(0.2,0.2,0.6))))
test2 <- test
If I want to add extra columns with counts I could do:
test2$good <- apply(test,1, function(x) sum(x==1))
test2$bad <- apply(test,1, function(x) sum(x==-1))
test2$neutral <- apply(test,1, function(x) sum(x==0))
But If I had many possible values instead I would have to create many lines, it won't be elegant.
I've tried with table(), but the output is not easily usable
apply(test,1, function(x) table(x))
and there is a big problem, if any row doesn't contain any occurrency of some factor the result generated by table() doesn't have the same length and it can't be binded.
Is there way to force table() to take that value into account, telling it has zero occurrencies?
Then I've thought of using do.call or lapply and merge but it's too difficult for me.
I've also read about dplyr count but I have no clue on how to do it.
Could anyone provide a solution with dplyr or tidyr?
PD: What about a data.table solution?
We could melt the dataset to long format after converting to matrix, get the frequency using table and cbind with the original dataset.
library(reshape2)
cbind(test2, as.data.frame.matrix(table(melt(as.matrix(test2))[-2])))
Or use mtabulate on the transpose of 'test2' and cbind with the original dataset.
library(qdapTools)
cbind(test2, mtabulate(as.data.frame(t(test2))))
Or we can use gather/spread from tidyr after creating row id with add_rownames from dplyr
library(dplyr)
library(tidyr)
add_rownames(test2) %>%
gather(Var, Val, -rowname) %>%\
group_by(rn= as.numeric(rowname), Val) %>%
summarise(N=n()) %>%
spread(Val, N, fill=0) %>%
bind_cols(test2, .)
you can use rowSums():
test2 <- cbind(test2, sapply(c(-1, 0, 1), function(x) rowSums(test==x)))
similar to the code in the comment from etienne, but without the call to apply()
Here is the answer using base R.
test <- data.frame(replicate(10,sample(c(-1,0,1),20, replace=T, prob=c(0.2,0.2,0.6))))
testCopy <- test
# find all unique values, note that data frame is a list
uniqVal <- unique(unlist(test))
# the new column names start with Y
for (val in uniqVal) {
test[paste0("Y",val)] <- apply(testCopy, 1, function(x) sum(x == val))
}
head(test)
# X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 Y-1 Y1 Y0
# 1 -1 0 1 1 1 0 -1 -1 1 1 3 5 2
# 2 1 -1 0 1 1 -1 -1 0 0 1 3 4 3
# 3 -1 0 1 0 1 1 1 1 -1 1 2 6 2
# 4 1 1 1 1 0 1 1 0 1 0 0 7 3
# 5 0 -1 1 -1 -1 0 0 1 0 0 3 2 5
# 6 1 1 0 1 1 1 1 1 1 1 0 9 1

finding if boolean is ever true by groups in R

I want a simple way to create a new variable determining whether a boolean is ever true in R data frame.
Here is and example:
Suppose in the dataset I have 2 variables (among other variables which are not relevant) 'a' and 'b' and 'a' determines a group, while 'b' is a boolean with values TRUE (1) or FALSE (0). I want to create a variable 'c', which is also a boolean being 1 for all entries in groups where 'b' is at least once 'TRUE', and 0 for all entries in groups in which 'b' is never TRUE.
From entries like below:
a b
-----
1 1
2 0
1 0
1 0
1 1
2 0
2 0
3 0
3 1
3 0
I want to get variable 'c' like below:
a b c
-----------
1 1 1
2 0 0
1 0 1
1 0 1
1 1 1
2 0 0
2 0 0
3 0 1
3 1 1
3 0 1
-----------
I know how to do it in Stata, but I haven't done similar things in R yet, and it is difficult to find information on that on the internet.
In fact I am doing that only in order to later remove all the observations for which 'c' is 0, so any other suggestions would be fine as well. The application of that relates to multinomial logit estimation, where the alternatives that are never-chosen need to be removed from the dataset before estimation.
if X is your data frame
library(dplyr)
X <- X %>%
group_by(a) %>%
mutate(c = any(b == 1))
A base R option would be
df1$c <- with(df1, ave(b, a, FUN=any))
Or
library(sqldf)
sqldf('select * from df1
left join(select a, b,
(sum(b))>0 as c
from df1
group by a)
using(a)')
Simple data.table approach
require(data.table)
data <- data.table(data)
data[, c := any(b), by = a]
Even though logical and numeric (0-1) columns behave identically for all intents and purposes, if you'd like a numeric result you can simply wrap the call to any with as.numeric.
An answer with base R, assuming a and b are in dataframe x
c value is a 1-to-1 mapping with a, and I create a mapping here
cmap <- ifelse(sapply(split(x, x$a), function(x) sum(x[, "b"])) > 0, 1, 0)
Then just add in the mapped value into the data frame
x$c <- cmap[x$a]
Final output
> x
a b c
1 1 1 1
2 2 0 0
3 1 0 1
4 1 0 1
5 1 1 1
6 2 0 0
7 2 0 0
8 3 0 1
9 3 1 1
10 3 0 1
edited to change call to split.

Resources