Create column based on unique values [duplicate]

Create column based on unique values [duplicate] - r

This question already has answers here:
Automatically expanding an R factor into a collection of 1/0 indicator variables for every factor level
(10 answers)
Closed 6 years ago.
I'd like to create columns in a data frame based on the unique values from a single column.
E.g.
Column1
A
B
C
Into
A B C
True False False
False True False
False False True

We can use table
!!table(1:nrow(df1), df1$Column1)
# A B C
# 1 TRUE FALSE FALSE
# 2 FALSE TRUE FALSE
# 3 FALSE FALSE TRUE
Or using mtabulate from qdapTools
library(qdapTools)
mtabulate(df1$Column1)!=0
# A B C
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE
#[3,] FALSE FALSE TRUE
Or using model.matrix
model.matrix(~Column1-1, df1)!=0
# Column1A Column1B Column1C
#1 TRUE FALSE FALSE
#2 FALSE TRUE FALSE
#3 FALSE FALSE TRUE

You could also use a loop,
sapply(df$Column1, function(i) grepl(i, df$Column1))
# A B C
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE
#[3,] FALSE FALSE TRUE

You can also use dcast from reshape2 package
library(reshape2)
!is.na(dcast(df, Column1 ~ Column1))[, -1]
# A B C
#[1,] TRUE FALSE FALSE
#[2,] FALSE TRUE FALSE
#[3,] FALSE FALSE TRUE

Related

Turns thousands of dummy variables into multinomial variable

I have a dataframe of the following sort:
a<-c('q','w')
b<-c(T,T)
d<-c(F,F)
.e<-c(T,F)
.f<-c(F,F)
.g<-c(F,T)
h<-c(F,F)
i<-c(F,T)
j<-c(T,T)
df<-data.frame(a,b,d,.e,.f,.g,h,i,j)
a b d .e .f .g h i j
1 q TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
2 w TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE
I want to turn all variables starting with periods at the start into a single multinomial variable called Index such that the second row would have a value 1 for the Index column, the third row would have a value 2, etc. :
df$Index<-c('e','g')
a b d .e .f .g h i j Index
1 q TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE e
2 w TRUE FALSE FALSE FALSE TRUE FALSE TRUE TRUE g
Although many rows can have a T for any of period-initial variable, each row can be T for only ONE period-initial variable.
If it were just a few items id do an ifelse statement:
df$Index <- ifelse(df$_10000, '10000',...
But there are 12000 of these. The names for all dummy variables begin with underscores, so I feel like there must be a better way. In pseudocode I would say something like:
for every row:
for every column beginning with '_':
if value == T:
assign the name of the column without '_' to a Column 'Index'
Thanks in advance

Sample data:
df <- cbind(a = letters[1:10], b = LETTERS[1:10],
data.frame(diag(10) == 1))
names(df)[-(1:2)] <- paste0("_", 1:10)
set.seed(42)
df <- df[sample(nrow(df)),]
head(df,3)
# a b _1 _2 _3 _4 _5 _6 _7 _8 _9 _10
# 1 a A TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# 5 e E FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# 10 j J FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Execution:
df$Index <- apply(subset(df, select = grepl("^_", names(df))), 1,
function(z) which(z)[1])
df
# a b _1 _2 _3 _4 _5 _6 _7 _8 _9 _10 Index
# 1 a A TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 1
# 5 e E FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE 5
# 10 j J FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE 10
# 8 h H FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE 8
# 2 b B FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 2
# 4 d D FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE 4
# 6 f F FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE 6
# 9 i I FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE 9
# 7 g G FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE 7
# 3 c C FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 3
If there are more than one TRUE in a row of _-columns, the first found will be used, the remainder silently ignored. If there are none, Index will be NA for that row.

R - Insert a True every nth row based on previous rows

Test data Frame:
a<-data.frame(True_False = c(T,F,F,F,F,T,F,T,T,T,F,F,F,F,F,F,F,F))
True_False
1 TRUE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
6 TRUE
7 FALSE
8 TRUE
9 TRUE
10 TRUE
11 FALSE
12 FALSE
13 FALSE
14 FALSE
15 FALSE
16 FALSE
17 FALSE
18 FALSE
Using this, I would like to edit this column or make a new one which has a True at least once every third row. Meaning I would need to check the current row, if False, and if the previous two rows are False, then make it a True. Otherwise leave it as it is. Using Zoo, Dplyr, and Rollapply, I get close.
library(zoo)
library(tidyverse)
b<-a%>%
mutate(Roll = ifelse(rollapplyr(Input,3,sum, partial = T) == 0,T,Input))
b$Desired<-c(T,F,F,T,F,T,F,T,T,T,F,F,T,F,F,T,F,F)
Input Roll Desired
1 TRUE TRUE TRUE
2 FALSE FALSE FALSE
3 FALSE FALSE FALSE
4 FALSE TRUE TRUE
5 FALSE TRUE FALSE
6 TRUE TRUE TRUE
7 FALSE FALSE FALSE
8 TRUE TRUE TRUE
9 TRUE TRUE TRUE
10 TRUE TRUE TRUE
11 FALSE FALSE FALSE
12 FALSE FALSE FALSE
13 FALSE TRUE TRUE
14 FALSE TRUE FALSE
15 FALSE TRUE FALSE
16 FALSE TRUE TRUE
17 FALSE TRUE FALSE
18 FALSE TRUE FALSE
Essentially my issue is that it will rollapply the sum to the whole column, and then add the Trues after. Thus, we have Trues that are not necessary. So is there a way I can do this in which the True is applied before going to the next row? I assume I need to use an apply of some sort, but that is an area I'm not familiar with, and even reading the documentation I'm not sure how to do this directly.

Due to the fact that you need to update your vector on the fly to process further operations, I'd say a simple for-loop is the way to go:
for(i in 3:nrow(a)){
a$True_False[i] <- ifelse(sum(a$True_False[(i-2):i]) == 0, T, a$True_False[i])
}
> a
True_False
1 TRUE
2 FALSE
3 FALSE
4 TRUE
5 FALSE
6 TRUE
7 FALSE
8 TRUE
9 TRUE
10 TRUE
11 FALSE
12 FALSE
13 TRUE
14 FALSE
15 FALSE
16 TRUE
17 FALSE
18 FALSE

Looks like you need something like this. Here is one approach (not the cleanest):
a<-data.frame(True_False = c(T,F,F,F,F,T,F,T,T,T,F,F,F,F,F,F,F,F))
a$Desired<-NA
a$Desired[4:nrow(a)]<-sapply(4:nrow(a),function(z){
if(z%%3==1 & a$True_False[z]==F & a$True_False[z-1]==F & a$True_False[z-2]==F){a$True_False[z]<-T}else{a$True_False[z]}
})
a$Desired[1:3]<-a$True_False[1:3]

Define an update function f and run it through Reduce.
f <- function(x, i) {
if (i >= 3 && all(!x[seq(to = i, length = 3)])) x[i] <- TRUE
x
}
transform(a, new = Reduce(f, init = True_False, seq_along(True_False)))
giving:
True_False new
1 TRUE TRUE
2 FALSE FALSE
3 FALSE FALSE
4 FALSE TRUE
5 FALSE FALSE
6 TRUE TRUE
7 FALSE FALSE
8 TRUE TRUE
9 TRUE TRUE
10 TRUE TRUE
11 FALSE FALSE
12 FALSE FALSE
13 FALSE TRUE
14 FALSE FALSE
15 FALSE FALSE
16 FALSE TRUE
17 FALSE FALSE
18 FALSE FALSE

Compare each element of a variable within each group

Consider the data frame in R:
set.seed(36)
y <- runif(10,0,200)
group <- sample(rep(1:2, each=5))
d <- data.frame(y, group)
I want to compare all y against all y within each group. The following codes do this correctly:
d_split <- split(d, d$group)
a <- with(d_split[[1]],outer(y, y, "<="))
b <- with(d_split[[2]],outer(y, y, "<="))
But while I am doing this inside a function, and the number of group varies (group will be an argument of that function), then I cannot proceed in this manner. How can I elegantly write the last three line codes to compare all y against all y within each group?

To perform the same operation for multiple groups we can use lapply and perform the outer operation for every group.
lapply(split(d, d$group), function(x) outer(x[["y"]], x[["y"]], "<="))
#$`1`
# [,1] [,2] [,3] [,4] [,5]
#[1,] TRUE TRUE FALSE FALSE FALSE
#[2,] FALSE TRUE FALSE FALSE FALSE
#[3,] TRUE TRUE TRUE FALSE TRUE
#[4,] TRUE TRUE TRUE TRUE TRUE
#[5,] TRUE TRUE FALSE FALSE TRUE
#$`2`
# [,1] [,2] [,3] [,4] [,5]
#[1,] TRUE TRUE FALSE TRUE FALSE
#[2,] FALSE TRUE FALSE TRUE FALSE
#[3,] TRUE TRUE TRUE TRUE TRUE
#[4,] FALSE FALSE FALSE TRUE FALSE
#[5,] TRUE TRUE FALSE TRUE TRUE

Here is an option without splitting
library(data.table)
setDT(d)[, as.data.table(outer(y, y, "<=")), group]
# group V1 V2 V3 V4 V5
#1: 1 TRUE TRUE FALSE FALSE FALSE
#2: 1 FALSE TRUE FALSE FALSE FALSE
#3: 1 TRUE TRUE TRUE FALSE TRUE
#4: 1 TRUE TRUE TRUE TRUE TRUE
#5: 1 TRUE TRUE FALSE FALSE TRUE
#6: 2 TRUE TRUE FALSE TRUE FALSE
#7: 2 FALSE TRUE FALSE TRUE FALSE
#8: 2 TRUE TRUE TRUE TRUE TRUE
#9: 2 FALSE FALSE FALSE TRUE FALSE
#10: 2 TRUE TRUE FALSE TRUE TRUE
Or in a 'long' format with CJ
setDT(d)[, CJ(y, y), group][, V1 <= V2, group]

Exclude multiple words from a vector with grepl [duplicate]

This question already has answers here:
Matching multiple patterns
(6 answers)
Closed 7 years ago.
Here sample data:
exclude.words <- c("zoznam","azet","dovera","joj","alza","telecom","google","post","sme")
main.data <- c("zoznam","registration","azet","azet.com","dovera","dna","joj","alza","telecom","google","post","sme")
This works if the words are equal (match exactly), however see azet.com that won't be excluded! For that we could use agrepl().
main.data[!(main.data %in% exclude.words)]
So how to use agrepl with two vectors?
main.data[!agrepl(main.data, exclude.words)]

As commented, you can use:
main.data[!grepl(paste(exclude.words, collapse = "|"), main.data)]
to exclude any words that have a partly or complete match between the main.data and exclude.words.
paste(exclude.words, collapse = "|")
creates a single string with "|" (logical OR) between the exclude.words which can be used as a single pattern in grepl. Therefore, you don't need to loop over the single words.

main.data[!as.logical(rowSums(sapply(exclude.words, function(x) agrepl(x, main.data))))]
# [1] "registration" "dna"
# clarification
sapply(exclude.words, function(x) agrepl(x, main.data))
# zoznam azet dovera joj alza telecom google post sme
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [3,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [7,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE
# [8,] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
# [9,] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
# [10,] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE
# [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
# [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE

You can use this functional programming approach:
library(functional)
funcs = lapply(exclude.words, function(u) function(x) x[!grepl(u, x)])
Reduce(Compose, funcs)(main.data)
#[1] "registration" "dna"

find the biggest change in a time series

I have a time series in R
e.g.
[1] 0.2 0.6 0.4 -0.2 -0.1 0.3 0.8 0.7
How can I find out the biggest change in the series? (from point 4 to 7 biggest change =1)
How can I find out were a change of e.g. 1 is? (again from 4 (= -0.2) to 7 (= 0,8)

To calculate the distance matrix for a set of points, you can use the dist function. After that it is just a matter of selecting the point pair with the highest distance between them. In code:
m = as.matrix(dist(runif(10)))
m == max(m)
1 2 3 4 5 6 7 8 9 10
1 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
2 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
3 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
4 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
5 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
6 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
7 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
8 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
9 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
10 FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
which(m == max(m), arr.ind = TRUE)[1,]
row col
10 6

You can use expand.grid here.
exg <- expand.grid(x, x)
exg[apply(exg, 1, diff) == VALUE.TO.FIND, ] # notice the ', ' (comma-and-space)
Var1 Var2
52 -0.2 0.8
where VALUE.TO.FIND is whichever specific value you are seraching for
If instead you want to find the maximum distance:
dist <- apply(exg, 1, diff)
exg[dist == max(dist), ]

To get the biggest change in a list, just iterate through it and get the max and min values. Then compare them. It's in O(n) time. It's dirt simple.
To find a certain change is a little more complex. Don't know why you'd want it, but it's still possible. One way would be to call the first function you just wrote with every combination of start index and end indexes of the list. That's a little more computationally complex, but it's the simplest way of implementing it. Then when you get the change from position 1 to 2, you can check to see if it's what you want, if not, 1-3. Eventually you'll get to n-1 to n, and if that's not the change you're looking for, then it's not in the set.
This method will be in O(n^2).

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Create column based on unique values [duplicate] - r

You could also use a loop, sapply(df$Column1, function(i) grepl(i, df$Column1)) # A B C #[1,] TRUE FALSE FALSE #[2,] FALSE TRUE FALSE #[3,] FALSE FALSE TRUE

You can also use dcast from reshape2 package library(reshape2) !is.na(dcast(df, Column1 ~ Column1))[, -1] # A B C #[1,] TRUE FALSE FALSE #[2,] FALSE TRUE FALSE #[3,] FALSE FALSE TRUE

Related

Turns thousands of dummy variables into multinomial variable

R - Insert a True every nth row based on previous rows

Compare each element of a variable within each group

Exclude multiple words from a vector with grepl [duplicate]

find the biggest change in a time series

Categories

Resources