Categorizing the contents of a vector - r

Here is my problem:
myvec <- c(1, 2, 2, 2, 3, 3,3, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 9, 10, 10, 10)
I want to develop a function that can caterize this vector depending upon number of categories I define.
if categories 1 all newvec elements will be 1
if categories are 2 then
unique (myvec), i.e.
1 = 1, 2 =2, 3 = 1, 4 = 2, 5 =1, 6 = 2, 7 = 1, 8 = 2, 9 = 1, 10 = 2
(which is situation of odd or even numbers)
If categories are 3 then first three number will be 1:3 and then pattern will be repeated.
1 = 1, 2 = 2, 3=3, 4=1, 5 = 2, 6 = 3, 7 =1, 8 = 2, 9 = 3, 10 =1
If caterogies are 4 then first number will be 1:4 and then pattern will be repeated
1 = 1, 2 = 2, 3= 3, 4 = 4, 5 = 1, 6 = 2, 7=3, 8=4, 9 =1, 10 = 2
Similarly for n categories the first 1:n, then the pattern repeated.

This should do what you need, if I correctly understood the question. You can vary variable n to choose the number of groups.
myvec <- c(1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 9, 10, 10, 10)
out <- vector(mode="integer", length=length(myvec))
uid <- sort(unique(myvec))
n <- 3
for (i in 1:n) {
s <- seq(i, length(uid), n)
out[myvec %in% s] <- i
}

Using the recycling features of R (this gives a warning if the vector length is not divisible by n):
R> myvec <- c(1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 6, 6, 6, 6, 7, 8, 8, 9, 10, 10, 10)
R> n <- 3
R> y <- cbind(x=sort(unique(myvec)), y=1:n)[, 2]
R> y
[1] 1 2 3 1 2 3 1 2 3 1
or using rep:
R> x <- sort(unique(myvec))
R> y <- rep(1:n, length.out=length(x))
R> y
[1] 1 2 3 1 2 3 1 2 3 1
Update: you could just use the modulo operator
R> myvec
[1] 1 2 2 2 3 3 3 4 4 5 6 6 6 6 7 8 8 9 10 10 10
R> n <- 4
R> ((myvec - 1) %% n) + 1
[1] 1 2 2 2 3 3 3 4 4 1 2 2 2 2 3 4 4 1 2 2 2

Related

How to dynamically alternate columns between dataframe subset given column prefixes in R?

I have a dataframe that looks like this:
ID <- c(1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0)
State <- c('AZ', 'IA', 'MN', 'NY', 'IL', 'FL', 'TX', 'TN', 'LA', 'ND')
ABC1_current <- c(1, 5, 3, 4, 3, 4, 5, 2, 2, 1, 4, 2, 1, 1, 5, 3, 2, 1, 1, 5)
ABC2_current <- c(4, 5, 5, 4, 2, 5, 5, 1, 2, 4, 2, 3, 3, 2, 2, 3, 2, 1, 4, 2)
ABC1_future <- c(2, 5, 3, 5, 3, 4, 5, 3, 2, 3, 1, 2, 2, 4, 5, 3, 2, 4, 1, 4)
ABC2_future <- c(2, 5, 5, 1, 2, 1, 5, 1, 3, 4, 2, 3, 3, 2, 1, 3, 1, 1, 2, 2)
df <- data.frame(ID, State, ABC1_current, ABC2_current, ABC1_future, ABC2_future)
I am trying to dynamically place the columns with the future suffix to the right of the columns with the current suffix for the given prefix (ABC1, ABC2, etc.) The ID and State columns don't move at all. Here's what I am hoping to get as a result:
df2 <- data.frame(ID, State, ABC1_current, ABC1_future, ABC2_current, ABC2_future)
Is there a way to interlace columns like this dynamically? Ideally, I'd like to use dplyr if possible.
Although not purely dplyr, this may help.
This takes advantage of ordering logic by number value.
library(dplyr)
df1 <- df %>% select("ID","State")
df2 <- df %>% select(-c("ID","State"))
index <- sort(colnames(df2))
df3 <- merge(df1,df2[index])
df3
ID State ABC1_current ABC1_future ABC2_current ABC2_future
1 1.1 AZ 1 2 4 2
2 1.2 IA 1 2 4 2
3 1.3 MN 1 2 4 2
4 1.4 NY 1 2 4 2
5 1.5 IL 1 2 4 2
6 1.6 FL 1 2 4 2
7 1.7 TX 1 2 4 2
8 1.8 TN 1 2 4 2
9 1.9 LA 1 2 4 2
10 2.0 ND 1 2 4 2
Basically the same idea as #PeteKittinun's:
library(dplyr)
df %>%
select(ID, State, sort(colnames(.)[3:ncol(.)]))
returns
> df %>%
+ select(ID, State, sort(colnames(.)[3:ncol(.)]))
ID State ABC1_current ABC1_future ABC2_current ABC2_future
1 1.1 AZ 1 2 4 2
2 1.2 IA 5 5 5 5
3 1.3 MN 3 3 5 5
4 1.4 NY 4 5 4 1
5 1.5 IL 3 3 2 2

R DataTable Select Rows

data1=data.frame("Student"=c(1, 1, 1, 2, 2, 2, 3, 3, 3),
"Grade"=c(1, 2, 3, 1, 2, 3, 1, 2, 3),
"Score"=c(5, 7, 9, 2, 2, 3, 10, NA, 3))
data2=data.frame("Student"=c(1, 1, 1, 3, 3, 3),
"Grade"=c(1, 2, 3, 1, 2, 3),
"Score"=c(5, 7, 9, 10, NA, 3))
I have 'data1' and wish for 'data2' where I ONLY include 'Student' if 'Score' at 'Grade' = 1 is at least 4.
My only knowledge of how to do this is doing it by 'Grade' and 'Score' but that does not give desired output.
library(data.table)
setDT(data1)
data1=data1[Grade==1 & Score >=4)
how is it possible to specify that I wish to select all STUDENTS who have a Score>=4 at Grade 1 and not just the ROWS
You just need to do a join with your desired conditions to retain the Student id.
Does this work?
library(data.table)
data1 <- data.frame("Student"=c(1, 1, 1, 2, 2, 2, 3, 3, 3),
"Grade"=c(1, 2, 3, 1, 2, 3, 1, 2, 3),
"Score"=c(5, 7, 9, 2, 2, 3, 10, NA, 3))
data2 <- data.frame("Student"=c(1, 1, 1, 3, 3, 3),
"Grade"=c(1, 2, 3, 1, 2, 3),
"Score"=c(5, 7, 9, 10, NA, 3))
setDT(data1)
setDT(data2)
wanted <- data1[ Grade == 1 & Score >= 4, .( Student ) ]
setkey( wanted, Student )
setkey( data1, Student )
data3 = data1[ wanted ]
data2
#> Student Grade Score
#> 1: 1 1 5
#> 2: 1 2 7
#> 3: 1 3 9
#> 4: 3 1 10
#> 5: 3 2 NA
#> 6: 3 3 3
data3
#> Student Grade Score
#> 1: 1 1 5
#> 2: 1 2 7
#> 3: 1 3 9
#> 4: 3 1 10
#> 5: 3 2 NA
#> 6: 3 3 3
Created on 2020-04-29 by the reprex package (v0.3.0)

How to expand a vector of integers into consecutive integers in each group in r

I want to expand a vector of integers into consecutive integers in each group in r. Can anyone have some hints on this problem?
Below is my original dataset:
x = c(1, 2, 3, 4, 5, 1, 3, 5, 6, 1, 2, 3, 6, 8)
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3)
data = data.frame(x, group)
and my desired dataset is as below:
desired_data = data.frame(
x = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 8),
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3))
Thanks for your help!
This can be easily done via expand from tidyr,
library(tidyverse)
df %>%
group_by(group) %>%
expand(x = full_seq(x, 1))
Which gives,
# A tibble: 19 x 2
# Groups: group [3]
group x
<dbl> <dbl>
1 1 1
2 1 2
3 1 3
4 1 4
5 1 5
6 2 1
7 2 2
8 2 3
9 2 4
10 2 5
11 2 6
12 3 1
13 3 2
14 3 3
15 3 4
16 3 5
17 3 6
18 3 7
19 3 8
I'm sure someone will have a cleaner solution soon. In the meantime:
minVals=aggregate(data$x, by = list(data$group), min)[,2]
maxVals=aggregate(data$x, by = list(data$group), max)[,2]
ls=apply(cbind(minVals,maxVals),1,function(x) x[1]:x[2])
desired_data = data.frame(
x = unlist(ls),
group = rep(unique(data$group),lapply(ls,length)))
x group
1 1 1
2 2 1
3 3 1
4 4 1
5 5 1
6 1 2
7 2 2
8 3 2
9 4 2
10 5 2
11 6 2
12 1 3
13 2 3
14 3 3
15 4 3
16 5 3
17 6 3
18 7 3
19 8 3
Here's a base R solution.
x = c(1, 2, 3, 4, 5, 1, 3, 5, 6, 1, 2, 3, 6, 8)
group = c(1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 3)
sl = split(x,group)
expanded = lapply(names(sl),function(x){
r = range(sl[[x]])
return(data.frame(x = seq(r[1],r[2],1),group = x))
})
do.call(rbind,expanded)
split x by group which results in a named list per group
using lapply on the names we can expand the integer range for each group
finally use do.call to rbind the results together.

R - Get Common Columns without the use of Joins or Merge

I have two data frames in R
df1 = data.frame(Cust = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4), ItemId = c(1, 2, 3, 4, 2, 3, 2, 5, 1, 2, 5, 6, 2))
df2 = data.frame(ItemId1 = c(1, 3, 2, 3, 2, 1, 2, 3, 4, 6, 5, 3, 2, 4), ItemId2 = c(3, 1, 2, 3, 4, 1, 6, 4, 2, 4, 3, 1, 3, 5))
> df1
Cust ItemId
1 1 1
2 1 2
3 1 3
4 1 4
5 2 2
6 2 3
7 2 2
8 2 5
9 3 1
10 3 2
11 3 5
12 4 6
13 4 2
> df2
ItemId1 ItemId2
1 1 3
2 3 1
3 2 2
4 3 3
5 2 4
6 1 1
7 2 6
8 3 4
9 4 2
10 6 4
11 5 3
12 3 1
13 2 3
14 4 5
All I need is the following output which is less costly than joins/merge (because in real time I am dealing with billions of records)
> output
ItemId1 ItemId2 Cust
1 1 3 1
2 3 1 1
3 2 2 1, 2, 3, 4
4 3 3 1, 2
5 2 4 1
6 1 1 1, 3
7 2 6 4
8 3 4 1
9 4 2 1
10 6 4 NA
11 5 3 2
12 3 1 1
13 2 3 1, 2
14 4 5 NA
What happens is If ItemId1, ItemId2 of df2 combination is present in ItemId of df1 we need to return the Cust values (even if they are multiple). If they are present we need to return NA.
i.e. Take the first row as example: ItemId1 = 1, ItemId2 = 3. Only Customer = 1 has ItemId = c(1,3) in df1. Similarly the next rows.
We can do this using Joins/Merge which are costly operations. But, they are resulting in Memory Error.
This may take more time but wont take much of your memory.
Please convert for loops using apply if possible:
library(plyr)
df1 = data.frame(Cust = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4), ItemId = c(1, 2, 3, 4, 2, 3, 2, 5, 1, 2, 5, 6, 2))
df2 = data.frame(ItemId1 = c(1, 3, 2, 3, 2, 1, 2, 3, 4, 6, 5, 3, 2, 4), ItemId2 = c(3, 1, 2, 3, 4, 1, 6, 4, 2, 4, 3, 1, 3, 5))
temp2 = ddply(df1[,c("Cust","ItemId")], .(Cust), summarize, ItemId = toString(unique(ItemId)))
temp3 = ddply(df1[,c("ItemId","Cust")], .(ItemId), summarize, Cust = toString(unique(Cust)))
dfout = cbind(df2[0,],data.frame(Cust = df1[0,1]))
for(i in 1:nrow(df2)){
a = df2[i,1]
b = df2[i,2]
if(a == b){
dfout = rbind(dfout,data.frame(ItemId1 = a,ItemId2 = a,Cust = temp3$Cust[temp3$ItemId == a]))
}else{
cusli = c()
for(j in 1:nrow(temp2)){
if(length(grep(a,temp2$ItemId[j]))>0 & length(grep(b,temp2$ItemId[j]))>0){
cusli = c(cusli,temp2$Cust[j])
}
}
dfout = rbind(dfout,data.frame(ItemId1 = a,ItemId2 = b,Cust = paste(cusli,collapse = ", ")))
}
}
dfout$Cust[dfout$Cust == "",] = NA

Find the wieght of each element of another vector

I have a vector v and I want to have a vector w which is the weight of each element of v. How can I get the result (vector w)in R? For example,
v = c(0, 0, 1, 1, 1, 3, 4, 4, 4, 4, 5, 5, 6)
u = unique(v)
w = c(2, 3, 1, 4, 2, 1)
Use table:
table(v)
v
0 1 3 4 5 6
2 3 1 4 2 1

Resources