R Check for levels within a group and duplicate row if not present - r

I have a problem with in Shiny which I will show in a simple example:
I have the following data:
Group<-c("A","A","B","C","C","D")
Value<-c(1,2,6,7,3,9)
df<-data.frame(Group, Value)
Group Value
A 1
A 2
B 6
C 7
C 3
D 9
Then I add a row to see how many reps a group has:
df$num <- ave(df$Value, df$Group, FUN = seq_along)
Group Value num
A 1 1
A 2 2
B 6 1
C 7 1
C 3 2
D 9 1
Now, I would like it, to check if the group contains a 2nd rep, and if not, duplicate the 1st row of the group (containing num=1) and setting num to 2.
So I would like to end up with:
Group Value num
A 1 1
A 2 2
B 6 1
B 6 2 #this row is added
C 7 1
C 3 2
D 9 1
D 9 2 #this row is added
I have tried to search for solution, but I mainly ended up with subject like a condition that is based on a certain value, rather than conditions within a group.
Could someone help me? I would appreciate it a lot!

Can this code do the trick ?
res <- lapply(unique(df$Group), function(x){
a <- df[df$Group == x, ]
if(nrow(a) == 1) {
a <- a[rep(row.names(a), 2), ]
a$num <- c(1:2)
}
a
})
do.call(rbind, res)

Related

Finding group with distinct (non-overlapping) elements

I have a simple dataframe with group IDs and elements of each group, like this:
x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3), "Values" = c(3,5,7,2,4,5,2,4,6))
Each ID may have a different number of elements. Now I want to find all IDs that have distinct elements with other IDs. In this example, ID1 and ID3 will be selected because they have distinct elements (3,5,7 vs 2,4,6). I also want to copy these unique IDs and their elements into a new dataframe, similar to the original.
How would I do that in R? My skills with R is quite limited.
Thank you very much!
Bests,
Seems like a good question for igraph cliques with one edge to another clique but I cant seem to wrap my head on how to use it.
Anyway, here is an option applying join to identify IDs with same Values and then anti-join to remove those IDs using data.table:
library(data.table)
DT <- as.data.table(x)
for (i in DT[, unique(ID)]) {
dupeID <- DT[DT[ID==i], on=.(Values), .(ID=unique(x.ID[x.ID!=i.ID]))]
DT <- DT[!dupeID , on=.(ID)]
}
output:
ID Values
1: 1 3
2: 1 5
3: 1 7
4: 3 2
5: 3 4
6: 3 6
x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3), "Values" = c(3,5,7,2,4,5,2,4,6))
gps = split(x, x$ID)
nGroups = length(gps)
k = 1
results = data.frame(ID = NULL, Values = NULL)
for(i in 1:(nGroups - 1)){
j = i + 1
while(j <= nGroups){
if(length(intersect(gps[[i]]$Values, gps[[j]]$Values)) == 0){
print(c(i,j))
results = rbind(results, gps[[i]], gps[[j]])
}
j = j + 1
}
}
results
> results
ID Values
1 1 3
2 1 5
3 1 7
7 3 2
8 3 4
9 3 6
You can try the following code, where the y is the list of data frames (including all data frames that have exclusive Value)
xs <- split(x,x$ID)
id <- names(xs)
y <- list()
ids <- seq_along(xs)
repeat {
if (length(ids)==0) break;
y[[length(y)+1]] <- xs[[ids[1]]]
p <- ids[[1]]
qs <- p
for (q in ids[-1]) {
if (length(intersect(xs[[p]]$Value,xs[[q]]$Value))==0) {
y[[length(y)]] <- rbind(y[[length(y)]],xs[[q]])
qs <- c(qs,q)
}
}
ids <- setdiff(ids,qs)
}
Example
x <- data.frame("ID" = c(1,1,1,2,2,2,3,3,3,4,4),
"Values" = c(3,5,7,2,4,5,2,4,6,1,3))
> x
ID Values
1 1 3
2 1 5
3 1 7
4 2 2
5 2 4
6 2 5
7 3 2
8 3 4
9 3 6
10 4 1
11 4 3
then you will get
> y
[[1]]
ID Values
1 1 3
2 1 5
3 1 7
7 3 2
8 3 4
9 3 6
[[2]]
ID Values
4 2 2
5 2 4
6 2 5
10 4 1
11 4 3

Return first occurring 2nd largest value in data frame rows using colnames & apply

Consider i have a df
> editor
A B C D E F G H I J
User1 1 0 5 6 5 6 5 6 2 6
User2 0 5 4 6 4 5 5 1 7 5
I want to store the column name of the first occuring 2nd largest value in above rows.
Expected results
> editor
A B C D E F G H I J 2nd_highest
User1 1 0 5 6 5 6 5 6 2 6 C
User2 0 5 4 6 4 5 5 1 7 5 D
i tried edited$2nd_highest <- colnames(edited)[apply(edited, 1, which.max)+1] but did'nt worked well .
Any ideas ?
Here's an attempt to achieve this using algebra in order to keep it vectorized and avoid by row operations (though it still does a matrix conversion similar to apply). The idea here is to find the maximum- then reduce it from the data set, then convert to log (after multiplying by -1) which will result in the largest value becoming -Inf (meaning the smallest value) and then do 1/result in order to find the largest value out of the values left.
indx <- max.col(1/log((editor - editor[cbind(1:nrow(editor),
max.col(editor))]) * -1), ties.method = "first")
names(editor)[indx]
# [1] "C" "D"
Here is an idea. We first sort the unique values of each row and extract the second value. Since we specify decreasing = TRUE, then the second value will be the second highest. We then use the first value of each element of the new list as the index for the column names
ind_lst <- apply(df, 1, function(i) which(i == sort(unique(i), decreasing = TRUE)[2]))
df$highest.two <- names(df)[unlist(lapply(ind_lst, '[', 1))]
df
# A B C D E F G H I J highest.two
#User1 1 0 5 6 5 6 5 6 2 6 C
#User2 0 5 4 6 4 5 5 1 7 5 D
This can help you:
mat <- matrix(sample(1:8, 24, replace=TRUE), ncol=6)
mat
sec_highest <- apply(mat, 1, function(x) which(x == max(x[which(x != max(x))])))
LETTERS[sec_highest] # letters display
Note that if you have two second highests with same scores, only one will be displayed.

Cumulative sum conditional over multiple columns in r dataframe containing the same values

Say my data.frame is as outlined below:
df<-as.data.frame(cbind("Home"=c("a","c","e","b","e","b"),
"Away"=c("b","d","f","c","a","f"))
df$Index<-rep(1,nrow(df))
Home Away Index
1 a b 1
2 c d 1
3 e f 1
4 b c 1
5 e a 1
6 b f 1
What I want to do is calculate a cumulative sum using the Index column for each character a - f regardless of whether they in the Home or Away columns. Thus a column called Cumulative_Sum_Home, say, takes the character in the Home row, "b" in the case of row 6, and counts how many times "b" has appeared in either the Home or Away columns in all previous rows including row 6. Thus in this case b has appeared 3 times cumulatively in the first 6 rows, and thus the Cumulative_Sum_Home gives the value 3. Likewise the same logic applies to the Cumulative_Sum_Away column. Taking row 5, character "a" appears in the Away column, and has cumulatively appeared 2 times in either Home or Away columns up to that row, so the column Cumulative_Sum_Away takes the value 2.
Home Away Index Cumulative_Sum_Home Cumulative_Sum_Away
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
I have to confess to being totally stumped as to how to solve this problem. I've tried looking at the data.table approaches, but I've never used that package before so I can't immediately see how to solve it. Any tips would be greatly received.
There is scope to make this leaner but if that doesn't matter much for you then this should be okay.
NewColumns = list()
for ( i in sort(unique(c(levels(df[,"Home"]),levels(df[,"Away"]))))) {
NewColumnAddition = i == df$Home | i ==df$Away
NewColumnAddition[NewColumnAddition] = cumsum(NewColumnAddition[NewColumnAddition])
NewColumns[[i]] = NewColumnAddition
}
df$Cumulative_Sum_Home = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Home"])]][i]
}
)
df$Cumulative_Sum_Away = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Away"])]][i]
}
)
> df
Home Away Index HomeSum AwaySum
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
Here's a data.table alternative -
setDT(df)
for ( i in sort(unique(c(levels(df[,Home]),levels(df[,Away]))))) {
df[, TotalSum := cumsum(i == Home | i == Away)]
df[Home == i, Cumulative_Sum_Home := TotalSum]
df[Away == i, Cumulative_Sum_Away := TotalSum]
}
df[,TotalSum := NULL]

Multirow deletion: delete row depending on other row

I'm stuck with a quite complex problem. I have a data frame with three rows: id, info and rownum. The data looks like this:
id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8
What I want to do now is to delete all other rows of one id if one of the rows contains the info a. This would mean for example that row 2 and 3 should be removed as row 1's coloumn info contains the value a. Please note that the info values are not ordered (id 3/row 5 & 6) and cannot be ordered due to other data limitations.
I solved the case using a for loop:
# select all id containing an "a"-value
a_val <- data$id[grep("a", data$info)]
# check for every id containing an "a"-value
for(i in a_val) {
temp_data <- data[which(data$id == i),]
# only go on if the given id contains more than one row
if (nrow(temp_data) > 1) {
for (ii in nrow(temp_data)) {
if (temp_data$info[ii] != "a") {
temp <- temp_data$row[ii]
if (!exists("delete_rows")) {
delete_rows <- temp
} else {
delete_rows <- c(delete_rows, temp)
}
}
}
}
}
My solution works quite well. Nevertheless, it is very, very, very slow as the original data contains more than 700k rows and more that 150k rows with an "a"-value.
I could use a foreach loop with 4 cores to speed it up, but maybe someone could give me a hint for a better solution.
Best regards,
Arne
[UPDATE]
The outcome should be:
id info row
1 a 1
2 a 4
3 a 6
4 b 7
4 c 8
Here is one possible solution.
First find ids where info contains "a":
ids <- with(data, unique(id[info == "a"]))
Subset the data:
subset(data, (id %in% ids & info == "a") | !id %in% ids)
Output:
id info row
1 1 a 1
4 2 a 4
6 3 a 6
7 4 b 7
8 4 c 8
An alternative solution (maybe harder to decipher):
subset(data, info == "a" | !rep.int(tapply(info, id, function(x) any(x == "a")),
table(id)))
Note. #BenBarnes found out that this solution only works if the data frame is ordered according to id.
You might want to investigate the data.table package:
EDIT: If the row variable is not a sequential numbering of each row in your data (as I assumed it was), you could create such a variable to obtain the original row order:
library(data.table)
# Create data.table of your data
dt <- as.data.table(data)
# Create index to maintain row order
dt[, idx := seq_len(nrow(dt))]
# Set a key on id and info
setkeyv(dt, c("id", "info"))
# Determine unique ids
uid <- dt[, unique(id)]
# subset your data to select rows with "a"
dt2 <- dt[J(uid, "a"), nomatch = 0]
# identify rows of dataset where the id doesn't have an "a"
dt3 <- dt[J(dt2[, setdiff(uid, id)])]
# rbind those two data.tables together
(dt4 <- rbind(dt2, dt3))
# id info row idx
# 1: 1 a 1 1
# 2: 2 a 4 4
# 3: 3 a 6 6
# 4: 4 b 7 7
# 5: 4 c 8 8
# And if you need the original ordering of rows,
dt5 <- dt4[order(idx)]
Note that setting a key for the data.table will order the rows according to the key columns. The last step (creating dt5) sets the row order back to the original.
Here is a way using ddply:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
library("plyr")
ddply(df,.(id),subset,rep(!'a'%in%info,length(info))|info=='a')
Returns:
id info row
1 1 a 1
2 2 a 4
3 3 a 6
4 4 b 7
5 4 c 8
if df is this (RE Sacha above) use match which just finds the index of the first occurrence:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
# the first info row matching 'a' and all other rows that are not 'a'
with(df, df[c(match('a',info), which(info != 'a')),])
id info row
1 1 a 1
2 1 b 2
3 1 c 3
5 3 b 5
7 4 b 7
8 4 c 8
try to take a look at subset, it's quite easy to use and it will solve your problem.
you just need to specify the value of the column that you want to subset based on, alternatively you can choose more columns.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html
http://www.statmethods.net/management/subset.html

Generating random number by length of blocks of data in R data frame

I am trying to simulate n times the measuring order and see how measuring order effects my study subject. To do this I am trying to generate integer random numbers to a new column in a dataframe. I have a big dataframe and i would like to add a column into the dataframe that consists a random number according to the number of observations in a block.
Example of data(each row is an observation):
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5))
A B C
1 1 x 1
2 1 b 2
3 1 c 4
4 2 g 1
5 2 h 5
6 3 g 7
7 3 g 1
8 3 u 2
9 3 l 5
What I'd like to do is add a D column and generate random integer numbers according to the length of each block. Blocks are defined in column A.
Result should look something like this:
df <- data.frame(A=c(1,1,1,2,2,3,3,3,3),
B=c("x","b","c","g","h","g","g","u","l"),
C=c(1,2,4,1,5,7,1,2,5),
D=c(2,1,3,2,1,4,3,1,2))
> df
A B C D
1 1 x 1 2
2 1 b 2 1
3 1 c 4 3
4 2 g 1 2
5 2 h 5 1
6 3 g 7 4
7 3 g 1 3
8 3 u 2 1
9 3 l 5 2
I have tried to use R:s sample() function to generate random numbers but my problem is splitting the data according to block length and adding the new column. Any help is greatly appreciated.
It can be done easily with ave
df$D <- ave( df$A, df$A, FUN = function(x) sample(length(x)) )
(you could replace length() with max(), or whatever, but length will work even if A is not numbers matching the length of their blocks)
This is really easy with ddply from plyr.
ddply(df, .(A), transform, D = sample(length(A)))
The longer manual version is:
Use split to split the data frame by the first column.
split_df <- split(df, df$A)
Then call sample on each member of the list.
split_df <- lapply(split_df, function(df)
{
df$D <- sample(nrow(df))
df
})
Then recombine with
df <- do.call(rbind, split_df)
One simple way:
df$D = 0
counts = table(df$A)
for (i in 1:length(counts)){
df$D[df$A == names(counts)[i]] = sample(counts[i])
}

Resources