Using grep in a nested for loop - r

I am trying to automize one of the simulation. I have two sets of data. One is the Subject ID's of patients (187 rows long), the other is the sample ID (3057 rows long). I would like to classify the sample ID's based on the Subject.
For eg: The sub ID = ABCD. The samples takes from the subject will be ABCD-0001, ABCD-0002 and so.
Now I am trying to use grep to search through every element in sub ID and see if its a subset of the sample ID. and if so, then the value it returns could be inserted into a new vector, with the row of the new vector denoted by the value returned from grep [Same as the row number in Sample ID] and the value would be same as the row number in Subject ID
As in
SubID SampID
ABCD ABCD-0001
EFGH ABCD-0002
IJKL IJKL-0001
IJKL-0002
EFGH-0001
EFGH-0002
EFGH-0003
Desired Output
Numeric ID
1
1
3
3
2
2
2
I am using this code
j = 1:nrow(SubID)
i = 1:nrow(SampID)
for (val in j)
{
for(val in i)
{
if (length(k<-grep(SubID[j,1],SampID[i,1]))>0)
{
l=as.numeric(unlist(k))
Ind[l]=j
}
}
}

There are ways to solve this without using a for-loop
Data:
a = data.frame(subID = c("ab","cd","de"))
b = data.frame(SampID = c("ab-1","ab-2","de-1","de-2","cd-1","cd-2","cd-3"))
> a
subID
1 ab
2 cd
3 de
> b
SampID
1 ab-1
2 ab-2
3 de-1
4 de-2
5 cd-1
6 cd-2
7 cd-3
To obtain the corresponding index, first obtain the substring of the first two elements (in my example! In yours should go from 1 to 4, if all have 4 letters!)
f = substr(b$SampID,1,2)
b$num = sapply(f,function(x){which(x==a)})
Which gives:
> b
SampID num
1 ab-1 1
2 ab-2 1
3 de-1 3
4 de-2 3
5 cd-1 2
6 cd-2 2
7 cd-3 2
Edit: Different letter lengths
If you have different lengths of letters in a, then you can do it with only one for loop. Try this
a = data.frame(subID = c("ab","cd","def"))
b = data.frame(SampID = c("ab-1","ab-2","def-1","def-2","cd-1","cd-2","cd-3"))
b$num = 0
for (k in 1:length(a$subID)){
b$num[grepl( pattern = a$subID[k] , x = b$SampID)] = k
}
In this case loop through every element of a and use grepl to determine those SampID that have this pattern. Assign the loop number to those that return true.
New Results:
> b
SampID num
1 ab-1 1
2 ab-2 1
3 def-1 3
4 def-2 3
5 cd-1 2
6 cd-2 2
7 cd-3 2

Related

R Check for levels within a group and duplicate row if not present

I have a problem with in Shiny which I will show in a simple example:
I have the following data:
Group<-c("A","A","B","C","C","D")
Value<-c(1,2,6,7,3,9)
df<-data.frame(Group, Value)
Group Value
A 1
A 2
B 6
C 7
C 3
D 9
Then I add a row to see how many reps a group has:
df$num <- ave(df$Value, df$Group, FUN = seq_along)
Group Value num
A 1 1
A 2 2
B 6 1
C 7 1
C 3 2
D 9 1
Now, I would like it, to check if the group contains a 2nd rep, and if not, duplicate the 1st row of the group (containing num=1) and setting num to 2.
So I would like to end up with:
Group Value num
A 1 1
A 2 2
B 6 1
B 6 2 #this row is added
C 7 1
C 3 2
D 9 1
D 9 2 #this row is added
I have tried to search for solution, but I mainly ended up with subject like a condition that is based on a certain value, rather than conditions within a group.
Could someone help me? I would appreciate it a lot!
Can this code do the trick ?
res <- lapply(unique(df$Group), function(x){
a <- df[df$Group == x, ]
if(nrow(a) == 1) {
a <- a[rep(row.names(a), 2), ]
a$num <- c(1:2)
}
a
})
do.call(rbind, res)

For loop to paste rows to create new dataframe from existing dataframe

New to SO, but can't figure out how to get this code to work. I have a dataframe that is very large, and is set up like this:
Number Year Type Amount
1 1 A 5
1 2 A 2
1 3 A 7
1 4 A 1
1 1 B 5
1 2 B 11
1 3 B 0
1 4 B 2
This goes onto multiple for multiple numbers. I want to take this dataframe and make a new dataframe that has two of the rows together, but it would be nested (for example, row 1 and row 2, row 1 and row 3, row 1 and row 4, row 2 and row 3, row 2 and row 4) where each combination of each year is together within types and numbers.
Example output:
Number Year Type Amount Number Year Type Amount
1 1 A 5 1 2 A 2
1 1 A 5 1 3 A 7
1 1 A 5 1 4 A 1
1 2 A 2 1 3 A 7
1 2 A 2 1 4 A 1
1 3 A 7 1 4 A 1
I thought that I would do a for loop to loop within number and type, but I do not know how to make the rows paste from there, or how to ensure that I am only getting the combinations of the rows once. For example:
for(i in 1:n_number){
for(j in 1:n_type){
....}}
Any tips would be appreciated! I am relatively new to coding, so I don't know if I should be using a for loop at all. Thank you!
df <- data.frame(Number= rep(1,8),
Year = rep(c(1:4),2),
Type = rep(c('A','B'),each=4),
Amount=c(5,2,7,1,5,11,0,2))
My interpretation is that you want to create a dataframe with all row combinations, where Number and Type are the same and Year is different.
First suggestion - join on Number and Type, then remove rows that have different Year. I added an index to prevent redundant matches (1 with 2 and 2 with 1).
df$index <- 1:nrow(df)
out <- merge(df,df,by=c("Number","Type"))
out <- out[which(out$index.x>out$index.y & out$Year.x!=out$Year.y),]
Second suggestion - if you want to see a version using a loop.
out2 <- NULL
for (i in c(1:(nrow(df)-1))){
for (j in c((i+1):nrow(df))){
if(df[i,"Year"]!=df[j,"Year"] & df[i,"Number"]==df[j,"Number"] & df[i,"Type"]==df[j,"Type"]){
out2 <- rbind(out2,cbind(df[i,],df[j,]))
}
}
}

Cumulative sum conditional over multiple columns in r dataframe containing the same values

Say my data.frame is as outlined below:
df<-as.data.frame(cbind("Home"=c("a","c","e","b","e","b"),
"Away"=c("b","d","f","c","a","f"))
df$Index<-rep(1,nrow(df))
Home Away Index
1 a b 1
2 c d 1
3 e f 1
4 b c 1
5 e a 1
6 b f 1
What I want to do is calculate a cumulative sum using the Index column for each character a - f regardless of whether they in the Home or Away columns. Thus a column called Cumulative_Sum_Home, say, takes the character in the Home row, "b" in the case of row 6, and counts how many times "b" has appeared in either the Home or Away columns in all previous rows including row 6. Thus in this case b has appeared 3 times cumulatively in the first 6 rows, and thus the Cumulative_Sum_Home gives the value 3. Likewise the same logic applies to the Cumulative_Sum_Away column. Taking row 5, character "a" appears in the Away column, and has cumulatively appeared 2 times in either Home or Away columns up to that row, so the column Cumulative_Sum_Away takes the value 2.
Home Away Index Cumulative_Sum_Home Cumulative_Sum_Away
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
I have to confess to being totally stumped as to how to solve this problem. I've tried looking at the data.table approaches, but I've never used that package before so I can't immediately see how to solve it. Any tips would be greatly received.
There is scope to make this leaner but if that doesn't matter much for you then this should be okay.
NewColumns = list()
for ( i in sort(unique(c(levels(df[,"Home"]),levels(df[,"Away"]))))) {
NewColumnAddition = i == df$Home | i ==df$Away
NewColumnAddition[NewColumnAddition] = cumsum(NewColumnAddition[NewColumnAddition])
NewColumns[[i]] = NewColumnAddition
}
df$Cumulative_Sum_Home = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Home"])]][i]
}
)
df$Cumulative_Sum_Away = sapply(
seq(nrow(df)),
function(i) {
NewColumns[[as.character(df[i,"Away"])]][i]
}
)
> df
Home Away Index HomeSum AwaySum
1 a b 1 1 1
2 c d 1 1 1
3 e f 1 1 1
4 b c 1 2 2
5 e a 1 2 2
6 b f 1 3 2
Here's a data.table alternative -
setDT(df)
for ( i in sort(unique(c(levels(df[,Home]),levels(df[,Away]))))) {
df[, TotalSum := cumsum(i == Home | i == Away)]
df[Home == i, Cumulative_Sum_Home := TotalSum]
df[Away == i, Cumulative_Sum_Away := TotalSum]
}
df[,TotalSum := NULL]

R - interesting issue with either data frame or global variable

Take a look at this code. I am trying to make code that starts with an empty data frame with only IDs and dynamically adds data. For example, let's say it starts with
ID
1 1
2 2
3 3
And then I make a call
addPair(1,"a",4); #sets the value of column "a" at row 1 to be the value 4
it would become
ID a
1 1 4
2 2 NA
3 3 NA
Take a look at this code below. The desired final total variable is:
ID a
1 1 4
2 2 NA
3 3 NA
But at the end, total is just
ID
1 1
2 2
3 3
Here is the code. Why is total not keeping what it adds? At the end of the method, total is correct, but then after the method, total is back to just the IDs. Here is the code and below is the output.
# rm(list=ls()) # that code _should_ always be commented out
#get all the IDs
IDs = c("1","2","3")
N = length(IDs)
#the big data frame
total <- data.frame("ID"=IDs)
addPair = function(i,name,val) {
total[,toString(name)] = rep(NA,N)
total[,toString(name)][i] = val
print("end")
print(total)
}
addPair(1,"a",4)
print("after call")
print(total)
Here is the output:
[1] "end"
ID a
1 1 4
2 2 NA
3 3 NA
> print("after call")
[1] "after call"
> print(total)
ID
1 1
2 2
3 3
Why does total lose that column a after the method is over?
transform(total, a = {b=rep(NA,N); b[1] <- 4;b })
ID a
1 1 4
2 2 NA
3 3 NA
Returns desired object, however it does not change total unless you assign the value to that name.
addPair <- function(df,item,name, val) transform(
df, name={t=rep(NA,nrow(df)); t[item]=val;t} )
addPair(total, 1,"a",4)
ID name
1 1 4
2 2 NA
3 3 NA
> total <- addPair(total, 1,"a",4)
> total
ID name
1 1 4
2 2 NA
3 3 NA
Unfortunately as when using with, transform is more designed for console use than programming. It's not totally safe for coding use.

Multirow deletion: delete row depending on other row

I'm stuck with a quite complex problem. I have a data frame with three rows: id, info and rownum. The data looks like this:
id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8
What I want to do now is to delete all other rows of one id if one of the rows contains the info a. This would mean for example that row 2 and 3 should be removed as row 1's coloumn info contains the value a. Please note that the info values are not ordered (id 3/row 5 & 6) and cannot be ordered due to other data limitations.
I solved the case using a for loop:
# select all id containing an "a"-value
a_val <- data$id[grep("a", data$info)]
# check for every id containing an "a"-value
for(i in a_val) {
temp_data <- data[which(data$id == i),]
# only go on if the given id contains more than one row
if (nrow(temp_data) > 1) {
for (ii in nrow(temp_data)) {
if (temp_data$info[ii] != "a") {
temp <- temp_data$row[ii]
if (!exists("delete_rows")) {
delete_rows <- temp
} else {
delete_rows <- c(delete_rows, temp)
}
}
}
}
}
My solution works quite well. Nevertheless, it is very, very, very slow as the original data contains more than 700k rows and more that 150k rows with an "a"-value.
I could use a foreach loop with 4 cores to speed it up, but maybe someone could give me a hint for a better solution.
Best regards,
Arne
[UPDATE]
The outcome should be:
id info row
1 a 1
2 a 4
3 a 6
4 b 7
4 c 8
Here is one possible solution.
First find ids where info contains "a":
ids <- with(data, unique(id[info == "a"]))
Subset the data:
subset(data, (id %in% ids & info == "a") | !id %in% ids)
Output:
id info row
1 1 a 1
4 2 a 4
6 3 a 6
7 4 b 7
8 4 c 8
An alternative solution (maybe harder to decipher):
subset(data, info == "a" | !rep.int(tapply(info, id, function(x) any(x == "a")),
table(id)))
Note. #BenBarnes found out that this solution only works if the data frame is ordered according to id.
You might want to investigate the data.table package:
EDIT: If the row variable is not a sequential numbering of each row in your data (as I assumed it was), you could create such a variable to obtain the original row order:
library(data.table)
# Create data.table of your data
dt <- as.data.table(data)
# Create index to maintain row order
dt[, idx := seq_len(nrow(dt))]
# Set a key on id and info
setkeyv(dt, c("id", "info"))
# Determine unique ids
uid <- dt[, unique(id)]
# subset your data to select rows with "a"
dt2 <- dt[J(uid, "a"), nomatch = 0]
# identify rows of dataset where the id doesn't have an "a"
dt3 <- dt[J(dt2[, setdiff(uid, id)])]
# rbind those two data.tables together
(dt4 <- rbind(dt2, dt3))
# id info row idx
# 1: 1 a 1 1
# 2: 2 a 4 4
# 3: 3 a 6 6
# 4: 4 b 7 7
# 5: 4 c 8 8
# And if you need the original ordering of rows,
dt5 <- dt4[order(idx)]
Note that setting a key for the data.table will order the rows according to the key columns. The last step (creating dt5) sets the row order back to the original.
Here is a way using ddply:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
library("plyr")
ddply(df,.(id),subset,rep(!'a'%in%info,length(info))|info=='a')
Returns:
id info row
1 1 a 1
2 2 a 4
3 3 a 6
4 4 b 7
5 4 c 8
if df is this (RE Sacha above) use match which just finds the index of the first occurrence:
df <- read.table(text="id info row
1 a 1
1 b 2
1 c 3
2 a 4
3 b 5
3 a 6
4 b 7
4 c 8",header=TRUE)
# the first info row matching 'a' and all other rows that are not 'a'
with(df, df[c(match('a',info), which(info != 'a')),])
id info row
1 1 a 1
2 1 b 2
3 1 c 3
5 3 b 5
7 4 b 7
8 4 c 8
try to take a look at subset, it's quite easy to use and it will solve your problem.
you just need to specify the value of the column that you want to subset based on, alternatively you can choose more columns.
http://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html
http://www.statmethods.net/management/subset.html

Resources