removing duplicate subsets of rows

removing duplicate subsets of rows - r

I have a list of stocks in an index sorted by date, and I'm trying to remove all rows in which the previous row has the same stock code. This will give a dataframe of the initial index and all dates that there was a change to the index
In my working example, I'll use names instead of the date column, and some numbers.
At first, I thought I could remove the rows by using subset() and !duplicated
name <- c("Joe","Mary","Sue","Frank","Carol","Bob","Kate","Jay")
num <- c(1,2,2,1,2,2,2,3)
num2 <- c(1,1,1,1,1,1,1,1)
df <- data.frame(name,num,num2)
dfnew <- subset(df, !duplicated(df[,2]))
However, this might not work in the case where a stock is removed from the list and then later replaced. So, in my working example, the desired output are the rows of Joe, Mary, Frank, Carol and Jay.
Next I created a function to tell if the index changes. The input of the function is row number:
#------ function to tell if there is a change in the row subset-----#
df2 <- as.matrix(df)
ChangeDay <- function(x){
Current <- df2[x,2:3]
Prev <- df2[x-1,2:3]
if (length(Current) != length(Prev))
NewList <- true
else
NewList <- length(which(Current==Prev))!=length(Current)
return(NewList)
}
Finally, I attempt to create a loop to remove the desired rows. I'm new to programming, and I struggle with loops. I'm not sure what the best way is to pre-allocate memory when the dimensions of my final output is unknown. All the books I've looked at only give trivial loop examples. Here is my latest attempt:
result <- matrix(data=NA,nrow=nrow(df2),ncol=3) #pre allocate memory
tmp <- as.numeric(df2) #store the original data
changes <- 1
for (i in 2:nrow(df2)){ #always keep row 1, thus the loop starts at row 2
if(ChangeDay(i)==TRUE){
result[i,] <-tmp[i] #store the row in result if ChangeDay(i)==TRUE
changes <- changes + 1 #increment counter
}
}
result <- result[1:changes,]
Thansk for your help, and any additional general advice on loops is appreciated!

It is not clear what you want to do. But I guess :
df[c(1,diff(df$num)) !=0,]
name num num2
1 Joe 1 1
2 Mary 2 1
4 Frank 1 1
5 Carol 2 1
8 Jay 3 1

Related

How to replace values in all columns in r

Just a quick question: how can I replace some values with others if these values are present in all the dataframe's column? Functions like mapvalues and recode work only if the column is specified, but in my case the dataframe has 89 columns so that would be time-consuming.
For the sake of clarity, take in consideration the following example. I want to replace [NULL] with another value.
Example:
a <- c("NULL",2,"NULL")
b <- c(3, "NULL", 1)
df <- data.frame(a, b)
df
a b
0 NULL 3
1 2 NULL
2 NULL 1
The difference between the example and my case is that the dataset is [35383 x 89], and the values I want to replace are more than one.
Thank you in advance for your time.

An extension to the comment by Ronak Shah. You can add 0 if you want like that. Or you can replace it with desired values, if you like that.
For example, replace the NULLs with mean of the respective columns:
#Run a loop to convert the characters into numbers because for your case it is all characters
#This will change the NULL to NAs.
for (i in colnames(df)){
df[,i] <- as.numeric(df[,i])
}
#Now replace the NAs with the mean of the column
for (i in colnames(df)){
df[,i][is.na(df[,i])] <- mean(df[,i], na.rm=TRUE)
}
You can similarly do this for median also. Let me know in the comment if you have any doubts.

For starters, I have added a few more rows to your example to better show how the code works
df
# a b
#1 NULL 3
#2 2 NULL
#3 NULL 1
#4 a 14
#5 1 a
#6 14 5
First, create two vectors: one with whe values you want to replace (pattern) and one with replacements in the same order. To make sure you have done it right, put them together in a data frame and take a look at the rows (this will also help in next step)
In this case, I want NULL to be 0, "a" to be "alpha", and so on, as shown below
pattern <- c("NULL", "a", 14, 1)
replacement <- c(0, "alpha", "fourteen", "one")
subs <- data.frame(pattern, replacement)
subs
# pattern replacement
#1 NULL 0
#2 a alpha
#3 14 fourteen
#4 1 one
To finish it, we will make a for tthat each time we will pick a pattern and its replacement from the subs data frame we created, and with these values execute a map_df(). This function iterates over the columns from our original data frame (df) and apply the gsub() function with the pattern and replacement
for (i in 1:nrow(subs)) {
df <- map_df(df, gsub, pattern = subs$pattern[i], replacement = subs$replacement[i])
}
df
# a b
#1 0 3
#2 2 0
#3 0 one
#4 alpha fourteen
#5 one alpha
#6 fourteen 5
I hope this was clear. Let me know if you have any doubts

R function to subset dataframe so that non-adjacent values in a column differ by >= X (starting with the first value)

I am looking for a function that iterates through the rows of a given column ("pos" for position, ascending) in a dataframe, and only keeps those rows whose values are at least let's say 10 different, starting with the first row.Thus it would start with the first row (and store it), and then carry on until it finds a row with a value at least 10 higher than the first, store this row, then start from this value again looking for the next >10diff one.
So far I have an R for loop that successfully finds adjacent rows at least X values apart, but it does not have the capability of looking any further than one row down, nor of stopping once it has found the given row and starting again from there.
Here is the function I have:
# example data frame
df <- data.frame(x=c(1:1000), pos=sort(sample(1:10000, 1000)))
# prep function (this only checks row above)
library(dplyr)
pos.apart.subset <- function(df, pos.diff) {
# create new dfs to store output
new.df <- list()
new.df1 <- data.frame()
# iterate through each row of df
for (i in 1:nrow(df)) {
# if the value of next row is higher or equal than value or row i+posdiff, keep
# if not ascending, keep
# if first row, keep
if(isTRUE(df$pos[i+1] >= df$pos[i]+pos.diff | df$pos[i+1] < df$pos[i] | i==1 )) {
# add rows that meet conditions to list
new.df[[i]] <- df[i,] }
}
# bind all rows that met conditions
new.df1 <- bind_rows(new.df)
return(new.df1)}
# test run for pos column adjacent values to be at least 10 apart
df1 <- pos.apart.subset(df, 10); head(df1)
Happy to do this in awk or any other language. Many thanks.

It seems I misunderstood the question earlier since we don't want to calculate the difference between consecutive rows, you can try :
nrows <- 1
previous_match <- 1
for(i in 2:nrow(df)) {
if(df$pos[i] - df$pos[previous_match] > 10) {
nrows <- c(nrows, i)
previous_match <- i
}
}
and then subset the selected rows :
df[nrows, ]
Earlier answer
We can use diff to get the difference between consecutive rows and select the row which has difference of greater than 10.
head(subset(df, c(TRUE, diff(pos) > 10)))
# x pos
#1 1 1
#2 2 31
#6 6 71
#9 9 134
#10 10 151
#13 13 185
The first TRUE is to by default select the first row.
In dplyr, we can use lag to get value from previous row :
library(dplyr)
df %>% filter(pos - lag(pos, default = -Inf) > 10)

How to run Chisq test for multiple rows FASTER in R?

I have managed to do chisq-test using loop in R but it is very slow for a large data and I wonder if you could help me out doing it faster with something like dplyr? I've tried with dplyr but I ended up getting an error all the time which I am not sure about the reason.
Here is a short example of my data:
df
1 2 3 4 5
row_1 2260.810 2136.360 3213.750 3574.750 2383.520
row_2 328.050 496.608 184.862 383.408 151.450
row_3 974.544 812.508 1422.010 1307.510 1442.970
row_4 2526.900 826.197 1486.000 2846.630 1486.000
row_5 2300.130 2499.390 1698.760 1690.640 2338.640
row_6 280.980 752.516 277.292 146.398 317.990
row_7 874.159 794.792 1033.330 2383.420 748.868
row_8 437.560 379.278 263.665 674.671 557.739
row_9 1357.350 1641.520 1397.130 1443.840 1092.010
row_10 1749.280 1752.250 3377.870 1534.470 2026.970
cs
1 1 1 2 1 2 2 1 2 3
What I want to do is to run chisq-test between each row of the df and cs. Then giving me the statistics and p.values as well as row names.
here is my code for the loop:
value = matrix(nrow=ncol(df),ncol=3)
for (i in 1:ncol(df)) {
tst <- chisq.test(df[i,], cs)
value[i,1] <- tst$p.value
value[i,2] <- tst$statistic
value[i,3] <- rownames(df)[i]}
Thanks for your help.

I guess you do want to do this column by column. Knowing the structure of Biobase::exprs(PANCAN_w)) would have helped greatly. Even better would have been to use an example from the Biobase package instead of a dataset that cannot be found.
This is an implementation of the code I might have used. Note: you do NOT want to use a matrix to store results if you are expecting a mixture of numeric and character values. You would be coercing all the numerics to character:
value = data.frame(p_val =NA, stat =NA, exprs = rownames(df) )
for (i in 1:col(df)) {
# tbl <- table((df[i,]), cs) ### No use seen for this
# I changed the indexing in the next line to compare columsn to the standard `cs`.
tst <- chisq.test(df[ ,i], cs) #chisq.test not vectorized, need some sort of loop
value[i, 1:2] <- tst[ c('p.value', 'statistic')] # one assignment per row
}
Obviously, you would need to change every instance of df (not a great name since there is also a df function) to Biobase::exprs(PANCAN_w)

Searching an item with an for loop

I am trying to do a for loop which would search over every row in data frame, but just the
first column checking the tag ID, and if its not it, then it should move to the next row and so on until it finds the value or get to the end of the data frame.
Then the row as a result should be printed.
The purpose is just checking how the for loop works and how "slow" it is ( I want it to compare to other way of searching). I am a bit inexperienced in R and programming general.
Progress so far/my code
Thus far I have done this code and the stopping point is how to make the function move to the other column and check it and move to the next.
SearchID = function(data,value) {
for(i in 1:nrow(testdata)) {
row <- testdata[i,1]
if("row" == "value") return(row)
#what now?
}
}
This is an reproducible example:
ID=c("ID43","ID23","ID14","ID14")
y=c(23,45,66,76)
k=c("yes","no","yes","no")
testdata= data.frame(ID,y,k)
If I give the ID14 as value, it should return the whole row with the ID14:
ID y k
4 ID14 76 no

Here, first you create an object d1 to hold the rows that match the value with the ID column. We are looping through each row of the data with the for loop and check the condition. If it matches, then use rbind to bind that row with the created object. You can also initialize d1 as d1 <- data.frame().
SearchID <- function(data,value) {
d1 <- c()
for(i in 1:nrow(data)) {
row <- data[i,1]
if(row==value){
d1 <- rbind(d1,data[i,])
}
}
d1
}
SearchID(testdata, 'ID14')
# ID y k
#3 ID14 66 yes
#4 ID14 76 no
SearchID(testdata, 'ID43')
# ID y k
#1 ID43 23 yes
SearchID(testdata, "ID86")
#NULL

It's not clear what assumptions about R knowledge can be made in trying to answer this question. For instance, if we could assume that we know how to extract rows based on a vector of row indices, the following would seem more natural to me than binding the rows one-by-one:
SearchID = function(data,value) {
getme <- numeric() # Initialize empty vector
for(i in 1:nrow(data)) { # Start the loop
row <- data[i, 1] # Capture the relevant value
# Compare, and If there's a match,
if (row == value) getme <- c(getme, i) # add loop index to "getme" vector
}
if (length(getme) == 0) NULL # If the vector is still empty, NULL
else data[getme, ] # else return the relevant rows
}
SearchID(testdata, "ID14")
# ID y k
# 3 ID14 66 yes
# 4 ID14 76 no
At the very least, this answer should give you something else to benchmark against :-)

Does column exist and how to rearrange columns in R data frame

How do I add a column in the middle of an R data frame? I want to see if I have a column named "LastName" and then add it as the third column if it does not already exist.

One approach is to just add the column to the end of the data frame, and then use subsetting to move it into the desired position:
d$LastName <- c("Flim", "Flom", "Flam")
bar <- d[c("x", "y", "Lastname", "fac")]

1) Testing for existence: Use %in% on the colnames, e.g.
> example(data.frame) # to get 'd'
> "fac" %in% colnames(d)
[1] TRUE
> "bar" %in% colnames(d)
[1] FALSE
2) You essentially have to create a new data.frame from the first half of the old, your new column, and the second half:
> bar <- data.frame(d[1:3,1:2], LastName=c("Flim", "Flom", "Flam"), fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim C
2 1 2 Flom A
3 1 3 Flam A
>

Of the many silly little helper functions I've written, this gets used every time I load R. It just makes a list of the column names and indices but I use it constantly.
##creates an object from a data.frame listing the column names and location
namesind=function(df){
temp1=names(df)
temp2=seq(1,length(temp1))
temp3=data.frame(temp1,temp2)
names(temp3)=c("VAR","COL")
return(temp3)
rm(temp1,temp2,temp3)
}
ni <- namesind
Use ni to see your column numbers. (ni is just an alias for namesind, I never use namesind but thought it was a better name originally) Then if you want insert your column in say, position 12, and your data.frame is named bob with 20 columns, it would be
bob2 <- data.frame(bob[,1:11],newcolumn, bob[,12:20]
though I liked the add at the end and rearrange answer from Hadley as well.

Dirk Eddelbuettel's answer works, but you don't need to indicate row numbers or specify entries in the lastname column. This code should do it for a data frame named df:
if(!("LastName" %in% names(df))){
df <- cbind(df[1:2],LastName=NA,df[3:length(df)])
}
(this defaults LastName to NA, but you could just as easily use "LastName='Smith'")

or using cbind:
> example(data.frame) # to get 'd'
> bar <- cbind(d[1:3,1:2],LastName=c("Flim", "Flom", "Flam"),fac=d[1:3,3])
> bar
x y LastName fac
1 1 1 Flim A
2 1 2 Flom B
3 1 3 Flam B

I always thought something like append() [though unfortunate the name is] should be a generic function
## redefine append() as generic function
append.default <- append
append <- `body<-`(args(append),value=quote(UseMethod("append")))
append.data.frame <- function(x,values,after=length(x))
`row.names<-`(data.frame(append.default(x,values,after)),
row.names(x))
## apply the function
d <- (if( !"LastName" %in% names(d) )
append(d,values=list(LastName=c("Flim","Flom","Flam")),after=2) else d)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

removing duplicate subsets of rows - r

It is not clear what you want to do. But I guess : df[c(1,diff(df$num)) !=0,] name num num2 1 Joe 1 1 2 Mary 2 1 4 Frank 1 1 5 Carol 2 1 8 Jay 3 1

Related

How to replace values in all columns in r

R function to subset dataframe so that non-adjacent values in a column differ by >= X (starting with the first value)

How to run Chisq test for multiple rows FASTER in R?

Searching an item with an for loop

Does column exist and how to rearrange columns in R data frame

Categories

Resources