R: find Index of data frame with non-unique/duplicated values

R: find Index of data frame with non-unique/duplicated values - r

I want to extract some values out of a vector, modify them and put them back at the original position.
I have been searching a lot and tried different approaches to this problem. I'm afraid this might be really simple but I'm not seeing it yet.
Creating a vector and convert it to a dataframe with. Also creating a empty dataframe for the results.
hight <- c(5,6,1,3)
hight_df <- data.frame("ID"=1:length(hight), "hight"=hight)
hight_min_df <- data.frame()
Extract for every pair of values the smaller value with corresponding ID.
for(i in 1:(length(hight_df[,2])-1))
{
hight_min_df[i,1] <- which(grepl(min(hight_df[,2][i:(i+1)]), hight_df[,2]))
hight_min_df[i,2] <- min(hight_df[,2][i:(i+1)])
}
Modify the extracted values and aggregate same IDs by higher value. At the end writing the modified values back.
hight_min_df[,2] <- hight_min_df[,2]+20
adj_hight <- aggregate(x=hight_min_df[,2],by=list(hight_min_df[,1]), FUN=max)
hight[adj_hight[,1]] <- adj_hight[,2]
This works perfectly as long a I have only uniqe values in hight.
How can I run this script with a vector like this: hight <- c(5,6,1,3,5)?

Alright there's a lot to unpack here. Instead of looping, I would suggest piping functions with dplyr. Read the vignette here - it is an outstanding resource and an excellent approach to data manipulation in R.
So using dplyr we can rewrite your code like this:
library(dplyr)
hight <- c(5,6,1,3,5) #skip straight to the test case
hight_df <- data.frame("ID"=1:length(hight), "hight"=hight)
adj_hight <- hight_df %>%
#logic psuedo code: if the last hight (using lag() function),
# going from the first row to the last,
# is greater than the current rows hight, take the current rows value. else
# take the last rows value
mutate(subst.id = ifelse(lag(hight) > hight, ID, lag(ID)),
subst.val = ifelse(lag(hight) > hight, hight, lag(hight)) + 20) %>%
filter(!is.na(subst.val)) %>% #remove extra rows
select(subst.id, subst.val) %>% #take just the columns we want
#grouping - rewrite of your use of aggregate
group_by(subst.id) %>%
summarise(subst.val = max(subst.val)) %>%
data.frame(.)
#tying back in
hight[adj_hight[,1]] <- adj_hight[,2]
print(hight)
Giving:
[1] 25 6 21 23 5

Related

Using dplyr/purrr instead of for loops to mask multiple columns and/or expand rows

Essentially its about using bitmask/binary columns and row-oriented operations against a data table/frame: Firstly, to construct a logical vector from a combination of selected columns that can be used to mask a charcter vector to represent 'what' columns are flagged. Secondly, row-expansion - given a count in one column, prouce a data table that contains the original row data replicated that number of times.
For summarising the flags using a row-wise bitmask, which uses purrr:reduce to concatenate the row-represented flags, I cannot find a succinct method to do this in a %>% chain rather than a separate for loop. I suspect a purrr::map is required but I cannot get it/the syntax right.
For the row expansion, the nested for loop has appalling performance and I cannot find a way for dplyr/purrr to, row-wise, replicate that row a given number of times per row. A map and other functions would need to produce and append multiple rows which, I don't think map is capable of.
The following code produces the required output - but, apart from performance issues (especially regarding row expansion), I'd like to be able to do this as vectorised operations.
library(tidyverse)
library(data.table)
dt <- data.table(C1=c(0,0,1,0,1,0),
C2=c(1,0,0,0,0,1),
C3=c(0,1,0,0,1,0),
C4=c(0,1,1,0,0,0),
C5=c(0,0,0,0,1,1),
N=c(5,2,6,8,1,3),
Spurious = '')
flags <- c("Scratching Head","Screaming",
"Breaking Keyboard","Coffee Break",
"Giving up")
# Summarise states
flagSummary <- function(dt){
interim <- dt %>%
dplyr::mutate_at(vars(C1:C5),.funs=as.logical) %>%
dplyr::mutate(States=c(""))
for(i in 1:nrow(interim)){
interim$States[i] <-
flags[as.logical(interim[i,1:5])] %>%
purrr::reduce(~ paste(.x, .y, sep = ","),.init="") %>%
stringr::str_replace("^[,]","") }
dplyr::select(interim,States,N) }
summary <- flagSummary(dt)
View(summary)
# Expand states
expandStates <- function(dt){
interim <- dt %>%
dplyr::mutate_at(vars(C1:C5), .funs=as.logical) %>%
dplyr::select_at(vars(C1:C5,N)) %>%
data.table::setnames(.,append(flags,"Count"))
expansion <- interim[0,1:5]
for(i in 1:nrow(interim)){
for(j in 1:interim$Count[i]){
expansion <- bind_rows(expansion, interim[i,1:5]) } }
expansion }
expansion <- expandStates(dt)
View(expansion)
As stated, the code produces the expected result. I'd 'like' to see the same without resorting to for loops and whilst being able to chain the functions into the initial mutate/selects.

As for the row expansion of the expandStates function, the answer is proffered here Replicate each row of data.frame and specify the number of replications for each row? by A5C1D2H2I1M1N2O1R2T1.
Essentially, the nested for loop is simply replaced by
interim[rep(rownames(interim[,1:5]),interim$Count),][1:5]
On my 'actual' data, this reduces user systime from 28.64 seconds to 0.06 to produce some 26000 rows.

Struggling with for-loop and mutate in R

I am trying to wrap my head around R, and I'm sure I'm doing something silly.
I have a dataframe that includes 30 brands (whose names I have separately in a list called "brands") and a list of new names that I wish to insert into the dataframe (called "known brands").
I am trying to populate the results of an if statement within new columns in an R dataframe (using the names within "known brands), but this keeps on generating an error message (unexpected '{' in "{")
I'm not sure where I'm going wrong - here's my code:
for(i in 1:length(brands)){
plot1a_df <- plot1a_df %>% mutate(known_brands[i] = ifelse(brands[i] >1, 1, 0))
}
To illustrate with data (assume 3 x2 columns):
plot1a_df = data.frame(brands = c(1,0,2), Misc = c(0,0,0))
The idea is to end up with a third column ("known_brands") with c(0,0,1)

To add a logical column with dplyr:
library(dplyr)
plot1a_df %>% mutate(is_brand_known = brands %in% brand_list)
Another example with iris dataset.
species_list = c('setosa')
iris %>% mutate(is_setosa = Species %in% species_list)

for (i in 1:30){
plot1a_df[, known_cols[i]] <- ifelse(plot1a_df[,brands[i]] >1, 1, 0)
print(plot1a_df[, known_cols[i]])
}
Found my solution could be achieved without mutate - although, still wonder if it is possible to combine for loops within dplyr (realise there's a lot of commentary on here, but nothing at a high level (for this simpleton to understand at least!)

Apply series of changes to multiple similar datasets in R

I have 20 csv files of data that are formatted exactly the same, about 40 columns of different numbers, but with different values in each column. I want to apply a series of changes to each data frame in order to extract specific information from every one of them.
Specifically I want to extract four columns from each data frame, find the maximum value of each column in each data frame and then add all of these maximum values together, so I get one final number for each data frame. Something like this:
str(data)
Extract<-data[c(1,2,3,4)]
Max<-apply(Extract,2,max)
Add<-Max[1] + Max[2] + Max[3] + Max[4]
I have the code written above to do all these steps for every data frame individually, but is it possible to apply this code to all of them at once?

If you put all 20 filenames into a vector called files
Maxes <- numeric(length(files))
i <- 1
for (file in files) {
data <- read.csv(file)
str(data)
Extract<-data[c(1,2,3,4)]
Max<-apply(Extract,2,max)
Add<-Max[1] + Max[2] + Max[3] + Max[4]
Maxes[i] <- Add
i <- i+1
}
Though that str(data) will just cause a lot of stuff to print to the terminal 20 times. I'm not sure the value of that, but it was in your question so I included.

Put all your files into a common folder such as /path/temp/
csvs <- list.files("/path/temp") # vector of csv
Use custom function for colMax
colMax <- function(data) sapply(data, max, na.rm = TRUE)
Using foreach, dplyr, and readr
library(foreach)
library(dplyr)
foreach(i=1:length(csvs), .combine="c") %do% { read_csv(csvs[i]) %>%
select(1:4) %>%
colMax(.) %>%
sum(.)
} # returns a vector

Remove Duplicates, but Keep the Most Complete Iteration

I'm trying to figure out how remove duplicates based on three variables (id, key, and num). I would like to remove the duplicate with the least amount of columns filled. If an equal number are filled, either can be removed.
For example,
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
The output would be the following:
Finished <- data.frame(id= c(1,2,3,4,5),
key=c(1,2,3,4,5),
num=c(1,1,1,1,1),
v4= c(1,5,5,5,7),
v5=c(1,5,5,5,7))
My real dataset is bigger and a mix of mostly numerical, but some character variables, but I couldn't determine the best way to go about doing this. I've previously used a program that would do something similar within the duplicates command called check.all.
So far, my thoughts have been to use grepl and determine where "anything" is present
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
Then, using the resultant dataframe I ask for rowSums and Cbind it to the original.
CompleteNess <- rowSums(Present)
cbind(Original, CompleteNess)
This is the point where I'm unsure of my next steps... I have a variable which tells me how many columns are filled in each row (CompleteNess); however, I'm unsure of how to implement duplicates.
Simply, I'm looking for When id, key, and num are duplicated - keep the row with the highest value of CompleteNess.
If anybody can think of a better way to do this or get me through the last little bit I would greatly appreciate it. Thanks All!

Here is a solution. It is not very pretty but it should work for your application:
#Order by the degree of completeness
Original<-Original[order(CompleteNess),]
#Starting from the bottom select the not duplicated rows
#based on the first 3 columns
Original[!duplicated(Original[,1:3], fromLast = TRUE),]
This does rearrange your original data frame so beware if there is additional processing later on.

You can aggregate your data and select the row with max score:
Original <- data.frame(id= c(1,2,2,3,3,4,5,5),
key=c(1,2,2,3,3,4,5,5),
num=c(1,1,1,1,1,1,1,1),
v4= c(1,NA,5,5,NA,5,NA,7),
v5=c(1,NA,5,5,NA,5,NA,7))
Present <- apply(Original, 2, function(x) grepl("[[:alnum:]]", x))
#get the score
Original$present <- rowSums(Present)
#create a column to aggregate on
Original$id.key.num <- paste(Original$id, Original$key, Original$num, sep = "-")
library("plyr")
#aggregate here
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present))
And if you want to keep the other columns, just do this:
Final <- ddply(Original,.(id.key.num),summarize,
Max = max(present),
v4 = v4[which.max(present)],
v5 = v5[which.max(present)]
)

dplyr mutate in R - adding a new column depending on sequence of another column

I am having an issue with mutate function in dplyr.
I am trying to
add a new column called state depending on the change in one of the column (V column). (V column repeat itself with a sequence so each sequence (rep(seq(100,2100,100),each=96) corresponds to one dataset in my df)
Error: impossible to replicate vector of size 8064
Here is reproducible example of md df:
df <- data.frame (
No=(No= rep(seq(0,95,1),times=84)),
AC= rep(rep(c(78,110),each=1),times=length(No)/2),
AR = rep(rep(c(256,320,384),each=2),times=length(No)/6),
AM = rep(1,times=length(No)),
DQ = rep(rep(seq(0,15,1),each=6),times=84),
V = rep(rep(seq(100,2100,100),each=96),times=4),
R = sort(replicate(6, sample(5000:6000,96))))
labels <- rep(c("CAP-CAP","CP-CAP","CAP-CP","CP-CP"),each=2016)
I added here 2016 value intentionally since I know the number of rows of each dataset.
But I want to assign these labels with automated function when the dataset changes. Because there is a possibility the total number of rows may change for each df for my real files. For this question think about its only one txt file and also think about there are plenty of them with different number of rows. But the format is the same.
I use dplyr to arrange my df
library("dplyr")
newdf<-df%>%mutate_each(funs(as.numeric))%>%
mutate(state = labels)
is there elegant way to do this process?

Iff you know the number of data sets contained in df AND the column you're keying off --- here, V --- is ordered in df like it is in your toy data, then this works. It's pretty clunky, and there should be a way to make it even more efficient, but it produced what I take to be the desired result:
# You'll need dplyr for the lead() part
library(dplyr)
# Make a vector with the labels for your subsets of df
labels <- c("AP-AP","P-AP","AP-P","P-P")
# This line a) produces an index that marks the final row of each subset in df
# with a 1 and then b) produces a vector with the row numbers of the 1s
endrows <- which(grepl(1, with(df, ifelse(lead(V) - V < 0, 1, 0))))
# This line uses those row numbers or the differences between them to tell rep()
# how many times to repeat each label
newdf$state <- c(rep(labels[1], endrows[1]), rep(labels[2], endrows[2] - endrows[1]),
rep(labels[3], endrows[3] - endrows[2]), rep(labels[4], nrow(newdf) - endrows[3]))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R: find Index of data frame with non-unique/duplicated values - r

Related

Using dplyr/purrr instead of for loops to mask multiple columns and/or expand rows

Struggling with for-loop and mutate in R

Apply series of changes to multiple similar datasets in R

Remove Duplicates, but Keep the Most Complete Iteration

dplyr mutate in R - adding a new column depending on sequence of another column

Categories

Resources