This question already has answers here:
Finding ALL duplicate rows, including "elements with smaller subscripts"
(9 answers)
Closed 9 months ago.
I have thise code that generates random values as part of a sequence. I need to keep duplicates and remove values that are NOT repeated. Any help with this?
Apparently, a solution is supposed to contain '%/%, as.numeric, names, table, >'
Here is original code.
x <- sample(c(-10:10), sample(20:40, 1), replace = TRUE)
Any help would be appreciated!
I would use table(). Table will give you a list of all values and the number of times they occur.
vec <- c(1,2,3,4,5,3,4,5,4,5)
valueset <- unique(vec) # all unique values in vec, will be used later
#now we determine whihc values are occuring more than once
valuecounts <- table(vec) # returns count of all unique values, values are in the names of this variable as strings now regardless of original data type
morethanone <- names(valuecounts)[valuecounts>1] #returns values with count>1
morethanone <- as.numeric(morethanone) # converts strings back to numeric
valueset[valueset %in% morethanone] #returns the original values which passed the above cirteria
As a function....
duplicates <- function(vector){
# Returns all unique values that occur more than once in a list
valueset <- unique(vector)
valuecounts <- table(vector)
morethanone <- names(valuecounts)[valuecounts>1]
morethanone <- as.numeric(morethanone)
return( valueset[valueset %in% morethanone] )
}
Now try running duplicates(x)
Related
This question already has answers here:
How to convert certain columns only to numeric?
(4 answers)
Make a list from ls(pattern="") [R]
(1 answer)
Closed 2 years ago.
I have a number of x dataframes (depending on previous operation). The names of the dataframes are stored in a different vector:
> list.industries
[1] "misc" "machinery" "electronics" "drugs" "chemicals"
Now, I want to set every column after the 4th as numeric. As the number of created dataframes and, therefore, the names change, I want to ask, if there is any way to do it automatically.
I tried:
for (i in 1:length(list.industries)) {
paste0(list.industries) <- lapply(paste0(list.industries)[,4:ncol(paste0(list.industries))] , as.numeric)
}
Where the function places automatically the name of the dataframe from the vector list.industries to set it as numeric.
Is there any way, how I can place the name of a dataframe as a variable from a vector?
Thanks!
You can use mget to get data as a named list, turn every columns after 4th as numeric and return the dataframe back.
new_data <- lapply(mget(list.industries), function(x) {
x[, 4:ncol(x)] <- lapply(x[, 4:ncol(x)], as.numeric)
x
})
new_data would have list of dataframes, if you want the changes to be reflected in the orignal dataframe use list2env.
list2env(new_data, .GlobalEnv)
You could use this fragment (untested):
one_df <- function(x) {
dat <- get(x)
for (i in seq(4, ncol(dat))) dat[,i] <- as.numeric(dat[,i])
return(dat)
}
ans <- lapply(list.industries, one_df)
So in short: you are looking for get.
Problem:
I am trying to combine the columns of two vectors with different lengths and starting dates (think stock prices) and want to cut off the excess from the longest while matching the length of the shortest. Any help would be appreciated!
What I have tried so far:
combinedcolumns<-cbind(A$Col,B$Col[(length(B$Col)-length(A$Col)):length(B$Col)])
Results:
I am able to bind the two columns and get correct values for A, but I get the same value for B across the entire length of combinedcolumns.
Thanks in advance!
One way would be to extend the shorter column by filling it with NAs (or anything else basically), i.e.
lmax <- max(c(length(A$col1), length(B$col2))) # determining which column is longer / shorter
lmin <- max(c(length(A$col1), length(B$col2)))
col2change <- which(c(length(A$col1), length(B$col2)) == lmin)
newcol <- rep(NA, lmax) # creating a vector of length lmax filled with NAs
newcol[1:lmin] <- ifelse(col2change == 1, # adding the data of the shorter col
A$col1,
B$col2)
combinedcolumns <- ifelse(col2change == 1,
cbind(A$col1, newcol),
cbind(newcol, B$col2))
This question already has answers here:
Replacing NAs with latest non-NA value
(21 answers)
Closed 4 years ago.
I'm struggling to solve this apparently simple question in R, but no suceess until now.
I have a data.frame with a char variable having some blanks and some non-blank values. I'm trying to complete those blanks with the last non-blank found into the same variable from top-down as in the following example related do variable 'Species' in data.frame 'want' vs 'have'.
If someone could help, I thanks in advance!
set.seed(12346)
foi <- split(iris, iris$Species)
want <- do.call("rbind", lapply(foi, function(x){
x[1:sample(1:10, 1), ]
}))
row.names(want) <- NULL
want$Species <- as.character(want$Species)
have <- want
have$Species[2:10] <- ""
have$Species[12:16] <- ""
have$Species[18:21] <- ""
head(have, 20)
head(want, 20)
A simple for loop assuming the first value is non missing:
for(i in which(have$Species=="")) have$Species[i]=have$Species[i-1]
You could split your variable by block of consecutive blank values and fill each block with the first previous non blank value if speed is an issue and your file is huge.
Surprised this hasn't been asked before (as far as I can see)
I have a data.frame with multiple columns and two rows, such as the below.
df<-as.data.frame(rbind(row1=c(NA,NA,rep(0,2),"FOO",NA,"BAR","FOO","FOOBAR","ETC"),
row2=c(300,23.4,1,2,"BAR","FOO","BAR","HELLO","WORLD","ETC")))
I want to select the entry in the first row as default but only if it's not NA. If it is NA I want to entry in the second row. I've tried the following:
apply(df,2,function(x) ifelse(is.na(x[1]),x[2],x[1]))
However, x is a combination of numeric and character and each columns class needs to be maintained so apply is causing issues. Also I need it returned as a data frame and not a named vector.
Try this and see if this is what you are after.
df<-as.data.frame(rbind(row1=c(NA,NA,rep(0,2),"FOO",NA,"BAR","FOO","FOOBAR","ETC"),
row2=c(300,23.4,1,2,"BAR","FOO","BAR","HELLO","WORLD","ETC")))
outDF <- lapply(df, function(x){
if(is.na(x[[1]])&!is.na(x[[2]])){
x[[1]] <- x[[2]]
}
x
})
data.frame(outDF, stringsAsFactors = FALSE)
I have a dataset with three columns.
## generate sample data
set.seed(1)
x<-sample(1:3,50,replace = T )
y<-sample(1:3,50,replace = T )
z<-sample(1:3,50,replace = T )
data<-as.data.frame(cbind(x,y,z))
What I am trying to do is:
Select those rows where all the three columns have 1
Select those rows where only two columns have 1 (could be any column)
Select only those rows where only column has 1 (could be any column)
Basically I want any two columns (for 2nd case) to fulfill the conditions and not any specific column.
I am aware of rows selection using
subset<-data[c(data$x==1,data$y==1,data$z==1),]
But this only selects those rows based on conditions for specific columns whereas I want any of the three/two columns to fullfill me criteria
Thanks
n = 1 # or 2 or 3
data[rowSums(data == 1) == n,]
Here is another method:
rowCounts <- table(c(which(data$x==1), which(data$y==1), which(data$z==1)))
# this is the long way
df.oneOne <- data[as.integer(names(rowCounts)[rowCounts == 1]),]
df.oneTwo <- data[as.integer(names(rowCounts)[rowCounts == 2]),]
df.oneThree <- data[as.integer(names(rowCounts)[rowCounts == 3]),]
It is better to save multiple data.frames in a list especially when there is some structure that guides this storage as is the case here. Following #richard-scriven 's suggestion, you can do this easily with lapply:
df.oneCountList <- lapply(1:3, function(i)
data[as.integer(names(rowCounts)[rowCounts == i]),]
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)
You can then pull out the data.frames using either their index, df.oneCountList[[1]] or their name df.oneCountList[["df.oneOne"]].
#eddi below suggests a nice shortcut to my method of pulling out the table names using tabulate and the arr.ind argument of which. When which is applied on a multipdimensional object such as an array or a data.frame, setting arr.ind==TRUE produces indices of the rows and the columns where the logical expression evaluates to TRUE. His suggestion exploits this to pull out the row vector where a 1 is found across all variables. The tabulate function is then applied to these row values and tabulate returns a sorted vector that where each element represents a row and rows without a 1 are filled in with a 0.
Under this method,
rowCounts <- tabulate(which(data == 1, arr.ind = TRUE)[,1])
returns a vector from which you might immediately pull the values. You can include the above lapply to get a list of data.frames:
df.oneCountList <- lapply(1:3, function(i) data[rowCounts == i,])
names(df.oneCountList) <- c("df.oneOne", "df.oneTwo", df.oneThree)