Function to check if column names are unique - r

I'm working on a program and now I'm looking for a way to check the column names when uploading a file. If the names are not unique an error should be written. Is there any way to do this?
For example if I have these df:
> a <- c(10, 20, 30)
> b <- c(1, 2, 3)
> c <- c("Peter", "Ann", "Mike")
> test <- data.frame(a, b, c)
with:
library(dplyr)
test <- rename(test, Number = a)
test <- rename(test, Number = b)
> test
Number Number c
1 10 1 Peter
2 20 2 Ann
3 30 3 Mike
If this were a file how could I check if the column names are unique. Nice would be as result only True or False!
Thanks!

We can use:
any(duplicated(names(df))) #tested with df as iris
[1] FALSE
On OP's data:
any(duplicated(names(test)))
[1] TRUE
The above can be simplified using the following as suggested by #sindri_baldur and #akrun
anyDuplicated(names(test))
If you wish to know how many are duplicated:
length(which(duplicated(names(test))==TRUE))
[1] 1
This can also be simplified to(as suggested by #sindri_baldur:
sum(duplicated(names(test)))

test.frame <- data.frame(a = c(1:5), b = c(6:10))
a <- c(5:1)
test.frame <- cbind(test.frame, a)
## Build data.frame with duplicate column
test.unique <- function(df) { ## function to test unique columns
length1 <- length(colnames(df))
length2 <- length(unique(colnames(df)))
if (length1 - length2 > 0 ) {
print(paste("There are", length1 - length2, " duplicates", sep=" "))
}
}
This results in ...
test.unique(test.frame)
[1] "There are 1 duplicates"

Check for the functions unique() and colnames(). For example:
are.unique.colnames <- function(array){
return(length(unique(colnames(array))) == dim(array)[2])
}
is a function based on the number of different column names (a easy and useful metadata of any array-like structure)

Related

Paste leading zero in columns A and B if column A meets condition

Data:
A B
"2058600192", "2058644"
"4087600101", "4087601"
"30138182591","30138011"
I am trying to add one leading 0 to columns A and B if column A is 10 characters.
This is what I have written so far:
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A)
data$B[i] <- paste0(0, data$B)
}
}
But I'm getting the following warning:
number of items to replace is not a multiple of replacement length
I've also tried using a dplyr solution, but I'm not sure how to mutate two columns based on one column. Any insight would be appreciated.
Your solution was already pretty good. You just made some very small mistakes. This code would give the correct output:
data <- data.frame(A = c("2058600192","4087600101","30138182591"), B = c("2058644","4087601","30138011"))
for (i in 1:nrow(data)) {
if (nchar(data$A[i]) == 10) {
data$A[i] <- paste0(0, data$A[i])
data$B[i] <- paste0(0, data$B[i])
}
}
The only difference is data$A[i] <- paste0(0, data$A[i]) instead of data$A[i] <- paste0(0, data$A). Without the [i] you would try to add the whole column.
You can get the index where the number of characters is equal to 10 and replace those values using lapply for multiple columns.
inds <- nchar(df$A) == 10
df[] <- lapply(df, function(x) replace(x, inds, paste0('0', x[inds])))
#If you want to replace only specific columns
#df[c('A', 'B')] <- lapply(df[c('A', 'B')], function(x)
# replace(x, inds, paste0('0', x[inds])))
df
# A B
#1 02058600192 02058644
#2 04087600101 04087601
#3 30138182591 30138011
data
df <- structure(list(A = c(2058600192, 4087600101, 30138182591), B = c(2058644L,
4087601L, 30138011L)), class = "data.frame", row.names = c(NA, -3L))
Just in case you were interested in using dplyr here's another solution using transmute.
df %>%
# Need to transmute B first, so that nchar is evaluated on the original A column and not on the one with leading zeros
transmute(B = ifelse(nchar(A) == 10, paste0(0, B), B),
A = ifelse(nchar(A) == 10, paste0(0, A), A)) %>%
# Just change the order of the columns to the original one
select(A,B)
Another way you can try
library(dplyr)
library(stringr)
df %>%
mutate(A = ifelse(str_length(A) == 10, str_pad(A, width = 11, side = "left", pad = 0), A),
B = ifelse(grepl("^0", A), paste0("0", B), B))
# A B
# 1 02058600192 02058644
# 2 04087600101 04087601
# 3 30138182591 30138011
str_length to detect length of string
You can use str_pad to add leading zeros. More information about str_pad() here
We can use grepl to detect strings with leading zeros in column A and add leading zeros to column B.
You may use the ifelse vectorized function here:
data$A <- ifelse(nchar(data$A) == 10, paste0("0", data$A), data$A)
data$B <- ifelse(nchar(data$B) == 10, paste0("0", data$B), data$B)
data
A B
1 02058600192 2058644
2 04087600101 4087601
3 30138182591 30138011

Nested for loop leading to: Error in [<-.data.frame`(`*tmp*` replacement has x rows, data has y

I have 6 data frames (dfs) with a lot of data of different biological groups and another 6 data frames (tax.dfs) with taxonomical information about those groups. I want to replace a column of each of the 6 dfs with a column with the scientific name of each species present in the 6 tax.dfs.
To do that I created two lists of the data frames and I'm trying to apply a nested for loop:
dfs <- list(df.birds, df.mammals, df.crocs, df.snakes, df.turtles, df.lizards)
tax.dfs <- list(tax.birds,tax.mammals, tax.crocs, tax.snakes, tax.turtles, tax.lizards )
for(i in dfs){
for(y in tax.dfs){
i[,1] <- y[,2]
}}
And this is the output I'm getting:
Error in `[<-.data.frame`(`*tmp*`, , 1, value = c("Aotus trivirgatus", :
replacement has 64 rows, data has 43
But both data frames have the same number of rows, I actually used dfs to create tax.dfs applying the tnrs_match_names function from rotl package.
Any suggestions of how I could fix this error or that help me to find another way to do what I need to will be greatly appreciated.
Thank You!
For what it is worth, to iterate over two objects simultaneously, the following works:
Example Data:
df1 <- data.frame(a=1, b=2)
df2 <- data.frame(c=3, d=4)
df3 <- data.frame(e=5, f=6)
df_1 <- data.frame(a='A', b='B')
df_2 <- data.frame(c='C', d='D')
df_3 <- data.frame(e='E', f='F')
dfs <- list(df1, df2, df3)
df_s <- list(df_1, df_2, df_3)
Using mapply:
out <- mapply(function(one, two) {
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
e f
1 F 6
Here, one and two in mapply correspond to the different elements in dfs and df_s. Having said that, let's make it a bit more interesting. Let's change my third example to the following:
df_3 <- data.frame(e=c('E', 'e'), f=c('F', 'f'))
df_s <- list(df_1, df_2, df_3) # needs to be executed again
Now, let's adjust the function:
out <- mapply(function(one, two) {
if(nrow(one) != nrow(two)){return('Wrong dimensions')}
one[,1] <- two[,2]
return(one)
}, dfs, df_s, SIMPLIFY = F )
out
[[1]]
a b
1 B 2
[[2]]
c d
1 D 4
[[3]]
[1] "Wrong dimensions"

Name of i in loop - R

I'm trying to extract the name of the i column used in a loop:
for (i in df){
print(name(i))
}
Python code solution example:
for i in df:
print(i)
PS: R gives me the column values If I use the same code than Python (but python gives just the name).
EDIT: It has to be in a loop. As I will do more elaborate things with this.
for (i in names(df)){
print(i)
}
Just do
names(df)
to print all the column names in df. There's no need for a loop, unless you want to do something more elaborate with each column.
If you want the i'th column name:
names(df)[i]
Instead of looping, you can use the imap function from the purrr package. When writing the code, .x is the object and .y is the name.
df <- data.frame(a = 1:10, b = 21:30, c = 31:40)
library(purrr)
imap(df, ~paste0("The name is ", .y, " and the sum is ", sum(.x)))
# $a
# [1] "The name is a and the sum is 55"
#
# $b
# [1] "The name is b and the sum is 255"
#
# $c
# [1] "The name is c and the sum is 355"
This is just a more convenient way of writing the following Base R code, which gives the same output:
Map(function(x, y) paste0("The name is ", y, " and the sum is ", sum(x))
, df, names(df))
You can try the following code:
# Simulating your data
a <- c(1,2,3)
b <- c(4,5,6)
df <- data.frame(a, b)
# Answer 1
for (i in 1:ncol(df)){
print(names(df)[i]) # acessing the name of column
print(df[,i]) # acessing column content
print('----')
}
Or this alternative:
# Answer 2
columns <- names(df)
for(i in columns) {
print(i) # acessing the name of column
print(df[, i]) # acessing column content
print('----')
}
Hope it helps!

join two columns in a dataframe so they do not contain same values

Sooo
I’ve got two lists
list1 <- rep(c("john","steve","lisa","sara","anna"), c(50,0,15,25,10))
list2 <- rep(c("john","steve","lisa","sara","anna"), c(15,25,0,10,50))
I need to put them into a dataframe.
df <- as.data.frame(matrix(1, nrow = 100, ncol = 2))
df$v1 <- list1
Now the problem.
I need to put list2 into df$v2
with out any row in df containing the same values.
It does not matter what values are in each row.
I use this for testing it, if each rows contains the same value:
all(apply(ballots, 1, function(x) length(unique(x)) == 2) == TRUE)
to clarify:
I need each value in the columns, which row doesn't matter.
I need a way to randomize or change the order of the second column (or the first) in such a way that the same value is never in column one or two
The output:
V1 V2
John Steve
John Lisa
Sara John
John Lisa
Steve Anna
Currently, when I join the columns in the dataframe, there are many rows in both column one and two containing the same value.
Alright... finally found the answer after many trials and errors.
If anyone has a cleaner method to do this I would love to see one.
The following code takes list A and puts it in column A
takes list B, randomizes and puts in column C, Column B is NA
If A and C is not the same, switch column B and C.
If it fails to finish all the rows, it starts over, randomizing column C
library(taRifx)
failed.counter <- 0
while (failed.counter <= 1) {
list1 <- rep(c("A","B","C"), c(3,1,2))
list2 <- sample(rep(c("A","B","C"), c(2,3,1)))
df <- as.data.frame(matrix(NA, nrow = length(list1), ncol = 3))
df[,1] <- list1
df[,3] <- list2
iteration.counter <- 0
while (anyNA(df$V2) == TRUE && failed.counter == 0 ) {
iteration.counter <- iteration.counter + 1
df.sub <- df[is.na(df[,2]) & df[,1] != df[,3] & !is.na(df[,3]),]
df.sub <- df.sub[,c("V1", "V3", "V2")]
colnames(df.sub) <- c("V1", "V2", "V3")
r.names <- rownames(df.sub)
df[r.names,] <- df.sub
df[,3] <- shift(df[,3], 1, Wrap=TRUE)
if(iteration.counter >= nrow(df)+1) {failed.counter <- 1}
}
if(anyNA(df$V2) == FALSE) {failed.counter <- 2}
}

Passing vector with multiple values into R function to generate data frame

I have a table, called table_wo_nas, with multiple columns, one of which is titled ID. For each value of ID there are many rows. I want to write a function that for input x will output a data frame containing the number of rows for each ID, with column headers ID and nobs respectively as below for x <- c(2,4,8).
## id nobs
## 1 2 1041
## 2 4 474
## 3 8 192
This is what I have. It works when x is a single value (ex. 3), but not when it contains multiple values, for example 1:10 or c(2,5,7). I receive the warning "In ID[counter] <- x : number of items to replace is not a multiple of replacement length". I've just started learning R and have been struggling with this for a week and have searched manuals, this site, Google, everything. Can someone help please?
counter <- 1
ID <- vector("numeric") ## contain x
nobs <- vector("numeric") ## contain nrow
for (i in x) {
r <- subset(table_wo_nas, ID %in% x) ## create subset for rows of ID=x
ID[counter] <- x ## add x to ID
nobs[counter] <- nrow(r) ## add nrow to nobs
counter <- counter + 1 } ## loop
result <- data.frame(ID, nobs) ## create data frame
In base R,
# To make a named vector, either:
tmp <- sapply(split(table_wo_nas, table_wo_nas$ID), nrow)
# OR just:
tmp <- table(table_wo_nas$ID)
# AND
# arrange into data.frame
nobs_df <- data.frame(ID = names(tmp), nobs = tmp)
Alternately, coerce the table into a data.frame directly, and rename:
nobs_df <- data.frame(table(table_wo_nas$ID))
names(nobs_df) <- c('ID', 'nobs')
If you only want certain rows, subset:
nobs_df[c(2, 4, 8), ]
There are many, many more options; these are just a few.
With dplyr,
library(dplyr)
table_wo_nas %>% group_by(ID) %>% summarise(nobs = n())
If you only want certain IDs, add on a filter:
table_wo_nas %>% group_by(ID) %>% summarise(nobs = n()) %>% filter(ID %in% c(2, 4, 8))
Seems pretty straightforward if you just use table again:
tbl <- table( table_wo_nas[ , 'ID'] )
data.frame( IDs = names(tbl), nobs= tbl)
Could also get a quick answer although with different column names using:
as.data.frame(table( table_wo_nas[ , 'ID'] ))
Try this.
x=c(2,4,8)
count_of_id=0
#df is your data frame table_wo_nas
count_of<-function(x)
{for(i in 1 : length(x))
{count_of_id[i]<-length(which(df$id==x[i])) #find out the n of rows for each unique value of x
}
df_1<-cbind(id,count_of_id)
return(df_1)
}

Resources