Renaming data frame column variables without using column name - r

I want to create a function which renames specific values in a column to something else, which is specified by the function, something like this (although in reality there would be much more to rename):
func <- function(x) x %>%
mutate(col_name = ifelse(col_name =="something","something else",
ifelse(col_name == "something2","something_else2")))
Note that it isn't the column names that I want to change, it is the values themselves in the column. However, I would like this to work regardless of which column the values are in (e.g. the function works all over the data frame). Also, this only works if the values named in the function is present, and I would like it to ignore the ones that aren't present in the columns. here is a small reproducible example: (column values are arbitrary)
col1 <- c("a","b","c","d","e")
col2 <- c("b","f","d","c","g")
df <- data.frame(col1, col2)
col3 <- c("a","h","i","b","c")
col4 <- c("c","d","j","a","g")
df2 <- data.frame(col3, col4)
Which looks like this:
df1:
col1 col2
1 a b
2 b f
3 c d
4 d c
5 e g
df2:
col3 col4
1 a c
2 h d
3 i j
4 b a
5 c g
Say that i want to rename like this:
df1:
col1 col2
1 can chi
2 chi pig
3 equ she
4 she equ
5 fox bov
df2:
col3 col4
1 can equ
2 avi she
3 tyr asp
4 chi can
5 equ bov
So what I was hoping to get was a function that changes the names of multiple values in data frame columns regardless of its position in the data frame, and that it ignores the values not found in the data frame by the function.

Recode all columns
library(dplyr)
func = function(x, originals = letters[1:10],
rename_tos = c("can", "chi", "equ", "she", "fox", "pig", "bov", "avi", "tyr", "asp")){
names(rename_tos) = originals
x %>%
mutate_if(is.factor, as.character) %>%
lapply(function(y){
y = rename_tos[y]
}) %>%
data.frame(row.names = NULL)
}
Results:
> func(df)
col1 col2
1 can chi
2 chi pig
3 equ she
4 she equ
5 fox bov
> func(df2)
col3 col4
1 can equ
2 avi she
3 tyr asp
4 chi can
5 equ bov
Notes:
The method I used is basically to create a lookup table (named vector) for the renames and index the rename_tos vector with column values. Here, I've set the originals and renames as the default of the function, but you can also supply your own.
User-supplied column names
If you want to be able to rename columns specified and leave the other columns the same, you can do something like the following:
library(dplyr)
library(rlang)
func = function(x, ..., originals = letters[1:10],
rename_tos = c("can", "chi", "equ", "she", "fox", "pig", "bov", "avi", "tyr", "asp")){
names(rename_tos) = originals
dots = quos(...)
x %>%
mutate_at(vars(!!! dots), as.character) %>%
mutate_at(vars(!!! dots), funs(rename_tos[.])) %>%
data.frame(row.names = NULL)
}
Result:
> func(df, col2)
col1 col2
1 a chi
2 b pig
3 c she
4 d equ
5 e bov
> func(df2, col3, col4)
col3 col4
1 can equ
2 avi she
3 tyr asp
4 chi can
5 equ bov
> func(df2, c(col3, col4))
col3 col4
1 can equ
2 avi she
3 tyr asp
4 chi can
5 equ bov
Notes:
Here, I added the ... argument to allow the user to input their own column names. I used quos from rlang to quote the ... arguments and later unquoted them inside vars to mutate_at using !!!. For example, if the user supplied func(df, col2), the first argument of mutate_at evaluates to vars(col2). This works with multiple arguments as well as a vector of arguments as one can see in the results.

Related

Drop Multiple Columns in R

I have a data of 80k rows and 874 columns. Some of these columns are empty. I use sum(is.na) in a for loop to determine the index of empty columns. Since the first column is not empty, if sum(is.na) is equal to the number of rows of the first column, it means that column is empty.
for (i in 1:ncol(loans)){
if (sum(is.na(loans[i])) == nrow(loans[1])){
print(i)
}
}
Now that I know the indices of empty columns, I want to drop them from the data. I thought about storing those indices in an array and dropping them in a loop but I don't think it will work since columns with data will replace the empty columns. How can I drop them?
You should try to provide a toy dataset for your question.
loans <- data.frame(
a = c(NA, NA, NA),
b = c(1,2,3),
c = c(1,2,3),
d = c(1,2,3),
e = c(NA, NA, NA)
)
loans[!sapply(loans, function(col) all(is.na(col)))]
sapply loops over columns of loans and applies the anonymous function checking if all elements are NA. It then coerces the output to a vector, in this case logical.
The tidyverse option:
loans[!purrr::map_lgl(loans, ~all(is.na(.x)))]
Does this work:
df <- data.frame(col1 = rep(NA, 5),
col2 = 1:5,
col3 = rep(NA,5),
col4 = 6:10)
df
col1 col2 col3 col4
1 NA 1 NA 6
2 NA 2 NA 7
3 NA 3 NA 8
4 NA 4 NA 9
5 NA 5 NA 10
df[,which(colSums(df, na.rm = TRUE) == 0)] <- NULL
df
col2 col4
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
Another approach:
df[!apply(df, 2, function(x) all(is.na(x)))]
col2 col4
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
A dplyr solution:
df %>%
select_if(!colSums(., na.rm = TRUE) == 0)
You can try to use fundamental skills like if else and for loops for almost all problems, although a drawback is that it will be slower.
# evaluate each column, if a column meets your condition, remove it, then next
for (i in 1:length(loans)){
if (sum(is.na(loans[,i])) == nrow(loans)){
loans[,i] <- NULL
}
}

How to place multiple columns as a single column most easily in R dataframe?

How to make values in columns 1,2,3,4 appear as values in a single column 1 in which values are placed one below the other? The contents are non numeric. I am unable to install tidy verse package for some reason. Any other way possible to accomplish? My dataframe df looks something like this
df
Person1 Person2 Person3
Doctor Self No
Friend No OthersSelf Others Doctor I want the dataframe to be:df1PersonDoctorFriendSelfSelfNoOthersNoOthersDoc
As a general rule, check this excellent answer on how to make a reproducible example in R. It'll help others to provide answers faster.
You can find a way to get your data in a long format using tidyr (assuming your columns are filled with stirngs as you mentioned).
> df <- data.frame(col1 = c("some", "strings"), col2 = c("more", "strings"), col3 = c("lotof", "strings"))
> df
col1 col2 col3
1 some more lotof
2 strings strings strings
> library(tidyr)
> pivot_longer(df, c(col1, col2, col3))
# A tibble: 6 x 2
name value
<chr> <fct>
1 col1 some
2 col2 more
3 col3 lotof
4 col1 strings
5 col2 strings
6 col3 strings
Regarding package installation problems, could you copy the error that pops out and the console output of sessionInfo()
data.table::melt()
or
reshape::melt()
Example
library(reshape)
mdata <- melt(mydata, id=c("id","te"))
Result
d t v value
1 1 x1 5
1 2 x1 3
2 1 x1 6
2 2 x1 2

Loop to count number of rows in each column and save output

I think this is pretty simple. I have a dataframe called df. It has 51 columns. The rows in each column contains random integers. All I want to do as a loop is add all the integers in all the rows of each column and then store the output for each of the columns in a seperate list or dataframe.
The df looks like this
Col1 col2 col3 col4
34 12 33 67
22 1 56 66
Etc
The output I want is:
Col1 col2 col3 col4
56 13 89 133
I do want to do this as a loop as I want to apply what I've learnt here to a more complex script with similar output and I need to do it quick- can't quite master functions as yet...
You can use the built in function colSums for this:
> df <- data.frame(col1 = c(1,2,3), col2 = c(2,3,4))
> colSums(df)
col1 col2
6 9
Another option using a loop:
# Create the result data frame
> res <- data.frame(df[1,], row.names = c('count'))
# Populate the results
> for(n in 1:ncol(df)) { res[colnames(df)[n]] <- sum(df[n]) }
col1 col2
6 9
If you really want to use a loop over a vectorized solution, use apply to loop over columns (second argument equal to 2, 1 is to loop over rows), by mentioning the function you want (here sum):
df = data.frame(col1=1:3,col2=2:4,col3=3:5)
apply(df, 2, sum)
#col1 col2 col3
# 6 9 12

Loop through columns and add string lengths as new columns

I have a data frame with a number of columns, and would like to output a separate column for each with the length of each row in it.
I am trying to iterate through the column names, and for each column output a corresponding column with '_length' attached.
For example col1 | col2 would go to col1 | col2 | col1_length | col2_length
The code I am using is:
df <- data.frame(col1 = c("abc","abcd","a","abcdefg"),col2 = c("adf qqwe","d","e","f"))
for(i in names(df)){
df$paste(i,'length',sep="_") <- str_length(df$i)
}
However this throws and error:
invalid function in complex assignment.
Am I able to use loops in this way in R?
You need to use [[, the programmatic equivalent of $. Otherwise, for example, when i is col1, R will look for df$i instead of df$col1.
for(i in names(df)){
df[[paste(i, 'length', sep="_")]] <- str_length(df[[i]])
}
You can use lapply to pass each column to str_length, then cbind it to your original data.frame...
library(stringr)
out <- lapply( df , str_length )
df <- cbind( df , out )
# col1 col2 col1 col2
#1 abc adf qqwe 3 8
#2 abcd d 4 1
#3 a e 1 1
#4 abcdefg f 7 1
With dplyr and stringr you can use mutate_all:
> df %>% mutate_all(funs(length = str_length(.)))
col1 col2 col1_length col2_length
1 abc adf qqwe 3 8
2 abcd d 4 1
3 a e 1 1
4 abcdefg f 7 1
For the sake of completeness, there is also a data.table solution:
library(data.table)
result <- setDT(df)[, paste0(names(df), "_length") := lapply(.SD, stringr::str_length)]
result
# col1 col2 col1_length col2_length
#1: abc adf qqwe 3 8
#2: abcd d 4 1
#3: a e 1 1
#4: abcdefg f 7 1

Convert the values in a column into row names in an existing data frame

I would like to convert the values in a column of an existing data frame into row names. Is is possible to do this without exporting the data frame and then reimporting it with a row.names = call?
For example I would like to convert:
> samp
names Var.1 Var.2 Var.3
1 A 1 5 0
2 B 2 4 1
3 C 3 3 2
4 D 4 2 3
5 E 5 1 4
Into:
> samp.with.rownames
Var.1 Var.2 Var.3
A 1 5 0
B 2 4 1
C 3 3 2
D 4 2 3
E 5 1 4
This should do:
samp2 <- samp[,-1]
rownames(samp2) <- samp[,1]
So in short, no there is no alternative to reassigning.
Edit: Correcting myself, one can also do it in place: assign rowname attributes, then remove column:
R> df<-data.frame(a=letters[1:10], b=1:10, c=LETTERS[1:10])
R> rownames(df) <- df[,1]
R> df[,1] <- NULL
R> df
b c
a 1 A
b 2 B
c 3 C
d 4 D
e 5 E
f 6 F
g 7 G
h 8 H
i 9 I
j 10 J
R>
As of 2016 you can also use the tidyverse.
library(tidyverse)
samp %>% remove_rownames %>% column_to_rownames(var="names")
in one line
> samp.with.rownames <- data.frame(samp[,-1], row.names=samp[,1])
It looks like the one-liner got even simpler along the line (currently using R 3.5.3):
# generate original data.frame
df <- data.frame(a = letters[1:10], b = 1:10, c = LETTERS[1:10])
# use first column for row names
df <- data.frame(df, row.names = 1)
The column used for row names is removed automatically.
With a one-row dataframe
Beware that if the dataframe has a single row, the behaviour might be confusing. As the documentation mentions:
If row names are supplied of length one and the data frame has a single row, the row.names is taken to specify the row names and not a column (by name or number).
This mean that, if you use the same command as above, it might look like it did nothing (when it actually named the first row "1", which won't look any different in the viewer).
In that case, you will have to stick to the more verbose:
df <- data.frame(a = "a", b = 1)
df <- data.frame(df, row.names = df[,1])
... but the column won't be removed. Also remember that, if you remove a column to end up with a single-column dataframe, R will simplify it to an atomic vector. In that case, you will want to use the extra drop argument:
df <- data.frame(df[,-1, drop = FALSE], row.names = df[,1])
You can execute this in 2 simple statements:
row.names(samp) <- samp$names
samp[1] <- NULL

Resources