Command for renaming several variables - r

After reshaping my data, I have a large dataset with columnnames that look like this:
1_abc 1_vwxyz 2_abc 2_vwxyz
I would like to change my column names to look like this: abc_1 vwxyz_1 abc_2 vwxyz_2
My code looks like this:
data <- tibble("1_abc" = c(1,2,3), "1_vwxyz" = c(10,11,12),
"2_abc" = c(1,1,2),"2_vwxyz" = c(9,11,15))
data_renamed <- data %>%
rename_(.dots=setNames(names(.), paste(substr(names(.), start=3, stop=nchar(names(.))),
substr(names(.), start=1, stop=1))))
I get this error:
Error in parse(text = x) : <text>:1:2: unexpected input
1: 1_
^

Here's a solution in base R. You first take the column names as a character vector, convert them to a list of two-element character vectors, reverse the order of each and put them back together with _.
ll <- strsplit(colnames(data), pattern = "_")
# apply across this list of character vectors to reverse the order and concatenate
ll1 <- lapply(ll, function(x) paste(rev(x), collapse = "_"))
# unlist and assign them to the new data frame
data_renamed <- data
colnames(data_renamed) <- unlist(ll1)
# A tibble: 3 x 4
# abc_1 vwxyz_1 abc_2 vwxyz_2
# <dbl> <dbl> <dbl> <dbl>
# 1 1 10 1 9
# 2 2 11 1 11
# 3 3 12 2 15

Related

How to loop through multiple data sets removing specific characters from specified columns in r

I have 25 data sets each is structured the same. Each contains many rows and 7 columns. Column 6 contains data that should be numerical but is not numerical. They are not numerical because the numbers contain commas i.e. 100000 is 100,000.
I can manually resolve this in each data set by removing the comma and then specifying that the data is numerical using the following code
df$column_6 <- gsub("[,]" , "", df$column_6)
df$column_6 <- as.numerical(df$column_6)
However as there are 25 data sets I would like to loop through them doing this however I am unable to do this.
Additionally because column 6 has a different name in each data set I would prefer to specify column 6 without using its name like below
df[6] <- gsub("[,]" , "", df[6])
however this doesn't seem to work.
My code is as follows
list_of_dfs = c(df1, df2, ..... , df25)
for (i in list_of_dfs) {
i[6] <- gsub("[,]" , "", i[6])
i[6] <- as.numerical(i[6])
}
Does anyone have any advice on how to do this
Your code is close, but has a few problems:
the result never gets assigned back the the list.
as.numerical is a typo, it needs to be as.numeric
i[6] doesn't work because you need to specify that it's the 6th column you want: i[, 6]. See here for details on [ vs [[.
c(df1, df2) doesn't actually create a list of data frames
Try this instead:
## this is bad, it will make a single list of columns, not of data frames
# list_of_dfs = c(df1, df2, ..... , df25)
# use this instead
list_of_dfs = list(df1, df2, ..... , df25)
# or this
list_of_dfs = mget(ls(pattern = "df"))
for (i in seq_along(list_of_dfs)) {
list_of_dfs[[i]][, 6] <- as.numeric(gsub("[,]" , "", list_of_dfs[[i]][, 6]))
}
We can do a bit better, gsub uses pattern-matching regular expressions by default, using the fixed = TRUE argument instead will be quite a bit faster:
for (i in seq_along(list_of_dfs)) {
list_of_dfs[[i]][, 6] <- as.numeric(gsub(",", "", list_of_dfs[[i]][, 6], fixed = TRUE))
}
And we could use lapply instead of a for loop for slightly shorter code:
list_of_dfs[[i]] <- lapply(list_of_dfs, function(x) {
x[, 6] = as.numeric(gsub("," , "", x[, 6], fixed = TRUE))
return(x)
})
Try this out. You put all dataframes in a list, then you make the column numeric. Instead of gsub I use readr::parse_number. I'll also include a practice set for illustration.
library(tidyverse)
df1 <- data_frame(id = rep(1,3), num = c("10,000", "11,000", "12,000"))
df2 <- data_frame(id = rep(2,3), num = c("13,000", "14,000", "15,000"))
df3 <- data_frame(id = rep(3,3), num = c("16,000", "17,000", "18,000"))
list(df1, df2, df3) %>% map(~mutate(.x, num = parse_number(num)))
#> [[1]]
#> # A tibble: 3 x 2
#> id num
#> <dbl> <dbl>
#> 1 1 10000
#> 2 1 11000
#> 3 1 12000
#>
#> [[2]]
#> # A tibble: 3 x 2
#> id num
#> <dbl> <dbl>
#> 1 2 13000
#> 2 2 14000
#> 3 2 15000
#>
#> [[3]]
#> # A tibble: 3 x 2
#> id num
#> <dbl> <dbl>
#> 1 3 16000
#> 2 3 17000
#> 3 3 18000
Created on 2018-09-20 by the reprex
package (v0.2.0).
Part of the answer has been sourced from here: Looping through list of data frames in R
In your case, you can do the following:
list_of_dfs = list(df1, df2, ..... , df25)
lapply(list_of_dfs, function(x) { x[, 6] <- as.integer(gsub("," , "", x[, 6])) })
The data table way
test<-data.table(col1=c('100,00','100','100,000'),col2=c('90','80,00','60'))
col1 col2
100,00 90
100 80,00
100,000 60
your list of data frames
testList<-list(test,test)
assume u want to correct col2 in this case but want to use index as reference
removeNonnumeric<-function(x){return(as.numeric(gsub(',','',x)))}
data<-function(x){return(x[,lapply(.SD,removeNonnumeric),.SDcols=names(x)[2],by=col1])}
removeNonnumeirc removes the "," from the columns and data accesses each data table in the testList and calls "removeNonnumeric" on them output is a list of data tables which is created by merging these 2 functions in an "lapply"
lapply(testList,data)

unquote string as variable in pipe

I want to remove duplicate rows from a dataframe, for specific columns only. That can be obtained with distinct:
data <- tibble(a = c(1, 1, 2, 2), b = c(3, 3, 3, 4), z = c(5,4,5,5))
filtered_data <- data %>% distinct(a, b, .keep_all = T)
dim(filtered_data)
# [1] 3 3
This is (almost) what I need. Yet, my problem is that the columnnames I need to use with distinct will change. So I have a string gen that contains the names of the columns I want to use for with the distinct function. They need to get unquoted to be usefull in the pipe. I found suggestions to use as.name() or eval(parse()). This however gives me a different result:
gen <- c("a", "b")
filtered_data <- data %>% distinct(eval(parse(text = gen)), .keep_all = T)
dim(filtered_data)
# [1] 2 4
The eval seems to do something funny with the amount of times the data is filtered. (and, adds an extra column. I could live with that, though...) So, how to obtain a similar result, as if I had used a,b, but by using a variable instead?
additional information
I actually obtain gen by reading the columnnames of a dataframe: gen <- colnames(data)[1:2]. The solution suggested by #gymbrane would be perfect, if I had a way to transform the gen to c(a, b). The whole point is to avoid hardcoding the columnames. I tried things like gen <- noquotes(gen), which does not give an error in the rm_dup_rows function suggested below, but it does give a different result, giving the same sort of repeated filtering as I started with...
fixed
I think I got it working. It might be unelegant, and I'm not sure if every step is necessary for the result, but it seems to work by combining the function provided by #gymbrane below with ensym and quos in a forloop while adding to a list in GlobalEnv (edit: GlobalEnv isn't necessary):
unquote_string <- function(string) {
out <- list()
i <- 1
for (s in string) {
t <- ensym(s)
out[i] <-dplyr::quos(!!t)
i <- i+1
}
return(out)
}
gen_quo <- unquote_string(gen)
filtered_data <- rm_dup_rows(data, gen_quo)
dim(filtered_data)
# [1] 3 3
How about creating a function and using quosures . Perhaps something like this is what you are looking for...
rm_dup_rows <- function(data, ...){
vars = dplyr::quos(...)
data %>% distinct(!!! vars, .keep_all = T)
}
I believe this returns what you are asking for
rm_dup_rows(data = data, a, b)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
2 3 5
2 4 5
rm_dup_rows(data, b, z)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 4 5
Additional
You could modify rm_dup_rows just slightly and construct and your vector with quos. Something like this...
rm_dup_rows <- function(data, vars){
data %>% distinct(!!! vars, .keep_all = T)
}
# quos your column name vector
gen <- quos(a,z)
rm_dup_rows(data, gen)
# A tibble: 3 x 3
a b z
<dbl> <dbl> <dbl>
1 3 5
1 3 4
2 3 5

R - Creating DFs (tibbles) in a loop. How to rename them and columns inside, to include date? (I do it with eval(..), but is there a better solution?)

I have a loop, that creates a tibble at the end of each iteration, tbl. Loop uses different date each time, date.
Assume:
tbl <- tibble(colA=1:5,colB=5:10)
date <- as.Date("2017-02-28")
> tbl
# A tibble: 5 x 2
colA colB
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
(contents are changing every loop, but tbl, date and all columns (colA, colB) names remain the same)
The output that I want needs to start with output - outputdate1, outputdate2 etc.
With columns inside it as colAdate1, colBdate1, and colAdate2, colBdate2 and so on.
At the moment I am using this piece of code, which works, but is not easy to read:
eval(parse(text = (
paste0("output", year(date), months(date), " <- tbl %>% rename(colA", year(date), months(date), " = 'colA', colB", year(date), months(date), " = 'colB')")
)))
It produces this code for eval(parse(...) to evaluate:
"output2017February <- tbl %>% rename(colA2017February = 'colA', colB2017February = 'colB')"
Which gives me the output that I want:
> output2017February
# A tibble: 5 x 2
colA2017February colB2017February
<int> <int>
1 1 5
2 2 6
3 3 7
4 4 8
5 5 9
Is there a better way of doing this? (Preferably with dplyr)
Thanks!
This avoids eval and is easier to read:
ym <- "2017February"
assign(paste0("output", ym), setNames(tbl, paste0(names(tbl), ym)))
Partial rename
If you only wanted to replace the names in the character vector old with the corresponding names in the character vector new then use the following:
assign(paste0("output", ym),
setNames(tbl, replace(names(tbl), match(old, names(tbl)), new)))
Variation
You might consider putting your data frames in a list instead of having a bunch of loose objects in your workspace:
L <- list()
L[[paste0("output", ym)]] <- setNames(tbl, paste0(names(tbl), ym))
.GlobalEnv could also be used in place of L (omitting the L <- list() line) if you want this style but still to put the objects separately in the global environment.
dplyr
Here it is using dplyr and rlang but it does involve increased complexity:
library(dplyr)
library(rlang)
.GlobalEnv[[paste0("output", ym)]] <- tbl %>%
rename(!!!setNames(names(tbl), paste0(names(tbl), ym)))

Sorting Interior of a column in R

I am trying to sort the interior of a column in R. For example I have this:
ID HoursAvailable
1 a,b,c,k,d
2 e,g,h
3 a,b,c,h,d
And I am trying to sort the numbers in the column internally like this
ID HoursAvailable
1 a,b,c,d,k
2 e,g,h,,
3 a,b,c,d,h
I have tried to use the separate function like this:
cdMCd<- cdMf %>% separate(HoursAvailable, c("a","b","c","d","e","f","g","h","i","j"))
But I cannot get it to sort correctly. For this example e in ID 2 would be sorted into the a column, but I need it sorted into the e column. I was planning to separate all the hours into separate columns, order, then recombine, but I cannot get them to separate correctly.
library(dplyr)
dt = read.table(text="
ID HoursAvailable
1 a,b,c,k,d
2 e,g,h
3 a,b,c,h,d
", header=T, stringsAsFactors=F)
SortString = function(x) {paste0(sort(unlist(strsplit(x, split=","))),collapse = ",")}
dt %>%
rowwise() %>%
mutate(Updated = SortString(HoursAvailable)) %>%
ungroup()
# # A tibble: 3 x 3
# ID HoursAvailable Updated
# <int> <chr> <chr>
# 1 1 a,b,c,k,d a,b,c,d,k
# 2 2 e,g,h e,g,h
# 3 3 a,b,c,h,d a,b,c,d,h
Here is what I will do:
First create a function that can sort a single one and then create a function that can apply such function to a vector of strings
library(stringr)
library(plyr)
split_and_sort <- function(x){
x_split <- sort(unlist(str_split(x, ",")))
return(paste(x_split, collapse = ","))
}
split_and_sort_column <- function(x){
laply(x, split_and_sort)
}
df$HoursAvailable <- split_and_sort_column(df$HoursAvailable)

Extract data after a string one each appearance R

I have a data like this (named spectra):
#Milk spectra: 1234
##XYDATA=(X++(Y..Y))
649.025085449219
667.675231457819
686.325377466418
##XYDATA=(X++(Y..Y))
723.625669483618
742.275815492218
760.925961500818
##XYDATA=(X++(Y..Y))
872.826837552417
891.476983561017
910.127129569617
928.777275578216
In this data, each time the string ##XYDATA=(X++(Y..Y)), that is the data for each different animal.
So, I want to have the code that can help extract this sample into 3 pieces of data.
Animal 1: 3 lines after 1st ' ##XYDATA=(X++(Y..Y))'
Animal 2: 3 lines after 2nd ' ##XYDATA=(X++(Y..Y))'
And so on.
I tried this line of code but it only help to extract line 1 of all times the string '##XYDATA=(X++(Y..Y))' appeared together. Thus, it did not meet my expect to have three lines and to have a separate pieces of data after each appearance of the string.
bo<-data.frame(spectra$V1[which(spectra$V1 == '##XYDATA=(X++(Y..Y))')+1])
Okay I think you could do something along these lines. I'm sure this could be much better and more efficient but read it in as a character vector.
Then loop through to spread it out. However this assumes there are always the same number of measures and you have a way to identify the character values.
c_data<- c("split", 1, 2, 3,
"split", 4, 5, 6)
y<- c_data == "split"
df_wide <- data.frame("animal"= character(), "v1" = numeric(), "v2" = numeric(), "v3" = numeric(),
stringsAsFactors = FALSE)
names(df_wide)<- c("animal", "v1", "v2", "v3")
x <- 0
for (i in 1:length(c_data)){
if (y[i] == TRUE){
x <- x +1
df_wide[x,] <- rbind(c(c_data[i], c_data[i+1], c_data[i+2], c_data[i+3]))
}
}
yields
animal v1 v2 v3
1 split 1 2 3
2 split 4 5 6
If it is a one time thing, it may not be worth trying to write something nicer. If it is an ongoing thing then you may want to look at using an apply function that you could have to write a function for.
You can do either of the following with split and map:
library(dplyr)
library(purrr)
df %>%
mutate(Animal = cumsum(grepl("##XYDATA=(X++(Y..Y))", V1, fixed = TRUE))) %>%
split(.$Animal) %>%
map(~slice(., -1) %>% mutate(V1 = as.numeric(V1))) %>%
'['(-1)
This creates an indicator variable Animal, split by that indicator, remove the first row for each dataframe, convert V1 to numeric, and finally remove the first element of the list.
You can also do the following:
df %>%
mutate(Animal = cumsum(grepl("##XYDATA=(X++(Y..Y))", V1, fixed = TRUE))) %>%
filter(!grepl("^#.*$", V1)) %>%
mutate(V1 = as.numeric(V1)) %>%
split(.$Animal)
This also creates the indicator Animal, but it intead, filters out all rows with # signs in it and converts V1 to numeric before splitting into separate dataframes.
Result:
$`1`
# A tibble: 3 x 2
V1 Animal
<dbl> <int>
1 649.0251 1
2 667.6752 1
3 686.3254 1
$`2`
# A tibble: 3 x 2
V1 Animal
<dbl> <int>
1 723.6257 2
2 742.2758 2
3 760.9260 2
$`3`
# A tibble: 4 x 2
V1 Animal
<dbl> <int>
1 872.8268 3
2 891.4770 3
3 910.1271 3
4 928.7773 3
Note:
Here I assumed #Milk spectra: 1234 is also a row in your column, hence the subsetting at the end.
Data:
df = read.table(textConnection("'#Milk spectra: 1234'
##XYDATA=(X++(Y..Y))
649.025085449219
667.675231457819
686.325377466418
##XYDATA=(X++(Y..Y))
723.625669483618
742.275815492218
760.925961500818
##XYDATA=(X++(Y..Y))
872.826837552417
891.476983561017
910.127129569617
928.777275578216"),comment.char = "", stringsAsFactors = FALSE)

Resources