Every week I a incomplete dataset for a analysis. That looks like:
df1 <- data.frame(var1 = c("a","","","b",""),
var2 = c("x","y","z","x","z"))
Some var1 values are missing. The dataset should end up looking like this:
df2 <- data.frame(var1 = c("a","a","a","b","b"),
var2 = c("x","y","z","x","z"))
Currently I use an Excel macro to do this. But this makes it harder to automate the analysis. From now on I would like to do this in R. But I have no idea how to do this.
Thanks for your help.
QUESTION UPDATE AFTER COMMENT
var2 is not relevant for my question. The only thing I am trying to is. Get from df1 to df2.
df1 <- data.frame(var1 = c("a","","","b",""))
df2 <- data.frame(var1 = c("a","a","a","b","b"))
Here is one way of doing it by making use of run-length encoding (rle) and its inverse rle.inverse:
fillTheBlanks <- function(x, missing=""){
rle <- rle(as.character(x))
empty <- which(rle$value==missing)
rle$values[empty] <- rle$value[empty-1]
inverse.rle(rle)
}
df1$var1 <- fillTheBlanks(df1$var1)
The results:
df1
var1 var2
1 a x
2 a y
3 a z
4 b x
5 b z
Here is a simpler way:
library(zoo)
df1$var1[df1$var1 == ""] <- NA
df1$var1 <- na.locf(df1$var1)
The tidyr packages has the fill() function which does the trick.
df1 <- data.frame(var1 = c("a",NA,NA,"b",NA), stringsAsFactors = FALSE)
df1 %>% fill(var1)
Here is another way which is slightly shorter and doesn't coerce to character:
Fill <- function(x,missing="")
{
Log <- x != missing
y <- x[Log]
y[cumsum(Log)]
}
Results:
# For factor:
Fill(df1$var1)
[1] a a a b b
Levels: a b
# For character:
Fill(as.character(df1$var1))
[1] "a" "a" "a" "b" "b"
Below is my unfill function, encontered same problem, hope will help.
unfill <- function(df,cols){
col_names <- names(df)
unchanged <- df[!(names(df) %in% cols)]
changed <- df[names(df) %in% cols] %>%
map_df(function(col){
col[col == col %>% lag()] <- NA
col
})
unchanged %>% bind_cols(changed) %>% select(one_of(col_names))
}
Related
My goal is to get a concise way to rename multiple columns in a data frame. Let's consider a small data frame df as below:
df <- data.frame(a=1, b=2, c=3)
df
Let's say we want to change the names from a, b, and c to Y, W, and Z respectively.
Defining a character vector containing old names and new names.
df names <- c(Y = "a", Z ="b", E = "c")
I would use this to rename the columns,
rename(df, !!!names)
df
suggestions?
One more !:
df <- data.frame(a=1, b=2, c=3)
df_names <- c(Y = "a", Z ="b", E = "c")
library(dplyr)
df %>% rename(!!!df_names)
## Y Z E
##1 1 2 3
A non-tidy way might be through match:
names(df) <- names(df_names)[match(names(df), df_names)]
df
## Y Z E
##1 1 2 3
You could try:
sample(LETTERS[which(LETTERS %in% names(df) == FALSE)], size= length(names(df)), replace = FALSE)
[1] "S" "D" "N"
Here, you don't really care what the new names are as you're using sample. Otherwise a straight forward names(df) < c('name1', 'name2'...
I am trying to standardize feedback from an API in R. However in some cases, the API returns a different format. This does not allow me to standardize and automate. I have thought of a solution which is as follows:
if dataframe has more than 1 variable, keep dataframe as it is
if dataframe has 1 variable then transpose
this id what I tried till now
col <- ncol(df)
df <- ifelse( col > 1, as.data.frame(df), as.data.frame(t(df))
This however returns a list and does not allow the process further. Thank you for the help in advance. any links would help too.
Thanks
Maybe you need something like this:
# some simple dataframes
df1 <- data.frame(col1 = c("a","b"))
df2 <- data.frame(col1 = c("a","b"),
col2 = c("c","d"))
func <- function(df) {
if (ncol(df) ==1) {
as.data.frame(t(df))
} else {
(df)
}
}
func(df1)
V1 V2
col1 a b
func(df2)
col1 col2
1 a c
2 b d
I am trying to re-order columns in R using a for loop since the column range needs to be dynamic. Does anyone know what is missing from my code?
Group <- c("A","B","C","D")
Attrib1 <- c("x","y","x","z")
Attrib2 <- c("q","w","u","i")
Day1A <- c(5,4,6,3)
Day2A <- c(6,5,7,4)
Day3A <- c(9,8,10,7)
Day1B <- c(4,3,5,2)
Day2B <- c(3,2,4,1)
Day3B <- c(2,1,3,0)
df <- data.frame(Group, Attrib1,Attrib2,Day1A,Day2A,Day3A,Day1B,Day2B,Day3B)
day_count <- 3
for(i in 4:ncol(df)) {
if (i == day_count+3) break
df[c(i,day_count+i)]
}
Here is my desired result:
df <- data.frame(Group, Attrib1,Attrib2,Day1A,Day1B,Day2A,Day2B,Day3A,Day3B)
So, in theory you can just do sort(colnames(df)[4:ncol(df)]) to get that, but it gets tricky when you have say Day1A..Day10A..Day20A
Below is a quick workaround, to get the numbers and alphabets:
COLS = colnames(df)[4:ncol(df)]
day_no = as.numeric(gsub("[^0-9]","",COLS))
day_letter = gsub("Day[0-9]*","",COLS)
o = order(day_no,day_letter)
To get your final dataframe:
df[,c(colnames(df)[1:3],COLS[o])]
An option with select
library(dplyr)
library(stringr)
df %>%
select(Group, starts_with('Attrib'),
names(.)[-(1:3)][order(str_remove_all(names(.)[-(1:3)], '\\D+'))])
I have a data frame with a number of columns in a form var1.mean, var2.mean. I would like to strip the suffix ".mean" from all columns that contain it. I tried using rename_all in conjunction with regex in a pipe but could not come up with a correct syntax. Any suggestions?
If you want to use the dplyr package, I'd recommend using the rename_at function.
Dframe <- data.frame(var1.mean = rnorm(10),
var2.mean = rnorm(10),
var1.sd = runif(10))
library(dplyr)
Dframe %>%
rename_at(.vars = vars(ends_with(".mean")),
.funs = funs(sub("[.]mean$", "", .)))
Using new dplyr:
df %>% rename_with(~str_remove(., '.mean'))
We can use rename_all
df1 %>%
rename_all(.funs = funs(sub("\\..*", "", names(df1)))) %>%
head(2)
# var1 var2 var3 var1 var2 var3
#1 -0.5458808 -0.09411013 0.5266526 -1.3546636 0.08314367 0.5916817
#2 0.5365853 -0.08554095 -1.0736261 -0.9608088 2.78494703 -0.2883407
NOTE: If the column names are duplicated, it needs to be made unique with make.unique
data
set.seed(24)
df1 <- as.data.frame(matrix(rnorm(25*6), 25, 6, dimnames = list(NULL,
paste0(paste0("var", 1:3), rep(c(".mean", ".sd"), each = 3)))))
You may use gsub.
colnames(df) <- gsub('.mean','',colnames(df))
The below works for me
dat <- data.frame(var1.mean = 1, var2.mean = 2)
col_old <- colnames(dat)
col_new <- gsub(pattern = ".mean",replacement = "", x = col_old)
colnames(dat) <- col_new
You can replace this names using stringi package stri_replace_last_regex function like this:
require(stringi)
df <- data.frame(1,2,3,4,5,6)
names(df) <- stri_paste("var",1:6,c(".mean",".sd"))
df
## var1.mean var2.sd var3.mean var4.sd var5.mean var6.sd
##1 1 2 3 4 5 6
names(df) <- stri_replace_last_regex(names(df),"\\.mean$","")
df
## var1 var2.sd var3 var4.sd var5 var6.sd
##1 1 2 3 4 5 6
The regex is \\.mean$ because you need to escape dot character (it has special meaning in regex) and also you can add $ sign at the end to ensure that you replace only names that ENDS with this pattern (if the .mean text is in the middle of string then it wan't be replaced).
I would use stringsplit:
x <- as.data.frame(matrix(runif(16), ncol = 4))
colnames(x) <- c("var1.mean", "var2.mean", "var3.mean", "something.else")
colnames(x) <- strsplit(colnames(x), split = ".mean")
colnames(x)
Lot's of quick answers have been given, the most intuitive, to me would be:
Dframe <- data.frame(var1.mean = rnorm(10), #Create Example
var2.mean = rnorm(10),
var1.sd = runif(10))
names(Dframe) <- gsub("[.]mean","",names(Dframe)) #remove ".mean"
I'm working from this answer trying to optimize the second argument in the plyr:rename, as suggested by Jared.
In short they are renaming some columns in a data frame using plyr like this,
df <- data.frame(col1=1:3,col2=3:5,col3=6:8)
df
newNames <- c("new_col1", "new_col2", "new_col3")
oldNames <- names(df)
require(plyr)
df <- rename(df, c("col1"="new_col1", "col2"="new_col2", "col3"="new_col3"))
df
In passing Jared writes '[a]nd you can be creative in making that second argument to rename so that it is not so manual.'
I've tried being creative like this,
df <- data.frame(col1=1:3,col2=3:5,col3=6:8)
df
secondArgument <- paste0('"', oldNames, '"','=', '"',newNames, '"',collapse = ',')
df <- rename(df, secondArgument)
df
But it does not work, can anyone help me automates this?
Thanks!
Update Sun Sep 9 11:55:42PM
I realized I should have been more specific in my question.
I'm using plyr::rename because I, in my real life example, have other variables and I don't always know the position of the variables I want to rename. I'll add an update to my question
My case look like this, but with 100+ variables
df2 <- data.frame(col1=1:3,col2=3:5,col3=6:8)
df2
df2 <- rename(df2, c("col1"="new_col1", "col3"="new_col3"))
df2
df2 <- data.frame(col1=1:3,col2=3:5,col3=6:8)
df2
newNames <- c("new_col1", "new_col3")
oldNames <- names(df[,c('col1', 'col3')])
secondArgument <- paste0('"', oldNames, '"','=', '"',newNames, '"',collapse = ',')
df2 <- rename(df2, secondArgument)
df2
Please add an comment if there is anything I need to clarify.
Solution to modified question:
df2 <- data.frame(col1=1:3,col2=3:5,col3=6:8)
df2
newNames <- c("new_col1", "new_col3")
oldNames <- names(df2[,c('col1', 'col3')])
(Isn't oldNames equal toc('col1','col3') by definition?)
Solution with plyr:
secondArgument <- setNames(newNames,oldNames)
library(plyr)
df2 <- rename(df2, secondArgument)
df2
Or in base R you could do:
names(df2)[match(oldNames,names(df2))] <- newNames
Set the names on newNames to the names from oldNames:
R> names(newNames) <- oldNames
R> newNames
col1 col2 col3
"new_col1" "new_col2" "new_col3"
R> df <- rename(df, newNames)
R> df
new_col1 new_col2 new_col3
1 1 3 6
2 2 4 7
3 3 5 8
plyr::rename requires a named character vector, with new names as values, and old names as names.
This should work:
names(newNames) <- oldNames
df <- rename(df, newNames)
df
new_col1 new_col2 new_col3
1 1 3 6
2 2 4 7
3 3 5 8