How to rename columns by substituting suffix in R? - r

I have several column names in my dataset (eat10.18) that have the suffix "_10p." I'd like to change that suffix to be "_p_10" but preserve the rest of the variable name. I only want this to affect columns that end in the exact string "_10p." I cannot figure out how to make this work with rename_with(). Can anyone help? Faux data below:
eat10.18 <- data.frame(id = c(1000, 1001, 1002),
eat_10 = c(2, 4, 1),
eat_10p = c(1, 2, 3),
run_10p = c(1, 1, 2))
In the above example, the variables "id" and "eat_10" would remain the same, but "eat_10p" and "run_10p" would become "eat_p_10" and "run_p_10"
Thanks for your help!

library(tidyverse)
eat10.18 %>%
rename_with(~str_replace(.,'_10p$', '_p_10'))
id eat_10 eat_p_10 run_p_10
1 1000 2 1 1
2 1001 4 2 1
3 1002 1 3 2

I suggest using gsub and referring to this post.
names(eat10.18) <- gsub(x = names(eat10.18), pattern = "_10p", replacement = "_p_10")
Result
id
eat_10
eat_p_10
run_p_10
1000
2
1
1
1001
4
2
1
1002
1
3
2

Related

How to rename dimnames default format (e.g. dimnames[[3]]) into a specific label (e.g. dimnames$time)

so I was trying to convert a data frame to a multidimensional matrix and I was able to do so. My only problem now is labelling them.
So far, I have four attributes for each dimnames but I cannot rename them. The tutorials I have seen online all want me to list down name per column, like this:
dimnames(dfa)[[2]] <-c("col_1", "col_2", "col_3)
but I don't need several names for column or row, all I need is one general label.
Basically, what I have now is this:
But what I need are these labels after the $ sign
So in the end, my file looks like this:
But I want labels (time, ID, factor_level,chan) like this:
I tried the following codes to rename the rows, didn't work
rownames(dfa) <-"factor_level"
dimnames(dfa) [[1]] <- c("factor_level")
I hope you can help me.
We need names
names(dimnames(dfa))[1] <- 'factor_level'
-outut
> dfa
, , 1, 1
NA
factor_level col_1 col_2 col_3
row_1 7 7 8
row_2 3 2 6
row_3 8 3 9
, , 2, 1
NA
factor_level col_1 col_2 col_3
row_1 4 7 8
row_2 9 8 1
row_3 8 1 5
, , 3, 1
...
data
set.seed(24)
dfa <- array(sample(1:9, size = 3 * 3 * 3 * 2, replace = TRUE),
dim = c(3, 3, 3, 2))
dimnames(dfa)[[2]] <- c("col_1", "col_2", "col_3")
dimnames(dfa)[[1]] <- c("row_1", "row_2", "row_3")

how I can add a column with all 1 to my dataframe?

I have a data frame and I want to add a new column with entries 1. how I can do that?
for example
col1. col2
1. 2.
4. 5.
33. 4.
5. 3.
new column
col1. col2. col3
1. 2. 1
4. 5. 1
33. 4. 1
5. 3. 1
df1$col3 <- 1
this should work as well
likewise as above
df1<-data.frame(df1,col3=1)
could also work
Simplest option is to do ?Extract
df1['col3'] <- 1
One of the good things about using [ instead of $ is that we can pass variable identifiers as well
v1 <- 'col3'
df1[v1] <- 1
But, if we do
df1$v1 <- 1
it creates a column with name as 'v1' instead of 'col3'
Other variations without changing the initial object would be
transform(df1, col3 = 1)
cbind(df1, col3 = 1)
NOTE: All of these creates a column appended as the last column
Also, there is a convenient function add_column which can add a column by specifying the position. By default, it creates the column as the last one
library(tibble)
add_column(df1, col3 = 1)
# col1. col2 col3
#1 1 2 1
#2 4 5 1
#3 33 4 1
#4 5 3 1
But, if we need to change it to a specific location, there are arguments
add_column(df1, col3 = 1, .after = 1)
# col1. col3 col2
#1 1 1 2
#2 4 1 5
#3 33 1 4
#4 5 1 3
data
df1 <- structure(list(col1. = c(1, 4, 33, 5), col2 = c(2, 5, 4, 3)),
class = "data.frame", row.names = c(NA,
-4L))

R: compare multiple columns pairs and place value on new corresponding variable

Am a basic R user.
I have 50 column pairs (example pair is: "pair_q1" and "pair_01_v_rde") per "id" in the same dataframe that I would like to collect data from and place it in a new corresponding variable e.g. "newvar_q1".
All the pair variable names have a pattern in their names that can be distilled to this ("pair_qX" and "pair_X_v_rde", where X = 1:50, and the final variables I would like to have are "newvar_qX", where X = 1:50)
Ideally only one member of the pair should contain data, but this is not the case.
Each of the variables can contain values from 1:5 or NA(missing).
Rules for collecting data from each pair based on "id" and what to place in their newly created corresponding variable are:
If one of the pairs has a value and the other is missing then place the value in their corresponding new variable. e.g. ("pair_q1" = 1 and "pair_01_v_rde" = NA then "newvar_q1" = 1)
If both pairs have the same value or both are missing then place that value/missing in their corresponding new variable e.g. ("pair_q50" = 1/NA and "pair_50_v_rde" = 1/NA then "newvar_q50" = 1/NA)
If both pairs have different values then ignore both values and assign their corresponding new variable 999 e.g. ("pair_q02" = 3 and "pair_02_v_rde" = 2 then "newvar_q02" = 999)
Can anyone show me how I can execute this in R please?
Thanks!
Nelly
# Create Toy dataset
id <- c(100, 101, 102)
pair_q1 <- c(1, NA, 1)
pair_01_v_rde <- c(NA, 2, 1)
pair_q2 <- c(1, 1, NA)
pair_02_v_rde <- c(2, NA, NA)
pair_q50 <- c(NA, 2, 4)
pair_50_v_rde <- c(4, 3, 1)
mydata <- data.frame(id, pair_q1, pair_01_v_rde, pair_q2, pair_02_v_rde, pair_q50, pair_50_v_rde)
# The dataset
> mydata
id pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde
1 100 1 NA 1 2 NA 4
2 101 NA 2 1 NA 2 3
3 102 1 1 NA NA 4 1
# Here I manually build what I would like to have in the dataset
newvar_q1 <- c(1, 2, 1)
newvar_q2 <- c(999, 1, NA)
newvar_q50 <- c(4, 999, 999)
mydata2 <- data.frame(id, pair_q1, pair_01_v_rde, pair_q2, pair_02_v_rde, pair_q50, pair_50_v_rde, newvar_q1, newvar_q2, newvar_q50)
> mydata2
id pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde newvar_q1 newvar_q2 newvar_q50
1 100 1 NA 1 2 NA 4 1 999 4
2 101 NA 2 1 NA 2 3 2 1 999
3 102 1 1 NA NA 4 1 1 NA 999
A possible solution using the 'tidyverse' (use 'inner_join(mydata,.,by="id")' to get the new columns in the order you give in your question):
mydata %>%
select(id,matches("^pair_q")) %>% # keeps only left part of pairs
gather(k,v1,-id) %>% # transforms into tuples (id,variable name,variable value)
mutate(n=as.integer(str_extract(k,"\\d+"))) -> df1 # converts variable name into variable number
mydata %>%
select(id,matches("^pair_\\d")) %>% # same on right part of pairs
gather(k,v2,-id) %>%
mutate(n=as.integer(str_extract(k,"\\d+"))) -> df2
inner_join(df1,df2,by=c("id","n")) %>%
mutate(w=case_when(is.na(v1) ~ v2, # builds new variable value
is.na(v2) ~ v1, # from your rules
v1==v2 ~ v1,
TRUE ~999),
k=paste0("newvar_q",n)) %>% # builds new variable name from variable number
select(id,k,w) %>% # keeps only useful columns
spread(k,w) %>% # switches back from tuple view to wide view
inner_join(mydata,by="id") # and merges the new variables to the original data
# id newvar_q1 newvar_q2 newvar_q50 pair_q1 pair_01_v_rde pair_q2 pair_02_v_rde pair_q50 pair_50_v_rde
#1 100 1 999 4 1 NA 1 #2 NA 4
#2 101 2 1 999 NA 2 1 NA 2 3
#3 102 1 NA 999 1 1 NA NA 4 1

Selecting between rows in a dataframe

I am working with a rather noisy data set and I was wondering if there was a good way to selectively choose between two rows of data within a group or leave them alone. Logic-wise I want to filter by group and then build an if-else type control structure to compare rows based on the value of a second column.
Example:
Row ID V1 V2
1 1 blah 1.2
2 1 blah NA
3 2 foo 2.3
4 3 bar NA
5 3 bar NA
I want to group by ID (1, 2, 3) then go to column V2 and choose for example, row 1 over row 2 because row 2 has NA. But for rows 4 and 5, where both are 'NA' I want to just leave them alone.
Thanks,
What you need might really depends on what you exactly have. In case of NAs, this might help:
df <- data.frame(
Row = c(1, 2, 3, 4, 5),
ID = c(1, 1, 2, 3, 3),
V1 = c("bla", "bla", "foo", "bla", "bla"),
V2 = c(1.2, NA, 2.3, NA, NA),
stringsAsFactors = FALSE)
df <- df[complete.cases(df), ]
A solution using purrr. The idea is to split the data frame by ID. After that, apply a user-defineed function, which evaluates if all the elements in V2 are NA. If TRUE, returns the original data frame, otherwise returns the subset of the data frame by filtering out rows with NA using na.omit. map_dfr is similar to lapply, but it can combine all the data frames in a list automatically. dt2 is the final output.
library(purrr)
dt2 <- dt %>%
split(.$ID) %>%
map_dfr(function(x){
if(all(is.na(x$V2))){
return(x)
} else {
return(na.omit(x))
}
})
dt2
# Row ID V1 V2
# 1 1 1 blah 1.2
# 2 3 2 foo 2.3
# 3 4 3 bar NA
# 4 5 3 bar NA
DATA
dt <- read.table(text = "Row ID V1 V2
1 1 blah 1.2
2 1 blah NA
3 2 foo 2.3
4 3 bar NA
5 3 bar NA",
header = TRUE, stringsAsFactors = FALSE)

In R, find duplicated dates in a dataset and replace their associated values with their mean

I have a rather small dataset of 3 columns (id, date and distance) in which some dates may be duplicated (otherwise unique) because there is a second distance value associated with that date.
For those duplicated dates, how do I average the distances then replace the original distance with the averages?
Let's use this dataset as the model:
z <- data.frame(id=c(1,1,2,2,3,4),var=c(2,4,1,3,5,2))
# id var
# 1 2
# 1 4
# 2 1
# 2 3
# 3 5
# 4 2
The mean of id#1 is 3 and of id#2 is 2, which would then replace each of the original var's.
I've checked multiple questions to address this and have found related discussions. As a result, here is what I have so far:
# Check if any dates have two estimates (duplicate Epochs)
length(unique(Rdataset$Epoch)) == nrow(Rdataset)
# if 'TRUE' then each day has a unique data point (no duplicate Epochs)
# if 'FALSE' then duplicate Epochs exist, and the distances must be
# averaged for each duplicate Epoch
Rdataset$Distance <- ave(Rdataset$Distance, Rdataset$Epoch, FUN=mean)
Rdataset <- unique(Rdataset)
Then, with the distances for duplicate dates averaged and replaced, I wish to perform other functions on the entire dataset.
Here's a solution that doesn't bother to actually check if the id's are duplicated- you don't actually need to, since for non-duplicated id's, you can just use the mean of the single var value:
duplicated_ids = unique(z$id[duplicated(z$id)])
library(plyr)
z_deduped = ddply(
z,
.(id),
function(df_section) {
res_df = data.frame(id=df_section$id[1], var=mean(df_section$var))
}
)
Output:
> z_deduped
id var
1 1 3
2 2 2
3 3 5
4 4 2
Unless I misunderstand:
library(plyr)
ddply(z, .(id), summarise, var2 = mean(var))
# id var2
# 1 1 3
# 2 2 2
# 3 3 5
# 4 4 2
Here is another answer in data.table style:
library(data.table)
z <- data.table(id = c(1, 1, 2, 2, 3, 4), var = c(2, 4, 1, 3, 5, 2))
z[, mean(var), by = id]
id V1
1: 1 3
2: 2 2
3: 3 5
4: 4 2
There is no need to treat unique values differently than duplicated values as the mean of a single argument is the argument.
zt<-aggregate(var~id,data=z,mean)
zt
id var
1 1 3
2 2 2
3 3 5
4 4 2

Resources