Adding prefixes to some variables without touching others? - r

I'd like to produce a data frame like df3 from df1, ie adding a prefix (important_) to variables without one, whilst not touching variables with certain prefixes (gea_, win_, hea_). Thus far I've only managed something like df2 where the important_ variables end up in a separate dataframe, but I'd like all variables in the same data frame. Any thoughts on it would be much appreciated.
What I've got:
library(dplyr)
df1 <- data.frame("hea_income"=c(45000,23465,89522),"gea_property"=c(1,1,2) ,"win_state"=c("AB","CA","GA"), "education"=c(1,2,3), "commute"=c(13,32,1))
df2 <- df1 %>% select(-contains("gea_")) %>% select(-contains("win_")) %>% select(-contains("hea_")) %>% setNames(paste0('important_', names(.)))
What I would like:
df3 <- data.frame("hea_income"=c(45000,23465,89522),"gea_property"=c(1,1,2) ,"win_state"=c("AB","CA","GA"), "important_education"=c(1,2,3), "important_commute"=c(13,32,1))

An option would be rename_at
dfN <- df1 %>%
rename_at(4:5, funs(paste0("important_", .)))
identical(dfN, df3)
#[1] TRUE
We can also include some regex if we want to specify the variables not by numeric index. Here the assumption is that all those columns that doesn't already have a _
df1 %>%
rename_at(vars(matches("^[^_]*$")), funs(paste0("important_", .)))
# hea_income gea_property win_state important_education important_commute
#1 45000 1 AB 1 13
#2 23465 1 CA 2 32
#3 89522 2 GA 3 1
Or with matches and -
df1 %>%
rename_at(vars(-matches("_")), funs(paste0("important_", .)))
# hea_income gea_property win_state important_education important_commute
#1 45000 1 AB 1 13
#2 23465 1 CA 2 32
#3 89522 2 GA 3 1
All three solutions above get the expected output as showed in the OP's post

Here's another possibility:
names(df1) <- names(df1) %>% {ifelse(grepl("_",.),.,paste0("important_",.))}
# > df1
# hea_income gea_property win_state important_education important_commute
# 1 45000 1 AB 1 13
# 2 23465 1 CA 2 32
# 3 89522 2 GA 3 1

Related

R - Return column name for row where first given value is found

I am trying to find the first occurrence of a FALSE in a dataframe for each row value. My rows are specific occurrences and the columns are dates. I would like to be able to find the date of first FALSE so that I can use that value to find a return date.
An example structure of my dataframe:
df <- data.frame(ID = c(1,2,3), '2001' = c(TRUE, TRUE, TRUE),
'2002' = c(FALSE, TRUE, FALSE), '2003' = c(TRUE, FALSE, TRUE))
I want to end up with a second dataframe or list that contains the ID and the column name that identifies the first instance of a FALSE.
For example :
ID | Date
1 | 2002
2 | 2003
3 | 2002
I do not know the mechanism to find such a result.
The actual dataframe contains a couple thousand rows so I unfortunately can't do it by hand.
I am a new R user so please don't refrain from suggesting things you might expect a more experienced R user to have already thought about.
Thanks in advance
Try this using tidyverse functions. You can reshape data to long and then filter for F values. If there are some duplicated rows the second filter can avoid them. Here the code:
library(dplyr)
library(tidyr)
#Code
newdf <- df %>% pivot_longer(-ID) %>%
group_by(ID) %>%
filter(value==F) %>%
filter(!duplicated(value)) %>% select(-value) %>%
rename(Myname=name)
Output:
# A tibble: 3 x 2
# Groups: ID [3]
ID Myname
<dbl> <chr>
1 1 2002
2 2 2003
3 3 2002
Another option without duplicated values can be using the row_number() to extract the first value (row_number()==1):
library(dplyr)
library(tidyr)
#Code 2
newdf <- df %>% pivot_longer(-ID) %>%
group_by(ID) %>%
filter(value==F) %>%
mutate(V=ifelse(row_number()==1,1,0)) %>%
filter(V==1) %>%
select(-c(value,V)) %>% rename(Myname=name)
Output:
# A tibble: 3 x 2
# Groups: ID [3]
ID Myname
<dbl> <chr>
1 1 2002
2 2 2003
3 3 2002
Or using base R with apply() and a generic function:
#Code 3
out <- data.frame(df[,1,drop=F],Res=apply(df[,-1],1,function(x) names(x)[min(which(x==F))]))
Output:
ID Res
1 1 2002
2 2 2003
3 3 2002
We can use max.col with ties.method = 'first' after inverting the logical values.
cbind(df[1], Date = names(df[-1])[max.col(!df[-1], ties.method = 'first')])
# ID Date
#1 1 2002
#2 2 2003
#3 3 2002

How to keep one instance or more of the values in one column when removing duplicate rows?

I'm trying to remove rows with duplicate values in one column of a data frame. I want to make sure that all the existing values in that column are represented, appearing more than once if its values in one other column are not duplicated and non-missing, and only once if the values in that other column are all missing. Take for example the following data frame:
toy <- data.frame(Group = c(1,1,2,2,2,3,3,4,5,5,6,7,7), Class = c("a",NA,"a","b",NA,NA,NA,NA,"a","b","a","a","a"))
I would like to end up with this:
ideal <- data.frame(Group = c(1,2,2,3,4,5,5,6,7), Class = c("a","a","b",NA,NA,"a","b","a","a"))
I tried transforming the data frame into a data table and follow the advice here, like this:
library(data.table)
toy.dt <- as.data.table(toy)
toy.dt[, .(Class = if(all(is.na(Class))) NA_character_ else na.omit(Class)), by = Group]
but duplicates weren't handled as needed: value 7 in the column 'Group' should appear only once in the resulting data.
It would be a bonus if the solution doesn't require transforming the data into a data table.
Here is one way using base R. We first drop NA rows in toy and select only unique rows. We can then left join it with unique Group values to get the rows which are NA for the group.
df1 <- unique(na.omit(toy))
merge(unique(subset(toy, select = Group)), df1, all.x = TRUE)
# Group Class
#1 1 a
#2 2 a
#3 2 b
#4 3 <NA>
#5 4 <NA>
#6 5 a
#7 5 b
#8 6 a
#9 7 a
Same logic using dplyr functions :
library(dplyr)
toy %>%
na.omit() %>%
distinct() %>%
right_join(toy %>% distinct(Group))
If you would like to try a tidyverse approach:
library(tidyverse)
toy %>%
group_by(Group) %>%
filter(!(is.na(Class) & sum(!is.na(Class)) > 0)) %>%
distinct()
Output
# A tibble: 9 x 2
# Groups: Group [7]
Group Class
<dbl> <chr>
1 1 a
2 2 a
3 2 b
4 3 NA
5 4 NA
6 5 a
7 5 b
8 6 a
9 7 a

How to replace column names that include specific string r

I would like to replace columns that contain "score" string with predefined names.
Here is a simple example dataset and my desired column names to replace.
df1 <- data.frame(a = c(1,2,3,4,5),
b = c(5,6,7,8,9),
c.1_score = c(10,10,2,3,4),
a.2_score= c(1,3,5,6,7))
replace.cols <- c("c_score", "a_score")
The number of columns changes each trial. So whenever the column name includes _score, I would like to replace them with my predefined replace.cols names.
The desired col names should be a b c_score and a_score.
Any thought?
Thanks.
We can use rename_at
library(dplyr)
df1 <- df1 %>%
rename_at(vars(ends_with('score')), ~ replace.cols)
df1
# a b c_score a_score
#1 1 5 10 1
#2 2 6 10 3
#3 3 7 2 5
#4 4 8 3 6
#5 5 9 4 7
or with str_remove
library(stringr)
df1 %>%
rename_at(vars(ends_with('score')), ~ str_remove(., '\\.\\d+'))
Or using base R (assuming the column names order is maintained in 'replace.cols')
names(df1)[endsWith(names(df1), 'score')] <- replace.cols

Remove first 10 and last 10 values

I have a file that contains multiple individuals and multiple values for the same individual.
I need to remove the first 10 and last 10 values of each individual, putting all the leftover values in a new table.
This is what my data kinda looks like:
Cow Data
NL123456 123
NL123456 456
I tried doing a for-loop, counting per individual how many values there were (but I think, I already got stuck there, because I am not using the right command I think? All variables in Cow are a factor).
I figured removing the first and last had to be something like this:
data1[c(11: n-10),]
If you know you always have more than 20 datapoints by cow you can do the following, illustrated on the iris dataset :
library(dplyr)
dim(iris)
# [1] 150 5
iris_trimmed <-
iris %>%
group_by(Species) %>%
slice(11:(n()-10)) %>%
ungroup()
dim(iris_trimmed)
# [1] 90 5
On your data :
res <-
your_data %>%
group_by(Cow) %>%
slice(11:(n()-10)) %>%
ungroup()
In base R you can do :
iris_trimmed <- do.call(
rbind,
lapply(split(iris, iris$Species),
function(x) head(tail(x,-10),-10)))
dim(iris_trimmed)
# [1] 90 5
Using data.table:
library(data.table)
idt <- as.data.table(iris)
idt[, .SD[11:(.N-10)], Species]
Same logic in base R:
do.call(
rbind,
lapply(
split(iris, iris[["Species"]]),
function(x) x[11:(nrow(x)-10), ]
)
)
Here a solution with dplyr.
In my example I cut only the first and last values. (you can adapt it by changing 2 with any number in filter).
The idea is to add after you group_by id the number of row per each observation starting from the top (n) and in reverse from the bottom (n1), then you simply filter out.
library(dplyr)
data %>%
group_by(id) %>%
mutate(n=1:n(),
n1 = n():1) %>% # n and n1 are the row numbers
filter(n >= 2,n1 >= 2) %>% # change 2 with 10, or whatever
# filter() keeps only the rows that you want
select(-n, -n1) %>%
ungroup()
# # A tibble: 4 x 2
# id value
# <dbl> <int>
# 1 1 6
# 2 1 8
# 3 2 1
# 4 2 2
Data:
set.seed(123)
data <- data.frame(id = c(rep(1,4), rep(2,4)), value=sample(8))
data
# id value
# 1 1 3
# 2 1 6
# 3 1 8
# 4 1 5
# 5 2 4
# 6 2 1
# 7 2 2
# 8 2 7

Joining data in R by first row, then second and so on

I have two data sets with one common variable - ID (there are duplicate ID numbers in both data sets). I need to link dates to one data set, but I can't use left-join because the first or left file so to say needs to stay as it is (I don't want it to return all combinations and add rows). But I also don't want it to link data like vlookup in Excel which finds the first match and returns it so when I have duplicate ID numbers it only returns the first match. I need it to return the first match, then the second, then third (because the dates are sorted so that the newest date is always first for every ID number) and so on BUT I can't have added rows. Is there any way to do this? Since I don't know how else to show you I have included an example picture of what I need. data joining. Not sure if I made myself clear but thank you in advance!
You can add a second column to create subid's that follow the order of the rownumbers. Then you can use an inner_join to join everything together.
Since you don't have example data sets I created two to show the principle.
df1 <- df1 %>%
group_by(ID) %>%
mutate(follow_id = row_number())
df2 <- df2 %>% group_by(ID) %>%
mutate(follow_id = row_number())
outcome <- df1 %>% inner_join(df2)
# A tibble: 7 x 3
# Groups: ID [?]
ID sub_id var1
<dbl> <int> <fct>
1 1 1 a
2 1 2 b
3 2 1 e
4 3 1 f
5 4 1 h
6 4 2 i
7 4 3 j
data:
df1 <- data.frame(ID = c(1, 1, 2,3,4,4,4))
df2 <- data.frame(ID = c(1,1,1,1,2,3,3,4,4,4,4),
var1 = letters[1:11])
You need a secondary id column. Since you need the first n matches, just group by the id, create an autoincrement id for each group, then join as usual
df1<-data.frame(id=c(1,1,2,3,4,4,4))
d1=sample(seq(as.Date('1999/01/01'), as.Date('2012/01/01'), by="day"),11)
df2<-data.frame(id=c(1,1,1,1,2,3,3,4,4,4,4),d1,d2=d1+sample.int(50,11))
library(dplyr)
df11 <- df1 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
df21 <- df2 %>%
group_by(id) %>%
mutate(id2=1:n())%>%
ungroup()
left_join(df11,df21,by = c("id", "id2"))
# A tibble: 7 x 4
id id2 d1 d2
<dbl> <int> <date> <date>
1 1 1 2009-06-10 2009-06-13
2 1 2 2004-05-28 2004-07-11
3 2 1 2001-08-13 2001-09-06
4 3 1 2005-12-30 2006-01-19
5 4 1 2000-08-06 2000-08-17
6 4 2 2010-09-02 2010-09-10
7 4 3 2007-07-27 2007-09-05

Resources