I would like to mutate several variables at once using mutate_at(). This is how I've been doing up until now, but since I'm dealing with a long list of variables to recode/rename, I want to know how I can do this using mutate_at(). I want to maintain the original columns, which is why I'm not using rename() but mutate() instead. This is what I normally do:
df <- df %>%
mutate(q_50_a = as.numeric(`question_50_part_a: very long very long very long very long` == "yes"),
q_50_b = as.numeric(`question_50_part_b: very long very long very long very long` == "yes"),
q_50_c = as.numeric(`question_50_part_c: very long very long very long very long` == "yes"))
This is what I have so far:
df <- df %>% mutate_at(vars(starts_with("question_50")), funs(q_50 = as.numeric(. == "yes")))
It works and creates a new numeric variable but I'm not sure how to get it to rename the new variables like this: q_50_a, q_50_b, q_50_c, ect.
Thank you.
edit: this is what the data looks like (except there are many many more columns which all look alike)
question_50_part_a: a very long title question_50_part_b: a very long title
yes yes
yes no
yes no
yes yes
no no
yes yes
but would like this:
q_50_a q_50_b
1 1
1 0
1 0
1 1
0 0
1 1
but I want to keep the original columns as they are and simply mutate these new columns with the shorter name and numeric binary coding.
We can use rename_at to rename the new columns.
library(dplyr)
df %>%
mutate_at(vars(starts_with('question_50')),
list(new = ~as.numeric(. == 'yes'))) %>%
rename_at(vars(ends_with('new')),
~sub('\\w+(_\\d+)_part(\\w+):.*', 'q\\1\\2', .))
# question_50_part_a: a very long title question_50_part_b: a very long title
#1 yes yes
#2 yes no
#3 yes no
#4 yes yes
#5 no no
#6 yes yes
# q_50_a q_50_b
#1 1 1
#2 1 0
#3 1 0
#4 1 1
#5 0 0
#6 1 1
Here is an approach that loops over each column:
column_names = colnames(df)
# optional filter out column names you don't want to change here
for(col in column_names){
# construct replacement name
col_replace = paste0("q_", substr(col, 10, 11), "_", substr(col, 18, 18))
# assign and drop old column
df = df %>%
mutate(!!sym(col_replace) := ifelse(!!sym(col) == "yes", 1, 0)) %>%
select(-!!sym(col))
}
Points to note:
If you have other columns you don't want changed, be sure to exclude them
The !!sym(col) construction takes the text string stored in col and turns it into a column name.
We use := rather than = because the LHS requires some evaluation before assignment can happen.
I have used ifelse instead of as.numeric but you can code the RHS of the equals sign as you please.
creating col_replace makes some assumptions about the format of your input names. If everything is the same length this should work. If the number of characters differ (e.g. Q_9_a and Q_10_a) then you may want to use a method based on strsplit instead.
The - sign in select makes it exclude the specified column
Related
I need to compare two rows next to each other in a column in a dataframe, if the data in both those rows matches, then save the most recent row, e.g.
# Animals
# 1 dog
# 2 cat
# 3 cat
It should compare dog and cat, then not save any data. So it won't save row 1 and 2.
But when it moves onto compare cat and cat, realise they are the same and save those rows. So save rows 2 and 3. As they are the same. There are several other columns but the animals column is the only one I need to use to decide whether the row is saved. However I want to keep all the data in the columns within the saved rows.
I need to do this for lots of rows, iterating through to compare a big set of data (~68,000)
I've tried to produce an if statement in which:
# results <- list()
#
# if(isTRUE(data$Animals[i+1] == data$Animals[i])) {
# output <- print(data$Animals[i+1])
# results[[i+1]] <- output
# output <- print(data$Animals[i])
# results[[i]] <- output
# }
#}
I then converted this results list into a dataframe for further manipulation. However this method only provides me with the animal name, I would prefer it the entire row was saved. I'm not too sure how to achieve this, I've been trying to edit the statement but I can't seem to get it working.
I'm new to R and learning, please help anyway you can, I'd appreciate it :)
To "prove" that we're saving the "most recent row", I'll add a row-number column. The data:
dat <- structure(list(Animals = c("dog", "cat", "cat"), row = 1:3), row.names = c(NA, -3L), class = "data.frame")
dat
# Animals row
# 1 dog 1
# 2 cat 2
# 3 cat 3
base R
dat[c(with(dat, Animals[-nrow(dat)] != Animals[-1])),,drop=FALSE]
# Animals row
# 1 dog 1
# 3 cat 3
dplyr
library(dplyr)
dat %>%
filter(Animals != lead(Animals, default = ''))
# Animals row
# 1 dog 1
# 2 cat 3
The only caution I have with this is that if package-loading is at all out-of-order, there exists both stats::filter and stats::lag that behave completely differently. If you see odd results, try prepending dplyr:: to make sure it isn't a which-function-am-I-using problem.
dat %>%
dplyr::filter(Animals != dplyr::lead(Animals, default = ''))
We could use lead and filter
library(dplyr)
df %>%
mutate(helper = lead(animals)) %>%
filter(animals == helper) %>%
select(animals)
Output:
animals
<chr>
1 cat
I want to "upgrade" my code by replacing mass-import using for loop with lapply function. After using lapply(list.files(), read.csv) I've got a list of dataframes. The problem is, the data is a bit messy and some things (like participant's sex) are mentioned only once, in one specific cell. It wasn't a problem when I used a for loop, as I could just refer to a specific cell. When I used:
for (x in list.files()) {
temp <- read.csv(x)
temp %>% slice(4:11) %>% select(form_2.index, form_2.response) %>% mutate(sex = temp[1,4])
#temp[1,4] is the one cell where the participant's sex is mentioned
database <- rbind(datadase, temp)
each temp variable looked like this:
form_2.index form_2.response sex$form.response
<dbl> <chr> <chr>
1 1 yes male
2 2 no male
3 3 no male
4 4 yes male
5 5 yes male
6 6 yes male
7 7 no male
8 8 no male
That's what I want. But how can I refer to a certain cell when using lapply? The following code doesn't work, as the temp variable is now a list:
temp <- lapply(list.files(), read_csv())
temp %>% lapply(slice, 4:11) %>% lapply(select, form_2.index, form_2.response) %>% lapply(mutate, plec = temp[1,4])
The slice and select functions work all right, the problem lies in the mutate part. Given it's a list, I need to point to a certain element of the list, not only column and row, but how can I do that? After all, I want it to be done in each element. Any ideas?
You can do :
library(dplyr)
temp <- lapply(list.files(), function(x) {
tmp <- readr::read_csv(x)
tmp %>%
slice(4:11) %>%
select(form_2.index, form_2.response) %>%
mutate(sex = tmp[1,4])
})
I am trying to pull the cell values from the StudyID column to the empty cells SigmaID column, but I am running into an odd issue with the output.
This is how my data looks before running commands.
StudyID Gender Region SigmaID
LM24008 1 20 LM24008
LM82993 1 16 LM28888
ST04283 0 44
ST04238 0 50
LM04829 1 24 LM23921
ST91124 0 89
ST29001 0 55
I tried accomplishing this by writing the syntax in three ways, because I wasn't sure if there is a problem with the way the logic was set up. All three produce the same output.
df$SigmaID <- ifelse(test = df$SigmaID != "", yes = df$SigmaID, no = df$StudyID)
df$SigmaID <- ifelse(df$SigmaID == "", df$StudyID, df3$SigmaID)
df %>% mutate(SigmaID = ifelse(Gender == 0, df$StudyID, df$SigmaID)
Output: instead of pulling the values from from the StudyID column, it is populating one to four digit numbers.
StudyID Gender Region SigmaID
LM24008 1 20 LM24008
LM82993 1 16 LM28888
ST04283 0 44 5
ST04238 0 50 4908
LM04829 1 24 LM23921
ST91124 0 89 209
ST29001 0 55 4092
I have tried recoding the empty spaces to NA and then calling on NA in the logic, but this produced the same output as seen above. I'm wondering if it could have anything to do with variable type or variable attributes and something's off about how it's reading the characters in StudyID. Would appreciate feedback on this issue!
Here is how to do it:
df$SigmaID[df$SigmaID == ""] = df$StudyID[df$SigmaID == ""]
df[df$SigmaID == ""] selects only the rows where SigmaID==""
I also recommend using data.table instead of data.frame. It is faster and has some useful syntax features:
library(data.table)
setDT(df) # setDT converts a data.frame to a data.table
df[SigmaID=="",SigmaId:=StudyID]
Following up on this! As it turns out, default R converts string types into factors. There are a few ways of addressing the issue above.
i <- sapply[df, is.factor]
df[i] <- lapply(df[i], as.character)
Another method:
df <- read.csv("/insert file pathway here", stringAsFactors = FALSE)
This is what I found to be helpful! I'm sure there are additional methods of troubleshooting this as well.
I have a column variable that I want to split into three factor variables. There are the factor variables I want to create:
goal<-c('newref', 'meow', 'woof')
area<-c('eco', 'social', 'bank')
fr<-c('demo', 'hist', 'util')
And the current variable looks more or less like that:
code<-c('goal\\\\meow', 'area\\\\bank', 'area\\\\bank', 'fr\\\\utilitarian', 'fr\\\\history')
And let's say the dataframe is something like that
df<-data.frame(var1=c(1,2,3,4,5), var2=c('a', 'b', 'c', 'd', 'e'), code=code)
So I would like to create 3 new columns, one per each factor variable, and use a regular expression that detected what it belongs to. So for example row number one should look as follows:
row1<-data.frame(var1=1, var2=c('a'), code=c('goal\\\\meow'), goal=2, area=NA, fr=NA)
Also note that the value of the factor variables is an abbreviation of the value in code (eg, history / hist).
The database is likely to have 10000 entries, so I would really appreciate any hints on this.
Thank you!
We can define a function that finds the position of the factor variable that, when used as a regular expression, finds a match in the code column:
find_match <- function(code, matches) {
apply(sapply(matches, grepl, code), 1, match, x=T)
}
If there is no match, this function returns NA for that row.
Next, we can simply use mutate from dplyr to add each column of factors:
df %>% mutate(goal = find_match(code, goal),
area = find_match(code, area),
fr = find_match(code, fr))
Which gives:
var1 var2 code goal area fr
1 1 a goal\\\\meow 2 NA NA
2 2 b area\\\\bank NA 3 NA
3 3 c area\\\\bank NA 3 NA
4 4 d fr\\\\utilitarian NA NA 3
5 5 e fr\\\\history NA NA 2
Doing this with tidyverse tools like the pipe %>% and dplyr:
Separate breaks up the code column into two with the separator you specify.
Because "\" is a special character in regex you have to escape each \ you want to look for with another .
Spread converts it from tall form to wide form as you needed.
library(dplyr)
df %>%
separate(code, into = c("colName", "value"), sep = "\\\\\\\\") %>%
spread(colName, value)
I have a simple problem:
I have a column with thousands of values and I'm trying to convert it into a dichotomous variable (Yes|No). Replacing strings with 'No' was easy enough as the value I was converting was a single asterisk
Data$Complete <- gsub("\\*", "No", Data$Complete)
But when I attempt to replace everything apart from 'No', the following code replaces everything with 'Yes' in my string. I don't understand why it would as I'm specifying to replace everthing apart from "No":
Data$Complete <- Data[!Data$Complete %in% c("No"), "Complete"] <- "Yes"
Any pointers would be appreciated.
You can use combination of ifelse function and grepl to extract necessary data as below:
library(stringi)
# data simulation
set.seed(123)
n <- 1000
data <- data.frame(
complete = stri_rand_strings(n = n, length = 20, pattern = "[A-Za-z0-9\\*]")
)
# string matching
data$yes_no <- ifelse(grepl("\\*", data$complete), "No", "Yes")
head(data)
Output:
complete yes_no
1 HmOsw1WtXRxRfZ5tE1Jx Yes
2 tgdzehXaH8xtgn0TkCJD Yes
3 7PPM87DSFr1Qn6YC7ktM Yes
4 e4NGoRoonQkch*SCMbL6 No
5 EfPm5QztsA7eKeJAm4SV Yes
6 aJTxTtubO8vH2wi7XxZO Yes