I have a dataframe as such:
sample = data.frame(
beer_brewerId = c("8481", "8481", "8481"),
rev_app = c("4/5","1/5", "2/5"),
beer_name = c("John Harvards Simcoe IPA", "John Harvards Simcoe IPA", "John Harvards American Brown Ale"),
review_taste =c("6/10", "7/10", "6/10"), stringsAsFactors = FALSE
)
str(sample)
I would like to convert only columns 2 and 4 from a character vector into an integer for analysis purposes. Normally, this would not be so difficult if all of the character columns I have I want to convert to numeric with the following code, but this does not work as I want to keep column 3 as a chr type:
sample %>%
select(2,4) %>%
mutate_if(is.character, as.numeric)
You can easily accomplish this with base r as:
#base approach
cols <- c("2","4")
data[cols] <- lapply(data[cols], as.numeric)
Is there an easy way to do this using dplyr, and even within a pipe sequence? If you were to select only certain columns using select(), it would not allow you to save the results back into the dataframe
Something like this would work, but as my dataset has 15+ columns, this seems like its very cumbersome code:
cleandf <- sample %>%
#Use transform or mutate to convert each column manually
transform(rev_app = as.integer(rev_app)) %>%
transform(review_taste = as.integer(review_taste))
Is mutate_at, or mutate_each meant to perform this task? Any help would be appreciated. Thanks.
#Maybe something like this:
cols <- c("2","4")
data %>%
mutate_each(is.character[cols], as.numeric)
Easiest way to accomplish this is through using mutate_at with the specified column indexes:
sample <- sample %>%
#Do normal mutations on the data
mutate(rev_app = str_replace_all(rev_app, "/5", "")) %>%
mutate(review_taste = str_replace_all(review_taste, "/10", "")) %>%
#Now add this one-liner onto your chain
mutate_at(c(2,4), as.numeric) %>%
glimpse(., n=5)
You can achieve this with the mutate_at function.
sample = data.frame(beer_brewerId = c("8481", "8481", "8481"),
rev_app = c("4/5","1/5", "2/5"),
beer_name = c("John Harvards Simcoe IPA", "John Harvards Simcoe IPA", "John Harvards American Brown Ale"),
review_taste =c("6/10", "7/10", "6/10"), stringsAsFactors = FALSE)
# get rid of "/"
clean <- function(foo) {
sapply(foo, function(x) eval(parse(text = x)))
}
# you can replace c(2,4) by whatever columns you need
clean_sample <- sample %>%
mutate_at(c(2,4), clean)
Columns 2 and 4 are now numeric.
Related
This should be relatively simple, but I am new to R and the tidyverse. I have a dataframe that is the result of 5 (!) combined csvs that looks like this:
I want to combine the data to be only four rows, which would look like this:
Would you please help me create an if else statement to fill in the NAs in the columns? For example,
If "Division" column value for the observation = 1, write "DivisionName" value as "New England". Essentially I want to combine Division (number) with "DivisionName" and RegionNumber with "RegionName" to clean the data. Any insights would be apprecaited. I believe this can be done using dplyr, perhaps with transmute and bindrows. Thank you for helping someone who is learning to combine and rename multiple CSVs. Here's the code I have now:
library(tidyverse)
DS1 <- read.csv("./datafiles/Division_State-I.csv")
DS2 <- read.csv("./datafiles/Division_State-II.csv")
DS3 <- read.csv("./datafiles/Division_State-III.csv")
RD <- read.csv("./datafiles/Region_Division.csv")
Region <-read.csv("./datafiles/Region.csv")
DS123 <- bind_rows(DS1,DS2,DS3,RD,Region)
uniqueDS123 <- unique(DS123) %>%
rename("Division"="DivisionNumber", "FIPS"="StateFIPS", "State"="StateName")
You can use unite to combine columns.
library(tidyr)
uniqueDS123 %>%
unite(Division, Division, DivisionName, sep = "-", na.rm = TRUE) %>%
unite(Region, RegionNumber, RegionName, sep = "-", na.rm = TRUE)
I have a dataframe that has one column which contains json data. I want to extract some attributes from this json data into named columns of the data frame.
Sample data
json_col = c('{"name":"john"}','{"name":"doe","points": 10}', '{"name":"jane", "points": 20}')
id = c(1,2,3)
df <- data.frame(id, json_col)
I was able to achieve this using
library(tidyverse)
library(jsonlite)
extract_json_attr <- function(from, attr, default=NA) {
value <- from %>%
as.character() %>%
jsonlite::fromJSON(txt = .) %>%
.[attr]
return(ifelse(is.null(value[[1]]), default, value[[1]]))
}
df <- df %>%
rowwise() %>%
mutate(name = extract_json_attr(json_col, "name"),
points = extract_json_attr(json_col, "points", 0))
In this case the extract_json_attr needs to parse the json column multiple times for each attribute to be extracted.
Is there a better way to extract all attributes at one shot?
I tried this function to return multiple values as a list, but I am not able to use it with mutate to set multiple columns.
extract_multiple <- function(from, attributes){
values <- from %>%
as.character() %>%
jsonlite::fromJSON(txt = .) %>%
.[attributes]
return (values)
}
I am able to extract the desired values using this function
extract_multiple(df$json_col[1],c('name','points'))
extract_multiple(df$json_col[2],c('name','points'))
But cannot apply this to set multiple columns in a single go. Is there a better way to do this efficiently?
Here is one way using bind_rows from dplyr
dplyr::bind_rows(lapply(as.character(df$json_col), jsonlite::fromJSON))
# A tibble: 3 x 2
# name points
# <chr> <int>
#1 john NA
#2 doe 10
#3 jane 20
To subset specific attribute from the function, we can do
bind_rows(lapply(as.character(df$json_col), function(x)
jsonlite::fromJSON(x)[c('name', 'points')]))
On the R4DS slack channel I received an alternative approach for handling json arrays as columns. Using that, I found another approach that seems to work better on larger datasets.
library(tidyverse)
library(jsonlite)
extract <- function(input, fields){
json_df <- fromJSON(txt=input)
missing <- setdiff(fields, names(json_df))
json_df[missing] <- NA
return (json_df %>% select(fields))
}
df <- data.frame(id=c(1,2,3),
json_col=c('{"name":"john"}','{"name":"doe","points": 10}', '{"name":"jane", "points": 20}'),
stringsAsFactors=FALSE)
df %>%
mutate(json_col = paste0('[',json_col,']'),
json_col = map(json_col, function(x) extract(input=x, fields=c('name', 'points')))) %>%
unnest(cols=c(json_col))
I want to turn a table into a data frame. Three columns should be there: 1. the zip code 2 outcome "0" and 3 outcome "1". But as.data.frame.matrix turns the zip-code into row names and makes them unusable.
I tried to add a fourth column with imaginary ID's (1:100) so R makes them to row names but R tells me, that "all arguments must be the same length" - which they are!
id <- 1:5000
zip <- sample(100:200, 5000, replace = TRUE)
outcome <- rbinom(5000, 1, 0.23)
df <- data.frame(id, outcome, zip)
abs <- table(df$zip, df$outcome)
abs <- as.data.frame.matrix(abs)
Some has a nice and slick idea? Thanks in advance!
Edit:
When:
abs <- as.matrix(as.data.frame(abs))
I get something close to what I want but the outcomes are together in one column. How to untie them, to make them look like the table again?
You can get to your desired result easier with dplyr and tidyr:
library(dplyr)
library(tidyr)
id <- 1:5000
zip <- sample(100:200, 5000, replace = TRUE)
outcome <- rbinom(5000, 1, 0.23)
df <- data.frame(id, outcome, zip)
df <- df %>% group_by(zip, outcome) %>%
summarise(freq = n()) %>%
ungroup() %>%
spread(outcome, freq)
You are supplying only a 100 values to a data.frame that has 101 rows.
> nrow(abs)
[1] 101
so this would work
abs$new_col <- 1:101
I think you want this:
abs2 <- as.data.frame(abs) %>% select(2,3,1)
I am trying to read a set of tab separated files into a matrix or data.frame. For each file I need to extract one column and then concatenate all the columns into a single matrix keeping both column and row names.
I am using tidyverse (and I am terrible at that). I successfully get column names but I miss row names at the very last stage of processing.
library("purrr")
library("tibble")
samples <- c("a","b","c","d")
a <- samples %>%
purrr::map_chr(~ file.path(getwd(), TARGET_FOLDER, paste(., "tsv", sep = "."))) %>%
purrr::map(safely(~ read.table(., row.names = 1, skip = 4))) %>%
purrr::set_names(rownames(samples)) %>%
purrr::transpose()
is_ok <- a$error %>% purrr::map_lgl(is_null)
x <- a$result[is_ok] %>%
purrr::map(~ {
v <- .[,1]
names(v) <- rownames(.)
v
}) %>% as_tibble(rownames = NA)
The x data.frame has correct colnames but lacks rownames. All the element on the a list have the same rownames in the exact same order. I am aware of tricks like rownames(x) <- rownames(a$result[[1]]) but I am looking for more consistent solutions.
It turned out that the solution was easier than expected. Using as.data.frame instead the last as_tibble solved it.
I'd like to do what I think is a very simple operation -- adding a column with a number for each person to a dataset with a list of (potentially) duplicative names. I think that I am close. This code looks at a dataset of names, does pairwise comparisons, and appends a column whether there is a likely match. Now I just want to go one step further -- instead of dropping duplicates, I want to come up with a unique identifier.
Peter
Example:
Peter
Peter
Peter
Connor
Matt
would become
Example:
Peter -- 1
Peter -- 1
Peter -- 1
Connor -- 2
Matt -- 3
library(RecordLinkage)
data(RLdata10000)
rpairs <- compare.dedup(RLdata10000, blockfld = 5)
p=epiWeights(rpairs)
classify <- epiClassify(p,0.7)
summary(classify)
match <- classify$prediction
results <- cbind(classify$pairs,match)
small rewrite avoiding that the weights and classifier have to be tuned with the IDs,
df_names <- data.frame(Name=c("Peter","Peter","Peter","Connor","Matt"))
df_names %>% compare.dedup() %>%
epiWeights() %>%
epiClassify(0.3) %>%
getPairs(show = "links", single.rows = TRUE) -> matches
left_join(mutate(df_names,ID = 1:nrow(df_names)),
select(matches,id1,id2) %>% arrange(id1) %>% filter(!duplicated(id2)),
by=c("ID"="id2")) %>%
mutate(ID = ifelse(is.na(id1), ID, id1) ) %>%
select(-id1)
I figured out the answer to my own question.
df_names <- df_names %>% mutate(ID = 1:nrow(df_names))
rpairs <- compare.dedup(df_names)
p=epiWeights(rpairs)
classify <- epiClassify(p,0.83)
summary(classify)
matches <- getPairs(classify, show = "links", single.rows = TRUE)
this code writes an "ID" column that is the same for similar names
matches <- matches %>% arrange(ID.1) %>% filter(!duplicated(ID.2))
df_names$ID_prior <- df_names$ID
merge matching information with the original data
df_names <- left_join(df_names, matches %>% select(ID.1,ID.2), by=c("ID"="ID.2"))
replace matches in ID with the thing they match with from ID.1
df_names$ID <- ifelse(is.na(df_names$ID.1), df_names$ID, df_names$ID.1)