If I read in a Stata or SAS dataset with labels using haven, it will be (at least in haven 0.2.0) read with the following format:
library(dplyr)
df1 <- data_frame(fips = structure(c(1001, 1001, 1001, 1001, 1001),
label = "FIPS (numeric)"),
id = structure(letters[1:5], label = "ID"))
df2 <- data_frame(fips = structure(c(1001, 1003, 1005, 1007, 1009),
label = "FIPS (numeric)"),
state = structure("AL", label = "State Abbreviation"))
(If necessary, I can post some Stata data that produces this, but this should be easy to verify using any labeled Stata/SAS dataset.)
When I try to use any of the dplyr join functions to join on a labeled column, I am sorely disappointed:
df1 %>% inner_join(df2)
returns the error
Error in eval(expr, envir, enclos) : cannot join on columns 'fips' x
'fips': Can't join on 'fips' x 'fips' because of incompatible types
(numeric / numeric)
The only way to avoid it seems to be to remove the labels on the join variables:
df1 %>%
mutate(fips = `attr<-`(fips, 'label', NULL)) %>%
inner_join(df2 %>% mutate(fips = `attr<-`(fips, 'label', NULL)))
which raises the question of why the labels were read in the first place. (The join also obliterates the labels in df2.)
This would seem to be a bug in the way haven and dplyr interact. Is there a better solution?
Try converting the columns into a character string. This seems to work
df1$fips<-as.character(df1$fips)
df2$fips<-as.character(df2$fips)
df1 %>% inner_join(df2)
The help page for inner_join does state: "a character vector of variables to join by"
When dplyr joins on a variable that is a factor in one dataset and a character in the other it sends out a warning but completes the join. numeric and character vectors are not compatible classes so it errors out. By converting them both to character the join works fine
library(dplyr)
df1 %>%
mutate(fips = as.character(fips)) %>%
inner_join(
df2 %>%
mutate(fips = as.character)
)
This was fixed at some point, and works in dplyr 0.7.4. I can't track down the exact version where it was fixed.
Related
I am performing Data Analysis and cleaning in R using tidyverse.
I have a Data Frame with 23 columns containing values 'NO','STEADY','UP' and 'down'.
I want to change all the values in these 23 columns to 0 in case of 'NO','STEADY' and 1 in other case.
What i did is, i created a list by name keys in which i have kept all my columns, After that i am using for loop, ifelse statements and mutate.
Please have a look at the code below
# Column names are kept in the list by name keys
keys = c('metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone',
'metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin',
'troglitazone', 'tolbutamide', 'acetohexamide')
After that, i used following code to get the desired result :
for (col in keys){
Dataset = Dataset %>%
mutate(col = ifelse(col %in% c('No','Steady'),0,1)) }
I was expecting that, it will do the changes that i require, but nothing happens after this. (NO ERROR MESSAGE AND NO DESIRED RESULT)
After that, i researched further and executed following code
for (col in keys){
print(col)}
It gives me elements of list as characters like - "metformin"
So, i thought - may be this is the issue. Hence, i used the below code to caste the keys as symbols :
keys_new = sym(keys)
After that i again ran the same code:
for (col in keys_new){
Dataset = Dataset %>%
mutate(col = ifelse(col %in% c('No','Steady'),0,1))}
It gives me following Error -
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
After all this. I also tried to create a function to get the desired results, but that too didn't worked:
change = function(name){
Dataset = Dataset %>%
mutate(name = ifelse(name %in% c('No','Steady'),0,1),
name = as.factor(name))
return(Dataset)}
for (col in keys){
change(col)}
This didn't perform any action. (NO ERROR MESSAGE AND NO DESIRED RESULT)
When keys_new is placed in this code:
for (col in keys_new){
change(col)}
I got the same Error :
Error in match(x, table, nomatch = 0L) :
'match' requires vector arguments
PLEASE GUIDE
There's no need to loop or keep track of column names. You can use mutate_all -
Dataset %>%
mutate_all(~ifelse(. %in% c('No','Steady'), 0, 1))
Another way, thanks to Rui Barradas -
Dataset %>%
mutate_all(~as.integer(!. %in% c('No','Steady')))
There's a simpler way using mutate_at and case_when.
Dataset %>% mutate_at(keys, ~case_when(. %in% c("NO", "STEADY") ~ 0, TRUE ~ 1))
mutate_at will only mutate the columns specified in the keys variable. case_when then lets you replace one value by another by some condition.
This answer for using mutate through forloop.
I don't have your data, so i tried to make my own data, i changed the keys into a tibble using enframe then spread it into columns and used the row number as a value for each column, then check if the value is higher than 10 or not.
To use the column name in mutate you have to use !! and := in the mutate function
df <- enframe(c('metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone',
'metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin',
'troglitazone', 'tolbutamide', 'acetohexamide')
) %>% spread(key = value,value = name)
keys = c('metformin', 'repaglinide', 'nateglinide', 'chlorpropamide', 'glimepiride',
'glipizide', 'glyburide', 'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol',
'insulin', 'glyburide-metformin', 'tolazamide', 'metformin-pioglitazone',
'metformin-rosiglitazone', 'glimepiride-pioglitazone', 'glipizide-metformin',
'troglitazone', 'tolbutamide', 'acetohexamide')
for (col in keys){
df = df %>%
mutate(!!as.character(col) := ifelse( df[col] > 10,0,100) )
}
I have two datasets, I'm trying to join together. the column i am joining by does not exactly match up with each other. first file the column looks like this: 00:01:54:2145 etc. 00: for every single observation. I want to change all the observations in this column to be in this format: 01/54/2145.
I have tried several things with string package, but can't get it to work.
df1 <- df %>%
str_replace_all("00:")
I'm getting this error, but don't think that's the only problem:
argument is not an atomic vector; coercing
Thank you
library(stringr)
library(dplyr)
my_conversion <- Vectorize(function(str) {
str_replace(str, "^00:", "") %>%
str_replace_all(":", "/")
})
df <- data.frame(
a_column = 1:3, key_column = c("00:01:54:2145", "00:01:54:2145", "00:01:54:2145"))
df %>% mutate(key_column = my_conversion(key_column))
I'm trying to do analysis from multiple csv files, and in order to create a key that can be used for left_join I think that I need to try and merge two columns. At present I'm trying to use the tidyverse packages (inc. mutate), but I'm running into an issue as the two columns to merge have different formatting: 1 is a double and the other is in date format. I'm using the following code
qlik2 <- qlik %>%
separate('Admit DateTime', into = c('Admit Date', 'Admit Time'), sep = 10) %>%
mutate(key = MRN + `Admit Date`)
and getting tis output error:
Error in mutate_impl(.data, dots) :
Evaluation error: non-numeric argument to binary operator.
If there's another way around this (or if the error is actually related to something else), then I'd appreciate any thoughts on the matter. Equally, if people know of a way to left_join with multiple keys, then that would work as well.
Thanks,
Cal
Hard without a reproducible example. But if i understand your question you either want a numeric key, or trying to concatinate a string with the plus +.
Numeric key
library(hablar)
qlik2 <- qlik %>%
separate('Admit DateTime',
into = c('Admit Date', 'Admit Time'),
sep = 10) %>%
convert(num(MRN, `Admit Date`)) %>%
mutate(key = MRN + `Admit Date`)
String key
qlik2 <- qlik %>%
separate('Admit DateTime',
into = c('Admit Date', 'Admit Time'),
sep = 10) %>%
mutate(key = paste(MRN, `Admit Date`))
First of all I am sorry if my formatting is bad, this is my first time posting, (also new to programming & R)
I am trying to merge two data frames together on string variables. I am merging university names, which might not match up perfectly, so I was hoping to merge using a fuzzy or approximate string matching function. I was happy when I found the ‘fuzzyjoin’ package.
from cranR:
stringdist_join: Join two tables based on fuzzy string matching of their columns
stringdist_join(x, y, by = NULL, max_dist = 2, method = c("osa", "lv",
"dl", "hamming", "lcs", "qgram", "cosine", "jaccard", "jw","soundex"), mode = "inner", ignore_case = FALSE, distance_col = NULL, ...)
my code:
stringdist_left_join(new, institutions, by = c("tm_9_undergradu" = "Institution.Name"))
Error:
Error in dists[include] <- stringdist::stringdist(v1[include], v2[include], :
NAs are not allowed in subscripted assignments
I know that there are some NA's in these columns, but I am not sure how I could remove them as I need them there as well. I know it other join & merge functions the NA's will simply be ignored. Does anyone know a way to get around this error for this package or to do an approximate join on strings another way. Thank you for your help.
This answer worked for me and is from GitHub
Step 1: figure out which Df has the NAs
`which(is.na(df1))
which(is.na(df2))`
Step 2: replace NAs with something else.
df1[is.na(df1)] <- "empty_string"
Step 3: run the join (the code I was working with when I got the error)
`test1 <- msa_table %>%
as_tibble() %>%
unlist() %>%
mutate(msa = sub("\\(.*)","", as.character(msa)) %>%
stringdist_full_join(msa_table, df1, by = 'msa', max_dist = 2)`
The result for me was not having the same error, but still having NAs in my tables.
Hope this helps! Also, to be clear: this solution came from Anton Prokopyev '#prokopyev' on GitHub.
Try
`test1 <- msa_table %>%
as_tibble() %>%
unlist() %>%
mutate(msa = stringr::str_squish(msa)) %>%
stringdist_full_join(msa_table, df1, by = 'msa', max_dist = 2)`
Short version: when executing the following command qtm(World, "amount") I get the following error message:
Error in $<-.data.frame(*tmp*, "SHAPE_AREAS", value =
c(653989.801201595, : replacement has 177 rows, data has 175
Disclaimer: this is the same problem I used to have in this question, but if I'm not wrong, in that one the problem was that I had one variable on the left dataframe that matched to several variables on the right one, and hence, I needed to group variables on right dataframe. In this case, I am pretty sure that I do not have the same problem, as can be seen from the code below:
library(tmap)
library(tidyr)
# Read tmap's world map.
data("World")
# Load my dataframe.
df = read.csv("https://gist.githubusercontent.com/ccamara/ad106eda807f710a6f331084ea091513/raw/dc9b51bfc73f09610f199a5a3267621874606aec/tmap.sample.dataframe.csv",
na = "")
# Compare the countries in df that do not match with World's
# SpatialPolygons.
df$iso_a3 %in% World$iso_a3
# Return rows which do not match
selected.countries = df$iso_a3[!df$iso_a3 %in% World$iso_a3]
df.f = filter(df, !(iso_a3 %in% selected.countries))
# Verification.
df.f$iso_a3[!df.f$iso_a3 %in% World$iso_a3]
World#data = World#data %>%
left_join(df.f, by = "iso_a3") %>%
mutate(iso_a3 = as.factor(iso_a3)) %>%
filter(complete.cases(iso_a3))
qtm(World, "amount")
My guess is that the clue may be the fact that the column I am using when joining both dataframes has different levels (hence it is converted to string), but I'm ashamed to admit that I still don't understand the error that I am having here. I'm assuming I have something wrong with my dataframe, although I have to admit that it didn't work even with a smaller dataframe:
selected.countries2 = c("USA", "FRA", "ITA", "ESP")
df.f2 = filter(df, iso_a3 %in% selected.countries2)
df.f2$iso_a3 = droplevels(df.f2$iso_a3)
World#data = World#data %>%
left_join(df.f2, by = "iso_a3") %>%
mutate(iso_a3 = as.factor(iso_a3)) %>%
filter(complete.cases(iso_a3))
World$iso_a3 = droplevels(World$iso_a3)
qtm(World, "amount")
Can anyone help me pointing out what's causing this error (providing an solution may also be much appreaciated)
Edited: It is again your data
table(df$iso_a3)