remove double quotes from factors in a dataframe - r

I got a dataframe to work on where I have a bunch of variables as factors in quotation marks like ""x1"".
str(df) gives me something like this:
$ x : Factor w/ 10 Levels "\"\"x1\"\"",..: 1 7 9 ...
I tried to get rid of the quotation marks with the gsub() function but that didn´t work. Probably because I don´t know what to insert as pattern? Would be great if somebody can solve this puzzle and maybe explain to me if the "\"\"x1\"\"" is the solution to this?
An example for the dataframe would look like this:
structure(list(Sent = structure(c(2L, 2L, 2L, 2L, 2L), .Label = c("\"\"Opted out\"\"",
"\"\"Yes\"\""), class = "factor"), Responded = structure(c(2L,
2L, 2L, 2L, 2L), .Label = c("\"\"Complete\"\"", "\"\"No\"\"",
"\"\"Partial\"\""), class = "factor")), row.names = c(NA, -5L
), class = c("tbl_df", "tbl", "data.frame"), .Names = c("Sent",
"Responded"))
Thanks in advance!

vec = c('""x1""', '""x2""', '""x3""')
vec = factor(vec)
levels(vec) <- gsub('["\\]', "", levels(vec))
#> vec
#[1] x1 x2 x3
#Levels: x1 x2 x3
See how I would use ' as wrapper, when I want to use " inside a string.
Another problem it didn't work for you was probably because you didn't use the levels attribute but rather the factor variable itself.
Factor variables are internally stored as 1, 2, 3,... numbers.
As you now have provided data, you can use: (df1 being your data with the factor columns)
df1[] <- lapply(df1, function(vec){ levels(vec) <- gsub('["\\]',"",levels(vec)); vec})

Related

How can I apply case_when(mapply (adist, x, y) <= 3 ~ x, TRUE ~ y)) to columns of different length and order

Hi I have been trying for a while to match two large columns of names, several have different spellings etc... so far I have written some code to practice on a smaller dataset
examples%>% mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1, TRUE ~ example_2))
This manages to create a new column with names the name from example 1 if it is less than an edit distance of 3 away. However, it does not give the name from example 2 if it does not meet this criteria which I need it to do.
This code also only works on the adjacent row of each column, whereas, I need it to work on a dataset which has two columns (one is larger- so cant be put in the same order).
Also needs to not try to match the NAs from the smaller column of names (there to fill it out to equal length to the other one).
Anyone know how to do something like this?
dput(head(examples))
structure(list(. = structure(c(4L, 3L, 2L, 1L, 5L), .Label = c("grarryfieldsred","harroldfrankknight", "sandramaymeres", "sheilaovensnew", "terrifrank"), class = "factor"), example_2 = structure(c(4L, 2L, 3L, 1L,
5L), .Label = c(" grarryfieldsred", "candramymars", "haroldfranrinight",
"sheilowansknew", "terryfrenk"), class = "factor")), row.names = c(NA,
5L), class = "data.frame")
The problem is that your columns have become factors rather than character vectors. When you try to combine two columns together with different factor levels, unexpected results can happen.
First convert your columns to character:
library(dplyr)
examples %>%
mutate(across(contains("example"),as.character)) %>%
mutate(new_ID = case_when(mapply (adist, example_1 , example_2) <= 3 ~ example_1,
TRUE ~ example_2))
# example_1 example_2 new_ID
#1 sheilaovensnew sheilowansknew sheilowansknew
#2 sandramaymeres candramymars candramymars
#3 harroldfrankknight haroldfranrinight harroldfrankknight
#4 grarryfieldsred grarryfieldsred grarryfieldsred
#5 terrifrank terryfrenk terrifrank
In your dput output, somehow the name of example_1 was changed. I ran this first:
names(examples)[1] <- "example_1"

Can I use %in% to search and match two columns?

I have a large dataframe and I have a vector to pull out terms of interest. for a previous project I was using:
a=data[data$rn %in% y, "Gene"]
To pull out information into a new vector. Now I have a another job Id like to do.
I have a large dataframe of 15 columns and >100000 rows. I want to search column 3 and 9 for the content in the vector and print this as a new dataframe.
To make this extra annoying the hit could be in v3 and not in v9 and visa versa.
Working example
I have striped the dataframe to 3 cols and few rows.
data <- structure(list(Gene = structure(c(1L, 5L, 3L, 2L, 4L), .Label = c("ibp","leuA", "pLeuDn_02", "repA", "repA1"), class = "factor"), LocusTag = structure(c(1L,2L, 5L, 3L, 4L), .Label = c("pBPS1_01", "pBPS1_02", "pleuBTgp4","pleuBTgp5", "pLeuDn_02"), class = "factor"), hit = structure(c(2L,4L, 3L, 1L, 5L), .Label = c("2-isopropylmalate synthase", "Ibp protein","ORF1", "repA1 protein", "replication-associated protein"), class = "factor")), .Names = c("Gene","LocusTag", "hit"), row.names = c(NA, 5L), class = "data.frame")
y <- c("ibp", "orf1")
First of all R is case sensitive so your example will not collect the third line but I guess you want that extracted. so you would have to change your y to
y <- c("ibp", "ORF1")
Ok from your example I try to see what you want to achieve I am not sure if this is really what you want but R knows the operator | as "or" so you could try something like:
new.data<-data[data$Gene %in% y|data$hit %in% y,]
if you only want to extract certain columns of your data set you can specify them behind the "," e.g.:
new.data<-data[data$Gene %in% y|data$hit %in% y, c("LocusTag","Gene")]

Extract the labels attribute from "labeled" tibble columns from a haven import from Stata

Hadley Wickham's haven package, applied to a Stata file, returns a tibble with many columns of type "labeled". You can see these with str(), e.g.:
$ MSACMSZ :Class 'labelled' atomic [1:8491861] NA NA NA NA NA NA NA NA NA NA ...
.. ..- attr(*, "label")= chr "metropolitan area size (cmsa/msa)"
.. ..- attr(*, "labels")= Named int [1:7] 0 1 2 3 4 5 6
.. .. ..- attr(*, "names")= chr [1:7] "not identified or nonmetropolitan" "100,000 - 249,999" "250,000 - 499,999" "500,000 - 999,999" ...
It would be nice if I could simply extract all these labeled vectors to factors, but I have compared the length of the labels attribute to the number of unique values in each vector, and it is sometimes longer and sometimes shorter. So I think I need to look at all of them and decide how to handle each one individually.
So I would like to extract the values of the labels attribute to a list. However, this function:
labels93 <- lapply(cps_00093.df, function(x){attr(X, which="labels", exact=TRUE)})
returns NULL for all variables.
Is this a tibble vs data frame problem? How do I extract these attributes from the tibble columns into a list?
Note that the labels vector is named, and I need both the labels and the names.
As per #Hack-R's request here is a tiny snippet of my data as converted by dput (which I had never used before). I applied this code:
filter(cps_00093.df, YEAR==2015) %>%
sample_n(10) %>%
select(HHTENURE, HHINTYPE) -> tiny
dput(tiny, file = "tiny")
to produce the file tiny. Hey! That was easy! I thought it would be hard to break off a piece this small.
Opening tiny with Notepad++, this is what I found:
structure(list(HHTENURE = structure(c(2L, 1L, 1L, 2L, 1L, 1L,
1L, 2L, 1L, 1L), labels = structure(c(0L, 1L, 2L, 3L, 6L, 7L), .Names = c("niu",
"owned or being bought", "rented for cash", "occupied without payment of cash rent",
"refused", "don't know")), class = "labelled"), HHINTYPE = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), labels = structure(1:3, .Names = c("interview",
"type a non-interview", "type b/c non-interview")), class = "labelled")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"), .Names = c("HHTENURE",
"HHINTYPE"))
I suspect this could be made more readable with a little spacing, but I did not want to muck with it for fear of accidentally destroying relevant information.
The original question asks how 'to extract the values of the labels attribute to a list.' A solution to the main question follows (assuming some_df is imported via haven and has label attributes). Update: I've now added a way to extract a label vector with the package sjlabelled.
library(purrr)
n <- ncol(some_df)
labels_list <- map(1:n, function(x) attr(some_df[[x]], "label") )
# if a vector of character strings is preferable
labels_vector <- map_chr(1:n, function(x) attr(some_df[[x]], "label") )
# to make a simple codebook
library(kable)
variable_name <- names(some_df)
data.frame(variable_name, description = labels_vector) %>%
kable(format = 'markdown')
# UPDATE: another approach with package sjlabelled
library(sjlabelled)
sjlabelled::get_label(some_df)
I'm going to take a go at answering this one, though my code isn't very pretty.
First I make a function to extract a named attribute from a single column.
ColAttr <- function(x, attrC, ifIsNull) {
# Returns column attribute named in attrC, if present, else isNullC.
atr <- attr(x, attrC, exact = TRUE)
atr <- if (is.null(atr)) {ifIsNull} else {atr}
atr
}
Then a function to lapply it to all the columns:
AtribLst <- function(df, attrC, isNullC){
# Returns list of values of the col attribute attrC, if present, else isNullC
lapply(df, ColAttr, attrC=attrC, ifIsNull=isNullC)
}
Finally I run it for each attribute.
stub93 <- AtribLst(cps_00093.df, attrC="label", isNullC=NA)
labels93 <- AtribLst(cps_00093.df, attrC="labels", isNullC=NA)
labels93 <- labels93[!is.na(labels93)]
All the columns have a "label" attribute, but only some are of type "labeled" and so have a "labels" attribute. The labels attribute is named, where the labels match values of the data and the names tell you what those values signify.
Jumping off #omar-waslow answer above, but adding the use of attr_getter.
If the data (some_df) is imported using read_dta in the haven package, then each column in the tibble has an attr called "label". So we split up the dataframe, going column by column. This creates a two column dataframe which can be joined back (after pivot_longer, for example).
library(tidyverse)
label_lookup_map <- tibble(
col_name = some_df %>% names(),
labels = some_df %>% map_chr(attr_getter("label"))
)

Error in setting up and cleaning a dataframe R

I am attempting to generate out of sample predictions and am getting this message after running the following code Error: variable 'dummygen' was fitted with type "numeric" but type "factor" was supplied.
I checked the str to verify that the two variables I am using are both numeric and they appear to be. I did a bunch of hunting around on here and think this might be somewhat related, but I haven't been able to get the suggestions to work.
Here is the code I have so far.
library(foreign)
library(plyr)
library(rvest)
library(stringi)
library(purrr)
library(XLConnect)
library(splitstackshape)
library(tidyr)
library(dplyr)
donner_raw <- read.csv("donner.txt", sep="\t", header = FALSE)
colnames(donner_raw) <- c("age_gen", "survive")
donner_raw <- separate(donner_raw, age_gen, into = c("age", "gender"), "(?<=\\d)(?=[A-Za-z])")
logit <- glm(survive ~ age + dummygen,family=binomial(link='logit'),data=donner_raw)
newlogit <- data.frame(age=seq(1,6, length=20), dummygen=("0"))
ooslogit <- predict.glm(logit, newlogit, se.fit=TRUE)
I'm not sure where in the process of what I've done I messed up. Here is a reproducible part of the data.
dput(droplevels(head(donner_raw)))
structure(list(age = structure(c(6L, 4L, 5L, 3L, 2L, 1L), .Label = c("13", "3", "4", "45", "6", "60"), class = "factor"), gender = c("M", "F", "F", "F", "F", "F"), dummygen = structure(c(2L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor")), .Names = c("age", "gender", "survive", "dummygen"), row.names = c(NA, 6L), class = "data.frame")
Let's simply read and think about the error message:
Error: variable 'dummygen' was fitted with type "numeric" but type "factor" was supplied
This error occurs after the line:
ooslogit <- predict.glm(logit, newlogit, se.fit=TRUE)
(Presumably, at least, because you're question isn't very clear about this and provides lots of code that doesn't seem related.)
So R is telling you that when the model was fit the variable dummygen was numeric, but now you've given it a factor.
So let's look:
str(newlogit)
'data.frame': 20 obs. of 2 variables:
$ age : num 1 1.26 1.53 1.79 2.05 ...
$ dummygen: Factor w/ 1 level "0": 1 1 1 1 1 1 1 1 1 1 ...
Yep!
So your problem was that you inexplicably created the data frame newlogit by specifying:
newlogit <- data.frame(age=seq(1,6, length=20), dummygen=("0"))
which clearly specifies that the variable dummygen is not going to be numeric. Just convert it back, or remove the quotes in the first place. For example:
newlogit <- data.frame(age=seq(1,6, length=20), dummygen= 0)
or
newlogit$dummygen <- as.numeric(newlogit$dummygen)

Converting factors into numeric format with signs in R

Let, I have such a dataframe(df) where each elements are factors:
df
---
+100.5
+120.2
-30.0
+75.0
-600.3
How can I convert df into a numric df using R? I ill be very glad for any help. Thanks a lot.
The conversion from factors to numerical values is sometimes complicated, and I think that it is usually necessary to convert the factors first into characters, and then into numerical values.
This should work:
df_n <- as.data.frame(as.numeric(as.character(df[,1])))
colnames(df_n) <- "df_n"
head(df_n)
# df_n
#1 100.5
#2 120.2
#3 -30.0
#4 75.0
#5 -600.3
class(df_n[,1])
#[1] "numeric"
data
df <- structure(list(df = structure(c(4L, 5L, 2L, 3L, 1L),
.Label = c("-600.3", "-30", "75", "100.5", "120.2"),
class = "factor")), .Names = "df",
row.names = c(NA, -5L), class = "data.frame")
Hope this helps.

Resources