R help converting non numeric column to numeric - r

I'm trying to help my friend, Director of Sales, make sense of his logged call data. There is one column in particular in which he is interested, "Disposition". This column has string values and I'm trying to convert them to numeric values (i.e. "Not Answered" converted to 1, "Answered" converted to 2, etc.) and remove any row with no values entered. I've created data frames, used as.numeric, created and deleted columns/rows, etc. to no avail. I'm just trying to run simple R code to give him some insight. Any and all help is much appreciated. Thanks in advance!
P.S. I'm unsure as to whether I should provide some code due to the fact that there is a lot of delicate information (personal phone numbers and emails).

First off: You should always provide representative sample data; if your data is sensitive in nature, provide mock-up data.
That aside, to recode a character vector as numeric you could convert to factor and then use as.numeric. For example:
# Sample data
column <- c("Not Answered", "Answered", "Something else", "Others")
# Convert character vector to factor
column <- factor(column, levels = as.character(unique(column)))
# Convert to numeric
as.numeric(column);
#[1] 1 2 3 4
The numbering can be adjusted by changing the order of the factor levels.

Alternatively, you can create a new column and fill it with the numeric values using an ifelse statement. To illustrate, let's assume this is your dataframe:
df <- data.frame(
Disposition = c(rep(c("answer", "no answer", "whatever", NA),3)),
Anything = c(rnorm(12))
)
df
Disposition Anything
1 answer 2.54721951
2 no answer 1.07409803
3 whatever 0.60482744
4 <NA> 2.08405038
5 answer 0.31799860
6 no answer -1.17558239
7 whatever 0.94206106
8 <NA> 0.45355501
9 answer 0.01787330
10 no answer -0.07629330
11 whatever 0.83109679
12 <NA> -0.06937357
Now you define a new column, say df$Analysis, and assign to it numbers based on the information in df$Disposition:
df$Analysis <- ifelse(df$Disposition=="no answer", 1,
ifelse(df$Disposition=="answer", 2, 3))
df
Disposition Anything Analysis
1 answer 2.54721951 2
2 no answer 1.07409803 1
3 whatever 0.60482744 3
4 <NA> 2.08405038 NA
5 answer 0.31799860 2
6 no answer -1.17558239 1
7 whatever 0.94206106 3
8 <NA> 0.45355501 NA
9 answer 0.01787330 2
10 no answer -0.07629330 1
11 whatever 0.83109679 3
12 <NA> -0.06937357 NA
The advantage of this method is that you keep the original information unchanged. If you now want to remove Na values in the dataframe, use na.omit. NB: this will remove not only the NA values in df$Disposition but any row with NA in any column:
df_clean <- na.omit(df)
df_clean
Disposition Anything Analysis
1 answer 2.5472195 2
2 no answer 1.0740980 1
3 whatever 0.6048274 3
5 answer 0.3179986 2
6 no answer -1.1755824 1
7 whatever 0.9420611 3
9 answer 0.0178733 2
10 no answer -0.0762933 1
11 whatever 0.8310968 3

Related

R subset dataframe column with variable when column name is escaped

I am trying to select a column from a dataframe using a variable as a column name, with the problem that the column name is escaped. I have a couple of workarounds for doing it, which involve changing my code a bit too much, and anyway I've been looking around and I am curious if anybody knew the solution for this kind of weird case.
My dataset is actually a list of time series (which I construct after some operations), this would be a toy example.
df <- list(`01/19/17`=seq(1,10), `01/20/17`=seq(2,11))
> df
$`01/19/17`
[1] 1 2 3 4 5 6 7 8 9 10
$`01/20/17`
[1] 2 3 4 5 6 7 8 9 10 11
I don't put the escapes ` in the column names because I want to, but because they come as dates from the process I follow to construct the dataset.
If I know the column name I can access like this,
df$`01/19/17`
If I want to use a variable, looking around e.g. here I see I could rewrite it to something like this,
`$`(df, `01/19/17`)
But I cannot assign a variable like this,
> name1 <- `01/19/17`
Error: object '01/19/17' not found
and if assign it this other way I get a NULL,
> name1 <- "01/19/17"
> `$`(df, name1)
NULL
As I say there are workarounds like e.g. changing all the column names in the list of series, but I just would like to know. Thank you so much.
You can access with brackets rather than with $, even when the key is a string:
df <- list(`01/19/17`=seq(1,10), `01/20/17`=seq(2,11))
name1 <- "01/19/17"
df[[name1]]
# [1] 1 2 3 4 5 6 7 8 9 10

Counting the number of individual letters in a string in R

I have the following string of characters:
pig<-c("A","B","C","D","AB","ABC","AB","AA","CD","CA",NA)
I am trying to get R to tell me how many of each total letters there are and how many total NAs there are. Thus, in this case I would like to the result to look like this:
print(cow)
A B C D NA
6 3 4 2 1
I have tried table in combination with strsplit but cannot figure out exactly how to do it. Any thoughts? Thanks!
You would need to use NULL (or the empty character "") for the split value in strsplit(), then unlist it. Then, in table() you'll want to use the useNA argument to include any NA values. Here we'll use "ifany", so that if there are any NA values they will be shown in the table and if there are not, NA will not be shown in the result at all.
table(unlist(strsplit(pig, NULL)), useNA = "ifany")
#
# A B C D <NA>
# 7 4 4 2 1

R - remove rows from a data frame with empty lines (not only numbers)

The issue seems to be something already treated but after a check I couldn't find any solution. I load a table from a file and it could be (don't know how) that some entire lines are empty. So when I get the data frame I got
# id c1 c2
# 1 a 1 2
# 2 b 2 4
# 3 NA NA
# 4 d 6 1
# 5 e 7 5
# 6 NA NA
if I do
apply(df, 1, function(x) all(is.na(x))
I got all FALSE as the first column is not a number (the table is much bigger with mixed character and numeric columns) and I can't filter these lines. Also with na.omit or complete.cases I cannot sort it out.
Is there any function or expression to check empty rows?
You may be able to cut this problem off at the source with the parameters you pass to read.csv:
For instance if the blanks are one space or blanks you could use
df <- read.csv(<your other logic here>, na.strings=c("NA","", " ")
This question seems to raise similar issues: read.csv blank fields to NA
If this works, then you can use the apply logic to work with the offending rows.

Find out if column in R table includes duplicate values?

I've got a lovely dataframe, my very first, and I'm starting to get the hang of R. One thing I haven't been able to find is a test for duplicate values. I have one column that I'm pretty sure is all unique values, but I don't know that.
Is there a way I can ask? For simplicity, let's pretend this is my data:
var1 var2 var3
1 1 A 1
2 2 B 3
3 3 C NA
4 4 D NA
5 5 E 4
and I want to know whether var1 ever repeats.
Check out the duplicated function:
duplicated(dat$var1) # the rows of dat var1 duplicated
Documentation is here.
You should also look at the unique function.
Remove duplicates based on columns:
my_data[!duplicated(my_data$Col_id), ] # Where ! is a logical negation:

filtering large data sets to exclude an identical element across all columns

I am a relatively new R user, and most of the complex coding (and packages) looks like Greek to me. It has been a long time since I used a programming language (Java/Perl) and I have only used R for very simple manipulations in the past (basic loading data from file, subsetting, ANOVA/T-Test). However, I am working on a project where I had no control over the data layout and the data file is very lengthy.
In my data, I have 172 rows which feature the Participant to a survey and 158 columns, each which represents the question number. The answers for each are 1-5. The raw data includes the number "99" to indicate that a question was not answered. I need to exclude any questions where a Participant did not answer without excluding the entire participant.
Part Q001 Q002 Q003 Q004
1 2 4 99 2
2 3 99 1 3
3 4 4 2 5
4 99 1 3 2
5 1 3 4 2
In the past I have used the subset feature to filter my data
data.filter <- subset(data, Q001 != 99)
Which works fine when I am working with sets where all my answers are contained in one column. Then this would just delete the whole row where the answer was not available.
However, with the answers in this set spread across 158 columns, if I subset out 99 in column 1 (Q001), I also filter out that entire Participant.
I'd like to know if there is a way to filter/subset the data such that my large data set would end up having 'blanks' when the "99" occured so that these 99's would not inflate or otherwise interfere with the statistics I run of the rest of the numbers. I need to be able to calculate means per question and run ANOVAs and T-Tests on various questions.
Resp Q001 Q002 Q003 Q004
1 2 4 2
2 3 1 3
3 4 4 2 5
4 1 3 2
5 1 3 4 2
Is this possible to do in R? I've tried to filter it before submitting to R, but it won't read the data file in when I have blanks, and I'd like to be able to use the whole data set without creating a subset for each question (which I will do if I have to... it's just time consuming if there is a better code or package to use)
Any assistance would be greatly appreciated!
You could replace the "99" by "NA" and the calculate the colMeans omitting NAs:
df <- replicate(20, sample(c(1,2,3,99), 4))
colMeans(df) # nono
dfc <- df
dfc[dfc == 99] <- NA
colMeans(dfc, na.rm = TRUE)
You can also indicate which values are NA's when you read your data base. For your particular case:
mydata <- read.table('dat_base', na.strings = "99")

Resources