How to ignore a "Level" in R? - r

Not sure how to do the following. Please refer to the picture in the link below:
https://i.stack.imgur.com/Kx79x.png
I have some blank spaces, and they are the missing values. I do not want this level to be read. I want R to ignore this level. I want to write a regression so that this empty category is not part of the model.
The data was read from a csv file. The variable is "I", "II"...."IV", but there is an extra "" factor because of missing data. I want R to ignore this factor. My question is how?

you can do the following:
df <- data.frame(letters=letters[1:5], numbers=c(1,2,3,"",5)) # my data frame
# letters numbers
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d
# 5 e 5
levels(df$numbers)
# "" "1" "2" "3" "5"
subdf <- subset(df, numbers != "") # data subset
subdf$numbers <- factor(subdf$numbers)
levels(subdf$numbers)
# "1" "2" "3" "5"

change the "" data to missing:
# generate sample data
df <- data.frame(x = sample(c("","I","II","III"),100, replace = T), stringsAsFactors = T)
option 1
df[df$x=="",'x'] <- NA
option 2
df$x <- factor(ifelse(df$x == "",NA,as.character(df$x)))

Related

Change a char value in a data column into zero?

I have a simple problem in that I have a very long data frame which reports 0 as a char "nothing" in the data frame column. How would I replace all of these to a numeric 0. A sample data frame is below
Group
Candy
A
5
B
nothing
And this is what I want to change it into
Group
Candy
A
5
B
0
Keeping in mind my actual dataset is 100s of rows long.
My own attempt was to use is.na but apparently it only works for NA and can convert those into zeros with ease but wasn't sure if there's a solution for actual character datatypes.
Thanks
The best way is to read the data in right, not with "nothing" for missing values. This can be done with argument na.strings of functions read.table or read.csv. Then change the NA's to zero.
The following function is probably slow for large data.frames but replaces the "nothing" values by zeros.
nothing_zero <- function(x){
tc <- textConnection("nothing", "w")
sink(tc) # divert output to tc connection
print(x) # print in string "nothing" instead of console
sink() # set the output back to console
close(tc) # close connection
tc <- textConnection(nothing, "r")
y <- read.table(tc, na.strings = "nothing", header = TRUE)
close(tc) # close connection
y[is.na(y)] <- 0
y
}
nothing_zero(df1)
# Group Candy
#1 A 5
#2 B 0
The main advantage is to read numeric data as numeric.
str(nothing_zero(df1))
#'data.frame': 2 obs. of 2 variables:
# $ Group: chr "A" "B"
# $ Candy: num 5 0
Data
df1 <- read.table(text = "
Group Candy
A 5
B nothing", header = TRUE)
sapply(df,function(x) {x <- gsub("nothing",0,x)})
Output
a
[1,] "0"
[2,] "5"
[3,] "6"
[4,] "0"
Data
df <- structure(list(a = c("nothing", "5", "6", "nothing")),
class = "data.frame",
row.names = c(NA,-4L))
Another option
df[] <- lapply(df, gsub, pattern = "nothing", replacement = "0", fixed = TRUE)
If you are only wanting to apply to one column
library(tidyverse)
df$a <- str_replace(df$a,"nothing","0")
Or applying to one column in base R
df$a <- gsub("nothing","0",df$a)

Why are empty levels in my factor tabulated after I assign NAs to missing values?

I have a dataframe df with a column foo containing data of type factor:
df <- data.frame("bar" = c(1:4), "foo" = c("M", "F", "F", "M"))
When I inspect the structure with str(df$foo), I get this:
Factor w/ 3 levels "","F",..: 2 2 2 2 2 2 2 2 2 2 ..
Why does it report 3 levels when there are only 2 in my data?
Edit:
There seems to be a missing value "" that I clean up by assigning it NA.
When I call table(df$foo), it seems to still count the "missing value" level, but finds no occurences:
F M
0 2 2
However, when I call df$foo I find it reports only two levels:
Levels: F M
How is it possible that table still counts the empty level, and how can I fix that behaviour?
Check whether your dataframe indeed has no missing values, because it does look to be that way. Try this:
# works because factor-levels are integers, internally; "" seems to be level 1
which(as.integer(df$MF) == 1)
# works if your missing value is just ""
which(df$MF == "")
You should then clean up your dataframe to properly refeclet missing values. A factor will handle NA:
df <- data.frame("rest" = c(1:5), "sex" = c("M", "F", "F", "M", ""))
df$sex[which(as.integer(df$sex) == 1)] <- NA
Once you have cleaned your data, you will have to drop unused levels to avoid tabulations such as table counting occurences of the empty level.
Observe this sequence of steps and its outputs:
# Build a dataframe to reproduce your behaviour
> df <- data.frame("Restaurant" = c(1:5), "MF" = c("M", "F", "F", "M", ""))
# notice the empty level "" for the missing value
> levels(df$MF)
[1] "" "F" "M"
# notice how a tabulation counts the empty level;
# this is the first column with a 1 (it has no label because
# there is no label, it is "")
> table(df$MF)
F M
1 2 2
# find the culprit and change it to NA
> df$MF[which(as.integer(df$MF) == 1)] <- as.factor(NA)
# AHA! So despite us changing the value, the original factor
# was not updated! I wonder what happens if we tabulate the column...
> levels(df$MF)
[1] "" "F" "M"
# Indeed, the empty level is present in the factor, but there are
# no occurences!
> table(df$MF)
F M
0 2 2
# droplevels to the rescue:
# it is used to drop unused levels from a factor or, more commonly,
# from factors in a data frame.
> df$MF <- droplevels(df$MF)
# factors fixed
> levels(df$MF)
[1] "F" "M"
# tabulation fixed
> table(df$MF)
F M
2 2

Change values from categorical to nominal in R

I want to change all the values in categorical columns by rank. Rank can be decided using the index of the sorted unique elements in the column.
For instance,
> data[1:5,1]
[1] "B2" "C4" "C5" "C1" "B5"
then I want these entries in the column replacing categorical values
> data[1:5,1]
[1] "1" "4" "5" "3" "2"
Another column:
> data[1:5,3]
[1] "Verified" "Source Verified" "Not Verified" "Source Verified" "Source Verified"
Then the updated column:
> data[1:5,3]
[1] "3" "2" "1" "2" "2"
I used this code for this task but it is taking a lot of time.
for(i in 1:ncol(data)){
if(is.character(data[,i])){
temp <- sort(unique(data[,i]))
for(j in 1:nrow(data)){
for(k in 1:length(temp)){
if(data[j,i] == temp[k]){
data[j,i] <- k}
}
}
}
}
Please suggest me the efficient way to do this, if possible.
Thanks.
Here a solution in base R. I create a helper function that convert each column to a factor using its unique sorted values as levels. This is similar to what you did except I use as.integer to get the ranking values.
rank_fac <- function(col1)
as.integer(factor(col1,levels = unique(col1)))
Some data example:
dx <- data.frame(
col1= c("B2" ,"C4" ,"C5", "C1", "B5"),
col2=c("Verified" , "Source Verified", "Not Verified" , "Source Verified", "Source Verified")
)
Applying it without using a for loop. Better to use lapply here to avoid side-effect.
data.frame(lapply(dx,rank_fac)
Results:
# col1 col2
# [1,] 1 3
# [2,] 4 2
# [3,] 5 1
# [4,] 3 2
# [5,] 2 2
using data.table syntax-sugar
library(data.table)
setDT(dx)[,lapply(.SD,rank_fac)]
# col1 col2
# 1: 1 3
# 2: 4 2
# 3: 5 1
# 4: 3 2
# 5: 2 2
simpler solution:
Using only as.integer :
setDT(dx)[,lapply(.SD,as.integer)]
Using match:
# df is your data.frame
df[] <- lapply(df, function(x) match(x, sort(unique(x))))

Comparing multiple rows and creating a matrix in R or in Excel

I have a file containing, multiple rows as follows
In file1:
a 8|2|3|4 4
b 2|3|5|6|7 5
c 8|5|6|7|9 5
a to a has 4 overlaps, similarly a to b had 2 overlaps, so to check the overlaps between various entity, I need to generate a matrix with the above details, and the output should be a matrix like
a b c
a 4 2 1
b 2 5 3
c 1 3 5
Please give me a suggestion, how to do this? Is there any way to do this using excel or using a shell script or using R? I have written this following code but since I am not a good coder, I couldn't get the output printed in a right format.
setwd('C:\\Users\\Desktop\\')
newmet1<-file("file.txt")
newmet2<-strsplit(readLines(newmet1),"\t")
Newmet<-sapply(newmet2, function(x) x[2:length(x)], simplify=F )
for (i in 1:length(Newmet))
{
for (j in 1:length(Newmet)
{
c <- ((intersect(Newmet[[i]], Newmet[[j]]))
print (length(c))
}
}
Edited: Thanks for all the answers.. I got the matrix using both excel and R with the help of following answers.
Here is a function in R that returns the counts of each columns matches as a new matrix
First we get your data into a R data.frame object:
A <- c(8,2,3,4,NA)
B <- c(2,3,5,6,7)
C <- c(8,5,6,7,9)
dataset <- data.frame(A,B,C)
Then we create a function:
count_matches <- function (x) {
if (is.data.frame(x)) {
y <- NULL
for (i in 1:dim(x)[2]) {
for (j in 1:dim(x)[2]) {
count <- sum(x[[i]][!is.na(x[i])] %in% x[[j]][!is.na(x[j])])
y <- c(y, count)
}
}
y <- matrix(y, dim(x)[2], )
colnames(y) <- names(x)
rownames(y) <- names(x)
return(y)
} else {
print('Argument must be a data.frame')
}
}
We test the function on our dataset:
count_matches(dat)
Which returns a matrix:
A B C
A 4 2 1
B 2 5 3
C 1 3 5
If the numbers are in separate cells starting in Sheet1!A1, try
=SUM(--ISNUMBER(MATCH(Sheet1!$A1:$E1,INDEX(Sheet1!$A$1:$E$3,COLUMN(),0),0)))
starting at Sheet2!A1.
Must be entered as an array formula using CtrlShiftEnter
Alternative formula that doesn't have to start at Sheet2!A1
SUM(--ISNUMBER(MATCH(Sheet1!$A1:$E1,INDEX(Sheet1!$A$1:$E$3,COLUMNS($A:A),0),0)))
Using R:
# dummy data
df1 <- read.table(text = "a 8|2|3|4 4
b 2|3|5|6|7 5
c 8|5|6|7|9 5", as.is = TRUE)
df1
# V1 V2 V3
# 1 a 8|2|3|4 4
# 2 b 2|3|5|6|7 5
# 3 c 8|5|6|7|9 5
# convert 2nd column to a splitted list
myList <- unlist(lapply(df1$V2, strsplit, split = "|", fixed = TRUE), recursive = FALSE)
names(myList) <- df1$V1
myList
# $a
# [1] "8" "2" "3" "4"
# $b
# [1] "2" "3" "5" "6" "7"
# $c
# [1] "8" "5" "6" "7" "9"
# get overlap counts
crossprod(table(stack(myList)))
# ind
# ind a b c
# a 4 2 1
# b 2 5 3
# c 1 3 5
If we remove data processing bit, this answer is already provided by
similar post: Intersect all possible combinations of list elements

Splitting a Large Data File in R using Strsplit and R Connection

Hi I am trying to read in a large data file into R. It is a tab delimited file, however the first two columns are filled with multiple pieces of data separated by a "|". The file looks like:
A|1 B|2 0.5 0.4
C|3 D|4 0.9 1
I only care about the first values in both the first and second columns as well as the third and fourth column. In the end I want to end up with a vectors for each line that look like:
A B 0.5 0.4
I am using a connection to read in the file:
con <- file("inputfile.txt", open = "r")
lines <- readLines(con)
which gives me:
lines[1]
[1] "A|1\tB|2/t0.5\t0.4"
then I am using strsplit to split the tab delimited file:
linessplit <- strsplit(lines, split="\t")
which gives me:
linessplit[1]
[1] "A|1" "B|2"
[3] "0.5" "0.4"
When I try the following to split "A|1" into "A" "1":
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")
I get:
"Error in strsplit(line1[1], split = "|") : non-character argument"
Does anyone have a way in which I can fix this?
Thanks!
Since you provided an approach I explain the errors in the code even though for your problem maybe you have to consider another approach.
Anyway putting aside personal tastes about code, the problems are:
you have to extract the first element of the list with double brackets
line1[[1]]
the split argument accepts regular
expressions. If you supply | which is a metacharacter, it won't be
read as is. You must escape it with \\| or (as suggested by #nongkrong) you have to use the fixed = T argument that allows you to match strings exactly as is (say, without their meaning as a meta characters).
The final code is l1 <- strsplit(line1[[1]], split = "\\|")
as a final personal consideration, you might take into considerations an lapply solution:
lapply(linessplit, strsplit, split = "|", fixed = T)
Here is my solution to your original problem, says
split lines
"A|1\tB|2\t0.5\t0.4"
"C|3\tD|4\t0.9\t1"
into
A B 0.5 0.4
C D 0.9 1
Below is my code:
lines <- c("A|1\tB|2\t0.5\t0.4", "C|3\tD|4\t0.9\t1", "E|5\tF|6\t0.7\t0.2")
lines
library(reshape2)
linessplit <- colsplit(lines, pattern="\t", names=c(1:4))
linessplit
split_n_select <- function(x, sel=c(1), pat="\\|", nam=c(1:2)){
tmp <- t(colsplit(x, pattern=pat, names=nam))
tmp[sel,]
}
linessplit2 <- sapply(linessplit, split_n_select)
linessplit2
That's break it down:
Read original data into lines
lines <- c("A|1\tB|2\t0.5\t0.4", "C|3\tD|4\t0.9\t1", "E|5\tF|6\t0.7\t0.2")
lines
Results:
[1] "A|1\tB|2\t0.5\t0.4" "C|3\tD|4\t0.9\t1" "E|5\tF|6\t0.7\t0.2"
Load reshape2 library to import function colsplit, then use it with pattern "\t" to split lines into 4 columns named 1,2,3,4.
library(reshape2)
linessplit <- colsplit(lines, pattern="\t", names=c(1,2,3,4))
linessplit
Results:
1 2 3 4
1 A|1 B|2 0.5 0.4
2 C|3 D|4 0.9 1.0
3 E|5 F|6 0.7 0.2
That's make a function to take a row, split into rows and select the row we want.
Take the first row of linessplit into colsplit
tmp <- colsplit(linessplit[1,], pattern="\\|", names=c(1:2))
tmp
Results:
1 2
1 A 1
2 B 2
3 0.5 NA
4 0.4 NA
Take transpose
tmp <- t(colsplit(linessplit[1,], pattern="\\|", names=c(1:2)))
tmp
Results:
[,1] [,2] [,3] [,4]
1 "A" "B" "0.5" "0.4"
2 " 1" " 2" NA NA
Select first row:
tmp[1,]
Results:
[1] "A" "B" "0.5" "0.4"
Make above steps a function split_n_select:
split_n_select <- function(x, sel=c(1), pat="\\|", nam=c(1:2)){
tmp <- t(colsplit(x, pattern=pat, names=nam))
tmp[sel,]
}
Use sapply to apply function split_n_select to each row in linessplit
linessplit2 <- sapply(linessplit, split_n_select)
linessplit2
Results:
1 2 3 4
[1,] "A" "B" "0.5" "0.4"
[2,] "C" "D" "0.9" "1"
[3,] "E" "F" "0.7" "0.2"
You can also select the second row by adding sel=c(2)
linessplit2 <- sapply(linessplit, split_n_select, sel=c(2))
linessplit2
Results:
1 2 3 4
[1,] "1" "2" NA NA
[2,] "3" "4" NA NA
[3,] "5" "6" NA NA
Change
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "|")
to
line1 <- linessplit[1]
l1 <- strsplit(line1[1], split = "[|]") #i added square brackets

Resources