I am performing an analysis in R. I want to fill the first row of an matrix with the content of a table. The problem I have is that the content of the table is variable depending on the data so sometimes certain identifiers that appear in the matrix do not appear in the table.
> random.evaluate
DNA LINE LTR/ERV1 LTR/ERVK LTR/ERVL LTR/ERVL-MaLR other SINE
[1,] NA NA NA NA NA NA NA NA
> y
DNA LINE LTR/ERVK LTR/ERVL LTR/ERVL-MaLR SINE
1 1 1 1 1 4
Due to this, when I try to join the data of the matrix with the data of the table, I get the following error
random.evaluate[1,] <- y
Error in random.evaluate[1, ] <- y :
number of items to replace is not a multiple of replacement length
Could someone help me fix this bug? I have found solutions to this error but in my case they do not work for me.
First check if the column names of the table exist in the matrix
Check this link
If it exists, just set the value as usual.
Related
I have a data set as such below
salaries <- read.csv('salaries.csv', header=TRUE)
print(salaries)
Name Job Salary CompanyExperience IndustryExperience
John Engineer 50000 3 12
Adam Manager 55000 6 7
Alice Manager #N/A 6 6
Bob Engineer 65000 5 #N/A
Carl Engineer 70000 #N/A 10
I would like to plot some of this information, however I would need to exclude any data points with "#N/A" by removing any rows where there is an "#N/A" text string (produced by MS Excel spreadsheet exported to CSV) to make a plot of Salary ~ CompanyExperience.
My code to subset is as follows:
salaries <-salaries[salaries$CompanyExperience!="#N/A" &
salaries$Salary!="#N/A",]
#write.csv(salaries, "salaries2.csv")
#salaries <- read.csv('salaries2.csv', header=TRUE)
print(salaries)
Now this seems to work without any issue, producing:
Name Job Salary CompanyExperience IndustryExperience
1 John Engineer 50000 3 12
2 Adam Manager 55000 6 7
4 Bob Engineer 65000 5 #N/A
Which seems fine, however as soon as I try to put this data subset into a linear regression, I get an error:
> salarylinear <- lm(salaries$CompanyExperience ~ salaries$Salary)
Warning messages:
1: In model.response(mf, "numeric") :
using type = "numeric" with a factor response will be ignored
2: In Ops.factor(y, z$residuals) : ‘-’ not meaningful for factors
Now if I've done some experimenting and have found that if I subset the data using things like "!=10000" or "<50", I dont get this error. Also, I've found that when I write this new subset into a CSV file and read it again (by removing the # tags in the code above, the data set will have added a mysterious "X" column at the front and wont have the error when trying to run a linear regression:
X Name Job Salary CompanyExperience IndustryExperience
1 1 John Engineer 50000 3 12
2 2 Adam Manager 55000 6 7
3 4 Bob Engineer 65000 5 #N/A
I've searched the web and cant find any reason why this is happening. Is there a way I can produce a useable subset by excluding "#N/A" strings without having to resort to writing the data to disk and reading into memory again?
Most likely what is happening is that columns of data that you think are numeric are not in fact numeric. Two things are leading to this:
read.csv() doesn't know that "#N/A" means "missing" and as a result, it is reading in "#N/A" as a string (not a number), causing it to think that the whole columns of Salary, CompanyExperience, and IndustryExperience are string variables.
read.csv() has a notorious default to read in strings as factors. If you're unfamiliar with factors, one good resource is this.
This combination of events is why lm() thinks your dependent variable is a factor and is throwing an error.
The solution is to add na.strings = "#N/A" as an argument to read.csv(). Then your data will be read in as numeric. You can proceed straight to running your regression because lm() will drop rows with NA's automatically.
However, to be a bit more explicit, you may also want to add stringsAsFactors = FALSE as an argument to read.csv() just in case you have any other things that mean "missing" but are coded as, say, a blank. And, if you want to handle the NAs manually before running your regression, you can drop rows with NAs using complete.cases() or something like salaries[!is.na(Salary),]
Follow-up to our discussion in the comments about what happens when you subset a data.frame with a matrix:
First, we create a 3x2 dataframe to work with:
df <- data.frame(x=1:3, y=4:6)
Then, let's create a vector of TRUE/FALSE for the rows we want to keep when we subset our dataframe.
v <- c(T,T,F)
Here, v has 2 TRUEs followed by 1 FALSE so if we subset our 3-row dataframe with v, we will be selecting the first 2 rows and omitting the 3rd row:
df[v,]
x y
1 1 4
2 2 5
Great, that works as expected. But what about if we subset with a matrix? We create matrix m that has the same 3x2 dimensions as our dataframe. m is full of TRUEs except for 2 FALSEs in cells (1,1) and (3,2).
m <- matrix(c(F,T,T,T,T,F), ncol=2)
m
[,1] [,2]
[1,] FALSE TRUE
[2,] TRUE TRUE
[3,] TRUE FALSE
Now, if we try to subset our dataframe with m, we might at first think that we're gong to only get row 2 back, because m has a FALSE in its first and third row. That, of course, isn't what happens.
df[m,]
x y
2 2 5
3 3 6
NA NA NA
NA.1 NA NA
The trick to understanding this is to know that a matrix in R is just a vector with a dimension attribute. The dimension is as expected, because we created m:
dim(m)
[1] 3 2
But as a vector, what does m look like:
as.vector(m)
[1] FALSE TRUE TRUE TRUE TRUE FALSE
We see that m-as-a-vector is just the columns of m, repeated one after the other (because R "fills in" matrices column-wise). Let me re-write m with the original cells identified, in case my description isn't clear:
[1] FALSE TRUE TRUE TRUE TRUE FALSE
(1,1) (2,1) (3,1) (1,2) (2,2) (3,2)
So when we try to subset our dataframe with m, it's like using this length-6 vector, and this length-6 vector says to select rows 2:5. So when we write df[m, ] R faithfully selects rows 2 and 3, and then when it tries to select rows 4 and 5, they don't "exist" so R fills them in with NAs. This is why we get more rows in our subset than in our original dataframe.
Lastly, we saw that df[m, ] has funny rownames like NA.1. Rownames must be unique, so R calls the row 4 of the "subset" 'NA' and it calls row 5 of the subset 'NA.1'.
I hope this clears it up for you. Happy coding!
I am programming in R for a commercial real estate project from this place I started to work at. I have data frames that have 195 categories for each of the properties sold in that area for the last year. The categories are along the top and the properties along the row.
I tried to make a function called cuttingvariables1 to cut out the number of variables first by taking a subset of the categories based on if they have seller, buyer, buyers, listing in the column name.
I was able to have it work when I ran it as commands, but why isn't it working when I try to make function in the source file and run off that.
Cuttingvariables2 is my second function and I do not understand why it stops working at line 7 for that loop. The loop is meant to check every na_count for each category and then see if it is greater than 20% the number of properties listed in that loaded csv. If it is, then the column gets deleted.
Any help would be appreciated.
cuttingvariables1 <- function(dataset)
(
dataset <- (subset(dataset,select=c(!grepl("Seller|Buyer|Buyers|Listing",names(dataset))))
)
)
Cuttingvariables2 function below!
cuttingvariables2 <- function(dataset)
{
z = ncol(dataset)
na_count <- c(lapply(dataset, function(y) sum(length(which(is.na(y))))))
setDT(na_count, keep.rownames = TRUE)[]
j = ncol(na_count)
for (i in 1:j) if((as.integer(na_count[,i])) > (nrow(dataset)/5)) na_count <- na_count[,-i]
for (i in 1:ncol(dataset)) if(colnames(dataset)[i] %in% (colnames(na_count))) dataset <- dataset[,-i]
return (dataset[1:5,1:5])
return (colnames(dataset))
}
#sample data
BROWNSVILLEMF2016TO2017[1:12,1:5]
Actual.Cap.Rate Age Asking.Price Assessed.Improved Assessed.Land
1 NA 31 NA 12039000 1776000
2 NA NA NA 1434000 1452000
3 NA 87 NA 306900 270000
4 NA 11 NA 432900 337950
5 NA 89 NA 281700 107100
6 4.5 87 3300000 NA NA
7 NA 96 NA 427500 66150
8 NA 87 NA 1228000 300000
9 NA 95 NA NA NA
10 NA 95 NA NA NA
11 NA 87 NA 210755 14418
12 NA 87 NA NA NA
I would not use subset directly with grep because you have so many fields. There may very different versions of the words and you want them whether they are capitalized or not.
(be sure to check my R grammar I have been working in python all day)
#Empty List - you will build a list of names
extractList<-list()
#names you are looking for in column names saved as a list (lowercase)
nameList<- c("seller","buyer","buyers","listing")
#Create the outer loop to grab index of columns and pull the column name off each one at a time
for (i in 1:ncol(dataset)){
cName<-names(dataset[i])
lcName<-tolower(cName)
#Created a loop within that loop to compare each keyword on your nameList to the columns to see if the word is in the title (with title case lowered)
for (j in nameList){
#if it is append the column name to the list NOT LOWER CASE, ***ORIGINAL***
if(grepl(j, lcName)==TRUE ){extractList=append(cName,extractList)}
} }
#Now remove duplicates names for the extract list
extractList<-unique(extractlist)
At this point you should have a concatenated list of column names each of which has one (or more) of those four words in ANY FORM capital or lowercase or camel case...which was the point of lowering the case of the column name before comparing them. Now you just need to subset the data frame the easy way!
newSet<- dataset[,which((names(dataset) %in% extractList)==TRUE)
This creates a logical vector with %in% statement so only names in the data frame which appear on the new list of unique column names with ANY version of your keywords will show as TRUE and be included in the columns of the new set.
Now you should have a complete set of data with only the types of column names you are looking to use. DO NOT JUST USE THIS...look at it and try to understand why some of the more esoteric tricks are at play so that you can work through similar problems in the future.
Almost forgot:
install.packages("questionr")
Then:
freq.na(newSet)
will give you a formatted table with the #missing and the percent of na's for each column, you can set this to a variable to use it in you vetting process!
The issue seems to be something already treated but after a check I couldn't find any solution. I load a table from a file and it could be (don't know how) that some entire lines are empty. So when I get the data frame I got
# id c1 c2
# 1 a 1 2
# 2 b 2 4
# 3 NA NA
# 4 d 6 1
# 5 e 7 5
# 6 NA NA
if I do
apply(df, 1, function(x) all(is.na(x))
I got all FALSE as the first column is not a number (the table is much bigger with mixed character and numeric columns) and I can't filter these lines. Also with na.omit or complete.cases I cannot sort it out.
Is there any function or expression to check empty rows?
You may be able to cut this problem off at the source with the parameters you pass to read.csv:
For instance if the blanks are one space or blanks you could use
df <- read.csv(<your other logic here>, na.strings=c("NA","", " ")
This question seems to raise similar issues: read.csv blank fields to NA
If this works, then you can use the apply logic to work with the offending rows.
I have a data frame x with columns a,b,c.
I am trying to order the data based on columns a and c and my ordering criteria is ascending on a and descending on c. Below is the code I wrote.
z <- x[order(x$a,-x$c),]
This code gives me a warning as below.
Warning message:
In Ops.factor(x$c) : - not meaningful for factors
Also, when I check the data frame z using head(z), it gives me wrong data as below:
30708908 0.3918980 NA
22061768 0.4022183 NA
21430343 0.4118651 NA
21429828 0.4134888 NA
21425966 0.4159323 NA
22057521 0.4173094 NA
and initially there wasnt any NA values of the column c in the data frame x. I have gone through a lot of threads but couldn't find any solution. Can anybody please suggest.
try this
install.packages('plyr');
library('plyr');
z<-arrange(x,a,desc(c));
In addition, you can use the
options(stringsAsFactors = FALSE)
before you create your frame, or while creating your 'x' data frame, specify
stringsAsFactors = FALSE
z <- x[order(x$a,-as.character(x$c) ), ]
z
If as Roman suspects you have digits in your facttor levels you may need to do as he suggests and add as.numeric, otherwise 9 will be greater than 10
z <- x[order(x$a,-as.numeric(as.character(x$c)) ), ]
z
But if they are characters, then you will again get all NAs, so it really depends on the nature of the levels of x$c
I have a vector called classes that is the output of an analysis that used listwise deletion. As a result, the cases included in classes is a subset of the entire dataset -- some cases were dropped because of incomplete data.
Selection is a dummy variable that occurs with every case in my dataset. A shortened example of my data is below. There is also a unique case ID for every observation.
classes <- c(1,2,1,1,1,2,3,3,3,1,1,1,3,3,2,2,2)
selection <- c(1,0,0,0,1,1,1,1,0,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,1,0)
case <-seq(1,26,1)
I would like to create a new version of selection (say, selection2) so that it only includes cases that are in classes. Basically, I would like both variables to be the same length for comparison purposes, where the cases that are NOT included in classes are also not included in selection2.
I thought this would be an easy fix, but I've spend a lot of time getting nowhere, so I thought I'd ask. Thanks in advance!
If they are to be the same length, then the reduced version must have NA's:
> selection2 <- selection
> is.na(selection2) <- !selection2 %in% classes
> selection2
[1] 1 NA NA NA 1 1 1 1 NA NA NA NA NA 1 1 1 1 NA NA NA 1 1 1 NA 1 NA