R: Remove character observations in a variable - r

I have a a variable in a dataframe whose observations are a mix of numeric and character values (due to faulty data entry). How can I subset in only the observations which are numeric? Suppose the values of filename$varname are (1, 2, 1, 5, 3, a, 3, d, 1), I would like subset out "a" and "d" and keep only the rest of the values which are numeric.

You can make use of the fact that as.numeric will convert character strings to NA whilst keeping numeric data:
x <- c(1, 2, 1, 5, 3, "a", 3, "d", 1)
as.numeric(x)
[1] 1 2 1 5 3 NA 3 NA 1
Warning message:
NAs introduced by coercion
Now use is.na to test for NA values and exclude these using vector subsetting:
y <- as.numeric(x)
y[!is.na(y)]
[1] 1 2 1 5 3 3 1

Without a reproducible example it is hard to see what your data actually looks like. For instance, is the column of your data frame a factor or just strings? If it is just strings then Andrie's answer works (just use as.numeric()), and if the data is a factor you first need to convert that to strings with as.character(x):
as.numeric(as.character(filename$varname))
You will get some NAs but that is absolutely fine as those values are indeed missing.
EDIT: To clarify abit more. You have a data frame, so you don't want to take values out of the data frame as then it wouldn't be a dataframe anymore (equal rows). You want to correctly assign NA for missing values instead as most statistical functions in R can handle them.

Related

Finding the maximum value for each row and extract column names [duplicate]

This question already has answers here:
R Create column which holds column name of maximum value for each row
(4 answers)
Closed 1 year ago.
Say we have the following matrix,
x <- matrix(1:9, nrow = 3, dimnames = list(c("X","Y","Z"), c("A","B","C")))
What I'm trying to do is:
1- Find the maximum value of each row. For this part, I'm doing the following,
df <- apply(X=x, MARGIN=1, FUN=max)
2- Then, I want to extract the column names of the maximum values and put them next to the values. Following the reproducible example, it would be "C" for the three rows.
Any assistance would be wonderful.
You can use apply like
maxColumnNames <- apply(x,1,function(row) colnames(x)[which.max(row)])
Since you have a numeric matrix, you can't add the names as an extra column (it would become converted to a character-matrix).
You can choose a data.frame and do
resDf <- cbind(data.frame(x),data.frame(maxColumnNames = maxColumnNames))
resulting in
resDf
A B C maxColumnNames
X 1 4 7 C
Y 2 5 8 C
Z 3 6 9 C

Is it possible in R that by adding two dataframe we get the result even if other value for same type is not there?

I have 2 dataframes with same number of column but row count differs.
X
enter image description here
y
enter image description here
Now when i am try to subtract y[,c(1,2)]-x[,c(4,3)] getting error
Error in Ops.data.frame(y[, c(1, 2)], x[, c(4, 3)]) :
‘-’ only defined for equally-sized data frames
i figured out that this is due to the fact that some of the type and wire are missing from x
So my objective is that can code assume the corresponding value as 0 against missing type and wire in x dataframe and return the result as y-0=y
You are subsetting your data.frames by column. When subsetting using square brackets, the values before the comma are for rows. The values after the comma are columns. Your y[, c(1:2] - x[, c(3:4] is trying to subtract columns 3 and 4 in y from columns 1 and 2 in x. Since the columns are different lengths, it fails.

Select factor values with level NA [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 5 years ago.
How can I avoid using a loop to subset a dataframe based on multiple factor levels?
In the following example my desired output is a dataframe. The dataframe should contain the rows of the original dataframe where the value in "Code" equals one of the values in "selected".
Working example:
#sample data
Code<-c("A","B","C","D","C","D","A","A")
Value<-c(1, 2, 3, 4, 1, 2, 3, 4)
data<-data.frame(cbind(Code, Value))
selected<-c("A","B") #want rows that contain A and B
#Begin subsetting
result<-data[which(data$Code==selected[1]),]
s1<-2
while(s1<length(selected)+1)
{
result<-rbind(result,data[which(data$Code==selected[s1]),])
s1<-s1+1
}
This is a toy example of a much larger dataset, so "selected" may contain a great number of elements and the data a great number of rows. Therefore I would like to avoid the loop.
You can use %in%
data[data$Code %in% selected,]
Code Value
1 A 1
2 B 2
7 A 3
8 A 4
Here's another:
data[data$Code == "A" | data$Code == "B", ]
It's also worth mentioning that the subsetting factor doesn't have to be part of the data frame if it matches the data frame rows in length and order. In this case we made our data frame from this factor anyway. So,
data[Code == "A" | Code == "B", ]
also works, which is one of the really useful things about R.
Try this:
> data[match(as.character(data$Code), selected, nomatch = FALSE), ]
Code Value
1 A 1
2 B 2
1.1 A 1
1.2 A 1

How can I attach new levels with specific values for specific rows

I have a dataframe with a column for the name of individuals and columns for results.
Now I want to attach a new column with either 1 , 2 or NA depending on the individual.
I have a vector with all the individuals which are level 1 and one for individuals from level 2
How can I attach a collumn to this data frame that goes something like this:
if dataframe$individual is (1,3,6,7) value in column is 1, if dataframe$individual is (2,5,8) value in column is 2, else value is NA
I hope I made it clear with the example what i am looking for.
Thanks for the help
Try
dat$newCol <- with(dat, ifelse(individual %in% c(1,3,6,7), 1,
ifelse(individual %in% c(2,5,8), 2, NA)))

Subset a dataframe by multiple factor levels [duplicate]

This question already has answers here:
Select rows from a data frame based on values in a vector
(3 answers)
Closed 5 years ago.
How can I avoid using a loop to subset a dataframe based on multiple factor levels?
In the following example my desired output is a dataframe. The dataframe should contain the rows of the original dataframe where the value in "Code" equals one of the values in "selected".
Working example:
#sample data
Code<-c("A","B","C","D","C","D","A","A")
Value<-c(1, 2, 3, 4, 1, 2, 3, 4)
data<-data.frame(cbind(Code, Value))
selected<-c("A","B") #want rows that contain A and B
#Begin subsetting
result<-data[which(data$Code==selected[1]),]
s1<-2
while(s1<length(selected)+1)
{
result<-rbind(result,data[which(data$Code==selected[s1]),])
s1<-s1+1
}
This is a toy example of a much larger dataset, so "selected" may contain a great number of elements and the data a great number of rows. Therefore I would like to avoid the loop.
You can use %in%
data[data$Code %in% selected,]
Code Value
1 A 1
2 B 2
7 A 3
8 A 4
Here's another:
data[data$Code == "A" | data$Code == "B", ]
It's also worth mentioning that the subsetting factor doesn't have to be part of the data frame if it matches the data frame rows in length and order. In this case we made our data frame from this factor anyway. So,
data[Code == "A" | Code == "B", ]
also works, which is one of the really useful things about R.
Try this:
> data[match(as.character(data$Code), selected, nomatch = FALSE), ]
Code Value
1 A 1
2 B 2
1.1 A 1
1.2 A 1

Resources