Below is an excerpt of the data I'm working with. I am having trouble finding a way to extract the last letter from the sbp.id column and using the results to add a new column to the below data frame called "sex". I initially tried grepl to separate the rows ending in F and the ones ending in M, but couldn't figure out how to use that to create a new column with just M or F, depending on which one is the last letter of each row in the sbp.id column
sbp.id newID
125F 125
13000M 13000
13120M 13120
13260M 13260
13480M 13480
Another way, if you know you need the last letter, irrespective of whether the other characters are numbers, digits, or even if the elements all have different lengths, but you still just need the last character in the string from every row:
df$sex <- substr(df$sbp.id, nchar(df$sbp.id), nchar(df$sbp.id))
This works because all of the functions are vectorized by default.
Using regex you can extract the last part from sbp.id
df$sex <- sub('.*([A-Z])$', '\\1', df$sbp.id)
#Also
#df$sex <- sub('.*([MF])$', '\\1', df$sbp.id)
Or another way would be to remove all the numbers.
df$sex <- sub('\\d+', '', df$sbp.id)
I have a data.frame with 1 column, and a nondescript number of rows.
This column contains strings, and some strings contain a substring, let's say "abcd".
I want to remove any strings from the database that contain that substring. For example, I may have five strings that are "123 abcd", and I want those to be removed.
I am currently using grepl() to try and remove these values, but it is not working. I am trying:
data.frame[!grepl("abcd", dataframe)]
but it returns an empty data frame.
We can use grepl to get a logical vector, negate (!) it, and use that to subset the 'data'
data[!grepl("abcd", data$Col),,drop = FALSE]
I have the data frame:
DT=data.frame(Row=c(1,2,3,4,5),Price=c(2.1,2.1,2.2,2.3,2.5),
'2.0'= c(100,300,700,400,0),
'2.1'= c(400,200,100,500,0),
'2.2'= c(600,700,200,100,200),
'2.3'= c(300,0,300,100,100),
'2.4'= c(400,0,0,500,600),
'2.5'= c(0,200,0,800,100))
The objective is to create a new column Quantity that selects the value for each row in the column equal to Price, such that:
DT.Objective=data.frame(Row=c(1,2,3,4,5),Price=c(2.1,2.1,2.2,2.3,2.5),
'2.0'= c(100,300,700,400,0),
'2.1'= c(400,200,100,500,0),
'2.2'= c(600,700,200,100,200),
'2.3'= c(300,0,300,100,100),
'2.4'= c(400,0,0,500,600),
'2.5'= c(0,200,0,800,100),
Quantity= c(400,200,200,100,100))
The dataset is very large so efficiency is important. I currently use and looking to make more efficient:
Names <- names(DT)
DT$Quantity<- DT[Names][cbind(seq_len(nrow(DT)), match(DT$Price, Names))]
For some reason the column names in the example come with an "X" in front of them, whereas in the actual data there is no X.
Cheers.
We can do this with row/column indexing after removing the prefix 'X' using sub or substring and then do the match as showed in the OP's post
DT$Quantity <- DT[cbind(1:nrow(DT), match(DT$Price, sub("^X", "", names(DT))))]
DT$Quantity
#[1] 400 200 200 100 100
The X is attached as prefix when the column names starts with numbers. One way to take care of this would be using check.names=FALSE in the data.frame call or read.csv/read.table
#akrun is correct, check.names=TRUE is the default behavior for data.frame(); from the man page:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.
If possible, you may want to make your column names a bit more descriptive.
I have a simple data frame
d <- data.frame(var1=c(5,5,5),var1_c=c(5,NA,6),var2 =c(6,6,6),var2_c = c(8,6,NA))
with a lots of lines, and a lots of variables, all labeled "varXXX" and "varXXX_c", and I want that everytimes there's a NA in a varXXX_c to replace the NA with the value in the varXXX variable.
In short, I want to do :
d[is.na(d$var1_c),"var1_c"] <- d$var1[is.na(d$var1_c)]
but try to find a better way to do this that copy paste and change "1" with the number of the variable.
I would rather find a solution in base R or dplyr, but would be grateful for any help !
We can use grep to find the column names that start with var followed by numbers (\\d+) followed by _ and followed by c. Similarly, we have another set of logical index for var followed by one or more numbers (\\d+) till the end of the string ($) and then do the subset of columns based on the index and change the NA values (is.na(d[i1])) to the corresponding elements in 'd[i2]`.
i1 <- grepl("var\\d+_c", names(d))
i2 <- grepl('var\\d+$', names(d))
d[i1][is.na(d[i1])] <- d[i2][is.na(d[i1])]
NOTE: This is based on the assumption that the columns are in the same order.
I have a big data frame and I want to remove certain rows from it based on first char of a column being a letter or a number. Sample of my data frame looks like a below:
y<-c('34TA912','JENAR','TEST','34CC515')
z<-('23.12.2015','24.12.2015','24.12.2015','25.12.2015')
abc<-data.frame(y,z)
Based on the sample above. I would like to remove second and third rows due to the value in y column in second row and third row starting with a letter instead of a number. Characters written in Y column could be anything, so only way I could filter is checking the first character without using any predefined value. If I use grep with a character, since other rows also contain letter, I could remove them aswell. Can you assist?
We can use grep. The regex ^ indicates the beginning of the string. We match numeric element ([0-9]) at the beginning of the string in the 'y' column using grep. The output will be numeric index, which we use to subset the rows of the 'abc'.
abc[grep('^[0-9]', abc$y),]
# y z
#1 34TA912 23.12.2015
#4 34CC515 25.12.2015