how to transform many columns of a data frame in r - r

I have a simple data frame
d <- data.frame(var1=c(5,5,5),var1_c=c(5,NA,6),var2 =c(6,6,6),var2_c = c(8,6,NA))
with a lots of lines, and a lots of variables, all labeled "varXXX" and "varXXX_c", and I want that everytimes there's a NA in a varXXX_c to replace the NA with the value in the varXXX variable.
In short, I want to do :
d[is.na(d$var1_c),"var1_c"] <- d$var1[is.na(d$var1_c)]
but try to find a better way to do this that copy paste and change "1" with the number of the variable.
I would rather find a solution in base R or dplyr, but would be grateful for any help !

We can use grep to find the column names that start with var followed by numbers (\\d+) followed by _ and followed by c. Similarly, we have another set of logical index for var followed by one or more numbers (\\d+) till the end of the string ($) and then do the subset of columns based on the index and change the NA values (is.na(d[i1])) to the corresponding elements in 'd[i2]`.
i1 <- grepl("var\\d+_c", names(d))
i2 <- grepl('var\\d+$', names(d))
d[i1][is.na(d[i1])] <- d[i2][is.na(d[i1])]
NOTE: This is based on the assumption that the columns are in the same order.

Related

Best way to extract a single letter from each row and create a new column in R?

Below is an excerpt of the data I'm working with. I am having trouble finding a way to extract the last letter from the sbp.id column and using the results to add a new column to the below data frame called "sex". I initially tried grepl to separate the rows ending in F and the ones ending in M, but couldn't figure out how to use that to create a new column with just M or F, depending on which one is the last letter of each row in the sbp.id column
sbp.id newID
125F 125
13000M 13000
13120M 13120
13260M 13260
13480M 13480
Another way, if you know you need the last letter, irrespective of whether the other characters are numbers, digits, or even if the elements all have different lengths, but you still just need the last character in the string from every row:
df$sex <- substr(df$sbp.id, nchar(df$sbp.id), nchar(df$sbp.id))
This works because all of the functions are vectorized by default.
Using regex you can extract the last part from sbp.id
df$sex <- sub('.*([A-Z])$', '\\1', df$sbp.id)
#Also
#df$sex <- sub('.*([MF])$', '\\1', df$sbp.id)
Or another way would be to remove all the numbers.
df$sex <- sub('\\d+', '', df$sbp.id)

Exclude one single column from sapply

I have a dataframe with multiple columns that I want to group according to their names. When several columns names respond to the same pattern, I want them grouped in a single column and that column is the sum of the group.
colnames(dataframe)
[1] "Départements" "01...3" "01...4" "01...5" "02...6" "02...7" "02...8" "02...9" "02...10" "03...11"
[11] "03...12" "03...13" "04...14" "04...15" "05...16" "05...17" "05...18" "06...19" "06...20" "06...21"
So I use this bit of code that works just fine when every column are numeric, though the first one is character and therefore I hit an error. How can I exclude the first column from the code?
#Group columns by patern, look for a pattern and loop through
patterns <- unique(substr(names(dataframe_2012), 1, 3))` #store patterns in a vector
dataframe <- sapply(patterns, function(xx) rowSums(dataframe[,grep(xx, names(dataframe)), drop=FALSE]))
#loop through
This is the error code I get
Error in rowSums(DEPTpolicedata_2012[, grep(xx, names(DEPTpolicedata_2012)), :
'x' must be numeric
You can simply remove the first column using
patterns$Départements <- NULL

create flag based on row values in grep()

I have a 10-row data frame of tweets about potatoes and need to flag them based on the punctuation each tweet contains (questions marks or exclamation points). The grep function will return row numbers where these characters appear:
grep("\\?", potatoes$tweet)
grep("!", potatoes$tweet)
I've tried to create the flag variable question with mutate in dplyr as shown...
potatoes$question <- NA
potatoes <- mutate(potatoes, question = +row_number(grep("\\?", potatoes$tweet)))
Error in mutate_impl(.data, dots) :
Column `question` must be length 10 (the number of rows) or one, not 3
I'm also happy to consider more elegant solutions than conditioning on the output of grep. Any help appreciated!
We can use grepl instead of grep as grep returns the index/position where the matches occurs, whereas grepl returns a logical vector where TRUE denotes the matching element and FALSE non-matching. It can be used as a flag
i1 <- grepl("!", potatoes$tweet)
and if we need to change to row numbers,
potatoes$question <- i1 * seq_len(nrow(potatoes$sweet))
Similarly, grep with row index can be used for assignment
i2 <- grep("!", potatoes$tweet)
potatoes$question[i2] <- seq_len(nrow(potatoes))[i2]

How to change values that start with a certain letter to NA (in R)

I have a data frame that I'm using called "fish".
The data frame has 3 different variables. One of the variables is called "species".
There are some species that start with the letter M. I want to change all the values of species that start with the letter M to be missing (NA) instead.
I know how to change it to NA when you are doing the whole species name, but how do you do it for just species that START with the letter M?
I've tried this:
fish$species[fish$species=="^M_"] <- NA
But this doesn't work. Can anyone help?
You could use the replacement function is.na<-() along with startsWith().
is.na(fish$species) <- startsWith(fish$species, "M")
According to the R documentation help(startsWith),
startsWith() is equivalent to but much faster than grepl("^<prefix>", x), where prefix is not to contain special regular expression characters.
The code above assumes a character column. For a factor column, you can change the appropriate levels.
is.na(levels(fish$species)) <- startsWith(levels(fish$species), "M")
Another way would be to replace with levels<-(), using NA for the replacement on the right-hand-side.
levels(fish$species)[startsWith(levels(fish$species), "M")] <- NA
Note that you can definitely use grepl() if you'd like, but this question seems like a good example use of the new startsWith() function.
Also note that all these were successfully tested on the iris data set.

Using grepl() in a particular type of pattern matching

I'm not sure how to do this, I have a feeling that I can use grepl() with this but I am not sure how.
I have a column in my dataset where I have names like "Abbot", "Baron", "William", and hundreds of other names, and many blanks/missing-values.
I want to extract it in such a way where the first letter is extracted and put in a new column that only contains the letter, and if its missing a value then fill in with unknown.
Below I use a quick sapply statement and strsplit to grab the first letter. There is likely a better way to do this, but here's one solution. :)
test <- c('Abbot', 'Baron', 'William')
firstLetter <- sapply(test, function(x){unlist(strsplit(x,''))[1]})
What do you mean with
and if its missing a value then fill in with unknown
?
The following code using substr should be very fast with a large number of rows. It always returns the first letter and returns NA if the respective value in test$name is NA.
test <- data.frame(name = c('Abbot', 'Baron', 'William', NA))
test$first.letter <- substr(test$name, 1, 1)
If you want to convert all NAin test$first.letter to 'unknown' you can do this afterwards:
test$first.letter <- ifelse(is.na(test$first.letter), "unknown", test$first.letter)

Resources