Create column based on another one but without matching alphabetically - r

I created a column of names like this
df:
PC
1 word
2 Now
3 Hate
4 Look
5 Check
I want to create another column based on this one and I was able to do that by
df<- df %>%
mutate (PCcode= factor [letters(PC)])
However, the new column assigned letters alphabetically which I do not want! I need to assign letters from A-Z but based on the order in the column to be like this:
df:
PC PCcode
1 word A
2 word A
3 Hate B
4 Look C
5 Check D

You can use match + unique to get unique index based on the occurrence of the value.
transform(df, PCcode = LETTERS[match(PC, unique(PC))])
# PC PCcode
#1 word A
#2 word A
#3 Hate B
#4 Look C
#5 Check D
If you prefer to do it in dplyr :
library(dplyr)
df %>% mutate(PCcode = LETTERS[match(PC, unique(PC))])
data
df <- structure(list(PC = c("word", "word", "Hate", "Look", "Check"
)), class = "data.frame", row.names = c(NA, -5L))

Related

Creating new column with condition

I have this data set:
ID Type Frequency
1 A 0.136546185
2 A 0.228915663
3 B 0.006024096
4 C 0.008032129
I want to create a new column that change the Frequency vaules less than 0.00 in to "other" and keep other information as it is. Like this :
ID Type Frequency New_Frequency
1 A 0.136546185 0.136546185
2 A 0.228915663 0.228915663
3 B 0.006024096 other
4 C 0.008032129 other
I used mutate but I dont know how to keep the original frequency bigger than 0.00.
Can you please help me?
You can't achieve what you want in base r because you cannot mix characters and numerics in the same vector. If you are willing to convert everything to characters the other answers will work. If you want to keep them numeric you need to use NA rather than "other". You can also try the labelled package which allows something like SPSS labels or SAS formats on numeric data.
Using mutate():
library(dplyr)
d <- tibble(ID = 1:4,
Type = c("A", "A", "B", "C"),
Frequency = c(0.136546185, 0.228915663, 0.006024096, 0.008032129))
d %>%
mutate(New_Frequency = case_when(Frequency < .01 ~ "other",
TRUE ~ as.character(Frequency)))
You can use ifelse
transform(df, Frequency = ifelse(Frequency < 0.01, 'Other', Frequency))
# ID Type Frequency
#1 1 A 0.136546185
#2 2 A 0.228915663
#3 3 B Other
#4 4 C Other
Note that Frequency column is now character since a column can have data of only one type.

Remove prefix letter from column variables

I have all column names that start with 'm'. Example: mIncome, mAge. I want to remove the prefix. So far, I have tried the following:
df %>%
rename_all(~stringr::str_replace_all(.,"m",""))
This removes all the column names that has the letter 'm'. I just need it removed from from the start. Any suggestions?
You can use sub in base R to remove "m" from the beginning of the column names.
names(df) <- sub('^m', '', names(df))
We need to specify the location. The ^ matches the start of the string (or here the column name). So, if we use ^m, it will only match 'm' at the beginning or start of the string and not elsewhere.
library(dplyr)
library(stringr)
df %>%
rename_all(~stringr::str_replace(.,"^m",""))
# ba Mbgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Also, if the case should be ignored, wrap with regex and specify ignore_case = TRUE
df %>%
rename_all(~ stringr::str_replace(., regex("^m", ignore_case = TRUE), ""))
# ba bgeg gmba cfor
#1 1 2 4 6
#2 2 3 5 7
#3 3 4 6 8
Another option is word boundary (\\bm), but this could match the beginning of words where there are multi word column names
NOTE: str_replace_all is used when we want to replace multiple occurrence of the pattern. Here, we just need to replace the first instance and for that str_replace is enough.
data
df <- data.frame(mba = 1:3, Mbgeg = 2:4, gmba = 4:6, cfor = 6:8)
Another way you can try
library(tidyverse)
df <- data.frame(mma = 1:2, mbapbe = 1:2)
df2 <- df %>%
rename_at(vars(c("mma", "mbapbe")) ,function(x) gsub("^m", "", x))
# ma bapbe
# 1 1 1
# 2 2 2

How to select rows with a certain value in r?

I am trying to edit my dataframe but cannot seem to find the function that I need to sort this out.
I have a dataframe that looks roughly like this:
Title Description Rating
Beauty and the Beast a 2.5
Aladdin b 3
Coco c 2
etc.
(rating is between 1 and 3)
I am trying to edit my dataframe so that I get a new dataframe where there is no decimal numbers for the rating column.
i.e: the new dataframe would be:
Title Description Rating
Aladdin b 3
Coco c 2
As Beaty and the Beast's rating is not 1, 2 or 3.
I feel like there's a simple function in R that I just cannot find on Google, and I was hoping someone could help.
We can use subset (from base R) with a comparison on the integer converted values of 'Rating'
subset(df1, Rating == as.integer(Rating))
# Title Description Rating
#2 Aladdin b 3
#3 Coco c 2
Or if we are comparing with specific set of values, use %in%
subset(df1, Rating %in% 1:3)
data
df1 <- structure(list(Title = c("Beauty and the Beast", "Aladdin", "Coco"
), Description = c("a", "b", "c"), Rating = c(2.5, 3, 2)),
class = "data.frame", row.names = c(NA,
-3L))
You can get the remainder after dividing by 1 and select rows where the remainder is 0.
subset(df, Rating %% 1 == 0)
# Title Description Rating
#2 Aladdin b 3
#3 Coco c 2
You want to use the dplyr function in R
library(dplyr)
df1 %>%
filter(R != 2.5)

select columns that do NOT start with a string using dplyr in R

I want to select columns from my tibble that end with the letter R AND do NOT start with a character string ("hc"). For instance, if I have a dataframe that looks like this:
name hc_1 hc_2 hc_3r hc_4r lw_1r lw_2 lw_3r lw_4
Joe 1 2 3 2 1 5 2 2
Barb 5 4 3 3 2 3 3 1
To do what I want, I've tried many options, but I'm surprised that this one doesn't work:
library(tidyverse)
data %>%
select(ends_with("r"), !starts_with("hc"))
When I try it, I get this error:
Error: !starts_with("hc") must evaluate to column positions or names, not a logical vector
I've also tried using negate() and get the same error.
library(tidyverse)
data %>%
select(ends_with("r"), negate(starts_with("hc")))
Error: negate(starts_with("hc")) must evaluate to column positions or names, not a function
I'd like to keep the answer within the dplyr select function because, once I select the variables, I'm going to end up reversing them by using mutate_at, so a tidy solution is best.
Thank you!
We can use - as the starts_with output is not a logical vector
library(dplyr)
data %>%
select(ends_with("r"), -starts_with("hc"))
# lw_1r lw_3r
#1 1 2
#2 2 3
data
data <- structure(list(name = c("Joe", "Barb"), hc_1 = c(1L, 5L), hc_2 = c(2L,
4L), hc_3r = c(3L, 3L), hc_4r = 2:3, lw_1r = 1:2, lw_2 = c(5L,
3L), lw_3r = 2:3, lw_4 = 2:1), class = "data.frame", row.names = c(NA,
-2L))
If you need an advanced regular expression use matches
library(dplyr)
#Starts with any letter except h or c and ends with an r
df %>% select(matches('^[^hc].*r$'))
lw_1r lw_3r
1 1 2
2 2 3

Regex solution for dataframe rownames

I have a dataframe returned from a function that looks like this:
df <- data.frame(data = c(1,2,3,4,5,6,7,8))
rownames(df) <- c('firsta','firstb','firstc','firstd','seconda','secondb','secondc','secondd')
firsta 1
seconda 5
firstb 2
secondb 6
my goal is to turn it into this:
df_goal <- data.frame(first = c(1,2,3,4), second = c(5,6,7,8))
rownames(df_goal) <- c('a','b','c','d')
first second
a 1 5
b 2 6
Basically the problem is that there is information in the row names that I can't discard because there isn't otherwise a way to distinguish between the column values.
This is a simple long-to-wide conversion; the twist is that we need to generate the key variable from the rownames by splitting the string appropriately.
In the data you present, the rowname consists of the concatination of a "position" (ie. 'first', 'second') and an id (ie. 'a', 'b'), which is stuck at the end. The structure of this makes splitting it complicated: ideally, you'd use a separator (ie. first_a, first_b) to make the separation unambiguous. Without a separator, our only option is to split on position, but that requires the splitting position to be a fixed distance from the start or end of the string.
In your example, the id is always the last single character, so we can pass -1 to the sep argument of separate to split off the last character as the ID column. If that wasn't always true, you would need to some up with a more complex solution to resolve the rownames.
Once you have converted the rownames into a "position" and "id" column, it's a simple matter to use spread to spread the position column into the wide format:
library(tidyverse)
df %>%
rownames_to_column('row') %>%
separate(row, into = c('num', 'id'), sep = -1) %>%
spread(num, data)
id first second
1 a 1 5
2 b 2 6
3 c 3 7
4 d 4 8
If row ids could be of variable length, the above solution wouldn't work. If you have a known and limited number of "position" values, you could use a regex solution to split the rowname:
Here, we extract the position value by matching to a regex containing all possible values (| is the OR operator).
We match the "id" value by putting that same regex in a positive lookahead operator. This regex will match 1 or more lowercase letters that come immediately after a match to the position value. The downside of this approach is that you need to specify all possible values of "position" in the regex -- if there are many options, this could quickly become too long and difficult to maintain:
df2
data
firsta 1
firstb 2
firstc 3
firstd 4
seconda 5
secondb 6
secondc 7
secondd 8
secondee 9
df2 %>%
rownames_to_column('row') %>%
mutate(num = str_extract(row, 'first|second'),
id = str_match(row, '(?<=first|second)[a-z]+')) %>%
select(-row) %>%
spread(num, data)
id first second
1 a 1 5
2 b 2 6
3 c 3 7
4 d 4 8
5 ee NA 9

Resources