Replace strings of numbers separated by commas with the median in R [duplicate] - r

This question already has answers here:
R: split string into numeric and return the mean as a new column in a data frame
(3 answers)
Closed 2 years ago.
I need help with replacing or extracting string of numbers, separated by comma in each element of my df, and replacing it with the median. For example,
a <- c("3, 3, 5, 5", "7, 7, 5, 5", "3, 4, 4, 5", "5, 7")
b <- c("Karina", "Eva", "Jake", "Ana")
df <- data.frame(b,a)
Now i need to replace variable a with the median of those numbers contained in each elements so it looks like below:
b a
1 Karina 4
2 Eva 6
3 Jake 4
4 Ana 6
Little bit background. Each number is actually a length of a word that belongs to the corresponding name. I need to find median length for each name and figure out whether names that start with a vowel have longer median length or not. So for example, from the above i will conclude that names that start with vowel have shorted length. And to use a test to show that it is statistically significant. If someone can guide me in any way, i really appreciate it!

We can split the 'a' column with strsplit on , followed by zero or more spaces (\\s*), loop over the list, convert to numeric and get the median, assign it to same column
df$a <- sapply(strsplit(df$a, ",\\s*"), function(x) median(as.numeric(x)))
df$a
#[1] 4 6 4 6
Or using tidyverse, we can use separate_rows to split the 'a' column and expand the rows while converting the type', then do a group by median
library(dplyr)
library(tidyr)
df %>%
separate_rows(a, convert = TRUE) %>%
group_by(b) %>%
summarise(a = median(a))

Related

How to recode values of a character column in a dataframe? [duplicate]

This question already has answers here:
How to remove + (plus sign) from string in R?
(3 answers)
Closed last year.
Beginner Question: What is a simple way to rename a variable observation in a dataframe column?
I have dataframe "Stuff" with a column of categorical data called "Age" where one of the data variables is called "Age80+". I've learned that R does not like "+" in a name,
e.g. Age80+ <- brings up an error
In column "Age" there are 7 other variable numbers, e.g. "Age18_30" so I cannot manually change the observation names efficiently.
I have looked but I haven't found a simple way to rename all "Age80+" to "Age80plus" without bringing in complicated packages like "stringer" or "dplyr". The dataframe has 100's of "Age80+" observations.
Thank you
I have tried
Stuff$Age<- gsub("Age80+", "Age80plus", Stuff$Age)
But that changes "Age80+" to "Age80plus+" not "Age80plus"
The change leaves the "+"
+ is a special character aka regular expression, that you may escape \\+ if you want the actual character.
dat <- transform(dat, age=gsub('Age80\\+', 'Age80plus', age))
dat
# id age x
# 1 1 Age80plus -0.9701187
# 2 2 Age80plus -0.5522213
# 3 3 Age80plus -1.6060125
# 4 4 Age60 -1.5417523
# 5 5 Age40 -1.9090871
Data:
dat <- structure(list(id = 1:5, age = c("Age80+", "Age80+", "Age80+",
"Age60", "Age40"), x = c(-0.970118672988532, -0.552221336521097,
-1.60601248510621, -1.54175233366043, -1.909087068272)), class = "data.frame", row.names = c(NA,
-5L))

Finding the maximum value for each row and extract column names [duplicate]

This question already has answers here:
R Create column which holds column name of maximum value for each row
(4 answers)
Closed 1 year ago.
Say we have the following matrix,
x <- matrix(1:9, nrow = 3, dimnames = list(c("X","Y","Z"), c("A","B","C")))
What I'm trying to do is:
1- Find the maximum value of each row. For this part, I'm doing the following,
df <- apply(X=x, MARGIN=1, FUN=max)
2- Then, I want to extract the column names of the maximum values and put them next to the values. Following the reproducible example, it would be "C" for the three rows.
Any assistance would be wonderful.
You can use apply like
maxColumnNames <- apply(x,1,function(row) colnames(x)[which.max(row)])
Since you have a numeric matrix, you can't add the names as an extra column (it would become converted to a character-matrix).
You can choose a data.frame and do
resDf <- cbind(data.frame(x),data.frame(maxColumnNames = maxColumnNames))
resulting in
resDf
A B C maxColumnNames
X 1 4 7 C
Y 2 5 8 C
Z 3 6 9 C

Clean special character, numeric and character

I have a variable like below in my dataframe
df$emp_length(10+ years, <1 year, 8 years)
I need to clean this variable for better analysis. Example, I want to compare this variable with other categorical or numerical variable. What is the best way to seperate this variable in to multiple columns.
I am thinking to separate this variable based on space something like below,
df$emp_length = c(10+, <1, 8)
df$years = c(years, years, years)
Also I would like to know if the number with special characters like + and < will be considered as numeric in R or I have to separate special character and numbers?
I want to have emp_length variable as numeric and years variable as character.
Please help!
One can use tidyr::extract to first separate emp_length in 2 columns. Then replace any symbol (anything other than 0-9) to "" in column with number and then convert it to numeric.
Option#1: Keep the symbol with number
library(tidyverse)
df <- df %>% extract(emp_length, c("emp_length", "years"),
regex="([[:digit:]+<]+)\\s+(\\w+)")
df
# emp_length years
# 1 10+ years
# 2 <1 year
# 3 8 years
Option#2: Just number but column is numeric
library(tidyverse)
df <- df %>%
extract(emp_length, c("emp_length", "years"), regex="([[:digit:]+<]+)\\s+(\\w+)") %>%
mutate(emp_length = as.numeric(gsub("[^0-9]","\\1",emp_length)))
df
# emp_length years
# 1 10 years
# 2 1 year
# 3 8 years
Data:
df <- data.frame(emp_length = c("10+ years", "<1 year", "8 years"),
stringsAsFactors = FALSE)

How can I cross tabulate multiple select and single select questions in R [duplicate]

This question already has answers here:
Split comma-separated strings in a column into separate rows
(6 answers)
Closed 5 years ago.
Problem description
I've run a survey with a multiple select question, where the output is separated by commas in one column, and a grouping question (e.g. sex). Now I want to cross tabulate those 2 variables.
Sample data
My data comprises of 2 columns:
A multiple select question, which the survey software outputs as one column with commas separating the selection
A grouping variable, in this case male or female
dat <- data.frame(Multiple = c("A,B,C","B","A,C"), Sex = c("M","F","F"))
Desired output
I want to cross tabulate the multiple select options (without commas) with sex:
Multiple Sex Count
A M 1
B M 1
C M 1
A F 1
B F 1
C F 1
Attempted solution
This is a partial solution where I count the elements in the multiple select question only. My problem is that I don't know how to include the grouping variable sex into this function because I am using a regular expression to count the elements in the comma separated vector:
MSCount <- function(X){
# Function to count values in a comma separated vector
Answers <- sort(
unique(
unlist(
strsplit(
as.character(X), ",")))) # Find the possible options from the data alone, e.g. "A", "B" etc.
Answers <- Answers[-which(Answers == "")] # Drop blank answers
CountAnswers <- numeric(0) # Initialise the count as an empty numeric list
for(i in 1:length(Answers)){
CountAnswers[i] <- sum(grepl(Answers[i],X))
} # Loop round and count the rows with a match for the answer text
SummaryAnswers <- data.frame(Answers,CountAnswers,PropAnswers = 100*CountAnswers/length(X[!is.na(X)]))
return(SummaryAnswers)
}
We can use separate_rows
library(tidyverse)
separate_rows(dat, Multiple) %>%
mutate(Count = 1) %>%
arrange(Sex, Multiple) %>%
select(Multiple, Sex, Count)

Count number of unique values per row [duplicate]

This question already has answers here:
Count of unique elements of each row in a data frame in R
(3 answers)
Closed 5 years ago.
I want to count the number of unique values per row.
For instance with this data frame:
example <- data.frame(var1 = c(2,3,3,2,4,5),
var2 = c(2,3,5,4,2,5),
var3 = c(3,3,4,3,4,5))
I want to add a column which counts the number of unique values per row; e.g. 2 for the first row (as there are 2's and 3's in the first row) and 1 for the second row (as there are only 3's in the second row).
Does anyone know an easy code to do this? Up until now I only found code for counting the number of unique values per column.
This apply function returns a vector of the number of unique values in each row:
apply(example, 1, function(x)length(unique(x)))
You can append it to your data.frame using on of the following two ways (and if you want to name that column as count):
example <- cbind(example, count = apply(example, 1, function(x)length(unique(x))))
or
example$count <- apply(example, 1, function(x)length(unique(x)))
We can also use a vectorized approach with regex. After pasteing the elements of each row of the dataset (do.call(paste0, ...), match a pattern of any character, capture as a group ((.)), using the positive lookahead, match characters only if it appears again later in the string (\\1 - backreference for the captured group and replace it with blank (""). So, in effect only those characters remain that will be unique. Then, with nchar we count the number of characters in the string.
example$count <- nchar(gsub("(.)(?=.*?\\1)", "", do.call(paste0, example), perl = TRUE))
example$count
#[1] 2 1 3 3 2 1

Resources