Convert multiple character columns with numbers and <,> symbols into numeric in R - r

I have a dataset with multiple character columns with numbers and >,< signs.
I want to change them all to numeric.
The values with "<x" are supposed to be halfed and the values with ">x" are supposed to equal to x.
Sample dataframe and my approach (data=labor_df):
data a b c
1 "1" "9" "20"
2 "<10" "14" "1.99"
3 "12" ">5" "14.5"
half.value.a <- (as.numeric(str_extract(labor_df$"a"[which((grepl(labor_df$"a",
pattern = c("<"),fixed = T)))],
"\\d+\\.*\\d*")))/2
min.value.a <- as.numeric(str_extract(labor_df$"a"[which((grepl(labor_df$"a",
pattern = c(">"),fixed = T)))], "\\d+\\.*\\d*"))
labor_df$"a"[which((grepl(labor_df$"a",
pattern = c("<"),fixed = T)))
] <- half.value.a
labor_df$"a"[which((grepl(labor_df$"a",
pattern = c(">"),fixed = T)))
] <- min.value.a
labor_df$"a" <- as.numeric(labor_df$"a")
I would like to apply this to multiple columns in my df or use a different approach entirely to convert multiple columns in my df to numeric.

You can apply this approach to whichever columns you want. In this case, if you want to apply to columns 1 through 3, you can specify as labor_df[1:3]. If you want to apply to specific columns based on the column name, then create a cols vector containing the names of columns to apply this to and use labor_df[cols] instead.
The first gsub will remove the greater than sign, and keep the value unchanged. The ifelse is vectorized and will apply to all values in the column. It will first check with grepl if less than sign is present; if it is, remove it, convert to a numeric value, and then divide by 2. Otherwise leave as is.
labor_df[1:3] <- lapply(labor_df[1:3], function(x) {
x <- gsub(">", "", x)
x <- ifelse(grepl("<", x), as.numeric(gsub("<", "", x)) / 2, x)
as.numeric(x)
})
labor_df
Output
a b c
1 1 9 20.00
2 5 14 1.99
3 12 5 14.50
Data
labor_df <- structure(list(a = c("1", "<10", "12"), b = c("9", "14", ">5"
), c = c("20", "1.99", "14.5")), class = "data.frame", row.names = c(NA,
-3L))

Related

How to export values from the list of lists into a data frame?

I have a very complicated structure, see pic below. I am struggling with this problem for a long time, and I really need help.
Here's the sample code to make my question more reproducible
t <- list(
list(answers = list(list(values = "male"),
list(values = "6"),
list(values = "9"),
list(values = "9"),
list(values = "other"))
),
list(answers = list(list(values = "145")
)
)
What I need are the values which are in the answers (from the each list).
I need this to look like a data frame — each list is the column (variable) and each value of the 1st list in answers is, obviously, value. Like this
> d <- data.frame("1" = "male", "2" = 6,
+ "3" = 9, "4.1" = 9,
+ "4.2" = 8,
+ "5" = "other",
+ "6" = 145)
> d
X1 X2 X3 X4 X5 X6
1 male 6 9 9 other 145
The other issue is that values in the 1st list of answers can contain multiple values. And I really do not know how to deal with it as I need to assign variables their values accurately.
So, I cannot imagine how to get this. Intuitively, I think that lapply() may help me, but I do not know how to use it properly.
Using your sample data:
results = unlist(lapply(t, "[[", "answers"))
names(results) = paste0("X", seq_along(results))
results = as.data.frame(t(results))
# X1 X2 X3 X4 X5 X6
# 1 male 6 9 9 other 145
The numbers are class character here, you may want to use type.convert(results) which will convert them to numerics (though it will also convert the remaining strings to factors).
Tricky one. Here are my thoughts. Also, not so helpful to not provide a dput of your data.
First I have to recreate a data set that resembles yours (extra work for me):
test <- list(
list(
answers = list(
values = list("6", "8", "4", "11", "18"),
question = list("some_text_1", "some_text_2", "some_text_3"))
),
list(
answers = list(
values = list("male"),
question = list("some_text_4", "some_text_5", "some_text_6"))
)
)
)
With some effort I can do this:
l1 <- lapply(test, function(x) lapply(x,`[[`, 1))
l2 <- unlist(l1, recursive = FALSE)
l3 <- unlist(l2, recursive = FALSE)
With this result:
> l3
$answers1
[1] "6"
$answers2
[1] "8"
$answers3
[1] "4"
$answers4
[1] "11"
$answers5
[1] "18"
$answers
[1] "male"
Or simpler:
unlist(l1)
But the latter looses structure and all values vectors end up as a single character vector.
In your list, I think this would give you all the values vectors at the 3rd nested level as a list with elements of unequal length. Because your values vectors have unequal length I would probably not try to coerce this to a data frame. Is this close enough?
UPDATE
With the update in the data set we can now do:
l1 <- lapply(t, `[[`, 1)
l2 <- unlist(l1, recursive = FALSE)
df <- as.data.frame(l2)
with this output:
> df
values values.1 values.2 values.3 values.4 values.5
1 male 6 9 9 other 145

R regexp for odd sorting of a char vector

I have several hundred files that need their columns sorted in a convoluted way. Imagine a character vector x which is the result of names(foo) where foo is a data.frame:
x <- c("x1","i2","Component.1","Component.10","Component.143","Component.13",
"r4","A","C16:1n-7")
I'd like to have it ordered according to the following rule: First, alphabetical for anything starting with "Component". Second, alphabetical for anything remaining starting with "C" and a number. Third anything remaining in alphabetical order.
For x that would be:
x[c(3,4,6,5,9,8,2,7,1)]
Is this a regexp kind of task? And does one use match? Each file will have a different number of columns (so x will be of varying lengths). Any tips appreciated.
You can achieve that with the function order from base-r:
x <- c("x1","i2","Component.1","Component.10","Component.143","Component.13",
"r4","A","C16:1n-7")
order(
!startsWith(x, "Component"), # 0 - starts with component, 1 - o.w.
!grepl("^C\\d", x), # 0 - starts with C<NUMBER>, 1 - o.w.
x # alphabetical
)
# output: 3 4 6 5 9 8 2 7 1
A brute-force solution using only base R:
first = sort(x[grepl('^Component', x)])
second = sort(x[grepl('^C\\d', x)])
third = sort(setdiff(x, c(first, second)))
c(first, second, third)
We can split int to different elements and then use mixedsort from gtools
v1 <- c(gtools::mixedsort(grep("Component", x, value = TRUE)),
gtools::mixedsort(grep("^C\\d+", x, value = TRUE)))
c(v1, gtools::mixedsort(x[!x %in% v1]))
#[1] "Component.1" "Component.10" "Component.13" "Component.143" "C16:1n-7" "A" "i2" "r4"
#[9] "x1"
Or another option in select assuming that these are the columns of the data.frame
library(dplyr)
df1 %>%
select(mixedsort(starts_with('Component')),
mixedsort(names(.)[matches("^C\\d+")]),
gtools::mixedsort(names(.)[everything()]))
If it is just the order of occurrence
df1 %>%
select(starts_with('Component'), matches('^C\\d+'), sort(names(.)[everything()]))
data
set.seed(24)
df1 <- as.data.frame(matrix(rnorm(5 * 9), ncol = 9,
dimnames = list(NULL, x)))

R - How to rename every nth column with "name_x" where x=1 and increases by 1 for each column?

I have a data set where the names of the columns are very messy, and I want to simplify them. Example data below:
structure(list(MemberID = 1L, This.was.the.first.question = "ABC",
This.was.the.first.date = 1012018L, This.was.the.first.city = "New York",
This.was.the.second.question = "XYZ", This.was.the.second.date = 11052018L,
This.was.the.second.city = "Boston"), .Names = c("MemberID",
"This.was.the.first.question", "This.was.the.first.date", "This.was.the.first.city",
"This.was.the.second.question", "This.was.the.second.date", "This.was.the.second.city"
), class = "data.frame", row.names = c(NA, -1L))
MemberID This was the first question This was the first date This was the first city This was the second question This was the second date This was the second city
1 ABC 1012018 New York XYZ 11052018 Boston
This is what I want the columns to look like:
MemberID Question_1 Date_1 City_1 Question_2 Date_2 City_2
So essentially the column name is the same but every 3rd column the number increases by 1. How would I do this? While this example data set small, my real data set is much larger and I want to learn how to do this by column indexing and iteration.
An easier option is to remove the substring except the last word and use make.unique
names(df1)[-1] <- make.unique(sub(".*\\.", "", names(df1)[-1]), sep="_")
names(df1)
#[1] "MemberID" "question" "date" "city" "question_1" "date_1" "city_1"
Or if we need the exact output as expected, extract the last word with sub and use ave to create the sequence based on duplicate names
v1 <- sub(".*\\.(\\w)", "\\U\\1", names(df1)[-1], perl = TRUE)
names(df1)[-1] <- paste(v1, ave(v1, v1, FUN = seq_along), sep="_")
names(df1)
#[1] "MemberID" "Question_1" "Date_1" "City_1"
#[5] "Question_2" "Date_2" "City_2"
#
# create vector of question name triplets
theList <- c("question_","date_","city_")
# create enough headings for 10 questions
questions <- rep(theList,10)
idNumbers <- 1:length(questions)
library(numbers)
# use mod function to group ids into triplets
idNumbers <- as.character(ifelse(mod(idNumbers,3)>0,floor(idNumbers/3)+1,floor(idNumbers/3)))
# concatenate question stems with numbers and add MemberID column at start of vector
questionHeaders <- c("MemberID",paste0(questions,idNumbers))
head(questionHeaders)
...and the output:
[1] "MemberID" "question_1" "date_1" "city_1" "question_2" "date_2"
use the colnames() or names() function to assign this vector as the column names of the data frame.
As noted in the comments on the OP, the question ID numbers can be generated by using the each= argument in rep(), eliminating the need for the mod() function.
idNumbers <- rep(1:10,each = 3)

Returning specific values within a row

I have 1 row of data and 50 columns in the row from a csv which I've put into a dataframe. The data is arranged across the spreadsheet like this:
"FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"...
How would I select only the middle part of each element (eg, "DFGS" in the 1st one, "SGRE" in the second etc), count their occurances and display the results?
I have tried using the strsplit function but I couldn't get it to work for the entire row of data. I'm thinking a loop of some kind might be what I need
You can do unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)] (assuming your data is consistently of the form A-B-C).
# E.g.
fun <- function(x) unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)]
fun(c("FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"))
# [1] "DFGS" "SGRE" "DFGS"
Edit
# Data frame
df <- structure(list(a = "FSEG-DFGS-THDG", b = "SGDG-SGRE-JJDF", c = "DIDC-DFGS-LEMS"),
class = "data.frame", row.names = c(NA, -1L))
fun(t(df[1,]))
# [1] "DFGS" "SGRE" "DFGS"
First we create a function strng() and then we apply() it on every column of df. strsplit() splits a string by "-" and strng() returns the second part.
df = data.frame(a = "ab-bc-ca", b = "gn-bc-ca", c = "kj-ll-mn")
strng = function(x) {
strsplit(x,"-")[[1]][2]
}
# table() outputs frequency of elements in the input
table(apply(df, MARGIN = 2, FUN = strng))
# output: bc ll
2 1

R update if statement to add count

I'm trying to count how columns contain text per row. I have the following that tells me if all columns contain text:
df = structure(list(Participant = 1:3, A = c("char", "foo", ""), B = c("char2", 0L, 0L)), .Names = c("Participant", "A", "B"), row.names = c(NA, -3L), class = "data.frame")
df$newcolumn <- ifelse(nchar(df$A)>1 & nchar(df$B)>1, "yes", "no")
Instead of "Yes" or "No" I want a count of how many matches occur. Ideas?
Using your logic you can try something like the following:
df$newcolumn <- (nchar(df$A)>1) + (nchar(df$B)>1)
df
Participant A B newcolumn
1 1 char char2 2
2 2 foo 0 1
3 3 0 0
If we need to get the nchar per row, loop through the columns of interest, get the nchar, and use Reduce with + to get the sum per each row
df$CountNChar <- Reduce(`+`, lapply(df[-1], nchar))
Or if we need the sum of logical condition, just change the nchar to nchar(x) > 1 (with anonymous function call)
df$CountNChar <- Reduce(`+`, lapply(df[-1], function(x) nchar(x) >1))
df$CountNChar
#[1] 2 1 0
You appear to be trying to count the number of rows wheredf$A and df$B have more than one character in them. The easiest way to do this is with sum, since logical vectors can be added up just like numeric or integer. Thus, the code fragment you want is
sum(nchar(df$A)>1 & nchar(df$B)>1)
However, looking at your first sentence, you should be aware that only one type of data can exist in a column of a data frame. c("foo",0L,0L) is a vector of class "character", with elements "foo","0","0".

Resources