R update if statement to add count - r

I'm trying to count how columns contain text per row. I have the following that tells me if all columns contain text:
df = structure(list(Participant = 1:3, A = c("char", "foo", ""), B = c("char2", 0L, 0L)), .Names = c("Participant", "A", "B"), row.names = c(NA, -3L), class = "data.frame")
df$newcolumn <- ifelse(nchar(df$A)>1 & nchar(df$B)>1, "yes", "no")
Instead of "Yes" or "No" I want a count of how many matches occur. Ideas?

Using your logic you can try something like the following:
df$newcolumn <- (nchar(df$A)>1) + (nchar(df$B)>1)
df
Participant A B newcolumn
1 1 char char2 2
2 2 foo 0 1
3 3 0 0

If we need to get the nchar per row, loop through the columns of interest, get the nchar, and use Reduce with + to get the sum per each row
df$CountNChar <- Reduce(`+`, lapply(df[-1], nchar))
Or if we need the sum of logical condition, just change the nchar to nchar(x) > 1 (with anonymous function call)
df$CountNChar <- Reduce(`+`, lapply(df[-1], function(x) nchar(x) >1))
df$CountNChar
#[1] 2 1 0

You appear to be trying to count the number of rows wheredf$A and df$B have more than one character in them. The easiest way to do this is with sum, since logical vectors can be added up just like numeric or integer. Thus, the code fragment you want is
sum(nchar(df$A)>1 & nchar(df$B)>1)
However, looking at your first sentence, you should be aware that only one type of data can exist in a column of a data frame. c("foo",0L,0L) is a vector of class "character", with elements "foo","0","0".

Related

Extract specific portion of a string and paste to a new column in R

I have the following dataframe with a string column and I want to extract T,N,M,G,L status (and so on..) for each observation into separate new columns including their respective prefix and suffix. I have tried the grep() and strsplit function but the resulting columns have differing number of rows due to NA values and it doesn't seem to work. I'm not an expert in coding and I'd really appreciate your support for a working script. Thanks in advance.
df <- data.frame(input="cT1b;cN1a;cM0;G3",
"pT1a;pN0;cM0;G1;L0;V0;Pn0;R0",
"cT3;cN0;M0")
The expected output should look like
df <- data.frame(input=c("cT1b;cN1a;cM0;G3",
"pT1a;pN0;cM0;G1;L0;V0;Pn0;R0",
"cT3;cN0;M0" ),
T_output=c("cT1b","pT1a","cT3"),
G_output=c("G3","G1",NA),
L_output=c(NA,"L0",NA))
grep is typically for finding (true/false) strings or occasionally returning whole strings that contain a substring (value=TRUE), but not for extracting substrings from a whole string. For that, one might look into sub/gsub or gregexpr or stringr::str_extract/str_extract_all for extracting substrings. However, I think that's not the best (well, certainly not the only) approach.
Try this:
library(dplyr)
dat %>%
select(input) %>%
mutate(
bind_rows(lapply(
strsplit(input, ";"),
function(S) as.data.frame(lapply(setNames(nm = c("T", "G", "L")),
function(z) paste0(grep(pattern = z, x = S, value = TRUE), collapse = ";"))))),
across(one_of(c("T","G","L")), ~ ifelse(nzchar(.), ., .[NA]))
)
# input T G L
# 1 cT1b;cN1a;cM0;G3 cT1b G3 <NA>
# 2 pT1a;pN0;cM0;G1;L0;V0;Pn0;R0 pT1a G1 L0
# 3 cT3;cN0;M0 cT3 <NA> <NA>
Note: it is obviously doing nothing with the M or N substrings, which might be intentional or whatever. If you want them too, you can do setNames(nm=c("T","G","L","N")) (and again the second time within one_of) to get another upper-letter column.
Data
dat <- structure(list(input = c("cT1b;cN1a;cM0;G3", "pT1a;pN0;cM0;G1;L0;V0;Pn0;R0", "cT3;cN0;M0")), class = "data.frame", row.names = c(NA, -3L))

Convert multiple character columns with numbers and <,> symbols into numeric in R

I have a dataset with multiple character columns with numbers and >,< signs.
I want to change them all to numeric.
The values with "<x" are supposed to be halfed and the values with ">x" are supposed to equal to x.
Sample dataframe and my approach (data=labor_df):
data a b c
1 "1" "9" "20"
2 "<10" "14" "1.99"
3 "12" ">5" "14.5"
half.value.a <- (as.numeric(str_extract(labor_df$"a"[which((grepl(labor_df$"a",
pattern = c("<"),fixed = T)))],
"\\d+\\.*\\d*")))/2
min.value.a <- as.numeric(str_extract(labor_df$"a"[which((grepl(labor_df$"a",
pattern = c(">"),fixed = T)))], "\\d+\\.*\\d*"))
labor_df$"a"[which((grepl(labor_df$"a",
pattern = c("<"),fixed = T)))
] <- half.value.a
labor_df$"a"[which((grepl(labor_df$"a",
pattern = c(">"),fixed = T)))
] <- min.value.a
labor_df$"a" <- as.numeric(labor_df$"a")
I would like to apply this to multiple columns in my df or use a different approach entirely to convert multiple columns in my df to numeric.
You can apply this approach to whichever columns you want. In this case, if you want to apply to columns 1 through 3, you can specify as labor_df[1:3]. If you want to apply to specific columns based on the column name, then create a cols vector containing the names of columns to apply this to and use labor_df[cols] instead.
The first gsub will remove the greater than sign, and keep the value unchanged. The ifelse is vectorized and will apply to all values in the column. It will first check with grepl if less than sign is present; if it is, remove it, convert to a numeric value, and then divide by 2. Otherwise leave as is.
labor_df[1:3] <- lapply(labor_df[1:3], function(x) {
x <- gsub(">", "", x)
x <- ifelse(grepl("<", x), as.numeric(gsub("<", "", x)) / 2, x)
as.numeric(x)
})
labor_df
Output
a b c
1 1 9 20.00
2 5 14 1.99
3 12 5 14.50
Data
labor_df <- structure(list(a = c("1", "<10", "12"), b = c("9", "14", ">5"
), c = c("20", "1.99", "14.5")), class = "data.frame", row.names = c(NA,
-3L))

R: Counting frequency of words from predefined dictionary

I have a very large dataset that looks like this: one column contains names, the second column contains their respective (very long) texts. I also have a pre-defined dictionary that contains at least 20 terms. How can I count the number of times these key words occur in each row of my dataframe? I have tried str_detect,grep(l), and %>% like, and looped over each row, but the problem seems to be that I want to detect too many terms, and these functions stop working when I use 15+ terms or so.
Would be sooo happy if anyone could help me out with this!
col1<- c("Henrik", "Joseph", "Lucy")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he did not")
df <- data.frame(col1, col2)```
dict <- c("groceries", "going", "me") #but my actual dictionary is much larger
Create a unique identifier for your rows. Split your col2 by words, one in each row. Filter for only the select words in your dict. Then count by each row. Finally, combine with original df and set NA to Zeros for rows that don't have any words from your dict.
library(dplyr)
col1 <- c("A","B","A")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he did not")
df <- data.frame(col1, col2, stringsAsFactors = FALSE)
dict <- c("groceries", "going", "me")
df <- df %>% mutate(row=row_number()) %>% select(row, everything())
counts <- df %>% tidyr::separate_rows(col2) %>% filter(col2 %in% dict) %>% group_by(row) %>% count(name = "counts")
final <- left_join(df, counts, by="row") %>% tidyr::replace_na(list(counts=0L))
final
#> row col1 col2 counts
#> 1 1 A I am going to get groceries 2
#> 2 2 B He called me at six. 1
#> 3 3 A No, he did not 0
Here is a base R option using gregexpr
dfout <- within(
df,
counts <- sapply(
gregexpr(paste0(dict, collapse = "|"), col2),
function(x) sum(x > 0)
)
)
or
dfout <- within(
df,
counts <- sapply(
regmatches(col2, gregexpr("\\w+", col2)),
function(v) sum(v %in% dict)
)
)
which gives
> dfout
col1 col2 counts
1 1 I am going to get groceries 2
2 2 He called me at six. 1
3 3 No, he did not 0
Data
structure(list(col1 = 1:3, col2 = c("I am going to get groceries",
"He called me at six.", "No, he did not")), class = "data.frame", row.names = c(NA,
-3L))
I think my solution gives you the output you want - that is for each word in your "dict" list, you can see how many times it appears in each sentence. Each row is an entry in df$col2 i.e. a sentence. "Dict" is your vector of terms that you're looking to match. We can loop over the vector and for each entry in the vector we match how many times that entry appears in each row/sentence using stringr::str_count. Note the syntax for str_count: str_count(string being checked over, expression you're trying to match)
str_count returns a vector showing how many times the word appears in each row. I create a data frame of these vectors which will contain the same number of rows as there are entries in the dict vector. Then you can just cbind "dict" to that data frame and you can see how many times each word is used in each sentence. I adjust the column names at very end so you can match the words to the sentence #'s. Note that if you want to calculate row means you'll need to subset out the "dict" column of the final data frame because it's character.
library(stringr)
col1<- c("Henrik", "Joseph", "Lucy")
col2 <- c("I am going to get groceries", "He called me at six.", "No, he
did not")
df <- data.frame(col1, col2)
dict <- c("groceries", "going", "me")
word_matches <- data.frame()
for (i in dict) {
word_tot<-(str_count(df$col2, i))
word_matches <- rbind(word_matches,word_tot)
}
word_matches
colnames(word_matches) <- paste("Sentence", 1:ncol(word_matches))
cbind(dict,word_matches)
dict Sentence 1 Sentence 2 Sentence 3
1 groceries 1 0 0
2 going 1 0 0
3 me 0 1 0

R - How to rename every nth column with "name_x" where x=1 and increases by 1 for each column?

I have a data set where the names of the columns are very messy, and I want to simplify them. Example data below:
structure(list(MemberID = 1L, This.was.the.first.question = "ABC",
This.was.the.first.date = 1012018L, This.was.the.first.city = "New York",
This.was.the.second.question = "XYZ", This.was.the.second.date = 11052018L,
This.was.the.second.city = "Boston"), .Names = c("MemberID",
"This.was.the.first.question", "This.was.the.first.date", "This.was.the.first.city",
"This.was.the.second.question", "This.was.the.second.date", "This.was.the.second.city"
), class = "data.frame", row.names = c(NA, -1L))
MemberID This was the first question This was the first date This was the first city This was the second question This was the second date This was the second city
1 ABC 1012018 New York XYZ 11052018 Boston
This is what I want the columns to look like:
MemberID Question_1 Date_1 City_1 Question_2 Date_2 City_2
So essentially the column name is the same but every 3rd column the number increases by 1. How would I do this? While this example data set small, my real data set is much larger and I want to learn how to do this by column indexing and iteration.
An easier option is to remove the substring except the last word and use make.unique
names(df1)[-1] <- make.unique(sub(".*\\.", "", names(df1)[-1]), sep="_")
names(df1)
#[1] "MemberID" "question" "date" "city" "question_1" "date_1" "city_1"
Or if we need the exact output as expected, extract the last word with sub and use ave to create the sequence based on duplicate names
v1 <- sub(".*\\.(\\w)", "\\U\\1", names(df1)[-1], perl = TRUE)
names(df1)[-1] <- paste(v1, ave(v1, v1, FUN = seq_along), sep="_")
names(df1)
#[1] "MemberID" "Question_1" "Date_1" "City_1"
#[5] "Question_2" "Date_2" "City_2"
#
# create vector of question name triplets
theList <- c("question_","date_","city_")
# create enough headings for 10 questions
questions <- rep(theList,10)
idNumbers <- 1:length(questions)
library(numbers)
# use mod function to group ids into triplets
idNumbers <- as.character(ifelse(mod(idNumbers,3)>0,floor(idNumbers/3)+1,floor(idNumbers/3)))
# concatenate question stems with numbers and add MemberID column at start of vector
questionHeaders <- c("MemberID",paste0(questions,idNumbers))
head(questionHeaders)
...and the output:
[1] "MemberID" "question_1" "date_1" "city_1" "question_2" "date_2"
use the colnames() or names() function to assign this vector as the column names of the data frame.
As noted in the comments on the OP, the question ID numbers can be generated by using the each= argument in rep(), eliminating the need for the mod() function.
idNumbers <- rep(1:10,each = 3)

Returning specific values within a row

I have 1 row of data and 50 columns in the row from a csv which I've put into a dataframe. The data is arranged across the spreadsheet like this:
"FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"...
How would I select only the middle part of each element (eg, "DFGS" in the 1st one, "SGRE" in the second etc), count their occurances and display the results?
I have tried using the strsplit function but I couldn't get it to work for the entire row of data. I'm thinking a loop of some kind might be what I need
You can do unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)] (assuming your data is consistently of the form A-B-C).
# E.g.
fun <- function(x) unlist(strsplit(x, '-'))[seq(2, length(x)*3, 3)]
fun(c("FSEG-DFGS-THDG", "SGDG-SGRE-JJDF", "DIDC-DFGS-LEMS"))
# [1] "DFGS" "SGRE" "DFGS"
Edit
# Data frame
df <- structure(list(a = "FSEG-DFGS-THDG", b = "SGDG-SGRE-JJDF", c = "DIDC-DFGS-LEMS"),
class = "data.frame", row.names = c(NA, -1L))
fun(t(df[1,]))
# [1] "DFGS" "SGRE" "DFGS"
First we create a function strng() and then we apply() it on every column of df. strsplit() splits a string by "-" and strng() returns the second part.
df = data.frame(a = "ab-bc-ca", b = "gn-bc-ca", c = "kj-ll-mn")
strng = function(x) {
strsplit(x,"-")[[1]][2]
}
# table() outputs frequency of elements in the input
table(apply(df, MARGIN = 2, FUN = strng))
# output: bc ll
2 1

Resources