Count words in each cell of a dataframe in R - r

I have a dataframe that looks like
df <- structure(list(Variable = c("Factor1", "Factor2", "Factor3"),
Variable1 = c("word1, word2", "word1", "word1"),
Variable2 = c("word1", "word1, word2", "word1"),
Variable3 = c("word1, word2", "word1", "word1, word2, word3")),
row.names = c(NA, -3L), class = "data.frame")
and would like to create a df that counts occurrences of words in each cell (separated by ",") and input the number into each cell.
df2 <- structure(list(Variable = c("Factor1", "Factor2", "Factor3"),
Variable1 = c("2", "1", "1"),
Variable2 = c("1", "2", "1"),
Variable3 = c("2", "1", "3")),
row.names = c(NA, -3L), class = "data.frame")
Would someone be able to help me in how this would be done?
Thanks!

Using dplyr and stringi:
df %>%
mutate(across(matches("variable\\d{1,}"),stringi::stri_count_words))
Variable Variable1 Variable2 Variable3
1 Factor1 2 1 2
2 Factor2 1 2 1
3 Factor3 1 1 3

I suppose you could try this if desired a base-R solution. Count the number of characters with nchar of a given character value, and subtract the number of characters after removing commas. The difference would be the number of commas (adding 1 would give the number of words/phrases separated by commas). This should be fast too (also see this answer).
cbind(df[1], t(apply(df[-1], 1, \(x) {
nchar(x) - nchar(gsub(",", "", x, fixed = T)) + 1
})))
Output
Variable Variable1 Variable2 Variable3
1 Factor1 2 1 2
2 Factor2 1 2 1
3 Factor3 1 1 3

Related

separate_rows with unequal size of strings in R

Suppose I have a dataset like this:
a b
"1/2/3" "a/b/c"
"3/5" "e/d/s"
"1" "f"
I want to use separate_rows But I can't because of the second row. How can I find these kind of rows?
You can find the rows with unequal numbers of '/' symbols by doing:
which(lengths(strsplit(df$a, '/')) != lengths(strsplit(df$b, '/')))
#> [1] 2
Presumably these rows contain data input mistakes, since the number of rows implied by each entry is different.
Or you can directly count the number of "/" in each column, and output the row that does not have equal number of "/".
library(stringr)
with(df, which(str_count(a, "/") != str_count(b, "/")))
[1] 2
Input data
df <- structure(list(a = c("1/2/3", "3/5", "1"), b = c("a/b/c", "e/d/s",
"f")), class = "data.frame", row.names = c(NA, -3L))
Perhaps cSplit would help
library(splitstackshape)
library(dplyr)
cSplit(df, c("a", "b"), sep = "/", "long") %>%
filter(if_any(c(a, b), complete.cases))
-output
a b
<int> <char>
1: 1 a
2: 2 b
3: 3 c
4: 3 e
5: 5 d
6: NA s
7: 1 f
data
df <- structure(list(a = c("1/2/3", "3/5", "1"), b = c("a/b/c", "e/d/s",
"f")), class = "data.frame", row.names = c(NA, -3L))

How do I specify pivot_wider for an entire dataframe?

I am able to pivot_wider for a specific column using the following:
new_df <- pivot_wider(old_df, names_from = col10, values_from = value_col, values_fn = list)
I would like to pivot_wider with every column in a dataframe (minus an id column). What is the best way to do this? Should I use a loop or is there a way that this function takes the whole dataframe?
To clarify, using the below sample dataframes, I am able to go from old_df to new_df using the pivot_wider function I listed above. I would like to now go from old_df2 to new_df2.
old_df <- structure(list(id = c("1", "1", "2"), col10 = c("yellow",
"green", "green"), value_col = c("1", "1", "1")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
old_df2 <- structure(list(id = c("1", "1", "2"), col10 = c("yellow",
"green", "green"), col11 = c("dog",
"cat", "dog"), value_col = c("1", "1", "1")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
new_df <- pivot_wider(old_df, names_from = col10, values_from = value_col, values_fn = list)
new_df2 <- structure(list(id = c("1", "2"), yellow = c("1", "NULL"), green = c("1", "1"), dog = c("1", "1"), cat = c("1", "NULL")), row.names = c(NA, -2L), class = c("tbl_df", "tbl", "data.frame"))
If you would like to have separate column names for each value between these two columns (or any number of columns) you first need to use pivot_longer to put all the column names into a single column and then use pivot_wider to spread them:
library(tidyr)
old_df2 %>%
pivot_longer(!c(id, value_col), names_to = "Cols", values_to = "vals") %>%
pivot_wider(names_from = vals, values_from = value_col) %>%
select(-Cols) %>%
group_by(id) %>%
summarise(across(everything(), ~ sum(as.numeric(.x), na.rm = TRUE)))
# A tibble: 2 x 5
id yellow dog green cat
<chr> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 2 0 1 1 0
Update 1
As per your update, here comes with a data.table option
dcast(
melt(setDT(old_df),
id.var = "id",
measure.vars = patterns("^col\\d+")
),
id ~ value,
fun.aggregate = length,
fill = NA
)
which gives
id cat dog green yellow
1: 1 1 1 1 1
2: 2 NA 1 1 NA
Are you looking for something like below?
reshape(
transform(
old_df,
q = ave(id, id, FUN = seq_along)
),
direction = "wide",
idvar = "id",
timevar = "q"
)
The output is
id col10.1 col11.1 value_col.1 col10.2 col11.2 value_col.2
1 1 yellow dog 1 green cat 1
3 2 green dog 1 <NA> <NA> <NA>
You could combine those columns and unnest them followed by pivot_wider:
library(tidyr)
library(dplyr)
old_df2 <- structure(list(id = c("1", "1", "2"), col10 = c("yellow",
"green", "green"), col11 = c("dog",
"cat", "dog"), value_col = c("1", "1", "1")), row.names = c(NA, -3L), class = c("tbl_df", "tbl", "data.frame"))
old_df2 %>%
mutate(new_col = strsplit(paste(col10, col11, sep = "_"), "_"), .keep = "unused") %>%
unnest(new_col) %>%
pivot_wider(names_from = new_col, values_from = value_col)
#> # A tibble: 2 x 5
#> id yellow dog green cat
#> <chr> <chr> <chr> <chr> <chr>
#> 1 1 1 1 1 1
#> 2 2 <NA> 1 1 <NA>
Created on 2021-08-25 by the reprex package (v2.0.1)

Subsetting Data in R not working through a vector

Scanario.
In the datacamp courses, cleaning data with R: case studies. There is an excercise at the extreme end of the course where we have 5 columns (say: 1,2,3,4,5) of dataset "att5". Only column 1 is char & has characters in it but 2:5 has numbers but it is type(chars). They tell me to make a vector cols consisting of vectors which has indices of (2,3,4,5) and use sapply to use as.numeric function on them.
My solution is not working although it is making sense. I'm sharing my their solutions first and then my solutions. Please help me understand what is going on.
Data Camp Solution(working)
# Define vector containing numerical columns: cols
cols <- -1
# Use sapply to coerce cols to numeric
att5[, cols] <- sapply(att5[, cols], as.numeric)
My Solution(not working)
# Define vector containing numerical columns: cols
cols <- c(2:5)
# Use sapply to coerce cols to numeric
att5[, cols] <- sapply(att5[, cols], as.numeric)
I'm getting this error: invalid subscript type list
Help me understand. Newbie in R.
Your solution works perfectly on my machine. The only difference I can be able to see is cols <- -1 is of class "numeric" where as cols <- c(2:5) is [1] "integer". If you want to know the difference between the two have a look What's the difference between integer class and numeric class in R.
So, one way to reverse-engineer their solution is to generate cols in numeric class and seq can help do that.
cols <- seq(2,5,1)
#class(cols)
#[1] "numeric"
att5[, cols] <- sapply(att5[, cols], as.numeric)
# str(att5)
# 'data.frame': 5 obs. of 5 variables:
# $ att1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
# $ att2: num 1 2 3 4 5
# $ att3: num 1 2 3 4 5
# $ att4: num 1 2 3 4 5
# $ att5: num 1 2 3 4 5
Data
dput(att5)
att5 <- structure(list(att1 = structure(1:5, .Label = c("A", "B", "C",
"D", "E"), class = "factor"), att2 = structure(1:5, .Label = c("1",
"2", "3", "4", "5"), class = "factor"), att3 = structure(1:5, .Label = c("1",
"2", "3", "4", "5"), class = "factor"), att4 = structure(1:5, .Label = c("1",
"2", "3", "4", "5"), class = "factor"), att5 = structure(1:5, .Label = c("1",
"2", "3", "4", "5"), class = "factor")), class = "data.frame", row.names = c(NA,
-5L))
Hope it works on your end.

r dataframe using rank

I would like to rank the row of a dataframe (with 30 columns) which has numerical values ranking from -inf to +inf.
This is what I have:
df <- structure(list(StockA = c("-5", "3", "6"),
StockB = c("2", "-1", "3"),
StockC = c("-3", "-4", "4")),
.Names = c( "StockA","StockB", "StockC"),
class = "data.frame", row.names = c(NA, -3L))
> df
StockA StockB StockC
1 -5 2 -3
2 3 -1 -4
3 6 3 4
This is what I would like to have:
> df_rank
StockA StockB StockC
1 3 1 2
2 1 2 3
3 1 3 2
I am using this command:
> rank(df[1,])
StockA StockB StockC
2 3 1
The resulting rank variables are not correct though as you can see.
rank() assigns the lowest rank to the smallest value.
So the short answer to your question is to use rank of the vector multiplied by -1:
rank (-c(-5, 2, -3) )
[1] 1 3 2
Here is the full code:
# data frame definition. The numbers should actually be integers as pointed out
# in comments, otherwise the rank command will sort them as strings
# So in the real word you should define them as integers,
# but to go with your data I will convert them to integers in the next step
df <- structure(list(StockA = c("-5", "3", "6"),
StockB = c("2", "-1", "3"),
StockC = c("-3", "-4", "4")),
.Names = c( "StockA","StockB", "StockC"),
class = "data.frame", row.names = c(NA, -3L))
# since you plan to rank them not as strings, but numbers, you need to convert
# them to integers:
df[] <- lapply(df,as.integer)
# apply will return a matrix or a list and you need to
# transpose the result and convert it back to a data.frame if needed
result <- as.data.frame(t( apply(df, 1, FUN=function(x){ return(rank(-x)) }) ))
result
# StockA StockB StockC
# 3 1 2
# 1 2 3
# 1 3 2

Setting different values in duplicated observations to NA

I have a data frame (DF) that looks as follows:
structure(list(ID = c("123", "123", "456", "789", "789"), REPORTER = c("ONE",
"ONE", "TWO", "THREE", "THREE"), VALUE1 = c("1", "1", "2", "1",
"1"), VALUE3 = c("2", "1", "1", "2", "1"), VALUE4 = c("2", "1",
"2", "1", "1")), .Names = c("ID", "REPORTER", "VALUE1", "VALUE3",
"VALUE4"), row.names = c(1L, 2L, 3L, 5L, 6L), class = "data.frame")
Uniqueness in this case is defined by ID and REPORTER. So the DF above contains a duplicate for the ID 123 and REPORTER ONE and the ID 789 and REPORTER THREE. Since I cannot tell which values of VALUE1 to VALUE4 are the correct ones, I like to set all values to NA, that differ within a duplicate.
This means I first have to identify the columns of VALUE that contain different values. These are the ones to be set to NA. For the rest I like to keep the data since here I can tell the value is correct.
The expected output would look like this:
structure(list(ID = c("123", "123", "456", "789", "789"), REPORTER = c("ONE",
"ONE", "TWO", "THREE", "THREE"), VALUE1 = c("1", "1", "2", "1",
"1"), VALUE3 = c(NA, NA, "1", NA, NA), VALUE4 = c(NA, NA, "2",
"1", "1")), .Names = c("ID", "REPORTER", "VALUE1", "VALUE3",
"VALUE4"), row.names = c(1L, 2L, 3L, 5L, 6L), class = "data.frame")
The goal is to ensure data quality. I don't like to just remove the problem cases since I can use the not differing values for analysis. But I also do not like to just use one of the rows because this would lead to wrong conclusions if I had chosen the wrong values.
How can I do this?
I think this is what you are looking for:
library(reshape2)
DFL <- melt(cbind(rn = 1:nrow(DF), DF), id.vars=c("rn", "ID", "REPORTER"))
DFL$value2 <- ave(DFL$value, DFL[c("ID", "REPORTER", "variable")],
FUN = function(x) {
ifelse(length(unique(x)) > 1, NA, x)
})
dcast(DFL, rn + ID + REPORTER ~ variable, value.var = "value2")
# rn ID REPORTER VALUE1 VALUE3 VALUE4
# 1 1 123 ONE 1 <NA> <NA>
# 2 2 123 ONE 1 <NA> <NA>
# 3 3 456 TWO 2 1 2
# 4 4 789 THREE 1 <NA> 1
# 5 5 789 THREE 1 <NA> 1
As you can see, I had to add a dummy "rn" supplementary ID variable to make sure that dcast wouldn't just collapse all the values into one row per ID+REPORTER combination.
Update
This is actually also entirely doable with base R's reshape and the ave step described above:
DFL <- reshape(DF, direction = "long",
varying = grep("VALUE", names(DF)), sep = "")
DFL <- within(DFL, {
VALUE <- ave(VALUE, ID, REPORTER, time, FUN = function(x)
ifelse(length(unique(x)) > 1, NA, x))
})
reshape(DFL)
# ID REPORTER id VALUE1 VALUE3 VALUE4
# 1.1 123 ONE 1 1 <NA> <NA>
# 2.1 123 ONE 2 1 <NA> <NA>
# 3.1 456 TWO 3 2 1 2
# 4.1 789 THREE 4 1 <NA> 1
# 5.1 789 THREE 5 1 <NA> 1
In the last line above, the attributes from the original reshape statement make it so we don't have to even worry about what arguments we need to put in. :-)
I created a function replaceDifferent() that looks like this:
replaceDifferent <- function(vector){
max <- max(vector)
min <- min(vector)
test <- max == min
if (!test){
return(NA)
}
else{
return(min(vector))
}
}
Then I melted the DF with melt() from the reshape package:
DFmelt <- melt(DF, id = c("ID", "REPORTER"))
After that I was able to apply the new function to the melted data frame wit ddply()
DFres <- ddply(DFmelt, .(ID, REPORTER, variable), function(x){replaceDifferent(x$value)})
To get the result data frame with duplicates removed I called dcast() on DFres:
DFres <- dcast(DFres, ID+REPORTER ~ variable)
This produces a slightly different output than the one I asked for, but is better in the way that I do not have to deal with duplicates anymore.

Resources