This question already has an answer here:
Include levels of zero count in result of table()
(1 answer)
Closed 2 years ago.
I have a list of integers that are all between 1 and 365. There are some integers that appear multiple times and some that do not appear. I would like to use a function like count to have a dataframe that counts the number of occurrences it appears including if it does not appear.
df
x freq
1 0
2 1
3 3
4 0
Currently, both the rows for 1 and 4 do not exist in my current count function df=count(list)
We can use factor with levels specified so that it will also take care of the missing elements and report the count as 0
table(factor(df$x, levels = 1:4))
Related
This question already has answers here:
Cumulative number of unique values in a column up to current row
(2 answers)
Closed 4 years ago.
My data frame looks something like this:
USER URL
1 homepage.com
1 homepage.com/welcome
1 homepage.com/overview
1 homepage.com/welcome
What I want is a vector with the following values:
UNIQUE
1
2
3
3
How do I do that?
We could use cumsum and duplicated
df$unique <- cumsum(!duplicated(df$URL))
df$unique
#[1] 1 2 3 3
duplicated gives us logical vector of whether a value is duplicate or not, we negate it (!) and then use cumsum over it so we have cumulative sum of unique values.
Using dplyr to add a new column:
library(dplyr)
df %>%
mutate(Dups=cumsum(!duplicated(URL)))
This question already has answers here:
Filtering a data frame by values in a column [duplicate]
(3 answers)
Closed 4 years ago.
Lets say I have the following data frame in r:
> patientData
patientID age diabetes status
1 1 25 Type 1 Poor
2 2 34 Type 2 Improved
3 3 28 Type 1 Excellent
4 4 52 Type 1 Poor
How can I reference a specific row or group of rows by using the specific value/level of a particular column rather than the row index? For instance, if I wanted to set a variable x to equal all of the rows which contain a patient with Type 1 diabetes or all of the rows that contain a patient in "Improved" status, how would I do that?
Try this one:
library(dplyr)
patientData %>%
filter(diabetes == "Type 1")
Next time, please provide a Minimum Reproducible Example.
This question already has answers here:
Split string column to create new binary columns
(10 answers)
Closed 1 year ago.
I have a simple question about cleaning up messy data. I have a dataset that was emailed to me that contains multiple columns, each of which contains a comma separated string of numbers. Traditionally, each of these numbers should be its own variable, but this is not how these datasets are given to me. Here is an example of some data:
indication treatment
1,2 3
2 2,1
1,3 2,3
Please imagine these datasets containing close to 100 of these columns and thousands of rows, and a varying number of variables in each of these columns. My goal is to import a dataset like this, and then split each column such that each variable in the string is in its own column, but each column is split in a way that each unique variable is sorted into its own column. Like this:
indication_1 indication_2 indication_3 treatment_1 treatment_2 treatment_3
1 1 0 0 0 1
0 1 0 1 1 0
1 0 1 0 1 1
Notice that the column header has changed and the numeric value is listed as a binary 0 or 1, where 1 indicates the presence of the variable.
Ive had issues because the split functions I have been trying require me to know how many columns I need, and then don't sort the variables into their own columns after the split. Its become quite complicated, and requires me to write separate code for each individual column containing a string.
Id like a function that can take a column containing a string, split the data into separate sorted columns, make these columns a binary yes or no, and then alter the column name to indicate both the original column name and the variable in that column. Id like this to be applicable to any column of data so I dont have to rewrite or modify the function for individual columns (assuming all columns are numerical strings with a character title).
Thanks in advance.
We can do an strsplit and then with mtabulate get the frequency
library(qdapTools)
do.call(cbind, lapply(df, function(x) mtabulate(strsplit(x, ","))))
# indication.1 indication.2 indication.3 treatment.1 treatment.2 treatment.3
#1 1 1 0 0 0 1
#2 0 1 0 1 1 0
#3 1 0 1 0 1 1
This question already has an answer here:
How to create a TRUE or FALSE column based on regexpr() findings in R?
(1 answer)
Closed 5 years ago.
df <-
SUB CONC
1 baseline (predose)
2 screen
2 predose
I want to add a flag such that if CONC column has "predose" written in it regardless of other things in the cell, then give it a flag 1, otherwise 0.
dfout <-
SUB CONC PREDOSE
1 baseline (predose) 1
2 screen 0
2 predose 1
How can I do this in R? I used RStudio.
We can use grepl with pattern specified as 'predose' to create a logical vector and then coerce that to binary with as.integer
df$PREDOSE <- as.integer(grepl('predose', df$CONC))
df$PREDOSE
#[1] 1 0 1
This question already has answers here:
Create counter within consecutive runs of certain values
(6 answers)
Closed 5 years ago.
I have a vector, for example
ind <- c(TRUE,FALSE,TRUE,TRUE,FALSE,FALSE,FALSE,TRUE,TRUE,TRUE,FALSE)
and I want to count the number of subsequent "TRUE" values, whereas the counting should start from 1 if there was a "FALSE" value between a block of subsequent "TRUE" values. The result for the example above should be
result <- c(1,0,1,2,0,0,0,1,2,3,0)
Any ideas how to do this nicely?
rle computes "the lengths and values of runs of equal values in a vector"
sequence creates for "each element of nvec the sequence seq_len(nvec[i])"
logical values are automatically coerced to 0/1 when multiplied with numbers
All these functions together:
sequence(rle(ind)$lengths) * ind
#[1] 1 0 1 2 0 0 0 1 2 3 0