Categorise multiple rows into one variable - r

A simple question, but apparently not answered in StO yet.
I've got a long data frame where 3 of the columns are:
person | trip | driver
=======================
1 car
1 bike
1 train
1 walk
2 walk
2 train
2 boat
What I'd like is to populate column 'driver', so that it reads 1 if at least one of the trips is made by car, 0 otherwise:
person | driver
================
1 1
1 1
1 1
1 1
2 0
2 0
2 0
I have a slight preference for doing this without recurring to fancy packages, but I am happy with most of the popular ones (e.g. plyr, data.table,sqldf....), or even new ones that prove helpful in the long term.
Thanks in advance, .p.

We could use data.table, convert 'data.frame' to 'data.table' (setDT(df1)), we check whether there is any 'car' in the 'trip' grouped by 'person', convert the logical output to numeric (+0L or wrapping with as.numeric) and assign (:=) to 'driver' column. If needed, we can remove the 'trip' column by assigning it to NULL or subset by [, c(1,3), with=FALSE]
library(data.table)
setDT(df1)[, driver := any(trip == 'car')+0L, by = person][, trip := NULL]
Or instead of any, we can use max(trip=='car') as #Arun mentioned in the comments
setDT(df1)[, driver := max(trip == 'car'), by = person]
Or using a similar logic as above, we group_by 'person' and create a new column with mutate and remove the unwanted columns with select
library(dplyr)
df1 %>%
group_by(person) %>%
mutate(driver= any(trip=='car')+0L) %>%
select(-trip)
Or with base R, we can use ave to create 'driver' and then subset to remove the 'trip' column.
df1$driver <- with(df1, ave(trip=='car', person, FUN=any)+0L)
subset(df1, select=-trip)

Related

Check if a string is a subset in another string in R

I've this data below that includes ID, and Code (chr type)
ID <- c(1,1,1,2,2,3,3,3, 4, 4)
Code <- c("0011100000", "0001100000", "1001100000", "1100000000",
"1000000000", "1000000000", "0100000000", "0010000000", "0010000001", "0010000001")
df <- data.frame(ID, Code)
I need to remove records (within each ID) based Code value pattern, That is:
For each ID, we look at the values of Code, and we remove the ones that are subset of other row.
For example, for ID=1, row #2 is a subset of row #1, so we remove row #2. But, row #3 is NOT a subset of row #2 or #3, so we keep it.
For ID=2, row #5 is a subset of row #4, so we remove it.
For ID=3, they are all different, so we keep them all.
For ID=4, since the Code for both records are the same, then keep the first one.
Here is the expected final view of the results:
It's not that pretty, but a bit of checking of every combination with a join will do it.
Convert to a data.table
library(data.table)
setDT(df)
Make a row counter, and identify all the 1 locations in each string and save to a list.
df[, rn := .I]
df[, ones := gregexpr("1", df$Code)]
Join each group to itself, and compare the lists where the row numbers don't match. Then keep the row numbers where the lists are subsets, and drop these rows from the original data. In the case of duplicates, only remove the first occasion of the duplicate.
df[
funion(
df[df, on=c("ID","rn>rn"), if(all(i.ones[[1]] %in% ones[[1]])) .(Code=i.Code), by=.EACHI][, -"rn"],
df[df, on=c("ID","rn<rn"), if(all(i.ones[[1]] %in% ones[[1]])) .(Code=i.Code), by=.EACHI][, -"rn"]
),
on=c("ID","Code"),
mult="first",
drop := 1
]
df[is.na(drop), -c("rn","ones","drop")]
# ID Code
#1: 1 0011100000
#2: 1 1001100000
#3: 2 1100000000
#4: 3 1000000000
#5: 3 0100000000
#6: 3 0010000000
#7: 4 0010000001

Split columns in a dataframe into a column that contains text not numbers and a column that contains numbers not text in R

Here is a simplified version of data I am working with:
a<-c("There are 5 programs", "2 - adult programs, 3- youth programs","25", " ","there are a number of programs","other agencies run our programs")
b<-c("four", "we don't collect this", "5 from us, more from others","","","")
c<-c(2,6,5,8,2,"")
df<-cbind.data.frame(a,b,c)
df$c<-as.numeric(df$c)
I want to keep both the text and numbers from the data b/c some of the text is important
expected output:
What I think makes sense is the following:
id all columns that have text in them, perhaps in a list (because some columns are just numbers)
subset columns from step 1 to a new dataframe lets call this df1
delete the subsetted columns in df1 from df
split all the columns in df1 into 2 columns, one that keeps the text and one that has the number.
bind the new spit columns from df1 into the orginal df
What I am struggling with is steps 1-2 and 4. I am okay with the characters (e.g., - and ') being excluded or included. There is additional processing I have to do after (e.g., when there are multiple numbers in a column after splitting I will need to split and add these and also address the written numbers), but those are things I can do.
Here's a dplyr solution using regular expression:
library(stringr)
library(dplyr)
df %>%
mutate(
a.text = gsub("(^|\\s)\\d+", "", a),
a.num = str_extract_all(a, "\\d+"),
b.text = gsub("(^|\\s)\\d+", "", b),
b.num = str_extract_all(b, "\\d+")
) %>%
select(c(4:7,3))
a.text a.num b.text b.num c
1 There are programs 5 four 2
2 - adult programs,- youth programs 2, 3 we don't collect this 6
3 25 from us, more from others 5 5
4 8
5 there are a number of programs 2
6 other agencies run our programs NA
Here is what I would do with my preferred tools. The solution will work with arbitrary numbers of arbitrarily named character and non-character columns.
library(data.table) # development version 1.14.3 used here
library(magrittr) # piping used to improve readability
num <- \(x) stringr::str_extract_all(x, "\\d+", simplify = TRUE) %>%
apply(1L, \(x) sum(as.integer(x), na.rm = TRUE))
txt <- \(x) stringr::str_remove_all(x, "\\d+") %>%
stringr::str_squish()
setDT(df)[, lapply(
.SD, \(x) if (is.character(x)) data.table(txt = txt(x), num = num(x)) else x)]
which returns
a.txt a.num b.txt b.num c
<char> <int> <char> <int> <num>
1: There are programs 5 four 0 2
2: - adult programs, - youth programs 5 we don't collect this 0 6
3: 25 from us, more from others 5 5
4: 0 0 8
5: there are a number of programs 0 0 2
6: other agencies run our programs 0 0 NA
Explanation
num() is a function which uses the regular expression \\d+ to extract all strings which consist of contiguous digits (aka integer numbers), coerces them to type integer, and computes the rowwise sum of the extracted numbers (as requested in OP's last sentence).
txt() is a function which removes all strings which consist of contiguous digits (aka integer numbers), removes whitespace from start and end of the strings and reduces repeated whitespace inside the strings.
\(x) is a new shortcut for function(x) introduced with R version 4.1
The next steps implement OP's proposed approach in data.table syntax, by and large:
lapply(.SD, ...) loops over each column of df.
if the column is character both functions txt() and num() are applied. The two resulting vectors are turned into a data.table as a partial result. Note that cbind() cannot be used here as it would return a character matrix.
if the column is non-character it is returned as is.
The final result is a data.table where the column names have been renamed automagically.
This approach keeps the relative position of columns.

Generate group by condition on row value in column R data.table

I want to split a data.table in R into groups based on a condition in the value of a row. I have searched SO extensively and can't find an efficient data.table way to do this (I'm not looking to for loop across rows)
I have data like this:
library(data.table)
dt1 <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)))
I'd like to group at the large numbers (over a settable value) and come up with the example below:
dt.desired <- data.table( x=1:139, t=c(rep(c(1:5),10),120928,rep(c(6:10),9), 10400,rep(c(13:19),6)), group=c(rep(1,50),rep(2,46),rep(3,43)))
dt1[ , group := cumsum(t > 200) + 1]
dt1[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
dt.desired[t > 200]
# x t group
# 1: 51 120928 2
# 2: 97 10400 3
You can use a test like t>100 to find the large values. You can then use cumsum() to get a running integer for each set of rows up to (but not including) the large number.
# assuming you can define "large" as >100
dt1[ , islarge := t>100]
dt1[ , group := shift(cumsum(islarge))]
I understand that you want the large number to be part of the group above it. To do this, use shift() and then fill in the first value (which will be NA after shift() is run.
# a little cleanup
# (fix first value and start group at 1 instead of 0)
dt1[1, group := 0]
dt1[ , group := group+1]

merging multiple dataframes with duplicate rows in R

Relatively new with R for this kind of thing, searched quite a bit and couldn't find much that was helpful.
I have about 150 .csv files with 40,000 - 60,000 rows each and I am trying to merge 3 columns from each into 1 large data frame. I have a small script that extracts the 3 columns of interest ("id", "name" and "value") from each file and merges by "id" and "name" with the larger data frame "MergedData". Here is my code (I'm sure this is a very inefficient way of doing this and that's ok with me for now, but of course I'm open to better options!):
file_list <- list.files()
for (file in file_list){
if(!exists("MergedData")){
MergedData <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(MergedData) <- c("id", "name", file)
}
else if(exists("MergedData")){
temp_data <- read.csv(file, skip=5)[ ,c("id", "name", "value")]
colnames(temp_data) <- c("id", "name", file)
MergedData <- merge(MergedData, temp_data, by=c("id", "name"), all=TRUE)
rm(temp_data)
}
}
Not every file has the same number of rows, though many rows are common to many files. I don't have an inclusive list of rows, so I included all=TRUE to append new rows that don't yet exist in the MergedData file.
My problem is: many of the files contain 2-4 rows with identical "id" and "name" entries, but different "value" entries. So, when I merge them I end up adding rows for every possible combination, which gets out of hand fast. Most frustrating is that none of these duplicates are of any interest to me whatsoever. Is there a simple way to take the value for the first entry and just ignore any further duplicate entries?
Thanks!
Based on your comment, we could stack each file and then cast the resulting data frame from "long" to "wide" format:
library(dplyr)
library(readr)
library(reshape2)
df = lapply(file_list, function(file) {
dat = read_csv(file)
dat$source.file = file
return(dat)
})
df = bind_rows(df)
df = dcast(df, id + name ~ source.file, value.var="value")
In the code above, after reading in each file, we add a new column source.file containing the file name (or a modified version thereof).* Then we use dcast to cast the data frame from "long" to "wide" format to create a separate column for the value from each file, with each new column taking one of the names we just created in source.file.
Note also that depending on what you're planning to do with this data frame, you may find it more convenient to keep it in long format (i.e., skip the dcast step) for further analysis.
Addendum: Dealing with Aggregation function missing: defaulting to length warning. This happens when you have more than one row with the same id, name and source.file. That means there are multiple values that have to get mapped to the same cell, resulting in aggregation. The default aggregation function is length (i.e., a count of the number of values in that cell). The only ways around this that I know of are (a) keep the data in long format, (b) use a different aggregation function (e.g., mean), or (c) add an extra counter column to differentiate cases with multiple values for the same combination of id, name, and source.file. We demonstrate these below.
First, let's create some fake data:
df = data.frame(id=rep(1:2,2),
name=rep(c("A","B"), 2),
source.file=rep(c("001","002"), each=2),
value=11:14)
df
id name source.file value
1 1 A 001 11
2 2 B 001 12
3 1 A 002 13
4 2 B 002 14
Only one value per combination of id, name and source.file, so dcast works as desired.
dcast(df, id + name ~ source.file, value.var="value")
id name 001 002
1 1 A 11 13
2 2 B 12 14
Add an additional row with the same id, name and source.file. Since there are now two values getting mapped to a single cell, dcast must aggregate. The default aggregation function is to provide a count of the number of values.
df = rbind(df, data.frame(id=1, name="A", source.file="002", value=50))
dcast(df, id + name ~ source.file, value.var="value")
Aggregation function missing: defaulting to length
id name 001 002
1 1 A 1 2
2 2 B 1 1
Instead, use mean as the aggregation function.
dcast(df, id + name ~ source.file, value.var="value", fun.aggregate=mean)
id name 001 002
1 1 A 11 31.5
2 2 B 12 14.0
Add a new counter column to differentiate cases where there are multiple rows with the same id, name and source.file and include that in dcast. This gets us back to a single value per cell, but at the expense of having more than one column for some source.files.
# Add counter column
df = df %>% group_by(id, name, source.file) %>%
mutate(counter=1:n())
As you can see, the counter value only has a value of 1 in cases where there's only one combination of id, name, and source.file, but has values of 1 and 2 for one case where there are two rows with the same id, name, and source.file (rows 3 and 5 below).
df
id name source.file value counter
1 1 A 001 11 1
2 2 B 001 12 1
3 1 A 002 13 1
4 2 B 002 14 1
5 1 A 002 50 2
Now we dcast with counter included, so we get two columns for source.file "002".
dcast(df, id + name ~ source.file + counter, value.var="value")
id name 001_1 002_1 002_2
1 1 A 11 13 50
2 2 B 12 14 NA
* I'm not sure what your file names look like, so you'll probably need to adjust this create a naming format with a unique file identifier. For example, if your file names follow the pattern "file001.csv", "file002.csv", etc., you could do this: dat$source.file = paste0("Value", gsub("file([0-9]{3})\\.csv", "\\1", file).

Custom function within subset of data, base functions, vector output?

Apologises for a semi 'double post'. I feel I should be able to crack this but I'm going round in circles. This is on a similar note to my previously well answered question:
Within ID, check for matches/differences
test <- data.frame(
ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05",
"2004-03-05","2004-06-05","2004-09-05","2005-01-05",
"2006-10-03","2007-02-05")
)
What I want to do is tag the subject whose first vist (as at DOV) was less than 180 days from their diagnosis (DOD). I have the following from the plyr package.
ddply(test, "ID", function(x) ifelse( (as.numeric(x$DOV[1]) - as.numeric(x$DOD[1])) < 180,1,0))
Which gives:
ID V1
1 A 1
2 B 0
3 C 1
What I would like is a vector 1,1,1,0,0,0,0,1,1 so I can append it as a column to the data frame. Basically this ddply function is fine, it makes a 'lookup' table where I can see which IDs have a their first visit within 180 days of their diagnosis, which I could then take my original test and go through and make an indicator variable, but I should be able to do this is one step I'd have thought.
I'd also like to use base if possible. I had a method with 'by', but again it only gave one result per ID and was also a list. Have been trying with aggregate but getting things like 'by has to be a list', then 'it's not the same length' and using the formula method of input I'm stumped 'cbind(DOV,DOD) ~ ID'...
Appreciate the input, keen to learn!
After wrapping as.Date around the creation of those date columns, this returns the desired marking vector assuming the df named 'test' is sorted by ID (and done in base):
# could put an ordering operation here if needed
0 + unlist( # to make vector from list and coerce logical to integer
lapply(split(test, test$ID), # to apply fn with ID
function(x) rep( # to extend a listwise value across all ID's
min(x$DOV-x$DOD) <180, # compare the minimum of a set of intervals
NROW(x)) ) )
11 12 13 21 22 23 24 31 32 # the labels
1 1 1 0 0 0 0 1 1 # the values
I have added to data.frame function stringsAsFactors=FALSE:
test <- data.frame(ID=c(rep(1,3),rep(2,4),rep(3,2)),
DOD = c(rep("2000-03-01",3), rep("2002-05-01",4), rep("2006-09-01",2)),
DOV = c("2000-03-05","2000-06-05","2000-09-05","2004-03-05",
"2004-06-05","2004-09-05","2005-01-05","2006-10-03","2007-02-05")
, stringsAsFactors=FALSE)
CODE
test$V1 <- ifelse(c(FALSE, diff(test$ID) == 0), 0,
1*(as.numeric(as.Date(test$DOV)-as.Date(test$DOD))<180))
test$V1 <- ave(test$V1,test$ID,FUN=max)

Resources