R, add column to dataframe, count of substrings - r

This is my desired output:
> head(df)
String numSubStrings
1 1 1
2 1 1
3 1;1;1;1 4
4 1;1;1;1 4
5 1;1;1 3
6 1 1
Hi, I have a data frame which has a "String" column as above. I would like to add a column "numSubStrings" which contains the number of substrings separated by ";" in "String".
I have tried
lapply(df, transform, numSubStrings=length(strsplit(df$Strings,";")[[1]]))
which gives me 1s in the numSubStrings instead.
Please advice.
Thanks.

It sounds like you're looking for count.fields. Usage would be something like:
> count.fields(textConnection(mydf$String), sep = ";")
[1] 1 1 4 4 3 1
You may need to wrap the mydf$String in as.character, depending on how the data were read in or created.
Or, you can try lengths:
> lengths(strsplit(mydf$String, ";", TRUE))
[1] 1 1 4 4 3 1

We can use gsub to remove all the characters except the ; and count the ; with nchar
df$numSubStrings <- nchar(gsub('[^;]+', '', df$String))+1
df$numSubStrings
#[1] 1 1 4 4 3 1
Or another option is stri_count from library(stringi) to count the ; characters and add 1.
library(stringi)
stri_count_fixed(df$String, ';')+1
#[1] 1 1 4 4 3 1

You may use str_count from stringr package.
x <- " String
1 1
2 1
3 1;1;1;1
4 1;1;1;1
5 1;1;1
6 1 "
df <- read.table(text=x, header=T)
df$numSubStrings <- str_count(df$String, "[^;]+")
df
# String numSubStrings
# 1 1 1
# 2 1 1
# 3 1;1;1;1 4
# 4 1;1;1;1 4
# 5 1;1;1 3
# 6 1 1

Related

Rename dataframe column names by switching string patterns

I have following dataframe and I want to rename the column names to c("WBC_MIN_D7", "WBC_MAX_D7", "DBP_MIN_D3")
> dataf <- data.frame(
+ WBC_D7_MIN=1:4,WBC_D7_MAX=1:4,DBP_D3_MIN=1:4
+ )
> dataf
WBC_D7_MIN WBC_D7_MAX DBP_D3_MIN
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
> names(dataf)
[1] "WBC_D7_MIN" "WBC_D7_MAX" "DBP_D3_MIN"
Probably, the rename_with function in tidyverse can do it, But I cannot figure out how to do it.
You can use capture groups with sub to extract values in order -
names(dataf) <- sub('^(\\w+)_(\\w+)_(\\w+)$', '\\1_\\3_\\2', names(dataf))
Same regex can be used in rename_with -
library(dplyr)
dataf %>% rename_with(~ sub('^(\\w+)_(\\w+)_(\\w+)$', '\\1_\\3_\\2', .))
# WBC_MIN_D7 WBC_MAX_D7 DBP_MIN_D3
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
You can rename your dataf with your vector with names(yourDF) <- c("A","B",...,"Z"):
names(dataf) <- c("WBC_MIN_D7", "WBC_MAX_D7", "DBP_MIN_D3")

Patterned Vector in base

I'd like to produce a vector with the following repeating pattern:
1 1 2 1 2 3 1 2 3 4 ...
that ranges from one to some arbitrary stopping point.
I can hack it together using an sapply followed by an unlist, as in the following, but it sure feels like there should be a base call that is more direct than this.
repeating_function <- function(stop_point) {
res_list <- sapply(1:stop_point, FUN=function(x) {1:x}, simplify=TRUE)
res <- unlist(res_list)
return(res)
}
Which produces:
repeating_function(5)
[1] 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5
An easier option would be
sequence(sequence(5))
#[1] 1 1 2 1 2 3 1 2 3 4 1 2 3 4 5
Wrapping in a function
repeating_function(val) {
sequence(sequence(val))
}

Recoding specific column values using reference list

My dataframe looks like this
data = data.frame(ID=c(1,2,3,4,5,6,7,8,9,10),
Gender=c('Male','Female','Female','Female','Male','Female','Male','Male','Female','Female'))
And I have a reference list that looks like this -
ref=list(Male=1,Female=2)
I'd like to replace values in the Gender column using this reference list, without adding a new column to my dataframe.
Here's my attempt
do.call(dplyr::recode, c(list(data), ref))
Which gives me the following error -
no applicable method for 'recode' applied to an object of class
"data.frame"
Any inputs would be greatly appreciated
An option would be do a left_join after stacking the 'ref' list to a two column data.frame
library(dplyr)
left_join(data, stack(ref), by = c('Gender' = 'ind')) %>%
select(ID, Gender = values)
A base R approach would be
unname(unlist(ref)[as.character(data$Gender)])
#[1] 1 2 2 2 1 2 1 1 2 2
In base R:
data$Gender = sapply(data$Gender, function(x) ref[[x]])
You can use factor, i.e.
factor(data$Gender, levels = names(ref), labels = ref)
#[1] 1 2 2 2 1 2 1 1 2 2
You can unlist ref to give you a named vector of codes, and then index this with your data:
transform(data,Gender=unlist(ref)[as.character(Gender)])
ID Gender
1 1 1
2 2 2
3 3 2
4 4 2
5 5 1
6 6 2
7 7 1
8 8 1
9 9 2
10 10 2
Surprisingly, that one works as well:
data$Gender <- ref[as.character(data$Gender)]
#> data
# ID Gender
# 1 1 1
# 2 2 2
# 3 3 2
# 4 4 2
# 5 5 1
# 6 6 2
# 7 7 1
# 8 8 1
# 9 9 2
# 10 10 2

R write table last longer for 2 columns than for whole dataframe

A dataframe with 40 columns:
This is executed after a few seconds
write.table(data_2[1:10000,], file = "/Volumes/2018/06_abteilungen/bi/analytics/tools/adobe/adobe_analytics/adobe_analytics_api_rohdaten/api_via_data_feed_auf_ftp/beispiel_datenexporte_data_feed/r_exporte/channel_va_closer.csv", sep = ";", col.names = NA)
This never ends:
write.table(data_2[1:1000,c(data_2$va_closer_detail,data_2$va_closer_id)], file = "/Volumes/2018/06_abteilungen/bi/analytics/tools/adobe/adobe_analytics/adobe_analytics_api_rohdaten/api_via_data_feed_auf_ftp/beispiel_datenexporte_data_feed/r_exporte/channel_va_closer.csv", sep = ";", col.names = NA)
How can I extract only 2 columns without performance-delay?
You can use [ to subset a data frame either by giving it row/column indices or row/column names. For example:
dd = data.frame(col1 = rep(1:2, 5), col2 = c(rep(1:3, 3), 1), col3 = 'a')
dd
# col1 col2 col3
# 1 1 1 a
# 2 2 2 a
# 3 1 3 a
# 4 2 1 a
# 5 1 2 a
# 6 2 3 a
# 7 1 1 a
# 8 2 2 a
# 9 1 3 a
# 10 2 1 a
If you wanted the first 5 rows and the first 2 columns, you could do either of these:
# good
dd[1:5, 1:2] # using column indices
dd[1:5, c("col1", "col2")] # using column names
But what you have in your question is
# bad
dd[1:5, c(dd$col1, dd$col2)] # using actual values :(
What columns are you asking for? Well, dd$col1 is the first column values: 1,2,1,2,... and dd$col2 is the second column values 1,2,3,1,2,3... Using c() you are sticking them together, so we can expand this out to
c(dd$col1, dd$col2) # these are the columns you are asking for
# [1] 1 2 1 2 1 2 1 2 1 2 1 2 3 1 2 3 1 2 3 1
# these are equivalent for this data
dd[1:5, c(dd$col1, dd$col2)]
dd[1:5, c(1,2,1,2,1,2,1,2,1,2,1,2,3,1,2,3,1,2,3,1)]
# col1 col2 col1.1 col2.1 col1.2 col2.2 col1.3 col2.3 col1.4 col2.4 col1.5 col2.5 col3 col1.6 col2.6 col3.1 col1.7 col2.7
# 1 1 1 1 1 1 1 1 1 1 1 1 1 a 1 1 a 1 1
# 2 2 2 2 2 2 2 2 2 2 2 2 2 a 2 2 a 2 2
# 3 1 3 1 3 1 3 1 3 1 3 1 3 a 1 3 a 1 3
# 4 2 1 2 1 2 1 2 1 2 1 2 1 a 2 1 a 2 1
# 5 1 2 1 2 1 2 1 2 1 2 1 2 a 1 2 a 1 2
# col3.2 col1.8
# 1 a 1
# 2 a 2
# 3 a 1
# 4 a 2
# 5 a 1
We are asking to repeat the columns again and again, with twice as many columns as there are rows in the original data! I don't know how many rows you have, it looks like more than 1000, so you are asking not for 2 columns, but for more than 2000 columns - maybe a lot more.
Two footnotes:
I second the the comment recommending data.table::fwrite, it will be much faster.
As a debugging technique, don't forget you can run small pieces of code to isolate the problem. When you try
write.table(data_2[1:1000,c(data_2$va_closer_detail,data_2$va_closer_id)],
file = "/Volumes/2018/06_abteilungen/bi/analytics/tools/adobe/adobe_analytics/adobe_analytics_api_rohdaten/api_via_data_feed_auf_ftp/beispiel_datenexporte_data_feed/r_exporte/channel_va_closer.csv",
sep = ";", col.names = NA)
And it doesn't seem to work there are two things worth checking: (a) is the file path valid, (b) is the data valid. If you had just tried running the data_2[...] part of the line, you would have identified the problem without needing help.
data_2[1:1000,c(data_2$va_closer_detail,data_2$va_closer_id)]
And when you ran that and saw different output than expected, again you run a smaller piece of the line,
c(data_2$va_closer_detail,data_2$va_closer_id)
And hopefully the issue is clear.

Conditionally dropping duplicates from a data.frame

Im am trying to figure out how to subset my dataset according to the repeated value of the variable s, taking also into account the id associated to the row.
Suppose my dataset is:
dat <- read.table(text = "
id s
1 2
1 2
1 1
1 3
1 3
1 3
2 3
2 3
3 2
3 2",
header=TRUE)
What I would like to do is, for each id, to keep only the first row for which s = 3. The result with dat would be:
id s
1 2
1 2
1 1
1 3
2 3
3 2
3 2
I have tried to use both duplicated() and which() for using subset() in a second moment, but I am not going anywhere. The main problem is that it is not sufficient to isolate the first row of the s = 3 "blocks", because in some cases (as here between id = 1 and id = 2) the 3's overlap between one id and another.. Which strategy would you adopt?
Like this:
subset(dat, s != 3 | s == 3 & !duplicated(dat))
# id s
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 3
# 7 2 3
# 9 3 2
# 10 3 2
Note that subset can be dangerous to work with (see Why is `[` better than `subset`?), so the longer but safer version would be:
dat[dat$s != 3 | dat$s == 3 & !duplicated(dat), ]

Resources