applying functions to pasted strings in R - r

I have a dataframe,df, with many columns cola,colb etc each consisting of a sequence of integers 0 or 1
df$cola
[1] 1 0 0 1 0 1 0 0 etc.
I am using the subSeq function in the doBy package to obtain some sequences
and want to apply this to all columns
I have tried putting the columns into a vector
cols <- colnames(df) # "cola" "colb" etc.
and have tried without success this approach
subSeq(get(paste0("df$",cols[1]))) # error object 'df$cola' not found
Could not easily find an equivalent on site via search

I think you are looking for df[[cols[1]]].
Note that df[["foo"]] is the same as df$foo.

Related

How to apply a function in different ranges of a vectror in R?

I have the following matrix:
x=matrix(c(1,2,2,1,10,10,20,21,30,31,40,
1,3,2,3,10,11,20,20,32,31,40,
0,1,0,1,0,1,0,1,1,0,0),11,3)
I would like to find for each unique value of the first column in x, the maximum value (across all records having that value of the first column in x) of the third column in x.
I have created the following code:
v1 <- sequence(rle(x[,1])$lengths)
A=split(seq_along(v1), cumsum(v1==1))
A_diff=rep(0,length(split(seq_along(v1), cumsum(v1==1))))
for( i in 1:length(split(seq_along(v1), cumsum(v1==1))) )
{
A_diff[i]=max(x[split(seq_along(v1), cumsum(v1==1))[[i]],3])
}
However, the provided code works only when same elements are consecutive in the first column (because I use rle) and I use a for loop.
So, how can I do it to work generally without the for loop as well, that is using a function?
If I understand correctly
> tapply(x[,3],x[,1],max)
1 2 10 20 21 30 31 40
1 1 1 0 1 1 0 0
For grouping more than 1 variable I would do aggregate, note that matrices are cumbersome for this purpose, I would suggest you transform it to a data frame, nonetheless
> aggregate(x[,3],list(x[,1],x[,2]),max)

Keeping only certain columns of a data frame provided they match a condition

I am new to programming so do bear with me. I have a data frame with about 1500 rows and 1000 variables. I am trying to keep columns that only have binary values i.e. "0" or "1", NAs are also allowed, but discard all other columns that don't match this criteria. Is there a way of doing this without knowing in advance the column names which meet the criteria?
I have read up on the dplyr filter() function and also the base R subsetting but none match what I am looking for.
The new features in dplyr 1.0.0 provide a simple solution to this: select(.data, where(is.logical)). Where .data is your tibble/data frame (provided your variables are of data type logical, i.e. TRUE/FALSE).
You can try something like this:
df <- data.frame(a=1:5,
b=c(0,1,0,1,0),
c=c(0,1,0,1,NA_real_),
d=c(0,1,0,1,2))
is_binary <- function(x){
all(x %in% c(0,1,NA_real_))
}
df[,sapply(df, is_binary)]
Output:
b c
1 0 0
2 1 1
3 0 0
4 1 1
5 0 NA

R - Creating new columns based on searching vector elements in a df

I would like to add columns to a df where the newly added columns are based on searching values of a vector in an existing column of the df.
My original dataset contains webdata where rows represent pages visited for each customer; the pages visited are stored in df$URL. I have a separate vector of web page URLS, each element in this vector needs to be added as a column with a value indicating whether that customer's page visit in the original df (df$URL) matches the to be added column (=vector element).
Basically: I want to create a column for each element of the vector (where column name = vector element) with values (0/1) based on searching the rows of the URL column of the df to add a 1 on a match, or 0 otherwise.
All of the vector elements in urlnames occur in df$URL (but not for every row), but df$URL contains more URLs than are in the vector (basically the vector contains only some top visited URL pages).
urlnames <- c("/home", "/login", "/contact")
df <- data.frame("URL" = c("/home", "/login", "/contact", "/chat", "/product-page"))
Manually I would do something like (with dplyr):
df %<>%
mutate(home = ifelse(URL == "/home", 1, 0))
Basically the variable name and ifelse criterium should be replaced with the vector element. I don't know if there's more efficient/neat ways of doing this.
I really want to learn how to do such things automatically rather than having to do manual mutate calls for each of these variables.
(BTW I would also appreciate input with potential issues the url slashes could create in creating column names, e.g. /home as a variable)
Hope I've been clear enough to explain my issue, apologies if not - it's my first post and I'm (obviously) new to R. Thank you!
Try table:
table(1:nrow(df),df$URL)
# /chat /contact /home /login /product-page
# 1 0 0 1 0 0
# 2 0 0 0 1 0
# 3 0 1 0 0 0
# 4 1 0 0 0 0
# 5 0 0 0 0 1
You can drop the columns you don't want afterwards and coerce to a data.frame if needed.
There are tons of ways to remove the columns. One consists of replaceing the values which are different from urlnames with NA and reapplying the above. Something like:
table(1:nrow(df),droplevels(replace(df$URL,which(!df$URL %in% urlnames),NA)))
Something like this, using lapply?
setNames(as.data.frame(lapply(urlnames, function(x) +(x==df$URL))), urlnames)
#> /home /login /contact
#> 1 1 0 0
#> 2 0 1 0
#> 3 0 0 1
#> 4 0 0 0
#> 5 0 0 0
What happens here is that we use lapply to create a list of vectors, with one vector of each member of urlnames. Each vector is filled with 1s and 0s depending on whether the element of urlnames was found at each position in df$URL. We then turn the list into a data frame and set its column names to the urlnames
Longer answer (a bit late to the party) and not as succinct, eloquent or efficient as those above but can be used for partial matches with only minor adjustments (removing the paste0 function encassing the urlnames):
setNames(as.data.frame(
lapply(paste0("^", urlnames, "$"), function(x){
+Vectorize(grepl)(x, df$URL)
}
), row.names = NULL), urlnames)

table() function does not convert data frame correctly

I am trying to convert a data.frame to table without packages. Basically I take cookbook as reference for this and tried from data frame, both named or unnamed vectors. The data set is stackoverflow survey from kaggle.
moreThan1000 is a data.frame stores countries those have more than 1000 stackoverflow user and sorted by number column as shown below:
moreThan1000 <- subset(users, users$Number >1000)
moreThan1000 <- moreThan1000[order(moreThan1000$Number),]
when I try to convert it to a table like
tbl <- table(moreThan1000)
tbl <- table(moreThan1000$Country, moreThan1000$Number)
tbl <- table(moreThan1000$Country, moreThan1000$Number, dnn = c("Country","Number"))
after each attempt my conversion look like this:
Why moreThan1000 data.frame do not send just related countries but all countries to table? It seems to me conversion looks like a matrix.
I believe that this is because countries do not relate to each other. To each country corresponds a number, to another country will correspond an unrelated number. So the best way to reflect this is the original data.frame, not a table that will have just one 1 per row (unless two countries have the very same number of stackoverflow users). I haven't downloaded the dataset you're using, but look to what happens with a fake dataset, order by number just like your moreThan1000.
dat <- data.frame(A = letters[1:5], X = 21:25)
table(dat$A, dat$X)
21 22 23 24 25
a 1 0 0 0 0
b 0 1 0 0 0
c 0 0 1 0 0
d 0 0 0 1 0
e 0 0 0 0 1
Why would you expect anything different from your dataset?
The function "table" is used to tabulate your data.
So it will count how often every value occurs (in the "number"column!). In your case, every number only occurs once, so don't use this function here. It's working correctly, but it's not what you need.
Your data is already a tabulation, no need to count frequencies again.
You can check if there is an object conversion function, I guess you are looking for a function as.table rather than table.

transpose row to column in R using qdap

I have been using the wfm function in "qdap" package for transposing the text row values into columns and ran into problem when the data contains numbers along with text. For example if the row value is "abcdef" the transpose works fine but if the value is "ab1000" then the truncation of numbers happen. Can anyone help with suggestions on how to work around this?
Approach tried so far:
input <- read.table(header=F, text="101 ab0003
101 pp6500
102 sm2456")
colnames(input) <- c("id","channel")
require(qdap)
library(qdap)
output <- t(with(input, wfm(channel, id)))
output <- as.data.frame(output)
expected_output<- read.table(header=F,text="1 1 0
0 0 1")
colnames(expected_output) <- c("ab0003","pp6500", "sm2456")
I think maybe wfm isn't the right tool for this job. It seems you don't really have sentences that you want to split into words. So you're using a function with a lot of overhead unnecessarily. What you really want it to tabulate the values you have by another grouping variable.
Here are two approaches. One using qdapTools's mtabulate, another using base R's table:
library(qdapTools)
mtabulate(with(input, split(channel, id)))
## ab0003 pp6500 sm2456
## 101 1 1 0
## 102 0 0 1
t(with(input, table(channel, id)))
## channel
## id ab0003 pp6500 sm2456
## 101 1 1 0
## 102 0 0 1
It may be possible your MWE is not reflecting the complexity of the data, if this is the case it brings us back to the original problem. wfm uses tmpackage as a backend to make some of the manipulations. So we'd need to supply something to the ldots (...). I re-read the documentation and this is a bit confusing (I have added this info in the dev version) but we want to pass removeNumbers=FALSE to TermDocumentMatrix as seen here:
output <- t(with(input, wfm(channel, id, removeNumbers=FALSE)))
as.data.frame(output)
## ab0003 pp6500 sm2456
## 101 1 1 0
## 102 0 0 1

Resources