Combine/unite series of columns - r

I have a data frame like this, that continues to variable length (an even column number):
V1 V2 V3 V4 V5 V6
A B C D E F
I would like the first half of the data frame to form pairs with the second half of the data frame. (In the case above that would be pairs such as AD, BE and CF.)
Taken from another post, I have made this but I can't manage to make a data frame out of it.
lapply(1:(ncol(df)/2), function(x) paste(df[,c(x,x+(ncol(df)/2))], collapse = "")) %>%
data.frame
Could someone explain what actually happens in this piece of code?

I am not sure exactly what problem you are facing but there are potentially two problems I see. First is that your character variables are actually factors. In that case you will get back underlying indexes rather than the characters. A second potential issue could be in the paste function. Writing it like this gives me the right results. You will have to use rename_all from dplyr to make the variables names usable.
lapply(1:(ncol(df)/2), function(x) paste0(df[[x]], df[[x + ncol(df) / 2]])) %>% data.frame
Now what is going on here:
Assuming that you will always have an even number of columns, we are dividing that number by two and then for each of that column index, applying the paste0 function. paste0 is a simple wrapper for paste(..., sep = ''). We are pasting the column x and column x + half the number of columns. In my updated code, I am using [[]] because that will return a character vector.

Related

Extracting information between special characters in a column in R

I'm sorry because I feel like versions of this question have been asked many times, but I simply cannot find code from other examples that works in this case. I have a column where all the information I want is stored in between two sets of "%%", and I want to extract this information between the two sets of parentheses and put it into a new column, in this case called df$empty.
This is a long column, but in all cases I just want the information between the sets of parentheses. Is there a way to code this out across the whole column?
To be specific, I want in this example a new column that will look like "information", "wanted".
empty <- c('NA', 'NA')
information <- c('notimportant%%information%%morenotimportant', 'ignorethis%%wanted%%notthiseither')
df <- data.frame(information, empty)
In this case you can do:
df$empty <- sapply(strsplit(df$information, '%%'), '[', 2)
# information empty
# 1 notimportant%%information%%morenotimportant information
# 2 ignorethis%%wanted%%notthiseither wanted
That is, split the text by '%%' and take second elements of the resulting vectors.
Or you can get the same result using sub():
df$empty <- sub('.*%%(.+)%%.*', '\\1', df$information)

replacing a pattern in a column value with a column value

after an hour or so skimming trough stackoverflow and trying out different things I've decided to make another query.
I made a data frame [ picture 1 ] in which I basically inserted a vector with the same length as the df containing a URL used to access an API's data.
Hereby I added the "FLAVOURS" text in the URL as a "pattern trigger" for gsub to replace this word with the column value which I will replace later as flavors.
What i ended up with was a df with 2 columns one with the URL used for the API and one with all the flavors. What i wanted to do now is insert the flavors [column 2] into the URL so it would become e.g:
"http://strainapi.evanbusse.com/ZlWfxSa/searchdata/flavors/Earthy"
So what I would like to happen is the pattern "FLAVOUR" to be replaced by the column 2 data in a row wise fashion.
I've tried using gsub on its own, or in combination with rowwise() but I've been getting errors out of them, or they do something I didn't expect at all.
*I'm still new to both R and making stackoverflow posts so please do give me pointers if I did something wrong.
In base R, you can use any of the apply family of functions to do this. If your dataframe is called df and the first two column are a and b you can do :
df$a <- mapply(function(x, y) sub('FLAVOURS', y, x), df$a, df$b)
However, stringr has a vectorised function str_replace.
df$a <- stringr::str_replace(df$a, 'FLAVOURS', df$b)
Another base R option would be to treat column a as file paths and use dirpath to extract the path until the last '/' and paste it with b column.
df$a <- paste(dirname(df$a), df$b, sep = '/')

Calculate ratios of all column combinations from a dataframe

I have a CVS file imported as df in R. dimension of this df is 18x11. I want to calculate all possible ratios between the columns. Can you guys please help me with this? I understand that either 'for loop" or vectorized function will do the job. The row names will remain the same, while column name combinations can be merged using paste. However, I don't know how to execute this. I did this in excel as it is still a smaller data set. A larger size will make it tedious and error prone in excel, therefore, I would like to try in R.
Will be great help indeed. Thanks. Let's say below is the data frame as subset from my data.
dfn = data.frame(replicate(18,sample(100:1000,15,rep=TRUE)))
If you do:
do.call("cbind", lapply(seq_along(dfn), function(y) apply(dfn, 2, function(x) dfn[[y]]/x)))
You will get an array that is 15 * 324, with 18 columns representing all columns divided by the first column, 18 columns divided by the second column, and so on.
You can keep track of them by labelling the columns with the following names:
apply(expand.grid(names(dfn), names(dfn)), 1, paste, collapse = " / ")

How can I apply an operator to different elements of the same data frame without calling the elements via [row,column]?

First time asking a question. I tried searching in vain for an answer but can't seem to find exactly what I am looking for.
i have a small (2x2) data frame at the moment:
status weighted.responses
1 control 3.872694
2 exposed 3.713198
What I want to be able to do is subtract 'Exposed' from 'Control' by calling out a specific name, as opposed to [2,2] - [1,2]. Reason being, there will be several more columns added to this data frame as time goes on.
I have tried to transpose the data frame, promote the rows to headers, and then remove the row being used for a header, but then I end up with a vector.
Transposing also seems to turn my data frame to strings for some reason, which is another problem.
I have tried just taking the vector of 'weighted.responses', naming them accordingly, and subtracting, but then I run into a problem of having the new variable end up being a named number, which I don't want. At that point it just seemed like a waste of time and space for have several different lines for a problem I am sure.
I feel like I am running circles around a very simple solution, but I can't figure it out.
I am very appreciative of your time, and apologies for the formatting.
There are a couple of different ways you can do this using dplyr/tidyverse. Note that functions like spread tend to work better for reshaping dataframes than t(), which turns your dataframe into a matrix and coerces all values to the same type. Examples of things you can do:
library(tidyverse)
df = data.frame(
status = c("Control", "Exposed"),
response = c(3.87, 3.71)
)
df %>% spread(status, response) %>% summarize(diff = Control - Exposed)
# Output:
diff
1 0.16
df %>%
summarize(diff = response[status == "Control"] - response[status == "Exposed"])
# Output:
diff
1 0.16
1) subtraction This will subtract row 1 from row 2 even if there are more than 2 columns. It is assumed that the other columns are numeric and are to be differenced as well. Note that -1 here means all columns except the first.
DF[2, -1] - DF[1, -1]
2) rownames Another way to do it is to convert the first column to row names and then do the subtraction:
DF1 <- DF[-1]
rownames(DF1) <- DF[[1]]
DF1["exposed", ] - DF1["control", ]
3) lapply This would also work:
data.frame(lapply(DF[-1], diff))

Column indexing based on row value

I have the data frame:
DT=data.frame(Row=c(1,2,3,4,5),Price=c(2.1,2.1,2.2,2.3,2.5),
'2.0'= c(100,300,700,400,0),
'2.1'= c(400,200,100,500,0),
'2.2'= c(600,700,200,100,200),
'2.3'= c(300,0,300,100,100),
'2.4'= c(400,0,0,500,600),
'2.5'= c(0,200,0,800,100))
The objective is to create a new column Quantity that selects the value for each row in the column equal to Price, such that:
DT.Objective=data.frame(Row=c(1,2,3,4,5),Price=c(2.1,2.1,2.2,2.3,2.5),
'2.0'= c(100,300,700,400,0),
'2.1'= c(400,200,100,500,0),
'2.2'= c(600,700,200,100,200),
'2.3'= c(300,0,300,100,100),
'2.4'= c(400,0,0,500,600),
'2.5'= c(0,200,0,800,100),
Quantity= c(400,200,200,100,100))
The dataset is very large so efficiency is important. I currently use and looking to make more efficient:
Names <- names(DT)
DT$Quantity<- DT[Names][cbind(seq_len(nrow(DT)), match(DT$Price, Names))]
For some reason the column names in the example come with an "X" in front of them, whereas in the actual data there is no X.
Cheers.
We can do this with row/column indexing after removing the prefix 'X' using sub or substring and then do the match as showed in the OP's post
DT$Quantity <- DT[cbind(1:nrow(DT), match(DT$Price, sub("^X", "", names(DT))))]
DT$Quantity
#[1] 400 200 200 100 100
The X is attached as prefix when the column names starts with numbers. One way to take care of this would be using check.names=FALSE in the data.frame call or read.csv/read.table
#akrun is correct, check.names=TRUE is the default behavior for data.frame(); from the man page:
check.names
logical. If TRUE then the names of the variables in the data frame are checked to ensure that they are syntactically valid variable names and are not duplicated. If necessary they are adjusted (by make.names) so that they are.
If possible, you may want to make your column names a bit more descriptive.

Resources