Make rows out of one column based on another columns value - r

I have a data frame like this
V1 V2
10 5
20 4
30 8
40 6
10 10
20 7
30 4
40 9
And I would like to have all the values relating to the same V1 in one row, like so...
V1 V2 V3
10 5 10
20 4 7
30 8 4
40 6 9

Here is a solution in base R. You can feed the uniques in row V1 into a lapply and extract all values in V2 for each unique V1. This you feed into a call to do.call (because the result from lapply is a list) with rbind, and then you merge the resulting matrix with the vector of uniques via cbind.
# Create df1 for demonstration
df1 = data.frame(a = rep(1:4, 10), b = sample(1:40))
output = cbind(unique(df1$a), do.call(rbind, lapply(unique(df1$a), function(x) df1$b[df1$a == x])))
This solution depends on the values inside the source data frame to be of the same type. If they are not, you might have to invest some time into casting the data into the correct types or so. But this should not be a problem.

You can do what you want with apply functions.
DF <- data.frame(A = c(1:5,1:5),B=11:20)
lst <- lapply(unique(DF$A),function(AA) DF[DF$A ==AA,'B'])
Result <- do.call(rbind,lst)
If you wish to have the A column back in you can use Results <- cbind(A=names(lst),Results)
Be careful, this will give you a matrix not a data.frame. If your values are not numeric like this example that may cause some issues.
There are some alternate ways to do this using Data Tables or dplyr.

We can do this with dcast from data.table
library(data.table)
dcast(setDT(df1), V1~paste0("V", rowid(V1)+1))
# V1 V2 V3
#1: 10 5 10
#2: 20 4 7
#3: 30 8 4
#4: 40 6 9

Related

In R, How can I filter only those rows in which the value for colmun V6 appears exactly 2 times?

How can I filter in R only those row in which the value for Column V6 appears exactly 2 times?
my dataseta date:
I try:
library(dplyr)
df <- as.data.frame(date)
df1 <- subset(df,duplicated(V6))
but it does not work.
You can use a contingency table to get the value counts. Here's some example code.
# Make some dummy data (only 8 and 2 appear exactly twice in this example)
df <- data.frame(V1=1:10,
V2=11:10,
V6=c(1,2,8,3,4,3,2,3,8,7))
# Get table of counts for column "V6"
tab <- table(df$V6)
# Get values that appear exactly twice
twice <- as.numeric(names(tab)[tab == 2])
# Filter the data frame based on these values
df <- df[df$V6 %in% twice,]
Output:
V1 V2 V6
2 2 10 2
3 3 11 8
7 7 11 2
9 9 11 8

Transform table [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 4 years ago.
I would like to repeat entire rows in a data-frame based on the samples column.
My input:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text=df, header=TRUE)
My expected output:
df <- 'chr start end samples
1 10 20 1-10-20-s1
1 10 20 1-10-20-s2
2 4 10 2-4-10-s1
2 4 10 2-4-10-s2
2 4 10 2-4-10-s3'
Some idea how to perform it wisely?
We can use expandRows to expand the rows based on the value in the 'samples' column, then convert to data.table, grouped by 'chr', we paste the columns together along with sequence of rows using sprintf to update the 'samples' column.
library(splitstackshape)
setDT(expandRows(df, "samples"))[,
samples := sprintf("%d-%d-%d-%s%d", chr, start, end, "s",1:.N) , chr][]
# chr start end samples
#1: 1 10 20 1-10-20-s1
#2: 1 10 20 1-10-20-s2
#3: 2 4 10 2-4-10-s1
#4: 2 4 10 2-4-10-s2
#5: 2 4 10 2-4-10-s3
NOTE: data.table will be loaded when we load splitstackshape.
You can achieve this using base R (i.e. avoiding data.tables), with the following code:
df <- 'chr start end samples
1 10 20 2
2 4 10 3'
df <- read.table(text = df, header = TRUE)
duplicate_rows <- function(chr, starts, ends, samples) {
expanded_samples <- paste0(chr, "-", starts, "-", ends, "-", "s", 1:samples)
repeated_rows <- data.frame("chr" = chr, "starts" = starts, "ends" = ends, "samples" = expanded_samples)
repeated_rows
}
expanded_rows <- Map(f = duplicate_rows, df$chr, df$start, df$end, df$samples)
new_df <- do.call(rbind, expanded_rows)
The basic idea is to define a function that will take a single row from your initial data.frame and duplicate rows based on the value in the samples column (as well as creating the distinct character strings you're after). This function is then applied to each row of your initial data.frame. The output is a list of data.frames that then need to be re-combined into a single data.frame using the do.call pattern.
The above code can be made cleaner by using the Hadley Wickham's purrr package (on CRAN), and the data.frame specific version of map (see the documentation for the by_row function), but this may be overkill for what you're after.
Example using DataFrame function from S4Vector package:
df <- DataFrame(x=c('a', 'b', 'c', 'd', 'e'), y=1:5)
rep(df, df$y)
where y column represents the number of times to repeat its corresponding row.
Result:
DataFrame with 15 rows and 2 columns
x y
<character> <integer>
1 a 1
2 b 2
3 b 2
4 c 3
5 c 3
... ... ...
11 e 5
12 e 5
13 e 5
14 e 5
15 e 5

Creating a new variable in R from two existing ones

My apologies if this is a basic question. I'm new to R.
I have a dataset, DAT, which has 3 variables: ID, V1 and V2. Unfortunately, V2 data are missing for many cases. I want to create a new variable, V3. I want V3 to have the same values as V2, but for any case that has a missing value for V2, I want V3 to take the value of V1 instead. What is the most efficient way to do this in R?
One approach using the dplyr package.
# Step 1: Load verb-like data wrangling package.
library(dplyr)
# Step 2: Create some data.
df <- data.frame(ID=1:5, V1 = 11:15, V2 = c(31:33, NA, NA))
ID V1 V2
1 11 31
2 12 32
3 13 33
4 14 NA
5 15 NA
# Step 3: Create a variable V3 using your criteria
df <- mutate(df, V3 = if_else(is.na(V2), V1, V2))
ID V1 V2 V3
1 11 31 31
2 12 32 32
3 13 33 33
4 14 NA 14
5 15 NA 15
Using the data.table package would probably be more efficient if you have a big data frame.
You can also use the ifelse statement.
DAT$V3 <- ifelse(is.na(DAT$V2), DAT$V1, DAT$V2)
Reads as if V2 is blank then use V1, otherwise use the data in V2.

Moving rows from one dataframe to another based on a matching column

I'm very sorry for asking this question, because I saw something similar in the past but I couldn't find it (so duplication will be understandable).
I have 2 data frames, and I want to move all my (matching) customers who appears in the 2 data frames into one of them. Please pay attention that I want to add the entire row.
Here is an example:
# df1
customer_ip V1 V2
1 15 20
2 12 18
# df2
customer_ip V1 V2
2 45 50
3 12 18
And I want my new data frames to look like:
# df1
customer_ip V1 V2
1 15 20
2 12 18
2 45 50
# df2
customer_ip V1 V2
3 12 18
Thank you in advance!
This does it.
df1<-rbind(df1,df2[df2$customer_ip %in% df1$customer_ip,])
df2<-df2[!(df2$customer_ip %in% df1$customer_ip),]
EDIT: Gaurav & Sotos got here before me whilst I was writing with essentially the same answer, but I'll leave this here as it shows the code without the redundant 'which'
This should do the trick:
#Add appropriate rows to df1
df1 <- rbind(df1, df2[which(df2$customer_ip %in% df1$customer_ip),])
#Remove appropriate rows from df2
df2 <- df2[-which(df2$customer_ip %in% df1$customer_ip),]

Same function over multiple data frames in R - not over a list of data frames

This Issue is almost what I wanted to do, except by the fact of an output being giving as a list of data frames. Let's reproduce the example of mentioned SE issue above.
Let's say I have 2 data frames:
df1
ID col1 col2
x 0 10
y 10 20
z 20 30
df1
ID col1 col2
a 0 10
b 10 20
c 20 30
What I want is an 4th column with an ifelse result. My rationale is:
if col1>=20 in any data.frame I could have named with the pattern "df", then the new column res=1, else res=0.
But I want to create a new column in each data.frame with the same name pattern, not put all of those data.frames in a list and apply the function, except if I could "extract" each 3rd dimension of this list back to individual data frames.
Thanks
Per #Frank...if my understanding of what you are looking for is correct, consider using data.table. MWE:
library(data.table);
addcol <- function(x) x[,res:=ifelse(col1>=20,1,0)]
df1 <- data.table(ID=c("x","y","z"),col1=c(0,10,20),col2=c(10,20,30))
df2 <- data.table(ID=c("x","y","z"),col1=c(20,10,20),col2=c(10,20,30))
#modified df2 so you can see different effects
lapply(list(df1,df2),addcol)
> df1
ID col1 col2 res
1: x 0 10 0
2: y 10 20 0
3: z 20 30 1
> df2
ID col1 col2 res
1: x 20 10 1
2: y 10 20 0
3: z 20 30 1
This works because data.table operates by reference on tables, so inside the function you're actually updating the underlying table, not only the scoped reference to the table.

Resources