I have a couple of data frames that I am attempting to use the values from one data frame to populate the cells of a column in a separate data frame.
They are as follows:
df1 <- data.frame(A = c("Doug", "Michele", "Steve", "John", "Pete", "David"))
df1$B <- 0
df2 <- data.frame(A = c("Doug", "Steve", "John"), B = c(1,1,0))
And the result that I am looking for is:
df1 <- data.frame(A = c("Doug", "Michele", "Steve", "John", "Pete", "David"), B = c(1,0,1,0,0,0))
I tried the following approach, but only Doug has a 1 value while the others are 0.
df1$B[(df1$A == df2$A & df2$B == 1)] <- 1
When attempting an approach with %in%, Doug has a 1 value but John does as well when Steve should be the one to receive the 1.
df1$B[(df1$A %in% df2$A & df2$B == 1)] <- 1
Am I missing something here that would resolve this issue?
Thanks in advance
An option with data.table would be to join on the 'A' column and assign the 'B' from the second dataset (i.B) to 'B' in first data
library(data.table)
setDT(df1)[df2, B := i.B, on = .(A)]
-output
df1
# A B
#1: Doug 1
#2: Michele 0
#3: Steve 1
#4: John 0
#5: Pete 0
#6: David 0
Related
I don't know how to say it clearly, that is maybe why i did not find the answer, but i want to edit the values of two different columns at the same time, while they are the identifying columns.
For example this is the data :
> data = data.frame(name1 = c("John","Jake","John","Paul"),
name2 = c("Paul", "Paul","John","John"),
value1 = c(0,0,1,0),
value2 = c(1,0,1,0))
> data
name1 name2 value1 value2
1 John Paul 0 1
2 Jake Paul 0 0
3 John John 1 1
4 Paul John 0 0
I would like to edit the values of the first row so the first row become Jake & John instead of John & Paul, and so i would like to combine these two lines of code for doing it at the same time :
data$name1[(data$name1 == "John" & data$name2 == "Paul")] <- "Jake"
data$name2[(data$name1 == "John" & data$name2 == "Paul")] <- "John"
Should be a simple trick but i dont have it !
Also, i should do that on larger datasets each modification can appear on multiple lines, and i cant know on which rows will be the modification
How about this ?
data[data$col1 == "A" & data$col2 == "B", ] <- list("B", "D")
data
# col1 col2
#1 B D
#2 A C
#3 B A
#4 B B
library(tidyverse)
data %>%
mutate(
name1=
case_when(
name1=="John" & name2=="Paul" ~ "Jake",
TRUE ~ name1
),
name2=
case_when(
name1=="John" & name2=="Paul" ~ "John",
TRUE ~ name2))
How do I swap one value with another in a column within a dataframe?
For example swap the 2's and 4's around in df1 to give df2:
df1 <- as.data.frame(col1 = c(1,2,1,4))
df2 <- as.data.frame(col1 = c(1,4,1,2))
Simple solution using replace in base R:
df2 <- data.frame(col1 = replace(df1$col1, c(4,2), c(2,4)))
Output
col1
1 1
2 4
3 1
4 2
We can try using case_when from the dplyr package for some switch functionality:
df2 <- df1
df2$col1 <- case_when(
df2$col1 == 2 ~ 4,
df2$col1 == 4 ~ 2,
TRUE ~ df2$col1
)
df2
col1
1 1
2 4
3 1
4 2
Data:
df1 <- data.frame(col1 = c(1,2,1,4))
you can swap by reassigning the index for that column.
With the dataframe:
df1 <- data.frame(col1 = c("a","b","c","d"))
> df1
col1
1 a
2 b
3 c
4 d
we can:
df1[,1] <- df1[c(1,4,3,2),1]
to get
> df1
col1
1 a
2 d
3 c
4 b
I have the following two data.frames:
df1 <- data.frame(Var1=c(3,4,8,9),
Var2=c(11,32,1,7))
> df1
Var1 Var2
1 3 11
2 4 32
3 8 1
4 9 7
df2 <- data.frame(ID=c('A', 'B', 'C'),
ball=I(list(c("3","11", "12"), c("4","1"), c("9","32"))))
> df2
ID ball
1 A 3, 11, 12
2 B 4, 1
3 C 9, 32
Note that column ball in df2 is a list.
I want to select the ID in df2 with elements in column ball that match a row in df1.
The ideal output would look like this:
> df3
ID ball1 ball2
1 A 3 11
Does anyone have an idea how to do this efficiently? The original data consists of millions of rows in both data.frames.
A data.table solution would work much more quickly than this base R solution but here is a possibility.
your data:
df1 <- data.frame(Var1=c(3,4,8,9),
Var2=c(11,32,1,7))
df2 <- data.frame(ID=c('A', 'B', 'C'),
ball=I(list(c("3","11", "12"), c("4","1"), c("9","32"))))
the process:
df2$ID <- as.character(df2$ID) # just in case they are levels instead
n <- length(df2)# initialize the size of df3 to be big enough
df3 <- data.frame(ID = character(n),
Var1 = numeric(n), Var2 = numeric(n),
stringsAsFactors = F) # to make sure we get the ID as a string
count = 0 # counter
for(i in 1:nrow(df1)){
for(j in 1:nrow(df2)){
if(all(df1[i,] %in% df2$ball[[j]])){
count = count + 1
df3$ID[count] <- df2$ID[j]
df3$Var1[count] <- df1$Var1[i]
df3$Var2[count] <- df1$Var2[i]
}
}
}
df3_final <- df3[-which(df3$ID == ""),] # since we overestimated the size of d3
df3_final
I have a data set in chronological order which I have imported to R using:
mydata <- read.csv(file="test.csv",stringsAsFactors=FALSE)
Two of the columns in the data set are 'winner' and loser'. Each row in the data is a tennis match.
What I am looking to do is to add two columns which give me a cumulative count of the total matches the player in the 'winner' column has played up to and including the match on that row. And the same count for the 'loser' in that row.
So for example it would look like this:
winner loser winner_matches loser_matches
tom andy 1 1
andy greg 2 1
greg tom 2 2
I hope that makes sense.
I have tried using the following code but can't get it to work across both columns:
ave(mydata$winner_name==mydata$winner_name, mydata$winner_name, FUN=cumsum)
So the data below is the first 10 rows of around 20,000.
1) base Define a function which counts matches up to the ith row for the indicated player and then apply it for the winner and loser matches separately. No packages are used:
count_matches <- function(i, player) {
with(DF[1:i, ], sum(winner == player | loser == player))
}
n <- nrow(DF)
transform(DF, winner_matches = mapply(count_matches, 1:n, winner),
loser_matches = mapply(count_matches, 1:n, loser))
giving:
winner loser winner_matches loser_matches
1 tom andy 1 1
2 andy greg 2 1
3 greg tom 2 2
2) sqldf A different solution can be obtained using sqldf upon realizing that this problem can be solved with a self-join on a complex condition like this:
library(sqldf)
sqldf("select a.winner,
a.loser,
sum(a.winner = b.winner or a.winner = b.loser) winner_matches,
sum(a.loser = b.winner or a.loser = b.loser) loser_matches
from DF a join DF b on a.rowid >= b.rowid
group by a.rowid")
giving:
winner loser winner_matches loser_matches
1 tom andy 1 1
2 andy greg 2 1
3 greg tom 2 2
Note: The input used, in reproducible form, is:
Lines <- "winner loser
tom andy
andy greg
greg tom"
DF <- read.table(text = Lines, header = TRUE, as.is = TRUE)
We can get number of times that each player won or lost by data.table package:
library(data.table)
setDT(dat)[, winner_matches_won := seq_len(.N), by=(winner)]
setDT(dat)[, loser_matches_lost := seq_len(.N), by=(loser)]
dat
# winner loser winner_matches_won loser_matches_lost
# 1: tom andy 1 1
# 2: andy greg 1 1
# 3: greg tom 1 1
# 4: greg tom 2 2
# 5: tom greg 2 2
Data:
dat <- structure(list(winner = structure(c(3L, 1L, 2L, 2L, 3L), .Label = c("andy",
"greg", "tom"), class = "factor"), loser = structure(c(1L, 2L,
3L, 3L, 2L), .Label = c("andy", "greg", "tom"), class = "factor")), .Names = c("winner",
"loser"), class = "data.frame", row.names = c(NA, -5L))
You're really close to getting ave to work. The cumsum function doesn't know how to handle text so I created a dummy column that's equal to 1 for each row. That gives cumsum something to count.
Here's a sample dataframe.
mydata <-
data.frame(
winner = c("tom", "andy", "greg", "tom", "gary"),
loser = c("andy", "greg", "tom", "gary", "tom"),
stringsAsFactors = FALSE
)
And here's the code to add the two new columns.
library(tidyverse)
mydata <- mutate(mydata, one = 1) # Add dummy column
# Use ave() to calculate both the wins and losses
mydata$winner_matches <- ave(x = mydata$one, mydata$winner, FUN = cumsum)
mydata$loser_matches <- ave(x = mydata$one, mydata$loser, FUN = cumsum)
mydata <- select(mydata, -one) # Remove dummy column
I need to delete all rows that contain a value of 2 or -2 regardless of what column it is in except column one.
Example dataframe:
df
a b c d
zzz 2 2 -1
yyy 1 1 1
xxx 1 -1 -2
Desired output:
df
a b c d
yyy 1 1 1
I have tried
df <- df[!grepl(-2 | 2, df),]
df <- subset(df, !df[-1] == 2 |!df[-1] == -2)
My actual dataset has over 300 rows and 70 variables
I believe I need to use some sort of apply function but I am not sure.
Any help is appreciated please let me know if you need more info.
We can create a logical index by comparing the absolute value of the dataset with that of 2, get the row wise sum and if there are no values, it will be 0 (by negating !, it returns TRUE for those 0 values and FALSE for others) and subset based on the logical index
df[!rowSums(abs(df[-1])==2),]
# a b c d
#2 yyy 1 1 1
Or another option is to compare within each column using lapply, collapse it to a logical vector with | and use that to subset the rows
df[!Reduce(`|`,lapply(abs(df[-1]), `==`, 2)),]
# a b c d
#2 yyy 1 1 1
We could also do this with tidyverse
library(tidyverse)
df %>%
select(-1) %>% #to remove the first column
map(~abs(.) ==2) %>% #do the columnwise comparison
reduce(`|`) %>% #reduce it to logical vector
`!` %>% #negate to convert TRUE/FALSE to FALSE/TRUE
df[., ] #subset the rows of original dataset
# a b c d
# 2 yyy 1 1 1
data
df <- structure(list(a = c("zzz", "yyy", "xxx"), b = c(2L, 1L, 1L),
c = c(2L, 1L, -1L), d = c(-1L, 1L, -2L)), .Names = c("a",
"b", "c", "d"), class = "data.frame", row.names = c(NA, -3L))
Option with dplyr:
library(dplyr)
a <- c("zzz","yyy","xxx")
b <- c(2,1,1)
c <- c(2,1,-1)
d <- c(-1,1,-2)
df <- data.frame(a,b,c,d)
filter(df,((abs(b) != 2) & (abs(c) != 2) & (abs(d) != 2)))
a b c d
1 yyy 1 1 1