Related
Having a data frame I would like to assign a calculated value based on a given a column index
df <- data.frame(a = c(2,4,7,3,5,3), b = c(8,3,8,2,6,1))
> df
a b
1 2 8
2 4 3
3 7 8
4 3 2
5 5 6
6 3 1
max <- apply(df, 1, which.max)
> max
[1] 2 1 2 1 2 1
addition <- apply(df, 1, sum)
> addition
[1] 10 7 15 5 11 4
Then some operation which I cannot figure out with the following result being assigned to df2
> df2
a b
1 2 10
2 7 3
3 7 15
4 5 2
5 5 11
6 4 1
highly appreciate your ideas and your help. Thank you
You can use cbind to access your selected columns for each row:
df2 = df
df2[cbind(1:nrow(df2),max)] = addition
df2
a b
1 2 10
2 7 3
3 7 15
4 5 2
5 5 11
6 4 1
Here, cbind returns a matrix of 2 columns and 6 rows that we use to subset the dataframe using matrix subsetting.
You can also use vectorised ifelse directly:
with(df, cbind.data.frame(a = ifelse(a > b, a + b, a), b = ifelse(a > b, b, a + b)));
# a b
#1 2 10
#2 7 3
#3 7 15
#4 5 2
#5 5 11
#6 4 1
I have a very large data set including 400 string and numeric variables. I want to compare each two consequiative columns 3&4, 5&6, etc. I am going to compare the third variable (.x) with fourth (.y) , fifth with sixth one, seventh one with eightth one and so on in the following way: if (.y) is NA then we replace the NA with the value of corresponding row from (.x) . For example if number .y is NA we replace NA with the corresponding value from number .x which would be 5. Again, if day.y is NA we replace NA in day.y with the corresponding value from day.x which would be 3. How can I write a loope function to do that?
A<-c(1,2,3,4,5,6,7,NA,NA,5,5,6)
B<-c(3,4,5,6,1,2,7,6,7,NA,NA,6)
number.x<-c(1,2,3,4,5,6,7,NA,NA,5,5,6)
number.y<-c(3,4,5,6,1,2,7,6,7,NA,NA,6)
day.x<-c(1,3,4,5,6,7,8,1,NA,3,5,3)
day.y<-c(4,5,6,7,8,7,8,1,2,3,5,NA)
school.x<-c("a","b","b","c","n","f","h","NA","F","G","z","h")
school.y<-c("a","b","b","c","m","g","h","NA","NA","G","H","T")
city.x<- c(1,2,3,7,5,8,7,5,6,7,5,1)
city.y<- c(1,2,3,5,5,7,7,NA,NA,3,4,5)
df<-data.frame(A,B,number.x,number.y,day.x,day.y,school.x,school.y,city.x,city.y)
This is a hacked approach to your question and it requires that every two columns are going to be compared against one another.
library(dplyr)
start_group <- seq(1, length(df), by = 2)
df2 <- data.frame(id = 1:nrow(df))
for(i in start_group){
i <- i
j <- i + 1
dnames <- df[, c(i, j)] %>%
names
df_ <- data.frame(col1 = df[, i],
col2 = df[, j]) %>%
mutate(col1 = ifelse(is.na(col1), col2 %>% paste, col1 %>% paste)) %>%
mutate(col2 = ifelse(is.na(col2), col1 %>% paste, col2 %>% paste))
names(df_) <- dnames
df2 <- cbind(df2, df_)
}
df2[, -1]
number.x number.y day.x day.y school.x school.y city.x city.y
1 1 3 1 4 a a 1 1
2 2 4 3 5 b b 2 2
3 3 5 4 6 b b 3 3
4 4 6 5 7 c c 7 5
5 5 1 6 8 n m 5 5
6 6 2 7 7 f g 8 7
7 7 7 8 8 h h 7 7
8 6 6 1 1 NA NA 5 5
9 7 7 2 2 F F 6 6
10 5 5 3 3 G G 7 3
11 5 5 5 5 z H 5 4
12 6 6 3 3 h T 1 5
Consider the following base R solution. Essentially, it loops through a distinct list of column stem names (number, day, school, class) and replaces NA values in .x columns with corresponding NA values in .y columns and vice versa. NOTE: Schools column require conversion from factor to character and one of its rows has NA in both .x and .y columns
# CONVERT TO CHARACTER (NOTE: NA VALUE BECOME "NA" STRINGS)
df[,c('school.x', 'school.y')] <-
sapply(df[,c('school.x', 'school.y')], as.character)
# SET UP FINAL DF
finaldf <- df
# OBTAIN UNIQUE LIST OF COLUMNS STEM (W/O x AND y SUFFIXES)
distinctcols <- unique(gsub("[.][x]|[.][y]", "", names(df)[49:ncol(df)]))
# LOOP THROUGH COLUMN STEM REPLACING NA VALUES
for (col in distinctcols) {
# REPLACE NA .x COLUMN VALUES
finaldf[is.na(finaldf[paste0(col,'.x')])|finaldf[paste0(col,'.x')]=="NA",
paste0(col,'.x')] <-
finaldf[is.na(finaldf[paste0(col,'.x')])|finaldf[paste0(col,'.x')]=="NA",
paste0(col,'.y')]
# REPLACE NA .y COLUMN VALUES
finaldf[is.na(finaldf[paste0(col,'.y')])|finaldf[paste0(col,'.y')]=="NA",
paste0(col,'.y')] <-
finaldf[is.na(finaldf[paste0(col,'.y')])|finaldf[paste0(col,'.y')]=="NA",
paste0(col,'.x')]
}
OUTPUT
number.x number.y day.x day.y school.x school.y city.x city.y
1 1 3 1 4 a a 1 1
2 2 4 3 5 b b 2 2
3 3 5 4 6 b b 3 3
4 4 6 5 7 c c 7 5
5 5 1 6 8 n m 5 5
6 6 2 7 7 f g 8 7
7 7 7 8 8 h h 7 7
8 6 6 1 1 NA NA 5 5
9 7 7 2 2 F F 6 6
10 5 5 3 3 G G 7 3
11 5 5 5 5 z H 5 4
12 6 6 3 3 h T 1 5
I have two data frames which have the same elements initially but after eliminating some rows in one of them are not the same length.
x <-c(4,2,3,6,7,3,1,8,5,2,4,1,2,6,3)
y <-c(1,4,2,3,6,7,3,1,8,5,2,3,1,4,3)
z <-c(4,2,3,1,8,5,2,4,1)
k <-c(1,4,2,3,1,8,5,2,3)
df1 <- data.frame(x,y)
df2 <- data.frame(z,k)
I would like to find a way in the second data frame (df2) to create a row or have the index reference with the index row number of the first data frame (df1) so it results in a new data frame as follows (a would be the index reference from df1).
df3
a z k
1 1 4 1
2 2 2 4
3 3 3 2
4 7 1 3
5 8 8 1
6 9 5 8
7 10 2 5
8 11 4 2
9 12 1 3
I could create a column manually of all rows that are eliminated or use
library(sqldf)
a1NotIna2 <- (sqldf('SELECT * FROM df1 EXCEPT SELECT * FROM df2'))
a1NotIna2
x y
1 2 1
2 3 3
3 3 7
4 6 3
5 6 4
6 7 6
I have tried using -which- without sucess on this last expression to find out the rows of df1 that were eliminated to be used this in removing from a sequencing vector of length equal to df1 those common elements as to obtain a vector with the index similar to df3
Any help is welcomed
A generic solution if your data.frames have two columns, using pmatch:
transform(df2, a=pmatch(do.call(paste0, df2), do.call(paste0, df1)))
# z k a
#1 4 1 1
#2 2 4 2
#3 3 2 3
#4 1 3 7
#5 8 1 8
#6 5 8 9
#7 2 5 10
#8 4 2 11
#9 1 3 12
You can get the first matching row of df1 for each row in df2 with:
match(paste(df2$z, df2$k), paste(df1$x, df1$y))
# [1] 1 2 3 7 8 9 10 11 7
Unfortunately this won't maintain ordering when you have duplicated rows, so for instance we got index 7 for the last row of df2 instead of 12.
I have a dataframe with an id variable, which may be duplicated. I want to split this into two dataframes, one which contains only the entries where the id's are duplicated, the other which shows only the id's which are unique. What is the best way of doing this?
For example, say I had the data frame:
dataDF <- data.frame(id = c(1,1,2,3,4,4,5,6),
a = c(1,2,3,4,5,6,7,8),
b = c(8,7,6,5,4,3,2,1))
i.e. the following
id a b
1 1 1 8
2 1 2 7
3 2 3 6
4 3 4 5
5 4 5 4
6 4 6 3
7 5 7 2
8 6 8 1
I want to get the following dataframes:
id a b
1 1 1 8
2 1 2 7
5 4 5 4
6 4 6 3
and
id a b
3 2 3 6
4 3 4 5
7 5 7 2
8 6 8 1
I am currently doing this as follows
dupeIds <- unique(subset(dataDF, duplicated(dataDF$id))$id)
uniqueDF <- subset(dataDF, !id %in% dupeIds)
dupeDF <- subset(dataDF, id %in% dupeIds)
which seems to work but it seems a bit off to subset three times, is there a simpler way of doing this? Thanks
Use duplicated twice, once top down, and once bottom up, and then use split to get it all in a list, like this:
split(dataDF, duplicated(dataDF$id) | duplicated(dataDF$id, fromLast = TRUE))
# $`FALSE`
# id a b
# 3 2 3 6
# 4 3 4 5
# 7 5 7 2
# 8 6 8 1
#
# $`TRUE`
# id a b
# 1 1 1 8
# 2 1 2 7
# 5 4 5 4
# 6 4 6 3
If you need to split this out into separate data.frames in your workspace (not sure why you would need to do that), assign names to the list items (eg names(mylist) <- c("nodupe", "dupe")) and then use list2env.
I have a dataframe df as follows:
A B C
NA 1 2
2 NA 3
4 5 6
7 8 9
what I want to do is remove all the rows that has NA.
if I use
apply(df,1,function(row) all(!is.na(row)))
I get the list of all the rows with TRUE (if the row does not contain a NA) and FALSE(if the row contains a NA).
But how do I get the rowname such that I can create some like
df2<-df[-c(list of rows that contains NA),]
which will give me all the new dataframe with NA in rows.
Thanks in advance.
Assuming you have a dataframe that looks like this:
A B C
1 NA 1 2
2 2 NA 3
3 4 5 6
4 7 8 9
Then try:
df1[apply(df1,1,function(x) !any(is.na(x))), ]
A B C
3 4 5 6
4 7 8 9
It doesn't use rownames but rather a logical vector. I guess Joshua and I read you question differently but we used the same method.
Joshua's suggestion is more compact:
> na.omit(df1)
A B C
3 4 5 6
4 7 8 9
And it reminds me that I should have used:
> df1[complete.cases(df1), ]
A B C
3 4 5 6
4 7 8 9
You can use the logical vector from your apply call to index your data.frame.
> Data[!apply(Data,1,function(row) all(!is.na(row))),]
A B C
1 NA 1 2
2 2 NA 3
> # or like this:
> Data[apply(Data,1,function(row) any(is.na(row))),]
A B C
1 NA 1 2
2 2 NA 3
is.na on a data.frame returns a matrix, which is a better candidate for apply:
df <- read.table(textConnection(" A B C
NA 1 2
2 NA 3
4 5 6
7 8 9
"))
## a matrix
is.na(df)
## logical for selecting rows that are all NA
apply(df, 1, function(x) all(is.na(x)))
## one liner
df[!apply(df, 1, function(x) all(is.na(x))), ]