R: ifelse statement: comparing data.frames - r

I have 2 dataframes where im trying to compare the value in one with another
If the value matches in both table 1 and 2, then a third value from table 2 is inserted into Table one.
Example Table My DF
words number
1 it 1
2 was 2
3 the 3
4 LTD QTY 4
5 end 5
6 of 6
7 winter 7
Table x.sub
lev_dist Var1 Var2
31 1 LTD QTY LTD QTY
What i want to say is, if Var1 in x.sub is equal to words in MyDF then insert x.sub.lev_dist in a third column next to the word in mydf
My attempt is below but keeps producing 3 in the results instead of the lev_value
mydf$lev_dist <- ifelse(test = (mydf$words == x.sub$Var1),x.sub$Var1,0)
Results:
words number lev_dist
1 it 1 0
2 was 2 0
3 the 3 0
4 LTD QTY 4 3
5 end 5 0
6 of 6 0
7 winter 7 0
Can anyone help

The x.sub$Var1 is a factor column. So, when we do the ifelse, we get the numeric levels of the factor. Replace x.sub$Var1 with as.character(x.sub$Var1) in the ifelse
mydf$lev_dist <- ifelse(mydf$words == as.character(x.sub$Var1)),
x.sub$lev_dist,0)
This could have avoided if the columns were of character class. Using stringsAsFactors=FALSE in the read.csv/read.table or data.frame would ensure that all the character columns are of character class.

You can also use merge:
x.sub = setNames(x.sub,c('lev_dist','words','Var2'))
df_ = merge(df, x.sub[,1:2], by='words', all=T)
df_[is.na(df_)]=0
# >df_
# words number lev_dist
#1 end 5 0
#2 it 1 0
#3 LTD QTY 4 1
#4 of 6 0
#5 the 3 0
#6 was 2 0
#7 winter 7 0

Related

Count the amount of times value A occurs without value B and vice versa

I'm having trouble figuring out how to do the opposite of the answer to this question (and in R not python).
Count the amount of times value A occurs with value B
Basically I have a dataframe with a lot of combinations of pairs of columns like so:
df <- data.frame(id1 = c("1","1","1","1","2","2","2","3","3","4","4"),
id2 = c("2","2","3","4","1","3","4","1","4","2","1"))
I want to count, how often all the values in column A occur in the whole dataframe without the values from column B. So the results for this small example would be the output of:
df_result <- data.frame(id1 = c("1","1","1","2","2","2","3","3","4","4"),
id2 = c("2","3","4","1","3","4","1","4","2","1"),
count = c("4","5","5","3","5","4","2","3","3","3"))
The important criteria for this, is that the final results dataframe is collapsed by the pairs (so in my example rows 1 and 2 are duplicates, and they are collapsed and summed by the total frequency 1 is observed without 2). For tallying the count of occurances, it's important that both columns are examined. I.e. order of columns doesn't matter for calculating the frequency - if column A has 1 and B has 2, this counts the same as if column A has 2 and B has 1.
I can do this very slowly by filtering for each pair, but it's not really feasible for my real data where I have many many different pairs.
Any guidance is greatly appreciated.
First paste the two id columns together to id12 for later matching. Then use sapply to go through all rows to see the records where id1 appears in id12 but id2 doesn't. sum that value and only output the distinct records. Finally, remove the id12 column.
library(dplyr)
df %>% mutate(id12 = paste0(id1, id2),
count = sapply(1:nrow(.),
function(x)
sum(grepl(id1[x], id12) & !grepl(id2[x], id12)))) %>%
distinct() %>%
select(-id12)
Or in base R completely:
id12 <- paste0(df$id1, df$id2)
df$count <- sapply(1:nrow(df), function(x) sum(grepl(df$id1[x], id12) & !grepl(df$id2[x], id12)))
df <- df[!duplicated(df),]
Output
id1 id2 count
1 1 2 4
2 1 3 5
3 1 4 5
4 2 1 3
5 2 3 5
6 2 4 4
7 3 1 2
8 3 4 3
9 4 2 3
10 4 1 3
A full tidyverse version:
library(tidyverse)
df %>%
mutate(id = paste(id1, id2),
count = map(cur_group_rows(), ~ sum(str_detect(id, id1[.x]) & str_detect(id, id2[.x], negate = T))))
A more efficient approach would be to work on a tabulation format:
tab = crossprod(table(rep(seq_len(nrow(df)), ncol(df)), c(df$id1, df$id2)))
#tab
#
# 1 2 3 4
# 1 7 3 2 2
# 2 3 6 1 2
# 3 2 1 4 1
# 4 2 2 1 5
So, now, we have the times each value appears with another (irrespectively of their order in the two columns). Here on, we need a way to subset the above table by each pair and subtract the value of their cooccurence from the value of each id's total appearance.
Make a grid of all combinations:
gr = expand.grid(id1 = colnames(tab), id2 = rownames(tab), stringsAsFactors = FALSE)
Create 2-column matrices to subset the table:
id1.ij = cbind(match(gr$id1, colnames(tab)),
match(gr$id1, rownames(tab)))
id2.ij = cbind(match(gr$id1, colnames(tab)),
match(gr$id2, rownames(tab)))
Subtract the respective values:
cbind(gr, count = tab[id1.ij] - tab[id2.ij])
# id1 id2 count
#1 1 1 0
#2 2 1 3
#3 3 1 2
#4 4 1 3
#5 1 2 4
#6 2 2 0
#7 3 2 3
#8 4 2 3
#9 1 3 5
#10 2 3 5
#11 3 3 0
#12 4 3 4
#13 1 4 5
#14 2 4 4
#15 3 4 3
#16 4 4 0
Of course, if we do not need the full grid of values, we can set:
gr = unique(df)
which results in:
# id1 id2 count
#1 1 2 4
#3 1 3 5
#4 1 4 5
#5 2 1 3
#6 2 3 5
#7 2 4 4
#8 3 1 2
#9 3 4 3
#10 4 2 3
#11 4 1 3

In R, how do I add rows containing counts of the number of values in that column that ==x?

I have a dataframe like this
Q1 <- c(1,0,1,4,3)
Q2 <- c(0,1,2,1,4)
df <- data.frame(Q1,Q2)
df
Q1 Q2
1 1 0
2 0 1
3 1 2
4 4 1
5 3 4
There are many more columns like this, and what I want to do is add 5 rows at the bottom of the dataframe with the count of how many items in each column==0, how many ==1, how many ==2, how many==3 and how many==4. Thank you.
You can apply table on each column in df and rbind to original dataset.
rbind(df, sapply(df, function(x) table(factor(x, levels = 0:4))))
# Q1 Q2
#1 1 0
#2 0 1
#3 1 2
#4 4 1
#5 3 4
#0 1 1
#11 2 2
#21 0 1
#31 1 0
#41 1 1
You can use this (if it's numeric, make it character or factor in order to get the count of each level)
New_df1 <- as.data.frame(table(df$Q1)
New_df2 <- as.data.frame(table(df$Q2)
Then you can transform and add (if you want) them to your original data frame.

Sort across rows to obtain three largest values

There is a injury score called ISS score
I have a table of injury data in rows according to pt ID.
I would like to obtain the top three values for the 6 injury columns.
Column values range from 0-5.
pt_id head face abdo pelvis Extremity External
1 4 0 0 1 0 3
2 3 3 5 0 3 2
3 0 0 2 1 1 1
4 2 0 0 0 0 1
5 5 0 0 2 0 1
My output for the above example would be
pt-id n1 n2 n3
1 4 3 1
2 5 3 3
3 2 1 1
4 2 1 0
5 5 2 1
values can be in a list or in new columns as calculating the score is simple from that point on.
I had thought that I would be able to create a list for the 6 injury columns and then apply a sort to each list taking the top three values. My code for that was:
ais$ais_list <- setNames(split(ais[,2:7], seq(nrow(ais))), rownames(ais))
But I struggled to apply the sort to the lists within the data frame as unfortunately some of the data in my data set includes NA values
We could use apply row-wise and sort the dataframe and take only first three values in each row.
cbind(df[1], t(apply(df[-1], 1, sort, decreasing = TRUE)[1:3, ]))
# pt_id 1 2 3
#1 1 4 3 1
#2 2 5 3 3
#3 3 2 1 1
#4 4 2 1 0
#5 5 5 2 1
As some values may contain NA it is better we apply sort using anonymous function and then take take top 3 values using head.
cbind(df[1], t(apply(df[-1], 1, function(x) head(sort(x, decreasing = TRUE), 3))))
A tidyverse option is to first gather the data, arrange it in descending order and for every row select only first three values. We then replace the injury column with the column names which we want and finally spread the data back to wide format.
library(tidyverse)
df %>%
gather(injury, value, -pt_id) %>%
arrange(desc(value)) %>%
group_by(pt_id) %>%
slice(1:3) %>%
mutate(injury = 1:3) %>%
spread(injury, value)
# pt_id `1` `2` `3`
# <int> <int> <int> <int>
#1 1 4 3 1
#2 2 5 3 3
#3 3 2 1 1
#4 4 2 1 0
#5 5 5 2 1

How to replace 1's in each column using a lookup table

I have a data frame (test_df) and a lookup list (key). I would like to replace the 1's in test_df with a value using key. key has only a subset of the column names, and not in the same order. So lookup the value for "dog" in key (5), and replace the 1's in the "dog" column with 5 in test_df.
test_df
cat dog monkey bear
1 1 0 2
2 1 1 1
0 2 2 0
key
dog cat bear
5 6 7
desired output
cat dog monkey bear
6 5 0 2
2 5 1 7
0 2 2 0
thanks for your help.
We can loop through the columns using Map and replace the 1s with the corresponding values of 'key' column
test_df[names(key)] <- Map(function(x, y) replace(x, x==1, y), test_df[names(key)], key)
test_df
# cat dog monkey bear
#1 6 5 0 2
#2 2 5 1 7
#3 0 2 2 0
for(f in names(key)){
col.num = which(names(test_df) == f)
test_df[(which(test_df[,col.num] == 1)),col.num] = key[,f]
}

R saving the output of table() into a data frame

I have the following data frame:
id<-c(1,2,3,4,1,1,2,3,4,4,2,2)
period<-c("first","calib","valid","valid","calib","first","valid","valid","calib","first","calib","valid")
df<-data.frame(id,period)
typing
table(df)
results in
period
id calib first valid
1 1 2 0
2 2 0 2
3 0 0 2
4 1 1 1
however if I save it as a data frame 'df'
df<-data.frame(table(df))
the format of 'df' would be like
id period Freq
1 1 calib 2
2 2 calib 1
3 3 calib 1
4 4 calib 0
5 1 first 1
6 2 first 2
7 3 first 0
8 4 first 0
9 1 valid 0
10 2 valid 0
11 3 valid 2
12 4 valid 3
how can I avoid this and how can I save the first output as it is into a data frame?
more importantly is there any way to get the same result using 'dcast'?
Would this help?
> data.frame(unclass(table(df)))
calib first valid
1 1 2 0
2 2 0 2
3 0 0 2
4 1 1 1
To elaborate just a little bit. I've changed the ids in the example data.frame such that your ids are not 1:4, in order to prove that the ids are carried along into the table and are not a sequence of row counts.
id <- c(10,20,30,40,10,10,20,30,40,40,20,20)
period <- c("first","calib","valid","valid","calib","first","valid","valid","calib","first","calib","valid")
df <- data.frame(id,period)
Create the new data.frame one of two ways. rengis answer is fine for 2-column data frames that have the id column first. It won't work so well if your data frame has more than 2 columns, or if the columns are in a different order.
Alternative would be to specify the columns and column order for your table:
df3 <- data.frame(unclass(table(df$id, df$period)))
the id column is included in the new data.frame as row.names(df3). To add it as a new column:
df3$id <- row.names(df3)
df3
calib first valid id
10 1 2 0 10
20 2 0 2 20
30 0 0 2 30
40 1 1 1 40

Resources