Match and Fill Values in R - r

I have a data set containing 3 columns. First column contains Products Name (A through E) and corresponding 2 columns contain nearest 2 neighbors (i.e customers who own Product specified in column A are more likely to buy the next best 2 products (nearest 2 neighbors).
m1 = data.frame(Product=c("A","B","C","D","E"), V1=c("C","A","A","A","D"),
V2=c("D","D","B","E","A"))
In the second data set, i have data at user level. First column contains User IDs and corresponding 5 columns contain information whether user own the product or not. 1 - Own it. 0 - Don't own it.
m2 = data.frame(ID = c(1:7), A = rbinom(7,1,1/2), B = rbinom(7,1,1/2),
C = rbinom(7,1,1/2), D = rbinom(7,1,1/2), E = rbinom(7,1,1/2))
I want product recommendation at user level. I want m1 data to be merged with m2 based on the user own it or not. The output should look like -
User - 1 A D

You haven't posted reproducible example and exact expected results, but this seems to do what you want.
set.seed(321)
m1 = data.frame(Product=c("A","B","C","D","E"), V1=c("C","A","A","A","D"),
V2=c("D","D","B","E","A"))
m2 = data.frame(ID = c(1:7), A = rbinom(7,1,1/2), B = rbinom(7,1,1/2),
C = rbinom(7,1,1/2), D = rbinom(7,1,1/2), E = rbinom(7,1,1/2))
recommended <- apply(m2, 1, function(x) {
client.recommended <- m1[as.logical(x[-1]),-1]
top <- names(sort(table(as.vector(t(client.recommended))),
decreasing = TRUE)[1:2])
c(x[1], top)
})
recommended <- as.data.frame(t(recommended), stringsAsFactors = FALSE)
ID V2 V3
1 1 A B
2 2 A D
3 3 A B
4 4 A D
5 5 A D
6 6 A D
7 7 A B
What this code does:
For every row in m2 data.frame (every client), take that row
Take subset of m1 data.frame corresponding to values found in row (if client chosen "A" and "B", take rows "A" and "B" from m1
Turn this subset into vector
Count occurrences of unique values in vector
Sort unique values by count
Take first most common unique values
Return these values along with client ID
Turn everything into proper data.frame for further processing
It seems that you expect to obtain only two products for each client and that is what this code does. For products with the same number of occurrences, apparently one that comes first alphabetically wins. You can get all recommended product by dropping [1:2] part, but then you will need to figure out how to coerce uneven-length vectors into single data.frame.

Related

Merging Two Datasets Using Different Column names: left_Join

I am trying to merge two datasets using two separate column names, but that share same unique values. For instance, column A in dataset 1== xyzw, while in dataset 2, the column's name is B but the value == xyzw.
However, the problem is that in dataset 2, column's B value == xyzw refers to firm names and appears several times, depending on how many employees are in that firm that exist in the dataset.
Essentially, I want to create a new column, let's call it C in dataset 1 telling me how many employees are in each firm.
I have tried the following:
## Counting how many teachers are in each matched school, using the "Matched" column from matching_file_V4, along with the school_name column from the sample11 dataset:
merged_dataset <- left_join(sample11,matched_datasets,by="school_name")
While this code works, it is not really providing me with the number of employees per firm.
If you could provide a sample data and expected output, It'd makes it easier for others to help. But that notwithstanding, I hope this gives you what you want:
Assuming we have these two data frames:
df_1 <- data.frame(
A = letters[1:5],
B = c('empl_1','empl_2','empl_3','empl_4','empl_5')
)
df_2 <- data.frame(
C = sample(rep(c('empl_1','empl_2','empl_3','empl_4','empl_5'), 15), 50),
D = sample(letters[1:5], 50, replace=T)
)
# I suggest you find the number of employees for each firm in the second data frame
df_2%>%group_by(C)%>%
summarise(
num_empl = n()
)%>% ### Then do the left join
left_join(
df_1,., by=c('B' = 'C') ## this is how you can join on two different column names
)
# A B num_empl
# 1 a empl_1 8
# 2 b empl_2 11
# 3 c empl_3 10
# 4 d empl_4 10
# 5 e empl_5 11

apply function by name of list

Imagine that I have a list
l <- list("a" = 1, "b" = 2)
and a data frame
id value
a 3
b 4
I want to match id with list names, and apply a function on that list with the value in data frame. For example, I want the sum of value in the data frame and corresponding value in the list, I get
id value
a 4
b 6
Anyone has a clue?
Edit:
A.
I just want to expand the question a little bit with. Now, I have more than one value in every elements of list.
l <- list("a" = c(1, 2), "b" =c(1, 2))
I still want the sum
id value
a 6
b 7
We can match the names of the list with id of dataframe, unlist the list accordingly and add it to value
df$value <- unlist(l[match(df$id, names(l))]) + df$value
df
# id value
#1 a 4
#2 b 6
EDIT
If we have multiple entries in list we need to sum every list after matching. We can do
df$value <- df$value + sapply(l[match(df$id, names(l))], sum)
df
# id value
#1 a 6
#2 b 7
You just need
df$value=df$value+unlist(l)[df$id]# vector have names can just order by names
df
id value
1 a 4
2 b 6
Try answer with Ronak
l <- list("b" = 2, "a" = 1)
unlist(l)[as.character(df$id)]# if you id in df is factor
a b
1 2
Update
df$value=df$value+unlist(lapply(l,sum))[df$id]

Removing rows in data.frame having columns subsumed in others

I am trying to achieve something similar to unique in a data.frame where column each element of a column in a row are vectors. What I want to do is if the elements of the vector in the column of that hat row a subset or equal to another remove the row with smaller number of elements. I can achieve this with a nested for loop but since data contains 400,000 rows the program is very inefficient.
Sample data
# Set the seed for reproducibility
set.seed(42)
# Create a random data frame
mydf <- data.frame(items = rep(letters[1:4], length.out = 20),
grps = sample(1:5, 20, replace = TRUE),
supergrp = sample(LETTERS[1:4], replace = TRUE))
# Aggregate items into a single column
temp <- aggregate(items ~ grps + supergrp, mydf, unique)
# Arrange by number of items for each grp and supergroup
indx <- order(lengths(temp$items), decreasing = T)
temp <- temp[indx, ,drop=FALSE]
Temp looks like
grps supergrp items
1 4 D a, c, d
2 3 D c, d
3 5 D a, d
4 1 A b
5 2 A b
6 3 A b
7 4 A b
8 5 A b
9 1 D d
10 2 D c
Now you can see that second combination of supergrp and items in second and third row is contained in first row. So, I want to delete the second and third rows from the result. Similarly, rows 5 to 8 are contained in row 4. Finally, rows 9 and 10 are contained in the first row, so I want to delete rows 9 and 10.
Hence, my result would look like:
grps supergrp items
1 4 D a, c, d
4 1 A b
My implementation is as follows::
# initialise the result dataframe by first row of old data frame
newdf <-temp[1, ]
# For all rows in the the original data
for(i in 1:nrow(temp))
{
# Index to check if all the items are found
indx <- TRUE
# Check if item in the original data appears in the new data
for(j in 1:nrow(newdf))
{
if(all(c(temp$supergrp[[i]], temp$items[[i]]) %in%
c(newdf$supergrp[[j]], newdf$items[[j]]))){
# set indx to false if a row with same items and supergroup
# as the old data is found in the new data
indx <- FALSE
}
}
# If none of the rows in new data contain items and supergroup in old data append that
if(indx){
newdf <- rbind(newdf, temp[i, ])
}
}
I believe there is an efficient way to implement this in R; may be using the tidy framework and dplyr chains but I am missing the trick. Apologies for a longish question. Any input would be highly appreciated.
I would try to get the items out of a list column and store them in a longer dataframe. Here is my somewhat hacky solution:
library(stringr)
items <- temp$items %>%
map(~str_split(., ",")) %>%
map_df(~data.frame(.))
out <- bind_cols(temp[, c("grps", "supergrp")], items)
out %>%
gather(item_name, item, -grps, -supergrp) %>%
select(-item_name, -grps) %>%
unique() %>%
filter(!is.na(item))

Remove columns in R based on the values of the first two rows

I have big data set that looks like this (actually it's got thousands of columns):
Or
A = c("AA","AA","AA","AA","AA")
B = c("CC","GG","CC","CG","GG")
C = c("TT","AA","AA","AT","TT")
D = c("GG","GG","GG","GG","GG")
E = c("TT","TT","NA","TT","TT")
mydata = data.frame(A, B, C, D, E)
mydata
Basically I would like to do 2 things:
Remove the columns from the data set, in where the value of the first and second row (within the column) is the same, so in this case, columns "A", "D", and "E" would be excluded.
Change the names of the cells referred to the values in the first and second row (within a column): If the cell has the same value as the cellin row 1 would be called "f", and if is same as row 2 "m"; and otherwise "h".
This is the table I would like to obtain in the end:
B = c("CC","GG","f","h","m")
C = c("TT","AA","m","h","f")
mydata = data.frame(B, C)
mydata
For the first point I've managed to get similiar results by using an apply function as in How to remove non-informative columns with and without missing values in dataframe, but what I would like is to reffer the condition to certain cells, like when using an "if" function in Excel.
I would appreciate any ideas of types of functions to use.
First thing you need to make is that your strings are characters instead of factors:
A = c("AA","AA","AA","AA","AA")
B = c("CC","GG","CC","CG","GG")
C = c("TT","AA","AA","AT","TT")
D = c("GG","GG","GG","GG","GG")
E = c("TT","TT","NA","TT","TT")
mydata = data.frame(A, B, C, D, E,stringsAsFactors = F)
Then for your first step you can do something like this:
mydata2<-mydata[,!mydata[1,]==mydata[2,]]
mydata2
and for your second step:
mydata2[-c(1:2),]<-lapply(mydata2,function(x)
ifelse(x[-c(1,2)]==x[1],'f',
ifelse(x[-c(1,2)]==x[2],'m','h'))
)
> mydata2
B C
1 CC TT
2 GG AA
3 f m
4 h h
5 m f

converting string IDs into numbers in a multilevel analysis using R [duplicate]

This question already has an answer here:
converting string IDs into numbers in a multilevel analysis using R
(1 answer)
Closed 9 years ago.
I have two data sets, one for student level data and another one for class level data. Student and class level IDs are generated as string values like:
Student data set:
student ID ->141PSDM2L,1420CHY1L,1JNLV36HH,1MNSBXUST,2K7EVS7X6,2N2SC26HL,...
class ID ->XK37HDN,XK37HDN,XK37HDN,3K3EH77,3K3EH77,2K36HN6,...
class level data set:
class ID ->XK37HDN,3K3EH77,2K36HN6,3K3LHSH,3K3LHSY,DK3EH14,DK3EH1H,DK3EH1K,...
In student data set,each class ID is repeated equal to the number of students in the class but in class level data set we only have one code for each class.
How can I convert those ID into integers? considering both student and class level ID.IN other words, I want to have IDs as below (or something similar):
Student data set:
student ID ->1,2,3,4,5,6,...
class ID ->1,1,1,2,2,3,...
class level data set:
class ID ->1,2,3,4,5,6,7,8,...
EDIT:
Conversion on student level data is not difficult. The problem arises when I want to convert class level data. Because of the repetition of class IDs in student data set, class IDs take values from 1 to 1533 but doing the same conversion method in class level data produces values from 1 to 896 so I don't know if , for example,class ID of 45 in student level data has the position as class ID 45 in class level data set.
Assuming that your studentID and classID are factors, I would use the fact that internally these are stored numerically. Hence if you can get the levels the same on both factors (i.e. in same order, and such that identical(levels(f1), levels(f2)) == TRUE), then you can simply coerce to integers.
I was thinking something along the lines of:
## dummy data first
set.seed(1)
df1 <- data.frame(f1 = sample(letters, 100, replace = TRUE),
f2 = sample(LETTERS, 100, replace = TRUE,
prob = rep(c(0.25, 0.75), length = 26)))
df2 <- with(df1, data.frame(f2 = sample(factor(unique(f2),
levels = sample(unique(f2)))),
vals = rnorm(length(unique(f2)))))
Note the levels of the factors are not identical even though there is a match between the data (given the way I generated them)
> identical(with(df1, levels(f2)), with(df2, levels(f2)))
[1] FALSE
Now make the levels identical, here I just take the union in case there are some values in one factor and not the other, and vice versa.
## make levels identical
levs <- sort(union(with(df1, levels(f2)), with(df2, levels(f2))))
df1 <- transform(df1, f2 = factor(f2, levels = levs))
df2 <- transform(df2, f2 = factor(f2, levels = levs))
> identical(with(df1, levels(f2)), with(df2, levels(f2)))
[1] TRUE
Now record to numeric
## recode as numeric
df1b <- transform(df1, f2int = as.numeric(f2))
df2b <- transform(df2, f2int = as.numeric(f2))
> head(df1b)
f1 f2 f2int
1 g B 2
2 j D 4
3 o R 17
4 x A 1
5 f F 6
6 x J 10
> head(df2b)
f2 vals f2int
1 Z -0.17955653 23
2 U -0.10019074 20
3 N 0.71266631 13
4 J -0.07356440 10
5 B -0.03763417 2
6 X -0.68166048 22
Notice the f1int and f2int values for f2 equal to B or J.
My point in the comments about merge() was if you want to match the tables, you can do the usual database joins using merge(). E.g.:
> head(merge(df1, df2, sort = FALSE))
f2 f1 vals
1 B g -0.03763417
2 B v -0.03763417
3 B u -0.03763417
4 B e -0.03763417
5 B w -0.03763417
6 D i -0.58889449
which would avoid the potentially error-prone step of getting the levels in order and converting to integers, if this was the ultimate aim.

Resources