grouping data frame within the list - r

I have a unique issue that I am trying to solve.
I have a data table that contains few different types of information in it.
Example bellow.
ID|inpSeq|Act |User |Representing
--|----- |----|---- |-----
1 | 123 | s | ABC | NA
1 | 124 | s | ABC | NA
1 | 125 | c | ABC | x1
1 | 126 | c | XYZ | x2
1 | 127 | d | ABC | x2
What I am trying to do is to organize the data so that view how "User" relates to "Repres"
In other words, I am looking to create following output
ID|Act |User|....
--|------|----|----|----
1 | sscd | ABC| x1 | x2.....
1 | c | XYZ| x2.....
So as you can see the original table is compacted into "User" centric view and the "Act" now contains all the activity that User performed on single ID.
Additionally, one I have this activity sorted out, I would need to (dynamically, if different) show on who's behalf they performed the activity. This is represented by x1, x2..... meaning that this can grow depending on how may unique "Representing" parties there are for each ID/Act/User combinations.
An important thing to note is that "s" values in Act field will always have NA in Representing filed. So in those NA do not need to be included in the transformed view.
Now thus far I was able to get the ID|Act|User part of the code figured out by using following code
aggregate(Act~ID+User, paste, collapse="", data=df)
But I need to figure out how to do the rest. That is where I need all of your help.
P.S. "inpSeq" field is a just unique numeric field that is created sequentially by an outside application and it allows for ordering of activities in correct sequential order.

With your data as a data frame df, you can use dplyr with the spread function from tidyr to get what you want:
library(dplyr)
library(tidyr)
f <- function(x) { paste(na.omit(x), collapse="") } ## 1.
result <- df %>% spread(Representing, Representing) %>% ## 2.
select(-inpSeq, -`<NA>`) %>% ## 3.
group_by(ID, User) %>% ## 4.
summarise_each(funs(f)))
Notes:
We define a function f that collapses the vector of characters to a single string and omits NAs in the process.
The first argument to spread is the column name for the keys and the second argument is the column name for the values. The spread function spreads the the values into multiple columns. These additional columns are named by the keys. Here, we spread the rows of Representing into multiple columns named after the rows of Representing. The result of just that command on your data gives:
## ID inpSeq Act User x1 x2 <NA>
##1 1 123 s ABC <NA> <NA> <NA>
##2 1 124 s ABC <NA> <NA> <NA>
##3 1 125 c ABC x1 <NA> <NA>
##4 1 126 c XYZ <NA> x2 <NA>
##5 1 127 d ABC <NA> x2 <NA>
Note that there are now three additional columns named x1, x2, and <NA> replacing the original Representing column.
From this result, we use select to omit the columns inpSeq and <NA>.
We then group_by ID and User and summaries_each of the remaining columns using the function f that we defined.
The result is:
print(result)
##Source: local data frame [2 x 5]
##Groups: ID [?]
## ID User Act x1 x2
## <int> <fctr> <chr> <chr> <chr>
##1 1 ABC sscd x1 x2
##2 1 XYZ c x2

Related

Compare two DF's and find differences in a specified column of different data types

I have two DF's with same column names but with different data types in them.
DF-1
DF-2
I want to compare above two DF's on column 'A' and write the difference to a new variable 'C'.
Problem here is I need to compare Alphanumeric values of "DF-1 column A" with Numeric values of "DF-2 column A" and find if the numerics in DF-2 are present in DF-1 or not.
If a value is not found in DF-1 then I want that difference to be written to new variable C below.
I want the variable C to be added to DF-1 like this with the differences identified.
Please advise.
You can use sub to get only the number from DF1$A and use %in%to test if the number is present in DF2$A.
DF1$C <- ""
i <- !sub("\\D*", "", DF1$A) %in% DF2$A
DF1$C[i] <-DF1$A[i]
DF1
# A B C
#1 ABC 1 AA ABC 1
#2 ABC 2 AB
#3 ABC 3 AC
#4 ABC 4 AD
#5 ABC 5 AE ABC 5
#6 ABC 6 AF ABC 6
#7 ABC 7 AG ABC 7
Data:
DF1 <- data.frame(A=paste("ABC", 1:7), B=paste0("A", LETTERS[1:7]))
DF2 <- data.frame(A=c(2:4,9:12), B=paste0("B", LETTERS[1:7]))

How to remove rows in a dataset by matching a column in another dataset

I'm very new to coding and am struggling with some data processing. I want to remove rows from a dataset based on another dataset. The datasets are fairly large and I am unable to match it accurately. The first dataset, dat1, is:
userId showId
user1 1
user1 3
user2 2
user3 1
user3 3
The second dataset, dat2, contains show Ids and attribute such as genre in the other columns
showId genre
1 a
2 b
3 a
4 b
5 b
I want to delete the rows in Dataset 2 where the showId does not appear in dataset. (i.e., i want to remove information about shows which are not in dataset 1 from dataset 2). I've tried:
nl <- subset(unique(dat1$showId) %in% dat2$showId)
Since I have 3 unique showIds in dat1, I should have 3 rows in object nl but this does not work and it returns me with rows =/= the number of unique showIds in dat1. Does anyone know any other way I can do this? Any help is appreciated, thanks!
Does this work:
df2[!is.na(match(df2$showId, df1$showId)),]
# A tibble: 3 x 2
showId genre
<dbl> <chr>
1 1 a
2 2 b
3 3 a
One option that works and should be easy to follow and use in other cases would be:
# Create a filter based on unique values in showId of dat1
filter_dat2 <- unique(dat1$showId)
# Filter dat2 to just those showId values that are in dat1
dat2_limited <- dat2 %>% filter(showId %in% filter_dat2)
This gives the following result:
showId genre
1 1 a
2 2 b
3 3 a
Hope that might help.

Values comparison under columns combinations

I have a data frame of the following type:
date ID1 ID2 sum
2017-1-5 1 a 200
2017-1-5 1 b 150
2017-1-5 2 a 300
2017-1-4 1 a 200
2017-1-4 1 b 120
2017-1-4 2 a 300
2017-1-3 1 b 150
I'm trying to compare between columns combinations over different dates to see if the sum values are equal. So, in the above-mentioned example, I'd like the code to identify that the sum of [ID1=1, ID2=b] combination is different between 2017-1-5 and 2017-1-4 (In my real data I have more than 2 ID categories and more than 2 Dates).
I'd like my output to be a data frame which contains all the combinations that include (at least one) unequal results. In my example:
date ID1 ID2 sum
2017-1-5 1 b 150
2017-1-4 1 b 120
2017-1-3 1 b 150
I tried to solve it using loops like this: Is there a R function that applies a function to each pair of columns with no great success.
Your help will be appreciated.
Using dplyr, we can group_by_(.dots=paste0("ID",1:2)) and then see if the values are unique:
library(dplyr)
res <- df %>% group_by_(.dots=paste0("ID",1:2)) %>%
mutate(flag=(length(unique(sum))==1)) %>%
ungroup() %>% filter(flag==FALSE) %>% select(-flag)
The group_by_ allows you to group multiple ID columns easily. Just change 2 to however many ID columns (i.e., N) you have assuming that they are numbered consecutively from 1 to N. The column flag is created to indicate if all of the values are the same (i.e., number of unique values is 1). Then we filter for results for which flag==FALSE. This gives the desired result:
res
### A tibble: 3 x 4
## date ID1 ID2 sum
## <chr> <int> <chr> <int>
##1 2017-1-5 1 b 150
##2 2017-1-4 1 b 120
##3 2017-1-3 1 b 150

Reshaping data with R [duplicate]

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 6 years ago.
I have my data in the below table structure:
Person ID | Role | Role Count
-----------------------------
1 | A | 24
1 | B | 3
2 | A | 15
2 | B | 4
2 | C | 7
I would like to reshape this so that there is one row for each Person ID, A column for each distinct role (e.g. A,B,C) and then the Role Count for each person as the values. Using the above data the output would be:
Person ID | Role A | Role B | Role C
-------------------------------------
1 | 24 | 3 | 0
2 | 16 | 4 | 7
Coming from a Java background I would take an iterative approach to this:
Find all distinct values for Role
Create a new table with a column for PersonID and each of the distinct roles
Iterate through the first table, get role counts for each Person ID and Role combination and insert results into new table.
Is there another way of doing this in R without iterating through the first table?
Thanks
Try:
library(tidyr)
df %>% spread(Role, `Role Count`)
To make the column names exactly as per your example:
df2 <- df %>% spread(Role, `Role Count`)
names(df2) <- paste('Role', names(df2))
Try this:
library(reshape2)
df <- dcast(df, PersonID~Role, value.var='RoleCount')
df[is.na(df)] <- 0
names(df)[-1] <- paste('Role', names(df[-1]))
df
PersonID Role A Role B Role C
1 1 24 3 0
2 2 15 4 7
With spread from tidyr
library(tidyr)
spread(data, Role, `Role Count`, sep = " ")

Faster looping in r

I have two data frames Test and User.
Test has 100 000 rows while User has 1 400 000 rows. I want to extract specific vectors from User data frame and merge this with Test data frame. Ex I want Income and Cat for every row in Test from User. Rows in Test is with repeated elements and I want any one value from User file. I want to keep the test file without removing duplicates.
Ex for Name A Income is 100 , Cat is M & L. Since M occurs first I need M.
> Test
Name Income Cat
A
B
C
D
...
User Cat Income
A M 100
B M 320
C U 400
D L 900
A L 100
..
I used for loop but takes lot of time. I do not want to use merge function.
for (i in 1:nrow(Test)
{
{ Test[i,"Cat"]<-User[which(User$Name==Test[i,"Name"]),"Cat"][1]}
{ Test[i,"Income"]<-User[which(User$Name==Test[i,"Name"]),"Income"][1]}}
I used merge as well but the overall count for Test file is more than 100k rows. It is appending extra elements.
I want a faster way to do by avoiding for loop and merge. Can someone suggest any apply family functions.
You can use match to find the first matching row (then vectorize the copying):
# Setup the data
User=data.frame(User=c('A','B','C','D','A'),Cat=c('M','M','U','L','L'),
Income=c(100,320,400,900,100))
Test=data.frame(Name=c('A','B','C','D'))
Test$Income<-NA
Test$Cat<-NA
> Test
Name Income Cat
1 A NA NA
2 B NA NA
3 C NA NA
4 D NA NA
## Copy only the first match to from User to Test
Test[,c("Income","Cat")]<-User[match(Test$Name,User$User),c("Income","Cat")]
> Test
Name Income Cat
1 A 100 M
2 B 320 M
3 C 400 U
4 D 900 L
Using dplyr package you can do something like this:
library(dplyr)
df %>% group_by(Name) %>% slice(1)
For your example, you get:
Original data frame:
df
Name Cat Income
1 A M 100
2 B M 320
3 C U 400
4 D L 900
5 A L 100
Picking first occurrence:
df %>% group_by(Name) %>% slice(1)
Source: local data frame [4 x 3]
Groups: Name [4]
Name Cat Income
(chr) (chr) (int)
1 A M 100
2 B M 320
3 C U 400
4 D L 900

Resources