Merge Uneven Data Files - r

I have 5 weeks of measured data in 5 separate CSV files, and am looking for a way to merge them into a single document that makes sense. The issue I'm having is that not all data points are present in each file, my largest has ~20k rows and my smallest has ~2k so there isn't a 1:1 relation. Here's what my data looks like:
Keyword URL 5/12 Rank
activity site.com 2
activity site.com/page 1
backup site.com/backup 4
The next file would look something like this:
Keyword URL 5/19 Rank
activity site.com/page 2
database site.com/data 3
What I'd like to end up with is something like this
Keyword URL 5/12 Rank 5/19 Rank
activity site.com 2 -
activity site.com/page 1 2
backup site.com/backup 4 -
database site.com/data - 3
My preference would be to do this with R. I think plyr will make this a snap, but I've never used it before and I'm just not getting how this comes together.

Use merge:
csv1 <- read.table(header=TRUE, text="
Keyword URL 5/12_Rank
activity site.com 2
activity site.com/page 1
backup site.com/backup 4
")
csv2 <- read.table(header=TRUE, text="
Keyword URL 5/19_Rank
activity site.com/page 2
database site.com/data 3
")
csv12 <- merge(csv1, csv2, all=TRUE)
#> csv12
# Keyword URL X5.12_Rank X5.19_Rank
#1 activity site.com 2 NA
#2 activity site.com/page 1 2
#3 backup site.com/backup 4 NA
#4 database site.com/data NA 3
If you have several tables, you can put them in a list and use Reduce:
csv3 <- read.table(header=TRUE, text="
Keyword URL 5/42_Rank
activity site.com 5
html site.com/data 6
")
L <- list(csv1, csv2, csv3)
Reduce(f=function(x,y)merge(x,y,all=TRUE), L)
Result
# Keyword URL X5.12_Rank X5.19_Rank X5.42_Rank
#1 activity site.com 2 NA 5
#2 activity site.com/page 1 2 NA
#3 backup site.com/backup 4 NA NA
#4 database site.com/data NA 3 NA
#5 html site.com/data NA NA 6

Related

How to remove duplicates right beneath original response? [duplicate]

This question already has answers here:
Remove repeated numbers in a sequence
(3 answers)
Closed 2 years ago.
Background: I have a survey attached to a excel sheet and at times duplication of a response takes place. This is due to user interaction.The duplication takes place right beneath the original response. I would like the R to delete the duplications that takes place next to/right beneath the original response. I would like the original response to be kept. Is there a way to target the duplicated responses right beneath the original one?
If my dataframe looks this:
Area Year Course Tested Grade
1 Git 1 Material Y A
2 Ort 3 Fabric Y B
3 Pinst 2 Pattern N NA
4 Coker 1 Fashion Y B+
5 Coker 1 Fashion Y B+
6 South 4 Business N NA
This is what I would want:
Area Year Course Tested Grade
1 Git 1 Material Y A
2 Ort 3 Fabric Y B
3 Pinst 2 Pattern N NA
4 Coker 1 Fashion Y B+
5 South 4 Business N NA
Thank you in advance
Assuming you want to only delete the duplicates if it happens in consecutive rows and keep it if they happen elsewhere you can use rleidv along with duplicated :
df[!duplicated(data.table::rleidv(df)),]
# Area Year Course Tested Grade
#1 Git 1 Material Y A
#2 Ort 3 Fabric Y B
#3 Pinst 2 Pattern N <NA>
#4 Coker 1 Fashion Y B+
#6 South 4 Business N <NA>

How to add columns from another data frame where there are multible matching rows

I'm new to R and I'm stuck.
NB! I'm sorry I could not figure out how to add more than 1 space between numbers and headers in my example so i used "_" instead.
The problem:
I have two data frames (Graduations and Occupations). I want to match the occupations to the graduations. The difficult part is that one person might be present multiple times in both data frames and I want to keep all the data.
Example:
Graduations
One person may have finished many curriculums. Original DF has more columns but they are not relevant for the example.
Person_ID__curriculum_ID__School ID
___1___________100__________10
___2___________100__________10
___2___________200__________10
___3___________300__________12
___4___________100__________10
___4___________200__________12
Occupations
Not all graduates have jobs, everyone in the DF should have only one main job (JOB_Type code "1") and can have 0-5 extra jobs (JOB_Type code "0"). Original DF has more columns but the are not relevant currently.
Person_ID___JOB_ID_____JOB_Type
___1_________1223________1
___3_________3334________1
___3_________2120________0
___3_________7843________0
___4_________4522________0
___4_________1240________1
End result:
New DF named "Result" containing the information of all graduations from the first DF(Graduations) and added columns from the second DF (Occupations).
Note that person "2" is not in the Occupations DF. Their data remains but added columns remain empty.
Note that person "3" has multiple jobs and thus extra duplicate rows are added.
Note that in case of person "4" has both multiple jobs and graduations so extra rows were added to fit in all the data.
New DF: "Result"
Person_ID__Curriculum_ID__School_ID___JOB_ID____JOB_Type
___1___________100__________10_________1223________1
___2___________100__________10
___2___________200__________10
___3___________300__________12_________3334________1
___3___________300__________12_________2122________0
___3___________300__________12_________7843________0
___4___________100__________10_________4522________0
___4___________100__________10_________1240________1
___4___________200__________12_________4522________0
___4___________200__________12_________1240________1
For me the most difficult part is how to make R add extra duplicate rows. I looked around to find an example or tutorial about something similar but could. Probably I did not use the right key words.
I will be very grateful if you could give me examples of how to code it.
You can use merge like:
merge(Graduations, Occupations, all.x=TRUE)
# Person_ID curriculum_ID School_ID JOB_ID JOB_Type
#1 1 100 10 1223 1
#2 2 100 10 NA NA
#3 2 200 10 NA NA
#4 3 300 12 3334 1
#5 3 300 12 2122 0
#6 3 300 12 7843 0
#7 4 100 10 4522 0
#8 4 100 10 1240 1
#9 4 200 12 4522 0
#10 4 200 12 1240 1
Data:
Graduations <- read.table(header=TRUE, text="Person_ID curriculum_ID School_ID
1 100 10
2 100 10
2 200 10
3 300 12
4 100 10
4 200 12")
Occupations <- read.table(header=TRUE, text="Person_ID JOB_ID JOB_Type
1 1223 1
3 3334 1
3 2122 0
3 7843 0
4 4522 0
4 1240 1")
An option with left_join
library(dplyr)
left_join(Graduations, Occupations)

Is there an R function to redefine a variable so I can use the spread function?

I'm new with R and I have the following problem. Maybe it's a really easy question, but I don't know the terms to search for an answer.
My problem:
I have several persons, each person is assigned a studynumber (SN). And each SN has one or more tests being performed, the test can have multiple results.
My data is long at the moment, but I need it to be wide (one row for each SN).
For example:
What I have:
SN testnumbers result
1 1 1234 6
2 1 1234 9
3 2 4567 6
4 3 5678 9
5 3 8790 9
What I want:
SN test1result1 test1result2 test2result1
1 1 6 6 NA
2 2 6 NA NA
3 3 9 NA 9
So I need to renumber the testnumbers into test 1 etc for each SN, in order to use the spread function, I think. But I don't know how.
I did manage to renumber testnumber into a list of 1 till the last unique testnumber, but still the wide dataframe looks awful.

Subsetting in R using a list

I have a large amount of data which I would like to subset based on the values in one of the columns (dive site in this case). The data looks like this:
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
alice rain 95 NA 50 NA 2 4 9
alice over NA 25 NA 25 2 4 9
steps clear NA 27 NA 25 2 4 9
steps NA 30 NA 20 1 4 9
andrea1 clear 60 NA 60 NA 2 4 5
I would like to create a subset of the data which contains only data for one dive site at a time (e.g. one subset for alice, one for steps, one for andrea1 etc...).
I understand that I could subset each individually using
alice <- subset(reefdata, site=="alice")
But as I have over 100 different sites to subset by would like to avoid having to individually specify each subset. I think that subset is probably not flexible enough for me to ask it to subset by a list of names (or at least not to my current knowledge of R, which is growing, but still in infancy), is there another command which I should be looking into?
Thank you
This will create a list that contains the subset data frames in separate list elements.
splitdat <- split(reefdata, reefdata$site)
Then if you want to access the "alice" data you can reference it like
splitdat[["alice"]]
I would use the plyr package.
library(plyr)
ll <- dlply(df,.variables = c("site"))
Result:
>ll
$alice
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 alice rain 95 NA 50 NA 2 4 9
2 alice over NA 25 NA 25 2 4 9
$andrea1
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 andrea1 clear 60 NA 60 NA 2 4 5
$steps
site weather depth_ft depth_m vis_ft vis_m coral_safety coral_deep rate
1 steps clear NA 27 NA 25 2 4 9
2 steps <NA> 30 NA 20 1 4 9 NA
split() and dlply() are perfect one shot solutions.
If you want a "step by step" procedure with a loop (which is frowned upon by many R users, but I find it helpful in order to understand what's going on), try this:
# create vector with site names, assuming reefdata$site is a factor
sites <- as.character( unique( reefdata$site ) )
# create empty list to take dive data per site
dives <- list( NULL )
# collect data per site into the list
for( i in 1:length( sites ) )
{
# subset
dive <- reefdata[ reefdata$site == sites[ i ] , ]
# add resulting data.frame to the list
dives[[ i ]] <- dive
# name the list element
names( dives )[ i ] <- sites[ i ]
}

Tabulating association frequency counts

I have data which is in this format:
User Item
1 A
1 B
1 C
1 D
2 A
2 C
2 E
What I want to get is a frequency count for each pair. Order is not important so I don't want to count the inverse. I want to end up with a result similar to this, where the frequency counts are partitioned by user.
Pair Frequency
AB 1
AC 2
AD 1
AE 1
BC 1
BD 1
BE 0
CD 1
CE 1
What tool can I use to formulate this kind of table? I'd prefer some open source solution if possible.
Edit- Added example for my comment below
I'm reading in data from a CSV file using the following two lines and removing the factors with these two steps in code.
xa<-read.csv("C:/Direcotry/MyData.csv")
xa<-data.frame(lapply(xa, as.character), stringsAsFactors=FALSE)
User Item
1 394324 Item A
2 124209 Item B
3 212457 Item C
4 427052 Item A
5 118281 Item D
6 156831 Item A
7 212442 Item E
8 156831 Item B
9 212442 Item A
10 177734 Item C
When I try running suggested answer, I get an error with this result:
Error in combn(x, 2) : n < m
Well R is open source.
Here's an example based on your tiny sample of data:
Here I just read your data in by copypasting it straight from your post:
> xa=read.table(stdin(),header=TRUE,as.is=TRUE)
0: User Item
1: 1 A
2: 1 B
3: 1 C
4: 1 D
5: 2 A
6: 2 C
7: 2 E
8:
So that's the data in. Then with a couple of lines of code:
> f=function(x) apply(combn(x,2),2,paste0,collapse="")
> table(unlist(tapply(xa$Item,xa$User,f)))
AB AC AD AE BC BD CD CE
1 2 1 1 1 1 1 1
If you need all the empty combinations explicitly as zeroes it takes another line or two (you need to generate all the possible combinations as a factor, rather than just the observed ones and tell table to include the empty ones).
After some research and suggestions by Glen, I came up with the following code which gets me a 3 column CSV file with the pair combination plus frequency count. If anyone sees a better way, let me know! But this seems to work.
The errors I was referring to in my follow up comments were caused by users having purchased only at one location.
library(reshape2)
xa<-read.csv("C:/Input.csv",as.is=TRUE)
xa=xa[!duplicated(xa),]
xa<-data.table(xa)
setkey(xa,ContactId,PurchaseLocation)
tab=table(xa$ContactId)
xa=xa[xa$ContactId %in% names(tab[tab>1]),]
f=function(x) apply(combn(x,2),2,paste0,collapse="--")
xb<-as.data.frame(table(unlist(tapply(xa$PurchaseLocation,xa$ContactId,f))))
xc=with(xb, cbind(Freq, colsplit(xb$Var1, pattern = "--", names = c('a', 'b'))))
xc=subset(xc,a!=b & a!="" & b!="" & Freq>1)
write.csv(xc,file="C:/Output.csv")
Edit- I made a small change to make it order independent by sorting the data table on a key.

Resources