I have two data frames that are related by a really long user ID, and I want to replace these values with something more readable, like a simple integer value. Obviously I want to keep these values consistent between data frames and I was wondering if there is a simple way to do this. Here is what the data.frames look like:
ArtistData - Shows how many times a user listened to a particular artist:
UserID Artist Plays
00000c289a1829a808ac09c00daf10bc3c4e223b elvenking 706
00000c289a1829a808ac09c00daf10bc3c4e223b lunachicks 538
00001411dc427966b17297bf4d69e7e193135d89 stars 373
... ... ...
UserData - Shows information on each individual user:
UserID gender age country
00001411dc427966b17297bf4d69e7e193135d89 m 21 Germany
00004d2ac9316e22dc007ab2243d6fcb239e707d f 34 Mexico
000063d3fe1cf2ba248b9e3c3f0334845a27a6bf m 27 Poland
... ... ... ...
So basically, can I replace these long strings that have no meaning for me with an integer that is consistent between each data frame?
Convert to factors with simplified labels, using all possible UserID's in both datasets:
levs <- union(UserData$UserID, ArtistData$UserID)
ArtistData$newid <- factor(
ArtistData$UserID, levels=levs, labels=seq_along(levs)
)
UserData$newid <- factor(
UserData$UserID, levels=levs, labels=seq_along(levs)
)
ArtistData
# UserID Artist Plays newid
#1 00000c289a1829a808ac09c00daf10bc3c4e223b elvenking 706 4
#2 00000c289a1829a808ac09c00daf10bc3c4e223b lunachicks 538 4
#3 00001411dc427966b17297bf4d69e7e193135d89 stars 373 1
UserData
# UserID gender age country newid
#1 00001411dc427966b17297bf4d69e7e193135d89 m 21 Germany 1
#2 00004d2ac9316e22dc007ab2243d6fcb239e707d f 34 Mexico 2
#3 000063d3fe1cf2ba248b9e3c3f0334845a27a6bf m 27 Poland 3
Related
I have a dataset with around 80 columns and 1000 Rows, a sample of this dataset follow below:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
2 F F josh linda 198
3 M NA Claude Bere 200
4 F M John Mary 350
5 F F Peter Lucy 298
And I need select all information that are different between gend.y and gend.x, like this:
ID gend.y gend.x Sire Dam Weight
1 M F Jim jud 220
3 M NA Claude Bere 200
4 F M John Mary 350
Remember, I need to select the another 76 columns too.
I tried this command:
library(dplyr)
new.file=my.file %>%
filter(gend.y != gend.x)
But don't worked. And this message appears:
Error in Ops.factor(gend.y, gend.x) : level sets of factors are different
As #divibisan said: "Still not a reproducible example, but the error gets you closer. These 2 variables are factors, The interpretation of a factor depends on both the codes and the "levels" attribute. Be careful only to compare factors with the same set of levels (in the same order). You probably want to convert them to character before comparing, or fix the levels to match."
So I did this (convert them to character):
my.file$new.gend.y=as.character(my.file$gend.y)
my.file$new.gend.x=as.character(my.file$gend.x)
And after I ran my previous command with the new variables (now converted to character):
library(dplyr)
new.file=my.file %>%
filter(new.gend.y != new.gend.x | is.na(new.gend.y != new.gend.x))
And now worked as I expected. Credits #divibisan
I have a dataset (~14410 rows) with observations including the country. I divide this set into train and test set and train my data using decision tree with the rpart() function. When it comes to predicting, sometimes I get the error that test set has countries which are not in train set.
At first I excluded/deleted the countries which appeared only once:
# Get orderland with frequency one
var.names <- names(table(mydata1$country))[table(mydata1$country) == 1]
loss <- match(var.names, mydata1$country)
names(which(table(mydata1$country) == 1))
mydata1 <- mydata1[-loss, ]
When rerunning my code, I get the same error at the same code line, saying that I have new countries in test which are not in train.
Now I did a count to see how often a country appears.
count <- as.data.frame(count(mydata1, vars=mydata1$country))
count[rev(order(count$n)),]
vars n
3 Bundesrep. Deutschland 7616
9 Grossbritannien 1436
12 Italien 930
2 Belgien 731
22 Schweden 611
23 Schweiz 590
13 Japan 587
19 Oesterreich 449
17 Niederlande 354
8 Frankreich 276
18 Norwegen 238
7 Finnland 130
21 Portugal 105
5 Daenemark 65
26 Spanien 57
4 China 55
20 Polen 51
27 Taiwan 31
14 Korea Süd 30
11 Irland 26
29 Tschechien 13
16 Litauen 9
10 Hong Kong 7
30 <NA> 3
6 Estland 3
24 Serbien 2
1 Australien 2
28 Thailand 1
25 Singapur 1
15 Kroatien 1
From this I can see, I also have NA's in my data.
My question now is, how can I proceed with this problem?
Should I exclude/delete all countries with e.g. observations < 7 or should I take the data with observations < 7 and reproduce/repeat this data two times, so my predict () function will always work, also for other data sets?
It's somehow not "fancy" just to delete the rows...is there any other possibility?
You need to convert every chr variable in factor:
mydata1$country <- as.factor(mydata1$country)
Then you can simply proceed with train/test splitting. You won't need to remove anything (except NAs)
By using the type factor, your model will know that an observation country, will have some possible levels:
Example:
country <- factor("Italy", levels = c("Italy", "USA", "UK")) # just 3 levels for example
country
[1] Italy
Levels: Italy USA UK
# note that as.factor() takes care of defining the levels for you
See the difference with:
country <- "Italy"
country
[1] "Italy"
By using factor, the model will know all the possible levels. Because of this, even if in the train data you won't have an observation "Italy", the model will know that it's possible to have it in the test data.
factor is always the correct type for characters in models.
I have to analyze data from an economic experiment.
My database is composed of 14 976 observations with 212 variables. Within this database we have other informations like the profit, total profit, the treatments and other variables.
You can see that I have two types :
Type 1 is for sellers
Type 2 is for buyers
For some variables, results were put in the buyers (type 2) rows and not in the sellers ones (which is a choice completely arbitrary choice). However I would like to analyze gender of sellers who overcharged (for instance). So I need to manipulate my database and I don't know how to do this.
Here, you have part of the database :
ID Gender Period Matching group Group Type Overcharging ...
654 1 1 73 1 1 NA
654 1 2 73 1 1 NA
654 1 3 73 1 1 NA
654 1 4 73 1 1 NA
435 1 1 73 2 1 NA
435 1 2 73 2 1 NA
435 1 3 73 2 1 NA
435 1 4 73 2 1 NA
708 0 1 73 1 2 1
708 0 2 73 1 2 0
708 0 3 73 1 2 0
708 0 4 73 1 2 1
546 1 1 73 2 2 0
546 1 2 73 2 2 0
546 1 3 73 2 2 1
546 1 4 73 2 2 0
To do what I'd like to I have many informations (only one seller was matched with one buyer in at the period x, in the group x, matching group x, and with treatment x...).
To give you and example, in matching group 73 we know that at period 1 subject 708 was overcharged (the one in group 1). As I know that this men belongs to group 1 and matching group 73, I am able to identify the seller who has overcharged him at period 1 : subject 654 with gender =1.
So, I would like to put overcharging (and some others) buyers values on the sellers rows (type ==1) to analyze sellers behavior but at the right period, for the right group and the right matching group.
I have a long way of doing it with data.frames. If you are looking to code in R long term I would suggest checking out either (i) dplyr package, part of the tidyverse suite or (ii) data.table package. The first one has the most popular syntax, and is tied together nicely with a bunch of useful packages. The second is harder to learn but quicker. For your size data, this is negligible though.
In base data.frames, here is something I hope matches your request. Let me know if I've mistaken anything, or been unclear.
# sellers data eg
dt1 <- data.frame(Period = 1:4, MatchGroup = 73, Group = 1, Type = 1,
Overcharging = NA)
# buyers data eg
dt2 <- data.frame(Period = 1:4, MatchGroup = 73, Group = 1, Type = 2,
Overcharging = c(1,0,0,1))
# make my current data view
dt <- rbind(dt1, dt2)
dt[]
# split in to two data frames, on the Type column:
dt_split <- split(dt, dt$Type)
dt_split
# move out of list
dt_suffix <- seq_along(dt_split)
dt_names <- sprintf("dt%s", dt_suffix)
for(name in dt_names){
assign(name, dt_split[match(name, dt_names)][[1]])
}
dt1[]
dt2[]
# define the columns in which to match up the buyer to seller
merge_cols <- c("Period", "MatchGroup", "Group")
# define the columns you want to merge, that you know are NA
na_cols <- c("Overcharging")
# now use merge operation, and filter dt2, to pull in only columns you want
# I suggest dropping the na_cols first in dt1, as otherwise it will create two
# columns post-merge: Overcharging, i.Overcharging
dt1 <- dt1[,setdiff(names(dt1), na_cols)]
dt1_new <- merge(dt1,
dt2[, c(merge_cols, na_cols)], # filter dt2
by = merge_cols, # columns to match on
all.x = TRUE) # dt1 is x, dt2 is y. Want to keep all of dt1
# if you want to bind them back together, ensure the column order matches, and
# bind e.g.
dt1_new <- dt1_new[, names(dt2)]
dt_final <- rbind(dt1_new, dt2)
dt_final[]
What my line of thinking is to make these buyers and sellers data frames in to two separate ones. Then identify how they join, and migrate the data you need from buyers to sellers. Then finally bring them back together if so desired.
In R, I have two data frames A & B as follows-
Data-Frame A:
Name Age City Gender Income Company ...
JXX 21 Chicago M 20K XYZ ...
CXX 25 NewYork M 30K PQR ...
CXX 26 Chicago M NA ZZZ ...
Data-Frame B:
Age City Gender Avg Income Avg Height Avg Weight ...
21 Chicago M 30K ... ... ...
25 NewYork M 40K ... ... ...
26 Chicago M 50K ... ... ...
I want to fill missing values in data frame A from data frame B.
For example, for third row in data frame A I can substitute avg income from data frame B instead of exact income. I don't want to merge these two data frames, instead want to perform look-up like operation using Age, City and Gender columns.
library(data.table);
## generate data
set.seed(5L);
NK <- 6L; pA <- 0.8; pB <- 0.2;
keydf <- unique(data.frame(Age=sample(18:65,NK,T),City=sample(c('Chicago','NewYork'),NK,T),Gender=sample(c('M','F'),NK,T),stringsAsFactors=F));
NO <- nrow(keydf)-1L;
Af <- cbind(keydf[-1L,],Name=sample(paste0(LETTERS,LETTERS,LETTERS),NO,T),Income=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pA,rep((1-pA)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
Bf <- cbind(keydf[-2L,],`Avg Income`=sample(c(NA,paste0(seq(20L,90L,10L),'K')),NO,T,c(pB,rep((1-pB)/8,8L))),stringsAsFactors=F)[sample(seq_len(NO)),];
At <- as.data.table(Af);
Bt <- as.data.table(Bf);
At;
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS NA
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX NA
Bt;
## Age City Gender Avg Income
## 1: 62 NewYork M NA
## 2: 51 Chicago F 60K
## 3: 31 Chicago M 50K
## 4: 27 NewYork M NA
## 5: 23 Chicago M 60K
I generated some random test data for demonstration purposes. I'm quite happy with the result I got with seed 5, which covers many cases:
one row in A that doesn't join with B (50/NewYork/F).
one row in B that doesn't join with A (27/NewYork/M).
two rows that join and should result in a replacement of NA in A with a non-NA value from B (23/Chicago/M and 31/Chicago/M).
one row that joins but has NA in B, so shouldn't affect the NA in A (62/NewYork/M).
one row that could join, but has non-NA in A, so shouldn't take the value from B (I assumed you would want this behavior) (51/Chicago/F). The value in A (90K) differs from the value in B (60K), so we can verify this behavior.
And I intentionally scrambled the rows of A and B to ensure we join them correctly, regardless of incoming row order.
## data.table solution
keys <- c('Age','City','Gender');
At[is.na(Income),Income:=Bt[.SD,on=keys,`Avg Income`]];
## Age City Gender Name Income
## 1: 50 NewYork F OOO NA
## 2: 23 Chicago M SSS 60K
## 3: 62 NewYork M VVV NA
## 4: 51 Chicago F FFF 90K
## 5: 31 Chicago M XXX 50K
In the above I filter for NA values in A first, then do a join in the j argument on the key columns and assign in-place the source column to the target column using the data.table := syntax.
Note that in the data.table world X[Y] does a right join, so if you want a left join you need to reverse it to Y[X] (with "left" now referring to X, counter-intuitively). That's why I used Bt[.SD] instead of (the likely more natural expectation of) .SD[Bt]. We need a left join on .SD because the result of the join index expression will be assigned in-place to the target column, and so the RHS of the assignment must be a full vector correspondent to the target column.
You can repeat the in-place assignment line for each column you want to replace.
## base R solution
keys <- c('Age','City','Gender');
m <- merge(cbind(Af[keys],Ai=seq_len(nrow(Af))),cbind(Bf[keys],Bi=seq_len(nrow(Bf))))[c('Ai','Bi')];
m;
## Ai Bi
## 1 2 5
## 2 5 3
## 3 4 2
## 4 3 1
mi <- which(is.na(Af$Income[m$Ai])); Af$Income[m$Ai[mi]] <- Bf$`Avg Income`[m$Bi[mi]];
Af;
## Age City Gender Name Income
## 2 50 NewYork F OOO <NA>
## 5 23 Chicago M SSS 60K
## 3 62 NewYork M VVV <NA>
## 6 51 Chicago F FFF 90K
## 4 31 Chicago M XXX 50K
I guess I was feeling a little bit creative here, so for a base R solution I did something that's probably a little unusual, and which I've never done before. I column-bound a synthesized row index column into the key-column subset of each of the A and B data.frames, then called merge() to join them (note that this is an inner join, since we don't need any kind of outer join here), and extracted just the row index columns that resulted from the join. This effectively precomputes the joined pairs of rows for all subsequent modification operations.
For the modification, I precompute the subset of the join pairs for which the row in A satisfies the replacement condition, e.g. that its Income value is NA for the Income replacement. We can then subset the join pair table for those rows, and do a direct assignment from B to A to carry out the replacement.
As before, you can repeat the assignment line for every column you want to replace.
So I think this works for Income. If there are only those 3 columns, you could substitute the names of the other columns in:
df1<-read.table(header = T, stringsAsFactors = F, text = "
Name Age City Gender Income Company
JXX 21 Chicago M 20K XYZ
CXX 25 NewYork M 30K PQR
CXX 26 Chicago M NA ZZZ")
df2<-read.table(header = T, stringsAsFactors = F, text = "
Age City Gender Avg_Income
21 Chicago M 30K
25 NewYork M 40K
26 Chicago M 50K ")
df1[is.na(df1$Income),]$Income<-df2[is.na(df1$Income),]$Avg_Income
It wouldn't surprise me if one of the regulars has a better way that prevents you from having to re-type the names of the columns.
You can simply use the following to update the average income of the city from B to the income in A.
dataFrameA$Income = dataFrameB$`Avg Income`[match(dataFrameA$City, dataFrameB$City)]
you'll have to use "`" if the column name has a space
this is similar to using a lookup using index and match in excel. I'm assuming you're coming from excel. The code will be more compact if you use data.table
I have a data.frame like so:
category count
A 11
B 1
C 45
A 1003
D 20
B 207
E 634
E 40
A 42
A 7
B 44
B 12
Each row represents a specific element with a category type and a count of that element. I would like to produce a frequency distribution of counts per category, but the categories are at the moment redundant.
How do I retrieve a table of redundant category counts? i.e. I want a table that looks like:
category count
A 11234
B 4005
C 100023
D 65567
E 54654
... ...
I almost got there using lapply:
df.nrcounts <- lapply(unique(df.counts$category),
function(x) c(category=x, count=sum(subset(df.counts, category==x)$count)))
but I can't seem to coerce the output to a proper dataframe. I can't quite get my head around using the function.
aggregate(df.counts$count,by=list(df.counts$category),FUN=sum)
Or
library(data.table)
setDT(df.counts)[, list(count=sum(count)), by = category]