Creating an ID to reshape a dataset [duplicate] - r

This question already has answers here:
Create a sequential number (counter) for rows within each group of a dataframe [duplicate]
(6 answers)
Closed 5 years ago.
First time posting, mainly because I got tired of banging my head against the wall.
Thanks in advance for looking at this.
I have a data frame that looks like this:
state city x y z
1 OR Portland 8 10 1
2 OR Portland 8 10 4
3 OR Portland 8 10 10
4 NY New York 29 15 10
5 NY New York 29 15 18
6 NJ Trenton 8 10 50
7 NJ Trenton 8 10 60
8 NJ Trenton 8 10 70
9 WA Seattle 1 70 6
10 WA Seattle 1 70 7
11 WA Seattle 1 70 8
12 WA Seattle 1 70 9
13 WA Seattle 1 70 10
14 WA Seattle 1 70 11
I have been trying to reshape it to look like this:
state city x y z.1 z.2 z.3 z.4 z.5 z.6
OR Portland 8 10 1 4 10
NY New York 29 15 10 18
NJ Trenton 8 10 50 60 70
WA Seattle 1 70 6 7 8 9 10 11
I have been using the package reshape2 and the code looks like this:
df <- melt(data,id.vars = c("state","city","x","y"),measure.vars = "z")
wide <- dcast(df, state + city + x + y ~ variable)
Which returns a count of variable z for each set of id.vars.
I also tried this:
wide <- dcast(df, state + city + x + y ~ value)
Which looks like this:
state city x y 1 4 6 7 etc...
OR Portland 8 10 1 1 0 0
NY New York 29 15 0 0 0 0
NJ Trenton 8 10 0 0 0 0
WA Seattle 1 70 0 0 1 1
This is closer to what I'm looking for but would be very difficult to use for looking up information.
Tell me if I'm wrong, but it looks like I need an id variable for each duplicate value of state, city, x, y.
I haven't been able to think up or find anything that will allow me to create column that will number duplicate values like below.
state city x y z num
1 OR Portland 8 10 1 1
2 OR Portland 8 10 4 2
3 OR Portland 8 10 10 3
4 NY New York 29 15 10 1
5 NY New York 29 15 18 2
6 NJ Trenton 8 10 50 1
7 NJ Trenton 8 10 60 2
8 NJ Trenton 8 10 70 3
9 WA Seattle 1 70 6 1
10 WA Seattle 1 70 7 2
11 WA Seattle 1 70 8 3
12 WA Seattle 1 70 9 4
13 WA Seattle 1 70 10 5
14 WA Seattle 1 70 11 6
I would appreciate any help or an idea of where to keep looking for solutions.
Best,
-n

If using dplyr is an option you can use:
library(dplyr)
df %>%
group_by(state,city, x, y) %>%
mutate(n = row_number()) %>%
spread(n, z, sep = '')
Note that the ordering is lost tho

Related

How to make an histogram with a data frame

I was trying to make an histogram of the frequencies of a name list, the list is like this:
> x[1:15,]
X x
1 1 JUAN DOMINGOGONZALEZDELGADO
2 2 FRANCISCO JAVIERVELARAMIREZ
3 3 JUAN CARLOSPEREZMEDINA
4 4 ARMANDOSALINASSALINAS
5 5 JOSE FELIXZAMBRANOAMEZQUITA
6 6 GABRIELMONTIELRIVAS
7 7 DONACIANOSANCHEZHERRERA
8 8 JUAN MARTINXHUERTA
9 9 ALVARO ALEJANDROGONZALEZRAMOS
10 10 OMAR ROMANCASTAƑEDALOPEZ
11 11 IGNACIOBUENOCANO
12 12 RAFAELBETANCOURTDELGADO
13 13 LUIS ALBERTOCASTILLOESCOBEDO
14 14 VICTORHERNANDEZGONZALEZ
15 15 FATIMAROMOTORRES
in order to do that I change it to a frequency table, it looks like this:
> y[1:15,]
X x Freq
1 1 15
2 2 JULIO CESAR ORDAZFLORES 1
3 3 MARCOS ANTONIOCUEVASNAVARRO 1
4 4 DULEY DILTON TRIBOUILLIERLOARCA 1
5 5 ANTONIORAMIREZLOPEZ 2
6 6 BRAYAN ALEJANDROOJEDARAMIREZ 1
7 7 JOSE DE JESUSESCOTOCORTEZ 1
8 8 AARONFLORESGARCIA 1
9 9 ABIGAILNAVARROAMBRIZ 1
10 10 ABILENYRODRIGUEZORTEGA 1
11 11 ABRAHAMHERNANDEZRAMIREZ 1
12 12 ABRAHAMPONCEALCANTARA 1
13 13 ADRIAN VAZQUEZ BUSTAMANTE 2
14 14 ADRIANHERNANDEZBERMUDEZ 28
15 15 ALAN ORLANDOCASTILLALOPEZ 11
when I try hist(x) or hist(x[,2]) I get:
Error in hist.default(x) : 'x' must be numeric
and if I try hist(y[,3]) I got an strange histogram which is not the desired, now how can I make a histogram of the frequencies of the name list?

How to sort with multiple conditions in R [duplicate]

This question already has answers here:
Sort (order) data frame rows by multiple columns
(19 answers)
Closed 3 years ago.
I have a very simple dataframe in R:
x <- data.frame("SN" = 1:7, "Age" = c(21,15,22,33,21,15,25), "Name" = c("John","Dora","Paul","Alex","Bud","Chad","Anton"))
My goal is to sort the dataframe by the Age and the Name. I am able to achieve this task partially if i type the following command:
x[order(x[, 'Age']),]
which returns:
SN Age Name
2 2 15 Dora
6 6 15 Chad
1 1 21 John
5 5 21 Bud
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex
As you can see the dataframe is order by the Age but not the Name.
Question: how can i order the dataframe by the age and name at the same time? This is what the result should look like
SN Age Name
6 6 15 Chad
2 2 15 Dora
5 5 21 Bud
1 1 21 John
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex
Note: I would like to avoid to use additional packages but using just the default ones
With dplyr:
library(dplyr)
x %>%
arrange(Age, Name)
SN Age Name
1 6 15 Chad
2 2 15 Dora
3 5 21 Bud
4 1 21 John
5 3 22 Paul
6 7 25 Anton
7 4 33 Alex
x[with(x, order(Age, Name)), ]
SN Age Name
6 6 15 Chad
2 2 15 Dora
5 5 21 Bud
1 1 21 John
3 3 22 Paul
7 7 25 Anton
4 4 33 Alex

Matching and merging two csv files in R

I have file 1 with attributes like (706 attributes)
Matchid TeamName Opp_TeamName TeamRank Opp_TeamRank Team_Top10RankingBatsman
1 New Zealand Bangladesh 1 10 2
2 New Zealand India 1 2 2
3 India England 2 5 1
4 Australia England 6 5 1
and file 2 with attributes (706 attributes)
id actual predicted error
3 79 206.828 127.828
1 90 182.522 92.522
2 101 193.486 92.486
4 89 174.889 85.889
I want to match "Matchid and id" of both files and add file2 attributes in file1 so that the final result is
Matchid TeamName Opp_TeamName TeamRank Opp_TeamRank Team_Top10RankingBatsman id actual predicted error
1 New Zealand Bangladesh 1 10 2 1 90 182.522 92.522
2 New Zealand India 1 2 2 2 101 193.486 92.486
3 India England 2 5 1 3 79 206.828 127.828
4 Australia England 6 5 1 4 89 174.889 85.889
so far I have tried tried simple merge function and it didn't work, how can I achieve my task?
merge(file1,file2,by.x="Matchid",by.y="id")
Maybe this way?
The dplyr way:
library(dplyr)
joined <- inner_join(file_1, file_2, by = c("Matchid" = "id"))

Sampling with a key in different data table using data.table in 'R'

I have the following code.
DT <- data.table(s3ITR)
DTKey <- data.table(s3Key, key = "Age")
> DT
Index Country Age Time Charity
1: 1 France 30 40 1
2: 2 France 40 40 0
3: 3 France 40 50 0
4: 4 Germany 40 40 1
5: 5 France 60 40 1
6: 6 France 40 40 1
7: 7 Germany 30 40 0
8: 8 Germany 30 40 1
9: 9 Germany 30 40 NA
10: 10 Germany 30 40 1
> DTKey
Index Country Age Time Charity
1: 1 France 30 40 0
2: 2 Germany 30 40 0
3: 3 Germany 30 40 1
4: 4 Germany 30 40 0
5: 5 Germany 30 40 1
6: 6 Germany 30 40 1
I would like to impute the NA in DT by random sample from DTKey, this may be stored in a new column called impute.
I can easily set a key within DT and sample from DT itself with the code below
DT <- data.table(s3ITR, key = "Age")
DT[, Impute := sample(na.omit(Charity), length(Charity), replace = T), by = key(DT)]
DT[!is.na(Charity), Impute := Charity]
It is a bit convoluted, but it works and I get the result below
Index Country Age Time Charity Impute
1: 1 France 30 40 1 1
2: 2 France 40 40 0 0
3: 3 France 40 50 0 0
4: 4 Germany 40 40 1 1
5: 5 France 60 40 1 1
6: 6 France 40 40 1 1
7: 7 Germany 30 40 0 0
8: 8 Germany 30 40 1 1
9: 9 Germany 30 40 NA 1
10: 10 Germany 30 40 1 1
Where the probability of NA being imputed as 1 is 3/4. I would like to this exact same thing but sample from DTkey instead, where the probability would be 3/6.
Is there a easy way to do this without merging the tables ?
Is there a special reason why you want to sample from DTKey? To achieve a "fair" probability you could simply use:
sample(0:1,1,replace=T)
Assuming that Charity is either 0 or 1 respectively.
UPDATE:
Okay in that case you could try the following:
DT[, Impute:= sample(DTKey[,Charity], length(DT[,Charity]), replace=T)]

Using R to generate a time series of averages from a very large dataset without using for loops

I am working with a large dataset of patent data. Each row is an individual patent, and columns contain information including application year and number of citations in the patent.
> head(p)
allcites appyear asscode assgnum cat cat_ocl cclass country ddate gday gmonth
1 6 1974 2 1 6 6 2/161.4 US 6 1
2 0 1974 2 1 6 6 5/11 US 6 1
3 20 1975 2 1 6 6 5/430 US 6 1
4 4 1974 1 NA 5 <NA> 114/354 6 1
5 1 1975 1 NA 6 6 12/142S 6 1
6 3 1972 2 1 6 6 15/53.4 US 6 1
gyear hjtwt icl icl_class icl_maingroup iclnum nclaims nclass nclass_ocl
1 1976 1 A41D 1900 A41D 19 1 4 2 2
2 1976 1 A47D 701 A47D 7 1 3 5 5
3 1976 1 A47D 702 A47D 7 1 24 5 5
4 1976 1 B63B 708 B63B 7 1 7 114 9
5 1976 1 A43D 900 A43D 9 1 9 12 12
6 1976 1 B60S 304 B60S 3 1 12 15 15
patent pdpass state status subcat subcat_ocl subclass subclass1 subclass1_ocl
1 3930271 10030271 IL 63 63 161.4 161.4 161
2 3930272 10156902 PA 65 65 11.0 11 11
3 3930273 10112031 MO 65 65 430.0 430 331
4 3930274 NA CA 55 NA 354.0 354 2
5 3930275 NA NJ 63 63 NA 142S 142
6 3930276 10030276 IL 69 69 53.4 53.4 53
subclass_ocl term_extension uspto_assignee gdate
1 161 0 251415 1976-01-06
2 11 0 246000 1976-01-06
3 331 0 10490 1976-01-06
4 2 0 0 1976-01-06
5 142 0 0 1976-01-06
6 53 0 243840 1976-01-06
I am attempting to create a new data frame which contains the mean number of citations (allcites) per application year (appyear), separated by category (cat), for patents from 1970 to 2006 (the data goes all the way back to 1901). I did this successfully, but I feel like my solution is somewhat ad hoc and does not take advantage of the specific capabilities of R. Here is my solution
#citations by category
citescat <- data.frame("chem"=integer(37),
"comp"=integer(37),
"drugs"=integer(37),
"ee"=integer(37),
"mech"=integer(37),
"other"=integer(37),
"year"=1970:2006
)
for (i in 1:37) {
for (j in 1:6) {
citescat[i,j] <- mean(p$allcites[p$appyear==(i+1969) & p$cat==j], na.rm=TRUE)
}
}
I am wondering if there is a simple way to do this without using the nested for loops which would make it easy to make small tweaks to it. It is hard for me to pin down exactly what I am looking for other than this, but my code just looks ugly to me and I suspect that there are better ways to do this in R.
Joran is right - here's a plyr solution. Without your dataset in a usable form it's hard to show you exactly, but here it is in a simplified dataset:
p <- data.frame(allcites = sample(1:20, 20), appyear = 1974:1975, pcat = rep(1:4, each = 5))
#First calculate the means of each group
cites <- ddply(p, .(appyear, pcat), summarise, meancites = mean(allcites, na.rm = T))
#This gives us the data in long form
# appyear pcat meancites
# 1 1974 1 14.666667
# 2 1974 2 9.500000
# 3 1974 3 10.000000
# 4 1974 4 10.500000
# 5 1975 1 16.000000
# 6 1975 2 4.000000
# 7 1975 3 12.000000
# 8 1975 4 9.333333
#Now use dcast to get it in wide form (which I think your for loop was doing):
citescat <- dcast(cites, appyear ~ pcat)
# appyear 1 2 3 4
# 1 1974 14.66667 9.5 10 10.500000
# 2 1975 16.00000 4.0 12 9.333333
Hopefully you can see how to adapt that to your specific data.

Resources