R Newbie has a simple data table (DT) that has the number of households (NumHH) in several United States (Residences):
NumHH Residence
6 AK
4 AL
7 AR
6 AZ
1 CA
2 CO
2 CT
1 AK
4 AL
6 AR
3 AZ
1 CA
6 CO
3 CT
5 AL
By using with(),
with(DT, table(NumHH, Residence))
I can get a table that's close to what I want:
Residence
NumHH AK AL AR AZ CA CO CT
1 1 0 0 0 2 0 0
2 0 0 0 0 0 1 1
3 0 0 0 1 0 0 1
4 0 2 0 0 0 0 0
5 0 1 0 0 0 0 0
6 1 0 1 1 0 1 0
7 0 0 1 0 0 0 0
but I need a table that provides the frequency of several ranges per residence. The frequencies are calculated this way:
##Frequency of ranges per State
One <- DT$NumHH <=1 ##Only 1 person/household
Two_Four <- ((DT$NumHH <=4) - (DT$NumHH <=1)) ##2 to 4 people in Household
OverFour <- DT$NumHH >4 ##More than 4 people in HH
Ideally, the result would look like this:
Residence
NumHH AK AL AR AZ CA CO CT
One 1 0 0 0 2 0 0
Two_Four 0 2 0 1 0 1 2
OverFour 1 1 2 1 0 1 0
I've tried:
with() - I am only able to do one range at a time with "with()", such as:
with(DT, table (One, Residence)) - and that gives me a FALSE row and a TRUE row by state.
data.frames asks me to name each state ("AK", "AL", "AR", etc.), but with() already knows.
I have also tried ddply, but got a list of each calculation's (150 unlabeled rows in 4 columns - not the desired 3 labeled rows in 50 columns for each state), so I'm obviously not doing it right.
Any assistance is greatly appreciated.
Use ?cut to establish your groups before using table:
with(dat, table( NumHH=cut(NumHH, c(0,1,4,Inf), labels=c("1","2-4",">4")), Residence))
# Residence
#NumHH AK AL AR AZ CA CO CT
# 1 1 0 0 0 2 0 0
# 2-4 0 2 0 1 0 1 2
# >4 1 1 2 1 0 1 0
Related
I have this data in excel that I imported into Power BI
Area Attach Pay Start Entry
North 1 0 1 0
North 1 0 1 0
North 1 0 1 1
North 0 1 1 0
West 1 0 1 0
West 1 0 1 1
West 0 0 1 0
West 1 1 1 0
West 1 0 1 0
West 1 0 1 0
West 1 0 1 1
I produced a report using matrix and sum the values of Attach, Pay, Start and Entry then group by Area
My current output
My Desired Output
i can do the formatting dont worry about it
There is a show on row option in formatting pane:
I wanted to clean my table but since I'm still new to [R], what I can do are pretty limited. The list is actually pretty long, around 100,000 rows, it would be impossible for me to do it manually ~ Please help me.
Suppose I have a very long list of data in table form. Each of them have a "Publication.Code" and a "Date". The Code is unique while the Date can be duplicated. For each of it, they have a list of "names" under the column "Type".
Publication.Code Date Type
1 AC00069535742 2009-04-16 E62D 21/15;E60R 7/06;E60R 21/06;E62D 25/14
2 BB000069535652 2008-10-30 F06Q 10/
3 FV000069434701 2007-04-05 E30B 15/;E30B 15/16
4 RG000069534443 2006-07-06 E62D 21/15;E62D 25/14;T60T 7/06;E60R 21/06
5 MV000069333663 2006-02-23 H04N 1/1;G01J 3/51
6 KK000069533634 2006-02-23 H12N 9/1;H12N 15/54;H12P 9/
7 NQ000069534198 2006-02-16 H12N 15/54;H12N 15/7;H12N 1/21;H12N 9/1
I wanted to mutate a new column using the 1st 4 alphabets of each names (That are E60R, E62D, F06Q, E30B, T60T, H04N, G01J, H12N) in the column "Type" and count its frequency among the list of names just like below:
Publication.Code Date E60R E62D F06Q E30B T60T H04N G01J H12N
1 AC00069535742 2009-04-16 2 2 1 0 0 0 0 0
2 BB000069535652 2008-10-30 0 0 1 0 0 0 0 0
3 FV000069434701 2007-04-05 0 0 0 2 0 0 0 0
4 RG000069534443 2006-07-06 1 2 0 0 1 0 0 0
5 MV000069333663 2006-02-23 0 0 0 0 0 1 1 0
6 KK000069533634 2006-02-23 0 0 0 0 0 0 0 3
7 NQ000069534198 2006-02-16 0 0 0 0 0 0 0 4
After that, I would like to sum that up by year, maybe by:
Year E60R E62D F06Q E30B T60T H04N G01J H12N
1 2009 2 2 1 0 0 0 0 0
2 2008 0 0 1 0 0 0 0 0
3 2007 0 0 0 2 0 0 0 0
4 2006 1 2 0 0 1 1 1 7
& also the cumulative sum of each column:
Year E60R E62D F06Q E30B T60T H04N G01J H12N
1 2009 2 2 1 0 0 0 0 0
2 2008 2 2 2 0 0 0 0 0
3 2007 2 2 2 2 0 0 0 0
4 2006 2 4 2 2 1 1 1 7
I understand that I can use dplyr to mutate the column and count the frequency by Year but I'm not sure how to just extract certain value from the column, really appreciate for any help ~
If you put your Types into vector myTypes, this should work for the first part of your problem
require(plyr)
require(stringr)
df<-read.table(header = TRUE, sep=",", text="
Publication.Code, Date, Type
AC00069535742, 2009-04-16, E62D 21/15;E60R 7/06;E60R 21/06;E62D 25/14
BB000069535652, 2008-10-30, F06Q 10/")
myTypes <- c("E60R", "E62D", "F06Q", "E30B", "T60T", "H04N", "G01J", "H12N")
res <- adply(df, .margin = 1, .fun = function(x) setNames(str_count(x$Type, pattern = myTypes), myTypes))
res$Type <- NULL
This will solve the second part
res$Date <-lubridate::ymd(res$Date)
ddply(res, .(year(Date)), function(x)colSums(x[,-(1:2)]))
To calculate cumulative values for each column use cumsum in colwise
names(res2)[1] <-"year"
cbind(year = res2$year, colwise(cumsum, myTypes)(res2))
I am a rookie in R. I think my questions are basic ones. I want to know the frequency of a variable under couple conditions. I try to use table() but it does not work. I have searched a lot, I still cannot find the answers.
My data looks like this
ID AGE LEVEL End_month
1 14 1 201005
2 25 2 201006
3 17 2 201006
4 16 1 201008
5 19 3 201007
6 33 2 201008
7 17 2 201006
8 15 3 201005
9 23 1 201004
10 25 2 201007
I want to know two things.
First, I want to know the frequency of age under different level. The age shows in certain range and aggregate the rest as a variable. It looks like this.
level
1 2 3 sum
age 14 1 0 0 1
16 1 0 0 1
15 0 0 1 1
17 0 2 0 2
19 0 0 1 1
20+ 1 3 0 4
sum 3 5 2 10
Second, I want to know the frequency of different age in different end_month of level 2&3 customer. I want to get a table like this.
For level 2 customer
End_month
201004 201005 201006 201007 201008 sum
age 15 0 0 0 0 0 0
19 0 0 0 0 0 0
17 0 0 2 0 0 2
19 0 0 0 0 0 0
25 0 0 0 1 0 1
33 0 0 0 1 1 2
sum 0 0 2 2 1 5
For level 3 customer
End_month
201004 201005 201006 201007 201008 sum
age 15 0 1 0 0 0 1
19 0 0 0 1 0 1
17 0 0 0 0 0 0
19 0 0 0 0 0 0
25 0 0 0 0 0 0
33 0 0 0 0 0 0
sum 0 1 0 1 0 2
Many thanks in advance.
You can still achieve this with table, because it can take more than one variables.
For example, use
table(AGE, LEVEL)
to get the first two-way table.
Now, when you want to produce such table for each subset according to LEVEL, you can do it this way, assuming we are going for level 1:
subset <- LEVEL == 1
table(AGE[subset], END[subset])
I was wondering if you guys can help me building an adjacency matrix. I have data in CVS format like this:
Paper_ID Author
2 Foster-McGregor, N.
3 Van Houte, M.
4 van de Meerendonk, A.
5 Farla, K.
6 van Houte, M.
6 Siegel, M.
8 Farla, K.
11 Farla, K.
11 Verspagen, B.
As you can see the column "Paper_ID" has a repeated value of 11, meaning that "Farla, K." and "Verspagen, B." are coauthors of a publication. I need to build a square weighted matrix using the names of the authors, counting the times that they are collaborating together.
Does the following do what you are looking for?
# simulate data.
d <- data.frame(
id=c(2,3,4,5,6,6,8,11,11,12,12),
author=c("FN", "VM","VA","FK","VM","SM","FK","FK","VB","FK","VB")
)
d
id author
1 2 FN
2 3 VM
3 4 VA
4 5 FK
5 6 VM
6 6 SM
7 8 FK
8 11 FK
9 11 VB
10 12 FK
11 12 VB
# create incidence matrix:
m <- xtabs(~author+id,d)
m
id
author 2 3 4 5 6 8 11 12
FK 0 0 0 1 0 1 1 1
FN 1 0 0 0 0 0 0 0
SM 0 0 0 0 1 0 0 0
VA 0 0 1 0 0 0 0 0
VB 0 0 0 0 0 0 1 1
VM 0 1 0 0 1 0 0 0
# convert to adjacency matrix.
# tcrossprod does "m %*% t(m)"
tcrossprod(m)
author
author FK FN SM VA VB VM
FK 4 0 0 0 2 0
FN 0 1 0 0 0 0
SM 0 0 1 0 0 1
VA 0 0 0 1 0 0
VB 2 0 0 0 2 0
VM 0 0 1 0 0 2
Note that crossprod() will give you the incidence matrix for the id variable (i.e. will do t(m) %*% m).
I have a data.frame with 8 columns. One is for the list of subjects (one row per subject) and the other 7 rows are a score of either 1 or 0.
This is what the data looks like:
>head(splitkscores)
subject block3 block4 block5 block6 block7 block8 block9
1 40002 0 0 1 0 0 0 0
2 40002 0 0 1 0 0 1 1
3 40002 1 1 1 1 1 1 1
4 40002 1 1 0 0 0 1 0
5 40002 0 1 0 0 0 1 1
6 40002 0 1 1 0 1 1 1
I want to create a data.frame with 3 columns. One column for subjects. In the other two columns, one must have the sum of 3 or 4 randomly chosen numbers from each row of my data.frame (except the subject) and the other column must have the sum of the remaining values which were not chosen in the first random sample.
Help is much appreciated.
Thanks in advance
Here's a neat and tidy solution free of unnecessary complexity (assume the input is called df):
chosen=sort(sample(setdiff(colnames(df),"subject"),sample(c(3,4),1)))
notchosen=setdiff(colnames(df),c("subject",chosen))
out=data.frame(subject=df$subject,
sum1=apply(df[,chosen],1,sum),sum2=apply(df[,notchosen],1,sum))
In plain English: sample from the column names other than "subject", choosing a sample size of either 3 or 4, and call those column names chosen; define notchosen to be the other columns (excluding "subject" again, obviously); then return a data frame with the list of subjects, the sum of the chosen columns, and the sum of the non-chosen columns. Done.
I think this'll do it: [changed the way data were read in based on the other response because I made a manual mistake...]
splitkscores <- read.table(text = " subject block3 block4 block5 block6 block7 block8 block9
1 40002 0 0 1 0 0 0 0
2 40002 0 0 1 0 0 1 1
3 40002 1 1 1 1 1 1 1
4 40002 1 1 0 0 0 1 0
5 40002 0 1 0 0 0 1 1
6 40002 0 1 1 0 1 1 1", header = TRUE)
df2 <- data.frame(subject = splitkscores$subject, sum3or4 = NA, leftover = NA)
df2$sum3or4 <- apply(splitkscores[,2:ncol(splitkscores)], 1, function(x){
sum(sample(x, sample(c(3,4),1), replace = FALSE))
})
df2$leftover <- rowSums(splitkscores[,2:ncol(splitkscores)]) - df2$sum3or4
df2
subject sum3or4 leftover
1 40002 1 0
2 40002 2 1
3 40002 3 4
4 40002 1 2
5 40002 2 1
6 40002 1 4