I have copied my code below. I start with a list of 50 small integers, representing the number of televisions owned by 50 families. My objective is shown in the object 'tv.final' below. My effort seems very wordy and inefficient.
Question: is there a better way to start with a list of 50 integers and end with a grouped data table with proportions? (Just taking my first baby steps with R, sorry for such a stupid question, but inquiring minds want to know.)
tv.data <- read.table("Tb02-08.txt",header=TRUE)
str(tv.data)
# 'data.frame': 50 obs. of 1 variable:
# $ TVs: int 1 1 1 2 6 3 3 4 2 4 ...
tv.table <- table(tv.data)
tv.table
# tv.data
# 0 1 2 3 4 5 6
# 1 16 14 12 3 2 2
tv.prop <- prop.table(tv.table)*100
tv.prop
# tv.data
# 0 1 2 3 4 5 6
# 2 32 28 24 6 4 4
tvs <- rbind(tv.table,tv.prop)
tvs
# 0 1 2 3 4 5 6
# tv.table 1 16 14 12 3 2 2
# tv.prop 2 32 28 24 6 4 4
tv.final <- t(tvs)
tv.final
# tv.table tv.prop
# 0 1 2
# 1 16 32
# 2 14 28
# 3 12 24
# 4 3 6
# 5 2 4
# 6 2 4
You can treat the object returned by table() as any other vector/matrix:
tv.table <- table(tv.data)
round(100 * tv.table/sum(tv.table))
That will give you the proportions in rounded percentage points.
Related
I have tried to create a data frame from a matrix; however, the result has a different dimension comparing to the main matrix. Please see below my code:
out <- table(UL_Final$Issue_Year, UL_Final$Insured_Age_Group)
out <- out/rowSums(out) #changing all numbers to ratio
The result is a matrix 12 by 7:
1 2 3 4 5 6 7
1387 0.165137615 0.036697248 0.229357798 0.321100917 0.201834862 0.018348624 0.027522936
1388 0.149222065 0.110325318 0.197312588 0.342291372 0.136492221 0.055162659 0.009193777
1389 0.144979508 0.101946721 0.222848361 0.335553279 0.138575820 0.046362705 0.009733607
1390 0.146991622 0.120030465 0.191622239 0.336024372 0.142269612 0.052551409 0.010510282
1391 0.165462754 0.111794582 0.185835214 0.321049661 0.135553047 0.064503386 0.015801354
1392 0.162399144 0.109583402 0.165321917 0.317388441 0.146344476 0.076115594 0.022847028
1393 0.181602139 0.116447173 0.151104070 0.325131201 0.148628577 0.062778493 0.014308347
1394 0.163760504 0.098529412 0.142489496 0.323792017 0.178728992 0.076050420 0.016649160
1395 0.137097032 0.094699511 0.128981757 0.321320170 0.197610147 0.098245950 0.022045433
1396 0.167187958 0.103851041 0.112696706 0.293202033 0.200689082 0.099306031 0.023067149
1397 0.193250090 0.130540713 0.108114843 0.270743930 0.186411584 0.091364656 0.019574185
1398 0.208026156 0.147573562 0.100455157 0.249503173 0.191935380 0.083338676 0.019167895
then using the code below:
out <- data.frame(out)
However, the result will change to a data frame and dimension of 84 by 3
Var1 Var2 Freq
1 1387 1 0.165137615
2 1388 1 0.149222065
3 1389 1 0.144979508
4 1390 1 0.146991622
5 .... .......
I am not sure why this happens. However in another case, as I explained below, I am not seeing such strange behavior. In another case, I used the code below to calculate another ratio for another variable:
out <- table( df_select$Insured_Age_Group,df_select$Policy_Status)
out <- cbind(out, Ratio = out[,2]/rowSums(out))
the result is :
Issuance Surrended Ratio
1 31046 5735 0.1559229
2 20039 4409 0.1803420
3 20399 9228 0.3114726
4 48677 17216 0.2612721
5 30045 8132 0.2130078
6 13947 4106 0.2274414
7 3157 1047 0.2490485
Now if we used the code below (by #Ronak Shah):
out <- data.frame(out) %>% mutate(x = row_number())
the result is :
Issuance Surrended Ratio x
1 31046 5735 0.1559229 1
2 20039 4409 0.1803420 2
3 20399 9228 0.3114726 3
4 48677 17216 0.2612721 4
5 30045 8132 0.2130078 5
6 13947 4106 0.2274414 6
7 3157 1047 0.2490485 7
As you can see the result is now a data frame with same dimension. Can anyone explain why this happens?
See ?table for an explanation:
The as.data.frame method for objects inheriting from class "table" can be used to convert the array-based representation of a contingency table to a data frame containing the classifying factors and the corresponding entries (the latter as component named by responseName). This is the inverse of xtabs.
A workaround is to use as.data.frame.matrix:
m <- table(mtcars$carb, mtcars$gear)
as.data.frame(m)
# Var1 Var2 Freq
# 1 1 3 3
# 2 2 3 4
# 3 3 3 3
# 4 4 3 5
# 5 6 3 0
# 6 8 3 0
# 7 1 4 4
# 8 2 4 4
# 9 3 4 0
# 10 4 4 4
# 11 6 4 0
# 12 8 4 0
# 13 1 5 0
# 14 2 5 2
# 15 3 5 0
# 16 4 5 1
# 17 6 5 1
# 18 8 5 1
as.data.frame.matrix(m)
# 3 4 5
# 1 3 4 0
# 2 4 4 2
# 3 3 0 0
# 4 5 4 1
# 6 0 0 1
# 8 0 0 1
I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]
One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119
my first language isn't English so I apologize in advance for mistakes I could do. I'm newbie in R but you will notice that anyway.
I'm trying to solve the problem of having a co-occurence matrix. I have several dataframes and I am interested in 3 variables : idT, numname and numstim.
This is the unique dataframe that contains the merged data :
z=rbind(df1,df2,df3,df4,df5,df6,df7,df8,df9,df10,df11,df12,df13,df14,
df15,df16,df17,df18,df19,df20,df21,df22,df23,df24,df25,df26,df27,df28,df29,df30,df31,df32)
write.csv(z, file = ".../listz.csv")
Then I extracted the 3 variables with :
#Extract columns 3 & 6 from all the files within the list
z1 = z[,c(3,6)]
#Create a new variable 'numname' to convert name groups into numeric groups,
#then obtain levels with facNum
z1$numname <- as.numeric(z1$namegroup)
colnames(z1) <- c("namegroup", "idT", "numname")
facNum <- factor(z1$numname)
write.csv(z1, file = "...D:/z1.csv")
And data look like :
namegroup idT numname
1 GLISSEVIBREVITE 1 6
2 CINETIQUE 1 3
3 VIBRATIONS_LEGERES 1 20
4 DIFFUS 1 5
5 LIQUIDE 1 8
6 PICOTEMENTS 1 10
How to read the table : each idT is classified in a group (namegroup) and then this group is converted in a numeric variable (numname).
# Specify z1 as a data frame to make next operations
z1 = as.data.frame(z1, idT = z1$numstim, numgroup = z1$numname)
tab1 <- table(z1)
write.csv(tab1, file = ".../tab1test.csv")
out1 <- data.matrix(tab1 %*% t(tab1))
write.csv(out1, file = ".../bmtest.csv")
But the bmtest matrix doesn't look like counting pairs of idT, because only 22 users have participated and there are 32 idT, but some the numbers are much higher :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 24 10 7 7 11 7 7 8 10 8 11 8 6 11 11 12
2 10 32 27 7 5 4 7 4 4 4 5 3 2 6 6 14
3 7 27 40 0 3 1 0 2 0 0 2 2 1 2 0 15
4 7 7 0 30 7 14 15 9 15 13 13 7 5 12 13 5
5 11 5 3 7 24 7 9 20 12 13 10 19 14 20 12 7
I wanna have a matrix which shows the results of a count of idT paired together. The matrix has to look like :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 15 3 2 2 3 3 2 1 2 1 3 3 1 3 3 5
2 3 15 9 2 0 1 2 0 0 0 0 0 0 0 1 3
3 2 9 15 0 2 1 0 2 0 0 1 1 1 2 0 2
4 2 2 0 15 1 6 5 1 7 5 6 2 0 1 3 2
5 3 0 2 1 15 1 2 12 4 5 3 13 9 11 3 2
In other words, I want to see which idT have been paired together. I've looked at this topic but didn't find a way to solve my problem.
Also, I tried :
library(igraph)
library(tnet)
idT_numname <- cbind(z1$idT, z1$numname)
igraph <- graph.data.frame(idT_numname)
item_item <- projecting_tm(net = idT_numname, method="sum")
item_item <- tnet_igraph(item_item,type="weighted one-mode tnet")
itemmat <- get.adjacency(item_item,attr="weight")
itemmat #8x8 martrix of items to items
But I get error message and I don't know how to get over the "duplicated entries in the edgelist", because it seems necessary to me to have duplicated entries in order to do a co-occurrence matrix :
> idT_numname <- cbind(z1$idT, z1$numname)
> item_item <- projecting_tm(idT_numname, method="sum")
Error in as.tnet(net, type = "binary two-mode tnet") :
There are duplicated entries in the edgelist
> item_item <- as.tnet(net = idT_numname, type ="binary two-mode tnet", method="sum")
Error in as.tnet(net = idT_numname, type = "binary two-mode tnet", method = "sum") :
unused argument (method = "sum")
> item_item <- as.tnet(net = idT_numname, type ="binary two-mode tnet")
Error in as.tnet(net = idT_numname, type = "binary two-mode tnet") :
There are duplicated entries in the edgelist
Your help is greatly appreciated.
I like to do data analysis and I want to learn more and more everyday !
Thank you
I have used the code below to "bin" a year.month string into three month bins. The problem is that I want each of the bins to have a number that corresponds where the bin occurs chronologically (i.e. first bin =1, second bin=2, etc.). Right now, the first month bin is assigned to the number 4, and I am not sure why. Any help would be highly appreciated!
> head(Master.feed.parts.gn$yr.mo, n=20)
[1] "2007.10" "2007.10" "2007.10" "2007.11" "2007.11" "2007.11" "2007.11" "2007.12" "2008.01"
[10] "2008.01" "2008.01" "2008.01" "2008.01" "2008.02" "2008.03" "2008.03" "2008.03" "2008.04"
[19] "2008.04" "2008.04"
>
> yearmonth_to_integer <- function(xx) {
+ yy_mm <- as.integer(unlist(strsplit(xx, '.', fixed=T)))
+ return( (yy_mm[1] - 2006) + (yy_mm[2] %/% 3) )
+ }
>
> Cluster.GN <- sapply(Master.feed.parts.gn$yr.mo, yearmonth_to_integer)
> Cluster.GN
2007.10 2007.10 2007.10 2007.11 2007.11 2007.11 2007.11 2007.12 2008.01 2008.01 2008.01
4 4 4 4 4 4 4 5 2 2 2
2008.01 2008.01 2008.02 2008.03 2008.03 2008.03 2008.04 2008.04 2008.04 2008.04 2008.05
2 2 2 3 3 3 3 3 3 3 3
2008.05 2008.05 2008.06 2008.10 2008.11 2008.11 2008.12 <NA> 2009.05 2009.05 2009.05
3 3 4 5 5 5 6 NA 4 4 4
2009.06 2009.07 2009.07 2009.07 2009.09 2009.10 2009.11 2010.01 2010.02 2010.02 2010.02
5 5 5 5 6 6 6 4 4 4 4
UPDATE:
I was asked to provide sample input (year) and the desired output (Cluster.GN).I have a year-month string that has varying numbers of observations for each month, and some months don't have any observations. What I want to do is bin each of the three consecutive months that have data, assigning each three month "bin" a number as shown below.
yr.mo Cluster.GN
1 2007.10 1
2 2007.10 1
3 2007.10 1
4 2007.10 1
5 2007.10 1
6 2007.11 1
7 2007.11 1
8 2007.11 1
9 2007.11 1
10 2007.12 1
11 2007.12 1
12 2007.12 1
13 2007.12 1
14 2008.10 2
15 2008.10 2
16 2008.10 2
17 2008.10 2
18 2008.12 2
19 2008.12 2
20 2008.12 2
21 2008.12 2
22 2008.12 2
1) Convert the strings to zoo's "yearqtr" class and then to integers:
s <- c("2007.10", "2007.10", "2007.10", "2007.11", "2007.11", "2007.11",
"2007.11", "2007.12", "2008.01", "2008.01", "2008.01", "2008.01",
"2008.01", "2008.02", "2008.03", "2008.03", "2008.03", "2008.04",
"2008.04", "2008.04")
library(zoo)
yq <- as.yearqtr(s, "%Y.%m")
as.numeric(factor(yq))
## [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
The last line could alternately be: 4*(yq - yq[1])+1
Note that in the question 2007.12 is classified as in a different quarter than 2007.10 and 2007.11; however, they are all in the same quarter and we assume you did not intend this.
2) Another possibility depending on what you want is:
f <- factor(s)
nlev <- nlevels(f)
levels(f) <- gl(nlev, 3, nlev)
f
## [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
## Levels: 1 2 3
IF there are missing months then this will give a different answer than (1) so it all depends on what you are looking for.
I have a dataframe df
Reads Counts
aaaa 10
bbbb 20
cccc 25
and so on.
I want to calculate the number of reads which exceed a certain value of counts and plot that. Example I want a data frame that looks like
Counts>= #reads with Counts>=
1 3
2 3
3 3
11 2
20 2
21 1
and so on. Can you suggest how I can get such a dataframe and plot it.
Given the levels you want to plot at...
cutoffs <- 1:30
... you could do something like:
data.frame(cutoff=cutoffs, num.above=Reduce("+", lapply(dat$Counts, ">=", cutoffs)))
# cutoff num.above
# 1 1 3
# 2 2 3
# 3 3 3
# 4 4 3
# 5 5 3
# 6 6 3
# 7 7 3
# 8 8 3
# 9 9 3
# 10 10 3
# 11 11 2
# 12 12 2
# 13 13 2
# 14 14 2
# 15 15 2
# 16 16 2
# 17 17 2
# 18 18 2
# 19 19 2
# 20 20 2
# 21 21 1
# 22 22 1
# 23 23 1
# 24 24 1
# 25 25 1
# 26 26 0
# 27 27 0
# 28 28 0
# 29 29 0
# 30 30 0
Basically for each value in the original data frame you compute a vector of whether it's greater than or equal to each cutoff (using lapply with >=). Then you add them up (using Reduce with +), getting the total number greater than or equal to each cutoff.
Another option would be using outer/colSums
cutoff <- 1:30
data.frame(cutoff=cutoffs, num.above=colSums(outer(df$Counts, cutoffs, ">=")))