problem in changing matrix to a data frame with same dimensions - r

I have tried to create a data frame from a matrix; however, the result has a different dimension comparing to the main matrix. Please see below my code:
out <- table(UL_Final$Issue_Year, UL_Final$Insured_Age_Group)
out <- out/rowSums(out) #changing all numbers to ratio
The result is a matrix 12 by 7:
1 2 3 4 5 6 7
1387 0.165137615 0.036697248 0.229357798 0.321100917 0.201834862 0.018348624 0.027522936
1388 0.149222065 0.110325318 0.197312588 0.342291372 0.136492221 0.055162659 0.009193777
1389 0.144979508 0.101946721 0.222848361 0.335553279 0.138575820 0.046362705 0.009733607
1390 0.146991622 0.120030465 0.191622239 0.336024372 0.142269612 0.052551409 0.010510282
1391 0.165462754 0.111794582 0.185835214 0.321049661 0.135553047 0.064503386 0.015801354
1392 0.162399144 0.109583402 0.165321917 0.317388441 0.146344476 0.076115594 0.022847028
1393 0.181602139 0.116447173 0.151104070 0.325131201 0.148628577 0.062778493 0.014308347
1394 0.163760504 0.098529412 0.142489496 0.323792017 0.178728992 0.076050420 0.016649160
1395 0.137097032 0.094699511 0.128981757 0.321320170 0.197610147 0.098245950 0.022045433
1396 0.167187958 0.103851041 0.112696706 0.293202033 0.200689082 0.099306031 0.023067149
1397 0.193250090 0.130540713 0.108114843 0.270743930 0.186411584 0.091364656 0.019574185
1398 0.208026156 0.147573562 0.100455157 0.249503173 0.191935380 0.083338676 0.019167895
then using the code below:
out <- data.frame(out)
However, the result will change to a data frame and dimension of 84 by 3
Var1 Var2 Freq
1 1387 1 0.165137615
2 1388 1 0.149222065
3 1389 1 0.144979508
4 1390 1 0.146991622
5 .... .......
I am not sure why this happens. However in another case, as I explained below, I am not seeing such strange behavior. In another case, I used the code below to calculate another ratio for another variable:
out <- table( df_select$Insured_Age_Group,df_select$Policy_Status)
out <- cbind(out, Ratio = out[,2]/rowSums(out))
the result is :
Issuance Surrended Ratio
1 31046 5735 0.1559229
2 20039 4409 0.1803420
3 20399 9228 0.3114726
4 48677 17216 0.2612721
5 30045 8132 0.2130078
6 13947 4106 0.2274414
7 3157 1047 0.2490485
Now if we used the code below (by #Ronak Shah):
out <- data.frame(out) %>% mutate(x = row_number())
the result is :
Issuance Surrended Ratio x
1 31046 5735 0.1559229 1
2 20039 4409 0.1803420 2
3 20399 9228 0.3114726 3
4 48677 17216 0.2612721 4
5 30045 8132 0.2130078 5
6 13947 4106 0.2274414 6
7 3157 1047 0.2490485 7
As you can see the result is now a data frame with same dimension. Can anyone explain why this happens?

See ?table for an explanation:
The as.data.frame method for objects inheriting from class "table" can be used to convert the array-based representation of a contingency table to a data frame containing the classifying factors and the corresponding entries (the latter as component named by responseName). This is the inverse of xtabs.
A workaround is to use as.data.frame.matrix:
m <- table(mtcars$carb, mtcars$gear)
as.data.frame(m)
# Var1 Var2 Freq
# 1 1 3 3
# 2 2 3 4
# 3 3 3 3
# 4 4 3 5
# 5 6 3 0
# 6 8 3 0
# 7 1 4 4
# 8 2 4 4
# 9 3 4 0
# 10 4 4 4
# 11 6 4 0
# 12 8 4 0
# 13 1 5 0
# 14 2 5 2
# 15 3 5 0
# 16 4 5 1
# 17 6 5 1
# 18 8 5 1
as.data.frame.matrix(m)
# 3 4 5
# 1 3 4 0
# 2 4 4 2
# 3 3 0 0
# 4 5 4 1
# 6 0 0 1
# 8 0 0 1

Related

R: Sum column from table 2 based on value in table 1, and store result in table 1

I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]
One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119

How to select columns that contain specific values of a chosen row in R?

I have a dataset that looks like this
Site <- c(1,2,3,4,5,6,7,8,9,10,"kingdom","phylum","class")
A <- c(0,0,1,2,4,5,6,7,13,56,"Eukaryota","Arthropoda","Insecta")
B <- c(1,0,0,0,0,4,5,7,7,8,"Eukaryota","Arthropoda","Insecta")
C <- c(2,3,0,0,4,5,67,8,43,21,"Eukaryota","Arthropoda","")
D <- c(134,0,0,2,0,0,9,0,45,55,"Eukaryota","Arthropoda","Arachnida")
site.species.sample <- data.frame(Site,A,B,C,D)
I want to select only the columns from this dataset where the row "class" is "Insecta" (i.e. in this example only columns A and B satisfy this condition). I tried this code:
site.species.sample <- site.species.sample[,site.species.sample["class",]=="Insecta"]
But got an error:
Error in `[.data.frame`(site.species.sample, , site.species.sample["class", :
undefined columns selected
So how do I do it? Thanks
Below is an option
site.species.sample[,c(TRUE,subset(site.species.sample[,-1],site.species.sample$Site=="class")=="Insecta")]
Site A B
1 1 0 1
2 2 0 0
3 3 1 0
4 4 2 0
5 5 4 0
6 6 5 4
7 7 6 5
8 8 7 7
9 9 13 7
10 10 56 8
11 kingdom Eukaryota Eukaryota
12 phylum Arthropoda Arthropoda
13 class Insecta Insecta

R: grouped data table with proportions

I have copied my code below. I start with a list of 50 small integers, representing the number of televisions owned by 50 families. My objective is shown in the object 'tv.final' below. My effort seems very wordy and inefficient.
Question: is there a better way to start with a list of 50 integers and end with a grouped data table with proportions? (Just taking my first baby steps with R, sorry for such a stupid question, but inquiring minds want to know.)
tv.data <- read.table("Tb02-08.txt",header=TRUE)
str(tv.data)
# 'data.frame': 50 obs. of 1 variable:
# $ TVs: int 1 1 1 2 6 3 3 4 2 4 ...
tv.table <- table(tv.data)
tv.table
# tv.data
# 0 1 2 3 4 5 6
# 1 16 14 12 3 2 2
tv.prop <- prop.table(tv.table)*100
tv.prop
# tv.data
# 0 1 2 3 4 5 6
# 2 32 28 24 6 4 4
tvs <- rbind(tv.table,tv.prop)
tvs
# 0 1 2 3 4 5 6
# tv.table 1 16 14 12 3 2 2
# tv.prop 2 32 28 24 6 4 4
tv.final <- t(tvs)
tv.final
# tv.table tv.prop
# 0 1 2
# 1 16 32
# 2 14 28
# 3 12 24
# 4 3 6
# 5 2 4
# 6 2 4
You can treat the object returned by table() as any other vector/matrix:
tv.table <- table(tv.data)
round(100 * tv.table/sum(tv.table))
That will give you the proportions in rounded percentage points.

Appending a tally of subjects to a dataframe in R

I have a list of subjects:
myDat = list(Subject = c(10234, 10234, 10234, 10234, 10242, 10242, 10242, 10242, 10253, 10253, 10253, 10268, 10268, 10268, 10268))
and I would like to add a count (DayNo) which restarts with every change in subject to the dataframe to look like:
Thanks in advance
An ave variant:
df <- as.data.frame(myDat)
df$Day <- ave(df$Subject, df$Subject, FUN=seq_along)
Produces:
Subject Day
1 10234 1
2 10234 2
3 10234 3
4 10234 4
5 10242 1
6 10242 2
7 10242 3
8 10242 4
9 10253 1
10 10253 2
11 10253 3
12 10268 1
13 10268 2
14 10268 3
15 10268 4
Use rle to get the run lengths and use sequence to create sequences of corresponding length.
myDat <- as.data.frame(myDat)
myDat$DayNo <- sequence(rle(myDat$Subject)$lengths)
# Subject DayNo
# 1 10234 1
# 2 10234 2
# 3 10234 3
# 4 10234 4
# 5 10242 1
# 6 10242 2
# 7 10242 3
# 8 10242 4
# 9 10253 1
# 10 10253 2
# 11 10253 3
# 12 10268 1
# 13 10268 2
# 14 10268 3
# 15 10268 4

Grouping data in R with including those attributes that are not on grouping condition

I want to group data such that i include those attributes that are not included in the grouping condition.
Example data
pixel740 label num
1 0 0 4132
2 0 1 4684
3 0 2 4177
4 1 7 4
5 1 9 1
6 2 7 11
7 2 9 6
8 3 7 10
9 3 9 4
Result Data that i want
pixel740 label num
0 1 4684 // this is as4684 is max num , so i include the row
1 7 4
2 7 11
3 7 10
i.e i want to include those rows that have max num according to pixel740 attribute
I have tried ddply,split options but they always include the attributes that we use to group i.e pixel 740 and don't include the whole row
How to do this? Is there a function that can do this or i use loops which i want to avoid
Here's how to get the max num value for each value of pixel740 with aggregate (calling your original data x):
aggregate(num ~ pixel740, data=x, FUN=max)
## pixel740 num
## 1 0 4684
## 2 1 4
## 3 2 11
## 4 3 10
To get the rows, you can merge with the original set:
ag <- aggregate(num ~ pixel740, data=x, FUN=max)
res <- merge(ag, x)
res
## pixel740 num label
## 1 0 4684 1
## 2 1 4 7
## 3 2 11 7
## 4 3 10 7
As requested in the comment, here's how to sort the data by the value of pixel740:
res[order(res$pixel740),]
For this short example, there is no difference in the output.
I've been trying to work out a solution using data.table, this i believe is the result. I imagine it can be improved.
require("data.table")
DT <- data.table(read.table("clipboard", header=T))
DT2 <- DT[, list(max_num = max(num)), by="pixel740"]
setkey(DT,num,pixel740)
setkey(DT2,max_num,pixel740)
RES <- DT[DT2,j=list(label)]
setkey(RES,pixel740)
RES
num pixel740 label
1: 4684 0 1
2: 4 1 7
3: 11 2 7
4: 10 3 7

Resources