Separating three dimensional array by the stratifying variable - r

I am working with the UCBAdmissions data set, and I want to separate out the data set into the 6 departmental tables that you get when you simply run
>UCBAdmissions
, , Dept = A
Gender
Admit Male Female
Admitted 512 89
Rejected 313 19
, , Dept = B
Gender
Admit Male Female
Admitted 353 17
Rejected 207 8
, , Dept = C
Gender
Admit Male Female
Admitted 120 202
Rejected 205 391
, , Dept = D
Gender
Admit Male Female
Admitted 138 131
Rejected 279 244
, , Dept = E
Gender
Admit Male Female
Admitted 53 94
Rejected 138 299
, , Dept = F
Gender
Admit Male Female
Admitted 22 24
Rejected 351 317
I am pretty sure I can make the data set into a dataframe and then go through and grep by department and sum to make tables, but I am wondering if there is an easier way, sine the data is already set up in the exact format I want, I just need to handle each department table individually

Oh, sorry I misread your question. You are not looking for converting this into a data frame but for splitting.
You may use:
setNames(lapply(1:dim(UCBAdmissions)[3], function (i) UCBAdmissions[,,i]),
dimnames(UCBAdmissions)[[3]])
#A
# Gender
#Admit Male Female
# Admitted 512 89
# Rejected 313 19
#
#$B
# Gender
#Admit Male Female
# Admitted 353 17
# Rejected 207 8
#
#$C
# Gender
#Admit Male Female
# Admitted 120 202
# Rejected 205 391
#
#$D
# Gender
#Admit Male Female
# Admitted 138 131
# Rejected 279 244
#
#$E
# Gender
#Admit Male Female
# Admitted 53 94
# Rejected 138 299
#
#$F
# Gender
#Admit Male Female
# Admitted 22 24
# Rejected 351 317

You can use assign in a for loop:
for (i in 1:6){assign(LETTERS[i], UCBAdmissions[,,i])}
A
# Gender
# Admit Male Female
# Admitted 512 89
# Rejected 313 19
and the same goes for B, C, D, E and F

Related

Scraping .txt files for many files that are formatted similarly

I converted some .pdf's into .txt files using R and am having trouble finding a way to scrape them to ultimately construct a data frame. I am new to text scraping, so please have mercy on my ignorance.
This is the format of the .txt file and I am mainly interested in the numbers and headers. Any recommendations are much appreciated.
Township of Buena Vista
General Election Results - November 2, 2010
Prepared by the Office of Edward P. McGettigan, Atlantic County Clerk
Township Committee Public Count
Mary Ann
Peter C. Richard Henry L. Total Total Total Total Total
Micheletti-
Bylone, Sr. Harlan Coia, Jr. Machine Vote By Provisional Emergency Public
Levari
Ward Democratic Democratic Republican Count Mail Count Count Count
Republican
District
D-1 205 195 230 223 436 113 16 565
D-2 202 160 275 261 459 459
D-3 331 346 99 87 457 457
D-4 215 205 164 152 377 377
D-5 104 95 169 166 271 271
D-6 77 70 109 108 188 188
I would like the output to be something in tabular form like
Mary Ann
Peter C. Richard Henry L. Total Total Total Total Total
Micheletti-
Bylone, Sr. Harlan Coia, Jr. Machine Vote By Provisional Emergency Public
Levari
Democratic Democratic Republican Count Mail Count Count Count
Republican
District
D-1 205 195 230 223 436 113 16 565
D-2 202 160 275 261 459 459
D-3 331 346 99 87 457 457
D-4 215 205 164 152 377 377
D-5 104 95 169 166 271 271
D-6 77 70 109 108 188 188
except with the names and party affiliation as one character string. The goal is to merge this with other files like it to create a dataset.
It's always going to be ugly, but this should be somewhat automated:
# read it in as individual lines
rl <- readLines(textConnection(txt))
# drop all the extra info at top
rl <- rl[-(1:9)]
# just keep header
dist <- which(rl == "District")
hd <- head(rl, dist - 1)
# make everything same length and split characters
hd <- lapply(strsplit(hd, ""), `length<-`, max(nchar(hd)))
hd <- lapply(hd, function(x) replace(x, is.na(x), " "))
# find where spaces are in common in all rows
wdths <- rle(Reduce(`&`, lapply(hd, `==`, " ")))$lengths
# read it all in, ignoring district row
out <- read.fwf(textConnection(rl[-dist]), widths=wdths )
# keep those columns that aren't all NA
out <- out[!sapply(out, function(x) all(is.na(x)) )]
# collapse the header
hdr <- sapply(head(out, dist - 1),
function(x) trimws(gsub("\\s+", " ", paste(na.omit(x), collapse=" "))))
# finalise by joining
setNames(
data.frame(lapply(tail(out, -(dist-1)), type.convert, as.is=TRUE)),
hdr
)
Result:
# Ward Peter C. Bylone, Sr. Democratic Richard Harlan Democratic
#1 D-1 205 195
#2 D-2 202 160
#3 D-3 331 346
#4 D-4 215 205
#5 D-5 104 95
#6 D-6 77 70
# Mary Ann Micheletti- Levari Republican Henry L. Coia, Jr. Republican
#1 230 223
#2 275 261
#3 99 87
#4 164 152
#5 169 166
#6 109 108
# Total Machine Count Total Vote By Mail Total Provisional Count
#1 436 113 16
#2 459 NA NA
#3 457 NA NA
#4 377 NA NA
#5 271 NA NA
#6 188 NA NA
# Total Emergency Count Total Public Count
#1 NA 565
#2 NA 459
#3 NA 457
#4 NA 377
#5 NA 271
#6 NA 188
The example txt used was:
" Township of Buena Vista\n General Election Results - November 2, 2010\n Prepared by the Office of Edward P. McGettigan, Atlantic County Clerk\n\n\n\n\n Township Committee Public Count\n\n Mary Ann\n Peter C. Richard Henry L. Total Total Total Total Total\n Micheletti-\n Bylone, Sr. Harlan Coia, Jr. Machine Vote By Provisional Emergency Public\n Levari\nWard Democratic Democratic Republican Count Mail Count Count Count\n Republican\nDistrict\n D-1 205 195 230 223 436 113 16 565\n D-2 202 160 275 261 459 459\n D-3 331 346 99 87 457 457\n D-4 215 205 164 152 377 377\n D-5 104 95 169 166 271 271\n D-6 77 70 109 108 188 188"
Perhaps you can generalize this approach, but I don't think, it is very stable when used with other data than the example data.
I put your example into a file named example.txt.
library(tidyverse)
input <- read_lines("example.txt")
input[as.logical(cumsum(input == "District"))] %>%
tibble() %>%
slice(-1) %>%
mutate(count = str_replace_all(string = ., "\\s{9,12}", ";")) %>%
select(-.) %>%
separate(col = count, into = c("District", as.character(1:9)), sep = ";") %>%
mutate(across(everything(), str_trim),
across(as.character(1:9), as.integer))
returns
# A tibble: 6 x 10
District `1` `2` `3` `4` `5` `6` `7` `8` `9`
<chr> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 D-1 205 195 230 223 436 113 16 NA 565
2 D-2 202 160 275 261 459 NA NA NA 459
3 D-3 331 346 99 87 457 NA NA NA 457
4 D-4 215 205 164 152 377 NA NA NA 377
5 D-5 104 95 169 166 271 NA NA NA 271
6 D-6 77 70 109 108 188 NA NA NA 188
Creating the column names (the candidate names) is a tricky task. Depending on the counts, perhaps it is necessary to adjust the spaces replaced with ";": \\s{9,12} means replace at least 9 up to 12 space characters.

Issue Getting New Variable in R

This is my first ever post, so bear with me. I am trying to manipulate a data set in R by adding new columns based on existing data. I've converted my data to a data frame and have employed the mutate function. The function works. However, when I call my dataset again to look at the changes, the new column disappears. What am I doing wrong?
# Converting raw data into a tibble data frame for easier data analysis:
spdata <- as_tibble(rawdata)
# Creating a new Grade column based on Math Scores
spdata %>%
mutate(math.grade = case_when(math.score < 60 ~ "F",
math.score >= 60 & math.score <= 69 ~ "D",
math.score >= 70 & math.score <= 79 ~ "C",
math.score >= 80 & math.score <= 89 ~ "B",
math.score >= 90 & math.score <= 100 ~ "A"))
Here is the output that automatically generates after I run my mutate function:
# A tibble: 1,000 x 9
gender race.ethnicity parental.level.of.education lunch test.preparation.course math.score reading.score writing.score math.grade
<fct> <fct> <fct> <fct> <fct> <int> <int> <int> <chr>
1 female group B bachelor's degree standard none 72 72 74 C
2 female group C some college standard completed 69 90 88 D
3 female group B master's degree standard none 90 95 93 A
4 male group A associate's degree free/reduced none 47 57 44 F
5 male group C some college standard none 76 78 75 C
6 female group B associate's degree standard none 71 83 78 C
7 female group B some college standard completed 88 95 92 B
8 male group B some college free/reduced none 40 43 39 F
9 male group D high school free/reduced completed 64 64 67 D
10 female group B high school free/reduced none 38 60 50 F
# ... with 990 more rows
My new math.grade variable shows up as expected.
However, when I call spdata again to look at it, the math.grade column is missing:
# A tibble: 1,000 x 8
gender race.ethnicity parental.level.of.education lunch test.preparation.course math.score reading.score writing.score
<fct> <fct> <fct> <fct> <fct> <int> <int> <int>
1 female group B bachelor's degree standard none 72 72 74
2 female group C some college standard completed 69 90 88
3 female group B master's degree standard none 90 95 93
4 male group A associate's degree free/reduced none 47 57 44
5 male group C some college standard none 76 78 75
6 female group B associate's degree standard none 71 83 78
7 female group B some college standard completed 88 95 92
8 male group B some college free/reduced none 40 43 39
9 male group D high school free/reduced completed 64 64 67
10 female group B high school free/reduced none 38 60 50
# ... with 990 more rows
You need to assign the data frame with the additional column to a new variable with <- :
new_df <- spdata %>%
mutate(math.grade = case_when(math.score < 60 ~ "F",
math.score >= 60 & math.score <= 69 ~ "D",
math.score >= 70 & math.score <= 79 ~ "C",
math.score >= 80 & math.score <= 89 ~ "B",
math.score >= 90 & math.score <= 100 ~ "A"))
new_df
This should work...

more columns than column name on txt file

I have this txt file and i would like to read this in R with this command:
read.table("C:/users/vatlidak/My Documents/Documents/Untitled.txt", header=TRUE)
R returns me the following command:
"more columns than column name"
txt file:
height Shoesize gender Location
1 181 44 male city center
4 170 43 female city center
5 172 43 female city center
13 175 42 male out of city
14 181 44 male out of city
15 180 43 male out of city
16 177 43 female out of city
17 133 41 male out of city
If myFile contains the path/filename then replace each of the first 4 stretches of whitespace on every line with a comma and then re-read using read.csv. No packages are used.
L <- readLines(myFile) ##
for(i in 1:4) L <- sub("\\s+", ",", L)
DF <- read.csv(text = L)
giving:
> DF
height Shoesize gender Location
1 181 44 male city center
4 170 43 female city center
5 172 43 female city center
13 175 42 male out of city
14 181 44 male out of city
15 180 43 male out of city
16 177 43 female out of city
17 133 41 male out of city
Note: For purposes of testing we can use this in place of the line marked ## above. (Note that SO can introduce spaces at the beginnings of the lines so we remove them.)
Lines <- " height Shoesize gender Location
1 181 44 male city center
4 170 43 female city center
5 172 43 female city center
13 175 42 male out of city
14 181 44 male out of city
15 180 43 male out of city
16 177 43 female out of city
17 133 41 male out of city"
L <- readLines(textConnection(Lines))
L[-1] <- sub("^\\s+", "", L[-1])
It is a bit late but i had the same problem and i tried them but i did not work on my dataset, than i just converted csv file into xlsx file and it worked without any extra operation. Like,
library(gdata)
df <- read.xls(file, sheet = 1, row.names=1)
This may help for the future readers.

Group function by two variables on data.table

My data looks something like this
students<-data.table(studid=c(1:6) ,FACULTY= c("IT","SCIENCE", "LAW","IT","IT","IT"),
SEX=c("Male","Male","Male","Female","Female","Male"), WAM=c(65,35,98,55,20,80))
studid FACULTY SEX AVE_MARK (WAM)
1 IT Male 65
2 SCIENCE Male 35
3 LAW Male 98
4 IT Female 55
5 IT Female 20
6 IT Male 80
I have used the following code to calculate the averages
degrees[, mean(WAM, na.rm=T),by=FACULTY][order(-V1)]
So my headings are
FACULTY VI
IT 65
LAW 50
etc
Any advice on how to do this would be greatly appreciated.
I would like to break this up by sex also
FACULTY VI VI
Male Female
IT 65 11
LAW 50 11
You could try
dcast.data.table(students, FACULTY~SEX, fun.aggregate=mean, na.rm=TRUE,
value.var='WAM')
# FACULTY Female Male
#1: IT 37.5 72.5
#2: LAW NaN 98.0
#3: SCIENCE NaN 35.0
Do you definitely need it in cross tabular format? If so, akrun's answer is the way to go.
Otherwise, here they are stacked:
> students[, mean(WAM, na.rm=T),by=c('FACULTY','SEX')]
FACULTY SEX V1
1: IT Male 72.5
2: SCIENCE Male 35.0
3: LAW Male 98.0
4: IT Female 37.5

Summarizing a data frame

I am trying to take the following data, and then uses this data to create a table which has the information broken down by state.
Here's the data:
> head(mydf2, 10)
lead_id buyer_account_id amount state
1 52055267 62 300 CA
2 52055267 64 264 CA
3 52055305 64 152 CA
4 52057682 62 75 NJ
5 52060519 62 750 OR
6 52060519 64 574 OR
15 52065951 64 152 TN
17 52066749 62 600 CO
18 52062751 64 167 OR
20 52071186 64 925 MN
I've allready subset the states that I'm interested in and have just the data I'm interested in:
mydf2 = subset(mydf, state %in% c("NV","AL","OR","CO","TN","SC","MN","NJ","KY","CA"))
Here's an idea of what I'm looking for:
State Amount Count
NV 1 50
NV 2 35
NV 3 20
NV 4 15
AL 1 10
AL 2 6
AL 3 4
AL 4 1
...
For each state, I'm trying to find a count for each amount "level." I don't necessary need to group the amount variable, but keep in mind that they are are not just 1,2,3, etc
> mydf$amount
[1] 300 264 152 75 750 574 113 152 750 152 675 489 188 263 152 152 600 167 34 925 375 156 675 152 488 204 152 152
[29] 600 489 488 75 152 152 489 222 563 215 452 152 152 75 100 113 152 150 152 150 152 452 150 152 152 225 600 620
[57] 113 152 150 152 152 152 152 152 152 152 640 236 152 480 152 152 200 152 560 152 240 222 152 152 120 257 152 400
Is there an elegant solution for this in R for this or will I be stuck using Excel (yuck!).
Here's my understanding of what you're trying to do:
Start with a simple data.frame with 26 states and amounts only ranging from 1 to 50 (which is much more restrictive than what you have in your example, where the range is much higher).
set.seed(1)
mydf <- data.frame(
state = sample(letters, 500, replace = TRUE),
amount = sample(1:50, 500, replace = TRUE)
)
head(mydf)
# state amount
# 1 g 28
# 2 j 35
# 3 o 33
# 4 x 34
# 5 f 24
# 6 x 49
Here's some straightforward tabulation. I've also removed any instances where frequency equals zero, and I've reordered the output by state.
temp1 <- data.frame(table(mydf$state, mydf$amount))
temp1 <- temp1[!temp1$Freq == 0, ]
head(temp1[order(temp1$Var1), ])
# Var1 Var2 Freq
# 79 a 4 1
# 157 a 7 2
# 391 a 16 1
# 417 a 17 1
# 521 a 21 1
# 1041 a 41 1
dim(temp1) # How many rows/cols
# [1] 410 3
Here's a little bit different tabulation. We are tabulating after grouping the "amount" values. Here, I've manually specified the breaks, but you could just as easily let R decide what it thinks is best.
temp2 <- data.frame(table(mydf$state,
cut(mydf$amount,
breaks = c(0, 12.5, 25, 37.5, 50),
include.lowest = TRUE)))
temp2 <- temp2[!temp2$Freq == 0, ]
head(temp2[order(temp2$Var1), ])
# Var1 Var2 Freq
# 1 a [0,12.5] 3
# 27 a (12.5,25] 3
# 79 a (37.5,50] 3
# 2 b [0,12.5] 2
# 28 b (12.5,25] 6
# 54 b (25,37.5] 5
dim(temp2)
# [1] 103 3
I am not sure if I understand correctly (you have two data.frames mydf and mydf2). I'll assume your data is in mydf. Using aggregate:
mydf$count <- 1:nrow(mydf)
aggregate(data = mydf, count ~ amount + state, length)
Is this what you are looking for?
Note: here count is a variable that is created just to get directly the output of the 3rd column as count.
Alternatives with ddply from plyr:
# no need to create a variable called count
ddply(mydf, .(state, amount), summarise, count=length(lead_id))
Here' one could use any column that exists in one's data instead of lead_id. Even state:
ddply(mydf, .(state, amount), summarise, count=length(state))
Or equivalently without using summarise:
ddply(mydf, .(state, amount), function(x) c(count=nrow(x)))

Resources