Fortran90 Reading and filling in arrays from data file

Fortran90 Reading and filling in arrays from data file - multidimensional-array

I am supposed to read in this file and fill in the arrays from that data file. But every time i try to run it, i receive an error saying i have invalid memory. The file looks like this
4
SanDiego
0
350
900
1100
Phoenix
350
0
560
604
Denver
900
560
0
389
Dallas
1100
604
389
0
It is basically a traveling salesman algorithm that takes gives the best distance. Here is my whole code
Program P4
IMPLICIT NONE
!Variable Declarations
INTEGER :: count, i, j, ios, distance=0, permutations=0, best_distance
CHARACTER(50) :: filename
TYPE city
CHARACTER(20) :: name
END TYPE
TYPE(city), ALLOCATABLE, DIMENSION(:) :: city_list
INTEGER, ALLOCATABLE, DIMENSION(:,:) :: d_table
INTEGER, ALLOCATABLE, DIMENSION(:) :: path, best_path
PRINT *, "Enter filename"
READ *, filename
!Open the file and read number of cities
OPEN(UNIT = 15, FILE = filename, FORM="FORMATTED", ACTION="READ", STATUS="OLD", IOSTAT=ios)
IF(ios /= 0) THEN
PRINT *, "ERROR, could not open file.", TRIM(filename), "Error code: ", ios
STOP
END IF
READ (UNIT=15, FMT=100) count
PRINT *, "Number of cities: ", count
!Allocate memory for all needed arrays
ALLOCATE(city_list(1:count), d_table(1:count,1:count), best_path(1:count), path(1:count), STAT=ios)
IF(ios /= 0) THEN
PRINT *, "ERROR, could not allocate memory."
STOP
END IF
!Fill in arrays from data file
DO i=1, count
path(i) = i
READ(UNIT=15, FMT=200) city_list(i)
IF(ios < 0) THEN
EXIT
END IF
DO j=1, 4
PRINT *, i, j, city_list(i)
READ(UNIT=15, FMT=100) d_table(i,j)
END DO
END DO
!Use recursion to find minimal distance
CALL permute(2, count)
!Print formatted output
PRINT *
DO i=1, count
PRINT *, path(i)
END DO
DO i=1, count
PRINT *, (city_list(i))
END DO
DO i=1, count
DO j=1, count
PRINT *, d_table(i,j)
END DO
END DO
100 FORMAT (I6)
200 FORMAT (A)
CONTAINS
!Permute function
RECURSIVE SUBROUTINE permute(first, last)
!Declare intent of parameter variables
IMPLICIT NONE
INTEGER, INTENT(in) :: first, last
INTEGER :: i, temp
IF(first == last) THEN
distance = d_table(1,path(2))
PRINT *, city_list(1)%name, city_list(path(2))%name, " ", d_table(1, path(2))
DO i=2, last-1
distance = distance + d_table(path(i),path(i+1))
print *, city_list(path(i))%name, " ", city_list(path(i+1))%name, d_table(path(i),path(i+1))
END DO
distance = distance + d_table(path(last),path(1))
PRINT *, city_list(path(last))%name," ",city_list(path(1))%name, d_table(path(last),path(1))
PRINT *, "Distance is ",distance
PRINT *
permutations = permutations + 1
IF(distance < best_distance) THEN
best_distance = distance
DO i=2, count
best_path(i) = path(i)
END DO
END IF
ELSE
DO i=first, last
temp = path(first)
path(first) = path(i)
path(i) = temp
call permute(first+1,last)
temp = path(first)
path(first) = path(i)
path(i) = temp
END DO
END IF
END SUBROUTINE permute
END PROGRAM P4
And i am no longer gettting an error message and able to run the program, but it its not running properly, it is supposed to out put this
Number of cities: 4
28 San Diego Phoenix 350
29 Phoenix Denver 560
30 Denver Dallas 389
31 Dallas San Diego 1100
32 Distance is 2399
33
34 San Diego Phoenix 350
35 Phoenix Dallas 604
36 Dallas Denver 389
37 Denver San Diego 900
38 Distance is 2243
39
40 San Diego Denver 900
41 Denver Phoenix 560
42 Phoenix Dallas 604
43 Dallas San Diego 1100
44 Distance is 3164
45
46 San Diego Denver 900
47 Denver Dallas 389
48 Dallas Phoenix 604
49 Phoenix San Diego 350
50 Distance is 2243
51
52 San Diego Dallas 1100
53 Dallas Denver 389
54 Denver Phoenix 560
55 Phoenix San Diego 350
56 Distance is 2399
57
58 San Diego Dallas 1100
59 Dallas Phoenix 604
60 Phoenix Denver 560
61 Denver San Diego 900
62 Distance is 3164
63
64
65 San Diego to Phoenix -- 350 miles
66 Phoenix to Dallas -- 604 miles
67 Dallas to Denver -- 389 miles
68 Denver to San Diego -- 900 miles
69
70 Best distance is: 2243
71 Number of permutations: 6
But instead it out puts this
Enter filename
data.txt
Number of cities: 4
1 1 SanDiego
1 2 SanDiego
1 3 SanDiego
1 4 SanDiego
2 1 Phoenix
2 2 Phoenix
2 3 Phoenix
2 4 Phoenix
3 1 Denver
3 2 Denver
3 3 Denver
3 4 Denver
4 1 Dallas
4 2 Dallas
4 3 Dallas
4 4 Dallas
SanDiego Phoenix 1100
Phoenix Denver 604
Denver Dallas 389
Dallas SanDiego 0
Distance is 2093
SanDiego Phoenix 1100
Phoenix Dallas 604
Dallas Denver 0
Denver SanDiego 389
Distance is 2093
SanDiego Denver 1100
Denver Phoenix 389
Phoenix Dallas 604
Dallas SanDiego 0
Distance is 2093
SanDiego Denver 1100
Denver Dallas 389
Dallas Phoenix 0
Phoenix SanDiego 604
Distance is 2093
SanDiego Dallas 1100
Dallas Denver 0
Denver Phoenix 389
Phoenix SanDiego 604
Distance is 2093
SanDiego Dallas 1100
Dallas Phoenix 0
Phoenix Denver 604
Denver SanDiego 389
Distance is 2093
1
2
3
4
SanDiego
Phoenix
Denver
Dallas
1100
1100
1100
1100
604
604
604
604
389
389
389
389
0
0
0
0

When you first open the file, the READ command reads the first line. If you call READ again, you will read line two.
I think that this error comes because you are trying to READ the first line (for the second time) but you are actually reading the second line that is a STRING.
You have two options to fix the error in line 39 (READ statement):
Call the OPEN statement (before the line 39) for a second time in order to READ the first line
Or delete the READ statement because you have already read "count" before (best option in my opinion)
Update:For segmentation fault error
I maid the following changes:
Comment line 39 (to solve the first problem)
In line 42: FMT=200
In line 46: DO j=1, 4 (because between the cities names there are only 4 numbers)
The following code worked for me:
Program P4
IMPLICIT NONE
!Variable Declarations
INTEGER :: count, i, j, ios, distance=0, permutations=0, best_distance
CHARACTER(50) :: filename
TYPE city
CHARACTER(20) :: name
END TYPE
TYPE(city), ALLOCATABLE, DIMENSION(:) :: city_list
INTEGER, ALLOCATABLE, DIMENSION(:,:) :: d_table
INTEGER, ALLOCATABLE, DIMENSION(:) :: path, best_path
PRINT *, "Enter filename"
READ *, filename
!Open the file and read number of cities
OPEN(UNIT = 15, FILE = filename, FORM="FORMATTED", ACTION="READ", STATUS="OLD", IOSTAT=ios)
IF(ios /= 0) THEN
PRINT *, "ERROR, could not open file.", TRIM(filename), "Error code: ", ios
STOP
END IF
READ (UNIT=15, FMT=100) count
PRINT *, "Number of cities: ", count
!Allocate memory for all needed arrays
ALLOCATE(city_list(1:count), d_table(1:count,1:count), best_path(1:count), path(1:count), STAT=ios)
IF(ios /= 0) THEN
PRINT *, "ERROR, could not allocate memory."
STOP
END IF
!Fill in arrays from data file
!READ (UNIT=15, FMT=100) count
DO i=1, count
path(i-1) = i
READ (UNIT=15, FMT=200, IOSTAT=ios) city_list(i)
IF(ios < 0) THEN
EXIT
END IF
DO j=1, 4
print*,i,j,city_list(i)
READ (UNIT=15, FMT=100) d_table(i,j)
END DO
END DO
!Use recursion to find minimal distance
CALL permute(2, count)
!Print formatted output
PRINT *
DO i=1, count
PRINT *, path(i)
END DO
DO i=1, count
PRINT *, (city_list(i))
END DO
DO i=1, count
DO j=1, count
PRINT *, d_table(i,j)
END DO
END DO
100 FORMAT (I6)
200 FORMAT (A)
CONTAINS
!Permute function
RECURSIVE SUBROUTINE permute(first, last)
!Declare intent of parameter variables
IMPLICIT NONE
INTEGER, INTENT(in) :: first, last
INTEGER :: i, temp
IF(first == last) THEN
distance = d_table(1,path(2))
PRINT *, city_list(1)%name, city_list(path(2))%name, " ", d_table(1, path(2))
DO i=2, last-1
distance = distance + d_table(path(i),path(i+1))
print *, city_list(path(i))%name, " ", city_list(path(i+1))%name, d_table(path(i),path(i+1))
END DO
distance = distance + d_table(path(last),path(1))
PRINT *, city_list(path(last))%name," ",city_list(path(1))%name, d_table(path(last),path(1))
PRINT *, "Distance is ",distance
PRINT *
permutations = permutations + 1
IF(distance < best_distance) THEN
best_distance = distance
DO i=2, count
best_path(i) = path(i)
END DO
END IF
ELSE
DO i=first, last
temp = path(first)
path(first) = path(i)
path(i) = temp
call permute(first+1,last)
temp = path(first)
path(first) = path(i)
path(i) = temp
END DO
END IF
END SUBROUTINE permute
END PROGRAM P4

Related

Combing "previous row" of same table and JOIN from different table in Sqlite

I have the following table
CREATE TABLE "shots" (
"player" INTEGER,
"tournament" TEXT,
"year" INTEGER,
"course" INTEGER,
"round" INTEGER,
"hole" INTEGER,
"shot" INTEGER,
"text" TEXT,
"distance" REAL,
"x" TEXT,
"y" TEXT,
"z" TEXT
);
With a sample of the data:
28237 470 2015 717 1 1 1 Shot 1 302 yds to left fairway, 257 yds to hole 10874 11451.596 10623.774 78.251
28237 470 2015 717 1 1 2 Shot 2 234 yds to right fairway, 71 ft to hole 8437 12150.454 10700.381 86.035
28237 470 2015 717 1 1 3 Shot 3 70 ft to green, 4 ft to hole 838 12215.728 10725.134 88.408
28237 470 2015 717 1 1 4 Shot 4 in the hole 46 12215.1 10729.1 88.371
28237 470 2015 717 1 2 1 Shot 1 199 yds to green, 29 ft to hole 7162 12776.03 10398.086 91.017
28237 470 2015 717 1 2 2 Shot 2 putt 26 ft 7 in., 2 ft 4 in. to hole 319 12749.444 10398.854 90.998
28237 470 2015 717 1 2 3 Shot 3 in the hole 28 12747.3 10397.6 91.027
28237 470 2015 717 1 3 1 Shot 1 296 yds to left intermediate, 204 yds to hole 10651 12596.857 9448.27 94.296
28237 470 2015 717 1 3 2 Shot 2 208 yds to green, 15 ft to hole 7478 12571.0 8825.648 94.673
28237 470 2015 717 1 3 3 Shot 3 putt 17 ft 6 in., 2 ft 5 in. to hole 210 12561.831 8840.539 94.362
I want to get for each shot the previous location (x, y, z). I wrote the below query.
SELECT cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot, cur.x, cur.y, cur.z, prev.x, prev.y, prev.z
FROM shots cur
INNER JOIN shots prev
ON (cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot) =
(prev.player, prev.tournament, prev.year, prev.course, prev.round, prev.hole, prev.shot - 1)
This query takes forever basically. How can I rewrite it to make it faster?
In addition, I need to make an adjustment for the first shot on a hole (shot = 1). This shot is made from tee_x, tee_y and tee_z. These values are available in table holes
CREATE TABLE "holes" (
"tournament" TEXT,
"year" INTEGER,
"course" INTEGER,
"round" INTEGER,
"hole" INTEGER,
"tee_x" TEXT,
"tee_y" TEXT,
"tee_z" TEXT
);
With data:
470 2015 717 1 1 11450 10625 78.25
470 2015 717 1 2 12750 10400 91
470 2015 717 1 3 2565 8840.5 95
Thanks

First, you need a composite index to speed up the operation:
CREATE INDEX idx_shots ON shots (player, tournament, year, course, round, hole, shot);
With that index, your query should run faster:
SELECT cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot, cur.x, cur.y, cur.z,
prev.x AS prev_x, prev.y AS prev_y, prev.z AS prev_z
FROM shots cur LEFT JOIN shots prev
ON (cur.player, cur.tournament, cur.year, cur.course, cur.round, cur.hole, cur.shot) =
(prev.player, prev.tournament, prev.year, prev.course, prev.round, prev.hole, prev.shot + 1);
The changes I made:
the join should be a LEFT join so that all rows are included and
not only the ones that have a previous row
-1 should be +1 because the previous row's shot is 1 less than the current row's shot
added aliases for the previous row's x, y and z
But, if your version of SQLite is 3.25.0+ it would be better to use window function LAG() instead of a self join:
SELECT *,
LAG(x) OVER w AS prev_x,
LAG(y) OVER w AS prev_y,
LAG(z) OVER w AS prev_z
FROM shots
WINDOW w AS (PARTITION BY player, tournament, year, course, round, hole ORDER BY shot);
See the demo (I include the query plan for both queries where you can see the use of the composite index).

Tableau issue with case statment

I am trying to say if the year is 2021 and the month is one then INT [api_contacts_handled] should be 320 and so on. [Y] and [M] are smallint.
IF [Y] = 2021 THEN
CASE [M]
WHEN 1 THEN [api_contacts_handled] = 320
WHEN 2 THEN [api_contacts_handled] = 420
WHEN 3 THEN [api_contacts_handled] = 520
WHEN 4 THEN [api_contacts_handled] = 620
WHEN 5 THEN [api_contacts_handled] = 820
WHEN 6 THEN [api_contacts_handled] = 920
ELSE [api_contacts_handled]
END
ELSE
[api_contacts_handled]
END
Error - Expected type Boolean fount Integer, Results types from CASE expressions should match.
Any help would be great.
Thanks
John

Probably you're missing the point of how case works.
If the logic is the one you wrote, you should create a calculated field called api_contacts_handled_calculated, with this formula:
IF [Y] = 2021 THEN
CASE [M]
WHEN 1 THEN 320
WHEN 2 THEN 420
WHEN 3 THEN 520
WHEN 4 THEN 620
WHEN 5 THEN 820
WHEN 6 THEN 920
ELSE [api_contacts_handled]
END
ELSE [api_contacts_handled]
END
You cannot "overwrite" your value like this:
THEN [api_contacts_handled] = 320
In addition, after "then" Tableau recognize a boolean statement checking the column against the values, which leads to your error.
So with that new calculated field, you will test all you're conditions and assign a value based on them, handling the else parts as well.

bs4 Attribute Error while scraping table python

I am trying to scrape a table using bs4. But whenever I iterate over the <tbody> elements, i get the following error: Traceback (most recent call last): File "f:\Python Programs\COVID-19 Notifier\main.py", line 28, in <module> for tr in soup.find('tbody').findAll('tr'): AttributeError: 'NoneType' object has no attribute 'findAll'
I am new to bs4 and have faced this error many times before too. This is the code I am using. Any help would be greatly appreciated as this is an official project to be submitted in a competition and the deadline is near. Thanks in advance. And beautifulsoup4=4.8.2, bs4==0.0.4 and soupsieve==2.0.
My code:
from plyer import notification
import requests
from bs4 import BeautifulSoup
import time
def notifyMe(title, message):
notification.notify(
title = title,
message = message,
app_icon = ".\\icon.ico",
timeout = 6
)
def getData(url):
r = requests.get(url)
return r.text
if __name__ == "__main__":
while True:
# notifyMe("Harry", "Lets stop the spread of this virus together")
myHtmlData = getData('https://www.mohfw.gov.in/')
soup = BeautifulSoup(myHtmlData, 'html.parser')
#print(soup.prettify())
myDataStr = ""
for tr in soup.find('tbody').find_all('tr'):
myDataStr += tr.get_text()
myDataStr = myDataStr[1:]
itemList = myDataStr.split("\n\n")
print(itemList)
states = ['Chandigarh', 'Telengana', 'Uttar Pradesh']
for item in itemList[0:22]:
dataList = item.split('\n')
if dataList[1] in states:
nTitle = 'Cases of Covid-19'
nText = f"State {dataList[1]}\nIndian : {dataList[2]} & Foreign : {dataList[3]}\nCured : {dataList[4]}\nDeaths : {dataList[5]}"
notifyMe(nTitle, nText)
time.sleep(2)
time.sleep(3600)

This line raises the error:
for tr in soup.findAll('tbody').findAll('tr'):
You can only call find_all on a single tag, not a result set returned by another find_all. (findAll is the same as find_all - the latter one is preferably used because it meets the Python PEP 8 styling standard)
According to the documentation:
The find_all() method looks through a tag’s descendants and retrieves all descendants that match your filters.
If you're looping through a single table, simply replace the first findAll with find. If multiple tables, store the result set in a variable and loop through it, and you can apply the findAll on a single tag.
This should fix it:
for tr in soup.find('tbody').find_all('tr'):
Multiple tables:
tables = soup.find_all('tbody')
for table in tables:
for tr in table.find_all('tr'):
...

There are a few issues here.
The <tbody> tag is within the comments of the html. BeautifulSoup skips comments, unless you specifically pull those.
Why bother with the getData() function? It's just one line, why not just put that into the code. The extra function doesn't really add efficiency or more readability in the code.
Even when you pull the <tbody> tag, your dataList doesn't have 6 items (you call dataList[5], which will throw and error). I adjusted it, but I don't know if those are the corect numbers. I don't know what each of those vlaues represent, so you may need to fix that. The headers for that data you are pulling are ['S. No.','Name of State / UT','Active Cases*','Cured/Discharged/Migrated*','Deaths**'], so I don't know what Indian : {dataList[2]} & Foreign : are suppose to be.
With that, I don't what those numbers represent, but is it the correct data? Looks like you can pull new data here, but it's not the same numbers in the <tbody>
So, here's to get that other data source...maybe it's more accurate?
import requests
import pandas as pd
jsonData = requests.get('https://www.mohfw.gov.in/data/datanew.json').json()
df = pd.DataFrame(jsonData)
Output:
print(df.to_string())
sno state_name active positive cured death new_active new_positive new_cured new_death state_code
0 2 Andaman and Nicobar Islands 153 5527 5309 65 146 5569 5358 65 35
1 1 Andhra Pradesh 66944 997462 922977 7541 74231 1009228 927418 7579 28
2 3 Arunachal Pradesh 380 17296 16860 56 453 17430 16921 56 12
3 4 Assam 11918 231069 217991 1160 13942 233453 218339 1172 18
4 5 Bihar 69869 365770 293945 1956 76420 378442 300012 2010 10
5 6 Chandigarh 4273 36404 31704 427 4622 37232 32180 430 04
6 7 Chhattisgarh 121555 605568 477339 6674 123479 622965 492593 6893 22
7 8 Dadra and Nagar Haveli and Daman and Diu 1668 5910 4238 4 1785 6142 4353 4 26
8 10 Delhi 91618 956348 851537 13193 92029 980679 875109 13541 07
9 11 Goa 10228 72224 61032 964 11040 73644 61628 976 30
10 12 Gujarat 92084 453836 355875 5877 100128 467640 361493 6019 24
11 13 Haryana 58597 390989 328809 3583 64057 402843 335143 3643 06
12 14 Himachal Pradesh 11859 82876 69763 1254 12246 84065 70539 1280 02
13 15 Jammu and Kashmir 16094 154407 136221 2092 16993 156344 137240 2111 01
14 16 Jharkhand 40942 184951 142294 1715 43415 190692 145499 1778 20
15 17 Karnataka 196255 1247997 1037857 13885 214330 1274959 1046554 14075 29
16 18 Kerala 156554 1322054 1160472 5028 179311 1350501 1166135 5055 32
17 19 Ladakh 2041 12937 10761 135 2034 13089 10920 135 37
18 20 Lakshadweep 803 1671 867 1 920 1805 884 1 31
19 21 Madhya Pradesh 84957 459195 369375 4863 87640 472785 380208 4937 23
20 22 Maharashtra 701614 4094840 3330747 62479 693632 4161676 3404792 63252 27
21 23 Manipur 513 30047 29153 381 590 30151 29180 381 14
22 24 Meghalaya 1133 15488 14198 157 1238 15631 14236 157 17
23 25 Mizoram 608 5220 4600 12 644 5283 4627 12 15
24 26 Nagaland 384 12800 12322 94 457 12889 12338 94 13
25 27 Odisha 32963 388479 353551 1965 36718 394694 356003 1973 21
26 28 Puducherry 5923 50580 43931 726 6330 51372 44314 728 34
27 29 Punjab 40584 319719 270946 8189 43943 326447 274240 8264 03
28 30 Rajasthan 107157 467875 357329 3389 117294 483273 362526 3453 08
29 31 Sikkim 640 6970 6193 137 693 7037 6207 137 11
30 32 Tamil Nadu 89428 1037711 934966 13317 95048 1051487 943044 13395 33
31 34 Telengana 52726 379494 324840 1928 58148 387106 326997 1961 36
32 33 Tripura 563 34302 33345 394 645 34429 33390 394 16
33 35 Uttarakhand 26980 138010 109058 1972 29949 142349 110379 2021 05
34 36 Uttar Pradesh 259810 976765 706414 10541 273653 1013370 728980 10737 09
35 37 West Bengal 68798 700904 621340 10766 74737 713780 628218 10825 19
36 11111 2428616 16263695 13648159 186920 2552940 16610481 13867997 189544
Here's your code with pulling the comments out
Code:
import requests
from bs4 import BeautifulSoup, Comment
import time
def notifyMe(title, message):
notification.notify(
title = title,
message = message,
app_icon = ".\\icon.ico",
timeout = 6
)
if __name__ == "__main__":
while True:
# notifyMe("Harry", "Lets stop the spread of this virus together")
myHtmlData = requests.get('https://www.mohfw.gov.in/').text
soup = BeautifulSoup(myHtmlData, 'html.parser')
comments = soup.find_all(string=lambda text: isinstance(text, Comment))
myDataStr = ""
for each in comments:
if 'tbody' in str(each):
soup = BeautifulSoup(each, 'html.parser')
for tr in soup.find('tbody').findAll('tr'):
myDataStr += tr.get_text()
myDataStr = myDataStr[1:]
itemList = myDataStr.split("\n\n")
print(itemList)
states = ['Chandigarh', 'Telengana', 'Uttar Pradesh','Meghalaya']
for item in itemList[0:22]:
w=1
dataList = item.split('\n')
if dataList[1] in states:
nTitle = 'Cases of Covid-19'
nText = f"State {dataList[1]}\nIndian : {dataList[0]} & Foreign : {dataList[2]}\nCured : {dataList[3]}\nDeaths : {dataList[4]}" #<-- I changed this
notifyMe(nTitle, nText)
time.sleep(2)
time.sleep(3600)

Subset data frame in R given grouping length criterium

I'm working on some exercises based on this dataset.
There's a State column listing the rate of deaths per month by heart attack for each hospital of the state (column 11):
> table(data$State)
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96
Now I try to filter out these states where at least 20 values are available:
> table(data$State)>20
AK AL AR AZ CA CO CT DC DE FL GA GU
FALSE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE
So using subset I try to get a subset of data based on the above conditions, but that gives me a result I can't follow:
> data_subset <- subset(data, table(data$State)>20)
> table(data_subset$State)
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY
14 84 66 65 288 64 25 8 5 155 109 1 19 93 24 153 107 100 83
Why am I getting AK 14, when I would expect that state to be filtered out by the condition?

You can use the following approach to filter out the data with less than 20 rows:
tab <- table(data$State)
data[data$State %in% names(tab)[tab > 19], ]
Your code
subset(data, table(data$State)>20)
does not work because table(data$State)>20 returns a boolean vector of length length(table$State). In your data, the boolean vector is shorter than the number of rows in your data frame. Due to vector recycling, the vector is combined with itself until the longer length is reached. E.g., have a look at (1:3)[c(TRUE, FALSE)].

Count rows for selected column values and remove rows based on count in R

I am new to R and am trying to work on a data frame from a csv file (as seen from the code below). It has hospital data with 46 columns and 4706 rows (one of those columns being 'State'). I made a table showing counts of rows for each value in the State column. So in essence the table shows each state and the number of hospitals in that state. Now what I want to do is subset the data frame and create a new one without the entries for which the state has less than 20 hospitals.
How do I count the occurrences of values in the State column and then remove those that count up to less than 20? Maybe I am supposed to use the table() function, remove the undesired data and put that into a new data frame using something like lappy(), but I'm not sure due to my lack of experience in programming with R.
Any help will be much appreciated. I have seen other examples of removing rows that have certain column values in this site, but not one that does that based on the count of a particular column value.
> outcome <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
> hospital_nos <- table(outcome$State)
> hospital_nos
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME MI
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96 114 68 45 37 134
MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX UT VA
133 108 83 54 112 36 90 26 65 40 28 185 170 126 59 175 51 12 63 48 116 370 42 87
VI VT WA WI WV WY
2 15 88 125 54 29

Here is one way to do it. Starting with the following data frame :
df <- data.frame(x=c(1:10), y=c("a","a","a","b","b","b","c","d","d","e"))
If you want to keep only the rows with more than 2 occurrences in df$y, you can do :
tab <- table(df$y)
df[df$y %in% names(tab)[tab>2],]
Which gives :
x y
1 1 a
2 2 a
3 3 a
4 4 b
5 5 b
6 6 b
And here is a one line solution with the plyr package :
ddply(df, "y", function(d) {if(nrow(d)>2) d else NULL})

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Fortran90 Reading and filling in arrays from data file - multidimensional-array

Related

Combing "previous row" of same table and JOIN from different table in Sqlite

Tableau issue with case statment

bs4 Attribute Error while scraping table python

Subset data frame in R given grouping length criterium

Count rows for selected column values and remove rows based on count in R

Categories

Resources