more columns than column name on txt file - r

I have this txt file and i would like to read this in R with this command:
read.table("C:/users/vatlidak/My Documents/Documents/Untitled.txt", header=TRUE)
R returns me the following command:
"more columns than column name"
txt file:
height Shoesize gender Location
1 181 44 male city center
4 170 43 female city center
5 172 43 female city center
13 175 42 male out of city
14 181 44 male out of city
15 180 43 male out of city
16 177 43 female out of city
17 133 41 male out of city

If myFile contains the path/filename then replace each of the first 4 stretches of whitespace on every line with a comma and then re-read using read.csv. No packages are used.
L <- readLines(myFile) ##
for(i in 1:4) L <- sub("\\s+", ",", L)
DF <- read.csv(text = L)
giving:
> DF
height Shoesize gender Location
1 181 44 male city center
4 170 43 female city center
5 172 43 female city center
13 175 42 male out of city
14 181 44 male out of city
15 180 43 male out of city
16 177 43 female out of city
17 133 41 male out of city
Note: For purposes of testing we can use this in place of the line marked ## above. (Note that SO can introduce spaces at the beginnings of the lines so we remove them.)
Lines <- " height Shoesize gender Location
1 181 44 male city center
4 170 43 female city center
5 172 43 female city center
13 175 42 male out of city
14 181 44 male out of city
15 180 43 male out of city
16 177 43 female out of city
17 133 41 male out of city"
L <- readLines(textConnection(Lines))
L[-1] <- sub("^\\s+", "", L[-1])

It is a bit late but i had the same problem and i tried them but i did not work on my dataset, than i just converted csv file into xlsx file and it worked without any extra operation. Like,
library(gdata)
df <- read.xls(file, sheet = 1, row.names=1)
This may help for the future readers.

Related

Am I able to get a specific P-value to see where the significance lies?

So these are the survey results. I have tried to do pairwise testing (pairwise.wilcox.test) for these results collected in Spring and Autumn for these sites. But I can't get a specific P -value as to which site has the most influence.
This is the error message I keep getting. My dataset isn't even, ie there were some of the sites that were not surveyed in Spring which I think may be the issue.
Error in wilcox.test.default(xi, xj, paired = paired, ...) :
'x' must be numeric
So I'm not sure if I have laid it out in the table wrong to see how much site influences the results between Spring and Autumn
Site Autumn Spring
Stokes Bay 25 6
Stokes Bay 54 6
Stokes Bay 31 0
Gosport Wall 213 16
Gosport Wall 24 19
Gosport Wall 54 60
No Mans Land 76 25
No Mans Land 66 68
No Mans Land 229 103
Osbourne 1 77
Osbourne 1 92
Osbourne 1 92
Osbourne 2 114 33
Osbourne 2 217 114
Osbourne 2 117 64
Osbourne 3 204 131
Osbourne 3 165 85
Osbourne 3 150 81
Osbourne 4 124 15
Osbourne 4 79 64
Osbourne 4 176 65
Ryde Roads 217 165
Ryde Roads 182 63
Ryde Roads 112 53
Ryde Sands 386 44
Ryde Sands 375 25
Ryde Sands 147 45
Spit Bank 223 23
Spit Bank 78 29
Spit Bank 60 15
St Helen's 1 247 11
St Helen's 1 126 36
St Helen's 1 107 20
St Helen's 2 108 115
St Helen's 2 223 25
St Helen's 2 126 30
Sturbridge 58 43
Sturbridge 107 34
Sturbridge 156 0
Osbourne Deep 1 76 59
Osbourne Deep 1 64 52
Osbourne Deep 1 77 30
Osbourne Deep 2 153 60
Osbourne Deep 2 106 88
Osbourne Deep 2 74 35
Sturbridge Shoal 169 45
Sturbridge Shoal 19 84
Sturbridge Shoal 81 44
Mother's Bank 208
Mother's Bank 119
Mother's Bank 153
Ryde Middle 16
Ryde Middle 36
Ryde Middle 36
Stanswood 14 132
Stanswood 47 87
Stanswood 14 88
This is what I've done so far:
MWU <- read.csv(file.choose(), header = T)
#attach file to workspace
attach(MWU)
#Read column names of the data
colnames(MWU) # Site, Autumn, Spring
MWU.1 <- MWU[c(1,2,3)] #It included blank columns in the df
kruskal.test(MWU.1$Autumn ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Autumn by MWU.1$Site
#Kruskal-Wallis chi-squared = 36.706, df = 24, p-value = 0.0468
kruskal.test(MWU.1$Spring ~ MWU.1$Site)
#Kruskal-Wallis rank sum test
#data: MWU.1$Spring by MWU.1$Site
#Kruskal-Wallis chi-squared = 35.134, df = 21, p-value = 0.02729
wilcox.test(MWU.1$Autumn, MWU.1$Spring, paired = T)
#Wilcoxon signed rank exact test
#data: MWU.1$Autumn and MWU.1$Spring**
#V = 1066, p-value = 8.127e-08**
#alternative hypothesis: true location shift is not equal to 0******
#Tried this version too to see if it would give a summary of where the influence is.
pairwise.wilcox.test(MWU.1$Spring, MWU.1$Autumn)
#Error in wilcox.test.default(xi, xj, paired = paired, ...) : not enough (non-missing) 'x' observations

Assign name in column based on values from another column in R

I am looking for some help to what seems like a very simple question. Any advice is greatly appreciated! I have created a data frame and I am looking to assign names under one column (Region) based on the values in the other column (Unit).
rdf<-as.data.frame(matrix(NA, nrow= 59, ncol=2))
colnames(rdf)<-c("Unit", "Region")
rdf$Unit<-c(1:35, 37:60)
rdf$Region<- ## (See below)
Here I want for Units 1:13 <- the region to be East,
for Units 14:25 and 27 the region to be labeled Central,
units 26, 28:38, 40:43, 45:46, to be labeled West,
and then Units 44, 39, 47:60, to be labeled BC.
I've been trying case_when or nested if else statements, but I am getting errors relating to a longer object length is not a multiple of shorter object length.
We can use dplyr. case_when is the way to go here.
library(dplyr)
rdf %>%
mutate(Region = case_when(Unit %in% 1:13 ~ "East",
Unit %in% c(14:25, 27) ~ "Central",
Unit %in% c(26, 28:38, 40:43, 45:46) ~ "West",
Unit %in% c(44, 39, 47:60) ~ "BC"))
Unit Region
1 1 East
2 2 East
3 3 East
4 4 East
5 5 East
6 6 East
7 7 East
8 8 East
9 9 East
10 10 East
11 11 East
12 12 East
13 13 East
14 14 Central
15 15 Central
16 16 Central
17 17 Central
18 18 Central
19 19 Central
20 20 Central
21 21 Central
22 22 Central
23 23 Central
24 24 Central
25 25 Central
26 26 West
27 27 Central
28 28 West
29 29 West
30 30 West
31 31 West
32 32 West
33 33 West
34 34 West
35 35 West
36 37 West
37 38 West
38 39 BC
39 40 West
40 41 West
41 42 West
42 43 West
43 44 BC
44 45 West
45 46 West
46 47 BC
47 48 BC
48 49 BC
49 50 BC
50 51 BC
51 52 BC
52 53 BC
53 54 BC
54 55 BC
55 56 BC
56 57 BC
57 58 BC
58 59 BC
59 60 BC
Because this is a small data frame, using a translation table might be the most convenient way:
xlat <- list(East=c(1:13),
Central=c(14:25,27),
West=c(26,28:38,40:43,45:46),
BC=c(44,39,47:60))
rdf$Region <- NA
for (r in names(xlat)) rdf$Region[rdf$Unit %in% xlat[[r]]] <- r
This solution (a) clearly documents the recoding; (b) will indicate if you have overlooked any unit number (by setting Region to NA); and (c) is easy to alter and maintain.
For larger tables, learn about the join operation among relations. It has many implementations in R.

Specific Join of two Dataframes

I have two data frames: df1 and df2:
> df1
ID Gender age cd evnt scr test_dt
1 C0004 MALE 22 1 1 82 7/3/2014
2 C0004 MALE 22 1 2 76 7/3/2014
3 C0005 MALE 22 1 3 1514 7/3/2014
4 C0005 MALE 23 2 1 81 11/3/2014
5 C0006 MALE 23 2 2 75 11/3/2014
6 C0006 MALE 23 2 3 878 11/3/2014
and,
> df2
ID hgt wt phys_dt
1 C0004 70 147 6/29/2015
2 C0004 70 157 6/27/2016
3 C0005 67 175 6/27/2016
4 C0005 65 171 7/2/2014
5 C0006 69 160 6/29/2015
6 C0006 64 143 7/2/2014
I want to join df1 and df2 in a way that yields the following data frame, call it df3:
> df3
ID Gender age cd evnt scr hgt wt
1 C0004 MALE 22 1 1 82 70 147
2 C0004 MALE 22 1 2 76 70 157
3 C0005 MALE 22 1 3 1514 67 175
4 C0005 MALE 23 2 1 81 65 171
5 C0006 MALE 23 2 2 75 69 160
6 C0006 MALE 23 2 3 878 64 143
I'm trying to add df2$hgt and df2$wt to the proper ID row. The tricky part is that I want to join hgt and wt to the ID row whose dates (df1$test_dt and df2$phys_dt) most closely align. I was thinking I could first sort the two data frames by ID then by their respective dates then try and join? I'm not quite sure how to approach this. Thanks.
If you want to murge just matching the df1$ID and df2$ID, the following should do it:
df3 <- left_join(df1, df2, by = c("ID" = "ID"))
if the date should be matched as well as the ID, you could try:
df3 <- left_join(df1, df2, by = c("ID" = "ID", "test_dt" = "phys_dt"))
it is in the library(dplyr)

Error in sort.list(y) whlie using 'Strata()' in R

When I run the command:
H <-length(table(data$Team))
n.h <- rep(5,H)
strata(data, stratanames=data$Team,size=n.h,method="srswor"),
I get the error statement:
'Error in sort.list(y) : 'x' must be atomic for 'sort.list' Have you called 'sort' on a list?'
Please help me how can I get this stratified sample. The variable 'Team' is 'Factor' type.
Data is as below:
zz <- "Team League.ID Player Salary POS G GS InnOuts PO A
ANA AL molinjo0 335000 C 73 57 1573 441 37
ANA AL percitr0 7833333 P 3 0 149 1 3
ARI NL bautida0 4000000 RF 141 135 3536 265 8
ARI NL estalbo0 550000 C 7 3 92 19 2
ARI NL finlest0 7000000 CF 104 102 2689 214 5
ARI NL koplomi0 330000 P 72 0 260 6 23
ARI NL sparkst0 500000 P 27 18 362 8 21
ARI NL villaos0 325000 P 17 0 54 0 4
ARI NL webbbr01 335000 P 33 35 624 13 41
ATL NL francju0 750000 1B 125 71 1894 627 48
ATL NL hamptmi0 14625000 P 35 29 517 13 37
ATL NL marreel0 3000000 LF 90 42 1125 80 4
ATL NL ortizru0 6200000 P 32 34 614 7 38
BAL AL surhobj0 800000 LF 100 31 805 69 0"
data <- read.table(text=zz, header=T)
This should work:
library(sampling)
H <- length(levels(data$Team))
n.h <- rep(5, H)
strata(data, stratanames=c("Team"), size=n.h, method="srswor")
stratanames should be a list of column names, not a reference to the actual column data.
Update:
Now that example data is available, I see another problem: you are sampling without-replacement (wor), but your samples are bigger that the available data. You need to sample with replacement in this case
smpl <- strata(data, stratanames=c("Team"), size=n.h, method="srswr")
BTW, you get the actual data with:
sampledData <- getdata(data, smpl)
This doesn't really answer your question, but a long time ago, I wrote a function called stratified that might be of use to you.
I've posted it here as a GitHub Gist.
Notice that when you have asked for samples that are bigger than your data, it just returns all of the relevant rows.
output <- stratified(data, "Team", 5)
# Some groups
# ---ANA, ATL, BAL---
# contain fewer observations than desired number of samples.
# All observations have been returned from those groups.
table(output$Team)
#
# ANA ARI ATL BAL
# 2 5 4 1
output
# Team League.ID Player Salary POS G GS InnOuts PO A
# 1 ANA AL molinjo0 335000 C 73 57 1573 441 37
# 2 ANA AL percitr0 7833333 P 3 0 149 1 3
# 9 ARI NL webbbr01 335000 P 33 35 624 13 41
# 7 ARI NL sparkst0 500000 P 27 18 362 8 21
# 8 ARI NL villaos0 325000 P 17 0 54 0 4
# 3 ARI NL bautida0 4000000 RF 141 135 3536 265 8
# 6 ARI NL koplomi0 330000 P 72 0 260 6 23
# 12 ATL NL marreel0 3000000 LF 90 42 1125 80 4
# 13 ATL NL ortizru0 6200000 P 32 34 614 7 38
# 10 ATL NL francju0 750000 1B 125 71 1894 627 48
# 11 ATL NL hamptmi0 14625000 P 35 29 517 13 37
# 14 BAL AL surhobj0 800000 LF 100 31 805 69 0
I'll add official documentation to the function at some point, but here's a summary to help you get the best use out of it:
The arguments to stratified are:
df: The input data.frame
group: A character vector of the column or columns that make up the "strata".
size: The desired sample size.
If size is a value less than 1, a proportionate sample is taken from each stratum.
If size is a single integer of 1 or more, that number of samples is taken from each stratum.
If size is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10).
select: This allows you to subset the groups in the sampling process. This is a list. For instance, if your group variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C")).
replace: For sampling with replacement.

Exclude Row with State having less than some value on Aggregation

I'm currently learning R with help of video on coursera. When trying to exclude all hospital of state which have less than 20 hospital form table, I couldn't able to find correct solution with lack of programming knowledge of R (as I had program lots with C, Logic I tried to implemented in R is also like C)
Code I had used is like
>test <- read.csv("outcome-of-care-measures.csv", colClasses = "character")
>test[, 11] <- as.numeric(outcome[, 11])
>test2 <- table(outcome$State)
Here from table test2, I can get the value of particular row as test2[[2]] but couldn't able to find out how to use conditional logic to get state with less then 20 hospital (If i get the state name then I can use subset() to address actual problem). Also I had look on dimnames() function but could find out any idea to solve my problem. So my question is, in R how could I check the threshold value with table value.
Value store in test2 is
AK AL AR AZ CA CO CT DC DE FL GA GU HI IA ID IL IN KS KY LA MA MD ME
17 98 77 77 341 72 32 8 6 180 132 1 19 109 30 179 124 118 96 114 68 45 37
MI MN MO MS MT NC ND NE NH NJ NM NV NY OH OK OR PA PR RI SC SD TN TX
134 133 108 83 54 112 36 90 26 65 40 28 185 170 126 59 175 51 12 63 48 116 370
UT VA VI VT WA WI WV WY ##State Name
42 87 2 15 88 125 54 29 ##Count of Hospital
as Arun also specified on his comment... you can do it as names(test2[test2 >= 20]) in order to get state with higher than 20 Hospital... Here is nice explanation why you have to avoid subset.
Or yo can transform your table to a data.frame and use subset
dat <- as.data.frame(test2)
subset(dat, Freq < 20)
nn Freq
1 AK 17
8 DC 8
9 DE 6
12 GU 1
13 HI 19
42 RI 12
49 VI 2
50 VT 15

Resources