r functions will not recognise apostrophe in character string - r

I have a large data frame of survey data read from a .csv that looks like this when simplified.
x <- data.frame("q1" = c("yes","no","don’t_know"),
"q2" = c("no","no","don’t_know"),
"q3" = c("yes","don’t_know","don’t_know"))
I want to create a column using rowSums as below
x$dntknw<-rowSums(x=="don’t_know")
I can do it for all the yes and no answers easily, but In my dataframe it just generates zeros for the don’t_know's.
I previously had an issue with the apostrophe looking like this don’t_know. I added encoding = "UTF-8"to my read.table to fix this. However now I cant seem to get any R functions to recognise it, I tried gsub("’","",df) but this didnt work as with rowSums.
Is this a problem with the encoding? is there a regex solution to removing them? what solutions are there for dealing with this?

It is an encoding issue and not a regex one. I am unable to reproduce the issue and my encoding is set as UTF-8 in R. Try by setting the encoding to UTF-8 in default R rather than at the time of read.
here is my sample output with your code.
> x
q1 q2 q3 dntknw
1 yes no yes 0
2 no no don’t_know 1
3 don’t_know don’t_know don’t_know 3
> Sys.setlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
Here is some more detail that may be helpful.
https://support.rstudio.com/hc/en-us/articles/200532197-Character-Encoding

As #Drj stated, it is probably an encoding error. When I paste your code into my console, I get
> x$q1
[1] yes no don<U+0092>t_know
Even if the encoding is off, you can still match it using regex:
grepl("don.+t_know", x$q1)
# [1] FALSE FALSE TRUE
Hence, you can calculate the row sums as follows:
x$dntknw <- rowSums(apply(x, 2, function(y) grepl("don.+t_know", y)))
Which results in
> x
q1 q2 q3 dntknw
1 yes no yes 0
2 no no don<U+0092>t_know 1
3 don<U+0092>t_know don<U+0092>t_know don<U+0092>t_know 3

Related

I am having trouble trying to do the statistical analysis part for my bio class

this is my first time asking a question on here. I am tirelessly working on this lab that was due ages ago but was able to get extended. I am not sure what I am doing anymore. I have to be able to do statistical anyalysis and do one of four tests: Correlation, Linear regression, T-test and ANOVA.
Currently what I am faced with is just getting my dataset to be readable in a wide format on R and currently what it looks like is: dataset I have been able to do the bare minimum which is get it read but my lesson tells me that it needs to be in a wide format and from what it looks like, it is not even in that formatting. I know I would have to run an ANOVA test as there is more than 2 categories that are being tested, but I do not know how to change variable name on the program nor do I know how to get it to run a statistical data as it is not reading the way I want it to. Any suggestions would be helpful! Thanks.
edit: here's my code
# Statistical Data for Lab 2: Measuring Diversity
Lab2 <- read.csv2('Lab2Measure.csv')
Lab2_wide <- Lab2
to which it gives me the following output:
> X.x1.y1.z1.x2.y2.z2
> 1 1,4,80,10,4,100,0
> 2 2,5,90,5,6,90,5
> 3 3,3,100,20,5,90,0
> 4 4,6,60,5,6,57,0
> 5 5,8,70,3,6,95,2
> 6 6,5,95,6,5,25,0
> 7 7,5,80,15,3,90,10
> 8 8,3,75,20,4,80,0
> 9 9,5,70,25,3,85,10
> 10 10,7,95,5,6,97,2
> 11 11,6,90,2,5,90,0.5
> 12 12,5,70,1,5,75,5
> 13 13,3,60,15,3,97,1
> 14 14,4,90,10,2,70,0
> 15 15,3,85,8,3,98,1
> 16 16,2,96,17,8,90,5
> 17 17,5,70,20,5,98,1
> 18 18,3,40,10,4,80,9
> 19 19,3,80,15,4,95,0
> 20 20,1,90,2,2,92,0
> 21 21,2,75,7,2,96,5
but please refer to the photo provided to understand my woes
When you see a column name like:
X.x1.y1.z1.x2.y2.z2
It means you didn't give the correct separator to the read function:
The default for read.table is whitespace. Default for read.csv2 is semicolon. You can change the separator to be used by read.table with the sep parameter. I'm not sure that you can change the separator of read.csv2 or read.csv with that parameter. I think they may throw an error if you try. As user 20650 suggests, you may get success with:
Lab2 <- read.csv("Lab2Measure.csv")
Rather than using images of datasets or the results you should learn to post the copied text from the console .
The default value for the separator in read.csv2() is ; but it seems that the separator in your data is ,. So you should add sep="," to the code to make it work correctly.
Lab2 <- read.csv2('Lab2Measure.csv', sep=",")

Yet another "ValueError: Input contains NaN, infinity or a value too large for dtype('float64')". I have checked, but data seems to be ok

I'm trying to prepare a dataset to use it as training data for a deep neural network. It consists of 13 .txt files, each between 500MB and 2 GB large. However, when trying to run a "data_prepare.py" file, I get the Value error of this post's title.
Reading answers from previous posts, I have loaded my data into R and checked both for NaN and infinite numbers, but the commands used tell me there appears to be nothing wrong with my data. I have done the following:
I load my data as one single dataframe using magrittr, data.table and purrr packages(there are about 300 Million rows, all with 7 variables):
txt_fread <-
list.files(pattern="*.txt") %>%
map_df(~fread(.))
I have used sapply to check for finite and NaN values:
>any(sapply(txt_fread, is.finite))
[1] TRUE
> any(sapply(txt_fread, is.nan))
[1] FALSE
I have also tried loading each data frame into a jupyter notebook and check individually for those values using the following commands:
file1= pd.read_csv("File_name_xyz_intensity_rgb.txt", sep=" ", header=None)
np.any(np.isnan(file1))
False
np.all(np.isfinite(file1))
True
And when I use print(file1.info()), this is what I get as info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22525176 entries, 0 to 22525175
Data columns (total 7 columns):
# Column Dtype
--- ------ -----
0 0 float64
1 1 float64
2 2 float64
3 3 int64
4 4 int64
5 5 int64
6 6 int64
dtypes: float64(3), int64(4)
memory usage: 1.2 GB
None
I know the file containing the code (data_prepare.py) works because it runs properly with a similar dataset. I therefore know it must be a problem with the new data I mention here, but I don't know what I have missed or done wrong while checking for NaNs and infinites. I have also tried reading and checking the .txt files individually, but it also hasn't helped much.
Any help is really appreciated!!
Btw: the R code with map_df came from a post by leerssej in How to import multiple .csv files at once?

Technique for finding bad data in read.csv in R

I am reading in a file of data that looks like this:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith,"john.smith#gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
but the actual file as around 20000 records. I use the following R code to read it in:
user = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", quote="")
And the reason I have quote="" is because without it the import stops prematurely. I end up with a total of 9569 observations. Why I don't understand why exactly the quote="" overcomes this problem, it seems to do so.
Except that it introduces other problems that I have to 'fix'. The first one I saw is that the dates end up being strings which include the quotes, which don't want to convert to actual dates when I use to.Date() on them.
Now I could fix the strings and hack my way through. But better to know more about what I am doing. Can someone explain:
Why does the quote="" fix the 'bad data'
What is a best-practice technique to figure out what is causing the read.csv to stop early? (If I just look at the input data at +/- the indicated row, I don't see anything amiss).
Here are the lines 'near' the 'problem'. I don't see the damage do you?
"16888","user1","user1#gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","user2#gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","user3#gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","user4#gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","user5#gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
* Update *
It's more tricky. Even though the total number of rows imported is 9569, if I look at the last few rows they correspond to the last few rows of data. Therefore I surmise that something happened during the import to cause a lot of rows to be skipped. In fact 15914 - 9569 = 6345 records. When I have the quote="" in there I get 15914.
So my question can be modified: Is there a way to get read.csv to report about rows it decides not to import?
* UPDATE 2 *
#Dwin, I had to remove na.strings="\N" because the count.fields function doesn't permit it. With that, I get this output which looks interesting but I don't understand it.
3 4 22 23 24
1 83 15466 178 4
Your second command produces a lots of data (and stops when max.print is reached.) But the first row is this:
[1] 2 4 2 3 5 3 3 3 5 3 3 3 2 3 4 2 3 2 2 3 2 2 4 2 4 3 5 4 3 4 3 3 3 3 3 2 4
Which I don't understand if the output is supposed to show how many fields there are in each record of input. Clearly the first lines all have more than 2,4,2 etc fields... Feel like I am getting closer, but still confused!
The count.fields function can be very useful in identifying where to look for malformed data.
This gives a tabulation of fields per line ignores quoting, possibly a problem if there are embedded commas:
table( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") )
This give a tabulation ignoring both quotes and "#"(octothorpe) as a comment character:
table( count.fields("~/Desktop/dbdump/users.txt", quote="", comment.char="") )
Atfer seeing what you report for the first tabulation..... most of which were as desired ... You can get a list of the line positions with non-22 values (using the comma and non-quote settings):
which( count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") != 22)
Sometimes the problem can be solved with fill=TRUE if the only difficulty is missing commas at the ends of lines.
One problem I have spotted (thanks to data.table) is the missing quote (") after John Smith. Could this be a problem also for other lines you have?
If I add the "missing" quote after John Smith, it reads fine.
I saved this data to data.txt:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith","john.smith#gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
"16888","user1","user1#gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","user2#gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","user3#gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","user4#gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","user5#gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
And this is a code. Both fread and read.csv works fine.
require(data.table)
dat1 <- fread("data.txt", header = T, na.strings = "\\N")
dat1
dat2 <- read.csv("data.txt", header = T, na.strings = "\\N")
dat2

Save matrix to .csv file in R without losing format

I'm trying to write a matrix to a .csv file using the write.matrix from MASS but I'm having some problems.
When I print the matrix it look something like this
p q s S2 R2 R2adj Cp AIC PRESS
1 0 1 167.27779 27981.8583 NA NA 3679.294476 NA NA
2 1 2 160.32254 25703.3165 0.08866209 0.08142925 3343.909110 1666.993 3338167.3
3 1 2 86.73559 7523.0630 0.73326195 0.73114498 891.016823 1509.726 1045980.3
4 1 2 67.50458 4556.8690 0.83843145 0.83714916 490.815893 1445.555 693993.5
but when I do
write.matrix(moDat2, file = paste(targetPath, "dat2.csv", sep="/"), sep=",")
It save it to the file like this
p,q,s,S2,R2,R2adj,Cp,AIC,PRESS
0.000000e+00,1.000000e+00,1.672778e+02,2.798186e+04, NA, NA,3.679294e+03, NA, NA
1.000000e+00,2.000000e+00,1.603225e+02,2.570332e+04,8.866209e-02,8.142925e-02,3.343909e+03,1.666993e+03,3.338167e+06
1.000000e+00,2.000000e+00,8.673559e+01,7.523063e+03,7.332620e-01,7.311450e-01,8.910168e+02,1.509726e+03,1.045980e+06
Is there anyway I can save it to the file without the data getting transform to scientific notation?
You can use format inside your write.matrix call.
write.matrix(format(moDat2, scientific=FALSE),
file = paste(targetPath, "dat2.csv", sep="/"), sep=",")
The help(page for MASS::write.table does not suggest that there are controls available. This is what write.table's help page says about formating numbers:
"In almost all cases the conversion of numeric quantities is governed by the option "scipen" (see options), but with the internal equivalent of digits=15. For finer control, use format to make a character matrix/data frame, and call write.table on that."

Read.CSV not working as expected in R

I am stumped. Normally, read.csv works as expected, but I have come across an issue where the behavior is unexpected. It most likely is user error on my part, but any help will be appreciated.
Here is the URL for the file
http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip
Here is my code to get the file, unzip, and read it in:
URL <- "http://nces.ed.gov/ipeds/datacenter/data/SFA0910.zip"
download.file(URL, destfile="temp.zip")
unzip("temp.zip")
tmp <- read.table("sfa0910.csv",
header=T, stringsAsFactors=F, sep=",", row.names=NULL)
Here is my problem. When I open the data csv data in Excel, the data look as expected. When I read the data into R, the first column is actually named row.names. R is reading in one extra row of data, but I can't figure out where the "error" occurs that is causing row.names to be a column. Simply, it looks like the data shifted over.
However, what is strange is that the last column in R does appear to contain the proper data.
Here are a few rows from the first few columns:
tmp[1:5,1:7]
row.names UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
Any thoughts on what I could be doing wrong?
My tip: use count.fields() as a quick diagnostic when delimited files do not behave as expected.
First, count the number of fields using table():
table(count.fields("sfa0910.csv", sep = ","))
# 451 452
# 1 6852
That tells you that all but one of the lines contains 452 fields. So which is the aberrant line?
which(count.fields("sfa0910.csv", sep = ",") != 452)
# [1] 1
The first line is the problem. On inspection, all lines except the first are terminated by 2 commas.
The question now is: what does that mean? Is there supposed to be an extra field in the header row which was omitted? Or were the 2 commas appended to the other lines in error? It may be best to contact whoever generated the data, if possible, to clarify the ambiguity.
I have a fix maybe based on mnel's comments
dat<-readLines(paste("sfa", '0910', ".csv", sep=""))
ncommas<-sapply(seq_along(dat),function(x){sum(attributes(gregexpr(',',dat[x])[[1]])$match.length)})
> head(ncommas)
[1] 450 451 451 451 451 451
all columns after the first have an extra seperator which excel ignores.
for(i in seq_along(dat)[-1]){
dat[i]<-gsub('(.*),','\\1',dat[i])
}
write(dat,'temp.csv')
tmp<-read.table('temp.csv',header=T, stringsAsFactors=F, sep=",")
> tmp[1:5,1:7]
UNITID XSCUGRAD SCUGRAD XSCUGFFN SCUGFFN XSCUGFFP SCUGFFP
1 100654 R 4496 R 1044 R 23
2 100663 R 10646 R 1496 R 14
3 100690 R 380 R 5 R 1
4 100706 R 6119 R 774 R 13
5 100724 R 4638 R 1209 R 26
the moral of the story .... listen to Joshua Ulrich ;)
Quick fix. Open the file in excel and save it. This will also delete the extra seperators.
Alternatively
dat<-readLines(paste("sfa", '0910', ".csv", sep=""),n=1)
dum.names<-unlist(strsplit(dat,','))
tmp <- read.table(paste("sfa", '0910', ".csv", sep=""),
header=F, stringsAsFactors=F,col.names=c(dum.names,'XXXX'),sep=",",skip=1)
tmp1<-tmp[,-dim(tmp)[2]]
I know you've found an answer but as your answer helped me to find out this, I'll share:
If you read into R a file with different amount of columns for different rows, like this:
1,2,3,4,5
1,2,3,4
1,2,3
it would be read-in filling the missing columns with NAs, like this:
1,2,3,4,5
1,2,3,4,NA
1,2,3,NA,NA
BUT!
If the row with the biggest columns is not the first row, like this:
1,2,3,4
1,2,3,4,5
1,2,3
then it would be read in a bit confusing way:
1,2,3,4
1,2,3,4
5,NA,NA,NA
1,2,3,NA
(overwhelming before you figure out the problem and quite simple after!)
Just hope it may help someone!
If you using local data, also make sure that it's in the right place. To be sure put it for instance in your working directory and change it via
setwd("C:/[User]/[MyFolder]")
directly in your R-console.

Resources