reading comma-separated strings with read.csv() - r

I am trying to load a comma-delimited data file that also has commas in one of its text columns. The following sample code generates such a file'test.csv',which I'll load usingread.csv()to illustrate my problem.
> d <- data.frame(name = c("John Smith", "Smith, John"), age = c(34, 34))
> d
name age
1 John Smith 34
2 Smith, John 34
> write.csv(d, file = "test.csv", quote = F, row.names = F)
> d2 <- read.csv("test.csv")
> d2
name age
John Smith 34 NA
Smith John 34
Because of the ',' in Smith, John, d2 is not assigned correctly. How do I read the file so that d2 looks exactly like d?
Thanks.

1) read.pattern read.pattern (in gsubfn package) can read such files:
library(gsubfn)
pat <- "(.*),(.*)"
read.pattern("test.csv", pattern = pat, header = TRUE, as.is = TRUE)
giving:
name age
1 John Smith 34
2 Smith, John 34
2) two pass Another possibility is to read it in, fix it up and then re-read it. This uses no packages and gives the same output.
L <- readLines("test.csv")
read.table(text = sub("(.*),", "\\1|", L), header = TRUE, sep = "|", as.is = TRUE)
Note: For 3 fields with the third field at the end use this in (1)
pat <- "(.*),([^,]+),([^,]+)"
The same situation use this in (2) assuming that there are non-spaces adjacent to each of the last two commas and at least one space adjacent to any commas in the text field and that fields have at least 2 characters:
text = gsub("(\\S),(\\S)", "\\1|\\2", L)
If you have some other arrangement just modify the regular expression in (1) appropriately and the sub or gsub in (2).

Related

Parsing names in mixed formats using R

I have a list of names in mixed formats, which I would like to separate into columns containing first and last names in R. An example dataset:
Names <- c("Mary Smith","Hernandez, Maria","Bonds, Ed","Michael Jones")
The goal is to assemble a dataframe that contains names in a format like this:
FirstNames <- c("Mary","Maria","Ed","Michael")
LastNames <- c("Smith","Hernandez","Bonds","Jones")
FinalData <- data.frame (FirstNames, LastNames)
I tried a few approaches to select either the First or Last name based on whether the names are separated by a space only versus comma-space. For instance I wanted to use regular expressions in gsub to copy first names from rows in which a comma-space separates the names:
FirstNames2 <- gsub (".*,\\s","",Names)
This worked for rows that contained names in the LastName, FirstName format, but gsub collected the entire contents in rows with names in FirstName LastName format.
So my request for help is to please advise: How would you tackle this problem? Thanks to all in advance!
Here is a one-liner. The pattern tries Firstname lastname first and if that fails it tries lastname, firstname. No packages are used.
read.table(text = sub("(\\w+) (\\w+)|(\\w+), (\\w+)", "\\1\\4 \\2\\3", Names), as.is=TRUE)
giving:
V1 V2
1 Mary Smith
2 Maria Hernandez
3 Ed Bonds
4 Michael Jones
You could rearrange the , version to first last name and then just strsplit.
FirstNames <- sapply(strsplit(gsub('(\\w+), (\\w+)', '\\2 \\1', Names), ' '), `[[`, 1)
LastNames <- sapply(strsplit(gsub('(\\w+), (\\w+)', '\\2 \\1', Names), ' '), `[[`, 2)
temp = strsplit(x = Names, split = "(, | )")
do.call(rbind, lapply(1:length(temp), function(i){
if (grepl(pattern = ", ", x = Names[i])){
data.frame(F = temp[[i]][2], L = temp[[i]][1])
}else{
data.frame(F = temp[[i]][1], L = temp[[i]][2])
}
}))
# F L
#1 Mary Smith
#2 Maria Hernandez
#3 Ed Bonds
#4 Michael Jones

Read CSV in R with first column as dataframe header

I have a simply text file where the first column is names (strings) and the second column is values (floats). As an example, names and ages:
Name, Age
John, 32
Heather, 46,
Jake, 23
Sally, 19
I'd like to read this in as a dataframe (call this df) but transposed so that I can access ages by names such that df$John would return 32. How can I do this?
Previous I tried creating a new dataframe, tdf, looping through the data in a for loop, assigning each name and age and then inserting into the empty dataframe as tdf[name] = age but this did not work as I expected.
You can read your data using read.table().
Then you can transpose it using t() and set colnames after.
Example:
If df is:
df=read.table("dummydata", header=T, sep=",")
df
Name Age
1 John 32
2 Heather 46
3 Jake 23
4 Sally 19
You transpose the age and then transform them into a dataframe:
tdf=as.data.frame(t(df$Age))
colnames(tdf)=t(df$Name)
So tdf will return:
tdf
John Heather Jake Sally
1 32 46 23 19
And, as you asked, tdf$John will return:
tdf$John
[1] 32
Now, if you have more than two columns you can do the same but instead of indicating the name of the column you can simply indicate the position using brackets.
df=read.table("dummydata", header=T, sep=",")
With t(df[2:ncol(df)]) you transpose the whole table starting from the second column, no matter the number of columns. The first column will be the names after the transpose.
tdf=as.data.frame(t(df[2:ncol(df)]))
Then you set the columnames.
colnames(tdf)=t(df[1])
tdf$John
[1] 32
This should add the the row as header when you read from the file
read.csv2(filename, as.is = TRUE, header = TRUE)
Read the data into a data frame, DF (see Note).
1) Assign the names to the rows of DF in which case this will give John's age without having to create a new data structure:
rownames(DF) <- DF$Name
DF["John", "Age"]
## [1] 32
2) Alternatively, split DF into a named list in which case you can get the precise syntax requested:
ages <- with(DF, split(Age, Name))
ages$John
## [1] 32
3) This alternative would also create the same list:
ages <- with(DF, setNames(as.list(Age), Name))
Note: DF in reproducible form is as follows. (We have removed the trailing comma on one line in the question but if it is really there add fill = TRUE to the read.csv line.)
Lines <- "Name, Age
John, 32
Heather, 46
Jake, 23
Sally, 19"
DF <- read.csv(text = Lines)
A bit late but hopefully helpful. The "row.names" parameter allows you to select the desired column as header:
read.csv("df.csv", header = TRUE, row.names = 1)

Merging two data frames with columns with certain patterns in strings

(I have been stuck with this problem for past two days, So if it has an answer on SO please bear with me.)
I have two data frames A and B. I want to merge them on Name column. Suppose, A has two columns Name and Numbers. The Name column of A df has values ".tony.x.rds", ".tom.x.rds" and so on.
Name Numbers
.tony.x.rds 15.6
.tom.x.rds 14.5
The B df has two columns Name and ChaR. The Name column of B has values "tony.x","tom.x" and so on.
Name ChaR
tony.x ENG
tom.x US
The main element in column Name of both dfs is "tony', "tom" and so on.
So, ".tony.x.rds" is equal to "tony.x" and ".tom.x.rds" is equal to "tom.x".
I have tried gsub with various option leaving me with 'tony", "tom", and so on in column Name of both A and B data frames. But when I use
StoRe<-merge(A,B, all=T)
I ge all the rows of A and B rather than single rows. That is, there are two rows for each "a", "b" and so on for with their respective values in Numbers and ChaR column. For example:
Name Numbers ChaR
tony 15.6 NA
tony NULL ENG
tom 14.5 NA
tom NULL US
It has been giving me splitting headache. I request you to help.
One possible solution. I am not completely sure what you want to do with the 'x' in the strings, I have kept them in the linkage key, but by changing the \\1\\2 to \\1 you keep only the first letter.
a <- data.frame(
Name = paste0(".", c("tony", "tom", "foo", "bar", "foobar"), ".x.rds"),
Numbers = rnorm(5)
)
b <- data.frame(
Name = paste0(c("tony", "tom", "bar", "foobar", "company"), ".x"),
ChaR = LETTERS[11:15]
)
# String consists of 'point letter1 point letter2 point rds'; replace by
# 'letter1 letter2'
a$Name_stand <- gsub("^\\.([a-z]+)\\.([a-z]+)\\.rds$", "\\1\\2", a$Name)
# String consists of 'letter1 point letter2'; replace by 'letter1 letter2'
b$Name_stand <- gsub("^([a-z]+)\\.([a-z]+)$", "\\1\\2", b$Name)
result <- merge(a, b, all = TRUE, by = "Name_stand")
Output:
#> result
# Name_stand Name.x Numbers Name.y ChaR
#1 barx .bar.x.rds 1.38072696 bar.x M
#2 companyx <NA> NA company.x O
#3 foobarx .foobar.x.rds -1.53076596 foobar.x N
#4 foox .foo.x.rds 1.40829287 <NA> <NA>
#5 tomx .tom.x.rds -0.01204651 tom.x L
#6 tonyx .tony.x.rds 0.34159406 tony.x K
Another, perhaps somewhat more robust (to variations of the strings such as 'tom.rds' and 'tom' which will still be linked; this can of course also be a disadvantage)/
# Remove the rds from a$Name
a$Name_stand <- gsub("rds$" , "", a$Name)
# Remove all non alpha numeric characters from the strings
a$Name_stand <- gsub("[^[:alnum:]]", "", a$Name_stand)
b$Name_stand <- gsub("[^[:alnum:]]", "", b$Name)
result2 <- merge(a, b, all = TRUE, by = "Name_stand")

Reading text into data.frame where string values contain spaces

What's the easiest way to read text from a printed data.frame into a data.frame when there are string values containing spaces that interfere with read.table? For instance, this data.frame excerpt does not pose a problem:
candname party elecVotes
1 BarackObama D 365
2 JohnMcCain R 173
I can paste it into a read.table call without a problem:
dat <- read.table(text = " candname party elecVotes
1 BarackObama D 365
2 JohnMcCain R 173", header = TRUE)
But if the data has strings with spaces like this:
candname party elecVotes
1 Barack Obama D 365
2 John McCain R 173
Then read.table throws an error as it interprets "Barack" and "Obama" as two separate variables.
Read the file into L, remove the row numbers and use sub with the indicated regular expression to insert commas between the remaining fields. (Note that "\\d" matches any digit and "\\S" matches any non-whitespace character.) Now re-read it using read.csv:
Lines <- " candname party elecVotes
1 Barack Obama D 365
2 John McCain R 173"
# L <- readLines("myfile") # read file; for demonstration use next line instead
L <- readLines(textConnection(Lines))
L2 <- sub("^ *\\d+ *", "", L) # remove row numbers
read.csv(text = sub("^ *(.*\\S) +(\\S+) +(\\S+)$", "\\1,\\2,\\3", L2), as.is = TRUE)
giving:
candname party elecVotes
1 Barack Obama D 365
2 John McCain R 173
Here is a visualization of the regular expression:
^ *(.*\S) +(\S+) +(\S+)$
Debuggex Demo

Read data with space character in R

Usually, read.table will solve many data input problems personally. Like this one:
China 2 3
USA 1 4
Sometimes, the data can madden people, like:
Chia 2 3
United States 3 4
So the read.table cannot work, and any assistance is appreciated.
P.S. the format of data file is .dat
First set up some test data:
# create test data
cat("Chia 2 3
United States 3 4
", file = "file.with.spaces.txt")
1) Using the above read in the data, insert commas between fields and re-read:
L <- readLines("file.with.spaces.txt")
L2 <- sub("^(.*) +(\\S+) +(\\S+)$", "\\1,\\2,\\3", L) # 1
DF <- read.table(text = L2, sep = ",")
giving:
> DF
V1 V2 V3
1 Chia 2 3
2 United States 3 4
2) Another approach. Using L from above, replace the last string of spaces with comma twice (since there are three fields):
L2 <- L
for(i in 1:2) L2 <- sub(" +(\\S+)$", ",\\1", L2) # 2
DF <- read.table(text = L2, sep = ",")
ADDED second solution. Minor improvements.
If the column seperator 'sep' is indeed a whitespace, it logically cannot differentiate between spaces in a name and spaces that actually seperate between columns. I'd suggest to change your country names to single strings, ie, strings without spaces. Alternatively, use semicolons to seperate between your data colums and use:
data <- read.table(foo.dat, sep= ";")
If you have many rows in your .dat file, you can consider using regular expressions to find spaces between the columns and replace them with semicolons.

Resources