R - Parse text file as fscanf ? [duplicate] - r

This question already has answers here:
How to transform a key/value string into distinct rows?
(2 answers)
Closed 4 years ago.
I have a large text file that I want to import in R with multimodal data encoded as such :
A=1,B=1,C=2,...
A=2,B=1,C=1,...
A=1,B=2,C=1,...
What I'd like to have is a dataframe similar to this :
A B C
1 1 2
2 1 1
1 2 1
Because the column name is being repeated over and over for each row, I was wondering if there was a way import that text file with a fscanf functionality that would parse the A, B, C column names such as "A=%d,B=%d,C=%d,...."
Or maybe there's a simpler way using read.table or scan ? But I couldn't figure out how.
Thanks for any tip

1) read.pattern read.pattern in the gsubfn package is very close to what you are asking. Instead of %d use (\\d+) when specifying the pattern. If the column names are not important the col.names argument could be omitted.
library(gsubfn)
L <- c("A=1,B=1,C=2", "A=1,B=1,C=2", "A=1,B=1,C=2") # test input
pat <- "A=(\\d+),B=(\\d+),C=(\\d+)"
read.pattern(text = L, pattern = pat, col.names = unlist(strsplit(pat, "=.*?(,|$)")))
giving:
A B C
1 1 1 2
2 1 1 2
3 1 1 2
1a) percent format Just for fun we could implement it using exactly the format given in the question.
fmt <- "A=%d,B=%d,C=%d"
pat <- gsub("%d", "(\\\\d+)", fmt)
Now run the read.pattern statement above.
2) strapply Using the same input and the gsubfn package, again, an alternative is to pull out all strings of digits eliminating the need for the pat shown in (1) reducing the pattern to just "\\d+".
DF <- strapply(L, "\\d+", as.numeric, simplify = data.frame)
names(DF) <- unlist(strsplit(L[1], "=.*?(,|$)"))
3) read.csv Even simpler is this base only solution which deletes the headings and reads in what is left setting the column names as in the prior solution. Again, omit the col.names argument if column names are not important.
read.csv(text = gsub("\\w*=", "", L), header = FALSE,
col.names = unlist(strsplit(L[1], "=.*?(,|$)")))

Related

Split columns in a dataframe into a column that contains text not numbers and a column that contains numbers not text in R

Here is a simplified version of data I am working with:
a<-c("There are 5 programs", "2 - adult programs, 3- youth programs","25", " ","there are a number of programs","other agencies run our programs")
b<-c("four", "we don't collect this", "5 from us, more from others","","","")
c<-c(2,6,5,8,2,"")
df<-cbind.data.frame(a,b,c)
df$c<-as.numeric(df$c)
I want to keep both the text and numbers from the data b/c some of the text is important
expected output:
What I think makes sense is the following:
id all columns that have text in them, perhaps in a list (because some columns are just numbers)
subset columns from step 1 to a new dataframe lets call this df1
delete the subsetted columns in df1 from df
split all the columns in df1 into 2 columns, one that keeps the text and one that has the number.
bind the new spit columns from df1 into the orginal df
What I am struggling with is steps 1-2 and 4. I am okay with the characters (e.g., - and ') being excluded or included. There is additional processing I have to do after (e.g., when there are multiple numbers in a column after splitting I will need to split and add these and also address the written numbers), but those are things I can do.
Here's a dplyr solution using regular expression:
library(stringr)
library(dplyr)
df %>%
mutate(
a.text = gsub("(^|\\s)\\d+", "", a),
a.num = str_extract_all(a, "\\d+"),
b.text = gsub("(^|\\s)\\d+", "", b),
b.num = str_extract_all(b, "\\d+")
) %>%
select(c(4:7,3))
a.text a.num b.text b.num c
1 There are programs 5 four 2
2 - adult programs,- youth programs 2, 3 we don't collect this 6
3 25 from us, more from others 5 5
4 8
5 there are a number of programs 2
6 other agencies run our programs NA
Here is what I would do with my preferred tools. The solution will work with arbitrary numbers of arbitrarily named character and non-character columns.
library(data.table) # development version 1.14.3 used here
library(magrittr) # piping used to improve readability
num <- \(x) stringr::str_extract_all(x, "\\d+", simplify = TRUE) %>%
apply(1L, \(x) sum(as.integer(x), na.rm = TRUE))
txt <- \(x) stringr::str_remove_all(x, "\\d+") %>%
stringr::str_squish()
setDT(df)[, lapply(
.SD, \(x) if (is.character(x)) data.table(txt = txt(x), num = num(x)) else x)]
which returns
a.txt a.num b.txt b.num c
<char> <int> <char> <int> <num>
1: There are programs 5 four 0 2
2: - adult programs, - youth programs 5 we don't collect this 0 6
3: 25 from us, more from others 5 5
4: 0 0 8
5: there are a number of programs 0 0 2
6: other agencies run our programs 0 0 NA
Explanation
num() is a function which uses the regular expression \\d+ to extract all strings which consist of contiguous digits (aka integer numbers), coerces them to type integer, and computes the rowwise sum of the extracted numbers (as requested in OP's last sentence).
txt() is a function which removes all strings which consist of contiguous digits (aka integer numbers), removes whitespace from start and end of the strings and reduces repeated whitespace inside the strings.
\(x) is a new shortcut for function(x) introduced with R version 4.1
The next steps implement OP's proposed approach in data.table syntax, by and large:
lapply(.SD, ...) loops over each column of df.
if the column is character both functions txt() and num() are applied. The two resulting vectors are turned into a data.table as a partial result. Note that cbind() cannot be used here as it would return a character matrix.
if the column is non-character it is returned as is.
The final result is a data.table where the column names have been renamed automagically.
This approach keeps the relative position of columns.

Extract String Part to Column in R

Consider the following dataframe:
status
1 file-status-done-bad
2 file-status-maybe-good
3 file-status-underreview-good
4 file-status-complete-final-bad
We want to extract the last part of status, wherein part is delimited by -. Such:
status status_extract
1 file-status-done-bad done
2 file-status-maybe-good maybe
3 file-status-ok-underreview-good underreview
4 file-status-complete-final-bad final
In SQL this is easy, select split_part(status, '-', -2).
However, the solutions I've seen with R either operate on vectors or are messy to extract particular elements (they return ALL elements). How is this done in a mutate chain? The below is a failed attempt.
df %>%
mutate(status_extract = str_split_fixed(status, pattern = '-')[[-2]])
Found the a really simple answer.
library(tidyverse)
df %>%
mutate(status_extract = word(status, -1, sep = "-"))
In base R you can combine the functions sapply and strsplit
df$status_extract <- sapply(strsplit(df$status, "-"), function(x) x[length(x) - 1])
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
You can use map() and nth() to extract the nth value from a vector.
library(tidyverse)
df %>%
mutate(status_extract = map_chr(str_split(status, "-"), nth, -2))
# status status_extract
# 1 file-status-done-bad done
# 2 file-status-maybe-good maybe
# 3 file-status-underreview-good underreview
# 4 file-status-complete-final-bad final
which is equivalent to a base version like
sapply(strsplit(df$status, "-"), function(x) rev(x)[2])
# [1] "done" "maybe" "underreview" "final"
You can use regex to get what you want without splitting the string.
sub('.*-(\\w+)-.*$', '\\1', df$status)
#[1] "done" "maybe" "underreview" "final"

Sum number in a character string (R)

I have a vector that looks like :
numbers <- c("1/1/1", "1/0/2", "1/1/1/1", "2/0/1/1", "1/2/1")
(not always the same number of "/" character)
How can I create another vector with the sum of the numbers of each string?
Something like :
sum
3
3
4
4
4
One solution with strsplit and sapply:
sapply(strsplit(numbers, '/'), function(x) sum(as.numeric(x)))
#[1] 3 3 4 4 4
strsplit will split your stings on / (doesn't matter how many /s you have). The output of strsplit is a list, so we iterate over it to calculate the sum with sapply.
What seems to me to be the most straightforward approach here is to convert your number strings to actual valid string arithmetic expressions, and then evaluate them in R using eval along with parse. Hence, the string 1/0/2 would become 1+0+2, and then we can simply evaluate that expression.
sapply(numbers, function(x) { eval(parse(text=gsub("/", "+", x))) })
1/1/1 1/0/2 1/1/1/1 2/0/1/1 1/2/1
3 3 4 4 4
Demo
1) strapply strapply matches each string of digits using \\d+ and then applies as.numeric to it returning a list with one vector of numbers per input string. We then apply sum to each of those vectors. This solution seems particularly short.
library(gsubfn)
sapply(strapply(numbers, "\\d+", as.numeric), sum)
## [1] 3 3 4 4 4
2) read.table This applies sum(read.table(...)) to each string. It is a bit longer (but still only one line of code) but uses no packages.
sapply(numbers, function(x) sum(read.table(text = x, sep = "/")))
## 1/1/1 1/0/2 1/1/1/1 2/0/1/1 1/2/1
## 3 3 4 4 4
Add the USE.NAMES = FALSE argument to sapply if you don't want names on the output.
scan(textConnection(x), sep = "/", quiet = TRUE) could be used in place of read.table but is longer.

Counting letters before and after a letter

I have an excel file of a list of sequences. How would I go about getting the number of times a letter appears before a letter in square brackets? An example of an entry is below.
GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT
I'd also like to do this for the letter after the square brackets.
Edit: Apologies for the confusion. Take the example below. Id like to count how many times A, C, G, and T appears immediately before and after the letter in square brackets (for which there is only one per line). So to count the occurences of A[A]A, A[A]C, C[A]A, and so on. The file is in excel, and I'm happy to use any method in excel, R or in Linux.
CCCACCCGCCAGGAAGCCGCTATCACTGTCCAAGTTGTCATCGGAACTCC[A]CCAGCCTGTGGACTTGGCCTGGTGCCGCCCATCCCCCTTGCGGTCCTTGC
ACCACTACCCCCTTCCCCACCATCCACCTCAGAAGCAGTCCCAGCCTGCC[A]CCCGCCAGCCCCTGCCCAGCCCTGGCTTTTTGGAAACGGGTCAGGATTGG
TTTGCTTTAAAATACTGCAACCACTCCAGGTAAATCTTCCGCTGCCTATA[A]CCCCGCCAATGAGCCTGCACATCAGGAGAGAAAGGGAAGTAACTCAAGCA
GAAATCTTCTGAAACAGTCTCCAGAAGACTGTCTCCAAATACACAGCAGA[A]CCAGCCAGTCCACAGCACTTTACCTTCTCTATTCTCAGATGGCAATTGAG
GGACTGCCCCAAGGCCCGCAGGGAGGTGGAGCTGCACTGGCGGGCCTCCC[A]GTGCCCGCACATCGTACGGATCGTGGATGTGTACGAGAATCTGTACGCAG
GGCCCAACGCCATCCTGAAACTCACTGACTTTGGCTTTGCCAAGGAAACC[A]CCAGCCACAACTCTTTGACCACTCCTTGTTATACACCGTACTATGTGGGT
TCTGCCTGGTCCGCTGGAGCTGGGCATTGAAGCCCCGCAGCTGCTCAGCC[A]CCTGCCCCGCCATCAAGAAGGCCCCACCGGCCCTGGGAAGGACACCCCTG
TTTGAAGCCCTTATGAACCAAGAAACCTTCGTTCAGGACCTCAAAATCAA[A]CCCCGCCACATGCAGCTCGCAGGCCTGCAGGAGGAAAGACAGGTTAGCAA
CTGCAGCCTACCTGTCCATGTCCCAGGGGGCCGTTGCCAACGCCAACAGC[A]CCCCGCCGCCCTATGAGCGTACCCGCCTCTCCCCACCCCGGGCCAGCTAC
ACTGGCAAACATGTTGAGGACAATGATGGAGGGGATGAGCTTGCATAGGA[A]CCTGCCGTAGGGCCACTGTCCCTGGAGAGCCAAGTGAGCCAGCGAGAAGG
CACCCTCAGAGAAGAAGAAAGGAGCTGAGGAGGAGAAGCCAAAGAGGAGG[A]GGCAGGAGAAGCAGGCAGCCTGCCCCTTCTACAACCACGAGCAGATGGGC
CCAGCCCTGTATGAGGACCCCCCAGATCAGAAAACCTCACCCAGTGGCAA[A]CCTGCCACACTCAAGATCTGCTCTTGGAATGTGGATGGGCTTCGAGCCTG
TTCCTGTGCGCCCCAACAACTCCTTTAGCTGGCCTAAAGTGAAAGGACGG[A]CCTGCCAATGAAAATAGACTTTCAGGGTCTAGCAGAAGGCAAGACCACCA
CTAACACCCGCACGAGCTGCTGGTAGATCTGAATGGCCAAGTCACTCAGC[A]CCTGCCGATACTCAGCCAGGTCAAAATTGGTGAGGCAGTGTTCATTCTGG
AGTTCTGCATCTGGAGCAAATCCTTGGCACTCCCTCATGCTGGCTATCAC[A]CCTGCCACGAATGTGCCATGGCCCAACCCTGCAGTCCATAAAGAAAACAA
CGTGCCCATGCAGCTAGTGCTCTTCCGAGAGGCTATTGAACACAGTGAGC[A]CCTGCCACGCCTATCCCCTTCCCCATCATCTCAGTGATGGGGTATGTCTA
ACAAGGACCTGGCCCTGGGGCAGCCCCTCAGCCCACCTGGTCCCTGCCTT[A]CCCAGCCAGTACTCTCCATCAGCACGGCCGAAGCCCAGCTTGTAGTCATT
You could split the original string into parts. From the start of the string to the first [ and from the first ] to the end of the string.
int count = firstPart.Count(f => f == 'a');
count += secondPart.Count(f => f == 'a');
Option Explicit
Sub test()
Dim seq As String
seq = "GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT"
Debug.Print CountLetter("A", seq)
End Sub
Function CountLetter(letter As String, ByVal sequence As String) As Long
'--- assumes the letter in the brackets is the same as that being counted
Dim allLetters() As String
allLetters = Split("A,C,G,T", ",")
Dim letterToDelete As Variant
For Each letterToDelete In allLetters
If letterToDelete <> letter Then
sequence = Replace(sequence, letterToDelete, "")
End If
Next letterToDelete
CountLetter = Len(sequence) - 1
End Function
x = "GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT"
#COUNT 'A'
sapply(unlist(strsplit(x,"\\[[A-z]\\]")), function(a) length(unlist(gregexpr("A", a))))
# GTCCTGGTTGTAGCTGAAGCTCTTCCC CTCCTCCCGATCACTGGGACGTCCTATGT
# 3 4
#COUNT 'G'
sapply(unlist(strsplit(x,"\\[[A-z]\\]")), function(a) length(unlist(gregexpr("G", a))))
# GTCCTGGTTGTAGCTGAAGCTCTTCCC CTCCTCCCGATCACTGGGACGTCCTATGT
# 7 6
New R solution (after clarification by OP)
Let's assume the data have been read from Excel into a data.table called los (list of sequences) which has only one column called sequence. Then, the occurences can be counted as follows:
library(data.table)
los[, .N, by = stringr::str_extract(sequence, "[ACGT]\\[[ACGT]\\][ACGT]")]
# stringr N
#1: C[A]C 8
#2: A[A]C 5
#3: C[A]G 1
#4: G[A]G 1
#5: G[A]C 1
#6: T[A]C 1
str_extract() looks for one of the letters A, C, G, T followed by [ followed by one of the letters A, C, G, T followed by ] followed by one of the letters A, C, G, T in column sequence and extracts the matching substrings. Then, los is grouped by the substrings and the number of occurences is counted (.N).
Data
If the Excel file is stored in CSV format then it can be read using data.table's fread() function like this
los <- fread("your_file_name.csv")
(Perhaps, some parameters to fread() might need to be adjusted for the specific file.)
However, some data already are provided in the question. These can be read as character string using fread() as well:
los <- fread("sequence
CCCACCCGCCAGGAAGCCGCTATCACTGTCCAAGTTGTCATCGGAACTCC[A]CCAGCCTGTGGACTTGGCCTGGTGCCGCCCATCCCCCTTGCGGTCCTTGC
ACCACTACCCCCTTCCCCACCATCCACCTCAGAAGCAGTCCCAGCCTGCC[A]CCCGCCAGCCCCTGCCCAGCCCTGGCTTTTTGGAAACGGGTCAGGATTGG
TTTGCTTTAAAATACTGCAACCACTCCAGGTAAATCTTCCGCTGCCTATA[A]CCCCGCCAATGAGCCTGCACATCAGGAGAGAAAGGGAAGTAACTCAAGCA
GAAATCTTCTGAAACAGTCTCCAGAAGACTGTCTCCAAATACACAGCAGA[A]CCAGCCAGTCCACAGCACTTTACCTTCTCTATTCTCAGATGGCAATTGAG
GGACTGCCCCAAGGCCCGCAGGGAGGTGGAGCTGCACTGGCGGGCCTCCC[A]GTGCCCGCACATCGTACGGATCGTGGATGTGTACGAGAATCTGTACGCAG
GGCCCAACGCCATCCTGAAACTCACTGACTTTGGCTTTGCCAAGGAAACC[A]CCAGCCACAACTCTTTGACCACTCCTTGTTATACACCGTACTATGTGGGT
TCTGCCTGGTCCGCTGGAGCTGGGCATTGAAGCCCCGCAGCTGCTCAGCC[A]CCTGCCCCGCCATCAAGAAGGCCCCACCGGCCCTGGGAAGGACACCCCTG
TTTGAAGCCCTTATGAACCAAGAAACCTTCGTTCAGGACCTCAAAATCAA[A]CCCCGCCACATGCAGCTCGCAGGCCTGCAGGAGGAAAGACAGGTTAGCAA
CTGCAGCCTACCTGTCCATGTCCCAGGGGGCCGTTGCCAACGCCAACAGC[A]CCCCGCCGCCCTATGAGCGTACCCGCCTCTCCCCACCCCGGGCCAGCTAC
ACTGGCAAACATGTTGAGGACAATGATGGAGGGGATGAGCTTGCATAGGA[A]CCTGCCGTAGGGCCACTGTCCCTGGAGAGCCAAGTGAGCCAGCGAGAAGG
CACCCTCAGAGAAGAAGAAAGGAGCTGAGGAGGAGAAGCCAAAGAGGAGG[A]GGCAGGAGAAGCAGGCAGCCTGCCCCTTCTACAACCACGAGCAGATGGGC
CCAGCCCTGTATGAGGACCCCCCAGATCAGAAAACCTCACCCAGTGGCAA[A]CCTGCCACACTCAAGATCTGCTCTTGGAATGTGGATGGGCTTCGAGCCTG
TTCCTGTGCGCCCCAACAACTCCTTTAGCTGGCCTAAAGTGAAAGGACGG[A]CCTGCCAATGAAAATAGACTTTCAGGGTCTAGCAGAAGGCAAGACCACCA
CTAACACCCGCACGAGCTGCTGGTAGATCTGAATGGCCAAGTCACTCAGC[A]CCTGCCGATACTCAGCCAGGTCAAAATTGGTGAGGCAGTGTTCATTCTGG
AGTTCTGCATCTGGAGCAAATCCTTGGCACTCCCTCATGCTGGCTATCAC[A]CCTGCCACGAATGTGCCATGGCCCAACCCTGCAGTCCATAAAGAAAACAA
CGTGCCCATGCAGCTAGTGCTCTTCCGAGAGGCTATTGAACACAGTGAGC[A]CCTGCCACGCCTATCCCCTTCCCCATCATCTCAGTGATGGGGTATGTCTA
ACAAGGACCTGGCCCTGGGGCAGCCCCTCAGCCCACCTGGTCCCTGCCTT[A]CCCAGCCAGTACTCTCCATCAGCACGGCCGAAGCCCAGCTTGTAGTCATT")
Old solution (before clarification by OP) - left here for reference
This is a solution in base R with help of the stringr package which will work with a "list" of sequences (a data.frame), any single letter enclosed in square brackets, and arbitrary lengths of the sequences. It assumes that the data already have been read from file into a data.frame which is named los here.
# create data: data frame with two sequences
los <- data.frame(
sequence = c("GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT",
"GTCCTGGTTGTAGCTGAAGCTCTTCCCACT[C]CTCCCGATCACTGGGACGTCCTATGT"))
# split sequences in three parts
mat <- stringr::str_split_fixed(los$sequence, "[\\[\\]]", n = 3)
los$letter <- mat[, 2]
los$n_before <- stringr::str_count(mat[, 1], mat[, 2])
los$n_after <- stringr::str_count(mat[, 3], mat[, 2])
print(los)
# sequence letter n_before n_after
#1 GTCCTGGTTGTAGCTGAAGCTCTTCCC[A]CTCCTCCCGATCACTGGGACGTCCTATGT A 3 4
#2 GTCCTGGTTGTAGCTGAAGCTCTTCCCACT[C]CTCCCGATCACTGGGACGTCCTATGT C 9 9
Note this code works best if there is exactly one pair of square brackets in each sequence. Any additional brackets will be ignored.
It will also work if there is more than just one letter enclosed in brackets, e.g., [GT].
I'm confessing that I'm addicted to Hadley Wickham's stringr package because I have difficulties to remember the inconsistently named base R functions for string maninpulation like strsplit, grepl, sub, match, gregexpr, etc. To understand what I mean please have a look at the Usage and See Also sections of ?grep and compare to stringr.
I would think that R packages for bioinformatics, such as seqinr or Biostrings, would be a good starting point. However, here's a "roll your own" solution.
First step: get your data from Excel into R. I will assume that file mydata.xlsx contains one sheet with a column of sequence and no header. You need to adapt this for your file and sheet format.
library(readxl)
sequences <- read_excel("mydata.xlsx", col_names = FALSE)
colnames(sequences) <- "sequence"
Now you need a function to extract the base in square brackets and the bases at -1 and +1. This function uses the stringr package to extract bases using regular expressions.
get_bases <- function(seq) {
require(stringr)
require(magrittr)
subseqs <- str_match(seq, "^([ACGT]+)\\[([ACGT])\\]([ACGT]+)$")
bases <- list(
before = subseqs[, 2] %>% str_sub(-1, -1),
base = subseqs[, 3],
after = subseqs[, 4] %>% str_sub(1, 1)
)
return(bases)
}
Now you can pass the column of sequences to the function to generate a list of lists, which can be converted to a data frame.
library(purrr)
sequences_df <- lapply(sequences, get_bases) %>%
map_df(as.data.frame, stringsAsFactors = FALSE)
head(sequences_df, 3)
before base after
1 C A C
2 C A C
3 A A C
The last step is to use functions from dplyr and tidyr to count up the bases.
library(tidyr)
sequences_df %>%
gather(position, letter, -base) %>%
group_by(base, position, letter) %>%
tally() %>%
spread(position, n) %>%
select(base, letter, before, after)
Result using your 17 example sequences. I would use better names than I did if I were you: base = the base in square brackets, letter = the base being counted, before = count at -1, after = count at +1.
base letter before after
* <chr> <chr> <int> <int>
1 A A 5 NA
2 A C 9 15
3 A G 2 2
4 A T 1 NA

Read data with space character in R

Usually, read.table will solve many data input problems personally. Like this one:
China 2 3
USA 1 4
Sometimes, the data can madden people, like:
Chia 2 3
United States 3 4
So the read.table cannot work, and any assistance is appreciated.
P.S. the format of data file is .dat
First set up some test data:
# create test data
cat("Chia 2 3
United States 3 4
", file = "file.with.spaces.txt")
1) Using the above read in the data, insert commas between fields and re-read:
L <- readLines("file.with.spaces.txt")
L2 <- sub("^(.*) +(\\S+) +(\\S+)$", "\\1,\\2,\\3", L) # 1
DF <- read.table(text = L2, sep = ",")
giving:
> DF
V1 V2 V3
1 Chia 2 3
2 United States 3 4
2) Another approach. Using L from above, replace the last string of spaces with comma twice (since there are three fields):
L2 <- L
for(i in 1:2) L2 <- sub(" +(\\S+)$", ",\\1", L2) # 2
DF <- read.table(text = L2, sep = ",")
ADDED second solution. Minor improvements.
If the column seperator 'sep' is indeed a whitespace, it logically cannot differentiate between spaces in a name and spaces that actually seperate between columns. I'd suggest to change your country names to single strings, ie, strings without spaces. Alternatively, use semicolons to seperate between your data colums and use:
data <- read.table(foo.dat, sep= ";")
If you have many rows in your .dat file, you can consider using regular expressions to find spaces between the columns and replace them with semicolons.

Resources