how to read numbers with irregular spaces in R - r

I have my data in txt file, contain the following number, how to read into R
I tied fread but did not work
Error in fread("x.txt") :
Expected sep (' ') but new line, EOF (or other non printing character) ends field 0 when detecting types ( first):
Here is the data:
2 3 3 2 1 2 3 2 3 2 1 3 1 2
1 1 3 2 3 1 2 1 2 3 3 2
3 1 1 1 2 1 1 3 1 2 2 2
1 3 1 1 3 2 3 3 1 1 2 2
1 3 2 3 2 1 3 1 1 1 3 1
1 3 1 2 3 3 2 2 2 2 3 3
1 3 2 3 2 3 2 2 2 1 3 1
3 2 1 2 2 3 3 2 3 2 3 3
2 1

Try this.
x <- scan("x.txt")
data <- as.data.frame(x)

Related

discrete choice experiment data preparation for analysis using GMNL package

I have conducted a discrete choice experiment using google forms and written up the results in a csv in excel. I am having problems understanding how to take the data from a standard csv format to a format that I can analyse using the gmnl package.
I am using this data below which has been dummy coded
personid choiceid alt payment management assessment crop
1 1 1 3 2 2 3
1 2 2 2 2 1 3
1 3 1 3 2 1 3
1 4 1 2 1 3 1
1 5 1 2 1 3 1
1 6 2 1 1 2 1
1 7 2 3 1 2 3
1 8 2 3 1 2 3
1 9 2 3 1 1 2
1 10 2 3 1 1 2
1 11 2 3 1 2 1
1 12 2 2 1 1 3
1 13 3 1 2 1 1
1 14 2 1 1 2 3
1 15 2 2 1 2 2
1 16 2 1 1 1 3
2 17 3 1 2 1 2
2 18 3 1 3 1 2
2 19 1 3 1 1 3
test <- as.data.frame(testchoices)
choices <- mlogit.data(test, shape = "long", idx = list(c("choiceid", "personid")),
idnames = c("management", "crops", "assessment", "price"))
write_csv(choices, "choicesnext.csv")
It works fine up to write csv where the error is thrown saying 'Error in [.data.frame (x, start:min(NROW(x), start + len)) : undefined columns selected
I would be grateful for any assistance

Hierarchical Clustering produces list instead of hclust

I have been doing some hierarchical clusterings in R. Its worked out fine up til now, producing hclust objects left and center, but suddenly not anymore. Now it will only produce lists when performing:
mydata.clusters <- hclust(dist(mydata[, 1:8]))
mydata.clustercut <- cutree(mydata.clusters, 4)
and when trying to:
table(mydata.clustercut, mydata$customer_lifetime)
it doesnt produce a table, but an endless print of the values (Im guessing from the list).
The cutree function provide the grouping to which each observation belong to. For example:
iris.clust <- hclust(dist(iris[,1:4]))
iris.clustcut <- cutree(iris.clust, 4)
iris.clustcut
# [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2
# [52] 2 2 3 2 3 2 3 2 3 3 3 3 2 3 2 3 3 2 3 2 3 2 2 2 2 2 2 2 3 3 3 3 2 3 2 2 2 3 3 3 2 3 3 3 3 3 2 3 3 2 2
# [103] 4 2 2 4 3 4 2 4 2 2 2 2 2 2 2 4 4 2 2 2 4 2 2 4 2 2 2 4 4 4 2 2 2 4 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Additional comparison can then be done by using this as a grouping variable for the observed data:
new.iris <- data.frame(iris, gp=iris.clustcut)
# example to visualise quickly the Species membership of each group
library(ggplot2)
ggplot(new.iris, aes(gp, fill=Species)) +
geom_bar()

How to select only the last row among the subset of rows satisfying a condition in R programming

The dataframe looks like this :
Customer_id A B C D E F G
10000001 1 1 2 3 1 3 1
10000001 1 2 3 1 2 1 3
10000002 2 2 2 3 1 3 1
10000002 2 2 1 4 2 3 1
10000003 1 5 2 4 7 2 4
10000003 1 5 2 6 3 7 2
10000003 1 1 2 2 1 2 1
10000004 1 2 3 1 2 3 1
10000004 1 3 2 3 1 3 2
10000004 1 3 2 1 3 2 1
10000004 1 4 1 4 1 3 1
10000006 1 2 3 4 5 1 2
10000006 1 3 1 4 1 2 1
10000008 2 3 2 3 2 1 2
10000008 2 3 1 1 2 1 2
10000008 1 3 1 1 2 2 1
There are multiple entries for each customer_id. I need to create another data frame from this existing data frame. The new data frame should contain only the last row for every customer_id. It should look like this
10000001 1 1 2 3 1 3 1
10000002 2 2 1 4 2 3 1
10000003 1 1 2 2 1 2 1
10000004 1 4 1 4 1 3 1
10000006 1 3 1 4 1 2 1
10000008 1 3 1 1 2 2 1
Something like this (hard to code without the data in R format):
dataframe[ rev(!duplicated(rev(dataframe$Customer_id))),]
or better
dataframe[ !duplicated(dataframe$Customer_id,fromLast=TRUE),]
You can also use aggregate
aggregate(. ~ Customer_id, data = DF, FUN = tail, 1)
## Customer_id A B C D E F G
## 1 10000001 1 2 3 1 2 1 3
## 2 10000002 2 2 1 4 2 3 1
## 3 10000003 1 1 2 2 1 2 1
## 4 10000004 1 4 1 4 1 3 1
## 5 10000006 1 3 1 4 1 2 1
## 6 10000008 1 3 1 1 2 2 1
Assume your data is named dat,
Here's one way using by and rbind, although the other two methods (aggregate and duplicated) are much nicer:
> do.call(rbind, by(dat,dat$Customer_id,FUN=tail,1))
## Customer_id A B C D E F G
## 2 10000001 1 2 3 1 2 1 3
## 4 10000002 2 2 1 4 2 3 1
## 7 10000003 1 1 2 2 1 2 1
## 11 10000004 1 4 1 4 1 3 1
## 13 10000006 1 3 1 4 1 2 1
## 16 10000008 1 3 1 1 2 2 1

Digits being neglected while performing N-gram in R

I want to get the counts of all character level Ngrams presnt in a text file.
Using R I wrote a small code for the same. However the code is neglecting all the digits present in the text. Could anyone help me in fixing this issue.
Here is the code :
library(tau)
temp<-read.csv("/home/aravi/Documents/sample/csv/ex.csv",header=TRUE,stringsAsFactors=F)
r<-textcnt(temp, method="ngram",n=4L, decreasing=TRUE)
a<-data.frame(counts = unclass(r), size = nchar(names(r)))
b<-split(a,a$size)
b
Here is the contents of the input file:
abcd123
appl2345e
coun56ry
live123
names3423bsdf
coun56ryas
This is the output:
$`1`
counts size
_ 18 1
a 3 1
e 3 1
n 3 1
s 3 1
c 2 1
l 2 1
o 2 1
p 2 1
r 2 1
u 2 1
y 2 1
b 1 1
d 1 1
f 1 1
i 1 1
m 1 1
v 1 1
$`2`
counts size
_c 2 2
_r 2 2
co 2 2
e_ 2 2
n_ 2 2
ou 2 2
ry 2 2
s_ 2 2
un 2 2
_a 1 2
_b 1 2
_e 1 2
_l 1 2
_n 1 2
am 1 2
ap 1 2
as 1 2
bs 1 2
df 1 2
es 1 2
f_ 1 2
iv 1 2
l_ 1 2
li 1 2
me 1 2
na 1 2
pl 1 2
pp 1 2
sd 1 2
ve 1 2
y_ 1 2
ya 1 2
$`3`
counts size
_co 2 3
_ry 2 3
cou 2 3
oun 2 3
un_ 2 3
_ap 1 3
_bs 1 3
_e_ 1 3
_li 1 3
_na 1 3
ame 1 3
app 1 3
as_ 1 3
bsd 1 3
df_ 1 3
es_ 1 3
ive 1 3
liv 1 3
mes 1 3
nam 1 3
pl_ 1 3
ppl 1 3
ry_ 1 3
rya 1 3
sdf 1 3
ve_ 1 3
yas 1 3
$`4`
counts size
_cou 2 4
coun 2 4
oun_ 2 4
_app 1 4
_bsd 1 4
_liv 1 4
_nam 1 4
_ry_ 1 4
_rya 1 4
ames 1 4
appl 1 4
bsdf 1 4
ive_ 1 4
live 1 4
mes_ 1 4
name 1 4
ppl_ 1 4
ryas 1 4
sdf_ 1 4
yas_ 1 4
Could anyone tell what am I missing or where I went wrong.
Thanks in Advance.
The default value for splits in textcnt includes "digits" , so numbers are being treated as delimiters. Remove that and things will work.

Episode count for each row

I'm sure this has been asked before but for the life of me I can't figure out what to search for!
I have the following data:
x y
1 3
1 3
1 3
1 2
1 2
2 2
2 4
3 4
3 4
And I would like to output a running count that resets everytime either x or y changes value.
x y o
1 3 1
1 3 2
1 3 3
1 2 1
1 2 2
2 2 1
2 4 1
3 4 1
3 4 2
Try something like
df<-read.table(header=T,text="x y
1 3
1 3
1 3
1 2
1 2
2 2
2 4
3 4
3 4")
cbind(df,o=sequence(rle(paste(df$x,df$y))$lengths))
> cbind(df,o=sequence(rle(paste(df$x,df$y))$lengths))
x y o
1 1 3 1
2 1 3 2
3 1 3 3
4 1 2 1
5 1 2 2
6 2 2 1
7 2 4 1
8 3 4 1
9 3 4 2
After seeing #ttmaccer's I see my first attempt with ave was wrong and this is perhaps what is needed:
> dat$o <- ave(dat$y, list(dat$y, dat$x), FUN=seq )
# there was a warning but the answer is corect.
> dat
x y o
1 1 3 1
2 1 3 2
3 1 3 3
4 1 2 1
5 1 2 2
6 2 2 1
7 2 4 1
8 3 4 1
9 3 4 2

Resources