Replace semicolon-separated values to tab

Replace semicolon-separated values to tab - r

I am trying to convert the data which I have in txt file:
4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;...
to a column (table) where the values are separated by tab.
4.0945725440979
4.07999897003174
4.0686674118042...
So far I tried
mydata <- read.table("1.txt", header = FALSE)
separate_data<- strsplit(as.character(mydata), ";")
But it does not work. separate_data in this case consist only of 1 element:
[[1]]
[1] "1"

Based on the OP, it's not directly stated whether the raw data file contains multiple observations of a single variable, or should be broken into n-tuples. Since the OP does state that read.table results in a single row where s/he expects it to contain multiple rows, we can conclude that the correct technique is to use scan(), not read.table().
If the data in the raw data file represents a single variable, then the solution posted in comments by #docendo works without additional effort. Otherwise, additional work is required to tidy the data.
Here is an approach using scan() that reads the file into a vector, and breaks it into observations containing 5 variables.
rawData <- "4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512"
value <- scan(textConnection(rawData),sep=";")
columns <- 5 # set desired # of columns
observations <- length(aVector) / columns
observation <- unlist(lapply(1:observations,function(x) rep(x,times=columns)))
variable <- rep(1:columns,times=observations)
data.frame(observation,variable,value)
...and the output:
> data.frame(observation,variable,value)
observation variable value
1 1 1 4.094573
2 1 2 4.079999
3 1 3 4.068667
4 1 4 4.059601
5 1 5 4.052183
6 2 1 4.094573
7 2 2 4.079999
8 2 3 4.068667
9 2 4 4.059601
10 2 5 4.052183
>
At this point the data can be converted into a wide format tidy data set with reshape2::dcast().
Note that this solution requires that the number of data values in the raw data file is evenly divisible by the number of variables.

Related

making trigrams from a dataframe in R

I would like to find matched elements in a second column with the first column of a data frame ,and create a trigrams using the matched element as the middle element of the trigram. In case of no match, the middle and last element of the trigram will be the unmatched second-column element. Here is an example:
gdf <- data.frame(from=c(1,2,3,4,5),to=c(2,3,1,5,6),stringsAsFactors=FALSE)
gdf
# from to
# 1 2
# 2 3
# 3 1
# 4 5
# 5 6
The output trigrams are as follow:
from middle to
1 2 3
2 3 1
3 1 2
4 5 6
5 6 6
My code with for loop takes a long time to process my huge data set.my data set has 54304 rows.
This is what I wrote:
num <- nrow(gdf)
df2 <- data.frame(from=character(0),middle=character(0),to=character(0),stringsAsFactors=FALSE)
count <- rep(0,nrow(gdf))
for(row in 1:nrow(gdf)){
for(rowc in 1:nrow(gdf)){
if(gdf[rowc,]$from==gdf[row,]$to){
df2[nrow(df2)+1,]<-c(gdf[row,]$from,gdf[row,]$to,gdf[rowc,]$to)
count[row]<-row
}
}
if(count[row]==0){
df2[nrow(df2)+1,]<-c(gdf[row,]$from,gdf[row,]$to,gdf[row,]$to)
}
}
Any help would be greatly appreciated!

Not sure if your example is too simple for this to work in the real data set, but a simple merge works for the example and then I sort the columns to get them back in order since a merge places the column that you merge by as column 1.
Merged <- merge(gdf,gdf,by.x="to",by.y="from")[,c(2,1,3)]
Then you can add in the nomatch elements later using a row bind
rbind(Merged,gdf[! paste(gdf[,1],gdf[,2]) %in% paste(Merged[,1],Merged[,2]),][,c(1,2,2)])

melt giving several value columns

I am reading in parameter estimates from some results files that I would like to compare side by side in a table. But I cant get the dataframe to the structure that I want to have (Parameter name, Values(file1), Values(file2))
When I read in the files I get a wide dataframe with each parameter in a separate column that I would like to transform to "long" format using melt. But that gives only one column with values. Any idea on how to get several value columns without using a for loop?
paraA <- c(1,2)
paraB <- c(6,8)
paraC <- c(11,9)
Source <- c("File1","File2")
parameters <- data.frame(paraA,paraB,paraC,Source)
wrong_table <- melt(parameters, by="Source")

You can use melt in combination with cast to get what you want. This is in fact the intended pattern of use, which is why the functions have the names they do:
m<-melt(parameters)
dcast(m,variable~Source)
# variable File1 File2
# 1 paraA 1 2
# 2 paraB 6 8
# 3 paraC 11 9

Converting #alexis's comment to an answer, transpose (t()) pretty much does what you want:
setNames(data.frame(t(parameters[1:3])), parameters[, "Source"])
# File1 File2
# paraA 1 2
# paraB 6 8
# paraC 11 9
I've used setNames above to conveniently rename the resulting data.frame in one step.

How to cross-tabulate two variables in R?

This seems to be basic, but I wont get it. I am trying to compute the frequency table in R for the data as below
1 2
2 1
3 1
I want to transport the the two way frequencies in csv output, whose rows will be all the unique entries in column A of the data and whose columns will be all the unique entries in column B of the data, and the cell values will be the number of times the values have occurred. I have explored some constructs like table but I am not able to output the values correctly in csv format.
Output of sample data:
"","1","2"
"1",0,1
"2",1,0
"3",1,0

The data:
df <- read.table(text = "1 2
2 1
3 1")
Calculate frequencies using table:
(If your object is a matrix, you could convert it to a data frame using as.data.frame before using table.)
tab <- table(df)
V2
V1 1 2
1 0 1
2 1 0
3 1 0
Write data with the function write.csv:
write.csv(tab, "tab.csv")
The resulting file:
"","1","2"
"1",0,1
"2",1,0
"3",1,0

Read multidimensional group data in R

I have done lot of googling but I didn't find satisfactory solution to my problem.
Say we have data file as:
Tag v1 v2 v3
A 1 2 3
B 1 2 2
C 5 6 1
A 9 2 7
C 1 0 1
The first line is header. The first column is Group id (the data have 3 groups A, B, C) while other column are values.
I want to read this file in R so that I can apply different functions on the data.
For example I tried to read the file and tried to get column mean
dt<-read.table(file_name,head=T) #gives warnings
apply(dt,2,mean) #gives NA NA NA
I want to read this file and want to get column mean. Then I want to separate the data in 3 groups (according to Tag A,B,C) and want to calculate mean(column wise) for each group. Any help

apply(dt,2,mean) doesn't work because apply coerces the first argument to an array via as.matrix (as is stated in the first paragraph of the Details section of ?apply). Since the first column is character, all elements in the coerced matrix object will be character.
Try this instead:
sapply(dt,mean) # works because data.frames are lists
To calculate column means by groups:
# using base functions
grpMeans1 <- t(sapply(split(dt[,c("v1","v2","v3")], dt[,"Tag"]), colMeans))
# using plyr
library(plyr)
grpMeans2 <- ddply(dt, "Tag", function(x) colMeans(x[,c("v1","v2","v3")]))

filtering large data sets to exclude an identical element across all columns

I am a relatively new R user, and most of the complex coding (and packages) looks like Greek to me. It has been a long time since I used a programming language (Java/Perl) and I have only used R for very simple manipulations in the past (basic loading data from file, subsetting, ANOVA/T-Test). However, I am working on a project where I had no control over the data layout and the data file is very lengthy.
In my data, I have 172 rows which feature the Participant to a survey and 158 columns, each which represents the question number. The answers for each are 1-5. The raw data includes the number "99" to indicate that a question was not answered. I need to exclude any questions where a Participant did not answer without excluding the entire participant.
Part Q001 Q002 Q003 Q004
1 2 4 99 2
2 3 99 1 3
3 4 4 2 5
4 99 1 3 2
5 1 3 4 2
In the past I have used the subset feature to filter my data
data.filter <- subset(data, Q001 != 99)
Which works fine when I am working with sets where all my answers are contained in one column. Then this would just delete the whole row where the answer was not available.
However, with the answers in this set spread across 158 columns, if I subset out 99 in column 1 (Q001), I also filter out that entire Participant.
I'd like to know if there is a way to filter/subset the data such that my large data set would end up having 'blanks' when the "99" occured so that these 99's would not inflate or otherwise interfere with the statistics I run of the rest of the numbers. I need to be able to calculate means per question and run ANOVAs and T-Tests on various questions.
Resp Q001 Q002 Q003 Q004
1 2 4 2
2 3 1 3
3 4 4 2 5
4 1 3 2
5 1 3 4 2
Is this possible to do in R? I've tried to filter it before submitting to R, but it won't read the data file in when I have blanks, and I'd like to be able to use the whole data set without creating a subset for each question (which I will do if I have to... it's just time consuming if there is a better code or package to use)
Any assistance would be greatly appreciated!

You could replace the "99" by "NA" and the calculate the colMeans omitting NAs:
df <- replicate(20, sample(c(1,2,3,99), 4))
colMeans(df) # nono
dfc <- df
dfc[dfc == 99] <- NA
colMeans(dfc, na.rm = TRUE)

You can also indicate which values are NA's when you read your data base. For your particular case:
mydata <- read.table('dat_base', na.strings = "99")

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex