I'm new to R, but want to use it for its statistics tools on some collected data. I'm trying to import raw data from an instrument output, but to do so, I need to take out the useless comments left over from the machine's display, then separate the multiple samples into their own dataframes. The data comes out as:
////this is some preamble
////for sample 1 that would graph
////data on the machines display
1 10
2 20
3 30
///This is the preamble
////for the second sample
1 11
2 19
3 32
4 41
5 50
////this is closing statements
////and final plot command
////for the machine's display
I'm currently trying to import it with whitespace delimiters. If I only had the one sample, I know I can just skip the first four lines and add the titles for the columns later, as
library(readr)
DATA <- read_table2("DATA.txt", col_names = FALSE, skip = 4)
colnames(DATA) <- c("X","Y")
But I can't figure out how to separate sample 2 and the remainder of the unimportant text.
Another problem that might arise is that separation of sample one and two happen on different lines depending on the file. So I figure I need to import the text file to scan through it before even making tables.
I know this is a bit of a cluster, but I appreciate any help.
This should get you started. I'm using base R but you can always convert to a tibble later if you want.
DATA <- read.table("DATA.txt", header=FALSE, comment="/")
colnames(DATA) <- c("X","Y")
begin <- which(DATA$X==1)
end <- c(diff(begin), nrow(DATA))
groups <- mapply(":", begin, end)
DATA.lst <- lapply(groups, function(g) DATA[g, ])
names(DATA.lst) <- sprintf("Group%0.2i", seq(length(groups)))
DATA.lst
# $Group01
# X Y
# 1 1 10
# 2 2 20
# 3 3 30
#
# $Group02
# X Y
# 4 1 11
# 5 2 19
# 6 3 32
# 7 4 41
# 8 5 50
DATA.lst is a list of data frames. You can extract them with the following code but it may not be your best option if you are planning to perform the same analyses on each. R makes it easy to process all of the data frames in a list which saves you writing code for each one:
for (i in 1:2) {assign(names(DATA.lst)[i], DATA.lst[[i]])}
Related
I am working on a project for a client that want their charts done in Excel.
One of the charts I need to do is of cumulative hazards, which I get with Survival::survfit,
My problem is that excel can't do stepwise charts, so I need to transform the data, so every timepoint occurs twice; once with the previous cumulative hazard and once with the cumulative hazard at that time-point.
It is relatively easy, but annoying and time-consuming, to do this in Excel - Is there a smart way of doing it in R?
I am a relatively new r user, and I have not been able to figure out a way to do what I want.
I have tried showing what I get and what I want below:
#Load survival package"
library(survival)
#Create survfit object
Survival_Function <- survfit(Surv(lung$time,
lung$status == 2)~1)
#extract cumulative hazards
cumhaz <- data.frame(Survival_Function$time, Survival_Function$cumhaz)
head(cumhaz)
Gives me the following:
Survival_Function.time Survival_Function.cumhaz
1 5 0.004385965
2 11 0.017601824
3 12 0.022066110
4 13 0.031034720
5 15 0.035559606
6 26 0.040105061
But for excel to make the charts properly I'd need it to look like this:
Survival_Function.time Survival_Function.cumhaz
1 5 0.004385965
2 11 0.004385965
3 11 0.017601824
4 12 0.017601824
5 12 0.022066110
6 13 0.022066110
7 13 0.031034720
8 15 0.031034720
9 15 0.035559606
10 26 0.035559606
11 26 0.040105061
Based on your code, one simple approach is to repeat the columns, with each element repeated twice. From here, you can remove the first element from the time column and the last from cumhaz column then combine.
An example of this code is:
x <- data.frame(
Time = c(1,2,3,4,5),
Hazard = c(6,7,8,9,10)
)
data.frame(
Time = rep(x$Time, each = 2)[-1], #Repeats the time, removing the first to give you the desired formatting
Hazard = rep(x$Hazard, each = 2)[-length(rep(x$Hazard, each = 2))] # By removing the last element is means that they have the same length
)
and this gives you the desired output.
Note: If you have a large amount of columns this will be cumbersome, however for just two or so it should be fine.
I am attempting to delete a row like this:
data <- data[-1645,]
However, after running the code, the row is still there. I can tell because there is an outlier in that row that is showing up on all my graphs, and when I view the data I can sort a column to easily find the offending outlier. I have had no trouble deleting rows in the past- has anyone run into anything similar? I do understand the limitations of outlier removal and I don't typically remove them however for a number of reasons I would like to see what the data look like without this one (in this case, all other values in the response variable are between -1 and 0, and in this row the value is 10^4).
You really need to provide more information, but there are several ways you can troubleshoot the problem. The first one is to print out the line you are removing:
data[1645, ]
Is that the outlier? You did not tell us how you identified the outlier. If lines have been removed from the data frame, the row names are not changed but the index values are changed, e.g.
set.seed(42)
x <- sample.int(25)
y <- sample.int(25)
data <- data.frame(x, y)
head(data)
# x y
# 1 17 2
# 2 5 8
# 3 1 3
# 4 10 1
# 5 4 10
# 6 18 11
data <- data[-c(5, 10, 15, 20, 25), ]
head(data)
# x y
# 1 17 2
# 2 5 8
# 3 1 3
# 4 10 1
# 6 18 11
# 7 25 15
data[6, ]
# x y
# 7 25 15
data["6", ]
# x y
# 6 18 11
Notice that the 6th row of the data has a row name of "7" but the row with name "6" is the 5th row in the data frame because we deleted the 5th row. The which function will give you the index value, but if you identified the outlier by looking at the printout, you got the row name and that may be different from the index. If we want to remove values in x greater than 24, here is one way to do that:
data[data$x<25, ]
After playing around with the data, I think the best explanation is that the indexing is off. This is in line with what dcarlson was saying- that it could be removing the 1,645th row, it just isn't labelled as such. I think the best solution is to use subset:
data <- subset(data, Yield.Decline < 100)
This is a more robust solution than trying to remove any given row based on its value (the line can be accidentally run multiple times without erroneously removing additional lines).
I am trying to use the LDA function to evaluate a corpus of text in R. However, when I do so, it seems to use the row names of the observations rather than the actual words in the corpus. I can't find anything else about this online so I imagine I must be doing something very basic incorrectly.
library(tm)
library(SnowballC)
library(tidytext)
library(stringr)
library(tidyr)
library(topicmodels)
library(dplyr)
#read in data
data <- read.csv('CSV_format_data.csv',sep=',')
#Create corpus/DTM
interviews <- as.matrix(data[,2])
ints.corpus <- Corpus(VectorSource(interviews))
ints.dtm <- TermDocumentMatrix(ints.corpus)
chapters_lda <- LDA(ints.dtm, k = 4, control = list(seed = 5421685))
chapters_lda_td <- tidy(chapters_lda,matrix="beta")
chapters_lda_td
head(ints.dtm$dimnames$Terms)
The 'chapters_lda_td' command outputs
# A tibble: 4,084 x 3
topic term beta
<int> <chr> <dbl>
1 1 1 0.000555
2 2 1 0.00399
3 3 1 0.000614
4 4 1 0.000699
5 1 2 0.0000195
6 2 2 0.000708
7 3 2 0.000731
8 4 2 0.00000155
9 1 3 0.000974
10 2 3 0.0000363
# ... with 4,074 more rows
Note that there are numbers instead of words as there should be in the "term" column. The number of rows matches the number of documents times the number of topics, rather than the number of terms times the number of topics, as it should be. The 'head(ints.dtm$dimnames$Terms)' is to check that there are actually words in the DTM, which there are. The result is:
[1] "aaye" "able" "adjust" "admission" "after" "age"
The data file itself is a pretty standard two-column CSV file with an ID and a block of text, and hasn't given me any problem while doing other text-mining stuff with it and the tm package. Any help would be appreciated, thank you!
I figured it out! It is because I am using the command
ints.dtm <- TermDocumentMatrix(ints.corpus)
rather than
ints.dtm <- DocumentTermMatrix(ints.corpus)
I guess the ordering of Term and Document switches their dimnames order around, so LDA grabs the wrong one.
I am trying to convert the data which I have in txt file:
4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;...
to a column (table) where the values are separated by tab.
4.0945725440979
4.07999897003174
4.0686674118042...
So far I tried
mydata <- read.table("1.txt", header = FALSE)
separate_data<- strsplit(as.character(mydata), ";")
But it does not work. separate_data in this case consist only of 1 element:
[[1]]
[1] "1"
Based on the OP, it's not directly stated whether the raw data file contains multiple observations of a single variable, or should be broken into n-tuples. Since the OP does state that read.table results in a single row where s/he expects it to contain multiple rows, we can conclude that the correct technique is to use scan(), not read.table().
If the data in the raw data file represents a single variable, then the solution posted in comments by #docendo works without additional effort. Otherwise, additional work is required to tidy the data.
Here is an approach using scan() that reads the file into a vector, and breaks it into observations containing 5 variables.
rawData <- "4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512;4.0945725440979;4.07999897003174;4.0686674118042;4.05960083007813;4.05218315124512"
value <- scan(textConnection(rawData),sep=";")
columns <- 5 # set desired # of columns
observations <- length(aVector) / columns
observation <- unlist(lapply(1:observations,function(x) rep(x,times=columns)))
variable <- rep(1:columns,times=observations)
data.frame(observation,variable,value)
...and the output:
> data.frame(observation,variable,value)
observation variable value
1 1 1 4.094573
2 1 2 4.079999
3 1 3 4.068667
4 1 4 4.059601
5 1 5 4.052183
6 2 1 4.094573
7 2 2 4.079999
8 2 3 4.068667
9 2 4 4.059601
10 2 5 4.052183
>
At this point the data can be converted into a wide format tidy data set with reshape2::dcast().
Note that this solution requires that the number of data values in the raw data file is evenly divisible by the number of variables.
I have a .csv file with several columns, but I am only interested in two of the columns(TIME and USER). The USER column consists of the value markers 1 or 2 in chunks and the TIME column consists of a value in seconds. I want to calculate the difference between the TIME value of the first 2 in a chunk in the USER column and the first 1 in a chunk in the USER column. I want to accomplish this through R. It would be ideal for their to be another column added to my data file with these differences.
So far I have only imported the .csv into R.
Latency <- read.csv("/Users/alinazjoo/Documents/Latency_allgaze.csv")
I'm going to guess your data looks like this
# sample data
set.seed(15)
rr<-sample(1:4, 10, replace=T)
dd<-data.frame(
user=rep(1:5, each=10),
marker=rep(rep(1:2,10), c(rbind(rr, 5-rr))),
time=1:50
)
Then you can calculate the difference using the base function aggregate and transform. Observe
namin<-function(...) min(..., na.rm=T)
dx<-transform(aggregate(
cbind(m2=ifelse(marker==2,time,NA), m1=ifelse(marker==1, time,NA)) ~ user,
dd, namin, na.action=na.pass),
diff = m2-m1)
dx
# user m2 m1 diff
# 1 1 4 1 3
# 2 2 15 11 4
# 3 3 23 21 2
# 4 4 35 31 4
# 5 5 44 41 3
We use aggregate to find the minimal time for each of the two kinds or markers, then we use transform to calculate the difference between them.