Read tab delimited text file - r

I am trying to read the data from this link in R using the following code but I keep getting warning messages and the dataframe doesn't read the data properly.
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/leukemia_remission.txt'
df <- read.table(url, sep = '\t',header = F, skip = 2,quote='', comment='')
Can you tell what I need to change to read the data
EDIT
Adding data snippet
REMISS CELL SMEAR INFIL LI BLAST TEMP
1 0.8 0.83 0.66 1.9 1.1 1
1 0.9 0.36 0.32 1.4 0.74 0.99
0 0.8 0.88 0.7 0.8 0.18 0.98
0 1 0.87 0.87 0.7 1.05 0.99
1 0.9 0.75 0.68 1.3 0.52 0.98
0 1 0.65 0.65 0.6 0.52 0.98
1 0.95 0.97 0.92 1 1.23 0.99
0 0.95 0.87 0.83 1.9 1.35 1.02

It is an issue about encoding. Please see this thread for more information (Get "embedded nul(s) found in input" when reading a csv using read.csv()).
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/leukemia_remission.txt'
df <- read.table(url, sep = '\t',header = TRUE, fileEncoding = "UTF-16LE")

Also consider,
url <- 'https://onlinecourses.science.psu.edu/stat501/sites/onlinecourses.science.psu.edu.stat501/files/data/leukemia_remission.txt'
df <- read.csv(url, sep="\t", header=T)

Related

Advanced pivot_longer: extract pattern in variables

My R df looks like this:
ID CO.RT CO.ER SC.RT SC.ER
1 0.19 0.06 1.24 0.09
2 0.61 0.01 0.63 0.03
3 0.43 0.02 1.31 0.09
I've been trying to find a way to use tidyr::pivot_longer to achieve the following:
ID Type RT ER
1 CO 0.19 0.06
1 SC 1.24 0.09
2 CO 0.61 0.01
2 SC 0.63 0.03
3 CO 0.43 0.02
3 SC 1.31 0.09
My issue: I can only pivot RT and RT into a single "score"-column—I fail to find the right regex(?) pattern to pivot my df in the way shown above.
Here is what I tried:
df %>% pivot_longer(cols = c(CO.RT:SC.ER),
names_pattern = "(.+).(.+)",
names_to=c("Type", ".value"))
There are many pivot_longer/names_to questions out there, however, I couldn't find the right one for my problem. Can anyone help?
My dataset:
df <- tibble(
ID = c(1,2,3),
CO.RT = c(0.19,0.61,0.43),
CO.ER = c(0.06,0.01,0.02),
SC.RT = c(1.24,0.63,1.31),
SC.ER = c(0.09,0.03,0.09)
)

'x' and 'y' lengths differ in custom entropy function

I am trying to learn R and I am having problems with the way it works. I tried to make an entropy function of variables p and 1-p from scratch and I am having problems when I try to add some ifs to avoid the NaN when dividing by 0.
When I try the custom entropy with the plot, it just works but it shows the NaN when I print the results. But when I try to add the ifs, then it says:
Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
entropy <- function(p){
cat("p = " , p)
if (p==0 || p==1) {
result = 0
}else{
result = - p*log2(p)-(1-p)*log2((1-p))
}
cat("\nresult=",result)
return(result)
}
p <- seq(0,1,0.01)
plot(p, entropy(p), type='l', main='Funcion entropia con dos valores posibles')
I don't understand it since I am using a plot of an array as x and a function with that array as parameter as y, so it should be the same lengths with and without ifs.
Console without the ifs:
p = 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1
result= NaN 0.08079314 0.1414405 0.1943919 0.2422922 0.286397 0.3274449 0.3659237 0.4021792 0.4364698 0.4689956 0.499916 0.5293609 0.5574382 0.5842388 0.6098403 0.6343096 0.6577048 0.680077 0.7014715 0.7219281 0.7414827 0.7601675 0.7780113 0.7950403 0.8112781 0.8267464 0.8414646 0.8554508 0.8687212 0.8812909 0.8931735 0.9043815 0.9149264 0.9248187 0.9340681 0.9426832 0.9506721 0.958042 0.9647995 0.9709506 0.9765005 0.9814539 0.985815 0.9895875 0.9927745 0.9953784 0.9974016 0.9988455 0.9997114 1 0.9997114 0.9988455 0.9974016 0.9953784 0.9927745 0.9895875 0.985815 0.9814539 0.9765005 0.9709506 0.9647995 0.958042 0.9506721 0.9426832 0.9340681 0.9248187 0.9149264 0.9043815 0.8931735 0.8812909 0.8687212 0.8554508 0.8414646 0.8267464 0.8112781 0.7950403 0.7780113 0.7601675 0.7414827 0.7219281 0.7014715 0.680077 0.6577048 0.6343096 0.6098403 0.5842388 0.5574382 0.5293609 0.499916 0.4689956 0.4364698 0.4021792 0.3659237 0.3274449 0.286397 0.2422922 0.1943919 0.1414405 0.08079314 NaN
Console with the ifs:
p = 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74 0.75 0.76 0.77 0.78 0.79 0.8 0.81 0.82 0.83 0.84 0.85 0.86 0.87 0.88 0.89 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1
result= 0Error in xy.coords(x, y, xlabel, ylabel, log) :
'x' and 'y' lengths differ
You did not create a vector but a scalar since you did not used a vectorized functionality in you if else clause. The result of your function has been just one number.
This should work:
entropy <- function(p){
# initialize a vector of the desired length with zeros
result <- numeric(length(p))
# subset the vector for which you want to apply your formula on
x <- p[!(p %in% c(0,1))]
# overwrite only those positions for which you want to calculate values based
# on your formula
result[!(p %in% c(0,1))] <- - x*log2(x)-(1-x)*log2((1-x))
#cat("\nresult=",result)
return(result)
}
p <- seq(0,1,0.01)
plot(p, entropy(p), type='l', main='Funcion entropia con dos valores posibles')
EDIT:
Even tho I was suggested to do it vectorizing it, I wanted to do it somewhat similar to other languages I know for the moment, since I am starting. I was able to fix it, althought I ended up using a for and printing 2 arrays instead of the function itself.
entropy <- function(p){
if (p==0 || p==1) {
result = 0
}else{
result = - p*log2(p)-(1-p)*log2((1-p))
}
return(result)
}
x <- seq(0,1,0.01)
y <- numeric(length(p))
i = 1
for (p in x) {
y[i] = entropy(p)
cat(x[i],"=",y[i],"\n")
i=i+1
}
plot(x, y, type='l', main='Funcion entropia con dos valores posibles')
I just applied your entropy function to the p vector prior to trying to plot it using the sapply function.
entropy <- function(p){
cat("p = " , p)
if (p==0 || p==1) {
result = 0
}else{
result = - p*log2(p)-(1-p)*log2((1-p))
}
cat("\nresult=",result)
return(result)
}
p <- seq(0,1,0.01)
# Apply the function over all the values of 'p'
entropy_p <- sapply(p,FUN = entropy)
plot(p, entropy_p, type='l', main='Funcion entropia con dos valores posibles')

Column Mean for rows with unique values

how can I compute the mean R, R1, R2, R3 values from the rows sharing the same lon,lat field? I'm sure this questions exists multiple times but I could not easily find it.
lon lat length depth R R1 R2 R3
1 147.5348 -35.32395 13709 1 0.67 0.80 0.84 0.83
2 147.5348 -35.32395 13709 2 0.47 0.48 0.56 0.54
3 147.5348 -35.32395 13709 3 0.43 0.29 0.36 0.34
4 147.4290 -35.27202 12652 1 0.46 0.61 0.60 0.58
5 147.4290 -35.27202 12652 2 0.73 0.96 0.95 0.95
6 147.4290 -35.27202 12652 3 0.77 0.92 0.92 0.91
I'd recommend using the split-apply-combine strategy, where you're splitting by BOTH lon and lat, applying mean to each group, then recombining into a single data frame.
I'd recommend using dplyr:
library(dplyr)
mydata %>%
group_by(lon, lat) %>%
summarize(
mean_r = mean(R)
, mean_r1 = mean(R1)
, mean_r2 = mean(R2)
, mean_r3 = mean(R3)
)

How to format the output in CSV file when outputting

So I am trying to get an output in csv file but I am having trouble formatting as per my need.
My Code
method.metric <- mmetric(testCenScal[[course_name]], method.pred, c("RMSE", "R2", "MAE", "COR"))
write.table(method.metric, "metric.csv", sep = ",", col.names = T, append = T)
Current Output
"x"
"MAE",0.636059658390333
"RMSE",0.814405873704867
"COR",0.581863604936215
"R2",0.338565254749368
"x"
"MAE",0.636059658390333
"RMSE",0.814405873704867
"COR",0.581863604936215
"R2",0.338565254749368
"x"
"RMSE",0.869309100173694
"R2",0.356594555638249
"MAE",0.653084184175849
"COR",0.597155386510286
"x"
"RMSE",0.869309100173694
"R2",0.356594555638249
"MAE",0.653084184175849
"COR",0.597155386510286
It would be nice if I could format this output into something like:
RMSE R2 MAE COR param1 param2
0.89 0.35 0.65 0.59 courseA Blackboost
0.89 0.35 0.65 0.59 courseB Blackboost
0.89 0.35 0.65 0.59 courseC Blackboost
0.89 0.35 0.65 0.59 courseD Blackboost
0.89 0.35 0.65 0.59 courseE Blackboost
0.89 0.35 0.65 0.59 courseA Rpart
0.89 0.35 0.65 0.59 courseB Rpart
0.89 0.35 0.65 0.59 courseC Rpart
0.89 0.35 0.65 0.59 courseD Rpart
0.89 0.35 0.65 0.59 courseE Rpart
I dont know what is "x" and where is it coming from, I guess I don't have the column name mentioned therefore it prints default as "x"?
I have this code in a function so I am passing two parameters one is the method and another is the target field. I would like to print those while appending it to a CSV file.
If I type dput(method.metric)
I get output as:
structure(c(0.869309100173694, 0.356594555638249, 0.653084184175849,
0.597155386510286), .Names = c("RMSE", "R2", "MAE", "COR"))
I already tried using the code write.csv(method.metric, file ="metric.csv", row.names=FALSE, eol=",", append=T) but it did not help much.
I will try to work on what you said formatting in R using cbind and other functions. If I get the output in above format, I will be able to create graphs with ease as I have lot of predictive model results being output.

Save unequal output to a csv or txt file

I want to save the following output I get in the R console into a csv or txt file.
Discordancy measures (critical value 3.00)
0.17 3.40 1.38 0.90 1.62 0.13 0.15 1.69 0.34 0.39 0.36 0.68 0.39
0.54 0.70 0.70 0.79 2.08 1.14 1.23 0.60 2.00 1.81 0.77 0.35 0.15
1.55 0.78 2.87 0.34
Heterogeneity measures (based on 100 simulations)
30.86 14.23 3.75
Goodness-of-fit measures (based on 100 simulations)
glo gev gno pe3 gpa
-3.72 -12.81 -19.80 -32.06 -37.66
This is the outcome I get when I run the following
Heter<-regtst(regsamlmu(-extremes), nsim=100)
where Heter is a list (i.e., is.list(Heter) returns TRUE)
You could use capture.output:
capture.output(regtst(regsamlmu(-extremes), nsim=100), file="myoutput.txt")
Or for capturing output coming from several consequential commands:
sink("myfile.txt")
#
# [commands generating desired output]
#
sink()
You could make a character vector which you write to a file. Each entry in the vector will be separated by a newline character.
out <- capture.output(regtst(regsamlmu(-extremes), nsim=100))
write(out, "output.txt", sep="\n")
If you would like to add more lines just do something like c(out, "hello Kostas")

Resources